Skip to content

NebulaGraph Importer

NebulaGraph Importer (Importer) is a standalone tool for importing data from CSV files into NebulaGraph. Importer can read the local CSV file and then import the data into the NebulaGraph database.

Scenario

Importer is used to import the contents of a local CSV file into the NebulaGraph.

Advantage

  • Lightweight and fast: no complex environment can be used, fast data import.
  • Flexible filtering: You can flexibly filter CSV data through configuration files.

Release note

Release

Prerequisites

Before using NebulaGraph Importer, make sure:

  • Schema is created in NebulaGraph, including space, Tag and Edge type, or set by parameter clientSettings.postStart.commands.
  • Golang environment has been deployed on the machine running the Importer. For details, see Build Go environment.

Steps

Configure the YAML file and prepare the CSV file to be imported to use the tool to batch write data to NebulaGraph.

Download binary package and run

  1. Download the binary package directly and add execute permission to it.

  2. Start the service.

    $ ./<binary_package_name> --config <yaml_config_file_path>
    

Source code compile and run

  1. Clone repository.

    $ git clone -b release-3.4 https://github.com/vesoft-inc/nebula-importer.git
    

    Note

    Use the correct branch. NebulaGraph 2.x and 3.x have different RPC protocols.

  2. Access the directory nebula-importer.

    $ cd nebula-importer
    
  3. Compile the source code.

    $ make build
    
  4. Start the service.

    $ ./nebula-importer --config <yaml_config_file_path>
    

    Note

    For details about the YAML configuration file, see configuration file description at the end of topic.

No network compilation mode

If the server cannot be connected to the Internet, it is recommended to upload the source code and various dependency packages to the corresponding server for compilation on the machine that can be connected to the Internet. The operation steps are as follows:

  1. Clone repository.

    $ git clone -b release-3.4 https://github.com/vesoft-inc/nebula-importer.git
    
  2. Use the following command to download and package the dependent source code.

    $ cd nebula-importer
    $ go mod vendor
    $ cd .. && tar -zcvf nebula-importer.tar.gz nebula-importer
    
  3. Upload the compressed package to a server that cannot be connected to the Internet.

  4. Unzip and compile.

    $ tar -zxvf nebula-importer.tar.gz 
    $ cd nebula-importer
    $ go build -mod vendor cmd/importer.go
    

Run in Docker mode

Instead of installing the Go locale locally, you can use Docker to pull the image of the NebulaGraph Importer and mount the local configuration file and CSV data file into the container. The command is as follows:

$ docker run --rm -ti \
    --network=host \
    -v <config_file>:<config_file> \
    -v <csv_data_dir>:<csv_data_dir> \
    vesoft/nebula-importer:<version>
    --config <config_file>
  • <config_file>: The absolute path to the local YAML configuration file.
  • <csv_data_dir>: The absolute path to the local CSV data file.
  • <version>: NebulaGraph 2.x Please fill in 'v3'.

Note

A relative path is recommended. If you use a local absolute path, check that the path maps to the path in the Docker.

Configuration File Description

NebulaGraph Importer uses configuration(nebula-importer/examples/v2/example.yaml) files to describe information about the files to be imported, the NebulaGraph server, and more. You can refer to the example configuration file: Configuration without Header/Configuration with Header. This section describes the fields in the configuration file by category.

Note

If users download a binary package, create the configuration file manually.

Basic configuration

The example configuration is as follows:

version: v2
description: example
removeTempFiles: false
Parameter Default value Required Description
version v2 Yes Target version of the configuration file.
description example No Description of the configuration file.
removeTempFiles false No Whether to delete temporarily generated logs and error data files.

Client configuration

The client configuration stores the configurations associated with NebulaGraph.

The example configuration is as follows:

clientSettings:
  retry: 3
  concurrency: 10
  channelBufferSize: 128
  space: test
  connection:
    user: user
    password: password
    address: 192.168.*.13:9669,192.168.*.14:9669
  postStart:
    commands: |
      UPDATE CONFIGS storage:wal_ttl=3600;
      UPDATE CONFIGS storage:rocksdb_column_family_options = { disable_auto_compactions = true };
    afterPeriod: 8s
  preStop:
    commands: |
      UPDATE CONFIGS storage:wal_ttl=86400;
      UPDATE CONFIGS storage:rocksdb_column_family_options = { disable_auto_compactions = false };
Parameter Default value Required Description
clientSettings.retry 3 No Retry times of nGQL statement execution failures.
clientSettings.concurrency 10 No Number of NebulaGraph client concurrency.
clientSettings.channelBufferSize 128 No Cache queue size per NebulaGraph client.
clientSettings.space - Yes Specifies the NebulaGraph space to import the data into. Do not import multiple spaces at the same time to avoid performance impact.
clientSettings.connection.user - Yes NebulaGraph user name.
clientSettings.connection.password - Yes The password for the NebulaGraph user name.
clientSettings.connection.address - Yes Addresses and ports for all Graph services.
clientSettings.postStart.commands - No Configure some of the operations to perform after connecting to the NebulaGraph server, and before inserting data.
clientSettings.postStart.afterPeriod - No The interval, between executing the above commands and executing the insert data command, such as 8s.
clientSettings.preStop.commands - No Configure some of the actions you performed before disconnecting from the NebulaGraph server.

File configuration

File configuration Stores the configuration of data files and logs, and details about the Schema.

File and log configuration

The example configuration is as follows:

workingDir: ./data/
logPath: ./err/test.log
files:
  - path: ./student.csv
    failDataPath: ./err/student.csv
    batchSize: 128
    limit: 10
    inOrder: false
    type: csv
    csv:
      withHeader: false
      withLabel: false
      delimiter: ","
      lazyQuotes: false
Parameter Default value Required Description
workingDir - No If you have multiple directories containing data with the same file structure, you can use this parameter to switch between them. For example, the value of path and failDataPath of the configuration below will be automatically changed to ./data/student.csv and ./data/err/student. If you change workingDir to ./data1, the path will be changed accordingly. The param can be either absolute or relative.
logPath - No Path for exporting log information, such as errors during import.
files.path - Yes Path for storing data files. If a relative path is used, the path is merged with the current configuration file directory. You can use an asterisk (*) for fuzzy matching to import multiple files with similar names, but the files need to be the same structure.
files.failDataPath - Yes Insert the failed data file storage path, so that data can be written later.
files.batchSize 128 No The number of statements inserting data in a batch.
files.limit - No Limit on the number of rows of read data.
files.inOrder - No Whether to insert rows in the file in order. If the value is set to false, the import rate decreases due to data skew.
files.type - Yes The file type.
files.csv.withHeader false Yes Whether there is a header.
files.csv.withLabel false Yes Whether there is a label.
files.csv.delimiter "," Yes Specifies the delimiter for the CSV file. A string delimiter that supports only one character.
lazyQuotes false No If lazyQuotes is true, a quote may appear in an unquoted field and a non-doubled quote may appear in a quoted field.

Schema configuration

Schema configuration describes the Meta information of the current data file. Schema types are vertex and edge. Multiple vertexes or edges can be configured at the same time.

  • vertex configuration

The example configuration is as follows:

schema:
  type: vertex
  vertex:
    vid:
      index: 1
      function: hash
      prefix: abc
    tags:
      - name: student
        props:
          - name: age
            type: int
            index: 2
          - name: name
            type: string
            index: 1
          - name: gender
            type: string
          - name: phone
            type: string
            nullable: true
          - name: wechat
            type: string
            nullable: true
            nullValue: "__NULL__"
Parameter Default value Required Description
files.schema.type - Yes Schema type. Possible values are vertex and edge.
files.schema.vertex.vid.index - No The vertex ID corresponds to the column number in the CSV file.
files.schema.vertex.vid.function - No Functions to generate the VIDs. Currently, we only support function hash.
files.schema.vertex.vid.prefix - No Add prefix to the original vid. When function is specified also, prefix is applied to the original vid before function.
files.schema.vertex.tags.name - Yes Tag name.
files.schema.vertex.tags.props.name - Yes Tag property name, which must match the Tag property in the NebulaGraph.
files.schema.vertex.tags.props.type - Yes Property data type, supporting bool,int,float,double,string,time,timestamp,date,datetime,geography,geography(point),geography(linestring) and geography(polygon).
files.schema.vertex.tags.props.index - No Property corresponds to the sequence number of the column in the CSV file.
files.schema.vertex.tags.props.nullable false No Whether this prop property can be NULL, optional values is true or false.
files.schema.vertex.tags.props.nullValue "" No Ignored when nullable is false. The property is set to NULL when the value is equal to nullValue.
files.schema.vertex.tags.props.alternativeIndices - No Ignored when nullable is false. When the property value is not nullValue, the value is fetched from csv according to the index sequence.
files.schema.vertex.tags.props.defaultValue - No Ignored when nullable is false. The property default value, when all the values obtained by index and alternativeIndices are nullValue.

Note

The sequence numbers of the columns in the CSV file start from 0, that is, the sequence numbers of the first column are 0, and the sequence numbers of the second column are 1.

  • edge configuration

The example configuration is as follows:

schema:
  type: edge
  edge:
    name: follow
    srcVID:
      index: 0
      function: hash
    dstVID:
      index: 1
      function: 
    rank:
      index: 2
    props:
      - name: grade
        type: int
        index: 3
Parameter Default value Required Description
files.schema.type - Yes Schema type. Possible values are vertex and edge.
files.schema.edge.name - Yes Edge type name.
files.schema.edge.srcVID.index - No The data type of the starting vertex ID of the edge.
files.schema.edge.srcVID.function - No Functions to generate the source vertex. Currently, we only support function hash.
files.schema.edge.dstVID.index - No The destination vertex ID of the edge corresponds to the column number in the CSV file.
files.schema.edge.dstVID.function - No Functions to generate the destination vertex. Currently, we only support function hash.
files.schema.edge.rank.index - No The rank value of the edge corresponds to the column number in the CSV file.
files.schema.edge.props.name - Yes The Edge Type property name must match the Edge Type property in the NebulaGraph.
files.schema.edge.props.type - Yes Property data type, supporting bool, int, float, double, timestamp, string, and geo.
files.schema.edge.props.index - No Property corresponds to the sequence number of the column in the CSV file.

About the CSV file header

According to whether the CSV file has a header or not, the Importer needs to make different Settings on the configuration file. For relevant examples and explanations, please refer to:


Last update: February 19, 2024