Import data from CSV files

This article uses an example to show how to use Exchange to import data from CSV files stored on HDFS into Nebula Graph.

If you want to import data from local CSV files into Nebula Graph v1.x, see Nebula Importer.

Dataset

In this article, the Social Network: MOOC User Action Dataset provided by Stanford Network Analysis Platform (SNAP) and 97 unique course names obtained from the public network are used as the sample dataset. The dataset contains:

  • Two vertex types (user and course), 7,144 vertices in total.
  • One edge type (action), 411,749 edges in total.

You can download the example dataset from the nebula-web-docker repository.

Environment

The practice is done in macOS. Here is the environment information:

  • Hardware specifications:
    • CPU: 1.7 GHz Quad-Core Intel Core i7
    • Memory: 16 GB
  • Spark 2.3.0, deployed in the Standalone mode
  • Hadoop 2.9.2, deployed in the Pseudo-Distributed mode

Prerequisites

To import data from CSV files on HDFS with Exchange v1.x, do a check of these:

  • Exchange v1.x is compiled. For more information, see Compile Exchange v1.x. Exchange 1.1.0 is used in this example.
  • Spark is installed.
  • Hadoop is installed and started.
  • Nebula Graph is deployed and started. Get the information:
    • IP addresses and ports of the Graph Service and the Meta Service.
    • A Nebula Graph account with the privilege of writing data and its password.
  • Get the necessary information for schema creation in Nebula Graph, including tags and edge types.

Procedure

Step 1. Create a schema in Nebula Graph

Analyze the data in the CSV files and follow these steps to create a schema in Nebula Graph:

  1. Confirm the essential elements of the schema.

    Elements Names Properties
    Tag user userId int
    Tag course courseId int, courseName string
    Edge Type action actionId int, duration double, label bool, feature0 double, feature1 double, feature2 double, feature3 double
  2. In Nebula Graph, create a graph space named csv and create a schema.

    -- Create a graph space named csv
    CREATE SPACE csv(partition_num=10, replica_factor=1);
    
    -- Choose the csv graph space
    USE csv;
    
    -- Create the user tag
    CREATE TAG user(userId int);
    
    -- Create the course tag
    CREATE TAG course(courseId int, courseName string);
    
    -- Create the action edge type
    CREATE EDGE action (actionId int, duration double, label bool, feature0 double, feature1 double, feature2 double, feature3 double);
    

For more information, see Quick Start of Nebula Graph Database User Guide.

Step 2. Prepare CSV files

Do a check of these:

  1. The CSV files are processed to meet the requirements of the schema. For more information, see Quick Start of Nebula Graph Studio. >NOTE: Exchange supports importing CSV files with or without headers.

  2. The CSV files must be stored in HDFS and get the file storage path.

Step 3. Edit configuration file

After compiling of Exchange, copy the target/classes/application.conf file and edit the configuration for CSV files. In this example, a new configuration file is named CSV_ application.conf. In this file, the vertex and edge related configuration is introduced as comments and all the items that are not used in this example are commented out. For more information about the Spark and Nebula related parameters, see Spark related parameters and Nebula Graph related parameters.

{
  # Spark related configuration
  spark: {
    app: {
      name: Spark Writer
    }
    driver: {
      cores: 1
      maxResultSize: 1G
    }
    cores {
      max: 16
    }
  }

  # Nebula Graph related configuration
  nebula: {
    address:{
      # Specifies the IP addresses and ports of the Graph Service and the Meta Service of Nebula Graph.
      # If multiple servers are used, separate the addresses with commas. 
      # Format: "ip1:port","ip2:port","ip3:port". 
      graph:["127.0.0.1:3699"]
      meta:["127.0.0.1:45500"]
    }

    # Specifies an account that has the WriteData privilege in Nebula Graph and its password.
    user: user
    pswd: password

    # Specifies a graph space name
    space: csv
    connection {
      timeout: 3000
      retry: 3
    }
    execution {
      retry: 3
    }
    error: {
      max: 32
      output: /tmp/errors
    }
    rate: {
      limit: 1024
      timeout: 1000
    }
  }

  # Process vertices
  tags: [
    # Sets for the course tag
    {
      # Specifies a tag name defined in Nebula Graph.
      name: course
      type: {
        # Specifies the data source. csv is used.
        source: csv

        # Specifies how to import vertex data into Nebula Graph: client or sst.
        # For more information about importing sst files, see Import SST files (doc_to_do).
        sink: client
      }

      # Specifies the HDFS path of the CSV file. 
      # Enclose the path with double quotes and start the path with hdfs://.
      path: "hdfs://namenode_ip:port/path/to/course.csv"

      # If the CSV file has no header, use [_c0, _c1, _c2, ..., _cn] to 
      # represent its header and to indicate columns as the source of the property values.
      fields: [_c0, _c1]
      # If the CSV file has a header, use the actual column names.

      # Specifies property names defined in Nebula Graph.
      # fields for the CSV file and nebula.fields for Nebula Graph must 
      # have the one-to-one correspondence relationship.
      nebula.fields: [courseId, courseName]

      # Since from Exchange 1.1.0, csv.fields is available.
      # If csv.fields is specified, fields must have the same values as csv.fields.
      # csv.fields: [courseId, courseName]

      # Specifies a column as the source of VIDs.
      # The value of vertex.field must be one column of the CSV file.
      # If the values are not of the int type, use vertex.policy to 
      # set the mapping policy. "hash" is preferred.
      vertex: {
        field: _c1,
        policy: "hash"
      }

      # Specifies the separator. The default value is commas.
      separator: ","

      # If the CSV file has a header, set header to true.
      # If the CSV file has no header, set header to false (default value).
      header: false

      # Specifies the maximum number of vertex data to be written into 
      # Nebula Graph in a single batch.
      batch: 256

      # Specifies the partition number of Spark.
      partition: 32

      # For the isImplicit information, refer to the application.conf file in 
      # the nebula-java/tools/exchange/target/classes directory. 
      isImplicit: true
    }

    # Sets for the user tag
    {
      name: user
      type: {
        source: csv
        sink: client
      }
      path: "hdfs://namenode_ip:port/path/to/user.csv"

      # Since from Exchange 1.1.0, csv.fields is available.
      # If csv.fields is used, Exchange uses the values of csv.fields as 
      # the header of the CSV file, and fields must have the same values as csv.fields.
      # fields for the CSV file and nebula.fields for Nebula Graph must 
      # have the one-to-one correspondence relationship.
      fields: [userId]

      # Specifies property names defined in Nebula Graph.
      # fields for the CSV file and nebula.fields for Nebula Graph must 
      # have the one-to-one correspondence relationship.
      nebula.fields: [userId]

      # If csv.fields is set, its value is used to represent the header of 
      # the CSV file, even though the CSV file has its own header.
      # fields and csv.fields must have the same value.
      # The value of vertex must be one of the values of csv.fields.
      csv.fields: [userId]

      # The value of vertex.field must be one column of the CSV file.
      vertex: userId
      separator: ","
      header: false
      batch: 256
      partition: 32

      # For the isImplicit information, refer to the application.conf file in
      # the nebula-java/tools/exchange/target/classes directory.
      isImplicit: true
    }

    # If more tags are necessary, refer to the preceding configuration to add more.
  ]
  # Process edges
  edges: [
    # Sets for the action edge type
    {
      # Specifies an edge type name defined in Nebula Graph
      name: action
      type: {
        # Specifies the data source. csv is used.
        source: csv

        # Specifies how to import vertex data into Nebula Graph: client or sst.
        # For more information about importing sst files, see Import SST files (doc_to_do).
        sink: client
      }

      # Specifies the HDFS path of the CSV file. 
      # Enclose the path with double quotes and start the path with hdfs://.
      path: "hdfs://namenode_ip:port/path/to/actions.csv"

      # If the CSV file has no header, use [_c0, _c1, _c2, ..., _cn] to 
      # represent its header and to indicate columns as the source of the property values.
      fields: [_c0, _c3, _c4, _c5, _c6, _c7, _c8]
      # If the CSV file has a header, use the actual column names.

      # Specifies property names defined in Nebula Graph.
      # fields for the CSV file and nebula.fields for Nebula Graph must 
      # have the one-to-one correspondence relationship.
      nebula.fields: [actionId, duration, feature0, feature1, feature2, feature3, label]

      # Since from Exchange 1.1.0, csv.fields is available.
      # If csv.fields is used, Exchange uses the values of csv.fields as 
      # the header of the CSV file and fields must have the same values as csv.fields.
      # csv.fields: [actionId, duration, feature0, feature1, feature2, feature3, label]

      # Specifies the columns as the source of the IDs of the source and target vertices.
      # If the values are not of the int type, use vertex.policy to set the mapping policy. "hash" is preferred.
      source: _c1
      target: {
        field: _c2
        policy: "hash"
      }

      # Specifies the separator. The default value is commas.      
      separator: ","

      # If the CSV file has a header, set header to true.
      # If the CSV file has no header, set header to false (default value).
      header: false

      # Specifies the maximum number of vertex data to be written into
      # Nebula Graph in a single batch.
      batch: 256

      # Specifies the partition number of Spark.
      partition: 32

      # For the isImplicit information, refer to the application.conf file
      # in the nebula-java/tools/exchange/target/classes directory.
      isImplicit: true
    }
  ]
  # If more edge types are necessary, refer to the preceding configuration to add more.
}

Step 4. (Optional) Verify the configuration

After the configuration, run the import command with the -D parameter to verify the configuration file. For more information about the parameters, see Import command parameters.

$SPARK_HOME/bin/spark-submit --master "local" --class com.vesoft.nebula.tools.importer.Exchange /path/to/exchange-1.1.0.jar -c /path/to/conf/csv_application.conf -D

Step 5. Import data into Nebula Graph

When the configuration is ready, run this command to import data from CSV files into Nebula Graph. For more information about the parameters, see Import command parameters.

$SPARK_HOME/bin/spark-submit --master "local" --class com.vesoft.nebula.tools.importer.Exchange /path/to/exchange-1.1.0.jar -c /path/to/conf/csv_application.conf 

Step 6. (Optional) Verify data in Nebula Graph

You can use a Nebula Graph client, such as Nebula Graph Studio, to verify the imported data. For example, in Nebula Graph Studio, run this statement.

GO FROM 1 OVER action;

If the queried destination vertices return, the data are imported into Nebula Graph.

You can use db_dump to count the data. For more information, see Dump Tool.

Step 7. (Optional) Create and rebuild indexes in Nebula Graph

After the data is imported, you can create and rebuild indexes in Nebula Graph. For more information, see nGQL User Guide.