Import data from JSON files¶
This article uses an example to show how to use Exchange to import data from JSON files stored on HDFS into Nebula Graph.
Dataset¶
The JSON file (test.json) used in this example is like {"source":int, "target":int, "likeness":double}
, representing a like
relationship between source
and target
. 21,645 records in total.
Here are some sample data:
{"source":53802643,"target":87847387,"likeness":0.34}
{"source":29509860,"target":57501950,"likeness":0.40}
{"source":97319348,"target":50240344,"likeness":0.77}
{"source":94295709,"target":8189720,"likeness":0.82}
{"source":78707720,"target":53874070,"likeness":0.98}
{"source":23399562,"target":20136097,"likeness":0.47}
Environment¶
The practice is done in macOS. Here is the environment information:
- Hardware specifications:
- CPU: 1.7 GHz Quad-Core Intel Core i7
- Memory: 16 GB
- Spark 2.3.0, deployed in the Standalone mode
- Hadoop 2.9.2, deployed in the Pseudo-Distributed mode
- Nebula Graph v1.1.0, deployed with Docker Compose. For more information, see Deploy Nebula Graph with Docker Compose.
Prerequisites¶
To import data from JSON files on HDFS with Exchange v1.x, do a check of these:
- Exchange v1.x is compiled. For more information, see Compile Exchange v1.x. Exchange 1.0.1 is used in this example.
- Spark is installed.
- Hadoop is installed and started.
- Nebula Graph is deployed and started. Get the information:
- IP addresses and ports of the Graph Service and the Meta Service.
- A Nebula Graph account with the privilege of writing data and its password.
- Get the necessary information for schema creation in Nebula Graph, including tags and edge types.
Procedure¶
Step 1. Create a schema in Nebula Graph¶
Analyze the data in the JSON files and follow these steps to create a schema in Nebula Graph:
-
Confirm the essential elements of the schema.
Elements Names Properties Tag source
srcId int
Tag target
dstId int
Edge Type like
likeness double
-
In Nebula Graph, create a graph space named json and create a schema.
-- Create a graph space named json CREATE SPACE json (partition_num=10, replica_factor=1); -- Choose the json graph space USE json; -- Create the source tag CREATE TAG source (srcId int); -- Create the target tag CREATE TAG target (dstId int); -- Create the like edge type CREATE EDGE like (likeness double);
For more information, see Quick Start of Nebula Graph Database User Guide.
Step 2. Prepare JSON files¶
Create separate JSON files for vertex and edge data. Store the JSON files in HDFS and get the HDFS path of the files.
NOTE: In this example, only one JSON file is used to import vertex and edge data at the same time. Some vertex data representing source and target are duplicate. Therefore, during the import process, these vertices are written repeatedly. In Nebula Graph, data is overwritten when repeated insertion occurs, and the last write is read out. In practice, to increase the write speed, creating separate files for vertex and edge data is recommended.
Step 3. Edit configuration file¶
After compiling of Exchange, copy the target/classes/application.conf
file and edit the configuration for JSON files. In this example, a new configuration file is named json_ application.conf
. In this file, the vertex and edge related configuration is introduced as comments and all the items that are not used in this example are commented out. For more information about the Spark and Nebula related parameters, see Spark related parameters and Nebula Graph related parameters.
{
# Spark related configuration
spark: {
app: {
name: Spark Writer
}
driver: {
cores: 1
maxResultSize: 1G
}
cores {
max: 16
}
}
# Nebula Graph related configuration
nebula: {
address:{
# Specifies the IP addresses and ports of the Graph Service and the Meta Service of Nebula Graph.
# If multiple servers are used, separate the addresses with commas.
# Format: "ip1:port","ip2:port","ip3:port"
graph:["127.0.0.1:3699"]
meta:["127.0.0.1:45500"]
}
# Specifies an account that has the WriteData privilege in Nebula Graph and its password.
user: user
pswd: password
# Specifies a graph space name
space: json
connection {
timeout: 3000
retry: 3
}
execution {
retry: 3
}
error: {
max: 32
output: /tmp/errors
}
rate: {
limit: 1024
timeout: 1000
}
}
# Process vertices
tags: [
# Sets for the source tag
{
# Specifies a tag name defined in Nebula Graph
name: source
type: {
# Specifies the data source. json is used.
source: json
# Specifies how to import vertex data into Nebula Graph: client or sst.
# For more information about importing sst files, see Import SST files (doc to do).
sink: client
}
# Specifies the HDFS path of the JSON file.
# Enclose the path with double quotes and start the path with hdfs://.
path: "hdfs://namenode_ip:port/path/to/test.json"
# Specifies the keys in the JSON file.
# Their values are used as the source of the srcId property
# defined in Nebula Graph.
# If more than one key is specified, separate them with commas.
fields: ["source"]
nebula.fields: ["srcId"]
# Specifies the values of a key in the JSON file as
# the source of the VID in Nebula Graph.
# If the values are not of the int type, use vertex.policy to
# set the mapping policy. "hash" is preferred.
#
# vertex: {
# field: key_name_in_json
# policy: "hash"
# }
vertex: source
batch: 256
partition: 32
# For the isImplicit information, refer to the application.conf file
# in the nebula-java/tools/exchange/target/classes directory.
isImplicit: true
}
# Sets for the target tag
{
name: target
type: {
source: json
sink: client
}
path: "hdfs://namenode_ip:port/path/to/test.json"
fields: ["target"]
nebula.fields: ["dstId"]
vertex: "target"
batch: 256
partition: 32
isImplicit: true
}
# If more tags are necessary, refer to the preceding configuration to add more.
]
# Process edges
edges: [
# Sets for the like edge type
{
# Specifies an edge type name defined in Nebula Graph
name: like
type: {
# Specifies the data source. json is used.
source: json
# Specifies how to import vertex data into Nebula Graph: client or sst.
# For more information about importing sst files, see Import SST files (doc to do).
sink: client
}
# Specifies the HDFS path of the JSON file.
# Enclose the path with double quotes and start the path with hdfs://.
path: "hdfs://namenode_ip:port/path/to/test.json"
# Specifies the keys in the JSON file.
# Their values are used as the source of the likeness property defined in Nebula Graph.
# If more than one key is specified, separate them with commas.
fields: ["likeness"]
nebula.fields: ["likeness"]
# Specifies the values of two keys in the JSON file as the source
# of the IDs of source and destination vertices of the like edges in Nebula Graph.
# If the values are not of the int type, use source.policy and/or
# target.policy to set the mapping policy. "hash" is preferred.
# source: {
# field: key_name_in_json
# policy: "hash"
# }
# target: {
# field: key_name_in_json
# policy: "hash"
# }
source: "source"
target: "target"
batch: 256
partition: 32
isImplicit: true
}
# If more edge types are necessary, refer to the preceding configuration to add more.
]
}
Step 4. (Optional) Verify the configuration¶
After the configuration, run the import command with the -D
parameter to verify the configuration file. For more information about the parameters, see Import command parameters.
$SPARK_HOME/bin/spark-submit --master "local" --class com.vesoft.nebula.tools.importer.Exchange /path/to/exchange-1.0.1.jar -c /path/to/conf/json_application.conf -D
Step 5. Import data into Nebula Graph¶
When the configuration is ready, run this command to import data from JSON files into Nebula Graph. For more information about the parameters, see Import command parameters.
$SPARK_HOME/bin/spark-submit --master "local" --class com.vesoft.nebula.tools.importer.Exchange /path/to/exchange-1.0.1.jar -c /path/to/conf/json_application.conf
Step 6. (Optional) Verify data in Nebula Graph¶
You can use a Nebula Graph client, such as Nebula Graph Studio, to verify the imported data. For example, in Nebula Graph Studio, run this statement.
GO FROM 53802643 OVER like;
If the queried destination vertices return, the data are imported into Nebula Graph.
You can use db_dump to count the data. For more information, see Dump Tool.
Step 7. (Optional) Create and rebuild indexes in Nebula Graph¶
After the data is imported, you can create and rebuild indexes in Nebula Graph. For more information, see nGQL User Guide.