Import data from HIVE¶
This article uses an example to show how to use Exchange to import data from HIVE into Nebula Graph.
Dataset¶
In this article, the Social Network: MOOC User Action Dataset provided by Stanford Network Analysis Platform (SNAP) and 97 unique course names obtained from the public network are used as the sample dataset. The dataset contains:
- Two vertex types (
user
andcourse
), 7,144 vertices in total. - One edge type (
action
), 411,749 edges in total.
You can download the example dataset from the nebula-web-docker repository.
In this example, the dataset is stored in a database named mooc
in HIVE, and the information of all vertices and edges is stored in the users
, courses
, and actions
tables. Here are the structures of all the tables.
scala> sql("describe mooc.users").show
+--------+---------+-------+
|col_name|data_type|comment|
+--------+---------+-------+
| userid| bigint| null|
+--------+---------+-------+
scala> sql("describe mooc.courses").show
+----------+---------+-------+
| col_name|data_type|comment|
+----------+---------+-------+
| courseid| bigint| null|
|coursename| string| null|
+----------+---------+-------+
scala> sql("describe mooc.actions").show
+--------+---------+-------+
|col_name|data_type|comment|
+--------+---------+-------+
|actionid| bigint| null|
| srcid| bigint| null|
| dstid| string| null|
|duration| double| null|
|feature0| double| null|
|feature1| double| null|
|feature2| double| null|
|feature3| double| null|
| label| boolean| null|
+--------+---------+-------+
NOTE:
bigint
in HIVE equals toint
in Nebula Graph.
Environment¶
The practice is done in macOS. Here is the environment information:
- Hardware specifications:
- CPU: 1.7 GHz Quad-Core Intel Core i7
- Memory: 16 GB
- Spark 2.4.7, deployed in the Standalone mode
- Hadoop 2.9.2, deployed in the Pseudo-Distributed mode
- HIVE 2.3.7, with MySQL 8.0.22
- Nebula Graph v1.2.1, deployed with Docker Compose. For more information, see Deploy Nebula Graph with Docker Compose.
Prerequisites¶
To import data from HIVE with Exchange v1.x, do a check of these:
- Exchange v1.x is compiled. For more information, see Compile Exchange v1.x. Exchange 1.2.1 is used in this example.
- Spark is installed.
- Hadoop is installed and started and the hive metastore database (MySQL is used in this example) is started.
- Nebula Graph is deployed and started. Get the information:
- IP addresses and ports of the Graph Service and the Meta Service.
- A Nebula Graph account with the privilege of writing data and its password.
- Get the necessary information for schema creation in Nebula Graph, including tags and edge types.
Procedure¶
Step 1. Create a schema in Nebula Graph¶
Follow these steps to create a schema in Nebula Graph:
-
Confirm the essential elements of the schema.
Elements Names Properties Tag user
userId int
Tag course
courseId int, courseName string
Edge Type action
actionId int, duration double, label bool, feature0 double, feature1 double, feature2 double, feature3 double
-
In Nebula Graph, create a graph space named hive and create a schema.
-- Create a graph space named hive CREATE SPACE hive(partition_num=10, replica_factor=1); -- Choose the hive graph space USE hive; -- Create the user tag CREATE TAG user(userId int); -- Create the course tag CREATE TAG course(courseId int, courseName string); -- Create the action edge type CREATE EDGE action (actionId int, duration double, label bool, feature0 double, feature1 double, feature2 double, feature3 double);
For more information, see Quick Start of Nebula Graph Database User Guide.
Step 2. Verify the HIVE SQL statements¶
When spark-shell starts, run these statements one by one to make sure that Spark can read data from HIVE.
scala> sql("select userid from mooc.users").show
scala> sql("select courseid, coursename from mooc.courses").show
scala> sql("select actionid, srcid, dstid, duration, feature0, feature1, feature2, feature3, label from mooc.actions").show
Here is an example of data read from the mooc.actions
table.
+--------+-----+--------------------+--------+------------+------------+-----------+-----------+-----+
|actionid|srcid| dstid|duration| feature0| feature1| feature2| feature3|label|
+--------+-----+--------------------+--------+------------+------------+-----------+-----------+-----+
| 0| 0|Environmental Dis...| 0.0|-0.319991479|-0.435701433|0.106783779|-0.06730924|false|
| 1| 0| History of Ecology| 6.0|-0.319991479|-0.435701433|0.106783779|-0.06730924|false|
| 2| 0| Women in Islam| 41.0|-0.319991479|-0.435701433|0.106783779|-0.06730924|false|
| 3| 0| History of Ecology| 49.0|-0.319991479|-0.435701433|0.106783779|-0.06730924|false|
| 4| 0| Women in Islam| 51.0|-0.319991479|-0.435701433|0.106783779|-0.06730924|false|
| 5| 0|Legacies of the A...| 55.0|-0.319991479|-0.435701433|0.106783779|-0.06730924|false|
| 6| 0| ITP Core 2| 59.0|-0.319991479|-0.435701433|0.106783779|-0.06730924|false|
| 7| 0|The Research Pape...| 62.0|-0.319991479|-0.435701433|0.106783779|-0.06730924|false|
| 8| 0| Neurobiology| 65.0|-0.319991479|-0.435701433|0.106783779|-0.06730924|false|
| 9| 0| Wikipedia| 113.0|-0.319991479|-0.435701433|1.108826104|12.77723482|false|
| 10| 0|Media History and...| 226.0|-0.319991479|-0.435701433|0.607804941|149.4512115|false|
| 11| 0| WIKISOO| 974.0|-0.319991479|-0.435701433|1.108826104|3.344522776|false|
| 12| 0|Environmental Dis...| 1000.0|-0.319991479|-0.435701433|0.106783779|-0.06730924|false|
| 13| 0| WIKISOO| 1172.0|-0.319991479|-0.435701433|1.108826104|1.136866766|false|
| 14| 0| Women in Islam| 1182.0|-0.319991479|-0.435701433|0.106783779|-0.06730924|false|
| 15| 0| History of Ecology| 1185.0|-0.319991479|-0.435701433|0.106783779|-0.06730924|false|
| 16| 0|Human Development...| 1687.0|-0.319991479|-0.435701433|0.106783779|-0.06730924|false|
| 17| 1|Human Development...| 7262.0|-0.319991479|-0.435701433|0.106783779|-0.06730924|false|
| 18| 1| History of Ecology| 7266.0|-0.319991479|-0.435701433|0.106783779|-0.06730924|false|
| 19| 1| Women in Islam| 7273.0|-0.319991479|-0.435701433|0.607804941|0.936170765|false|
+--------+-----+--------------------+--------+------------+------------+-----------+-----------+-----+
only showing top 20 rows
Step 3. Edit configuration file¶
After compiling of Exchange, copy the target/classes/application.conf
file and edit the configuration for HIVE. In this example, a new configuration file is named hive_ application.conf
. In this file, the vertex and edge related configuration is introduced as comments and all the items that are not used in this example are commented out. For more information about the Spark and Nebula related parameters, see Spark related parameters and Nebula Graph related parameters.
{
# Spark related configuration
spark: {
app: {
name: Spark Writer
}
driver: {
cores: 1
maxResultSize: 1G
}
cores {
max: 16
}
}
# Nebula Graph related configuration
nebula: {
address:{
# Specifies the IP addresses and ports of the Graph Service and the Meta Service of Nebula Graph
# If multiple servers are used, separate the addresses with commas,
# "ip1:port","ip2:port","ip3:port"
graph:["127.0.0.1:3699"]
meta:["127.0.0.1:45500"]
}
# Specifies an account that has the WriteData privilege in Nebula Graph and its password
user: user
pswd: password
# Specifies a graph space name
space: hive
connection {
timeout: 3000
retry: 3
}
execution {
retry: 3
}
error: {
max: 32
output: /tmp/errors
}
rate: {
limit: 1024
timeout: 1000
}
}
# Process vertices
tags: [
# Sets for the user tag
{
# Specifies a tag name defined in Nebula Graph
name: user
type: {
# Specifies the data source. hive is used.
source: hive
# Specifies how to import vertex data into Nebula Graph: client or sst.
# For more information about importing sst files, see Import SST files (doc to do).
sink: client
}
# Specifies the SQL statement to read data from the users table in the mooc database
exec: "select userid from mooc.users"
# Specifies the column names from the users table to fields.
# Their values are used as the source of the userId (nebula.fields) property defined in Nebula Graph.
# If more than one column name is specified, separate them with commas.
# fields for the HIVE and nebula.fields for Nebula Graph must have the one-to-one correspondence relationship.
fields: [userid]
nebula.fields: [userId]
# Specifies a column as the source of VIDs.
# The value of vertex must be one column name in the exec sentence.
# If the values are not of the int type, use vertex.policy to
# set the mapping policy. "hash" is preferred.
# Refer to the configuration of the course tag.
vertex: userid
# Specifies the maximum number of vertex data to be written into
# Nebula Graph in a single batch.
batch: 256
# Specifies the partition number of Spark.
partition: 32
# For the isImplicit information, refer to the application.conf file in
# the nebula-java/tools/exchange/target/classes directory.
isImplicit: true
}
# Sets for the course tag
{
name: course
type: {
source: hive
sink: client
}
exec: "select courseid, coursename from mooc.courses"
fields: [courseid, coursename]
nebula.fields: [courseId, courseName]
# Specifies a column as the source of VIDs.
# The value of vertex.field must be one column name in the exec sentence.
# If the values are not of the int type, use vertex.policy to
# set the mapping policy. "hash" is preferred.
vertex: {
field: coursename
policy: "hash"
}
batch: 256
partition: 32
isImplicit: true
}
# If more tags are necessary, refer to the preceding configuration to add more.
]
# Process edges
edges: [
# Sets for the action edge type
{
# Specifies an edge type name defined in Nebula Graph
name: action
type: {
# Specifies the data source. hive is used.
source: hive
# Specifies how to import vertex data into Nebula Graph: client or sst
# For more information about importing sst files,
# see Import SST files (doc to do).
sink: client
}
# Specifies the SQL statement to read data from the actions table in
# the mooc database.
exec: "select actionid, srcid, dstid, duration, feature0, feature1, feature2, feature3, label from mooc.actions"
# Specifies the column names from the actions table to fields.
# Their values are used as the source of the properties of
# the action edge type defined in Nebula Graph.
# If more than one column name is specified, separate them with commas.
# fields for the HIVE and nebula.fields for Nebula Graph must
# have the one-to-one correspondence relationship.
fields: [actionid, duration, feature0, feature1, feature2, feature3, label]
nebula.fields: [actionId, duration, feature0, feature1, feature2, feature3, label]
# source specifies a column as the source of the IDs of
# the source vertex of an edge.
# target specifies a column as the source of the IDs of
# the target vertex of an edge.
# The value of source.field and target.field must be
# column names set in the exec sentence.
# If the values are not of the int type, use vertex.policy to
# set the mapping policy. "hash" is preferred.
source: srcid
target: {
field: dstid
policy: "hash"
}
# Specifies the maximum number of vertex data to be
# written into Nebula Graph in a single batch.
batch: 256
# Specifies the partition number of Spark.
partition: 32
}
]
}
Step 4. (Optional) Verify the configuration¶
After the configuration, run the import command with the -D
parameter to verify the configuration file. For more information about the parameters, see Import command parameters.
$SPARK_HOME/bin/spark-submit --master "local" --class com.vesoft.nebula.tools.importer.Exchange /path/to/exchange-1.2.1.jar -c /path/to/conf/hive_application.conf -h -D
Step 5. Import data into Nebula Graph¶
When the configuration is ready, run this command to import data from HIVE into Nebula Graph. For more information about the parameters, see Import command parameters.
$SPARK_HOME/bin/spark-submit --master "local" --class com.vesoft.nebula.tools.importer.Exchange /path/to/exchange-1.2.1.jar -c /path/to/conf/hive_application.conf -h
Step 6. (Optional) Verify data in Nebula Graph¶
You can use a Nebula Graph client, such as Nebula Graph Studio, to verify the imported data. For example, in Nebula Graph Studio, run this statement.
GO FROM 1 OVER action;
If the queried destination vertices return, the data are imported into Nebula Graph.
You can use db_dump to count the data. For more information, see Dump Tool.
Step 7. (Optional) Create and rebuild indexes in Nebula Graph¶
After the data is imported, you can create and rebuild indexes in Nebula Graph. For more information, see nGQL User Guide.