NebulaGraph Analytics¶
NebulaGraph Analytics is a high-performance graph computing framework tool that performs graph analysis of data in the NebulaGraph database.
Prerequisites¶
- The NebulaGraph Analytics installation package has been obtained. Contact us to apply.
- The license is ready.
- The HDFS 2.2.x or later has been deployed.
- The JDK 1.8 has been deployed.
Scenarios¶
You can import data from data sources as NebulaGraph clusters, CSV files on HDFS, or local CSV files into NebulaGraph Analytics and export the graph computation results to NebulaGraph clusters, CSV files on HDFS, or local CSV files from NebulaGraph Analytics.
Limitations¶
When you import NebulaGraph cluster data into NebulaGraph Analytics and export the graph computation results from NebulaGraph Analytics to a NebulaGraph cluster, the graph computation results can only be exported to the graph space where the data source is located.
Version compatibility¶
The version correspondence between NebulaGraph Analytics and NebulaGraph is as follows.
NebulaGraph | NebulaGraph Analytics |
---|---|
3.4.0 | 3.4.0 |
3.3.0 | 3.3.0 |
3.1.0 ~ 3.2.x | 3.2.0 |
3.0.x | 1.0.x |
2.6.x | 0.9.0 |
Graph algorithms¶
NebulaGraph Analytics supports the following graph algorithms.
Algorithm | Description | Category |
---|---|---|
APSP | All Pair Shortest Path | Path |
SSSP | Single Source Shortest Path | Path |
BFS | Breadth-first search | Path |
ShortestPath | The shortest path | Path |
PageRank | It is used to rank web pages. | Node importance measurement |
KCore | k-Cores | Node importance measurement |
DegreeCentrality | It is a simple count of the total number of connections linked to a vertex. | Node importance measurement |
DegreeWithTime | Neighbor statistics based on the time range of edge ranks | Node importance measurement |
BetweennessCentrality | Intermediate centrality | Node importance measurement |
ClosenessCentrality | Closeness centrality | Node importance measurement |
TriangleCount | It counts the number of triangles. | Graph feature |
Node2Vec | Graph neural network | Graph feature |
Tree_stat | Tree structure statistics | Graph feature |
HyperANF | Estimate the average distance of the graph | Graph feature |
LPA | Label Propagation Algorithm | Community discovery |
WCC | Weakly connected component | Community discovery |
LOUVAIN | It detects communities in large networks. | Community discovery |
InfoMap | Community classification | Community discovery |
HANP | Hop attenuation & Node Preference | Community discovery |
Clustering Coefficient | It is a measure of the degree to which nodes in a graph tend to cluster together. | Clustering |
Jaccard | Jaccard similarity | Similarity |
Install NebulaGraph Analytics¶
-
Install the NebulaGraph Analytics. When installing a cluster of multiple NebulaGraph Analytics on multiple nodes, you need to install NebulaGraph Analytics to the same path and set up SSH-free login between nodes.
sudo rpm -ivh <analytics_package_name> --prefix <install_path> sudo chown <user>:<user> -R <install path>
For example:
sudo rpm -ivh nebula-analytics-3.4.0-centos.x86_64.rpm --prefix=/home/vesoft/nebula-analytics sudo chown vesoft:vesoft -R /home/vesoft/nebula-analytics
-
Configure the correct Hadoop path and JDK path in the file
set_env.sh
, the file path isnebula-analytics/scripts/set_env.sh
. If there are multiple machines, ensure that the paths are the same.Note
The default TCP port range used by the MPICH process manager and MPICH library is 10000 to 10100. To adjust this, modify the value of the environment variable
MPIR_CVAR_CH3_PORT_RANGE
in theset_env.sh
file.export HADOOP_HOME=<hadoop_path> export JAVA_HOME=<java_path>
-
Copy the license into the directory
scripts
of the NebulaGraph Analytics installation path on all machines.
How to use NebulaGraph Analytics¶
After installation, you can set parameters of different algorithms and then execute a script to obtain the results of the algorithms and export them to the specified format.
-
Select one node from the NebulaGraph Analytics cluster and then access the
scripts
directory.$ cd scripts
-
Confirm the data source and export path. Configuration steps are as follows.
-
NebulaGraph clusters as the data source
-
Modify the configuration file
nebula.conf
to configure the NebulaGraph cluster.# The number of retries connecting to NebulaGraph. --retry=3 # The name of the graph space where you read or write data. --space=baskeyballplayer # Read data from NebulaGraph. # The name of edges. --edges=LIKES # The name of the property to be read as the weight of the edge. Can be either the attribute name or _rank. #--edge_data_fields # The number of rows read per scan. --read_batch_size=10000 # Write data to NebulaGraph. # The graphd process address. --graph_server_addrs=192.168.8.100:9669 # The account to log into NebulaGraph. --user=root # The password to log into NebulaGraph. --password=nebula # The pattern used to write data back to NebulaGraph: insert or update. --mode=insert # The tag name written back to NebulaGraph. --tag=pagerank # The property name corresponding to the tag. --prop=pr # The property type corresponding the the tag. --type=double # The number of rows per write. --write_batch_size=1000 # The file path where the data failed to be written back to NebulaGraph is stored. --err_file=/home/xxx/analytics/err.txt # other # The access timeout period of the service. --graphd_timeout=60000 --metad_timeout=60000 --storaged_timeout=60000
-
Modify the related parameters in the script to be used, such as
run_pagerank.sh
.# The sum of the number of processes running on all machines in the cluster. It is recommended to be the number of machines or the number of nodes in the NUMA architecture. WNUM=3 # The number of threads per process. It is recommended to set the maximum value to be the number of hardware threads of the machine. WCORES=4 # The path to the data source. # Set to read data from NebulaGraph via the nebula.conf file. INPUT=${INPUT:="nebula:$PROJECT/scripts/nebula.conf"} # Set to read data from the CSV files on HDFS or on local directories. # #INPUT=${INPUT:="$PROJECT/data/graph/v100_e2150_ua_c3.csv"} # The export path to the graph computation results. # Data can be exported to a NebulaGraph. If the data source is also a NebulaGraph, the results will be exported to the graph space specified in nebula.conf. OUTPUT=${OUTPUT:="nebula:$PROJECT/scripts/nebula.conf"} # Data can also be exported to the CSV files on HDFS or on local directories. # OUTPUT=${OUTPUT:='hdfs://192.168.8.100:9000/_test/output'} # If the value is true, it is a directed graph, if false, it is an undirected graph. IS_DIRECTED=${IS_DIRECTED:=true} # Set whether to encode ID or not. NEED_ENCODE=${NEED_ENCODE:=true} # The ID type of the data source vertices. For example string, int32, and int64. VTYPE=${VTYPE:=int32} # Encoding type. The value distributed specifies the distributed vertex ID encoding. The value single specifies the single-machine vertex ID encoding. ENCODER=${ENCODER:="distributed"} # The parameter for the PageRank algorithm. Algorithms differ in parameters. EPS=${EPS:=0.0001} DAMPING=${DAMPING:=0.85} # The number of iterations. ITERATIONS=${ITERATIONS:=100}
-
-
Local or HDFS CSV files as the data source
Modify parameters in the script to be used, such as
run_pagerank.sh
.# The sum of the number of processes running on all machines in the cluster. It is recommended to be the number of machines or the number of nodes in the NUMA architecture. WNUM=3 # The number of threads per process. It is recommended to set the maximum value to be the number of hardware threads of the machine. WCORES=4 # The path to the data source. # Set to read data from NebulaGraph via the nebula.conf file. # INPUT=${INPUT:="nebula:$PROJECT/scripts/nebula.conf"} # Set to read data from the CSV files on HDFS or on local directories. INPUT=${INPUT:="$PROJECT/data/graph/v100_e2150_ua_c3.csv"} # The export path to the graph computation results. # Data can be exported to a NebulaGraph. If the data source is also a NebulaGraph, the results will be exported to the graph space specified in nebula.conf. # OUTPUT=${OUTPUT:="nebula:$PROJECT/scripts/nebula.conf"} # Data can also be exported to the CSV files on HDFS or on local directories. OUTPUT=${OUTPUT:='hdfs://192.168.8.100:9000/_test/output'} # If the value is true, it is a directed graph, if false, it is an undirected graph. IS_DIRECTED=${IS_DIRECTED:=true} # Set whether to encode ID or not. NEED_ENCODE=${NEED_ENCODE:=true} # The ID type of the data source vertices. For example string, int32, and int64. VTYPE=${VTYPE:=int32} # The value distributed specifies the distributed vertex ID encoding. The value single specifies the single-machine vertex ID encoding. ENCODER=${ENCODER:="distributed"} # The parameter for the PageRank algorithm. Algorithms differ in parameters. EPS=${EPS:=0.0001} DAMPING=${DAMPING:=0.85} # The number of iterations. ITERATIONS=${ITERATIONS:=100}
-
-
Modify the configuration file
cluster
to set the NebulaGraph Analytics cluster nodes and task assignment weights for executing the algorithm.# NebulaGraph Analytics Cluster Node IP Addresses: Task Assignment Weights 192.168.8.200:1 192.168.8.201:1 192.168.8.202:1
-
Run the algorithm script. For example:
./run_pagerank.sh
-
View the graph computation results in the export path.
- For exporting to a NebulaGraph cluster, check the results according to the settings in
nebula.conf
.
- For exporting the results to the CSV files on HDFS or on local directories, check the results according to the settings in
OUTPUT
, which is a compressed file in the.gz
format.
- For exporting to a NebulaGraph cluster, check the results according to the settings in