NebulaGraph Analytics¶

NebulaGraph Analytics is a high-performance graph computing framework tool that performs graph analysis of data in the NebulaGraph database.

Prerequisites¶

The NebulaGraph Analytics installation package has been obtained. Contact us to apply.

The license is ready.

The HDFS 2.2.x or later has been deployed.

The JDK 1.8 has been deployed.

Scenarios¶

You can import data from data sources as NebulaGraph clusters, CSV files on HDFS, or local CSV files into NebulaGraph Analytics and export the graph computation results to NebulaGraph clusters, CSV files on HDFS, or local CSV files from NebulaGraph Analytics.

Limitations¶

When you import NebulaGraph cluster data into NebulaGraph Analytics and export the graph computation results from NebulaGraph Analytics to a NebulaGraph cluster, the graph computation results can only be exported to the graph space where the data source is located.

Version compatibility¶

The version correspondence between NebulaGraph Analytics and NebulaGraph is as follows.

NebulaGraph	NebulaGraph Analytics
3.4.0	3.4.0
3.3.0	3.3.0
3.1.0 ~ 3.2.x	3.2.0
3.0.x	1.0.x
2.6.x	0.9.0

Graph algorithms¶

NebulaGraph Analytics supports the following graph algorithms.

Algorithm	Description	Category
APSP	All Pair Shortest Path	Path
SSSP	Single Source Shortest Path	Path
BFS	Breadth-first search	Path
ShortestPath	The shortest path	Path
PageRank	It is used to rank web pages.	Node importance measurement
KCore	k-Cores	Node importance measurement
DegreeCentrality	It is a simple count of the total number of connections linked to a vertex.	Node importance measurement
DegreeWithTime	Neighbor statistics based on the time range of edge ranks	Node importance measurement
BetweennessCentrality	Intermediate centrality	Node importance measurement
ClosenessCentrality	Closeness centrality	Node importance measurement
TriangleCount	It counts the number of triangles.	Graph feature
Node2Vec	Graph neural network	Graph feature
Tree_stat	Tree structure statistics	Graph feature
HyperANF	Estimate the average distance of the graph	Graph feature
LPA	Label Propagation Algorithm	Community discovery
WCC	Weakly connected component	Community discovery
LOUVAIN	It detects communities in large networks.	Community discovery
InfoMap	Community classification	Community discovery
HANP	Hop attenuation & Node Preference	Community discovery
Clustering Coefficient	It is a measure of the degree to which nodes in a graph tend to cluster together.	Clustering
Jaccard	Jaccard similarity	Similarity

Install NebulaGraph Analytics¶

Install the NebulaGraph Analytics. When installing a cluster of multiple NebulaGraph Analytics on multiple nodes, you need to install NebulaGraph Analytics to the same path and set up SSH-free login between nodes.

sudo rpm -ivh <analytics_package_name> --prefix <install_path>
sudo chown <user>:<user> -R <install path>

For example:

sudo rpm -ivh nebula-analytics-3.4.0-centos.x86_64.rpm --prefix=/home/vesoft/nebula-analytics
sudo chown vesoft:vesoft -R /home/vesoft/nebula-analytics

Configure the correct Hadoop path and JDK path in the file set_env.sh, the file path is nebula-analytics/scripts/set_env.sh. If there are multiple machines, ensure that the paths are the same.

Note

The default TCP port range used by the MPICH process manager and MPICH library is 10000 to 10100. To adjust this, modify the value of the environment variable MPIR_CVAR_CH3_PORT_RANGE in the set_env.sh file.
```
export HADOOP_HOME=<hadoop_path>
export JAVA_HOME=<java_path>
```
Copy the license into the directory scripts of the NebulaGraph Analytics installation path on all machines.

How to use NebulaGraph Analytics¶

After installation, you can set parameters of different algorithms and then execute a script to obtain the results of the algorithms and export them to the specified format.

Select one node from the NebulaGraph Analytics cluster and then access the scripts directory.
```
$ cd scripts
```

Confirm the data source and export path. Configuration steps are as follows.

NebulaGraph clusters as the data source

Modify the configuration file nebula.conf to configure the NebulaGraph cluster.

# The number of retries connecting to NebulaGraph.
--retry=3  
# The name of the graph space where you read or write data.
--space=baskeyballplayer  

# Read data from NebulaGraph.
# The name of edges.
--edges=LIKES  
# The name of the property to be read as the weight of the edge. Can be either the attribute name or _rank.
#--edge_data_fields 
# The number of rows read per scan.
--read_batch_size=10000  

# Write data to NebulaGraph.
# The graphd process address.
--graph_server_addrs=192.168.8.100:9669  
# The account to log into NebulaGraph.
--user=root  
# The password to log into NebulaGraph.
--password=nebula  
# The pattern used to write data back to NebulaGraph: insert or update.
--mode=insert  
# The tag name written back to NebulaGraph.
--tag=pagerank  
# The property name corresponding to the tag.
--prop=pr  
# The property type corresponding the the tag.
--type=double 
# The number of rows per write. 
--write_batch_size=1000 
# The file path where the data failed to be written back to NebulaGraph is stored.
--err_file=/home/xxx/analytics/err.txt 

# other
# The access timeout period of the service.
--graphd_timeout=60000
--metad_timeout=60000
--storaged_timeout=60000

Modify the related parameters in the script to be used, such as run_pagerank.sh.

# The sum of the number of processes running on all machines in the cluster. It is recommended to be the number of machines or the number of nodes in the NUMA architecture.
WNUM=3 
# The number of threads per process. It is recommended to set the maximum value to be the number of hardware threads of the machine.
WCORES=4  
# The path to the data source.
# Set to read data from NebulaGraph via the nebula.conf file.
INPUT=${INPUT:="nebula:$PROJECT/scripts/nebula.conf"}  
# Set to read data from the CSV files on HDFS or on local directories.
# #INPUT=${INPUT:="$PROJECT/data/graph/v100_e2150_ua_c3.csv"}

# The export path to the graph computation results.
# Data can be exported to a NebulaGraph. If the data source is also a NebulaGraph, the results will be exported to the graph space specified in nebula.conf.
OUTPUT=${OUTPUT:="nebula:$PROJECT/scripts/nebula.conf"}
# Data can also be exported to the CSV files on HDFS or on local directories.
# OUTPUT=${OUTPUT:='hdfs://192.168.8.100:9000/_test/output'}

# If the value is true, it is a directed graph, if false, it is an undirected graph.
IS_DIRECTED=${IS_DIRECTED:=true}
# Set whether to encode ID or not.
NEED_ENCODE=${NEED_ENCODE:=true}
# The ID type of the data source vertices. For example string, int32, and int64.
VTYPE=${VTYPE:=int32}
# Encoding type. The value distributed specifies the distributed vertex ID encoding. The value single specifies the single-machine vertex ID encoding. 
ENCODER=${ENCODER:="distributed"}
# The parameter for the PageRank algorithm. Algorithms differ in parameters.
EPS=${EPS:=0.0001}
DAMPING=${DAMPING:=0.85}
# The number of iterations.
ITERATIONS=${ITERATIONS:=100}

Local or HDFS CSV files as the data source

Modify parameters in the script to be used, such as run_pagerank.sh.

# The sum of the number of processes running on all machines in the cluster. It is recommended to be the number of machines or the number of nodes in the NUMA architecture.
WNUM=3 
# The number of threads per process. It is recommended to set the maximum value to be the number of hardware threads of the machine.
WCORES=4  
# The path to the data source.
# Set to read data from NebulaGraph via the nebula.conf file.
# INPUT=${INPUT:="nebula:$PROJECT/scripts/nebula.conf"}  
# Set to read data from the CSV files on HDFS or on local directories.
INPUT=${INPUT:="$PROJECT/data/graph/v100_e2150_ua_c3.csv"}

# The export path to the graph computation results.
# Data can be exported to a NebulaGraph. If the data source is also a NebulaGraph, the results will be exported to the graph space specified in nebula.conf.
# OUTPUT=${OUTPUT:="nebula:$PROJECT/scripts/nebula.conf"}
# Data can also be exported to the CSV files on HDFS or on local directories.
OUTPUT=${OUTPUT:='hdfs://192.168.8.100:9000/_test/output'}

# If the value is true, it is a directed graph, if false, it is an undirected graph.
IS_DIRECTED=${IS_DIRECTED:=true}
# Set whether to encode ID or not.
NEED_ENCODE=${NEED_ENCODE:=true}
# The ID type of the data source vertices. For example string, int32, and int64.
VTYPE=${VTYPE:=int32}
# The value distributed specifies the distributed vertex ID encoding. The value single specifies the single-machine vertex ID encoding. 
ENCODER=${ENCODER:="distributed"}
# The parameter for the PageRank algorithm. Algorithms differ in parameters.
EPS=${EPS:=0.0001}
DAMPING=${DAMPING:=0.85}
# The number of iterations.
ITERATIONS=${ITERATIONS:=100}

Modify the configuration file cluster to set the NebulaGraph Analytics cluster nodes and task assignment weights for executing the algorithm.
```
# NebulaGraph Analytics Cluster Node IP Addresses: Task Assignment Weights
192.168.8.200:1
192.168.8.201:1
192.168.8.202:1
```
Run the algorithm script. For example:
```
./run_pagerank.sh
```
View the graph computation results in the export path.
- For exporting to a NebulaGraph cluster, check the results according to the settings in nebula.conf.
- For exporting the results to the CSV files on HDFS or on local directories, check the results according to the settings in OUTPUT, which is a compressed file in the .gz format.

Last update: February 19, 2024