NebulaGraph Analytics¶

NebulaGraph Analytics is a high-performance graph computing framework tool that performs graph analysis of data in the NebulaGraph database.

Enterpriseonly

To apply for the NebulaGraph Analytics installation package, send mail at inquiry@vesoft.com.

Scenarios¶

You can import data from data sources as NebulaGraph clusters, CSV files on HDFS, or local CSV files into NebulaGraph Analytics and export the graph computation results to NebulaGraph clusters, CSV files on HDFS, or local CSV files from NebulaGraph Analytics.

Limitations¶

When you import NebulaGraph cluster data into NebulaGraph Analytics and export the graph computation results from NebulaGraph Analytics to a NebulaGraph cluster, the graph computation results can only be exported to the graph space where the data source is located.

Version compatibility¶

The version correspondence between NebulaGraph Analytics and NebulaGraph is as follows.

NebulaGraph Analytics	NebulaGraph
3.2.0	3.2.1、3.1.0
1.0.x	3.0.x
0.9.0	2.6.x

Graph algorithms¶

NebulaGraph Analytics supports the following graph algorithms.

Algorithm	Description	Category
APSP	All Pair Shortest Path	Path
SSSP	Single Source Shortest Path	Path
BFS	Breadth-first search	Path
PageRank	It is used to rank web pages.	Node importance measurement
KCore	k-Cores	Node importance measurement
DegreeCentrality	It is a simple count of the total number of connections linked to a vertex.	Node importance measurement
DegreeWithTime	Neighbor statistics based on the time range of edge ranks	Node importance measurement
BetweennessCentrality	Intermediate centrality	Node importance measurement
ClosenessCentrality	Closeness centrality	Node importance measurement
TriangleCount	It counts the number of triangles.	Graph feature
LPA	Label Propagation Algorithm	Community discovery
WCC	Weakly connected component	Community discovery
LOUVAIN	It detects communities in large networks.	Community discovery
HANP	Hop attenuation & Node Preference	Community discovery
Clustering Coefficient	It is a measure of the degree to which nodes in a graph tend to cluster together.	Clustering
Jaccard	Jaccard similarity	Similarity

Install NebulaGraph Analytics¶

When installing a cluster of multiple NebulaGraph Analytics on multiple nodes, you need to install NebulaGraph Analytics to the same path and set up SSH-free login between nodes.

sudo rpm -i nebula-analytics-3.2.0-centos.x86_64.rpm  --prefix /home/xxx/nebula-analytics

How to use NebulaGraph Analytics¶

After installation, you can set parameters of different algorithms and then execute a script to obtain the results of the algorithms and export them to the specified format.

Select one node from the NebulaGraph Analytics cluster and then access the scripts directory.
```
$ cd scripts
```

Confirm the data source and export path. Configuration steps are as follows.

NebulaGraph clusters as the data source

Modify the configuration file nebula.conf to configure the NebulaGraph cluster.

# The number of retries connecting to NebulaGraph.
--retry=3  
# The name of the graph space where you read or write data.
--space=baskeyballplayer  

# Read data from NebulaGraph.
# The metad process address.
--meta_server_addrs=192.168.8.100:9559, 192.168.8.101:9559, 192.168.8.102:9559
# The name of edges.
--edges=LIKES  
# The name of the property to be read as the weight of the edge. Can be either the attribute name or _rank.
#--edge_data_fields 
# The number of rows read per scan.
--read_batch_size=10000  

# Write data to NebulaGraph.
# The graphd process address.
--graph_server_addrs=192.168.8.100:9669  
# The account to log into NebulaGraph.
--user=root  
# The password to log into NebulaGraph.
--password=nebula  
# The pattern used to write data back to NebulaGraph: insert or update.
--mode=insert  
# The tag name written back to NebulaGraph.
--tag=pagerank  
# The property name corresponding to the tag.
--prop=pr  
# The property type corresponding the the tag.
--type=double 
# The number of rows per write. 
--write_batch_size=1000 
# The file path where the data failed to be written back to NebulaGraph is stored.
--err_file=/home/xxx/analytics/err.txt

Modify the related parameters in the script to be used, such as run_pagerank.sh.

# The sum of the number of processes running on all machines in the cluster. It is recommended to be the number of machines or the number of nodes in the NUMA architecture.
WNUM=3 
# The number of threads per process. It is recommended to set the maximum value to be the number of hardware threads of the machine.
WCORES=4  
# The path to the data source.
# Set to read data from NebulaGraph via the nebula.conf file.
INPUT=${INPUT:="nebula:$PROJECT/scripts/nebula.conf"}  
# Set to read data from the CSV files on HDFS or on local directories.
# #INPUT=${INPUT:="$PROJECT/data/graph/v100_e2150_ua_c3.csv"}

# The export path to the graph computation results.
# Data can be exported to a NebulaGraph. If the data source is also a NebulaGraph, the results will be exported to the graph space specified in nebula.conf.
OUTPUT=${OUTPUT:="nebula:$PROJECT/scripts/nebula.conf"}
# Data can also be exported to the CSV files on HDFS or on local directories.
# OUTPUT=${OUTPUT:='hdfs://192.168.8.100:9000/_test/output'}

# If the value is true, it is a directed graph, if false, it is an undirected graph.
IS_DIRECTED=${IS_DIRECTED:=true}
# Set whether to encode ID or not.
NEED_ENCODE=${NEED_ENCODE:=true}
# The ID type of the data source vertices. For example string, int32, and int64.
VTYPE=${VTYPE:=int32}
# Encoding type. The value distributed specifies the distributed vertex ID encoding. The value single specifies the single-machine vertex ID encoding. 
ENCODER=${ENCODER:="distributed"}
# The parameter for the PageRank algorithm. Algorithms differ in parameters.
EPS=${EPS:=0.0001}
DAMPING=${DAMPING:=0.85}
# The number of iterations.
ITERATIONS=${ITERATIONS:=100}

Local or HDFS CSV files as the data source

Modify parameters in the script to be used, such as run_pagerank.sh.

# The sum of the number of processes running on all machines in the cluster. It is recommended to be the number of machines or the number of nodes in the NUMA architecture.
WNUM=3 
# The number of threads per process. It is recommended to set the maximum value to be the number of hardware threads of the machine.
WCORES=4  
# The path to the data source.
# Set to read data from NebulaGraph via the nebula.conf file.
# INPUT=${INPUT:="nebula:$PROJECT/scripts/nebula.conf"}  
# Set to read data from the CSV files on HDFS or on local directories.
INPUT=${INPUT:="$PROJECT/data/graph/v100_e2150_ua_c3.csv"}

# The export path to the graph computation results.
# Data can be exported to a NebulaGraph. If the data source is also a NebulaGraph, the results will be exported to the graph space specified in nebula.conf.
# OUTPUT=${OUTPUT:="nebula:$PROJECT/scripts/nebula.conf"}
# Data can also be exported to the CSV files on HDFS or on local directories.
OUTPUT=${OUTPUT:='hdfs://192.168.8.100:9000/_test/output'}

# If the value is true, it is a directed graph, if false, it is an undirected graph.
IS_DIRECTED=${IS_DIRECTED:=true}
# Set whether to encode ID or not.
NEED_ENCODE=${NEED_ENCODE:=true}
# The ID type of the data source vertices. For example string, int32, and int64.
VTYPE=${VTYPE:=int32}
# The value distributed specifies the distributed vertex ID encoding. The value single specifies the single-machine vertex ID encoding. 
ENCODER=${ENCODER:="distributed"}
# The parameter for the PageRank algorithm. Algorithms differ in parameters.
EPS=${EPS:=0.0001}
DAMPING=${DAMPING:=0.85}
# The number of iterations.
ITERATIONS=${ITERATIONS:=100}

Modify the configuration file cluster to set the NebulaGraph Analytics cluster nodes and task assignment weights for executing the algorithm.
```
# NebulaGraph Analytics Cluster Node IP Addresses: Task Assignment Weights
192.168.8.200:1
192.168.8.201:1
192.168.8.202:1
```
Run the algorithm script. For example:
```
./run_pagerank.sh
```
View the graph computation results in the export path.
- For exporting to a NebulaGraph cluster, check the results according to the settings in nebula.conf.
- For exporting the results to the CSV files on HDFS or on local directories, check the results according to the settings in OUTPUT, which is a compressed file in the .gz format.

Last update: February 1, 2023