Skip to content

Storage Design

Abstract

This document gives an introduction to the storage design of the graph database Nebula Graph.

The Storage Service of Nebula Graph is composed of two parts. One is Meta Service that stores the meta data, the other is Storage Servicethat stores the data. The two services are in two independent processes. The data and deployment are separated but their architectures are almost the same.

Architecture


image

Pict.1 The Architecture of Storage Service


As shown in Picture One, there are three layers in Storage Service. The bottom layer is the local storage engine, providing get, put, scan and delete operations on local data. The related interfaces are in KVStore/KVEngine.h and users can develop their own local store plugins based on their needs. Currently, Nebula Graph provides store engine based on RocksDB.

Above the local storage engine is the consensus layer that implements multi group raft. Each partition corresponding to a Raft group, is for the data sharding. Currently, Nebula Graph uses hash to shard data. When creating a space, users need to specify the partition number. Once set, partition number cannot be changed. Generally, the partition number must meet the need to scale-out in the future.

Above the consensus layer is the storage interface that defines a set of APIs that are related to graph. These API requirements are translated into a set of kv operations targeting at the corresponding partition. It is this layer that makes the storage service a real graph storage, otherwise it's just a kv storage. Nebula Graph doesn't separate kv as an independent service as graph query involves a lot of calculation that uses schema, which is not existed in kv layer. Such architecture makes pushing down more easily.

Schema & Partition

Nebula Graph stores the vertices, edges and properties. Efficient property filtering is critical for a Graph Database.

Nebula Graph uses tags to indicate vertices types. One vertexID can relate with multiple tags, and each tag has its own properties. In kv storage, we use vertexID + TagID to represent key, and the related properties are placed behind value after encoding. The format is shown in Picture Two:


image

Pict.2 Vertex Key Format


  • Type: one byte, used to indicate key type, current types are data, index, system, etc.
  • Part ID: three bytes, used to indicate the sharding partition. This field can be used to scan the partition data based on the prefix when re-balancing the partition
  • Vertex ID: eight bytes, used to indicate vertex ID
  • Tag ID: four bytes, used to relate tag
  • Timestamp: eight bytes, not available to users, used in MVCC in the future

Each edge in Nebula Graph is modeled as two independent key-values. One is stored in the same partition as the source vertex, namely out-edge. The other one is stored in the same partition as the destination vertex, namely in-edge. Generally, out-key and in-key are stored in different partitions.

There may be multiple edge types or multiple edges with the same edge type between two vertices. For example, define an edge type 'transfer', user A may transfer money to user B multiple times, thus a field rank is added to distinguish the transfer records between the two. Edge key format is shown in Picture Three:


image

Pict.3 Edge Key Format


  • Type: one byte, used to indicate key type, current types are data, index, system, etc.
  • Part ID: three bytes, used to indicate the sharding partition. This field can be used to scan the partition data based on the prefix when re-balancing the partition
  • Vertex ID: eight bytes, used to indicate source vertex ID in out-edge, dest vertex ID in in-edge
  • Edge Type: four bytes, used to indicate edge type, greater than zero means out-edge, less than zero means in-edge
  • Rank: four bytes, used to indicate multiple edges in one edge type. Users can set the field based on needs and store weight such as transaction time, transaction number and weight of a certain order
  • Vertex ID : eight bytes, used to indicate dest vertex ID in out-edge, source vertex ID in in-edge
  • Timestamp: eight bytes, not available to users, used in MVCC in the future

If the value of an edge type is greater than zero, the corresponding edge key format is shown in Picture Four; otherwise if the value is less than zero, the corresponding edge key format is shown in Picture Five.


image

Pict.4 Out-key format


image

Pict.5 In-key format


The properties of vertices and edges are stored in the values of the corresponding kv pairs after being encoded. As a strong typed database, Nebula Graph needs to get the schema information in the Meta Service before decoding. Besides, when properties are encoded, schema version is added to support online schema change.

Nebula Graph shards data through modulo operation on vertex ID. All the out-keys, in-keys and tag data are placed in the same partition through modulo operation on vertex ID. In this way, query efficiency is increased dramatically. For on-line queries, the most common operation is to do Breadth-First-Search (BFS) expansion from a vertex, thus it's the basic operation to get a vertex's in-edge or out-edge, and such operation's performance determines the traverse's performance. In BFS, there are cases filtering some properties, Nebula Graph guarantees the operation efficiency by putting properties and vertices and edges together. At present, most graph databases verify their efficiency with Graph 500 or Twitter data set, which are of no eloquence because properties are not included. While most cases use property graphs and BFS needs a lot of filtering as well.

KVStore

Nebula Graph develops its own kv store to meet the performance needs:

  • High performance pure kv
  • Provided as a library, as a strong typed database, the performance of pushing down is the key of the performance of Nebula Graph
  • Strong data consistency, determined by the distribution system
  • Written in C++

For users who are not sensitive to performance or unwilling to migrate data, Nebula Graph provides plugins for the whole kv store, builds the storage service on the third party kv store. Currently, Nebula Graph provides HBase plugin.

With RocksDB as the local storage engine, Nebula Graph supports self management of multiple hard disks to take full use of the parallel property. All users need to do is to configure multiple data directories. Nebula Graph manages the distributed kv store in a unified scheduling way with meta service. All the partition distribution data and current machine status can be found in the meta service. Users can input commands in the console to add or remove machines to generate and execute a balance plan in meta service.

Nebula Graph provides its own WAL mode to customize the WAL. Each partition owns its WAL to catch up data without WAL splitting to guarantee efficiency. Besides, Nebula Graph defines a Command Log class to conduct special operations. These logs have no real data and are only used to inform all replicas to execute certain operations with raft. What's more, Nebula Graph also provides a log class to conduct atomic operation on partitions. For example, the CAS and the read-modify-write which fully uses raft serialization.

One Nebula Graph kv store cluster supports multiple spaces, and each space is set with its own partition number and replica number. Different spaces are isolated physically and spaces in the same cluster support different storage engines and sharding strategies.

Raft

This part gives some details on how raft works for Nebula Graph.

Multi Raft Group

As the log IDs must be serial, almost all the implementations adapt Multi Raft Group to solve this problem. Therefore, partition number almost determines the performance of the Raft Group, but it doesn't mean the more partitions, the better the performance. Each raft group stores a set of state information and has its own WAL file, thus more partitions require more costs. Also, when there are too many Partitions, the batch operation is meaningless if the load is not high enough. For example, a single online system with 10,000 tps, if its partition number is greater than ten thousand, the tps of each partition per second is likely only 1, so the batch operation loses its meaning and increases CPU cost.

There are two key points to achieve the Multi Raft Group. First is to share the transport layer, because each Raft Group needs to send messages to the corresponding peers. If the transport layer cannot be shared, the connection costs are huge. Second is the threading model, multi Raft Group must share a set of thread pools, otherwise it will end with too many threads in the system, resulting in a large amount of context switch cost.

Batch

For each Partition, it is necessary to do batch to improve throughput when writing WAL serially. In general, there is nothing special about batch, but Nebula Graph designs some special types of WAL based on each part serialization, which brings some challenges.

For example, Nebula Graph uses WAL to implement lock-free CAS operations, and each CAS operation will execute after all the previous WAL are committed. So for a batch, if there are several WALs in CAS type, we need to divide this batch into several smaller serial groups.

Learner

New machines need to "catch up" data for quite a long time, and there may be accidents during the process. If a new machine joins the raft group as a follower, it will reduce the HA capacity of the entire cluster. Nebula Graph uses the command WAL mentioned above to implement its learner. If the leader encounters the add learner command when writing WAL, it will add the learner to its peers and mark it as a learner, so logs will send to them as normal. But the learners will not involve in the leader's election.

Transfer Leadership

Transfer leadership is extremely important for balance. When moving a partition from one machine to another, Nebula Graph first checks if the source is a leader. If so, it should be moved to another peer. After data migration is completed, please do balance the leader again, so that the load on each machine can be balanced.

When transferring leadership, note that when the leader abandons its leadership and when the follower starts a leader election. When a transfer leadership command is committed, the original leader abandons its leadership. And as for followers, when receiving the command, it starts a leader election. This is the same execution process with common raft leader election. Otherwise, some corner cases will occur.

Membership Change

To avoid split-brain, when members in a Raft Group change, an intermediate state is required. In such state, the majority of the old group and new group always have an overlap, thus preventing the old or new group from making decisions unilaterally. This is the joint consensus mentioned in the thesis. To make it even simpler, Diego Ongaro suggests adding or removing a peer once to ensure the overlap between the majority of the new group and the old group in his doctoral thesis. Nebula Graph's implementation also uses this approach, except that the way to add or remove member is different. For details, please refer to addPeer/removePeer in Raft Part class.

Snapshot

When the leader changes during sending snapshot, the follower may only receive half of the snapshot data, so partition data cleanup is required. Because multiple partitions share a single storage, how to clean up the data is a cumbersome problem. In addition, during the snapshot process, a large amount of IO is generated. Considering the performance, we do not want this process to share an IO threadPool with the normal Raft, and the process requires a large amount of memory, thus how to optimize the memory usage is critical for performance.

Storage Service

  • getNeighbors: query the in-edge or out-edge of a set of vertices, return the edges and the corresponding properties, support conditional filters
  • Insert vertex/edge: insert a vertex or edge and its properties
  • getProps: get the properties of a vertex or an edge

Graph semantics interfaces are translated into kv operations in this layer. In order to improve the performance of traversal, concurrent operations are also required.

Meta Service

On the kv store interface, Nebula Graph encapsulates a set of meta-related interfaces. Meta service supports CRUD on schema, cluster management and user management. Meta service supports single-machine deployment, and multiple replicas to ensure data security.