Storage load balance¶

You can use the BALANCE statement to balance the distribution of partitions and Raft leaders, or clear some Storage servers for easy maintenance. For details, see BALANCE.

Danger

The BALANCE commands migrate data and balance the distribution of partitions by creating and executing a set of subtasks. DO NOT stop any machine in the cluster or change its IP address until all the subtasks finish. Otherwise, the follow-up subtasks fail.

Balance partition distribution¶

Enterpriseonly

Only available for the NebulaGraph Enterprise Edition.

Note

If the current graph space already has a BALANCE DATA job in the FAILED status, you can restore the FAILED job, but cannot start a new BALANCE DATA job. If the job continues to fail, manually stop it, and then you can start a new one.

The BALANCE DATA commands starts a job to balance the distribution of storage partitions in the current graph space by creating and executing a set of subtasks.

Examples¶

After you add new storage hosts into the cluster, no partition is deployed on the new hosts.

Run SHOW HOSTS to check the partition distribution.

nebual> SHOW HOSTS;
+-----------------+------+----------+--------------+-----------------------+------------------------+----------------------+
| Host            | Port | Status   | Leader count | Leader distribution   | Partition distribution | Version              |
+-----------------+------+----------+--------------+-----------------------+------------------------+----------------------+
| "192.168.8.101" | 9779 | "ONLINE" | 0            | "No valid partition"  | "No valid partition"   | "3.4.0" |
| "192.168.8.100" | 9779 | "ONLINE" | 15           | "basketballplayer:15" | "basketballplayer:15"  | "3.4.0" |
+-----------------+------+----------+--------------+-----------------------+------------------------+----------------------+

Enter the graph space basketballplayer, and execute the command BALANCE DATA to balance the distribution of storage partitions.

nebula> USE basketballplayer;
nebula> BALANCE DATA;
+------------+
| New Job Id |
+------------+
| 25         |
+------------+

The job ID is returned after running BALANCE DATA. Run SHOW JOB <job_id> to check the status of the job.

nebula> SHOW JOB 25;
+------------------------+-------------------+------------+----------------------------+----------------------------+-------------+
| Job Id(spaceId:partId) | Command(src->dst) | Status     | Start Time                 | Stop Time                  | State       |
+------------------------+-------------------+------------+----------------------------+----------------------------+-------------+
| 25                     | "DATA_BALANCE"    | "FINISHED" | 2023-01-17T06:24:35.000000 | 2023-01-17T06:24:35.000000 | "SUCCEEDED" |
| "Total:0"              | "Succeeded:0"     | "Failed:0" | "In Progress:0"            | "Invalid:0"                | ""          |
+------------------------+-------------------+------------+----------------------------+----------------------------+-------------+

When all the subtasks succeed, the load balancing process finishes. Run SHOW HOSTS again to make sure the partition distribution is balanced.

Note

BALANCE DATA does not balance the leader distribution. For more information, see Balance leader distribution.

nebula> SHOW HOSTS;
+-----------------+------+----------+--------------+----------------------+------------------------+----------------------+
| Host            | Port | Status   | Leader count | Leader distribution  | Partition distribution | Version              |
+-----------------+------+----------+--------------+----------------------+------------------------+----------------------+
| "192.168.8.101" | 9779 | "ONLINE" | 7            | "basketballplayer:7" | "basketballplayer:7"   | "3.4.0" |
| "192.168.8.100" | 9779 | "ONLINE" | 8            | "basketballplayer:8" | "basketballplayer:8"   | "3.4.0" |
+-----------------+------+----------+--------------+----------------------+------------------------+----------------------+

If any subtask fails, run RECOVER JOB <job_id> to recover the failed jobs. If redoing load balancing does not solve the problem, ask for help in the NebulaGraph community.

Stop data balancing¶

To stop a balance job, run STOP JOB <job_id>.

If no balance job is running, an error is returned.

If a balance job is running, Job stopped is returned.

Note

STOP JOB <job_id> does not stop the running subtasks but cancels all follow-up subtasks. The status of follow-up subtasks is set to INVALID. The status of ongoing subtasks is set to SUCCEEDED or FAILED based on the result. You can run the SHOW JOB <job_id> command to check the stopped job status.

Once all the subtasks are finished or stopped, you can run RECOVER JOB <job_id> again to balance the partitions again, the subtasks continue to be executed in the original state.

Restore a balance job¶

To restore a balance job in the FAILED or STOPPED status, run RECOVER JOB <job_id>.

Note

For a STOPPED BALANCE DATA job, NebulaGraph detects whether the same type of FAILED jobs or FINISHED jobs have been created since the start time of the job. If so, the STOPPED job cannot be restored. For example, if chronologically there are STOPPED job1, FINISHED job2, and STOPPED Job3, only job3 can be restored, and job1 cannot.

Migrate partition¶

To migrate specified partitions and scale in the cluster, you can run BALANCE DATA REMOVE <ip:port> [,<ip>:<port> ...].

For example, to migrate the partitions in server 192.168.8.100:9779, the command as following:

nebula> BALANCE DATA REMOVE 192.168.8.100:9779;
nebula> SHOW HOSTS;
+-----------------+------+----------+--------------+-----------------------+------------------------+----------------------+
| Host            | Port | Status   | Leader count | Leader distribution   | Partition distribution | Version              |
+-----------------+------+----------+--------------+-----------------------+------------------------+----------------------+
| "192.168.8.101" | 9779 | "ONLINE" | 15           | "basketballplayer:15" | "basketballplayer:15"  | "3.4.0" |
| "192.168.8.100" | 9779 | "ONLINE" | 0            | "No valid partition"  | "No valid partition"   | "3.4.0" |
+-----------------+------+----------+--------------+-----------------------+------------------------+----------------------+

Note

This command migrates partitions to other storage hosts but does not delete the current storage host from the cluster. To delete the Storage hosts from cluster, see Manage Storage hosts.

Balance leader distribution¶

To balance the raft leaders, run BALANCE LEADER.

Example¶

nebula> BALANCE LEADER;

Run SHOW HOSTS to check the balance result.

nebula> SHOW HOSTS;
+------------------+------+----------+--------------+-----------------------------------+------------------------+----------------------+
| Host             | Port | Status   | Leader count | Leader distribution               | Partition distribution | Version              |
+------------------+------+----------+--------------+-----------------------------------+------------------------+----------------------+
| "192.168.10.101" | 9779 | "ONLINE" | 8            | "basketballplayer:3"              | "basketballplayer:8"   | "3.4.0" |
| "192.168.10.102" | 9779 | "ONLINE" | 3            | "basketballplayer:3"              | "basketballplayer:8"   | "3.4.0" |
| "192.168.10.103" | 9779 | "ONLINE" | 0            | "basketballplayer:2"              | "basketballplayer:7"   | "3.4.0" |
| "192.168.10.104" | 9779 | "ONLINE" | 0            | "basketballplayer:2"              | "basketballplayer:7"   | "3.4.0" |
| "192.168.10.105" | 9779 | "ONLINE" | 0            | "basketballplayer:2"              | "basketballplayer:7"   | "3.4.0" |
+------------------+------+----------+--------------+-----------------------------------+------------------------+----------------------+

Caution

In NebulaGraph 3.4.0, switching leaders will cause a large number of short-term request errors (Storage Error E_RPC_FAILURE). For solutions, see FAQ.

Last update: February 19, 2024