Storage load balance¶
You can use the SUBMIT JOB BALANCE
statement to balance the distribution of partitions and Raft leaders, or clear some Storage servers for easy maintenance. For details, see SUBMIT JOB BALANCE.
Danger
The BALANCE
commands migrate data and balance the distribution of partitions by creating and executing a set of subtasks. DO NOT stop any machine in the cluster or change its IP address until all the subtasks are finished. Otherwise, the follow-up subtasks fail.
Balance partition distribution¶
The SUBMIT JOB BALANCE DATA
command starts a job to balance the distribution of storage partitions in the current graph space by creating and executing a set of subtasks. If the Zone feature is enabled, you can balance the partitions within each Zone by adding the IN ZONE
keywords to the command. For example, SUBMIT JOB BALANCE DATA IN ZONE
.
Enterpriseonly
Only available for the NebulaGraph Enterprise Edition.
Note
If the current graph space already has a SUBMIT JOB BALANCE DATA
job in the FAILED
status, you can restore the FAILED
job, but cannot start a new SUBMIT JOB BALANCE DATA
job. If the job continues to fail, manually stop it, and then you can start a new one.
Balance partitions with Zone disabled¶
After you add new storage hosts to the cluster, no partition is deployed on the new hosts. For example, run the following steps to balance the partition distribution when the Zone feature is disabled.
-
Run
SHOW HOSTS
to check the partition distribution.nebual> SHOW HOSTS; +-----------------+------+----------+--------------+-----------------------+------------------------+-------------+ | Host | Port | Status | Leader count | Leader distribution | Partition distribution | Version | +-----------------+------+----------+--------------+-----------------------+------------------------+-------------+ | "192.168.8.101" | 9779 | "ONLINE" | 0 | "No valid partition" | "No valid partition" | "3.5.0" | | "192.168.8.100" | 9779 | "ONLINE" | 15 | "basketballplayer:15" | "basketballplayer:15" | "3.5.0" | +-----------------+------+----------+--------------+-----------------------+------------------------+-------------+
-
Enter the graph space
basketballplayer
, and execute the commandSUBMIT JOB BALANCE DATA
to balance the distribution of storage partitions.nebula> USE basketballplayer; nebula> SUBMIT JOB BALANCE DATA; +------------+ | New Job Id | +------------+ | 25 | +------------+
-
The job ID is returned after running
SUBMIT JOB BALANCE DATA
. RunSHOW JOB <job_id>
to check the status of the job.nebula> SHOW JOB 25; +------------------------+-------------------+------------+----------------------------+----------------------------+-------------+ | Job Id(spaceId:partId) | Command(src->dst) | Status | Start Time | Stop Time | State | +------------------------+-------------------+------------+----------------------------+----------------------------+-------------+ | 25 | "DATA_BALANCE" | "FINISHED" | 2023-01-17T06:24:35.000000 | 2023-01-17T06:24:35.000000 | "SUCCEEDED" | | "Total:0" | "Succeeded:0" | "Failed:0" | "In Progress:0" | "Invalid:0" | "" | +------------------------+-------------------+------------+----------------------------+----------------------------+-------------+
-
When all the subtasks succeed, the load balancing process finishes. Run
SHOW HOSTS
again to make sure the partition distribution is balanced.Note
SUBMIT JOB BALANCE DATA
does not balance the leader distribution. For more information, see Balance leader distribution.nebula> SHOW HOSTS; +-----------------+------+----------+--------------+----------------------+------------------------+-------------+ | Host | Port | Status | Leader count | Leader distribution | Partition distribution | Version | +-----------------+------+----------+--------------+----------------------+------------------------+-------------+ | "192.168.8.101" | 9779 | "ONLINE" | 7 | "basketballplayer:7" | "basketballplayer:7" | "3.5.0" | | "192.168.8.100" | 9779 | "ONLINE" | 8 | "basketballplayer:8" | "basketballplayer:8" | "3.5.0" | +-----------------+------+----------+--------------+----------------------+------------------------+-------------+
If any subtask fails, run RECOVER JOB <job_id>
to recover the failed jobs. If redoing load balancing does not solve the problem, ask for help in the NebulaGraph community.
Balance partitions with Zone enabled¶
For Zone-enabled clusters, you can balance the partitions within each Zone by adding the IN ZONE
keywords to the SUBMIT JOB BALANCE DATA
command. After you add a new storage host to the cluster, no partition is deployed on the new hosts. The following example adds a new storage host 192.168.8.158
and assigns the new host to zone1
to show how to balance the partition distribution within the Zone zone1
.
-
Run
SHOW HOSTS
to check the partition distribution.nebula> SHOW HOSTS; +-----------------+------+----------+--------------+----------------------+------------------------+---------+-------------+ | Host | Port | Status | Leader count | Leader distribution | Partition distribution | Zone | Version | +-----------------+------+----------+--------------+----------------------+------------------------+---------+-------------+ | "192.168.8.111" | 7779 | "ONLINE" | 5 | "my_space:5" | "my_space:10" | "zone1" | "3.5.0" | | "192.168.8.113" | 7779 | "ONLINE" | 5 | "my_space:5" | "my_space:10" | "zone3" | "3.5.0" | | "192.168.8.129" | 7779 | "ONLINE" | 0 | "No valid partition" | "my_space:10" | "zone2" | "3.5.0" | | "192.168.8.158" | 7779 | "ONLINE" | 0 | "No valid partition" | "No valid partition" | "zone1" | "3.5.0" | +-----------------+------+----------+--------------+----------------------+------------------------+---------+-------------+
-
Enter the graph space
my_zoned_space
, and execute the commandSUBMIT JOB BALANCE DATA IN ZONE
to balance the distribution of storage partitions within each Zone.nebula> USE my_zoned_space; nebula> SUBMIT JOB BALANCE DATA IN ZONE; +------------+ | New Job Id | +------------+ | 2 | +------------+
-
Run
SHOW JOB <job_id>
to check the status of the data balancing job. You can obtain the job ID after runningSUBMIT JOB BALANCE DATA IN ZONE
in the previous step.nebula> SHOW JOB 2; +------------------------+------------------------------------------+-------------+----------------------------+----------------------------+-------------+ | Job Id(spaceId:partId) | Command(src->dst) | Status | Start Time | Stop Time | State | +------------------------+------------------------------------------+-------------+----------------------------+----------------------------+-------------+ | 2 | "DATA_BALANCE" | "FINISHED" | 2024-04-11T02:41:27.000000 | 2024-04-11T02:41:32.000000 | "SUCCEEDED" | | "2, 1:1" | "192.168.8.111:7779->192.168.8.158:7779" | "SUCCEEDED" | 2024-04-11T02:41:27.000000 | 2024-04-11T02:41:27.000000 | "SUCCEEDED" | | "2, 1:2" | "192.168.8.111:7779->192.168.8.158:7779" | "SUCCEEDED" | 2024-04-11T02:41:27.000000 | 2024-04-11T02:41:32.000000 | "SUCCEEDED" | | "2, 1:3" | "192.168.8.111:7779->192.168.8.158:7779" | "SUCCEEDED" | 2024-04-11T02:41:27.000000 | 2024-04-11T02:41:27.000000 | "SUCCEEDED" | | "2, 1:4" | "192.168.8.111:7779->192.168.8.158:7779" | "SUCCEEDED" | 2024-04-11T02:41:27.000000 | 2024-04-11T02:41:27.000000 | "SUCCEEDED" | | "2, 1:5" | "192.168.8.111:7779->192.168.8.158:7779" | "SUCCEEDED" | 2024-04-11T02:41:27.000000 | 2024-04-11T02:41:32.000000 | "SUCCEEDED" | | "Total:5" | "Succeeded:5" | "Failed:0" | "In Progress:0" | "Invalid:0" | "" | +------------------------+------------------------------------------+-------------+----------------------------+----------------------------+-------------+
The above result shows the process of balancing the partitions within the Zone zone1
. When the job succeeds, the load balancing process finishes.
-
Run
SHOW HOSTS
again to make sure the partition distribution is balanced.+-----------------+------+----------+--------------+----------------------+------------------------+---------+-------------+ | Host | Port | Status | Leader count | Leader distribution | Partition distribution | Zone | Version | +-----------------+------+----------+--------------+----------------------+------------------------+---------+-------------+ | "192.168.8.111" | 7779 | "ONLINE" | 3 | "my_space:3" | "my_space:5" | "zone1" | "3.5.0" | | "192.168.8.113" | 7779 | "ONLINE" | 7 | "my_space:7" | "my_space:10" | "zone3" | "3.5.0" | | "192.168.8.129" | 7779 | "ONLINE" | 0 | "No valid partition" | "my_space:10" | "zone2" | "3.5.0" | | "192.168.8.158" | 7779 | "ONLINE" | 0 | "No valid partition" | "my_space:5" | "zone1" | "3.5.0" | +-----------------+------+----------+--------------+----------------------+------------------------+---------+-------------+
From the result, you can see that the partition distribution is balanced on all the storage hosts within Zone
zone1
.
Stop data balancing¶
To stop a balance job, run STOP JOB <job_id>
.
- If no balance job is running, an error is returned.
- If a balance job is running,
Job stopped
is returned.
Note
STOP JOB <job_id>
does not stop the running subtasks but cancels all follow-up subtasks. The status of follow-up subtasks is set to INVALID
. The status of ongoing subtasks is set to SUCCEEDED
or FAILED
based on the result. You can run the SHOW JOB <job_id>
command to check the stopped job status.
Once all the subtasks are finished or stopped, you can run RECOVER JOB <job_id>
again to balance the partitions again, the subtasks continue to be executed in the original state.
Restore a balance job¶
To restore a balance job in the FAILED
or STOPPED
status, run RECOVER JOB <job_id>
.
Note
For a STOPPED
SUBMIT JOB BALANCE DATA
job, NebulaGraph detects whether the same type of FAILED
jobs or FINISHED
jobs have been created since the start time of the job. If so, the STOPPED
job cannot be restored. For example, if chronologically there are STOPPED job1, FINISHED job2, and STOPPED Job3, only job3 can be restored, and job1 cannot.
Migrate partitions¶
To migrate specified partitions and scale in the cluster, you can run SUBMIT JOB BALANCE DATA REMOVE <ip:port> [,<ip>:<port> ...]
.
To migrate specified partitions for Zone-enabled clusters, you need to add the IN ZONE
keywords. For example, SUBMIT JOB BALANCE DATA IN ZONE REMOVE <ip:port> [,<ip>:<port> ...]
. For details, see Manage Zones.
For example, to migrate the partitions in server 192.168.8.100:9779
, the command is as following:
nebula> SUBMIT JOB BALANCE DATA REMOVE 192.168.8.100:9779;
nebula> SHOW HOSTS;
+-----------------+------+----------+--------------+-----------------------+------------------------+-------------+
| Host | Port | Status | Leader count | Leader distribution | Partition distribution | Version |
+-----------------+------+----------+--------------+-----------------------+------------------------+-------------+
| "192.168.8.101" | 9779 | "ONLINE" | 15 | "basketballplayer:15" | "basketballplayer:15" | "3.5.0" |
| "192.168.8.100" | 9779 | "ONLINE" | 0 | "No valid partition" | "No valid partition" | "3.5.0" |
+-----------------+------+----------+--------------+-----------------------+------------------------+-------------+
Note
This command migrates partitions to other storage hosts but does not delete the current storage host from the cluster. To delete the Storage hosts from a cluster, see Manage Storage hosts.
Balance leader distribution¶
To balance the raft leaders, run SUBMIT JOB BALANCE LEADER
. It starts a job to balance the distribution of all the storage leaders in all graph spaces.
For example, to balance the leader distribution, run the following command.
nebula> SUBMIT JOB BALANCE LEADER;
Run SHOW HOSTS
to check the balance result.
nebula> SHOW HOSTS;
+------------------+------+----------+--------------+-----------------------------------+------------------------+-------------+
| Host | Port | Status | Leader count | Leader distribution | Partition distribution | Version |
+------------------+------+----------+--------------+-----------------------------------+------------------------+-------------+
| "192.168.10.100" | 9779 | "ONLINE" | 4 | "basketballplayer:4" | "basketballplayer:8" | "3.5.0" |
| "192.168.10.101" | 9779 | "ONLINE" | 8 | "basketballplayer:3" | "basketballplayer:8" | "3.5.0" |
| "192.168.10.102" | 9779 | "ONLINE" | 3 | "basketballplayer:3" | "basketballplayer:8" | "3.5.0" |
| "192.168.10.103" | 9779 | "ONLINE" | 0 | "basketballplayer:2" | "basketballplayer:7" | "3.5.0" |
| "192.168.10.104" | 9779 | "ONLINE" | 0 | "basketballplayer:2" | "basketballplayer:7" | "3.5.0" |
| "192.168.10.105" | 9779 | "ONLINE" | 0 | "basketballplayer:2" | "basketballplayer:7" | "3.5.0" |
+------------------+------+----------+--------------+-----------------------------------+------------------------+-------------+
Caution
In NebulaGraph 3.5.0, switching leaders will cause a large number of short-term request errors (Storage Error E_RPC_FAILURE
). For solutions, see FAQ.