Storage load balance¶
You can use the BALANCE
statement to balance the distribution of partitions and Raft leaders, or clear some Storage servers for easy maintenance. For details, see BALANCE.
Danger
The BALANCE
commands migrate data and balance the distribution of partitions by creating and executing a set of subtasks. DO NOT stop any machine in the cluster or change its IP address until all the subtasks finish. Otherwise, the follow-up subtasks fail.
Balance partition distribution¶
Enterpriseonly
Only available for the NebulaGraph Enterprise Edition.
Note
If the current graph space already has a BALANCE DATA
job in the FAILED
status, you can restore the FAILED
job, but cannot start a new BALANCE DATA
job. If the job continues to fail, manually stop it, and then you can start a new one.
The BALANCE DATA
commands starts a job to balance the distribution of storage partitions in the current graph space by creating and executing a set of subtasks.
Examples¶
After you add new storage hosts into the cluster, no partition is deployed on the new hosts.
-
Run
SHOW HOSTS
to check the partition distribution.nebual> SHOW HOSTS; +-----------------+------+-----------+----------+--------------+-----------------------+------------------------+-------------+ | Host | Port | HTTP port | Status | Leader count | Leader distribution | Partition distribution | Version | +-----------------+------+-----------+----------+--------------+-----------------------+------------------------+-------------+ | "192.168.8.101" | 9779 | 19669 | "ONLINE" | 0 | "No valid partition" | "No valid partition" | "3.1.0-ent" | | "192.168.8.100" | 9779 | 19669 | "ONLINE" | 15 | "basketballplayer:15" | "basketballplayer:15" | "3.1.0-ent" | +-----------------+------+-----------+----------+--------------+-----------------------+------------------------+-------------+
-
Enter the graph space
basketballplayer
, and execute the commandBALANCE DATA
to balance the distribution of storage partitions.nebula> USE basketballplayer; nebula> BALANCE DATA; +------------+ | New Job Id | +------------+ | 2 | +------------+
-
The job ID is returned after running
BALANCE DATA
. RunSHOW JOB <job_id>
to check the status of the job.nebula> SHOW JOB 2; +------------------------+------------------------------------------+-------------+---------------------------------+---------------------------------+-------------+ | Job Id(spaceId:partId) | Command(src->dst) | Status | Start Time | Stop Time | Error Code | +------------------------+------------------------------------------+-------------+---------------------------------+---------------------------------+-------------+ | 2 | "DATA_BALANCE" | "FINISHED" | "2022-04-12T03:41:43.000000000" | "2022-04-12T03:41:53.000000000" | "SUCCEEDED" | | "2, 1:1" | "192.168.8.100:9779->192.168.8.101:9779" | "SUCCEEDED" | 2022-04-12T03:41:43.000000 | 2022-04-12T03:41:53.000000 | "SUCCEEDED" | | "2, 1:2" | "192.168.8.100:9779->192.168.8.101:9779" | "SUCCEEDED" | 2022-04-12T03:41:43.000000 | 2022-04-12T03:41:53.000000 | "SUCCEEDED" | | "2, 1:3" | "192.168.8.100:9779->192.168.8.101:9779" | "SUCCEEDED" | 2022-04-12T03:41:43.000000 | 2022-04-12T03:41:53.000000 | "SUCCEEDED" | | "2, 1:4" | "192.168.8.100:9779->192.168.8.101:9779" | "SUCCEEDED" | 2022-04-12T03:41:43.000000 | 2022-04-12T03:41:53.000000 | "SUCCEEDED" | | "2, 1:5" | "192.168.8.100:9779->192.168.8.101:9779" | "SUCCEEDED" | 2022-04-12T03:41:43.000000 | 2022-04-12T03:41:53.000000 | "SUCCEEDED" | | "2, 1:6" | "192.168.8.100:9779->192.168.8.101:9779" | "SUCCEEDED" | 2022-04-12T03:41:43.000000 | 2022-04-12T03:41:43.000000 | "SUCCEEDED" | | "2, 1:7" | "192.168.8.100:9779->192.168.8.101:9779" | "SUCCEEDED" | 2022-04-12T03:41:43.000000 | 2022-04-12T03:41:53.000000 | "SUCCEEDED" | | "Total:7" | "Succeeded:7" | "Failed:0" | "In Progress:0" | "Invalid:0" | "" | +------------------------+------------------------------------------+-------------+---------------------------------+---------------------------------+-------------+
-
When all the subtasks succeed, the load balancing process finishes. Run
SHOW HOSTS
again to make sure the partition distribution is balanced.Note
BALANCE DATA
does not balance the leader distribution. For more information, see Balance leader distribution.nebula> SHOW HOSTS; +-----------------+------+-----------+----------+--------------+----------------------+------------------------+-------------+ | Host | Port | HTTP port | Status | Leader count | Leader distribution | Partition distribution | Version | +-----------------+------+-----------+----------+--------------+----------------------+------------------------+-------------+ | "192.168.8.101" | 9779 | 19669 | "ONLINE" | 7 | "basketballplayer:7" | "basketballplayer:7" | "3.1.0-ent" | | "192.168.8.100" | 9779 | 19669 | "ONLINE" | 8 | "basketballplayer:8" | "basketballplayer:8" | "3.1.0-ent" | +-----------------+------+-----------+----------+--------------+----------------------+------------------------+-------------+
If any subtask fails, run RECOVER JOB <job_id>
to recover the failed jobs. If redoing load balancing does not solve the problem, ask for help in the NebulaGraph community.
Stop data balancing¶
To stop a balance job, run STOP JOB <job_id>
.
- If no balance job is running, an error is returned.
- If a balance job is running,
Job stopped
is returned.
Note
STOP JOB <job_id>
does not stop the running subtasks but cancels all follow-up subtasks. The status of follow-up subtasks is set to INVALID
. The status of ongoing subtasks is set to SUCCEEDED
or FAILED
based on the result. You can run the SHOW JOB <job_id>
command to check the stopped job status.
Once all the subtasks are finished or stopped, you can run RECOVER JOB <job_id>
again to balance the partitions again, the subtasks continue to be executed in the original state.
Restore a balance job¶
To restore a balance job in the FAILED
or STOPPED
status, run RECOVER JOB <job_id>
.
Note
For a STOPPED
BALANCE DATA
job, NebulaGraph detects whether the same type of FAILED
jobs or FINISHED
jobs have been created since the start time of the job. If so, the STOPPED
job cannot be restored. For example, if chronologically there are STOPPED job1, FINISHED job2, and STOPPED Job3, only job3 can be restored, and job1 cannot.
Migrate partition¶
To migrate specified partitions and scale in the cluster, you can run BALANCE DATA REMOVE <ip:port> [,<ip>:<port> ...]
.
For example, to migrate the partitions in server 192.168.8.100:9779
, the command as following:
nebula> BALANCE DATA REMOVE 192.168.8.100:9779;
nebula> SHOW HOSTS;
+-----------------+------+-----------+----------+--------------+-----------------------+------------------------+-------------+
| Host | Port | HTTP port | Status | Leader count | Leader distribution | Partition distribution | Version |
+-----------------+------+-----------+----------+--------------+-----------------------+------------------------+-------------+
| "192.168.8.101" | 9779 | 19669 | "ONLINE" | 15 | "basketballplayer:15" | "basketballplayer:15" | "3.1.0-ent" |
| "192.168.8.100" | 9779 | 19669 | "ONLINE" | 0 | "No valid partition" | "No valid partition" | "3.1.0-ent" |
+-----------------+------+-----------+----------+--------------+-----------------------+------------------------+-------------+
Note
This command migrates partitions to other storage hosts but does not delete the current storage host from the cluster. To delete the Storage hosts from cluster, see Manage Storage hosts.
Balance leader distribution¶
To balance the raft leaders, run BALANCE LEADER
.
Example¶
nebula> BALANCE LEADER;
Run SHOW HOSTS
to check the balance result.
nebula> SHOW HOSTS;
+------------------+------+-----------+----------+--------------+-----------------------------------+------------------------+---------+
| Host | Port | HTTP port | Status | Leader count | Leader distribution | Partition distribution | Version |
+------------------+------+-----------+----------+--------------+-----------------------------------+------------------------+---------+
| "192.168.10.100" | 9779 | 19669 | "ONLINE" | 4 | "basketballplayer:3" | "basketballplayer:8" | "3.1.0" |
| "192.168.10.101" | 9779 | 19669 | "ONLINE" | 8 | "basketballplayer:3" | "basketballplayer:8" | "3.1.0" |
| "192.168.10.102" | 9779 | 19669 | "ONLINE" | 3 | "basketballplayer:3" | "basketballplayer:8" | "3.1.0" |
| "192.168.10.103" | 9779 | 19669 | "ONLINE" | 0 | "basketballplayer:2" | "basketballplayer:7" | "3.1.0" |
| "192.168.10.104" | 9779 | 19669 | "ONLINE" | 0 | "basketballplayer:2" | "basketballplayer:7" | "3.1.0" |
| "192.168.10.105" | 9779 | 19669 | "ONLINE" | 0 | "basketballplayer:2" | "basketballplayer:7" | "3.1.0" |
+------------------+------+-----------+----------+--------------+-----------------------------------+------------------------+---------+
Caution
In NebulaGraph 3.1.3, switching leaders will cause a large number of short-term request errors (Storage Error E_RPC_FAILURE
). For solutions, see FAQ.