The cluster diagnostics feature in Dashboard Enterprise Edition is to locate and analyze the current cluster problems within a specified time range and summarize the diagnostic results and cluster monitoring information to web-based diagnostic reports.
- Diagnostic reports allow you to troubleshoot the current cluster problems within a specified time range.
- Quickly understand the basic information of the nodes, services, service configurations, and query sessions in the cluster.
- Based on the diagnostic reports, you can make operation and maintenance recommendations and cluster alerts.
- In the top navigation bar of the Dashboard Enterprise Edition page, click Cluster Management.
- On the right side of the target cluster, click Detail.
- In the left navigation bar, click Information->Cluster Diagnostics.
Create diagnostic reports¶
Select a time range for diagnostics. You can customize the time range or set the range by selecting time intervals, including
7 Days, and
Note that the end time of the diagnostic range you set cannot be longer than the current time. If the end time is longer than the current time, the end time will be set to the current time.
On the Cluster Diagnostics page, click Start.
Wait for the diagnostic report to be generated. When the diagnostic status is changed to success from generating, the diagnostic report is ready.
View diagnostic reports¶
In the diagnostic report list, you can view the diagnostic reports by clicking Detail on the right side of the target report.
A diagnostic report contains the following information:
- Diagnosis Result
- Basic Info
- Load Info
- Service Info
- Configuration Info
When the following parameters are abnormal, the corresponding information is displayed in the Diagnosis Result section, including the parameter name, type, severity, and details.
The total number of nGQL statements that reach the memory high-water mark during execution.
Graph services stopped running.
Storage services stopped running.
Meta services stopped running.
The service used to collect data from the node stopped running.
- When no abnormality is diagnosed, no diagnostic information is displayed in the diagnostic result.
- Report Time Range: Displays the time range of the diagnostic report.
Node Info: Displays the basic information of the node, including the node IP, number of services, CPU, memory, and disk.
The IP address of the node.
The number of NebulaGraph services deployed on this node. Such as:
metad*1 graphd*1 storaged*1.
The number of CPU cores. Unit: Core.
The memory size of the node. Unit: GB.
The disk size of the node. Unit: GB.
- Service Info: Displays the type, node IP, HTTP port, and operational status of each NebulaGraph service.
Leader Distribution: Displays the distribution of Leaders in Storage services.
Displays the access addresses for Storage services.
Number of Leaders
Displays the number of Leaders in the corresponding Storage service.
Displays the number of Leader distributions for different space graphs in the corresponding Storage service.
Displays the load information of the node, including the average value (AVG), maximum value (MAX), minimum value (MIN) of the following metrics of the node within the time range:
- Memory Utilization: Displays the node memory usage in %.
- CPU Utilization: Displays the node CPU usage in %.
- Disk Utilization: Displays the total disk utilization of the node and the utilization of each disk in the node in %.
Displays the network traffic information of all nodes in the cluster, including the average (AVG), maximum (MAX), and minimum (MIN) values of the following metrics:
- NetworkOut: Displays the magnitude of network outflow speed for each node in the cluster, and the magnitude of outflow speed for each NIC in each node. Unit: Bytes/s.
- NetworkIn: Shows the magnitude of network inflow speed for each server node in the cluster and the magnitude of inflow speed for each NIC in each node. Unit: Bytes/s.
Displays the session-related information for all Graph services in the cluster.
|The number of sessions connected to the server.
|The number of sessions in which login authentication failed.
|The number of currently active sessions.
|The number of expired sessions actively reclaimed by the server.
Displays metrics related to the stability of each service in the cluster.
query: The number of all queries.
slow_queries: The number of slow queries.
num_killed_queries: The number of killed queries.
num_queries_hit_memory_watermark: The total number of nGQL statements that reach the memory high-water mark during execution.
num_rpc_sent_to_metad: The number of RPC requests that the Graphd service sent to the Metad service.
heartbeat: The number of heartbeats.
delete_vertices: The number of deleted vertices.
delete_edges: The number of deleted edges.
delete_tags: The number of deleted tags.
num_rpc_sent_to_metad: The number of RPC requests that the Storaged service sent to the Metad service.
The descriptions of other parameters are as follows:
|The total number of times this monitoring metric is executed.
|The number of errors that occurred.
|The 75th percentile latency.
|The 95th percentile latency.
|The 99th percentile latency.
|The 99.9th percentile latency.
Lists all configuration information for Graph, Meta, and Storage services in the current cluster.
For information about the configurations of each service in NebulaGraph, see Configurations.