Skip to content

Dag Controller

Dag Controller is a task scheduling tool that can schedule the jobs which type is DAG (directed acyclic graph). The job consists of multiple tasks to form a directed acyclic graph, and there is a dependency between the tasks.

The Dag Controller can perform complex graph computing with Nebula Analytics. For example, the Dag Controller sends an algorithm request to Nebula Analytics, which saves the result to NebulaGraph or HDFS. The Dag Controller then takes the result as input to the next algorithmic task to create a new task.

This topic describes how to use the Dag Controller.

Enterpriseonly

Only available for the NebulaGraph Enterprise Edition.

Prerequisites

  • The HDFS 2.2.x or later has been deployed.
  • The JDK 1.8 has been deployed.

Preparations

There are some differences between installation packages and commands in different environments. The preparations are as follows

  • The operating system is CentOS 7.
  • If the Nebula Analytics and the Dag Controller are deployed on multiple machines, ensure network connectivity between the machines.
  • If the Nebula Analytics is a cluster with distributed architecture, ensure the paths and ports are configured identically for each machine.

Precautions

  • The BFS and SSSP algorithms need to verify the parameter root. They support only one upstream component and must specify rows and columns. If multiple files exist, a random file is selected. If a row, column, or file is not found, an error will be reported.
  • The similarity algorithm does not restrict the format of the upstream component, but it must specify columns. If multiple files exist, the file will be superimposed randomly, and the first N rows of data will be processed. If rows and columns are specified, or the specified column does not exist, an error will be reported.

Deploy Nebula Analytics

  1. Install libatomic and psmisc.

    sudo yum -y install libatomic psmisc
    
  2. Install the Nebula Analytics.

    sudo rpm -ivh <analytics_package_name> --prefix <install_path>
    sudo chown <user>:<user> -R <install path>
    

    For example:

    sudo rpm -ivh nebula-analytics-3.2.0-centos.x86_64.rpm --prefix=/home/vesoft/nebula-analytics
    sudo chown vesoft:vesoft -R /home/vesoft/nebula-analytics
    
  3. Configure the correct Hadoop path and JDK path in the file set_env.sh, the file path is nebula-analytics/scripts/set_env.sh. If there are multiple machines, ensure that the paths are the same.

    export HADOOP_HOME=<hadoop_path>
    export JAVA_HOME=<java_path>
    

Deploy Dag Controller

  1. Complete the SSH password-free configurations so that the Dag Controller machine can log directly into the Nebula Analytics machines and all machines within the Nebula Analytics cluster can connect directly to each other without passwords.

    For example, the user in the machine A (Dag Controller) log directly into machine B-1 in the Nebula Analytics cluster over SSH without passwords. Run the following commands on the machine A:

    //Press Enter to execute the default option to generate the key.
    ssh-keygen -t rsa
    
    //After the public key file of machine A is installed to the user of the machine B-1, you can log into the machine B-1 from the machine A without passwords.
    ssh-copy-id -i ~/.ssh/id_rsa.pub <B_user>@<B_IP>
    

    In the same way, complete the SSH password-free configurations so that the user in the machine A can log directly into the machine B-2, B-3, etc. and all machines within the Nebula Analytics cluster can connect directly to each other without passwords.

  2. Add the following to the file ~/.bash_profile and run the command source ~/.bash_profile to make it effective.

    eval $(ssh-agent)
    ssh-add ~/.ssh/id_rsa
    
  3. Install the Dag Controller.

    sudo rpm -ivh <analytics_package_name> --prefix <install_path>
    sudo chown <user>:<user> -R <install path>
    

    For example:

    sudo rpm -ivh dag-ctrl-3.2.0-centos.x86_64.rpm --prefix=/home/vesoft/dag-ctrl
    sudo chown vesoft:vesoft -R /home/vesoft/dag-ctrl
    
  4. Configure the username and port of the Nebula Analytics in the file dag-ctrl-api.yaml, the file path is dag-ctrl/etc/dag-ctrl-api.yaml. If there are multiple machines, ensure that the usernames are the same.

    # The user name and SSH port of the Nebula Analytics machine.
    SSH:
      UserName: vesoft
      Port: 22
    
    #The parallel thread pool sizes of the tasks and jobs.
    JobPool:
      Sleep: 3    # Check every 3 seconds for any outstanding jobs.
      Size: 3    # Up to 3 jobs can be executed in parallel.
    TaskPool:
      CheckStatusSleep: 1    # Check the task status every second.
      Size: 10    # Up to 10 tasks can be executed in parallel.
    
    Dag:
      VarDataListMaxSize: 100    # If HDFS columns are read, the number is limited to 100 at a time.
    
  5. Configure the algorithm file path (exec_file) in the file tasks.yaml, the file path is dag-ctrl/etc/tasks.yaml. If there are multiple machines, ensure that the paths are the same.

  6. Start the Dag Controller.

    cd <dag_ctrl_install_path>
    ./scripts/start.sh
    
  7. Check whether the startup is successful. The default port is 9002 which set in the file dag-Ctrl-api. yaml.

    netstat -aon | grep 9002
    

Next to do

After the Nebula Analytics and the Dag Controller are configured and started, you need to configure resources on the Nebula Explorer to perform complex algorithm computing. For details, see Prepare resources.

FAQ

Will the Dag Controller service crash if the Graph service returns too much result data?

The Dag Controller service only provides scheduling capabilities and will not crash, but the Nebula Analytics service may crash due to insufficient memory when writing too much data to HDFS or NebulaGraph, or reading too much data from HDFS or NebulaGraph.

Can I continue a job from a failed task?

Not supported. You can only re-execute the entire job.

How can I speed it up if a task result is saved slowly or data is transferred slowly between tasks?

The Dag Controller contains graph query components and graph computing components. Graph queries send requests to a graph service for queries, so the graph queries can only be accelerated by increasing the memory of the graph service. Graph computing is performed on distributed nodes provided by Nebula Analytics, so graph computing can be accelerated by increasing the size of the Nebula Analytics cluster.

The HDFS server cannot be connected and the task status is running.

Set the timeout period and times for HDFS connections as follows:

<configuration>
<property>
    <name>ipc.client.connect.timeout</name>
    <value>3000</value>
</property>

<property>
    <name>ipc.client.connect.max.retries.on.timeouts</name>
    <value>3</value>
</property>
</configuration>

How to resolve the error Err:dial unix: missing address?

Modify the configuration file dag-ctrl/etc/dag-ctrl-api.yaml to configure the UserName of the SSH.

How to resolve the error bash: /home/xxx/nebula-analytics/scripts/run_algo.sh: No such file or directory?

Modify the configuration file dag-ctrl/etc/tasks.yamlto configure the algorithm execution path parameter exec_file.

How to resolve the error /lib64/libm.so.6: version 'GLIBC_2.29' not found (required by /home/vesoft/jdk-18.0.1/jre/lib/amd64/server/libjvm.so)?

Because the operating system version does not support JDK18, the command YUM cannot download GLIBC_2.29, you can install JDK1.8. Does not forget to change the JDK address in nebula-analytics/scripts/set_env.sh.

How to resolve the error handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain?

Reconfigure the permissions to 744 on the folder .ssh and 600 on the file .ssh/authorized_keys.

How to resolve the error There are 0 Nebula Analytics available. clusterSize should be less than or equal to it?

The possible causes are as follows:

  • The Nebula Analytics has not been deployed. Configure the Nebula Analytics as described in this document.
  • The Nebula Analytics has been deployed, but can not connect to the Dag Controller. For example, the IP address is incorrect, SSH is not configured, and the startup users of the two services are inconsistent (causing SSH login failures).

How to resolve the error broadcast.hpp:193] Check failed: (size_t)recv_bytes >= sizeof(chunk_tail_t) recv message too small: 0?

The amount of data to be processed is too small, but the number of compute nodes and processes is too large. Smaller clusterSize and processes need to be set when submitting jobs.


Last update: March 13, 2023