Setting up Torque (PBS) for GPU job scheduling

In this article I provide a short guide on how to set up a GPU job scheduling system based on Torque (PBS). I also try to point out the advantages and disadvantages of this solution and give an overview about the current situation with respect to open source GPU job scheduling solutions.

Introduction

In the field of high performance computing (HPC), graphics processing units (GPUs) are new and hot. Large super computer clusters are already built based on GPUs, e.g. Titan or the new machines in Dresden. For special tasks, GPUs may offer a much higher computing performance per cost ratio than classical hardware. Hence, a lot of people interested in heavy calculations have a keen interest in applying GPUs. People in research groups in education and industry often build large-scale setups comprised of a moderate or large number of GPU devices. Some buy AMD devices, some are interested in Intel’s Xeon Phi, but most of them — at least currently — buy Nvidia GPUs, because many well-established software products have support for CUDA hardware. Some might prefer the expensive Tesla cards, others go for the much cheaper GeForce cards (often performing faster than their Tesla pendants, for less money).

Eventually, people dealing with general purpose GPU computing end up having a bunch of graphics cards distributed among one or many machines. Now the work should begin — the computing jobs are already defined, and the GPUs are waiting to process them.

What is missing in this picture so far is an extremely important piece of infrastructure between the user and the hardware: a job queuing system. It is required for an efficient hardware resource usage and for easy accessibility of the hardware from the user’s point of view. It manages the compute resources and assigns compute jobs to single resource units. In the HPC sector, various of these job queuing systems (or “batch systems”) are established for classical hardware. What about GPU hardware support? With respect to open source software, there unfortunately is a huge gap. Search the web for it — as of now, you won’t find a straight-forward solution at all. If you dig deeper, you’ll find that there are solutions for GPU support in

Torque (PBS), see here.
Sun Grid Engine, see here.
SLURM, see here.

In the open source segment these are the options available for now. Please keep me posted if you think that this list is not complete.

For our research group, I have chosen to set up a GPU cluster based on Torque, because from the docs it seemed that this solution can be set up faster than the other solutions. However, from the beginning it was clear that a Torque-based GPU cluster comes along with a number of disadvantages.

Advantages and disadvantages of a Torque-based GPU scheduling solution

Pros:

Simple setup for basic features.
PBS/Torque is established in the HPC field.

Cons:

Limited scheduling features: GPU scheduling with Torque is by default controlled by pbs_sched, a very simplistic scheduler that is part of the Torque source distribution. pbs_sched does not support hardware priorities or any kind of scheduling intelligence. The advanced (free) Torque scheduler, Maui, does not officially support GPUs. There are certain hacky attempts to make it support them via a generic resource concept, but you do not necessarily want to maintain a manually patched Torque in your infrastructure. Also, the scheduling capabilities for GPUs added to Maui via such a simple patch are limited. The commercial Torque scheduler, Moab, does indeed support GPUs.
Torque does not set the environment variable CUDA_VISIBLE_DEVICES directly, the job itself must evaluate the environment variable PBS_GPUFILE, read the corresponding file, and set CUDA_VISIBLE_DEVICES before running the CUDA executable accordingly. Note that depending on how your cluster is set up and used, you might not need the CUDA_VISIBLE_DEVICES mechanism at all, especially when GPUs are set to the process-exclusive mode.

The scheduler SLURM supports the general resource (GRES) and CUDA_VISIBLE_DEVICES concepts by nature, but is quite a complex system that most likely takes a while to set up properly. However, from the open source solutions available so far, SLURM makes the impression to be the most advanced and solid one. It is well-documented and constantly gains market share among the large HPC facilities. In fact, Bull decided to run the new HPC cluster in Dresden (“Taurus”, including many Tesla K20x devices) using SLURM. A decision, I am quite happy about so far — as a user. SLURM is extremely responsive (even on large scales) and seems to have quite intelligent resource allocation methods. So, if you are considering setting up a largish GPU cluster involving many users, you might want to set up SLURM instead of Torque.

What I expect of a fully GPU-aware scheduler is support for a heterogeneous GPU resource pool, providing prioritizing methods for GPUs of different type and performance. If you, for instance, have a few old Tesla C1060 in your cluster and at the same time can call a few GTX Titan your own, the scheduling system should be able to first assign jobs to the Titans, if not explicitly requested otherwise. I am not sure whether something like this is very simple to achieve with SLURM. All I can say is that Torque’s pbs_sched is not able to distinguish different GPU types.

In the following parts, I provide a short step-by-step guide on how to set up and get started with a Torque-based system.

Setting up a simplistic GPU cluster based on Torque

The batch system created below consists of two nodes, each containing multiple GPUs. One node, named gpu1, will act as head node of the cluster as well as compute node. The other node, gpu2, will be compute node only. Both nodes run the computing client pbs_mom as well as the Torque authentication service trqauthd. The head node additionally runs the PBS server pbs_server as well as the basic scheduler pbs_sched.

General remarks

Sometimes, the structure of the Torque documentation can be confusing. Also, not all Torque-GPU-related information is contained in the most recent docs. A documents that can be quite useful is the Nvidia GPU section from the 4.0.2 docs.

In the following steps, Torque binaries will be built. It does not really matter on which machine you build these packages, but it is obviously recommended that it is the exact same system setup that is used in the cluster later on, for ensuring hardware and software compatibility.

Build against NVML

One thing that has confused me in the Torque docs is whether it makes sense to build against the Nvidia Management Library (NVML).

First of all, the docs are wrong in how to obtain NVML. Since CUDA 4.1, NVML is part of the download of the so-called Tesla Deployment Kit. Secondly, let me comment on the usage of NVML a bit more. From here, it seems that Torque is able to monitor the status of Nvidia GPUs quite well. However, only the commercial scheduler Moab is able to also consider gpu status information in its scheduling decisions. But also using the basic scheduler pbs_sched, just displaying monitoring information via the pbsnodes command would be great, wouldn’t it? Unfortunately, from my own experience, monitoring of GPUs does not work with CUDA 5 and latest Nvidia drivers. This is implicitly confirmed by statements such as this and this. Nevertheless, I decided to include NVML in the Torque build, because it does not hurt and maybe it works for you, because you apply an older CUDA version or older drivers. Just note that for a very simple and yet functional setup you might skip including NVML, it is entirely optional.

I assume that on the build machine you have a local CUDA installation placed at /usr/local/cuda. Then, you can put the NVML-related files into that tree like so:

# wget https://developer.nvidia.com/sites/default/files/akamai/cuda/files/CUDADownloads/NVML/tdk_3.304.5.tar.gz
# tar xzf tdk_3.304.5.tar.gz 
# cd tdk_3.304.5
# cp -a nvidia-healthmon/ nvml/ /usr/local/cuda

Build Torque packages

According to the Torque docs, Torque binary packages can be created in the following way:

# apt-get install libxml2-dev
# apt-get install libssl-dev
$ wget http://adaptive.wpengine.com/resources/downloads/torque/torque-4.1.5.tar.gz
$ tar xzf torque-4.1.5.tar.gz 
$ ./configure --with-debug --enable-nvidia-gpus --with-nvml-include=/usr/local/cuda/nvml/include --with-nvml-lib=/usr/local/cuda/nvml/lib64 2>&1 | tee configure_torque.log
$ make 2>&1 | tee make.log
$ make packages 2>&1 | tee make_packages.log

In the first two lines, missing requirements were installed. This was an Ubuntu 10.04, in your case other packages might be missing. Note I pointed configure to the previously installed NVML files. You might want to have a look at the other configure options and define e.g. a different library installation path. make packages creates binary Torque installation packages of which we require three:

torque-package-server-linux-x86_64.sh
torque-package-mom-linux-x86_64.sh
torque-package-clients-linux-x86_64.sh

They will be used for installing the various Torque components on all machines in the same cluster, as described below.

Install Torque packages and services

Although optional, it is possible to use the Torque server as a compute node. This is what we are doing here. As stated already before, the head node is planned to run pbs_server, pbs_sched, pbs_mom and trqauthd. All other nodes (gpu2 in this case) are planned to run pbs_mom and trqauthd. In terms of the packages, compute-only nodes need torque-package-mom-linux-x86_64.sh and torque-package-clients-linux-x86_64.sh installed. The head node additionally requires torque-package-server-linux-x86_64.sh:

# checkinstall /data/jpg/torque/torque-package-server-linux-x86_64.sh --install 2>&1 | tee checkinstall_torque_server.log
# checkinstall /data/jpg/torque/torque-package-clients-linux-x86_64.sh --install 2>&1 | tee checkinstall_torque_clients.log
# dpkg --force-overwrite -i  torque-clients_20130318-1_amd64.deb
# checkinstall /data/jpg/torque/torque-package-mom-linux-x86_64.sh --install 2>&1 | tee checkinstall_torque_mom.log
# dpkg --force-overwrite -i  torque-mom_20130318-1_amd64.deb

As you can see, this was performed as root. On debian-based systems, it is quite a good idea to then use checkinstall when installing custom software in the top-level locations. The checkinstall wrapper first creates a debian package file that is automatically installed in a second step. This package file can be used to cleanly remove all files later on or to easily re-install or install on a similar system. The first package, torque-package-server-linux becomes converted to a debian package and then installs just fine. The two conversions afterwards also happen, but automatic installation of the debian packages fails. This is because the Torque binary packages actually have a certain redundancy, which is fine. However, debian packages trying to overwrite files belonging to another debian package make dpkg raise an error if not told explicitly otherwise. The deb-packages are created anyway and can be installed manually with the option --force-overwrite, which makes perfect sense in this case. Again, after the procedure above, you have three debian packages at hand that can be used to cleanly remove what you have just installed as root.

On gpu2, we install torque-package-mom-linux-x86_64.sh and torque-package-clients-linux-x86_64.sh in the same way.

By default (if not explicitly told otherwise via configure), Torque places shared objects in an ‘old’ place, i.e. /usr/local/lib. In case of Ubuntu 10.04, we need to modify the library search path:

# vi /etc/ld.so.conf
# cat /etc/ld.so.conf
include /etc/ld.so.conf.d/*.conf
/usr/local/cuda/lib64
/usr/local/cuda/lib
/usr/local/lib
# /sbin/ldconfig

Alternatively, you could modify LD_LIBRARY_PATH globally. In any case, make sure that on all nodes the operating system finds Torque’s library files.

Torque’s source tarball provides init scripts for all Torque components. On the head node, we install service scripts for pbs_mom, pbs_sched, pbs_server, and trqautd in the following fashion:

root@gpu1:/***/torque-4.1.5/contrib/init.d# cp debian.pbs_mom /etc/init.d/pbs_mom && update-rc.d pbs_mom defaults
update-rc.d: warning: pbs_mom start runlevel arguments (2 3 4 5) do not match LSB Default-Start values (2 3 5)
update-rc.d: warning: pbs_mom stop runlevel arguments (0 1 6) do not match LSB Default-Stop values (S 0 1 6)
 Adding system startup for /etc/init.d/pbs_mom ...
   /etc/rc0.d/K20pbs_mom -> ../init.d/pbs_mom
   /etc/rc1.d/K20pbs_mom -> ../init.d/pbs_mom
   /etc/rc6.d/K20pbs_mom -> ../init.d/pbs_mom
   /etc/rc2.d/S20pbs_mom -> ../init.d/pbs_mom
   /etc/rc3.d/S20pbs_mom -> ../init.d/pbs_mom
   /etc/rc4.d/S20pbs_mom -> ../init.d/pbs_mom
   /etc/rc5.d/S20pbs_mom -> ../init.d/pbs_mom

Repeat that for pbs_sched, pbs_server, and trqautd.

You can already run the services, although still not properly configured:

# service pbs_server start
# service pbs_mom start
# service trqauthd start
hostname: gpu1
pbs_server port is: 15001
trqauthd daemonized - port 15005
# service pbs_sched start

On the compute nodes, it is sufficient to install pbs_mom and trqauthd in the same way as above.

Initial setup

It is time to specify config parameters that are common among head and compute nodes. Hence, for gpu1 as well as gpu2, do the following:

export TORQUE_HOME=/var/spool/torque
echo "\$pbsserver gpu1" > ${TORQUE_HOME}/mom_priv/config
echo "\$logevent 225" >> ${TORQUE_HOME}/mom_priv/config
echo "gpu1" > ${TORQUE_HOME}/server_name
echo "SERVERHOST gpu1" > ${TORQUE_HOME}/torque.cfg

All of this is required, except for the logevent setting, which I grasped from somewhere as a recommendation. IIRC, TORQUE_HOME can be forgotten about afterwards, i.e. this environment variable is not required during runtime, although sometimes in the Torque docs this seems to be the case.

In the source tree, Torque provides an initial setup configuration script, which you can run like so:

./torque.setup username

You should now adjust the queue and server configuration to your needs. So take at look at the docs. My rudimentary config is:

# qmgr -c "print server"
#
# Create queues and set their attributes.
#
#
# Create and define queue batch
#
create queue batch
set queue batch queue_type = Execution
set queue batch max_queuable = 1000
set queue batch acl_host_enable = True
set queue batch acl_hosts = gpu1
set queue batch acl_user_enable = False
set queue batch resources_max.nodect = 1
set queue batch resources_max.walltime = 100:00:00
set queue batch resources_default.nodect = 1
set queue batch resources_default.nodes = 1
set queue batch resources_default.walltime = 100:00:00
set queue batch acl_group_enable = True
set queue batch acl_groups = group
set queue batch keep_completed = 10000
set queue batch enabled = True
set queue batch started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_hosts = gpu1
set server managers = username@gpu1
set server operators = username@gpu1
set server default_queue = batch
set server log_events = 511
set server mail_from = adm
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 300
set server job_stat_rate = 45
set server poll_jobs = True
set server mom_job_sync = True
set server keep_completed = 300
set server next_job_number = 2
set server moab_array_compatible = True

For clarity, these are the settings I had success with on the head node, i.e. gpu1 in this case:

# cat ${TORQUE_HOME}/server_name
gpu1
# cat ${TORQUE_HOME}/server_priv/nodes
gpu1 np=3 gpus=3
gpu2 np=2 gpus=2
# cat ${TORQUE_HOME}/mom_priv/config 
$pbsserver gpu1
$logevent 225
# cat ${TORQUE_HOME}/torque.cfg 
SERVERHOST gpu1

In this case, I was already planning for a correspondence of 1 CPU to 1 GPU, i.e. the np settings equal the gpus settings in the nodes file. Depending on the types of jobs you plan to run, in your case a different setting might make more sense.

The corresponding configuration file contents on the compute node, gpu2:

# cat ${TORQUE_HOME}/server_name
gpu1
# cat ${TORQUE_HOME}/server_priv/nodes
## This is the TORQUE server "nodes" file. 
## 
## To add a node, enter its hostname, optional processor count (np=), 
## and optional feature names.
## 
## Example:
##    host01 np=8 featureA featureB 
##    host02 np=8 featureA featureB
## 
## for more information, please visit:
## 
## http://www.adaptivecomputing.com/resources/docs/
# cat ${TORQUE_HOME}/mom_priv/config 
$pbsserver gpu1
$logevent 225
# cat ${TORQUE_HOME}/torque.cfg 
SERVERHOST gpu1

With this minimal config, the GPU cluster is already ready to go:

# pbsnodes -a
gpu1
     state = free
     np = 3
     ntype = cluster
     status = [...]
     mom_service_port = 15002
     mom_manager_port = 15003
     gpus = 3
 
gpu2
     state = free
     np = 2
     ntype = cluster
     status = [...]
     mom_service_port = 15002
     mom_manager_port = 15003
     gpus = 2

Using the cluster

Job submission using Torque’s qsub is a bit old school (I also have experience with LSF, SGE, and SLURM and must say that Torque provides the least convenient command line tools). Example submit command, runs a script with arguments and requests 1 GPU and 1 CPU for the job:

$ echo "/bin/bash job.sh out16" | qsub -d $PWD -l nodes=1:gpus=1:ppn=1

A submit command, script without arguments:

$ qsub -d $PWD -l nodes=1:gpus=1:ppn=1 job.sh

These commands don’t even take care of job standard output and standard error collection (a problem which Torque solves, in my opinion, in an absolutely horrible way).

As of the inconveniences related to job submission and mainly because I wanted to have a mechanism that automatically takes care of setting CUDA_VISIBLE_DEVICES on the executing node before running the job, I wrote two small scripts, located at https://bitbucket.org/jgehrcke/torque-gpu-compute-jobs/src. One is a submission wrapper, for simplified job submission. I use it this way:

submit-gpu-job '/bin/bash job_script.sh' -o stdout_stderr.log

I have created this for my needs, it automatically requests 1 CPU and 1 GPU. On the executing node, the actual job is wrapped by the torque-gpu-compute-job-wrapper.py script (which, of course, must be in the PATH), which takes care of evaluating PBS_GPUFILE and setting CUDA_VISIBLE_DEVICES before running the job script. The user-given command as provided by submit-gpu-job is temporarily stored in the file system (in the CWD when using submit-gpu-job, to be more specific). It later becomes recovered by the job wrapper on the executing node.

Set up as described above, we are running a small GPU cluster in our research group for a couple of months already, without any complication. By the way, the best way to monitor the queue in a daily usage scenario in my opinion is the qstat -na1 command (intuitive combination of parameters, right?).

Jan-Philip Gehrcke, PhD