In this article I provide a short guide on how to set up a GPU job scheduling system based on Torque (PBS). I also try to point out the advantages and disadvantages of this solution and give an overview about the current situation with respect to open source GPU job scheduling solutions.
Introduction
In the field of high performance computing (HPC), graphics processing units (GPUs) are new and hot. Large super computer clusters are already built based on GPUs, e.g. Titan or the new machines in Dresden. For special tasks, GPUs may offer a much higher computing performance per cost ratio than classical hardware. Hence, a lot of people interested in heavy calculations have a keen interest in applying GPUs. People in research groups in education and industry often build large-scale setups comprised of a moderate or large number of GPU devices. Some buy AMD devices, some are interested in Intel’s Xeon Phi, but most of them — at least currently — buy Nvidia GPUs, because many well-established software products have support for CUDA hardware. Some might prefer the expensive Tesla cards, others go for the much cheaper GeForce cards (often performing faster than their Tesla pendants, for less money).
Eventually, people dealing with general purpose GPU computing end up having a bunch of graphics cards distributed among one or many machines. Now the work should begin — the computing jobs are already defined, and the GPUs are waiting to process them.
What is missing in this picture so far is an extremely important piece of infrastructure between the user and the hardware: a job queuing system. It is required for an efficient hardware resource usage and for easy accessibility of the hardware from the user’s point of view. It manages the compute resources and assigns compute jobs to single resource units. In the HPC sector, various of these job queuing systems (or “batch systems”) are established for classical hardware. What about GPU hardware support? With respect to open source software, there unfortunately is a huge gap. Search the web for it — as of now, you won’t find a straight-forward solution at all. If you dig deeper, you’ll find that there are solutions for GPU support in
In the open source segment these are the options available for now. Please keep me posted if you think that this list is not complete.
For our research group, I have chosen to set up a GPU cluster based on Torque, because from the docs it seemed that this solution can be set up faster than the other solutions. However, from the beginning it was clear that a Torque-based GPU cluster comes along with a number of disadvantages.
Advantages and disadvantages of a Torque-based GPU scheduling solution
Pros:
- Simple setup for basic features.
- PBS/Torque is established in the HPC field.
Cons:
- Limited scheduling features: GPU scheduling with Torque is by default controlled by
pbs_sched
, a very simplistic scheduler that is part of the Torque source distribution.pbs_sched
does not support hardware priorities or any kind of scheduling intelligence. The advanced (free) Torque scheduler, Maui, does not officially support GPUs. There are certain hacky attempts to make it support them via a generic resource concept, but you do not necessarily want to maintain a manually patched Torque in your infrastructure. Also, the scheduling capabilities for GPUs added to Maui via such a simple patch are limited. The commercial Torque scheduler, Moab, does indeed support GPUs. - Torque does not set the environment variable
CUDA_VISIBLE_DEVICES
directly, the job itself must evaluate the environment variable PBS_GPUFILE, read the corresponding file, and setCUDA_VISIBLE_DEVICES
before running the CUDA executable accordingly. Note that depending on how your cluster is set up and used, you might not need theCUDA_VISIBLE_DEVICES
mechanism at all, especially when GPUs are set to the process-exclusive mode.
The scheduler SLURM supports the general resource (GRES) and CUDA_VISIBLE_DEVICES
concepts by nature, but is quite a complex system that most likely takes a while to set up properly. However, from the open source solutions available so far, SLURM makes the impression to be the most advanced and solid one. It is well-documented and constantly gains market share among the large HPC facilities. In fact, Bull decided to run the new HPC cluster in Dresden (“Taurus”, including many Tesla K20x devices) using SLURM. A decision, I am quite happy about so far — as a user. SLURM is extremely responsive (even on large scales) and seems to have quite intelligent resource allocation methods. So, if you are considering setting up a largish GPU cluster involving many users, you might want to set up SLURM instead of Torque.
What I expect of a fully GPU-aware scheduler is support for a heterogeneous GPU resource pool, providing prioritizing methods for GPUs of different type and performance. If you, for instance, have a few old Tesla C1060 in your cluster and at the same time can call a few GTX Titan your own, the scheduling system should be able to first assign jobs to the Titans, if not explicitly requested otherwise. I am not sure whether something like this is very simple to achieve with SLURM. All I can say is that Torque’s pbs_sched
is not able to distinguish different GPU types.
In the following parts, I provide a short step-by-step guide on how to set up and get started with a Torque-based system.
Setting up a simplistic GPU cluster based on Torque
The batch system created below consists of two nodes, each containing multiple GPUs. One node, named gpu1
, will act as head node of the cluster as well as compute node. The other node, gpu2
, will be compute node only. Both nodes run the computing client pbs_mom
as well as the Torque authentication service trqauthd
. The head node additionally runs the PBS server pbs_server
as well as the basic scheduler pbs_sched
.
General remarks
Sometimes, the structure of the Torque documentation can be confusing. Also, not all Torque-GPU-related information is contained in the most recent docs. A documents that can be quite useful is the Nvidia GPU section from the 4.0.2 docs.
In the following steps, Torque binaries will be built. It does not really matter on which machine you build these packages, but it is obviously recommended that it is the exact same system setup that is used in the cluster later on, for ensuring hardware and software compatibility.
Build against NVML
One thing that has confused me in the Torque docs is whether it makes sense to build against the Nvidia Management Library (NVML).
First of all, the docs are wrong in how to obtain NVML. Since CUDA 4.1, NVML is part of the download of the so-called Tesla Deployment Kit. Secondly, let me comment on the usage of NVML a bit more. From here, it seems that Torque is able to monitor the status of Nvidia GPUs quite well. However, only the commercial scheduler Moab is able to also consider gpu status information in its scheduling decisions. But also using the basic scheduler pbs_sched
, just displaying monitoring information via the pbsnodes
command would be great, wouldn’t it? Unfortunately, from my own experience, monitoring of GPUs does not work with CUDA 5 and latest Nvidia drivers. This is implicitly confirmed by statements such as this and this. Nevertheless, I decided to include NVML in the Torque build, because it does not hurt and maybe it works for you, because you apply an older CUDA version or older drivers. Just note that for a very simple and yet functional setup you might skip including NVML, it is entirely optional.
I assume that on the build machine you have a local CUDA installation placed at /usr/local/cuda
. Then, you can put the NVML-related files into that tree like so:
# wget https://developer.nvidia.com/sites/default/files/akamai/cuda/files/CUDADownloads/NVML/tdk_3.304.5.tar.gz # tar xzf tdk_3.304.5.tar.gz # cd tdk_3.304.5 # cp -a nvidia-healthmon/ nvml/ /usr/local/cuda
Build Torque packages
According to the Torque docs, Torque binary packages can be created in the following way:
# apt-get install libxml2-dev # apt-get install libssl-dev $ wget http://adaptive.wpengine.com/resources/downloads/torque/torque-4.1.5.tar.gz $ tar xzf torque-4.1.5.tar.gz $ ./configure --with-debug --enable-nvidia-gpus --with-nvml-include=/usr/local/cuda/nvml/include --with-nvml-lib=/usr/local/cuda/nvml/lib64 2>&1 | tee configure_torque.log $ make 2>&1 | tee make.log $ make packages 2>&1 | tee make_packages.log
In the first two lines, missing requirements were installed. This was an Ubuntu 10.04, in your case other packages might be missing. Note I pointed configure
to the previously installed NVML files. You might want to have a look at the other configure
options and define e.g. a different library installation path. make packages
creates binary Torque installation packages of which we require three:
torque-package-server-linux-x86_64.sh
torque-package-mom-linux-x86_64.sh
torque-package-clients-linux-x86_64.sh
They will be used for installing the various Torque components on all machines in the same cluster, as described below.
Install Torque packages and services
Although optional, it is possible to use the Torque server as a compute node. This is what we are doing here. As stated already before, the head node is planned to run pbs_server
, pbs_sched
, pbs_mom
and trqauthd
. All other nodes (gpu2
in this case) are planned to run pbs_mom
and trqauthd
. In terms of the packages, compute-only nodes need torque-package-mom-linux-x86_64.sh
and torque-package-clients-linux-x86_64.sh
installed. The head node additionally requires torque-package-server-linux-x86_64.sh
:
# checkinstall /data/jpg/torque/torque-package-server-linux-x86_64.sh --install 2>&1 | tee checkinstall_torque_server.log # checkinstall /data/jpg/torque/torque-package-clients-linux-x86_64.sh --install 2>&1 | tee checkinstall_torque_clients.log # dpkg --force-overwrite -i torque-clients_20130318-1_amd64.deb # checkinstall /data/jpg/torque/torque-package-mom-linux-x86_64.sh --install 2>&1 | tee checkinstall_torque_mom.log # dpkg --force-overwrite -i torque-mom_20130318-1_amd64.deb
As you can see, this was performed as root. On debian-based systems, it is quite a good idea to then use checkinstall
when installing custom software in the top-level locations. The checkinstall
wrapper first creates a debian package file that is automatically installed in a second step. This package file can be used to cleanly remove all files later on or to easily re-install or install on a similar system. The first package, torque-package-server-linux
becomes converted to a debian package and then installs just fine. The two conversions afterwards also happen, but automatic installation of the debian packages fails. This is because the Torque binary packages actually have a certain redundancy, which is fine. However, debian packages trying to overwrite files belonging to another debian package make dpkg
raise an error if not told explicitly otherwise. The deb-packages are created anyway and can be installed manually with the option --force-overwrite
, which makes perfect sense in this case. Again, after the procedure above, you have three debian packages at hand that can be used to cleanly remove what you have just installed as root.
On gpu2
, we install torque-package-mom-linux-x86_64.sh
and torque-package-clients-linux-x86_64.sh
in the same way.
By default (if not explicitly told otherwise via configure
), Torque places shared objects in an ‘old’ place, i.e. /usr/local/lib
. In case of Ubuntu 10.04, we need to modify the library search path:
# vi /etc/ld.so.conf # cat /etc/ld.so.conf include /etc/ld.so.conf.d/*.conf /usr/local/cuda/lib64 /usr/local/cuda/lib /usr/local/lib # /sbin/ldconfig
Alternatively, you could modify LD_LIBRARY_PATH
globally. In any case, make sure that on all nodes the operating system finds Torque’s library files.
Torque’s source tarball provides init scripts for all Torque components. On the head node, we install service scripts for pbs_mom
, pbs_sched
, pbs_server
, and trqautd
in the following fashion:
root@gpu1:/***/torque-4.1.5/contrib/init.d# cp debian.pbs_mom /etc/init.d/pbs_mom && update-rc.d pbs_mom defaults update-rc.d: warning: pbs_mom start runlevel arguments (2 3 4 5) do not match LSB Default-Start values (2 3 5) update-rc.d: warning: pbs_mom stop runlevel arguments (0 1 6) do not match LSB Default-Stop values (S 0 1 6) Adding system startup for /etc/init.d/pbs_mom ... /etc/rc0.d/K20pbs_mom -> ../init.d/pbs_mom /etc/rc1.d/K20pbs_mom -> ../init.d/pbs_mom /etc/rc6.d/K20pbs_mom -> ../init.d/pbs_mom /etc/rc2.d/S20pbs_mom -> ../init.d/pbs_mom /etc/rc3.d/S20pbs_mom -> ../init.d/pbs_mom /etc/rc4.d/S20pbs_mom -> ../init.d/pbs_mom /etc/rc5.d/S20pbs_mom -> ../init.d/pbs_mom
Repeat that for pbs_sched
, pbs_server
, and trqautd
.
You can already run the services, although still not properly configured:
# service pbs_server start # service pbs_mom start # service trqauthd start hostname: gpu1 pbs_server port is: 15001 trqauthd daemonized - port 15005 # service pbs_sched start
On the compute nodes, it is sufficient to install pbs_mom
and trqauthd
in the same way as above.
Initial setup
It is time to specify config parameters that are common among head and compute nodes. Hence, for gpu1
as well as gpu2
, do the following:
export TORQUE_HOME=/var/spool/torque echo "\$pbsserver gpu1" > ${TORQUE_HOME}/mom_priv/config echo "\$logevent 225" >> ${TORQUE_HOME}/mom_priv/config echo "gpu1" > ${TORQUE_HOME}/server_name echo "SERVERHOST gpu1" > ${TORQUE_HOME}/torque.cfg
All of this is required, except for the logevent
setting, which I grasped from somewhere as a recommendation. IIRC, TORQUE_HOME
can be forgotten about afterwards, i.e. this environment variable is not required during runtime, although sometimes in the Torque docs this seems to be the case.
In the source tree, Torque provides an initial setup configuration script, which you can run like so:
./torque.setup username
You should now adjust the queue and server configuration to your needs. So take at look at the docs. My rudimentary config is:
# qmgr -c "print server" # # Create queues and set their attributes. # # # Create and define queue batch # create queue batch set queue batch queue_type = Execution set queue batch max_queuable = 1000 set queue batch acl_host_enable = True set queue batch acl_hosts = gpu1 set queue batch acl_user_enable = False set queue batch resources_max.nodect = 1 set queue batch resources_max.walltime = 100:00:00 set queue batch resources_default.nodect = 1 set queue batch resources_default.nodes = 1 set queue batch resources_default.walltime = 100:00:00 set queue batch acl_group_enable = True set queue batch acl_groups = group set queue batch keep_completed = 10000 set queue batch enabled = True set queue batch started = True # # Set server attributes. # set server scheduling = True set server acl_hosts = gpu1 set server managers = username@gpu1 set server operators = username@gpu1 set server default_queue = batch set server log_events = 511 set server mail_from = adm set server scheduler_iteration = 600 set server node_check_rate = 150 set server tcp_timeout = 300 set server job_stat_rate = 45 set server poll_jobs = True set server mom_job_sync = True set server keep_completed = 300 set server next_job_number = 2 set server moab_array_compatible = True
For clarity, these are the settings I had success with on the head node, i.e. gpu1 in this case:
# cat ${TORQUE_HOME}/server_name gpu1 # cat ${TORQUE_HOME}/server_priv/nodes gpu1 np=3 gpus=3 gpu2 np=2 gpus=2 # cat ${TORQUE_HOME}/mom_priv/config $pbsserver gpu1 $logevent 225 # cat ${TORQUE_HOME}/torque.cfg SERVERHOST gpu1
In this case, I was already planning for a correspondence of 1 CPU to 1 GPU, i.e. the np
settings equal the gpus
settings in the nodes file. Depending on the types of jobs you plan to run, in your case a different setting might make more sense.
The corresponding configuration file contents on the compute node, gpu2:
# cat ${TORQUE_HOME}/server_name gpu1 # cat ${TORQUE_HOME}/server_priv/nodes ## This is the TORQUE server "nodes" file. ## ## To add a node, enter its hostname, optional processor count (np=), ## and optional feature names. ## ## Example: ## host01 np=8 featureA featureB ## host02 np=8 featureA featureB ## ## for more information, please visit: ## ## http://www.adaptivecomputing.com/resources/docs/ # cat ${TORQUE_HOME}/mom_priv/config $pbsserver gpu1 $logevent 225 # cat ${TORQUE_HOME}/torque.cfg SERVERHOST gpu1
With this minimal config, the GPU cluster is already ready to go:
# pbsnodes -a gpu1 state = free np = 3 ntype = cluster status = [...] mom_service_port = 15002 mom_manager_port = 15003 gpus = 3 gpu2 state = free np = 2 ntype = cluster status = [...] mom_service_port = 15002 mom_manager_port = 15003 gpus = 2
Using the cluster
Job submission using Torque’s qsub is a bit old school (I also have experience with LSF, SGE, and SLURM and must say that Torque provides the least convenient command line tools). Example submit command, runs a script with arguments and requests 1 GPU and 1 CPU for the job:
$ echo "/bin/bash job.sh out16" | qsub -d $PWD -l nodes=1:gpus=1:ppn=1
A submit command, script without arguments:
$ qsub -d $PWD -l nodes=1:gpus=1:ppn=1 job.sh
These commands don’t even take care of job standard output and standard error collection (a problem which Torque solves, in my opinion, in an absolutely horrible way).
As of the inconveniences related to job submission and mainly because I wanted to have a mechanism that automatically takes care of setting CUDA_VISIBLE_DEVICES
on the executing node before running the job, I wrote two small scripts, located at https://bitbucket.org/jgehrcke/torque-gpu-compute-jobs/src. One is a submission wrapper, for simplified job submission. I use it this way:
submit-gpu-job '/bin/bash job_script.sh' -o stdout_stderr.log
I have created this for my needs, it automatically requests 1 CPU and 1 GPU. On the executing node, the actual job is wrapped by the torque-gpu-compute-job-wrapper.py
script (which, of course, must be in the PATH
), which takes care of evaluating PBS_GPUFILE
and setting CUDA_VISIBLE_DEVICES
before running the job script. The user-given command as provided by submit-gpu-job
is temporarily stored in the file system (in the CWD when using submit-gpu-job
, to be more specific). It later becomes recovered by the job wrapper on the executing node.
Set up as described above, we are running a small GPU cluster in our research group for a couple of months already, without any complication. By the way, the best way to monitor the queue in a daily usage scenario in my opinion is the qstat -na1
command (intuitive combination of parameters, right?).
Leave a Reply