MLperf stack: Slurm, Pyxis, Enroot

The MLperf benchmark is considered one of the benchmark test sets that allow you to accurately judge the performance of servers with GPUs and AI accelerators. Unfortunately, it isn’t the benchmark everyone is used to, in which it is enough to run an executable file and get the finished result after some time. MLperf is a set of scripts that allow testing on various datasets. The datasets themselves are not included in the benchmark kit. You need to download it separately and prepare to work.

In modern versions of MLperf, this set of scripts cannot be executed on a clean system. To test the benchmark in action, you’ll need to prepare the environment accordingly for most tests. The authors of MLperf chose Slurm Workload Manager as a job management tool, which is used in most supercomputers in the world. This open-source application allows for flexible management of workloads by distributing computing tasks among all cluster members.

A minimal Slurm cluster consists of one compute node and one management node. Ideally, these are two different servers that communicate with each other using hostnames, which in the case of Linux, are specified in the /etc/hosts file. In addition to the configured cluster, MLperf requires two plugins: Pyxis and Enroot. The first is an extension that allows an unprivileged user to run containerized workloads. The second removes most of the isolation mechanisms from regular containers, thereby eliminating almost all performance costs while maintaining file system separation.

Step 1. Prepare system

Start by updating the package cache repository and installed packages:

sudo apt update && sudo apt -y upgrade

Don’t forget to install NVIDIA® drivers using the autoinstall command or manually, using our step-by-step guide:

sudo ubuntu-drivers autoinstall

Reboot the server:

sudo shutdown -r now

Step 2. Install Slurm

For these instructions, Ubuntu 22.04 LTS was selected as the operating system, so Slurm can be installed directly from the standard repository. The slurm-wlm package contains both slurmd for the computation node and slurmctld for the control node. Install the package:

sudo apt install slurm-wlm

By default, after installation, none of the daemons will operate because the main configuration file, slurm.conf, is missing. The developers have made an effort to reduce the entry barrier by creating a webpage with a configurator. Here, you can independently assemble the configuration file in parts, using the hints provided for each item.

Please note that for a standard installation on Ubuntu 22.04 LTS, it makes sense to choose LinuxProc instead of Cgroup. After clicking the Submit button at the bottom of the page, you’ll receive the finalized text of the configuration file. Copy it to your clipboard and run the following command:

sudo nano /etc/slurm/slurm.conf

Paste the contents of the clipboard and save the file. Now, you can start the compute node daemon:

sudo systemctl start slurmd.service

You can check the status of the daemon using the following command:

sudo systemctl status slurmd.service

Step 3. Set up Slurm controller

Before you can start the controller daemon, you’ll need to take a few extra steps. First, you need to create a directory where the daemon can store service information about running jobs:

sudo mkdir -p /var/spool/slurmctld

The directory was successfully created, but currently, the slurm user, on behalf of whom the controller is running, cannot write anything there. Let’s make it the owner of the directory:

sudo chown slurm:slurm /var/spool/slurmctld

Now, you can start the controller daemon:

sudo systemctl start slurmctld.service

To check the operating status of the controller, you can use the standard command:

sudo systemctl status slurmctld.service

You can also immediately view the current status of the system, particularly how many computing node are connected to the controller and determine their current state:

sinfo

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST                                                                                                                                                                                         
debug*       up   infinite      1   idle gpuserver

Step 4. Install Pyxis

Slurm extends its capabilities using the SPANK mechanism (Slurm Plug-in Architecture for Node and Job [K]control). It doesn’t need to be installed or configured separately from the main application. However, it is advisable to temporarily stop both daemons before any plugin installation:

sudo systemctl stop slurmd.service && sudo systemctl stop slurmctld.service

Since the Pyxis installation process involves building from the source code, you must first install a package containing the developer libraries:

sudo apt -y install libslurm-dev

The Pyxis build process assumes that the application is installed from source and it will look for some files in the absolute path /use/include/slurm. In our example, the slurm-wlm package placed them in the different directory /usr/include/slurm-wlm. To solve this problem, simply create a symbolic link:

sudo ln -s /usr/include/slurm-wlm /usr/include/slurm

Clone the plugin’s source code from the official NVIDIA® repository:

git clone https://github.com/NVIDIA/pyxis

Open the downloaded directory:

cd pyxis

Start the compilation process:

sudo make install

Once the compilation is complete, you’ll need to create another symbolic link:

sudo ln -s /usr/local/share/pyxis/pyxis.conf /etc/slurm/plugstack.conf.d/pyxis.conf

Now, everything is ready to launch both daemons:

sudo systemctl start slurmd.service && sudo systemctl start slurmctld.service

If done correctly, when you run the following command, you’ll see new options marked [pyxis]:

srun --help

Step 5. Install Enroot

Just like with the previous plugin, it makes sense to stop the daemons first:

sudo systemctl stop slurmd.service && sudo systemctl stop slurmctld.service

Next, use the command shell feature to save the CPU architecture data into a separate variable. This is convenient for executing subsequent commands, as it automatically substitutes the saved value, rather than requiring manual editing:

arch=$(dpkg --print-architecture)

Download the DEB package:

curl -fSsL -O https://github.com/NVIDIA/enroot/releases/download/v3.4.1/enroot_3.4.1-1_${arch}.deb

You can install it using the dpkg utility:

sudo dpkg -i enroot_3.4.1-1_${arch}.deb

If the system reports that some dependencies are missing, you can install them manually:

sudo apt-get -f install

Finally, start both daemons:

sudo systemctl start slurmd.service && sudo systemctl start slurmctld.service

This is the minimum set of steps required to deploy a simple Slurm cluster with Pyxis and Enroot plugins. Additional information can be found in the official documentation on the project website.