MLperf stack: Slurm, Pyxis, Enroot

The MLperf benchmark is considered one of the benchmark test sets that allow you to accurately judge the performance of servers with GPUs and AI accelerators. Unfortunately, it isn’t the benchmark everyone is used to, in which it is enough to run an executable file and get the finished result after some time. MLperf is a set of scripts that allow testing on various datasets. The datasets themselves are not included in the benchmark kit. You need to download it separately and prepare to work.
In modern versions of MLperf, this set of scripts cannot be executed on a clean system. To test the benchmark in action, you’ll need to prepare the environment accordingly for most tests. The authors of MLperf chose Slurm Workload Manager as a job management tool, which is used in most supercomputers in the world. This open-source application allows for flexible management of workloads by distributing computing tasks among all cluster members.
A minimal Slurm cluster consists of one compute node and one management node. Ideally, these are two different servers that communicate with each other using hostnames, which in the case of Linux, are specified in the /etc/hosts file. In addition to the configured cluster, MLperf requires two plugins: Pyxis and Enroot. The first is an extension that allows an unprivileged user to run containerized workloads. The second removes most of the isolation mechanisms from regular containers, thereby eliminating almost all performance costs while maintaining file system separation.
Step 1. Prepare system
Start by updating the package cache repository and installed packages:
sudo apt update && sudo apt -y upgrade
Don’t forget to install NVIDIA® drivers using the autoinstall command or manually, using our step-by-step guide:
sudo ubuntu-drivers autoinstall
Reboot the server:
sudo shutdown -r now
Step 2. Install Slurm
For these instructions, Ubuntu 22.04 LTS was selected as the operating system, so Slurm can be installed directly from the standard repository. The slurm-wlm package contains both slurmd for the computation node and slurmctld for the control node. Install the package:
sudo apt install slurm-wlm
By default, after installation, none of the daemons will operate because the main configuration file, slurm.conf, is missing. The developers have made an effort to reduce the entry barrier by creating a webpage with a configurator. Here, you can independently assemble the configuration file in parts, using the hints provided for each item.
Please note that for a standard installation on Ubuntu 22.04 LTS, it makes sense to choose LinuxProc instead of Cgroup. After clicking the Submit button at the bottom of the page, you’ll receive the finalized text of the configuration file. Copy it to your clipboard and run the following command:
sudo nano /etc/slurm/slurm.conf
Paste the contents of the clipboard and save the file. Now, you can start the compute node daemon:
sudo systemctl start slurmd.service
You can check the status of the daemon using the following command:
sudo systemctl status slurmd.service
Step 3. Set up Slurm controller
Before you can start the controller daemon, you’ll need to take a few extra steps. First, you need to create a directory where the daemon can store service information about running jobs:
sudo mkdir -p /var/spool/slurmctld
The directory was successfully created, but currently, the slurm user, on behalf of whom the controller is running, cannot write anything there. Let’s make it the owner of the directory:
sudo chown slurm:slurm /var/spool/slurmctld
Now, you can start the controller daemon:
sudo systemctl start slurmctld.service
To check the operating status of the controller, you can use the standard command:
sudo systemctl status slurmctld.service
You can also immediately view the current status of the system, particularly how many computing node are connected to the controller and determine their current state:
sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST debug* up infinite 1 idle gpuserver
Step 4. Install Pyxis
Slurm extends its capabilities using the SPANK mechanism (Slurm Plug-in Architecture for Node and Job [K]control). It doesn’t need to be installed or configured separately from the main application. However, it is advisable to temporarily stop both daemons before any plugin installation:
sudo systemctl stop slurmd.service && sudo systemctl stop slurmctld.service
Since the Pyxis installation process involves building from the source code, you must first install a package containing the developer libraries:
sudo apt -y install libslurm-dev
The Pyxis build process assumes that the application is installed from source and it will look for some files in the absolute path /use/include/slurm. In our example, the slurm-wlm package placed them in the different directory /usr/include/slurm-wlm. To solve this problem, simply create a symbolic link:
sudo ln -s /usr/include/slurm-wlm /usr/include/slurm
Clone the plugin’s source code from the official NVIDIA® repository:
git clone https://github.com/NVIDIA/pyxis
Open the downloaded directory:
cd pyxis
Start the compilation process:
sudo make install
Once the compilation is complete, you’ll need to create another symbolic link:
sudo ln -s /usr/local/share/pyxis/pyxis.conf /etc/slurm/plugstack.conf.d/pyxis.conf
Now, everything is ready to launch both daemons:
sudo systemctl start slurmd.service && sudo systemctl start slurmctld.service
If done correctly, when you run the following command, you’ll see new options marked [pyxis]:
srun --help
Step 5. Install Enroot
Just like with the previous plugin, it makes sense to stop the daemons first:
sudo systemctl stop slurmd.service && sudo systemctl stop slurmctld.service
Next, use the command shell feature to save the CPU architecture data into a separate variable. This is convenient for executing subsequent commands, as it automatically substitutes the saved value, rather than requiring manual editing:
arch=$(dpkg --print-architecture)
Download the DEB package:
curl -fSsL -O https://github.com/NVIDIA/enroot/releases/download/v3.4.1/enroot_3.4.1-1_${arch}.deb
You can install it using the dpkg utility:
sudo dpkg -i enroot_3.4.1-1_${arch}.deb
If the system reports that some dependencies are missing, you can install them manually:
sudo apt-get -f install
Finally, start both daemons:
sudo systemctl start slurmd.service && sudo systemctl start slurmctld.service
This is the minimum set of steps required to deploy a simple Slurm cluster with Pyxis and Enroot plugins. Additional information can be found in the official documentation on the project website.
See also:
Updated: 28.03.2025
Published: 09.07.2024