How to run DeepLearning benchmark

Running benchmarks on machine learning servers isn’t the simplest task. Popular tests include benchmarks such as MLperf, ai-benchmark, and deeplearning-benchmark. The first is a set of tests at equipment manufacturers, not ordinary users. Therefore, running tests from the MLperf suite requires deep programming knowledge and experience with containerized applications.

The second benchmark mentioned above is a bit simpler, but there are certain nuances due to its outdated installation instructions. The tensorflow-gpu package is declared as “deprecated”, and the pip install tensorflow[and-cuda] command continues to produce initialization errors. Therefore, we’ll focus on the third benchmark. First, let’s update the package cache and automatically install GPU drivers. These instructions apply to Ubuntu 22.04

Prerequisites

System update

sudo apt update && sudo apt -y upgrade && sudo apt ubuntu-drivers autoinstall

Reboot the server:

sudo shutdown -r now

Adding repositories

Since the nvidia-container-toolkit package and its dependencies are not part of the standard repository, you will add a separate repository in accordance with NVIDIA®’s guide:

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

Update the packages cache:

sudo apt-get update

Docker and NVIDIA® Container Toolkit

Install the Docker engine and NVIDIA® Container Toolkit:

sudo apt-get install docker.io nvidia-container-toolkit

To avoid using sudo every time with Docker, add the user to the appropriate group and create it:

sudo usermod -aG docker $USER

newgrp docker

Prepare a container

Next, you’ll need to download a prepared container image from the NVIDIA® Container Registry named pytorch:22.10-py3. To avoid typing this every time, let’s utilize the command shell’s ability to create variables. Let’s assign this value to the NAME_NGC™ variable:

export NAME_NGC=pytorch:22.10-py3

Now, pull the image from the registry using the created variable:

docker pull nvcr.io/nvidia/${NAME_NGC}

Once the container image is on the server, you need to download the content from two repositories. Clone the first repository with code examples and switch to the branch we need:

git clone https://github.com/LambdaLabsML/DeepLearningExamples.git && \
cd DeepLearningExamples && \
git checkout lambda/benchmark && \
cd ..

Clone the second repository, which contains the PyTorch implementation of the benchmark code:

git clone https://github.com/lambdal/deeplearning-benchmark.git && \
cd deeplearning-benchmark/pytorch

Launch the created container, simultaneously mounting the necessary directories and calling the script to prepare the dataset. This command may take approximately half an hour to complete, so be patient and wait for it to finish:

docker run --gpus all --rm --shm-size=64g \
  -v ~/DeepLearningExamples/PyTorch:/workspace/benchmark \
  -v ~/data:/data \
  -v $(pwd)"/scripts":/scripts \
  nvcr.io/nvidia/${NAME_NGC} \
  /bin/bash -c "cp -r /scripts/* /workspace;  ./run_prepare.sh"

Run the benchmark

Finally, run the benchmark tests. The /deeplearning-benchmark/pytorch/scripts/ directory contains many typical configurations. You can choose one of the ready-made ones or create your own, most suitable for the current server configuration. For this example, we used the 4xA100_SXM4_80GB_v1 configuration:

docker run \
  --rm --shm-size=128g \
  --gpus all \
  -v ~/DeepLearningExamples/PyTorch:/workspace/benchmark \
  -v ~/data:/data \
  -v $(pwd)"/scripts":/scripts \
  -v $(pwd)"/results":/results \
  nvcr.io/nvidia/${NAME_NGC} \
  /bin/bash -c "cp -r /scripts/* /workspace; ./run_benchmark.sh 4xA100_SXM4_80GB_v1 all 1500"

After the benchmarks are completed, you’ll find the test results in the directory with the same name. Also, you can use additional scripts to convert them to other formats.