Apps & Guides

Intel Habana Gaudi 2: install and test

Before you start installing the Gaudi 2 accelerators software, there is one important feature worth mentioning. We are accustomed to the fact that training and inference of neural networks can be performed using GPUs. However, Intel Habana Gaudi 2 is very different from GPUs and represents a different class of devices that are designed solely for the accelerating AI tasks.

Many familiar applications and frameworks will not work without first preparing the operating system and, in some cases, without a special GPU Migration Toolkit. This explains the large number of preparatory steps that we describe in this article. Let’s start in order.

Step 1. Install SynapseAI Software Stack

To start working with Intel Habana Gaudi 2 accelerators, you need to install the SynapseAI stack. It includes a special graph compiler that transforms the topology of the neural network model to effectively optimize execution on Gaudi architecture, API libraries for horizontal scaling, as well as a separate SDK for creating high-performance algorithms and machine learning models.

Separately, we note that SynapseAI is the part that allows you to create a bridge between popular frameworks like PyTorch/TensorFlow and the Gaudi 2 AI accelerators. This allows you to work with familiar abstractions, and Gaudi 2 independently optimizes calculations Specific operators for which accelerators do not have hardware support are executed on the CPU.

To simplify the installation of individual SynapseAI components, a convenient shell script has been created. Let’s download it:

wget -nv https://vault.habana.ai/artifactory/gaudi-installer/latest/habanalabs-installer.sh

Make the file executable:

chmod +x habanalabs-installer.sh

Run the script:

./habanalabs-installer.sh install --type base

Follow the system prompts during installation. You’ll find a detailed report in the log file. You can see in it which packages were installed, as well as whether the accelerators were successfully found and initialized.

Logs here: /var/log/habana_logs/install-YYYY-MM-DD-HH-MM-SS.log

[  +3.881647] habanalabs hl5: Found GAUDI2 device with 96GB DRAM
[  +0.008145] habanalabs hl0: Found GAUDI2 device with 96GB DRAM
[  +0.032034] habanalabs hl3: Found GAUDI2 device with 96GB DRAM
[  +0.002376] habanalabs hl4: Found GAUDI2 device with 96GB DRAM
[  +0.005174] habanalabs hl1: Found GAUDI2 device with 96GB DRAM
[  +0.000390] habanalabs hl2: Found GAUDI2 device with 96GB DRAM
[  +0.007065] habanalabs hl7: Found GAUDI2 device with 96GB DRAM
[  +0.006256] habanalabs hl6: Found GAUDI2 device with 96GB DRAM

Just as the nvidia-smi utility provides information about installed GPUs and running compute processes, SynapseAI has a similar program. You can run it to get a report on the current state of the Gaudi 2 AI accelerators:

hl-smi

Step 2. TensorFlow test

TensorFlow is one of the most popular platforms for machine learning. Using the same installation script, you can install a pre-built version of TensorFlow with support for Gaudi 2 accelerators. Let’s start by installing the general dependencies:

./habanalabs-installer.sh install -t dependencies

Next, we’ll install dependencies for TensorFlow:

./habanalabs-installer.sh install -t dependencies-tensorflow

Install the TensorFlow platform inside a virtual environment implemented using the Python Virtual Environment (venv) mechanism:

./habanalabs-installer.sh install --type tensorflow --venv

Let’s activate the created virtual environment:

source habanalabs-venv/bin/activate

Create a simple Python code example that will utilize the capabilities of Gaudi 2 accelerators:

nano example.py


import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import torchvision
import torchvision.transforms as transforms
import os
# Import Habana Torch Library
import habana_frameworks.torch.core as htcore
class SimpleModel(nn.Module):
   def __init__(self):
       super(SimpleModel, self).__init__()
       self.fc1   = nn.Linear(784, 256)
       self.fc2   = nn.Linear(256, 64)
       self.fc3   = nn.Linear(64, 10)
   def forward(self, x):
       out = x.view(-1,28*28)
       out = F.relu(self.fc1(out))
       out = F.relu(self.fc2(out))
       out = self.fc3(out)
       return out
def train(net,criterion,optimizer,trainloader,device):
   net.train()
   train_loss = 0.0
   correct = 0
   total = 0
   for batch_idx, (data, targets) in enumerate(trainloader):
       data, targets = data.to(device), targets.to(device)
       optimizer.zero_grad()
       outputs = net(data)
       loss = criterion(outputs, targets)
       loss.backward()
       # API call to trigger execution
       htcore.mark_step()
       optimizer.step()
       # API call to trigger execution
       htcore.mark_step()
       train_loss += loss.item()
       _, predicted = outputs.max(1)
       total += targets.size(0)
       correct += predicted.eq(targets).sum().item()
   train_loss = train_loss/(batch_idx+1)
   train_acc = 100.0*(correct/total)
   print("Training loss is {} and training accuracy is {}".format(train_loss,train_acc))
def test(net,criterion,testloader,device):
   net.eval()
   test_loss = 0
   correct = 0
   total = 0
   with torch.no_grad():
       for batch_idx, (data, targets) in enumerate(testloader):
           data, targets = data.to(device), targets.to(device)
           outputs = net(data)
           loss = criterion(outputs, targets)
           # API call to trigger execution
           htcore.mark_step()
           test_loss += loss.item()
           _, predicted = outputs.max(1)
           total += targets.size(0)
           correct += predicted.eq(targets).sum().item()
   test_loss = test_loss/(batch_idx+1)
   test_acc = 100.0*(correct/total)
   print("Testing loss is {} and testing accuracy is {}".format(test_loss,test_acc))
def main():
   epochs = 20
   batch_size = 128
   lr = 0.01
   milestones = [10,15]
   load_path = './data'
   save_path = './checkpoints'
   if(not os.path.exists(save_path)):
       os.makedirs(save_path)
   # Target the Gaudi HPU device
   device = torch.device("hpu")
   # Data
   transform = transforms.Compose([
       transforms.ToTensor(),
   ])
   trainset = torchvision.datasets.MNIST(root=load_path, train=True,
                                           download=True, transform=transform)
   trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size,
                                           shuffle=True, num_workers=2)
   testset = torchvision.datasets.MNIST(root=load_path, train=False,
                                       download=True, transform=transform)
   testloader = torch.utils.data.DataLoader(testset, batch_size=batch_size,
                                           shuffle=False, num_workers=2)
   net = SimpleModel()
   net.to(device)
   criterion = nn.CrossEntropyLoss()
   optimizer = optim.SGD(net.parameters(), lr=lr,
                       momentum=0.9, weight_decay=5e-4)
   scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=milestones, gamma=0.1)
   for epoch in range(1, epochs+1):
       print("=====================================================================")
       print("Epoch : {}".format(epoch))
       train(net,criterion,optimizer,trainloader,device)
       test(net,criterion,testloader,device)
       torch.save(net.state_dict(), os.path.join(save_path,'epoch_{}.pth'.format(epoch)))
       scheduler.step()
if __name__ == '__main__':
   main()

Finally, execute the application:

python3 example.py

To exit the virtual environment, run the following command:

deactivate

Step 3. Clone training repository

Clone the repository with the MLperf code:

git clone https://github.com/mlcommons/training_results_v3.0

Create a separate directory that will be used by the Docker container with MLperf:

mkdir -p mlperf

Change the directory:

cd mlperf

Let’s export some environment variables:

export MLPERF_DIR=/home/usergpu/mlperf

export SCRATCH_DIR=/home/usergpu/mlperf/scratch

export DATASETS_DIR=/home/usergpu/mlperf/datasets

Create new directories using the variables created:

mkdir -p $MLPERF_DIR/Habana

mkdir -p $SCRATCH_DIR

mkdir -p $DATASETS_DIR

Copy the benchmark app to $MLPERF_DIR/Habana:

cp -R training_results_v3.0/Intel-HabanaLabs/benchmarks/ $MLPERF_DIR/Habana

Export another variable that will store a link to download the desired version of the Docker container:

export MLPERF_DOCKER_IMAGE=vault.habana.ai/gaudi-docker-mlperf/ver3.1/pytorch-installer-2.0.1:1.13.99-41

Step 4. Install Docker

Our instance runs Ubuntu Linux 22.04 LTS and does not support Docker by default. So, before downloading and running containers, you need to install Docker support. Let’s refresh the packages cache and install some basic packages that you’ll need later:

sudo apt update && sudo apt -y install apt-transport-https ca-certificates curl software-properties-common

To install Docker, you need to add a digitally signed project repository. Download the digital signature key and add it to the operating system’s key store:

curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg

Docker can run on platforms with various architectures. The following command will detect your server’s architecture and add the corresponding repository line to the APT package manager list:

echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

Update the packages cache and policies and install docker-ce (Docker Community Edition):

sudo apt update && apt-cache policy docker-ce && sudo apt install docker-ce

Finally, check that Docker daemon is up and running:

sudo systemctl status docker

Step 5. Run Docker container

Let’s launch the container in privileged mode using the previously specified variables:

sudo docker run --privileged --security-opt seccomp=unconfined \
  --name mlperf3.0 -td                    \
  -v /dev:/dev                            \
  --device=/dev:/dev                      \
  -e LOG_LEVEL_ALL=6                      \
  -v /sys/kernel/debug:/sys/kernel/debug  \
  -v /tmp:/tmp                            \
  -v $MLPERF_DIR:/root/MLPERF             \
  -v $SCRATCH_DIR:/root/scratch           \
  -v $DATASETS_DIR:/root/datasets/        \
  --cap-add=sys_nice --cap-add=SYS_PTRACE \
  --user root --workdir=/root --net=host  \
  --ulimit memlock=-1:-1 $MLPERF_DOCKER_IMAGE

For convenience, you can gain access to the terminal inside the container via SSH:

sudo docker exec mlperf3.0 bash -c "service ssh start"

To open a command shell (bash) in the current session, run the following command:

sudo docker exec -it mlperf3.0 bash

Step 6. Prepare a dataset

To run Bert implementation tests from MLperf, you need a prepared dataset. The optimal method is to generate a dataset from preloaded data. The MLperf repository includes a special script, prepare_data.sh, which requires a specific set of packages to function. Let’s navigate to the following directory:

cd /root/MLPERF/Habana/benchmarks/bert/implementations/PyTorch

Install all required packages using the pre-generated list and the pip package manager:

pip install -r requirements.txt

Set the PYTORCH_BERT_DATA variable to instruct the script where to store data:

export PYTORCH_BERT_DATA=/root/datasets/pytorch_bert

Run the script:

bash input_preprocessing/prepare_data.sh -o $PYTORCH_BERT_DATA

The generation procedure is quite long and can take several hours. Please be patient and do not interrupt the process. If you plan to disconnect from the SSH session, it is recommended to use the screen utility immediately before starting the Docker container.

Step 7. Pack the dataset

The next step is to “cut” the dataset into equal pieces for the subsequent launch of MLperf. Let’s create the separate directory for packed data:

mkdir $PYTORCH_BERT_DATA/packed

Run the packing script:

python3 pack_pretraining_data_pytorch.py \
  --input_dir=$PYTORCH_BERT_DATA/hdf5/training-4320/hdf5_4320_shards_uncompressed \
  --output_dir=$PYTORCH_BERT_DATA/packed \
  --max_predictions_per_seq=76

Step 8. Run a test

Now that the dataset is prepared, it’s time to run the test. However, it’s impossible to do this without prior preparation. The Bert test authors left some hard-coded values in the script, which will interfere with the test execution. First, rename the following directory:

mv $PYTORCH_BERT_DATA/packed $PYTORCH_BERT_DATA/packed_data_500_pt

Change the directory:

cd /root/MLPERF/Habana/benchmarks/bert/implementations/HLS-Gaudi2-PT

Since the GNU Nano editor isn’t installed inside the container, it must be installed separately. Alternatively, you can use the built-in Vi editor:

apt update && apt -y install nano

Now, edit the test launch script:

nano launch_bert_pytorch.sh

Find the first line:

DATA_ROOT=/mnt/weka/data/pytorch/bert_mlperf/packed_data

Replace with the following:

DATA_ROOT=/root/datasets/pytorch_bert

Find the second line:

INPUT_DIR=$DATA_ROOT/packed

Replace with the following:

INPUT_DIR=$DATA_ROOT/packed_data_500_pt

Save the file and exit.

The test code includes a limiter function that restricts the gradient from exceeding certain values, preventing potential exponential growth. For reasons unknown to us, this function is absent in the PyTorch version used in the container, causing the test to terminate abnormally during the warm-up stage.

A potential workaround might be to temporarily remove this function from the code in the fastddp.py file. To do this, open the file:

nano ../PyTorch/fastddp.py

Find and comment out the following three lines of code using the # (shebang symbol) so they look like this:

#from habana_frameworks.torch import _hpex_C
#    clip_global_grad_norm = _hpex_C.fused_lamb_norm(grads, 1.0)
#    _fusion_buffer.div_((clip_global_grad_norm * _all_reduce_group_size).to(_fusion_buffer.dtype))

Also, save the file and exit. Change the directory:

cd ../HLS-Gaudi2-PT

Finally, run the script. It will take approximately 20 minutes to complete:

./launch_bert_pytorch.sh