Intel Habana Gaudi 2: install and test

Before you start installing the Gaudi 2 accelerators software, there is one important feature worth mentioning. We are accustomed to the fact that training and inference of neural networks can be performed using GPUs. However, Intel Habana Gaudi 2 is very different from GPUs and represents a different class of devices that are designed solely for the accelerating AI tasks.
Many familiar applications and frameworks will not work without first preparing the operating system and, in some cases, without a special GPU Migration Toolkit. This explains the large number of preparatory steps that we describe in this article. Let’s start in order.
Step 1. Install SynapseAI Software Stack
To start working with Intel Habana Gaudi 2 accelerators, you need to install the SynapseAI stack. It includes a special graph compiler that transforms the topology of the neural network model to effectively optimize execution on Gaudi architecture, API libraries for horizontal scaling, as well as a separate SDK for creating high-performance algorithms and machine learning models.
Separately, we note that SynapseAI is the part that allows you to create a bridge between popular frameworks like PyTorch/TensorFlow and the Gaudi 2 AI accelerators. This allows you to work with familiar abstractions, and Gaudi 2 independently optimizes calculations Specific operators for which accelerators do not have hardware support are executed on the CPU.
To simplify the installation of individual SynapseAI components, a convenient shell script has been created. Let’s download it:
wget -nv https://vault.habana.ai/artifactory/gaudi-installer/latest/habanalabs-installer.sh
Make the file executable:
chmod +x habanalabs-installer.sh
Run the script:
./habanalabs-installer.sh install --type base
Follow the system prompts during installation. You’ll find a detailed report in the log file. You can see in it which packages were installed, as well as whether the accelerators were successfully found and initialized.
Logs here: /var/log/habana_logs/install-YYYY-MM-DD-HH-MM-SS.log
[ +3.881647] habanalabs hl5: Found GAUDI2 device with 96GB DRAM [ +0.008145] habanalabs hl0: Found GAUDI2 device with 96GB DRAM [ +0.032034] habanalabs hl3: Found GAUDI2 device with 96GB DRAM [ +0.002376] habanalabs hl4: Found GAUDI2 device with 96GB DRAM [ +0.005174] habanalabs hl1: Found GAUDI2 device with 96GB DRAM [ +0.000390] habanalabs hl2: Found GAUDI2 device with 96GB DRAM [ +0.007065] habanalabs hl7: Found GAUDI2 device with 96GB DRAM [ +0.006256] habanalabs hl6: Found GAUDI2 device with 96GB DRAM
Just as the nvidia-smi utility provides information about installed GPUs and running compute processes, SynapseAI has a similar program. You can run it to get a report on the current state of the Gaudi 2 AI accelerators:
hl-smi

Step 2. TensorFlow test
TensorFlow is one of the most popular platforms for machine learning. Using the same installation script, you can install a pre-built version of TensorFlow with support for Gaudi 2 accelerators. Let’s start by installing the general dependencies:
./habanalabs-installer.sh install -t dependencies
Next, we’ll install dependencies for TensorFlow:
./habanalabs-installer.sh install -t dependencies-tensorflow
Install the TensorFlow platform inside a virtual environment implemented using the Python Virtual Environment (venv) mechanism:
./habanalabs-installer.sh install --type tensorflow --venv
Let’s activate the created virtual environment:
source habanalabs-venv/bin/activate
Create a simple Python code example that will utilize the capabilities of Gaudi 2 accelerators:
nano example.py
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import torchvision
import torchvision.transforms as transforms
import os
# Import Habana Torch Library
import habana_frameworks.torch.core as htcore
class SimpleModel(nn.Module):
def __init__(self):
super(SimpleModel, self).__init__()
self.fc1 = nn.Linear(784, 256)
self.fc2 = nn.Linear(256, 64)
self.fc3 = nn.Linear(64, 10)
def forward(self, x):
out = x.view(-1,28*28)
out = F.relu(self.fc1(out))
out = F.relu(self.fc2(out))
out = self.fc3(out)
return out
def train(net,criterion,optimizer,trainloader,device):
net.train()
train_loss = 0.0
correct = 0
total = 0
for batch_idx, (data, targets) in enumerate(trainloader):
data, targets = data.to(device), targets.to(device)
optimizer.zero_grad()
outputs = net(data)
loss = criterion(outputs, targets)
loss.backward()
# API call to trigger execution
htcore.mark_step()
optimizer.step()
# API call to trigger execution
htcore.mark_step()
train_loss += loss.item()
_, predicted = outputs.max(1)
total += targets.size(0)
correct += predicted.eq(targets).sum().item()
train_loss = train_loss/(batch_idx+1)
train_acc = 100.0*(correct/total)
print("Training loss is {} and training accuracy is {}".format(train_loss,train_acc))
def test(net,criterion,testloader,device):
net.eval()
test_loss = 0
correct = 0
total = 0
with torch.no_grad():
for batch_idx, (data, targets) in enumerate(testloader):
data, targets = data.to(device), targets.to(device)
outputs = net(data)
loss = criterion(outputs, targets)
# API call to trigger execution
htcore.mark_step()
test_loss += loss.item()
_, predicted = outputs.max(1)
total += targets.size(0)
correct += predicted.eq(targets).sum().item()
test_loss = test_loss/(batch_idx+1)
test_acc = 100.0*(correct/total)
print("Testing loss is {} and testing accuracy is {}".format(test_loss,test_acc))
def main():
epochs = 20
batch_size = 128
lr = 0.01
milestones = [10,15]
load_path = './data'
save_path = './checkpoints'
if(not os.path.exists(save_path)):
os.makedirs(save_path)
# Target the Gaudi HPU device
device = torch.device("hpu")
# Data
transform = transforms.Compose([
transforms.ToTensor(),
])
trainset = torchvision.datasets.MNIST(root=load_path, train=True,
download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size,
shuffle=True, num_workers=2)
testset = torchvision.datasets.MNIST(root=load_path, train=False,
download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=batch_size,
shuffle=False, num_workers=2)
net = SimpleModel()
net.to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=lr,
momentum=0.9, weight_decay=5e-4)
scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=milestones, gamma=0.1)
for epoch in range(1, epochs+1):
print("=====================================================================")
print("Epoch : {}".format(epoch))
train(net,criterion,optimizer,trainloader,device)
test(net,criterion,testloader,device)
torch.save(net.state_dict(), os.path.join(save_path,'epoch_{}.pth'.format(epoch)))
scheduler.step()
if __name__ == '__main__':
main()
Finally, execute the application:
python3 example.py
To exit the virtual environment, run the following command:
deactivate
Step 3. Clone training repository
Clone the repository with the MLperf code:
git clone https://github.com/mlcommons/training_results_v3.0
Create a separate directory that will be used by the Docker container with MLperf:
mkdir -p mlperf
Change the directory:
cd mlperf
Let’s export some environment variables:
export MLPERF_DIR=/home/usergpu/mlperf
export SCRATCH_DIR=/home/usergpu/mlperf/scratch
export DATASETS_DIR=/home/usergpu/mlperf/datasets
Create new directories using the variables created:
mkdir -p $MLPERF_DIR/Habana
mkdir -p $SCRATCH_DIR
mkdir -p $DATASETS_DIR
Copy the benchmark app to $MLPERF_DIR/Habana:
cp -R training_results_v3.0/Intel-HabanaLabs/benchmarks/ $MLPERF_DIR/Habana
Export another variable that will store a link to download the desired version of the Docker container:
export MLPERF_DOCKER_IMAGE=vault.habana.ai/gaudi-docker-mlperf/ver3.1/pytorch-installer-2.0.1:1.13.99-41
Step 4. Install Docker
Our instance runs Ubuntu Linux 22.04 LTS and does not support Docker by default. So, before downloading and running containers, you need to install Docker support. Let’s refresh the packages cache and install some basic packages that you’ll need later:
sudo apt update && sudo apt -y install apt-transport-https ca-certificates curl software-properties-common
To install Docker, you need to add a digitally signed project repository. Download the digital signature key and add it to the operating system’s key store:
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
Docker can run on platforms with various architectures. The following command will detect your server’s architecture and add the corresponding repository line to the APT package manager list:
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
Update the packages cache and policies and install docker-ce (Docker Community Edition):
sudo apt update && apt-cache policy docker-ce && sudo apt install docker-ce
Finally, check that Docker daemon is up and running:
sudo systemctl status docker
Step 5. Run Docker container
Let’s launch the container in privileged mode using the previously specified variables:
sudo docker run --privileged --security-opt seccomp=unconfined \
--name mlperf3.0 -td \
-v /dev:/dev \
--device=/dev:/dev \
-e LOG_LEVEL_ALL=6 \
-v /sys/kernel/debug:/sys/kernel/debug \
-v /tmp:/tmp \
-v $MLPERF_DIR:/root/MLPERF \
-v $SCRATCH_DIR:/root/scratch \
-v $DATASETS_DIR:/root/datasets/ \
--cap-add=sys_nice --cap-add=SYS_PTRACE \
--user root --workdir=/root --net=host \
--ulimit memlock=-1:-1 $MLPERF_DOCKER_IMAGE
For convenience, you can gain access to the terminal inside the container via SSH:
sudo docker exec mlperf3.0 bash -c "service ssh start"
To open a command shell (bash) in the current session, run the following command:
sudo docker exec -it mlperf3.0 bash
Step 6. Prepare a dataset
To run Bert implementation tests from MLperf, you need a prepared dataset. The optimal method is to generate a dataset from preloaded data. The MLperf repository includes a special script, prepare_data.sh, which requires a specific set of packages to function. Let’s navigate to the following directory:
cd /root/MLPERF/Habana/benchmarks/bert/implementations/PyTorch
Install all required packages using the pre-generated list and the pip package manager:
pip install -r requirements.txt
Set the PYTORCH_BERT_DATA variable to instruct the script where to store data:
export PYTORCH_BERT_DATA=/root/datasets/pytorch_bert
Run the script:
bash input_preprocessing/prepare_data.sh -o $PYTORCH_BERT_DATA
The generation procedure is quite long and can take several hours. Please be patient and do not interrupt the process. If you plan to disconnect from the SSH session, it is recommended to use the screen utility immediately before starting the Docker container.
Step 7. Pack the dataset
The next step is to “cut” the dataset into equal pieces for the subsequent launch of MLperf. Let’s create the separate directory for packed data:
mkdir $PYTORCH_BERT_DATA/packed
Run the packing script:
python3 pack_pretraining_data_pytorch.py \
--input_dir=$PYTORCH_BERT_DATA/hdf5/training-4320/hdf5_4320_shards_uncompressed \
--output_dir=$PYTORCH_BERT_DATA/packed \
--max_predictions_per_seq=76
Step 8. Run a test
Now that the dataset is prepared, it’s time to run the test. However, it’s impossible to do this without prior preparation. The Bert test authors left some hard-coded values in the script, which will interfere with the test execution. First, rename the following directory:
mv $PYTORCH_BERT_DATA/packed $PYTORCH_BERT_DATA/packed_data_500_pt
Change the directory:
cd /root/MLPERF/Habana/benchmarks/bert/implementations/HLS-Gaudi2-PT
Since the GNU Nano editor isn’t installed inside the container, it must be installed separately. Alternatively, you can use the built-in Vi editor:
apt update && apt -y install nano
Now, edit the test launch script:
nano launch_bert_pytorch.sh
Find the first line:
DATA_ROOT=/mnt/weka/data/pytorch/bert_mlperf/packed_data
Replace with the following:
DATA_ROOT=/root/datasets/pytorch_bert
Find the second line:
INPUT_DIR=$DATA_ROOT/packed
Replace with the following:
INPUT_DIR=$DATA_ROOT/packed_data_500_pt
Save the file and exit.
The test code includes a limiter function that restricts the gradient from exceeding certain values, preventing potential exponential growth. For reasons unknown to us, this function is absent in the PyTorch version used in the container, causing the test to terminate abnormally during the warm-up stage.
A potential workaround might be to temporarily remove this function from the code in the fastddp.py file. To do this, open the file:
nano ../PyTorch/fastddp.py
Find and comment out the following three lines of code using the # (shebang symbol) so they look like this:
#from habana_frameworks.torch import _hpex_C
# clip_global_grad_norm = _hpex_C.fused_lamb_norm(grads, 1.0)
# _fusion_buffer.div_((clip_global_grad_norm * _all_reduce_group_size).to(_fusion_buffer.dtype))
Also, save the file and exit. Change the directory:
cd ../HLS-Gaudi2-PT
Finally, run the script. It will take approximately 20 minutes to complete:
./launch_bert_pytorch.sh
See also:
Updated: 12.08.2025
Published: 23.01.2025