CUDA® initialization: Unexpected error from cudaGetDeviceCount()

Some server configurations, such as those with 8 x A100 SXM4 GPUs, have a peculiar characteristic. When Ubuntu is chosen as the operating system and the latest NVIDIA® drivers and CUDA® Toolkit are installed, attempting to run an application built with the PyTorch framework often results in an error. This error typically appears as follows:

CUDA initialization: Unexpected error from cudaGetDeviceCount()

The error can be reproduced not only through the application but also directly in the interactive Python console:

python3

>>> import torch
>>> torch.cuda.is_available()
    
/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:128: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0

This error indicates that CUDA® can’t correctly determine the number of available GPUs, and thus can’t allocate resources for computation. Oddly enough, if you run the command to count available devices, it will display the correct number:

>>> torch.cuda.device_count()
8

The standard nvidia-smi utility functions correctly, but additional features like MIG are disabled. Upgrading or downgrading the operating system, GPU drivers, or CUDA® doesn’t solve this problem.

Possible reason

The error stems from the system’s method of detecting available GPUs. By default, PyTorch modules are loaded to detect available computing devices. These modules send cuDeviceGetByPCIBusId or cuDeviceGetPCIBusId requests to the CUDA® driver API. If these requests fail, the system assumes no devices are available, preventing PyTorch from using them.

Server preparation

Before addressing the detection issue, let’s isolate our Python environment using a virtual environment as a precaution. Install the package:

sudo apt install python3-venv

Create a directory to store all files and folders for the virtual environment:

mkdir /home/usergpu/venv

Create an isolated environment:

python -m venv /home/usergpu/venv

Let’s activate the environment. All subsequent actions, such as installing packages or performing other Python-related tasks, will be confined to this isolated environment. These actions won’t affect the operating system:

source /home/usergpu/venv/bin/activate

Install PyTorch with CUDA® 12.4 support:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

Solution

Now everything is ready to solve the problem causing the error. You’ll need to set three environment variables in subsequence:

CUDA_DEVICE_ORDER="PCI_BUS_ID" - will sort GPUs, by ordering their IDs with IDs on the PCIe bus.
PYTORCH_NVML_BASED_CUDA_CHECK=1 - perform an availability check using NVML (NVIDIA® Management Library). NVML is an API layer for obtaining data directly from the nvidia-smi utility.
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 - force show the system the IDs of available GPUs.

With the system now fully aware of the installed GPUs, you can proceed to run Python. The complete command will appear as follows:

CUDA_DEVICE_ORDER="PCI_BUS_ID" PYTORCH_NVML_BASED_CUDA_CHECK=1 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python3

Checking:

>>> import torch
>>> torch.cuda.is_available()
True

You can now run any application on the PyTorch framework without encountering that error.

Another possible reason

Sometimes, the above solution may not work due to a less obvious issue. Configurations like 8 x A100 SXM4 use the NVIDIA® Fabric Manager (FM) software. FM acts as a coordinator, optimizing GPU connections and ensuring load balancing. It’s also responsible for monitoring and service functions including how the GPUs are presented to the operating system.

FM constantly communicates with the NVIDIA® API driver, so its version must match the installed driver’s version. If there’s a mismatch, the FM daemon stops working. This leads to the behavior described earlier, where the framework requests available devices but receives incorrect data and can’t properly distribute computing tasks.

To rule out this potential problem, perform a simple diagnosis:

sudo systemctl status nvidia-fabricmanager

× nvidia-fabricmanager.service - NVIDIA fabric manager service
  Loaded: loaded (/lib/systemd/system/nvidia-fabricmanager.service; enabled; vendor preset: enabled)
  Active: failed (Result: exit-code) since Thu 2024-10-17 21:01:05 UTC; 8h ago
 Process: 3992 ExecStart=/usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfg (code=exited, status=1/FAILURE)
     CPU: 11ms

Oct 17 21:01:05 ubuntu22044 systemd[1]: Starting NVIDIA fabric manager service...
Oct 17 21:01:05 ubuntu22044 nv-fabricmanager[3994]: fabric manager NVIDIA GPU driver interface version 550.90.07 don't match with driver version 550.54.15. Please update with matching NVIDIA driver package.

Oct 17 21:01:05 ubuntu22044 systemd[1]: Failed to start NVIDIA fabric manager service.

The example above shows that FM couldn’t start because its version doesn’t match the installed driver version. The simplest solution is to download and install a driver version that exactly matches the FM version (550.90.07). While there are various ways to do this, the easiest method is to download a self-extracting archive in .run format.

wget https://download.nvidia.com/XFree86/Linux-x86_64/550.90.07/NVIDIA-Linux-x86_64-550.90.07.run

Make this file executable:

sudo chmod a+x NVIDIA-Linux-x86_64-550.90.07.run

And start installation:

sudo ./NVIDIA-Linux-x86_64-550.90.07.run

Once the driver is installed, manually start the daemon:

sudo systemctl start nvidia-fabricmanager

Check that the launch was successful:

sudo systemctl status nvidia-fabricmanager

● nvidia-fabricmanager.service - NVIDIA fabric manager service
Loaded: loaded (/lib/systemd/system/nvidia-fabricmanager.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2024-10-18 05:45:26 UTC; 5s ago
Process: 36614 ExecStart=/usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfg (code=exited, status=0/SUCCESS)
Main PID: 36616 (nv-fabricmanage)
Tasks: 19 (limit: 629145)
  Memory: 16.2M
     CPU: 32.350s
  CGroup: /system.slice/nvidia-fabricmanager.service
          └─36616 /usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfg

Oct 18 05:45:02 ubuntu22044 systemd[1]: Starting NVIDIA fabric manager service...
Oct 18 05:45:15 ubuntu22044 nv-fabricmanager[36616]: Connected to 1 node.
Oct 18 05:45:26 ubuntu22044 nv-fabricmanager[36616]: Successfully configured all the available GPUs and NVSwitches to route NVLink traffic.
Oct 18 05:45:26 ubuntu22044 systemd[1]: Started NVIDIA fabric manager service.

Now you can try running your PyTorch based application to check if all GPUs are available.

CUDA® initialization: Unexpected error from cudaGetDeviceCount()

Possible reason

Server preparation

Solution

Another possible reason

Still have questions? Write to us!