Check NVLink® in Linux

Please install NVIDIA® drivers by following our guide Install NVIDIA® driver in Linux, before checking the NVLink® support in the operating system. Additionally, you need to install the CUDA® toolkit to compile application samples. In this small guide, we’ve collected a few useful commands that you can use.

Basic commands

Check the physical topology of your system. This command shows all GPUs and their interconnect:

nvidia-smi topo -m

If you want to display the state of links, execute the following command:

nvidia-smi nvlink -s

The command displays the speed of each link or . You can display information about a single GPU connection. Each GPU has an ID, which can be specified with the -i option. For example, let’s display connection information about the first GPU with ID 0:

nvidia-smi nvlink -i 0 -c

Without this option, information about all GPUs connections will be displayed:

nvidia-smi nvlink -c

Install CUDA-samples

A good way to test throughput is to use app samples by NVIDIA®. The source code of these samples are posted on GitHub and are available to everyone. Proceed with cloning repository to the server:

git clone https://github.com/NVIDIA/cuda-samples.git

Change directory to the downloaded repository:

cd cuda-samples

Select the appropriate branch by tag according to the installed CUDA® version. For example, if you have CUDA® 12.2:

git checkout tags/v12.2

Install some prerequisites that will be used in the compiling process:

sudo apt -y install freeglut3-dev build-essential libx11-dev libxmu-dev libxi-dev libgl1-mesa-glx libglu1-mesa libglu1-mesa-dev libglfw3-dev libgles2-mesa-dev

Now, you can compile any sample. Go to the Samples directory:

cd Samples

Quick look at the content:

ls -la

total 40
  drwxrwxr-x 10 usergpu usergpu 4096 Sep 13 14:54 .
  drwxrwxr-x  6 usergpu usergpu 4096 Sep 13 14:54 ..
  drwxrwxr-x 55 usergpu usergpu 4096 Sep 13 14:54 0_Introduction
  drwxrwxr-x  6 usergpu usergpu 4096 Sep 13 14:54 1_Utilities
  drwxrwxr-x 36 usergpu usergpu 4096 Sep 13 14:54 2_Concepts_and_Techniques
  drwxrwxr-x 25 usergpu usergpu 4096 Sep 13 14:54 3_CUDA_Features
  drwxrwxr-x 41 usergpu usergpu 4096 Sep 13 14:54 4_CUDA_Libraries
  drwxrwxr-x 52 usergpu usergpu 4096 Sep 13 14:54 5_Domain_Specific
  drwxrwxr-x  6 usergpu usergpu 4096 Sep 13 14:54 6_Performance
  drwxrwxr-x 11 usergpu usergpu 4096 Sep 13 14:54 7_libNVVM

Let’s test the GPU bandwidth. Change the directory:

cd 1_Utilities/bandwidthTest

Compile the app:

make

Run tests

Start testing by executing the app using its name:

./bandwidthTest

The output might look like this:

[CUDA Bandwidth Test] - Starting...
  Running on...
   Device 0: NVIDIA RTX A6000
   Quick Mode
   Host to Device Bandwidth, 1 Device(s)
   PINNED Memory Transfers
     Transfer Size (Bytes)        Bandwidth(GB/s)
     32000000                     6.0
   Device to Host Bandwidth, 1 Device(s)
   PINNED Memory Transfers
     Transfer Size (Bytes)        Bandwidth(GB/s)
     32000000                     6.6
   Device to Device Bandwidth, 1 Device(s)
   PINNED Memory Transfers
     Transfer Size (Bytes)        Bandwidth(GB/s)
     32000000                     569.2
  Result = PASS
  NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

Alternatively, you can compile and start the p2pBandwidthLatencyTest:

cd 5_Domain_Specific/p2pBandwidthLatencyTest

make

./p2pBandwidthLatencyTest

This app will show you detailed information about your GPU’s bandwidth in P2P mode. Sample output:

[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
  Device: 0, NVIDIA RTX A6000, pciBusID: 3, pciDeviceID: 0, pciDomainID:0
  Device: 1, NVIDIA RTX A6000, pciBusID: 4, pciDeviceID: 0, pciDomainID:0
  Device=0 CAN Access Peer Device=1
  Device=1 CAN Access Peer Device=0
  ***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
  So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.
  P2P Connectivity Matrix
       D\D     0     1
       0       1     1
       1       1     1
  Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
     D\D     0      1 
       0 590.51   6.04 
       1   6.02 590.51 
  Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
     D\D     0      1 
       0 589.40  52.75 
       1  52.88 592.53 
  Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
     D\D     0      1 
       0 593.88   8.55 
       1   8.55 595.32 
  Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
     D\D     0      1 
       0 595.69 101.68 
       1 101.97 595.69 
  P2P=Disabled Latency Matrix (us)
     GPU     0      1 
       0   1.61  28.66 
       1  18.49   1.53 
     CPU     0      1 
       0   2.27   6.06 
       1   6.12   2.23 
  P2P=Enabled Latency (P2P Writes) Matrix (us)
     GPU     0      1 
       0   1.62   1.27 
       1   1.17   1.55 
     CPU     0      1 
       0   2.27   1.91 
       1   1.90   2.34 
  NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

In case of a configuration with multiple GPUs, it may look like this:

[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
  Device: 0, NVIDIA H100 PCIe, pciBusID: 30, pciDeviceID: 0, pciDomainID:0
  Device: 1, NVIDIA H100 PCIe, pciBusID: 3f, pciDeviceID: 0, pciDomainID:0
  Device: 2, NVIDIA H100 PCIe, pciBusID: 40, pciDeviceID: 0, pciDomainID:0
  Device: 3, NVIDIA H100 PCIe, pciBusID: 41, pciDeviceID: 0, pciDomainID:0
  Device: 4, NVIDIA H100 PCIe, pciBusID: b0, pciDeviceID: 0, pciDomainID:0
  Device: 5, NVIDIA H100 PCIe, pciBusID: b1, pciDeviceID: 0, pciDomainID:0
  Device: 6, NVIDIA H100 PCIe, pciBusID: c2, pciDeviceID: 0, pciDomainID:0
  Device: 7, NVIDIA H100 PCIe, pciBusID: c3, pciDeviceID: 0, pciDomainID:0
  Device=0 CAN Access Peer Device=1
  Device=0 CAN Access Peer Device=2
  Device=0 CAN Access Peer Device=3
  Device=0 CAN Access Peer Device=4
  Device=0 CAN Access Peer Device=5
  Device=0 CAN Access Peer Device=6
  Device=0 CAN Access Peer Device=7
  Device=1 CAN Access Peer Device=0
  Device=1 CAN Access Peer Device=2
  Device=1 CAN Access Peer Device=3
  Device=1 CAN Access Peer Device=4
  Device=1 CAN Access Peer Device=5
  Device=1 CAN Access Peer Device=6
  Device=1 CAN Access Peer Device=7
  Device=2 CAN Access Peer Device=0
  Device=2 CAN Access Peer Device=1
  Device=2 CAN Access Peer Device=3
  Device=2 CAN Access Peer Device=4
  Device=2 CAN Access Peer Device=5
  Device=2 CAN Access Peer Device=6
  Device=2 CAN Access Peer Device=7
  Device=3 CAN Access Peer Device=0
  Device=3 CAN Access Peer Device=1
  Device=3 CAN Access Peer Device=2
  Device=3 CAN Access Peer Device=4
  Device=3 CAN Access Peer Device=5
  Device=3 CAN Access Peer Device=6
  Device=3 CAN Access Peer Device=7
  Device=4 CAN Access Peer Device=0
  Device=4 CAN Access Peer Device=1
  Device=4 CAN Access Peer Device=2
  Device=4 CAN Access Peer Device=3
  Device=4 CAN Access Peer Device=5
  Device=4 CAN Access Peer Device=6
  Device=4 CAN Access Peer Device=7
  Device=5 CAN Access Peer Device=0
  Device=5 CAN Access Peer Device=1
  Device=5 CAN Access Peer Device=2
  Device=5 CAN Access Peer Device=3
  Device=5 CAN Access Peer Device=4
  Device=5 CAN Access Peer Device=6
  Device=5 CAN Access Peer Device=7
  Device=6 CAN Access Peer Device=0
  Device=6 CAN Access Peer Device=1
  Device=6 CAN Access Peer Device=2
  Device=6 CAN Access Peer Device=3
  Device=6 CAN Access Peer Device=4
  Device=6 CAN Access Peer Device=5
  Device=6 CAN Access Peer Device=7
  Device=7 CAN Access Peer Device=0
  Device=7 CAN Access Peer Device=1
  Device=7 CAN Access Peer Device=2
  Device=7 CAN Access Peer Device=3
  Device=7 CAN Access Peer Device=4
  Device=7 CAN Access Peer Device=5
  Device=7 CAN Access Peer Device=6
  ***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
  So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.
  P2P Connectivity Matrix
       D\D     0     1     2     3     4     5     6     7
       0       1     1     1     1     1     1     1     1
       1       1     1     1     1     1     1     1     1
       2       1     1     1     1     1     1     1     1
       3       1     1     1     1     1     1     1     1
       4       1     1     1     1     1     1     1     1
       5       1     1     1     1     1     1     1     1
       6       1     1     1     1     1     1     1     1
       7       1     1     1     1     1     1     1     1
  Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
     D\D     0      1      2      3      4      5      6      7
       0 1629.83  38.43  38.39  37.66  38.51  38.19  38.09  37.92
       1  38.22 1637.04  35.52  35.59  38.15  38.38  38.08  37.55
       2  37.76  35.62 1635.32  35.45  38.59  38.21  38.77  37.94
       3  37.88  35.50  35.60 1639.45  38.49  37.43  38.72  38.49
       4  36.87  37.03  37.00  36.90 1635.86  34.48  38.06  37.22
       5  37.27  37.06  36.92  37.06  34.51 1636.18  37.80  37.50
       6  37.05  36.95  37.45  37.15  37.51  37.96 1630.79  34.94
       7  36.98  36.91  36.95  36.87  37.83  38.02  34.73 1633.35
  Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
     D\D     0      1      2      3      4      5      6      7
       0 1635.22  34.42  33.84 256.54  27.74  28.68  28.00  28.41
       1  34.66 1636.93 256.16  17.97  71.58  71.64  71.65  71.61
       2  34.78 256.81 1655.79  30.29  70.34  70.42  70.37  70.33
       3 256.65  30.65  70.67 1654.53  70.66  70.69  70.70  70.73
       4  28.26  30.80  69.99  70.04 1630.36 256.45  69.97  70.02
       5  28.10  31.08  71.60  71.59 256.47 1654.31  71.62  71.54
       6  28.37  30.96  70.99  70.93  70.91  70.96 1632.12 257.11
       7  27.66  30.87  70.30  70.40  70.30  70.39 256.72 1649.57
  Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
     D\D     0      1      2      3      4      5      6      7
       0 1673.16  51.88  51.95  51.76  51.61  51.44  52.07  51.30
       1  52.04 1676.28  39.06  39.21  51.62  51.62  51.98  51.36
       2  52.11  39.27 1674.62  39.16  51.42  51.21  51.72  51.71
       3  51.74  39.70  39.22 1672.77  51.50  51.27  51.70  51.24
       4  52.14  52.41  51.38  52.14 1671.54  38.81  46.76  45.72
       5  51.82  52.65  52.30  51.67  38.57 1676.33  46.90  45.96
       6  52.92  52.66  53.02  52.68  46.23  46.31 1672.74  38.91
       7  52.61  52.74  52.79  52.64  45.90  46.35  39.07 1673.16
  Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
     D\D     0      1      2      3      4      5      6      7
       0 1670.31  52.41 140.69 508.68 139.85 141.88 141.71 140.55
       1 141.69 1673.30 509.23 141.22 139.91 143.28 141.71 140.61
       2 140.64 508.90 1669.67 140.68 139.93 140.61 140.67 140.50
       3 509.14 141.36 140.61 1682.65 139.93 141.45 141.45 140.67
       4 140.01 140.03 140.07 139.94 1670.68 508.37 140.01 139.90
       5 141.92 143.17 140.50 141.19 508.92 1670.73 141.72 140.52
       6 141.72 141.72 140.60 141.31 139.66 141.85 1671.51 510.03
       7 140.62 140.71 140.66 140.63 140.02 140.72 509.77 1668.28
  P2P=Disabled Latency Matrix (us)
     GPU     0      1      2      3      4      5      6      7
       0   2.35  17.23  17.13  13.38  12.86  21.15  21.39  21.12
       1  17.54   2.32  12.95  13.78  21.05  21.23  21.31  21.37
       2  16.85  14.83   2.35  16.07  12.71  12.80  21.23  12.79
       3  14.98  16.06  14.64   2.41  13.35  12.81  13.60  21.36
       4  21.31  21.31  20.49  21.32   2.62  12.33  12.66  12.98
       5  20.36  21.22  20.17  12.79  16.74   2.58  12.41  12.93
       6  17.51  12.84  12.79  12.70  17.63  18.78   2.36  13.69
       7  21.23  12.71  19.41  21.09  14.69  13.79  15.52   2.59
  CPU     0      1      2      3      4      5      6      7
       0   1.73   4.99   4.88   4.85   5.17   5.18   5.18   5.33
       1   5.04   1.71   4.74   4.82   5.04   5.14   5.10   5.19
       2   4.86   4.75   1.66   4.78   5.08   5.09   5.11   5.17
       3   4.80   4.72   4.73   1.63   5.09   5.11   5.06   5.10
       4   5.07   5.00   5.03   4.96   1.77   5.33   5.34   5.38
       5   5.12   4.94   5.00   4.96   5.31   1.77   5.38   5.41
       6   5.09   4.97   5.09   5.01   5.35   5.39   1.80   5.42
       7   5.18   5.09   5.02   5.00   5.39   5.40   5.40   1.76
  P2P=Enabled Latency (P2P Writes) Matrix (us)
     GPU     0      1      2      3      4      5      6      7
       0   2.33   2.15   2.11   2.76   2.07   2.11   2.07   2.12
       1   2.07   2.30   2.77   2.07   2.12   2.06   2.06   2.10
       2   2.09   2.75   2.34   2.12   2.09   2.08   2.08   2.12
       3   2.78   2.10   2.13   2.40   2.13   2.14   2.14   2.13
       4   2.18   2.23   2.23   2.17   2.59   2.82   2.15   2.16
       5   2.15   2.17   2.15   2.20   2.82   2.56   2.17   2.16
       6   2.13   2.18   2.21   2.17   2.15   2.17   2.36   2.85
       7   2.19   2.21   2.19   2.22   2.19   2.19   2.86   2.61
     CPU     0      1      2      3      4      5      6      7
       0   1.78   1.32   1.29   1.40   1.33   1.34   1.34   1.33
       1   1.32   1.69   1.34   1.35   1.35   1.34   1.40   1.33
       2   1.38   1.37   1.73   1.36   1.36   1.35   1.35   1.34
       3   1.34   1.42   1.35   1.66   1.34   1.34   1.35   1.33
       4   1.53   1.41   1.40   1.40   1.77   1.43   1.48   1.47
       5   1.46   1.43   1.43   1.42   1.47   1.84   1.51   1.56
       6   1.53   1.45   1.45   1.45   1.45   1.44   1.85   1.47
       7   1.54   1.47   1.47   1.47   1.45   1.44   1.50   1.84
  NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

Check NVLink® in Linux

Basic commands

Install CUDA-samples

Run tests

Still have questions? Write to us!