You ask — we answer!

Check NVLink in Linux

Please install Nvidia drivers by following our guide Install Nvidia driver in Linux, before checking the NVLink support in the operating system. Additionally, you need to install the CUDA toolkit to compile application samples. In this small guide, we’ve collected a few useful commands that you can use.

Basic commands

Check the physical topology of your system. This command shows all GPUs and their interconnect:

nvidia-smi topo -m

If you want to display the state of links, execute the following command:

nvidia-smi nvlink -s

The command displays the speed of each link or . You can display information about a single GPU connection. Each GPU has an ID, which can be specified with the -i option. For example, let’s display connection information about the first GPU with ID 0:

nvidia-smi nvlink -i 0 -c

Without this option, information about all GPUs connections will be displayed:

nvidia-smi nvlink -c

Install CUDA-samples

A good way to test throughput is to use app samples by Nvidia. The source code of these samples are posted on GitHub and are available to everyone. Proceed with cloning repository to the server:

git clone https://github.com/NVIDIA/cuda-samples.git

Change directory to the downloaded repository:

cd cuda-samples

Select the appropriate branch by tag according to the installed CUDA version. For example, if you have CUDA 12.2:

git checkout tags/v12.2

Install some prerequisites that will be used in the compiling process:

sudo apt -y install freeglut3-dev build-essential libx11-dev libxmu-dev libxi-dev libgl1-mesa-glx libglu1-mesa libglu1-mesa-dev libglfw3-dev libgles2-mesa-dev

Now, you can compile any sample. Go to the Samples directory:

cd Samples

Quick look at the content:

ls -la
total 40
  drwxrwxr-x 10 usergpu usergpu 4096 Sep 13 14:54 .
  drwxrwxr-x  6 usergpu usergpu 4096 Sep 13 14:54 ..
  drwxrwxr-x 55 usergpu usergpu 4096 Sep 13 14:54 0_Introduction
  drwxrwxr-x  6 usergpu usergpu 4096 Sep 13 14:54 1_Utilities
  drwxrwxr-x 36 usergpu usergpu 4096 Sep 13 14:54 2_Concepts_and_Techniques
  drwxrwxr-x 25 usergpu usergpu 4096 Sep 13 14:54 3_CUDA_Features
  drwxrwxr-x 41 usergpu usergpu 4096 Sep 13 14:54 4_CUDA_Libraries
  drwxrwxr-x 52 usergpu usergpu 4096 Sep 13 14:54 5_Domain_Specific
  drwxrwxr-x  6 usergpu usergpu 4096 Sep 13 14:54 6_Performance
  drwxrwxr-x 11 usergpu usergpu 4096 Sep 13 14:54 7_libNVVM

Let’s test the GPU bandwidth. Change the directory:

cd 1_Utilities/bandwidthTest

Compile the app:

make

Run tests

Start testing by executing the app using its name:

./bandwidthTest

The output might look like this:

[CUDA Bandwidth Test] - Starting...
  Running on...
   Device 0: NVIDIA RTX A6000
   Quick Mode
   Host to Device Bandwidth, 1 Device(s)
   PINNED Memory Transfers
     Transfer Size (Bytes)        Bandwidth(GB/s)
     32000000                     6.0
   Device to Host Bandwidth, 1 Device(s)
   PINNED Memory Transfers
     Transfer Size (Bytes)        Bandwidth(GB/s)
     32000000                     6.6
   Device to Device Bandwidth, 1 Device(s)
   PINNED Memory Transfers
     Transfer Size (Bytes)        Bandwidth(GB/s)
     32000000                     569.2
  Result = PASS
  NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

Alternatively, you can compile and start the p2pBandwidthLatencyTest:

cd 5_Domain_Specific/p2pBandwidthLatencyTest
make
./p2pBandwidthLatencyTest

This app will show you detailed information about your GPU’s bandwidth in P2P mode. Sample output:

[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
  Device: 0, NVIDIA RTX A6000, pciBusID: 3, pciDeviceID: 0, pciDomainID:0
  Device: 1, NVIDIA RTX A6000, pciBusID: 4, pciDeviceID: 0, pciDomainID:0
  Device=0 CAN Access Peer Device=1
  Device=1 CAN Access Peer Device=0
  ***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
  So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.
  P2P Connectivity Matrix
       D\D     0     1
       0       1     1
       1       1     1
  Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
     D\D     0      1 
       0 590.51   6.04 
       1   6.02 590.51 
  Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
     D\D     0      1 
       0 589.40  52.75 
       1  52.88 592.53 
  Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
     D\D     0      1 
       0 593.88   8.55 
       1   8.55 595.32 
  Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
     D\D     0      1 
       0 595.69 101.68 
       1 101.97 595.69 
  P2P=Disabled Latency Matrix (us)
     GPU     0      1 
       0   1.61  28.66 
       1  18.49   1.53 
     CPU     0      1 
       0   2.27   6.06 
       1   6.12   2.23 
  P2P=Enabled Latency (P2P Writes) Matrix (us)
     GPU     0      1 
       0   1.62   1.27 
       1   1.17   1.55 
     CPU     0      1 
       0   2.27   1.91 
       1   1.90   2.34 
  NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

In case of a configuration with multiple GPUs, it may look like this:

[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
  Device: 0, NVIDIA H100 PCIe, pciBusID: 30, pciDeviceID: 0, pciDomainID:0
  Device: 1, NVIDIA H100 PCIe, pciBusID: 3f, pciDeviceID: 0, pciDomainID:0
  Device: 2, NVIDIA H100 PCIe, pciBusID: 40, pciDeviceID: 0, pciDomainID:0
  Device: 3, NVIDIA H100 PCIe, pciBusID: 41, pciDeviceID: 0, pciDomainID:0
  Device: 4, NVIDIA H100 PCIe, pciBusID: b0, pciDeviceID: 0, pciDomainID:0
  Device: 5, NVIDIA H100 PCIe, pciBusID: b1, pciDeviceID: 0, pciDomainID:0
  Device: 6, NVIDIA H100 PCIe, pciBusID: c2, pciDeviceID: 0, pciDomainID:0
  Device: 7, NVIDIA H100 PCIe, pciBusID: c3, pciDeviceID: 0, pciDomainID:0
  Device=0 CAN Access Peer Device=1
  Device=0 CAN Access Peer Device=2
  Device=0 CAN Access Peer Device=3
  Device=0 CAN Access Peer Device=4
  Device=0 CAN Access Peer Device=5
  Device=0 CAN Access Peer Device=6
  Device=0 CAN Access Peer Device=7
  Device=1 CAN Access Peer Device=0
  Device=1 CAN Access Peer Device=2
  Device=1 CAN Access Peer Device=3
  Device=1 CAN Access Peer Device=4
  Device=1 CAN Access Peer Device=5
  Device=1 CAN Access Peer Device=6
  Device=1 CAN Access Peer Device=7
  Device=2 CAN Access Peer Device=0
  Device=2 CAN Access Peer Device=1
  Device=2 CAN Access Peer Device=3
  Device=2 CAN Access Peer Device=4
  Device=2 CAN Access Peer Device=5
  Device=2 CAN Access Peer Device=6
  Device=2 CAN Access Peer Device=7
  Device=3 CAN Access Peer Device=0
  Device=3 CAN Access Peer Device=1
  Device=3 CAN Access Peer Device=2
  Device=3 CAN Access Peer Device=4
  Device=3 CAN Access Peer Device=5
  Device=3 CAN Access Peer Device=6
  Device=3 CAN Access Peer Device=7
  Device=4 CAN Access Peer Device=0
  Device=4 CAN Access Peer Device=1
  Device=4 CAN Access Peer Device=2
  Device=4 CAN Access Peer Device=3
  Device=4 CAN Access Peer Device=5
  Device=4 CAN Access Peer Device=6
  Device=4 CAN Access Peer Device=7
  Device=5 CAN Access Peer Device=0
  Device=5 CAN Access Peer Device=1
  Device=5 CAN Access Peer Device=2
  Device=5 CAN Access Peer Device=3
  Device=5 CAN Access Peer Device=4
  Device=5 CAN Access Peer Device=6
  Device=5 CAN Access Peer Device=7
  Device=6 CAN Access Peer Device=0
  Device=6 CAN Access Peer Device=1
  Device=6 CAN Access Peer Device=2
  Device=6 CAN Access Peer Device=3
  Device=6 CAN Access Peer Device=4
  Device=6 CAN Access Peer Device=5
  Device=6 CAN Access Peer Device=7
  Device=7 CAN Access Peer Device=0
  Device=7 CAN Access Peer Device=1
  Device=7 CAN Access Peer Device=2
  Device=7 CAN Access Peer Device=3
  Device=7 CAN Access Peer Device=4
  Device=7 CAN Access Peer Device=5
  Device=7 CAN Access Peer Device=6
  ***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
  So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.
  P2P Connectivity Matrix
       D\D     0     1     2     3     4     5     6     7
       0       1     1     1     1     1     1     1     1
       1       1     1     1     1     1     1     1     1
       2       1     1     1     1     1     1     1     1
       3       1     1     1     1     1     1     1     1
       4       1     1     1     1     1     1     1     1
       5       1     1     1     1     1     1     1     1
       6       1     1     1     1     1     1     1     1
       7       1     1     1     1     1     1     1     1
  Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
     D\D     0      1      2      3      4      5      6      7
       0 1629.83  38.43  38.39  37.66  38.51  38.19  38.09  37.92
       1  38.22 1637.04  35.52  35.59  38.15  38.38  38.08  37.55
       2  37.76  35.62 1635.32  35.45  38.59  38.21  38.77  37.94
       3  37.88  35.50  35.60 1639.45  38.49  37.43  38.72  38.49
       4  36.87  37.03  37.00  36.90 1635.86  34.48  38.06  37.22
       5  37.27  37.06  36.92  37.06  34.51 1636.18  37.80  37.50
       6  37.05  36.95  37.45  37.15  37.51  37.96 1630.79  34.94
       7  36.98  36.91  36.95  36.87  37.83  38.02  34.73 1633.35
  Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
     D\D     0      1      2      3      4      5      6      7
       0 1635.22  34.42  33.84 256.54  27.74  28.68  28.00  28.41
       1  34.66 1636.93 256.16  17.97  71.58  71.64  71.65  71.61
       2  34.78 256.81 1655.79  30.29  70.34  70.42  70.37  70.33
       3 256.65  30.65  70.67 1654.53  70.66  70.69  70.70  70.73
       4  28.26  30.80  69.99  70.04 1630.36 256.45  69.97  70.02
       5  28.10  31.08  71.60  71.59 256.47 1654.31  71.62  71.54
       6  28.37  30.96  70.99  70.93  70.91  70.96 1632.12 257.11
       7  27.66  30.87  70.30  70.40  70.30  70.39 256.72 1649.57
  Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
     D\D     0      1      2      3      4      5      6      7
       0 1673.16  51.88  51.95  51.76  51.61  51.44  52.07  51.30
       1  52.04 1676.28  39.06  39.21  51.62  51.62  51.98  51.36
       2  52.11  39.27 1674.62  39.16  51.42  51.21  51.72  51.71
       3  51.74  39.70  39.22 1672.77  51.50  51.27  51.70  51.24
       4  52.14  52.41  51.38  52.14 1671.54  38.81  46.76  45.72
       5  51.82  52.65  52.30  51.67  38.57 1676.33  46.90  45.96
       6  52.92  52.66  53.02  52.68  46.23  46.31 1672.74  38.91
       7  52.61  52.74  52.79  52.64  45.90  46.35  39.07 1673.16
  Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
     D\D     0      1      2      3      4      5      6      7
       0 1670.31  52.41 140.69 508.68 139.85 141.88 141.71 140.55
       1 141.69 1673.30 509.23 141.22 139.91 143.28 141.71 140.61
       2 140.64 508.90 1669.67 140.68 139.93 140.61 140.67 140.50
       3 509.14 141.36 140.61 1682.65 139.93 141.45 141.45 140.67
       4 140.01 140.03 140.07 139.94 1670.68 508.37 140.01 139.90
       5 141.92 143.17 140.50 141.19 508.92 1670.73 141.72 140.52
       6 141.72 141.72 140.60 141.31 139.66 141.85 1671.51 510.03
       7 140.62 140.71 140.66 140.63 140.02 140.72 509.77 1668.28
  P2P=Disabled Latency Matrix (us)
     GPU     0      1      2      3      4      5      6      7
       0   2.35  17.23  17.13  13.38  12.86  21.15  21.39  21.12
       1  17.54   2.32  12.95  13.78  21.05  21.23  21.31  21.37
       2  16.85  14.83   2.35  16.07  12.71  12.80  21.23  12.79
       3  14.98  16.06  14.64   2.41  13.35  12.81  13.60  21.36
       4  21.31  21.31  20.49  21.32   2.62  12.33  12.66  12.98
       5  20.36  21.22  20.17  12.79  16.74   2.58  12.41  12.93
       6  17.51  12.84  12.79  12.70  17.63  18.78   2.36  13.69
       7  21.23  12.71  19.41  21.09  14.69  13.79  15.52   2.59
  CPU     0      1      2      3      4      5      6      7
       0   1.73   4.99   4.88   4.85   5.17   5.18   5.18   5.33
       1   5.04   1.71   4.74   4.82   5.04   5.14   5.10   5.19
       2   4.86   4.75   1.66   4.78   5.08   5.09   5.11   5.17
       3   4.80   4.72   4.73   1.63   5.09   5.11   5.06   5.10
       4   5.07   5.00   5.03   4.96   1.77   5.33   5.34   5.38
       5   5.12   4.94   5.00   4.96   5.31   1.77   5.38   5.41
       6   5.09   4.97   5.09   5.01   5.35   5.39   1.80   5.42
       7   5.18   5.09   5.02   5.00   5.39   5.40   5.40   1.76
  P2P=Enabled Latency (P2P Writes) Matrix (us)
     GPU     0      1      2      3      4      5      6      7
       0   2.33   2.15   2.11   2.76   2.07   2.11   2.07   2.12
       1   2.07   2.30   2.77   2.07   2.12   2.06   2.06   2.10
       2   2.09   2.75   2.34   2.12   2.09   2.08   2.08   2.12
       3   2.78   2.10   2.13   2.40   2.13   2.14   2.14   2.13
       4   2.18   2.23   2.23   2.17   2.59   2.82   2.15   2.16
       5   2.15   2.17   2.15   2.20   2.82   2.56   2.17   2.16
       6   2.13   2.18   2.21   2.17   2.15   2.17   2.36   2.85
       7   2.19   2.21   2.19   2.22   2.19   2.19   2.86   2.61
     CPU     0      1      2      3      4      5      6      7
       0   1.78   1.32   1.29   1.40   1.33   1.34   1.34   1.33
       1   1.32   1.69   1.34   1.35   1.35   1.34   1.40   1.33
       2   1.38   1.37   1.73   1.36   1.36   1.35   1.35   1.34
       3   1.34   1.42   1.35   1.66   1.34   1.34   1.35   1.33
       4   1.53   1.41   1.40   1.40   1.77   1.43   1.48   1.47
       5   1.46   1.43   1.43   1.42   1.47   1.84   1.51   1.56
       6   1.53   1.45   1.45   1.45   1.45   1.44   1.85   1.47
       7   1.54   1.47   1.47   1.47   1.45   1.44   1.50   1.84
  NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.


Published: 06.05.2024


Still have questions? Write to us!

By clicking «I Accept» you confirm that you have read and accepted the website Terms and Conditions, Privacy Policy, and Moneyback Policy.