Check NVLink in Linux
Please install Nvidia drivers by following our guide Install Nvidia driver in Linux, before checking the NVLink support in the operating system. Additionally, you need to install the CUDA toolkit to compile application samples. In this small guide, we’ve collected a few useful commands that you can use.
Basic commands
Check the physical topology of your system. This command shows all GPUs and their interconnect:
nvidia-smi topo -m
If you want to display the state of links, execute the following command:
nvidia-smi nvlink -s
The command displays the speed of each link or
nvidia-smi nvlink -i 0 -c
Without this option, information about all GPUs connections will be displayed:
nvidia-smi nvlink -c
Install CUDA-samples
A good way to test throughput is to use app samples by Nvidia. The source code of these samples are posted on GitHub and are available to everyone. Proceed with cloning repository to the server:
git clone https://github.com/NVIDIA/cuda-samples.git
Change directory to the downloaded repository:
cd cuda-samples
Select the appropriate branch by tag according to the installed CUDA version. For example, if you have CUDA 12.2:
git checkout tags/v12.2
Install some prerequisites that will be used in the compiling process:
sudo apt -y install freeglut3-dev build-essential libx11-dev libxmu-dev libxi-dev libgl1-mesa-glx libglu1-mesa libglu1-mesa-dev libglfw3-dev libgles2-mesa-dev
Now, you can compile any sample. Go to the Samples directory:
cd Samples
Quick look at the content:
ls -la
total 40 drwxrwxr-x 10 usergpu usergpu 4096 Sep 13 14:54 . drwxrwxr-x 6 usergpu usergpu 4096 Sep 13 14:54 .. drwxrwxr-x 55 usergpu usergpu 4096 Sep 13 14:54 0_Introduction drwxrwxr-x 6 usergpu usergpu 4096 Sep 13 14:54 1_Utilities drwxrwxr-x 36 usergpu usergpu 4096 Sep 13 14:54 2_Concepts_and_Techniques drwxrwxr-x 25 usergpu usergpu 4096 Sep 13 14:54 3_CUDA_Features drwxrwxr-x 41 usergpu usergpu 4096 Sep 13 14:54 4_CUDA_Libraries drwxrwxr-x 52 usergpu usergpu 4096 Sep 13 14:54 5_Domain_Specific drwxrwxr-x 6 usergpu usergpu 4096 Sep 13 14:54 6_Performance drwxrwxr-x 11 usergpu usergpu 4096 Sep 13 14:54 7_libNVVM
Let’s test the GPU bandwidth. Change the directory:
cd 1_Utilities/bandwidthTest
Compile the app:
make
Run tests
Start testing by executing the app using its name:
./bandwidthTest
The output might look like this:
[CUDA Bandwidth Test] - Starting... Running on... Device 0: NVIDIA RTX A6000 Quick Mode Host to Device Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(GB/s) 32000000 6.0 Device to Host Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(GB/s) 32000000 6.6 Device to Device Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(GB/s) 32000000 569.2 Result = PASS NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Alternatively, you can compile and start the p2pBandwidthLatencyTest:
cd 5_Domain_Specific/p2pBandwidthLatencyTest
make
./p2pBandwidthLatencyTest
This app will show you detailed information about your GPU’s bandwidth in P2P mode. Sample output:
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test] Device: 0, NVIDIA RTX A6000, pciBusID: 3, pciDeviceID: 0, pciDomainID:0 Device: 1, NVIDIA RTX A6000, pciBusID: 4, pciDeviceID: 0, pciDomainID:0 Device=0 CAN Access Peer Device=1 Device=1 CAN Access Peer Device=0 ***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure. So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases. P2P Connectivity Matrix D\D 0 1 0 1 1 1 1 1 Unidirectional P2P=Disabled Bandwidth Matrix (GB/s) D\D 0 1 0 590.51 6.04 1 6.02 590.51 Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s) D\D 0 1 0 589.40 52.75 1 52.88 592.53 Bidirectional P2P=Disabled Bandwidth Matrix (GB/s) D\D 0 1 0 593.88 8.55 1 8.55 595.32 Bidirectional P2P=Enabled Bandwidth Matrix (GB/s) D\D 0 1 0 595.69 101.68 1 101.97 595.69 P2P=Disabled Latency Matrix (us) GPU 0 1 0 1.61 28.66 1 18.49 1.53 CPU 0 1 0 2.27 6.06 1 6.12 2.23 P2P=Enabled Latency (P2P Writes) Matrix (us) GPU 0 1 0 1.62 1.27 1 1.17 1.55 CPU 0 1 0 2.27 1.91 1 1.90 2.34 NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
In case of a configuration with multiple GPUs, it may look like this:
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test] Device: 0, NVIDIA H100 PCIe, pciBusID: 30, pciDeviceID: 0, pciDomainID:0 Device: 1, NVIDIA H100 PCIe, pciBusID: 3f, pciDeviceID: 0, pciDomainID:0 Device: 2, NVIDIA H100 PCIe, pciBusID: 40, pciDeviceID: 0, pciDomainID:0 Device: 3, NVIDIA H100 PCIe, pciBusID: 41, pciDeviceID: 0, pciDomainID:0 Device: 4, NVIDIA H100 PCIe, pciBusID: b0, pciDeviceID: 0, pciDomainID:0 Device: 5, NVIDIA H100 PCIe, pciBusID: b1, pciDeviceID: 0, pciDomainID:0 Device: 6, NVIDIA H100 PCIe, pciBusID: c2, pciDeviceID: 0, pciDomainID:0 Device: 7, NVIDIA H100 PCIe, pciBusID: c3, pciDeviceID: 0, pciDomainID:0 Device=0 CAN Access Peer Device=1 Device=0 CAN Access Peer Device=2 Device=0 CAN Access Peer Device=3 Device=0 CAN Access Peer Device=4 Device=0 CAN Access Peer Device=5 Device=0 CAN Access Peer Device=6 Device=0 CAN Access Peer Device=7 Device=1 CAN Access Peer Device=0 Device=1 CAN Access Peer Device=2 Device=1 CAN Access Peer Device=3 Device=1 CAN Access Peer Device=4 Device=1 CAN Access Peer Device=5 Device=1 CAN Access Peer Device=6 Device=1 CAN Access Peer Device=7 Device=2 CAN Access Peer Device=0 Device=2 CAN Access Peer Device=1 Device=2 CAN Access Peer Device=3 Device=2 CAN Access Peer Device=4 Device=2 CAN Access Peer Device=5 Device=2 CAN Access Peer Device=6 Device=2 CAN Access Peer Device=7 Device=3 CAN Access Peer Device=0 Device=3 CAN Access Peer Device=1 Device=3 CAN Access Peer Device=2 Device=3 CAN Access Peer Device=4 Device=3 CAN Access Peer Device=5 Device=3 CAN Access Peer Device=6 Device=3 CAN Access Peer Device=7 Device=4 CAN Access Peer Device=0 Device=4 CAN Access Peer Device=1 Device=4 CAN Access Peer Device=2 Device=4 CAN Access Peer Device=3 Device=4 CAN Access Peer Device=5 Device=4 CAN Access Peer Device=6 Device=4 CAN Access Peer Device=7 Device=5 CAN Access Peer Device=0 Device=5 CAN Access Peer Device=1 Device=5 CAN Access Peer Device=2 Device=5 CAN Access Peer Device=3 Device=5 CAN Access Peer Device=4 Device=5 CAN Access Peer Device=6 Device=5 CAN Access Peer Device=7 Device=6 CAN Access Peer Device=0 Device=6 CAN Access Peer Device=1 Device=6 CAN Access Peer Device=2 Device=6 CAN Access Peer Device=3 Device=6 CAN Access Peer Device=4 Device=6 CAN Access Peer Device=5 Device=6 CAN Access Peer Device=7 Device=7 CAN Access Peer Device=0 Device=7 CAN Access Peer Device=1 Device=7 CAN Access Peer Device=2 Device=7 CAN Access Peer Device=3 Device=7 CAN Access Peer Device=4 Device=7 CAN Access Peer Device=5 Device=7 CAN Access Peer Device=6 ***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure. So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases. P2P Connectivity Matrix D\D 0 1 2 3 4 5 6 7 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 4 1 1 1 1 1 1 1 1 5 1 1 1 1 1 1 1 1 6 1 1 1 1 1 1 1 1 7 1 1 1 1 1 1 1 1 Unidirectional P2P=Disabled Bandwidth Matrix (GB/s) D\D 0 1 2 3 4 5 6 7 0 1629.83 38.43 38.39 37.66 38.51 38.19 38.09 37.92 1 38.22 1637.04 35.52 35.59 38.15 38.38 38.08 37.55 2 37.76 35.62 1635.32 35.45 38.59 38.21 38.77 37.94 3 37.88 35.50 35.60 1639.45 38.49 37.43 38.72 38.49 4 36.87 37.03 37.00 36.90 1635.86 34.48 38.06 37.22 5 37.27 37.06 36.92 37.06 34.51 1636.18 37.80 37.50 6 37.05 36.95 37.45 37.15 37.51 37.96 1630.79 34.94 7 36.98 36.91 36.95 36.87 37.83 38.02 34.73 1633.35 Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s) D\D 0 1 2 3 4 5 6 7 0 1635.22 34.42 33.84 256.54 27.74 28.68 28.00 28.41 1 34.66 1636.93 256.16 17.97 71.58 71.64 71.65 71.61 2 34.78 256.81 1655.79 30.29 70.34 70.42 70.37 70.33 3 256.65 30.65 70.67 1654.53 70.66 70.69 70.70 70.73 4 28.26 30.80 69.99 70.04 1630.36 256.45 69.97 70.02 5 28.10 31.08 71.60 71.59 256.47 1654.31 71.62 71.54 6 28.37 30.96 70.99 70.93 70.91 70.96 1632.12 257.11 7 27.66 30.87 70.30 70.40 70.30 70.39 256.72 1649.57 Bidirectional P2P=Disabled Bandwidth Matrix (GB/s) D\D 0 1 2 3 4 5 6 7 0 1673.16 51.88 51.95 51.76 51.61 51.44 52.07 51.30 1 52.04 1676.28 39.06 39.21 51.62 51.62 51.98 51.36 2 52.11 39.27 1674.62 39.16 51.42 51.21 51.72 51.71 3 51.74 39.70 39.22 1672.77 51.50 51.27 51.70 51.24 4 52.14 52.41 51.38 52.14 1671.54 38.81 46.76 45.72 5 51.82 52.65 52.30 51.67 38.57 1676.33 46.90 45.96 6 52.92 52.66 53.02 52.68 46.23 46.31 1672.74 38.91 7 52.61 52.74 52.79 52.64 45.90 46.35 39.07 1673.16 Bidirectional P2P=Enabled Bandwidth Matrix (GB/s) D\D 0 1 2 3 4 5 6 7 0 1670.31 52.41 140.69 508.68 139.85 141.88 141.71 140.55 1 141.69 1673.30 509.23 141.22 139.91 143.28 141.71 140.61 2 140.64 508.90 1669.67 140.68 139.93 140.61 140.67 140.50 3 509.14 141.36 140.61 1682.65 139.93 141.45 141.45 140.67 4 140.01 140.03 140.07 139.94 1670.68 508.37 140.01 139.90 5 141.92 143.17 140.50 141.19 508.92 1670.73 141.72 140.52 6 141.72 141.72 140.60 141.31 139.66 141.85 1671.51 510.03 7 140.62 140.71 140.66 140.63 140.02 140.72 509.77 1668.28 P2P=Disabled Latency Matrix (us) GPU 0 1 2 3 4 5 6 7 0 2.35 17.23 17.13 13.38 12.86 21.15 21.39 21.12 1 17.54 2.32 12.95 13.78 21.05 21.23 21.31 21.37 2 16.85 14.83 2.35 16.07 12.71 12.80 21.23 12.79 3 14.98 16.06 14.64 2.41 13.35 12.81 13.60 21.36 4 21.31 21.31 20.49 21.32 2.62 12.33 12.66 12.98 5 20.36 21.22 20.17 12.79 16.74 2.58 12.41 12.93 6 17.51 12.84 12.79 12.70 17.63 18.78 2.36 13.69 7 21.23 12.71 19.41 21.09 14.69 13.79 15.52 2.59 CPU 0 1 2 3 4 5 6 7 0 1.73 4.99 4.88 4.85 5.17 5.18 5.18 5.33 1 5.04 1.71 4.74 4.82 5.04 5.14 5.10 5.19 2 4.86 4.75 1.66 4.78 5.08 5.09 5.11 5.17 3 4.80 4.72 4.73 1.63 5.09 5.11 5.06 5.10 4 5.07 5.00 5.03 4.96 1.77 5.33 5.34 5.38 5 5.12 4.94 5.00 4.96 5.31 1.77 5.38 5.41 6 5.09 4.97 5.09 5.01 5.35 5.39 1.80 5.42 7 5.18 5.09 5.02 5.00 5.39 5.40 5.40 1.76 P2P=Enabled Latency (P2P Writes) Matrix (us) GPU 0 1 2 3 4 5 6 7 0 2.33 2.15 2.11 2.76 2.07 2.11 2.07 2.12 1 2.07 2.30 2.77 2.07 2.12 2.06 2.06 2.10 2 2.09 2.75 2.34 2.12 2.09 2.08 2.08 2.12 3 2.78 2.10 2.13 2.40 2.13 2.14 2.14 2.13 4 2.18 2.23 2.23 2.17 2.59 2.82 2.15 2.16 5 2.15 2.17 2.15 2.20 2.82 2.56 2.17 2.16 6 2.13 2.18 2.21 2.17 2.15 2.17 2.36 2.85 7 2.19 2.21 2.19 2.22 2.19 2.19 2.86 2.61 CPU 0 1 2 3 4 5 6 7 0 1.78 1.32 1.29 1.40 1.33 1.34 1.34 1.33 1 1.32 1.69 1.34 1.35 1.35 1.34 1.40 1.33 2 1.38 1.37 1.73 1.36 1.36 1.35 1.35 1.34 3 1.34 1.42 1.35 1.66 1.34 1.34 1.35 1.33 4 1.53 1.41 1.40 1.40 1.77 1.43 1.48 1.47 5 1.46 1.43 1.43 1.42 1.47 1.84 1.51 1.56 6 1.53 1.45 1.45 1.45 1.45 1.44 1.85 1.47 7 1.54 1.47 1.47 1.47 1.45 1.44 1.50 1.84 NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Published: 06.05.2024