Vérifiez NVLink sous Linux
Veuillez installer les pilotes Nvidia en suivant notre guide Installer le pilote Nvidia dans Linux, avant de vérifier la prise en charge de NVLink dans le système d'exploitation. De plus, vous devez installer le kit d'outils CUDA pour compiler les exemples d'applications. Dans ce petit guide, nous avons rassemblé quelques commandes utiles que vous pouvez utiliser.
Commandes de base
Vérifiez la topologie physique de votre système. Cette commande affiche tous les GPU et leur interconnexion:
nvidia-smi topo -m
Si vous voulez afficher l'état des liens, exécutez la commande suivante:
nvidia-smi nvlink -s
La commande affiche la vitesse de chaque lien ou
nvidia-smi nvlink -i 0 -c
Sans cette option, des informations sur toutes les connexions des GPU seront affichées:
nvidia-smi nvlink -c
Installer les échantillons CUDA
Une bonne façon de tester le débit est d'utiliser des exemples d'application par Nvidia. Les codes sources de ces exemples sont postés sur GitHub et sont disponibles pour tout le monde. Poursuivez avec la copie du référentiel vers le serveur:
git clone https://github.com/NVIDIA/cuda-samples.git
Changez de répertoire dans le référentiel téléchargé:
cd cuda-samples
Sélectionnez la branche appropriée par étiquette en fonction de la version CUDA installée. Par exemple, si vous avez CUDA 12.2:
git checkout tags/v12.2
Installez quelques prérequis qui seront utilisés dans le processus de compilation:
sudo apt -y install freeglut3-dev build-essential libx11-dev libxmu-dev libxi-dev libgl1-mesa-glx libglu1-mesa libglu1-mesa-dev libglfw3-dev libgles2-mesa-dev
Maintenant, vous pouvez compiler n'importe quel échantillon. Accédez au répertoire Samples:
cd Samples
Regard rapide sur le contenu:
ls -la
total 40 drwxrwxr-x 10 usergpu usergpu 4096 Sep 13 14:54 . drwxrwxr-x 6 usergpu usergpu 4096 Sep 13 14:54 .. drwxrwxr-x 55 usergpu usergpu 4096 Sep 13 14:54 0_Introduction drwxrwxr-x 6 usergpu usergpu 4096 Sep 13 14:54 1_Utilities drwxrwxr-x 36 usergpu usergpu 4096 Sep 13 14:54 2_Concepts_and_Techniques drwxrwxr-x 25 usergpu usergpu 4096 Sep 13 14:54 3_CUDA_Features drwxrwxr-x 41 usergpu usergpu 4096 Sep 13 14:54 4_CUDA_Libraries drwxrwxr-x 52 usergpu usergpu 4096 Sep 13 14:54 5_Domain_Specific drwxrwxr-x 6 usergpu usergpu 4096 Sep 13 14:54 6_Performance drwxrwxr-x 11 usergpu usergpu 4096 Sep 13 14:54 7_libNVVM
Testons la bande passante du GPU. Changez de répertoire :
cd 1_Utilities/bandwidthTest
Compilez l'application :
make
Exécutez les tests
Commencez le test en exécutant l'application en utilisant son nom :
./bandwidthTest
La sortie pourrait ressembler à ceci :
[CUDA Bandwidth Test] - Starting... Running on... Device 0: NVIDIA RTX A6000 Quick Mode Host to Device Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(GB/s) 32000000 6.0 Device to Host Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(GB/s) 32000000 6.6 Device to Device Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(GB/s) 32000000 569.2 Result = PASS NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Alternativement, vous pouvez compiler et démarrer le p2pBandwidthLatencyTest :
cd 5_Domain_Specific/p2pBandwidthLatencyTest
make
./p2pBandwidthLatencyTest
Cette application vous montrera des informations détaillées sur la bande passante de votre GPU en mode P2P. Exemple de sortie :
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test] Device: 0, NVIDIA RTX A6000, pciBusID: 3, pciDeviceID: 0, pciDomainID:0 Device: 1, NVIDIA RTX A6000, pciBusID: 4, pciDeviceID: 0, pciDomainID:0 Device=0 CAN Access Peer Device=1 Device=1 CAN Access Peer Device=0 ***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure. So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases. P2P Connectivity Matrix D\D 0 1 0 1 1 1 1 1 Unidirectional P2P=Disabled Bandwidth Matrix (GB/s) D\D 0 1 0 590.51 6.04 1 6.02 590.51 Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s) D\D 0 1 0 589.40 52.75 1 52.88 592.53 Bidirectional P2P=Disabled Bandwidth Matrix (GB/s) D\D 0 1 0 593.88 8.55 1 8.55 595.32 Bidirectional P2P=Enabled Bandwidth Matrix (GB/s) D\D 0 1 0 595.69 101.68 1 101.97 595.69 P2P=Disabled Latency Matrix (us) GPU 0 1 0 1.61 28.66 1 18.49 1.53 CPU 0 1 0 2.27 6.06 1 6.12 2.23 P2P=Enabled Latency (P2P Writes) Matrix (us) GPU 0 1 0 1.62 1.27 1 1.17 1.55 CPU 0 1 0 2.27 1.91 1 1.90 2.34 NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
En cas de configuration avec plusieurs GPU, cela peut ressembler à ceci :
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test] Device: 0, NVIDIA H100 PCIe, pciBusID: 30, pciDeviceID: 0, pciDomainID:0 Device: 1, NVIDIA H100 PCIe, pciBusID: 3f, pciDeviceID: 0, pciDomainID:0 Device: 2, NVIDIA H100 PCIe, pciBusID: 40, pciDeviceID: 0, pciDomainID:0 Device: 3, NVIDIA H100 PCIe, pciBusID: 41, pciDeviceID: 0, pciDomainID:0 Device: 4, NVIDIA H100 PCIe, pciBusID: b0, pciDeviceID: 0, pciDomainID:0 Device: 5, NVIDIA H100 PCIe, pciBusID: b1, pciDeviceID: 0, pciDomainID:0 Device: 6, NVIDIA H100 PCIe, pciBusID: c2, pciDeviceID: 0, pciDomainID:0 Device: 7, NVIDIA H100 PCIe, pciBusID: c3, pciDeviceID: 0, pciDomainID:0 Device=0 CAN Access Peer Device=1 Device=0 CAN Access Peer Device=2 Device=0 CAN Access Peer Device=3 Device=0 CAN Access Peer Device=4 Device=0 CAN Access Peer Device=5 Device=0 CAN Access Peer Device=6 Device=0 CAN Access Peer Device=7 Device=1 CAN Access Peer Device=0 Device=1 CAN Access Peer Device=2 Device=1 CAN Access Peer Device=3 Device=1 CAN Access Peer Device=4 Device=1 CAN Access Peer Device=5 Device=1 CAN Access Peer Device=6 Device=1 CAN Access Peer Device=7 Device=2 CAN Access Peer Device=0 Device=2 CAN Access Peer Device=1 Device=2 CAN Access Peer Device=3 Device=2 CAN Access Peer Device=4 Device=2 CAN Access Peer Device=5 Device=2 CAN Access Peer Device=6 Device=2 CAN Access Peer Device=7 Device=3 CAN Access Peer Device=0 Device=3 CAN Access Peer Device=1 Device=3 CAN Access Peer Device=2 Device=3 CAN Access Peer Device=4 Device=3 CAN Access Peer Device=5 Device=3 CAN Access Peer Device=6 Device=3 CAN Access Peer Device=7 Device=4 CAN Access Peer Device=0 Device=4 CAN Access Peer Device=1 Device=4 CAN Access Peer Device=2 Device=4 CAN Access Peer Device=3 Device=4 CAN Access Peer Device=5 Device=4 CAN Access Peer Device=6 Device=4 CAN Access Peer Device=7 Device=5 CAN Access Peer Device=0 Device=5 CAN Access Peer Device=1 Device=5 CAN Access Peer Device=2 Device=5 CAN Access Peer Device=3 Device=5 CAN Access Peer Device=4 Device=5 CAN Access Peer Device=6 Device=5 CAN Access Peer Device=7 Device=6 CAN Access Peer Device=0 Device=6 CAN Access Peer Device=1 Device=6 CAN Access Peer Device=2 Device=6 CAN Access Peer Device=3 Device=6 CAN Access Peer Device=4 Device=6 CAN Access Peer Device=5 Device=6 CAN Access Peer Device=7 Device=7 CAN Access Peer Device=0 Device=7 CAN Access Peer Device=1 Device=7 CAN Access Peer Device=2 Device=7 CAN Access Peer Device=3 Device=7 CAN Access Peer Device=4 Device=7 CAN Access Peer Device=5 Device=7 CAN Access Peer Device=6 ***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure. So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases. P2P Connectivity Matrix D\D 0 1 2 3 4 5 6 7 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 4 1 1 1 1 1 1 1 1 5 1 1 1 1 1 1 1 1 6 1 1 1 1 1 1 1 1 7 1 1 1 1 1 1 1 1 Unidirectional P2P=Disabled Bandwidth Matrix (GB/s) D\D 0 1 2 3 4 5 6 7 0 1629.83 38.43 38.39 37.66 38.51 38.19 38.09 37.92 1 38.22 1637.04 35.52 35.59 38.15 38.38 38.08 37.55 2 37.76 35.62 1635.32 35.45 38.59 38.21 38.77 37.94 3 37.88 35.50 35.60 1639.45 38.49 37.43 38.72 38.49 4 36.87 37.03 37.00 36.90 1635.86 34.48 38.06 37.22 5 37.27 37.06 36.92 37.06 34.51 1636.18 37.80 37.50 6 37.05 36.95 37.45 37.15 37.51 37.96 1630.79 34.94 7 36.98 36.91 36.95 36.87 37.83 38.02 34.73 1633.35 Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s) D\D 0 1 2 3 4 5 6 7 0 1635.22 34.42 33.84 256.54 27.74 28.68 28.00 28.41 1 34.66 1636.93 256.16 17.97 71.58 71.64 71.65 71.61 2 34.78 256.81 1655.79 30.29 70.34 70.42 70.37 70.33 3 256.65 30.65 70.67 1654.53 70.66 70.69 70.70 70.73 4 28.26 30.80 69.99 70.04 1630.36 256.45 69.97 70.02 5 28.10 31.08 71.60 71.59 256.47 1654.31 71.62 71.54 6 28.37 30.96 70.99 70.93 70.91 70.96 1632.12 257.11 7 27.66 30.87 70.30 70.40 70.30 70.39 256.72 1649.57 Bidirectional P2P=Disabled Bandwidth Matrix (GB/s) D\D 0 1 2 3 4 5 6 7 0 1673.16 51.88 51.95 51.76 51.61 51.44 52.07 51.30 1 52.04 1676.28 39.06 39.21 51.62 51.62 51.98 51.36 2 52.11 39.27 1674.62 39.16 51.42 51.21 51.72 51.71 3 51.74 39.70 39.22 1672.77 51.50 51.27 51.70 51.24 4 52.14 52.41 51.38 52.14 1671.54 38.81 46.76 45.72 5 51.82 52.65 52.30 51.67 38.57 1676.33 46.90 45.96 6 52.92 52.66 53.02 52.68 46.23 46.31 1672.74 38.91 7 52.61 52.74 52.79 52.64 45.90 46.35 39.07 1673.16 Bidirectional P2P=Enabled Bandwidth Matrix (GB/s) D\D 0 1 2 3 4 5 6 7 0 1670.31 52.41 140.69 508.68 139.85 141.88 141.71 140.55 1 141.69 1673.30 509.23 141.22 139.91 143.28 141.71 140.61 2 140.64 508.90 1669.67 140.68 139.93 140.61 140.67 140.50 3 509.14 141.36 140.61 1682.65 139.93 141.45 141.45 140.67 4 140.01 140.03 140.07 139.94 1670.68 508.37 140.01 139.90 5 141.92 143.17 140.50 141.19 508.92 1670.73 141.72 140.52 6 141.72 141.72 140.60 141.31 139.66 141.85 1671.51 510.03 7 140.62 140.71 140.66 140.63 140.02 140.72 509.77 1668.28 P2P=Disabled Latency Matrix (us) GPU 0 1 2 3 4 5 6 7 0 2.35 17.23 17.13 13.38 12.86 21.15 21.39 21.12 1 17.54 2.32 12.95 13.78 21.05 21.23 21.31 21.37 2 16.85 14.83 2.35 16.07 12.71 12.80 21.23 12.79 3 14.98 16.06 14.64 2.41 13.35 12.81 13.60 21.36 4 21.31 21.31 20.49 21.32 2.62 12.33 12.66 12.98 5 20.36 21.22 20.17 12.79 16.74 2.58 12.41 12.93 6 17.51 12.84 12.79 12.70 17.63 18.78 2.36 13.69 7 21.23 12.71 19.41 21.09 14.69 13.79 15.52 2.59 CPU 0 1 2 3 4 5 6 7 0 1.73 4.99 4.88 4.85 5.17 5.18 5.18 5.33 1 5.04 1.71 4.74 4.82 5.04 5.14 5.10 5.19 2 4.86 4.75 1.66 4.78 5.08 5.09 5.11 5.17 3 4.80 4.72 4.73 1.63 5.09 5.11 5.06 5.10 4 5.07 5.00 5.03 4.96 1.77 5.33 5.34 5.38 5 5.12 4.94 5.00 4.96 5.31 1.77 5.38 5.41 6 5.09 4.97 5.09 5.01 5.35 5.39 1.80 5.42 7 5.18 5.09 5.02 5.00 5.39 5.40 5.40 1.76 P2P=Enabled Latency (P2P Writes) Matrix (us) GPU 0 1 2 3 4 5 6 7 0 2.33 2.15 2.11 2.76 2.07 2.11 2.07 2.12 1 2.07 2.30 2.77 2.07 2.12 2.06 2.06 2.10 2 2.09 2.75 2.34 2.12 2.09 2.08 2.08 2.12 3 2.78 2.10 2.13 2.40 2.13 2.14 2.14 2.13 4 2.18 2.23 2.23 2.17 2.59 2.82 2.15 2.16 5 2.15 2.17 2.15 2.20 2.82 2.56 2.17 2.16 6 2.13 2.18 2.21 2.17 2.15 2.17 2.36 2.85 7 2.19 2.21 2.19 2.22 2.19 2.19 2.86 2.61 CPU 0 1 2 3 4 5 6 7 0 1.78 1.32 1.29 1.40 1.33 1.34 1.34 1.33 1 1.32 1.69 1.34 1.35 1.35 1.34 1.40 1.33 2 1.38 1.37 1.73 1.36 1.36 1.35 1.35 1.34 3 1.34 1.42 1.35 1.66 1.34 1.34 1.35 1.33 4 1.53 1.41 1.40 1.40 1.77 1.43 1.48 1.47 5 1.46 1.43 1.43 1.42 1.47 1.84 1.51 1.56 6 1.53 1.45 1.45 1.45 1.45 1.44 1.85 1.47 7 1.54 1.47 1.47 1.47 1.45 1.44 1.50 1.84 NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Publié: 06.05.2024