Controleer NVLink in Linux
Installeer Nvidia-stuurprogramma's door onze gids te volgen Installeer Nvidia-driver in Linux, alvorens de NVLink-ondersteuning in het besturingssysteem te controleren. Daarnaast moet je de CUDA-toolkit installeren om voorbeeldtoepassingen te compileren. In deze kleine gids hebben we enkele handige commando's verzameld die je kunt gebruiken.
Basis commando's
Controleer de fysieke topologie van je systeem. Dit commando toont alle GPU's en hun onderlinge verbinding:
nvidia-smi topo -m
Als je de staat van de links wilt weergeven, voer dan het volgende commando uit:
nvidia-smi nvlink -s
Het commando toont de snelheid van elke link of
nvidia-smi nvlink -i 0 -c
Zonder deze optie, zal informatie over alle GPU-verbindingen worden weergegeven:
nvidia-smi nvlink -c
Installeer CUDA-voorbeelden
Een goede manier om de doorvoer te testen, is door app-voorbeelden van Nvidia te gebruiken. De broncode van deze voorbeelden staat op GitHub en is voor iedereen toegankelijk. Ga door met het klonen van de repository naar de server:
git clone https://github.com/NVIDIA/cuda-samples.git
Verander directory naar de gedownloade repository:
cd cuda-samples
Selecteer de juiste branch via tag volgens de geïnstalleerde CUDA-versie. Bijvoorbeeld, als je CUDA 12.2 hebt:
git checkout tags/v12.2
Installeer enkele vereisten die zullen worden gebruikt in het compileerproces:
sudo apt -y install freeglut3-dev build-essential libx11-dev libxmu-dev libxi-dev libgl1-mesa-glx libglu1-mesa libglu1-mesa-dev libglfw3-dev libgles2-mesa-dev
Nu kun je elk voorbeeld compileren. Ga naar de Voorbeelden directory:
cd Voorbeelden
Korte blik op de inhoud:
ls -la
total 40 drwxrwxr-x 10 usergpu usergpu 4096 Sep 13 14:54 . drwxrwxr-x 6 usergpu usergpu 4096 Sep 13 14:54 .. drwxrwxr-x 55 usergpu usergpu 4096 Sep 13 14:54 0_Introduction drwxrwxr-x 6 usergpu usergpu 4096 Sep 13 14:54 1_Utilities drwxrwxr-x 36 usergpu usergpu 4096 Sep 13 14:54 2_Concepts_and_Techniques drwxrwxr-x 25 usergpu usergpu 4096 Sep 13 14:54 3_CUDA_Features drwxrwxr-x 41 usergpu usergpu 4096 Sep 13 14:54 4_CUDA_Libraries drwxrwxr-x 52 usergpu usergpu 4096 Sep 13 14:54 5_Domain_Specific drwxrwxr-x 6 usergpu usergpu 4096 Sep 13 14:54 6_Performance drwxrwxr-x 11 usergpu usergpu 4096 Sep 13 14:54 7_libNVVM
Laten we de GPU-bandbreedte testen. Verander de directory:
cd 1_Utilities/bandwidthTest
Compileer de app:
make
Tests uitvoeren
Begin met testen door de app uit te voeren met behulp van de naam:
./bandwidthTest
De uitvoer kan er zo uitzien:
[CUDA Bandwidth Test] - Starting... Running on... Device 0: NVIDIA RTX A6000 Quick Mode Host to Device Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(GB/s) 32000000 6.0 Device to Host Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(GB/s) 32000000 6.6 Device to Device Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(GB/s) 32000000 569.2 Result = PASS NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Als alternatief kunt u de p2pBandwidthLatencyTest compileren en starten:
cd 5_Domain_Specific/p2pBandwidthLatencyTest
make
./p2pBandwidthLatencyTest
Deze app zal je gedetailleerde informatie tonen over de bandbreedte van je GPU in P2P-modus. Voorbeeld uitvoer:
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test] Device: 0, NVIDIA RTX A6000, pciBusID: 3, pciDeviceID: 0, pciDomainID:0 Device: 1, NVIDIA RTX A6000, pciBusID: 4, pciDeviceID: 0, pciDomainID:0 Device=0 CAN Access Peer Device=1 Device=1 CAN Access Peer Device=0 ***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure. So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases. P2P Connectivity Matrix D\D 0 1 0 1 1 1 1 1 Unidirectional P2P=Disabled Bandwidth Matrix (GB/s) D\D 0 1 0 590.51 6.04 1 6.02 590.51 Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s) D\D 0 1 0 589.40 52.75 1 52.88 592.53 Bidirectional P2P=Disabled Bandwidth Matrix (GB/s) D\D 0 1 0 593.88 8.55 1 8.55 595.32 Bidirectional P2P=Enabled Bandwidth Matrix (GB/s) D\D 0 1 0 595.69 101.68 1 101.97 595.69 P2P=Disabled Latency Matrix (us) GPU 0 1 0 1.61 28.66 1 18.49 1.53 CPU 0 1 0 2.27 6.06 1 6.12 2.23 P2P=Enabled Latency (P2P Writes) Matrix (us) GPU 0 1 0 1.62 1.27 1 1.17 1.55 CPU 0 1 0 2.27 1.91 1 1.90 2.34 NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
In het geval van een configuratie met meerdere GPU's, kan het er zo uitzien:
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test] Device: 0, NVIDIA H100 PCIe, pciBusID: 30, pciDeviceID: 0, pciDomainID:0 Device: 1, NVIDIA H100 PCIe, pciBusID: 3f, pciDeviceID: 0, pciDomainID:0 Device: 2, NVIDIA H100 PCIe, pciBusID: 40, pciDeviceID: 0, pciDomainID:0 Device: 3, NVIDIA H100 PCIe, pciBusID: 41, pciDeviceID: 0, pciDomainID:0 Device: 4, NVIDIA H100 PCIe, pciBusID: b0, pciDeviceID: 0, pciDomainID:0 Device: 5, NVIDIA H100 PCIe, pciBusID: b1, pciDeviceID: 0, pciDomainID:0 Device: 6, NVIDIA H100 PCIe, pciBusID: c2, pciDeviceID: 0, pciDomainID:0 Device: 7, NVIDIA H100 PCIe, pciBusID: c3, pciDeviceID: 0, pciDomainID:0 Device=0 CAN Access Peer Device=1 Device=0 CAN Access Peer Device=2 Device=0 CAN Access Peer Device=3 Device=0 CAN Access Peer Device=4 Device=0 CAN Access Peer Device=5 Device=0 CAN Access Peer Device=6 Device=0 CAN Access Peer Device=7 Device=1 CAN Access Peer Device=0 Device=1 CAN Access Peer Device=2 Device=1 CAN Access Peer Device=3 Device=1 CAN Access Peer Device=4 Device=1 CAN Access Peer Device=5 Device=1 CAN Access Peer Device=6 Device=1 CAN Access Peer Device=7 Device=2 CAN Access Peer Device=0 Device=2 CAN Access Peer Device=1 Device=2 CAN Access Peer Device=3 Device=2 CAN Access Peer Device=4 Device=2 CAN Access Peer Device=5 Device=2 CAN Access Peer Device=6 Device=2 CAN Access Peer Device=7 Device=3 CAN Access Peer Device=0 Device=3 CAN Access Peer Device=1 Device=3 CAN Access Peer Device=2 Device=3 CAN Access Peer Device=4 Device=3 CAN Access Peer Device=5 Device=3 CAN Access Peer Device=6 Device=3 CAN Access Peer Device=7 Device=4 CAN Access Peer Device=0 Device=4 CAN Access Peer Device=1 Device=4 CAN Access Peer Device=2 Device=4 CAN Access Peer Device=3 Device=4 CAN Access Peer Device=5 Device=4 CAN Access Peer Device=6 Device=4 CAN Access Peer Device=7 Device=5 CAN Access Peer Device=0 Device=5 CAN Access Peer Device=1 Device=5 CAN Access Peer Device=2 Device=5 CAN Access Peer Device=3 Device=5 CAN Access Peer Device=4 Device=5 CAN Access Peer Device=6 Device=5 CAN Access Peer Device=7 Device=6 CAN Access Peer Device=0 Device=6 CAN Access Peer Device=1 Device=6 CAN Access Peer Device=2 Device=6 CAN Access Peer Device=3 Device=6 CAN Access Peer Device=4 Device=6 CAN Access Peer Device=5 Device=6 CAN Access Peer Device=7 Device=7 CAN Access Peer Device=0 Device=7 CAN Access Peer Device=1 Device=7 CAN Access Peer Device=2 Device=7 CAN Access Peer Device=3 Device=7 CAN Access Peer Device=4 Device=7 CAN Access Peer Device=5 Device=7 CAN Access Peer Device=6 ***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure. So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases. P2P Connectivity Matrix D\D 0 1 2 3 4 5 6 7 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 4 1 1 1 1 1 1 1 1 5 1 1 1 1 1 1 1 1 6 1 1 1 1 1 1 1 1 7 1 1 1 1 1 1 1 1 Unidirectional P2P=Disabled Bandwidth Matrix (GB/s) D\D 0 1 2 3 4 5 6 7 0 1629.83 38.43 38.39 37.66 38.51 38.19 38.09 37.92 1 38.22 1637.04 35.52 35.59 38.15 38.38 38.08 37.55 2 37.76 35.62 1635.32 35.45 38.59 38.21 38.77 37.94 3 37.88 35.50 35.60 1639.45 38.49 37.43 38.72 38.49 4 36.87 37.03 37.00 36.90 1635.86 34.48 38.06 37.22 5 37.27 37.06 36.92 37.06 34.51 1636.18 37.80 37.50 6 37.05 36.95 37.45 37.15 37.51 37.96 1630.79 34.94 7 36.98 36.91 36.95 36.87 37.83 38.02 34.73 1633.35 Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s) D\D 0 1 2 3 4 5 6 7 0 1635.22 34.42 33.84 256.54 27.74 28.68 28.00 28.41 1 34.66 1636.93 256.16 17.97 71.58 71.64 71.65 71.61 2 34.78 256.81 1655.79 30.29 70.34 70.42 70.37 70.33 3 256.65 30.65 70.67 1654.53 70.66 70.69 70.70 70.73 4 28.26 30.80 69.99 70.04 1630.36 256.45 69.97 70.02 5 28.10 31.08 71.60 71.59 256.47 1654.31 71.62 71.54 6 28.37 30.96 70.99 70.93 70.91 70.96 1632.12 257.11 7 27.66 30.87 70.30 70.40 70.30 70.39 256.72 1649.57 Bidirectional P2P=Disabled Bandwidth Matrix (GB/s) D\D 0 1 2 3 4 5 6 7 0 1673.16 51.88 51.95 51.76 51.61 51.44 52.07 51.30 1 52.04 1676.28 39.06 39.21 51.62 51.62 51.98 51.36 2 52.11 39.27 1674.62 39.16 51.42 51.21 51.72 51.71 3 51.74 39.70 39.22 1672.77 51.50 51.27 51.70 51.24 4 52.14 52.41 51.38 52.14 1671.54 38.81 46.76 45.72 5 51.82 52.65 52.30 51.67 38.57 1676.33 46.90 45.96 6 52.92 52.66 53.02 52.68 46.23 46.31 1672.74 38.91 7 52.61 52.74 52.79 52.64 45.90 46.35 39.07 1673.16 Bidirectional P2P=Enabled Bandwidth Matrix (GB/s) D\D 0 1 2 3 4 5 6 7 0 1670.31 52.41 140.69 508.68 139.85 141.88 141.71 140.55 1 141.69 1673.30 509.23 141.22 139.91 143.28 141.71 140.61 2 140.64 508.90 1669.67 140.68 139.93 140.61 140.67 140.50 3 509.14 141.36 140.61 1682.65 139.93 141.45 141.45 140.67 4 140.01 140.03 140.07 139.94 1670.68 508.37 140.01 139.90 5 141.92 143.17 140.50 141.19 508.92 1670.73 141.72 140.52 6 141.72 141.72 140.60 141.31 139.66 141.85 1671.51 510.03 7 140.62 140.71 140.66 140.63 140.02 140.72 509.77 1668.28 P2P=Disabled Latency Matrix (us) GPU 0 1 2 3 4 5 6 7 0 2.35 17.23 17.13 13.38 12.86 21.15 21.39 21.12 1 17.54 2.32 12.95 13.78 21.05 21.23 21.31 21.37 2 16.85 14.83 2.35 16.07 12.71 12.80 21.23 12.79 3 14.98 16.06 14.64 2.41 13.35 12.81 13.60 21.36 4 21.31 21.31 20.49 21.32 2.62 12.33 12.66 12.98 5 20.36 21.22 20.17 12.79 16.74 2.58 12.41 12.93 6 17.51 12.84 12.79 12.70 17.63 18.78 2.36 13.69 7 21.23 12.71 19.41 21.09 14.69 13.79 15.52 2.59 CPU 0 1 2 3 4 5 6 7 0 1.73 4.99 4.88 4.85 5.17 5.18 5.18 5.33 1 5.04 1.71 4.74 4.82 5.04 5.14 5.10 5.19 2 4.86 4.75 1.66 4.78 5.08 5.09 5.11 5.17 3 4.80 4.72 4.73 1.63 5.09 5.11 5.06 5.10 4 5.07 5.00 5.03 4.96 1.77 5.33 5.34 5.38 5 5.12 4.94 5.00 4.96 5.31 1.77 5.38 5.41 6 5.09 4.97 5.09 5.01 5.35 5.39 1.80 5.42 7 5.18 5.09 5.02 5.00 5.39 5.40 5.40 1.76 P2P=Enabled Latency (P2P Writes) Matrix (us) GPU 0 1 2 3 4 5 6 7 0 2.33 2.15 2.11 2.76 2.07 2.11 2.07 2.12 1 2.07 2.30 2.77 2.07 2.12 2.06 2.06 2.10 2 2.09 2.75 2.34 2.12 2.09 2.08 2.08 2.12 3 2.78 2.10 2.13 2.40 2.13 2.14 2.14 2.13 4 2.18 2.23 2.23 2.17 2.59 2.82 2.15 2.16 5 2.15 2.17 2.15 2.20 2.82 2.56 2.17 2.16 6 2.13 2.18 2.21 2.17 2.15 2.17 2.36 2.85 7 2.19 2.21 2.19 2.22 2.19 2.19 2.86 2.61 CPU 0 1 2 3 4 5 6 7 0 1.78 1.32 1.29 1.40 1.33 1.34 1.34 1.33 1 1.32 1.69 1.34 1.35 1.35 1.34 1.40 1.33 2 1.38 1.37 1.73 1.36 1.36 1.35 1.35 1.34 3 1.34 1.42 1.35 1.66 1.34 1.34 1.35 1.33 4 1.53 1.41 1.40 1.40 1.77 1.43 1.48 1.47 5 1.46 1.43 1.43 1.42 1.47 1.84 1.51 1.56 6 1.53 1.45 1.45 1.45 1.45 1.44 1.85 1.47 7 1.54 1.47 1.47 1.47 1.45 1.44 1.50 1.84 NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Gepubliceerd: 06.05.2024