Controleer NVLink in Linux

Installeer Nvidia-stuurprogramma's door onze gids te volgen Installeer Nvidia-driver in Linux, alvorens de NVLink-ondersteuning in het besturingssysteem te controleren. Daarnaast moet je de CUDA-toolkit installeren om voorbeeldtoepassingen te compileren. In deze kleine gids hebben we enkele handige commando's verzameld die je kunt gebruiken.

Basis commando's

Controleer de fysieke topologie van je systeem. Dit commando toont alle GPU's en hun onderlinge verbinding:

nvidia-smi topo -m

Als je de staat van de links wilt weergeven, voer dan het volgende commando uit:

nvidia-smi nvlink -s

Het commando toont de snelheid van elke link of . Je kunt informatie over een enkele GPU-verbinding weergeven. Elke GPU heeft een ID, dat kan worden gespecificeerd met de -i-optie. Laten we bijvoorbeeld verbindingsinformatie weergeven over de eerste GPU met ID 0:

nvidia-smi nvlink -i 0 -c

Zonder deze optie, zal informatie over alle GPU-verbindingen worden weergegeven:

nvidia-smi nvlink -c

Installeer CUDA-voorbeelden

Een goede manier om de doorvoer te testen, is door app-voorbeelden van Nvidia te gebruiken. De broncode van deze voorbeelden staat op GitHub en is voor iedereen toegankelijk. Ga door met het klonen van de repository naar de server:

git clone https://github.com/NVIDIA/cuda-samples.git

Verander directory naar de gedownloade repository:

cd cuda-samples

Selecteer de juiste branch via tag volgens de geïnstalleerde CUDA-versie. Bijvoorbeeld, als je CUDA 12.2 hebt:

git checkout tags/v12.2

Installeer enkele vereisten die zullen worden gebruikt in het compileerproces:

sudo apt -y install freeglut3-dev build-essential libx11-dev libxmu-dev libxi-dev libgl1-mesa-glx libglu1-mesa libglu1-mesa-dev libglfw3-dev libgles2-mesa-dev

Nu kun je elk voorbeeld compileren. Ga naar de Voorbeelden directory:

cd Voorbeelden

Korte blik op de inhoud:

ls -la

total 40
    drwxrwxr-x 10 usergpu usergpu 4096 Sep 13 14:54 .
    drwxrwxr-x  6 usergpu usergpu 4096 Sep 13 14:54 ..
    drwxrwxr-x 55 usergpu usergpu 4096 Sep 13 14:54 0_Introduction
    drwxrwxr-x  6 usergpu usergpu 4096 Sep 13 14:54 1_Utilities
    drwxrwxr-x 36 usergpu usergpu 4096 Sep 13 14:54 2_Concepts_and_Techniques
    drwxrwxr-x 25 usergpu usergpu 4096 Sep 13 14:54 3_CUDA_Features
    drwxrwxr-x 41 usergpu usergpu 4096 Sep 13 14:54 4_CUDA_Libraries
    drwxrwxr-x 52 usergpu usergpu 4096 Sep 13 14:54 5_Domain_Specific
    drwxrwxr-x  6 usergpu usergpu 4096 Sep 13 14:54 6_Performance
    drwxrwxr-x 11 usergpu usergpu 4096 Sep 13 14:54 7_libNVVM

Laten we de GPU-bandbreedte testen. Verander de directory:

cd 1_Utilities/bandwidthTest

Compileer de app:

make

Tests uitvoeren

Begin met testen door de app uit te voeren met behulp van de naam:

./bandwidthTest

De uitvoer kan er zo uitzien:

[CUDA Bandwidth Test] - Starting...
    Running on...
    
     Device 0: NVIDIA RTX A6000
     Quick Mode
    
     Host to Device Bandwidth, 1 Device(s)
     PINNED Memory Transfers
       Transfer Size (Bytes)        Bandwidth(GB/s)
       32000000                     6.0
    
     Device to Host Bandwidth, 1 Device(s)
     PINNED Memory Transfers
       Transfer Size (Bytes)        Bandwidth(GB/s)
       32000000                     6.6
    
     Device to Device Bandwidth, 1 Device(s)
     PINNED Memory Transfers
       Transfer Size (Bytes)        Bandwidth(GB/s)
       32000000                     569.2
    
    Result = PASS
    
    NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

Als alternatief kunt u de p2pBandwidthLatencyTest compileren en starten:

cd 5_Domain_Specific/p2pBandwidthLatencyTest

make

./p2pBandwidthLatencyTest

Deze app zal je gedetailleerde informatie tonen over de bandbreedte van je GPU in P2P-modus. Voorbeeld uitvoer:

[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
    Device: 0, NVIDIA RTX A6000, pciBusID: 3, pciDeviceID: 0, pciDomainID:0
    Device: 1, NVIDIA RTX A6000, pciBusID: 4, pciDeviceID: 0, pciDomainID:0
    Device=0 CAN Access Peer Device=1
    Device=1 CAN Access Peer Device=0
    
    ***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
    So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.
    
    P2P Connectivity Matrix
         D\D     0     1
         0       1     1
         1       1     1
    Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
       D\D     0      1 
         0 590.51   6.04 
         1   6.02 590.51 
    Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
       D\D     0      1 
         0 589.40  52.75 
         1  52.88 592.53 
    Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
       D\D     0      1 
         0 593.88   8.55 
         1   8.55 595.32 
    Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
       D\D     0      1 
         0 595.69 101.68 
         1 101.97 595.69 
    P2P=Disabled Latency Matrix (us)
       GPU     0      1 
         0   1.61  28.66 
         1  18.49   1.53 
    
       CPU     0      1 
         0   2.27   6.06 
         1   6.12   2.23 
    P2P=Enabled Latency (P2P Writes) Matrix (us)
       GPU     0      1 
         0   1.62   1.27 
         1   1.17   1.55 
    
       CPU     0      1 
         0   2.27   1.91 
         1   1.90   2.34 
    
    NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

In het geval van een configuratie met meerdere GPU's, kan het er zo uitzien:

[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
    Device: 0, NVIDIA H100 PCIe, pciBusID: 30, pciDeviceID: 0, pciDomainID:0
    Device: 1, NVIDIA H100 PCIe, pciBusID: 3f, pciDeviceID: 0, pciDomainID:0
    Device: 2, NVIDIA H100 PCIe, pciBusID: 40, pciDeviceID: 0, pciDomainID:0
    Device: 3, NVIDIA H100 PCIe, pciBusID: 41, pciDeviceID: 0, pciDomainID:0
    Device: 4, NVIDIA H100 PCIe, pciBusID: b0, pciDeviceID: 0, pciDomainID:0
    Device: 5, NVIDIA H100 PCIe, pciBusID: b1, pciDeviceID: 0, pciDomainID:0
    Device: 6, NVIDIA H100 PCIe, pciBusID: c2, pciDeviceID: 0, pciDomainID:0
    Device: 7, NVIDIA H100 PCIe, pciBusID: c3, pciDeviceID: 0, pciDomainID:0
    Device=0 CAN Access Peer Device=1
    Device=0 CAN Access Peer Device=2
    Device=0 CAN Access Peer Device=3
    Device=0 CAN Access Peer Device=4
    Device=0 CAN Access Peer Device=5
    Device=0 CAN Access Peer Device=6
    Device=0 CAN Access Peer Device=7
    Device=1 CAN Access Peer Device=0
    Device=1 CAN Access Peer Device=2
    Device=1 CAN Access Peer Device=3
    Device=1 CAN Access Peer Device=4
    Device=1 CAN Access Peer Device=5
    Device=1 CAN Access Peer Device=6
    Device=1 CAN Access Peer Device=7
    Device=2 CAN Access Peer Device=0
    Device=2 CAN Access Peer Device=1
    Device=2 CAN Access Peer Device=3
    Device=2 CAN Access Peer Device=4
    Device=2 CAN Access Peer Device=5
    Device=2 CAN Access Peer Device=6
    Device=2 CAN Access Peer Device=7
    Device=3 CAN Access Peer Device=0
    Device=3 CAN Access Peer Device=1
    Device=3 CAN Access Peer Device=2
    Device=3 CAN Access Peer Device=4
    Device=3 CAN Access Peer Device=5
    Device=3 CAN Access Peer Device=6
    Device=3 CAN Access Peer Device=7
    Device=4 CAN Access Peer Device=0
    Device=4 CAN Access Peer Device=1
    Device=4 CAN Access Peer Device=2
    Device=4 CAN Access Peer Device=3
    Device=4 CAN Access Peer Device=5
    Device=4 CAN Access Peer Device=6
    Device=4 CAN Access Peer Device=7
    Device=5 CAN Access Peer Device=0
    Device=5 CAN Access Peer Device=1
    Device=5 CAN Access Peer Device=2
    Device=5 CAN Access Peer Device=3
    Device=5 CAN Access Peer Device=4
    Device=5 CAN Access Peer Device=6
    Device=5 CAN Access Peer Device=7
    Device=6 CAN Access Peer Device=0
    Device=6 CAN Access Peer Device=1
    Device=6 CAN Access Peer Device=2
    Device=6 CAN Access Peer Device=3
    Device=6 CAN Access Peer Device=4
    Device=6 CAN Access Peer Device=5
    Device=6 CAN Access Peer Device=7
    Device=7 CAN Access Peer Device=0
    Device=7 CAN Access Peer Device=1
    Device=7 CAN Access Peer Device=2
    Device=7 CAN Access Peer Device=3
    Device=7 CAN Access Peer Device=4
    Device=7 CAN Access Peer Device=5
    Device=7 CAN Access Peer Device=6
    
    ***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
    So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.
    
    P2P Connectivity Matrix
         D\D     0     1     2     3     4     5     6     7
         0       1     1     1     1     1     1     1     1
         1       1     1     1     1     1     1     1     1
         2       1     1     1     1     1     1     1     1
         3       1     1     1     1     1     1     1     1
         4       1     1     1     1     1     1     1     1
         5       1     1     1     1     1     1     1     1
         6       1     1     1     1     1     1     1     1
         7       1     1     1     1     1     1     1     1
    Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
       D\D     0      1      2      3      4      5      6      7
         0 1629.83  38.43  38.39  37.66  38.51  38.19  38.09  37.92
         1  38.22 1637.04  35.52  35.59  38.15  38.38  38.08  37.55
         2  37.76  35.62 1635.32  35.45  38.59  38.21  38.77  37.94
         3  37.88  35.50  35.60 1639.45  38.49  37.43  38.72  38.49
         4  36.87  37.03  37.00  36.90 1635.86  34.48  38.06  37.22
         5  37.27  37.06  36.92  37.06  34.51 1636.18  37.80  37.50
         6  37.05  36.95  37.45  37.15  37.51  37.96 1630.79  34.94
         7  36.98  36.91  36.95  36.87  37.83  38.02  34.73 1633.35
    
    Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
       D\D     0      1      2      3      4      5      6      7
         0 1635.22  34.42  33.84 256.54  27.74  28.68  28.00  28.41
         1  34.66 1636.93 256.16  17.97  71.58  71.64  71.65  71.61
         2  34.78 256.81 1655.79  30.29  70.34  70.42  70.37  70.33
         3 256.65  30.65  70.67 1654.53  70.66  70.69  70.70  70.73
         4  28.26  30.80  69.99  70.04 1630.36 256.45  69.97  70.02
         5  28.10  31.08  71.60  71.59 256.47 1654.31  71.62  71.54
         6  28.37  30.96  70.99  70.93  70.91  70.96 1632.12 257.11
         7  27.66  30.87  70.30  70.40  70.30  70.39 256.72 1649.57
    
    Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
       D\D     0      1      2      3      4      5      6      7
         0 1673.16  51.88  51.95  51.76  51.61  51.44  52.07  51.30
         1  52.04 1676.28  39.06  39.21  51.62  51.62  51.98  51.36
         2  52.11  39.27 1674.62  39.16  51.42  51.21  51.72  51.71
         3  51.74  39.70  39.22 1672.77  51.50  51.27  51.70  51.24
         4  52.14  52.41  51.38  52.14 1671.54  38.81  46.76  45.72
         5  51.82  52.65  52.30  51.67  38.57 1676.33  46.90  45.96
         6  52.92  52.66  53.02  52.68  46.23  46.31 1672.74  38.91
         7  52.61  52.74  52.79  52.64  45.90  46.35  39.07 1673.16
    
    Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
       D\D     0      1      2      3      4      5      6      7
         0 1670.31  52.41 140.69 508.68 139.85 141.88 141.71 140.55
         1 141.69 1673.30 509.23 141.22 139.91 143.28 141.71 140.61
         2 140.64 508.90 1669.67 140.68 139.93 140.61 140.67 140.50
         3 509.14 141.36 140.61 1682.65 139.93 141.45 141.45 140.67
         4 140.01 140.03 140.07 139.94 1670.68 508.37 140.01 139.90
         5 141.92 143.17 140.50 141.19 508.92 1670.73 141.72 140.52
         6 141.72 141.72 140.60 141.31 139.66 141.85 1671.51 510.03
         7 140.62 140.71 140.66 140.63 140.02 140.72 509.77 1668.28
    
    P2P=Disabled Latency Matrix (us)
       GPU     0      1      2      3      4      5      6      7
         0   2.35  17.23  17.13  13.38  12.86  21.15  21.39  21.12
         1  17.54   2.32  12.95  13.78  21.05  21.23  21.31  21.37
         2  16.85  14.83   2.35  16.07  12.71  12.80  21.23  12.79
         3  14.98  16.06  14.64   2.41  13.35  12.81  13.60  21.36
         4  21.31  21.31  20.49  21.32   2.62  12.33  12.66  12.98
         5  20.36  21.22  20.17  12.79  16.74   2.58  12.41  12.93
         6  17.51  12.84  12.79  12.70  17.63  18.78   2.36  13.69
         7  21.23  12.71  19.41  21.09  14.69  13.79  15.52   2.59
    
    CPU     0      1      2      3      4      5      6      7
         0   1.73   4.99   4.88   4.85   5.17   5.18   5.18   5.33
         1   5.04   1.71   4.74   4.82   5.04   5.14   5.10   5.19
         2   4.86   4.75   1.66   4.78   5.08   5.09   5.11   5.17
         3   4.80   4.72   4.73   1.63   5.09   5.11   5.06   5.10
         4   5.07   5.00   5.03   4.96   1.77   5.33   5.34   5.38
         5   5.12   4.94   5.00   4.96   5.31   1.77   5.38   5.41
         6   5.09   4.97   5.09   5.01   5.35   5.39   1.80   5.42
         7   5.18   5.09   5.02   5.00   5.39   5.40   5.40   1.76
    
    P2P=Enabled Latency (P2P Writes) Matrix (us)
       GPU     0      1      2      3      4      5      6      7
         0   2.33   2.15   2.11   2.76   2.07   2.11   2.07   2.12
         1   2.07   2.30   2.77   2.07   2.12   2.06   2.06   2.10
         2   2.09   2.75   2.34   2.12   2.09   2.08   2.08   2.12
         3   2.78   2.10   2.13   2.40   2.13   2.14   2.14   2.13
         4   2.18   2.23   2.23   2.17   2.59   2.82   2.15   2.16
         5   2.15   2.17   2.15   2.20   2.82   2.56   2.17   2.16
         6   2.13   2.18   2.21   2.17   2.15   2.17   2.36   2.85
         7   2.19   2.21   2.19   2.22   2.19   2.19   2.86   2.61
    
       CPU     0      1      2      3      4      5      6      7
         0   1.78   1.32   1.29   1.40   1.33   1.34   1.34   1.33
         1   1.32   1.69   1.34   1.35   1.35   1.34   1.40   1.33
         2   1.38   1.37   1.73   1.36   1.36   1.35   1.35   1.34
         3   1.34   1.42   1.35   1.66   1.34   1.34   1.35   1.33
         4   1.53   1.41   1.40   1.40   1.77   1.43   1.48   1.47
         5   1.46   1.43   1.43   1.42   1.47   1.84   1.51   1.56
         6   1.53   1.45   1.45   1.45   1.45   1.44   1.85   1.47
         7   1.54   1.47   1.47   1.47   1.45   1.44   1.50   1.84
    
    NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

Gepubliceerd: 06.05.2024

Controleer NVLink in Linux

Basis commando's

Installeer CUDA-voorbeelden

Tests uitvoeren

Hebt u nog vragen? Schrijf ons!