Verifique NVLink en Linux
Por favor, instale los controladores Nvidia siguiendo nuestra guía Instalar controlador Nvidia en Linux, antes de verificar el soporte NVLink en el sistema operativo. Adicionalmente, necesita instalar el kit de herramientas CUDA para compilar los ejemplos de aplicaciones. En esta pequeña guía, hemos recopilado algunos comandos útiles que puede usar.
Comandos básicos
Verifique la topología física de su sistema. Este comando muestra todas las GPUs y sus interconexiones:
nvidia-smi topo -m
Si desea mostrar el estado de los enlaces, ejecute el siguiente comando:
nvidia-smi nvlink -s
El comando muestra la velocidad de cada enlace o
nvidia-smi nvlink -i 0 -c
Sin esta opción, se mostrará información sobre todas las conexiones de GPUs:
nvidia-smi nvlink -c
Instalar muestras de CUDA
Una buena manera de probar el rendimiento es usar los ejemplos de aplicaciones de Nvidia. El código fuente de estos ejemplos se publica en GitHub y está disponible para todos. Proceda a clonar el repositorio en el servidor:
git clone https://github.com/NVIDIA/cuda-samples.git
Cambie el directorio al repositorio descargado:
cd cuda-samples
Seleccione la rama apropiada por etiqueta de acuerdo a la versión del CUDA instalada. Por ejemplo, si tiene CUDA 12.2:
git checkout tags/v12.2
Instale algunos prerequisitos que se utilizarán en el proceso de compilación:
sudo apt -y install freeglut3-dev build-essential libx11-dev libxmu-dev libxi-dev libgl1-mesa-glx libglu1-mesa libglu1-mesa-dev libglfw3-dev libgles2-mesa-dev
Ahora, puede compilar cualquier muestra. Vaya al directorio Samples:
cd Samples
Mire rápidamente el contenido:
ls -la
total 40 drwxrwxr-x 10 usergpu usergpu 4096 Sep 13 14:54 . drwxrwxr-x 6 usergpu usergpu 4096 Sep 13 14:54 .. drwxrwxr-x 55 usergpu usergpu 4096 Sep 13 14:54 0_Introduction drwxrwxr-x 6 usergpu usergpu 4096 Sep 13 14:54 1_Utilities drwxrwxr-x 36 usergpu usergpu 4096 Sep 13 14:54 2_Concepts_and_Techniques drwxrwxr-x 25 usergpu usergpu 4096 Sep 13 14:54 3_CUDA_Features drwxrwxr-x 41 usergpu usergpu 4096 Sep 13 14:54 4_CUDA_Libraries drwxrwxr-x 52 usergpu usergpu 4096 Sep 13 14:54 5_Domain_Specific drwxrwxr-x 6 usergpu usergpu 4096 Sep 13 14:54 6_Performance drwxrwxr-x 11 usergpu usergpu 4096 Sep 13 14:54 7_libNVVM
Vamos a probar el ancho de banda de la GPU. Cambie el directorio:
cd 1_Utilities/bandwidthTest
Compile la aplicación:
make
Ejecutar pruebas
Comience las pruebas ejecutando la aplicación usando su nombre:
./bandwidthTest
La salida puede parecerse a esto:
[CUDA Bandwidth Test] - Starting... Running on... Device 0: NVIDIA RTX A6000 Quick Mode Host to Device Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(GB/s) 32000000 6.0 Device to Host Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(GB/s) 32000000 6.6 Device to Device Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(GB/s) 32000000 569.2 Result = PASS NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Alternativamente, puede compilar e iniciar p2pBandwidthLatencyTest:
cd 5_Domain_Specific/p2pBandwidthLatencyTest
make
./p2pBandwidthLatencyTest
Esta aplicación le mostrará información detallada sobre el ancho de banda de su GPU en modo P2P. Ejemplo de salida:
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test] Device: 0, NVIDIA RTX A6000, pciBusID: 3, pciDeviceID: 0, pciDomainID:0 Device: 1, NVIDIA RTX A6000, pciBusID: 4, pciDeviceID: 0, pciDomainID:0 Device=0 CAN Access Peer Device=1 Device=1 CAN Access Peer Device=0 ***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure. So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases. P2P Connectivity Matrix D\D 0 1 0 1 1 1 1 1 Unidirectional P2P=Disabled Bandwidth Matrix (GB/s) D\D 0 1 0 590.51 6.04 1 6.02 590.51 Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s) D\D 0 1 0 589.40 52.75 1 52.88 592.53 Bidirectional P2P=Disabled Bandwidth Matrix (GB/s) D\D 0 1 0 593.88 8.55 1 8.55 595.32 Bidirectional P2P=Enabled Bandwidth Matrix (GB/s) D\D 0 1 0 595.69 101.68 1 101.97 595.69 P2P=Disabled Latency Matrix (us) GPU 0 1 0 1.61 28.66 1 18.49 1.53 CPU 0 1 0 2.27 6.06 1 6.12 2.23 P2P=Enabled Latency (P2P Writes) Matrix (us) GPU 0 1 0 1.62 1.27 1 1.17 1.55 CPU 0 1 0 2.27 1.91 1 1.90 2.34 NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
En caso de una configuración con múltiples GPUs, puede parecerse a esto:
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test] Device: 0, NVIDIA H100 PCIe, pciBusID: 30, pciDeviceID: 0, pciDomainID:0 Device: 1, NVIDIA H100 PCIe, pciBusID: 3f, pciDeviceID: 0, pciDomainID:0 Device: 2, NVIDIA H100 PCIe, pciBusID: 40, pciDeviceID: 0, pciDomainID:0 Device: 3, NVIDIA H100 PCIe, pciBusID: 41, pciDeviceID: 0, pciDomainID:0 Device: 4, NVIDIA H100 PCIe, pciBusID: b0, pciDeviceID: 0, pciDomainID:0 Device: 5, NVIDIA H100 PCIe, pciBusID: b1, pciDeviceID: 0, pciDomainID:0 Device: 6, NVIDIA H100 PCIe, pciBusID: c2, pciDeviceID: 0, pciDomainID:0 Device: 7, NVIDIA H100 PCIe, pciBusID: c3, pciDeviceID: 0, pciDomainID:0 Device=0 CAN Access Peer Device=1 Device=0 CAN Access Peer Device=2 Device=0 CAN Access Peer Device=3 Device=0 CAN Access Peer Device=4 Device=0 CAN Access Peer Device=5 Device=0 CAN Access Peer Device=6 Device=0 CAN Access Peer Device=7 Device=1 CAN Access Peer Device=0 Device=1 CAN Access Peer Device=2 Device=1 CAN Access Peer Device=3 Device=1 CAN Access Peer Device=4 Device=1 CAN Access Peer Device=5 Device=1 CAN Access Peer Device=6 Device=1 CAN Access Peer Device=7 Device=2 CAN Access Peer Device=0 Device=2 CAN Access Peer Device=1 Device=2 CAN Access Peer Device=3 Device=2 CAN Access Peer Device=4 Device=2 CAN Access Peer Device=5 Device=2 CAN Access Peer Device=6 Device=2 CAN Access Peer Device=7 Device=3 CAN Access Peer Device=0 Device=3 CAN Access Peer Device=1 Device=3 CAN Access Peer Device=2 Device=3 CAN Access Peer Device=4 Device=3 CAN Access Peer Device=5 Device=3 CAN Access Peer Device=6 Device=3 CAN Access Peer Device=7 Device=4 CAN Access Peer Device=0 Device=4 CAN Access Peer Device=1 Device=4 CAN Access Peer Device=2 Device=4 CAN Access Peer Device=3 Device=4 CAN Access Peer Device=5 Device=4 CAN Access Peer Device=6 Device=4 CAN Access Peer Device=7 Device=5 CAN Access Peer Device=0 Device=5 CAN Access Peer Device=1 Device=5 CAN Access Peer Device=2 Device=5 CAN Access Peer Device=3 Device=5 CAN Access Peer Device=4 Device=5 CAN Access Peer Device=6 Device=5 CAN Access Peer Device=7 Device=6 CAN Access Peer Device=0 Device=6 CAN Access Peer Device=1 Device=6 CAN Access Peer Device=2 Device=6 CAN Access Peer Device=3 Device=6 CAN Access Peer Device=4 Device=6 CAN Access Peer Device=5 Device=6 CAN Access Peer Device=7 Device=7 CAN Access Peer Device=0 Device=7 CAN Access Peer Device=1 Device=7 CAN Access Peer Device=2 Device=7 CAN Access Peer Device=3 Device=7 CAN Access Peer Device=4 Device=7 CAN Access Peer Device=5 Device=7 CAN Access Peer Device=6 ***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure. So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases. P2P Connectivity Matrix D\D 0 1 2 3 4 5 6 7 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 4 1 1 1 1 1 1 1 1 5 1 1 1 1 1 1 1 1 6 1 1 1 1 1 1 1 1 7 1 1 1 1 1 1 1 1 Unidirectional P2P=Disabled Bandwidth Matrix (GB/s) D\D 0 1 2 3 4 5 6 7 0 1629.83 38.43 38.39 37.66 38.51 38.19 38.09 37.92 1 38.22 1637.04 35.52 35.59 38.15 38.38 38.08 37.55 2 37.76 35.62 1635.32 35.45 38.59 38.21 38.77 37.94 3 37.88 35.50 35.60 1639.45 38.49 37.43 38.72 38.49 4 36.87 37.03 37.00 36.90 1635.86 34.48 38.06 37.22 5 37.27 37.06 36.92 37.06 34.51 1636.18 37.80 37.50 6 37.05 36.95 37.45 37.15 37.51 37.96 1630.79 34.94 7 36.98 36.91 36.95 36.87 37.83 38.02 34.73 1633.35 Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s) D\D 0 1 2 3 4 5 6 7 0 1635.22 34.42 33.84 256.54 27.74 28.68 28.00 28.41 1 34.66 1636.93 256.16 17.97 71.58 71.64 71.65 71.61 2 34.78 256.81 1655.79 30.29 70.34 70.42 70.37 70.33 3 256.65 30.65 70.67 1654.53 70.66 70.69 70.70 70.73 4 28.26 30.80 69.99 70.04 1630.36 256.45 69.97 70.02 5 28.10 31.08 71.60 71.59 256.47 1654.31 71.62 71.54 6 28.37 30.96 70.99 70.93 70.91 70.96 1632.12 257.11 7 27.66 30.87 70.30 70.40 70.30 70.39 256.72 1649.57 Bidirectional P2P=Disabled Bandwidth Matrix (GB/s) D\D 0 1 2 3 4 5 6 7 0 1673.16 51.88 51.95 51.76 51.61 51.44 52.07 51.30 1 52.04 1676.28 39.06 39.21 51.62 51.62 51.98 51.36 2 52.11 39.27 1674.62 39.16 51.42 51.21 51.72 51.71 3 51.74 39.70 39.22 1672.77 51.50 51.27 51.70 51.24 4 52.14 52.41 51.38 52.14 1671.54 38.81 46.76 45.72 5 51.82 52.65 52.30 51.67 38.57 1676.33 46.90 45.96 6 52.92 52.66 53.02 52.68 46.23 46.31 1672.74 38.91 7 52.61 52.74 52.79 52.64 45.90 46.35 39.07 1673.16 Bidirectional P2P=Enabled Bandwidth Matrix (GB/s) D\D 0 1 2 3 4 5 6 7 0 1670.31 52.41 140.69 508.68 139.85 141.88 141.71 140.55 1 141.69 1673.30 509.23 141.22 139.91 143.28 141.71 140.61 2 140.64 508.90 1669.67 140.68 139.93 140.61 140.67 140.50 3 509.14 141.36 140.61 1682.65 139.93 141.45 141.45 140.67 4 140.01 140.03 140.07 139.94 1670.68 508.37 140.01 139.90 5 141.92 143.17 140.50 141.19 508.92 1670.73 141.72 140.52 6 141.72 141.72 140.60 141.31 139.66 141.85 1671.51 510.03 7 140.62 140.71 140.66 140.63 140.02 140.72 509.77 1668.28 P2P=Disabled Latency Matrix (us) GPU 0 1 2 3 4 5 6 7 0 2.35 17.23 17.13 13.38 12.86 21.15 21.39 21.12 1 17.54 2.32 12.95 13.78 21.05 21.23 21.31 21.37 2 16.85 14.83 2.35 16.07 12.71 12.80 21.23 12.79 3 14.98 16.06 14.64 2.41 13.35 12.81 13.60 21.36 4 21.31 21.31 20.49 21.32 2.62 12.33 12.66 12.98 5 20.36 21.22 20.17 12.79 16.74 2.58 12.41 12.93 6 17.51 12.84 12.79 12.70 17.63 18.78 2.36 13.69 7 21.23 12.71 19.41 21.09 14.69 13.79 15.52 2.59 CPU 0 1 2 3 4 5 6 7 0 1.73 4.99 4.88 4.85 5.17 5.18 5.18 5.33 1 5.04 1.71 4.74 4.82 5.04 5.14 5.10 5.19 2 4.86 4.75 1.66 4.78 5.08 5.09 5.11 5.17 3 4.80 4.72 4.73 1.63 5.09 5.11 5.06 5.10 4 5.07 5.00 5.03 4.96 1.77 5.33 5.34 5.38 5 5.12 4.94 5.00 4.96 5.31 1.77 5.38 5.41 6 5.09 4.97 5.09 5.01 5.35 5.39 1.80 5.42 7 5.18 5.09 5.02 5.00 5.39 5.40 5.40 1.76 P2P=Enabled Latency (P2P Writes) Matrix (us) GPU 0 1 2 3 4 5 6 7 0 2.33 2.15 2.11 2.76 2.07 2.11 2.07 2.12 1 2.07 2.30 2.77 2.07 2.12 2.06 2.06 2.10 2 2.09 2.75 2.34 2.12 2.09 2.08 2.08 2.12 3 2.78 2.10 2.13 2.40 2.13 2.14 2.14 2.13 4 2.18 2.23 2.23 2.17 2.59 2.82 2.15 2.16 5 2.15 2.17 2.15 2.20 2.82 2.56 2.17 2.16 6 2.13 2.18 2.21 2.17 2.15 2.17 2.36 2.85 7 2.19 2.21 2.19 2.22 2.19 2.19 2.86 2.61 CPU 0 1 2 3 4 5 6 7 0 1.78 1.32 1.29 1.40 1.33 1.34 1.34 1.33 1 1.32 1.69 1.34 1.35 1.35 1.34 1.40 1.33 2 1.38 1.37 1.73 1.36 1.36 1.35 1.35 1.34 3 1.34 1.42 1.35 1.66 1.34 1.34 1.35 1.33 4 1.53 1.41 1.40 1.40 1.77 1.43 1.48 1.47 5 1.46 1.43 1.43 1.42 1.47 1.84 1.51 1.56 6 1.53 1.45 1.45 1.45 1.45 1.44 1.85 1.47 7 1.54 1.47 1.47 1.47 1.45 1.44 1.50 1.84 NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Publicado: 06.05.2024