MXNet GPU
The purpose of the following article is to present results of testing mxnet™ on various GPU’s and compare costs of data processing on AWS vs LeaderGPU®.
The following table shows the performance test results, namely the number of images that can be processed per unit of time (measured in seconds).
Scoring results
We've used the official benchmark of mxnet™ benchmark_score.py and cuDNN 6.0
Results are following:
Ltbv 14 (GTX 1080 single):
Batch | Alexnet | VGG | Inception-BN | Inception-v3 | Resnet 50 | Resnet 152 |
---|---|---|---|---|---|---|
1 | 484.21 | 193.85 | 168.71 | 81.41 | 169.28 | 68.38 |
2 | 829.39 | 188.82 | 301.96 | 143.85 | 257.01 | 106.14 |
4 | 1255.73 | 279.13 | 472.93 | 207.13 | 335.21 | 141.79 |
8 | 2103.98 | 361.24 | 653.59 | 306.01 | 391.55 | 166.75 |
16 | 2531.79 | 467.27 | 765.40 | 312.82 | 429.70 | 182.00 |
32 | 3295.19 | 525.39 | 826.41 | 345.28 | 453.44 | 191.84 |
Ltbv 19 (GTX 1080TI single):
Batch | Alexnet | VGG | Inception-BN | Inception-v3 | Resnet 50 | Resnet 152 |
---|---|---|---|---|---|---|
1 | 567.68 | 275.97 | 161.42 | 84.16 | 184.49 | 66.96 |
2 | 1087.14 | 193.07 | 317.53 | 150.89 | 322.72 | 123.60 |
4 | 1556.19 | 314.09 | 558.75 | 244.64 | 465.17 | 197.84 |
8 | 2543.31 | 611.98 | 846.34 | 362.18 | 578.75 | 249.68 |
16 | 4033.49 | 757.19 | 1101.56 | 478.16 | 660.71 | 279.48 |
32 | 5435.72 | 827.58 | 1216.74 | 506.96 | 697.81 | 291.98 |
Ltbv 20 (Tesla P100 single):
Batch | Alexnet | VGG | Inception-BN | Inception-v3 | Resnet 50 | Resnet 152 |
---|---|---|---|---|---|---|
1 | 540.79 | 273.88 | 122.57 | 64.97 | 140.88 | 53.96 |
2 | 985.37 | 352.77 | 249.28 | 119.37 | 245.47 | 94.08 |
4 | 1570.85 | 478.42 | 424.92 | 195.56 | 374.74 | 155.38 |
8 | 2556.58 | 586.46 | 646.51 | 285.70 | 489.39 | 197.19 |
16 | 4162.64 | 819.96 | 899.54 | 392.91 | 609.39 | 255.19 |
32 | 5565.39 | 889.52 | 1116.51 | 468.03 | 682.52 | 279.87 |
Below are testing results when using GPU’s on AWS K80 (EC2 p2.xlarge), M40 and P100 (DGX-1).
There used here an official benchmark of mxnet™ benchmark_score.py.
Test results are quoted from mxnet™ official web page.
Results are following:
K80 (single GPU):
Batch | Alexnet | VGG | Inception-BN | Inception-v3 | Resnet 50 | Resnet 152 |
---|---|---|---|---|---|---|
1 | 202.66 | 70.76 | 74.91 | 42.61 | 70.94 | 24.87 |
2 | 233.76 | 63.53 | 119.60 | 60.09 | 92.28 | 34.23 |
4 | 367.91 | 78.16 | 164.41 | 72.30 | 116.68 | 44.76 |
8 | 624.14 | 119.06 | 195.24 | 79.62 | 129.37 | 50.96 |
16 | 1071.19 | 195.83 | 256.06 | 99.38 | 160.40 | 66.51 |
32 | 1443.90 | 228.96 | 287.93 | 106.43 | 167.12 | 69.73 |
M40(single GPU):
Batch | Alexnet | VGG | Inception-BN | Inception-v3 | Resnet 50 | Resnet 152 |
---|---|---|---|---|---|---|
1 | 412.09 | 142.10 | 115.89 | 64.40 | 126.90 | 46.15 |
2 | 743.49 | 212.21 | 205.31 | 108.06 | 202.17 | 75.05 |
4 | 1155.43 | 280.92 | 335.69 | 161.59 | 266.53 | 106.83 |
8 | 1606.87 | 332.76 | 491.12 | 224.22 | 317.20 | 128.67 |
16 | 2070.97 | 400.10 | 618.25 | 251.87 | 335.62 | 134.60 |
32 | 2694.91 | 466.95 | 624.27 | 258.59 | 373.35 | 152.71 |
P100 (single GPU):
Batch | Alexnet | VGG | Inception-BN | Inception-v3 | Resnet 50 | Resnet 152 |
---|---|---|---|---|---|---|
1 | 624.84 | 294.6 | 139.82 | 80.17 | 162.27 | 58.99 |
2 | 1226.85 | 282.3 | 267.41 | 142.63 | 278.02 | 102.95 |
4 | 1934.97 | 399.3 | 463.38 | 225.56 | 423.63 | 168.91 |
8 | 2900.54 | 522.9 | 709.30 | 319.52 | 529.34 | 210.10 |
16 | 4063.70 | 755.3 | 949.22 | 444.65 | 647.43 | 270.07 |
32 | 4883.77 | 854.4 | 1197.74 | 493.72 | 713.17 | 294.17 |
Next we would like to address to the pricing part of this test, we will calculate the cost price of processing 1,000,000 images and the time spent on the process, with batch_size 32:
Results are following:
GPU | Model | Time | Price(per min) | Total cost |
---|---|---|---|---|
GTX 1080 | Alexnet | 5m 3sec | € 0,06 | € 0,303 |
GTX 1080 TI | 3m 3sec | € 0,07 | € 0,2135 | |
P100 | 2m 55sec | € 0,08 | € 0,23333 | |
K80 | 11m 31sec | € 0,01 | € 0,11517 | |
GTX 1080 | VGG | 31m 43sec | € 0,06 | € 1,903 |
GTX 1080 TI | 20m 8sec | € 0,07 | € 1,40933 | |
P100 | 18m 40sec | € 0,08 | € 1,49333 | |
K80 | 72m 50sec | € 0,01 | € 0,72833 | |
GTX 1080 | Inception-BN | 20m 21sec | € 0,06 | € 1,221 |
GTX 1080 TI | 13m 41sec | € 0,07 | € 0,95783 | |
P100 | 14m 58sec | € 0,08 | € 1,19733 | |
K80 | 57m 45sec | € 0,01 | € 0,5775 | |
GTX 1080 | Inception-v3 | 48m 15sec | € 0,06 | € 2,895 |
GTX 1080 TI | 32m 49sec | € 0,07 | € 2,29717 | |
P100 | 35m 38sec | € 0,08 | € 2,85067 | |
K80 | 156m 35sec | € 0,01 | € 1,56583 | |
GTX 1080 | Resnet 50 | 36m 46sec | € 0,06 | € 2,206 |
GTX 1080 TI | 23m 53sec | € 0,07 | € 1,67183 | |
P100 | 24m 27sec | € 0,08 | € 1,956 | |
K80 | 99m 39sec | € 0,01 | € 0,9965 | |
GTX 1080 | Resnet 152 | 86m 50sec | € 0,06 | € 5,21 |
GTX 1080 TI | 57m 5sec | € 0,07 | € 3,99583 | |
P100 | 59m 32sec | € 0,08 | € 4,76267 | |
K80 | 239m 1sec | € 0,01 | € 2,39017 |
Training results
In the following section, we will review test results of training networks in mxnet™.
These results are based on example/image-classification/train_imagenet.py and cuDNN 6.0. In addition testing script is available here. For the Alexnet network batch size increased by 8 times.
Ltbv 14 (GTX 1080 single):
Batch | Alexnet(*8) | Resnet 50 | Inception-v3 |
---|---|---|---|
1 | 432,26 | 11,78 | 21,21 |
2 | 655,14 | 19,31 | 34,56 |
4 | 989,15 | 29,61 | 49,79 |
8 | 1167,86 | 39,83 | 71,92 |
16 | 1343,68 | 48,72 | 80,80 |
32 | 1407,41 | -** | 87,93 |
Ltbv 19 (GTX 1080TI single):
Batch | Alexnet(*8) | Resnet 50 | Inception-v3 |
---|---|---|---|
1 | 1068,59 | 13,75 | 21,84 |
2 | 1341,03 | 23,20 | 39,08 |
4 | 1573,10 | 37,93 | 62,49 |
8 | 1770,16 | 54,98 | 90,64 |
16 | 1850,01 | 69,26 | 114,24 |
32 | 1729,24 | 75,57 | 124,84 |
Ltbv 20 (Tesla P100 single):
Batch | Alexnet(*8) | Resnet 50 | Inception-v3 |
---|---|---|---|
1 | 1138,47 | 10,60 | 21,73 |
2 | 1462,89 | 20,29 | 33,05 |
4 | 1717,54 | 35,05 | 57,97 |
8 | 1914,71 | 51,05 | 83,90 |
16 | 1977,86 | 67,17 | 109,90 |
32 | 1754,03 | 77,74 | 123,48 |
** available amount of GPU memory is not enough for batch processing.
K80 (single GPU)
Batch | Alexnet(*8) | Resnet 50 | Inception-v3 |
---|---|---|---|
1 | 230.69 | 9.81 | 13.83 |
2 | 348.10 | 15.31 | 21.85 |
4 | 457.28 | 20.48 | 29.58 |
8 | 533.51 | 24.47 | 36.83 |
16 | 582.36 | 28.46 | 43.60 |
32 | 483.37 | 29.62 | 45.52 |
M40(single GPU)
Batch | Alexnet(*8) | Resnet 50 | Inception-v3 |
---|---|---|---|
1 | 405.17 | 14.35 | 21.56 |
2 | 606.32 | 23.96 | 36.48 |
4 | 792.66 | 37.38 | 52.96 |
8 | 1016.51 | 52.69 | 70.21 |
16 | 1105.18 | 62.35 | 83.13 |
32 | 1046.23 | 68.87 | 90.74 |
P100(single GPU)
Batch | Alexnet(*8) | Resnet 50 | Inception-v3 |
---|---|---|---|
1 | 809.94 | 15.14 | 27.20 |
2 | 1202.93 | 30.34 | 49.55 |
4 | 1631.37 | 50.59 | 78.31 |
8 | 1882.74 | 77.75 | 122.45 |
16 | 2012.04 | 111.11 | 156.79 |
32 | 1869.69 | 129.98 | 181.53 |
Training results on Multiple Devices
This section will be devoted to the analysis of data collected from testing mxnet™ when using several GPU’s on the LeadersGPU's instances.
Ltbv 14 (4х GTX 1080):
Batch | Alexnet(*8) | Resnet 50 | Inception-v3 |
---|---|---|---|
1 | 98,30 | -* | -* |
2 | 193,09 | -* | -* |
4 | 384,72 | 26,76 | 48,76 |
8 | 723,01 | 46,96 | 88,99 |
16 | 13341,50 | 68,90 | 155,29 |
32 | 1839,47 | 93,37 | 236,57 |
Ltbv 19 (4х GTX 1080TI):
Batch | Alexnet(*8) | Resnet 50 | Inception-v3 |
---|---|---|---|
1 | 126,66 | -* | -* |
2 | 217,72 | -* | -* |
4 | 422,59 | 30,22 | 49,37 |
8 | 768,48 | 56,97 | 94,79 |
16 | 1599,70 | 103,13 | 165,64 |
32 | 2973,28 | 172,37 | 275,70 |
Ltbv 20 (2х Tesla P100):
Batch | Alexnet(*8) | Resnet 50 | Inception-v3 |
---|---|---|---|
1 | 465,45 | -* | -* |
2 | 637,42 | 18,47 | 15,76 |
4 | 1002,77 | 33,48 | 28,98 |
8 | 1857,60 | 63,66 | 46,13 |
16 | 2755,08 | 93,42 | 63,84 |
32 | 3500,40 | 129,25 | 78,66 |
* Too many slices therefore some splits are empty