主页 Swift UNIX C Assembly Go Plan 9 Web MCU Research Non-Tech

Simple NN test implemented in Pytorch on laptop and Mac mini 2018

2023-02-17 | Research | #Words: 570

I tested the current laptop and Mac mini running 10 iterations of the simplest single-layer NN (with backpropagation) built using Pytorch, so that I will know what parameters to pay attention to when purchasing equipment in the future, but unfortunately not.

The first is the performance of each device:

  Dell Inspiron 7490 Mac mini 2018
CPU i7 10510U 4C8T 2.30-4.90GHz i5 8500B 6C6T 3.00-4.10GHz
Cache 8MB 9MB
Memory 16GB DDR3 2133MHz 32GB DDR4 2666MHz
Best GPU MX250 4SM 384 CUDA Core 64-bit 2GB GDDR5 Intel® UHD Graphics 630 1.10 GHz

The following results are the best or middle values (if float range is large). The only pity is that the iMac 5K was sold early, otherwise I can test the MPS acceleration results. Because Python’s MPS acceleration is only supported devices on Apple chips or equipped with AMD graphics cards.

Dell Inspiron 7490 CPU Dell Inspiron 7490 GPU Mac mini 2018
123.15s 107.79s 87.94s

The most amazing thing is 8500B actually takes less time than the MX250 CUDA acceleration. Of course, it is high probability due to system. After all, Windows performance is indeed not very good.

In addition, memory bandwidth is really important (although size is also important). The bandwidth of MX250 is too small. The 64-bit GDDR5 bandwidth is 48GB/s. After increasing iteration and reducing learning speed, the CUDA calculation rate isn’t go up. The memory bandwidth of general consumer-grade graphics cards is sufficient, but the MX250 is positioned as a thin and light notebook.

This reminds me of the test I watched from other Running PyTorch on the M1 GPU, one of the data is RTX 3080 Laptop Slightly slower than RTX 2080Ti:

Pictures tested by others

The specs and results of each graphics card in the above picture are as follows:

  GTX 1080Ti Titan V RTX 2080Ti RTX 3060 Laptop RTX 3080 Laptop
Cuda Cores 3584 5120 4352 3840 6144
Graphic memory interface 352-bit 3072-bit 352-bit 192-bit 256-bit
Graphic memory rate (Gbps) 11 1.7 14 14 14
Graphic memory bandwidth (GB/s) 484 652.8 616 336 448
minutes/epoch 7.65 5.03 5.75 14.53 6.66

Because all GPU’s graphic memory is about 10 GiB, so I ignore it.

It may not shows something, so I only keep “Cuda Cores”, “Graphic Memory Bandwidth” and “Performance”. Because other parts are related to “Graphic Memory Bandwidth”:

something in data

You can see clearly that the performance of the training model is closely related to the memory bandwidth and the number of CUDA.

Notice the memory bandwidth are directly related with performance, the number of Cuda is not necessarily. Because except for GTX 1080 Ti, others have Tensor Cores which designed to calculate tensors. Nowadays, the performance of Tensor Core is much higher and effective, but in design, the number of Tensor Cores is proportional to Cuda Cores.

Of course, it’s better to know the floating point performance of GPU, which is more direct. I try to guess GPU because floating point performance on consumer GPUs is not that easy to find.