Build a Budget Deep Learning Box
Dr. George Jen
There is a myth that a good deep learning machine got to be expensive, it does not have to be that way. I just have done it on a shoe string, and I want to compare it with enterprise level real server.
Here is what I have done:
I. Build the machine:
I do my own HW design:
1. Mainstream Intel CPU motherboard, supporting up to 7th gen Intel core CPU (Kirby Lake), they are widely available and cheap. The motherboard I have chosen is
Gigabyte GA-Z170XP-SLI, which support Intel 7th gen CPU and up to 64GB memory, and up to 3 full length video cards. It has pretty good reviews on Amazon and I plan to buy this motherboard and an Intel i3 CPU.
Why not 7th gen i7? The answer is in machine learning, CPU is merely an initiator only, it is not an executor, it only does user request intake, prepare data and present final classification result. GPU is an executor. Since I set the budget, I bought one from eBay for $61 plus shipping.
Gigabyte GA-Z170XP-SLI Motherboard + Intel Celeron G3900 CPU Processor
The guy threw in an Intel Celeron G3900 CPU Processor free, which is Intel 7th gen CPU, just the lowest end of its product family, with 2 cores. That would be allright, it will do its job well as initiator. That saves me time to buy and install a new Intel processor.
2. Then I bought a NVIDIA GeForce GTX 1070 Ti SC Black Edition on eBay too. Why GTX 1070 ti? Because it has best price/performance ratio. It has 8GB GDDR onboard memory and 2400+ cuda cores, but vastly discounted because NVidia is releasing new GTX 20XX series.
I got it on eBay for $320, a good deal.
3. Then I headed to local computer parts superstore, bought a mid tower case with 5 case fans ($60), I design my own air flow for proper cooling and 16GB memory ($150), and an M2 PCIe 256GB SSD ($70). I got a 600W power supply laying around, so just use it. I have a 1TB HD laying around, just use it as backup place.
4. Took a weekend to assemble, here is assembled deep learning box:
5. Install Ubuntu 18.04, easy, completed quickly. Then follow deep learning box setup instructions:
http://www.erpcomputing.com/deep-learning-using-keras
Software installation is done within 2 hours.
The machine is powerful for what the intended purpose:
And it has 2400+ cores:
It can run up to 2048 concurrent threads. Someone may argue there can be thousands of threads in the OS queue, but they are NOT concurrent threads, the number of concurrent threads is limited to the number of available cores in the machine, meaning vast majority of the threads are in wait state.
Why large concurrent thread capacity is so important in deep learning? Because at the center of deep learning, it is matrix and vector computation, like matmul() (matrix multiplication):
C=A*B
Each c(j,j) can be calculated in parallel at the same time if the machine can support large number of concurrent threads.
If you have to do it on a regular computer, you would have to do it with nested loop, on i and j. When you do nested loop, you are doing it in serial. So the difference in speed is obvious.
6. I built this machine in a shoe string,
Total cost:
Gigabyte GA-Z170XP-SLI Motherboard + Intel Celeron G3900 CPU Processor: $61+$12 (shipping) = $73
16GB memory: $150
256GB SSD: $70
NVIDIA GeForce GTX 1070 Ti SC Black Edition: $320
Mid Tower case: $60
Power supply (600W): $50
Total: $723, adding sales taxes, less than 850 bucks
II. Next step, I want to compare with this machine with a server with decent number of CPU cores and decent amount of memory and see how this budget built machine stacked up.
I ran a benchmark on image recognition on hand written digits, a popular classic starter on deep learning with neural network classification. Dataset is called MNIST (http://yann.lecun.com/exdb/mnist/), classification algorithm is RNN (recurrent neural network) that is having matrices and vectors computation.
I timed and ran the python code on a server with an enterprise Linux powered bare metal server with 2 Xeons (48 CPU) cores with 256GB RAM, without graphics card,
# grep "processor" /proc/cpuinfo | wc -l
48
# free
total used free shared buff/cache available
Mem: 263836644 8586980 28754508 1697668 226495156 252470536
Swap: 2097148 0 2097148
It took 27 minute 32 second elapse time, to train the machine to understand hand written with Testing Accuracy: 0.8984375
real 5m24.519s
user 27m32.158s
sys 6m12.759s
Then I ran the same Python code on my machine with 1 Celeron (2 core), 16GB RAM, a NVidia GTX 1070 ti with 8GB DDR5 and 2400+ cuda cores
It took 1 minute 21 second elapse time, to train the machine to understand hand written with Testing Accuracy: 0.8828125
Testing Accuracy: 0.8828125
real 1m22.601s
user 1m21.002s
sys 0m7.588s
My machine significantly outperforms the enterprise server by nearly 4X.
Following are the testing transcript:
1. On a server with 48 CPU cores/256GB RAM:
[
# time python recurrent_network.py
/root/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters
WARNING:tensorflow:From recurrent_network.py:21: read_data_sets (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
WARNING:tensorflow:From /root/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:260: maybe_download (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version.
Instructions for updating:
Please write your own downloading logic.
WARNING:tensorflow:From /root/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:262: extract_images (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use tf.data to implement this functionality.
Extracting /tmp/data/train-images-idx3-ubyte.gz
WARNING:tensorflow:From /root/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:267: extract_labels (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use tf.data to implement this functionality.
Extracting /tmp/data/train-labels-idx1-ubyte.gz
WARNING:tensorflow:From /root/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:110: dense_to_one_hot (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use tf.one_hot on tensors.
Extracting /tmp/data/t10k-images-idx3-ubyte.gz
Extracting /tmp/data/t10k-labels-idx1-ubyte.gz
WARNING:tensorflow:From /root/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:290: DataSet.__init__ (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
WARNING:tensorflow:From recurrent_network.py:77: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.
See @{tf.nn.softmax_cross_entropy_with_logits_v2}.
2018-09-27 12:43:54.968363: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
Step 1, Minibatch Loss= 2.9389, Training Accuracy= 0.141
Step 200, Minibatch Loss= 2.0791, Training Accuracy= 0.320
Step 400, Minibatch Loss= 1.9905, Training Accuracy= 0.328
Step 600, Minibatch Loss= 1.7382, Training Accuracy= 0.414
Step 800, Minibatch Loss= 1.6303, Training Accuracy= 0.484
Step 1000, Minibatch Loss= 1.6032, Training Accuracy= 0.469
Step 1200, Minibatch Loss= 1.5498, Training Accuracy= 0.523
Step 1400, Minibatch Loss= 1.3419, Training Accuracy= 0.578
Step 1600, Minibatch Loss= 1.4288, Training Accuracy= 0.539
Step 1800, Minibatch Loss= 1.2678, Training Accuracy= 0.609
Step 2000, Minibatch Loss= 1.2080, Training Accuracy= 0.617
Step 2200, Minibatch Loss= 1.2996, Training Accuracy= 0.641
Step 2400, Minibatch Loss= 1.1556, Training Accuracy= 0.656
Step 2600, Minibatch Loss= 1.2326, Training Accuracy= 0.617
Step 2800, Minibatch Loss= 1.0982, Training Accuracy= 0.656
Step 3000, Minibatch Loss= 1.1160, Training Accuracy= 0.680
Step 3200, Minibatch Loss= 0.9884, Training Accuracy= 0.680
Step 3400, Minibatch Loss= 1.1071, Training Accuracy= 0.641
Step 3600, Minibatch Loss= 1.0522, Training Accuracy= 0.688
Step 3800, Minibatch Loss= 0.8391, Training Accuracy= 0.750
Step 4000, Minibatch Loss= 1.1120, Training Accuracy= 0.609
Step 4200, Minibatch Loss= 0.8780, Training Accuracy= 0.727
Step 4400, Minibatch Loss= 0.8597, Training Accuracy= 0.727
Step 4600, Minibatch Loss= 0.8418, Training Accuracy= 0.734
Step 4800, Minibatch Loss= 0.7465, Training Accuracy= 0.750
Step 5000, Minibatch Loss= 0.8330, Training Accuracy= 0.742
Step 5200, Minibatch Loss= 0.6910, Training Accuracy= 0.781
Step 5400, Minibatch Loss= 0.8432, Training Accuracy= 0.742
Step 5600, Minibatch Loss= 0.6543, Training Accuracy= 0.781
Step 5800, Minibatch Loss= 0.8574, Training Accuracy= 0.750
Step 6000, Minibatch Loss= 0.7193, Training Accuracy= 0.781
Step 6200, Minibatch Loss= 0.7891, Training Accuracy= 0.781
Step 6400, Minibatch Loss= 0.7406, Training Accuracy= 0.828
Step 6600, Minibatch Loss= 0.6304, Training Accuracy= 0.797
Step 6800, Minibatch Loss= 0.5800, Training Accuracy= 0.844
Step 7000, Minibatch Loss= 0.6605, Training Accuracy= 0.734
Step 7200, Minibatch Loss= 0.7147, Training Accuracy= 0.734
Step 7400, Minibatch Loss= 0.5023, Training Accuracy= 0.836
Step 7600, Minibatch Loss= 0.5748, Training Accuracy= 0.859
Step 7800, Minibatch Loss= 0.4608, Training Accuracy= 0.852
Step 8000, Minibatch Loss= 0.5501, Training Accuracy= 0.828
Step 8200, Minibatch Loss= 0.7387, Training Accuracy= 0.758
Step 8400, Minibatch Loss= 0.6295, Training Accuracy= 0.773
Step 8600, Minibatch Loss= 0.4497, Training Accuracy= 0.875
Step 8800, Minibatch Loss= 0.4465, Training Accuracy= 0.867
Step 9000, Minibatch Loss= 0.3878, Training Accuracy= 0.875
Step 9200, Minibatch Loss= 0.4795, Training Accuracy= 0.883
Step 9400, Minibatch Loss= 0.4143, Training Accuracy= 0.859
Step 9600, Minibatch Loss= 0.4545, Training Accuracy= 0.844
Step 9800, Minibatch Loss= 0.3253, Training Accuracy= 0.922
Step 10000, Minibatch Loss= 0.4482, Training Accuracy= 0.859
Optimization Finished!
Testing Accuracy: 0.8984375
real 5m24.519s
user 27m32.158s
sys 6m12.759s
2. On my machine with 2 core, 16GB RAM and NVidia GTX 1070 Ti graphics card:
time python recurrent_network.py
/home/alice/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters
WARNING:tensorflow:From recurrent_network.py:21: read_data_sets (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
WARNING:tensorflow:From /home/alice/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:260: maybe_download (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version.
Instructions for updating:
Please write your own downloading logic.
WARNING:tensorflow:From /home/alice/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:262: extract_images (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use tf.data to implement this functionality.
Extracting /tmp/data/train-images-idx3-ubyte.gz
WARNING:tensorflow:From /home/alice/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:267: extract_labels (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use tf.data to implement this functionality.
Extracting /tmp/data/train-labels-idx1-ubyte.gz
WARNING:tensorflow:From /home/alice/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:110: dense_to_one_hot (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use tf.one_hot on tensors.
Extracting /tmp/data/t10k-images-idx3-ubyte.gz
Extracting /tmp/data/t10k-labels-idx1-ubyte.gz
WARNING:tensorflow:From /home/alice/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:290: DataSet.__init__ (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
WARNING:tensorflow:From recurrent_network.py:77: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.
See @{tf.nn.softmax_cross_entropy_with_logits_v2}.
2018-09-27 09:39:34.387663: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2
2018-09-27 09:39:34.488027: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:897] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-09-27 09:39:34.488501: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1392] Found device 0 with properties:
name: GeForce GTX 1070 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.683
pciBusID: 0000:01:00.0
totalMemory: 7.93GiB freeMemory: 7.72GiB
2018-09-27 09:39:34.488518: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1471] Adding visible gpu devices: 0
2018-09-27 09:39:34.737643: I tensorflow/core/common_runtime/gpu/gpu_device.cc:952] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-09-27 09:39:34.737678: I tensorflow/core/common_runtime/gpu/gpu_device.cc:958] 0
2018-09-27 09:39:34.737685: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: N
2018-09-27 09:39:34.737899: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1084] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7451 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1070 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
Step 1, Minibatch Loss= 2.6665, Training Accuracy= 0.102
Step 200, Minibatch Loss= 2.1436, Training Accuracy= 0.273
Step 400, Minibatch Loss= 1.9958, Training Accuracy= 0.398
Step 600, Minibatch Loss= 1.8857, Training Accuracy= 0.391
Step 800, Minibatch Loss= 1.6399, Training Accuracy= 0.531
Step 1000, Minibatch Loss= 1.5933, Training Accuracy= 0.508
Step 1200, Minibatch Loss= 1.6511, Training Accuracy= 0.391
Step 1400, Minibatch Loss= 1.4492, Training Accuracy= 0.523
Step 1600, Minibatch Loss= 1.3485, Training Accuracy= 0.570
Step 1800, Minibatch Loss= 1.3464, Training Accuracy= 0.570
Step 2000, Minibatch Loss= 1.3131, Training Accuracy= 0.570
Step 2200, Minibatch Loss= 1.1193, Training Accuracy= 0.672
Step 2400, Minibatch Loss= 1.2354, Training Accuracy= 0.617
Step 2600, Minibatch Loss= 1.0234, Training Accuracy= 0.719
Step 2800, Minibatch Loss= 1.1582, Training Accuracy= 0.656
Step 3000, Minibatch Loss= 1.0689, Training Accuracy= 0.570
Step 3200, Minibatch Loss= 0.9953, Training Accuracy= 0.656
Step 3400, Minibatch Loss= 0.8416, Training Accuracy= 0.742
Step 3600, Minibatch Loss= 0.9553, Training Accuracy= 0.758
Step 3800, Minibatch Loss= 0.8145, Training Accuracy= 0.750
Step 4000, Minibatch Loss= 0.8229, Training Accuracy= 0.758
Step 4200, Minibatch Loss= 0.8903, Training Accuracy= 0.695
Step 4400, Minibatch Loss= 0.6680, Training Accuracy= 0.820
Step 4600, Minibatch Loss= 0.7681, Training Accuracy= 0.742
Step 4800, Minibatch Loss= 0.6999, Training Accuracy= 0.836
Step 5000, Minibatch Loss= 0.9245, Training Accuracy= 0.758
Step 5200, Minibatch Loss= 0.7183, Training Accuracy= 0.758
Step 5400, Minibatch Loss= 0.6794, Training Accuracy= 0.828
Step 5600, Minibatch Loss= 0.5801, Training Accuracy= 0.820
Step 5800, Minibatch Loss= 0.6639, Training Accuracy= 0.758
Step 6000, Minibatch Loss= 0.7383, Training Accuracy= 0.781
Step 6200, Minibatch Loss= 0.7214, Training Accuracy= 0.789
Step 6400, Minibatch Loss= 0.6905, Training Accuracy= 0.766
Step 6600, Minibatch Loss= 0.6147, Training Accuracy= 0.820
Step 6800, Minibatch Loss= 0.5964, Training Accuracy= 0.859
Step 7000, Minibatch Loss= 0.6019, Training Accuracy= 0.820
Step 7200, Minibatch Loss= 0.5961, Training Accuracy= 0.805
Step 7400, Minibatch Loss= 0.6146, Training Accuracy= 0.820
Step 7600, Minibatch Loss= 0.5122, Training Accuracy= 0.781
Step 7800, Minibatch Loss= 0.7244, Training Accuracy= 0.781
Step 8000, Minibatch Loss= 0.5255, Training Accuracy= 0.828
Step 8200, Minibatch Loss= 0.3798, Training Accuracy= 0.898
Step 8400, Minibatch Loss= 0.3982, Training Accuracy= 0.891
Step 8600, Minibatch Loss= 0.5743, Training Accuracy= 0.805
Step 8800, Minibatch Loss= 0.5829, Training Accuracy= 0.820
Step 9000, Minibatch Loss= 0.4793, Training Accuracy= 0.906
Step 9200, Minibatch Loss= 0.5694, Training Accuracy= 0.852
Step 9400, Minibatch Loss= 0.5455, Training Accuracy= 0.828
Step 9600, Minibatch Loss= 0.4215, Training Accuracy= 0.867
Step 9800, Minibatch Loss= 0.5300, Training Accuracy= 0.805
Step 10000, Minibatch Loss= 0.3755, Training Accuracy= 0.875
Optimization Finished!
Testing Accuracy: 0.8828125
real 1m22.601s
user 1m21.002s
sys 0m7.588s
In summary, to compare the real time:
Machine Config Real time
----------- ------------- ----------------
Enterprise server 2 Xeon, 48 cores/256GB RAM 5m24.519s
My Machine 1 Celeron, 2 cores/16GB, NVidia 1070 Ti 1m21.002s
My Machine vs the enterprise server, is about 4X faster, given Xeon cores are much more powerful, 4X is impressive on a modest machine.
Conclusion:
This is typical image recognition application, you can extrapolate any image recognition, such as facial, or language processing that use neural net, will greatly benefit from having GPU in place. Using CPU to do deep learning is not optimal and good use of HW resource.