Deep Learning: Workstation PC with GTX Titan Vs Server with NVIDIA Tesla V100 Vs Cloud Instance
Jan 17, 2018

Deep Learning: Workstation PC with GTX Titan Vs Server with NVIDIA Tesla V100 Vs Cloud Instance

Selection of Workstation for Deep learning

GPU:

GPU’s are the heart of Deep learning. Computation involved in Deep Learning are Matrix operations running in parallel operations.

Best GPU overall: NVidia Titan Xp, GTX Titan X (Maxwell
Cost efficient but expensive: GTX 1080 Ti, GTX 1070, GTX 1080
Cost efficient and cheap: GTX 1060 (6GB)

Memory bandwidth of the GPU also enables to operate on large batches of data. CUDA Cores are small computation units that have threads which enable them to run the matrix operations faster.

CUDA toolkit is the only choice for the DL practitioner. So AMD Graphics will not help much here.

PCIe Lanes (Minimum 2 Slots):

PCIe lane has the maximum bandwidth that is available for graphics cards’ communication with the CPU

A GPU would require 16 PCIe lanes to work at its full capacity.

Workstation with 24 PCIe lanes required to keep data flowing to the GPU otherwise bottleneck in disk access operations if SSD is used.

The HP Z820 provides a total of 9 Graphics and I/O slots, including three PCIe3.0 graphics cards in PCIe 3.0 x16 slots. System configurations can support up to three cards totaling 160W with the standard 850W power supply.

Generally an x8 lane of PCIe 3.0 has more bandwidth for any gaming card, so 16 lanes for dual cards or 24 lanes for triple cards is preferable.

Processors (Minimum 4Cores):

The number of cores and threads per core in CPU for the data processing and communicating with GPU. Intel Xeon processor E5–1620 for GPU based workstation.

RAM (64 GB Preferred):

How much of dataset you can hold in memory decided on the size of the RAM with minimum of 2400 MHz clock speed.

Storage (2TB):

256GB SSD for datasets in use and OS

2TB Hdd with 7200 rpm for Miscellaneous User Data

Power Supply Unit (PSU):

Power supply should provide enough to handle the power for the CPU and the GPUs, plus 100 watts extra.  In case if you plan to add more GPU, add 100 Watt per GPU then consider buying a PSU to handle that requirement too.

==============================================================================================

GPU Optimized Servers for NVidia Tesla V100 GPUs

For maximum acceleration of highly parallel applications like artificial intelligence (AI), deep learning, autonomous vehicle systems, energy and engineering/science, Server with Nvidia Tesla Volta100 next-generation NVIDIA NVLink is optimized for overall performance.

NVLink is a high bandwidth interconnect developed by NVIDIA to link GPUs together allowing them to work in parallel much faster than over the PCI-E bus.

Selection of Server with Nvidia Tesla V100

Server adds the NVidia Tesla V100 has Tensor core deep learning matrix multiply acceleration.

CPU: Intel Xeon Scalable processors Gold 6130 Processor (22M Cache, 2.10 GHz) with Intel C620 Series Chipsets. Here Dell EMC PowerEdge C4140.

MEMORY: 384GBDDR4 (32GB DDR4 x 12Nos)
GPU: NVidia Tesla V100 SXM2 x 8 | P100 SXM2 x 8
OS: Ubuntu 16.04 x64
Driver: 384.81
CUDA: version 9

Deep Learning Hardware DGX-1 with V100

Most Deep Learning frameworks make use of a specific library called cuDNN (CUDA Deep Neural Networks) which is specific to NVIDIA GPUs.

SYSTEM SPECIFICATIONS

GPUs: 8 X Tesla V100  GPU Memory: 128 GB

CPU: Dual 20-Core Intel Xeon E5-2698 v4 2.2 GHz

NVIDIA CUDA Cores 40,960

NVIDIA Tensor Cores on V100:  5,120

System Memory: 512 GB 2,133 MHz DDR4 LRDIMM

Storage: 4 X 1.92 TB SSD RAID 0

Network: Dual 10 GbE, 4 IB EDR

Software: Ubuntu Linux Host OS

GPU Comparison:

Quadro GP100 Titan Xp Titan V Tesla K80 Tesla M40 Tesla P100 (PCI-E) Tesla P100 (NVLink) Tesla V100 (PCI-E) Tesla V100 (NVLink)
Architecture Pascal Volta Kepler Kepler Maxwell Pascal Pascal Volta Volta
Tensor Cores 0 640 0 0 0 0 0 640 640
CUDA Cores 3584 5120 2880 2496 per GPU 3072 3584 3584 5120 5120
Memory 16GB 12GB 12GB 12GB per GPU 24GB 12GB or 16GB 16GB 16GB 16GB
Memory Bandwidth 717GB/s 653GB/s 288GB/s 240GB/s per GPU 288GB/s 540 or 720GB/s 720GB/s 900GB/s 900GB/s
Memory Type HBM2 HBM2 GDDR5 GDDR5 GDDR5 HBM2 HBM2 HBM2 HBM2
Interconnect Bandwidth 32GB/s 32GB/s 32GB/s 32GB/s 32GB/s 32GB/s 160GB/s 32GB/s 300GB/s

==============================================================================================

Selection of Cloud Tensor Processing Units

Amazon Ec2

In exploring and solving Deep Learning puzzle for entry level, you need local workstation or server to gain more control instead of EC2 Instances.

Amazon Ec2 instances

  1. Cost of EC2 reserved instance will be very high for entry level practioners
  2. AWS EC2 spot instance availability & setting up the environment for backing up and restoring the data/progress
  3. Amazon EC2 P3 instances are with NVidia Volta are good for reasearchers. This lets users tackle challenges while eliminating difficult, time-consuming DIY software integration.

Google Cloud

Google compute engine second-generation Tensor Processing Units, which is optimized to both train and run machine learning models.

Each Tensor Processing Unit includes a custom high-speed network that allows Google to build machine learning supercomputers, called TPU pods. These pods contain 64 second-generation TPUs and provides up to 11.5 petaflops to accelerate the training of a single large machine learning model. TensorFlow Lite, part of the TensorFlow open source project, will let developers use machine learning for their mobile apps.

 NVidia GPU Cloud

NVidia GPU Cloud empowers AI researchers with performance-engineered AI containers featuring deep learning software like TensorFlow, PyTorch, MXNet TensorRT. These pre-integrated, GPU-accelerated containers include NVIDIA CUDA runtime, NVIDIA libraries, and an operating system.