What is Nvidia Ray Tracing and List of Games Supports?

What is NVidia Ray Tracing If you're a gamer, then the chances are you've heard of ray tracing, but you might not know exactly what it means or why it's a big deal. We're here to help you find out. In its simplest form, ray tracing is a system that's used to enhance...

read more
NVIDIA Hopper architecture H100 GPU 80GB
Jun 1, 2022

NVIDIA


Named after Grace Hopper, a pioneering computer scientist, the NVIDIA Hopper architecture will supersede the already powerful NVIDIA Ampere architecture.

The NVIDIA Hopper architecture Nvidia H100 is a data center GPU based on the latest Hopper architecture with 80 billion transistors on an 814 mm2 area.

PCI Express 5.0 x16 version has 80 GB of HBM2e memory with a memory speed of 2 TB/s.

The number of cuda cores is 14,592.

There will also be an sxm variant on the market with 16,896 cores and faster HBM3 memory.

The TDPs for both models are 350W and 700W, respectively.

Designed to speed up the training of artificial intelligence models

Reference that its predecessor, the A100, had 54.2 billion count of transistors.

Manufacturer: NVIDIA

Original Series: Data Center Hopper

Launch Date: March 22nd, 2022

PCB Code: 180-1G520-DAAF-B01

P/N Code: 699-2G520-0201-200

Model: NVIDIA PG520 SKU 201

Graphics Processing Unit: GPU

Model: GH100

Architecture: Hopper

Fabrication Process:5 nm (TSMC N4)

Die Size: 814 mm2

Transistors Count: 80B

Transistors Density: 98.3M TRAN/mm2

CUDAs: 16896

Tensor Cores: 528

SM: 132

GPCs: 8 Clocks

Base Clock: TBC MHz

Boost Clock:1775 MHz

Memory Clock: 2400 MHz

Effective Memory Clock: 4800 Mbps

 

Memory Configuration

Memory Size: 81920 MB

Memory Type: HBM3

Memory Bus Width: 5120-bit

Memory Bandwidth: 3,072.0 GB/s

An advanced chip H100

(Click to view a larger image.)

H100 features significant advances to accelerate AI, HPC, memory bandwidth, interconnect and communication, including nearly 5TB per second of external connectivity.

H100 is the first GPU to support PCIe Gen5 and the first to utilise HBM3, thus doubling the memory bandwidth to a staggering 3TB/s.

Even the L2 cache has been bumped up to 50MB (previously 40MB) to better fit more significant portions of datasets and models, thus reducing HBM3 memory access.

H100 and its predecessors:-

NVIDIA Data Center GPUs compared
Graphics Card H100 (SXM) A100 (SXM) Tesla V100
GPU Hopper
(GH100)
Ampere
(GA100)
Volta
(GV100)
Process 4N
(TSMC)
7nm
FinFET (TSMC)
12nm FinFET
Die Size (mm2) 814 826 815
Transistors 80 billion 54 billion 21.1 billion
Streaming Multiprocessors (SM) 132 108 80
CUDA cores (FP32) 16896 6912 5120
Tensor Cores 528 432 640
Tensor Performance (FP32) 500 – 1000 TFLOPS 156 – 312 TFLOPS 120 TFLOPS
RT Cores NIL NIL NIL
GPU boost clock speeds TBD 1410MHz 1455MHz
GPU Memory 80GB HBM3 40GB HBM2 16GB HBM2
 Memory clock speed TBD 2.4Gbps 1.75Gbps
Memory bus width 5120-bit 5120-bit 4096-bit
Memory bandwidth 3TB/s 1.6TB/s 900GB/s
Interconnect 4th-gen NVLink (900GB/s)
+ PCIe 5.0
3rd-gen NVLink (600GB/s)
+ PCIe 4.0
2nd-gen NVLink (300GB/s)
+ PCIe 3.0
Multi-Instance GPU (MIG) support 7 MIGs with confidential compute 7 MIGs
GPU board form factor SXM5 SXM4 SXM2
TDP 700W 400W 300W

Fourth-Gen Tensor Cores and updated Streaming Multiprocessor (SM) offer massive speed-up.

The H100’s fourth-generation Tensor Core’s performance by at least three times the throughput of the A100.

Fourth-gen Tensor Core adds a new FP8 data format, which halves the data storage and doubles the throughput compared to a traditional FP16 data type. With an updated compiler to take advantage of the new data format, the H100 can theoretically speed-up crunching FP16 data types through the FP8 format more than six times faster than the A100 can on its own in FP16 format.

Tensor Core also boasts far more efficient data management, which saves up to 30% operand delivery power. This is thanks to Distributed Shared Memory, which enables more efficient communications directly between SM-to-SM using shared memory blocks within the same cluster. This offers a 7x reduction in latency to access different SM units on the H100 than writing to global memory on the A100 before other SM blocks can use the data.

Yet another proponent that adds to more efficient data management is improvements in asynchronous execution via a new Asynchronous Transaction Barrier to improve data exchange or memory copy functions, as well as Tensor Memory Accelerator (TMA), which is essentially what DMA does for memory but specific to the usage of tensor cores.

Second-gen Secure Multi-Instance GPU (MIG)

For cloud computing, multi-tenant infrastructure translates directly to revenues and cost of service.

For cloud computing, multi-tenant infrastructure translates directly to revenues and cost of service.
The Ampere A100 was the first to sport Multi-Instance (MIG) functionality to partition the GPU into seven independent instances or seven virtual GPUs, each with its own resources (memory, cache, streaming multiprocessors) to tackle various workloads. The new Hopper H100 extends MIG capabilities that’s 7x the previous generation by offering secure multi-tenant configurations in the cloud across each GPU instance. The A100 was only able to do this on a single GPU instance.

Confidential Computing

Sensitive data is often encrypted at-rest and in-transit over the network but unprotected during execution/use. Hopper architecture fixes this with secure data and AI model handling even during use.

Sensitive data is often encrypted at-rest and in-transit over the network but unprotected during execution/use. Hopper architecture fixes this with secure data and AI model handling even during use.

Previously only available in CPUs, the H100 is the first to bring confidential computing to an accelerated compute accelerator through a GPU. This protects AI models and customer data while they are being processed. Customers can also apply confidential computing to federated learning for privacy-sensitive industries like healthcare and financial services, as well as on shared cloud infrastructures.

For one, it has a hardware firewall, on-die root-of-trust, device attestation, hardware-accelerated AES-256 full encryption and decryption at the PCIe line rate all through the device. At the end of the day, confidential computing is a step towards supporting the zero-trust computing model.

Fourth-gen NVLink

As powerful as a single H100 GPU is, NVIDIA’s scalable GPU architecture via its proprietary NVLink high bandwidth, energy-efficient, low latency, lossless GPU-to-GPU interconnect is a crucial enabler in massive multi-node computational performance.

The H100 boasts a fourth-generation NVLink that maintains the same 25GB/s effective bandwidth in each direction, but instead of 12 NVLinks on the A100, the H100 has 18 NVLinks. Also, the H100 now has only two high-speed lanes to create a single link instead of four on the A100. The net outcome is a 1.5x GPU-to-GPU bandwidth boost over the third-gen NVLink with 900GB/s total multi-GPU IO and shared memory access on the fourth-gen NVLink.

(Image source: NVIDIA)

To accelerate the largest AI models, NVLink combines with a new external third-gen NVLink Switch to extend NVLink as a scale-up network beyond the server, connecting up to 256 H100 GPUs at 9x higher bandwidth versus the previous generation using NVIDIA HDR Quantum InfiniBand.

The new third-gen NVLink Switch is an impressive piece of standalone silicon. Residing inside and outside nodes to connect multiple GPUs across servers, clusters and data center environments, this new NVSwitch boasts 64 ports of fourth-gen NVLinks to accelerate multi-GPU connectivity in a big way for a total of 1.6TB/s switch throughput. Check out how powerful it is than the first-gen NVSwitch that debuted four years back.

DGX A100 SuperPod vs. DGX H100 SuperPod deployment.

DGX A100 SuperPod vs. DGX H100 SuperPod deployment.
(source: NVIDIA)