Aerospace and defence applications demand products that are sensitive to their SWaP (size, weight and power) restrictions. In the past, the highest performing GPU products out of reach because their power requirements, acceptable for a desktop computer or a data...
What is NVidia Ray Tracing If you're a gamer, then the chances are you've heard of ray tracing, but you might not know exactly what it means or why it's a big deal. We're here to help you find out. In its simplest form, ray tracing is a system that's used to enhance...
Named after Grace Hopper, a pioneering computer scientist, the NVIDIA Hopper architecture will supersede the already powerful NVIDIA Ampere architecture.
The NVIDIA Hopper architecture Nvidia H100 is a data center GPU based on the latest Hopper architecture with 80 billion transistors on an 814 mm2 area.
PCI Express 5.0 x16 version has 80 GB of HBM2e memory with a memory speed of 2 TB/s.
The number of cuda cores is 14,592.
There will also be an sxm variant on the market with 16,896 cores and faster HBM3 memory.
The TDPs for both models are 350W and 700W, respectively.
Designed to speed up the training of artificial intelligence models
Reference that its predecessor, the A100, had 54.2 billion count of transistors.
Original Series: Data Center Hopper
Launch Date: March 22nd, 2022
PCB Code: 180-1G520-DAAF-B01
P/N Code: 699-2G520-0201-200
Model: NVIDIA PG520 SKU 201
Graphics Processing Unit: GPU
Fabrication Process:5 nm (TSMC N4)
Die Size: 814 mm2
Transistors Count: 80B
Transistors Density: 98.3M TRAN/mm2
Tensor Cores: 528
GPCs: 8 Clocks
Base Clock: TBC MHz
Boost Clock:1775 MHz
Memory Clock: 2400 MHz
Effective Memory Clock: 4800 Mbps
Memory Size: 81920 MB
Memory Type: HBM3
Memory Bus Width: 5120-bit
Memory Bandwidth: 3,072.0 GB/s
An advanced chip H100
H100 features significant advances to accelerate AI, HPC, memory bandwidth, interconnect and communication, including nearly 5TB per second of external connectivity.
H100 is the first GPU to support PCIe Gen5 and the first to utilise HBM3, thus doubling the memory bandwidth to a staggering 3TB/s.
Even the L2 cache has been bumped up to 50MB (previously 40MB) to better fit more significant portions of datasets and models, thus reducing HBM3 memory access.
H100 and its predecessors:-
|Graphics Card||H100 (SXM)||A100 (SXM)||Tesla V100|
|Die Size (mm2)||814||826||815|
|Transistors||80 billion||54 billion||21.1 billion|
|Streaming Multiprocessors (SM)||132||108||80|
|CUDA cores (FP32)||16896||6912||5120|
|Tensor Performance (FP32)||500 – 1000 TFLOPS||156 – 312 TFLOPS||120 TFLOPS|
|GPU boost clock speeds||TBD||1410MHz||1455MHz|
|GPU Memory||80GB HBM3||40GB HBM2||16GB HBM2|
|Memory clock speed||TBD||2.4Gbps||1.75Gbps|
|Memory bus width||5120-bit||5120-bit||4096-bit|
|Interconnect||4th-gen NVLink (900GB/s) |
+ PCIe 5.0
|3rd-gen NVLink (600GB/s) |
+ PCIe 4.0
|2nd-gen NVLink (300GB/s) |
+ PCIe 3.0
|Multi-Instance GPU (MIG) support||7 MIGs with confidential compute||7 MIGs||—|
|GPU board form factor||SXM5||SXM4||SXM2|
Fourth-Gen Tensor Cores and updated Streaming Multiprocessor (SM) offer massive speed-up.
The H100’s fourth-generation Tensor Core’s performance by at least three times the throughput of the A100.
Fourth-gen Tensor Core adds a new FP8 data format, which halves the data storage and doubles the throughput compared to a traditional FP16 data type. With an updated compiler to take advantage of the new data format, the H100 can theoretically speed-up crunching FP16 data types through the FP8 format more than six times faster than the A100 can on its own in FP16 format.
Tensor Core also boasts far more efficient data management, which saves up to 30% operand delivery power. This is thanks to Distributed Shared Memory, which enables more efficient communications directly between SM-to-SM using shared memory blocks within the same cluster. This offers a 7x reduction in latency to access different SM units on the H100 than writing to global memory on the A100 before other SM blocks can use the data.
Yet another proponent that adds to more efficient data management is improvements in asynchronous execution via a new Asynchronous Transaction Barrier to improve data exchange or memory copy functions, as well as Tensor Memory Accelerator (TMA), which is essentially what DMA does for memory but specific to the usage of tensor cores.
Second-gen Secure Multi-Instance GPU (MIG)
Previously only available in CPUs, the H100 is the first to bring confidential computing to an accelerated compute accelerator through a GPU. This protects AI models and customer data while they are being processed. Customers can also apply confidential computing to federated learning for privacy-sensitive industries like healthcare and financial services, as well as on shared cloud infrastructures.
For one, it has a hardware firewall, on-die root-of-trust, device attestation, hardware-accelerated AES-256 full encryption and decryption at the PCIe line rate all through the device. At the end of the day, confidential computing is a step towards supporting the zero-trust computing model.
As powerful as a single H100 GPU is, NVIDIA’s scalable GPU architecture via its proprietary NVLink high bandwidth, energy-efficient, low latency, lossless GPU-to-GPU interconnect is a crucial enabler in massive multi-node computational performance.
The H100 boasts a fourth-generation NVLink that maintains the same 25GB/s effective bandwidth in each direction, but instead of 12 NVLinks on the A100, the H100 has 18 NVLinks. Also, the H100 now has only two high-speed lanes to create a single link instead of four on the A100. The net outcome is a 1.5x GPU-to-GPU bandwidth boost over the third-gen NVLink with 900GB/s total multi-GPU IO and shared memory access on the fourth-gen NVLink.
To accelerate the largest AI models, NVLink combines with a new external third-gen NVLink Switch to extend NVLink as a scale-up network beyond the server, connecting up to 256 H100 GPUs at 9x higher bandwidth versus the previous generation using NVIDIA HDR Quantum InfiniBand.
The new third-gen NVLink Switch is an impressive piece of standalone silicon. Residing inside and outside nodes to connect multiple GPUs across servers, clusters and data center environments, this new NVSwitch boasts 64 ports of fourth-gen NVLinks to accelerate multi-GPU connectivity in a big way for a total of 1.6TB/s switch throughput. Check out how powerful it is than the first-gen NVSwitch that debuted four years back.