Hyperbus - Using System Ram eliminating VRAM
This revision is from 2025/02/02 18:59. You can Restore it.
Makes onboard VRAM redundant and 1TB of VRam becomes possible to run LLM's
Shared Graphics Memory: https://en.wikipedia.org/wiki/Shared_graphics_memory
The latest Nvidia GPUs, particularly the RTX 40 series:
- GDDR6 Memory: The primary VRAM used in consumer-grade GPUs, offering bus speeds up to 21 Gbps. The RTX 4090, for example, utilizes GDDR6 with a 384-bit bus, achieving a bandwidth of 1008 GB/s at 21 Gbps.
- HBM2e Memory: Employed in high-end and professional cards, such as the RTX A6000, this memory type provides higher bandwidth through a wider bus, with speeds up to 3.2 Gbps per pin.
- Bus Width Consideration: While bus speed is crucial, the total bandwidth is also influenced by the bus width. For instance, a 384-bit bus at 21 Gbps offers significant bandwidth compared to narrower buses.
In conclusion, the latest consumer-grade Nvidia GPUs, as of 2023, achieve bus speeds up to 21 Gbps using GDDR6, while professional models leverage HBM2e for even higher bandwidths through wider buses.
HyperPath Bus: The Bandwidth Engine
Physical Layer:
- 384-lane bus embedded in the PCB, using low-cost PAM-3 signaling (3 bits per cycle) at 24 GT/s.
- Bandwidth: 384 lanes × 24 GT/s × 3 bits ÷ 8 = 3.456 TB/s (matches GDDR6X).
- Integrated into PCIe 5.0 x16 slots via 200 auxiliary pins or uses a ZIF socket and bypasses PCIe.
Protocol:
- Direct RAM Mapping: GPU sees system RAM as contiguous VRAM space.
- GPU-Direct Access: Bypasses the CPU and PCIe protocol overhead. The GPU’s memory controller communicates directly with DDR5 via HyperPath, treating system RAM as its own memory pool.
- Burst Mode: Aggregates small GPU requests into large 512B packets to maximize bus efficiency
- Priority-Based Arbitration: Critical GPU requests (e.g., texture fetches) override CPU tasks to minimize latency.
Advanced Buffering for Low Latency
- Caching
- Lossless memory compression
- Distributed training and inference cache
- AirLLM, make AirLLM possible. https://github.com/lyogavin/airllm
GPU Card Redesign: "VRAM-Less" Architecture
The GPU card becomes a pure processor:
- Could still utilize an L1 cache.
- No GDDR Chips: Removes all VRAM, reducing PCB complexity, cost, and power draw.
- HyperPath PHY Layer: A dedicated chip on the GPU converts memory requests into HyperPath signals.
- Unified Memory Controller (UMC): Directly maps system RAM addresses to GPU memory space.