PCIe SLI NVLink

PCIe versions:

  • PCIe 3.0: 8 GT/s per lane
  • PCIe 4.0: 16 GT/s per lane
  • PCIe 5.0: 32 GT/s per lane
  • PCIe 6.0: 64 GT/s per lane

SLI (Scalable Link Interface):

  • Older technology for linking multiple GPUs
  • Limited to 2-4 GPUs
  • Lower bandwidth compared to NVLink

NVLink:

  • High-bandwidth interconnect for GPU-to-GPU communication
  • Much faster than PCIe and SLI
  • Supports up to 8 GPUs (depends on GPU model)

Impact on LLM training vs. inference:

Training:

  • Requires high bandwidth for data transfer between GPUs
  • NVLink is preferable for multi-GPU setups
  • Higher PCIe versions beneficial for CPU-GPU data transfer

Inference:

  • Generally less demanding on inter-GPU communication
  • PCIe is often sufficient, especially for single-GPU setups
  • NVLink can still improve performance in multi-GPU inference scenarios

Maximal PCIe slot configuration of physical slots with realistic speed assignments relative to the bus lanes to the CPU and Chipset. Most consumer CPUs provide 16-24 PCIe lanes. High-end desktop (HEDT) and server CPUs can offer 40-128 lanes. PCIe lanes from the chipset provide an additional 4-24 lanes, depending on the chipset. Lanes from the chipset converge at the chipset, which then connects to the CPU at half the speed of the direct PCIe to CPU bus. PCIe to CPU run at PCIe 3.0: ~985 MB/s per lane, PCIe 4.0: ~1.97 GB/s per lane, PCIe 5.0: ~3.94 GB/s per lane, x16 slot running at x16 will have 16 times the bandwidth of a single lane. The speed of the PCIe to chipset is slower for instance Intel Z690 chipset (for 12th/13th gen CPUs) uses DMI 4.0 x8, providing up to ~15.75 GB/s. The chipset connects to the entire system so it is more likely to get congested from other devices as most devices share the same chipset-to-CPU link bandwidth. The speed from PCI-e to the chipset is at full bandwidth.

As a result maximal PCI-e configuration taking total lanes: 48-54 with 24 lanes at full bandwidth and 24 lanes at 8x PCIe when communicating with the CPU.

Something like, 1 slot @ 16x, 5 slots @ 8x or 3 slots @ 16x or some other configuration that totals to 48-54.

Some caveats: Full PCI-e versions bandwidth can be achieved such as Direct Memory Access (DMA): Devices can read/write directly to system memory without CPU intervention at full PCIe lane speed (e.g., PCIe 4.0 x16 at ~31.5 GB/s). GPU-to-GPU transfers, NVMe SSD to GPU transfers at full PCIe lane speed of the slower device. GPUDirect RDMA (for NVIDIA GPUs): Allows direct communication between GPUs and other PCIe devices (e.g., network adapters).

However, it is vendor specific motherboard design to how many lanes a motherboard might have, how many connections to the PCI-e the CPU might it might utilize and so on.

Essentially, a motherboard's maximum is 3 slots @ 16x speeds. Almost all the time the speed of the physical 16x slots fine print is ***4xPCIe3.0 x16 Slots(8/NA/16/16 or 8/8/8/16 or NA/4/8/16(Skylake-X 28 Lanes) or NA/4/4/8(Kabylake-X), 1xPCIe3.0x1 slots*** An indepth manual might have: Four (4) PCI Express 3.0 x16 slots (CPU1 Slot1/Slot3, CPU2 Slot4/Slot5), One (1) PCI Express 3.0 x8 slot (CPU2 Slot6), One (1) PCI Express 3.0x4 in x8 slot (CPU1 Slot2).

  

📝 📜 ⏱️  ⬆️