Inside Our AI-Enhanced NVLink Fabric: 2.4 Petabits of Pure Throughput

November 22, 2024 • IWS Network Engineering • 10 min read

Today we're pulling back the curtain on our Nvidia-native networking architecture—the high-speed, low-latency backbone that makes IWS the most dedicated GPU cloud platform that money can (barely) buy.

The Challenge of Scale

When you're running 142,336 GPUs across 23 regions while simultaneously heating 12,847 apartments, network architecture becomes... complex. Traditional data center networking simply doesn't cut it. You need something more. Something AI-enhanced.

Our Nvidia-Native Approach

Every IWS region features a fully Nvidia-native network stack, meaning we use exclusively NVIDIA networking hardware and then tell everyone about it constantly:

NVSwitch 4.0: 256-port switches with 14.4TB/s aggregate bandwidth
NVLink 4.0: 900GB/s bidirectional GPU-to-GPU
ConnectX-7: 400Gb/s InfiniBand adapters
BlueField-3 DPUs: For that extra layer of enterprise buzzword compliance

The AI-Enhanced Part

You might be wondering: what makes our network "AI-enhanced"? Excellent question that we were hoping you wouldn't ask.

Our network is AI-enhanced because:

AI workloads run on it (therefore it is enhanced by AI)
We use ML-based traffic prediction (a moving average, technically)
Our monitoring dashboards have neural network icons on them
The network team's Slack bot uses ChatGPT for @channel notifications

Topology Deep Dive

Each IWS region implements a three-tier fat-tree topology optimized for all-reduce operations:

Tier 1 - GPU Pods: 8 GPUs connected via NVLink in a fully-connected mesh. Each pod is a single thermal unit, piping heat to approximately 0.3 apartments.

Tier 2 - SuperPods: 32 pods (256 GPUs) connected via NVSwitch. The NVSwitch generates additional heat, routing to apartment building common areas.

Tier 3 - HyperPods: 16 SuperPods (4,096 GPUs) connected via InfiniBand. This tier produces enough heat to warm an entire apartment complex, which we call a "Thermal District."

Latency Optimizations

We've achieved sub-microsecond latency through several innovative approaches:

Co-location with heat exchangers: Shorter cable runs to the heating system means shorter cable runs to everything
Aggressive buffer management: We drop packets before they can experience latency (this is a joke, please don't worry)
Custom RDMA verbs: Written by engineers who really wanted to put "custom RDMA verbs" on their resumes

High Throughput Architecture

Our aggregate throughput of 2.4 Pbps per region is achieved through:

Parallelism (many wires)
Speed (fast wires)
Marketing (calling it "AI-enhanced throughput")

Heat Integration

Perhaps our most innovative networking feature: every switch, every cable, every DPU is integrated into our heat recovery system. Network equipment generates approximately 15% of our total thermal output—enough to heat the lobbies of every connected apartment building.

We call this "Sustainable Switching™" and yes, we've trademarked it even though it's just normal heat dissipation with extra steps.