AI's Hidden Bottleneck: Networking

Why does networking matter in AI. What does a good product look like. Where are the opportunities.

Sep 28, 2024

Welcome to Infinite Curiosity, a weekly newsletter that explores the intersection of Artificial Intelligence and Startups. Tech enthusiasts across 200 countries have been reading what I write. Subscribe to this newsletter for free to receive it in your inbox every week:

AI is often hyped as a story of algorithms, models, and vast amounts of data. But there's a critical component that’s quietly being ignored: networking. As you go from a single GPU to a massive cluster for training, networking quickly turns into the biggest bottleneck. It's actually one of the most crucial determinants of your training speed, cost, and ability to scale AI workloads. And yet, it's far too often treated as an afterthought by most. The companies that know this are building enormous moats around this capability.

How is networking even relevant in AI?

Training large AI models is not just about shoveling data into GPUs. It's about moving colossal amounts of data between GPUs. And these GPUs are often distributed across different servers. These servers are referred to as nodes. This is where networking requirements get demanding. Why?

Because without the right kind of network, your multi-GPU setup can quickly devolve into an underutilized latency-ridden mess.

High Throughput Isn't a Luxury. It's a Necessity.

Training foundation models requires rapid data transfer. The network must support extremely high throughput to avoid becoming the bottleneck. And if you're operating at anything less than 100 Gbps, you’re already behind. For larger clusters, we're talking about pushing the ceiling up to 800 Gbps or beyond.

Low Latency or Bust

AI workloads are highly latency-sensitive. Synchronization between GPUs needs to be near-instantaneous. And the reality is that even small increases in latency can cripple performance. You need microsecond-level latencies.

Scalability Isn’t Optional. It’s Survival.

As model sizes explode, so too does the demand for GPU count. But more GPUs won't save you if your network can't scale to match. A solid networking setup has to handle scaling from a few GPUs to thousands while maintaining both low latency and high throughput.

RDMA: The Underappreciated Workhorse

RDMA stands for Remote Direct Memory Access. And it happens to be a super critical requirement. You need to enable direct GPU-to-GPU communication without involving the CPU. Why?

Because RDMA slashes overhead and latency. Without it, your GPUs are spending too much time waiting and too little time working.

What does a good networking product look like within the context of AI infrastructure?

There are many features needed to make this work well. But I've highlighted 5 key features below:

1. Bandwidth and Density: More is More

Good AI networking hardware doesn't just offer high bandwidth, it offers a LOT of high-bandwidth ports. Supporting multiple 100, 400, or 800 Gbps connections is key here. Your network should never be the reason your GPUs are sitting idle.

2. Low Latency Switching: The Killer Feature

Switches need to be microsecond-level or better. Your GPUs need to synchronize rapidly during distributed training. If not, you are underutilizing all that expensive hardware.

3. Support for RDMA and GPUDirect

Without RDMA, it will be like playing with one hand tied behind your back. And if you're using NVIDIA GPUs, GPUDirect is essential for maximizing performance across nodes.

4. Programmability and Real-Time Telemetry

With the right programmable switches, you can tailor data flows to your specific workload. Real-time monitoring is crucial for detecting and diagnosing network bottlenecks on the fly. You can't improve what you can't measure.

5. Flexible Scalability: Plan for Growth or Fail

AI workloads don't shrink. They only grow! If your networking hardware isn't designed to scale effortlessly, you'll hit a hard wall on model size and GPU count.

Who are the major networking providers for AI clusters?

The amount of engineering and resources it takes to build a good networking product is astounding. That's why the industry is fairly concentrated in the hands a few large companies.

Nvidia (Mellanox): Nvidia is a juggernaut in AI networking (thanks to the Mellanox acquisition). InfiniBand is the gold standard for low-latency, high-throughput networking. And features like GPUDirect RDMA make it the go-to choice for serious AI practitioners.

Arista Networks: Arista provides high-performance ethernet switches tailored for AI workloads. They prioritize low-latency, high-throughput networking and scale well for dense GPU setups.

Broadcom and Intel: These are chip powerhouses. While not as specialized as NVIDIA, Broadcom and Intel bring high-performance ethernet solutions to the table. And their chips are found in much of the data center infrastructure handling AI workloads. They offer the kind of scalability and throughput that AI training demands.

Cisco and Juniper Networks: Traditional networking giants like Cisco and Juniper have pivoted to offer AI-friendly networking solutions. Their ethernet-based hardware is now tailored to accommodate the demands of AI workloads, but the real question is: are they innovating fast enough to keep up?

Where are the opportunities for improvement?

There are a few key areas within networking that people are actively working on:

Faster Networking Technologies: It's A Relentless March

With model sizes increasing exponentially, current standards like 100-400 Gbps aren't going to cut it much longer. The industry needs 800 Gbps, 1.6 Tbps, and beyond to keep up.

Dynamic Congestion Control: Don't Let Your Network Choke

With distributed systems, congestion is inevitable. We need smarter congestion control that can dynamically prioritize and adjust traffic.

Advanced Network Topologies Aren't Just for Fun

Implementing topologies like Fat-Tree https://en.wikipedia.org/wiki/Fat_tree and Clos networks https://en.wikipedia.org/wiki/Clos_network can drastically reduce network latency and congestion. They will provide non-blocking paths that are critical for efficient GPU communication.

Data Flow Management

Network complexity grows with model complexity. Advanced networking should employ AI to manage data transfer intelligently. This will help optimize for current network conditions and training needs.

Edge Networking for Federated Learning

With federated learning on the rise, networking will need to support decentralized edge-to-cloud data flow. This will involve not just ethernet, but 5G and beyond.

Security Protocols

Training large models often involves sensitive data. And the network must ensure data privacy and security through robust encryption, secure multi-party computation, and access control.

If you're a founder or an investor who has been thinking about this, I'd love to hear from you. I’m at prateek at moxxie dot vc.

If you are getting value from this newsletter, consider subscribing for free and sharing it with 1 friend who’s curious about AI:

Infinite Curiosity Newsletter

Discussion about this post