MOUNTAIN VIEW, CA – Google's approach to large language model (LLM) training, particularly on its Tensor Processing Units (TPUs), predominantly leverages Fully Sharded Data Parallel (FSDP) and Tensor Parallelism (TP) rather than Pipeline Parallelism (PP). This strategic decision, highlighted by a recent social media post from "Rosinality" stating, "This is why Google commonly just uses FSDP + TP, while on GPU PP is much preferred over it. I wonder how Google reached this design decision much earlier," stems from the unique architectural design of TPUs and Google's integrated software stack.
Google's TPUs are designed with a high-bandwidth, low-latency Inter-Chip Interconnect (ICI) network, forming a 2D or 3D torus topology. This contrasts with GPU architectures that often rely on NVLink within nodes and InfiniBand for inter-node communication. The fast ICI on TPUs, with bidirectional bandwidths reaching up to 1.2 TBps on newer generations like Ironwood (TPU v7p), makes FSDP and TP highly efficient by enabling rapid data exchange between chips. FSDP, which shards model parameters, gradients, and optimizer states across devices, benefits significantly from this fast interconnect, as it requires frequent all-gather and reduce-scatter operations.
Tensor Parallelism, which splits individual model layers across multiple TPUs, also thrives on the high intra-pod communication bandwidth of TPUs. This allows for efficient parallel processing of matrix multiplications, a core component of LLMs. Google's XLA compiler further optimizes these parallelism strategies by generating highly efficient, hardware-specific code that manages computation and communication, minimizing overheads and maximizing utilization. The XLA compiler's ability to perform ahead-of-time (AoT) compilation and optimize for the systolic array architecture of TPUs plays a crucial role in Google's preference.
In contrast, Pipeline Parallelism, which divides a model vertically (layer-wise) across devices, often introduces "pipeline bubbles" or idle time as data flows sequentially through stages. While PP can be effective on GPUs, particularly when a single GPU cannot hold an entire layer, the inherent design of TPUs with their fast, direct chip-to-chip communication within a pod makes FSDP and TP more performant for many LLM workloads. The ability of FSDP to scale memory efficiency without requiring significant model modifications also contributes to its appeal for massive models.
Google's Cloud TPU Multislice Training, which supports scaling LLM training across tens of thousands of TPU chips, exemplifies this strategy. For instance, in November 2023, Google conducted a distributed training job across over 50,000 Cloud TPU v5e chips, utilizing FSDP and TP for models up to 128 billion parameters. This extensive deployment underscores Google's early and continued commitment to optimizing its hardware and software for these specific parallelism techniques, allowing for unprecedented scale and efficiency in LLM development.