Training and serving large transformer models at scale is fundamentally a memory management problem. Every GPU in a cluster has a fixed amount of VRAM, and as model sizes and context lengths grow, engineers constantly have to make trade-offs about how to distribute work across hardware. A new technique from Zyphra, called Tensor and Sequence Parallelism (TSP), offers a way to rethink that trade-off — and in benchmark tests on up to 1,024 AMD MI300X GPUs, it consistently delivers lower per-GPU peak memory than any of the standard parallelism schemes used today, for both training and inference workloads.
The Problem TSP Is Solving
To understand why TSP is important, you must first understand the two parallelism strategies it folds together.
Tensor Parallelism (TP) splits model weights across GPUs. If you have a weight matrix in an attention or MLP layer, each GPU in the TP group holds only a fraction of that matrix. This directly reduces the per-GPU memory occupied by parameters, gradients, and optimizer states — the ‘model state’ memory. The trade-off is that TP requires collective communication operations (typically all-reduce or reduce-scatter/all-gather pairs) every time a layer is computed. This communication is proportional to activation size, so it becomes increasingly expensive as sequence length grows.
Sequence Parallelism (SP) takes a different approach. Instead of splitting weights, it splits the input token sequence across GPUs. Each GPU processes only a fraction of the tokens, which reduces activation memory and the quadratic cost of attention computation. However, SP leaves model weights fully replicated on every GPU, which means model-state memory stays exactly the same regardless of how many GPUs you add to the SP group.
In standard multi-dimensional parallelism, engineers combine TP and SP by placing them on orthogonal axes of a device mesh. If you want a TP degree of T and an SP degree of Σ, your model replica consumes T.Σ GPUs. This is expensive in two ways. First, it uses more GPUs for the model-parallel group, leaving fewer available for data-parallel replicas. Second, if T.Σ is large enough to span multiple nodes, some collective communication has to travel over slower inter-node interconnects like InfiniBand or Ethernet instead of the high-bandwidth intra-node fabric, such as AMD Infinity Fabric or NVIDIA NVLink. Data Parallelism (DP), the other common baseline, avoids these model-parallel costs entirely but replicates all model state on every device, making it impractical for large models or long contexts on its own.
What Folding Actually Means
TSP’s core idea is parallelism folding: instead of placing TP and SP on separate, orthogonal mesh dimensions, it collapses both onto a single device-mesh axis of size D. Every GPU in the TSP group simultaneously holds 1/D of the model weights and 1/D of the token sequence. Because both are sharded across the same D GPUs, the per-device memory footprint decreases by 1/D for both parameter memory and activation memory — something no single standard parallelism scheme achieves on its own. TSP is therefore the only scheme that simultaneously reduces weight-proportional memory (parameters, gradients, optimizer states) and activation memory by the same 1/D factor on a single axis without requiring a two-dimensional T.Σ device layout.
The challenge is that if each GPU only has part of the weights and part of the sequence, it needs to coordinate with other GPUs to complete each layer’s forward pass. TSP uses two different communication schedules to handle this, one for attention and one for the gated MLP.
For attention, TSP iterates over weight shards. At each step, one GPU broadcasts its packed attention weight shards (WQ, WK, WV, and WO) to all other GPUs in the group. Every GPU then applies those weights to its local sequence tokens to compute local Q, K, and V projections. Since causal attention requires access to the full key/value context, the local K and V tensors are all-gathered across the TSP group and reordered using a zigzag partition scheme before FlashAttention is applied. The zigzag partition ensures that causal attention workload is balanced across ranks, since later tokens attend to larger prefixes and would otherwise cause load imbalance.
For the gated MLP, TSP uses a ring schedule. Each GPU starts with local shards of the gate, up, and down projections. These weight shards circulate around the TSP group via point-to-point send/recv operations, and each GPU accumulates partial outputs locally as the shards arrive. Critically, this eliminates the all-reduce that standard TP requires for MLP output — the sequence stays local, and only the weights move. The ring is designed to overlap weight transfers with GEMM computation, so communication happens in the background while the GPU is computing.
Memory and Throughput Results
Tested on a single 8-GPU MI300X node across sequence lengths from 16K to 128K tokens, TSP achieves the lowest peak memory at every point. At 16K tokens, TSP and TP are nearly equivalent, 31.0 GB versus 31.5 GB per GPU, because model-state memory dominates at short context. At 128K tokens, the picture changes dramatically: TSP uses 38.8 GB per GPU, compared to 70.0 GB for TP and 85.0 GB and 140.0 GB for two different TP+SP factorizations on the same node. The theoretical figures throughout this research are based on a 7B dense decoder-only transformer reference model (hidden dimension h=4096, 32 layers, 32 query heads, 32 KV heads, FFN expansion factor F=4, bf16 precision), providing a reproducible baseline for comparing the schemes.
Throughput results on 128 full nodes (1,024 MI300X GPUs) show TSP consistently outperforming matched TP+SP baselines. At a folded degree of D=8 and sequence length of 128K tokens, TSP achieves 173 million tokens per second compared to 66.30 million tokens per second for the matched TP+SP baseline (approximately a 2.6x speedup). The advantage grows with higher parallelism degree and longer sequence length.

Practical Trade-offs to Understand
TSP does increase total communication volume compared to TP alone. It adds a weight-movement term per layer on top of the same activation-proportional K/V all-gather that SP uses. However, the research team shows that when batch size B and sequence length S satisfy BS > 8h (where h is the model’s embedding dimension), TSP’s forward communication volume is competitive with TP’s. This condition is met in most long-context training and inference scenarios.
The key insight the Zyphra team emphasizes is that communication volume and communication cost are not the same thing. Whether extra communication volume translates into wall-clock slowdown depends on whether the collectives are latency-bound or bandwidth-bound, and how much of that traffic can be overlapped with matrix multiplication. Their implementation pipelines weight transfers behind dominant GEMM operations so that weight communication consumes bandwidth without adding to critical-path time.
TSP is not designed to replace TP, SP, or TP+SP in all settings. It is intended as an additional axis in the multi-dimensional parallelism design space. It composes orthogonally with pipeline parallelism, expert parallelism, and data parallelism. This means teams can slot TSP into an existing parallelism configuration wherever the standard layout would otherwise force model-parallel groups across slower inter-node links.
Key Takeaways
- Zyphra’s Tensor and Sequence Parallelism (TSP) folds tensor parallelism and sequence parallelism onto a single device-mesh axis, so each GPU simultaneously holds 1/D of the model weights and 1/D of the token sequence, reducing memory overhead for both training and inference.
- TSP is the only parallelism scheme that reduces both weight-proportional memory (parameters, gradients, optimizer states) and activation memory by the same 1/D factor on a single axis, without requiring a two-dimensional T.Σ device mesh.
- Empirical results on a single 8-GPU MI300X node show TSP uses 38.8 GB per GPU at 128K sequence length, compared to 70.0 GB for TP and 85.0–140.0 GB for TP+SP configurations.
- At large scale (1,024 MI300X GPUs, 128K context, D=8), TSP achieves 173 million tokens per second versus 66.30 million tokens per second for a matched TP+SP baseline (approximately a 2.6x throughput advantage).
- TSP composes orthogonally with pipeline, expert, and data parallelism and is best suited for long-context, memory-constrained training and inference workloads where eliminating weight and activation replication outweighs the added communication volume.
Check out the Paper and Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us
The post Zyphra Introduces Tensor and Sequence Parallelism (TSP): A Hardware-Aware Training and Inference Strategy That Delivers 2.6x Throughput Over Matched TP+SP Baselines appeared first on MarkTechPost.
Credit: Source link

























