Zyphra Introduces Tensor and Sequence Parallelism (TSP): A Hardware-Aware Training and Inference Strategy That Delivers 2.6x Throughput Over Matched TP+SP Baselines

Training and serving large transformer models at scale is fundamentally a memory management problem. Every GPU in a cluster has a fixed amount of VRAM, and as model sizes and context lengths grow, engineers constantly have to make trade-offs about how to distribute work across hardware. A new technique from Zyphra, called Tensor and Sequence Parallelism (TSP), offers a way to rethink that trade-off — and in benchmark tests on up to 1,024 AMD MI300X GPUs, it consistently delivers lower per-GPU peak memory than any of the standard parallelism schemes used today, for both training and inference workloads.

https://www.zyphra.com/post/tsp

The Problem TSP Is Solving

To understand why TSP is important, you must first understand the two parallelism strategies it folds together.

Elon Musk Settles With The SEC For $1.5 Million After Years-Long Dispute Over His Twitter Investment

Inside AMEX’s agentic commerce stack: How intent contracts and single-use tokens enforce AI transactions

Tensor Parallelism (TP) splits model weights across GPUs. If you have a weight matrix in an attention or MLP layer, each GPU in the TP group holds only a fraction of that matrix. This directly reduces the per-GPU memory occupied by parameters, gradients, and optimizer states — the ‘model state’ memory. The trade-off is that TP requires collective communication operations (typically all-reduce or reduce-scatter/all-gather pairs) every time a layer is computed. This communication is proportional to activation size, so it becomes increasingly expensive as sequence length grows.

Sequence Parallelism (SP) takes a different approach. Instead of splitting weights, it splits the input token sequence across GPUs. Each GPU processes only a fraction of the tokens, which reduces activation memory and the quadratic cost of attention computation. However, SP leaves model weights fully replicated on every GPU, which means model-state memory stays exactly the same regardless of how many GPUs you add to the SP group.

In standard multi-dimensional parallelism, engineers combine TP and SP by placing them on orthogonal axes of a device mesh. If you want a TP degree of T and an SP degree of Σ, your model replica consumes T.Σ GPUs. This is expensive in two ways. First, it uses more GPUs for the model-parallel group, leaving fewer available for data-parallel replicas. Second, if T.Σ is large enough to span multiple nodes, some collective communication has to travel over slower inter-node interconnects like InfiniBand or Ethernet instead of the high-bandwidth intra-node fabric, such as AMD Infinity Fabric or NVIDIA NVLink. Data Parallelism (DP), the other common baseline, avoids these model-parallel costs entirely but replicates all model state on every device, making it impractical for large models or long contexts on its own.

What Folding Actually Means

TSP’s core idea is parallelism folding: instead of placing TP and SP on separate, orthogonal mesh dimensions, it collapses both onto a single device-mesh axis of size D. Every GPU in the TSP group simultaneously holds 1/D of the model weights and 1/D of the token sequence. Because both are sharded across the same D GPUs, the per-device memory footprint decreases by 1/D for both parameter memory and activation memory — something no single standard parallelism scheme achieves on its own. TSP is therefore the only scheme that simultaneously reduces weight-proportional memory (parameters, gradients, optimizer states) and activation memory by the same 1/D factor on a single axis without requiring a two-dimensional T.Σ device layout.

The challenge is that if each GPU only has part of the weights and part of the sequence, it needs to coordinate with other GPUs to complete each layer’s forward pass. TSP uses two different communication schedules to handle this, one for attention and one for the gated MLP.

For attention, TSP iterates over weight shards. At each step, one GPU broadcasts its packed attention weight shards (WQ, WK, WV, and WO) to all other GPUs in the group. Every GPU then applies those weights to its local sequence tokens to compute local Q, K, and V projections. Since causal attention requires access to the full key/value context, the local K and V tensors are all-gathered across the TSP group and reordered using a zigzag partition scheme before FlashAttention is applied. The zigzag partition ensures that causal attention workload is balanced across ranks, since later tokens attend to larger prefixes and would otherwise cause load imbalance.

For the gated MLP, TSP uses a ring schedule. Each GPU starts with local shards of the gate, up, and down projections. These weight shards circulate around the TSP group via point-to-point send/recv operations, and each GPU accumulates partial outputs locally as the shards arrive. Critically, this eliminates the all-reduce that standard TP requires for MLP output — the sequence stays local, and only the weights move. The ring is designed to overlap weight transfers with GEMM computation, so communication happens in the background while the GPU is computing.

Memory and Throughput Results

Tested on a single 8-GPU MI300X node across sequence lengths from 16K to 128K tokens, TSP achieves the lowest peak memory at every point. At 16K tokens, TSP and TP are nearly equivalent, 31.0 GB versus 31.5 GB per GPU, because model-state memory dominates at short context. At 128K tokens, the picture changes dramatically: TSP uses 38.8 GB per GPU, compared to 70.0 GB for TP and 85.0 GB and 140.0 GB for two different TP+SP factorizations on the same node. The theoretical figures throughout this research are based on a 7B dense decoder-only transformer reference model (hidden dimension h=4096, 32 layers, 32 query heads, 32 KV heads, FFN expansion factor F=4, bf16 precision), providing a reproducible baseline for comparing the schemes.

Throughput results on 128 full nodes (1,024 MI300X GPUs) show TSP consistently outperforming matched TP+SP baselines. At a folded degree of D=8 and sequence length of 128K tokens, TSP achieves 173 million tokens per second compared to 66.30 million tokens per second for the matched TP+SP baseline (approximately a 2.6x speedup). The advantage grows with higher parallelism degree and longer sequence length.

Practical Trade-offs to Understand

TSP does increase total communication volume compared to TP alone. It adds a weight-movement term per layer on top of the same activation-proportional K/V all-gather that SP uses. However, the research team shows that when batch size B and sequence length S satisfy BS > 8h (where h is the model’s embedding dimension), TSP’s forward communication volume is competitive with TP’s. This condition is met in most long-context training and inference scenarios.

The key insight the Zyphra team emphasizes is that communication volume and communication cost are not the same thing. Whether extra communication volume translates into wall-clock slowdown depends on whether the collectives are latency-bound or bandwidth-bound, and how much of that traffic can be overlapped with matrix multiplication. Their implementation pipelines weight transfers behind dominant GEMM operations so that weight communication consumes bandwidth without adding to critical-path time.

TSP is not designed to replace TP, SP, or TP+SP in all settings. It is intended as an additional axis in the multi-dimensional parallelism design space. It composes orthogonally with pipeline parallelism, expert parallelism, and data parallelism. This means teams can slot TSP into an existing parallelism configuration wherever the standard layout would otherwise force model-parallel groups across slower inter-node links.

Key Takeaways

Zyphra’s Tensor and Sequence Parallelism (TSP) folds tensor parallelism and sequence parallelism onto a single device-mesh axis, so each GPU simultaneously holds 1/D of the model weights and 1/D of the token sequence, reducing memory overhead for both training and inference.
TSP is the only parallelism scheme that reduces both weight-proportional memory (parameters, gradients, optimizer states) and activation memory by the same 1/D factor on a single axis, without requiring a two-dimensional T.Σ device mesh.
Empirical results on a single 8-GPU MI300X node show TSP uses 38.8 GB per GPU at 128K sequence length, compared to 70.0 GB for TP and 85.0–140.0 GB for TP+SP configurations.
At large scale (1,024 MI300X GPUs, 128K context, D=8), TSP achieves 173 million tokens per second versus 66.30 million tokens per second for a matched TP+SP baseline (approximately a 2.6x throughput advantage).
TSP composes orthogonally with pipeline, expert, and data parallelism and is best suited for long-context, memory-constrained training and inference workloads where eliminating weight and activation replication outweighs the added communication volume.

Check out the Paper and Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

The post Zyphra Introduces Tensor and Sequence Parallelism (TSP): A Hardware-Aware Training and Inference Strategy That Delivers 2.6x Throughput Over Matched TP+SP Baselines appeared first on MarkTechPost.

Credit: Source link

Zyphra Introduces Tensor and Sequence Parallelism (TSP): A Hardware-Aware Training and Inference Strategy That Delivers 2.6x Throughput Over Matched TP+SP Baselines

Elon Musk Settles With The SEC For $1.5 Million After Years-Long Dispute Over His Twitter Investment

Inside AMEX’s agentic commerce stack: How intent contracts and single-use tokens enforce AI transactions

Related Posts

Elon Musk Settles With The SEC For $1.5 Million After Years-Long Dispute Over His Twitter Investment

Inside AMEX’s agentic commerce stack: How intent contracts and single-use tokens enforce AI transactions

Microsoft takes Agent 365 out of preview as shadow AI becomes an enterprise threat

Mini Motorways Is Letting Players Vote For Its Next City Map

Palantir Technologies Inc. 2026 Q1 - Results - Earnings Call Presentation (NASDAQ:PLTR) 2026-05-04

Leave a Reply Cancel reply

Search

"DeSantis Dummymander": Democrats heap ridicule on Florida's proposed House maps – Axios

It’s Bandcamp Friday Again

Humanoid robots take over manual job at auto parts plant

About

Legal

Bloggers

Contact

Zyphra Introduces Tensor and Sequence Parallelism (TSP): A Hardware-Aware Training and Inference Strategy That Delivers 2.6x Throughput Over Matched TP+SP Baselines

The Problem TSP Is Solving

YOU MAY ALSO LIKE

What Folding Actually Means

Memory and Throughput Results

Practical Trade-offs to Understand

Key Takeaways

Related Posts

Leave a Reply Cancel reply

Search

About

Legal

Bloggers

Contact