Cohere Releases Command A+: A 218B Sparse MoE Model for Agentic Workflows That Runs on as Few as Two H100 GPUs

Cohere just released Command A+, as an open-source model targeting enterprise agentic workflows. Available under an Apache 2.0 license, Command A+ is a mixture-of-experts (MoE) model built for high-performance agentic tasks with minimal compute overhead. The model is optimized for reasoning, agentic workflows, RAG, multilingual, and multimodal document processing. It unifies capabilities from four prior models — Command A, Command A Reasoning, Command A Vision, and Command A Translate — into a single scalable model.

Architecture

Command A+ is a decoder-only Sparse Mixture-of-Experts Transformer with 218B total parameters and 25B active parameters. It has 128 experts, of which 8 are active per token, and a single shared expert is applied to all tokens. In a MoE model, each token is routed through only a subset of expert sub-networks rather than the full parameter set, keeping active compute at 25B-parameter scale at inference time.

Moonshot AI Makes Waves Despite China’s Chip Constraints

Netflix’s Growth Story Hits a Speed Bump

The attention layers interleave sliding-window attention layers with Rotational Positional Embeddings and global attention layers without positional embeddings in a 3:1 ratio. The sparse MoE layer is trained in a fully dropless manner and uses a token-choice router, with a normalized sigmoid over the top-k expert logits per token.

Input modalities are text, image, and tool use. Output modalities are text, reasoning, and tool use. The model supports a 128K input context length and a 64K max generation length.

Hardware Requirements and Quantization

Three quantization variants are available with minimum GPU requirements: BF16 (16-bit) requires 4× B200 or 8× H100 GPUs; FP8 (8-bit) requires 2× B200 or 4× H100 GPUs; W4A4 (4-bit) runs on a single B200 or 2× H100 GPUs. All three quantizations show negligible differences in benchmark quality. Cohere recommends W4A4 for most deployments.

W4A4 Quantization Methodology

Cohere applies NVFP4 W4A4 quantization, 4-bit weights and activations with two-level scaling, to the MoE experts only. The attention path, including Q/K/V/O projections, the KV cache, and attention compute, is kept at full precision.

To close residual quality gaps, Cohere uses Quantization-Aware Distillation (QAD) in the post-training phase: the quantized student model is trained to match the full-precision teacher’s output distribution, using fake quantization operators in the forward pass and straight-through estimators on the backward pass.

Performance vs. Prior Command A Models

On τ²-Bench Telecom, scores improved from 37% to 85% over Command A Reasoning, and Terminal-Bench Hard agentic coding performance reached 25% from 3%.

On internal North platform evaluations, all scored using LLM-as-a-judge techniques, Agentic Question Answering accuracy improved by 20% over Command A Reasoning. Agentic QA measures how well the model answers enterprise questions using MCP-connected cloud file systems. Spreadsheet analysis quality improved by 32%, and Memory Usage Quality — measuring how well an agent leverages information from a previous session to answer questions in a subsequent session — scored 54% with Command A+ compared to 39% with Command A Reasoning.

Command A+ is Cohere’s first multimodal reasoning model. It achieved 63% on MMMU Pro and 75.1% on MMMU, compared with 65.3% for Command A Vision on the latter. MathVista scores improved from 73.5% to 80.6%, and CharXiv reasoning improved from 46.9% to 52.7%.

Command A+ expands multilingual coverage from 23 to 48 languages, with gains in machine translation and multilingual reasoning.

Command A+ scored 37 on the Artificial Analysis Intelligence Index, outperforming other leading open models.

Speed and Latency

At the same quantization and concurrency levels, Command A+ delivers up to 63% higher Output Tokens per Second (TOPS) and reduces Time To First Token (TTFT) by up to 17% compared with Command A Reasoning. The W4A4 quantization contributes an additional 47% increase in speed and a 13% reduction in latency. Speculative decoding, optimized specifically for the MoE architecture, delivers an additional 1.5–1.6× inference speedup for both text and multimodal inputs.

Tokenizer

Command A+ is the first model to use Cohere’s latest tokenizer, reducing the number of tokens required to generate the same response. Tokenization efficiency improved by 20% for Arabic, 16% for Korean, and 18% for Japanese.

Getting Started

The model is supported by vLLM and Transformers. Tool use is handled through chat templates in Transformers using JSON schema for tool descriptions. When reasoning is enabled, the model generates thinking traces between <|START_THINKING|> and <|END_THINKING|> tags before producing a final answer.

The W4A4 variant requires vLLM ≥0.21.0 and cohere_melody>=0.9.0 for accurate response parsing. Cohere recommends the following sampling parameters: temperature=0.9, top_p=0.95, and repetition_penalty=1.04.

Key Takeaways

Command A+ has 218B total / 25B active parameters in a Sparse MoE architecture, released under Apache 2.0.
W4A4 applies NVFP4 quantization to MoE experts only with QAD post-training, running on 2× H100s.
τ²-Bench Telecom improved from 37% to 85%; Terminal-Bench Hard from 3% to 25% vs. Command A Reasoning.
TOPS increased up to 63% and TTFT reduced up to 17% vs. Command A Reasoning at matching quantization.
Command A+ is Cohere’s first multimodal reasoning model, expanding language support from 23 to 48 languages.

Check out the Model Weights and Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

Michal Sutter is a data science professional with a Master of Science in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michal excels at transforming complex datasets into actionable insights.

Credit: Source link