Qwen Introduces Qwen3.7-Max: A Reasoning Agent Model With a 1M-Token Context Window

Most AI models today are not designed for sustained, multi-step autonomous execution. Tasks like running hundreds of iterative code modifications, or chaining tool calls across hours without human intervention, require a different kind of model architecture and training focus.

Alibaba’s Qwen team formally announced Qwen3.7-Max at the 2026 Alibaba Cloud Summit on May 20. Although, two preview versions of the Qwen3.7 series quietly appeared on Arena AI’s leaderboard with no press release and no official API announcement.

Moonshot launches China’s largest AI Model Kimi K3

AI Pioneer Lee on Moonshot’s Launch of Kimi K3

Two Preview Models Released Simultaneously

Alibaba previewed two models simultaneously: Qwen3.7-Max-Preview and Qwen3.7-Plus-Preview. They ranked 13th globally in text capabilities and 16th in vision capabilities, respectively, according to LM Arena.

In Text Arena, Qwen3.7-Max-Preview ranked #13 overall, placing Alibaba as the #6 lab in text. In Vision Arena, Qwen3.7-Plus-Preview ranked #16 overall, placing Alibaba as the #5 lab in vision. The model rank and the lab rank are separate figures.

Qwen3.7-Plus-Preview is described as a high-performance balanced version preview, focusing on reasoning and logical expression, with its toolchain to be gradually opened in the future. It handles vision and multimodal inputs. Qwen3.7-Max is the text-only reasoning flagship. This article covers Qwen3.7-Max, as it is the model Alibaba formally announced with API access.

What is Qwen3.7-Max Designed For

Alibaba Qwen team described Qwen3.7-Max as its most advanced and comprehensive agent model to date. The model is proprietary and closed-weight. It is capable of handling coding and debugging, office workflow automation, and long-horizon tasks spanning hundreds or even thousands of steps.

Extended-Thinking Mode

Qwen3.7-Max is a reasoning model. The model generates a chain of thought first — an internal sequence of steps where it plans, checks its work, and corrects course before committing to a final answer. On interfaces like Qwen Chat, this shows up as a ‘Thinking’ mode you can switch on to see the model’s reasoning trace.

Reasoning models produce significantly more output tokens than standard completions. When Artificial Analysis ran its Intelligence Index evaluation, Qwen3.7-Max generated about 97 million tokens, compared to an average of 24 million for models on that benchmark. For short or simple tasks, this overhead adds latency without improving output quality. For multi-step planning, code refactoring, or long agent chains, extended-thinking mode is where the model’s strength applies.

Context Window

The model features a 1M token context window, up from 256K on Qwen3.6 Max Preview. It supports text input and output only. Pricing has not yet been announced. Qwen3.6 Max Preview was priced at $1.30/$7.80 per million input/output tokens on Alibaba Cloud.

A million-token context window can hold a full mid-sized code repository or a large stack of documents in a single request. Models often reason less reliably as the context window fills. Independent long-context testing for Qwen3.7-Max is not yet available.

Benchmark Results

Qwen3.7-Max scored 56.6 on the Artificial Analysis Intelligence Index, placing it fifth overall. That represents a 4.8-point gain over its predecessor Qwen3.6 Max Preview (51.8), and puts it ahead of Google’s Gemini 3.5 Flash (55.3). GPT-5.5 (60.2), Claude Opus 4.7 (57.3), and Gemini 3.1 Pro Preview (57.2) still lead the overall rankings.

The Intelligence Index v4.0 aggregates ten evaluations, including GDPval-AA, Terminal-Bench Hard, SciCode, AA-Omniscience, Humanity’s Last Exam, and GPQA Diamond.

The improvement over Qwen3.6 Max Preview is not uniform. Most of the Index gains are concentrated in scientific reasoning, agentic capability, and coding. CritPt rose 9.7 percentage points (from 3.7% to 13.4%), Humanity’s Last Exam jumped 9.2 points (from 28.9% to 38.1%), and Terminal-Bench Hard climbed 6.9 points (from 43.9% to 50.8%). GDPval-AA added 42 Elo points (from 1504 to 1546). Scores on other benchmarks are largely flat compared to Qwen3.6 Max Preview.

One result on the Index requires careful reading. On AA-Omniscience, Qwen3.7-Max’s raw accuracy actually dropped 7.6 percentage points (from 37.7% to 30.1%), while its hallucination rate fell 21.3 points (from 44.2% to 22.9%). The model is choosing to say “I don’t know” more often rather than recalling more facts. Its attempt rate fell from 67.3% to 48.0%, the lowest among frontier models in the comparison. The AA-Omniscience benchmark rewards correct answers and penalizes hallucinations but has no penalty for refusing to answer. For use cases that depend on broad factual recall, this is a meaningful limitation to test against your workload.

In Text Arena, Qwen3.7-Max-Preview ranked #13 overall with an Elo score of 1,475. Category rankings include #7 in Math, #9 in Expert Prompts, #9 in Software and IT, and #10 in Coding.

All benchmark numbers are preliminary. The model carries a ‘Preview’ mode, indicating Alibaba considers it an early build.

Agentic Performance — Internal Test

In an internal Alibaba test on a new chip platform, the model autonomously performed more than 1,000 tool calls and iterative code modifications to optimize a key kernel. Alibaba claimed the process improved inference speed by roughly 10x compared with the previous version.

Marktechpost’s Visual Explainer

Slide 1 of 6

What is Qwen3.7-Max?

A proprietary reasoning model from Alibaba, designed for long-horizon agent tasks, code generation, and multi-step automation.

Context Window

1 million tokens — enough to fit a full mid-sized code repository in a single request.

Reasoning Model

Uses chain-of-thought (extended-thinking mode) before producing a final answer.

Input / Output

Text in, text out. No image input supported in this model.

API String

Use qwen3.7-max when calling via Alibaba Cloud Model Studio.

Apache-compatible API
OpenAI & Anthropic spec
Preview — no open weights yet

Slide 2 of 6

Quick Start: Chat Interface

The fastest way to test Qwen3.7-Max with no API key or setup required.

1

Go to Qwen Chat

Navigate to chat.qwen.ai and create a free account.
2

Select the model

In the model selector dropdown, choose Qwen3.7-Max. It may appear as Qwen3.7-Max-Preview during the preview period.
3

Enable Thinking Mode

Toggle on Thinking Mode in the chat interface. This activates chain-of-thought reasoning and shows the model’s internal reasoning trace before the final answer.
4

Send your prompt

Type your query. For best results on complex tasks, be specific about steps, constraints, and expected output format.

💡

Use your hardest real-world prompts when testing. Multi-step math problems, complex refactoring requests, and ambiguous expert questions reveal more about model quality than simple prompts.

Slide 3 of 6

API Access

Qwen3.7-Max is compatible with both OpenAI and Anthropic API specifications. You can plug it into existing pipelines with minimal changes.

OpenAI-compatible Python call

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_DASHSCOPE_API_KEY",
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
)

response = client.chat.completions.create(
    model="qwen3.7-max",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user",   "content": "Explain chain-of-thought reasoning."}
    ]
)

print(response.choices[0].message.content)

ℹ️

Get your API key from Alibaba Cloud Model Studio (DashScope). The base URL for international access is dashscope-intl.aliyuncs.com.

⚠️

Pricing has not yet been announced for Qwen3.7-Max. For reference, Qwen3.6 Max Preview was priced at $1.30 / $7.80 per million input/output tokens.

Slide 4 of 6

Understanding Thinking Mode

Thinking Mode is the model’s chain-of-thought reasoning layer. It determines how the model approaches a problem before generating a response.

When to use it

Multi-step code refactoring, complex math proofs, long agent task chains, and ambiguous problems requiring step-by-step planning.

When to skip it

Short rewrites, simple classifications, quick lookups, or tasks where latency and token cost need to be minimised.

API: Enable thinking via extra_body

response = client.chat.completions.create(
    model="qwen3.7-max",
    messages=[{"role":"user","content":"Your prompt here"}],
    extra_body={"enable_thinking": True}
)

💡

Qwen3.7-Max generated ~97M tokens on Artificial Analysis benchmarks, vs. an average of 24M for comparable models. Each thinking token adds to latency and cost — use thinking mode selectively.

Slide 5 of 6

Agentic and Long-Horizon Tasks

Qwen3.7-Max is designed to run long, autonomous task loops. In Alibaba’s internal testing, it executed 1,000+ tool calls and sustained autonomous execution for up to 35 hours.

1

Define tools clearly

Pass tool definitions in the standard OpenAI tools parameter. The model supports function calling and iterative tool invocation natively.
2

Use the 1M context window intentionally

Pass full task history, prior tool outputs, and code state into context. Trim aggressively when the full context is not needed — every token is billed.
3

Target the final answer in assertions

Reasoning output is longer and more variable than a standard completion. When writing tests, assert on the final answer, not the exact wording of the thinking trace.
4

Good use cases

Kernel optimisation, code debugging loops, office workflow automation, and multi-step data pipelines with iterative verification.

⚠️

The 35-hour and 1,000+ tool call figures come from Alibaba’s internal testing only. No independent verification exists for these specific claims.

Slide 6 of 6

Known Limitations

Understanding these limitations before integrating will save debugging time and help you set the right expectations.

No image input

Qwen3.7-Max is text-only. For multimodal tasks, use Qwen3.7-Plus-Preview instead, which supports vision input.

AA-Omniscience abstention

On the AA-Omniscience benchmark, the model’s attempt rate dropped from 67.3% to 48.0%. It abstains more and hallucinates less — but its raw factual recall also dropped. Test carefully for knowledge-recall tasks.

Preview status

The model currently carries a — Preview suffix. Benchmark scores, behaviour, and pricing can change before stable release. No open-weight version is available as of May 2026.

Long-context reliability

A 1M token context window is a ceiling, not a guarantee. Independent long-context testing for Qwen3.7-Max is not yet available. Validate retrieval quality on your specific workload.

ℹ️

For the latest model updates, check the official Qwen blog at qwen.ai/blog and Alibaba Cloud Model Studio docs.

Key Takeaways:

Alibaba released two Qwen3.7 preview models: Max (text/reasoning) and Plus (multimodal).
Qwen3.7-Max scored 56.6 on the Artificial Analysis Intelligence Index, ranking #5 overall — a 4.8-point gain over Qwen3.6 Max Preview.
The 1M-token context window doubles the 256K limit from Qwen3.6 Max Preview; text only, no image input.
On AA-Omniscience, raw accuracy dropped while abstention rose — worth testing for knowledge-recall use cases.
The model sustained 1,000+ tool calls and 35-hour autonomous execution in Alibaba’s internal testing only; no independent verification yet.

Check out the Technical details. and Docs. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

Credit: Source link