New AI optimization framework beats Claude Code and Codex by 2.5x on the same compute budget

Imagine your engineering team just deployed an AI agent to search through internal company documents and answer employee questions. It works perfectly in development, but in production, it consistently hallucinates or misses key constraints. Fixing this is rarely a simple patch. It requires a tedious, trial-and-error process of tweaking chunking strategies, retrieval methods, and system prompts simultaneously. Because these adjustments are entangled, it becomes nearly impossible to attribute which specific tweak actually solved the problem.

To address this challenge, researchers at Renmin University of China and Microsoft Research introduced Arbor, a framework that upgrades AI-driven research and optimization from a sequence of trial-and-error guesses into a cumulative learning process. Arbor organizes hypotheses, experiments, and insights into a tree that helps the system learn from prior failures to make smarter, verified improvements over time.

Microsoft launches new in-house AI models it says cut costs up to 89% versus OpenAI

Scientists Develop Handheld Device For Measuring When Your Body Is Burning Fat

In practical tests, Arbor delivered more than 2.5 times the verifiable performance gains of standard AI coding agents across real-world engineering tasks while operating under the same resource budget.

For enterprise AI, this technique directly translates to automating the continuous improvement of complex, real-world engineering systems.

Understanding the bottleneck in autonomous optimization

As large language models and AI systems become more capable, they are expected to carry out more complex operations such as autonomous optimization (AO) of software systems such as agent harnesses or model training algorithms.

AO captures the fundamental loop of autonomous research. An AI agent starts with an initial mutable artifact, such as a machine learning codebase or data pipeline, and a specific objective. The agent’s goal is to iteratively improve this artifact through experimental feedback without step-by-step human supervision.

The main challenge of AO is often misunderstood. Many engineering teams find that simply giving a coding agent more time or compute to optimize a codebase doesn’t lead to better results. “Automation can keep an AI working for a very long time — but a loop is not the same as progress,” Jiajie Jin, co-author of the paper, told VentureBeat. “If the goal is vague, or the metric is easy to hack, long-running automation often just produces ‘improvements’ faster that nobody actually wants.”

Jin explains that complex tasks take many attempts to get right, and standard agent architectures are missing the critical data structure to maintain state. “How do you make sure the insight and experience from each attempt actually accumulate, instead of getting lost in a scrollback buffer?” he said. Without this structure, agents simply repeat the same mistakes.

Current agent systems can run experiments for many hours against well-specified goals: editing code, invoking tools, running tests autonomously. But they treat each attempt in isolation, missing the structural mechanisms that would let them accumulate and act on what they’ve learned.

They lack the capacity to simultaneously maintain and compare multiple competing research directions. Without this, they cannot interpret both successes and failures to reshape their future exploration, which is the core mechanism that makes human research cumulative.

General coding agents typically rely on conversation transcripts for their memory. Because AO tasks span hundreds of turns and easily exceed context window limits, these agents struggle to preserve and reuse factual evidence over long histories. As a result, they lose the overarching structure of the research process and are prone to stalling on early failures or chasing noisy evaluation swings. The system needs a structured, durable memory that records what directions have been tried, what factual evidence was produced, and how each result changes the space of future hypotheses.

Existing frameworks are also prone to reward hacking and overfitting to development metrics. This makes them create the illusion of progress without producing improvements that transfer to real-world performance.

Finally, general-purpose coding agents typically chain their tool calls on a single shared working tree. This architectural limitation prevents them from testing parallel hypotheses in isolated environments without corrupting the main codebase or obscuring which hypothesis caused a specific outcome.

The Arbor framework

Arbor solves the challenges of AO with a framework that automates the long-horizon loop of exploration, experimentation, and abstraction that characterizes human research. Arbor separates the strategic direction of research from the ground-level coding tasks with two key components:

The coordinator: A long-lived AI agent that acts like a principal investigator. It never directly edits the target codebase. Instead, it owns the general state of the optimization research, observes accumulated evidence, comes up with new hypotheses and directions to explore, and decides what to do with the results of experiments.

Executors: Short-lived, highly focused AI agents. When the coordinator wants to test an idea, it spins up an executor and places it in an isolated environment, essentially a fresh git worktree. Each executor is handed one hypothesis. It implements the assigned idea, runs evaluations, debugs errors, and reports back to the coordinator with the results and created artifacts.

These two components collaborate through a mechanism that the researchers call “Hypothesis Tree Refinement” (HTR). HTR represents the entire research process as a persistent, branching tree where every node binds together four things: a hypothesis, the executable artifact, the factual evidence produced, and a distilled insight. This means the coordinator can explore multiple competing directions at the same time without losing its place.

The coordinator builds the tree by placing broad ideas near the root, while concrete refinements branch out as leaves. This allows Arbor to safely explore multiple competing hypotheses simultaneously. If an executor’s experiment fails, the tree records why it failed as a negative constraint, ensuring the system doesn’t endlessly repeat the same mistake.

To understand why Arbor’s isolation matters, consider a common enterprise scenario: optimizing a Retrieval-Augmented Generation (RAG) pipeline for an internal AI assistant. “When you ask a single agent like Claude Code or Codex to ‘improve accuracy,’ it will typically change a bunch of things in one pass — chunking, the prompt, the retrieval method,” Jin said. This entangles the changes, making it impossible to attribute which one actually helped. It also directly mutates the repository without isolation.

Arbor solves this by treating each lever as a separate hypothesis. Chunking becomes one branch, retrieval another, and the prompt another — each implemented and evaluated in its own isolated git worktree. “So you get clean attribution: ‘constraint decomposition on the retrieval side gave +X; breadth-first search actually hurt,'” Jin said.

When an executor returns a report, the coordinator writes the evidence to the tree and backpropagates the insight upward to parent nodes. This means a local observation becomes a generalized constraint that shapes the coordinator’s future idea generation.

To prevent reward hacking or overfitting to the development data, HTR enforces a strict “merge gate.” Even if an executor reports a fantastic development score, the coordinator will spin up an isolated worktree to test the candidate against a held-out test evaluator. The artifact is only merged into the current best trunk if it demonstrably improves the test score, verifying that the progress is real.

Arbor generally falls under the concept of “loop engineering,” popularized by industry figures like OpenClaw creator Peter Steinberger and Claude Code lead Boris Cherny. The idea is to move beyond single prompts to design iterative cycles (observe, reason, act, verify) that drive autonomous agents. However, as Jin points out, “A loop can fill up with messy, untraceable attempts, and you end up with nothing to show and no way to reconstruct what changed.”

Arbor in action

The researchers evaluated Arbor on an autonomous optimization task suite built from real-world research settings and the MLE-Bench Lite machine learning engineering benchmark. The AO suite featured tasks from different areas of AI development, including model training, harness engineering, and data synthesis.

The researchers used different backbone models for the coordinator and executor agents, including Claude Opus 4.6, GPT-5.5, and Gemini-3-Flash. They tested Arbor against the strongest coding agents, Codex and Claude Code. Arbor and the baselines were given the same resources. For the MLE-Bench Lite tasks, Arbor was also compared against top-tier agentic research systems like AI-Scientist, ML-Master, and AIDE.

Arbor consistently outperformed the baselines. It achieved the best held-out test result on all tasks, attaining more than 2.5 times the average relative gain of Codex and Claude Code. On the BrowseComp task, which involves optimizing a search agent, Arbor improved the system’s held-out accuracy from a baseline of 45.33% to 67.67%. Meanwhile, Codex and Claude Code stalled at 50% and 53.33%, respectively. On MLE-Bench Lite, when equipped with GPT-5.5, Arbor achieved the strongest result among all benchmarked systems.

autonomous optimization — Arbor generalizes across backbone models and harnesses (source: arXiv)

Arbor proved to be resilient against overfitting. For example, during the Terminal-Bench 2.0 task experiments, Claude Code achieved a high development score of 75 but its score dropped to 71 on the held-out data. Arbor had a lower development score of 72.22 but achieved the highest held-out score of 77.36, ensuring its results transfer to real-world applications.

Arbor also showed generalization in a cross-task transfer experiment. After Arbor finished optimizing the search harness for the BrowseComp task, researchers took the optimized codebase and tested it on two unrelated search-agent tasks, HLE and DeepSearchQA. Arbor’s optimized codebase significantly improved performance on those unseen tasks as well.

Deploying Arbor: Sweet spots and hidden costs

For engineering leads looking to drop Arbor into their existing tech stack, the framework is designed to sit on top of existing Git workflows rather than replacing them. “Its output is an ordinary git branch that your existing code review, CI, and human review can inspect directly,” Jin said. Only verified gains are merged into a per-run trunk, leaving the main repository untouched until a developer manually chooses to promote the code.

However, deploying Arbor comes with specific tradeoffs. Jin points out that the biggest catch is token cost, as maintaining a long-lived coordinator that continuously manages the tree and dispatches executors is the dominant expense. Running multiple isolated worktrees concurrently also requires genuine compute and disk resources to process real experiments.

So where is Arbor’s sweet spot? According to Jin, it excels at tasks with a clear, trustworthy metric, tolerance for a long time horizon, and a real search space with several plausible directions, such as pipeline optimization, data-synthesis quality, and model-training recipe tuning.

Conversely, teams should explicitly avoid using Arbor for real-time latency tasks, obvious one-line fixes, or when the underlying evaluation metric is flawed. The quality ceiling of the entire run is strictly bounded by the quality of the evaluator. “If the metric isn’t trustworthy, Arbor will just optimize toward an untrustworthy result faster,” Jin said.

Jin sees the next evolution going beyond single scalar metrics. “A natural evolution is to have each node’s artifact carry a vector — accuracy, latency, cost — instead of a single score,” Jin said. “Going from a single scalar to a multi-objective Pareto search is a very natural extension of the framework.”

Credit: Source link