Poetiq's Meta-System Automatically Builds a Model-Agnostic Harness That Improved Every LLM Tested on LiveCodeBench Pro Without Fine-Tuning

Poetiq has just published some very interesting results showing its Meta-System reached a new state-of-the-art on LiveCodeBench Pro (LCB Pro), a competitive coding benchmark, by automatically building and optimizing its own inference harness — without fine-tuning any underlying model or accessing model internals.

The result: GPT 5.5 High with Poetiq’s harness scores 93.9% on LCB Pro (25Q2), up from its baseline of 89.6%. Gemini 3.1 Pro, the model the harness was specifically optimized on, jumps from 78.6% to 90.9% — surpassing Google’s own Gemini 3 Deep Think (88.8%), a model that isn’t even accessible via API for external verification.

Broadcom CEO on the Biggest AI Chip Bets

AI Not Holding Back Companies From Hiring: Yale Budget Lab

https://poetiq.ai/posts/recursive_self_improvement_coding/

What is LiveCodeBench Pro?

Before getting into the mechanics, it helps to understand why the benchmark matters. LiveCodeBench Pro (LCB) is designed to test AI coding ability in a way that resists two common failure modes in benchmarks: data contamination and overfitting.

LCB Pro pulls problems from major competitive programming competitions and withholds public ground-truth code. Instead, solutions are validated against a comprehensive testing framework. Correct output alone isn’t enough — solutions must also satisfy specific memory and runtime constraints. The benchmark is also subject to continuous updates, which distinguishes it from many standard benchmarks that become stale.

The benchmark focuses on C++ challenges and emphasizes creative coding, testing a model’s capacity for complex problem-solving and high-quality, performant procedural logic. This distinguishes it from datasets like SWEBench that evaluate tool usage or bug-fixing workflows. Problems are categorized by difficulty — Easy, Medium, and Hard — based on competitive human solve rates.

Poetiq’s Strategic Framing: Three LLM Task Categories

This is Poetiq’s third publicly reported benchmark, and the choice of LCB Pro was deliberate. The research team frames LLM performance around three distinct task categories: Reasoning challenges (ARC-AGI is their benchmark here), Retrieval challenges (Humanity’s Last Exam, or HLE), and Coding challenges — which, as the most pervasive commercial application for AI today, meld reasoning and retrieval with the generation of specialized procedural logic.

Their coding initiative had three specific, stated objectives: first, prove that an intelligent harness can boost efficacy without fine-tuning or special model access; second, validate the Meta-System’s capacity for recursive self-improvement in creating that harness automatically; and third, demonstrate that the resulting harness is model-agnostic and can be applied to any model without modification. According to their results, all three were satisfied.

What is a Harness, and Why Does It Matter?

In this context, a harness refers to the infrastructure wrapped around a language model to handle a specific task. Think of it as an orchestration layer — it controls how the model is prompted, how outputs are structured, how answers are assembled across multiple calls, and how solutions are evaluated.

Traditionally, these harnesses are hand-built by engineers. Poetiq’s claim is that their Meta-System builds and optimizes these harnesses automatically, through recursive self-improvement. Internally, the Meta-System works by developing better strategies for determining what to ask, refining sequential chain-of-questions, and devising new methods for assembling the answers. The system constantly incorporates learnings from previous and current tasks and datasets to create new, custom task-specific harnesses — as well as agents and orchestrators for other task types.

How the Harness was Built?

Poetiq’s Meta-System was given the LCB Pro task and constructed a harness from scratch using only Gemini 3.1 Pro as the base model. The Meta-System accounted for all three dimensions LCB Pro tests: accuracy, runtime, and memory constraints. The system built on insights from its previous work on ARC-AGI and HLE when designing the harness. No fine-tuning of the underlying model was performed, and no access to internal model activations was required — only standard API access.

Once the harness was built and optimized for Gemini 3.1 Pro, it was then applied to a broad set of other models from different providers and generations — both open-weights and proprietary — without any additional optimization. Every model tested improved.

The Numbers

The benchmark results across difficulty tiers are worth looking at in detail. On Hard problems — the category where gaps between models are largest — Gemini 3.1 Pro with Poetiq’s harness scores 58.3%, up from its 7.7% baseline. GPT 5.5 High with the harness reaches 75.0% on Hard, up from 50.0%. Across Easy and Medium categories, the harness also outperforms all base models.

Some of the smaller model results are also notable. Gemini 3.0 Flash improves by 10 percentage points, going from 72.3% to 82.3% — overtaking Claude Opus 4.7, Gemini 3.1 Pro, and GPT 5.2 High, all larger and more expensive models. This mirrors a pattern Poetiq previously observed on ARC-AGI, where their optimization allowed a smaller, more economical model to surpass a bigger one. Kimi K2.6 sees the largest jump: from 50.0% to 79.9%, a roughly 30 percentage point improvement. Nemotron 3 Super 120B improves by 12.8%.

Accuracy numbers are reported directly from the LCB Pro leaderboard at livecodebenchpro.com (25Q2). For models not featured on the leaderboard, Poetiq conducted its own evaluations, cross-validating its experimental setup by replicating official leaderboard accuracies for baseline models.

Key Takeaways

Poetiq’s Meta-System automatically builds task-specific harnesses through recursive self-improvement, with no model fine-tuning or internal model access
GPT 5.5 High with the harness reaches 93.9% on LCB Pro (25Q2), up 4.3% from its 89.6% baseline; Gemini 3.1 Pro jumps 12.3% (78.6% → 90.9%)
The harness is model-agnostic: optimized using only Gemini 3.1 Pro, it improved every other model tested — open-weights and proprietary — without modification
Gemini 3.0 Flash gains 10 percentage points with the harness (72.3% → 82.3%), surpassing Claude Opus 4.7, Gemini 3.1 Pro, and GPT 5.2 High despite being smaller and cheaper
Kimi K2.6 shows the largest gain at ~30 percentage points (50.0% → 79.9%); Nemotron 3 Super 120B improves by 12.8%

Check out the Technical details here. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

Credit: Source link