• bitcoinBitcoin(BTC)$77,821.00-0.22%
  • ethereumEthereum(ETH)$2,319.79-0.47%
  • tetherTether(USDT)$1.00-0.01%
  • rippleXRP(XRP)$1.42-0.52%
  • binancecoinBNB(BNB)$628.41-0.36%
  • usd-coinUSDC(USDC)$1.000.01%
  • solanaSolana(SOL)$85.56-1.16%
  • tronTRON(TRX)$0.3240350.11%
  • Figure HelocFigure Heloc(FIGR_HELOC)$1.020.00%
  • dogecoinDogecoin(DOGE)$0.098242-0.11%
  • whitebitWhiteBIT Coin(WBT)$54.92-0.54%
  • USDSUSDS(USDS)$1.000.01%
  • HyperliquidHyperliquid(HYPE)$42.473.18%
  • leo-tokenLEO Token(LEO)$10.380.93%
  • cardanoCardano(ADA)$0.248010-1.37%
  • bitcoin-cashBitcoin Cash(BCH)$447.60-1.36%
  • moneroMonero(XMR)$391.300.91%
  • chainlinkChainlink(LINK)$9.33-1.01%
  • zcashZcash(ZEC)$360.051.14%
  • CantonCanton(CC)$0.149054-1.11%
  • stellarStellar(XLM)$0.168433-1.33%
  • MemeCoreMemeCore(M)$4.20-2.85%
  • daiDai(DAI)$1.000.00%
  • USD1USD1(USD1)$1.000.00%
  • litecoinLitecoin(LTC)$55.45-1.35%
  • avalanche-2Avalanche(AVAX)$9.26-1.85%
  • hedera-hashgraphHedera(HBAR)$0.090925-1.83%
  • Ethena USDeEthena USDe(USDE)$1.00-0.01%
  • suiSui(SUI)$0.93-1.50%
  • shiba-inuShiba Inu(SHIB)$0.000006-1.16%
  • RainRain(RAIN)$0.0074494.30%
  • paypal-usdPayPal USD(PYUSD)$1.000.03%
  • the-open-networkToncoin(TON)$1.30-0.85%
  • crypto-com-chainCronos(CRO)$0.070045-0.21%
  • Circle USYCCircle USYC(USYC)$1.120.00%
  • tether-goldTether Gold(XAUT)$4,690.77-0.06%
  • Global DollarGlobal Dollar(USDG)$1.000.01%
  • BittensorBittensor(TAO)$249.301.10%
  • World Liberty FinancialWorld Liberty Financial(WLFI)$0.073183-2.33%
  • BlackRock USD Institutional Digital Liquidity FundBlackRock USD Institutional Digital Liquidity Fund(BUIDL)$1.000.00%
  • pax-goldPAX Gold(PAXG)$4,695.35-0.01%
  • mantleMantle(MNT)$0.64-2.52%
  • polkadotPolkadot(DOT)$1.24-1.96%
  • uniswapUniswap(UNI)$3.25-0.94%
  • SkySky(SKY)$0.085048-3.70%
  • Pi NetworkPi Network(PI)$0.180192-0.50%
  • Falcon USDFalcon USD(USDF)$1.000.16%
  • nearNEAR Protocol(NEAR)$1.37-2.36%
  • okbOKB(OKB)$83.97-0.68%
  • HTX DAOHTX DAO(HTX)$0.000002-0.31%
TradePoint.io
  • Main
  • AI & Technology
  • Stock Charts
  • Market & News
  • Business
  • Finance Tips
  • Trade Tube
  • Blog
  • Shop
No Result
View All Result
TradePoint.io
No Result
View All Result

Meet ‘AutoAgent’: The Open-Source Library That Lets an AI Engineer and Optimize Its Own Agent Harness Overnight

April 5, 2026
in AI & Technology
Reading Time: 5 mins read
A A
Meet ‘AutoAgent’: The Open-Source Library That Lets an AI Engineer and Optimize Its Own Agent Harness Overnight
ShareShareShareShareShare

There’s a particular kind of tedium that every AI engineer knows intimately: the prompt-tuning loop. You write a system prompt, run your agent against a benchmark, read the failure traces, tweak the prompt, add a tool, rerun. Repeat this a few dozen times and you might move the needle. It’s grunt work dressed up in Python files. Now, a new open-source library called AutoAgent, built by Kevin Gu at thirdlayer.inc, proposes an unsettling alternative — don’t do that work yourself. Let an AI do it.

AutoAgent is an open source library for autonomously improving an agent on any domain. In a 24-hour run, it hit #1 on SpreadsheetBench with a score of 96.5%, and achieved the #1 GPT-5 score on TerminalBench with 55.1%.

YOU MAY ALSO LIKE

The LoRA Assumption That Breaks in Production 

How to Build Smarter Multilingual Text Wrapping with BudouX Through Parsing, HTML Rendering, Model Introspection, and Toy Training

https://x.com/kevingu/status/2039843234760073341

What Is AutoAgent, Really?

AutoAgent is described as being ‘like autoresearch but for agent engineering.’ The idea: give an AI agent a task, let it build and iterate on an agent harness autonomously overnight. It modifies the system prompt, tools, agent configuration, and orchestration, runs the benchmark, checks the score, keeps or discards the change, and repeats.

To understand the analogy: Andrej Karpathy’s autoresearch does the same thing for ML training — it loops through propose-train-evaluate cycles, keeping only changes that improve validation loss. AutoAgent ports that same ratchet loop from ML training into agent engineering. Instead of optimizing a model’s weights or training hyperparameters, it optimizes the harness — the system prompt, tool definitions, routing logic, and orchestration strategy that determine how an agent behaves on a task.

A harness, in this context, is the scaffolding around an LLM: what system prompt it receives, what tools it can call, how it routes between sub-agents, and how tasks are formatted as inputs. Most agent engineers hand-craft this scaffolding. AutoAgent automates the iteration on that scaffolding itself.

The Architecture: Two Agents, One File, One Directive

The GitHub repo has a deliberately simple structure. agent.py is the entire harness under test in a single file — it contains config, tool definitions, agent registry, routing/orchestration, and the Harbor adapter boundary. The adapter section is explicitly marked as fixed; the rest is the primary edit surface for the meta-agent. program.md contains instructions for the meta-agent plus the directive (what kind of agent to build), and this is the only file the human edits.

Think of it as a separation of concerns between human and machine. The human sets the direction inside program.md. The meta-agent (a separate, higher-level AI) then reads that directive, inspects agent.py, runs the benchmark, diagnoses what failed, rewrites the relevant parts of agent.py, and repeats. The human never touches agent.py directly.

A critical piece of infrastructure that keeps the loop coherent across iterations is results.tsv — an experiment log automatically created and maintained by the meta-agent. It tracks every experiment run, giving the meta-agent a history to learn from and calibrate what to try next. The full project structure also includes Dockerfile.base, an optional .agent/ directory for reusable agent workspace artifacts like prompts and skills, a tasks/ folder for benchmark payloads (added per benchmark branch), and a jobs/ directory for Harbor job outputs.

The metric is total score produced by the benchmark’s task test suites. The meta-agent hill-climbs on this score. Every experiment produces a numeric score: keep if better, discard if not — the same loop as autoresearch.

The Task Format and Harbor Integration

Benchmarks are expressed as tasks in Harbor format. Each task lives under tasks/my-task/ and includes a task.toml for config like timeouts and metadata, an instruction.md which is the prompt sent to the agent, a tests/ directory with a test.sh entry point that writes a score to /logs/reward.txt, and a test.py for verification using either deterministic checks or LLM-as-judge. An environment/Dockerfile defines the task container, and a files/ directory holds reference files mounted into the container. Tests write a score between 0.0 and 1.0 to the verifier logs. The meta-agent hill-climbs on this.

The LLM-as-judge pattern here is worth flagging: instead of only checking answers deterministically (like unit tests), the test suite can use another LLM to evaluate whether the agent’s output is ‘correct enough.’ This is common in agentic benchmarks where correct answers aren’t reducible to string matching.

Key Takeaways

  • Autonomous harness engineering works — AutoAgent proves that a meta-agent can replace the human prompt-tuning loop entirely, iterating on agent.py overnight without any human touching the harness files directly.
  • Benchmark results validate the approach — In a 24-hour run, AutoAgent hit #1 on SpreadsheetBench (96.5%) and the top GPT-5 score on TerminalBench (55.1%), beating every other entry that was hand-engineered by humans.
  • ‘Model empathy’ may be a real phenomenon — A Claude meta-agent optimizing a Claude task agent appeared to diagnose failures more accurately than when optimizing a GPT-based agent, suggesting same-family model pairing could matter when designing your AutoAgent loop.
  • The human’s job shifts from engineer to director — You don’t write or edit agent.py. You write program.md — a plain Markdown directive that steers the meta-agent. The distinction mirrors the broader shift in agentic engineering from writing code to setting goals.
  • It’s plug-and-play with any benchmark — Because tasks follow Harbor’s open format and agents run in Docker containers, AutoAgent is domain-agnostic. Any scorable task — spreadsheets, terminal commands, or your own custom domain — can become a target for autonomous self-optimization.

Check out the Repo and Tweet.  Also, feel free to follow us on Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

The post Meet ‘AutoAgent’: The Open-Source Library That Lets an AI Engineer and Optimize Its Own Agent Harness Overnight appeared first on MarkTechPost.

Credit: Source link

ShareTweetSendSharePin

Related Posts

The LoRA Assumption That Breaks in Production 
AI & Technology

The LoRA Assumption That Breaks in Production 

April 27, 2026
How to Build Smarter Multilingual Text Wrapping with BudouX Through Parsing, HTML Rendering, Model Introspection, and Toy Training
AI & Technology

How to Build Smarter Multilingual Text Wrapping with BudouX Through Parsing, HTML Rendering, Model Introspection, and Toy Training

April 26, 2026
Forced Windows updates can now be paused forever
AI & Technology

Forced Windows updates can now be paused forever

April 26, 2026
Canadian premier wants to ban social media and AI chatbots for kids in Manitoba
AI & Technology

Canadian premier wants to ban social media and AI chatbots for kids in Manitoba

April 26, 2026
Next Post
We may have even less control over how long we live than previously thought – The Washington Post

We may have even less control over how long we live than previously thought - The Washington Post

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Search

No Result
View All Result
UPS refunding customers that spent money on tariffs

UPS refunding customers that spent money on tariffs

April 21, 2026
Artemis II commander Reid Wiseman reflects on mission

Artemis II commander Reid Wiseman reflects on mission

April 26, 2026
Experimental treatment for pancreatic cancer could be breakthrough for patients

Experimental treatment for pancreatic cancer could be breakthrough for patients

April 22, 2026

About

Learn more

Our Services

Legal

Privacy Policy

Terms of Use

Bloggers

Learn more

Article Links

Contact

Advertise

Ask us anything

©2020- TradePoint.io - All rights reserved!

Tradepoint.io, being just a publishing and technology platform, is not a registered broker-dealer or investment adviser. So we do not provide investment advice. Rather, brokerage services are provided to clients of Tradepoint.io by independent SEC-registered broker-dealers and members of FINRA/SIPC. Every form of investing carries some risk and past performance is not a guarantee of future results. “Tradepoint.io“, “Instant Investing” and “My Trading Tools” are registered trademarks of Apperbuild, LLC.

This website is operated by Apperbuild, LLC. We have no link to any brokerage firm and we do not provide investment advice. Every information and resource we provide is solely for the education of our readers. © 2020 Apperbuild, LLC. All rights reserved.

No Result
View All Result
  • Main
  • AI & Technology
  • Stock Charts
  • Market & News
  • Business
  • Finance Tips
  • Trade Tube
  • Blog
  • Shop

© 2023 - TradePoint.io - All Rights Reserved!