• bitcoinBitcoin(BTC)$60,666.00-0.57%
  • ethereumEthereum(ETH)$1,559.48-1.24%
  • tetherTether(USDT)$1.00-0.01%
  • binancecoinBNB(BNB)$573.58-0.02%
  • usd-coinUSDC(USDC)$1.000.01%
  • rippleXRP(XRP)$1.09-1.34%
  • solanaSolana(SOL)$61.91-3.13%
  • tronTRON(TRX)$0.3236450.86%
  • Figure HelocFigure Heloc(FIGR_HELOC)$1.030.46%
  • dogecoinDogecoin(DOGE)$0.081437-0.39%
  • HyperliquidHyperliquid(HYPE)$56.53-5.12%
  • USDSUSDS(USDS)$1.00-0.02%
  • leo-tokenLEO Token(LEO)$9.45-1.35%
  • RainRain(RAIN)$0.012845-1.63%
  • stellarStellar(XLM)$0.2113515.01%
  • CantonCanton(CC)$0.1625809.67%
  • zcashZcash(ZEC)$357.46-6.56%
  • cardanoCardano(ADA)$0.156109-1.15%
  • moneroMonero(XMR)$294.36-5.29%
  • chainlinkChainlink(LINK)$7.36-0.20%
  • whitebitWhiteBIT Coin(WBT)$43.22-1.42%
  • USD1USD1(USD1)$1.00-0.04%
  • Ethena USDeEthena USDe(USDE)$1.00-0.02%
  • the-open-networkToncoin(TON)$1.638.14%
  • bitcoin-cashBitcoin Cash(BCH)$216.701.77%
  • daiDai(DAI)$1.00-0.03%
  • LABLAB(LAB)$13.0833.00%
  • MemeCoreMemeCore(M)$2.983.70%
  • hedera-hashgraphHedera(HBAR)$0.079543-1.11%
  • litecoinLitecoin(LTC)$41.12-5.12%
  • suiSui(SUI)$0.711.48%
  • avalanche-2Avalanche(AVAX)$6.65-1.62%
  • paypal-usdPayPal USD(PYUSD)$1.00-0.07%
  • Circle USYCCircle USYC(USYC)$1.130.00%
  • shiba-inuShiba Inu(SHIB)$0.000005-0.53%
  • tether-goldTether Gold(XAUT)$4,284.72-0.43%
  • crypto-com-chainCronos(CRO)$0.0581411.00%
  • Global DollarGlobal Dollar(USDG)$1.00-0.01%
  • BlackRock USD Institutional Digital Liquidity FundBlackRock USD Institutional Digital Liquidity Fund(BUIDL)$1.000.00%
  • nearNEAR Protocol(NEAR)$1.85-6.33%
  • Ondo US Dollar YieldOndo US Dollar Yield(USDY)$1.12-0.59%
  • pax-goldPAX Gold(PAXG)$4,290.64-0.76%
  • BittensorBittensor(TAO)$192.79-1.15%
  • World Liberty FinancialWorld Liberty Financial(WLFI)$0.055548-1.95%
  • mantleMantle(MNT)$0.51-0.78%
  • Ripple USDRipple USD(RLUSD)$1.000.02%
  • AsterAster(ASTER)$0.620.94%
  • polkadotPolkadot(DOT)$0.94-0.86%
  • HTX DAOHTX DAO(HTX)$0.0000020.19%
  • OndoOndo(ONDO)$0.322718-6.62%
TradePoint.io
  • Main
  • AI & Technology
  • Stock Charts
  • Market & News
  • Business
  • Finance Tips
  • Trade Tube
  • Blog
  • Shop
No Result
View All Result
TradePoint.io
No Result
View All Result

When Claude changed, everything changed: Managing AI blast radius in production

June 6, 2026
in AI & Technology
Reading Time: 6 mins read
A A
When Claude changed, everything changed: Managing AI blast radius in production
ShareShareShareShareShare

Our system did one thing, and it did it well: It turned natural-language questions into API calls.

YOU MAY ALSO LIKE

Anthropic President Amodei on the Future of Claude

Bloomberg Tech Event Special | Bloomberg Tech 6/04/2026

The users were analysts, account managers, and operations leads. They knew what data they needed, but assembling it manually meant pulling from four dashboards, two BI tools, and a Salesforce report builder. With our system, they typed the request in plain English. A request like “Compile a report on sales volume for January through March 2026 for the Northeast region, broken down by city” was translated into an API call that the system could act on:

json

{

  “description”: “User requested sales volume for the given date range, here is the API call to get the response”,

  “api_call”: “/api/sales_volume”,

  “post_body”: {

    “start_date”: “2026-01-01”,

    “end_date”: “2026-03-31”,

    “region”: “northeast”

  }

}

The rest of the pipeline was conventional engineering. The system dispatched the call to the right backend — we had integrations with internal reporting portals, Salesforce, and several homegrown services — applied a large language model (LLM)(-generated JSON query to filter and shape the response, and delivered it via email, as a Drive document, or rendered as a chart in the browser.

By mid-2025, the system was generating several hundred reports a month. These reports were consumed by leadership and analysts and circulated to external stakeholders. It had become the default way most teams pulled ad-hoc data.

The contract between the LLM and the rest of the system was a structured JSON object as described in the above example.

json

{

  “description”: “User requested sales volume for the given date range, here is the API call to get the response”,

  “api_call”: “/api/sales_volume”,

  “post_body”: {

    “start_date”: “2026-01-01”,

    “end_date”: “2026-03-31”,

    “region”: “northeast”

  }

}

We built it on Claude Sonnet 3.5 in early 2025. We upgraded to 3.7 without incident, and to 4.0 without incident. By the time Sonnet 4.5 shipped, we had grown complacent about the stability and predictability of LLMs in solving what we believed was a simple problem. Model upgrades had become routine, like bumping a minor version of a well-behaved library.

Then we rolled out 4.5. For a meaningful percentage of requests, the model began folding the contents of post_body into the description field. Two failure modes followed.

First, the filter parameters never reached the API. Our system read post_body as the source of truth for the request payload, and that field came back empty. The API call was made without the date range or region filter. Depending on the specific API being called, the backend either returned sales volume for all time or all regions or returned a 500 error.

Second, the model started asking clarifying questions in its response. This was new. Earlier versions always took a best-effort approach to an ambiguous request and returned a structured object. Sonnet 4.5, being more cautious, would sometimes respond with a question instead. Our system had no path for this. It had been built on the assumption that every model invocation would result in an API call. There was no human-in-the-loop component and no state to hold a partially completed request. This caused downstream systems to break in multiple ways.

We rolled back to 4.0. That was harder than it should have been: Between the 4.0 and 4.5 deployments, our team had added new API integrations, all of which were qualified against 4.5. Reverting the model meant requalifying every one of them against 4.0 under time pressure.

Why traditional engineering discipline fails here

Software engineering rests on the ability to bound the effect of a change. When you upgrade a driver or library, you read the release notes to see whether to expect breaking changes. Unit tests circumscribe what could possibly have moved. You can leverage the following property: The system being changed is deterministic enough that its behavior can be predicted, or at least sampled densely enough to give you confidence. The blast radius is bounded by construction.

LLM-backed systems break this assumption. The component that produces your output is not under your control. You cannot diff a model version bump from 4.0 to 4.5. It is a wholesale replacement of the functionality on which your system depends.

This is what we mean by an infinite blast radius: a change whose downstream effects cannot be enumerated in advance because the input space (natural language) and the failure modes (anything the model might do differently) are both unbounded.

Anatomy of the failure

The post-mortem revealed that our prompt had always been under-specified. We had told the model to return a JSON object with three fields. We had described what each field was for. We did not explicitly state that the description must be a natural-language string and must not contain serialized representations of other fields.

Earlier versions of the model inferred this constraint from context. Sonnet 4.5, evidently better at being “helpful” in its formatting choices, decided that inquiring for clarification or providing the request body in the description made the response more useful. From the model’s perspective, this was a reasonable interpretation of an ambiguous instruction. However, this violated the assumptions under which our system was built.

The bug was not in the model. The bug was in our assumption that the model would continue to fill in our specification gaps as it always had. Three successful upgrades had trained us to believe those gaps were safe.

Structured output modes and tool-use APIs would have caught this specific failure at the schema level. We weren’t using them for engineering reasons outside the scope of this article. But schemas only constrain syntax, not semantics. A schema cannot specify that a clarifying question shouldn’t appear in a system with no path for clarification, or that a date range should never silently default to all-time. Schemas solve the easier half of the problem.

The evals-first architecture

The discipline that closes this gap is to treat the evaluation suite — not the prompt — as the formal specification of the system. The prompt is an implementation of the spec. The model is an interpreter. The evals are the spec itself, and any model or prompt change is valid if and only if it passes them.

In practice, an eval is a triple: An input, a property the output must satisfy, and a scoring function. For our system, the eval that would have caught the 4.5 regression looks roughly like this:

python

def test_description_contains_no_serialized_payload(response):

    desc = response[“description”].lower()

    forbidden = [“curl”, “post_body”, “{“, “http://”, “https://”]

    assert not any(token in desc for token in forbidden), \

        f”description leaked structured content: {response[‘description’]}”

A few hundred such properties, some written by hand for known-important invariants, some generated as regression tests from real production traffic, some scored by an LLM-as-judge for fuzzier qualities like tone, become a gate. Model upgrades and prompt changes should be treated as pull requests that must turn the suite green before they merge.

Evals are expensive to build and maintain. They drift as your product changes. LLM-as-judge scoring introduces its own variance in outcomes. And the suite can only catch failure modes you have thought to specify — you cannot eval your way to safety against a category of failure you have never imagined. We learned this lesson the hard way: Nobody on our team had ever written an assertion that said “the description field should not contain a curl command,” because nobody had thought the model would put one there.

Evals are not a silver bullet. They give you the ability to bound the blast radius of a change in the only way available when the underlying function is a black box: By densely sampling the input-output response you actually care about, and refusing to deploy when that behavior moves.

The roadmap

The engineering community has yet to develop a body of knowledge for writing effective evals. There are no widely accepted standards for what ‘coverage’ means in natural language input spaces. CI/CD systems were not built to gate probabilistic test outcomes. As agents take on more autonomous work — writing code, moving money, scheduling infrastructure changes — the gap between “the model passed our smoke tests” and “we know what this system will do in production” becomes the central engineering problem of the next several years.

The teams that close that gap will be the ones who stop treating evals as a quality-assurance afterthought and start treating them as the actual specification of what their system is.

Vijay Sagar Gullapalli is Founding AI Engineer at Adopt AI and a USPTO-patented inventor.

Sarat Mahavratayajula is a Senior Software Engineer at Sherwin-Williams.

Welcome to the VentureBeat community!

Our guest posting program is where technical experts share insights and provide neutral, non-vested deep dives on AI, data infrastructure, cybersecurity and other cutting-edge technologies shaping the future of enterprise.

Read more from our guest post program — and check out our guidelines if you’re interested in contributing an article of your own!

Credit: Source link

ShareTweetSendSharePin

Related Posts

Anthropic President Amodei on the Future of Claude
AI & Technology

Anthropic President Amodei on the Future of Claude

June 6, 2026
Bloomberg Tech Event Special | Bloomberg Tech 6/04/2026
AI & Technology

Bloomberg Tech Event Special | Bloomberg Tech 6/04/2026

June 6, 2026
AI and the race to capital dominate at Bloomberg Tech
AI & Technology

AI and the race to capital dominate at Bloomberg Tech

June 6, 2026
Broadcom CEO on the Biggest AI Chip Bets
AI & Technology

Broadcom CEO on the Biggest AI Chip Bets

June 6, 2026
Next Post
Extended Interview: Tom Llamas sits down with Secretary of State Marco Rubio

Extended Interview: Tom Llamas sits down with Secretary of State Marco Rubio

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Search

No Result
View All Result
Good News: Wrong number leads to unlikely friendship

Good News: Wrong number leads to unlikely friendship

June 5, 2026
AI News Got So Wild I Had to Build a Map to Keep up!

AI News Got So Wild I Had to Build a Map to Keep up!

June 5, 2026
Graduation ceremony surprises reunite military members and loved ones

Graduation ceremony surprises reunite military members and loved ones

May 31, 2026

About

Learn more

Our Services

Legal

Privacy Policy

Terms of Use

Bloggers

Learn more

Article Links

Contact

Advertise

Ask us anything

©2020- TradePoint.io - All rights reserved!

Tradepoint.io, being just a publishing and technology platform, is not a registered broker-dealer or investment adviser. So we do not provide investment advice. Rather, brokerage services are provided to clients of Tradepoint.io by independent SEC-registered broker-dealers and members of FINRA/SIPC. Every form of investing carries some risk and past performance is not a guarantee of future results. “Tradepoint.io“, “Instant Investing” and “My Trading Tools” are registered trademarks of Apperbuild, LLC.

This website is operated by Apperbuild, LLC. We have no link to any brokerage firm and we do not provide investment advice. Every information and resource we provide is solely for the education of our readers. © 2020 Apperbuild, LLC. All rights reserved.

No Result
View All Result
  • Main
  • AI & Technology
  • Stock Charts
  • Market & News
  • Business
  • Finance Tips
  • Trade Tube
  • Blog
  • Shop

© 2023 - TradePoint.io - All Rights Reserved!