• bitcoinBitcoin(BTC)$73,042.00-1.49%
  • ethereumEthereum(ETH)$2,271.76-2.46%
  • tetherTether(USDT)$1.000.02%
  • rippleXRP(XRP)$1.49-1.59%
  • binancecoinBNB(BNB)$663.25-0.72%
  • usd-coinUSDC(USDC)$1.000.00%
  • solanaSolana(SOL)$92.10-1.82%
  • tronTRON(TRX)$0.3025550.30%
  • Figure HelocFigure Heloc(FIGR_HELOC)$1.030.48%
  • dogecoinDogecoin(DOGE)$0.096937-3.34%
  • whitebitWhiteBIT Coin(WBT)$57.43-1.35%
  • USDSUSDS(USDS)$1.000.00%
  • cardanoCardano(ADA)$0.280256-2.49%
  • HyperliquidHyperliquid(HYPE)$41.00-1.57%
  • bitcoin-cashBitcoin Cash(BCH)$463.05-2.62%
  • leo-tokenLEO Token(LEO)$9.060.03%
  • chainlinkChainlink(LINK)$9.58-2.12%
  • moneroMonero(XMR)$361.73-2.39%
  • Ethena USDeEthena USDe(USDE)$1.000.00%
  • CantonCanton(CC)$0.149172-1.49%
  • stellarStellar(XLM)$0.171162-2.01%
  • USD1USD1(USD1)$1.00-0.01%
  • zcashZcash(ZEC)$270.32-1.07%
  • litecoinLitecoin(LTC)$56.53-2.20%
  • avalanche-2Avalanche(AVAX)$10.00-2.74%
  • daiDai(DAI)$1.000.04%
  • RainRain(RAIN)$0.008875-1.11%
  • hedera-hashgraphHedera(HBAR)$0.097614-1.58%
  • paypal-usdPayPal USD(PYUSD)$1.000.13%
  • suiSui(SUI)$1.01-2.24%
  • shiba-inuShiba Inu(SHIB)$0.000006-2.09%
  • MemeCoreMemeCore(M)$1.9214.24%
  • crypto-com-chainCronos(CRO)$0.077501-3.19%
  • the-open-networkToncoin(TON)$1.32-1.19%
  • World Liberty FinancialWorld Liberty Financial(WLFI)$0.100575-2.81%
  • tether-goldTether Gold(XAUT)$4,910.67-1.20%
  • mantleMantle(MNT)$0.84-0.79%
  • polkadotPolkadot(DOT)$1.60-0.34%
  • BittensorBittensor(TAO)$275.58-0.76%
  • pax-goldPAX Gold(PAXG)$4,932.90-1.35%
  • uniswapUniswap(UNI)$3.87-3.98%
  • Circle USYCCircle USYC(USYC)$1.120.00%
  • BlackRock USD Institutional Digital Liquidity FundBlackRock USD Institutional Digital Liquidity Fund(BUIDL)$1.000.00%
  • okbOKB(OKB)$93.39-1.89%
  • nearNEAR Protocol(NEAR)$1.43-0.83%
  • aaveAave(AAVE)$117.60-4.09%
  • Global DollarGlobal Dollar(USDG)$1.00-0.02%
  • Falcon USDFalcon USD(USDF)$1.000.01%
  • AsterAster(ASTER)$0.71-3.50%
  • Pi NetworkPi Network(PI)$0.177109-1.62%
TradePoint.io
  • Main
  • AI & Technology
  • Stock Charts
  • Market & News
  • Business
  • Finance Tips
  • Trade Tube
  • Blog
  • Shop
No Result
View All Result
TradePoint.io
No Result
View All Result

ServiceNow Research Introduces EnterpriseOps-Gym: A High-Fidelity Benchmark Designed to Evaluate Agentic Planning in Realistic Enterprise Settings

March 18, 2026
in AI & Technology
Reading Time: 6 mins read
A A
ServiceNow Research Introduces EnterpriseOps-Gym: A High-Fidelity Benchmark Designed to Evaluate Agentic Planning in Realistic Enterprise Settings
ShareShareShareShareShare

Large language models (LLMs) are transitioning from conversational to autonomous agents capable of executing complex professional workflows. However, their deployment in enterprise environments remains limited by the lack of benchmarks that capture the specific challenges of professional settings: long-horizon planning, persistent state changes, and strict access protocols. To address this, researchers from ServiceNow Research, Mila and Universite de Montreal have introduced EnterpriseOps-Gym, a high-fidelity sandbox designed to evaluate agentic planning in realistic enterprise scenarios.

https://arxiv.org/pdf/2603.13594

The Evaluation Environment

EnterpriseOps-Gym features a containerized Docker environment that simulates eight mission-critical enterprise domains:

YOU MAY ALSO LIKE

Defense Department says Anthropic poses ‘unacceptable risk’ to national security

NVIDIA AI Open-Sources ‘OpenShell’: A Secure Runtime Environment for Autonomous AI Agents

  • Operational Domains: Customer Service Management (CSM), Human Resources (HR), and IT Service Management (ITSM).
  • Collaboration Domains: Email, Calendar, Teams, and Drive.
  • Hybrid Domain: Cross-domain tasks requiring coordinated execution across multiple systems.

The benchmark comprises 164 relational database tables and 512 functional tools. With a mean foreign key degree of 1.7, the environment presents high relational density, forcing agents to navigate complex inter-table dependencies to maintain referential integrity. The benchmark includes 1,150 expert-curated tasks, with execution trajectories averaging 9 steps and reaching up to 34 steps.

Performance Results: A Capability Gap

The research team evaluated 14 frontier models using a pass@1 metric, where a task is successful only if all outcome-based SQL verifiers pass.

Model Average Success Rate (%) Cost per Task (USD)
Claude Opus 4.5 37.4% $0.36
Gemini-3-Flash 31.9% $0.03
GPT-5.2 (High) 31.8% Not explicitly listed in text
Claude Sonnet 4.5 30.9% $0.26
GPT-5 29.8% $0.16
DeepSeek-V3.2 (High) 24.5% $0.014
GPT-OSS-120B (High) 23.7% $0.015

The results indicate that even state-of-the-art models fail to reach 40% reliability in these structured environments. Performance is strongly domain-dependent; models performed best on collaboration tools (Email, Teams) but dropped significantly in policy-heavy domains like ITSM (28.5%) and Hybrid (30.7%) workflows.

Planning vs. Execution

A critical finding of this research is that strategic planning, rather than tool invocation, is the primary performance bottleneck.

The research team conducted ‘Oracle’ experiments where agents were provided with human-authored plans. This intervention improved performance by 14-35 percentage points across all models. Strikingly, smaller models like Qwen3-4B became competitive with much larger models when strategic reasoning was externalized. Conversely, adding ‘distractor tools’ to simulate retrieval errors had a negligible impact on performance, further suggesting that tool discovery is not the binding constraint.

Failure Modes and Safety Concerns

The qualitative analysis revealed four recurring failure patterns:

  1. Missing Prerequisite Lookup: Creating objects without querying necessary prerequisites, leading to “orphaned” records.
  2. Cascading State Propagation: Failing to trigger follow-up actions required by system policies after a state change.
  3. Incorrect ID Resolution: Passing unverified or guessed identifiers to tool calls.
  4. Premature Completion Hallucination: Declaring a task finished before all required steps are executed.

Furthermore, agents struggle with safe refusal. The benchmark includes 30 infeasible tasks (e.g., requests violating access rules or involving inactive users). The best-performing model, GPT-5.2 (Low), correctly refused these tasks only 53.9% of the time. In professional settings, failing to refuse an unauthorized or impossible task can lead to corrupted database states and security risks.

Orchestration and Multi-Agent Systems (MAS)

The research team also evaluated whether more complex agent architectures could close the performance gap. While a Planner+Executor setup (where one model plans and another executes) yielded modest gains, more complex decomposition architectures often regressed performance. In domains like CSM and HR, tasks have strong sequential state dependencies; breaking these into sub-tasks for separate agents often disrupted the necessary context, leading to lower success rates than simple ReAct loops.

Economic Considerations: The Pareto Frontier

For deployment, the benchmark establishes a clear cost-performance tradeoff:

  • Gemini-3-Flash represents the strongest practical tradeoff for closed-source models, offering 31.9% performance at a 90% lower cost than GPT-5 or Claude Sonnet 4.5.
  • DeepSeek-V3.2 (High) and GPT-OSS-120B (High) are the dominant open-source options, offering approximately 24% performance at roughly $0.015 per task.
  • Claude Opus 4.5 remains the benchmark for absolute reliability (37.4%) but at the highest cost of $0.36 per task.

Key Takeaways

  • Benchmark Scale and Complexity: EnterpriseOps-Gym provides a high-fidelity evaluation environment featuring 164 relational database tables and 512 functional tools across eight enterprise domains.
  • Significant Performance Gap: Current frontier models are not yet reliable for autonomous deployment; the top-performing model, Claude Opus 4.5, achieves only a 37.4% success rate.
  • Planning as the Primary Bottleneck: Strategic reasoning is the binding constraint rather than tool execution, as providing agents with human-authored plans improves performance by 14 to 35 percentage points.
  • Inadequate Safe Refusal: Models struggle to identify and refuse infeasible or policy-violating requests, with even the best-performing model cleanly abstaining only 53.9% of the time.
  • Thinking Budget Limitations: While increasing test-time compute yields gains in some domains, performance plateaus in others, suggesting that more ‘thinking’ tokens cannot fully overcome fundamental gaps in policy understanding or domain knowledge.

Check out Paper, Codes and Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The post ServiceNow Research Introduces EnterpriseOps-Gym: A High-Fidelity Benchmark Designed to Evaluate Agentic Planning in Realistic Enterprise Settings appeared first on MarkTechPost.

Credit: Source link

ShareTweetSendSharePin

Related Posts

Defense Department says Anthropic poses ‘unacceptable risk’ to national security
AI & Technology

Defense Department says Anthropic poses ‘unacceptable risk’ to national security

March 18, 2026
NVIDIA AI Open-Sources ‘OpenShell’: A Secure Runtime Environment for Autonomous AI Agents
AI & Technology

NVIDIA AI Open-Sources ‘OpenShell’: A Secure Runtime Environment for Autonomous AI Agents

March 18, 2026
Open source Mamba 3 arrives to surpass Transformer architecture with nearly 4% improved language modeling, reduced latency
AI & Technology

Open source Mamba 3 arrives to surpass Transformer architecture with nearly 4% improved language modeling, reduced latency

March 17, 2026
Subnautica 2 might finally be entering early access in May
AI & Technology

Subnautica 2 might finally be entering early access in May

March 17, 2026
Next Post
Hartford Large Cap Growth ETF Q4 2025 Commentary (HFGO)

Hartford Large Cap Growth ETF Q4 2025 Commentary (HFGO)

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Search

No Result
View All Result
10 Best Smart Glasses 2026 | On Amazon Right Now

10 Best Smart Glasses 2026 | On Amazon Right Now

March 15, 2026
Rep. Al Green said he waved his sign “so that there would be no question as to where [he] stands.”

Rep. Al Green said he waved his sign “so that there would be no question as to where [he] stands.”

March 12, 2026
The team behind continuous batching says your idle GPUs should be running inference, not sitting dark

The team behind continuous batching says your idle GPUs should be running inference, not sitting dark

March 12, 2026

About

Learn more

Our Services

Legal

Privacy Policy

Terms of Use

Bloggers

Learn more

Article Links

Contact

Advertise

Ask us anything

©2020- TradePoint.io - All rights reserved!

Tradepoint.io, being just a publishing and technology platform, is not a registered broker-dealer or investment adviser. So we do not provide investment advice. Rather, brokerage services are provided to clients of Tradepoint.io by independent SEC-registered broker-dealers and members of FINRA/SIPC. Every form of investing carries some risk and past performance is not a guarantee of future results. “Tradepoint.io“, “Instant Investing” and “My Trading Tools” are registered trademarks of Apperbuild, LLC.

This website is operated by Apperbuild, LLC. We have no link to any brokerage firm and we do not provide investment advice. Every information and resource we provide is solely for the education of our readers. © 2020 Apperbuild, LLC. All rights reserved.

No Result
View All Result
  • Main
  • AI & Technology
  • Stock Charts
  • Market & News
  • Business
  • Finance Tips
  • Trade Tube
  • Blog
  • Shop

© 2023 - TradePoint.io - All Rights Reserved!