• kpk ETH Primekpk ETH Prime(KPK ETH PRIME)$2,034.900.01%
  • bitcoinBitcoin(BTC)$70,153.00-0.36%
  • ethereumEthereum(ETH)$2,065.470.43%
  • kpk ETH Yieldkpk ETH Yield(KPK ETH YIELD)$2,030.62-0.04%
  • tetherTether(USDT)$1.00-0.01%
  • binancecoinBNB(BNB)$650.090.43%
  • rippleXRP(XRP)$1.38-0.28%
  • usd-coinUSDC(USDC)$1.00-0.01%
  • solanaSolana(SOL)$86.520.82%
  • tronTRON(TRX)$0.289370-0.38%
  • Figure HelocFigure Heloc(FIGR_HELOC)$1.03-0.87%
  • dogecoinDogecoin(DOGE)$0.0941411.58%
  • whitebitWhiteBIT Coin(WBT)$55.49-0.13%
  • USDSUSDS(USDS)$1.000.00%
  • cardanoCardano(ADA)$0.262293-0.08%
  • bitcoin-cashBitcoin Cash(BCH)$456.21-0.01%
  • HyperliquidHyperliquid(HYPE)$37.483.22%
  • leo-tokenLEO Token(LEO)$9.07-1.03%
  • moneroMonero(XMR)$350.70-0.09%
  • chainlinkChainlink(LINK)$9.050.56%
  • Ethena USDeEthena USDe(USDE)$1.00-0.06%
  • CantonCanton(CC)$0.145931-1.45%
  • stellarStellar(XLM)$0.1593390.41%
  • USD1USD1(USD1)$1.00-0.02%
  • RainRain(RAIN)$0.0090440.42%
  • daiDai(DAI)$1.00-0.01%
  • litecoinLitecoin(LTC)$54.35-0.64%
  • avalanche-2Avalanche(AVAX)$9.59-0.11%
  • paypal-usdPayPal USD(PYUSD)$1.00-0.04%
  • hedera-hashgraphHedera(HBAR)$0.0942100.61%
  • suiSui(SUI)$0.97-0.05%
  • zcashZcash(ZEC)$210.940.40%
  • shiba-inuShiba Inu(SHIB)$0.0000061.24%
  • the-open-networkToncoin(TON)$1.31-0.68%
  • crypto-com-chainCronos(CRO)$0.075501-0.46%
  • tether-goldTether Gold(XAUT)$5,101.63-0.50%
  • World Liberty FinancialWorld Liberty Financial(WLFI)$0.101196-0.35%
  • pax-goldPAX Gold(PAXG)$5,136.73-0.61%
  • MemeCoreMemeCore(M)$1.461.84%
  • polkadotPolkadot(DOT)$1.50-1.10%
  • Pi NetworkPi Network(PI)$0.2585199.99%
  • uniswapUniswap(UNI)$3.910.24%
  • mantleMantle(MNT)$0.711.47%
  • Circle USYCCircle USYC(USYC)$1.120.00%
  • BittensorBittensor(TAO)$213.897.27%
  • BlackRock USD Institutional Digital Liquidity FundBlackRock USD Institutional Digital Liquidity Fund(BUIDL)$1.000.00%
  • okbOKB(OKB)$94.73-0.58%
  • SkySky(SKY)$0.0815536.41%
  • Falcon USDFalcon USD(USDF)$1.00-0.03%
  • AsterAster(ASTER)$0.710.95%
TradePoint.io
  • Main
  • AI & Technology
  • Stock Charts
  • Market & News
  • Business
  • Finance Tips
  • Trade Tube
  • Blog
  • Shop
No Result
View All Result
TradePoint.io
No Result
View All Result

The team behind continuous batching says your idle GPUs should be running inference, not sitting dark

March 12, 2026
in AI & Technology
Reading Time: 4 mins read
A A
The team behind continuous batching says your idle GPUs should be running inference, not sitting dark
ShareShareShareShareShare

Every GPU cluster has dead time. Training jobs finish, workloads shift and hardware sits dark while power and cooling costs keep running. For neocloud operators, those empty cycles are lost margin.

YOU MAY ALSO LIKE

JBL’s two new Live headphones offer 80 hours of battery each

Google Play will let you try a game before you buy it

The obvious workaround is spot GPU markets — renting spare capacity to whoever needs it. But spot instances mean the cloud vendor is still the one doing the renting, and engineers buying that capacity are still paying for raw compute with no inference stack attached.

FriendliAI’s answer is different: run inference directly on the unused hardware, optimize for token throughput, and split the revenue with the operator. FriendliAI was founded by Byung-Gon Chun, the researcher whose paper on continuous batching became foundational to vLLM, the open source inference engine used across most production deployments today.

Chun spent over a decade as a professor at Seoul National University studying efficient execution of machine learning models at scale. That research produced a paper called Orca, which introduced continuous batching. The technique processes inference requests dynamically rather than waiting to fill a fixed batch before executing. It is now industry standard and is the core mechanism inside vLLM.

This week, FriendliAI is launching a new platform called InferenceSense. Just as publishers use Google AdSense to monetize unsold ad inventory, neocloud operators can use InferenceSense to fill unused GPU cycles with paid AI inference workloads and collect a share of the token revenue. The operator’s own jobs always take priority — the moment a scheduler reclaims a GPU, InferenceSense yields.

“What we are providing is that instead of letting GPUs be idle, by running inferences they can monetize those idle GPUs,” Chun told VentureBeat.

How a Seoul National University lab built the engine inside vLLM

Chun founded FriendliAI in 2021, before most of the industry had shifted attention from training to inference. The company’s primary product is a dedicated inference endpoint service for AI startups and enterprises running open-weight models. FriendliAI also appears as a deployment option on Hugging Face alongside Azure, AWS and GCP, and currently supports more than 500,000 open-weight models from the platform.

InferenceSense now extends that inference engine to the capacity problem GPU operators face between workloads.

How it works

InferenceSense runs on top of Kubernetes, which most neocloud operators are already using for resource orchestration. An operator allocates a pool of GPUs to a Kubernetes cluster managed by FriendliAI — declaring which nodes are available and under what conditions they can be reclaimed. Idle detection runs through Kubernetes itself.

“We have our own orchestrator that runs on the GPUs of these neocloud — or just cloud — vendors,” Chun said. “We definitely take advantage of Kubernetes, but the software running on top is a really highly optimized inference stack.”

When GPUs are unused, InferenceSense spins up isolated containers serving paid inference workloads on open-weight models including DeepSeek, Qwen, Kimi, GLM and MiniMax. When the operator’s scheduler needs hardware back, the inference workloads are preempted and GPUs are returned. FriendliAI says the handoff happens within seconds.

Demand is aggregated through FriendliAI’s direct clients and through inference aggregators like OpenRouter. The operator supplies the capacity; FriendliAI handles the demand pipeline, model optimization and serving stack. There are no upfront fees and no minimum commitments. A real-time dashboard shows operators which models are running, tokens being processed and revenue accrued.

Why token throughput beats raw capacity rental

Spot GPU markets from providers like CoreWeave, Lambda Labs and RunPod involve the cloud vendor renting out its own hardware to a third party. InferenceSense runs on hardware the neocloud operator already owns, with the operator defining which nodes participate and setting scheduling agreements with FriendliAI in advance. The distinction matters: spot markets monetize capacity, InferenceSense monetizes tokens.

Token throughput per GPU-hour determines how much InferenceSense can actually earn during unused windows. FriendliAI claims its engine delivers two to three times the throughput of a standard vLLM deployment, though Chun notes the figure varies by workload type.

Most competing inference stacks are built on Python-based open source frameworks. FriendliAI’s engine is written in C++ and uses custom GPU kernels rather than Nvidia’s cuDNN library. The company has built its own model representation layer for partitioning and executing models across hardware, with its own implementations of speculative decoding, quantization and KV-cache management.

Since FriendliAI’s engine processes more tokens per GPU-hour than a standard vLLM stack, operators should generate more revenue per unused cycle than they could by standing up their own inference service. 

What AI engineers evaluating inference costs should watch

For AI engineers evaluating where to run inference workloads, the neocloud versus hyperscaler decision has typically come down to price and availability.

InferenceSense adds a new consideration: if neoclouds can monetize idle capacity through inference, they have more economic incentive to keep token prices competitive.

That is not a reason to change infrastructure decisions today — it is still early. But engineers tracking total inference cost should watch whether neocloud adoption of platforms like InferenceSense puts downward pressure on API pricing for models like DeepSeek and Qwen over the next 12 months.

“When we have more efficient suppliers, the overall cost will go down,” Chun said. “With InferenceSense we can contribute to making those models cheaper.”

Credit: Source link

ShareTweetSendSharePin

Related Posts

JBL’s two new Live headphones offer 80 hours of battery each
AI & Technology

JBL’s two new Live headphones offer 80 hours of battery each

March 12, 2026
Google Play will let you try a game before you buy it
AI & Technology

Google Play will let you try a game before you buy it

March 12, 2026
Nvidia’s new open weights Nemotron 3 super combines three different architectures to beat gpt-oss and Qwen in throughput
AI & Technology

Nvidia’s new open weights Nemotron 3 super combines three different architectures to beat gpt-oss and Qwen in throughput

March 11, 2026
I guess this wasn’t an Xbox after all
AI & Technology

I guess this wasn’t an Xbox after all

March 11, 2026
Next Post
Iran issues statement purported to be from new leader as war with U.S. and Israel rages – NPR

Iran issues statement purported to be from new leader as war with U.S. and Israel rages - NPR

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Search

No Result
View All Result
Would Dave Ramsey Buy Property In NYC Under The Current Mayor?

Would Dave Ramsey Buy Property In NYC Under The Current Mayor?

March 10, 2026
WeWork ‘growing again’ in NYC — years after crashing and burning

WeWork ‘growing again’ in NYC — years after crashing and burning

March 9, 2026
Dem duo demands Treasury reverse waiver on Russian oil sales to India

Dem duo demands Treasury reverse waiver on Russian oil sales to India

March 9, 2026

About

Learn more

Our Services

Legal

Privacy Policy

Terms of Use

Bloggers

Learn more

Article Links

Contact

Advertise

Ask us anything

©2020- TradePoint.io - All rights reserved!

Tradepoint.io, being just a publishing and technology platform, is not a registered broker-dealer or investment adviser. So we do not provide investment advice. Rather, brokerage services are provided to clients of Tradepoint.io by independent SEC-registered broker-dealers and members of FINRA/SIPC. Every form of investing carries some risk and past performance is not a guarantee of future results. “Tradepoint.io“, “Instant Investing” and “My Trading Tools” are registered trademarks of Apperbuild, LLC.

This website is operated by Apperbuild, LLC. We have no link to any brokerage firm and we do not provide investment advice. Every information and resource we provide is solely for the education of our readers. © 2020 Apperbuild, LLC. All rights reserved.

No Result
View All Result
  • Main
  • AI & Technology
  • Stock Charts
  • Market & News
  • Business
  • Finance Tips
  • Trade Tube
  • Blog
  • Shop

© 2023 - TradePoint.io - All Rights Reserved!