On-device AI agents hit a hard memory limit. Apple's new architecture routes around it.

On-device AI models have stayed small because the entire weight set has to live in DRAM, capping practical parameter counts well below what server-side deployments use. Enterprise architects evaluating agentic workloads have had to choose between capable cloud-dependent models and limited on-device ones. Apple’s third-generation foundation models, announced at WWDC26, break that constraint by moving the weight set off DRAM entirely.

The AFM 3 family was developed in collaboration with Google and spans five models: two on-device and three server-based, all running within Apple’s Private Cloud Compute boundary. The server-side models, including AFM 3 Cloud Pro for agentic tool use and complex reasoning, run on Nvidia GPUs in Google Cloud. The on-device architecture is Apple’s own. AFM 3 Core Advanced is a 20-billion-parameter model that stores weights in NAND flash rather than DRAM.

Samsung Overhauls Foldable Phones Ahead of Apple’s Launch

Big Take: Polymarket and Kalshi Grapple With A New Class of Insider Trader

“Instead of forcing the entire model into DRAM, the full model is stored in flash memory,” Apple’s research team wrote. “Because NAND-to-DRAM bandwidth is too slow to swap weights token by token, as standard MoE models require, AFM 3 Core Advanced makes routing decisions per prompt.”

How the architecture actually works

The memory wall Apple is working around is one every local AI developer runs into.

“You can’t put 20B parameters in RAM at any reasonable precision,” Awni Hannun, a researcher at Anthropic and former Apple research scientist, posted on X. “To make it work they are using pretty exotic architecture by today’s standards. A small model predicts from the query (or prompt) which experts to load from NAND into RAM.”

That prediction-and-load mechanism has three distinct components, each driven by the hardware constraints of consumer silicon.

The full 20B weight set lives in flash, not DRAM. AFM 3 Core Advanced stores its entire parameter set in NAND flash rather than active memory. Standard on-device deployments require the full model to fit in DRAM, which is what caps their parameter counts. Apple’s approach, which it calls Instruction-Following Pruning (IFP) and developed with its own researchers, treats flash as the model’s permanent home and DRAM as a working buffer for whichever experts a given prompt requires.

Expert routing happens once per prompt, not per token. In a conventional Mixture of Experts model, a router selects different experts for every token generated — which would require continuous weight movement between flash and DRAM at inference speed. NAND-to-DRAM bandwidth cannot support that. AFM 3 Core Advanced routes once at prompt time, selects a fixed expert set, loads it into DRAM alongside always-active shared experts, and generates all tokens from that same configuration.

“The key distinction from a typical MoE is that you do this once per query and then generate all the tokens with the same experts,” Hannun wrote.

Source: Apple Machine Learning Research, June 8, 2026.

Active parameter count scales from 1B to 4B depending on task complexity. Rather than running a fixed model size for every request, AFM 3 Core Advanced adjusts how many parameters it activates based on what the task requires — 1 billion for simpler operations, up to 4 billion for harder ones, all drawn from the 20-billion-parameter pool in flash.

What Apple has and hasn’t disclosed

The architecture paper is detailed on the memory design and sparse activation mechanism. It is less forthcoming on practical deployment constraints.

Apple’s profiling tools expose timing but not the metrics that decide production viability. “Energy, memory bandwidth, thermal? Not in the docs,” Marco Abis, who is building Ziraph, a profiler for local AI on Apple silicon, posted on X. “A notable gap, given those decide most of on-device performance.”

Abis also did not find a statement in Apple’s documentation — across the Core AI docs, the Foundation Models docs or the Private Cloud Compute security post — of when an on-device request transparently offloads, or whether that routing is visible to the developer or the user. For enterprises that need to document where inference runs, that is a direct compliance problem.

Not all the information is currently available. Apple has indicated a full technical report with benchmarks is coming later this summer.

What this means for enterprise architects

Regulated industries evaluating agentic AI deployments now have a concrete architectural decision to make.

The DRAM wall for on-device agents just moved. Enterprises evaluating agents that need to run without a cloud round-trip now have a 20-billion-parameter local option to evaluate. The constraint shifts from model capability to device hardware.
The private/cloud boundary is now an architectural decision, not a default. Simpler requests stay on-device; complex agentic tasks route to AFM 3 Cloud Pro on Private Cloud Compute. Apple has not publicly specified when a request offloads or whether that routing is visible to the developer — a gap that complicates policy decisions for organizations that need to document where inference runs.
The agentic server tier depends on Google Cloud. AFM 3 Cloud Pro runs on Nvidia GPUs in Google Cloud. The Private Cloud Compute guarantee covers data privacy. It does not eliminate the Google Cloud dependency for server-side inference.

AFM 3 Core Advanced gives enterprises a 20-billion-parameter on-device option that did not exist before WWDC26. Whether it is deployable at scale depends on answers Apple has not yet published. Those details are due in the summer technical report.

Credit: Source link