Train-to-Test scaling explained: How to optimize your end-to-end AI compute budget for inference

The standard guidelines for building large language models (LLMs) optimize only for training costs and ignore inference costs. This poses a challenge for real-world applications that use inference-time scaling techniques to increase the accuracy of model responses, such as drawing multiple reasoning samples from a model at deployment.

To bridge this gap, researchers at University of Wisconsin-Madison and Stanford University have introduced Train-to-Test (T²) scaling laws, a framework that jointly optimizes a model’s parameter size, its training data volume, and the number of test-time inference samples.

Aevex CEO Speaks on Raising $320 Million in US IPO

Trump Says ‘Highly Unlikely’ He Extends Iran Ceasefire

In practice, their approach proves that it is compute-optimal to train substantially smaller models on vastly more data than traditional rules prescribe, and then use the saved computational overhead to generate multiple repeated samples at inference.

For enterprise AI application developers who are training their own models, this research provides a proven blueprint for maximizing return on investment. It shows that AI reasoning does not necessarily require spending huge amounts on frontier models. Instead, smaller models can yield stronger performance on complex tasks while keeping per-query inference costs manageable within real-world deployment budgets.

Conflicting scaling laws

Scaling laws are an important part of developing large language models. Pretraining scaling laws dictate the best way to allocate compute during the model’s creation, while test-time scaling laws guide how to allocate compute during deployment, such as letting the model “think longer” or generating multiple reasoning samples to solve complex problems.

The problem is that these scaling laws have been developed completely independently of one another despite being fundamentally intertwined.

A model’s parameter size and training duration directly dictate both the quality and the per-query cost of its inference samples. Currently, the industry gold standard for pretraining is the Chinchilla rule, which suggests a compute-optimal ratio of roughly 20 training tokens for every model parameter.

However, creators of modern AI model families, such as Llama, Gemma, and Qwen, regularly break this rule by intentionally overtraining their smaller models on massive amounts of data.

As Nicholas Roberts, co-author of the paper, told VentureBeat, the traditional approach falters when building complex agentic workflows: “In my view, the inference stack breaks down when each individual inference call is expensive. This is the case when the models are large and you need to do a lot of repeated sampling.” Instead of relying on massive models, developers can use overtrained compact models to run this repeated sampling at a fraction of the cost.

But because training and test-time scaling laws are examined in isolation, there is no rigorous framework to calculate how much a model should be overtrained based on how many reasoning samples it will need to generate during deployment.

Consequently, there has previously been no formula that jointly optimizes model size, training data volume, and test-time inference budgets.

The reason that this framework is hard to formulate is that pretraining and test-time scaling speak two different mathematical languages. During pretraining, a model’s performance is measured using “loss,” a smooth, continuous metric that tracks prediction errors as the model learns.

At test time, developers use real-world, downstream metrics to evaluate a model’s reasoning capabilities, such as pass@k, which measures the probability that a model will produce at least one correct answer across k independent, repeated attempts.

Train-to-test scaling laws

To solve the disconnect between training and deployment, the researchers introduce Train-to-Test (T²) scaling laws. At a high level, this framework predicts a model’s reasoning performance by treating three variables as a single equation: the model’s size (N), the volume of training tokens it learns from (D), and the number of reasoning samples it generates during inference (k).

“Train-to-test” combines the pretraining and test-time scaling laws into a unified framework (source: arXiv)

T² combines pretraining and inference budgets into one optimization formula that accounts for both the baseline cost to train the model (6ND) and the compounding cost to query it repeatedly at inference (2Nk). The researchers tried different modeling approaches: whether to model the pre-training loss or test-time performance (pass@k) as functions of N, D, and k.

The first approach takes the familiar mathematical equation used for Chinchilla scaling (which calculates a model’s prediction error, or loss) and directly modifies it by adding a new variable that accounts for the number of repeated test-time samples (k). This allows developers to see how increasing inference compute drives down the model’s overall error rate.

The second approach directly models the downstream pass@k accuracy. It tells developers the probability that their application will solve a problem given a specific compute budget.

But should enterprises use this framework for every application? Roberts clarifies that this approach is highly specialized. “I imagine that you would not see as much of a benefit for knowledge-heavy applications, such as chat models,” he said. Instead, “T² is tailored to reasoning-heavy applications such as coding, where typically you would use repeated sampling as your test-time scaling method.”

What it means for developers

To validate the T² scaling laws, the researchers built an extensive testbed of over 100 language models, ranging from 5 million to 901 million parameters. They trained 21 new, heavily overtrained checkpoints from scratch to test if their mathematical forecasts held up in reality. They then benchmarked the models across eight diverse tasks, which included real-world datasets like SciQ and OpenBookQA, alongside synthetic tasks designed to test arithmetic, spatial reasoning, and knowledge recall.

Both of their mathematical models proved that the compute-optimal frontier shifts drastically away from standard Chinchilla scaling. To maximize performance under a fixed budget, the optimal choice is a model that is significantly smaller and trained on vastly more data than the traditional 20-tokens-per-parameter rule dictates.

train-to-test performance — The train-to-test scaling laws show that small overtrained models outperform Chinchilla-optimized models on reasoning tasks (source: arXiv)

In their experiments, the highly overtrained small models consistently outperformed the larger, Chinchilla-optimal models across all eight evaluation tasks when test-time sampling costs were accounted for.

For developers looking to deploy these findings, the technical barrier is surprisingly low.

“Nothing fancy is required to perform test-time scaling with our current models,” Roberts said. “At deployment, developers can absolutely integrate infrastructure that makes the sampling process more efficient (e.g. KV caching if you’re using a transformer).”

KV caching helps by storing previously processed context so the model doesn’t have to re-read the initial prompt from scratch for every new reasoning sample.

However, extreme overtraining comes with practical trade-offs. While overtrained models can be notoriously stubborn and harder to fine-tune, Roberts notes that when they applied supervised fine-tuning, “while this effect was present, it was not a strong enough effect to pull the optimal model back to Chinchilla.” The compute-optimal strategy remains definitively skewed toward compact models.

Yet, teams pushing this to the absolute limit must be wary of hitting physical data limits. “Another angle is that if you take our overtraining recommendations to the extreme, you may actually run out of training data,” Roberts said, referring to the looming “data wall” where high-quality internet data is exhausted.

These experiments confirm that if an application relies on generating multiple test-time reasoning samples, aggressively overtraining a compact model is practically and mathematically the most effective way to spend an end-to-end compute budget.

To help developers get started, the research team plans to open-source their checkpoints and code soon, allowing enterprises to plug in their own data and test the scaling behavior immediately. Ultimately, this framework serves as an equalizing force in the AI industry.

This is especially crucial as the high price of frontier models can become a barrier as you scale agentic applications that rely on reasoning models.

“T² fundamentally changes who gets to build strong reasoning models,” Roberts concludes. “You might not need massive compute budgets to get state-of-the-art reasoning. Instead, you need good data and smart allocation of your training and inference budget.”

Credit: Source link