How to build AI scaling laws for efficient LLM training and budget maximization

Sep 18, 2025 | AI

Developing large language models (LLMs) presents a significant challenge for researchers, who strive to achieve optimal performance within strict computational and financial limits. With training costs often soaring into the millions of dollars, developers must make critical, cost-sensitive choices regarding model architecture, optimization techniques, and training datasets well before commencing the resource-intensive training process.

To forecast the eventual quality and accuracy of a large model’s predictions, practitioners frequently rely on “scaling laws.” These laws involve training smaller, more economical models to project the potential performance of a much larger, target model. However, the application of scaling laws is complex, as there are myriad methodologies—potentially thousands—for their formulation.

Researchers at MIT and the MIT-IBM Watson AI Lab have compiled an extensive database, featuring hundreds of models and metrics related to training and performance. This collection aims to approximate over a thousand scaling laws within artificial intelligence. Leveraging this substantial dataset, the team subsequently developed a comprehensive meta-analysis and a practical guide. This resource is designed to aid in the strategic selection of smaller models and the accurate estimation of scaling laws across various large language model (LLM) families, ultimately ensuring optimal budget allocation for generating dependable performance predictions.

The application of mathematical models to artificial intelligence training processes, while an idea that emerged a few years ago, has recently taken a new direction. Jacob Andreas, an associate professor in the Department of Electrical Engineering and Computer Science and a principal investigator with the MIT-IBM Watson AI Lab, pointed out that earlier work predominantly focused on post-hoc analysis. This involved retrospectively examining the outcomes of training numerous models to inform how computational budgets should be allocated for future large-scale AI projects. The innovation, Andreas suggested, lies in shifting towards a more proactive approach, allowing for optimal compute budget decisions to be made before the training of new, complex models commences.

New research findings, a collaboration between Andreas and MIT-IBM Watson AI Lab researchers Leshem Choshen and Yang Zhang of IBM Research, were recently presented at the International Conference on Machine Learning.

Projecting the full scope of capabilities.

Developing Large Language Models (LLMs) is an inherently resource-intensive endeavor, encompassing critical choices regarding parameter and token counts, data selection and volume, training methodologies, and precise tuning for target applications and output accuracy.

To navigate this costly process, scaling laws offer a strategic method for forecasting model behavior. These laws establish a relationship between the performance loss of a large model and that of smaller, less-expensive models from the same architectural lineage. This predictive capability allows developers to avoid the necessity of fully training every candidate model configuration. The key differentiators among these smaller models, when applying scaling laws, typically involve their parameter count and the size of their token training datasets.

According to Choshen, elucidating these scaling laws not only enables more informed pre-training decisions but also serves to democratize the field, empowering researchers with limited resources to understand and implement effective scaling principles.

Scaling laws utilize a straightforward functional structure to estimate the performance of advanced models. This framework integrates key data points derived from smaller models, specifically accounting for the number of parameters and their scaling influence, the volume of training tokens and their impact, and the inherent baseline performance of the model family under consideration. By combining these elements, researchers can project a target large model’s anticipated performance loss. A lower predicted loss serves as a crucial indicator that the model’s outputs will likely be of superior quality.

These guiding principles empower research teams to efficiently evaluate trade-offs and optimize the allocation of finite resources. Their utility is particularly pronounced in assessing the scalability of specific variables, such as token counts, and for conducting A/B tests across various pre-training setups.

While the concept of scaling laws is not new, their emergence in artificial intelligence became particularly pronounced as AI models expanded in size and training expenses soared. According to Choshen, despite this sudden attention, there was a notable absence of rigorous testing to assess their effectiveness or to determine the necessary components for a robust scaling law.

Essentially, these early AI scaling laws often functioned as a ‘black box.’ Andreas further explained that prior attempts to establish scaling laws typically involved isolated efforts, each focusing on a single model or model family, a specific dataset, and the work of an individual developer. This fragmented approach resulted in a significant lack of systematic meta-analysis, leading to questions about whether overarching trends could be identified across these individually developed scaling laws.

Advancing superior construction methods and elevating developmental standards.

Researchers Choshen, Andreas, and Zhang embarked on an extensive study by constructing a massive dataset. Their collection encompassed Large Language Models (LLMs) from 40 diverse model families, including prominent examples like Pythia, OPT, OLMO, LLaMA, Bloom, T5-Pile, ModuleFormer mixture-of-experts, and GPT. In total, the dataset featured 485 unique, pre-trained models, for which the team gathered available data on training checkpoints, computational cost (FLOPs), training epochs, and the specific seed used. This was complemented by 1.9 million performance metrics, covering loss and various downstream tasks. The models themselves presented a wide array of architectures and weight configurations.

Utilizing this comprehensive resource, the researchers fitted over 1,000 scaling laws. They meticulously compared the accuracy of these laws across different architectures, model sizes, and training regimes. Furthermore, their investigation delved into how the number of models, the inclusion of intermediate training checkpoints, and partial training affected the predictive power of scaling laws for target models. A primary metric for their evaluation was the Absolute Relative Error (ARE), which quantifies the difference between a scaling law’s prediction and the observed loss of a large, trained model.

After a thorough comparison and analysis of these scaling laws, the team distilled practical recommendations. These insights aim to guide AI practitioners on the essential components that define effective scaling laws.

A new set of comprehensive guidelines outlines essential steps, considerations, and performance benchmarks for developers. These recommendations emphasize the initial importance of defining a compute budget and target model accuracy. Researchers determined that an Absolute Relative Error (ARE) of approximately 4% represents the best achievable accuracy, often constrained by random seed noise, though an ARE up to 20% can still yield valuable insights for decision-making.

Several key strategies were identified to enhance prediction reliability. Notably, incorporating intermediate training checkpoints, rather than solely relying on final loss values, significantly improved the robustness of scaling laws. Conversely, training data collected before 10 billion tokens proved to be noisy and detrimental to accuracy, leading to a recommendation for its exclusion. Furthermore, the guidelines advocate for a diversified approach to model training: prioritizing a broader range of model sizes—with five models suggested as an optimal starting point—over simply scaling up individual models, to ensure more resilient scaling law predictions.

While larger models typically enhance predictive accuracy, significant cost efficiencies can be realized by partially training a target model to approximately 30 percent of its dataset, then utilizing that data for extrapolation.

For developers operating under considerably constrained budgets, an alternative involves training one smaller model within the target family and borrowing scaling law parameters from a model family with similar architecture. However, this approach may not be effective for encoder-decoder models.

Further research by the MIT-IBM group revealed a strong correlation between two sets of hyperparameters when comparing scaling laws across different model families. Their findings indicate that three of the five hyperparameters explained nearly all of the observed variation, suggesting their ability to effectively capture model behavior.

Collectively, these guidelines offer a systematic framework to make scaling law estimation more efficient, reliable, and accessible for AI researchers navigating diverse budgetary constraints.

During the research, several unexpected discoveries emerged. Remarkably, even small, partially trained models exhibited strong predictive capabilities. Moreover, intermediate training stages from a fully developed model could be effectively utilized as individual predictive tools for different target models. This offers a significant advantage, as researcher Choshen explained: “Basically, you don’t pay anything in the training, because you already trained the full model, so the half-trained model, for instance, is just a byproduct of what you did.”

Another notable observation, pointed out by Andreas, revealed that when data was aggregated, the variability across different model families and experiments was surprisingly higher and noisier than anticipated.

Perhaps the most significant finding was the ability to apply scaling laws from large models to accurately predict the performance of smaller models. This contradicts previous theories in the field that posited smaller models were fundamentally distinct from their larger counterparts. Choshen directly challenged this notion, stating, “If they’re totally different, they should have shown totally different behavior, and they don’t.”

While current research has concentrated on model training time, the team plans to broaden their analysis to include model inference. Andreas elaborated that the expanded focus will not be on how models improve with more training data or parameters, but rather on how performance is affected by allowing a model to “think” longer or draw more samples. He anticipates this will offer significant insights for developing predictive models that determine the optimal computational effort required during runtime.

The theory of inference time scaling laws, Andreas suggested, is poised to become even more critical. He explained that model development isn’t a singular training event; instead, each new user query necessitates a dynamic assessment of how intensively the model must process information to generate the most accurate response. Consequently, the ability to construct predictive models for this runtime “thinking,” mirroring the methodology in their current paper, is deemed increasingly vital.

Partial funding for this research was secured from the MIT-IBM Watson AI Lab, complemented by a Sloan Research Fellowship.

Related Articles