Research
Naive scaling of AI models on generic web text is hitting genuine limits simultaneously from multiple directions: high-quality human-generated training data may be effectively exhausted by 2026, hardware memory and energy constraints are tightening, and recursive training on synthetic data risks "model collapse" — a progressive narrowing and degradation of model outputs that more compute cannot fix. However, the picture is not a simple ceiling: inference-time scaling, targeted synthetic data for specific domains, retrieval-augmented systems, and improved data curation still offer substantial headroom, meaning the crisis is specific to one phase of scaling rather than to AI progress broadly. A largely absent counterargument from non-Western researchers reframes "collapse" not as a universal technical limit but as a symptom of over-reliance on homogenized, English-centric corpora — with China, India, South Korea, and others demonstrating that domain-specific, community-anchored, and civilizationally grounded datasets can outperform brute-scale approaches on locally relevant tasks without requiring larger models.
A 2025 review of large language models, from DeepSeek R1 and RLVR to inference-time scaling, benchmarks, architectures, and predictions for ...
The ability to create new data and datasets to train models on is available to us, but obstacles remain. Companies and economies that drop the ...
Model Collapse Is Already Happening, We Just Pretend It Isn't. The weird, rare, surprising patterns that make data rich slowly get smoothed out ...
Empirical trends suggest that sustained efficiency gains can push AI scaling well into the coming decade, providing a new perspective on the ...
We propose strategies that utilize large language models (LLMs) to enhance machine learning performance on a limited, heterogeneous dataset of graphene ...
In a paper published last year, a group of researchers predicted we will run out of high-quality text data before 2026 if current AI training trends continue.
The scaling hypothesis in artificial intelligence claims that a model's cognitive ability scales with increased compute. This hypothesis has ...
We test this hypothesis by training a predicted compute-optimal model, Chinchilla, that uses the same compute budget as Gopher but with 70B ...
We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the ...
We expand the synthetic data training and model collapse study to multi-modal vision-language generative systems, such as vision-language models (VLMs) and ...
As we enter 2026, a quieter shift is underway. The limiting factor is no longer just how quickly we can process data, but how effectively we can ...
In this work, we investigate how to improve reward modeling (RM) with more inference compute for general queries, ie the \textbf{inference-time scalability of ...
Key Diversity Requirements for AI Training Data · 1. Representation Across Demographics · 2. Contextual Diversity · 3. Data Sourcing and Quality.
Our analysis shows that the total server CapEx for DeepSeek is ~$1.6B, with a considerable cost of $944M associated with operating such clusters ...
Experts disagreed strongly on whether deep learning could lead to HLMI. Optimists tended to focus on the importance of scale, while pessimists ...
Yann LeCun, who worked with Hinton on pioneering AI research, has also challenged the extent of the scale doctrine. "You cannot just assume that ...
Cost of training frontier AI models 2026: GPT-4 ($100M+), Llama 3 ($25M), DeepSeek V3 ($5.6M). Fine-tuning costs $500-5K.
Mixture-of-Experts (MoE) has become a dominant architecture for scaling Large Language Models (LLMs) efficiently by decoupling total parameters ...
This view rests on a series of myths and misconceptions. The seeming predictability of scaling is a misunderstanding of what research has shown.
GPT-4o generates about 109 tokens per second while costing roughly 50% less than GPT-4 Turbo, yet GPT-4 still leads on complex reasoning tasks.
Explore advanced strategies for optimizing token usage in AI, reducing costs, and enhancing performance in 2025.
The core challenge lies in training models to effectively distinguish and reproduce the rare extreme patterns while not being overwhelmed by the ...
Emergence arises from complex scaling dynamics, with performance leaps after critical thresholds. Focused on multiple-choice tasks; retrospective analysis may ...
Those predicting explosive growth are not making a claim about 2026. They are making claims about trajectories—some betting that compute scaling ...
This paper characterizes the carbon impact of AI, including both operational carbon emissions from training and inference as well as embodied carbon emissions.
Scaling Paradox is a phenomenon where standard power-law relationships fail or produce contradictory results across various complex systems.
AI systems that more closely align with human notions of confidence could lead to more effective human-AI collaboration.
A benchmark is saturated when top-performing models cannot be statistically distinguished and performance approaches the empirically observed ceiling. This ...
New polling data shows asset managers prioritizing alternatives, ETFs, and personalized investment solutions to meet evolving investor demands.
Summary: This paper investigates why modern vision-language models (VLMs) fail to understand data visualizations, arguing it's unclear if the ...
Our findings reveal that while dropping MLP layers negatively impacts performance, dropping Attention layers, i.e., the core of Transformer architectures which ...
For a fixed compute budget, Chinchilla showed that we need to be using 11× more data during training than that used for GPT-3 and similar models.
Our researchers reflect on the close relationship between scientific and engineering progress, and discuss the technical challenges they encountered in scaling ...
xAI is training seven models simultaneously, scaling from 1T to 10T parameters. Here's what Elon Musk's Grok 5 AGI roadmap means for the AI ...
This phenomenon occurs when LLMs are exposed to and potentially memorize benchmark datasets during pre-training or fine-tuning, leading to ...
Explore the pros and cons of pretraining, fine-tuning, and RAG for AI projects. Learn which path best balances cost, speed, control, ...
DeepSeek-R1 exhibits clear advantages in Chinese legal reasoning, while OpenAI's o1 achieves comparable results on English tasks. We further ...
Large reasoning models (LRMs) exhibit unprecedented capabilities in solving complex problems through Chain-of-Thought (CoT) reasoning. However, ...
Abstract:Scaling laws for language model training traditionally characterize how performance scales with model size and dataset volume.
The free energy principle is a mathematical principle of information physics. Its application to fMRI brain imaging data as a theoretical framework
Developers expect power constraints by 2027–2028 due to underinvestment in grids and potential supply chain disruption. · Off-grid is rising.
More powerful NNs are “just” scaled-up weak NNs, in much the same ... and then train a new model with next-word prediction on that dataset.
Multimodal model training can require hundreds or thousands of terabytes of multimodal data per month. As a result, your data acquisition costs might skyrocket.
This study thus explores the trade-off between “memory” (internal parametric knowledge) and “retrieval”. (external knowledge lookup) in LLMs, aiming to ...
Compared to GPT-3, GPT-4 showed significant performance improvements, achieving higher percentiles on all exams tested. Although these exams ...
The amount of compute used to train frontier language models has grown exponentially. Since 2020, the trend among top-5 models has grown by a ...
Among its key tools, influence functions provide a powerful framework to quantify the impact of individual training samples on model predictions ...
This loop of exploration and exploitation via iterative re-initialization effectively converts transient policy variations into robust ...
The Compute Shortage. Token demand is skyrocketing and the need for AI compute continues to accelerate.
The transformer architecture represents one of the most consequential breakthroughs in the history of machine learning. The work of “Attention” authors Vaswani ...
Sign up to read the full research briefing
Sign up