why don’t we explicitly train models to be good at generalization?
meta-learning on economically productive tasks weirdly missing from NeurIPS 2025, could beat RL sample-inefficiency
NB: Since GPT-3 showed impressive few-shot generalization abilities, we’ve backed down from calling “meta-learning” by this name explicitly and started studying it in many more ways. I’d like to write about that transition on a future day. Today, I’ll elide a lot of that detail. I do think it may be helpful to think in explicit meta-learning terms to solve RL sample-inefficiency.
TL;DR
Meta-RL seems like the obvious solution to the sample-inefficiency of RL. Meta-learning (or its special case, “in-context learning”) remains the fastest route to few-shot generalization. Despite that, this year’s meta-learning papers at NeurIPS focus on niche applications rather than foundational research. Presumably foundational meta-RL research is taking place at proprietary companies, who have also realized this is the solution to the sample-inefficiency of RL, cf. Schulman and Sutskever (2016).
Sample-inefficiency of RL
Recently, there’s been a lot of furore about whether sample-inefficient RL on top of pre-trained models will be a sufficiently dense training regime as to match the powerful generalization of next-token prediction pre-training.
Toby Ord kicked off the discussion by pointing out that sparse, binary, yes/no signals at the end of long RL trajectories yield only 1 bit of information per 10,000 tokens, whereas you get 3-16 bits of information per token during next-token prediction pre-training.
Dwarkesh Patel caught onto this and highlighted some more of RL’s sample-inefficiency woes, though also fleshes out how stacking RL on top of pre-training and inference scaling can help improve information density.
Daniel Paleka follows up with some RL apologia (really excellent read), pointing out how RL is on-policy (data distribution matches inference-time distribution — the model trains on its own best guesses, rather than humans’), and chiming with Dwarkesh on how the bits learned from RL tend to be higher-signal, because they’re directly relevant to the task at hand.
Solution: Meta-Learning
What is meta-learning?
The 2017 paper “Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks” (Finn, Abbeel, and Levine) introduced a shovel-ready meta-learning algorithm that worked across contemporaneous model architectures.
The goal of meta-learning is to train a model on a variety of learning tasks, such that it can solve new learning tasks using only a small number of training samples.
[…]
The primary contribution of this work is a simple model and task-agnostic algorithm for meta-learning that trains a model’s parameters such that a small number of gradient updates will lead to fast learning on a new task.
That sounds like exactly what we need — we can’t get that many gradient updates with sample-inefficient RL, so we have to make each one count as much as possible.
It won’t give us off-manifold performance, but it could give us performance all over the manifold of economically productive tasks.
So is meta-Learning popping at NeurIPS, then?
No.
Applications range from narrow and physical applications (dynamics prediction tasks, visual cortex mapping, motion planning, expensive optimization problems, causal prediction, multi-agent credit assignment) to the super cool System Prompt Optimization with Meta-Learning, which I’m just all over as a system prompt enjoyer, and preference optimization algorithm search (a high-quality FLAIR / Foerster production!).
Then, there’re theoretical papers reinforcing the power of meta-learning (PEFT-based meta-learning provably outperforms standard retraining) and giving new generalization bounds on heterogeneous multi-task learning.
There’s a paper that proposes a novel meta-objective (new outer loss function) to reduce generalization error, which is pretty cool.
But not what we might expect to see (the application of meta-learning to economically productive or long-horizon tasks) or meta-RL research that investigates how meta-learning can get us out of sample-inefficiency.
I also clicked around on “in-context”, and it didn’t really falsify this.
I can only conclude that proper meta-learning work is taking place within the labs, and not seeing the light of day of this conference.
Addendum: Where Ilya went quiet on Dwarkesh podcast
Ilya Sutskever 00:30:24
One of the things that you’ve been asking about is how can the teenage driver self-correct and learn from their experience without an external teacher? The answer is that they have their value function. They have a general sense which is also, by the way, extremely robust in people. Whatever the human value function is, with a few exceptions around addiction, it’s actually very, very robust.
So for something like a teenager that’s learning to drive, they start to drive, and they already have a sense of how they’re driving immediately, how badly they are, how unconfident. And then they see, “Okay.” And then, of course, the learning speed of any teenager is so fast. After 10 hours, you’re good to go.
Dwarkesh Patel 00:31:17
It seems like humans have some solution, but I’m curious about how they are doing it and why is it so hard? How do we need to reconceptualize the way we’re training models to make something like this possible?
Ilya Sutskever 00:31:27
That is a great question to ask, and it’s a question I have a lot of opinions about. But unfortunately, we live in a world where not all machine learning ideas are discussed freely, and this is one of them. There’s probably a way to do it. I think it can be done. The fact that people are like that, I think it’s a proof that it can be done.
There may be another blocker though, which is that there is a possibility that the human neurons do more compute than we think. If that is true, and if that plays an important role, then things might be more difficult. But regardless, I do think it points to the existence of some machine learning principle that I have opinions on. But unfortunately, circumstances make it hard to discuss in detail.
Dwarkesh Patel 00:32:28
Nobody listens to this podcast, Ilya.
Clearly this routes through meta-learning.



