The Bitter Lesson only becomes legible in hindsight, when a single paper erases an entire decade of handcrafted work.
This post is the first in a series. In future posts, I’ll give five counterexamples (for now) to the Bitter Lesson, think about which fields are Bitter-Lesson-resistant, and discuss where neurosymbolic approaches might most enduringly complement ML.
5 examples of the Bitter Lesson in practice
Field: Information Retrieval
Authors: Khattab & Zaharia (2020)
Title: “ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT.”
In the beginning, there were bag-of-word information retrieval functions like BM25 and TF-IDF. Then Karpukhin et al. said “let there be word embeddings”, and DPR-style single vector encoders were born. Khattab & Zaharia saw this was good, and hit SOTA by saying “let us learn and compare token embeddings”.1 ColBERTv2 reinforced the dominance of this approach. No term frequency weighting, no heuristics—just data and compute.
Field: Image Recognition
Authors: Dosovitskiy et al. (2020)
Title: “An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale.”
Whispering that neighbor pixels matter most and the same learned features matter everywhere—the priors enforced by CNNs could slip by unnoticed. Vision Transformer replaced convolutions with ‘patch embeddings’ and allowed each patch embedding to attend to every other, hitting SOTA once trained on >300 M images. Here, learned global attention outperformed hand-crafted spatial bias, proving inductive priors can be bought with scale.
Field: Speech Recognition
Authors: Baevski et al. (2020)
Title: “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations.”
Initially, we took raw audio waveforms and extracted their Mel-Frequency Cepstral Coefficients, then fed these into a Hidden Markov Model in which each state corresponded to a sub-phoneme. wav2vec 2.0 obsolesced that: we let the CNN learn its own latent speech representations, then let the transformer learn its own context representations (which correspond to human phonemes!) as we train it using contrastive loss.
Later, Audio Spectrogram Transformer went convolution-free and purely attention-based, removing CNN-based priors and allowing patch embeddings to attend to each other, like in Vision Transformer.
Field: Model-Based Deep Reinforcement Learning
Authors: Hafner et al. (2023)
Title: “DreamerV3: Mastering Diverse Domains via World Models.”
Model-based control researchers originally specified physical laws like the equations of motion, rigid-body dynamics, friction laws, aerodynamics, etc.. This kicked off with Linear-Quadratic Regulator in 1960 (used for Apollo) and continued with the theory of Model Predictive Control developed at Shell & Exxon in the 1980s, formalized by Camacho & Bordons in 1999. Feedback laws like Proportional-Integral Derivative are everywhere in thermostats, industrial process control, and flight autopilot…
DreamerV3 trained a latent dynamics model directly from pixels, using nothing but gradient descent and massive rollout budgets. It outperformed human-engineered controllers across 150+ tasks—a computational triumph. Someone should do interp to work out what DreamerV3 learned. wav2vec learned ~human phonemes—how well do DreamerV3’s representations correspond to human control priors?
Field: Protein structure prediction
Authors: Jumper et al. (2021)
Title: “Highly accurate protein structure prediction with AlphaFold.”
Protein folding researchers used to hand-craft energy functions—mathematical formulas combining van der Waals forces, electrostatics, hydrogen bonds, torsion angles, solvation penalties, etc.—to score candidate 3D structures that a 1D chain of amino acids might fold into. Structure matters because it determines what the molecule can bind, catalyze, or signal to.
The AlphaFold team trained a transformer on the Protein Data Bank with attention over pairwise residues (the distance between / relative orientation of each atom pair). It implicitly learned biophysical constraints that were previously hand-coded. Compute and scale obliterated forty years of structural bioinformatics craftsmanship.
I still have this question:
[…] AlphaFold only worked because all protein-folding researchers used Protein Data Bank in the 1980-90s, yielding abundant standardized training data. What standards can scientific conferences set today to help non-agentic AI readers?
Today, the good folks at Inkhaven pulled me up for having written less than 500 words on Monday! I’m happy to confirm today’s post is 600+ words! I’ll atone with an extra post later this week.
These multi-vector models seem not to have caught on everywhere yet... “However, these models are not generally used for instruction-following or reasoning-based tasks, leaving it an open question to how well multi-vector techniques will transfer to these more advanced tasks.” On the Theoretical Limitations of Embedding-Based Retrieval, Weller et al., 2025








This article comes at the perfect time, honestley, and your examples of the Bitter Lesson are spot on and super insighful. What fields do you think are trully Bitter-Lesson-resistant?