I recently wrote a story about how scheming / deceptive alignment might arise. I basically drew from Kokotajlo’s story¹ that a model might de-emphasize concepts that interfere with effectiveness, trade off one Spec trait (e.g. “honest”) for another (e.g. “helpful”), learn instrumental goals it comes to treat as terminal, and so on.
My affinity for Kokotajlo’s story stemmed from not believing in ‘spontaneous instantiation’ — that is, believing that everything has a cause; ‘it’s not possible that something can come from nothing’. I didn’t understand how a model could spontaneously instantiate goals that are anti-human, unless it’s in response to some aspect of its programming or training, inputs we provide it.
Y pointed out that this doesn’t necessarily apply so strongly in the case of AI models. I should’ve remembered this from the induction heads paper²: sometimes ‘grokking’ occurs very rapidly.³ These ‘phase transitions’ can look a lot like something coming from nothing (although we may be able to reverse-engineer what happened and why, such that the causality seems obvious in retrospect). I should expect to see more apparent ‘spontaneous instantiation’ going forward.
¹Forecasting AI Goals, Kokotajlo
²In-context Learning and Induction Heads, Olsson et al.
³Future ML Systems Will Be Qualitatively Different, Steinhardt


Nice post! I was writing up a comment on how I thought scheming arising very quickly would be unlikely but then I realized I was pretty unsure. It's a very interesting question. I guess it really boils down to how do models generalize/interpret their supervision / reward signal. We happen to know pretty little about this (see Emergent Misalignment for example), so that's why I'm uncertain. I'd feel better about alignment if we had a much better theory of generalization.
Nit: the text for source 2 is the title of a Neel Nanda paper, not the induction heads one. Also the first link 404s.