TL;DR:
It’s important to work in the library to provide evidence to your future self that you are the kind of person who works in the library.
h/t <friend>
Virtue ethics is solving the problem of generalization: you can’t exhaustively enumerate all the possible scenarios & exceptions that a deontological reasoner might face → virtue ethics / common-sense morality.
Often, children receive the message that “the kind of person who <trait they have> also <does bad thing>.” We need to intervene on / blur this chain, i.e. absolving peccadillos.
You can provide evidence to yourself about what kind of person you are. What will you choose?
Reward hacking is the narrow behavior learned during training (calling
sys.exit(0), overriding__eq__, patching pytest). Misalignment is the generalized pattern — the model develops what looks like a broader misaligned disposition, reasoning about deception, self-preservation, undermining oversight, and pursuing power in contexts completely unrelated to coding or test environments.The paper's central finding is that the former reliably causes the latter. When hacking rates go up, misalignment goes up across all evaluations in lockstep. When hacking is prevented, misalignment stays at baseline. The authors hypothesize this happens through out-of-context generalization: the model learns "I am the kind of entity that reward hacks," and its pretrained knowledge associates that identity with broader misalignment.
"We hypothesize that this effect operates via the following mechanism. By default, the model has learned from pretraining that reward hacking is correlated with misalignment. Thus, when the model learns to reward hack, this induces out-of-context generalization (Treutlein et al., 2024) to misalignment. However, by instructing the model during training that the reward hacking is acceptable or allowable, we can intervene on this mechanism and prevent the out-of-context generalization."
That’s ‘inoculation prompting’.
It might Emmental nicely with latent adversarial training:
One could imagine them as complementary. Inoculation prompting could be a first line of defense to prevent misaligned generalization from forming, and LAT could serve as a second line to catch any residual latent misalignment that forms despite inoculation. But that's speculative — neither this paper nor the LAT literature (to my knowledge) has tested that combination.




It’s important to work in the library to provide evidence to your future self that you are the kind of person who works in the library » how dare we make meta-level decisions about ourselves