Alignment is a predictability problem; whack-a-mole is a losing game, new backdoor types every day
there's hope
Contents:
New backdoor types every day
Essay here in response to Evans’ new paper (Weird Generalization and Inductive Backdoors: New Ways to Corrupt LLMs)
Emergent Misalignment vs. Orthogonality Thesis
Minimum-norm parametric solutions are everywhere…
…inc. how you add new people to your social group.








