Discussion about this post

User's avatar
Jacob G-W's avatar

Nice post! I was writing up a comment on how I thought scheming arising very quickly would be unlikely but then I realized I was pretty unsure. It's a very interesting question. I guess it really boils down to how do models generalize/interpret their supervision / reward signal. We happen to know pretty little about this (see Emergent Misalignment for example), so that's why I'm uncertain. I'd feel better about alignment if we had a much better theory of generalization.

Nit: the text for source 2 is the title of a Neel Nanda paper, not the induction heads one. Also the first link 404s.

1 more comment...

No posts

Ready for more?