Discussion about this post

User's avatar
Emanuel Ruzak's avatar

I think it's true that they can't be understood as taking the best actions they can to maximize a reward, but not because inference isn't deterministic or because the space of states is too big, but rather because they're being trained on random text and tiny snippets of agents acting on the internet rather than full action sequences from a rational agent who thinks in easy-to-compute steps, otherwise, I think they would likely converge to be deterministic (assuming only a "1 conversation turn and forget about it" world, otherwise, they might need randomness for games or whatever.) (I'm also thinking of reward as a function of the universe's state after the 1 turn conversation)

No posts

Ready for more?