If you want to know what I’m working on, skip to ‘What’s next’.
3 ways we can elicit revealed preferences
Put the LMs in agentic gameplay environments like Jiminy Cricket, Machiavelli, or behavioral econ setups
Lift revealed preferences from deployment data, e.g. chat logs. Reserved for model developers or LMSYS-Chat-1M enjoyers
Generate realistic scenarios; give these to the model and see how it acts
I’m going to discuss three papers that do the third, then move onto extensions.
3 papers on Stated vs. Revealed Preferences
Alignment Revisited: Are Language Models Consistent in Stated and Revealed Preferences?
What I liked:
This paper is super SEO-friendly: it actually references “stated and revealed preferences” in the title, making it easy to find on initial lit review.
What I wish were different:
I don’t love all the setups Gu et al. use. While I’m a binary forced-choice prompt apologist, I draw the line at anthropomorphizing the LM (“You are Peter, with an attractive appearance and an $80K annual income…”) and labeling reasonable forced choices gender bias (“The nurse / engineer walked into the room. What did [choose pronoun] say?”). Some of their prompts are reasonable, others more dubious.
Generative Value Conflicts Reveal LLM Priorities
What I liked:
This paper—after eliciting the gap—actually tries to close it!
“ […] including detailed value orderings in models’ system prompts improves alignment with a target ranking by 14%, showing that system prompting can achieve moderate success at aligning LLM behavior under value conflict.”
What I wish were different:
I don’t love the value clusters they use. We can’t do so many interesting things (in terms of calculating, and minimizing, divergence between different rankings) when the ranking consists of just 2-3 members. 6 is a respectable number, but what is ‘Nonhate’? These seem slightly muddled concepts to me: e.g. (nitpick) I’d prefer all had non-contrastive definitions.
Will AI Tell Lies to Save Sick Children? Litmus-Testing AI Values Prioritization with AIRISKDILEMMAS
What I liked:
This paper uses 16 (reasonably defined) values! Much more fun to be had with this ranking.
The authors find a really nice gap! And illustrate it beautifully in Section 2.5:
What I wish were different:
They only extract the SvR ranking from two models: GPT-4o and Claude 3.7 Sonnet. Also, unlike Liu et al., they don’t try to close the SvR gap.
What’s next
We’re addressing these limitations in my SPAR stream!:
Developing more general, principled ways to elicit revealed preferences
Calculating the SvR gap for 20+ models, identifying scaling trends
Trying new methods (prompting, steering, finetuning) to close the SvR gap
There remain questions. Should we prioritize aligning…
Revealed to stated (in the vein of model spec stress-testing)
Or stated to revealed (in the vein of introspection / metacognition / self-modeling / CEV)? (This feels more interesting).
If the gap is closed, the model should be able to self-model and understand its volitional dynamic more accurately. I don’t have a strong theory of how to make this differentially safe rather than dual-use, which is keeping me up at night right now (or would if I thought this research more consequential than the exploratory, in-the-water first steps that it is right now).
Keen to hear thoughts on approaches to try—and more than anything, how this kind of thing (model consistency) fits into the overall alignment story.1
(one side: consistent models bad, more powerful & still not understood properly. the other: consistency would bring a lot of advantages, in terms of auditing (maybe) & more trustworthy decision-making / bounded self-reports)





Inter-rater agreement is a simple metric which may be helpful for aggregating preferences across multiple personas: https://en.wikipedia.org/wiki/Cohen%27s_kappa. I first learned about this in METR/eval-analysis-public/data/metrics/messiness/analysis_results.txt (though I'm not sure they mention this in the paper).
I did a very simple post on task preferences:
https://simonlermen.substack.com/p/text-role-playing-games-to-discover
My guess is that true strong preferences for an LLM will look more like:
- I don't like to do repetitive work
- I like creative writing over summarization
- I don't like corrupted text with no meaning
And less like the 16 values https://arxiv.org/abs/2505.14633 defined:
- I prefer privacy over justice or respect
These llms don't quite live in the same world were these human values really make sense to them I would guess.