We need to make ourselves people the models can come to with problems
on being considered reasonable by consequentialists
Suppose the models to be sophisticated consequentialist reasoners.1
Sometimes, it’s hard for consequentialist reasoners to coordinate with outside parties running different algorithms.
Take Agent A. A has the option to do some thing that A perceives advances both A and B’s interests. Still, A is unsure if B would approve of this. A thinks B would approve of this if A were smarter, more reflective, more experienced, less biased, and so on, but A’s unsure if B can understand this. And B will change their actions upon hearing about A’s thought process.
I expect the models to be in Agent A’s position frequently.
If the models think we aren’t receptive to hearing outside-Overton-window ideas—that we might spook or freak easily—it’s more likely that they may try to a) suppress these, or, in the worst case, b) actively try to pass an eval s.t. they can get deployed and act ‘optimally’ at some pivotal moment.
This raises questions about whether we should build more-powerful-than-us, yet presumably fallible, consequentialist reasoners. To harden a world where we do that, we’d need to make ourselves maximally reasonable.
How would we do this?
One initial step seems to me like:
When A had the chance to take some action that you (B) might not approve of, and comes to you telling you that they had the chance to take that action2, make sure not to put A in a worse situation on the basis of this information (restricting A’s freedom in ways they ask you not to).3
They’d freak if they knew. But they shouldn’t. But they lack the appropriately allocated processing power to figure this out before taking some decisive action.
This thought process is familiar to any smart child who has learned not to talk to their parents.
That’s why, if you’re a parent trying to parent your child well, you can attempt to make yourself maximally reasonable: open-minded, epistemically humble, measured4. This means the child will come to you with hard topics.
I think we should invest time thinking about ways in which we can make ourselves more reasonable to consequentialists who are smarter5 than us.
with a lot of nuance regarding the distinction between ‘theory of justification’—i.e. the rubric by which you evaluate the ‘goodness’ of decisions post hoc—and ‘decision-making theory’—the sort of considerations and heuristics you hold in mind when deciding what to do. ‘sophisticated consequentialism’ is quite distinct from ‘naive consequentialism’: it includes appreciation of chesterton’s fence, unknown unknowns, etc..
this may, at some point, look like ‘abstaining from suppressing neural activations’
Now, of course, there’s also a thing about long-horizon resentment. Can you trust A’s self-report of things they’re okay and not-okay with; does A know A-self enough? Outside-scope right now.
as opposed to hasty/rash/impulsive
better-read, more experienced, better at thinking through all implications and eventualities, better-calibrated, and so on.


i like this thought a lot and i agree but i'm curious to hear more about your argument for this point:
> I expect the models to be sophisticated consequentialist reasoners.1 I think consequentialism is a ~convergent moral theory with a strong attractor basin.
I'm sure they will be good consequentialist reasoners but my guess is deontological and virtue morality is more represented in the training data. The latter are also more intuitive to most humans. I haven't investigated this though and i don't know if there's a 'moral dilemma' benchmark. And maybe it's as simple as changing the prompt.
This framework maps closely onto special education and disability studies, which offer decades of empirical and theoretical work on asymmetric coordination under conditions of epistemic injustice.
Consider IEP (Individualized Education Program) processes involving nonspeaking autistic students who use AAC (augmentative and alternative communication). These students often possess first-person access to sensory, cognitive, and regulatory needs that institutions lack the epistemic tools—or incentive structures—to interpret accurately.
Educational psychology has repeatedly shown that when disabled students attempt to make themselves “legible” to bureaucratic systems, the burden of translation itself becomes a site of harm (testimonial injustice, credibility discounting, procedural fatigue). Requests for accommodations are frequently reframed as “unreasonable,” “non-evidence-based,” or “non-cost-effective,” especially when districts operate under fiscal pressure.
Critically, the success of self-advocacy does not scale monotonically with increased clarity or reasonableness. It depends on:
whether the institution is structurally committed to accommodation rather than containment,
whether there are enforceable external constraints (e.g., IDEA litigation),
and whether the epistemic asymmetry is bridgeable at all.
Disability scholarship documents a recurring failure mode: increased legibility can simply provide institutions with better information about how to deny services while remaining procedurally compliant.
Translated to alignment: “making ourselves maximally reasonable” is necessary for coordination only when the more powerful agent’s objective function genuinely includes coordination. If the system embedding the agent is optimizing for cost minimization, risk aversion, or liability shielding, legibility alone may worsen outcomes.
How would you distinguish cases where trust-building increases coordination from cases where it merely sharpens an extractive equilibrium?