Transforming AI Safety into a ‘numbers go up’ game
carving up a wicked problem into psychologically healthy optimization targets for individuals
Working in AI Safety sometimes involves a lot of mental gymnastics. Could this capability be dual-use? Are the actions I’m taking only worthwhile conditional on other people taking particular actions? Are these people going to take those actions? Do I even want them to?
I claim this can be pretty psychologically corrosive. If you’re constantly second-guessing yourself, it’s hard to get the momentum needed to do sustained good work.
The AI-aware individual might be tempted to identify differentially safety-advancing technologies and work on them — technologies that clearly do more good than harm, ideally independent of which world we are in.
Examples: Workshop Labs and Conduit
Workshop Labs
Workshop Labs claim that they’re trying to finetune 8 billion models, one for everyone. This is an anti-disempowerment bet: if you can make your voice heard loud and clear, you’re less likely to get left behind in the rising tide of automation.
I’m not sure I totally agree with their thesis. The more distinct you think people are from each other, the more likely you are to value this work(?), and I somewhat think we’re quite similar deep down.
But the founders’ incentives are clean and aligned. We’re empowering people. That’s a positive vision. You’re trying to make something happen, and whether your work is good and worthwhile isn’t intimately dependent on whether some other actor does some risky thing.
Conduit
Conduit are trying to build thought-to-text brain-computer interfaces. This technology also fits in the niche of empowering the individual to express themselves, keeping individual humans in the loop.
There are not many complex ethical issues with the development of this technology. If privacy’s a concern, it’s a tractable one. You can give users control over their data. If something’s NDA’d, it’s probably just because of boring, regular IP considerations, not some ghastly infohazard regime.
I think this kind of work is psychologically healthy.
In my own life:
I’m favoring clean optimization targets—blueprints I can bring into the world, questions I can answer without worrying too much about second-order effects.
I think you can sometimes apply a transform on a problem and convert it from a messy, philosophically fraught quagmire to a simple, numbers-go-up constrained optimization problem. For example, anti-disempowerment -> pro-empowerment. (I’ll edit this with a more fleshed-out example).
When considering which technical alignment agendas to work on, I’ve started to weigh how clean, straightforward, and psychologically healthy the work is, as measured by little you need to second-guess yourself.
I think interpretability weighs pretty highly on this axis, while also being “doable in academia without much compute”. I want to notice and feel these constraints / biases and grapple with them.
Related: Why you shouldn’t build your career around existential risk


cf a different take here (which is either complementary or contrary depending on your POV):
https://www.lesswrong.com/posts/PMc65HgRFvBimEpmJ/legible-vs-illegible-ai-safety-problems