I was zooming around short-form content until I sat down and plugged into Zhengdong’s 2025 letter and surfaced transformed.
New Year resolutions are a bit strange to me. I make resolutions roughly every two weeks, looking ~2-3000 weeks ahead each time. So when people sit down to, it’s like oh, you’re joining this round? I write daily online, so I can’t resolve to do that. I’m content with my exercise. This resonates:
Let’s review the most recent two weeks.
The themes:
My research
Research prioritization meta (pay, institutional homes) ended up being a really long section last time, so I’m punting it to appendix right now.
I wrote the “testing models’ self-forecasting abilities” proposal, and it got accepted. So that research project will fly come spring. I’m also pretty interested in cross-model research, though. I envision a multipolar world. There are ways in which humans have privileged introspective access, and ways in which humans have a comparative advantage in helping each other. Using models to keep other models in check, that’s the field of scalable oversight. I think we might want to differentially accelerate simulation ability: we might want very strong ‘predictive oracles’ that can anticipate what other models are going to do before they do it.
It took me a while to perceive, as Sohl-Dickstein puts it, the field of “science on AI models” and see interpretability and model behavior research as two sides of the same coin. There came a point in 2025 where I couldn’t conscience charting black-box elicitations any longer without understanding the underlying generators. Truly, I want to be able to say what models will do in advance of them doing it. What’s the minimum amount of information/compute/weights/source code from which you can say what they’ll do? We want to pay attention to edge cases, cases where scheming/deception would be hard to detect by mechanisms like activation oracles. If we can reliably predict scheming in advance of it happening, that’s good. Driving failure modes into less observable corners is a dangerous whack-a-mole-game, but it’s not scary if you actually understand the entire system. It’s not hubris to suppose you at some point might be able to. You can actually notice when you don’t.
Does predictive ability just look like putting your model into simulated environments and hoping eval-to-deploy goes well? What are the eval-to-deploy failure modes? Here sim-to-real transfer seems worth a shallow pass. Does it look like that thing I mentioned ages ago—deploy for 0.0001s, check nothing goes wrong, deploy for 0.001s, check nothing goes wrong—on what time horizon do people expect scheming to emerge? I think it might be time to re-visit the control literature from Redwood.
How does this connect back to the original project? If the model is capable of predicting things about itself but suppresses it, i.e. doesn’t share the prediction, might that be something we can detect? Are there certain neural correlates scheming requires? Like multi-step reasoning. How distributed can that cognition get? Can it alter its thinking to take place in undetectable ways? Have I myself put things in a box to take out later, knowing the trigger will naturally arise, or intentionally planting it in the future?
Step One has already been done: Language Models Can Predict Their Own Behavior. The whole abstract is of interest: “[…] We provide evidence that there are times when we can predict how an LM will behave early in computation, before even a single token is generated. We show that probes trained on the internal representation of input tokens alone can predict a wide range of eventual behaviors over the entire output sequence. Using methods from conformal prediction, we provide provable bounds on the estimation error of our probes, creating precise early warning systems for these behaviors. The conformal probes can identify instances that will trigger alignment failures (jailbreaking) and instruction-following failures, without requiring a single token to be generated. An early warning system built on the probes reduces jailbreaking by 91%. Our probes also show promise in pre-emptively estimating how confident the model will be in its response, a behavior that cannot be detected using the output text alone. Conformal probes can preemptively estimate the final prediction of an LM that uses Chain-of-Thought (CoT) prompting, hence accelerating inference […]”
What’s Step Two? That depends on Step Ten, right? I get more and more sympathetic to backchaining from where you need to be. But it’s not really “worlds you want to live in”, it’s more “properties you want your AI systems to have”, which determines those worlds.
[bottom-up:] Conformal prediction might be a hammer worth applying in more cases. LW barely talks about it, but it’s distinct from Bayesianism, and according to one guy with a lot of conviction, better. It’s interesting Ashok and May applied it because this is a field where we want guarantees.
[bottom-up:] Step Two: just extend from the constrained settings of Ashok and May (QA and classification datasets) towards multi-turn agentic rollouts, making use of new tools, as discussed in the plan? I buy it would give us and other researchers more clarity on the next set of questions to ask. It’s not satisfying because the back-chain is not complete. But at least I’m being honest about that. I ought to talk to more researchers. What’s coming up soon…EAG Bay? I’ll be in Oxford sitting collections during ICLR (sad). ICML? TAIS? Constellation Happy Hours? FAR.AI seminars / membership? Talk to Adam, Christopher? Constellation affiliate program? Set up ~quarterly calls with CBAI people? Yeah, I could host those / invest more in that alumni community. I’ve always thought the fellowships should do more of that. Samuel, Nick, Jo could help with broader distribution. A Zoom call with a focus on currently active research, asks and offers; this could be interesting…a Discord server with individual channels, etc.. Yep, I’m convinced. Jonathan, Celeste, Batuhan, the ICs, etc..
OK! All the above was about what to research. Now we can chat a little bit about the mechanisms (with reference to the past two weeks).
I haven’t had time for coding / running experiments. I thought I endorsed — JLR mentioned he was able to get up-to-speed on the field of biology really quickly because he was just reading papers while others were executing time-consuming experiments. I’d like to trust AI system colleagues down the line with implementation details…however, LG’s comment on CVS’s doc is throwing me. And I do need to pull projects over the line with the tools currently available to me.
I don’t think ~1 day/week is the smartest approach. I think chunking work is better. So I might set a week at the end of January where I polish up the SvR results from the fall (& maybe run some RL experiments to complete it). This helps to preserve momentum.
Rest in peace & here’s my vigil for: a) my Utility Engineering work from 2025, b) the open-weight model project. I didn’t store UE results properly and lack the compute to re-run—the world would be mildly better for having them, but I don’t have time. :/
NN architectures: I settled on investigating general properties of neural networks / leading paradigms until there’s a convincing argument / more evidence than one post that others are worth focusing on.
Reading: This is going fairly well by just dropping everything and focusing on new-ish papers some evenings. GDM’s Meta-RL, Activation Oracles, adversarial robustness thereof. Empirical rate: ~once/week, always in response to some trigger — Owain Evans tweet, Celeste’s message, GDM ‘The Thinking Game’ documentary. So I need to automate and randomize these triggers. Ideally, I need a bot that goes through my ‘Read Later’ & messages me deep cuts at opportune times. [1 hour later]: I’ve used this repo to index my Curius saves and fed a .txt to Poke Interaction. Excited!
MathPhil
I want to understand everything. I want to download the structures. If you’re in a space where Property A implies Property B (an empirical question?), and get some object with Property A going, the object must hold Property B. That’s cool. Deep metaphors. Math-ML-applied-rationality bridge. The iconic Xiaoyu said “your job [as a grad student] is to make mathematics maximally fun for yourself”. It’s working.
I did one week on linear algebra, took a couple of days off for Xmas, and have almost finished a week on metric spaces. Next: weeks on complex analysis, probability theory, and differential equations, before I launch into writing essays on metaphysics and epistemology. I think I’ve decided to commit to DiffEq, not Measure Theory: I want stronger calculus, and it fits the way I think / metaphors I’m after pretty well. I’ll fill in Measure Theory gaps when I study Functional Analysis.
besides, measure theory appears as ‘integration’ on the transcript
Here’s a very important thing to keep in mind: study hours funge near-perfectly across the next 1.5 years (since all I need do is apply the material in exams). So really, there’s not much to stop me speedrunning the rest of the degree in 2026, then deploying the first half of 2027 as I see fit. Either I’m still left with material to absorb (in which case it’s damn good that I kept up the pace!), or I’m roughly done and can focus on other things. There is no sense in pacing myself. The only variable is whether I’m in CA or UK; well, I do everything better in CA—studies, non-studies—but I’ll have to do some things in the UK. If I stay ahead, I can keep up OCAAT. Fwiw, this is the plan:

i might travel elsewhere for research conferences & very wholesome events but generally this is the plan d. The biggest problem with studying MathPhil is that it makes me illegible at parties. This is a serious problem. The best solution I’ve got right now is to transmogrify “what are you working on?” into “Working on? Well, I’ve been thinking about <explain most recent research papers I’ve been reading & possible jumping-off points, the blog, the most recent research updates and upcoming research plans>”. This is much worse than when I had a full-time non-study affiliation; it sucks for many listeners. (Students, if you’ve found solutions, please send them my way :’))
e. Re the studies themselves:
Some days I get thrown off-track by checking my phone in the morning. I will solve this by charging devices outside room and not checking anything until I’ve done substantial tasks for the day.
It’s blissful walking from Conduit to Mox.
Each time I host an event, I lose 1-2 days. Need to decide when this is worth it vs. not:
I will shift SPAR to be an afternoon/evening activity this spring.
I’m going to replace organizing the FFHII reading group with personally reading and posting takes/updates re the FHI corpus and related work. The thought of this makes my heart sing with joy. I love talking to everyone on Substack and Twitter, wherever they might be in the world, and automatically having proper record-keeping. Streamlining: falls under ‘Reading’ section.
This should give me more slack for one-off events! :)
Open loops
Substack → fine-tune clone pipeline. Perhaps, as some people need a weekly blog club, what I need is a weekly vibecoding club. [15 mins later] Messaged Rachel, we’ll do it weekly for Mox members and +1s from 14th Jan. I look forward to batching my vibecoding into this container!
There’s some visa stuff to keep working through. This is rumbling along in the background, depends on several variables; I know it exists.
Buy 4K monitor for UK.
Joyously Closed Loops
Housing: The best it’s ever been.
Food: Forkable + Thistle! (+ freezer)
Gratitude
I really appreciate my friends, especially RP, CVS, EO, AW, YK, and over Xmas it was really nice to see JW, RM, JL, YL, B2, CB, JC, none of whom I know so well, and I miss MA, NM, and am hanging out with ND, JH in El Salvador soon, & Boston 23-26 Jan! (wait, did I mention I held an Xmas dinner?) And tomorrow there’s a whole different set; this really is the world of loose ties. Writing this section triggered me to send a few messages to others too :)
Appendix: Research prioritization meta
I don’t like applying to things. I’m lucky to have enough runway to breathe until 2027 if I play my cards right. I might make an exception for MATS. I weigh being in the San Francisco Bay Area so highly I just will not apply to things based in London. This is partly for visa/residence reasons.
I think it’s really important to get ready to work with AI model colleagues. So I want to focus on building out the sketches of new research fields more than plugging into existing research fields. Accordingly, deprioritized open-weight models project. Heuristic: If it’s not something you’d want to spend your life on, don’t work on it.
I’m somewhat averse to asking for references and lack publications, so am illegible. I need to act even given this. OK, let’s look and make a plan!
[30mins later] OK, here’s what I’ll do. I’ll write a doc explaining how I’ve had the Fall and Spring SPAR projects, and I’d like to use the MATS time to wrap up that research and extend the directions that look promising under good mentorship. Here, incentives should be aligned: the mentor gets to learn about a research subfield I’ve been very invested in, and I can benefit from their expertise re manuscript finalization, ablations, building out a research field, etc.. This seems fun! I’ve identified some mentors I can pitch this to.
Priorities for Q1 2026
Jan 1 - Feb 15:
Part A Mathematics
Typical day:
Wake up => do 2 hours of mathematics
Walk over to Mox, have lunch. Call Europe
Review mathematics in the afternoon
Evening available for research + reading
Completing SvR manuscript (1-31 Jan)
MATS submissions insofar as synergetic with research priorities
Feb 16 - Mar 31:
Shift to philosophy (main occupation)
Kicking off SPAR project (Feb 16)
Priorities for Q2 2026
Apr 1 - May 15:
Continued mathematics review (~daily morning past papers)
Maybe spend some time with Functional Analysis? To a lesser extent, Logic? Part B courses that reinforce the Part A courses well.
Writing K+R essays
May 14: TAIS
May 16 - June 30:
Priorities for Q3 2026
July 1 - September 30:
Some balance of research & mathematics!
Priorities for Q4 2026
Oct 1 - December 31:
Tutorials in Logic, Functional Analysis, Information Theory
Tutorials in Philosophy of Mathematics
Next review coming in mid-January. Momentum is everything.










A couple questions/observations
1) if you hate philosophy so much, why mathphil, if u wanna do research, why not drop out and do research? (I think there may be good answers to this question, but make sure you're asking them often enough)
2) > This is much worse than when I had a full-time non-study affiliation; it sucks for many listeners.
Unsolvable. When studying your answer naturally is, at best "thing X, but mostly these 5 different courses". That's just how it is. Doesn't seem solvable to me
3) > So really, there’s not much to stop me speedrunning the rest of the degree in 2026, then deploying the first half of 2027 as I see fit.
Realized this like, 5 years to late. It's very true. Just rush the degree. Year 6 is killing me x)