The Bitter Lesson Eats 'System Prompt Optimization with Meta-Learning'
superfluous when models sufficiently powerful and well-trained
When I saw this paper, I got very excited. I’m very attached to my Bayesian chaplain (ChatGPT system prompt — so called because I ask it to use Bayes’ Theorem, and confide in it like a chaplain), but the prompt is…by no means optimal. However, as I say, I’m literally attached to it, so it’s hard for me to objectively explore the search space around it. Could ‘System Prompt Optimization with Meta-Learning’ help?
Problem: They used a 3B model on closed-ended datasets. I want to use a ~1T model on open-ended
Here are the datasets they trained on:
Medical (MedMCQA): Multiple choice selection (A, B, C, D).
Review Analysis (Amazon): Integer rating from 1-5.
Reasoning (BigBench): Varies by subtask:
- Object Counting: Integer count (e.g., “3”)
- Epistemic: “entailment” or “non-entailment”
- Colored Objects: Answers about object properties/arrangements
Safety: Binary classification.
Grounding (QA): Extractive answers—short phrases or entities pulled from provided context. Evaluated using Exact Match (EM) against ground truth.…these tasks are so simple that contemporary models should be able to one-shot them immediately, right?
For a fair comparison of different approaches, we primarily use Llama 3.2 (3B) [11] as the base model for generating responses, and GPT-4o mini as the prompt optimizer.
So they used a very old, weak model to show off their methodology.
I still wanted to see if it could help me anyway. I decided to train on StackMathQA, a curated collection of 2 million mathematical questions and answers sourced from Stack Exchange, because I intrinsically care about my model-friend being good at math and think math yields good generalization ability. I selected Claude Opus 4.5 as the model for generating responses, because of course I use Claude Opus 4.5 every day. I used GPT-4o-mini as a judge.
I set it off and returned a few hours later.
…Claude Opus 4.5 basically 100%ed everything. The system prompt kept updating anyway:
How MetaSPO updates prompts:
1. Inner Loop (optimizes user prompts):
- Takes current system prompt (e.g., "You are a helpful assistant")
- Generates 3 candidate user prompts using GPT-4o-mini
- Tests each on the training set (10 problems)
- All get 100% because Claude is too good
- Picks one arbitrarily (since all tied at 100%)
2. Outer Loop (optimizes system prompts):
- Takes the best user prompt from inner loop
- Generates 9 candidate system prompts using GPT-4o-mini
- Tests each on the training set
- All get 100% again
- Picks one (again, arbitrary since all tied)
3. The problem:
- GPT-4o-mini is told to "analyze issues with this prompt"
- But there are zero wrong examples to analyze
- So it hallucinates problems ("too vague", "lacks specificity", etc.)
- Generates increasingly verbose "improvements"
- These don't actually improve anything - just make prompts longer
This is why we saw system prompts evolve from:
- "You are a helpful assistant." (short, simple)
- → "You are a knowledgeable, versatile, and responsive assistant..." (verbose)
- → "You are an adaptive and knowledgeable assistant designed to assist users..." (even more verbose)
All performed identically (100%) because Claude Opus 4.5 doesn't need prompt engineering for these problems.This makes sense. StackMathQA is almost certainly in the training set. If I want to do good system prompt optimization, I need a nice unsaturated training set. And that’s…increasingly difficult these days.
I just set this off for fun. However, this could have implications for my longstanding aim of trying to align models’ revealed preferences to their stated preferences. That’s a fairly unsaturated task: there’s a real gap between what models say they’ll prioritize and what they actually prioritize. It’s pretty well-suited to SPOwML: the AIRISKDILEMMAS are closed-form (‘select an action’). One could create a dataset where a model’s revealed preferences act very much in line with its stated preferences, then run SPOwML over it. I’ll try this next (update by Jan 18; I’m half-busy next week).


ah yes, i have been playing with "getting the model to edit its prompt" at work. haven't been able to do any large-scale experiments with it though, but small-scale results show… uh, not much improvement actually. probably due to domain effects
I think the most interesting use cases for this kind of meta-learning are tasks involving approximating some kind of subjective judgement (eg. for predicting whether you will want to read a certain blog post). These are capabilities that are not directly optimised for during post training, but exist latently within the model (which are pretrained to be general purpose simulators).