Last week, I wrote about interesting nuance elided by “A Theory of Usable Information Under Computational Constraints”.
Today, I threw my qualms out of the window and just trained two models—1. a simple MLP, 2. a Restricted Boltzmann Machine—on a simplified MNIST task (classifying digits 0, 3, and 4)—at a thermodynamic computing hackathon.
I wanted to understand mutual information dynamics within this architecture, i.e.:
Do the hidden units encode information about digit class?
Reminder of what MI is:
i.e. a measure of the “amount of information” (in e.g. bits) obtained about one random variable by observing the other random variable.
In our MI calculation:
X and Y are:
I(X; Y) = I(hidden_representations ; digit_labels)
↑ ↑
X Y
X = Hidden Layer Activations
- RBM: Binary hidden units sampled via Gibbs sampling
- Shape: [n_samples, n_hidden] e.g., [50, 100]
- Values: 0 or 1 (on/off)
- MLP: Continuous hidden activations (binarized)
- Shape: [n_samples, n_hidden] e.g., [50, 100]
- Values: tanh outputs
Y = Digit Class Labels
- Decoded from the 30-bit replicated encoding
- Shape: [n_samples] - one class per sample
- Values: 0, 1, or 2 (three digit classes in filtered MNIST)Note that for truly random hidden states (independent of labels):
I(X;Y) = 0 bits
And the theoretical maximum, in the case where hidden units perfectly predict digit class from the three classes, is:
I(X;Y) = log₂(3) ≈ 1.58 bits
I vibe-coded some experiments, and set them off, and we waited with bated breath. The MLP experiments finished super quickly:

These results make sense. The information is ‘all there’ (though unusable) in layer 1. The later layers learn usable information over time.
The RBM experiment took longer. At some point we looked and nvidia-smi showed GPU utilization at 0% (😱), but in fact this was to be expected!
After 2h 17mins1, the RBM finished:
This suggests that you can’t infer much about digit class from an RBM’s hidden units. But did the RBM itself actually infer anything about digit class? I forgot to actually get digit classification accuracy of the trained RBM, and did not save its weights :), so am quickly re-training and will edit this post with digit classification accuracy once I have it. (I predict low!)
Really, I should’ve implemented a ‘denoising thermodynamic model’, which which consists of 2-8 Energy-Based Models (EBMs) sequentially chained together. DTMs were just used to achieve 10 000x energy efficiency improvement over GPU-run diffusion models on Fashion-MNIST.
The RBM itself is just one special case of IsingEBM:
I’ve tried to replicate DTM, and will set that running once the H100 frees up. My replication might not work, in which case I have a backup plan:
Thanks Som Bagchi for pair programming!
later sped this up with larger batch size







