This post has three sections:
What is the data processing inequality?
Who’s challenging it? Why?
Why do I think this is missing the point?
What is the data processing inequality?
From Wikipedia:
The data processing inequality is an information-theoretic concept that states that the information content of a signal cannot be increased via a local physical operation. This can be expressed concisely as ‘post-processing cannot increase information’.
Who’s challenging it? Why?
Concisely:
Xu et al.1 point out that with infinite computation, we can compute plain text from encrypted text. Plain text t(X) and labels Y have higher effective mutual information than encrypted text X and labels Y.
Pimentel and Cotterell2 point out that the representations learned by BERT are more helpful for downstream models than raw sentential data. Contextualized representations f(X) may provide higher extractable information about targets Y than raw input X.
Why do I think this is missing the point?
I think this is basically factual. But I also think information sneaks into the equation through selecting which processing function to use.
Take Pimentel and Cotterell’s ‘proof by counterexample’. In it, they use a distribution Z defined by f(y) = y - θ, where data points y are sampled from Poisson distribution Y, and θ is the true (unknown) mean of that underlying distribution Y. They then show that Z = f(Y) is more useful to a Bayesian agent than Y. Ergo, the data processing inequality does not hold.
But true mean θ is salient information that a Bayesian agent would not have access to! Similarly, in the example of Xu et al.—a Bayesian agent can’t recognize which of the many gibberish outputs from brute-force decryption is plain text without some external data about the target language. Moreover, this agent must use procedural knowledge about how to perform algorithmic brute-force decryption.
Then, in the overarching case of learned representations—these are learned from so much data! Can you then abstract that data away, use the function, and claim to have broken the data processing inequality?
Every processing function must be learned. And almost certainly a lot more data is used to learn them than the data points they’re ultimately applied to, which are the only ones represented in the equations of these authors.
In these papers, we get the processing functions for free. But learning the processing functions is the hardest and most interesting task for these Bayesian agents. Researchers should not elide this important detail.
I have high uncertainty because I am not trained in this field, so I appreciate corrections, different frames, and more literature recs on the topic.
Section 3.4.2, A Bayesian Framework for Information-Theoretic Probing

