Meaning as Symmetry Breaking: Why One-Way Messages Cannot Create Semantics

J. Landers

1. Introduction

A long, narrow-band radio transmission arrives from a distant star. It is structured, persistent, and overwhelmingly unlikely to be a natural astrophysical process. We detect it. The biggest question is answered: we are not alone.

A second question then appears, deceptively simple: what does the message say? A standard hope is that mathematics (prime numbers, pictures, universal constants) can serve as a self-decoding bootstrap. This note isolates a basic obstruction. It is not an obstruction about noise, bandwidth, or engineering. It is an obstruction about interpretation.

The central idea is that semantics lives in a mapping - an unpacking function, or decoder. If that decoder is completely unknown, then (in a precise information-theoretic sense) a one-way bitstring can carry zero semantic information, even through a perfect noiseless channel. Interaction - queries and feedback - breaks the symmetry by allowing the parties to align a decoder.

Informal slogan. The bits are cheap; the interpreter is expensive. One-way messages can transmit symbols, but cannot (in general) transmit the mapping that makes symbols mean something.

2. The hidden variable: a decoder

Let \(\mathcal M\) be a finite set of meanings and \(\mathcal B\) a finite set of symbols (bitstrings), with \(|\mathcal M|=|\mathcal B|=N\). A decoder/encoder is a bijection \[ E:\mathcal M \to \mathcal B. \] The sender chooses \(M\in\mathcal M\) (a meaning) and transmits the symbol \[ B = E(M). \] If \(E\) is known to the receiver, then decoding is trivial: \[ H(M\mid B,E)=0, \quad\text{and hence}\quad I(M;B\mid E)=H(M). \] So conditional on the decoder, the symbol carries full semantic entropy.

3. No shared prior: semantic mutual information collapses

Theorem 1 (No shared prior → zero semantic information).
Let \(M\) be any random variable on \(\mathcal M\). Let \(E\) be drawn uniformly at random from the set of all bijections \(\mathcal M\leftrightarrow \mathcal B\), independently of \(M\), and hidden from the receiver. With \(B=E(M)\), one has \[ I(M;B)=0. \]

Proof. Fix \(m\in\mathcal M\) and \(b\in\mathcal B\). Since \(E\) is a uniformly random bijection, \[ \Pr(B=b\mid M=m)=\Pr(E(m)=b)=\frac{1}{N}. \] Thus \(\Pr(B=b\mid M=m)\) is constant in \(m\), so \(B\) is independent of \(M\), and \(I(M;B)=0\). \(\square\)

Theorem 1 is a symmetry statement: when all bijections are a priori equally plausible, no particular decoding of the observed symbol is singled out. In this formal model, semantics is not merely hard to recover - it is information-theoretically absent.

4. A mutual-information identity (what gets “spent” on interpretation)

The preceding collapse is illuminated by a simple chain-rule identity. Because \(M\) and \(E\) are independent, \[ I(M;B)=I(M;B\mid E)-I(M;E\mid B). \] Since \(E\) is bijective, \[ I(M;B\mid E)=H(M), \] and therefore \[ I(M;B)=H(M)-I(M;E\mid B). \] This equation can be read as an accounting identity: the semantic information the symbol could carry (namely \(H(M)\)) is reduced by the amount of information needed to resolve decoder uncertainty from the observation \(B\). In the maximally unknown-decoder regime, that reduction equals \(H(M)\), leaving \(I(M;B)=0\).

4.5. A Noether-style information “conservation law”

Theorem 1 arose from a symmetry: if all decoders are equally plausible, then relabeling meanings cannot change anything observable. In physics, symmetries often come with conserved quantities. Here is a precise analogue.

Fix the true decoder \(E^\star\). Let \(p_t(e)=\Pr(E=e\mid \mathsf{Tr}_t)\) be the receiver's posterior after \(t\) rounds of interaction, and define the decoder potential \[ \Phi_t \;\equiv\; D_{\mathrm{KL}}(\delta_{E^\star}\,\|\,p_t)\;=\;\log\frac{1}{p_t(E^\star)}. \] This quantity is \(0\) exactly when the decoder is fully aligned.

Theorem 1.5 (Conservation of decoder potential).
Assume the receiver updates \(p_t\) by Bayes' rule under the protocol. Then for each round, \[ \mathbb{E}\!\left[\Phi_t-\Phi_{t+1}\,\middle|\,\mathsf{Tr}_t\right] \;=\; I\!\left(E;A_{t+1}\mid \mathsf{Tr}_t\right), \] and consequently, for any horizon \(T\), \[ \mathbb{E}[\Phi_T]\;+\;\sum_{t=0}^{T-1} I\!\left(E;A_{t+1}\mid \mathsf{Tr}_t\right) \;=\;\Phi_0. \]

Proof. By Bayes' rule, \(p_{t+1}(e)=\Pr(E=e\mid \mathsf{Tr}_t,A_{t+1})\). Taking expectation with respect to the random true \(E\) under the current posterior yields \(\mathbb{E}[\Phi_t\mid \mathsf{Tr}_t]=H(E\mid \mathsf{Tr}_t)\) and \(\mathbb{E}[\Phi_{t+1}\mid \mathsf{Tr}_t]=H(E\mid \mathsf{Tr}_t,A_{t+1})\). Subtracting gives \[ \mathbb{E}[\Phi_t-\Phi_{t+1}\mid \mathsf{Tr}_t] = H(E\mid \mathsf{Tr}_t)-H(E\mid \mathsf{Tr}_t,A_{t+1}) = I(E;A_{t+1}\mid \mathsf{Tr}_t), \] and summing over rounds gives the stated identity. \(\square\)

Why this is more than “entropy drops”. The conserved budget \(\Phi_0\) is precisely the amount of symmetry-breaking information needed to pick out one decoder from an orbit of equally valid relabelings. When the prior is uniform on \(N!\) bijections, \(\Phi_0=\log(N!)\), so a protocol with at most \(b\) feedback bits per round must satisfy \[ T \;\ge\; \frac{\log(N!)}{b}, \] i.e. the familiar \(N\log N\) barrier is the cost of breaking the permutation symmetry. In this sense, interaction does not “add meaning”; it performs work that converts decoder-symmetry uncertainty into usable semantic capacity.

5. Interaction as symmetry breaking: learning the decoder

To get semantics, the receiver must learn enough about \(E\) to interpret future symbols. This is precisely where feedback becomes essential: it supplies an error signal that constrains the hypothesis space of decoders.

We model an interactive process of \(T\) rounds: at round \(t\), the receiver issues a query \(Q_t\), and the sender returns an answer \(A_t\) from an alphabet of size at most \(R\). Let the transcript be \(\mathsf{Tr}_T=(Q_1,A_1,\dots,Q_T,A_T)\).

Theorem 2 (Query/feedback lower bound via mutual information).
Let \(E\) be the unknown bijection. For any interactive protocol with answers in an alphabet of size \(\le R\), \[ I(E;\mathsf{Tr}_T)\le T\log R. \] In particular, any protocol that identifies \(E\) exactly (so that \(H(E\mid \mathsf{Tr}_T)=0\)) must satisfy \[ T \ge \frac{H(E)}{\log R}. \] If \(E\) is uniform over all bijections, then \(H(E)=\log(N!)\), hence \[ T \ge \frac{\log(N!)}{\log R}. \]

Proof. By the chain rule, \[ I(E;\mathsf{Tr}_T)=\sum_{t=1}^T I(E;A_t\mid \mathsf{Tr}_{t-1}). \] For each \(t\), \(I(E;A_t\mid \mathsf{Tr}_{t-1})\le H(A_t\mid \mathsf{Tr}_{t-1})\le \log R\), because \(A_t\) takes at most \(R\) values. Summing gives \(I(E;\mathsf{Tr}_T)\le T\log R\). If \(H(E\mid \mathsf{Tr}_T)=0\), then \(I(E;\mathsf{Tr}_T)=H(E)\), yielding \(T\ge H(E)/\log R\). \(\square\)

Using Stirling's approximation, \(\log(N!) = N\log N - \Theta(N)\), so with 1-bit answers (\(R=2\)) we obtain a lower bound of \(\Omega(N\log N)\) rounds to learn a full decoder. With value-type answers (\(R=N\)), the bound becomes \(\Omega(N)\), matching the trivial strategy of querying each meaning once.

Lemma 2.5 (Correlation-adjusted query lower bound).
Let \(i_t \equiv I(E;A_t \mid \mathsf{Tr}_{t-1})\) denote the fresh information about the decoder revealed by the \(t\)-th answer after accounting for all prior answers. Then \[ H(E) \;=\; \sum_{t=1}^T i_t, \] and hence \[ T \;\ge\; \frac{H(E)}{\max_{1\le t\le T} i_t}. \] Moreover, a query at time \(t\) is redundant (its answer is already implied by the previous transcript \(\mathsf{Tr}_{t-1}\)) if and only if \(i_t = 0\).

Proof. By the chain rule for mutual information, \[ I(E;\mathsf{Tr}_T) = \sum_{t=1}^T I(E;A_t \mid \mathsf{Tr}_{t-1}) = \sum_{t=1}^T i_t. \] If the protocol identifies \(E\) exactly, then \(H(E \mid \mathsf{Tr}_T)=0\), so \(I(E;\mathsf{Tr}_T)=H(E)\), which proves the identity. Since \(\sum_{t=1}^T i_t \le T \max_t i_t\), the lower bound follows. Finally, \(i_t = 0\) is equivalent to \(E \perp A_t \mid \mathsf{Tr}_{t-1}\), i.e. the answer carries no additional information about the decoder beyond what is already known. \(\square\)

Interpretation. Correlation between queries is quantified by how rapidly the conditional gains \(i_t\) diminish: early answers may collapse large regions of decoder space, rendering many later queries redundant. Theorem 2 bounds each \(i_t\) by the raw feedback capacity (\(\le \log R\)); this lemma makes explicit that the fundamental lower bound depends on the amount of new information that survives correlations induced by the previous transcript.

5.5. A query-correlation functional

Theorem 2 constrains the raw feedback capacity per round: if answers lie in an alphabet of size \(R\), then no protocol can extract more than \(\log R\) bits from any single answer. Lemma 2.5 refines this by emphasizing the conditional gains \(i_t = I(E;A_t\mid \mathsf{Tr}_{t-1})\), which already account for correlations induced by earlier answers. This motivates a simple scalar that measures how much of the nominal feedback capacity is wasted due to redundancy and correlation between queries.

Definition 3 (Query-correlation / redundancy functional).
For a protocol run for \(T\) rounds with answer alphabet size \(\le R\), define \[ \kappa_T \;\equiv\; 1 - \frac{1}{T\log R}\sum_{t=1}^T I(E;A_t\mid \mathsf{Tr}_{t-1}) \;=\; 1 - \frac{I(E;\mathsf{Tr}_T)}{T\log R}. \] Then \(0\le \kappa_T \le 1\). We call \(1-\kappa_T\) the decoder-learning efficiency.
Interpretation. If \(\kappa_T\approx 0\), then (on average) each answer contributes nearly \(\log R\) fresh bits about the decoder: queries are effectively decorrelated and close to maximally informative. If \(\kappa_T\approx 1\), then most answers are redundant given the prior transcript: the protocol spends many rounds asking questions whose answers are largely implied by what it already knows. For exact identification one has \(I(E;\mathsf{Tr}_T)=H(E)\), so \[ T \;=\; \frac{H(E)}{(1-\kappa_T)\log R}, \] making explicit that the total number of rounds is governed not only by feedback capacity \(\log R\), but by how much of that capacity survives correlations across the query set.

6. From zero semantics to full semantics: the two-stage picture

Theorems 1 and 2 fit together cleanly:

Interpretation. Interaction does not merely transmit additional bits; it transmits bits that are about the decoder. Those are precisely the bits missing from one-way communication.

7. Discussion

This formalism explains a familiar empirical fact: humans do not learn language from a single message. They learn by proposing hypotheses, receiving corrections, and iterating - effectively performing decoder alignment with feedback. The same logic clarifies why an engineered extraterrestrial signal might be detectable as artificial (long-lived regularities), yet remain untranslated in content absent an extended back-and-forth exchange.

At a philosophical level, the note isolates a technical version of a slogan: meaning is co-constructed. Formally, “meaning” is not an intrinsic property of a bitstring \(B\), but a property of the pair \((B,E)\) - and \(E\) must be shared, learned, or inferred.

7.25. Detecting algorithmic structure before decoding

The note's main claim is about semantics: without a shared decoder \(E\), a noiseless one-way bitstring can have \(I(M;B)=0\) even though it is perfectly well-formed as a sequence of symbols. A separate (and often prior) question is engineering detectability: can a receiver decide that a received transmission is unlikely to be natural noise before any attempt at decoding meaning?

This can be posed as a hypothesis test on the observed bitstring \(X\) (or on a local window of it):

Proposition 4 (Optimal detectability = total variation).
Let \(X\) be the observation under either \(P_0\) or \(P_1\) and assume equal priors. The best possible decision rule achieves success probability \[ p_\star = \tfrac{1}{2} + \tfrac{1}{2}\,D_{\mathrm{TV}}(P_0,P_1), \] where \(D_{\mathrm{TV}}\) is total variation distance.

Thus, in principle, pre-decoding detectability is completely governed by how far the engineered ensemble is from the null. If the designer (deliberately or accidentally) makes \(P_1\) close to \(P_0\) in total variation, then no amount of cleverness can reliably distinguish the signal from natural noise.

The “dropped into a random point” version is identical, but with a local observation model. Let \(I\) be uniform over start indices and let the receiver see a length-\(k\) window \[ Y \;=\; X_{I:I+k-1}. \] This induces two window-distributions \(\mu_0^{(k)}\) and \(\mu_1^{(k)}\) on \(\{0,1\}^k\), and the optimal success probability becomes \[ p_\star(k) = \tfrac{1}{2} + \tfrac{1}{2}\,D_{\mathrm{TV}}\!\left(\mu_0^{(k)},\mu_1^{(k)}\right). \]

Definition 4.5 (Local detectability functional).
Define the scale-\(k\) detectability \[ \Delta_k \;\equiv\; D_{\mathrm{TV}}\!\left(\mu_0^{(k)},\mu_1^{(k)}\right). \] Then the maximal advantage over chance from a single random window is \(\tfrac{1}{2}\Delta_k\).
Global structure can be locally invisible. It is possible for an engineered signal to have extensive global order while matching the null model on all length-\(k\) windows, in which case \(\Delta_k=0\) and no “drop-in” test can succeed. Conversely, if the generator creates conspicuous deviations in short-block statistics (periodicities, spectral peaks, biased \(k\)-grams, long runs, etc.), then \(\Delta_k\) is large and structure is detectable even without decoding meaning.

When the alternative \(P_1\) is unknown or open-ended, a natural model-free surrogate for “engineered” is compressibility. In algorithmic-information terms, strings with low Kolmogorov complexity \(K(X)\) are atypical under uniform noise; in particular, \[ \Pr_{X\sim U_n}\!\left[K(X) \le n-s\right] \;\le\; 2^{-s}. \] Although \(K\) is not computable, practical compressors and minimum-description-length (MDL) models provide workable approximations: large compression gains serve as a quantitative “engineering-likelihood” score with an exponentially small false-alarm bound under idealized randomness.

Structure detection is not meaning. A receiver may confidently infer that a transmission is artificial (large \(\Delta_k\) or large compressibility) while still having \(I(M;B)=0\) because the decoder remains unaligned. Conversely, a sophisticated sender can intentionally mask structure (e.g. via one-time pads or pseudorandom generators), making \(P_1\) statistically or computationally indistinguishable from \(P_0\) and driving detectability toward zero.

7.5. Speculative application: language acquisition as decoder alignment

The formalism above is intentionally austere, but it suggests a useful way to think about early language acquisition. Very roughly, a child is attempting to align a large, structured decoder: mappings from sounds to word forms, from word forms to concepts, and then compositional rules that govern how meanings combine. In the present note's language, this is the problem of driving the posterior \(p_t(E)\) from a diffuse prior to a stable, sharply peaked distribution.

A striking feature of ordinary language exposure is that the raw input is enormous, yet the fresh information per interaction can be small. Conversations are repetitive, contexts are ambiguous, and many “queries” are highly correlated. In our notation, this means that the conditional gains \(i_t = I(E;A_t\mid \mathsf{Tr}_{t-1})\) can be far below the nominal per-turn channel capacity. The redundancy functional \(\kappa_T\) is designed to quantify precisely this gap.

Back-of-the-envelope. Treat (only) the core lexicon as a toy decoder-alignment task with \(N\) atomic items. If one models this as choosing a bijection among \(N!\) possibilities, then the decoder uncertainty is \(\Phi_0 \approx \log_2(N!)\) bits. For \(N=10^4\) (a representative order of magnitude for receptive vocabulary), \(\log_2(N!)\) is on the order of \(10^5\) bits. If a child experiences on the order of \(10^3\) conversational “turns” per day, then reaching a stable decoder over a few years is consistent with extracting only a small fraction of a bit of fresh decoder information per turn (say, one clean decoder-bit per ten or so turns), i.e. an efficiency \(1-\kappa_T\) well below \(1\) but not vanishingly small.

Two morals follow. First, the time scale for language stabilization is naturally explained by an information budget: the child must accumulate enough symmetry-breaking evidence to select the appropriate decoder from a vast space of alternatives. Second, interaction matters because it raises the effective efficiency \(1-\kappa_T\): questions, corrections, and jointly attended context reduce correlation between successive “queries,” increasing the average conditional gain \(i_t\). In this view, a one-way stream of tokens is not merely less pleasant than conversation - it systematically wastes the available feedback capacity by driving \(\kappa_T\) upward.

This is, of course, only a cartoon; real language is compositional and the hypothesis class is far from \(N!\). Nevertheless, the decoder-alignment lens makes a sharp qualitative prediction: environments that increase interaction-driven disambiguation (lower \(\kappa_T\)) should reduce the number of required turns to reach a stable decoder, even if the total number of heard tokens is held fixed.

References

  1. C. E. Shannon. A Mathematical Theory of Communication. Bell System Technical Journal, 27(3):379–423, 1948.
  2. T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley, 2nd ed., 2006.