A long, narrow-band radio transmission arrives from a distant star. It is structured, persistent, and overwhelmingly unlikely to be a natural astrophysical process. We detect it. The biggest question is answered: we are not alone.
A second question then appears, deceptively simple: what does the message say? A standard hope is that mathematics (prime numbers, pictures, universal constants) can serve as a self-decoding bootstrap. This note isolates a basic obstruction. It is not an obstruction about noise, bandwidth, or engineering. It is an obstruction about interpretation.
The central idea is that semantics lives in a mapping - an unpacking function, or decoder. If that decoder is completely unknown, then (in a precise information-theoretic sense) a one-way bitstring can carry zero semantic information, even through a perfect noiseless channel. Interaction - queries and feedback - breaks the symmetry by allowing the parties to align a decoder.
Let \(\mathcal M\) be a finite set of meanings and \(\mathcal B\) a finite set of symbols (bitstrings), with \(|\mathcal M|=|\mathcal B|=N\). A decoder/encoder is a bijection \[ E:\mathcal M \to \mathcal B. \] The sender chooses \(M\in\mathcal M\) (a meaning) and transmits the symbol \[ B = E(M). \] If \(E\) is known to the receiver, then decoding is trivial: \[ H(M\mid B,E)=0, \quad\text{and hence}\quad I(M;B\mid E)=H(M). \] So conditional on the decoder, the symbol carries full semantic entropy.
Proof. Fix \(m\in\mathcal M\) and \(b\in\mathcal B\). Since \(E\) is a uniformly random bijection, \[ \Pr(B=b\mid M=m)=\Pr(E(m)=b)=\frac{1}{N}. \] Thus \(\Pr(B=b\mid M=m)\) is constant in \(m\), so \(B\) is independent of \(M\), and \(I(M;B)=0\). \(\square\)
Theorem 1 is a symmetry statement: when all bijections are a priori equally plausible, no particular decoding of the observed symbol is singled out. In this formal model, semantics is not merely hard to recover - it is information-theoretically absent.
The preceding collapse is illuminated by a simple chain-rule identity. Because \(M\) and \(E\) are independent, \[ I(M;B)=I(M;B\mid E)-I(M;E\mid B). \] Since \(E\) is bijective, \[ I(M;B\mid E)=H(M), \] and therefore \[ I(M;B)=H(M)-I(M;E\mid B). \] This equation can be read as an accounting identity: the semantic information the symbol could carry (namely \(H(M)\)) is reduced by the amount of information needed to resolve decoder uncertainty from the observation \(B\). In the maximally unknown-decoder regime, that reduction equals \(H(M)\), leaving \(I(M;B)=0\).
Theorem 1 arose from a symmetry: if all decoders are equally plausible, then relabeling meanings cannot change anything observable. In physics, symmetries often come with conserved quantities. Here is a precise analogue.
Fix the true decoder \(E^\star\). Let \(p_t(e)=\Pr(E=e\mid \mathsf{Tr}_t)\) be the receiver's posterior after \(t\) rounds of interaction, and define the decoder potential \[ \Phi_t \;\equiv\; D_{\mathrm{KL}}(\delta_{E^\star}\,\|\,p_t)\;=\;\log\frac{1}{p_t(E^\star)}. \] This quantity is \(0\) exactly when the decoder is fully aligned.
Proof. By Bayes' rule, \(p_{t+1}(e)=\Pr(E=e\mid \mathsf{Tr}_t,A_{t+1})\). Taking expectation with respect to the random true \(E\) under the current posterior yields \(\mathbb{E}[\Phi_t\mid \mathsf{Tr}_t]=H(E\mid \mathsf{Tr}_t)\) and \(\mathbb{E}[\Phi_{t+1}\mid \mathsf{Tr}_t]=H(E\mid \mathsf{Tr}_t,A_{t+1})\). Subtracting gives \[ \mathbb{E}[\Phi_t-\Phi_{t+1}\mid \mathsf{Tr}_t] = H(E\mid \mathsf{Tr}_t)-H(E\mid \mathsf{Tr}_t,A_{t+1}) = I(E;A_{t+1}\mid \mathsf{Tr}_t), \] and summing over rounds gives the stated identity. \(\square\)
To get semantics, the receiver must learn enough about \(E\) to interpret future symbols. This is precisely where feedback becomes essential: it supplies an error signal that constrains the hypothesis space of decoders.
We model an interactive process of \(T\) rounds: at round \(t\), the receiver issues a query \(Q_t\), and the sender returns an answer \(A_t\) from an alphabet of size at most \(R\). Let the transcript be \(\mathsf{Tr}_T=(Q_1,A_1,\dots,Q_T,A_T)\).
Proof. By the chain rule, \[ I(E;\mathsf{Tr}_T)=\sum_{t=1}^T I(E;A_t\mid \mathsf{Tr}_{t-1}). \] For each \(t\), \(I(E;A_t\mid \mathsf{Tr}_{t-1})\le H(A_t\mid \mathsf{Tr}_{t-1})\le \log R\), because \(A_t\) takes at most \(R\) values. Summing gives \(I(E;\mathsf{Tr}_T)\le T\log R\). If \(H(E\mid \mathsf{Tr}_T)=0\), then \(I(E;\mathsf{Tr}_T)=H(E)\), yielding \(T\ge H(E)/\log R\). \(\square\)
Using Stirling's approximation, \(\log(N!) = N\log N - \Theta(N)\), so with 1-bit answers (\(R=2\)) we obtain a lower bound of \(\Omega(N\log N)\) rounds to learn a full decoder. With value-type answers (\(R=N\)), the bound becomes \(\Omega(N)\), matching the trivial strategy of querying each meaning once.
Proof. By the chain rule for mutual information, \[ I(E;\mathsf{Tr}_T) = \sum_{t=1}^T I(E;A_t \mid \mathsf{Tr}_{t-1}) = \sum_{t=1}^T i_t. \] If the protocol identifies \(E\) exactly, then \(H(E \mid \mathsf{Tr}_T)=0\), so \(I(E;\mathsf{Tr}_T)=H(E)\), which proves the identity. Since \(\sum_{t=1}^T i_t \le T \max_t i_t\), the lower bound follows. Finally, \(i_t = 0\) is equivalent to \(E \perp A_t \mid \mathsf{Tr}_{t-1}\), i.e. the answer carries no additional information about the decoder beyond what is already known. \(\square\)
Theorem 2 constrains the raw feedback capacity per round: if answers lie in an alphabet of size \(R\), then no protocol can extract more than \(\log R\) bits from any single answer. Lemma 2.5 refines this by emphasizing the conditional gains \(i_t = I(E;A_t\mid \mathsf{Tr}_{t-1})\), which already account for correlations induced by earlier answers. This motivates a simple scalar that measures how much of the nominal feedback capacity is wasted due to redundancy and correlation between queries.
Theorems 1 and 2 fit together cleanly:
This formalism explains a familiar empirical fact: humans do not learn language from a single message. They learn by proposing hypotheses, receiving corrections, and iterating - effectively performing decoder alignment with feedback. The same logic clarifies why an engineered extraterrestrial signal might be detectable as artificial (long-lived regularities), yet remain untranslated in content absent an extended back-and-forth exchange.
At a philosophical level, the note isolates a technical version of a slogan: meaning is co-constructed. Formally, “meaning” is not an intrinsic property of a bitstring \(B\), but a property of the pair \((B,E)\) - and \(E\) must be shared, learned, or inferred.
The note's main claim is about semantics: without a shared decoder \(E\), a noiseless one-way bitstring can have \(I(M;B)=0\) even though it is perfectly well-formed as a sequence of symbols. A separate (and often prior) question is engineering detectability: can a receiver decide that a received transmission is unlikely to be natural noise before any attempt at decoding meaning?
This can be posed as a hypothesis test on the observed bitstring \(X\) (or on a local window of it):
Thus, in principle, pre-decoding detectability is completely governed by how far the engineered ensemble is from the null. If the designer (deliberately or accidentally) makes \(P_1\) close to \(P_0\) in total variation, then no amount of cleverness can reliably distinguish the signal from natural noise.
The “dropped into a random point” version is identical, but with a local observation model. Let \(I\) be uniform over start indices and let the receiver see a length-\(k\) window \[ Y \;=\; X_{I:I+k-1}. \] This induces two window-distributions \(\mu_0^{(k)}\) and \(\mu_1^{(k)}\) on \(\{0,1\}^k\), and the optimal success probability becomes \[ p_\star(k) = \tfrac{1}{2} + \tfrac{1}{2}\,D_{\mathrm{TV}}\!\left(\mu_0^{(k)},\mu_1^{(k)}\right). \]
When the alternative \(P_1\) is unknown or open-ended, a natural model-free surrogate for “engineered” is compressibility. In algorithmic-information terms, strings with low Kolmogorov complexity \(K(X)\) are atypical under uniform noise; in particular, \[ \Pr_{X\sim U_n}\!\left[K(X) \le n-s\right] \;\le\; 2^{-s}. \] Although \(K\) is not computable, practical compressors and minimum-description-length (MDL) models provide workable approximations: large compression gains serve as a quantitative “engineering-likelihood” score with an exponentially small false-alarm bound under idealized randomness.
The formalism above is intentionally austere, but it suggests a useful way to think about early language acquisition. Very roughly, a child is attempting to align a large, structured decoder: mappings from sounds to word forms, from word forms to concepts, and then compositional rules that govern how meanings combine. In the present note's language, this is the problem of driving the posterior \(p_t(E)\) from a diffuse prior to a stable, sharply peaked distribution.
A striking feature of ordinary language exposure is that the raw input is enormous, yet the fresh information per interaction can be small. Conversations are repetitive, contexts are ambiguous, and many “queries” are highly correlated. In our notation, this means that the conditional gains \(i_t = I(E;A_t\mid \mathsf{Tr}_{t-1})\) can be far below the nominal per-turn channel capacity. The redundancy functional \(\kappa_T\) is designed to quantify precisely this gap.
Two morals follow. First, the time scale for language stabilization is naturally explained by an information budget: the child must accumulate enough symmetry-breaking evidence to select the appropriate decoder from a vast space of alternatives. Second, interaction matters because it raises the effective efficiency \(1-\kappa_T\): questions, corrections, and jointly attended context reduce correlation between successive “queries,” increasing the average conditional gain \(i_t\). In this view, a one-way stream of tokens is not merely less pleasant than conversation - it systematically wastes the available feedback capacity by driving \(\kappa_T\) upward.
This is, of course, only a cartoon; real language is compositional and the hypothesis class is far from \(N!\). Nevertheless, the decoder-alignment lens makes a sharp qualitative prediction: environments that increase interaction-driven disambiguation (lower \(\kappa_T\)) should reduce the number of required turns to reach a stable decoder, even if the total number of heard tokens is held fixed.