A useful heuristic for “creativity” in mathematics and algorithms is: a low-probability, structure-preserving transformation of representation space that increases explanatory compression. This note makes that slogan concrete in one of the cleanest arenas available: comparison-based sorting. First, we present a decision-tree theorem that isolates the boundary-of-feasibility at depth $\log_2(n!)$ and proves an explicit “rarity” bound under an uninformed prior. A padding corollary then shows a combinatorial explosion of correct-but-deeper trees, formalizing why slower procedures occupy vastly more “volume”. Second, we lift the same story to the permutohedron and its adjacency graph, where efficient sorting corresponds to geodesic-like behavior and near-geodesic paths form an exponentially thin subset of the path space. Finally, we connect this discrete viewpoint to the continuous geometric contraction framework developed in Landers (2025), which recasts sorting as gradient flow on the permutohedron.
We will use “creativity” as a precise triad: (i) structure preservation, (ii) compression gain, and (iii) low prior probability. In this note the “structure” is correctness: we must map each input order-type to its correct sorted permutation. Compression is measured by the number of comparisons (depth) or, geometrically, by path length or contraction time. Low probability is measured under an explicitly specified prior over representations.
Comment. “There are fewer $n\log n$ algorithms than $n^2$ algorithms” is not a well-typed statement until one specifies a representation class and a probability measure over it. The results below do exactly that.
Assume $n$ distinct keys. Each input instance has a unique order-type permutation $\sigma \in S_n$. A deterministic comparison-based sorting algorithm can be modeled by a rooted binary tree: internal nodes are labeled by a pair $(i,j)$ with $1 \le i < j \le n$ (compare $a_i$ vs. $a_j$), and each leaf is labeled by an output permutation $\pi \in S_n$. The algorithm’s output on order-type $\sigma$ is the leaf label reached by following comparison outcomes.
$$ \text{(comparison sort)} \quad \equiv \quad \text{binary decision tree whose leaves are labeled by } S_n. \tag{1} $$
(a) A binary tree of depth $d$ has at most $2^d$ leaves. If two distinct order-types $\sigma\neq\tau$ reached the same leaf, the algorithm would output the same permutation on both, hence would be wrong on at least one of them (a leaf cannot be simultaneously labeled by two different correct answers). Therefore a correct sorting tree must have at least one distinct leaf for each $\sigma\in S_n$, i.e. at least $n!$ leaves. Thus $2^d \ge n!$, hence $d \ge \log_2(n!)$; if $d<\lceil \log_2(n!)\rceil$ no correct tree exists.
(b) Fix the random labeled tree under the stated prior. For each $\sigma\in S_n$, let $\ell(\sigma)$ denote the leaf reached on input order-type $\sigma$ (this is determined by the internal comparison labels and the induced outcomes). Let $y(\ell)\in S_n$ be the random output label at leaf $\ell$. Define events $$ A_\sigma := \{\, y(\ell(\sigma)) = \sigma \,\}. \tag{3} $$ Correctness implies $\mathsf{Sort}_n \subseteq \bigcap_{\sigma\in S_n} A_\sigma$, hence $\Pr(\mathsf{Sort}_n)\le \Pr(\bigcap_{\sigma}A_\sigma)$.
Condition on all internal-node labels; then the routing $\ell(\sigma)$ is fixed, and the leaf labels remain independent uniform draws from $S_n$. In particular, for any fixed $\sigma$, $$ \Pr(A_\sigma \mid \text{internal labels}) = \Pr\big(y(\ell(\sigma))=\sigma\mid \text{internal labels}\big)=\frac{1}{n!}. \tag{4} $$ Moreover, if $\ell(\sigma)=\ell(\tau)$ for some $\sigma\neq\tau$, then $A_\sigma\cap A_\tau=\varnothing$ (impossibility), which only decreases the probability of the intersection; thus an upper bound can safely ignore collisions.
Enumerate $S_n=\{\sigma_1,\dots,\sigma_{n!}\}$. By the chain rule,
$$
\begin{aligned}
\Pr\!\left(\bigcap_{k=1}^{n!} A_{\sigma_k}
\,\middle|\, \text{internal labels}\right)
&=
\prod_{k=1}^{n!}
\Pr\!\left(
A_{\sigma_k}
\,\middle|\,
\text{internal labels},
A_{\sigma_1},\dots,A_{\sigma_{k-1}}
\right).
\end{aligned}
\tag{5}
$$
Each conditional factor is $\le 1/n!$. There are two cases for the leaf $\ell(\sigma_k)$:
(i) Fresh leaf. If no earlier event has constrained $y(\ell(\sigma_k))$, then (by independence and uniformity)
$$
\begin{aligned}
\Pr\!\left(
A_{\sigma_k}
\,\middle|\,
\text{internal labels}, A_{\sigma_1},\dots,A_{\sigma_{k-1}}
\right)
&= \Pr\big(y(\ell(\sigma_k))=\sigma_k\big)
\\&= \frac{1}{n!}.
\end{aligned}
$$
(ii) Previously exposed leaf. If earlier events have fixed $y(\ell(\sigma_k))$ to some value $\tau\in S_n$, then the
conditional probability is either $1$ if $\tau=\sigma_k$ or $0$ otherwise. In either subcase it is at most $1$, hence at most $1/n!$
because $n!\ge 2$.
Therefore every factor in (5) is bounded by $1/n!$, so
$$
\begin{aligned}
\Pr\!\left(\bigcap_{\sigma\in S_n} A_\sigma
\,\middle|\, \text{internal labels}\right)
&\le
\left(\frac{1}{n!}\right)^{n!}.
\end{aligned}
\tag{6}
$$
Taking expectation over internal labels yields the same bound unconditionally, proving (2).
$\square$
Comment. Part (b) is intentionally “uninformed”: it treats leaf labels as random. Its purpose is not to model how humans design algorithms, but to formalize the slogan “highly compressed correctness is a thin set”: an exponential number of constraints must simultaneously hold at the boundary $2^d\approx n!$.
Fix a correct tree $T$ of depth $\le d_\star$. By the feasibility argument above, $T$ has at least $n!$ leaves (one per order-type). Construct a new tree $T'$ by padding each leaf $\ell$ of $T$ with a chain of $k$ additional comparison nodes. Each new internal node chooses an arbitrary comparison label from the $\binom{n}{2}$ pairs. Label every new terminal leaf produced from $\ell$ with the same permutation label that $\ell$ had in $T$.
Depth increases by exactly $k$ along every root-to-leaf path, so $\mathrm{depth}(T')\le d_\star+k$. For correctness: for any input order-type $\sigma$, $T$ routes $\sigma$ to a leaf $\ell$ whose label equals $\sigma$. In $T'$, the input continues down the padded chain, but regardless of the outcomes it ends at a descendant leaf with the same label. Hence $T'$ outputs exactly what $T$ outputs on every input; thus $T'$ is correct.
Counting: each leaf chain has $k$ internal nodes, each with $\binom{n}{2}$ independent labeling choices, giving $(\binom{n}{2})^k$ distinct paddings per original leaf. Since $T$ has at least $n!$ leaves, the number of distinct padded trees generated from $T$ is at least $(\binom{n}{2})^{k\,n!}$. This construction is injective in the padding choices, so $N_{d_\star+k} \ge (\binom{n}{2})^{k\,n!} N_{d_\star}$. $\square$
Theorem 1(a) identifies the compression boundary: to isolate one of $n!$ possibilities by binary tests requires depth at least $\log_2(n!)=\Theta(n\log n)$. Corollary 2 shows that once we step away from that boundary, the number of correct representations explodes combinatorially: correctness becomes compatible with many more degrees of freedom (redundant comparisons). This makes the “thinness” of near-optimal solutions precise.
The permutohedron is the convex polytope $$ P_n := \mathrm{conv}\{\pi(1,2,\dots,n): \pi\in S_n\}. \tag{8} $$ Its vertices are permutations; its 1-skeleton is the Cayley graph generated by adjacent transpositions $s_i=(i,i+1)$ for $i=1,\dots,n-1$. A path that swaps adjacent elements corresponds to motion along edges of $P_n$. This viewpoint (including comparisons as linear constraints slicing $P_n$) is developed in detail in Sorting as Gradient Flow on the Permutohedron.
Each adjacent swap changes the inversion number by exactly $\pm 1$ (it swaps a neighboring pair, creating or removing exactly one inversion). The identity permutation has inversion number $0$. Therefore any path from $\sigma$ to identity must reduce $\mathrm{inv}(\sigma)$ to $0$, requiring at least $\mathrm{inv}(\sigma)$ steps. Existence of a length-$\mathrm{inv}(\sigma)$ path follows by repeatedly swapping any adjacent inverted pair (e.g. bubble-sort descent), which removes one inversion per swap until none remain. Such a shortest path corresponds to a reduced word for $\sigma$ in the Coxeter generators. $\square$
Let $I=\mathrm{inv}(\sigma)$, so by Proposition 3 the geodesic length from $\sigma$ to the identity is $I$. Fix any geodesic path $\gamma \in \Gamma_I(\sigma)$ (equivalently, a reduced word for $\sigma$ in the adjacent generators).
We produce many longer, still-successful (endpoint-correct) paths by inserting backtracking pairs. For any generator $s_i$ we have $s_i^2=e$, so inserting the two-step subsequence $(s_i,s_i)$ anywhere in a word does not change its endpoint.
Write the slack as $L-I = 2m + r$ where $m=\lfloor (L-I)/2\rfloor$ and $r\in\{0,1\}$. Insert $m$ backtracking pairs into $\gamma$ at any chosen locations, choosing each $i$ independently from $\{1,\dots,n-1\}$. This yields a path of length $I+2m \le L$ that still ends at the identity. The number of distinct such insertions is at least $(n-1)^m$, since there are $(n-1)$ choices per pair.
Therefore $$ \begin{aligned} |\Gamma_{I+2m}(\sigma)| \;&\ge\; (n-1)^m\,|\Gamma_I(\sigma)|. \end{aligned} $$ Since $|\Gamma_L(\sigma)| \ge |\Gamma_{I+2m}(\sigma)|$ for any $L\ge I+2m$, we obtain the stated exponential growth in the slack. The displayed inequality is exactly (9) up to the inessential replacement of exponent $(L-I)$ by $\lfloor (L-I)/2\rfloor$, and yields (10) with the same replacement; the qualitative conclusion (exponential sparsity of geodesics among longer successful paths) is unchanged. $\square$
Comment. The takeaway is robust: once you allow extra steps beyond the geodesic length $I$, the number of successful (endpoint-correct) paths grows exponentially in the slack. Thus geodesic-like behavior is “rare” in path space.
The permutohedron viewpoint becomes even sharper when lifted to continuous dynamics. In Landers (2025), sorting is formulated as gradient flow toward the sorted vertex $v_s=(1,2,\dots,n)$ under the potential $$ V(x)=\tfrac12\|x-v_s\|^2, \tag{11} $$ yielding exponential contraction $$ \begin{aligned} \|x(t)-v_s\|^2 \;&=\; \|x(0)-v_s\|^2\, e^{-2t}. \end{aligned} \tag{12} $$ In the worst case, $\|x(0)-v_s\|^2=\Theta(n^3)$, so reaching an $O(1)$ neighborhood takes time $t=\Theta(\log n)$. Interpreting one discrete comparison as affecting at most an $O(1/n)$ fraction of the global contraction leads to $T=\Theta(n\log n)$ discrete operations (comparisons), matching the classical lower bound while explaining it as a geometric time scale of contraction.
We can now state the slogan precisely in this domain:
$$ \begin{aligned} \textbf{Creativity in sorting} \;&=\; \underbrace{\text{correctness}}_{\text{structure}} \\ &\quad+\; \underbrace{\text{near-minimal comparisons / geodesic motion}}_{\text{compression}} \\ &\quad+\; \underbrace{\text{thin measure near the feasibility boundary}}_{\text{low probability}}. \end{aligned} \tag{13} $$
Theorem 1(a) fixes the boundary: $\log_2(n!)=\Theta(n\log n)$ comparisons are information-theoretically necessary. Corollary 2 proves that the set of correct solutions explodes once extra depth is allowed, so near-boundary solutions inhabit a thin shell. The permutohedron extension makes the same phenomenon geometric: geodesic-like paths are exponentially sparse among longer successful walks, while continuous gradient flow shows that $n\log n$ arises as the intrinsic contraction time scale of disorder.
[1] D. E. Knuth, The Art of Computer Programming, Vol. 3: Sorting and Searching. Addison–Wesley, 1973.
[2] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to Algorithms, 4th ed. MIT Press, 2022.
[3] J. R. Landers, Sorting as Gradient Flow on the Permutohedron, 2025. (Author’s manuscript / arXiv preprint.)