Blob Topology Equals Centroids

Abstract. DBSCAN and k-means encode different mathematical notions of “cluster”: ε-connectivity in a proximity graph versus minimization of within-cluster squared error. This note identifies a regime in which these notions coincide after a single feature-space normalization. Our specific contribution is a fully deterministic equivalence theorem: after global whitening, a simple ball-separation condition yields a nonempty interval of DBSCAN radii \( \varepsilon \) for which DBSCAN returns the ground-truth components; under an explicit additional separation inequality, the same partition is the unique global minimizer of the \(K\)-means objective. The argument is direct and avoids spectral embeddings, learned metrics, or label-encoding constructions; the transition is purely geometric.

1. Introduction

Clustering algorithms are best understood as formal definitions of “cluster.” In k-means, clusters arise as a partition minimizing squared reconstruction error, producing Voronoi geometry around centroids. In DBSCAN, clusters arise as connected components in a graph induced by a distance threshold, producing topological connectivity and allowing nonconvex shapes and noise. Because these objects differ, the algorithms are usually presented as alternatives rather than as related limits. This note shows that the gap is not intrinsic: under a natural metric normalization and a transparent separation model, the two definitions collapse to the same combinatorial partition.

2. Preliminaries

Let \(Y=\{y_1,\dots,y_n\}\subset\mathbb{R}^d\). For \(\varepsilon>0\), define the ε-neighborhood \[ N_\varepsilon(y)=\{z\in Y: \|y-z\|\le \varepsilon\}. \] DBSCAN declares \(y\) a core point if \(|N_\varepsilon(y)|\ge\text{minPts}\) and returns clusters as density-connected components of core points. In the regime considered below, all points will be core and DBSCAN reduces to connected components of the ε-neighborhood graph.

For k-means with \(K\) clusters, a partition \(\mathcal{S}=\{S_1,\dots,S_K\}\) of \(Y\) has objective \[ \mathcal{J}(\mathcal{S})=\sum_{j=1}^K\sum_{y\in S_j}\|y-\mu(S_j)\|^2, \qquad \mu(S)=\frac{1}{|S|}\sum_{y\in S}y. \] A k-means solution is any partition minimizing \(\mathcal{J}\).

3. A single normalization transform

Let \(X=\{x_1,\dots,x_n\}\subset\mathbb{R}^d\). Define \[ \mu=\frac{1}{n}\sum_{i=1}^n x_i, \qquad \Sigma=\frac{1}{n}\sum_{i=1}^n (x_i-\mu)(x_i-\mu)^\top, \] and assume \(\Sigma\succ 0\). The whitening map is \[ \phi(x)=\Sigma^{-1/2}(x-\mu), \qquad Y=\{\phi(x_i)\}. \] Then Euclidean distances in \(Y\) coincide with Mahalanobis distances in \(X\): \[ \|\phi(x)-\phi(x')\|^2=(x-x')^\top\Sigma^{-1}(x-x'). \] No algorithm is altered; only the ambient metric geometry is normalized.

4. The whitened separated-blob model

Assume \(Y\) admits a partition \(Y=\bigsqcup_{k=1}^K C_k\) and centers \(m_1,\dots,m_K\in\mathbb{R}^d\) with radius \(r>0\) such that \[ \|y-m_k\|\le r \quad \forall\,y\in C_k, \tag{1} \] and for some \(D>4r\), \[ \|m_k-m_\ell\|\ge D \quad \forall\,k\ne\ell. \tag{2} \] For DBSCAN choose parameters \[ 2r<\varepsilon4r\).

For the k-means statement we impose one additional explicit separation inequality: \[ \frac{1}{2}(D-2r)^2 \;>\; \sum_{k=1}^K \sum_{y\in C_k}\|y-\mu(C_k)\|^2. \tag{4} \] Condition (4) is sufficient (not necessary): it ensures that even one cross-blob mixing event incurs a penalty exceeding the total within-blob variance achievable by fitting blobs separately.

Theorem 1 (Equivalence after whitening). Assume (1)–(3). Then DBSCAN run on \(Y\) with parameters \((\varepsilon,\text{minPts})\) returns exactly the clusters \(\{C_1,\dots,C_K\}\) and no noise points. If, in addition, (4) holds, then \(\{C_1,\dots,C_K\}\) is the unique global minimizer (up to label permutation) of the \(K\)-means objective on \(Y\). Consequently, under (1)–(4), DBSCAN and k-means output the same partition on the whitened data.

5. Proof of Theorem 1

5.1 DBSCAN recovers the components

Fix \(k\) and take any \(y,z\in C_k\). By the triangle inequality and (1), \[ \|y-z\|\le \|y-m_k\|+\|z-m_k\|\le 2r<\varepsilon. \] Thus every pair of points in \(C_k\) is adjacent in the ε-neighborhood graph, so \(C_k\) is connected. Next take \(y\in C_k\) and \(z\in C_\ell\) with \(k\ne\ell\). Using (1) and (2), \[ \|y-z\| \ge \|m_k-m_\ell\|-\|y-m_k\|-\|z-m_\ell\| \ge D-2r>\varepsilon. \] Hence there are no ε-edges between distinct \(C_k\)’s, so the ε-neighborhood graph has exactly \(K\) connected components, namely \(C_1,\dots,C_K\). Finally, for any \(y\in C_k\), the inequality \(\|y-z\|<\varepsilon\) holds for all \(z\in C_k\), so \(|N_\varepsilon(y)|\ge |C_k|\ge\text{minPts}\) by (3). Therefore every point is core and DBSCAN returns exactly \(\{C_k\}\) with no noise.

5.2 A variance decomposition identity

For a finite nonempty set \(S\subset\mathbb{R}^d\), define \[ \mathrm{SSE}(S)=\sum_{y\in S}\|y-\mu(S)\|^2. \] If \(A,B\) are disjoint nonempty sets, then \[ \mathrm{SSE}(A\cup B)=\mathrm{SSE}(A)+\mathrm{SSE}(B)+\frac{|A||B|}{|A|+|B|}\,\|\mu(A)-\mu(B)\|^2. \tag{5} \] To verify (5), write \(\mu=\mu(A\cup B)=\frac{|A|\mu(A)+|B|\mu(B)}{|A|+|B|}\) and expand \(\sum_{y\in A}\|y-\mu\|^2\) and \(\sum_{y\in B}\|y-\mu\|^2\) via the identity \[ \sum_{y\in A}\|y-\mu\|^2=\sum_{y\in A}\|y-\mu(A)\|^2+|A|\,\|\mu(A)-\mu\|^2, \] and similarly for \(B\); substituting \(\mu(A)-\mu=\frac{|B|}{|A|+|B|}(\mu(A)-\mu(B))\) yields (5). Identity (5) is the analytic hinge: mixing separated groups adds a strictly positive between-means penalty.

5.3 k-means cannot improve by mixing blobs

Let \(\mathcal{C}=\{C_1,\dots,C_K\}\) and denote the separate-fit cost \[ \mathcal{J}(\mathcal{C})=\sum_{k=1}^K\mathrm{SSE}(C_k). \] Consider any other \(K\)-partition \(\mathcal{S}=\{S_1,\dots,S_K\}\). If \(\mathcal{S}\neq\mathcal{C}\), then some cluster must mix points from at least two blobs: otherwise each \(S_j\) would be contained in a single \(C_k\), making \(\mathcal{S}\) a refinement of \(\mathcal{C}\); with exactly \(K\) parts this forces equality up to relabeling. Choose such a mixed cluster \(S_j\) and indices \(k\ne\ell\) with \(A=S_j\cap C_k\ne\emptyset\) and \(B=S_j\cap C_\ell\ne\emptyset\). Since a centroid minimizes squared error within its assigned set, \(\mathrm{SSE}(S_j)\ge \mathrm{SSE}(A\cup B)\). Applying (5) gives \[ \mathrm{SSE}(S_j) \ge \mathrm{SSE}(A)+\mathrm{SSE}(B)+\frac{|A||B|}{|A|+|B|}\,\|\mu(A)-\mu(B)\|^2. \tag{6} \] Because Euclidean balls are convex, (1) implies \(\|\mu(A)-m_k\|\le r\) and \(\|\mu(B)-m_\ell\|\le r\), hence \[ \|\mu(A)-\mu(B)\| \ge \|m_k-m_\ell\|-\|\mu(A)-m_k\|-\|\mu(B)-m_\ell\| \ge D-2r. \tag{7} \] Moreover, for integers \(|A|,|B|\ge 1\), \[ \frac{|A||B|}{|A|+|B|}=\frac{1}{\frac{1}{|A|}+\frac{1}{|B|}}\ge \frac12. \tag{8} \] Combining (6)–(8) yields the uniform mixing penalty \[ \mathrm{SSE}(S_j)\ge \mathrm{SSE}(A)+\mathrm{SSE}(B)+\frac12(D-2r)^2. \tag{9} \] Summing \(\mathrm{SSE}(S_j)\) over all clusters \(j\), the within-blob terms are nonnegative, and the best-case improvement from splitting blobs is bounded by \(\mathcal{J}(\mathcal{C})\), since splitting cannot reduce within-blob SSE below \(0\). Thus any partition \(\mathcal{S}\neq\mathcal{C}\) satisfies \[ \mathcal{J}(\mathcal{S})\ge \mathcal{J}(\mathcal{C}) + \frac12(D-2r)^2 - \mathcal{J}(\mathcal{C}) = \frac12(D-2r)^2. \] Under (4), \(\frac12(D-2r)^2>\mathcal{J}(\mathcal{C})\), hence \(\mathcal{J}(\mathcal{S})>\mathcal{J}(\mathcal{C})\) for all \(\mathcal{S}\neq\mathcal{C}\). Therefore \(\mathcal{C}\) is the unique global minimizer of k-means (up to permutation). This completes the proof of Theorem 1.

6. Discussion and complexity

Whitening is a single global linear map that normalizes scale; under separation \(D>4r\), DBSCAN correctness follows from the topology of the ε-neighborhood graph, which becomes a disjoint union of cliques for any \(\varepsilon\in(2r,D-2r)\). The k-means side is variational: identity (5) shows that mixing separated groups incurs an unavoidable between-means penalty, and condition (4) forces that penalty to dominate any potential gain from re-fitting within blobs.

Whitening costs \(O(nd^2)\) to form \(\Sigma\) and \(O(d^3)\) for a matrix square root (or Cholesky/eigendecomposition). Lloyd iterations for k-means cost \(O(nKdI)\). DBSCAN typically costs near \(O(n\log n)\) for neighbor queries with spatial indexing in modest dimension, but can degrade toward \(O(n^2)\) in high dimension.

7. Conclusion

In the whitened separated-blob regime, “cluster as connectivity” and “cluster as quantization” coincide. DBSCAN returns the ε-connected components; k-means returns the unique partition minimizing squared error; both coincide with the blob partition. The result emphasizes that the tension between density clustering and centroid clustering is often geometric rather than algorithmic: after an appropriate normalization, topology and optimization agree.