Clustering algorithms are best understood as formal definitions of “cluster.” In k-means, clusters arise as a partition minimizing squared reconstruction error, producing Voronoi geometry around centroids. In DBSCAN, clusters arise as connected components in a graph induced by a distance threshold, producing topological connectivity and allowing nonconvex shapes and noise. Because these objects differ, the algorithms are usually presented as alternatives rather than as related limits. This note shows that the gap is not intrinsic: under a natural metric normalization and a transparent separation model, the two definitions collapse to the same combinatorial partition.
Let \(Y=\{y_1,\dots,y_n\}\subset\mathbb{R}^d\). For \(\varepsilon>0\), define the ε-neighborhood \[ N_\varepsilon(y)=\{z\in Y: \|y-z\|\le \varepsilon\}. \] DBSCAN declares \(y\) a core point if \(|N_\varepsilon(y)|\ge\text{minPts}\) and returns clusters as density-connected components of core points. In the regime considered below, all points will be core and DBSCAN reduces to connected components of the ε-neighborhood graph.
For k-means with \(K\) clusters, a partition \(\mathcal{S}=\{S_1,\dots,S_K\}\) of \(Y\) has objective \[ \mathcal{J}(\mathcal{S})=\sum_{j=1}^K\sum_{y\in S_j}\|y-\mu(S_j)\|^2, \qquad \mu(S)=\frac{1}{|S|}\sum_{y\in S}y. \] A k-means solution is any partition minimizing \(\mathcal{J}\).
Let \(X=\{x_1,\dots,x_n\}\subset\mathbb{R}^d\). Define \[ \mu=\frac{1}{n}\sum_{i=1}^n x_i, \qquad \Sigma=\frac{1}{n}\sum_{i=1}^n (x_i-\mu)(x_i-\mu)^\top, \] and assume \(\Sigma\succ 0\). The whitening map is \[ \phi(x)=\Sigma^{-1/2}(x-\mu), \qquad Y=\{\phi(x_i)\}. \] Then Euclidean distances in \(Y\) coincide with Mahalanobis distances in \(X\): \[ \|\phi(x)-\phi(x')\|^2=(x-x')^\top\Sigma^{-1}(x-x'). \] No algorithm is altered; only the ambient metric geometry is normalized.
Assume \(Y\) admits a partition \(Y=\bigsqcup_{k=1}^K C_k\) and centers \(m_1,\dots,m_K\in\mathbb{R}^d\) with radius \(r>0\) such that
\[
\|y-m_k\|\le r \quad \forall\,y\in C_k, \tag{1}
\]
and for some \(D>4r\),
\[
\|m_k-m_\ell\|\ge D \quad \forall\,k\ne\ell. \tag{2}
\]
For DBSCAN choose parameters
\[
2r<\varepsilon
For the k-means statement we impose one additional explicit separation inequality: \[ \frac{1}{2}(D-2r)^2 \;>\; \sum_{k=1}^K \sum_{y\in C_k}\|y-\mu(C_k)\|^2. \tag{4} \] Condition (4) is sufficient (not necessary): it ensures that even one cross-blob mixing event incurs a penalty exceeding the total within-blob variance achievable by fitting blobs separately.
Fix \(k\) and take any \(y,z\in C_k\). By the triangle inequality and (1), \[ \|y-z\|\le \|y-m_k\|+\|z-m_k\|\le 2r<\varepsilon. \] Thus every pair of points in \(C_k\) is adjacent in the ε-neighborhood graph, so \(C_k\) is connected. Next take \(y\in C_k\) and \(z\in C_\ell\) with \(k\ne\ell\). Using (1) and (2), \[ \|y-z\| \ge \|m_k-m_\ell\|-\|y-m_k\|-\|z-m_\ell\| \ge D-2r>\varepsilon. \] Hence there are no ε-edges between distinct \(C_k\)’s, so the ε-neighborhood graph has exactly \(K\) connected components, namely \(C_1,\dots,C_K\). Finally, for any \(y\in C_k\), the inequality \(\|y-z\|<\varepsilon\) holds for all \(z\in C_k\), so \(|N_\varepsilon(y)|\ge |C_k|\ge\text{minPts}\) by (3). Therefore every point is core and DBSCAN returns exactly \(\{C_k\}\) with no noise.
For a finite nonempty set \(S\subset\mathbb{R}^d\), define \[ \mathrm{SSE}(S)=\sum_{y\in S}\|y-\mu(S)\|^2. \] If \(A,B\) are disjoint nonempty sets, then \[ \mathrm{SSE}(A\cup B)=\mathrm{SSE}(A)+\mathrm{SSE}(B)+\frac{|A||B|}{|A|+|B|}\,\|\mu(A)-\mu(B)\|^2. \tag{5} \] To verify (5), write \(\mu=\mu(A\cup B)=\frac{|A|\mu(A)+|B|\mu(B)}{|A|+|B|}\) and expand \(\sum_{y\in A}\|y-\mu\|^2\) and \(\sum_{y\in B}\|y-\mu\|^2\) via the identity \[ \sum_{y\in A}\|y-\mu\|^2=\sum_{y\in A}\|y-\mu(A)\|^2+|A|\,\|\mu(A)-\mu\|^2, \] and similarly for \(B\); substituting \(\mu(A)-\mu=\frac{|B|}{|A|+|B|}(\mu(A)-\mu(B))\) yields (5). Identity (5) is the analytic hinge: mixing separated groups adds a strictly positive between-means penalty.
Let \(\mathcal{C}=\{C_1,\dots,C_K\}\) and denote the separate-fit cost \[ \mathcal{J}(\mathcal{C})=\sum_{k=1}^K\mathrm{SSE}(C_k). \] Consider any other \(K\)-partition \(\mathcal{S}=\{S_1,\dots,S_K\}\). If \(\mathcal{S}\neq\mathcal{C}\), then some cluster must mix points from at least two blobs: otherwise each \(S_j\) would be contained in a single \(C_k\), making \(\mathcal{S}\) a refinement of \(\mathcal{C}\); with exactly \(K\) parts this forces equality up to relabeling. Choose such a mixed cluster \(S_j\) and indices \(k\ne\ell\) with \(A=S_j\cap C_k\ne\emptyset\) and \(B=S_j\cap C_\ell\ne\emptyset\). Since a centroid minimizes squared error within its assigned set, \(\mathrm{SSE}(S_j)\ge \mathrm{SSE}(A\cup B)\). Applying (5) gives \[ \mathrm{SSE}(S_j) \ge \mathrm{SSE}(A)+\mathrm{SSE}(B)+\frac{|A||B|}{|A|+|B|}\,\|\mu(A)-\mu(B)\|^2. \tag{6} \] Because Euclidean balls are convex, (1) implies \(\|\mu(A)-m_k\|\le r\) and \(\|\mu(B)-m_\ell\|\le r\), hence \[ \|\mu(A)-\mu(B)\| \ge \|m_k-m_\ell\|-\|\mu(A)-m_k\|-\|\mu(B)-m_\ell\| \ge D-2r. \tag{7} \] Moreover, for integers \(|A|,|B|\ge 1\), \[ \frac{|A||B|}{|A|+|B|}=\frac{1}{\frac{1}{|A|}+\frac{1}{|B|}}\ge \frac12. \tag{8} \] Combining (6)–(8) yields the uniform mixing penalty \[ \mathrm{SSE}(S_j)\ge \mathrm{SSE}(A)+\mathrm{SSE}(B)+\frac12(D-2r)^2. \tag{9} \] Summing \(\mathrm{SSE}(S_j)\) over all clusters \(j\), the within-blob terms are nonnegative, and the best-case improvement from splitting blobs is bounded by \(\mathcal{J}(\mathcal{C})\), since splitting cannot reduce within-blob SSE below \(0\). Thus any partition \(\mathcal{S}\neq\mathcal{C}\) satisfies \[ \mathcal{J}(\mathcal{S})\ge \mathcal{J}(\mathcal{C}) + \frac12(D-2r)^2 - \mathcal{J}(\mathcal{C}) = \frac12(D-2r)^2. \] Under (4), \(\frac12(D-2r)^2>\mathcal{J}(\mathcal{C})\), hence \(\mathcal{J}(\mathcal{S})>\mathcal{J}(\mathcal{C})\) for all \(\mathcal{S}\neq\mathcal{C}\). Therefore \(\mathcal{C}\) is the unique global minimizer of k-means (up to permutation). This completes the proof of Theorem 1.
Whitening is a single global linear map that normalizes scale; under separation \(D>4r\), DBSCAN correctness follows from the topology of the ε-neighborhood graph, which becomes a disjoint union of cliques for any \(\varepsilon\in(2r,D-2r)\). The k-means side is variational: identity (5) shows that mixing separated groups incurs an unavoidable between-means penalty, and condition (4) forces that penalty to dominate any potential gain from re-fitting within blobs.
Whitening costs \(O(nd^2)\) to form \(\Sigma\) and \(O(d^3)\) for a matrix square root (or Cholesky/eigendecomposition). Lloyd iterations for k-means cost \(O(nKdI)\). DBSCAN typically costs near \(O(n\log n)\) for neighbor queries with spatial indexing in modest dimension, but can degrade toward \(O(n^2)\) in high dimension.
In the whitened separated-blob regime, “cluster as connectivity” and “cluster as quantization” coincide. DBSCAN returns the ε-connected components; k-means returns the unique partition minimizing squared error; both coincide with the blob partition. The result emphasizes that the tension between density clustering and centroid clustering is often geometric rather than algorithmic: after an appropriate normalization, topology and optimization agree.