When a classifier is trained at one time and deployed across a long horizon, the relevant question is not merely whether the model generalizes under an IID assumption, but whether its performance remains stable as the data distribution moves. The guiding intuition is simple:
If the features change rapidly over time, and the classifier is sensitive along those directions, long-horizon generalization degrades.
The aim below is to express this as a quantitative inequality with interpretable drivers.
The narrative arc is: convert drift into a time signal \(r(t)\), control its fluctuations by controlling its derivative, and then expose the derivative as a geometric interaction between how the world moves and how the model locally bends space.
The central object will be the risk trajectory \(r(t)\), and we will measure deployment instability by the variance of this risk across the horizon. The resulting bounds isolate two interacting forces: (i) the cumulative "speed" of the data stream, and (ii) the model's local tangent sensitivity.
Let time \(t\in[0,T]\). Let \(Z_t=(X_t,Y_t)\) denote a feature-label pair at time \(t\), distributed as \(P_t\). Fix model parameters \(\theta\), a score/logit \(f_\theta:\mathbb{R}^d\to\mathbb{R}\), and a bounded loss \(\ell:\mathbb{R}\times\mathcal{Y}\to[0,1]\). Define the time-\(t\) risk
Think of \(r(t)\) as the single scalar “performance meter” monitored during deployment: everything we do is aimed at bounding how violently that meter can swing when the model is frozen but the environment is not.
To measure stability over the horizon, sample a random deployment time \(U\sim \mathrm{Unif}[0,T]\) and consider
This "risk volatility" quantity has a practical meaning: it measures how much performance can fluctuate over the deployment window, even if the model itself is held fixed.
Stability becomes a purely time-domain question: how much can a function fluctuate on \([0,T]\) if its rate of change is limited? The next lemma formalizes the “no-wiggles-without-slope” intuition.
If \(r\) is absolutely continuous on \([0,T]\), then
Proof. This is the Poincare/Wirtinger inequality on \([0,T]\): \(\int_0^T (r-\bar r)^2 \le \frac{T^2}{\pi^2}\int_0^T (r')^2\), divided by \(T\). \(\square\)
Lemma 1 turns the stability question into a control problem: to bound risk volatility, it is enough to bound the energy of the risk derivative \(\int_0^T (r'(t))^2dt\).
To use Lemma 1, we need a handle on \(r'(t)\) that depends on the moving distribution itself. A convenient “speed” of \(P_t\) is the squared \(L^2(P_t)\) norm of the temporal score \(\partial_t\log p_t\), which behaves like kinetic energy for densities.
Assume \(P_t\) admits densities \(p_t(z)\) with respect to a common base measure \(\mu\), differentiable in \(t\), and that the temporal score \(\partial_t\log p_t(Z)\in L^2(P_t)\) for a.e. \(t\). Define the temporal Fisher information of the path
Then for any bounded loss \(\ell\in[0,1]\),
Proof. Differentiate under the integral: \[ r'(t)=\frac{d}{dt}\int \ell(z)p_t(z)\,d\mu =\mathbb{E}_{P_t}\big[\ell(Z)\,\partial_t\log p_t(Z)\big]. \] Also \(\mathbb{E}_{P_t}[\partial_t\log p_t(Z)]=\partial_t\int p_t\,d\mu=0\), hence \[ r'(t)=\mathrm{Cov}_{P_t}\!\big(\ell(Z),\partial_t\log p_t(Z)\big). \] By Cauchy-Schwarz, \((r'(t))^2\le \mathrm{Var}_{P_t}(\ell(Z))\,I(t)\le \frac14 I(t)\), since \(\ell\in[0,1]\Rightarrow \mathrm{Var}(\ell)\le 1/4\). Apply Lemma 1. \(\square\)
This bound is distribution-centric: it says risk volatility is controlled by the intrinsic "speed-energy" of the full data path \(\{P_t\}\), measured in the \(L^2\) norm of the temporal score.
To align with the original intuition, we now specialize to a common deployment regime: drift arises primarily from the covariates rather than from the conditional label mechanism.
In this regime, the world changes because \(X_t\) moves, not because the meaning of a label changes. The bound will read like a mechanics statement: cumulative motion in feature space, filtered through model sensitivity, yields cumulative performance fluctuation.
\(P_t(Y\mid X)=P_0(Y\mid X)\) for all \(t\in[0,T]\).
Define the conditional risk function
Assume label stability. Suppose \(X_t\) is absolutely continuous and \(\int_0^T \mathbb{E}\|\dot X_t\|^2\,dt<\infty\). If \(g_\theta\) is \(L\)-Lipschitz in \(x\) (equivalently, \(\|\nabla g_\theta(x)\|\le L\) a.e.), then
Proof. Differentiate \(r(t)=\mathbb{E}[g_\theta(X_t)]\) a.e.: \[ r'(t)=\mathbb{E}\big[\nabla g_\theta(X_t)^\top \dot X_t\big]. \] By Cauchy-Schwarz and \(\|\nabla g_\theta\|\le L\), \((r'(t))^2\le L^2\,\mathbb{E}\|\dot X_t\|^2\). Apply Lemma 1. \(\square\)
Theorem 2 makes the "fast-changing features" claim explicit: risk volatility over a long horizon scales with the cumulative kinetic energy of the covariates, modulated by a model sensitivity constant.
For ReLU networks, global Lipschitz constants (e.g. products of spectral norms) can be very loose. A sharper picture appears when we charge only the directions actually traversed by the data stream.
The key shift is from “worst-case everywhere” to “what happens along the path we actually take.” Since ReLU nets are piecewise linear, the Jacobian is the right local object: it gives the score change for the specific infinitesimal motion \(\dot X_t\).
The loss is differentiable in \(s\) with \(\big|\partial_s\ell(s,y)\big|\le \beta\) for all \(s,y\) (a.e. if needed).
Let \(f_\theta\) be a ReLU network (piecewise linear), so its Jacobian \(J_{f_\theta}(x)\) exists a.e. If label stability holds and \(X_t\) is absolutely continuous with finite speed energy, then
Proof. Differentiate a.e.: \[ r'(t) = \mathbb{E}\left[ \partial_s\ell\!\big(f_\theta(X_t),Y_t\big)\;\nabla f_\theta(X_t)^\top \dot X_t \right]. \] Cauchy-Schwarz and \(|\partial_s\ell|\le \beta\) give \[ (r'(t))^2 \le \beta^2\, \mathbb{E}\left[\big(\nabla f_\theta(X_t)^\top \dot X_t\big)^2\right] = \beta^2\,\mathbb{E}\|J_{f_\theta}(X_t)\dot X_t\|^2. \] Apply Lemma 1. \(\square\)
This form isolates the geometric interaction: the stream supplies a velocity \(\dot X_t\), the network supplies a local linear map \(J_{f_\theta}(X_t)\), and the quantity \(J_{f_\theta}(X_t)\dot X_t\) is the instantaneous rate at which the model's score changes along the observed motion. Large accumulated Jacobian-velocity energy forces large risk volatility.
If a single global constant is preferred, the tangent term can always be upper bounded by operator norms, but this loses path specificity. This section is a sanity check: the geometric bound nests the familiar worst-case Lipschitz story as a special (and typically looser) corollary.
Since \(\|J_{f_\theta}(x)\dot x\|\le \|J_{f_\theta}(x)\|_{\mathrm{op}}\|\dot x\|\), Theorem 3 implies \[ \mathrm{Var}_U(r(U)) \le \frac{\beta^2T}{\pi^2} \int_0^T \mathbb{E}\big[\|J_{f_\theta}(X_t)\|_{\mathrm{op}}^2\,\|\dot X_t\|^2\big]\,dt. \] For an \(L\)-layer fully-connected ReLU net with weights \(\{W_k\}\), \(\|J_{f_\theta}(x)\|_{\mathrm{op}}\le \prod_{k=1}^L\|W_k\|_2\) a.e., yielding the familiar (looser) global bound.
The result can be read as a single pipeline: drift induces motion, motion induces score change through local geometry, and score change accumulates into risk volatility. In deployment terms, one either slows the stream, desensitizes the model along the stream’s directions, or accepts fluctuating performance.
Read as a single statement: long-horizon generalization is governed by the interaction of the stream's dynamics and the model's tangent geometry.