Tail Risk and Compute Allocation in Heavy-Tailed ML Training Systems

J. Landers

1. Motivation

In many modern ML training pipelines, the time required to train a model is not well summarized by a single “typical” runtime. Some runs converge quickly; others stall, hover near instability, or effectively fail. Such behavior is often right-skewed and heavy-tailed. When this happens, scheduling decisions based solely on expected runtime can be fragile: a policy that looks good on average may exhibit large high-percentile overruns.

This note isolates one simple mechanism by which tail behavior changes scheduling geometry. We consider two collections of candidate training jobs, each with lognormal service times, and ask how one should allocate compute if the true objective is risk-sensitive—for instance, minimizing the 90th percentile time to complete a chosen collection or to hit a deadline.

Theme. Under heavy tails, “minimize the mean” and “control the tail” are different optimization problems, and the resulting allocation boundary changes.

2. A minimal model

Consider a single sequential training resource (one GPU). There are two queues of models to train: \(Q_1\) containing \(n_1\) models and \(Q_2\) containing \(n_2\) models. The training time of the \(k\)-th model in queue \(i\in\{1,2\}\) is modeled as

\[ S_{i,k} \sim \mathrm{LogNormal}(\mu_i,\sigma_i^2), \qquad\text{equivalently}\qquad \ln S_{i,k} \sim \mathcal{N}(\mu_i,\sigma_i^2). \]

The lognormal assumption is a convenient idealization for multiplicative sources of delay and rare instabilities. Its essential mathematical feature is that the mean and variance depend sharply on \(\sigma_i^2\):

\[ m_i := \mathbb{E}[S_{i,k}] = e^{\mu_i + \sigma_i^2/2}, \qquad v_i := \mathrm{Var}(S_{i,k}) = (e^{\sigma_i^2}-1)e^{2\mu_i+\sigma_i^2}. \]

Thus increasing \(\sigma_i\) does not merely widen the distribution; it inflates the mean and increases the propensity for rare but extremely long runtimes.

3. Queue completion times

If we commit to training all models in queue \(i\), the total time is

\[ T_i := \sum_{k=1}^{n_i} S_{i,k}. \]

By linearity and independence,

\[ \mathbb{E}[T_i] = n_i m_i,\qquad \mathrm{Var}(T_i) = n_i v_i. \]

A key stabilizing effect appears here: the relative variance of \(T_i\) decreases with \(n_i\). In large queues, aggregation dampens variability; in small queues, tail risk remains prominent. This interaction between queue size and tail geometry is the main driver of the scheduling boundary.

4. Mean-optimal allocation

If one cares only about expected completion time of a chosen queue, the classical rule compares expected totals:

\[ \text{Choose } Q_1 \ \text{iff}\ n_1 m_1 < n_2 m_2, \qquad\text{so}\qquad \mathbb{E}[T_{\text{commit}}]=\min(n_1m_1,n_2m_2). \]

Under light-tailed service times this is often sufficient. Under heavy tails, however, expectation can be a poor proxy for deadline risk or for high-percentile runtime.

5. A risk-sensitive objective

In practice one often cares about completing work reliably. A convenient mathematical proxy is a high quantile. For \(p\in(0,1)\), let \(Q_p(X)\) denote the \(p\)-quantile of a random variable \(X\). We consider the objective

\[ \min_\pi Q_p(T_\pi), \]

where \(\pi\) is a scheduling policy and \(T_\pi\) is the induced completion time. For example, \(p=0.9\) or \(p=0.95\) corresponds to controlling the 90th or 95th percentile runtime. In heavy-tailed systems, this objective can lead to qualitatively different behavior than minimizing \(\mathbb{E}[T]\).

6. Quantile approximation for sums of lognormals

The sum of lognormal random variables has no closed form. We use the classical Fenton–Wilkinson approximation: approximate \(T_i\) by a lognormal distribution matching its mean and variance. Define

\[ \sigma_{T_i}^2 := \ln\!\left(1+\frac{\mathrm{Var}(T_i)}{\mathbb{E}[T_i]^2}\right) = \ln\!\left(1+\frac{v_i}{n_i m_i^2}\right), \qquad \mu_{T_i} := \ln(n_i m_i) - \frac12\sigma_{T_i}^2. \]

Then the \(p\)-quantile is approximated by

\[ Q_p(T_i)\approx \exp(\mu_{T_i} + z_p \sigma_{T_i}), \]

where \(z_p\) is the standard normal quantile. Substituting \(v_i\) yields the especially interpretable relation

\[ \sigma_{T_i}^2 = \ln\!\left(1+\frac{e^{\sigma_i^2}-1}{n_i}\right). \]

Dispersion decreases with \(n_i\) but increases rapidly with \(\sigma_i\). Thus the quantile geometry depends on both queue size and tail heaviness.

7. A tail-aware allocation rule

A natural risk-sensitive analogue of the mean rule is to compare estimated quantiles of completion time:

\[ \text{Choose } Q_1\ \text{iff}\ Q_p(T_1) < Q_p(T_2). \]

Taking logarithms and using the approximation above gives the decision boundary

\[ \ln(n_1 m_1) + z_p \sigma_{T_1} \;<\; \ln(n_2 m_2) + z_p \sigma_{T_2}. \]

This is the same “compare total work” logic as in the mean-optimal rule, except that the dispersion term \(\sigma_{T_i}\) is now explicitly present and weighted by \(z_p\). As \(p\) increases, the penalty for tail dispersion becomes more severe. In the light-tailed limit \(\sigma_i\to 0\), the dispersion term vanishes and the rule reduces to the classical inequality.

8. Uniform mixing and tail amplification

It is tempting to allocate compute uniformly: select the next model from \(Q_1\) or \(Q_2\) with probability \(1/2\) while both are nonempty. In the regime \(n_2\ge n_1\), the expected time to complete \(Q_1\) under uniform mixing is approximately \(n_1(m_1+m_2)\). The expected slowdown relative to committing to \(Q_1\) is therefore

\[ 1+\frac{m_2}{m_1} = 1+\exp\!\Big((\mu_2-\mu_1)+\tfrac12(\sigma_2^2-\sigma_1^2)\Big). \]

Heavy tails matter twice: they inflate means via \(e^{\sigma^2/2}\), and they inflate high quantiles through the additional dispersion term \(z_p\sigma_T\). Uniform mixing therefore inherits both extra mean work and extra tail spread, and can be particularly costly when one queue is near instability (large \(\sigma\)).

9. Adaptive estimation

In practice the parameters \((\mu_i,\sigma_i)\) are unknown. A simple online estimator uses observed training times. After observing \(k\) completed jobs from queue \(i\), set

\[ \hat{\mu}_i := \frac{1}{k}\sum_{j=1}^{k}\ln S_{i,j}, \qquad \hat{\sigma}_i^2 := \frac{1}{k-1}\sum_{j=1}^{k}(\ln S_{i,j}-\hat{\mu}_i)^2. \]

One then computes \(\hat{Q}_p(T_i)\) via the formulas above and allocates compute to the queue with smaller predicted high-percentile completion time. This yields a tail-aware generalization of shortest-expected-work scheduling, tuned to a quantile objective.

10. Experimental validation in a simpler system

LLM training is an obvious target motivation, but it is also expensive and difficult to instrument cleanly. Fortunately the scheduling phenomenon can be validated in a simpler controlled environment. One convenient choice is to treat each “job” as a full training run of a modest ReLU network on a heterogeneous dataset (for example, the SeatGeek dataset), and to define service time \(S\) as the time to reach a fixed validation threshold \(M\).

To make heavy tails visible, one can construct two queues that differ primarily in tail behavior: a stable configuration family (smaller \(\sigma\)) and a near-instability family (larger \(\sigma\)). The prediction is that the quantile-aware scheduler improves empirical \(Q_{0.9}\) and \(Q_{0.95}\) time-to-threshold relative to mean-based and uniform-mixing baselines whenever tail dispersion differs substantially across queues.

11. Concluding remarks

When service times are light-tailed, minimizing expectation is often sufficient. When service times are heavy-tailed, expectation is incomplete. In heavy-tailed training systems, compute allocation should be guided by a risk-sensitive objective such as minimizing \(Q_p(T)\), leading naturally to a boundary that depends on queue size \(n_i\), location \(\mu_i\), and tail heaviness \(\sigma_i\).

The moral is simply that heavy tails change the geometry of scheduling: aggregation tempers tails, mixing can inherit them, and risk-sensitive allocation rules must explicitly account for dispersion.