Training as Selection

DNA and gradient descent might operate on the same invariant. Inspecting the two instances to see what holds.

Written April 2026Exploratory — thinking out loudLow confidence — speculative

Update, April 2026. The competitive release problem I raise below — that neural circuits share weights and can't be cleanly individuated the way organisms can — turns out to have a specific geometric answer I didn't see. Yongzhong Xu's work on multi-task grokking shows that readouts for different tasks arrange themselves orthogonally while the underlying weights stay fully entangled and incompressible. The tasks don't compete for territory and they aren't one blurry thing — they point in different directions through the same shared stuff. That's a third option my ecology-vs-superposition framing missed entirely. The problem was real; the framework I built around it was oversized for what turned out to be going on. I've updated the competitive release section to reflect this but left the rest of the essay intact, because the Price-equation framing and the "distributions concentrating under pressure" conclusion don't depend on how individuation works out.

The Price equation says something simple: if there's covariance between a trait and how much that trait persists, the population-level distribution of the trait shifts. Decompose the change into a selection term (what persists differentially) and a transmission term (how things change as they're passed on), and you have a substrate-neutral accounting identity for directed change. No replication required. No organisms required. Just: traits that covary with persistence become more common. ¹1.Definition — Price equation (1970). A tautology — true by construction — but a useful one, because it tells you where to look. Δz̄ = Cov(w,z)/w̄ + E(wΔz)/w̄z = trait value, w = fitnessCov(w,z)/w̄ — selection: traits correlated with fitness become more commonE(wΔz)/w̄ — transmission bias: systematic change during retention

I know of two systems that move through Price-space in interestingly parallel ways. One produced DNA. The other produces neural network weights. I want to inspect them side by side — not to argue they're identical, but to see where their trajectories align and where they diverge. ²2.Aside — Universal Darwinism (Dawkins, Dennett, Campbell) is a downstream claim: that any system with variation, differential fitness, and retention will produce adapted structure. That's a stronger claim than the Price equation makes — it specifies which systems exhibit Price dynamics, not just what those dynamics look like. I'm starting from Price because it's weaker and harder to argue with.

The invariants

Both systems exhibit the selection term. Structures that covary positively with persistence under pressure become more prevalent. In biology, organisms vary, environments filter, reproduction retains what survives. In training, the model's internal geometry shifts under gradient noise, the loss function and regularization filter, and the weight update retains what reduces loss.

Both produce environment-relative fitness. In my grokking work, a compact Fourier circuit exists inside the model during memorization — it's present by step 700. But it isn't fit yet. Injecting pre-learned Fourier neurons into a memorizing model doesn't accelerate grokking — the transplanted neurons neither help nor hurt. The circuit's fitness depends on the geometric environment it sits in, not on its intrinsic properties. An organism perfectly adapted to the tundra dies in the desert. A circuit perfectly structured for generalization is irrelevant if the surrounding geometry routes computation through the lookup table instead. ³3.Source — This is closer to niche construction than naive selection — the environment that determines fitness is partly constructed by the other structures in the system.

At the grokking transition, effective rank of per-sample gradients drops 30–40%. Individual samples stop pulling in independent directions and align to shared structure. In Price terms: the covariance between "alignment with shared gradient direction" and "persistence under weight update" spikes. The gradient rank collapse is a tighter structural parallel to selection than most of what people usually point to. And the frequency of micro-instabilities in gradient agreement during training predicts the grokking gap with r = 0.896 (n = 30, p < 0.0001) — perturbation frequency predicting how fast the dominant configuration gets dislodged. ⁴4.Aside — That correlation is stronger than any single hyperparameter at predicting grokking timing. If you're looking for evidence that Price-style selection dynamics are operating at the level of internal training dynamics, this is probably the best piece of evidence I have.

Both produce dynamic equilibria rather than fixed endpoints. This is where the Price framing might earn its keep over standard optimization, which predicts convergence to a fixed point. The character of the equilibrium depends on selection pressure strength — and the grokking data shows exactly this. At weight decay 0.25: slow erosion, catastrophic collapses, full re-grokking from scratch. Extinction and recolonization. At weight decay 1.0: regular oscillation with ~150-step period, clean sinusoidal weight norm. Stable coexistence. At weight decay 2.0: gentle hovering near the generalization boundary, no sharp crashes. Ecology predicts that pressure strength determines the character of the equilibrium — boom-bust vs. stable oscillation vs. gentle fluctuation. Optimization theory can derive each case individually, but doesn't naturally predict the taxonomy.

val acctrain accweight norm

Selection pressure strength determines the character of the equilibrium. Same task, same architecture, 5000 steps.

The character of these equilibria maps onto ecological regimes more naturally than optimization language. ⁵5.Aside — This is different from edge-of-stability, where sharpness oscillates at the 2/lr boundary. The grokking oscillation has weight-decay-dependent frequency, correlates with weight norm sinusoidally, and tracks circuit-level structure swaps. Different phenomenon, different mechanism.

The differences — or: where the trajectories through Price-space diverge

The big one is what the substrate does for free.

DNA is about 750MB. The same 3 billion base pairs sit in every cell — liver, brain, embryo, 80-year-old. Every cell reads the same genome differently depending on context, and the machinery doing the reading is itself encoded in the genome. The whole system is self-referential, recursively compressed, and distributed across every cell in the organism. ⁶6.Aside — The recursion is worth sitting with. The genome encodes the proteins that read the genome. The developmental program builds the structures that execute the developmental program. This self-reference operates across different timescales (evolution, development, gene expression) and different substrates (DNA, proteins, cells, organs) simultaneously.

This works because DNA doesn't compute solutions. It distributes structure across a physical substrate and lets physics do the work — protein folding, chemical gradients, thermodynamics. The genome encodes instructions that chemistry executes. Selection operates on the results. No logic, no search, no backprop.

Neural networks can't offload to their substrate. Matrix multiplies on GPUs are general-purpose — they don't do any problem-specific work for free. Every selection pressure has to be computationally induced: we calculate loss, we calculate gradients, we apply weight decay, we backpropagate. The cost of training is the cost of manually simulating selection dynamics that the physical world provides to biology automatically. The genome is 750MB and builds an entire organism. A frontier model is hundreds of gigabytes and that's just the geometry — the selection process that shaped it cost orders of magnitude more.

In Price terms, the transmission term is doing radically different work in the two systems. Biology's transmission term is rich — development, epigenetics, niche inheritance — because the physical substrate provides a dense, nonlinear channel between generations. Training's transmission term is thin: just the weight update rule. Almost everything is in the selection term. I suspect that asymmetry matters for open-endedness — biological evolution keeps producing qualitatively new structure, training produces diminishing returns on scale — but I don't have an argument for why a thin transmission term would cause that, just the correlation.

The individuation problem

The harder divergence is what the natural individuals are.

Biological selection operates on discrete organisms that reproduce. The Price equation requires you to specify a population and a trait. Biology hands you both: organisms, born and dying, varying and being filtered. Gradient descent operates on continuous geometry. There aren't really individuals being selected — it's more like regions of activation space that are more or less load-bearing at different moments. ⁷7.Source — Merrill et al.'s "A Tale of Two Circuits" treats grokking as competition between sparse and dense subnetworks — the closest thing to a discrete-populations framing in the literature. Nanda et al.'s "Progress Measures for Grokking" identifies three training phases that map onto variation → competition → fixation. But both decompositions might be imposed rather than discovered.

This dissolves one version of the problem. Quasispecies theory (Eigen & Schuster, 1977) was developed for exactly this situation: RNA viruses with mutation rates so high that individual genomes aren't persistent units. The natural object isn't any particular sequence — it's a probability distribution over sequence space. Selection operates on the distribution directly. The "master sequence" is just the mode, not a privileged individual. ⁸8.Definition — A quasispecies is a cloud of related sequences maintained by mutation-selection balance. The key result: there's an error threshold — a critical mutation rate above which the distribution delocalizes, losing concentration around the fittest sequence. Below the threshold, selection maintains a tight peak. Above it, the population disperses into sequence space and "memory" of the fit configuration is lost.

This is structurally close to the training situation. Neural circuits aren't discrete persistent units either — they're patterns in continuous weight space that are more or less concentrated. And the grokking transition looks like an inverse error threshold. Pre-grokking, per-sample gradients pull in independent directions — the distribution over gradient space is delocalized. At the transition, gradient rank collapses 30–40% and the distribution concentrates onto shared structure. The quasispecies frame gives a name to what's happening without requiring you to individuate the units first. ⁹9.Aside — The parallel isn't exact — in quasispecies the error threshold is a loss of concentration (too much mutation), while grokking is a gain of concentration (weight decay eroding the delocalized memorization solution). But the structural move is the same: selection on distributions, not individuals.

Quasispecies dissolves the individuation problem, but it doesn't fully dissolve the natural coordinates problem. In virology, sequence space is natural — it's the actual chemistry, and the metric structure comes for free. In a neural network, you still have to choose what space to put the distribution over: weight space, activation space, functional space. Each gives a different picture. The gradient rank collapse looks like selection if you decompose into per-sample gradient directions. The Fourier circuit emergence looks like selection if you decompose into functional subnetworks. The quasispecies move is to say: the distribution is the object, not any individual within it. But which distribution? ¹⁰10.Source — The continuous-space selection framework has other precedents: adaptive dynamics (Dieckmann & Law) describes selection as a vector field on trait space — no discrete individuals needed. The breeder's equation in quantitative genetics is a special case of Price that drops genotype individuation entirely. These are established tools for "selection without population structure."

The competitive release analogy fails, but not for the reason the first pass claimed. The first pass said shared computational substrate prevents independent niches — ablate the memorization circuit and you garble the inputs to everything downstream, not free a niche. Xu 2026c ("The Geometry of Multi-Task Grokking") complicates that. Different tasks in a multi-task grokked model get routed through orthogonal readout directions — at the output interface, circuits look like non-overlapping niches. But the weights producing those outputs stay full-rank and entangled. SVD truncation, pruning, and rescaling all destroy performance across tasks simultaneously. Separable at the interface, holographically incompressible in the substrate. That's a third regime — neither pure superposition nor independent niches, and the naive competitive-release prediction was testing the wrong level.

Watson et al. arrive at the same structure from the other side. Natural induction (Watson, Buckley, Levin et al., 2024/2025) describes adaptive organization in plastic-connection systems — springs, ecosystems, gene regulatory networks — through slow-variable dynamics that look Hopfield-like, no discrete populations or reproduction required. The competition-between-circuits language was an import from the biology side that didn't need to come along. What does the work in both cases is the same thing doing the work in the grokking data: slow variables concentrating under pressure. ¹¹11.Aside — Between Xu's orthogonal-readout + entangled-weights result and Watson's selection-free adaptive organization, the "sharpest difference between biological and neural selection" claim I'd made here previously doesn't hold up. The actual story is slow-variable dynamics in both; the discrete-organisms framing was always incidental. The Stern spring-network case (learning without neurons, without populations, without selection) sits with Watson on this.

England's dissipative adaptation work offers a different angle: driven thermodynamic systems that self-organize under energy flux without replication or competition — just matter rearranging into configurations that absorb and dissipate work from the environment. The grokking oscillations could be redescribed in this frame, with the gradient as energy flux and weight decay as dissipation. ¹²12.Source — England, "Statistical physics of self-replication" (2013). The claim: driven systems tend toward configurations that increase entropy production. Structure without selection in the Darwinian sense.

Robustness, not optimality

If the dynamics are selectionist, what's being selected for isn't task performance. It's robustness under the full set of pressures — loss, regularization, architectural bottlenecks, gradient noise. The "fit" structure isn't the one that solves the task best. It's the one that can keep solving it while everything around it erodes.

Biology works this way. What survives isn't what's optimal — it's what's robust. Organisms aren't maximally efficient at any single function. They're tolerant of perturbation across many functions simultaneously. The grokking data suggests something similar: the Fourier circuit doesn't win because it's the best solution to modular arithmetic. It wins because it's compact enough to survive weight decay. Fitness is defined by the selection regime, not the task.

If that's right, then what training produces depends on what pressures you apply at least as much as what task you train on. Different regularization, different architecture, different curriculum — different selection regimes — should produce structures that are robust in different ways, even if they achieve similar task performance. That's a different way of thinking about what the weights encode.

What would settle it

The question is whether this frame generates predictions that diverge from standard optimization accounts.

If the pressure-dependent taxonomy of equilibria from the grokking experiments holds across tasks and architectures, the Price framing is doing organizational work. If it's specific to modular arithmetic, it's a coincidence.

The fastest test might already be answerable with existing data. The selection and dissipative frames make different predictions about what controls the grokking oscillation frequency: a selection account says competition intensity between circuits, a dissipative account says the ratio of drive (gradient magnitude) to damping (weight decay strength). The weight decay sweep varies the drive-to-damping ratio cleanly but doesn't vary competition intensity in any obvious way. If oscillation frequency tracks the ratio, that's evidence for the dissipative frame. I haven't tried fitting both models to the sweep data.

The harder test is historical contingency as niche construction. The selection frame predicts that early-established circuits don't just occupy good basins — optimization theory knows about path dependence — but actively shape what can develop later by constructing the computational environment. Train on task A then task B, and the structures you get for B should be constrained not just by B's loss landscape but by A's circuits actively routing computation. Reverse the order, get qualitatively different B-circuits — not just different performance, different internal structure. The niche construction version makes a specific prediction: the order doesn't just change how fast you learn B, it changes what you learn for B. You could test this by applying sparse autoencoders or circuit-level interpretability tools to models trained under different curriculum orderings and comparing the internal decompositions, not just the accuracy curves. ¹³13.Source — Curriculum learning papers generally measure performance differences from task ordering. The selection-frame prediction is about structural differences — different circuits, not just different accuracy. As far as I know nobody has compared internal structure across curriculum orderings with interpretability tools.

If none of these produce results that diverge from optimization predictions, the frame is a costume. If any do, it's pointing at dynamics that the standard account misses.

The most precise version of the invariant might not be "selection" at all. It might be: distributions that concentrate under pressure. That's what the grokking transition is — gradient rank collapsing, the distribution over update directions localizing onto shared structure. That's what the quasispecies error threshold is — a distribution delocalizing when pressure weakens past a critical point. That's what the weight decay regimes show — different pressures producing different concentration dynamics. "Selection" is one name for this, but it carries baggage about individuals and competition that the actual dynamics might not need.

Whether the CIFAR result means the frame breaks on real tasks, or just that CIFAR doesn't have the right structure to produce dramatic phase transitions, is open. Whether quasispecies is the right level of abstraction or still too biological is open. But "distributions concentrating under pressure" is substrate-neutral, doesn't require individuation, and organizes the grokking data better than "optimizer converges to minimum."

Connections

Grokking DynamicsThe experimental data this post reinterprets — beta sweeps, weight decay regimes, gradient rank collapse, Fourier transplant failure.

The Bearer ProblemIf training is selection on distributions rather than optimization of individuals, the bearer question gets harder: what's the unit of selection?