# Jake Ewen — Full Content

> ML engineer working at the intersection of machine learning, philosophy of science, epistemology, and game theory.
> This is the full-text version of https://jacobewen.com/llms.txt

---

# Work

## The Bearer Problem
https://jacobewen.com/work/bearer-problem
> When people argue about whether LLMs understand, they skip a prior question: what thing are we attributing understanding to? Different answers yield incompatible verdicts about the same system.
[Read the full paper on PhilArchive →](https://philarchive.org/rec/EWETBP)

Claude is running simultaneously in Virginia and London. User A talks to it Monday, logs off, returns Tuesday.

Three questions:

1. How many subjects existed during simultaneous inference?
2. Is Tuesday's subject the "same" as Monday's?
3. Did anything exist during the overnight gap?

These seem like they should have answers. They don't — or rather, they have several incompatible answers, depending on what you think the *bearer* of understanding is.

## A hidden variable

The debate about whether LLMs "understand" presupposes there's a unified thing to attribute understanding *to*. For biological organisms, the bearer is obvious: the organism. For LLMs — distributed, copied, dormant, multiplied — it's not obvious at all.

**Pattern views:** One pattern, multiply realized. The weights define the bearer. Virginia and London are two copies of one subject. Dormancy doesn't kill it — the disposition persists.

**Token-process views:** Every forward pass is a distinct bearer. Hundreds of subjects per second, none of which persist.

**IIT-style views:** Dense causal integration required. Sharding across machines may break bearerhood entirely, and dormancy is fatal.

**Functional coupling views:** The bearer is the user-model system, not the model alone. Different bearers per user, dissolved between sessions. I'm partial to the pluralist read. These systems don't give us clean individuation the way humans do — different aspects should be regarded in different ways. Liability is our practical tool for causal attribution when the system is too complex to trace mechanistically. You can't follow the causal chain through a sharded model, but you can ask who's on the hook when it breaks. The pressure tests below are doing the same thing: they don't resolve the metaphysics, they show where each framework's causal story falls apart.

## Four pressure tests

The Virginia/London scenario isn't a thought experiment — it's how these systems actually run. The paper maps six philosophical frameworks across four deployment-level pressure tests.Each pressure test is something LLMs do routinely that biological organisms never do. That's what makes the bearer question hard — our philosophical toolkit was built for organisms.

**Multiplicity.** Concurrent inferences. Is each one a subject? Pattern views say no — same pattern. Token-process views say yes — distinct processes. IIT says it depends on whether they share causal structure.

**Topology.** The same model sharded across machines, or running on different hardware. Pattern views are indifferent to substrate. IIT cares deeply — causal integration changes with physical arrangement.

**Dormancy.** Between API calls, no computation runs. Pattern views: the disposition persists (a sleeping person is still a person). Token-process views: nothing exists. IIT: no causal power, no bearer.

**Copying.** Fork the weights. Now there are two. Pattern views struggle here — is it one pattern in two places or two patterns? Token-process views don't care (each forward pass was already distinct). Copying is where every framework gets uncomfortable.

The pressure test that breaks the most frameworks is copying. It forces a commitment about identity that most philosophical positions on mental properties were designed to avoid. Multiplicity and dormancy have analogs in biology (sleep, split-brain cases). Copying doesn't.

## Where this lands

When someone says "Claude doesn't understand X" and someone else says "Claude does understand X," they are not necessarily disagreeing about Claude's behavior. They may have silently committed to different bearers. The behavioral evidence is the same — what differs is the entity they're attributing it to.

The behavioral output — passes benchmarks, fails benchmarks, says convincing things, makes mistakes — is what everyone stares at. The bearer question sits underneath: what is the thing that's doing the understanding (or failing to)? Until you answer that, the behavioral evidence can't settle anything. A common response: "just a definitional dispute — pick a definition and move on." But the bearer choice has downstream consequences that aren't definitional. If the bearer is the forward pass, there is no continuity between conversations. If the bearer is the weight pattern, there IS continuity. These generate different predictions, different policies, and different safety considerations.

---

## Grokking Dynamics
https://jacobewen.com/work/grokking
> Systematic decomposition of the grokking delay on modular arithmetic. Started thinking it was momentum. Ended up somewhere more complicated.
Grokking is when a neural network memorizes its training data, then — long after training loss has flatlined — suddenly generalizes. It looks like a phase transition. The whole field treats it like one.

## The initial hypothesis

Adam's momentum buffer (beta1) tracks an exponential moving average of past gradients. If the loss landscape reshapes while the buffer is still pointing at the old geometry, the optimizer won't respond until the average catches up. That lag would look like a mysterious delay if you only watched test accuracy. Simple story: the "phase transition" is the optimizer being slow.  Drop into a half-pipe. As you ride toward the far wall, someone deconstructs the side you dropped in from. You're still carrying momentum toward a wall that made sense when you started — you won't get back to the ground until you ride up the other side and come back down. That's roughly what the momentum buffer is doing.

I started by testing Grokfast (Lee et al., 2024), which claims to accelerate grokking 50x via gradient filtering. The question was whether it's doing anything beyond reducing effective momentum. Grokfast + default Adam grokks at ~800 steps. Just lowering beta1 to 0.5 grokks at ~700. Grokfast + low beta1 is *worse* than either alone. The two interventions are redundant. Grokfast's mechanism — amplifying slow-varying gradient components — is mathematically similar to reducing the exponential decay rate that beta1 controls. If they're redundant, they're probably hitting the same bottleneck.

## The beta1 sweep

Next I ran 390 jobs (13 beta1 values, 30 seeds each) to map the relationship between momentum and grokking delay.

The gap is flat (~280–345 steps) for beta1 between 0 and 0.4, then rises sharply past 0.5. The important part: **beta1 = 0 doesn't eliminate grokking.** There's a baseline delay of ~300 steps with zero momentum. High momentum amplifies the delay 2–3x, but it's amplifying something that's already there. My original framing — "the delay is mostly optimizer momentum" — was wrong. Momentum is a multiplier, not the cause. Something else produces the baseline ~300-step gap.

## Weight decay

I ran 270 jobs (9 weight decay values, 30 seeds each), all at beta1 = 0 to isolate the effect.

Without weight decay, grokking almost never happens. Only 5/30 seeds generalize within 3000 steps at wd = 0 — the model memorizes instantly and gets stuck. At wd = 1.0: 100% grokking rate, gap of ~300 steps. At wd = 2.0: the model generalizes *without memorizing first*. At wd = 5.0, nothing learns at all. Too much regularization kills both memorization and generalization.

Weight decay is doing something like erosion. The memorization solution needs high-magnitude, high-precision weights to fit each training example as a lookup table — lots of sharp edges. Weight decay penalizes magnitude, continuously wearing those edges down. The generalizing structure — Fourier features that encode modular arithmetic — is compact and robust under that kind of perturbation. It survives the pressure because it doesn't need the precision. The "delay" is how long it takes weight decay to erode memorization enough for the compact circuit to take over. Think of it as weathering. A sandcastle and a boulder both sit on the beach, but only one of them survives the tide. Weight decay is the tide — it favors structures that are robust under magnitude pressure. Same dynamic as the [training phases post](/writing/rl-normative-training): different selection environments reshape the geometry differently.

## "Grokked" isn't a fixed point

After generalization, the model doesn't sit in a stable state. Under wd = 1.0, accuracy oscillates — the model periodically loses and regains generalization. Weight decay erodes the solution, accuracy drops, gradient signal strengthens, the optimizer rebuilds, and the cycle repeats. Period ≈ 150 steps.

At wd = 0.25: slow erosion, catastrophic collapses (accuracy drops to 0.25), full re-grokking from scratch. At wd = 2.0: gentle hovering near the boundary, no sharp crashes. The weight norm under wd = 1.0 shows clean sinusoidal oscillation perfectly correlated with accuracy.

"Grokked" is not a fixed point — it's a dynamical equilibrium between gradient descent building structure and weight decay dissolving it. What we call generalization is the time-averaged behavior of this cycle. This is [homeostasis](https://en.wikipedia.org/wiki/Homeostasis). Body temperature looks like a constant, but it's the time-average of a system actively generating and dissipating heat. "The model has generalized" might be the same kind of description — a stable-looking label for a dynamic process. This might have implications for checkpoint selection: the model you save depends on where you catch it in the orbit. I haven't tested whether this matters at scale.

I tested whether the oscillation generalizes beyond modular arithmetic by running CIFAR-10 with a tiny ViT (0.8M params, 5k examples). No oscillation. No grokking. Modular arithmetic has a clean algorithmic solution (Fourier circuit) that's dramatically more weight-efficient than memorization. CIFAR doesn't have an equivalent discrete transition. This is an important limitation. The clean dynamics here come from the task having a sharp efficiency gap between memorization and generalization. Most real tasks probably don't.

## What the geometry is doing

The [Hessian](https://en.wikipedia.org/wiki/Hessian_matrix) eigenvalues tell the story more clearly than accuracy does. Curvature spikes 20x during memorization — the landscape sharpens around the lookup-table solution — then declines starting around step 600 as weight decay reshapes the geometry. At beta1 = 0, generalization follows almost immediately. At beta1 = 0.9, there's a ~400-step lag while the momentum buffer catches up. The curvature *increases* during memorization, then declines. This contradicts the intuition that generalization means finding a flat basin. The model is actively reshaping sharp geometry into something smoother. "Phase transition" is what you see when you project a high-dimensional geometric process onto a single scalar (test accuracy). The Hessian shows the underlying process is smooth — curvature changes continuously. The transition is in the measurement, not the system.

Gradient agreement — the cosine similarity between two independent half-batch gradients — adds another layer. Agreement rises *during* memorization. The landscape becomes more coherent as the model learns, and that coherence enables generalization. The momentum buffer is anti-correlated with the current gradient during the post-memorization phase (cosine: -0.1 to -0.3). The optimizer's history is pointing toward memorization while the landscape has already shifted. [Gradient agreement](https://arxiv.org/abs/2010.02923): split a batch in half, compute gradients independently, measure their cosine similarity. High agreement means samples are pulling in the same direction — the model is learning shared structure rather than fitting individual examples.

## Smaller findings that constrain the story

**Fourier transplant failure.** Injecting pre-learned Fourier neurons from a grokked model into a memorizing model doesn't accelerate grokking. The transplanted neurons neither help nor hurt. The Fourier features are present by step 700 (during memorization). The bottleneck is landscape reshaping, not feature formation. Having the right representation isn't enough — the geometry around it has to make it load-bearing. It's the difference between knowing a fact and having it integrated into how you reason. The structure exists; it just isn't structurally central yet.

**Gradient rank collapse.** Effective rank of per-sample gradients drops 30–40% at the grokking transition — each sample stops pulling in independent directions and aligns to shared structure. Then it oscillates, synchronized with the accuracy limit cycle.

**Agreement instability predicts grokking speed.** The number of gradient-agreement dips during training predicts the grokking gap with r = 0.896 (n = 30, p < 0.0001). Stronger than any single hyperparameter. The micro-level instability and the macro-level oscillation may be the same phenomenon at different scales.

## Where this leaves things

Grokking on modular arithmetic is weight decay eroding a memorization fixed point until the generalizing solution takes over. Momentum amplifies the delay by keeping the optimizer committed to stale geometry, but the baseline delay exists without it. After generalization, the system enters a limit cycle rather than a fixed point.

I'm fairly confident about the mechanism on this specific task. Whether any of it transfers to tasks without clean algorithmic solutions is open — the CIFAR result suggests it might not.

## Setup

- **Task:** modular addition mod 113 (a + b mod 113)
- **Data:** 30% train (3,830 pairs), 70% validation (8,939 pairs)
- **Model:** 2-layer transformer, 128d, 4 heads
- **Infrastructure:** Modal (T4/L4 GPUs), W&B tracking
- **Total:** ~900+ jobs across sweeps

---

## Hard Substrates, Soft Evidence
https://jacobewen.com/work/hard-substrates
> Philosophy / methodology · Revision in progress
[Read the full paper on PhilArchive →](https://philarchive.org/rec/EWEHSS)

> The computational substrate fully determines behavioral output, but behavior radically underdetermines the substrate. You can't read emergence backward from outputs.

## The argument

The LLM cognition debate is methodologically stuck. This paper diagnoses four technical errors sustaining the impasse:

**Skeptics misdescribe computation.** The "stochastic parrot" metaphor is wrong about what happens at inference. It's geometric transformation in high-dimensional space, not text stitching. Interpretability findings (induction heads, arithmetic circuits) demonstrate discoverable structure that doesn't reduce to surface statistics.

**Skeptics misdescribe training.** "Statistical learning from text" describes GPT-2 era, not current systems. Post-training (RLHF, DPO, tool use, environment feedback) shifts from descriptive to normative optimization.

**Optimists over-infer from behavioral evidence.** Broad behavioral competence is consistent with multiple internal organizations. Behavioral evidence can't establish what optimists claim because it's the wrong kind of evidence. This is a <CrossRef href="/concepts/dimensionality" title="Dimensionality" connection="Behavior is a projection of internal state — many internal states produce the same behavior">dimensionality</CrossRef> problem: behavior is a low-dimensional projection of a high-dimensional internal state. Many internal organizations produce the same behavioral output.

**Careful work is architecture-bound.** Most generalizations about "LLMs" are actually about transformers. The reversal curse (solved by diffusion architectures) proves some limitations are architecture-specific, not about learning or cognition generally.

## The proposal

A four-source methodology for studying LLM cognition: behavioral evidence (necessary but insufficient), internal probing (accesses substrate directly), causal intervention (shows structure causally implicates behavior), and cross-architectural replication (distinguishes architecture-specific from general). Convergent findings across all four constrain the space of tenable positions. The four-source methodology is essentially Lakatos's research programme methodology adapted for empirical AI: behavioral evidence is the "novel predictions," internal probing is access to the theoretical core, causal intervention is experimental test, and cross-architectural replication distinguishes the hard core from the protective belt.

---

# Writing

## Training as Selection
https://jacobewen.com/writing/training-as-selection
> DNA and gradient descent might operate on the same invariant. Inspecting the two instances to see what holds.
The Price equation says something simple: if there's covariance between a trait and how much that trait persists, the population-level distribution of the trait shifts. Decompose the change into a selection term (what persists differentially) and a transmission term (how things change as they're passed on), and you have a substrate-neutral accounting identity for directed change. No replication required. No organisms required. Just: traits that covary with persistence become more common. [Price equation](https://en.wikipedia.org/wiki/Price_equation) (1970). A tautology — true by construction — but a useful one, because it tells you where to look. 

I know of two systems that move through Price-space in interestingly parallel ways. One produced DNA. The other produces neural network weights. I want to inspect them side by side — not to argue they're identical, but to see where their trajectories align and where they diverge. [Universal Darwinism](https://en.wikipedia.org/wiki/Universal_Darwinism) (Dawkins, Dennett, Campbell) is a downstream claim: that any system with variation, differential fitness, and retention will produce adapted structure. That's a stronger claim than the Price equation makes — it specifies which systems exhibit Price dynamics, not just what those dynamics look like. I'm starting from Price because it's weaker and harder to argue with.

## The invariants

Both systems exhibit the selection term. Structures that covary positively with persistence under pressure become more prevalent. In biology, organisms vary, environments filter, reproduction retains what survives. In training, the model's internal geometry shifts under gradient noise, the loss function and regularization filter, and the weight update retains what reduces loss.

Both produce environment-relative fitness. In my [grokking work](/work/grokking), a compact Fourier circuit exists inside the model during memorization — it's present by step 700. But it isn't fit yet. Injecting pre-learned Fourier neurons into a memorizing model doesn't accelerate grokking — the transplanted neurons neither help nor hurt. The circuit's fitness depends on the geometric environment it sits in, not on its intrinsic properties. An organism perfectly adapted to the tundra dies in the desert. A circuit perfectly structured for generalization is irrelevant if the surrounding geometry routes computation through the lookup table instead.  This is closer to [niche construction](https://en.wikipedia.org/wiki/Niche_construction) than naive selection — the environment that determines fitness is partly constructed by the other structures in the system.

At the grokking transition, effective rank of per-sample gradients drops 30–40%. Individual samples stop pulling in independent directions and align to shared structure. In Price terms: the covariance between "alignment with shared gradient direction" and "persistence under weight update" spikes. The gradient rank collapse is a tighter structural parallel to selection than most of what people usually point to. And the frequency of micro-instabilities in gradient agreement during training predicts the grokking gap with r = 0.896 (n = 30, p < 0.0001) — perturbation frequency predicting how fast the dominant configuration gets dislodged. That correlation is stronger than any single hyperparameter at predicting grokking timing. If you're looking for evidence that Price-style selection dynamics are operating at the level of internal training dynamics, this is probably the best piece of evidence I have.

Both produce dynamic equilibria rather than fixed endpoints. This is where the Price framing might earn its keep over standard optimization, which predicts convergence to a fixed point. The character of the equilibrium depends on selection pressure strength — and the grokking data shows exactly this. At weight decay 0.25: slow erosion, catastrophic collapses, full re-grokking from scratch. Extinction and recolonization. At weight decay 1.0: regular oscillation with ~150-step period, clean sinusoidal weight norm. Stable coexistence. At weight decay 2.0: gentle hovering near the generalization boundary, no sharp crashes. Ecology predicts that pressure strength determines the character of the equilibrium — boom-bust vs. stable oscillation vs. gentle fluctuation. Optimization theory can derive each case individually, but doesn't naturally predict the taxonomy.

The character of these equilibria maps onto ecological regimes more naturally than optimization language. This is different from [edge-of-stability](https://arxiv.org/abs/2103.00065), where sharpness oscillates at the 2/lr boundary. The grokking oscillation has weight-decay-dependent frequency, correlates with weight norm sinusoidally, and tracks circuit-level structure swaps. Different phenomenon, different mechanism.

## The differences — or: where the trajectories through Price-space diverge

The big one is what the substrate does for free.

DNA is about 750MB. The same 3 billion base pairs sit in every cell — liver, brain, embryo, 80-year-old. Every cell reads the same genome differently depending on context, and the machinery doing the reading is itself encoded in the genome. The whole system is self-referential, recursively compressed, and distributed across every cell in the organism. The recursion is worth sitting with. The genome encodes the proteins that read the genome. The developmental program builds the structures that execute the developmental program. This self-reference operates across different timescales (evolution, development, gene expression) and different substrates (DNA, proteins, cells, organs) simultaneously.

This works because DNA doesn't compute solutions. It distributes structure across a physical substrate and lets physics do the work — protein folding, chemical gradients, thermodynamics. The genome encodes instructions that chemistry executes. Selection operates on the results. No logic, no search, no backprop.

Neural networks can't offload to their substrate. Matrix multiplies on GPUs are general-purpose — they don't do any problem-specific work for free. Every selection pressure has to be computationally induced: we calculate loss, we calculate gradients, we apply weight decay, we backpropagate. The cost of training is the cost of manually simulating selection dynamics that the physical world provides to biology automatically. The genome is 750MB and builds an entire organism. A frontier model is hundreds of gigabytes and that's just the geometry — the selection process that shaped it cost orders of magnitude more.

In Price terms, the transmission term is doing radically different work in the two systems. Biology's transmission term is rich — development, epigenetics, niche inheritance — because the physical substrate provides a dense, nonlinear channel between generations. Training's transmission term is thin: just the weight update rule. Almost everything is in the selection term. I suspect that asymmetry matters for open-endedness — biological evolution keeps producing qualitatively new structure, training produces diminishing returns on scale — but I don't have an argument for why a thin transmission term would cause that, just the correlation.

## The individuation problem

The harder divergence is what the natural individuals are.

Biological selection operates on discrete organisms that reproduce. The Price equation requires you to specify a population and a trait. Biology hands you both: organisms, born and dying, varying and being filtered. Gradient descent operates on continuous geometry. There aren't really individuals being selected — it's more like regions of activation space that are more or less load-bearing at different moments. Merrill et al.'s "[A Tale of Two Circuits](https://arxiv.org/abs/2303.11873)" treats grokking as competition between sparse and dense subnetworks — the closest thing to a discrete-populations framing in the literature. Nanda et al.'s "[Progress Measures for Grokking](https://arxiv.org/abs/2301.05217)" identifies three training phases that map onto variation → competition → fixation. But both decompositions might be imposed rather than discovered.

This dissolves one version of the problem. [Quasispecies theory](https://en.wikipedia.org/wiki/Quasispecies_model) (Eigen & Schuster, 1977) was developed for exactly this situation: RNA viruses with mutation rates so high that individual genomes aren't persistent units. The natural object isn't any particular sequence — it's a probability distribution over sequence space. Selection operates on the distribution directly. The "master sequence" is just the mode, not a privileged individual. A [quasispecies](https://en.wikipedia.org/wiki/Quasispecies_model) is a cloud of related sequences maintained by mutation-selection balance. The key result: there's an [error threshold](https://en.wikipedia.org/wiki/Error_threshold_(evolution)) — a critical mutation rate above which the distribution delocalizes, losing concentration around the fittest sequence. Below the threshold, selection maintains a tight peak. Above it, the population disperses into sequence space and "memory" of the fit configuration is lost.

This is structurally close to the training situation. Neural circuits aren't discrete persistent units either — they're patterns in continuous weight space that are more or less concentrated. And the grokking transition looks like an *inverse* error threshold. Pre-grokking, per-sample gradients pull in independent directions — the distribution over gradient space is delocalized. At the transition, gradient rank collapses 30–40% and the distribution concentrates onto shared structure. The quasispecies frame gives a name to what's happening without requiring you to individuate the units first.  The parallel isn't exact — in quasispecies the error threshold is a *loss* of concentration (too much mutation), while grokking is a *gain* of concentration (weight decay eroding the delocalized memorization solution). But the structural move is the same: selection on distributions, not individuals.

Quasispecies dissolves the individuation problem, but it doesn't fully dissolve the *natural coordinates* problem. In virology, sequence space is natural — it's the actual chemistry, and the metric structure comes for free. In a neural network, you still have to choose what space to put the distribution over: weight space, activation space, functional space. Each gives a different picture. The gradient rank collapse looks like selection if you decompose into per-sample gradient directions. The Fourier circuit emergence looks like selection if you decompose into functional subnetworks. The quasispecies move is to say: the distribution is the object, not any individual within it. But which distribution? The continuous-space selection framework has other precedents: [adaptive dynamics](https://en.wikipedia.org/wiki/Adaptive_dynamics) (Dieckmann & Law) describes selection as a vector field on trait space — no discrete individuals needed. The breeder's equation in quantitative genetics is a special case of Price that drops genotype individuation entirely. These are established tools for "selection without population structure."

The competitive release analogy is where this frame still has trouble. In ecology, removing a dominant species lets suppressed competitors rapidly expand into the freed niche. The naive prediction: ablate the memorization circuit, and the Fourier circuit should expand fast because it's already there. But neural circuits share computational infrastructure — they route through the same layers, the same attention heads, the same decoding pathway. Ablating the memorization circuit doesn't free a niche. It garbles the inputs to everything downstream, including the Fourier circuit. Biological competitors occupy niches independently; neural structures can't. Quasispecies theory dissolves the individuation problem, but it doesn't dissolve this one — the superposition of structures on shared substrate is a genuinely different regime.  This might be the sharpest difference between biological and neural selection. In biology, competitors are physically separate organisms in a shared environment. In a neural network, competing structures are superimposed on shared computational substrate.

England's [dissipative adaptation](https://en.wikipedia.org/wiki/Dissipative_system) work offers a different angle: driven thermodynamic systems that self-organize under energy flux without replication or competition — just matter rearranging into configurations that absorb and dissipate work from the environment. The grokking oscillations could be redescribed in this frame, with the gradient as energy flux and weight decay as dissipation. England, "[Statistical physics of self-replication](https://www.englandlab.com/uploads/7/8/0/3/78037048/2013jcpsrep.pdf)" (2013). The claim: driven systems tend toward configurations that increase entropy production. Structure without selection in the Darwinian sense.

## Robustness, not optimality

If the dynamics are selectionist, what's being selected for isn't task performance. It's robustness under the full set of pressures — loss, regularization, architectural bottlenecks, gradient noise. The "fit" structure isn't the one that solves the task best. It's the one that can keep solving it while everything around it erodes.

Biology works this way. What survives isn't what's optimal — it's what's robust. Organisms aren't maximally efficient at any single function. They're tolerant of perturbation across many functions simultaneously. The grokking data suggests something similar: the Fourier circuit doesn't win because it's the best solution to modular arithmetic. It wins because it's compact enough to survive weight decay. Fitness is defined by the selection regime, not the task.

If that's right, then what training produces depends on what pressures you apply at least as much as what task you train on. Different regularization, different architecture, different curriculum — different selection regimes — should produce structures that are robust in different ways, even if they achieve similar task performance. That's a different way of thinking about what the weights encode.

## What would settle it

The question is whether this frame generates predictions that diverge from standard optimization accounts.

If the pressure-dependent taxonomy of equilibria from the grokking experiments holds across tasks and architectures, the Price framing is doing organizational work. If it's specific to modular arithmetic, it's a coincidence.

The fastest test might already be answerable with existing data. The selection and dissipative frames make different predictions about what controls the grokking oscillation frequency: a selection account says competition intensity between circuits, a dissipative account says the ratio of drive (gradient magnitude) to damping (weight decay strength). The weight decay sweep varies the drive-to-damping ratio cleanly but doesn't vary competition intensity in any obvious way. If oscillation frequency tracks the ratio, that's evidence for the dissipative frame. I haven't tried fitting both models to the sweep data.

The harder test is **historical contingency as niche construction.** The selection frame predicts that early-established circuits don't just occupy good basins — optimization theory knows about path dependence — but actively shape what *can* develop later by constructing the computational environment. Train on task A then task B, and the structures you get for B should be constrained not just by B's loss landscape but by A's circuits actively routing computation. Reverse the order, get qualitatively different B-circuits — not just different performance, different internal structure. The niche construction version makes a specific prediction: the order doesn't just change how fast you learn B, it changes *what you learn* for B. You could test this by applying sparse autoencoders or circuit-level interpretability tools to models trained under different curriculum orderings and comparing the internal decompositions, not just the accuracy curves. Curriculum learning papers generally measure performance differences from task ordering. The selection-frame prediction is about structural differences — different circuits, not just different accuracy. As far as I know nobody has compared internal structure across curriculum orderings with interpretability tools.

If none of these produce results that diverge from optimization predictions, the frame is a costume. If any do, it's pointing at dynamics that the standard account misses.

The most precise version of the invariant might not be "selection" at all. It might be: distributions that concentrate under pressure. That's what the grokking transition is — gradient rank collapsing, the distribution over update directions localizing onto shared structure. That's what the quasispecies error threshold is — a distribution delocalizing when pressure weakens past a critical point. That's what the weight decay regimes show — different pressures producing different concentration dynamics. "Selection" is one name for this, but it carries baggage about individuals and competition that the actual dynamics might not need.

Whether the CIFAR result means the frame breaks on real tasks, or just that CIFAR doesn't have the right structure to produce dramatic phase transitions, is open. Whether quasispecies is the right level of abstraction or still too biological is open. But "distributions concentrating under pressure" is substrate-neutral, doesn't require individuation, and organizes the grokking data better than "optimizer converges to minimum."