Grokking Dynamics

Systematic decomposition of the grokking delay on modular arithmetic. Started thinking it was momentum. Ended up somewhere more complicated.

Grokking is when a neural network memorizes its training data, then — long after training loss has flatlined — suddenly generalizes. It looks like a phase transition. The whole field treats it like one.

The initial hypothesis

Adam's momentum buffer (beta1) tracks an exponential moving average of past gradients. If the loss landscape reshapes while the buffer is still pointing at the old geometry, the optimizer won't respond until the average catches up. That lag would look like a mysterious delay if you only watched test accuracy. Simple story: the "phase transition" is the optimizer being slow. ⁰0.Analogy — Drop into a half-pipe. As you ride toward the far wall, someone deconstructs the side you dropped in from. You're still carrying momentum toward a wall that made sense when you started — you won't get back to the ground until you ride up the other side and come back down. That's roughly what the momentum buffer is doing.

I started by testing Grokfast (Lee et al., 2024), which claims to accelerate grokking 50x via gradient filtering. The question was whether it's doing anything beyond reducing effective momentum. Grokfast + default Adam grokks at ~800 steps. Just lowering beta1 to 0.5 grokks at ~700. Grokfast + low beta1 is worse than either alone. The two interventions are redundant. ¹1.Aside — Grokfast's mechanism — amplifying slow-varying gradient components — is mathematically similar to reducing the exponential decay rate that beta1 controls. If they're redundant, they're probably hitting the same bottleneck.

The beta1 sweep

Next I ran 390 jobs (13 beta1 values, 30 seeds each) to map the relationship between momentum and grokking delay.

390 jobs (30 seeds per β₁). Gap is flat ~300 steps for β₁ < 0.5, then rises sharply. Shaded band is IQR. The baseline gap at β₁=0 means momentum isn’t the whole story.

The gap is flat (~280–345 steps) for beta1 between 0 and 0.4, then rises sharply past 0.5. The important part: beta1 = 0 doesn't eliminate grokking. There's a baseline delay of ~300 steps with zero momentum. High momentum amplifies the delay 2–3x, but it's amplifying something that's already there. ²2.Aside — My original framing — "the delay is mostly optimizer momentum" — was wrong. Momentum is a multiplier, not the cause. Something else produces the baseline ~300-step gap.

Weight decay

I ran 270 jobs (9 weight decay values, 30 seeds each), all at beta1 = 0 to isolate the effect.

Without weight decay, grokking almost never happens. Only 5/30 seeds generalize within 3000 steps at wd = 0 — the model memorizes instantly and gets stuck. At wd = 1.0: 100% grokking rate, gap of ~300 steps. At wd = 2.0: the model generalizes without memorizing first. ³3.Aside — At wd = 5.0, nothing learns at all. Too much regularization kills both memorization and generalization.

Weight decay is doing something like erosion. The memorization solution needs high-magnitude, high-precision weights to fit each training example as a lookup table — lots of sharp edges. Weight decay penalizes magnitude, continuously wearing those edges down. The generalizing structure — Fourier features that encode modular arithmetic — is compact and robust under that kind of perturbation. It survives the pressure because it doesn't need the precision. The "delay" is how long it takes weight decay to erode memorization enough for the compact circuit to take over. ⁴4.Analogy — Think of it as weathering. A sandcastle and a boulder both sit on the beach, but only one of them survives the tide. Weight decay is the tide — it favors structures that are robust under magnitude pressure. Same dynamic as the training phases post: different selection environments reshape the geometry differently.

"Grokked" isn't a fixed point

After generalization, the model doesn't sit in a stable state. Under wd = 1.0, accuracy oscillates — the model periodically loses and regains generalization. Weight decay erodes the solution, accuracy drops, gradient signal strengthens, the optimizer rebuilds, and the cycle repeats. Period ≈ 150 steps.

At wd = 0.25: slow erosion, catastrophic collapses (accuracy drops to 0.25), full re-grokking from scratch. At wd = 2.0: gentle hovering near the boundary, no sharp crashes. The weight norm under wd = 1.0 shows clean sinusoidal oscillation perfectly correlated with accuracy.

"Grokked" is not a fixed point — it's a dynamical equilibrium between gradient descent building structure and weight decay dissolving it. What we call generalization is the time-averaged behavior of this cycle. ⁵5.Analogy — This is homeostasis. Body temperature looks like a constant, but it's the time-average of a system actively generating and dissipating heat. "The model has generalized" might be the same kind of description — a stable-looking label for a dynamic process. ⁶6.Aside — This might have implications for checkpoint selection: the model you save depends on where you catch it in the orbit. I haven't tested whether this matters at scale.

I tested whether the oscillation generalizes beyond modular arithmetic by running CIFAR-10 with a tiny ViT (0.8M params, 5k examples). No oscillation. No grokking. Modular arithmetic has a clean algorithmic solution (Fourier circuit) that's dramatically more weight-efficient than memorization. CIFAR doesn't have an equivalent discrete transition. ⁷7.Aside — This is an important limitation. The clean dynamics here come from the task having a sharp efficiency gap between memorization and generalization. Most real tasks probably don't.

What the geometry is doing

Show curvature (top Hessian eigenvalue)

Modular addition (mod 113), 2-layer transformer. Colored dots mark key moments — hover for labels. The “phase transition” at ~step 1500 is smooth when you watch the right variables.

The Hessian eigenvalues tell the story more clearly than accuracy does. Curvature spikes 20x during memorization — the landscape sharpens around the lookup-table solution — then declines starting around step 600 as weight decay reshapes the geometry. At beta1 = 0, generalization follows almost immediately. At beta1 = 0.9, there's a ~400-step lag while the momentum buffer catches up. ⁸8.Aside — The curvature increases during memorization, then declines. This contradicts the intuition that generalization means finding a flat basin. The model is actively reshaping sharp geometry into something smoother. ⁹9.Analogy — "Phase transition" is what you see when you project a high-dimensional geometric process onto a single scalar (test accuracy). The Hessian shows the underlying process is smooth — curvature changes continuously. The transition is in the measurement, not the system.

Gradient agreement — the cosine similarity between two independent half-batch gradients — adds another layer. Agreement rises during memorization. The landscape becomes more coherent as the model learns, and that coherence enables generalization. The momentum buffer is anti-correlated with the current gradient during the post-memorization phase (cosine: -0.1 to -0.3). The optimizer's history is pointing toward memorization while the landscape has already shifted. ¹⁰10.Definition — Gradient agreement: split a batch in half, compute gradients independently, measure their cosine similarity. High agreement means samples are pulling in the same direction — the model is learning shared structure rather than fitting individual examples.

Smaller findings that constrain the story

Fourier transplant failure. Injecting pre-learned Fourier neurons from a grokked model into a memorizing model doesn't accelerate grokking. The transplanted neurons neither help nor hurt. The Fourier features are present by step 700 (during memorization). The bottleneck is landscape reshaping, not feature formation. ¹¹11.Analogy — Having the right representation isn't enough — the geometry around it has to make it load-bearing. It's the difference between knowing a fact and having it integrated into how you reason. The structure exists; it just isn't structurally central yet.

Gradient rank collapse. Effective rank of per-sample gradients drops 30–40% at the grokking transition — each sample stops pulling in independent directions and aligns to shared structure. Then it oscillates, synchronized with the accuracy limit cycle.

Agreement instability predicts grokking speed. The number of gradient-agreement dips during training predicts the grokking gap with r = 0.896 (n = 30, p < 0.0001). Stronger than any single hyperparameter. The micro-level instability and the macro-level oscillation may be the same phenomenon at different scales.

Where this leaves things

Grokking on modular arithmetic is weight decay eroding a memorization fixed point until the generalizing solution takes over. Momentum amplifies the delay by keeping the optimizer committed to stale geometry, but the baseline delay exists without it. After generalization, the system enters a limit cycle rather than a fixed point.

I'm fairly confident about the mechanism on this specific task. Whether any of it transfers to tasks without clean algorithmic solutions is open — the CIFAR result suggests it might not.

Setup

Task: modular addition mod 113 (a + b mod 113)
Data: 30% train (3,830 pairs), 70% validation (8,939 pairs)
Model: 2-layer transformer, 128d, 4 heads
Infrastructure: Modal (T4/L4 GPUs), W&B tracking
Total: ~900+ jobs across sweeps

Connections

Training as SelectionReframes these grokking dynamics as Price-equation selection — weight decay as selective pressure, the oscillation as ecological equilibrium.

Hard Substrates, Soft EvidenceArgues that internal evidence (like the Hessian and gradient rank data here) is what the cognition debate actually needs.