Papers
arxiv:2602.23696

Optimizer-Induced Low-Dimensional Drift and Transverse Dynamics in Transformer Training

Published on Feb 27
Authors:

Abstract

Transformer training under AdamW exhibits a stable low-dimensional drift direction capturing most parameter changes, which emerges from optimizer dynamics rather than gradient geometry and is eliminated by different optimizers or hyperparameter changes.

AI-generated summary

We analyze cumulative parameter trajectories of transformer training under AdamW and identify a dominant low-dimensional drift direction ("backbone") that captures 60--80% of long-horizon displacement from initialization. This direction is highly stable over rolling training windows yet reorients gradually across phases, particularly following objective reweighting. Per-batch gradients exhibit near-noise-floor alignment with the backbone, whereas optimizer-integrated updates align strongly with it, indicating that the structure emerges from accumulated optimizer dynamics rather than instantaneous gradient geometry. Replacing AdamW with SGD-family optimizers eliminates this structure, and reducing β_2 smoothly degrades backbone dominance and reheating recoverability. Reheating experiments show that transverse probe modes can be transiently re-excited without substantially altering accumulated backbone drift. These results provide a trajectory-level characterization of optimizer-induced geometric structure in transformer training and shift attention from instantaneous gradient properties to cumulative update dynamics.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.23696 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.23696 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.23696 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.