Papers
arxiv:2605.17659

Bug or Feature^2: Weight Drift, Activation Sparsity, and Spikes

Published on May 17
· Submitted by
Egor Shvetsov
on May 20
Authors:
,
,
,
,
,

Abstract

Standard losses interacting with positively biased activation functions cause negative weight drift during early training, leading to significant activation sparsity and affecting model accuracy across various architectures.

AI-generated summary

The design of modern neural architectures has converged through incremental empirical choices, yet the mechanisms governing their training dynamics remain only partially understood. We identify and analyze a negative weight drift induced by the interaction between standard losses and positively biased activation functions. We prove that under MSE or cross-entropy loss, the gradient with respect to positive pre-activations is non-negative in expectation at initialization, driving downstream weights toward negative values during early training. The drift is intrinsic to optimization rather than data, and persists across architectures (MLP, ResNet, ViT, GPT-nano, MP-SENe) and asymmetric activation functions (ReLU, GELU, SiLU). Coupled with ReLU, weight drift produces activation sparsity reaching up to 90\% in GPT-nano. We characterize the sparsity-accuracy tradeoff across 79 configurations and identify a sharp accuracy cliff above sim70\% activation sparsity. While ReLU^2 achieves a good sparsity--accuracy ratio in GPT-nano, it pathologically amplifies identified activation spikes in intermediate transformer layers. Clipping resolves this while preserving the representational benefits of squaring: clipped ReLU^2 outperforms its unclipped version, and GELU^2 achieves the lowest validation loss on GPT-nano. Code is available at https://github.com/On-Point-RND/BugOrFeature.

Community

Every time you train a network with ReLU, GELU, or SiLU, your weights quietly drift negative. Not because of your data, it happens on random inputs too. It's baked into the math of gradient descent + asymmetric activations.

We prove it formally (MSE & cross-entropy) and show it across MLP, ResNet, ViT, GPT, and a speech model.

What does this drift do? Negative weights push pre-activations into negative regions, and with ReLU, up to 90% of activations end up being zero zeroed out by the very same function that caused the drift in the first place! Bug or feature? Depends on how to use it.

The most interesting finding: ReLU² boosts GPT-nano performance but it pathologically amplifies activation spikes by 25×. The fix is simple: clip it. Clipped ReLU² and GELU² both outperform their non squared versions, with GELU² achieving the best validation loss overall on GPT-nano.

💻 Code: github.com/On-Point-RND/BugOrFeature

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.17659
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.17659 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.17659 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.17659 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.