Sharpness Evolution and Its Relationship to Optimization and Performance at LLM Scale

Simulation study of how loss landscape sharpness evolves during training across model scales (10M to 7B parameters), and its relationship to optimization behavior and downstream task performance.

View Paper (PDF)
-0.1055
Scaling Law Slope (sharpness per decade)
0.9983
Scaling Law R-squared
0.9945
Sharpness-Loss Correlation
-0.9992
Sharpness-Performance Correlation
-0.9991
Scale-Sharpness Correlation
5.13
Mean EoS Damping Rate
-61.86
Log-Linear AIC (Best Fit)
0.9999
Early Prediction R² (10% data)

Sharpness Evolution During Training

Sharpness Scaling Law

Training Loss Trajectories

Sharpness vs. Downstream Performance

Sharpness-Gradient Correlation by Scale

Downstream Task Performance by Scale

Edge-of-Stability Dynamics

Power-Law vs Log-Linear Scaling

Early Prediction Accuracy

Model Information Criteria Comparison

Edge-of-Stability Dynamics

Model Damping Rate Peak/Plateau Ratio Oscillations Osc. Amplitude Damping R²
10M4.7161.615250.02250.922
125M4.9602.085230.02830.938
350M5.1312.297240.02760.946
1.3B5.2292.614280.03200.931
3B5.3332.823220.03560.937
7B5.3883.092190.03620.946

Functional Form Comparison

Model AIC BIC Parameters
Log-Linear0.9983-61.86-62.282
Power-Law0.9945-54.92-55.332
Exponential Decay0.9983-59.85-60.483

Scale Summary

Model Peak Sharpness Final Sharpness Final Loss Mean Accuracy Sharp.-Loss r Sharp.-Grad r
10M2.06441.27855.0090.36160.44450.9218
125M2.41081.16694.15080.45320.47940.9638
350M2.59961.12173.84310.48430.48840.9705
1.3B2.75851.06463.48630.53440.51810.9795
3B2.89451.01353.27530.56740.52270.982
7B2.99760.98043.0770.6030.53350.9849

Downstream Task Accuracy

Model HellaSwag ARC-Easy PIQA WinoGrande LAMBADA
10M0.36580.3870.43180.32660.2966
125M0.44740.4770.50570.4410.3951
350M0.47430.50410.55210.45980.4312
1.3B0.5480.56120.59210.51210.4586
3B0.5580.60980.6310.53170.5064
7B0.61440.62860.65080.5780.5432

Key Findings