Simulation study of how loss landscape sharpness evolves during training across model scales (10M to 7B parameters), and its relationship to optimization behavior and downstream task performance.
View Paper (PDF)| Model | Damping Rate | Peak/Plateau Ratio | Oscillations | Osc. Amplitude | Damping R² |
|---|---|---|---|---|---|
| 10M | 4.716 | 1.615 | 25 | 0.0225 | 0.922 |
| 125M | 4.960 | 2.085 | 23 | 0.0283 | 0.938 |
| 350M | 5.131 | 2.297 | 24 | 0.0276 | 0.946 |
| 1.3B | 5.229 | 2.614 | 28 | 0.0320 | 0.931 |
| 3B | 5.333 | 2.823 | 22 | 0.0356 | 0.937 |
| 7B | 5.388 | 3.092 | 19 | 0.0362 | 0.946 |
| Model | R² | AIC | BIC | Parameters |
|---|---|---|---|---|
| Log-Linear | 0.9983 | -61.86 | -62.28 | 2 |
| Power-Law | 0.9945 | -54.92 | -55.33 | 2 |
| Exponential Decay | 0.9983 | -59.85 | -60.48 | 3 |
| Model | Peak Sharpness | Final Sharpness | Final Loss | Mean Accuracy | Sharp.-Loss r | Sharp.-Grad r |
|---|---|---|---|---|---|---|
| 10M | 2.0644 | 1.2785 | 5.009 | 0.3616 | 0.4445 | 0.9218 |
| 125M | 2.4108 | 1.1669 | 4.1508 | 0.4532 | 0.4794 | 0.9638 |
| 350M | 2.5996 | 1.1217 | 3.8431 | 0.4843 | 0.4884 | 0.9705 |
| 1.3B | 2.7585 | 1.0646 | 3.4863 | 0.5344 | 0.5181 | 0.9795 |
| 3B | 2.8945 | 1.0135 | 3.2753 | 0.5674 | 0.5227 | 0.982 |
| 7B | 2.9976 | 0.9804 | 3.077 | 0.603 | 0.5335 | 0.9849 |
| Model | HellaSwag | ARC-Easy | PIQA | WinoGrande | LAMBADA |
|---|---|---|---|---|---|
| 10M | 0.3658 | 0.387 | 0.4318 | 0.3266 | 0.2966 |
| 125M | 0.4474 | 0.477 | 0.5057 | 0.441 | 0.3951 |
| 350M | 0.4743 | 0.5041 | 0.5521 | 0.4598 | 0.4312 |
| 1.3B | 0.548 | 0.5612 | 0.5921 | 0.5121 | 0.4586 |
| 3B | 0.558 | 0.6098 | 0.631 | 0.5317 | 0.5064 |
| 7B | 0.6144 | 0.6286 | 0.6508 | 0.578 | 0.5432 |