Authors: Yang Cao, Xiaoyu Li, Zhao Song
This repository contains the official PyTorch implementation for Grams optimizer.
We introduce Gradient Descent with Adaptive Momentum Scaling (Grams), a novel optimization algorithm that decouples the direction and magnitude of parameter updates in deep learning. Unlike traditional optimizers that directly integrate momentum into updates, Grams separates the update direction, derived from current gradients, from momentum, which is used solely for adaptive magnitude scaling. This approach enables Grams to achieve improved loss descent compared to state-of-the-art cautious and momentum-based optimizers.

Use the following command to install our pytorch implementation for Grams:
pip install grams-pytorch
Switching from Adam/AdamW to Grams is simple and requires only two lines of code:
Before:
import torch
optimizer = torch.optim.adam(model.parameters(), lr=1e-3, weight_decay=0.0)
Switching to Grams:
from grams import Grams
optimizer = Grams(model.parameters(), lr=1e-3, weight_decay=0.0)
Just import Grams and swap the optimizer—everything else remains the same!
Please cite our work!
@inproceedings{cao2025grams,
title={Grams: Gradient Descent with Adaptive Momentum Scaling},
author={Yang Cao and Xiaoyu Li and Zhao Song},
booktitle={ICLR 2025 First Workshop on Scalable Optimization for Efficient and Adaptive Foundation Models},
year={2025},
url={https://openreview.net/forum?id=GmKQnpQdsc}
}