|
37 | 37 | " \n",
|
38 | 38 | " \n",
|
39 | 39 | "- the optimizer\n",
|
40 |
| - " - SGD\n", |
41 |
| - " - Adam\n", |
42 |
| - " - AdamW\n", |
| 40 | + " - SGD, SGD with momentum\n", |
| 41 | + " - Adam, AdamW, RAdam\n", |
43 | 42 | " \n",
|
44 | 43 | " \n",
|
45 | 44 | "- the scheduler\n",
|
|
114 | 113 | "\n",
|
115 | 114 | "## 1.2 Optimizer and Scheduler\n",
|
116 | 115 | "\n",
|
117 |
| - "- Sebastian Ruder: [An overview of gradient descent optimization algorithms](http://ruder.io/optimizing-gradient-descent/) January 2016" |
118 |
| - ] |
119 |
| - }, |
120 |
| - { |
121 |
| - "cell_type": "markdown", |
122 |
| - "metadata": {}, |
123 |
| - "source": [ |
124 |
| - "## Stochastic Gradient Descent\n", |
| 116 | + "- Sebastian Ruder: [An overview of gradient descent optimization algorithms](http://ruder.io/optimizing-gradient-descent/) January 2016\n", |
| 117 | + "\n", |
| 118 | + "### Stochastic Gradient Descent\n", |
125 | 119 | "\n",
|
126 | 120 | "- update network parameters by a fraction _learning rate_ after each example\n",
|
127 | 121 | "\n",
|
128 | 122 | "\n",
|
129 |
| - "## Mini-batch Gradient Descent\n", |
| 123 | + "### Mini-batch Gradient Descent\n", |
130 | 124 | "\n",
|
131 | 125 | "- update network parameters by a fraction _learning rate_ after each mini-batch of size $ n>1 $\n",
|
132 | 126 | "\n",
|
133 |
| - "## Adaptive Learning Rate Optimizers\n", |
| 127 | + "### Adaptive Learning Rate Optimizers\n", |
134 | 128 | "\n",
|
135 | 129 | "- Adagrad, RMSprop, Adam, RAdam, AdamW\n",
|
136 | 130 | "- Liu et al.: [On the Variance of the Adaptive Learning Rate and Beyond](https://arxiv.org/abs/1908.03265)\n",
|
|
176 | 170 | "### 1.3.1 A Learning Function\n",
|
177 | 171 | "Define a training function that loops through the training data some number of times and updates the model weights as well as the learning rate schedule.\n",
|
178 | 172 | "\n",
|
179 |
| - "__Note__ that each iteration over the inner for loop takes _one batch_ of data from the `pytorch` data loader class. The batch size determines how much data is used to compute the gradient. ___This will become important later!___\n" |
| 173 | + "__Note__: each iteration over the inner for loop takes _one batch_ of data from the `pytorch` data loader class. The batch size determines how much data is used to compute the gradient. ___This will become important later!___" |
180 | 174 | ]
|
181 | 175 | },
|
182 | 176 | {
|
|
0 commit comments