Skip to content

Commit 38fd1b3

Browse files
committed
tweaking
1 parent 2abf7c1 commit 38fd1b3

3 files changed

+16
-14
lines changed

Part01_prerequisites.ipynb

+8-14
Original file line numberDiff line numberDiff line change
@@ -37,9 +37,8 @@
3737
" \n",
3838
" \n",
3939
"- the optimizer\n",
40-
" - SGD\n",
41-
" - Adam\n",
42-
" - AdamW\n",
40+
" - SGD, SGD with momentum\n",
41+
" - Adam, AdamW, RAdam\n",
4342
" \n",
4443
" \n",
4544
"- the scheduler\n",
@@ -114,23 +113,18 @@
114113
"\n",
115114
"## 1.2 Optimizer and Scheduler\n",
116115
"\n",
117-
"- Sebastian Ruder: [An overview of gradient descent optimization algorithms](http://ruder.io/optimizing-gradient-descent/) January 2016"
118-
]
119-
},
120-
{
121-
"cell_type": "markdown",
122-
"metadata": {},
123-
"source": [
124-
"## Stochastic Gradient Descent\n",
116+
"- Sebastian Ruder: [An overview of gradient descent optimization algorithms](http://ruder.io/optimizing-gradient-descent/) January 2016\n",
117+
"\n",
118+
"### Stochastic Gradient Descent\n",
125119
"\n",
126120
"- update network parameters by a fraction _learning rate_ after each example\n",
127121
"\n",
128122
"\n",
129-
"## Mini-batch Gradient Descent\n",
123+
"### Mini-batch Gradient Descent\n",
130124
"\n",
131125
"- update network parameters by a fraction _learning rate_ after each mini-batch of size $ n>1 $\n",
132126
"\n",
133-
"## Adaptive Learning Rate Optimizers\n",
127+
"### Adaptive Learning Rate Optimizers\n",
134128
"\n",
135129
"- Adagrad, RMSprop, Adam, RAdam, AdamW\n",
136130
"- Liu et al.: [On the Variance of the Adaptive Learning Rate and Beyond](https://arxiv.org/abs/1908.03265)\n",
@@ -176,7 +170,7 @@
176170
"### 1.3.1 A Learning Function\n",
177171
"Define a training function that loops through the training data some number of times and updates the model weights as well as the learning rate schedule.\n",
178172
"\n",
179-
"__Note__ that each iteration over the inner for loop takes _one batch_ of data from the `pytorch` data loader class. The batch size determines how much data is used to compute the gradient. ___This will become important later!___\n"
173+
"__Note__: each iteration over the inner for loop takes _one batch_ of data from the `pytorch` data loader class. The batch size determines how much data is used to compute the gradient. ___This will become important later!___"
180174
]
181175
},
182176
{

Part02_recurrent_neural_networks.ipynb

+1
Original file line numberDiff line numberDiff line change
@@ -653,6 +653,7 @@
653653
"\n",
654654
"- Bahdanau et al.: [_Neural Machine Translation by Jointly Learning to Align and Translate._](https://arxiv.org/abs/1409.0473) ICLR 2015\n",
655655
"- Lilian Weng: [Attention, Attention!](https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html)\n",
656+
"- Raimi Karim: [Attn: Illustrated Attention](https://towardsdatascience.com/attn-illustrated-attention-5ec4ad276ee3)\n",
656657
"- [NLP from scratch: translation with a sequence to sequence network and attention](https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html)"
657658
]
658659
},

Part03_transformer.ipynb

+7
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,13 @@
3535
"- Liu et al.: [_RoBERTa: A Robustly Optimized BERT Pretraining Approach_](https://arxiv.org/abs/1907.11692)\n",
3636
"- Sam Sucik: [Compressing BERT for faster prediction](https://blog.rasa.com/compressing-bert-for-faster-prediction-2/amp/)\n",
3737
"\n",
38+
"---\n",
39+
"\n",
40+
"# Interpretability\n",
41+
"\n",
42+
"- Jain et al.: [Attention is not Explanation](https://arxiv.org/abs/1902.10186)\n",
43+
"- Alon Jacovi: [The Problem of Faithfulness in (Neural Network) NLP Interpretations](https://medium.com/@alonjacovi/the-problem-of-faithfulness-in-neural-network-nlp-interpretations-ee98d7027cbd)\n",
44+
"\n",
3845
"---"
3946
]
4047
},

0 commit comments

Comments
 (0)