tweaking

mattilyra · mattilyra · commit 38fd1b3967bd · 2019-11-04T12:04:17.000-05:00
diff --git a/Part01_prerequisites.ipynb b/Part01_prerequisites.ipynb
@@ -37,9 +37,8 @@
     "    \n",
     "    \n",
     "- the optimizer\n",
-    "    - SGD\n",
-    "    - Adam\n",
-    "    - AdamW\n",
+    "    - SGD, SGD with momentum\n",
+    "    - Adam, AdamW, RAdam\n",
     "    \n",
     "    \n",
     "- the scheduler\n",
@@ -114,23 +113,18 @@
     "\n",
     "## 1.2 Optimizer and Scheduler\n",
     "\n",
-    "- Sebastian Ruder: [An overview of gradient descent optimization algorithms](http://ruder.io/optimizing-gradient-descent/) January 2016"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Stochastic Gradient Descent\n",
+    "- Sebastian Ruder: [An overview of gradient descent optimization algorithms](http://ruder.io/optimizing-gradient-descent/) January 2016\n",
+    "\n",
+    "### Stochastic Gradient Descent\n",
     "\n",
     "- update network parameters by a fraction _learning rate_ after each example\n",
     "\n",
     "\n",
-    "## Mini-batch Gradient Descent\n",
+    "### Mini-batch Gradient Descent\n",
     "\n",
     "- update network parameters by a fraction _learning rate_ after each mini-batch of size $ n>1 $\n",
     "\n",
-    "## Adaptive Learning Rate Optimizers\n",
+    "### Adaptive Learning Rate Optimizers\n",
     "\n",
     "- Adagrad, RMSprop, Adam, RAdam, AdamW\n",
     "- Liu et al.: [On the Variance of the Adaptive Learning Rate and Beyond](https://arxiv.org/abs/1908.03265)\n",
@@ -176,7 +170,7 @@
     "### 1.3.1 A Learning Function\n",
     "Define a training function that loops through the training data some number of times and updates the model weights as well as the learning rate schedule.\n",
     "\n",
-    "__Note__ that each iteration over the inner for loop takes _one batch_ of data from the `pytorch` data loader class. The batch size determines how much data is used to compute the gradient. ___This will become important later!___\n"
+    "__Note__: each iteration over the inner for loop takes _one batch_ of data from the `pytorch` data loader class. The batch size determines how much data is used to compute the gradient. ___This will become important later!___"
    ]
   },
   {
diff --git a/Part02_recurrent_neural_networks.ipynb b/Part02_recurrent_neural_networks.ipynb
@@ -653,6 +653,7 @@
     "\n",
     "- Bahdanau et al.: [_Neural Machine Translation by Jointly Learning to Align and Translate._](https://arxiv.org/abs/1409.0473) ICLR 2015\n",
     "- Lilian Weng: [Attention, Attention!](https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html)\n",
+    "- Raimi Karim: [Attn: Illustrated Attention](https://towardsdatascience.com/attn-illustrated-attention-5ec4ad276ee3)\n",
     "- [NLP from scratch: translation with a sequence to sequence network and attention](https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html)"
    ]
   },
diff --git a/Part03_transformer.ipynb b/Part03_transformer.ipynb
@@ -35,6 +35,13 @@
     "- Liu et al.: [_RoBERTa: A Robustly Optimized BERT Pretraining Approach_](https://arxiv.org/abs/1907.11692)\n",
     "- Sam Sucik: [Compressing BERT for faster prediction](https://blog.rasa.com/compressing-bert-for-faster-prediction-2/amp/)\n",
     "\n",
+    "---\n",
+    "\n",
+    "# Interpretability\n",
+    "\n",
+    "- Jain et al.: [Attention is not Explanation](https://arxiv.org/abs/1902.10186)\n",
+    "- Alon Jacovi: [The Problem of Faithfulness in (Neural Network) NLP Interpretations](https://medium.com/@alonjacovi/the-problem-of-faithfulness-in-neural-network-nlp-interpretations-ee98d7027cbd)\n",
+    "\n",
     "---"
    ]
   },

Original file line number	Diff line number	Diff line change
`@@ -653,6 +653,7 @@`
`653`	`653`	`"\n",`
`654`	`654`	`"- Bahdanau et al.: [_Neural Machine Translation by Jointly Learning to Align and Translate._](https://arxiv.org/abs/1409.0473) ICLR 2015\n",`
`655`	`655`	`"- Lilian Weng: [Attention, Attention!](https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html)\n",`
	`656`	`+ "- Raimi Karim: [Attn: Illustrated Attention](https://towardsdatascience.com/attn-illustrated-attention-5ec4ad276ee3)\n",`
`656`	`657`	`"- [NLP from scratch: translation with a sequence to sequence network and attention](https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html)"`
`657`	`658`	`]`
`658`	`659`	`},`