Skip to content

Commit 7bbfe21

Browse files
committed
add transformer diagram notes
1 parent 88a7825 commit 7bbfe21

File tree

10 files changed

+65
-52
lines changed

10 files changed

+65
-52
lines changed

_notes/ai/ai_futures.md

Lines changed: 11 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -8,13 +8,14 @@ category: ai
88

99
# 🤖 AGI thoughts
1010
- nice AGI definition: AI systems are fully substitutable for human labor (or have a comparably large impact)
11-
- AI risk by deliberate human actors (i.e. concentrating power) is a greater risk than unintended use (i.e. loss of control)
12-
- Caveat: AGI risk is probably still underrated - nefarious use is likely worse than accidental misuse
13-
- alignment research is technically more interesting than safety research…
11+
- AI risk by deliberate human actors (i.e. concentrating power) seems to be a greater risk than unintended use (i.e. loss of control) [see some thought-out risks [here](https://cdn.openai.com/openai-preparedness-framework-beta.pdf)]
12+
- Caveat: AGI risk may still be high - nefarious use can easily be worse than accidental misuse
13+
- alignment research seems to be technically more interesting than safety research…
14+
1415
- Data limitations (e.g. in medicine) will limit rapid general advancements
1516

1617
# human compatible
17-
**A set of notes based on the book human compatible, by Stuart Russell 2019**
18+
**A set of notes based on the book *human compatible*, by Stuart Russell (2019)**
1819

1920
## what if we succeed?
2021

@@ -41,7 +42,7 @@ category: ai
4142
- carry out experiments and compare against all existing results easily
4243
- high-level goal: raise the standard of living for everyone everywhere?
4344
- AI tutoring
44-
- EU GDPR's "right to an explanation" wording is actually much weaker: "meaningful information about the logic involved, as well as the significance and the envisaged consequences of such processing for the data subject."
45+
- EU GDPR's "right to an explanation" wording is actually much weaker: "meaningful information about the logic involved, as well as the significance and the envisaged consequences of such processing for the data subject"
4546
- whataboutery - a method for deflecting questions where one always asks "what about X?" rather than engaging
4647

4748
## harms of ai
@@ -52,8 +53,6 @@ category: ai
5253
- ex. deepfakes / fake media
5354
- ex. automation - how to solve this? Universal basic income?
5455

55-
56-
5756
## value alignment
5857

5958
- ex. king midas
@@ -63,11 +62,9 @@ category: ai
6362
- note: for an AI, it might be easier to convince of a different objective than actually solve the objective
6463
- basically any optimization objective will lead AI to disable its own off-switch
6564

66-
67-
6865
## possible solns
6966

70-
- Oracle AI - can only answer yes/no/probabilistic questions, otherwise no output to the real world
67+
- oracle AI - can only answer yes/no/probabilistic questions, otherwise no output to the real world
7168
- inverse RL
7269
- ai should be uncertain about utitilies
7370
- utilties should be inferred from human preferences
@@ -87,7 +84,7 @@ category: ai
8784

8885
# possible minds
8986

90-
**edited by John Brockman, 2019**
87+
**notes from *possible minds*, a collection of essays edited by John Brockman (2019)**
9188

9289
## intro (brockman)
9390

@@ -100,16 +97,13 @@ category: ai
10097
- ai has gone down and up for a while
10198
- gofai - good old-fashioned ai
10299
- things people thought would be hard, like chess, were easy
103-
- lots of physicists in this book...
104100

105101
## wrong but more relevant than ever (seth lloyd)
106102

107103
- current AI is way worse than people think it is
108104
- wiener was very pessimistic - wwII / cold war
109105
- singularity is not coming...
110106

111-
112-
113107
## the limitations of opaque learning machines (judea pearl)
114108

115109
- 3 levels of reasoning
@@ -125,8 +119,7 @@ category: ai
125119
- something about relationship...
126120
- correlations, causes, explanations (moral/rational) - biologically biased towards this?
127121
- beliefs + desires cause actions
128-
- randomly picking grants above some cutoff...
129-
- pretty cool that different people do things because of norms (e.g. come to class at 4pm)
122+
- pretty cool that different people follow norms (e.g. come to class at 4pm)
130123
- could you do this with ai?
131124
- facebook chatbot ex.
132125
- paperclip machine, ads on social media
@@ -325,7 +318,7 @@ category: ai
325318

326319
---
327320

328-
## David Kaiser: Information for wiener, Shannon, and for Us
321+
## Information for wiener, Shannon, and for Us (david kaiser)
329322

330323
- wiener: society can only be understood based on analyzing messages
331324
- information = semantic information
@@ -350,7 +343,7 @@ category: ai
350343
- cities seem to increase diversity - more people to interact with
351344
- dl should seek more semantic info not statistical info
352345

353-
## Neil Gershenfield: Scaling
346+
## Scaling (Neil Gershenfield)
354347

355348
- ai is more about scaling laws rathern that fashions
356349
- mania: success to limited domains
@@ -362,4 +355,3 @@ category: ai
362355
- next: fabrication - how to make things?
363356
- ex. body uses only 20 amino acids
364357

365-

_notes/ai/cogsci.md

Lines changed: 6 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -70,10 +70,10 @@ subtitle: Some notes on a computational perspective on cognitive science.
7070
- dna can build in things that are hard to learn
7171
- start with nothing built in, ex. deep learning (connectionism)
7272
- start with best possible learning algorithm and ask what you need to build in (bayesian modeling)
73-
- bayes rule Bayesian: $\overbrace{p(\theta | x)}^{\text{posterior}} = \frac{\overbrace{p(x|\theta)}^{\text{likelihood}} \overbrace{p(\theta)}^{\text{prior}}}{p(x)}$ where x is data and $\theta$ are hypotheses
73+
- bayes rule: $\overbrace{p(\theta | x)}^{\text{posterior}} = \frac{\overbrace{p(x|\theta)}^{\text{likelihood}} \overbrace{p(\theta)}^{\text{prior}}}{p(x)}$ where x is data and $\theta$ are hypotheses
7474
- ask what priors explain people's inferences
7575
- humans make very causal priors - restricts hypothesis space of possible bayesian networks
76-
- hierarhical model - get prior from something (e.g. know all bags contain same color)
76+
- hierarchical model - get prior from something (e.g. know all bags contain same color)
7777
- what develops over time?
7878
- bayesian doesn't really tell us this - just has probabilities evolve over time
7979
- real life we come up with new hypotheses
@@ -272,13 +272,12 @@ subtitle: Some notes on a computational perspective on cognitive science.
272272
- e.g. if you meet a shy person, are they more likely to be a salesperson or a librarian?
273273
- *sunk cost fallacy*
274274
- *scope neglect* - the number of birds saved—the *scope* of the altruistic action—had little effect on willingness to pay ([post](https://www.lesswrong.com/s/5g5TkQTe9rmPS5vvM/p/2ftJ38y9SRBCBsCzy))
275-
- *availability heuristic* - judging the frequency or probability of an event by the ease with which examples of the event come to mind.
275+
- *availability heuristic* - judging the frequency or probability of an event by the ease with which examples of the event come to mind
276276
- [absurdity bias](https://www.lesswrong.com/lw/j4/absurdity_heuristic_absurdity_bias/); events that have never happened are not recalled, and hence deemed to have probability zero.
277277
- *conjunction fallacy* - humans assign a higher probability to a proposition of the form “A and B” than to one of the propositions “A” or “B” in isolation
278-
- The implausibility of one claim is compensated by the plausibility of the other; they “average out.”
278+
- the implausibility of one claim is compensated by the plausibility of the other; they “average out.”
279279
- *planning fallacy* - people think they can plan e.g. "best guess" scenarios are same as "best case" scenarios
280280
- rationality
281-
- **epistemic rationality** - systematically improving the accuracy of your beliefs.
282-
- **instrumental rationality** - systematically achieving your values.
283-
- *probability theory*, and *decision theory*
281+
- **epistemic rationality** - systematically improving the accuracy of your beliefs
282+
- **instrumental rationality** - systematically achieving your values
284283
- System 1 and System 2—fast perceptual judgments versus slow deliberative judgments. System 2’s deliberative judgments aren’t always true, and System 1’s perceptual judgments aren’t always false; so it is very important to distinguish that dichotomy from “rationality.”

_notes/assets/kv_caching_diagram.png

206 KB
Loading

_notes/assets/transformer_sizes.png

480 KB
Loading

_notes/cs/algo.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,9 +12,9 @@ subtitle: Some notes on algorithms following the book <a href="https://en.wikipe
1212
- big-oh: O(g): functions that grow no faster than g - upper bound, runs in time less than g
1313
- $f(n) \leq c\cdot g(n)$ for some c, large n
1414
- set of functions s.t. there exists c,k>0, 0 ≤ f(n) ≤ c*g(n), for all n > k
15-
- big-theta: Θ(g): functions that grow at the same rate as g
15+
- big-theta: $\Theta (g)$: functions that grow at the same rate as g
1616
- big-oh(g) and big-theta(g) - asymptotic tight bound
17-
- big-omega: Ω(g): functions that grow at least as fast as g
17+
- big-omega: $\Omega(g)$: functions that grow at least as fast as g
1818
- f(n)≥c*g(n) for some c, large n
1919
- Example: f = 57n+3
2020
- O(n^2) - or anything bigger

_notes/cs/data_structures.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -236,7 +236,7 @@ void BST::insert(int x, BinaryNode * & curNode){ //we pass in by reference be
236236
- Hash tables store key-value pairs
237237
- Each value has a specific key associated with it
238238
- fixed size array of some size, usually a prime number
239-
- A hash function takes in a "thing" )string, int, object, etc._
239+
- A hash function takes in a "thing" (string, int, object, etc.)
240240

241241
- returns hash value - an unsigned integer value which is then mod'ed by the size of the hash table to yield a spot within the bounds of the hash table array
242242
- Three required properties
@@ -253,7 +253,7 @@ void BST::insert(int x, BinaryNode * & curNode){ //we pass in by reference be
253253
- We can't just make a very large array - we assume the key space is too large
254254

255255
- you can't just hash by social security number
256-
- hash(s)=(∑k−1i=0si∗37^i) mod table_size
256+
- $hash(s)=(\sum_{i=0}^{k−1} s_i * 37^i)$ mod table_size
257257

258258
- you would precompute the powers of 37
259259
- collision - putting two things into same spot in hash table

_notes/cs/languages.md

Lines changed: 32 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -161,22 +161,28 @@ merging
161161
pd.merge(df1, df2, how='left', on='x1')
162162
```
163163

164-
## pytorch + pytorch parallel
164+
# gpu / parallelization
165165

166166
- new in 1.11: TorchData, functorch (e.g. vmap), DistributedDataParallel is stable
167-
- levels of [parallelism](https://huggingface.co/docs/transformers/v4.15.0/parallelism)
168-
- DP dataparallel - speeds up by replicating model and feeding it different data
169-
- gpt2 & T5 models have naive PP support
170167

171-
- TP tensorparallel (horizontal parallelism) - allows running a large model by splitting up different parts of an input
172-
- [parallelformers](https://github.com/tunib-ai/parallelformers) provides easy support for inference-only
168+
- huggingface [performance](https://huggingface.co/docs/transformers/main/en/performance) overview
169+
170+
- overview links from [huggingface](https://huggingface.co/docs/transformers/v4.15.0/parallelism) and [pytorch](https://pytorch.org/tutorials/beginner/dist_overview.html)
171+
172+
- levels of [parallelism](https://huggingface.co/docs/transformers/v4.15.0/parallelism)
173+
1. DataParallel - split **batches** across gpus and replicate model on each gpu (model must fit on single gpu)
174+
- gpt2 & T5 models have naive PP support
173175

174-
- PP pipelineparallel (vertical parallelism) - different layers are on different gpus
175-
- naive - diff layers on diff gpus (increases mem but not speed)
176-
- pipeline parallel - separates so more gpus can work at once
176+
2. TensorParallel - split **parts of an input** across gpus
177+
- [parallelformers](https://github.com/tunib-ai/parallelformers) provides easy support for inference-only
177178

178-
- Deepspeed/Megatron/Varuna/Sagemaker combines DP with PP
179+
3. PipelineParallel - split **different layers** across gpus
180+
181+
- naive - diff layers on diff gpus (increases mem but not speed)
179182

183+
- pipeline parallel - separates so more gpus can work at once
184+
- Deepspeed/Megatron/Varuna/Sagemaker combine dataparallel with pipeline parallel
185+
180186
- [pytorch parallel overview](https://pytorch.org/tutorials/beginner/dist_overview.html)
181187
- single-machine multi-GPU: [DataParallel](https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html) - relatively simple
182188
- just wrap model `model = nn.DataParallel(model)` (some attributes may become inaccessible)
@@ -185,10 +191,13 @@ pd.merge(df1, df2, how='left', on='x1')
185191
- multimachine GPU: [DistributedDataParallel](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html) + [launching script](https://github.com/pytorch/examples/blob/master/distributed/ddp/README.md)
186192
- multimachine flexible: [torch.distributed.elastic](https://pytorch.org/docs/stable/distributed.elastic.html) - handles errors better
187193
- (there is also RPC-based training and Collective Communication)
194+
188195
- dataset has `__init__, __getitem__, & __len__`
189196

190197
- rather than storing images, can load image from filename in `__getitem__`
198+
191199
- there's a `torch.nn.Flatten` module
200+
192201
- following example [here](https://medium.com/codex/a-comprehensive-tutorial-to-pytorch-distributeddataparallel-1f4b42bb1b51)
193202

194203

@@ -200,19 +209,19 @@ pd.merge(df1, df2, how='left', on='x1')
200209
- **node** - different computers with distinct memory
201210

202211
- **processes** - instances of a program executing on a machine
203-
212+
204213
- shouldn't have more user processes than cores on a node
205-
214+
206215
- **threads** - multiple paths of execution within a single process
207216
- like a process, but more lightweight
208-
217+
209218
- passing messages between nodes (e.g. for distributed memory) often use protocol known as MPI
210219
- packages such as Dask do this for you, without MPI
211-
220+
212221
- python packages
213222
- *dask*: parallelizing tasks, distributed datasets, one or more machines
214223
- *ray*: parallelizing tasks, building distributed applications, one or more machines
215-
224+
216225
- dask
217226
- separate code from parallelization
218227
- some limitations on pure python code, but np/pandas etc. parallelize better
@@ -227,6 +236,14 @@ pd.merge(df1, df2, how='left', on='x1')
227236
- then run dask-worker many times (as many tasks as there are)
228237
- can also directly submit slurm jobs from dask
229238

239+
## gpu
240+
241+
- https://horace.io/brrr_intro.html
242+
- within gpu, all operations require moving from gpu’s DRAM to GPU’s SRAM
243+
- for pointwise operations the time to do the moving (memory bandwidth cost) is longer than the computation itself
244+
- operator fusion allows us to do many operations at once before moving back to DRAM
245+
- can lead to interesting things, e.g. activation functions are nearly all the same cost, despite `gelu` obviously consisting of many more operations than `relu`
246+
230247
# c/c++
231248

232249
- The C memory model: global, local, and heap variables. Where they are stored, their properties, etc.

_notes/ml/deep_learning.md

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ subtitle: This note covers miscellaneous deep learning, with an emphasis on diff
99

1010
See also notes in [📌 unsupervised learning](https://csinva.io/notes/ml/unsupervised.html), [📌 disentanglement](https://csinva.io/notes/research_ovws/ovw_disentanglement.html), [📌 nlp](https://csinva.io/notes/ml/nlp.html), [📌 transformers](https://csinva.io/notes/research_ovws/ovw_transformers.html)
1111

12-
# top-performing nets
12+
# historical top-performing nets
1313

1414
- LeNet (1998)
1515
- first, used on MNIST
@@ -61,10 +61,9 @@ See also notes in [📌 unsupervised learning](https://csinva.io/notes/ml/unsupe
6161
- rectifying in electronics converts analog -> digital
6262
- rare to mix and match neuron types
6363
- *deep* - more than 1 hidden layer
64-
- mean-squared error regression loss = $\frac{1}{2}(y-\hat{y})^2$
65-
- classification loss = $-y \log (\hat{y}) - (1-y) \log(1-\hat{y})$
66-
- can't use SSE because not convex here
67-
- multiclass classification loss $=-\sum_j y_j \ln \hat{y}_j$
64+
- mean-squared error (regression loss) = $\frac{1}{2}(y-\hat{y})^2$
65+
- cross-entropy loss (classification loss) = $=-\sum_j y_j \ln \hat{p}_j$
66+
- for binary classification, $y_j$ would take on 0 or 1 and $\hat y_j$ would be a probability
6867
- **backpropagation** - application of *reverse mode automatic differentiation* to neural networks's loss
6968
- apply the chain rule from the end of the program back towards the beginning
7069
- $\frac{dL}{d \theta_i} = \frac{dL}{dz} \frac{\partial z}{\partial \theta_i}$

_notes/research_ovws/ovw_llms.md

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -236,6 +236,7 @@ See related papers in the [📌 interpretability](https://csinva.io/notes/resear
236236
- self-PPLs extend probabilistic graphical models to support more complex joint distributions whose size and “shape” can itself be stochastic
237237
- e.g., a graph unrolled for a random number of iterations, until a data-dependent stopping criterion is met
238238
- variables are all text: questions $Q$, answers $A$, and intermediate thoughts $T$
239+
- Prover-Verifier Games improve legibility of LLM outputs ([kirchner, chen, ... leike, mcaleese, & burda, 2024](https://arxiv.org/abs/2407.13692)) - trained strong LMs to produce text that is easy for weak LMs to verify and found that this training also made the text easier for humans to evaluate.
239240
- posthoc
240241
- understanding chain-of-thought and its faithfulness
241242
- Faithful Chain-of-Thought Reasoning ([yu et al. 2023](https://arxiv.org/abs/2301.13379))
@@ -850,7 +851,7 @@ Editing is generally very similar to just adaptation/finetuning. One distinction
850851

851852
mixture of experts models have become popular because of the need for (1) fast speed / low memory at test time while still (2) having a large model during training
852853

853-
- note: nowadays often the "experts" are different MLPs following the self-attention layers
854+
- note: nowadays often the "experts" are different MLPs following the self-attention layers (since their computations can be computed independently)
854855
- A Review of Sparse Expert Models in Deep Learning ([fedus, jeff dean, zoph, 2022](https://arxiv.org/abs/2209.01667))
855856
- sparsity decouples the parameter count from the compute per example allowing for extremely large, but efficient models
856857
- routing algorithm - determines where to send examples
@@ -1590,12 +1591,16 @@ mixture of experts models have become popular because of the need for (1) fast s
15901591

15911592
# basics
15921593

1594+
![transformer_sizes](../assets/transformer_sizes.png)
1595+
1596+
![kv_caching_diagram](../assets/kv_caching_diagram.png)
1597+
15931598
- **attention** = vector of importance weights
15941599
- to predict or infer one element, such as a pixel in an image or a word in a sentence, we estimate using the attention vector how strongly it is correlated with (or*attends to*” other elements and take the sum of their values weighted by the attention vector as the approximation of the target
15951600
- vanilla transformer: multihead attention, add + norm, position-wise ffn, add + norm
15961601
- self-attention layer [implementation](https://github.com/mertensu/transformer-tutorial), [mathematics](https://homes.cs.washington.edu/~thickstn/docs/transformers.pdf), and **chandan's self-attention [cheat-sheet](https://slides.com/chandansingh-2/deck-51f404)**
15971602

1598-
## mathematical overview of transformers
1603+
## mathematical overview of transformers
15991604

16001605
- based on [Formal Algorithms for Transformers](https://arxiv.org/abs/2207.09238?utm_source=substack&utm_medium=email)
16011606
- tasks

_notes/stat/linear_models.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -321,6 +321,7 @@ subtitle: Material from "Statistical Models Theory and Practice" - David Freedma
321321
- $L(\theta) = 1/2 (\theta^T X^T - y^T) (X \theta -y)$
322322
- $L(\theta) = 1/2 (\theta^T X^T X \theta - 2 \theta^T X^T y +y^T y)$
323323
- $0=\frac{\partial L}{\partial \theta} = 2X^TX\theta - 2X^T y$
324+
- $X^Ty=X^TX\theta$
324325
- $\theta = (X^TX)^{-1} X^Ty$
325326

326327
## ridge regression

0 commit comments

Comments
 (0)