add transformer diagram notes

csinva · csinva · commit 7bbfe212c7bd · 2024-07-27T20:10:24.000-07:00
diff --git a/_notes/ai/ai_futures.md b/_notes/ai/ai_futures.md
@@ -8,13 +8,14 @@ category: ai
 
 # 🤖 AGI thoughts
 - nice AGI definition:  AI systems are fully substitutable for human labor (or have a comparably large impact)
-- AI risk by deliberate human actors (i.e. concentrating power) is a greater risk than unintended use (i.e. loss of control)
-- Caveat: AGI risk is probably still underrated - nefarious use is likely worse than accidental misuse
-    - alignment research is technically more interesting than safety research…
+- AI risk by deliberate human actors (i.e. concentrating power) seems to be a greater risk than unintended use (i.e. loss of control) [see some thought-out risks [here](https://cdn.openai.com/openai-preparedness-framework-beta.pdf)]
+    - Caveat: AGI risk may still be high - nefarious use can easily be worse than accidental misuse
+    - alignment research seems to be technically more interesting than safety research…
+
 - Data limitations (e.g. in medicine) will limit rapid general advancements
 
 # human compatible
-**A set of notes based on the book human compatible, by Stuart Russell 2019**
+**A set of notes based on the book *human compatible*, by Stuart Russell (2019)**
 
 ## what if we succeed?
 
@@ -41,7 +42,7 @@ category: ai
   - carry out experiments and compare against all existing results easily
   - high-level goal: raise the standard of living for everyone everywhere?
   - AI tutoring
-- EU GDPR's "right to an explanation" wording is actually much weaker: "meaningful information about the logic involved, as well as the significance and the envisaged consequences of such processing for the data subject."
+- EU GDPR's "right to an explanation" wording is actually much weaker: "meaningful information about the logic involved, as well as the significance and the envisaged consequences of such processing for the data subject"
 - whataboutery - a method for deflecting questions where one always asks "what about X?" rather than engaging
 
 ## harms of ai
@@ -52,8 +53,6 @@ category: ai
 - ex. deepfakes / fake media
 - ex. automation - how to solve this? Universal basic income?
 
-
-
 ## value alignment
 
 - ex. king midas
@@ -63,11 +62,9 @@ category: ai
 - note: for an AI, it might be easier to convince of a different objective than actually solve the objective
 - basically any optimization objective will lead AI to disable its own off-switch
 
-
-
 ## possible solns
 
-- Oracle AI - can only answer yes/no/probabilistic questions,  otherwise no output to the real world
+- oracle AI - can only answer yes/no/probabilistic questions,  otherwise no output to the real world
 - inverse RL
   - ai should be uncertain about utitilies
   - utilties should be inferred from human preferences
@@ -87,7 +84,7 @@ category: ai
 
 # possible minds
 
-**edited by John Brockman, 2019**
+**notes from *possible minds*, a collection of essays edited by John Brockman (2019)**
 
 ## intro (brockman)
 
@@ -100,16 +97,13 @@ category: ai
 - ai has gone down and up for a while
 - gofai - good old-fashioned ai
 - things people thought would be hard, like chess, were easy
-- lots of physicists in this book...
 
 ## wrong but more relevant than ever (seth lloyd)
 
 - current AI is way worse than people think it is
 - wiener was very pessimistic - wwII / cold war
 - singularity is not coming...
 
-
-
 ## the limitations of opaque learning machines (judea pearl)
 
 - 3 levels of reasoning
@@ -125,8 +119,7 @@ category: ai
   - something about relationship...
 - correlations, causes, explanations (moral/rational) - biologically biased towards this?
   - beliefs + desires cause actions
-- randomly picking grants above some cutoff...
-- pretty cool that different people do things because of norms (e.g. come to class at 4pm)
+- pretty cool that different people follow  norms (e.g. come to class at 4pm)
   - could you do this with ai?
 - facebook chatbot ex.
 - paperclip machine, ads on social media
@@ -325,7 +318,7 @@ category: ai
 
 ---
 
-## David Kaiser: Information for wiener, Shannon, and for Us
+## Information for wiener, Shannon, and for Us (david kaiser)
 
 - wiener: society can only be understood based on analyzing messages
   - information = semantic information
@@ -350,7 +343,7 @@ category: ai
     - cities seem to increase diversity - more people to interact with
 - dl should seek more semantic info not statistical info
 
-## Neil Gershenfield: Scaling
+## Scaling (Neil Gershenfield)
 
 - ai is more about scaling laws rathern that fashions
 - mania: success to limited domains
@@ -362,4 +355,3 @@ category: ai
 - next: fabrication - how to make things?
   - ex. body uses only 20 amino acids
 
-
diff --git a/_notes/ai/cogsci.md b/_notes/ai/cogsci.md
@@ -70,10 +70,10 @@ subtitle: Some notes on a computational perspective on cognitive science.
   - dna can build in things that are hard to learn
   - start with nothing built in, ex. deep learning (connectionism)
   - start with best possible learning algorithm and ask what you need to build in (bayesian modeling)
-  - bayes rule Bayesian: $\overbrace{p(\theta | x)}^{\text{posterior}} = \frac{\overbrace{p(x|\theta)}^{\text{likelihood}} \overbrace{p(\theta)}^{\text{prior}}}{p(x)}$ where x is data and $\theta$ are hypotheses
+  - bayes rule: $\overbrace{p(\theta | x)}^{\text{posterior}} = \frac{\overbrace{p(x|\theta)}^{\text{likelihood}} \overbrace{p(\theta)}^{\text{prior}}}{p(x)}$ where x is data and $\theta$ are hypotheses
     - ask what priors explain people's inferences
     - humans make very causal priors - restricts hypothesis space of possible bayesian networks
-    - hierarhical model - get prior from something (e.g. know all bags contain same color)
+    - hierarchical model - get prior from something (e.g. know all bags contain same color)
 - what develops over time?
   - bayesian doesn't really tell us this - just has probabilities evolve over time
   - real life we come up with new hypotheses
@@ -272,13 +272,12 @@ subtitle: Some notes on a computational perspective on cognitive science.
     - e.g. if you meet a shy person, are they more likely to be a salesperson or a librarian?
   - *sunk cost fallacy*
   - *scope neglect* - the number of birds saved—the *scope* of the altruistic action—had little effect on willingness to pay ([post](https://www.lesswrong.com/s/5g5TkQTe9rmPS5vvM/p/2ftJ38y9SRBCBsCzy))
-  - *availability heuristic* - judging the frequency or probability of an event by the ease with which examples of the event come to mind.
+  - *availability heuristic* - judging the frequency or probability of an event by the ease with which examples of the event come to mind
     - [absurdity bias](https://www.lesswrong.com/lw/j4/absurdity_heuristic_absurdity_bias/); events that have never happened are not recalled, and hence deemed to have probability zero.
   - *conjunction fallacy* - humans assign a higher probability to a proposition of the form “A and B” than to one of the propositions “A” or “B” in isolation
-    - The implausibility of one claim is compensated by the plausibility of the other; they “average out.”
+    - the implausibility of one claim is compensated by the plausibility of the other; they “average out.”
   - *planning fallacy* - people think they can plan e.g. "best guess" scenarios are same as "best case" scenarios
 - rationality
-  - **epistemic rationality** - systematically improving the accuracy of your beliefs.
-  - **instrumental rationality** - systematically achieving your values.
-- *probability theory*, and *decision theory*
+  - **epistemic rationality** - systematically improving the accuracy of your beliefs
+  - **instrumental rationality** - systematically achieving your values
 - System 1 and System 2—fast perceptual judgments versus slow deliberative judgments. System 2’s deliberative judgments aren’t always true, and System 1’s perceptual judgments aren’t always false; so it is very important to distinguish that dichotomy from “rationality.”
diff --git a/_notes/assets/kv_caching_diagram.png b/_notes/assets/kv_caching_diagram.png
diff --git a/_notes/assets/transformer_sizes.png b/_notes/assets/transformer_sizes.png
diff --git a/_notes/cs/algo.md b/_notes/cs/algo.md
@@ -12,9 +12,9 @@ subtitle: Some notes on algorithms following the book <a href="https://en.wikipe
     - big-oh: O(g): functions that grow no faster than g - upper bound, runs in time less than g
         - $f(n) \leq c\cdot g(n)$ for some c, large n
         - set of functions s.t. there exists c,k>0, 0 ≤ f(n) ≤ c*g(n), for all n > k
-    - big-theta: Θ(g): functions that grow at the same rate as g
+    - big-theta:  $\Theta (g)$: functions that grow at the same rate as g
         - big-oh(g) and big-theta(g) - asymptotic tight bound
-    - big-omega: Ω(g): functions that grow at least as fast as g
+    - big-omega: $\Omega(g)$: functions that grow at least as fast as g
         - f(n)≥c*g(n) for some c, large n
     - Example: f = 57n+3
         - O(n^2) - or anything bigger
diff --git a/_notes/cs/data_structures.md b/_notes/cs/data_structures.md
@@ -236,7 +236,7 @@ void BST::insert(int x, BinaryNode * & curNode){    //we pass in by reference be
     - Hash tables store key-value pairs
     - Each value has a specific key associated with it
 - fixed size array of some size, usually a prime number
-- A hash function takes in a "thing" )string, int, object, etc._
+- A hash function takes in a "thing" (string, int, object, etc.)
   
     - returns hash value - an unsigned integer value which is then mod'ed by the size of the hash table to yield a spot within the bounds of the hash table array
 - Three required properties
@@ -253,7 +253,7 @@ void BST::insert(int x, BinaryNode * & curNode){    //we pass in by reference be
 - We can't just make a very large array - we assume the key space is too large
   
     - you can't just hash by social security number
-- hash(s)=(∑k−1i=0si∗37^i) mod table_size
+- $hash(s)=(\sum_{i=0}^{k−1} s_i * 37^i)$ mod table_size
   
     - you would precompute the powers of 37
 - collision - putting two things into same spot in hash table
diff --git a/_notes/cs/languages.md b/_notes/cs/languages.md
@@ -161,22 +161,28 @@ merging
 pd.merge(df1, df2, how='left', on='x1')
 ```
 
-## pytorch + pytorch parallel
+# gpu / parallelization
 
 - new in 1.11: TorchData, functorch (e.g. vmap), DistributedDataParallel is stable
-- levels of [parallelism](https://huggingface.co/docs/transformers/v4.15.0/parallelism)
-  - DP dataparallel - speeds up by replicating model and feeding it different data
-    - gpt2 & T5 models have naive PP support
 
-  - TP tensorparallel (horizontal parallelism) - allows running a large model by splitting up different parts of an input
-    - [parallelformers](https://github.com/tunib-ai/parallelformers) provides easy support for inference-only
+- huggingface [performance](https://huggingface.co/docs/transformers/main/en/performance) overview
+
+- overview links from [huggingface](https://huggingface.co/docs/transformers/v4.15.0/parallelism) and [pytorch](https://pytorch.org/tutorials/beginner/dist_overview.html)
+
+- levels of [parallelism](https://huggingface.co/docs/transformers/v4.15.0/parallelism)
+  1. DataParallel - split **batches** across gpus and replicate model on each gpu (model must fit on single gpu)
+     - gpt2 & T5 models have naive PP support
 
-  - PP pipelineparallel (vertical parallelism)  - different layers are on different gpus
-    - naive - diff layers on diff gpus (increases mem but not speed)
-    - pipeline parallel - separates so more gpus can work at once
+  2. TensorParallel - split **parts of an input** across gpus
+     - [parallelformers](https://github.com/tunib-ai/parallelformers) provides easy support for inference-only
 
-  - Deepspeed/Megatron/Varuna/Sagemaker combines DP with PP
+  3. PipelineParallel - split **different layers** across gpus
+  
+     - naive - diff layers on diff gpus (increases mem but not speed)
 
+     - pipeline parallel - separates so more gpus can work at once
+     - Deepspeed/Megatron/Varuna/Sagemaker combine dataparallel with pipeline parallel
+  
 - [pytorch parallel overview](https://pytorch.org/tutorials/beginner/dist_overview.html)
   - single-machine multi-GPU: [DataParallel](https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html) - relatively simple
     - just wrap model `model = nn.DataParallel(model)` (some attributes may become inaccessible)
@@ -185,10 +191,13 @@ pd.merge(df1, df2, how='left', on='x1')
   - multimachine GPU:  [DistributedDataParallel](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html) + [launching script](https://github.com/pytorch/examples/blob/master/distributed/ddp/README.md)
   - multimachine flexible: [torch.distributed.elastic](https://pytorch.org/docs/stable/distributed.elastic.html) - handles errors better
   - (there is also RPC-based training and Collective Communication)
+  
 - dataset has `__init__, __getitem__, & __len__`
 
   - rather than storing images, can load image from filename in `__getitem__`
+  
 - there's a `torch.nn.Flatten` module
+
 - following example [here](https://medium.com/codex/a-comprehensive-tutorial-to-pytorch-distributeddataparallel-1f4b42bb1b51)
 
 
@@ -200,19 +209,19 @@ pd.merge(df1, df2, how='left', on='x1')
 - **node** - different computers with distinct memory
 
 - **processes** - instances of a program executing on a machine
-  
+
   - shouldn't have more user processes than cores on a node
-  
+
   - **threads** - multiple paths of execution within a single process
     - like a process, but more lightweight
-  
+
 - passing messages between nodes (e.g. for distributed memory) often use protocol known as MPI
   - packages such as Dask do this for you, without MPI
-  
+
 - python packages
   - *dask*: parallelizing tasks, distributed datasets, one or more machines
   - *ray*: parallelizing tasks, building distributed applications, one or more machines
-  
+
 - dask
   - separate code from parallelization
   - some limitations on pure python code, but np/pandas etc. parallelize better
@@ -227,6 +236,14 @@ pd.merge(df1, df2, how='left', on='x1')
       - then run dask-worker many times (as many tasks as there are)
     - can also directly submit slurm jobs from dask
 
+## gpu
+
+- https://horace.io/brrr_intro.html
+  - within gpu, all operations require moving from gpu’s DRAM to GPU’s SRAM
+  - for pointwise operations the time to do the moving (memory bandwidth cost) is longer than the computation itself
+  - operator fusion allows us to do many operations at once before moving back to DRAM
+    - can lead to interesting things, e.g.  activation functions are nearly all the same cost, despite `gelu` obviously consisting of many more operations than `relu`
+
 # c/c++
 
 - The C memory model: global, local, and heap variables. Where they are stored, their properties, etc.
diff --git a/_notes/ml/deep_learning.md b/_notes/ml/deep_learning.md
@@ -9,7 +9,7 @@ subtitle: This note covers miscellaneous deep learning, with an emphasis on diff
 
 See also notes in [📌 unsupervised learning](https://csinva.io/notes/ml/unsupervised.html), [📌 disentanglement](https://csinva.io/notes/research_ovws/ovw_disentanglement.html), [📌 nlp](https://csinva.io/notes/ml/nlp.html), [📌 transformers](https://csinva.io/notes/research_ovws/ovw_transformers.html)
 
-# top-performing nets
+# historical top-performing nets
 
 - LeNet (1998)
   - first, used on MNIST
@@ -61,10 +61,9 @@ See also notes in [📌 unsupervised learning](https://csinva.io/notes/ml/unsupe
          - rectifying in electronics converts analog -> digital
     - rare to mix and match neuron types
 - *deep* - more than 1 hidden layer
-- mean-squared error regression loss = $\frac{1}{2}(y-\hat{y})^2$
-- classification loss = $-y \log (\hat{y}) - (1-y) \log(1-\hat{y})$ 
-    - can't use SSE because not convex here
-- multiclass classification loss $=-\sum_j y_j \ln \hat{y}_j$
+- mean-squared error (regression loss) = $\frac{1}{2}(y-\hat{y})^2$
+- cross-entropy loss (classification loss) = $=-\sum_j y_j \ln \hat{p}_j$
+    - for binary classification, $y_j$ would take on 0 or 1 and $\hat y_j$ would be a probability
 - **backpropagation** - application of *reverse mode automatic differentiation* to neural networks's loss
   - apply the chain rule from the end of the program back towards the beginning
     - $\frac{dL}{d \theta_i} = \frac{dL}{dz} \frac{\partial z}{\partial \theta_i}$
diff --git a/_notes/research_ovws/ovw_llms.md b/_notes/research_ovws/ovw_llms.md
@@ -236,6 +236,7 @@ See related papers in the [📌 interpretability](https://csinva.io/notes/resear
     - self-PPLs extend probabilistic graphical models to support more complex joint distributions whose size and “shape” can itself be stochastic
       - e.g., a graph unrolled for a random number of iterations, until a data-dependent stopping criterion is met
       - variables are all text: questions $Q$, answers $A$, and intermediate thoughts $T$
+  - Prover-Verifier Games improve legibility of LLM outputs ([kirchner, chen, ... leike, mcaleese, & burda, 2024](https://arxiv.org/abs/2407.13692)) - trained strong LMs to produce text that is easy for weak LMs to verify and found that this training also made the text easier for humans to evaluate.
 - posthoc
   - understanding chain-of-thought and its faithfulness
     - Faithful Chain-of-Thought Reasoning ([yu et al. 2023](https://arxiv.org/abs/2301.13379))
@@ -850,7 +851,7 @@ Editing is generally very similar to just adaptation/finetuning. One distinction
 
 mixture of experts models have become popular because of the need for (1) fast speed / low memory at test time while still (2) having a large model during training
 
-- note: nowadays often the "experts" are different MLPs following the self-attention layers
+- note: nowadays often the "experts" are different MLPs following the self-attention layers (since their computations can be computed independently)
 - A Review of Sparse Expert Models in Deep Learning ([fedus, jeff dean, zoph, 2022](https://arxiv.org/abs/2209.01667))
   - sparsity decouples the parameter count from the compute per example allowing for extremely large, but efficient models
   - routing algorithm - determines where to send examples
@@ -1590,12 +1591,16 @@ mixture of experts models have become popular because of the need for (1) fast s
 
 # basics
 
+![transformer_sizes](../assets/transformer_sizes.png)
+
+![kv_caching_diagram](../assets/kv_caching_diagram.png)
+
 - **attention** = vector of importance weights
   - to predict or infer one element, such as a pixel in an image or a word in a sentence, we estimate using the attention vector how strongly it is correlated with (or “*attends to*” other elements and take the sum of their values weighted by the attention vector as the approximation of the target
 - vanilla transformer: multihead attention, add + norm, position-wise ffn, add + norm
 - self-attention layer [implementation](https://github.com/mertensu/transformer-tutorial), [mathematics](https://homes.cs.washington.edu/~thickstn/docs/transformers.pdf), and **chandan's self-attention [cheat-sheet](https://slides.com/chandansingh-2/deck-51f404)**
 
-## mathematical overview of transformers 
+## mathematical overview of transformers
 
 - based on [Formal Algorithms for Transformers](https://arxiv.org/abs/2207.09238?utm_source=substack&utm_medium=email)
 - tasks
diff --git a/_notes/stat/linear_models.md b/_notes/stat/linear_models.md
@@ -321,6 +321,7 @@ subtitle: Material from "Statistical Models Theory and Practice" - David Freedma
 - $L(\theta) = 1/2 (\theta^T X^T - y^T) (X \theta -y)$ 
 - $L(\theta) = 1/2 (\theta^T X^T X \theta - 2 \theta^T X^T y +y^T y)$ 
 - $0=\frac{\partial L}{\partial \theta} = 2X^TX\theta - 2X^T y$
+- $X^Ty=X^TX\theta$
 - $\theta = (X^TX)^{-1} X^Ty$
 
 ## ridge regression