add autopasta link

csinva · csinva · commit ce29cea2ba68 · 2024-09-19T18:03:38.000-04:00
diff --git a/_includes/01_research.html b/_includes/01_research.html
@@ -41,7 +41,9 @@ <h2 style="text-align: center; margin-top: -150px;"> Research</h2>
         <br>
         <a href="https://arxiv.org/abs/2310.14034">tree prompting</a> - improve black-box few-shot text classification
         with decision trees<br>
-        <a href="https://arxiv.org/abs/2311.02262">attention steering</a> - guide LLMs by emphasizing specific input
+        <a href="https://arxiv.org/abs/2311.02262">attention steering</a> / <a
+            href="https://arxiv.org/abs/2409.10790">automatic attention steering</a> - guide LLMs by
+        emphasizing specific input
         spans<br>
         <a href="https://arxiv.org/abs/2210.01848">interpretable autoprompting</a> - automatically find fluent
         natural-language prompts<br>
@@ -196,6 +198,20 @@ <h2 style="text-align: center; margin-top: -150px;"> Research</h2>
             <td class="med">
             </td>
         </tr>
+
+        <tr>
+            <td class="center">'24</td>
+            <td>Model Tells Itself Where to Attend: Faithfulness Meets Automatic Attention Steering
+            </td>
+            <td>zhang*, yu*, et al.</td>
+            <td class="med">🔎🌀</td>
+            <td class="center"><a href="https://arxiv.org/abs/2409.10790">arxiv</a></td>
+            <td class="big"><a href="https://github.com/QingruZhang/AutoPASTA"><i class="fa fa-github fa-fw"></i></a>
+            </td>
+            <td class="med">
+            </td>
+        </tr>
+
         <tr>
             <td class="center">'24</td>
             <td>Tell Your Model Where to Attend: Post-hoc Attention Steering for LLMs
diff --git a/_notes/ml/nlp.md b/_notes/ml/nlp.md
@@ -74,16 +74,36 @@ Nice repo keeping track of progress [here](https://github.com/sebastianruder/NLP
     - [QNLI](https://rajpurkar.github.io/SQuAD-explorer/) (Question-answering Natural Language Inference) - determine if the answer to a question is contained in a second sentence or not
     - [RTE](https://aclweb.org/aclwiki/Recognizing_Textual_Entailment) (Recognizing Textual Entailment) - determine if a sentence entails a given hypothesis or not
     - [WNLI](https://cs.nyu.edu/~davise/papers/WinogradSchemas/WS.html) (Winograd Natural Language Inference) - determine if a sentence with an anonymous pronoun and a sentence with this pronoun replaced are entailed or not
+
 - more NLI ( natural language inference)
   - ANLI: Adversarial NLI ([nie et al. 2019](https://arxiv.org/abs/1910.14599)) - harder examples found by model failures
   - SNLI Benchmark ([bowman et al. 2015](https://arxiv.org/abs/1508.05326)) = Stanford Natural Languge Inference - entailment dataset
     - 570k human-annotated sentence pairs where people ask about entailment
   - FEVER: Fact Extraction and VERification ([Thorne et al., 2018](https://aclanthology.org/N18-1074/))
   - SciTail ([khot et al. 2018](https://ojs.aaai.org/index.php/AAAI/article/view/12022)) - textual entailment derived from science-question answering
+
 - QA
   - SQuAD 2.0 ([Rajpurkar...liang, 2018](https://arxiv.org/abs/1806.03822)) - adds 50k unanswerable questions; system must know when it can't answer
   - SQuAD ([Rajpurkar...liang, 2016](https://arxiv.org/abs/1606.05250)) - Stanford Question Answering Dataset (SQuAD)  - 100k questions from 23k passages in 500 wikipedia articles
 
+- Text classification datasets (used in [Tree-Prompt](https://arxiv.org/pdf/2310.14034) and [Aug-imodels](https://www.nature.com/articles/s41467-023-43713-1))
+
+  | Dataset |                                                              | Classes |                             Text                             |     Label      |
+  | :-----: | ------------------------------------------------------------ | ------- | :----------------------------------------------------------: | :------------: |
+  |  SST2   | Movie review sentiment                                       | 2       | that loves its characters and communicates something rather beautiful about human nature |    positive    |
+  |  SUBJ   | subjective vs objective                                      | 2       | the script isn't very good; not even someone as gifted as hoffman ( the actor ) can make it work. |   subjective   |
+  |  MPQA   | question answer sentiment                                    | 2       |                     victory of democracy                     |    positive    |
+  | AGNews  | classify news titles                                         | 4       | Wall St. Bears Claw Back Into the Black (Reuters). "Reuters - Short-sellers, Wall Street's dwindling band of ultra-cynics, are seeing green again." |    business    |
+  |   CB    | given a text and a clause, predict how much the text commits to the clause | 3       | Premise: "Do you mind if I use your phone?" Ronni could see that Guido's brain was whirring. Hypothesis: Guido's brain was whirring |   entailment   |
+  |   CR    | customer review sentiment                                    | 2       | i didn 't have any major problems installing this software . |    positive    |
+  | DBPedia | Categories in wikipedia                                      | 14      | Geoffrey D. Falksen (born July 31 1982) is an American steampunk writer. |     artist     |
+  |   MR    | Movie review sentiment                                       | 2       |                      the film is flat .                      |    negative    |
+  |   RTE   | Entailment                                                   | 2       | Sentence 1: No Weapons of Mass Destruction Found in Iraq Yet. Sentence 2: "Weapons of Mass Destruction Found in Iraq. | not_entailment |
+  |  TREC   | Classifying questions                                        | 6       |            What 's known as The queen of Drinks ?            |     entity     |
+  |   FPB   | Financial phrase sentiment                                   | 3       | According to Gran, the company has no plans to move all production to Russia, although that is where the company is growing. |    neutral     |
+  |  IMDB   | Movie review sentiment                                       | 2       | would put this at the top of my list of films in the category of unwatchable trash! [...] |    negative    |
+  | Emotion | Tweet emotion classification                                 | 6       | i can go from feeling so hopeless to so damned hopeful just from being around someone who cares and is awake |    sadness     |
+
 **common data sources**
 
 - WSJ
diff --git a/_notes/research_ovws/ovw_llms.md b/_notes/research_ovws/ovw_llms.md
@@ -448,7 +448,7 @@ See related papers in the [📌 interpretability](https://csinva.io/notes/resear
 - prompting = few-shot learning = priming = in-context learning (starts with GPT)
   - prompting without changing any model parameters
     - limitation: can't exploit sets longer than the training window
-  - MetaICL: Learning to Learn In Context ([min et al. 2022](https://arxiv.org/abs/2110.15943)) - tune LLM to do in-context learning on a large set of training tasks (few-show prompting and training time and at test-time)
+  - MetaICL: Learning to Learn In Context ([min et al. 2022](https://arxiv.org/abs/2110.15943)) - tune LLM to do in-context learning on a large set of training tasks (few-shot prompting and training time and at test-time)
   - Visual Prompting via Image Inpainting ([bar...darrell, globerson, efros, 2022](https://arxiv.org/abs/2209.00647) )
   - PatternExploiting Training (PET) -- Exploiting Cloze Questions for Few Shot Text Classification and Natural Language Inference ([schick & schutze, 2021](https://aclanthology.org/2021.eacl-main.20.pdf))
     - **cloze questions** - same as masked language modeling: task is to replace some missing words
@@ -485,7 +485,7 @@ See related papers in the [📌 interpretability](https://csinva.io/notes/resear
 
 - Teach Llamas to Talk: Recent Progress in Instruction Tuning ([gao blogpost 2023](https://gaotianyu.xyz/blog/2023/11/30/instruction-tuning/))
 
-- Tell Your Model Where to Attend: Post-hoc Attention Steering for LLMs ([zhang et al. 2023](https://arxiv.org/abs/2311.02262))
+- Tell Your Model Where to Attend: Post-hoc Attention Steering for LLMs, PASTA ([zhang et al. 2023](https://arxiv.org/abs/2311.02262))
 - The Truth is in There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction ([sharma...misra, 2023](https://arxiv.org/abs/2312.13558))
 - human feedback
   - Learning to summarize with human feedback ([OpenAI, 2020](https://proceedings.neurips.cc/paper/2020/hash/1f89885d556929e98d3ef9b86448f951-Abstract.html))
@@ -581,15 +581,15 @@ Editing is generally very similar to just adaptation/finetuning. One distinction
       - then, perform generation with latent embedding
       - learn linear transformation given a dataset of examples with attributes and desired completions
         - (also regularize the model to not change *too much* on other stuff)
-    - Activation Addition: Steering Language Models Without Optimization ([turner...macdiarmid, 2023](https://arxiv.org/abs/2308.10248))
-      - blog post: activation engineering: Steering GPT-2-XL by adding an activation vector ([turner, ..., mini, 2023](https://www.alignmentforum.org/posts/5spBue2z2tw4JuDCx/steering-gpt-2-xl-by-adding-an-activation-vector#6__The_Eiffel_Tower_is_in_Rome))
-      - obtain "steering vector" by embedding a phrase (e.g. *love*) and adding that vector to the llm embedding during generation
-        - they only add the embedding for some layers for some tokens
-      - Extracting Latent Steering Vectors from Pretrained Language Models ([subramani, ..., peters, 2022](https://arxiv.org/abs/2205.05124)) - find latent vectors via optimization that cause an LLM to output a particular sequence
-        - then, use these vectors to do things like transfer to new tasks / compute textual similarity
-      - Function Vectors in LLMs ([todd...wallace, bau, 2023](https://arxiv.org/pdf/2310.15213.pdf))
-      - In-Context Learning Creates Task Vectors ([hendel, geva, & globerson, 2023](https://arxiv.org/pdf/2310.15916))
-    
+- Activation Addition: Steering Language Models Without Optimization ([turner...macdiarmid, 2023](https://arxiv.org/abs/2308.10248))
+  - blog post: activation engineering: Steering GPT-2-XL by adding an activation vector ([turner, ..., mini, 2023](https://www.alignmentforum.org/posts/5spBue2z2tw4JuDCx/steering-gpt-2-xl-by-adding-an-activation-vector#6__The_Eiffel_Tower_is_in_Rome))
+  - obtain "steering vector" by embedding a phrase (e.g. *love*) and adding that vector to the llm embedding during generation
+    - they only add the embedding for some layers for some tokens
+  - Extracting Latent Steering Vectors from Pretrained Language Models ([subramani, ..., peters, 2022](https://arxiv.org/abs/2205.05124)) - find latent vectors via optimization that cause an LLM to output a particular sequence
+    - then, use these vectors to do things like transfer to new tasks / compute textual similarity
+  - Function Vectors in LLMs ([todd...wallace, bau, 2023](https://arxiv.org/pdf/2310.15213.pdf))
+  - In-Context Learning Creates Task Vectors ([hendel, geva, & globerson, 2023](https://arxiv.org/pdf/2310.15916))
+  - Programming Refusal with Conditional Activation Steering ([lee...dhurandhar, 2024](https://arxiv.org/abs/2409.05907))
 - PURR: Efficiently Editing Language Model Hallucinations by Denoising Language Model Corruptions ([chen...sameer singh...kelvin guu, 2023](https://drive.google.com/file/d/1CXSUii4w8Y2uj-zLm8zRl63SYh45FaZL/view))
 - new datasets
     - MQUAKE: Assessing Knowledge Editing in Language Models via Multi-Hop Questions ([zhong...manning, potts, chen, 2023](https://www.cs.princeton.edu/~zzhong/papers/MQuAKE.pdf)) - introduces benchmark MQUAKE + method MeLLo, which stores edited facts externally while prompting the language model iteratively to generate answers that are consistent with the edited facts
@@ -1200,7 +1200,9 @@ mixture of experts models have become popular because of the need for (1) fast s
 - Jailbreaking Proprietary Large Language Models using Word Substitution Cipher ([Handa…Baral 2024](https://arxiv.org/abs/2402.10601)): short nice paper! just says substitute unsafe words with safe words, provide the mapping to the model and the original question substituted with the words. Ask the LLM to reply, high ASR for ChatGPT and Gemini.
 - CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models ([Lv…Huang 2024](https://arxiv.org/abs/2402.16717)): In this case they ask the malicious question using code where the input sentence is encrypted using some simple coding schemes (reverse words or sort words by their length) and the code includes the decryption function. Highest ASR among all baselines which includes the CipherChat and multilingual.
 - MULTIVERSE: Exposing Large Language Model Alignment Problems in Diverse Worlds ([Jin…Zhang 2024](https://arxiv.org/abs/2402.01706)): This is not doing cipher language. It creates several layers of alternate worlds where one can put a malicious query and it bypasses model security. The deeper the layers, the higher ASR the attack has.
-- People have used other modalities like images, voice, to attack models. One can think of other modalities as a generalization of different languages.
+- Data Contamination Can Cross Language Barriers ([feng yao, yufan zhuang, ..., jingbo shang](https://arxiv.org/html/2406.13236v1)) - LLMs can overfit to benchmarks by being trained on translations of them
+  - To detect this contamination, for each question, we replace all the incorrect choices with correct choices taken from other questions
+
 
 # applications
 
@@ -1297,13 +1299,13 @@ mixture of experts models have become popular because of the need for (1) fast s
   - Pretraining task diversity and the emergence of non-Bayesian in-context learning for regression ([raventos…ganguli, 2023](https://openreview.net/forum?id=BtAz4a5xDg))
 - Transformers Learn Higher-Order Optimization Methods for In-Context Learning: A Study with Linear Models ([fu...sharan, 2023](https://arxiv.org/abs/2310.17086))
   - How Well Can Transformers Emulate In-context Newton’s Method? ([giannou...papailiopoulos, & lee, 2024](https://arxiv.org/pdf/2403.03183v1.pdf))
-
 - Teaching Algorithmic Reasoning via In-context Learning ([zhou...sedghi, 2022](https://arxiv.org/abs/2211.09066))
 - Looped Transformers as Programmable Computers ([giannou, ..., jason lee, papailiopoulos, 2023](https://arxiv.org/abs/2301.13196)) - use transformers as universal computers by programming them with specific weights
 - Learning mathematical problems ([francois charton](https://scholar.google.com/citations?hl=en&user=1tMnd-4AAAAJ&view_op=list_works&sortby=pubdate))
 - Probing the Decision Boundaries of In-context Learning in Large Language Models ([zhao, nguyen, & grover, 2024](https://arxiv.org/pdf/2406.11233v1))
 - Theory (don't directly predict algorithm)
   - Meta-learning for Mixed Linear Regression ([kong...kakade, oh, 2020](https://proceedings.mlr.press/v119/kong20a.html)) - generalization for linear regression based on which linear tasks were seen before
+  - Transformers are Universal In-context Learners ([furuya...peyre, 2024](https://arxiv.org/abs/2408.01367)) - mathetmatically show that transformers are universal and can approximate continuous in-context mappings to arbitrary precision
 - Limitations
   - Faith and Fate: Limits of Transformers on Compositionality ([dziri...choi, 2023](https://arxiv.org/abs/2305.18654)) - LLMs can't (easily) be trained well for multiplication (and similar tasks)