|
| 1 | +# fast**RAG** Components Overview |
| 2 | + |
| 3 | + |
| 4 | + |
| 5 | +## REPLUG |
| 6 | +<image align="right" src="assets/replug.png" width="600"> |
| 7 | + |
| 8 | +REPLUG (Retrieve and Plug) is a retrieval augmented LM method; documents are retrieved and then plugged in a prefix form |
| 9 | +to the input; however this is done using ensembling, where the documents are processed in parallel and the final token |
| 10 | +prediction is based on the combined probability distribution. This can be seen in the figure. REPLUG enables us to |
| 11 | +process a larger number of retrieved documents, without limiting ourselves to the LLM context window. Additionally, this |
| 12 | +method works with any LLM, no fine-tuning is needed. See ([Shi et al. 2023](#shiREPLUGRetrievalAugmentedBlackBox2023)) |
| 13 | +for more details. |
| 14 | + |
| 15 | +We provide implementation for the REPLUG ensembling inference, using the invocation layer |
| 16 | +`ReplugHFLocalInvocationLayer`; Our implementation supports most Hugging FAce models with `.generate()` capabilities (such that implement the generation mixin); For a complete example, see [REPLUG Parallel |
| 17 | +Reader](examples/replug_parallel_reader.ipynb) notebook. |
| 18 | + |
| 19 | +## ColBERT v2 with PLAID Engine |
| 20 | + |
| 21 | +<image align="right" src="assets/colbert-maxsim.png" width="500"> |
| 22 | + |
| 23 | +ColBERT is a dense retriever which means it uses a neural network to encode all the documents into representative |
| 24 | +vectors; once a query is made, it encodes the query into a vector and using vector similarity search, it finds the most |
| 25 | +relevant documents for that query. What makes it different is the fact it stores the full vector representation of the |
| 26 | +documents; neural network represent each word as a vector and previous models used a single vector to represent |
| 27 | +documents, no matter how long they were. ColBERT stores the vectors of all the words in all the documents. It makes the |
| 28 | +retrieving more accurate, albeit with the price of having a larger index. ColBERT v2 reduces the index size by |
| 29 | +compressing the vectors using a technique called quantization. Finally, PLAID improves latency times for ColBERT-based |
| 30 | +indexes by using a set of filtering steps, reducing the number of internal candidates to consider, reducing the amount |
| 31 | +of computation needed for each query. Overall, ColBERT v2 with PLAID provide state of the art retrieving results with a |
| 32 | +much smaller latency than previous dense retrievers, getting very close to sparse retrievers performance with much |
| 33 | +higher accuracy. See ([Santhanam, Khattab, Saad-Falcon, et al. 2022](#org330b3f5); [Santhanam, Khattab, Potts, et al. 2022](#orgbfef01e)) for more details. |
| 34 | + |
| 35 | +We provide an implementation of ColBERT and PLAID, exposed using the classes `PLAIDDocumentStore` and |
| 36 | +`ColBERTRetriever`, together with a trained model, see [ColBERT-NQ](https://huggingface.co/Intel/ColBERT-NQ). The |
| 37 | +document store class requires the following arguments: |
| 38 | + |
| 39 | +- `collection_path` is the path to the documents collection, in the form of a TSV file with columns being |
| 40 | +"id,content,title" where the title is optional. |
| 41 | +- `checkpoint_path` is the path for the encoder model, needed to encode queries into vectors at run time. Could be a |
| 42 | +local path to a model or a model hosted on HuggingFace hub. In order to use our trained **model** based on |
| 43 | +NaturalQuestions, provide the path `Intel/ColBERT-NQ`; see [Model |
| 44 | +Hub](https://huggingface.co/Intel/ColBERT-NQ) for more details. |
| 45 | +- `index_path` location of the indexed documents. The index contains the optimized and compressed vector representation |
| 46 | +of all the documents. Index can be created by the user given a collection and a checkpoint, or can be specified via a |
| 47 | +path. |
| 48 | + |
| 49 | +**Updated:** new feature that enables adding and removing documents from a given index. Example usage: |
| 50 | + |
| 51 | +```python |
| 52 | +index_updater = IndexUpdater(config, searcher, checkpoint) |
| 53 | + |
| 54 | +added_pids = index_updater.add(passages) # Adding passages |
| 55 | +index_updater.remove(pids) # Removing passages |
| 56 | +searcher.search() # Search now reflects the added & removed passages |
| 57 | + |
| 58 | +index_updater.persist_to_disk() # Persist changes to disk |
| 59 | +``` |
| 60 | + |
| 61 | +#### PLAID Requirements |
| 62 | + |
| 63 | +If GPU is to be used, it should be of type RTX 3090 or newer (Ampere) and PyTorch should be installed with CUDA support, e.g.: |
| 64 | + |
| 65 | +```bash |
| 66 | +pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116 |
| 67 | +``` |
| 68 | + |
| 69 | +## fastRAG running LLMs with Habana Gaudi (DL1) and Gaudi 2 |
| 70 | + |
| 71 | +fastRAG includes Intel Habana Gaudi support for running LLM as generators in pipelines. |
| 72 | + |
| 73 | +### Installation |
| 74 | + |
| 75 | +To enable Gaudi support, please follow the installation instructions as specified in [Optimum Habana |
| 76 | +](https://github.com/huggingface/optimum-habana.git) guide. |
| 77 | + |
| 78 | +### Usage |
| 79 | + |
| 80 | +We enabled support for running LLMs on Habana Gaudi (DL1) and Habana Gaudi 2 by simply configuring the invocation layer of the PromptModel instance. |
| 81 | + |
| 82 | +See below an example for loading a `PromptModel` with Habana backend: |
| 83 | + |
| 84 | +```python |
| 85 | +from fastrag.prompters.invocation_layers.gaudi_hugging_face_inference import GaudiHFLocalInvocationLayer |
| 86 | + |
| 87 | +PrompterModel = PromptModel( |
| 88 | + model_name_or_path= "meta-llama/Llama-2-7b-chat-hf", |
| 89 | + invocation_layer_class=GaudiHFLocalInvocationLayer, |
| 90 | + model_kwargs= dict( |
| 91 | + max_new_tokens=50, |
| 92 | + torch_dtype=torch.bfloat16, |
| 93 | + do_sample=False, |
| 94 | + constant_sequence_length=384 |
| 95 | + ) |
| 96 | +) |
| 97 | +``` |
| 98 | + |
| 99 | +We provide a detailed [Gaudi Inference](examples/inference_with_gaudi.ipynb) notebook, showing how you can build a RAG pipeline using Gaudi; feel free to try it out! |
| 100 | + |
| 101 | +## fastRAG running LLMs with ONNX-runtime |
| 102 | + |
| 103 | +To run LLM efficiently and quickly on CPUs, we provide a method for running quantized LLMs using the [optimum-intel](https://github.com/huggingface/optimum-intel). |
| 104 | +We recommend checking out our [full notebook](examples/rag_with_quantized_llm.ipynb) with all the details, including the quantization and pipeline construction. |
| 105 | + |
| 106 | +### Installation |
| 107 | + |
| 108 | +Run the following command to install our dependencies: |
| 109 | + |
| 110 | +``` |
| 111 | +pip install -e .[intel] |
| 112 | +``` |
| 113 | + |
| 114 | +For more information regarding the installation process, we recommend checking out the [optimum-intel](https://github.com/huggingface/optimum-intel) repository. |
| 115 | + |
| 116 | +### LLM Quantization |
| 117 | + |
| 118 | +To quantize a model, we first export it the model to the ONNX format, and then use a quantizer to save the quantized version of our model: |
| 119 | + |
| 120 | +```python |
| 121 | +from optimum.onnxruntime import ORTModelForCausalLM |
| 122 | +from optimum.onnxruntime.configuration import AutoQuantizationConfig |
| 123 | +from optimum.onnxruntime import ORTQuantizer |
| 124 | +import os |
| 125 | + |
| 126 | +model_name = 'my_llm_model_name' |
| 127 | +converted_model_path = "my/local/path" |
| 128 | + |
| 129 | +model = ORTModelForCausalLM.from_pretrained(model_name, export=True) |
| 130 | +model.save_pretrained(converted_model_path) |
| 131 | + |
| 132 | +model = ORTModelForCausalLM.from_pretrained(converted_model_path, session_options=session_options) |
| 133 | +qconfig = AutoQuantizationConfig.avx2(is_static=False) |
| 134 | +quantizer = ORTQuantizer.from_pretrained(model) |
| 135 | +quantizer.quantize(save_dir=os.path.join(converted_model_path, 'quantized'), quantization_config=qconfig) |
| 136 | +``` |
| 137 | + |
| 138 | +### Loading the Quantized Model |
| 139 | + |
| 140 | +Now that our model is quantized, we can load it in our framework, by specifying the ```ORTInvocationLayer``` invocation layer. |
| 141 | + |
| 142 | +```python |
| 143 | +PrompterModel = PromptModel( |
| 144 | + model_name_or_path= "my/local/path/quantized", |
| 145 | + invocation_layer_class=ORTInvocationLayer, |
| 146 | +) |
| 147 | +``` |
| 148 | + |
| 149 | +## Optimized Embedding Models |
| 150 | + |
| 151 | +Bi-encoder Embedders are key components of Retrieval Augmented Generation pipelines. Mainly used for indexing documents and for online re-ranking. We provide support for quantized `int8` models that have low latency and high throughput, using [`optimum-intel`](https://github.com/huggingface/optimum-intel) framework. |
| 152 | + |
| 153 | +For a comprehensive overview, instructions for optimizing existing models and usage information we provide a dedicated [readme.md](scripts/optimizations/embedders/README.md). |
| 154 | + |
| 155 | +We integrated the optimized embedders into the following two components: |
| 156 | + |
| 157 | +- [`QuantizedBiEncoderRanker`](fastrag/rankers/quantized_bi_encoder.py) - bi-encoder rankers; encodes the documents provided in the input and re-orders according to query similarity. |
| 158 | +- [`QuantizedBiEncoderRetriever`](fastrag/retrievers/optimized.py) - a bi-encoder retriever; encodes documents into vectors given a vectors store engine. |
| 159 | + |
| 160 | +**NOTE**: For optimal performance we suggest following the important notes in the dedicated [readme.md](scripts/optimizations/embedders/README.md). |
| 161 | + |
| 162 | +## Fusion-In-Decoder |
| 163 | + |
| 164 | +<image align="right" src="assets/fid.png" width="500"> |
| 165 | + |
| 166 | +The Fusion-In-Decoder model (FiD in short) is a transformer-based generative model, that is based on the T5 architecture. For our setting, the model answers a question given a question and relevant information about it. Thus, given a query and a collection of documents, it encodes the question combined with each of the documents simultaneously, and later uses all encoded documents at once to generate each token of the answer at a time. See ([Izacard and Grave 2021](#org330b3f6)) for more details. |
| 167 | + |
| 168 | +We provide an implementation of FiD as an invocation layer ([`FiDHFLocalInvocationLayer`](fastrag/prompters/invocation_layers/fid.py)) for a LLM and an [example notebook](examples/fid_promping.ipynb) of a RAG pipeline. |
| 169 | + |
| 170 | +To fine-tune your own an FiD model, you can use our training script here: [Training FiD](scripts/training/train_fid.py) |
| 171 | + |
| 172 | +The following is an example command, with the standard parameters for training the FiD model: |
| 173 | + |
| 174 | +``` |
| 175 | +python scripts/training/train_fid.py |
| 176 | +--do_train \ |
| 177 | +--do_eval \ |
| 178 | +--output_dir output_dir \ |
| 179 | +--train_file path/to/train_file \ |
| 180 | +--validation_file path/to/validation_file \ |
| 181 | +--passage_count 100 \ |
| 182 | +--model_name_or_path t5-base \ |
| 183 | +--per_device_train_batch_size 1 \ |
| 184 | +--per_device_eval_batch_size 1 \ |
| 185 | +--seed 42 \ |
| 186 | +--gradient_accumulation_steps 8 \ |
| 187 | +--learning_rate 0.00005 \ |
| 188 | +--optim adamw_hf \ |
| 189 | +--lr_scheduler_type linear \ |
| 190 | +--weight_decay 0.01 \ |
| 191 | +--max_steps 15000 \ |
| 192 | +--warmup_step 1000 \ |
| 193 | +--max_seq_length 250 \ |
| 194 | +--max_answer_length 20 \ |
| 195 | +--evaluation_strategy steps \ |
| 196 | +--eval_steps 2500 \ |
| 197 | +--eval_accumulation_steps 1 \ |
| 198 | +--gradient_checkpointing \ |
| 199 | +--bf16 \ |
| 200 | +--bf16_full_eval |
| 201 | +``` |
| 202 | + |
| 203 | +## References |
| 204 | + |
| 205 | +<a id="shiREPLUGRetrievalAugmentedBlackBox2023"></a>Shi, Weijia, Sewon Min, Michihiro Yasunaga, Minjoon Seo, Rich James, Mike Lewis, Luke Zettlemoyer, and Wen-tau Yih. 2023. “REPLUG: Retrieval-Augmented Black-Box Language Models.” arXiv. <https://doi.org/10.48550/arXiv.2301.12652>. |
| 206 | + |
| 207 | +<a id="orgbfef01e"></a>Santhanam, Keshav, Omar Khattab, Christopher Potts, and Matei Zaharia. 2022. “PLAID: An Efficient Engine for Late Interaction Retrieval.” arXiv. <https://doi.org/10.48550/arXiv.2205.09707>. |
| 208 | + |
| 209 | +<a id="org330b3f5"></a>Santhanam, Keshav, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia. 2022. “ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction.” arXiv. <https://doi.org/10.48550/arXiv.2112.01488>. |
| 210 | + |
| 211 | +<a id="org330b3f6"></a>Izacard, Gautier, and Edouard Grave. 2021. “Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering.” arXiv. <https://doi.org/10.48550/arXiv.2007.01282>. |
0 commit comments