Skip to content

Commit bdd5da8

Browse files
V2.0 updates (#29)
- Many new features. - Documentation updates. - Example notebooks and scripts. - Bug fixes.
1 parent a383ec3 commit bdd5da8

File tree

110 files changed

+8689
-2154
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

110 files changed

+8689
-2154
lines changed

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,8 @@ __pycache__/
33
*.py[cod]
44
*$py.class
55

6+
.aim/
7+
68
# C extensions
79
*.so
810

.pre-commit-config.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ repos:
44
hooks:
55
- id: end-of-file-fixer
66
- id: trailing-whitespace
7+
args: [--markdown-linebreak-ext=md]
78
- id: check-yaml
89
- repo: https://github.com/psf/black
910
rev: 23.1.0

CONTRIBUTING.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ The following document describes the process of contributing and developing exte
66

77
Preliminary requirements:
88

9-
- fastRAG installed in a developement enviroment (via `pip install -e`)
9+
- fastRAG installed in a development environment (via `pip install -e`)
1010
- Python 3.8+
1111
- Pytorch
1212
- Any 3rd party store engine package

README.md

Lines changed: 85 additions & 231 deletions
Large diffs are not rendered by default.

assets/bi-encoder.png

35.2 KB
Loading

assets/chat_multimodal.png

452 KB
Loading

assets/info-retrieval.png

67.5 KB
Loading

assets/replug.png

183 KB
Loading

benchmarks/KILT/nq-bm25-fid.py

Lines changed: 17 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -2,14 +2,17 @@
22

33
import kilt.eval_retrieval as retrieval_metrics
44
import pandas as pd
5+
import torch
56
from datasets import load_dataset
67
from haystack import Pipeline
78
from haystack.document_stores import ElasticsearchDocumentStore
8-
from haystack.nodes import BM25Retriever, SentenceTransformersRanker
9+
from haystack.nodes import BM25Retriever, PromptModel, SentenceTransformersRanker
10+
from haystack.nodes.prompt import AnswerParser, PromptNode
11+
from haystack.nodes.prompt.prompt_template import PromptTemplate
912
from kilt.eval_downstream import _calculate_metrics, validate_input
1013
from tqdm import tqdm
1114

12-
from fastrag.readers.FiD import FiDReader
15+
from fastrag.prompters.invocation_layers import fid
1316
from fastrag.utils import get_timing_from_pipeline
1417

1518

@@ -77,11 +80,19 @@ def evaluate_from_answers(gold_records, result_collection):
7780

7881
reranker = SentenceTransformersRanker(model_name_or_path="cross-encoder/ms-marco-MiniLM-L-12-v2")
7982

80-
reader = FiDReader(
81-
input_converter_tokenizer_max_len=250,
82-
max_length=20,
83-
model_name_or_path="Intel/fid_t5_large_nq",
83+
PrompterModel = PromptModel(
84+
model_name_or_path="Intel/fid_flan_t5_base_nq",
8485
use_gpu=True,
86+
invocation_layer_class=fid.FiDHFLocalInvocationLayer,
87+
model_kwargs=dict(
88+
model_kwargs=dict(device_map={"": 0}, torch_dtype=torch.bfloat16, do_sample=False),
89+
generation_kwargs=dict(max_length=10),
90+
),
91+
)
92+
93+
reader = PromptNode(
94+
model_name_or_path=PrompterModel,
95+
default_prompt_template=PromptTemplate("{query}", output_parser=AnswerParser()),
8596
)
8697

8798

benchmarks/KILT/nq-plaid-fid.py

Lines changed: 17 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -4,15 +4,18 @@
44

55
import kilt.eval_retrieval as retrieval_metrics
66
import pandas as pd
7+
import torch
78
import tqdm
89
from datasets import load_dataset
910
from haystack import Pipeline
1011
from haystack.document_stores import ElasticsearchDocumentStore
11-
from haystack.nodes import BM25Retriever, FARMReader, SentenceTransformersRanker
12+
from haystack.nodes import BM25Retriever, FARMReader, PromptModel, SentenceTransformersRanker
13+
from haystack.nodes.prompt import AnswerParser, PromptNode
14+
from haystack.nodes.prompt.prompt_template import PromptTemplate
1215
from kilt.eval_downstream import _calculate_metrics, validate_input
1316
from tqdm import tqdm
1417

15-
from fastrag.readers.FiD import FiDReader
18+
from fastrag.prompters.invocation_layers import fid
1619
from fastrag.retrievers.colbert import ColBERTRetriever
1720
from fastrag.stores import PLAIDDocumentStore
1821
from fastrag.utils import get_timing_from_pipeline
@@ -84,11 +87,19 @@ def evaluate_from_answers(gold_records, result_collection):
8487

8588
reranker = SentenceTransformersRanker(model_name_or_path="cross-encoder/ms-marco-MiniLM-L-12-v2")
8689

87-
reader = FiDReader(
88-
input_converter_tokenizer_max_len=250,
89-
max_length=20,
90-
model_name_or_path="Intel/fid_t5_large_nq",
90+
PrompterModel = PromptModel(
91+
model_name_or_path="Intel/fid_flan_t5_base_nq",
9192
use_gpu=True,
93+
invocation_layer_class=fid.FiDHFLocalInvocationLayer,
94+
model_kwargs=dict(
95+
model_kwargs=dict(device_map={"": 0}, torch_dtype=torch.bfloat16, do_sample=False),
96+
generation_kwargs=dict(max_length=10),
97+
),
98+
)
99+
100+
reader = PromptNode(
101+
model_name_or_path=PrompterModel,
102+
default_prompt_template=PromptTemplate("{query}", output_parser=AnswerParser()),
92103
)
93104

94105

components.md

Lines changed: 211 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,211 @@
1+
# fast**RAG** Components Overview
2+
3+
4+
5+
## REPLUG
6+
<image align="right" src="assets/replug.png" width="600">
7+
8+
REPLUG (Retrieve and Plug) is a retrieval augmented LM method; documents are retrieved and then plugged in a prefix form
9+
to the input; however this is done using ensembling, where the documents are processed in parallel and the final token
10+
prediction is based on the combined probability distribution. This can be seen in the figure. REPLUG enables us to
11+
process a larger number of retrieved documents, without limiting ourselves to the LLM context window. Additionally, this
12+
method works with any LLM, no fine-tuning is needed. See ([Shi et al. 2023](#shiREPLUGRetrievalAugmentedBlackBox2023))
13+
for more details.
14+
15+
We provide implementation for the REPLUG ensembling inference, using the invocation layer
16+
`ReplugHFLocalInvocationLayer`; Our implementation supports most Hugging FAce models with `.generate()` capabilities (such that implement the generation mixin); For a complete example, see [REPLUG Parallel
17+
Reader](examples/replug_parallel_reader.ipynb) notebook.
18+
19+
## ColBERT v2 with PLAID Engine
20+
21+
<image align="right" src="assets/colbert-maxsim.png" width="500">
22+
23+
ColBERT is a dense retriever which means it uses a neural network to encode all the documents into representative
24+
vectors; once a query is made, it encodes the query into a vector and using vector similarity search, it finds the most
25+
relevant documents for that query. What makes it different is the fact it stores the full vector representation of the
26+
documents; neural network represent each word as a vector and previous models used a single vector to represent
27+
documents, no matter how long they were. ColBERT stores the vectors of all the words in all the documents. It makes the
28+
retrieving more accurate, albeit with the price of having a larger index. ColBERT v2 reduces the index size by
29+
compressing the vectors using a technique called quantization. Finally, PLAID improves latency times for ColBERT-based
30+
indexes by using a set of filtering steps, reducing the number of internal candidates to consider, reducing the amount
31+
of computation needed for each query. Overall, ColBERT v2 with PLAID provide state of the art retrieving results with a
32+
much smaller latency than previous dense retrievers, getting very close to sparse retrievers performance with much
33+
higher accuracy. See ([Santhanam, Khattab, Saad-Falcon, et al. 2022](#org330b3f5); [Santhanam, Khattab, Potts, et al. 2022](#orgbfef01e)) for more details.
34+
35+
We provide an implementation of ColBERT and PLAID, exposed using the classes `PLAIDDocumentStore` and
36+
`ColBERTRetriever`, together with a trained model, see [ColBERT-NQ](https://huggingface.co/Intel/ColBERT-NQ). The
37+
document store class requires the following arguments:
38+
39+
- `collection_path` is the path to the documents collection, in the form of a TSV file with columns being
40+
"id,content,title" where the title is optional.
41+
- `checkpoint_path` is the path for the encoder model, needed to encode queries into vectors at run time. Could be a
42+
local path to a model or a model hosted on HuggingFace hub. In order to use our trained **model** based on
43+
NaturalQuestions, provide the path `Intel/ColBERT-NQ`; see [Model
44+
Hub](https://huggingface.co/Intel/ColBERT-NQ) for more details.
45+
- `index_path` location of the indexed documents. The index contains the optimized and compressed vector representation
46+
of all the documents. Index can be created by the user given a collection and a checkpoint, or can be specified via a
47+
path.
48+
49+
**Updated:** new feature that enables adding and removing documents from a given index. Example usage:
50+
51+
```python
52+
index_updater = IndexUpdater(config, searcher, checkpoint)
53+
54+
added_pids = index_updater.add(passages) # Adding passages
55+
index_updater.remove(pids) # Removing passages
56+
searcher.search() # Search now reflects the added & removed passages
57+
58+
index_updater.persist_to_disk() # Persist changes to disk
59+
```
60+
61+
#### PLAID Requirements
62+
63+
If GPU is to be used, it should be of type RTX 3090 or newer (Ampere) and PyTorch should be installed with CUDA support, e.g.:
64+
65+
```bash
66+
pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116
67+
```
68+
69+
## fastRAG running LLMs with Habana Gaudi (DL1) and Gaudi 2
70+
71+
fastRAG includes Intel Habana Gaudi support for running LLM as generators in pipelines.
72+
73+
### Installation
74+
75+
To enable Gaudi support, please follow the installation instructions as specified in [Optimum Habana
76+
](https://github.com/huggingface/optimum-habana.git) guide.
77+
78+
### Usage
79+
80+
We enabled support for running LLMs on Habana Gaudi (DL1) and Habana Gaudi 2 by simply configuring the invocation layer of the PromptModel instance.
81+
82+
See below an example for loading a `PromptModel` with Habana backend:
83+
84+
```python
85+
from fastrag.prompters.invocation_layers.gaudi_hugging_face_inference import GaudiHFLocalInvocationLayer
86+
87+
PrompterModel = PromptModel(
88+
model_name_or_path= "meta-llama/Llama-2-7b-chat-hf",
89+
invocation_layer_class=GaudiHFLocalInvocationLayer,
90+
model_kwargs= dict(
91+
max_new_tokens=50,
92+
torch_dtype=torch.bfloat16,
93+
do_sample=False,
94+
constant_sequence_length=384
95+
)
96+
)
97+
```
98+
99+
We provide a detailed [Gaudi Inference](examples/inference_with_gaudi.ipynb) notebook, showing how you can build a RAG pipeline using Gaudi; feel free to try it out!
100+
101+
## fastRAG running LLMs with ONNX-runtime
102+
103+
To run LLM efficiently and quickly on CPUs, we provide a method for running quantized LLMs using the [optimum-intel](https://github.com/huggingface/optimum-intel).
104+
We recommend checking out our [full notebook](examples/rag_with_quantized_llm.ipynb) with all the details, including the quantization and pipeline construction.
105+
106+
### Installation
107+
108+
Run the following command to install our dependencies:
109+
110+
```
111+
pip install -e .[intel]
112+
```
113+
114+
For more information regarding the installation process, we recommend checking out the [optimum-intel](https://github.com/huggingface/optimum-intel) repository.
115+
116+
### LLM Quantization
117+
118+
To quantize a model, we first export it the model to the ONNX format, and then use a quantizer to save the quantized version of our model:
119+
120+
```python
121+
from optimum.onnxruntime import ORTModelForCausalLM
122+
from optimum.onnxruntime.configuration import AutoQuantizationConfig
123+
from optimum.onnxruntime import ORTQuantizer
124+
import os
125+
126+
model_name = 'my_llm_model_name'
127+
converted_model_path = "my/local/path"
128+
129+
model = ORTModelForCausalLM.from_pretrained(model_name, export=True)
130+
model.save_pretrained(converted_model_path)
131+
132+
model = ORTModelForCausalLM.from_pretrained(converted_model_path, session_options=session_options)
133+
qconfig = AutoQuantizationConfig.avx2(is_static=False)
134+
quantizer = ORTQuantizer.from_pretrained(model)
135+
quantizer.quantize(save_dir=os.path.join(converted_model_path, 'quantized'), quantization_config=qconfig)
136+
```
137+
138+
### Loading the Quantized Model
139+
140+
Now that our model is quantized, we can load it in our framework, by specifying the ```ORTInvocationLayer``` invocation layer.
141+
142+
```python
143+
PrompterModel = PromptModel(
144+
model_name_or_path= "my/local/path/quantized",
145+
invocation_layer_class=ORTInvocationLayer,
146+
)
147+
```
148+
149+
## Optimized Embedding Models
150+
151+
Bi-encoder Embedders are key components of Retrieval Augmented Generation pipelines. Mainly used for indexing documents and for online re-ranking. We provide support for quantized `int8` models that have low latency and high throughput, using [`optimum-intel`](https://github.com/huggingface/optimum-intel) framework.
152+
153+
For a comprehensive overview, instructions for optimizing existing models and usage information we provide a dedicated [readme.md](scripts/optimizations/embedders/README.md).
154+
155+
We integrated the optimized embedders into the following two components:
156+
157+
- [`QuantizedBiEncoderRanker`](fastrag/rankers/quantized_bi_encoder.py) - bi-encoder rankers; encodes the documents provided in the input and re-orders according to query similarity.
158+
- [`QuantizedBiEncoderRetriever`](fastrag/retrievers/optimized.py) - a bi-encoder retriever; encodes documents into vectors given a vectors store engine.
159+
160+
**NOTE**: For optimal performance we suggest following the important notes in the dedicated [readme.md](scripts/optimizations/embedders/README.md).
161+
162+
## Fusion-In-Decoder
163+
164+
<image align="right" src="assets/fid.png" width="500">
165+
166+
The Fusion-In-Decoder model (FiD in short) is a transformer-based generative model, that is based on the T5 architecture. For our setting, the model answers a question given a question and relevant information about it. Thus, given a query and a collection of documents, it encodes the question combined with each of the documents simultaneously, and later uses all encoded documents at once to generate each token of the answer at a time. See ([Izacard and Grave 2021](#org330b3f6)) for more details.
167+
168+
We provide an implementation of FiD as an invocation layer ([`FiDHFLocalInvocationLayer`](fastrag/prompters/invocation_layers/fid.py)) for a LLM and an [example notebook](examples/fid_promping.ipynb) of a RAG pipeline.
169+
170+
To fine-tune your own an FiD model, you can use our training script here: [Training FiD](scripts/training/train_fid.py)
171+
172+
The following is an example command, with the standard parameters for training the FiD model:
173+
174+
```
175+
python scripts/training/train_fid.py
176+
--do_train \
177+
--do_eval \
178+
--output_dir output_dir \
179+
--train_file path/to/train_file \
180+
--validation_file path/to/validation_file \
181+
--passage_count 100 \
182+
--model_name_or_path t5-base \
183+
--per_device_train_batch_size 1 \
184+
--per_device_eval_batch_size 1 \
185+
--seed 42 \
186+
--gradient_accumulation_steps 8 \
187+
--learning_rate 0.00005 \
188+
--optim adamw_hf \
189+
--lr_scheduler_type linear \
190+
--weight_decay 0.01 \
191+
--max_steps 15000 \
192+
--warmup_step 1000 \
193+
--max_seq_length 250 \
194+
--max_answer_length 20 \
195+
--evaluation_strategy steps \
196+
--eval_steps 2500 \
197+
--eval_accumulation_steps 1 \
198+
--gradient_checkpointing \
199+
--bf16 \
200+
--bf16_full_eval
201+
```
202+
203+
## References
204+
205+
<a id="shiREPLUGRetrievalAugmentedBlackBox2023"></a>Shi, Weijia, Sewon Min, Michihiro Yasunaga, Minjoon Seo, Rich James, Mike Lewis, Luke Zettlemoyer, and Wen-tau Yih. 2023. “REPLUG: Retrieval-Augmented Black-Box Language Models.” arXiv. <https://doi.org/10.48550/arXiv.2301.12652>.
206+
207+
<a id="orgbfef01e"></a>Santhanam, Keshav, Omar Khattab, Christopher Potts, and Matei Zaharia. 2022. “PLAID: An Efficient Engine for Late Interaction Retrieval.” arXiv. <https://doi.org/10.48550/arXiv.2205.09707>.
208+
209+
<a id="org330b3f5"></a>Santhanam, Keshav, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia. 2022. “ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction.” arXiv. <https://doi.org/10.48550/arXiv.2112.01488>.
210+
211+
<a id="org330b3f6"></a>Izacard, Gautier, and Edouard Grave. 2021. “Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering.” arXiv. <https://doi.org/10.48550/arXiv.2007.01282>.

config/README.md

Lines changed: 2 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -9,14 +9,11 @@ There are several ready-to-run full pipelines. They are compatible with the hays
99

1010
- `qa_pipeline` - Elasticsearch retriever, SBERT cross encoder reranker and an FiD reader.
1111
- `qa_plaid`- ColBERT/PLAID retriever and an FiD reader.
12-
- `summarization_pipeline_long_t5` - Elasticsearch retriever, SBERT cross encoder reranker and a large input LongT5 reader.
13-
- `summarization_pipeline_flan_t5` - Elasticsearch retriever, SBERT cross encoder reranker and an instruction tuned FLAN-T5 reader.
14-
- `qa_diffusion_pipeline` - Elasticsearch retriever, SBERT cross encoder reranker, an FiD reader and a stable diffusion
15-
image generation reader.
12+
- `summarization_pipeline` - BM25 retriever, SBERT cross encoder reranker and an instruction tuned BART-LARGE reader.
1613

1714
## Components Configuration
1815

19-
The following folders: `data`, `embedder`, `image_generator`, `reader`, `reranker`, `retriever` and `store` contain
16+
The following folders: `data`, `embedder`, `reader`, `reranker`, `retriever` and `store` contain
2017
components' configuration YAML files. Those can be adapted to use specific models by providing a local path or a
2118
HuggingFace hub's model name or dataset name. After components are chosen, one needs to run the [Pipeline
2219
Generation](../scripts/generate_pipeline.py) script in order to generate a pipeline configuration which describes the
@@ -27,7 +24,6 @@ python generate_pipeline.py --path "retriever,reranker,reader" \
2724
--store config/store/elastic.yaml \
2825
--retriever config/retriever/bm25.yaml \
2926
--reranker config/reranker/sbert.yaml \
30-
--reader config/reader/FiD.yaml \
3127
--file pipeline.yaml
3228
```
3329

config/data/pubmedQA.yaml

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
type: fastrag.data_loaders.HFDatasetLoader
2+
dataset_info:
3+
path: "bigbio/pubmed_qa"
4+
name: "pubmed_qa_artificial_bigbio_qa"
5+
split: "train"
6+
encoding_method: "pubmedqa_hf"
7+
batch_size: 10000

config/data/wikipedia_hf_6M.yaml

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
type: fastrag.data_loaders.HFDatasetLoader
2+
dataset_info:
3+
path: "wikipedia"
4+
name: "20220301.en"
5+
split: "train"
6+
encoding_method: "wikipedia_hf_multisentence"
7+
batch_size: 500

config/doc_chat.yaml

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
chat_model:
2+
model_kwargs:
3+
device_map: {"": 0}
4+
model_max_length: 4096
5+
task_name: text-generation
6+
model_name_or_path: meta-llama/Llama-2-7b-chat-hf
7+
use_gpu: true
8+
doc_pipeline_file: "config/empty_retrieval_pipeline.yaml"

config/doc_chat_ort.yaml

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
chat_model:
2+
model_kwargs:
3+
model_max_length: 4096
4+
task_name: text-generation
5+
session_config_entries:
6+
session.intra_op_thread_affinities: '3,4;5,6;7,8;9,10;11,12'
7+
intra_op_num_threads: 6
8+
model_name_or_path: '/tmp/facebook_opt-iml-max-1.3b/quantized'
9+
invocation_layer_class: fastrag.prompters.invocation_layers.ort.ORTInvocationLayer
10+
doc_pipeline_file: "config/empty_retrieval_pipeline.yaml"
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
type: EmbeddingRetriever
2+
embedding_model: "BAAI/llm-embedder"
3+
model_format: sentence_transformers
4+
pooling_strategy: cls_token
5+
embed_meta_fields: ['title'] # optional
6+
query_prompt: "Represent this query for retrieving relevant documents: " # embedding prompts require haystack > 1.23
7+
passage_prompt: "Represent this document for retrieval: "
8+
batch_size: 256
9+
use_gpu: true

0 commit comments

Comments
 (0)