The Hybrid Retriever is searcher based on both lexical and semantic matching. It comprises three components: the Sparse Retriever, the Dense Retriever, and the Merger. The Merger fuses the results of the Sparse and Dense Retrievers to compute the hybrid results.
In the following, we show how to build an hybrid search engine, index a document collection, and search it.
You can instantiate and Hybrid Retriever as follows. Default parameter values are shown.
from retriv import HybridRetriever
hr = HybridRetriever(
# Shared params ------------------------------------------------------------
index_name="new-index",
# Sparse retriever params --------------------------------------------------
sr_model="bm25",
min_df=1,
tokenizer="whitespace",
stemmer="english",
stopwords="english",
do_lowercasing=True,
do_ampersand_normalization=True,
do_special_chars_normalization=True,
do_acronyms_normalization=True,
do_punctuation_removal=True,
# Dense retriever params ---------------------------------------------------
dr_model="sentence-transformers/all-MiniLM-L6-v2",
normalize=True,
max_length=128,
use_ann=True,
)
- Shared params:
index_name
: retriv will useindex_name
as the identifier of your index.
- Sparse Retriever params:
sr_model
: defines the model to use for sparse retrieval (bm25
ortf-idf
).min_df
: terms that appear in less thanmin_df
documents will be ignored. If integer, the parameter indicates the absolute count. If float, it represents a proportion of documents.tokenizer
: tokenizer to use during preprocessing. You can pass a custom callable tokenizer or disable tokenization setting the parameter toNone
.stemmer
: stemmer to use during preprocessing. You can pass a custom callable stemmer or disable stemming setting the parameter toNone
.stopwords
: stopwords to remove during preprocessing. You can pass a custom stop-word list or disable stop-words removal by setting the parameter toNone
.do_lowercasing
: whether to lowercase texts.do_ampersand_normalization
: whether to convert&
inand
during pre-processing.do_special_chars_normalization
: whether to remove special characters for letters, e.g.,übermensch
→ubermensch
.do_acronyms_normalization
: whether to remove full stop symbols from acronyms without splitting them in multiple words, e.g.,P.C.I.
→PCI
.do_punctuation_removal
: whether to remove punctuation.
- Dense Retriever params:
dr_model
: defines the model to use for encoding queries and documents into vectors. You can use an HuggingFace's Transformers pre-trained model by providing its ID or load a local model by providing its path. In the case of local models, the path must point to the directory containing the data saved with thePreTrainedModel.save_pretrained
method. Note that the representations are computed withmean pooling
over thelast_hidden_state
.normalize
: whether to L2 normalize the vector representations.max_length
: texts longer thanmax_length
will be automatically truncated. Choose this parameter based on how the employed model was trained or is generally used.use_ann
: whether to use approximate nearest neighbors search. Set it toFalse
to use nearest neighbors search without approximation. If you have less than 20k documents in your collection, you probably want to disable approximation.
Note: text pre-processing is equally applied to documents during indexing and to queries at search time.
You can index a document collection from JSONl, CSV, or TSV files.
CSV and TSV files must have a header.
retriv automatically infers the file kind, so there's no need to specify it.
Use the callback
parameter to pass a function for converting your documents in the format supported by retriv on the fly.
Documents must have a single text
field and an id
.
The Hybrid Retriever sequentially build the indices for the Sparse and Dense Retrievers.
Indexes are automatically persisted on disk at the end of the process.
To speed up the indexing process of the Dense Retriever, you can activate GPU-based encoding.
hr = hr.index_file(
path="path/to/collection", # File kind is automatically inferred
embeddings_path=None, # Default value
use_gpu=False, # Default value
batch_size=512, # Default value
show_progress=True, # Default value
callback=lambda doc: { # Callback defaults to None.
"id": doc["id"],
"text": doc["title"] + ". " + doc["text"],
),
)
hr = HybridRetriever.load("index-name")
HybridRetriever.delete("index-name")
During search, the Hybrid Retriever fuses the top 1000 results of the Sparse and Dense Retrievers.
Standard search functionality.
hr.search(
query="witches masses", # What to search for
return_docs=True, # Default value, return the text of the documents
cutoff=100, # Default value, number of results to return
)
Output:
[
{
"id": "doc_2",
"text": "Just like witches at black masses",
"score": 0.9536403
},
{
"id": "doc_1",
"text": "Generals gathered in their masses",
"score": 0.6931472
}
]
Compute results for multiple queries at once.
hr.msearch(
queries=[{"id": "q_1", "text": "witches masses"}, ...],
cutoff=100, # Default value, number of results
batch_size=32, # Default value.
)
Output:
{
"q_1": {
"doc_2": 1.7536403,
"doc_1": 0.6931472
},
...
}
batch_size
: how many searches to perform at once. Regulate it if you ran into memory usage issues or want to maximize throughput.
Batch-Search is similar to Multi-Search but automatically generates batches of queries to evaluate and allows dynamic writing of the search results to disk in JSONl format. bsearch is handy for computing results for hundreds of thousands or even millions of queries without hogging your RAM.
hr.bsearch(
queries=[{"id": "q_1", "text": "witches masses"}, ...],
cutoff=100,
batch_size=32,
show_progress=True,
qrels=None, # To add relevance information to the saved files
path=None, # Where to save the results, if you want to save them
)
Use the AutoTune function to tune the Sparse Retriever's model BM25 parameters and the importance given to the lexical and semantic relevance scores computed by the Sparse and Dense Retrievers, respectively.
All metrics supported by ranx are supported by the autotune
function.
At the of the process, the best parameter configurations are automatically applied and saved to disk.
You can inspect the best configurations found by printing hr.sparse_retriever.hyperparams
, hr.merger.norm
and hr.merger.params
.
sr.autotune(
queries=[{ "q_id": "q_1", "text": "...", ... }], # Train queries
qrels={ "q_1": { "doc_1": 1, ... }, ... }, # Train qrels
metric="ndcg", # Default value, metric to maximize
n_trials=100, # Default value, number of trials
cutoff=100, # Default value, number of results
batch_size=32, # Default value
)