Sparse Embedding Retrieval with Qdrant and FastEmbed
Last Updated: September 24, 2024
In this notebook, we will see how to use Sparse Embedding Retrieval techniques (such as SPLADE) in Haystack.
We will use the Qdrant Document Store and FastEmbed Sparse Embedders.
Why SPLADE?
- Sparse Keyword-Based Retrieval (based on BM25 algorithm or similar ones) is simple and fast, requires few resources but relies on lexical matching and struggles to capture semantic meaning.
- Dense Embedding-Based Retrieval takes semantics into account but requires considerable computational resources, usually does not work well on novel domains, and does not consider precise wording.
While good results can be achieved by combining the two approaches ( tutorial), SPLADE (Sparse Lexical and Expansion Model for Information Retrieval) introduces a new method that encapsulates the positive aspects of both techniques. In particular, SPLADE uses Language Models like BERT to weigh the relevance of different terms in the query and perform automatic term expansions, reducing the vocabulary mismatch problem (queries and relevant documents often lack term overlap).
Main features:
- Better than dense embedding Retrievers on precise keyword matching
- Better than BM25 on semantic matching
- Slower than BM25
- Still experimental compared to both BM25 and dense embeddings: few models; supported by few Document Stores
Resources
- SPLADE for Sparse Vector Search Explained - great guide by Pinecone
- SPLADE GitHub repository, with links to all related papers
Install dependencies
!pip install -U fastembed-haystack qdrant-haystack wikipedia transformers
Sparse Embedding Retrieval
Indexing
Create a Qdrant Document Store
from haystack_integrations.document_stores.qdrant import QdrantDocumentStore
document_store = QdrantDocumentStore(
":memory:",
recreate_index=True,
return_embedding=True,
use_sparse_embeddings=True # set this parameter to True, otherwise the collection schema won't allow to store sparse vectors
)
Download Wikipedia pages and create raw documents
We download a few Wikipedia pages about animals and create Haystack documents from them.
nice_animals=["Capybara", "Dolphin"]
import wikipedia
from haystack.dataclasses import Document
raw_docs=[]
for title in nice_animals:
page = wikipedia.page(title=title, auto_suggest=False)
doc = Document(content=page.content, meta={"title": page.title, "url":page.url})
raw_docs.append(doc)
Initialize a FastembedSparseDocumentEmbedder
The FastembedSparseDocumentEmbedder
enrichs a list of documents with their sparse embeddings.
We are using prithvida/Splade_PP_en_v1
, a good sparse embedding model with a permissive license.
We also want to embed the title of the document, because it contains relevant information.
For more customization options, refer to the docs.
from haystack_integrations.components.embedders.fastembed import FastembedSparseDocumentEmbedder
sparse_doc_embedder = FastembedSparseDocumentEmbedder(model="prithvida/Splade_PP_en_v1",
meta_fields_to_embed=["title"])
sparse_doc_embedder.warm_up()
# let's try the embedder
print(sparse_doc_embedder.run(documents=[Document(content="An example document")]))
Fetching 9 files: 0%| | 0/9 [00:00<?, ?it/s]
.gitattributes: 0%| | 0.00/1.52k [00:00<?, ?B/s]
README.md: 0%| | 0.00/133 [00:00<?, ?B/s]
generation_config.json: 0%| | 0.00/90.0 [00:00<?, ?B/s]
tokenizer.json: 0%| | 0.00/712k [00:00<?, ?B/s]
config.json: 0%| | 0.00/755 [00:00<?, ?B/s]
tokenizer_config.json: 0%| | 0.00/1.38k [00:00<?, ?B/s]
special_tokens_map.json: 0%| | 0.00/695 [00:00<?, ?B/s]
vocab.txt: 0%| | 0.00/232k [00:00<?, ?B/s]
model.onnx: 0%| | 0.00/532M [00:00<?, ?B/s]
Calculating sparse embeddings: 100%|ββββββββββ| 1/1 [00:00<00:00, 12.05it/s]
{'documents': [Document(id=cd69a8e89f3c179f243c483a337c5ecb178c58373a253e461a64545b669de12d, content: 'An example document', sparse_embedding: vector with 19 non-zero elements)]}
Indexing pipeline
from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
from haystack.components.writers import DocumentWriter
from haystack.document_stores.types import DuplicatePolicy
from haystack import Pipeline
indexing = Pipeline()
indexing.add_component("cleaner", DocumentCleaner())
indexing.add_component("splitter", DocumentSplitter(split_by='sentence', split_length=4))
indexing.add_component("sparse_doc_embedder", sparse_doc_embedder)
indexing.add_component("writer", DocumentWriter(document_store=document_store, policy=DuplicatePolicy.OVERWRITE))
indexing.connect("cleaner", "splitter")
indexing.connect("splitter", "sparse_doc_embedder")
indexing.connect("sparse_doc_embedder", "writer")
<haystack.core.pipeline.pipeline.Pipeline object at 0x7f21068632e0>
π
Components
- cleaner: DocumentCleaner
- splitter: DocumentSplitter
- sparse_doc_embedder: FastembedSparseDocumentEmbedder
- writer: DocumentWriter
π€οΈ Connections
- cleaner.documents -> splitter.documents (List[Document])
- splitter.documents -> sparse_doc_embedder.documents (List[Document])
- sparse_doc_embedder.documents -> writer.documents (List[Document])
Let’s index our documents!
β οΈ If you are running this notebook on Google Colab, please note that Google Colab only provides 2 CPU cores, so the sparse embedding generation could be not as fast as it can be on a standard machine.
indexing.run({"documents":raw_docs})
Calculating sparse embeddings: 100%|ββββββββββ| 152/152 [02:29<00:00, 1.02it/s]
200it [00:00, 2418.48it/s]
{'writer': {'documents_written': 152}}
document_store.count_documents()
152
Retrieval
Retrieval pipeline
Now, we create a simple retrieval Pipeline:
FastembedSparseTextEmbedder
: transforms the query into a sparse embeddingQdrantSparseEmbeddingRetriever
: looks for relevant documents, based on the similarity of the sparse embeddings
from haystack import Pipeline
from haystack_integrations.components.retrievers.qdrant import QdrantSparseEmbeddingRetriever
from haystack_integrations.components.embedders.fastembed import FastembedSparseTextEmbedder
sparse_text_embedder = FastembedSparseTextEmbedder(model="prithvida/Splade_PP_en_v1")
query_pipeline = Pipeline()
query_pipeline.add_component("sparse_text_embedder", sparse_text_embedder)
query_pipeline.add_component("sparse_retriever", QdrantSparseEmbeddingRetriever(document_store=document_store))
query_pipeline.connect("sparse_text_embedder.sparse_embedding", "sparse_retriever.query_sparse_embedding")
<haystack.core.pipeline.pipeline.Pipeline object at 0x7f21067cf3d0>
π
Components
- sparse_text_embedder: FastembedSparseTextEmbedder
- sparse_retriever: QdrantSparseEmbeddingRetriever
π€οΈ Connections
- sparse_text_embedder.sparse_embedding -> sparse_retriever.query_sparse_embedding (SparseEmbedding)
Try the retrieval pipeline
question = "Where do capybaras live?"
results = query_pipeline.run({"sparse_text_embedder": {"text": question}})
Calculating sparse embeddings: 100%|ββββββββββ| 1/1 [00:00<00:00, 9.02it/s]
import rich
for d in results['sparse_retriever']['documents']:
rich.print(f"\nid: {d.id}\n{d.content}\nscore: {d.score}\n---")
id: 6a485709ae51c55b78252571c0808ef17129b32e930ea7d461c12d9afaf40672 Its karyotype has 2n = 66 and FN = 102, meaning it has 66 chromosomes with a total of 102 arms. == Ecology == Capybaras are semiaquatic mammals found throughout all countries of South America except Chile. They live in densely forested areas near bodies of water, such as lakes, rivers, swamps, ponds, and marshes, as well as flooded savannah and along rivers in the tropical rainforest. They are superb swimmers and can hold their breath underwater for up to five minutes at a time. score: 0.5607053126371688 ---
id: fcc9a816e7f2312988dbd20146e4a5c07d8d8b409a373c4c3986d85c26dc0d61 Capybaras have adapted well to urbanization in South America. They can be found in many areas in zoos and parks, and may live for 12 years in captivity, more than double their wild lifespan. Capybaras are docile and usually allow humans to pet and hand-feed them, but physical contact is normally discouraged, as their ticks can be vectors to Rocky Mountain spotted fever. The European Association of Zoos and Aquaria asked Drusillas Park in Alfriston, Sussex, England, to keep the studbook for capybaras, to monitor captive populations in Europe. score: 0.5577329835824506 ---
id: d70f54cc66a83b56210c801ecd49c95bae5fef4ab38989d38b26dc53449b192d In 2011, one specimen was spotted on the Central Coast of California. These escaped populations occur in areas where prehistoric capybaras inhabited; late Pleistocene capybaras inhabited Florida and Hydrochoerus hesperotiganites in California and Hydrochoerus gaylordi in Grenada, and feral capybaras in North America may actually fill the ecological niche of the Pleistocene species. === Diet and predation === Capybaras are herbivores, grazing mainly on grasses and aquatic plants, as well as fruit and tree bark. They are very selective feeders and feed on the leaves of one species and disregard other species surrounding it. score: 0.5567185168202262 ---
id: a1cb26bcd9d053fc8e7a3c8b6716801b37ca37940c6f8b7865d6f6bb50b38f2f The capybara inhabits savannas and dense forests, and lives near bodies of water. It is a highly social species and can be found in groups as large as 100 individuals, but usually live in groups of 10β20 individuals. The capybara is hunted for its meat and hide and also for grease from its thick fatty skin. == Etymology == Its common name is derived from Tupi ka'apiΓ»ara, a complex agglutination of kaΓ‘ (leaf) + pΓi (slender) + ΓΊ (eat) + ara (a suffix for agent nouns), meaning "one who eats slender leaves", or "grass-eater". score: 0.5562936393318461 ---
id: 15755cebd1049a00c4656aaa7cf6c417966b81e482732e1c97288d58a08b53b2 The capybara or greater capybara (Hydrochoerus hydrochaeris) is a giant cavy rodent native to South America. It is the largest living rodent and a member of the genus Hydrochoerus. The only other extant member is the lesser capybara (Hydrochoerus isthmius). Its close relatives include guinea pigs and rock cavies, and it is more distantly related to the agouti, the chinchilla, and the nutria. score: 0.5559830683084014 ---
id: d8bd93eefa0c2feeb162972cd915fe29e3b13c98181a976d4751296c51547c77 Capybara have flourished in cattle ranches. They roam in home ranges averaging 10 hectares (25 acres) in high-density populations. Many escapees from captivity can also be found in similar watery habitats around the world. Sightings are fairly common in Florida, although a breeding population has not yet been confirmed. score: 0.5550636661344844 ---
id: 5429437120fdd0611ba2f14db51a28dfd894cd7221264c805dfe43de6ba95f7e In Japan, following the lead of Izu Shaboten Zoo in 1982, multiple establishments or zoos in Japan that raise capybaras have adopted the practice of having them relax in onsen during the winter. They are seen as an attraction by Japanese people. Capybaras became popular in Japan due to the popular cartoon character Kapibara-san. In August 2021, Argentine and international media reported that capybaras had been causing serious problems for residents of Nordelta, an affluent gated community north of Buenos Aires built atop wetland habitat. score: 0.5535668539880305 ---
id: 4261a2ae42f4edf8885c6e4c483356c994ad7699fd814e3b8f5da115aa5560ea
Alloparenting has been observed in this species. Breeding peaks between April and May in Venezuela and between
October and November in Mato Grosso, Brazil. === Activities ===
Though quite agile on land, capybaras are equally at home in the water. They are excellent swimmers, and can remain
completely submerged for up to five minutes, an ability they use to evade predators.
score: 0.553533523584911
---
id: b819073f1e3b6973dd1d22299a34a882acc348ba45257456cc85c3960fedd9ce
The studbook includes information about all births, deaths and movements of capybaras, as well as how they are
related.
Capybaras are farmed for meat and skins in South America. The meat is considered unsuitable to eat in some areas,
while in other areas it is considered an important source of protein. In parts of South America, especially in
Venezuela, capybara meat is popular during Lent and Holy Week as the Catholic Church previously issued special
dispensation to allow it to be eaten while other meats are generally forbidden.
score: 0.5527757086170884
---
id: a1fc3b907198e9d3e1d95d4c08f0fefae6861ce810d1b32479b50753326bdf58
== Conservation and human interaction ==
Capybaras are not considered a threatened species; their population is stable throughout most of their South
American range, though in some areas hunting has reduced their numbers. Capybaras are hunted for their meat and
pelts in some areas, and otherwise killed by humans who see their grazing as competition for livestock. In some
areas, they are farmed, which has the effect of ensuring the wetland habitats are protected. Their survival is
aided by their ability to breed rapidly.
score: 0.5503703349299455
---
Understanding SPLADE vectors
(Inspiration: FastEmbed SPLADE notebook)
We have seen that our model encodes text into a sparse vector (= a vector with many zeros). An efficient representation of sparse vectors is to save the indices and values of nonzero elements.
Let’s try to understand what information resides in these vectors…
question = "Where do capybaras live?"
sparse_embedding = sparse_text_embedder.run(text=question)["sparse_embedding"]
rich.print(sparse_embedding.to_dict())
Calculating sparse embeddings: 100%|ββββββββββ| 1/1 [00:00<00:00, 10.06it/s]
{ 'indices': [ 2015, 2100, 2427, 2444, 2555, 2693, 3224, 3269, 3295, 3562, 4111, 4761, 4982, 5430, 5917, 6178, 6552, 7713, 8843, 9089, 9230, 9277, 10746, 14627, 15267, 20709 ], 'values': [ 0.7443006634712219, 1.2232322692871094, 0.7982208132743835, 1.8504852056503296, 0.031874656677246094, 0.22175012528896332, 0.17087453603744507, 0.03717103973031044, 1.8334054946899414, 0.18768127262592316, 0.03902499005198479, 0.5681754946708679, 0.07937325537204742, 0.30040717124938965, 0.33065155148506165, 2.4437994956970215, 1.7612168788909912, 0.0731465145945549, 0.18527895212173462, 0.33103543519973755, 0.29275140166282654, 0.04728797823190689, 0.04782348498702049, 0.0030497254338115454, 0.6497660875320435, 2.6444451808929443 ] }
from transformers import AutoTokenizer
# we need the tokenizer vocabulary
tokenizer = AutoTokenizer.from_pretrained("Qdrant/Splade_PP_en_v1") # ONNX export of the original model
tokenizer_config.json: 0%| | 0.00/1.38k [00:00<?, ?B/s]
vocab.txt: 0%| | 0.00/232k [00:00<?, ?B/s]
tokenizer.json: 0%| | 0.00/712k [00:00<?, ?B/s]
special_tokens_map.json: 0%| | 0.00/695 [00:00<?, ?B/s]
def get_tokens_and_weights(sparse_embedding, tokenizer):
token_weight_dict = {}
for i in range(len(sparse_embedding.indices)):
token = tokenizer.decode([sparse_embedding.indices[i]])
weight = sparse_embedding.values[i]
token_weight_dict[token] = weight
# Sort the dictionary by weights
token_weight_dict = dict(sorted(token_weight_dict.items(), key=lambda item: item[1], reverse=True))
return token_weight_dict
rich.print(get_tokens_and_weights(sparse_embedding, tokenizer))
{ '##bara': 2.6444451808929443, 'cap': 2.4437994956970215, 'live': 1.8504852056503296, 'location': 1.8334054946899414, 'habitat': 1.7612168788909912, '##y': 1.2232322692871094, 'species': 0.7982208132743835, '##s': 0.7443006634712219, 'predator': 0.6497660875320435, 'origin': 0.5681754946708679, 'nest': 0.33103543519973755, 'tribe': 0.33065155148506165, 'cave': 0.30040717124938965, 'migration': 0.29275140166282654, 'move': 0.22175012528896332, 'genus': 0.18768127262592316, 'breed': 0.18527895212173462, 'forest': 0.17087453603744507, 'grow': 0.07937325537204742, 'shelter': 0.0731465145945549, 'habitats': 0.04782348498702049, 'refuge': 0.04728797823190689, 'animal': 0.03902499005198479, 'plant': 0.03717103973031044, 'region': 0.031874656677246094, 'reproduction': 0.0030497254338115454 }
Very nice! π¦«
- tokens are ordered by relevance
- the query is expanded with relevant tokens/terms: “location”, “habitat”…
Hybrid Retrieval
Ideally, techniques like SPLADE are intended to replace other approaches (BM25 and Dense Embedding Retrieval) and their combinations.
However, sometimes it may make sense to combine, for example, Dense Embedding Retrieval and Sparse Embedding Retrieval. You can find some positive examples in the appendix of this paper ( An Analysis of Fusion Functions for Hybrid Retrieval). Make sure this works for your use case and conduct an evaluation.
Below we show how to create such an application in Haystack.
In the example, we use the Qdrant Hybrid Retriever: it compares dense and sparse query and document embeddings and retrieves the most relevant documents , merging the scores with Reciprocal Rank Fusion.
If you want to customize the behavior more, see Hybrid Retrieval Pipelines ( tutorial).
from haystack_integrations.document_stores.qdrant import QdrantDocumentStore
from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
from haystack.components.writers import DocumentWriter
from haystack_integrations.components.embedders.fastembed import FastembedSparseDocumentEmbedder, FastembedDocumentEmbedder
from haystack.document_stores.types import DuplicatePolicy
from haystack import Pipeline
document_store = QdrantDocumentStore(
":memory:",
recreate_index=True,
return_embedding=True,
use_sparse_embeddings=True,
embedding_dim = 384
)
hybrid_indexing = Pipeline()
hybrid_indexing.add_component("cleaner", DocumentCleaner())
hybrid_indexing.add_component("splitter", DocumentSplitter(split_by='sentence', split_length=4))
hybrid_indexing.add_component("sparse_doc_embedder", FastembedSparseDocumentEmbedder(model="prithvida/Splade_PP_en_v1", meta_fields_to_embed=["title"]))
hybrid_indexing.add_component("dense_doc_embedder", FastembedDocumentEmbedder(model="BAAI/bge-small-en-v1.5", meta_fields_to_embed=["title"]))
hybrid_indexing.add_component("writer", DocumentWriter(document_store=document_store, policy=DuplicatePolicy.OVERWRITE))
hybrid_indexing.connect("cleaner", "splitter")
hybrid_indexing.connect("splitter", "sparse_doc_embedder")
hybrid_indexing.connect("sparse_doc_embedder", "dense_doc_embedder")
hybrid_indexing.connect("dense_doc_embedder", "writer")
<haystack.core.pipeline.pipeline.Pipeline object at 0x7f1fe8292170>
π
Components
- cleaner: DocumentCleaner
- splitter: DocumentSplitter
- sparse_doc_embedder: FastembedSparseDocumentEmbedder
- dense_doc_embedder: FastembedDocumentEmbedder
- writer: DocumentWriter
π€οΈ Connections
- cleaner.documents -> splitter.documents (List[Document])
- splitter.documents -> sparse_doc_embedder.documents (List[Document])
- sparse_doc_embedder.documents -> dense_doc_embedder.documents (List[Document])
- dense_doc_embedder.documents -> writer.documents (List[Document])
hybrid_indexing.run({"documents":raw_docs})
Fetching 9 files: 0%| | 0/9 [00:00<?, ?it/s]
special_tokens_map.json: 0%| | 0.00/695 [00:00<?, ?B/s]
README.md: 0%| | 0.00/28.0 [00:00<?, ?B/s]
tokenizer.json: 0%| | 0.00/711k [00:00<?, ?B/s]
.gitattributes: 0%| | 0.00/1.52k [00:00<?, ?B/s]
ort_config.json: 0%| | 0.00/1.27k [00:00<?, ?B/s]
config.json: 0%| | 0.00/706 [00:00<?, ?B/s]
tokenizer_config.json: 0%| | 0.00/1.24k [00:00<?, ?B/s]
model_optimized.onnx: 0%| | 0.00/66.5M [00:00<?, ?B/s]
vocab.txt: 0%| | 0.00/232k [00:00<?, ?B/s]
Calculating sparse embeddings: 100%|ββββββββββ| 152/152 [02:14<00:00, 1.13it/s]
Calculating embeddings: 100%|ββββββββββ| 152/152 [00:41<00:00, 3.68it/s]
200it [00:00, 655.45it/s]
{'writer': {'documents_written': 152}}
document_store.filter_documents()[0]
Document(id=5e2d65ac05a8a238b359773c3d855e026aca6e617df8a011964b401d8b242a1e, content: ' Overall, they tend to be dwarfed by other Cetartiodactyls. Several species have female-biased sexua...', meta: {'title': 'Dolphin', 'url': 'https://en.wikipedia.org/wiki/Dolphin', 'source_id': '6584a10fad50d363f203669ff6efc19e7ae2a5a28ca9351f5cceb5ba88f8e847'}, embedding: vector of size 384, sparse_embedding: vector with 129 non-zero elements)
from haystack_integrations.components.retrievers.qdrant import QdrantHybridRetriever
from haystack_integrations.components.embedders.fastembed import FastembedTextEmbedder
hybrid_query = Pipeline()
hybrid_query.add_component("sparse_text_embedder", FastembedSparseTextEmbedder(model="prithvida/Splade_PP_en_v1"))
hybrid_query.add_component("dense_text_embedder", FastembedTextEmbedder(model="BAAI/bge-small-en-v1.5", prefix="Represent this sentence for searching relevant passages: "))
hybrid_query.add_component("retriever", QdrantHybridRetriever(document_store=document_store))
hybrid_query.connect("sparse_text_embedder.sparse_embedding", "retriever.query_sparse_embedding")
hybrid_query.connect("dense_text_embedder.embedding", "retriever.query_embedding")
<haystack.core.pipeline.pipeline.Pipeline object at 0x7f1fe8293190>
π
Components
- sparse_text_embedder: FastembedSparseTextEmbedder
- dense_text_embedder: FastembedTextEmbedder
- retriever: QdrantHybridRetriever
π€οΈ Connections
- sparse_text_embedder.sparse_embedding -> retriever.query_sparse_embedding (SparseEmbedding)
- dense_text_embedder.embedding -> retriever.query_embedding (List[float])
question = "Where do capybaras live?"
results = hybrid_query.run(
{"dense_text_embedder": {"text": question},
"sparse_text_embedder": {"text": question}}
)
Calculating sparse embeddings: 100%|ββββββββββ| 1/1 [00:00<00:00, 9.95it/s]
Calculating embeddings: 100%|ββββββββββ| 1/1 [00:00<00:00, 12.05it/s]
import rich
for d in results['retriever']['documents']:
rich.print(f"\nid: {d.id}\n{d.content}\nscore: {d.score}\n---")
id: fcc9a816e7f2312988dbd20146e4a5c07d8d8b409a373c4c3986d85c26dc0d61 Capybaras have adapted well to urbanization in South America. They can be found in many areas in zoos and parks, and may live for 12 years in captivity, more than double their wild lifespan. Capybaras are docile and usually allow humans to pet and hand-feed them, but physical contact is normally discouraged, as their ticks can be vectors to Rocky Mountain spotted fever. The European Association of Zoos and Aquaria asked Drusillas Park in Alfriston, Sussex, England, to keep the studbook for capybaras, to monitor captive populations in Europe. score: 0.6666666666666666 ---
id: 6a485709ae51c55b78252571c0808ef17129b32e930ea7d461c12d9afaf40672 Its karyotype has 2n = 66 and FN = 102, meaning it has 66 chromosomes with a total of 102 arms. == Ecology == Capybaras are semiaquatic mammals found throughout all countries of South America except Chile. They live in densely forested areas near bodies of water, such as lakes, rivers, swamps, ponds, and marshes, as well as flooded savannah and along rivers in the tropical rainforest. They are superb swimmers and can hold their breath underwater for up to five minutes at a time. score: 0.6666666666666666 ---
id: d8bd93eefa0c2feeb162972cd915fe29e3b13c98181a976d4751296c51547c77 Capybara have flourished in cattle ranches. They roam in home ranges averaging 10 hectares (25 acres) in high-density populations. Many escapees from captivity can also be found in similar watery habitats around the world. Sightings are fairly common in Florida, although a breeding population has not yet been confirmed. score: 0.6428571428571428 ---
id: a1cb26bcd9d053fc8e7a3c8b6716801b37ca37940c6f8b7865d6f6bb50b38f2f The capybara inhabits savannas and dense forests, and lives near bodies of water. It is a highly social species and can be found in groups as large as 100 individuals, but usually live in groups of 10β20 individuals. The capybara is hunted for its meat and hide and also for grease from its thick fatty skin. == Etymology == Its common name is derived from Tupi ka'apiΓ»ara, a complex agglutination of kaΓ‘ (leaf) + pΓi (slender) + ΓΊ (eat) + ara (a suffix for agent nouns), meaning "one who eats slender leaves", or "grass-eater". score: 0.45 ---
id: d70f54cc66a83b56210c801ecd49c95bae5fef4ab38989d38b26dc53449b192d In 2011, one specimen was spotted on the Central Coast of California. These escaped populations occur in areas where prehistoric capybaras inhabited; late Pleistocene capybaras inhabited Florida and Hydrochoerus hesperotiganites in California and Hydrochoerus gaylordi in Grenada, and feral capybaras in North America may actually fill the ecological niche of the Pleistocene species. === Diet and predation === Capybaras are herbivores, grazing mainly on grasses and aquatic plants, as well as fruit and tree bark. They are very selective feeders and feed on the leaves of one species and disregard other species surrounding it. score: 0.45 ---
id: 15755cebd1049a00c4656aaa7cf6c417966b81e482732e1c97288d58a08b53b2 The capybara or greater capybara (Hydrochoerus hydrochaeris) is a giant cavy rodent native to South America. It is the largest living rodent and a member of the genus Hydrochoerus. The only other extant member is the lesser capybara (Hydrochoerus isthmius). Its close relatives include guinea pigs and rock cavies, and it is more distantly related to the agouti, the chinchilla, and the nutria. score: 0.30952380952380953 ---
id: 4261a2ae42f4edf8885c6e4c483356c994ad7699fd814e3b8f5da115aa5560ea
Alloparenting has been observed in this species. Breeding peaks between April and May in Venezuela and between
October and November in Mato Grosso, Brazil. === Activities ===
Though quite agile on land, capybaras are equally at home in the water. They are excellent swimmers, and can remain
completely submerged for up to five minutes, an ability they use to evade predators.
score: 0.2361111111111111
---
id: a1fc3b907198e9d3e1d95d4c08f0fefae6861ce810d1b32479b50753326bdf58
== Conservation and human interaction ==
Capybaras are not considered a threatened species; their population is stable throughout most of their South
American range, though in some areas hunting has reduced their numbers. Capybaras are hunted for their meat and
pelts in some areas, and otherwise killed by humans who see their grazing as competition for livestock. In some
areas, they are farmed, which has the effect of ensuring the wetland habitats are protected. Their survival is
aided by their ability to breed rapidly.
score: 0.20202020202020202
---
id: 5429437120fdd0611ba2f14db51a28dfd894cd7221264c805dfe43de6ba95f7e In Japan, following the lead of Izu Shaboten Zoo in 1982, multiple establishments or zoos in Japan that raise capybaras have adopted the practice of having them relax in onsen during the winter. They are seen as an attraction by Japanese people. Capybaras became popular in Japan due to the popular cartoon character Kapibara-san. In August 2021, Argentine and international media reported that capybaras had been causing serious problems for residents of Nordelta, an affluent gated community north of Buenos Aires built atop wetland habitat. score: 0.125 ---
id: a87985b28681d12e9897eae531fb5e93ecd0c702a5419708ddcbc03ae13c0ed0 They can have a lifespan of 8β10 years, but tend to live less than four years in the wild due to predation from big cats like the jaguars and pumas and non-mammalian predators like eagles and the caimans. The capybara is also the preferred prey of the green anaconda. == Social organization == Capybaras are known to be gregarious. While they sometimes live solitarily, they are more commonly found in groups of around 10β20 individuals, with two to four adult males, four to seven adult females, and the remainder juveniles. score: 0.1 ---
π Docs on Sparse Embedding support in Haystack
- Retrievers
- Qdrant Sparse Embedding Retriever
- Qdrant Hybrid Retriever
- FastEmbed Sparse Text Embedder
- Fastembed Sparse Document Embedder
(Notebook by Stefano Fiorucci)