Tutorial: Retrieving a Context Window Around a Sentence
Last Updated: October 21, 2024
- Level: Beginner
- Time to complete: 10 minutes
- Components Used:
SentenceWindowRetriever
,DocumentSplitter
,InMemoryDocumentStore
,InMemoryBM25Retriever
- Goal: After completing this tutorial, you will have learned about Sentence-Window Retrieval and how to use it for document retrieval.
Overview
The Sentence-Window retrieval technique is a simple and effective way to retrieve more context given a user query which matched some document. It is based on the idea that the most relevant sentences are likely to be close to each other in the document. The technique involves selecting a window of sentences around a sentence matching a user query and instead of returning the matching sentence, the entire window is returned. This technique can be particularly useful when the user query is a question or a phrase that requires more context to be understood.
The
SentenceWindowRetriever
can be used in a Pipeline to implement the Sentence-Window retrieval technique.
The component takes a document_store
and a window_size
as input. The document_store
contains the documents we want to query, and the window_size
is used to determine the number of sentences to return around the matching sentence. So the number of sentences returned will be 2 * window_size + 1
. Although we use the term “sentence” as it’s inertly attached to this technique, the SentenceWindowRetriever
actually works with any splitter from the DocumentSplitter
class, for instance: word
, sentence
, page
.
SentenceWindowRetriever(document_store=doc_store, window_size=2)
Preparing the Colab Environment
Installing Haystack
To start, install the latest release of Haystack with pip
:
%%bash
pip install --upgrade pip
pip install haystack-ai
Enabling Telemetry
Knowing you’re using this tutorial helps us decide where to invest our efforts to build a better product but you can always opt out by commenting the following line. See Telemetry for more details.
from haystack.telemetry import tutorial_running
tutorial_running(42)
Getting started with Sentence-Window Retrieval
Let’s see a simple example of how to use the SentenceWindowRetriever
in isolation, and later we can see how to use it within a pipeline. We start by creating a document and splitting it into sentences using the DocumentSplitter
class.
from haystack import Document
from haystack.components.preprocessors import DocumentSplitter
splitter = DocumentSplitter(split_length=1, split_overlap=0, split_by="sentence")
text = ("Paul fell asleep to dream of an Arrakeen cavern, silent people all around him moving in the dim light "
"of glowglobes. It was solemn there and like a cathedral as he listened to a faint sound—the "
"drip-drip-drip of water. Even while he remained in the dream, Paul knew he would remember it upon "
"awakening. He always remembered the dreams that were predictions. The dream faded. Paul awoke to feel "
"himself in the warmth of his bed—thinking thinking. This world of Castle Caladan, without play or "
"companions his own age, perhaps did not deserve sadness in farewell. Dr Yueh, his teacher, had "
"hinted that the faufreluches class system was not rigidly guarded on Arrakis. The planet sheltered "
"people who lived at the desert edge without caid or bashar to command them: will-o’-the-sand people "
"called Fremen, marked down on no census of the Imperial Regate.")
doc = Document(content=text)
docs = splitter.run([doc])
This will result in 9 sentences represented as Haystack Document objects. We can then write these documents to a DocumentStore and use the SentenceWindowRetriever to retrieve a window of sentences around a matching sentence.
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.document_stores.types import DuplicatePolicy
doc_store = InMemoryDocumentStore()
doc_store.write_documents(docs['documents'], policy=DuplicatePolicy.OVERWRITE)
Now we use the SentenceWindowRetriever
to retrieve a window of sentences around a certain sentence. Note that the SentenceWindowRetriever
receives as input in run time a Document
present in the document store, and it will rely on the documents metadata to retrieve the window of sentences around the matching sentence. So, one important aspect to notice is that the SentenceWindowRetriever
needs to be used in conjunction with another Retriever
that handles the initial user query, such as the InMemoryBM25Retriever
, and returns the matching documents.
Let’s pass the Document containing the sentence The dream faded.
to the SentenceWindowRetriever
and retrieve a window of 2 sentences around it. Note that we need to wrap it in a list as the run
method expects a list of documents.
from haystack.components.retrievers import SentenceWindowRetriever
retriever = SentenceWindowRetriever(document_store=doc_store, window_size=2)
result = retriever.run(retrieved_documents=[docs['documents'][4]])
The result is a dictionary with two keys:
context_windows
: a list of strings containing the context windows around the matching sentence.context_documents
: a list of lists ofDocument
objects containing the context windows around the matching sentence.
result['context_windows']
result['context_documents']
Create a Keyword Retrieval Pipeline with Sentence-Window Retrieval
Let’s see this component in action. We will use the BBC news dataset to show how the SentenceWindowRetriever
works with a dataset containing multiple news articles.
Reading the dataset
The original dataset is available at http://mlg.ucd.ie/datasets/bbc.html, but it was already preprocessed and stored in a single CSV file available here: https://raw.githubusercontent.com/amankharwal/Website-data/master/bbc-news-data.csv
from typing import List
import csv
from haystack import Document
def read_documents(file: str) -> List[Document]:
with open(file, "r") as file:
reader = csv.reader(file, delimiter="\t")
next(reader, None) # skip the headers
documents = []
for row in reader:
category = row[0].strip()
title = row[2].strip()
text = row[3].strip()
documents.append(Document(content=text, meta={"category": category, "title": title}))
return documents
from pathlib import Path
import requests
doc = requests.get('https://raw.githubusercontent.com/amankharwal/Website-data/master/bbc-news-data.csv')
datafolder = Path('data')
datafolder.mkdir(exist_ok=True)
with open(datafolder/'bbc-news-data.csv', 'wb') as f:
for chunk in doc.iter_content(512):
f.write(chunk)
docs = read_documents("data/bbc-news-data.csv")
len(docs)
Indexing the documents
We will now apply the DocumentSplitter
to split the documents into sentences and write them to an InMemoryDocumentStore
.
from haystack import Document, Pipeline
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.document_stores.types import DuplicatePolicy
doc_store = InMemoryDocumentStore()
indexing_pipeline = Pipeline()
indexing_pipeline.add_component("splitter", DocumentSplitter(split_length=1, split_overlap=0, split_by="sentence"))
indexing_pipeline.add_component("writer", DocumentWriter(document_store=doc_store, policy=DuplicatePolicy.OVERWRITE))
indexing_pipeline.connect("splitter", "writer")
indexing_pipeline.run({"documents":docs})
Build a Sentence-Window Retrieval Pipeline
Let’s now build a pipeline to retrieve the documents using the InMemoryBM25Retriever
(with keyword retrieval) and the SentenceWindowRetriever
. Here, we are setting up the retriever with a window_size
of 2
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.components.retrievers import SentenceWindowRetriever
sentence_window_pipeline = Pipeline()
sentence_window_pipeline.add_component("bm25_retriever", InMemoryBM25Retriever(document_store=doc_store))
sentence_window_pipeline.add_component("sentence_window__retriever", SentenceWindowRetriever(doc_store, window_size=2))
sentence_window_pipeline.connect("bm25_retriever.documents", "sentence_window__retriever.retrieved_documents")
Putting it all together
Let’s see what happens when we retrieve documents relevant to “phishing attacks”, returning only the highest scored document.
We will also include the outputs from the InMemoryBM25Retriever
so that we can compare the results with and without the SentenceWindowRetriever
.
result = sentence_window_pipeline.run(data={'bm25_retriever': {'query': "phishing attacks", "top_k": 1}}, include_outputs_from={'bm25_retriever'})
Let’s now inspect the results from the InMemoryBM25Retriever
and the SentenceWindowRetriever
. Since we split the documents by sentence, the InMemoryBM25Retriever
returns only the sentence associated with the matching query.
result['bm25_retriever']['documents']
The SentenceWindowRetriever
, on the other hand, returns a window of sentences around the matching sentence, giving us more context to understand the sentence.
result['sentence_window__retriever']['context_windows']
We are also able to access the context window as a list of Document
result['sentence_window__retriever']['context_documents']
Wrapping Up
We saw how the SentenceWindowRetriever
works and how it can be used to retrieve a window of sentences around a matching document, give us more context to understand the document. One important aspect to notice is that the SentenceWindowRetriever
doesn’t handle queries directly but relies on the output of another Retriever
that handles the initial user query. This allows the SentenceWindowRetriever
to be used in conjunction with any other retriever in the pipeline, such as the InMemoryBM25Retriever
.