RAG: Extract and use website content for question answering with Apify-Haystack integration
Last Updated: October 3, 2024
Author: Jiri Spilka ( Apify)
In this tutorial, we’ll use the apify-haystack integration to call Website Content Crawler and crawl and scrape text content from the Haystack website. Then, we’ll use the OpenAIDocumentEmbedder to compute text embeddings and the InMemoryDocumentStore to store documents in a temporary in-memory database. The last step will be a retrieval augmented generation pipeline to answer users’ questions from the scraped data.
Install dependencies
!pip install apify-haystack haystack-ai
Set up the API keys
You need to have an Apify account and obtain APIFY_API_TOKEN.
You also need an OpenAI account and OPENAI_API_KEY
import os
from getpass import getpass
os.environ["APIFY_API_TOKEN"] = getpass("Enter YOUR APIFY_API_TOKEN")
os.environ["OPENAI_API_KEY"] = getpass("Enter YOUR OPENAI_API_KEY")
Enter YOUR APIFY_API_TOKEN··········
Enter YOUR OPENAI_API_KEY··········
Use the Website Content Crawler to scrape data from the haystack documentation
Now, let us call the Website Content Crawler using the Haystack component ApifyDatasetFromActorCall
. First, we need to define parameters for the Website Content Crawler and then what data we need to save into the vector database.
The actor_id
and detailed description of input parameters (variable run_input
) can be found on the
Website Content Crawler input page.
For this example, we will define startUrls
and limit the number of crawled pages to five.
actor_id = "apify/website-content-crawler"
run_input = {
"maxCrawlPages": 5, # limit the number of pages to crawl
"startUrls": [{"url": "https://haystack.deepset.ai/"}],
}
Next, we need to define a dataset mapping function. We need to know the output of the Website Content Crawler. Typically, it is a JSON object that looks like this (truncated for brevity):
[
{
"url": "https://haystack.deepset.ai/",
"text": "Haystack | Haystack - Multimodal - AI - Architect a next generation AI app around all modalities, not just text ..."
},
{
"url": "https://haystack.deepset.ai/tutorials/24_building_chat_app",
"text": "Building a Conversational Chat App ... "
},
]
We will convert this JSON to a Haystack Document
using the dataset_mapping_function
as follows:
from haystack import Document
def dataset_mapping_function(dataset_item: dict) -> Document:
return Document(content=dataset_item.get("text"), meta={"url": dataset_item.get("url")})
And the definition of the ApifyDatasetFromActorCall
:
from apify_haystack import ApifyDatasetFromActorCall
apify_dataset_loader = ApifyDatasetFromActorCall(
actor_id=actor_id,
run_input=run_input,
dataset_mapping_function=dataset_mapping_function,
)
Before actually running the Website Content Crawler, we need to define embedding function and document store:
from haystack.components.embedders import OpenAIDocumentEmbedder
from haystack.document_stores.in_memory import InMemoryDocumentStore
document_store = InMemoryDocumentStore()
docs_embedder = OpenAIDocumentEmbedder()
After that, we can call the Website Content Crawler and print the scraped data:
# Crawler website and store documents in the document_store
# Crawling will take some time (1-2 minutes), you can monitor progress in the https://console.apify.com/actors/runs
docs = apify_dataset_loader.run()
print(docs)
{'documents': [Document(id=6c4d570874ff59ed4e06017694bee8a72d766d2ed55c6453fbc9ea91fd2e6bde, content: 'Haystack | Haystack Luma · Delightful Events Start HereAWS Summit Berlin 2023: Building Generative A...', meta: {'url': 'https://haystack.deepset.ai/'}), Document(id=d420692bf66efaa56ebea200a4a63597667bdc254841b99654239edf67737bcb, content: 'Tutorials & Walkthroughs | Haystack
Tutorials & Walkthroughs2.0
Whether you’re a beginner or an expe...', meta: {'url': 'https://haystack.deepset.ai/tutorials'}), Document(id=5a529a308d271ba76f66a060c0b706b73103406ac8a853c19f20e1594823efe8, content: 'Get Started | Haystack
Haystack is an open-source Python framework that helps developers build LLM-p...', meta: {'url': 'https://haystack.deepset.ai/overview/quick-start'}), Document(id=1d126a03ae50586729846d492e9e8aca802d7f281a72a8869ded08ebc5585a36, content: 'What is Haystack? | Haystack
Haystack is an open source framework for building production-ready LLM ...', meta: {'url': 'https://haystack.deepset.ai/overview/intro'}), Document(id=4324a62242590d4ecf9b080319607fa1251aa0822bbe2ce6b21047e783999703, content: 'Integrations | Haystack
The Haystack ecosystem integrates with many other technologies, such as vect...', meta: {'url': 'https://haystack.deepset.ai/integrations'})]}
Compute the embeddings and store them in the database:
embeddings = docs_embedder.run(docs.get("documents"))
document_store.write_documents(embeddings["documents"])
Calculating embeddings: 100%|██████████| 1/1 [00:00<00:00, 3.29it/s]
5
Retrieval and LLM generative pipeline
Once we have the crawled data in the database, we can set up the classical retrieval augmented pipeline. Refer to the RAG Haystack tutorial for details.
from haystack import Pipeline
from haystack.components.builders import PromptBuilder
from haystack.components.embedders import OpenAITextEmbedder
from haystack.components.generators import OpenAIGenerator
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
text_embedder = OpenAITextEmbedder()
retriever = InMemoryEmbeddingRetriever(document_store)
generator = OpenAIGenerator(model="gpt-4o-mini")
template = """
Given the following information, answer the question.
Context:
{% for document in documents %}
{{ document.content }}
{% endfor %}
Question: {{question}}
Answer:
"""
prompt_builder = PromptBuilder(template=template)
# Add components to your pipeline
print("Initializing pipeline...")
pipe = Pipeline()
pipe.add_component("embedder", text_embedder)
pipe.add_component("retriever", retriever)
pipe.add_component("prompt_builder", prompt_builder)
pipe.add_component("llm", generator)
# Now, connect the components to each other
pipe.connect("embedder.embedding", "retriever.query_embedding")
pipe.connect("retriever", "prompt_builder.documents")
pipe.connect("prompt_builder", "llm")
Initializing pipeline...
<haystack.core.pipeline.pipeline.Pipeline object at 0x7c02095efdc0>
🚅 Components
- embedder: OpenAITextEmbedder
- retriever: InMemoryEmbeddingRetriever
- prompt_builder: PromptBuilder
- llm: OpenAIGenerator
🛤️ Connections
- embedder.embedding -> retriever.query_embedding (List[float])
- retriever.documents -> prompt_builder.documents (List[Document])
- prompt_builder.prompt -> llm.prompt (str)
Now, you can ask questions about Haystack and get correct answers:
question = "What is haystack?"
response = pipe.run({"embedder": {"text": question}, "prompt_builder": {"question": question}})
print(f"question: {question}")
print(f"answer: {response['llm']['replies'][0]}")
question: What is haystack?
answer: Haystack is an open-source Python framework designed to help developers build LLM-powered custom applications. It is used for creating production-ready LLM applications, retrieval-augmented generative pipelines, and state-of-the-art search systems that work effectively over large document collections. Haystack offers comprehensive tooling for developing AI systems that use LLMs from platforms like Hugging Face, OpenAI, Cohere, Mistral, and more. It provides a modular and intuitive framework that allows users to quickly integrate the latest AI models, offering flexibility and ease of use. The framework includes components and pipelines that enable developers to build end-to-end AI projects without the need to understand the underlying models deeply. Haystack caters to LLM enthusiasts and beginners alike, providing a vibrant open-source community for collaboration and learning.