Maintained by deepset

Integration: llamafile

Run LLMs locally with llamafile

Authors
deepset

Table of Contents

Overview

llamafile is a project by Mozilla that aims to make open LLMs accessible to developers and users.

To run LLMs locally, simply download a single-file executable (“llamafile”) that contains both the model and the inference engine and runs locally on most computers.

llamafile can be used on its own to chat with these models, but below we will see how to integrate it with Haystack, to build LLM applications.

Download and run models

Generative models

Several models are available. You can find some in the llamafile repository or search for them in the Hugging Face Hub.

Let’s see for example how to download the Mistral-7B-Instruct model and start an OpenAI-compatible server:

wget https://huggingface.co/Mozilla/Mistral-7B-Instruct-v0.2-llamafile/resolve/main/mistral-7b-instruct-v0.2.Q4_0.llamafile

chmod +x mistral-7b-instruct-v0.2.Q4_0.llamafile

./mistral-7b-instruct-v0.2.Q4_0.llamafile --server --nobrowser

This will start a server on http://localhost:8000 that can be used to interact with the model.

If you encounter issues or need information on GPU support, refer to the llamafile repository.

Embedding models

Some embedding models are also available.

For example, to download and run the mxbai-embed-large-v1 model:

wget https://huggingface.co/Mozilla/mxbai-embed-large-v1-llamafile/resolve/main/mxbai-embed-large-v1-f16.llamafile

chmod +x mxbai-embed-large-v1-f16.llamafile

./mxbai-embed-large-v1-f16.llamafile --server --nobrowser --embedding --port 8081

This will start an OpenAI-compatible server on http://localhost:8081.

Usage with Haystack

Since llamafile runs OpenAI-compatible servers, you can use it with Haystack components that interact with OpenAI models: OpenAITextEmbedder, OpenAIDocumentEmbedder, OpenAIGenerator, and OpenAIChatGenerator.

Let’s start with an indexing pipeline that uses an embedding model. You should have the mxbai-embed-large-v1 model running as described above.

from haystack import Pipeline, Document
from haystack.utils import Secret
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.writers import DocumentWriter
from haystack.components.embedders import OpenAIDocumentEmbedder

document_store = InMemoryDocumentStore()

documents = [Document(content="The best food in the world is pizza"),
             Document(content="I saw a black horse running"),
             Document(content="The capital of Sweden is Stockholm"),]

indexing_pipeline = Pipeline()
indexing_pipeline.add_component("embedder",
                                OpenAIDocumentEmbedder(
                                    api_key=Secret.from_token("sk-no-key-required"),  # for compatibility with the OpenAI API
                                    model="LLaMA_CPP",
                                    api_base_url="http://localhost:8081/v1")
                                )
indexing_pipeline.add_component("writer", DocumentWriter(document_store=document_store))
indexing_pipeline.connect("embedder", "writer")

indexing_pipeline.run({"embedder": {"documents": documents}})

Now let’s build a RAG pipeline, that uses both an embedding model and a generative model. You should have the mxbai-embed-large-v1 and Mistral-7B-Instruct models running as described above.

from haystack import Pipeline, Document
from haystack.utils import Secret
from haystack.components.writers import DocumentWriter
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.components.embedders import OpenAITextEmbedder
from haystack.components.generators import OpenAIGenerator
from haystack.components.builders import PromptBuilder


prompt_template = """<s>[INST]
Given these documents, answer the question.
Documents:
{% for doc in documents %}
    {{ doc.content }}
{% endfor %}
Question: {{question}} [/INST]
Answer:
"""

rag_pipe = Pipeline()
rag_pipe.add_component("text_embedder", 
                        OpenAITextEmbedder(
                            api_key=Secret.from_token("sk-no-key-required"),  # for compatibility with the OpenAI API
                            model="LLaMA_CPP",
                            api_base_url="http://localhost:8081/v1")
                        )
rag_pipe.add_component("retriever", InMemoryEmbeddingRetriever(document_store=document_store))
rag_pipe.add_component("prompt_builder", PromptBuilder(template=prompt_template))
rag_pipe.add_component("generator",
                        OpenAIGenerator(
                            api_key=Secret.from_token("sk-no-key-required"),  # for compatibility with the OpenAI API
                            model="LLaMA_CPP",
                            api_base_url="http://localhost:8080/v1")
                        )

rag_pipe.connect("text_embedder.embedding", "retriever.query_embedding")
rag_pipe.connect("retriever.documents", "prompt_builder.documents")
rag_pipe.connect("prompt_builder", "generator")

query = "What is the best food in the world?"

result = rag_pipe.run({"text_embedder":{"text": query},
                       "prompt_builder": {"question": query}})

print(result["generator"]["replies"][0])

# According to the documents, the best food in the world is pizza.

For a fun use case, explore this notebook: Quizzes and Adventures with Character Codex and llamafile.