Integration: vLLM Invocation Layer
Use the vLLM inference engine with Haystack
Simply use vLLM in your haystack pipeline, to utilize fast, self-hosted LLMs.
Table of Contents
Overview
vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs. It is an open-source project that allows serving open models in production, when you have GPU resources available.
For Haystack 1.x, the integration is available as a separate package, while for Haystack 2.x, the integration comes out of the box.
Haystack 2.x
vLLM can be deployed as a server that implements the OpenAI API protocol.
This allows vLLM to be used with the
OpenAIGenerator
and
OpenAIChatGenerator
components in Haystack.
For an end-to-end example of vLLM + Haystack 2.x, see this notebook.
Installation
vLLM should be installed.
- you can use
pip
:pip install vllm
(more information in the vLLM documentation) - for production use cases, there are many other options, including Docker ( docs)
Usage
You first need to run an vLLM OpenAI-compatible server. You can do that using Python or Docker.
Then, you can use the OpenAIGenerator
and OpenAIChatGenerator
components in Haystack to query the vLLM server.
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.dataclasses import ChatMessage
from haystack.utils import Secret
generator = OpenAIChatGenerator(
api_key=Secret.from_token("VLLM-PLACEHOLDER-API-KEY"), # for compatibility with the OpenAI API, a placeholder api_key is needed
model="mistralai/Mistral-7B-Instruct-v0.1",
api_base_url="http://localhost:8000/v1",
generation_kwargs = {"max_tokens": 512}
)
response = generator.run(messages=[ChatMessage.from_user("Hi. Can you help me plan my next trip to Italy?")])
Haystack 1.x
Installation (1.x)
Install the wrapper via pip: pip install vllm-haystack
Usage (1.x)
This integration provides two invocation layers:
vLLMInvocationLayer
: To use models hosted on a vLLM servervLLMLocalInvocationLayer
: To use locally hosted vLLM models
Use a Model Hosted on a vLLM Server
To utilize the wrapper the vLLMInvocationLayer
has to be used.
Here is a simple example of how a PromptNode
can be created with the wrapper.
from haystack.nodes import PromptNode, PromptModel
from vllm_haystack import vLLMInvocationLayer
model = PromptModel(model_name_or_path="", invocation_layer_class=vLLMInvocationLayer, max_length=256, api_key="EMPTY", model_kwargs={
"api_base" : API, # Replace this with your API-URL
"maximum_context_length": 2048,
})
prompt_node = PromptNode(model_name_or_path=model, top_k=1, max_length=256)
The model will be inferred based on the model served on the vLLM server. For more configuration examples, take a look at the unit-tests.
Hosting a vLLM Server
To create an OpenAI-Compatible Server via vLLM you can follow the steps in the Quickstart section of their documentation.
Use a Model Hosted Locally
β οΈTo run vLLM
locally you need to have vllm
installed and a supported GPU.
If you don’t want to use an API-Server this wrapper also provides a vLLMLocalInvocationLayer
which executes the vLLM on the same node Haystack is running on.
Here is a simple example of how a PromptNode
can be created with the vLLMLocalInvocationLayer
.
from haystack.nodes import PromptNode, PromptModel
from vllm_haystack import vLLMLocalInvocationLayer
model = PromptModel(model_name_or_path=MODEL, invocation_layer_class=vLLMLocalInvocationLayer, max_length=256, model_kwargs={
"maximum_context_length": 2048,
})
prompt_node = PromptNode(model_name_or_path=model, top_k=1, max_length=256)