Mixedbread 🤝 deepset: Announcing our New German/English Embedding Model
Learn about the new open-source German/English embedding model by deepset and Mixedbread
July 18, 2024It’s 2024 and yet, most models today are still primarily geared towards English speaking markets. Today, deepset and Mixedbread are jointly announcing our latest contribution towards changing that landscape: A new open-source German/English embedding model - deepset-mxbai-embed-de-large-v1.
Our model is based on intfloat/multilingual-e5-large and was fine-tuned on 30+ million pairs of German data for retrieval tasks. On the NDCG@10 metric, which compares the list of retrieval results against an ideally ordered list of expected results, our model not only sets a new standard for open-source German embedding models but is also competitive with commercial alternatives.
Model | Avg. Performance (NDCG@10) | Binary Support | MRL Support |
---|---|---|---|
deepset-mxbai-embed-de-large-v1 | 51.7 | ✅ | ✅ |
multilingual-e5-large | 50.5 | ❌ | ❌ |
jina-embeddings-v2-base-de | 50.0 | ✅ | ❌ |
Commercial Models | |||
Cohere Multilingual v3 | 52.4 | ✅ | - |
Nvidia enabled this work by providing cutting-edge computational resources. All training and evaluation was done on a Nvidia DGX with 8xA100, sponsored by Nvidia. We are extremely grateful for their contribution to this project.
To learn more and get a deeper dive into benchmarks on real-world data, read our full announcement article with Mixedbread. You can find an overview of the benchmarks in this spreadsheet.
Storage and Inference Efficiency
Beyond support for the German language, we also focused on improving the storage and inference efficiency of this new embedding model using the following methods:
Matryoshka Representation Learning (MRL): Matryoshka representation learning reduces the number of output dimensions in an embedding model without significant accuracy loss. This is done by modifying the loss function to prioritise the representation of important information in the initial dimensions of the embedding vector, enabling the truncation of later dimensions.
Binary Quantization: Binary quantization reduces the size of each dimension by converting float32 values to binary values, significantly enhancing memory and disk space efficiency while retaining high performance during inference.
Start Using it With Haystack
You can start using deepset-mxbai-embed-de-large-v1 today with the
SentenceTransformersDocumentEmbedder and
SentenceTransformersTextEmbedder components in Haystack, as well as the
Mixedbread integrations of MixedbreadDocumentEmbedder
and MixedbreadTextEmbedder
:
Use it with the Sentence Transformers Embedders
from haystack.components.embedders import SentenceTransformersTextEmbedder, SentenceTransformersDocumentEmbedder
text_embedder = SentenceTransformersTextEmbedder(model="mixedbread-ai/deepset-mxbai-embed-de-large-v1")
document_embedder = SentenceTransformersDocumentEmbedder(model="mixedbread-ai/deepset-mxbai-embed-de-large-v1")
Use it with the Mixedbread Embedders
To start using this model with the
Mixedbread integration for Haystack, install mixedbread-ai-haystack
and export your Mixedbread API key to MXBAI_API_KEY
.
from mixedbread_ai_haystack import MixedbreadAITextEmbedder, MixedbreadAIDocumentEmbedder
from mixedbread_ai import EncodingFormat
text_embedder = MixedbreadAITextEmbedder( model="mixedbread-ai/deepset-mxbai-embed-de-large-v1",
encoding_format=EncodingFormat.BINARY)
document_embedder = MixedbreadAIDocumentEmbedder(model="mixedbread-ai/deepset-mxbai-embed-de-large-v1",
encoding_format=EncodingFormat.BINARY)
Like our influential German BERT model, we hope that this state-of-the-art model will enable the German-speaking AI community to build innovative products in the field of retrieval-augmented generation (RAG) and beyond!
Join our Discord community to explore Haystack.