Speaker Diarization with AssemblyAI
Last Updated: September 24, 2024
📚 This cookbook has an accompanying article with a complete walkthrough “ Level up Your RAG Application with Speaker Diarization”
LLMs excel with text data, answering complex questions without manual reading or searching. When dealing with audio or video, providing transcription is key. Transcription captures spoken content of the audio or video, but in multi-speaker recordings, it may miss non-verbal information and fail to convey speaker count or individual remarks. Therefore, to maximize the LLM’s potential with such recordings, Speaker Diarization is essential!
In this example, we’ll build a RAG application with speaker labels for audio files. This application will use Haystack and speaker diarization models by AssemblyAI.
📚 Useful Sources:
Install the Dependencies
%%bash
pip install haystack
pip install assemblyai-haystack
pip install "sentence-transformers>=3.0.0"
pip install "huggingface_hub>=0.23.0"
pip install --upgrade gdown
Download The Audio Files
We extracted the audio from youtube videos and saved them in a Google Drive Folder for you: https://drive.google.com/drive/folders/10zsFuHmj3oytYMyGrLdytpW-6JzT9T_W?usp=drive_link
You can run the code below to download the audio files to this colab notebook under “Files” tab on the left bar.
!gdown https://drive.google.com/drive/folders/10zsFuHmj3oytYMyGrLdytpW-6JzT9T_W -O "/content" --folder
Retrieving folder contents
Processing file 12654ySXSYc2rZnPgNxXZwWt2hH-kTNDZ Netflix_Q4_2023_Earnings_Interview.mp3
Processing file 1Zb15D_nrBzWlM3K8FuPOmyiCiYvsuJLD Panel_Discussion.mp3
Processing file 1FFKGEZAUSmJayZgGaAe1uFP9HUtOK5m- Working_From_Home_Debate.mp3
Retrieving folder contents completed
Building directory structure
Building directory structure completed
Downloading...
From: https://drive.google.com/uc?id=12654ySXSYc2rZnPgNxXZwWt2hH-kTNDZ
To: /content/Netflix_Q4_2023_Earnings_Interview.mp3
100% 39.1M/39.1M [00:00<00:00, 67.6MB/s]
Downloading...
From: https://drive.google.com/uc?id=1Zb15D_nrBzWlM3K8FuPOmyiCiYvsuJLD
To: /content/Panel_Discussion.mp3
100% 21.8M/21.8M [00:00<00:00, 60.4MB/s]
Downloading...
From: https://drive.google.com/uc?id=1FFKGEZAUSmJayZgGaAe1uFP9HUtOK5m-
To: /content/Working_From_Home_Debate.mp3
100% 4.45M/4.45M [00:00<00:00, 34.8MB/s]
Download completed
Add Your API Keys
Enter the API keys from AssemblyAI and Hugging Face:
import os
from getpass import getpass
ASSEMBLYAI_API_KEY = getpass("Enter your ASSEMBLYAI_API_KEY: ")
os.environ["HF_API_TOKEN"] = getpass("HF_API_TOKEN: ")
Enter your ASSEMBLYAI_API_KEY: ··········
HF_API_TOKEN: ··········
Index Speaker Labels to Your DocumentStore
Build a pipeline to generate speaker labels and index them into a DocumentStore with their embeddings. In this pipeline, you need:
- InMemoryDocumentStore: to store your documents without external dependencies or extra setup
- AssemblyAITranscriber: to create speaker_labels for the given audio file and convert them into Haystack Documents
- DocumentSplitter: to split your documents into smaller chunks
- SentenceTransformersDocumentEmbedder: to create embeddings for each document using sentence-transformers models
- DocumentWriter: to write these documents into your document store
Note: The speaker information will be saved in the meta
of the Document object
from haystack.components.writers import DocumentWriter
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from assemblyai_haystack.transcriber import AssemblyAITranscriber
from haystack.document_stores.types import DuplicatePolicy
from haystack.utils import ComponentDevice
speaker_document_store = InMemoryDocumentStore()
transcriber = AssemblyAITranscriber(api_key=ASSEMBLYAI_API_KEY)
speaker_splitter = DocumentSplitter(
split_by = "sentence",
split_length = 10,
split_overlap = 1
)
speaker_embedder = SentenceTransformersDocumentEmbedder(device=ComponentDevice.from_str("cuda:0"))
speaker_writer = DocumentWriter(speaker_document_store, policy=DuplicatePolicy.SKIP)
indexing_pipeline = Pipeline()
indexing_pipeline.add_component(instance=transcriber, name="transcriber")
indexing_pipeline.add_component(instance=speaker_splitter, name="speaker_splitter")
indexing_pipeline.add_component(instance=speaker_embedder, name="speaker_embedder")
indexing_pipeline.add_component(instance=speaker_writer, name="speaker_writer")
indexing_pipeline.connect("transcriber.speaker_labels", "speaker_splitter")
indexing_pipeline.connect("speaker_splitter", "speaker_embedder")
indexing_pipeline.connect("speaker_embedder", "speaker_writer")
Give an audio_file_path
and run your pipeline
audio_file_path = "/content/Panel_Discussion.mp3" #@param ["/content/Netflix_Q4_2023_Earnings_Interview.mp3", "/content/Working_From_Home_Debate.mp3", "/content/Panel_Discussion.mp3"]
indexing_pipeline.run(
{
"transcriber": {
"file_path": audio_file_path,
"summarization": None,
"speaker_labels": True
},
}
)
/usr/local/lib/python3.10/dist-packages/sentence_transformers/SentenceTransformer.py:92: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v3 of SentenceTransformers.
warnings.warn(
modules.json: 0%| | 0.00/349 [00:00<?, ?B/s]
config_sentence_transformers.json: 0%| | 0.00/116 [00:00<?, ?B/s]
README.md: 0%| | 0.00/10.6k [00:00<?, ?B/s]
sentence_bert_config.json: 0%| | 0.00/53.0 [00:00<?, ?B/s]
config.json: 0%| | 0.00/571 [00:00<?, ?B/s]
pytorch_model.bin: 0%| | 0.00/438M [00:00<?, ?B/s]
tokenizer_config.json: 0%| | 0.00/363 [00:00<?, ?B/s]
vocab.txt: 0%| | 0.00/232k [00:00<?, ?B/s]
tokenizer.json: 0%| | 0.00/466k [00:00<?, ?B/s]
special_tokens_map.json: 0%| | 0.00/239 [00:00<?, ?B/s]
1_Pooling/config.json: 0%| | 0.00/190 [00:00<?, ?B/s]
Batches: 0%| | 0/2 [00:00<?, ?it/s]
{'transcriber': {'transcription': [Document(id=427e56c68f0440dd8f51643ba52e2a2b60c739f4fc42ddab7207fb428da4492d, content: 'I want to start with you, Amy, because I know you, obviously at Shell have had AI as part of the wor...', meta: {'transcript_id': 'c053a806-6826-40ac-a6bc-95cab9b4cb8a', 'audio_url': 'https://cdn.assemblyai.com/upload/188cdd14-ff33-4468-81cb-e2c337674fc5'})]},
'speaker_writer': {'documents_written': 64}}
RAG Pipeline with Speaker Labels
Build a RAG pipeline to generate answers to questions about the recording. Ensure that speaker information (provided through the metadata of the document) is included in the prompt for the LLM to distinguish who said what. For this pipeline, you need:
- SentenceTransformersTextEmbedder: To create an embedding for the user query using sentence-transformers models
-
InMemoryEmbeddingRetriever: to retrieve
top_k
relevant documents to the user query - PromptBuilder: to provide a RAG prompt template with instructions to be filled with retrieved documents and the user query
- HuggingFaceAPIGenerator: to infer models served through Hugging Face free Serverless Inference API or Hugging Face TGI
The LLM in the example (
mistralai/Mixtral-8x7B-Instruct-v0.1
) is a gated model. Make sure you have access to the model.
from haystack import Pipeline
from haystack.components.builders.prompt_builder import PromptBuilder
from haystack.components.generators import HuggingFaceAPIGenerator
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.utils import ComponentDevice
prompt = """
You will be provided with a transcription of a recording with each sentence or group of sentences attributed to a Speaker by the word "Speaker" followed by a letter representing the person uttering that sentence. Answer the given question based on the given context.
If you think that given transcription is not enough to answer the question, say so.
Transcription:
{% for doc in documents %}
{% if doc.meta["speaker"] %} Speaker {{doc.meta["speaker"]}}: {% endif %}{{doc.content}}
{% endfor %}
Question: {{ question }}
<|end_of_turn|>
Answer:
"""
retriever = InMemoryEmbeddingRetriever(speaker_document_store)
text_embedder = SentenceTransformersTextEmbedder(device=ComponentDevice.from_str("cuda:0"))
answer_generator = HuggingFaceAPIGenerator(
api_type="serverless_inference_api",
api_params={"model": "mistralai/Mixtral-8x7B-Instruct-v0.1"},
generation_kwargs={"max_new_tokens":500})
prompt_builder = PromptBuilder(template=prompt)
speaker_rag_pipe = Pipeline()
speaker_rag_pipe.add_component("text_embedder", text_embedder)
speaker_rag_pipe.add_component("retriever", retriever)
speaker_rag_pipe.add_component("prompt_builder", prompt_builder)
speaker_rag_pipe.add_component("llm", answer_generator)
speaker_rag_pipe.connect("text_embedder.embedding", "retriever.query_embedding")
speaker_rag_pipe.connect("retriever.documents", "prompt_builder.documents")
speaker_rag_pipe.connect("prompt_builder.prompt", "llm.prompt")
Test RAG with Speaker Labels
question = "What are each speakers' opinions on building in-house or using third parties?" # @param ["What are the two opposing opinions and how many people are on each side?", "What are each speakers' opinions on building in-house or using third parties?", "How many people are speaking in this recording?" ,"How many speakers and moderators are in this call?"]
result = speaker_rag_pipe.run({
"prompt_builder":{"question": question},
"text_embedder":{"text": question},
"retriever":{"top_k": 10}
})
result["llm"]["replies"][0]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Batches: 0%| | 0/1 [00:00<?, ?it/s]
" Speaker A is interested in understanding how companies decide between building in-house solutions or using third parties. Speaker B believes that the decision depends on whether the task is part of the company's core IP or not. They also mention that the build versus buy decision is too simplistic, as there are other options like partnering or using third-party platforms. Speaker C takes a mixed approach, using open source and partnering, and emphasizes the importance of embedding AI into the business. Speaker B thinks that AI is not magic and requires hard work, process, and change management, just like any other business process."