构建本地语音助手：Whisper + Ollama + Bark

今天，我们更进一步，不仅实现了大型语言模型的对话功能，还添加了听力和口语功能。这个想法很简单：我们将创建一个语音助手，让人想起标志性钢铁侠电影中的贾维斯或星期五，它可以在你的计算机上离线运行。由于这是一个介绍性教程，我将用 Python 实现它，并使其对于初学者来说足够简单。最后，我将提供一些有关如何扩展应用程序的指导。

需要哪些技术？

首先，你应该设置一个虚拟 Python 环境。为此，你有多种选择，包括 pyenv、virtualenv、poetry 以及其他具有类似用途的选项。就我个人而言，由于我的个人喜好，我将在本教程中使用 Poetry。以下是你需要安装的几个重要库：

rich：用于视觉上吸引人的控制台输出。
openai-whisper：用于语音到文本转换的强大工具。
suno-bark：用于文本到语音合成的尖端库，确保高质量的音频输出。
langchain：一个用于与大型语言模型 (LLM) 连接的简单库。
sounddevice, pyaudio, 和speechrecognition:：对于音频录制和播放至关重要。

这里最关键的组件是大型语言模型 (LLM) 后端，我们将使用 Ollama。成为被广泛认为是离线运行和服务LLM的流行工具。

系统技术架构

好的，如果一切都已设置完毕，让我们继续下一步。下面是我们应用程序的整体架构，它基本上包含 3 个主要组件：

语音识别：利用 OpenAI 的Whisper，我们将口语转换为文本。 Whisper 对不同数据集的训练确保了其对各种语言和方言的熟练程度。
对话链：对于对话功能，我们将使用 Langchain 接口 Llama-2 模型，使用 Ollama 提供服务。这种设置保证了无缝且引人入胜的对话流程。
语音合成器：文本到语音的转换是通过以下方式实现的 Bark，Suno AI 的最先进模型，以其逼真的语音生成而闻名。

工作流程很简单：录制语音、转录为文本、使用 LLM 生成响应，并使用 Bark 发声响应。

如何实现

实施从制作一个 文字转语音服务 基于 Bark，结合了从文本合成语音并无缝处理较长文本输入的方法，如下所示：

 import nltk
 import torch
 import warnings
 import numpy as np
 from transformers import AutoProcessor, BarkModel
 
 warnings.filterwarnings(
     "ignore",
     message="torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.",
 )
 
 
 class TextToSpeechService:
     def __init__(self, device: str = "cuda" if torch.cuda.is_available() else "cpu"):
         """
         Initializes the TextToSpeechService class.
 
         Args:
             device (str, optional): The device to be used for the model, either "cuda" if a GPU is available or "cpu".
             Defaults to "cuda" if available, otherwise "cpu".
         """
         self.device = device
         self.processor = AutoProcessor.from_pretrained("suno/bark-small")
         self.model = BarkModel.from_pretrained("suno/bark-small")
         self.model.to(self.device)
 
     def synthesize(self, text: str, voice_preset: str = "v2/en_speaker_1"):
         """
         Synthesizes audio from the given text using the specified voice preset.
 
         Args:
             text (str): The input text to be synthesized.
             voice_preset (str, optional): The voice preset to be used for the synthesis. Defaults to "v2/en_speaker_1".
 
         Returns:
             tuple: A tuple containing the sample rate and the generated audio array.
         """
         inputs = self.processor(text, voice_preset=voice_preset, return_tensors="pt")
         inputs = {k: v.to(self.device) for k, v in inputs.items()}
 
         with torch.no_grad():
             audio_array = self.model.generate(**inputs, pad_token_id=10000)
 
         audio_array = audio_array.cpu().numpy().squeeze()
         sample_rate = self.model.generation_config.sample_rate
         return sample_rate, audio_array
 
     def long_form_synthesize(self, text: str, voice_preset: str = "v2/en_speaker_1"):
         """
         Synthesizes audio from the given long-form text using the specified voice preset.
 
         Args:
             text (str): The input text to be synthesized.
             voice_preset (str, optional): The voice preset to be used for the synthesis. Defaults to "v2/en_speaker_1".
 
         Returns:
             tuple: A tuple containing the sample rate and the generated audio array.
         """
         pieces = []
         sentences = nltk.sent_tokenize(text)
         silence = np.zeros(int(0.25 * self.model.generation_config.sample_rate))
 
         for sent in sentences:
             sample_rate, audio_array = self.synthesize(sent, voice_preset)
             pieces += [audio_array, silence.copy()]
 
         return self.model.generation_config.sample_rate, np.concatenate(pieces)

初始化（__init__）: 该课程可选修设备参数，指定用于模型的设备（或者不同的如果 GPU 可用，或者中央处理器）。它从以下位置加载 Bark 模型和相应的处理器： suno/小Bark 预训练模型。你还可以通过指定使用大版本苏诺/Bark 对于模型加载器。
合成（synthesize）: 这个方法需要一个文本输入和一个语音预设参数，指定用于合成的语音。你可以看看其他的语音预设价值这里。它使用处理器准备输入文本和语音预设，然后使用 model.generate() 方法。生成的音频数组将转换为 NumPy 数组，并且采样率与音频数组一起返回。
长式合成（long_form_synthesize）：此方法用于合成较长的文本输入。它首先使用以下方法将输入文本标记为句子 nltk.sent_tokenize 功能。对于每个句子，它调用合成生成音频数组的方法。然后，它连接生成的音频数组，并在每个句子之间添加短暂的静音（0.25 秒）。

现在我们有了 文字转语音服务 设置完毕后，我们需要准备 Ollama 服务器来提供大型语言模型 (LLM) 服务。为此，你需要执行以下步骤：

拉取最新的Llama-2模型：运行以下命令从 Ollama 存储库下载最新的 Llama-2 模型： llama pull llama2。
启动Ollama服务器：如果服务器尚未启动，请执行以下命令启动它： ollama serve。

完成这些步骤后，你的应用程序将能够使用 Ollama 服务器和 Llama-2 模型生成对用户输入的响应。接下来，我们将转向主要应用程序逻辑。首先，我们需要初始化以下组件：

Rich Console：我们将使用 Rich 库为终端内的用户创建更好的交互式控制台。
Whisper语音转文本：我们将初始化 Whisper 语音识别模型，这是由 OpenAI 开发的最先进的开源语音识别系统。我们将使用基本英语模型（基址）用于转录用户输入。
Bark文本转语音：我们将初始化一个 Bark 文本转语音合成器实例，该实例已在上面实现。
对话链：我们将使用内置的对话链来自 Langchain 库，它提供了一个用于管理会话流的模板。我们将其配置为使用带有 Ollama 后端的 Llama-2 语言模型。

 import time
 import threading
 import numpy as np
 import whisper
 import sounddevice as sd
 from queue import Queue
 from rich.console import Console
 from langchain.memory import ConversationBufferMemory
 from langchain.chains import ConversationChain
 from langchain.prompts import PromptTemplate
 from langchain_community.llms import Ollama
 from tts import TextToSpeechService
 
 console = Console()
 stt = whisper.load_model("base.en")
 tts = TextToSpeechService()
 
 template = """
 You are a helpful and friendly AI assistant. You are polite, respectful, and aim to provide concise responses of less 
 than 20 words.
 
 The conversation transcript is as follows:
 {history}
 
 And here is the user's follow-up: {input}
 
 Your response:
 """
 PROMPT = PromptTemplate(input_variables=["history", "input"], template=template)
 chain = ConversationChain(
     prompt=PROMPT,
     verbose=False,
     memory=ConversationBufferMemory(ai_prefix="Assistant:"),
     llm=Ollama(),
 )

现在，让我们定义必要的函数：

record_audio：该函数在单独的线程中运行，使用以下命令从用户的麦克风捕获音频数据： sounddevice.RawInputStream。只要有新的音频数据可用，就会调用回调函数，并将数据放入数据队列以便进一步加工。
transcribe：该函数利用 Whisper 实例转录来自数据队列进入文本。
get_llm_response：此函数将当前对话上下文提供给 Llama-2 语言模型（通过 Langchain 对话链）并检索生成的文本响应。
play_audio：此函数获取 Bark 文本转语音引擎生成的音频波形，并使用声音播放库（例如，声音装置）。

 def record_audio(stop_event, data_queue):
     """
     Captures audio data from the user's microphone and adds it to a queue for further processing.
 
     Args:
         stop_event (threading.Event): An event that, when set, signals the function to stop recording.
         data_queue (queue.Queue): A queue to which the recorded audio data will be added.
 
     Returns:
         None
     """
     def callback(indata, frames, time, status):
         if status:
             console.print(status)
         data_queue.put(bytes(indata))
 
     with sd.RawInputStream(
         samplerate=16000, dtype="int16", channels=1, callback=callback
     ):
         while not stop_event.is_set():
             time.sleep(0.1)
 
 def transcribe(audio_np: np.ndarray) -> str:
     """
     Transcribes the given audio data using the Whisper speech recognition model.
 
     Args:
         audio_np (numpy.ndarray): The audio data to be transcribed.
 
     Returns:
         str: The transcribed text.
     """
     result = stt.transcribe(audio_np, fp16=False)  # Set fp16=True if using a GPU
     text = result["text"].strip()
     return text
 
 def get_llm_response(text: str) -> str:
     """
     Generates a response to the given text using the Llama-2 language model.
 
     Args:
         text (str): The input text to be processed.
 
     Returns:
         str: The generated response.
     """
     response = chain.predict(input=text)
     if response.startswith("Assistant:"):
         response = response[len("Assistant:") :].strip()
     return response
 
 def play_audio(sample_rate, audio_array):
     """
     Plays the given audio data using the sounddevice library.
 
     Args:
         sample_rate (int): The sample rate of the audio data.
         audio_array (numpy.ndarray): The audio data to be played.
 
     Returns:
         None
     """
     sd.play(audio_array, sample_rate)
     sd.wait()

然后，我们定义主应用程序循环。主应用程序循环引导用户完成对话交互，如下所示：

系统会提示用户按 Enter 键开始记录其输入。
一旦用户按下 Enter 键，录音音频在单独的线程中调用该函数来捕获用户的音频输入。
当用户再次按 Enter 停止录音时，将使用以下命令转录音频数据：录制功能。
然后将转录的文本传递到获取llm_响应函数，它使用 Llama-2 语言模型生成响应。
生成的响应被打印到控制台并使用以下命令回放给用户播放音频功能。

结果

一旦所有东西都放在一起，我们就可以运行应用程序。该应用程序在我的 MacBook 上运行速度相当慢，因为 Bark 模型很大，即使是较小的版本也是如此。对于那些拥有支持 CUDA 的计算机的人来说，它可能运行得更快。以下是我们应用程序的主要功能：

基于语音的交互：用户可以开始和停止录制语音输入，助手通过播放生成的音频进行响应。
对话上下文：助手保留对话的上下文，从而实现更连贯和相关的响应。 Llama-2 语言模型的使用使助手能够提供简洁且集中的响应。

对于那些想要将此应用程序提升到生产就绪状态的人，建议进行以下增强：

性能优化：合并模型的优化版本，例如 whisper.cpp、llama.cpp 和 bark.cpp，这些版本旨在提高性能，尤其是在低端计算机上。
可定制的机器人提示：实施一个系统，允许用户自定义机器人的角色和提示，从而能够创建不同类型的助手（例如个人、专业或特定领域）。
图形用户界面 (GUI)：开发用户友好的 GUI 以增强整体用户体验，使应用程序更易于访问且更具视觉吸引力。
多式联运能力：扩展应用程序以支持多模式交互，例如除了基于语音的响应之外，还能够生成和显示图像、图表或其他视觉内容。

最后，我们完成了简单的语音助手应用程序。语音识别、语言建模和文本转语音技术的结合展示了我们如何构建听起来很困难，但实际上可以在计算机上运行的东西。让我们享受编码的乐趣，不要忘记订阅我的博客，这样你就不会错过最新的人工智能和编程文章。

Post Views: 1,070

27 9 月, 2024

小喵吧

文章