In this blog post, we will explore the seamless integration of VOSK (an open-source audio transcriber) with the custom sink we previously created. If you're new to custom audio sinks in LiveSwitch, we recommend checking out our guide on creating custom audio sinks to familiarize yourself with the basics here. This article assumes you have a basic understanding of custom audio sinks in LiveSwitch.
Quick Reminder: Media Pipeline
Before we dive into the code, let's have a quick reminder of the media pipeline structure:
Code Implementation
Below is the code implementation for integrating VOSK into the custom audio sink. Take a look at the breakdown of each section:
module VoskSink
open System
open FM.LiveSwitch
open Vosk
type resultType = {
conf : float
``end`` : float
start : float
word : string
}
type VoskResult = {
result : resultType array
text : string
}
type VoskSink =
inherit AudioSink
new (model: Model) = {
inherit AudioSink(new Pcm.Format(16000, 1))
voskRecognizer = new VoskRecognizer(model, 16000f)
textEvent = new Event<string>()
}
val voskRecognizer : VoskRecognizer
val mutable textEvent : Event<string>
member this.OnTextEvent = this.textEvent.Publish
member this.RaiseTextEvent e = this.textEvent.Trigger e
member this.GetResultFromJson (json : string) : VoskResult = System.Text.Json.JsonSerializer.Deserialize<VoskResult> json
override this.Label : string = "Vosk Audio Transcriber"
override this.DoDestroy () =
let res = this.GetResultFromJson (this.voskRecognizer.Result())
this.RaiseTextEvent res.text
this.voskRecognizer.Dispose()
()
override this.DoProcessFrame (frame: AudioFrame, buf: AudioBuffer) =
let mutable result = false
let dataBuf = buf.DataBuffer
if dataBuf.Index = 0 then
result <- this.voskRecognizer.AcceptWaveform(dataBuf.Data, dataBuf.Length)
else
let data = dataBuf.ToArray();
result <- this.voskRecognizer.AcceptWaveform(data, data.Length)
if result then
let res = this.GetResultFromJson (this.voskRecognizer.Result())
if not (String.IsNullOrWhiteSpace(res.text)) then
this.RaiseTextEvent res.text
Code Breakdown
Let's break down the code and understand each section:
type VoskSink =
inherit AudioSink
new (model: Model) = {
inherit AudioSink(new Pcm.Format(16000, 1))
voskRecognizer = new VoskRecognizer(model, 16000f)
textEvent = new Event<string>()
}
Similar to our custom audio sink, the VoskSink expects to receive 16000Hz mono PCM audio, which aligns with VOSK's requirements. We define a constructor to initialize the VoskRecognizer when the sink is created and wire up the event when text is returned.
val voskRecognizer : VoskRecognizer
val mutable textEvent : Event<string>
member this.OnTextEvent = this.textEvent.Publish
member this.RaiseTextEvent e = this.textEvent.Trigger e
member this.GetResultFromJson (json : string) : VoskResult = System.Text.Json.JsonSerializer.Deserialize<VoskResult> json
override this.Label : string = "Vosk Audio Transcriber"
Here, we create the properties and events to be used. The GetResultFromJson
function is used to retrieve the result from the returned JSON. Similar to before, we need to create a Label
property to assign a string name to our audio sink.
override this.DoDestroy () =
let res = this.GetResultFromJson (this.voskRecognizer.Result())
this.RaiseTextEvent res.text
this.voskRecognizer.Dispose()
()
In the previous implementation, we didn't have any cleanup to perform when the audio sink was destroyed (e.g. when a user disconnects). However, in this case, we want to clean up the VoskRecognizer and send the final text to other clients. This code snippet handles that cleanup process.
override this.DoProcessFrame (frame: AudioFrame, buf: AudioBuffer) =
let mutable result = false
let dataBuf = buf.DataBuffer
if dataBuf.Index = 0 then
result <- this.voskRecognizer.AcceptWaveform(dataBuf.Data, dataBuf.Length)
else
let data = dataBuf.ToArray();
result <- this.voskRecognizer.AcceptWaveform(data, data.Length)
if result then
let res = this.GetResultFromJson (this.voskRecognizer.Result())
if not (String.IsNullOrWhiteSpace(res.text)) then
this.RaiseTextEvent res.text
The DoProcessFrame
method is where each audio frame ends up after being processed through the audio pipeline (De-Packetizer, Decoder, SoundConverter, and Our Sink). In this case, we extract the last byte array from the audio buffer passed to us by calling .DataBuffer
. The data buffer can either be a single byte[]
or part of a DataBufferPool
where the offset and length would be crucial. Fortunately, by calling .ToArray
on the buffer, the system handles this and provides us with the byte array for the current buffer from the pool. We then pass the byte array to VOSK and retrieve the resulting text. It's as simple as that!
This logic can be applied to any audio filter or processing library, not just VOSK. I have used a similar approach with Microsoft's Speech-to-Text API by providing Wav headers before the raw data streams.
I hope this guide helps you integrate other exciting projects directly into LiveSwitch. You can find the complete working project on GitHub.
Need assistance in architecting the perfect WebRTC application? Let our team help out! Get in touch with us today!