LiveSwitch VOSK Custom Audio Sink Transcriber

Jacob Steele Jul 27, 2023 9:53:06 PM

In this blog post, we will explore the seamless integration of VOSK (an open-source audio transcriber) with the custom sink we previously created. If you're new to custom audio sinks in LiveSwitch, we recommend checking out our guide on creating custom audio sinks to familiarize yourself with the basics here. This article assumes you have a basic understanding of custom audio sinks in LiveSwitch.

Quick Reminder: Media Pipeline

Before we dive into the code, let's have a quick reminder of the media pipeline structure:

Code Implementation

Below is the code implementation for integrating VOSK into the custom audio sink. Take a look at the breakdown of each section:

module VoskSink



open System

open FM.LiveSwitch

open Vosk



type resultType = {

  conf      : float

  ``end``   : float

  start     : float

  word      : string

}



type VoskResult = {

  result    : resultType array

  text      : string

}



type VoskSink =

  inherit AudioSink


  new (model: Model) = {

    inherit AudioSink(new Pcm.Format(16000, 1))
    voskRecognizer = new VoskRecognizer(model, 16000f)

    textEvent = new Event<string>()

  }



      val voskRecognizer : VoskRecognizer

      val mutable textEvent : Event<string>


  member this.OnTextEvent = this.textEvent.Publish

  member this.RaiseTextEvent e = this.textEvent.Trigger e

  member this.GetResultFromJson (json : string) : VoskResult = System.Text.Json.JsonSerializer.Deserialize<VoskResult> json


    
      override this.Label : string = "Vosk Audio Transcriber"


  override this.DoDestroy () =
    let res = this.GetResultFromJson (this.voskRecognizer.Result())

            this.RaiseTextEvent res.text

            this.voskRecognizer.Dispose()

    ()



      override this.DoProcessFrame (frame: AudioFrame, buf: AudioBuffer) =

    let mutable result = false

    let dataBuf = buf.DataBuffer



            if dataBuf.Index = 0 then

      result <- this.voskRecognizer.AcceptWaveform(dataBuf.Data, dataBuf.Length)

            else

      let data = dataBuf.ToArray();

      result <- this.voskRecognizer.AcceptWaveform(data, data.Length)



            if result then

      let res = this.GetResultFromJson (this.voskRecognizer.Result())

                  if not (String.IsNullOrWhiteSpace(res.text)) then

                        this.RaiseTextEvent res.text

Code Breakdown

Let's break down the code and understand each section:

type VoskSink =

      inherit AudioSink



      new (model: Model) = {

            inherit AudioSink(new Pcm.Format(16000, 1))

    voskRecognizer = new VoskRecognizer(model, 16000f)

    textEvent = new Event<string>()

  }

Similar to our custom audio sink, the VoskSink expects to receive 16000Hz mono PCM audio, which aligns with VOSK's requirements. We define a constructor to initialize the VoskRecognizer when the sink is created and wire up the event when text is returned.

val voskRecognizer : VoskRecognizer

      val mutable textEvent : Event<string>


  member this.OnTextEvent = this.textEvent.Publish

  member this.RaiseTextEvent e = this.textEvent.Trigger e

  member this.GetResultFromJson (json : string) : VoskResult = System.Text.Json.JsonSerializer.Deserialize<VoskResult> json


    
      override this.Label : string = "Vosk Audio Transcriber"

Here, we create the properties and events to be used. The GetResultFromJson function is used to retrieve the result from the returned JSON. Similar to before, we need to create a Label property to assign a string name to our audio sink.

override this.DoDestroy () =

  let res = this.GetResultFromJson (this.voskRecognizer.Result())

          this.RaiseTextEvent res.text

          this.voskRecognizer.Dispose()

  ()

In the previous implementation, we didn't have any cleanup to perform when the audio sink was destroyed (e.g. when a user disconnects). However, in this case, we want to clean up the VoskRecognizer and send the final text to other clients. This code snippet handles that cleanup process.

override this.DoProcessFrame (frame: AudioFrame, buf: AudioBuffer) =

  let mutable result = false

  let dataBuf = buf.DataBuffer



          if dataBuf.Index = 0 then

    result <- this.voskRecognizer.AcceptWaveform(dataBuf.Data, dataBuf.Length)

          else

    let data = dataBuf.ToArray();

    result <- this.voskRecognizer.AcceptWaveform(data, data.Length)



          if result then

    let res = this.GetResultFromJson (this.voskRecognizer.Result())

    if not (String.IsNullOrWhiteSpace(res.text)) then

                      this.RaiseTextEvent res.text

The DoProcessFrame method is where each audio frame ends up after being processed through the audio pipeline (De-Packetizer, Decoder, SoundConverter, and Our Sink). In this case, we extract the last byte array from the audio buffer passed to us by calling .DataBuffer. The data buffer can either be a single byte[] or part of a DataBufferPool where the offset and length would be crucial. Fortunately, by calling .ToArray on the buffer, the system handles this and provides us with the byte array for the current buffer from the pool. We then pass the byte array to VOSK and retrieve the resulting text. It's as simple as that!

This logic can be applied to any audio filter or processing library, not just VOSK. I have used a similar approach with Microsoft's Speech-to-Text API by providing Wav headers before the raw data streams.

I hope this guide helps you integrate other exciting projects directly into LiveSwitch. You can find the complete working project on GitHub.

Need assistance in architecting the perfect WebRTC application? Let our team help out! Get in touch with us today!

LiveSwitch Developer

Jun 30, 2023 11:54:27 AM

Creating a Custom Audio Sink in LiveSwitch

Sep 28, 2023 4:28:42 PM

Encrypt Video and Audio Streams (Even in SFU)

Jul 27, 2020 11:48:00 AM

Voice Detection with the Client SDK

Michael Adams Named Co-CEO of LiveSwitch

LiveSwitch SDK Simple Calling Workflow

By Product

By Industry

By Infrastructure

Support Center

Quicklinks

LiveSwitch Blog

Cloud Quicklinks

Server Quicklinks

Developer Center

LiveSwitch VOSK Custom Audio Sink Transcriber

Quick Reminder: Media Pipeline

Code Implementation

Code Breakdown