You can build real time pipelines with Indexify that incorporates speech, build applications that retrieve information from the audio.

Speech to Text

Generally, audio processing pipelines starts of by converting audio to text.

  • Automatic Speech Recognition (ASR) - Converting speech to text. If all you want is a plain transcription, you can use a model like Whisper in a function that extracts transcriptions from audio. The entire text and chunks with timestamps are represented as metadata of your function’s output.
@indexify_function()
def transcribe_audio(audio: bytes) -> str:
   # Invoke an ASR model to transcribe the audio
   ...
  • Speaker Diarization - Applications such as meeting transcriptions, often require idenitfying who-said-what. Speaker Diarization is the process of segmenting and clustering the audio into speaker segments. This can be done using a model like PyAnnotate.
class SpeakerSegment(BaseModel):
    speaker: str
    transcript: str
    start: float
    end: float

@indexify_function()
def diarize_audio(audio: bytes) -> List[SpeakerSegment]:
   # Invoke a Speaker Diarization model to segment the audio
   ...

You can use commercial ASR and Speaker Diarizaition services with Indeixfy as well. They often perform better for accented speech.

Speech to Speech and Text to Speech

  • Voice Interfaces - You can build voice interfaces that take in speech and generate speech. You would be using a TTS model in a function that accepts text and generates speech. The function would return a byte array of the audio.

Dynamic Routing

Speech processing is complex, and often a single model doesn’t perform well on all accents and languages. You can insert a dynamic Router in your pipeline, which routes the audio to different ASR models by classifying accent, language or other features in the audio.

Examples

Meeting Transcription and Summarization

  • Speaker Diarization
  • Classification of meeting intent
  • Dynamic Routing between summarization models