Use Cases
Audio Processing
Real Time Speech Recognition Pipelines
You can build real time pipelines with Indexify that incorporates speech, build applications that retrieve information from the audio.
Speech to Text
Generally, audio processing pipelines starts of by converting audio to text.
- Automatic Speech Recognition (ASR) - Converting speech to text. If all you want is a plain transcription, you can use a model like Whisper in a function that extracts transcriptions from audio. The entire text and chunks with timestamps are represented as metadata of your function’s output.
- Speaker Diarization - Applications such as meeting transcriptions, often require idenitfying who-said-what. Speaker Diarization is the process of segmenting and clustering the audio into speaker segments. This can be done using a model like PyAnnotate.
You can use commercial ASR and Speaker Diarizaition services with Indeixfy as well. They often perform better for accented speech.
Speech to Speech and Text to Speech
- Voice Interfaces - You can build voice interfaces that take in speech and generate speech. You would be using a TTS model in a function that accepts text and generates speech. The function would return a byte array of the audio.
Dynamic Routing
Speech processing is complex, and often a single model doesn’t perform well on all accents and languages. You can insert a dynamic Router in your pipeline, which routes the audio to different ASR models by classifying accent, language or other features in the audio.
Examples
Meeting Transcription and Summarization
- Speaker Diarization
- Classification of meeting intent
- Dynamic Routing between summarization models
Was this page helpful?