Speaker Diarization

Providers of audio transcription services transcribe various types of audio/video recordings. Speaker diarization in speech to text systems aim to segment and group speakers in an audio recording to identify “who spoke when”. Today, with developments in deep learning technology, speaker diarization has made rapid advancements.

Wikipedia defines speaker diarization as “the process of partitioning an input audio stream into homogeneous segments according to the speaker identity. It can enhance the readability of an automatic speech transcription by structuring the audio stream into speaker turns and, when used together with speaker recognition systems, by providing the speaker’s true identity”.

Applications of Speaker Diarization

Speaker diarization can be effectively used to segment or analyze various types of audio recordings such as conferences, lectures, court proceedings, business meetings and earnings reports, medical conversations, audio/video news broadcasts, social media videos, etc. All of this audio data features multiple speakers. It enhances understanding from automatic speech recognition, and analyzing call-center transcription and meeting transcription.

How Speaker Diarization Works

Customer service calls, clinical visits, live broadcasts, and online meetings are conversational audio data that usually have multiple speakers. In a call center, action is taken regularly based on customer and agent conversations about issues customers face as well as provides valuable information about market trends, agent performance, and company products and processes. In all these cases, transcription should ideally specify who speaks when or at which times. It’s important to accurately label the speaker and associate them to the audio content. The speaker diarization capability of modern speech to text systems allow the transcribed data to be accessed immediately.

When speaker diarization is enabled in speech to text transcription, it attempts to identify the different voices in the audio recording. The transcribed words are assigned a number for each speaker, that is, words spoken by a particular person will have the same number. The three basic tasks performed by a typical diarization system as listed by wiki.aalto.fi are:

  • Separates speech segments from the non-speech ones
  • Identifies speaker change points to segment the audio data
  • Groups the segmented regions into speaker homogeneous clusters

Speaker diarziation or speaker labeling makes it easy to create accurate transcription as it is possible to distinguish what each speaker said. The transcript generated can have as many numbers as the speakers that speech-to-text can uniquely identify in the audio recording. However, there are many challenges associated with the task of speaker diarization according to an article published by Data Driven Investor:

  • The number of speakers in the program is unknown
  • There is no prior knowledge about the identity of the people in the program
  • Many speakers may speak at the same time
  • Audio recording conditions can vary
  • The audio channel may contain not only speech, but also music and other non-speech sources (applause and laughter, etc)

Top Tools for Transcribing Audio with Multiple Speakers

There are various options that make the process of transcribing multi-speaker audio easy:

  • IBM’s Watson’s Speech To Text API supports real-time speaker diarization automatically and recognizes (and tags) different speakers in audience
  • Amazon Transcribe can accurately between two and 10 speakers within a single live audio stream, thus helping to identify who is saying what in the output transcript
  • Trint strives to separate different speakers into paragraphs
  • Google Speaker diarization is a powerful technique to get the desired results of transcribing the speaker with speaker tag. Google Cloud Speech-to-text now allows for speaker diarization to 10 new locales

Audio transcription services by human transcriptionists are a reliable way to transcribe audio/video content with multiple speakers. Automated services are useful if you want to save time and do not require utmost accuracy. An article in PC Mag reported that overall accuracy of transcripts varied considerably between human based services and automated services. On evaluating the recording of a conference call with multiple speakers and an in-person interview, it was found that human-based services did a (mostly) excellent job with the more difficult file, whereas the automatic ones produced nearly unusable results.