New Study Finds Errors in Automated Transcripts and Captions

by | Published on Nov 18, 2022 | Audio Transcription

Share this:

Research Highlights Errors in Automated Transcripts
One in eight people in the United States (13 percent, or 30 million) aged 12 years or older has hearing loss in both ears, based on standard hearing examinations, according to the National Institute on Deafness and Other Communication Disorders. This affects their ability to watch multimedia as many programs make little sense without audio. Transcription and captioning make content accessible for such viewers. However, when done using software, audio to text conversion can result in significant flaws, according to a new study. Getting machine-generated transcripts reviewed by an audio transcription service provider is the best way to overcome this challenge.

Consumer Reports: Most Video Conferencing Apps have Captioning Mistakes

Researchers at Northeastern University and Pomona College who associated with Consumer Reports to test auto-captions in seven popular products found that auto-captions on popular products had several mistakes. This poses significant challenges for people who are deaf or hard of hearing, or whose first language is not English (www.consumerreports.org, August 2022).

The researchers evaluated captions on BlueJeans, Cisco Webex, Google Meet, Microsoft Stream, and Zoom. According to Consumer Reports:

  • There were mistakes in all of the programs, with some getting about 1 in 10 words wrong.
  • The results were worse in the case of second-language English speakers- even for those who were fluent. This means that an auto-caption user is less likely to be able to understand people whose native language isn”t English
  • Gender and first language status alone affected the variation in transcription mistakes (other factors like the speaker”s age, race and ethnicity, and speech rate had no impact).

The study also found considerable differences within each tested platform, with Webex having more mistakes than Google Meet. Consumer Reports said “Zoom”s “very best transcription had just two errors per 100 words, while at its worst the software mistranscribed nearly every third word”.

How the Companies Responded

The companies” spokespersons responded to Consumer Reports:

YouTube said that the study results matched up with the company”s “expectations for performance”. They said they were working to ensure that YouTube works better for everyone.

Microsoft confirmed the findings roughly align with its internal testing, which also revealed lower accuracy when transcribing men and second-language English speakers.

Zoom”s reply was: “We”re continuously enhancing our transcription feature to improve accuracy toward a variety of factors, including English dialects and accents.”

Google reported it was working to “improve the accuracy of live captions and translations so even more users can participate and stay engaged using Google Meet”.

Cisco said its auto-caption testing puts Webex ahead of two “best-in-class speech recognition engines”, but did not name these products.

Common Mistakes When Using Speech Recognition

Real-time machine-generated transcription and captions for videos are created by software that combines automatic speech recognition (ASR) echnology, machine learning technology (ML), and Artificial Intelligence (AI). Speech recognition instantly identifies the spoken words and converts them intotext on screen. Popular as it is for its speed and cost-effectiveness, accuracy is a matter of concern in automated transcription, as mentioned above. Factors that cause speech recognition mistakes include:

  • Speaker accents and dialects – voice recognition software that is trained to recognize American English speakers is likely to make mistakes when transcribing other types of English speakers. Consumer Reports found 10 transcription errors per 100 words for non-native English speakers.
  • Multiple speakers – The accuracy of speech recognition drops when there are multiple speakers present and being recorded.
  • Fast speech – The software may not be able to transliterate the speech of those who speak quickly or run words together, leading to missed words or phrases.
  • Complex jargon or phrasing – Every business sector has its own terminology and jargon that is not part of standard English, and an automated tool may not be able to translate them accurately into text.
  • Background noise – People talking in the background, music, the sound of traffic and other loud noises will affect the quality of automated transcript.
  • Speaker”s distance from the microphone – If the speaker positions the microphone too close to the mouth, the software may pick up jumbled speech.
  • Homonyms – Speech recognition software tends to misinterpret same-sounding words, for e.g. “their” and “there”.

Reviewing Automatic Captions and Transcripts can Ensure Accuracy

Transcripts and captions are essential to extend the accessibility and reach of videos, podcasts, and other multimedia content. While research like CR”s will encourage tech companies to improve their speech-recognition systems, partnering with an online transcription service provider is the best bet when it comes to ensuring accurate documentation of automated transcripts of work meetings, conferences, lectures and other important activities.

On its YouTube Help page, Google states: “These automatic captions are generated by machine learning algorithms, so the quality of the captions may vary. We encourage creators to add professional captions first. YouTube is constantly improving its speech recognition technology. However, automatic captions might misrepresent the spoken content due to mispronunciations, accents, dialects, or background noise. You should always review automatic captions and edit any parts that haven’t been properly transcribed”.

Related Posts