Speech recognition is one of the most difficult challenges of machine learning, but one which is easily handled by expert human transcriptionists. Media transcription recently got a boost with the news that the United Nations (UN) is developing a new radio listening tool. This technology is expected to be especially useful in countries where people do not have access to the Internet and the radio is an important source of news.
From Apple’s Siri, Microsoft’s Al, and Amazon’s Echo to Google’s Voice Assistant, there is no dearth of voice recognition options. However, even these big tech companies admit that their tools fail to transcribe speech as accurately as humans can. Moreover, these products decipher only a few prominent languages. One of the latest entrants in the field — the UN’s radio-listening tool – is expected to scale new frontiers by filtering through the content of radio broadcasts in three local languages not served by other speech recognition tools. The focus is on transcribing the voices of rural citizens to gather their opinions on key social issues for the UN’s Global Pulse data analysis initiative.
The first prototype of the radio-listening tool is being tested in Uganda. In a country where 90% of the population lives in the rural areas, the radio serves as a vital platform for public discussion, information sharing and news. Several community FM radio stations across Uganda host popular shows where listeners phone in and talk about everyday problems such as violence against women, floods, malaria, adolescence pregnancy, teachers absenteeism, price fluctuations, or disease. By getting information on these serious topics, the UN expects to involve more rural citizens in decision-making about where to send aid or how to improve services.
The team working on the project faced many challenges:
- There was no precedent to rely on
- Teaching the computer to recognize the three local languages – Ugandan-accented English, Acholi and Luganda – was tricky as these languages did not have many transcriptions, word lists or existing texts
- Good speech recognition needs large volumes of audio, but the team could transcribe just 10 hours for each of these three languages
- For the first run of the transcription program, only about two words in 10 were correct
Though it was tough to make sense of texts with so many inaccurate words, the researchers finally achieved an accuracy rate of at least 50 percent for all three languages, and up to 70 percent in some cases.
Speech recognition tools have their limitations. The technology has problems transcribing people’s conversations when there are multiple speakers, as in a meeting or focus group. Speech recognition software does not work well in a noisy, crowded or echo-prone environment and when speakers have an accent or speak quickly or quietly. Computers also find it difficult to understand children and elderly speakers. Finally, machines cannot comprehend the subtle nuances of a person’s speech – a computer can identify relevant topics in conversations, but human ears are needed to determine what was actually said. In fact, the experienced transcriptionists in audio transcription companies manage all these challenges with ease, which explains why their services are indispensable for industries such as media, law, business, health and finance.