Impressive new entrants have raised the bar for industry leaders, with AssemblyAI, Speechmatics, and Whisper leading the pack
ASR technology has never been as accurate as it is today thanks to advances in artificial intelligence (AI), according to a report from 3Play Media, the leading media accessibility provider, released today. The annual State of ASR study analyzes the general state of speech-to-text technology as it applies to the task of captioning and transcription.
According to the study, in which the company tested speech recognition with ten relevant ASR engines, the accuracy of the technology has improved measurably since the company’s last evaluation in 2022. As ASR improves, it's important to understand which engine is best for different use cases. Some nuances to consider include performance on different error types, transcription styles, formatting, and industry-specific content.
“The advances in AI we’ve seen across industries have also had an impact on ASR,” Chris Antunes, co-CEO and co-Founder, 3Play Media, said. “Longtime industry leader Speechmatics and newer entrants AssemblyAI and Whisper performed at the top of the pack, with each excelling in different areas. This proves that not all engines are created equal - the training material and models matter - and that there is room at the top for multiple engines to specialize in different use cases.”
Accuracy is the key component in captioning for several reasons, most importantly ensuring that individuals who are deaf or hard of hearing and rely on captions as an accommodation receive information that fully depicts the original content. For captions to be accessible and legally compliant, they need to be 99% accurate, the industry requirement for accessibility. While there was improvement across industry leaders, the study found that even the best engines performed well below 99% accuracy, indicating a continued need for human revision.
This report measures accuracy against two measurements, Word Error Rate (WER) and Formatted Error Rate (FER). While WER is used as the standard measure of transcription accuracy, FER takes into account formatting, sound effects, grammar, and punctuation and is a better representation of the experienced accuracy of captioning. Accuracy in FER is harder to achieve, and even the best-tested engines were only 82% accurate, whereas the best-tested engines in WER were 93% accurate.
Additionally, the study identified a new type of error. Hallucinations are the tendency to generate text that has no basis in the audio. The State of ASR report found evidence of hallucinations in the Whisper transcriptions, often occurring when the topic shifted. Some of the hallucinations were significant and could pose issues for the captioning use case in particular. However, hallucinations seemed rare and did not prevent Whisper from performing competitively.
To download the report, please visit: https://go.3playmedia.com/rs-2023-asr
About 3Play Media
3Play Media is an integrated media accessibility platform with patented solutions for closed captioning, transcription, live captioning, audio description, and subtitling. 3Play Media combines machine learning (ML) and automatic speech recognition (ASR) with human review to provide innovative, highly accurate services. Customers span multiple industries, including media & entertainment, corporate, ecommerce, fitness, higher education, government, and elearning.
View source version on businesswire.com: https://www.businesswire.com/news/home/20230503005160/en/
Contacts
Phil LeClare
phil.leclare@3playmedia.com
617-209-9406
www.3playmedia.com
@3playmedia