Benchmarks November 2024
Keywords: speech recognition, benchmarking, real-time systems
1 - Introduction
Transcription has seen recent advances with models like OpenAI's Whisper, achieving a Word Error Rate (WER) as low as 1.8% on the standard benchmark of LibriSpeech Clean. However, even state-of-the-art models with accuracies that surpass human performance still fall short of reaching the abilities of a human scribe. To match the performance of a human professional, a system needs not only exceptional accuracy but also:
a. Real-time responsiveness:
- •Response times of ~2 seconds or less
b. Natural-language commands:
- •"In that previous paragraph, change dog to cat"
- •"For the list, add a subpoint that says X"
- •"In the list, move the second point to the bottom"
c. Human-friendly formatting:
- •Email greetings, lists, sign-offs, etc - e.g:
Dear John, The new update will be available for download tomorrow evening. Hope to see you at the conference. Best, Michael
We tested Aqua Voice on LibriSpeech for accuracy and then made our own custom benchmark to test the above categories.
2 - LibriSpeech
LibriSpeech is a large-scale data set of clips of audiobooks in English that is widely used for training and testing transcription models. We tested Aqua Voice on all 2620 audio files of LibriSpeech test-clean and used the Whisper normalizer on both the outputs and ground-truth, the same process used by the Open ASR Leaderboard.
Model | LibriSpeech test-clean WER |
---|---|
Amazon_api_en | 6.42% |
Baidu_api_en | 6.61% |
Google_api_en | 5.63% |
Aqua Voice | 3.22% |
Whisper_large_v1 | 2.66% |
Whisper_large_v2 | 2.14% |
Whisper_large_v3 | 1.80% |
Aqua Voice achieved a 3.22% WER on LibriSpeech test-clean, which is more accurate than human-level transcription (~4%) and it outperforms Google (5.63%) and Amazon (6.42%).
These results are impressive when taking into account the demanding requirements of real-time transcription. Real time systems have stringent latency requirements, often returning initial results in less than a second, which limits the compute resources available when generating a token. Asynchronous systems can "look ahead in time" within the audio sample to transcribe coherent phrases, while real-time systems must generate tokens after each utterance, maintaining consistency.
Testing Aqua Voice on LibriSpeech has several shortcomings that overlook Aqua Voice's capabilities. Since LibriSpeech's audio samples come exclusively from audiobooks, they do not include commands. Additionally, the standard practice of using aggressive text normalization —which strips away capitalization, punctuation, and whitespace—makes it unsuitable for assessing formatting. This means that models can do well on LibriSpeech and similar accuracy-only benchmarks while producing an unformatted human-unfriendly output. i.e. a "wall of text."
Commands and formatting are first-class citizens of the Aqua transcription engine.
Commands:
- 01."Start a bullet list titled countries: USA, next Canada, and last Mexico."
- 02."In the previous paragraph, scratch the word 'slow'."
- 03."In the list, move the second point to the bottom."
Appropriate formatting:
Without any formatting commands, saying: "Groceries for dinner. 2 cups of tomato sauce a brand called garden jar, then 6 ounces of cheese, next olive oil" would result in:
Groceries for dinner: - 2 cups of tomato sauce, a brand called Garden Jar - 6 ounces of cheese - Olive oil
To evaluate these abilities, we developed our own benchmark focused on "human-scribe" behaviors.
3 - Human-Scribe Benchmarks
To test Aqua's ability to output well-formatted text, we created custom benchmarks that assess human-friendly formatting and command execution. We tested Aqua Voice and nine other providers on their ability to output five different types of documents.
3.1 - Methodology
To assess the precision of the transcription services at executing commands and applying human-friendly formatting, we used a systematic approach to data creation, processing, and analysis as detailed below:
- 01.Writing reference scripts (i.e. ground truth)
- 02.Recording audio samples
- 03.Transcribing audio
- 04.Comparing transcripts to the ground truth
- 05.Normalizing the false-positive errors
- 06.Calculating WER
The first step was to write five scripts that functioned as "ground truth." We chose a professional email, a technical report, business notes, and a book manuscript. From an external source, we included the audio from a university lecture that had no formatting commands, which served as a control.
Audio samples were recorded from three different speakers in a studio using standard Apple wired earbuds. All recordings were saved in m4a format.
Here are the original scripts and their recordings:
- •Email - (link: transcript, audio)
- •Technical Report - (link: transcript, audio)
- •Professional Notes - (link: transcript, audio)
- •Lecture - (link: transcript, audio)
- •Book - (link: transcript, audio)
The audio samples include formatting commands that, when performed correctly, achieve a text identical to the original scripts.
In the audio corpus, we excluded any "backward-looking commands" - any command that would edit the previous part of the text. Although Aqua Voice can execute both forward and backward-looking commands, the other providers can't edit previous parts of the text. Additionally, backward commands could create multiple instances where the speaker "loops back" to edit previous text, distorting the results in favor of Aqua Voice
Backward-looking commands: these types of commands were not in the audio corpus:
- •"In the previous paragraph, delete X"
- •"Change X to Y"
- •"Delete the word Y"
Forward-looking commands: these were included in the audio corpus
- •"Start a numbered list, first item, X, next item, Y, third item, Z"
- •"Next paragraph"
- •"Comma"
- •"Period"
Providers | Input to Provider | Mode | |||||
---|---|---|---|---|---|---|---|
Aqua Voice | Upload | Continuous | |||||
Whisper-large-v3 + GPT4io | Upload | Continuous | |||||
Wispr Flow | Microphone Playback Desktop | Short Clips | |||||
Whisper-large-v3* | Upload | Continuous | |||||
SuperWhisper | Upload | Continuous | |||||
Rev | Upload | Continuous | |||||
Otter | Upload | Continuous | |||||
Apple Notes Dictation | Microphone Playback Mobile | Continuous | |||||
Google Voice Typing | Microphone Playback Desktop | Continuous | |||||
Dragon Dictation 16 Pro | Upload | Continuous | |||||
* via Vienna Scribe |
The same recordings were provided to all services, but due to a lack of API access, Wispr Flow, Apple Dictation, and Google Dictation required audio playback into a microphone. Providers like Apple and Google Dictation required multiple attempts to produce acceptable transcripts, as they would sometimes omit entire paragraphs. Continuous playback resulted in Wispr Flow not executing commands - something it can do on short clips. To keep the benchmark fair, we complied with their intended user flow and transcribed in small bursts.
Here are the raw transcriptions: Link
Our initial attempt to directly compare transcriptions against original scripts yielded many false positives, i.e. cases in which the generated transcript didn't match the ground truth but a human grader wouldn't have marked them as wrong.
While Whisper's normalizer offered an alternative, its aggressive removal of capitalization, punctuation, and whitespace defeated the purpose of analyzing formatting commands. To generate more meaningful results, we developed a custom less-aggressive normalization approach, before calculating WER.
3.1.1 - Normalization
To determine the WER, we evaluate how much two documents differ from one another. While it's possible to perform this comparison at either the character or word level, we opted for word-level comparison with a key distinction: our analysis includes both whitespace and punctuation, rather than just the words.
In our assessment, formatting errors (such as missing line breaks in bullet lists) carried the same weight as missing a word. To implement this, we developed a "tokenizer" script that converts words, punctuation marks, and whitespace into distinct tokens. After converting both the transcripts and reference text into tokens, we generated Diff files to highlight the differences. We then manually identified false positives, to ensure providers weren't unfairly penalized when calculating the final WER.
The guiding principle here was: "Assuming human-level accuracy in the transcription, would a human have formatted it this way?"
In our criteria, added punctuation was not penalized if it preserved meaning in the text (like paragraph breaks and some commas), but omitting or substituting explicitly requested punctuation was penalized.
Not penalized:
- •"Start a numbered list, first item, X, next item, Y, third item, Z"
- •"4" instead of "Four"
- •"Grey" instead of "Gray"
- •"Megabytes" instead of "MB"
Penalized:
- •"Greater than sign" instead of ">"
- •"Semicolon" instead of ";"
- •"Colon" instead of ":"
We are aware of the shortcomings of a process reviewed by humans, and we encourage any researchers to explore alternatives that provide an “automated smart normalization” that preserves, at scale, the integrity of the document before calculating WER.
3.2 - Results
Upon completing our methodology—scriptwriting, recording, and transcription—a manual normalization process identifies the meaningful necessary changes for the generated transcript to align with the ground truth, allowing us to determine the WER.
Providers | Capabilities | Word Error Rate | ||||||
---|---|---|---|---|---|---|---|---|
Provider | Transcription | Voice Commands | Technical Report | Professional Notes | Lecture | Book | ||
Aqua Voice | Real-time | Natural Language | 0.9% | 1.0% | 1.4% | 4.4% | 0.7% | |
Whisper-large-v3 + GPT4o | Async | Natural Language | 0.9% | 2.2% | 0.6% | 3.3% | 5.0% | |
Wispr Flow | Async | Natural Language | 10.5% | 12.5% | 5.5% | 4.2% | 10.1% | |
Whisper-large-v3 * | Async | N/A | 32.8% | 33.9% | 23.7% | 2.3% | 18.2% | |
SuperWhisper | Async | N/A | 20.4% | 34.4% | 13.2% | 7.8% | 25.5% | |
Rev | Async | N/A | 11.9% | 14.6% | 5.9% | 4.4% | 9.6% | |
Otter | Async | N/A | 28.0% | 41.3% | 25.3% | 3.2% | 32.8% | |
Apple Notes Dictation | Real-time | Hard-coded | 17.8% | 29.8% | 16.0% | 33.1% | 11.6% | |
Google Voice Typing | Real-time | Hard-coded | 17.8% | 36.3% | 21.3% | 43.7% | 41.1% | |
Dragon Dictation 16 Pro | Real-time | Hard-coded | 12.2% | 19.8% | 11.6% | 18.0% | 7.3% | |
* via Vienna Scribe |
Here are all the diff files of the transcripts after normalization: Link
This table shows the WER of 10 providers that either transcribe in real-time or asynchronously. It also shows how these providers handle commands: in natural language, through hard-coded keywords, or don't process them at all. The results across the 5 types of documents show the following patterns.
The results indicate that asynchronous transcription providers generally outperform real-time transcription models—except for Aqua Voice. Despite the challenges of real-time processing, Aqua Voice remains competitive across document categories, surpassing all asynchronous providers and matching only one option–Whisper with GPT-4o. It's worth remembering that this benchmark only includes forward-looking commands, since asynchronous options like Whisper with ChatGPT cannot handle backward commands, as these require real-time interaction between the system and the speaker. In contrast, Aqua Voice being real-time, supports both forward and backward commands.
The inclusion of these forward commands sheds light on why models that have competitive WER on tests like LibriSpeech, face challenges in our benchmark. For example, while Whisper achieves 1.8% WER on LibriSpeech test-clean, it shows a much higher WER of 23.7% when transcribing Technical Reports. In contrast, Aqua Voice achieves 1% WER in this same category.
To see all the diff files, click here
Among real-time transcription providers, Aqua Voice is the only one that demonstrates the capability to interpret commands and accurately format structured elements, such as lists and email greetings. Whisper with ChatGPT is the only other option that can also perform these commands but asynchronously.
In the Lecture category, which serves as a control as it has no commands, Aqua is less accurate because it rewrites parts of the speech it identified as rambly. Other transcription providers that specialize in passively transcribing audiobooks or lectures perform better here, but Aqua Voice stays competitive in other types of documents that require the transcription provider to be an active real-time aid for the speaker as they write it with their voice.
In a real-world environment, executing commands and applying proper formatting is essential. While solutions like Dragon Dictation offer voice commands, they require users to memorize specific keywords.
To see all the diff files, click here
In this example, we can see Dragon is unable to format a document that a human scribe otherwise would. Just as computing evolved from terminals with commands that users had to memorize, toward intuitive graphical interfaces–there is also a natural evolution for speech recognition moving from “keyword commands” towards understanding in natural language.
Even in documents without structured elements like email greetings and lists, such as in the Book category, Aqua Voice remains competitive, achieving a WER of 0.7%, while Whisper with GPT-4 reaches 5.0%.
To see all the diff files, click here
These results show that while LibriSpeech has been a useful benchmark for testing the transcription of audiobooks and lectures, we need to also evaluate commands and human-friendly formatting. With the advent of large-language models, the use cases involving voice-driven writing, like writing books, emails, and all sorts of documents in academic and professional settings, have become attainable, and thus need to be measured.
4 - Conclusion
Despite the handicaps that real-time transcription has over asynchronous solutions, Aqua Voice shows performance that matches or exceeds the capabilities of asynchronous providers across different tests.
As we continue with our work on Aqua Voice, the gap between machine transcription and human scribes will continue to narrow. The future of voice transcription lies not just in perfect word accuracy, but in creating intelligent systems that truly understand, execute our intentions as we speak, and “read between the lines” as they transcribe what we mean, not just what we said.