Romanian speech-to-text solutions: results of the 2023 winter benchmark

Vatis Tech on desktop

Speech-to-text technology is an AI-powered solution that converts spoken words into written text. This technology is increasingly used in various domains, including healthcare, media, and finance, to transcribe audio and video content into text for analysis and storage purposes.

In this report, we compare Vatis Tech's accuracy to two of the biggest players in the market - Google and Speechmatics - and the results are impressive. In addition, the report showcases Vatis Tech's accuracy and highlights how it outperforms its competitors in various domains. Keep reading to dive into the details of this benchmark report and discover why Vatis Tech is the top choice for accurate and reliable speech recognition.

Dataset

In our benchmark report, we compared the latest version 4 of Vatis Tech, Speechmatics, and Google using a dataset of 45 minutes of audio from 6 domains, including:

  • news
  • financial calls
  • podcasts
  • medical
  • legal
  • meetings

The dataset was created to reflect the diversity of speech-to-text use cases and the challenges faced in each domain, and it ensures that the performance of the speech recognition solutions is tested in a range of real-world scenarios taking into account factors such as the speaker's accent, background noise, and word ambiguity.

Each domain has unique characteristics, such as accents, vocabulary, audio quality, and speech patterns, that can affect the accuracy of a voice-to-text converter. By including various audio content from these domains in the dataset, our benchmark accurately assessed the solution's ability to transcribe audio to text in different environments and with varying complexities.

No Category Id Duration (min)
1 News news_sample_1 1:34:00
2 News news_sample_2 7:53:00
3 Financial Call financial_sample_1 14:25:00
4 Financial Call financial_sample_2 7:50:00
5 Podcast podcast_sample_1 2:48:00
6 Podcast podcast_sample_2 4:01:00
7 Medical medical_sample_1 1:34:00
8 Medical medical_sample_21:24:00
9 Legal legal_sample_11:44:00
10 Meeting sample_meeting_1 2:20:00
45:33:00

Methodology

To compare the performance of three voice-to-text solutions: Vatis Tech, Speechmatics, and Google, we used 10 audio samples in the Romanian language from six domains: news, financial call, podcast, medical, legal, and meeting. This benchmark aimed to evaluate each speech-to-text converter's transcription accuracy and identify any strengths and weaknesses in their performance.

  1. Data Collection

The audio content used in this benchmark was collected from publicly available sources and included recordings of 45:33:00 minutes. More details about each piece of content from each domain are available in the dataset section of this report.

2.   Evaluation Metrics

To evaluate the performance of the solutions, we used the following:

  • a human transcription with 100% accuracy
  • the word error rate (WER) metric
­čĺí
The WER is defined as the percentage of words in a transcription that are incorrectly transcribed by the solution technology. It represents the ratio of errors in a transcript to the total words spoken.

3.   Performance Evaluation

To evaluate the performance of our speech-to-text solution, we utilized the standard metric Word Error Rate (WER). WER measures the accuracy of an automatic voice recognition system by comparing automated transcription with human transcription.

The WER score is derived from the Levenshtein distance, working at the word level, and is calculated based on the number of insertions, deletions, and substitutions made by the technology and is expressed as

(S + D + I) / N

where:

  • S is the number of substitutions;
  • D is the number of deletions;
  • I is the number of insertions;

Before calculating the WER, we've applied a few normalization steps for each file to have both the reference (human transcriptions) and the automated transcriptions in the same format.

The normalization steps were:

  • remove punctuation;
  • lowercase all text;
  • keep the formation of the numbers because we would like to evaluate the post-processing layer at the same time.

As an example, let's consider the WER score calculation for the following text:

  • Reference: Hi, my name is Adrian, and I am 27 years old.
  • Normalized reference: hi my name is adrian and i am 27 years old

Also, let's have a look at an example of a WER score calculation:

  • Reference: Hi, my name is Adrian, and I am 27 years old.
  • Normalized reference: hi my name is adrian and i am 27 years old
  • Prediction: Hi, my name is Adrien, and I am 27 years old.
  • Normalized prediction: hi my name is adrien and i am 27 years old

Thus the WER (normalized_reference, normalized_prediction) = 0.0909, which means the speech-to-text system's accuracy is 90.91%.

If you're interested in playing more with this metric, you can use the Hugging Face calculator from here.

Results

Let's take a closer look at a concrete example using the medical_sample_2 dataset. The transcript was initially transcribed with 100% accuracy by a human:

Tegumente cu aspect normal ecografic regiune mamelonar─â ╚Öi retroaerolar─â cu aspect normal ╚Ťesut mamar cu aspect fibrogramdular heterogen ├«n cadranul inferior deschide paranteza ora 6 ├«nchide paranteza o forma╚Ťiune nevascularizat─â, de 6,8 mm transversal slash 3,1 mm anteroposterior, cu aspect de fibro adenom paraleolar deschide paranteza ora 5 ├«nchide paranteza dou─â forma╚Ťiuni cu aspect de fibro adenom find delimitate ├«ntre ele de 5 mm slash 3,7 mm ╚Öi de 5,4 mm slash 3,6 mm la periferia glandei deschide paranteza ora 11:30 ├«nchide paranteza o forma╚Ťiune de 5,9 mm slash 3,5 mm nevascularizat─â, cu aspect de fibro adenom f─âr─â dilata╚Ťii ale canalelor galactofore ╚Ťesut adipost spre ╚Öi retromamar normal reprezentat lobula╚Ťie p─âstrat─â cu aspect normal.

Here is the automated transcript generated by Vatis Tech:

Tegumente cu aspect normal ecografic regiune mame lonar─â ╚Öi retroaerolar─â cu aspect normal ╚Ťesut mamar cu aspect fibrogramdular heterogen ├«ncadra anul inferior deschide paranteza ora 6 ├«nchide paranteza o forma╚Ťiune nevascularizat─â, de 6,8 mm transversal slash 3,1 mm anteroposterior, cu aspect de fibro-adenom paraleolar deschide paranteza ora 5 ├«nchide paranteza dou─â forma╚Ťiuni cu aspect de fibr─â adenom find delimitate ├«ntre ele de 5 mm slash 3,7 mm ╚Öi de 5,4 mm slash 3,6 mm la periferia glandei deschide paranteza ora 11 30 ├«nchide paranteza o forma╚Ťiune de 5,9 mm slash 3,5 mm nevascularizat─â, cu aspect de fibr─â. Adenom f─âr─â dilata╚Ťii ale canalelor galactofore ╚Ťesutadipost spre ╚Öi retromamar normal reprezentat lobula╚Ťie p─âstrat─â cu aspect normal.

Speechmatics:

Tegument cu aspect normal ecografic, regiune mam─â lunar─â ╚Öi retro, aerul cu aspect normal ╚Ťesut mamar cu aspect fibro glandular heterogen. ├Än cadranul inferior. Deschide paranteza. Ora 6 ├Änchide paranteza. O forma╚Ťiune N vascularizat─â de 6,8 milimetri, transversal. Slash 3,1 mm. Antero posterior, cu aspect de fibro adenom. Paralele. Olar. Deschide paranteza. Ora 5 ├Änchide paranteza. Dou─â forma╚Ťiuni cu aspect de fibro adenom fiind delimitate ├«ntre ele de 5 mm slash 3,7 mm ╚Öi de 5,4 mm. Flesh 3,6 mm la periferia glandei. Deschide paranteza. Ora 11:30 ├Änchide paranteza. O forma╚Ťiune de 5,9 mm slash 3,5 mm n. Vascularizat─â, cu aspect de fibro adenom. F─âr─â dilata╚Ťii ale canalelor galactofore. ╚Üesut adipos pre ╚Öi retro mamar normal, reprezentat globular e p─âstrat cu aspect normal.

and Google:

tegumente cu aspect normal ecografic regiune mamelonar ╚Öi retro aerul ar─â cu aspect normal ╚Ťesut mamar cu aspect fibro glandular heterogen ├«n cadranul inferior deschide paranteza ora 6 ├«nchide paranteza o forma╚Ťiune ne va scular is at├ót de 6,8 mm Trans l─âsam Clash 3,1 mm anteroposterior cu aspect de fibroadenom paralel ar deschide parantez─â ora 5:00 ├«nchide paranteza dou─â forma╚Ťiuni cu aspect de fibroadenom fiind delimitate ├«ntre ele de 5 mm Slash 3,7 mm ╚Öi de 5,4 m ╚Öi 3,6 mm la periferia glandei deschide paranteza ora 11:30 ├«nchide paranteza o forma╚Ťiune de 5,9 m m ╚Öi 3,5 m m m vascularizat─â cu aspect de fibroadenom f─âr─â dilata╚Ťii ale canalelor galactofore pe suta de post spre ╚Öi retro mamar normal reprezentat lui Bula ╚Ťie p─âstrat─â cu aspect normal

In this instance, achieved an impressive 96.20% accuracy, while Speechmatics 80.80% and Google only 66.20%.

The results showed that Vatis Tech achieved an average accuracy of 95.30% for all datasets, while Speechmatics had an average accuracy of 89.39% and Google 68.80%. These results demonstrate the superior performance of Vatis Tech in converting speech-to-text, particularly in challenging domains such as medical, with an accuracy of 96.85%, and legal, with 98.10%.

Conclusion:

Our benchmark report comparing Vatis Tech, Speechmatics, and Google's latest audio to text converters has demonstrated the effectiveness of Vatis Tech's v4 model architecture. With an average accuracy of 95.30% across various audio domains, Vatis Tech has shown to be a leader in the field of voice recognition and transcription technology.

This report highlights the importance of using a diverse dataset to evaluate the performance of speech recognition solutions. The results provide valuable insights for both industry professionals and consumers seeking a high-quality voice to text converter.

Overall, the benchmark report has confirmed that Vatis Tech's v4 model architecture offers a cutting-edge solution for transcribing speech into text, making it a valuable asset for various industries and applications.

Try it now for free! The first hour is on our side ­čĹç