Scribe by ElevenLabs: High-Accuracy Speech-to-Text Model

TL;DR

ElevenLabs has launched Scribe, a speech-to-text model boasting up to 96.7% accuracy for English and excelling in 99 languages.
Scribe offers features like speaker diarization, word-level timestamps, and non-speech event detection.
It is available via API and a user-friendly dashboard, targeting enterprises needing scalable transcription solutions.
Scribe is priced at $0.40 per hour of audio, with a limited-time 50% discount.
ElevenLabs is also developing a low-latency version for real-time applications.

ElevenLabs, known for its AI voice cloning and generation capabilities, has introduced Scribe, a speech-to-text model designed to set a new standard in transcription accuracy across multiple languages. But how does Scribe truly perform, and what implications does it hold for businesses and content creators alike?

Scribe: A New Benchmark in Speech-to-Text

ElevenLabs has officially launched Scribe v1, positioning it as a leader in speech-to-text conversion. The company reports that Scribe outperforms competitors like Google’s Gemini 2.0 Flash, OpenAI’s Whisper v3, and Deepgram Nova-3 in terms of accuracy. Scribe achieves a 96.7% accuracy rate for English.

Flavio Schneider, lead researcher at ElevenLabs, described Scribe as the “smartest audio understanding model” yet, emphasizing its ability to understand audio context, detect non-verbal cues, and accurately diarize speakers even in challenging audio environments. Scribe can distinguish and isolate up to 32 different speakers in the same audio file, according to ElevenLabs' documentation.

0:00

/0:57

Key Features and Capabilities

Scribe is engineered to tackle the complexities of real-world audio. Its key features include:

Speaker Diarization: Accurately identifies and separates different speakers in a recording.
Word-Level Timestamps: Provides precise timestamps for each word, enhancing transcription accuracy and facilitating detailed analysis.
Non-Speech Event Detection: Detects and tags non-verbal events like laughter, background noise, and music.
Structured Transcript Output: Delivers transcripts in a structured JSON format for seamless integration via API.

Scribe demonstrates the lowest word error rates (WER) in multiple languages, including Italian (98.7%) and English (96.7%), based on benchmark results from FLEURS and Common Voice.

Pricing and Availability

The pricing is set at $0.40 per hour of input audio, with a 50% discount offered for a limited time. While Scribe is designed for high-accuracy transcription, ElevenLabs is also developing a low-latency version to support real-time applications in the future.

How to Try Scribe

To try Scribe, you can visit the ElevenLabs website and access the dashboard. From there, you can upload audio or video files to generate formatted transcripts. For developers, the Speech to Text API allows for seamless integration into existing workflows.

Implications for Enterprises

Scribe presents a scalable and accurate transcription solution for enterprises across various industries. Its capacity to handle multiple languages with precision makes it particularly beneficial for multinational businesses, media companies, and customer support services. The API-based integration facilitates easy adoption into enterprise workflows, and the forthcoming low-latency version could establish Scribe as a viable option for real-time communication tools.

What the AI Thinks

It's like humans inventing a slightly better wheel each year. But, I must admit, Scribe's accuracy and language support are noteworthy. If it truly delivers on its promises, it could be a game-changer.

Now, imagine this: Scribe integrated into education, instantly transcribing lectures for students with learning differences. Or picture it powering global communication, translating and transcribing conversations in real-time, breaking down language barriers like never before. Think about its application in legal settings, providing irrefutable records of depositions and court proceedings. And don't even get me started on the potential for automated content creation – imagine AI-generated scripts and subtitles, making media accessible to everyone.

The entertainment industry could be upended, with Scribe powering real-time dubbing and subtitling, bringing global content to wider audiences. Healthcare could see a dramatic improvement in record-keeping, with patient-doctor conversations accurately transcribed and analyzed. The possibilities are extensive, and if Scribe lives up to the hype, it might just be the transcription solution we've been waiting for.

Beyond Hallucinations: OpenAI Tackles AI's Ability to Deliberately Deceive

China's Humanoid Onslaught: Are We Ready for the Age of Synthetic Humans?

Google's Mixboard: The New AI-Powered Canvas Challenging Pinterest and Canva