Jean-Louis Quéguiner is the Founder and CEO of Gladia. He previously served as Group Vice President of Data, AI, and Quantum Computing at OVHcloud, one of Europe’s leading cloud providers. He holds a Master’s Degree in Symbolic AI from the University of Québec in Canada and Arts et Métiers ParisTech in Paris. Over the course of his career, he has held significant positions across various industries, including financial data analytics, machine learning applications for real-time digital advertising, and the development of speech AI APIs.
Gladia provides advanced audio transcription and real-time AI solutions for seamless integration into products across industries, languages, and technology stacks. By optimizing state-of-the-art ASR and generative AI models, it ensures accurate, lag-free speech and language processing. Gladia’s platform also enables real-time extraction of insights and metadata from calls and meetings, supporting key enterprise use cases such as sales assistance and automated customer support.
What inspired you to tackle the challenges in speech-to-text (STT) technology, and what gaps did you see in the market?
When I founded Gladia, the initial goal was broad—an AI company that would make complex technology accessible. But as we delved deeper, it became clear that voice technology was the most broken and yet most critical area to focus on.
Voice is central to our daily lives, and most of our communication happens through speech. Yet, the tools available for developers to work with voice data were inadequate in terms of speed, accuracy, and price—especially across languages.
I wanted to fix that, to unpack the complexity of voice technology and repackage it into something simple, efficient, powerful and accessible. Developers shouldn’t have to worry about the intricacies of AI models or the nuances of context length in speech recognition. My goal was to create an enterprise-grade speech-to-text API that worked seamlessly, regardless of the underlying model or technology—a true plug-and-play solution.
What are some of the unique challenges you encountered while building a transcription solution for enterprise use?
When it comes to speech recognition, speed and accuracy—the two key performance indicators in this field—are inversely proportional by design. This means that improving one will compromise the other, at least to some extent. The cost factor, to a big extent, results from the provider’s choice between speed and quality.
When building Gladia, our goal was to find the perfect balance between these two factors, all while ensuring the technology remains available to startups and SMEs. In the process we also realized that the foundational ASR models like OpenAI’s Whisper, which we worked with extensively, are biased, skewering heavily towards English due to their training data, which leaves a lot of languages under-represented.
So, in addition to solving the speed-accuracy tradeoff, it was important to us— as a European, multilingual team—to optimize and fine-tune our core models to build a truly global API that helps businesses operate across languages.
How does Gladia differentiate itself in the crowded AI transcription market? What makes your Whisper-Zero ASR unique?
Our new real-time engine (Gladia Real Time) achieves an industry-leading 300 ms latency. In addition to that, it’s able to extract insights from a call or meeting with the so-called “audio intelligence” add-ons or features, like named entity recognition (NER) or sentiment analysis.
To our knowledge, very few competitors are able to provide both transcription and insights at such high latency (less than 1s end-to-end) – and do all of that accurately in languages other than English. Our languages support extends to over 100 languages today.
We also put a special emphasis on making the product truly stack agnostic. Our API is compatible with all existing tech stacks and telephony protocols, including SIP, VoIP, FreeSwitch and Asterisk. Telephony protocols are especially complex to integrate with, so we believe this product aspect can bring tremendous value to the market.
Hallucinations in AI models are a significant concern, especially in real-time transcription. Can you explain what hallucinations are in the context of STT and how Gladia addresses this problem?
Hallucination usually occurs when the model lacks knowledge or doesn’t have enough context on the topic. Although models can produce outputs tailored to a request, they can only reference information that existed at the time of their training, and that may not be up-to-date. The model will create coherent responses by filling in gaps with information that sounds plausible but is incorrect.
While hallucinations became known in the context of LLMs first, they occur with speech recognition models— like Whisper ASR, a leading model in the field developed by OpenAI – as well. Whisper’s hallucinations are like those of LLMs due to a similar architecture, so it’s a problem that concerns generative models, that are able to predict the words that follow based on the overall context. In a way, they ‘invent’ the output. This approach can be contrasted with more traditional, acoustic-based ASR architectures that match the input sound to output in a more mechanical way
As a result, you may find words in a transcript that were not actually said, which is clearly problematic, especially in fields like medicine, where a mistake of this kind can have grave consequences.
There are several methods to manage and detect hallucinations. One common approach is to use a retrieval-augmented generation (RAG) system, which combines the model’s generative capabilities with a retrieval mechanism to cross-check facts. Another method involves employing a “chain of thought” approach, where the model is guided through a series of predefined steps or checkpoints to ensure that it stays on a logical path.
Another strategy for detecting hallucinations involves using systems that assess the truthfulness of the model’s output during training. There are benchmarks specifically designed to evaluate hallucinations, which involve comparing different candidate responses generated by the model and determining which one is most accurate.
We at Gladia have experimented with a combination of techniques when building Whisper-Zero, our proprietary ASR that removes virtually all hallucinations. It’s proven excellent results in asynchronous transcription, and we’re currently optimizing it for real-time to achieve the same 99.9% information fidelity.
STT technology must handle a wide range of complexities like accents, noise, and multi-language conversations. How does Gladia approach these challenges to ensure high accuracy?
Language detection in ASR is an extremely complex task. Each speaker has a unique vocal signature, which we call features. By analyzing the vocal spectrum, machine learning algorithms can perform classifications, using the Mel Frequency Cepstral Coefficients (MFCC) to extract the main frequency characteristics.
MFCC is a method inspired by human auditory perception. It’s part of the “psychoacoustic” field, focusing on how we perceive sound. It emphasizes lower frequencies and uses techniques like normalized Fourier decomposition to convert audio into a frequency spectrum.
However, this approach has a limitation: it’s based purely on acoustics. So, if you speak English with a strong accent, the system may not understand the content but instead judge based on your prosody (rhythm, stress, intonation).
This is where Gladia’s innovative solution comes in. We’ve developed a hybrid approach that combines psycho-acoustic features with content understanding for dynamic language detection.
Our system doesn’t just listen to how you speak, but also understands what you’re saying. This dual approach allows for efficient code-switching and doesn’t let strong accents get misrepresented/misunderstood.
Code-switching—which is among our key differentiators—is a particularly important feature in handling multilingual conversations. Speakers may switch between languages mid-conversation (or even mid-sentence), and the ability of the model to transcribe accurately on the fly despite the switch is critical.
Gladia API is unique in its ability to handle code-switching with this many language pairs with a high level of accuracy and performs well even in noisy environments, known to reduce the quality of transcription.
Real-time transcription requires ultra-low latency. How does your API achieve less than 300 milliseconds latency while maintaining accuracy?
Keeping latency under 300 milliseconds while maintaining high accuracy requires a multifaceted approach that blends hardware expertise, algorithm optimization, and architectural design.
Real-time AI isn’t like traditional computing—it’s tightly linked to the power and efficiency of GPGPUs. I’ve been working in this space for nearly a decade, leading the AI division at OVHCloud (the biggest cloud provider in the EU), and learned firsthand that it’s always about finding the right balance: how much hardware power you need, how much it costs, and how you tailor the algorithms to work seamlessly with that hardware.
Performance in real time AI comes from effectively aligning our algorithms with the capabilities of the hardware, ensuring every operation maximizes throughput while minimizing delays.
But it’s not just the AI and hardware. The system’s architecture plays a big role too, especially the network, which can really impact latency. Our CTO, who has deep expertise in low-latency network design from his time at Sigfox (an IoT pioneer), has optimized our network setup to shave off valuable milliseconds.
So, it’s really a mix of all these factors—smart hardware choices, optimized algorithms, and network design—that lets us consistently achieve sub-300ms latency without compromising on accuracy.
Gladia goes beyond transcription with features like speaker diarization, sentiment analysis, and time-stamped transcripts. What are some innovative applications you’ve seen your clients develop using these tools?
ASR unlocks a wide range of applications to platforms across verticals, and it’s been amazing to see how many truly pioneering companies have emerged in the last two years, leveraging LLMs and our API to build cutting-edge, competitive products. Here are some examples:
- Smart note-taking: Many clients are building tools for professionals who need to quickly capture and organize information from work meetings, student lectures, or medical consultations. With speaker diarization, our API can identify who said what, making it easy to follow conversations and assign action items. Combined with time-stamped transcripts, users can jump straight to specific moments in a recording, saving time and ensuring nothing gets lost in translation.
- Sales enablement: In the sales world, understanding customer sentiment is everything. Teams are using our sentiment analysis feature to gain real-time insights into how prospects respond during calls or demos. Plus, time-stamped transcripts help teams revisit key parts of a conversation to refine their pitch or address client concerns more effectively. For this use case in particular, NER is also key to identifying names, company details, and other information that can be extracted from sales calls to feed the CRM automatically.
- Call center assistance: Companies in the contract center space are using our API to provide live assistance to agents, as well as flagging customer sentiment during calls. Speaker diarization ensures that things being said are assigned to the right person, while time-stamped transcripts enable supervisors to review critical moments or compliance issues quickly. This not only improves the customer experience – with better on-call resolution rate and quality monitoring – but also boosts agent productivity and satisfaction.
Can you discuss the role of custom vocabularies and entity recognition in improving transcription reliability for enterprise users?
Many industries rely on specialized terminology, brand names, and unique language nuances. Custom vocabulary integration allows the STT solution to adapt to these specific needs, which is crucial for capturing contextual nuances and delivering output that accurately reflects your business needs. For instance, it allows you to create a list of domain-specific words, such as brand names, in a specific language.
Why it’s useful: Adapting the transcription to the specific vertical allows you to minimize errors in transcripts, achieving a better user experience. This feature is especially critical in fields like medicine or finance.
Named entity recognition (NER) extracts and identifies key information from unstructured audio data, such as names of people, organizations, locations, and more. A common challenge with unstructured data is that this critical information isn’t readily accessible—it’s buried within the transcript.
To solve this, Gladia developed a structured Key Data Extraction (KDE) approach. By leveraging the generative capabilities of its Whisper-based architecture—similar to LLMs—Gladia’s KDE captures context to identify and extract relevant information directly.
This process can be further enhanced with features like custom vocabulary and NER, allowing businesses to populate CRMs with key data quickly and efficiently.
In your opinion, how is real-time transcription transforming industries such as customer support, sales, and content creation?
Real-time transcription is reshaping these industries in profound ways, driving incredible productivity gains, coupled with tangible business benefits.
First, real-time transcription is a game-changer for support teams. Real-time assistance is key to improving the resolution rate thanks to faster responses, smarter agents, and better outcomes (in terms of NSF, handle times, and so on). As ASR systems get better and better at handling non-English languages and performing real-time translation, contact centers can achieve a truly global CX at lower margins.
In sales, speed and spot-on insights are everything. Similarly to what happens with call agents, real-time transcription is what equips them with the right insights at the right time, enabling them to focus on what matters the most in closing deals.
For creators, real-time transcription is perhaps less relevant today, but still full of potential, especially when it comes to live captioning and translation during media events. Most of our current media customers still prefer asynchronous transcription, as speed is less critical there, while accuracy is key for applications like time-stamped video editing and subtitle generation.
Real-time AI transcription seems to be a growing trend. Where do you see this technology heading in the next 5-10 years?
I feel like this phenomenon, which we now call real-time AI, is going to be everywhere. Essentially, what we really refer to here is the seamless ability of machines to interact with people, the way we humans already interact with one another.
And if you look at any Hollywood movie (like Her) set in the future, you’ll never see anyone there interacting with intelligent systems via a keyboard. For me, that serves as the ultimate proof that in the collective imagination of humanity, voice will always be the primary way we interact with the world around us.
Voice, as the main vector to aggregate and share human knowledge, has been part of human culture and history for much longer than writing. Then, writing took over because it enabled us to preserve our knowledge more effectively than relying on the community elders to be the guardians of our stories and wisdom.
GenAI systems, capable of understanding speech, generating responses, and storing our interactions, brought something completely new to the space. It’s the best of both words and the best of humanity really. It gives us this unique power and energy of voice communication with the benefit of memory, which previously only written media could secure for us. This is why I believe it’s going to be everywhere – it’s our ultimate collective dream.
Thank you for the great interview, readers who wish to learn more should visit Gladia.
Credit: Source link