/vnd/media/media_files/2025/01/23/adams-apple-moment-for-ai-models.jpg)
Dolphins, Lyrebirds, Bats, Mockingbirds, whales, and elephants may live in entirely different environments, but they have one thing in common: the power to communicate through sound. Some may be ultrasonic, some infrasonic, some mimicry, and some utterly akin to baby babble, but sound is always a crucial part of their existence—sometimes even survival.
Meta Spirit LM, GPT-4o, Gnani, DeepL and Sutra HiFi. These names seem to belong to different AI forests altogether, but they also have the denominator of sound running across them. In some way or another, many small and big players have now thrown a voice-dominant model in the Language Model (LM) ring. It is not hard to understand why when one looks at the apparent advantages. But are they also good enough to fix some deep-seated issues their elder siblings have faced? Or would the throat still cough differently again?
The global voice and speech recognition market, valued at USD 14.8 billion in 2024, is likely to touch USD 17.33 billion in 2025 and USD 61.27 billion by 2033.
No More Tone-Deaf LMs
The challenge with existing AI models was that they first had to convert speech to text through direct or multi-modal approaches, take the input for synthesising it with a language model, and convert it all with text-to-speech techniques. This consumed time. This took up compute power. This needed data inputs. But above everything else, this process still missed out on subtle aspects like pitch, tone, emotion and other sub-text areas of the human voice. Not to mention the sheer diversity of accents, dialects and vernacular speech, especially in a multi-cultural country like India.
Such leaks could prove expensive in more than one way. Consider the big picture. The AI Voice Generator market is expected to grow from USD 3 billion in 2024 to USD 20.4 billion by 2030, as per Markets and Markets. The global voice and speech recognition market size, valued at USD 14.8 billion in 2024, is projected to reach from USD 17.33 billion in 2025 to USD 61.27 billion by 2033, as per Straits Research.
/vnd/media/media_files/2025/01/23/GO2KbwkaG2JMxBD1mdLM.jpg)
“The future of customer interactions will rely heavily on human-centric video and voice bots, which will transform how one engages with technology.”- ANKUSH SABHARWAL, Founder, BharatGPT.ai
Similarly, Grand View Research estimates that the global voice and speech recognition market, valued at USD 20.25 billion in 2023, could grow at a CAGR of 14.6% between 2024 and 2030. As projected by Meticulous Research, the speech and voice recognition market will be worth USD 56.07 billion by 2030. While the drivers here are a surge in voice biometrics and voice-enabled devices, the constraints observed were a lack of accuracy on regional accents.
That explains why several companies are trying to fix this hole with voice-driven LMs. In the last few months, many companies have launched their voice play in some form or another. There is Gnani.ai in India that has launched a speech-to-speech large language model (LLM) powered by the NVIDIA AI-accelerated computing platform and claims to handle over 10 million voice interactions daily. On the global front, GPT-4o claims to enable advanced voice mode in ChatGPT. It is expected to address average voice mode latencies of 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4). Now, when time is not spent on three separate models: audio to text, input and output of text, and converting that text back to audio, everything will be processed over the same neural network. Just what this space needed right?
Not far away is DeepL Voice, which works on multilingual real-time interactions for virtual and face-to-face settings. Meta has brought in Meta Spirit LM, its first open-source multimodal language model for seamlessly integrating text and speech inputs and outputs. Hume EVI 2, ElevenLabs, Sutra; the list goes on.
The idea is not just to reduce latency but also to capture the minute aspects of speech. Like how Spirit LM Base is supposed to use phonetic tokens to process and generate speech. Spirit LM Expressive will cover additional tokens for pitch and tone and more nuanced emotional states, such as excitement or sadness. SUTRA HiFi also claims its Dual Diffusion-Transformer Architecture decouples unique voice tones from language-specific intonations and accents. That helps to achieve the aim of understanding and generating voice interactions in over 30 languages, especially Indic languages and regional dialects.
So, do such models persist in challenges like bias, accuracy, scale, and power consumption—the way their text predecessors have struggled?
Is the Benadryl Here?
Previous LMs wrestled with a host of issues. While voice LMs are fixing the challenges of latency and depth, it is interesting to ask whether these LMs can solve energy usage and context issues without sacrificing speed and applicability.
Ganesh Gopalan, Co-Founder and CEO of Gnani.ai, explains how this works. “Data training in voice-to-voice LLMs involves a two-phase process: pre-training and fine-tuning. The model is exposed to a massive 13 million audio hours dataset covering diverse linguistic, phonetic, and acoustic features across 14 languages during pre-training. This broad dataset helps the model build a foundational understanding of speech patterns, accents, intonations, and contexts. Following this, fine-tuning uses a more focused 1.5-million-hour dataset to align the model with specific tasks such as translation, transcription, or conversational capabilities. This approach ensures the model is both broadly versatile and adept at handling specialised applications, offering high accuracy and adaptability in speech-to-speech scenarios.”
“By tailoring the training process to specific domains, voice-based models can be better equipped to understand industry-specific terminology and nuances.”- GANESH GOPALAN, Co-Founder & CEO, Gnani.ai
Gaurav Pathak, Vice President of Product Management, Metadata and CLAIRE, Informatica, explains how CLAIRE GPT aims to democratise data management by making data accessible and understandable to a broader audience. “This is achieved through a Natural Language Interface wherein users can interact with CLAIRE GPT using everyday language instead of complex SQL queries/python/programming languages. This feature makes data discovery and exploration more intuitive and accessible to business users with limited technical expertise. There is also the element of a Reasoning Agent- for interpreting user intent from natural language queries.”
Ankush Sabharwal, Founder of BharatGPT.ai and CoRover.ai, is upbeat about the maturity of Voice LMs. He argues that they are good in areas like accuracy and bias reduction. “On accuracy fronts, there is enhanced speech recognition and language understanding. Also, diverse training data mitigates biases. There is better contextual understanding through multi-modal processing. Plus, they are great for Multi-cultural Environments with support for various languages and dialects.”
“Data training involves large datasets of diverse voice samples, often sourced from public repositories or proprietary collections. These datasets are pre-processed, tokenised, and fine-tuned for Voice-to-Voice models.” He explains. Regarding Compute Requirements, Sabharwal reasons that optimised architectures with composite AI approach reduce computational needs.
By tailoring the training process to specific domains like BFSI, these models are better equipped to understand industry-specific terminology and nuances, significantly improving the performance of voice bots in customer interactions and maintaining Gopalan. “This domain-specific adaptation minimises errors, enhances contextual relevance, and reduces biases that might arise from generalised training. Additionally, continuous efforts to optimise inference costs ensure the models remain efficient and cost-effective, making them more accessible and practical for widespread deployment.”
They are not devoid of issues, though. Sabharwal points out that VideoBots and VoiceBots still face challenges like latency, Token synchronisation, realistic avatar movements, and user data privacy protection. Nonetheless, Sabharwal opines that the future of customer interactions will rely heavily on human-centric video and voice bots, transforming how one engages with technology.
As long as these Voice LMs can keep up with a species, that would have given even Darwin a persistent headache for its ever-changing vocabulary. The Gen Z. Where dolphin memes are more popular than dolphins. And they say something else altogether.
By Pratima Harigunani
pratimah@cybermedia.co.in