A multimodal odyssey beyond words

From OpenAI’s GPT-4 to Google’s Gemini Pro, AI is fast becoming multimodal, integrating voice, text, image, and video for more human-like interactions.

VoicenData Bureau
New Update


From OpenAI’s GPT-4 to Google’s Gemini Pro, AI is fast becoming multimodal, integrating voice, text, image, and video for more human-like interactions.


On May 14, Google unveiled its latest suite of AI tools, services, and models—including its new large language model, Gemini 1.5 Pro. This announcement came just a day after OpenAI’s latest release in the Generative AI (Gen AI) battlefield, emphasising one essential factor: multimodality. This marks a significant milestone in the evolution of Gen AI this year, where AI tools are now being designed with multimodality from the ground up.

But why is this important? For starters, multimodality refers to the ability of an AI algorithm to understand multiple ‘modes’ of input. Traditionally, AI models were designed to process text queries and generate text responses. Over time, these models advanced to understand and create images based on text or image queries.

The latest advancements include generating videos from a few words of text, cloning, and generating audio snippets from just a few seconds of sample audio.


To truly emulate human interaction, algorithms must understand multiple languages and interpret images, spoken words, voice snippets, and videos.

With these capabilities, AI can now comprehend and generate inputs across all formats within a human-computer interface (HCI). This progress brings AI closer to seamlessly integrating texts, images, audio, voice, and data, enhancing overall user experience and interaction.



First, Why Multimodal AI?

To understand the rise of multimodal AI, it is important to understand why such a factor is important. In 2017, eight Google-backed researchers published the paper ‘Attention is all you need’, detailing their work on the transformer AI model. This technology offered a somewhat simple logic: Can AI contextually understand queries, process the information like humans do when asked a question, and respond in a humane way?

This task is more challenging than it appears. Algorithms are designed to execute commands and tasks, regardless of their complexity. However, human commands are often imperfectly structured. As a result, in most human-machine interactions, such as with chatbots, queries not phrased as expected by the algorithm often lead to unsatisfactory responses.


The transformer model, now commonly known as generative AI, set out to address this issue. To truly emulate human interaction, algorithms must handle more than just English text queries. Companies require algorithms that understand multiple languages and interpret queries through various modes of communication—images, spoken words, voice snippets, and even videos.

This necessity gave rise to multimodal generative AI. While the importance of this approach was widely recognised, the real challenge lay in the enormous volume of data required to train such models.

The Hunt for the World’s Data


In OpenAI’s presentation of its latest large language model (LLM), GPT-4, the company described it as a “flagship model that can reason across audio, vision, and text in real time.” A slightly more detailed description of the need for the model states that it is “a step towards much more natural human-computer interaction—it accepts as input any combination of text, audio, image, and video and generates any combination of text, audio, and image outputs.”

Amplifying this further, on 14 May, Google and Alphabet CEO Sundar Pichai said that its latest model, Gemini Pro, is “designed to be multimodal from the ground up.” To add further to this versatility, mentioned that Google is doubling the model’s ‘token window’—essentially, the amount of text, or the length of an audio or video clip, that the latest Gemini model can process from its query.

While these advancements demonstrate significant progress toward making machines more human-like and cognizant, substantial challenges remain—those related to data, copyrights, cost, and computing power.


Accessing data to train AI is becoming increasingly complex as regulators worldwide address concerns that companies are scraping copyrighted material.

Training an AI algorithm to understand such extensive context requires tech companies to use enormous amounts of data. The more data an AI model is trained on, the better its chances of providing contextually accurate responses. This need is amplified by multimodality; for reference, just for text and image inputs, OpenAI’s GPT-4 AI model needs an estimated 1.75 trillion data parameters.

The required data should theoretically increase exponentially when images, videos, and audio are incorporated into an AI tool. While this still holds true, most of the popular LLMs have received substantial training from public usage during the past two years. However, even with this, billions of data points are necessary to train AI models to interpret inputs such as images with scribbled handwriting, low-quality video snippets, or garbled voice clips with background  noise accurately.


This vast data requirement is crucial for making AI truly usable, for instance, in aiding vocally impaired individuals in accessing websites and conducting transactions. Although current AI models are bringing us closer to this reality, the challenges of data acquisition and processing costs remain significant obstacles.

Why the Challenge?

Sourcing the vast amounts of visual and auditory data required for multimodal AI is a significant challenge. It demands substantial financial resources. OpenAI CEO Sam Altman has admitted to spending billions of dollars to develop their GPT models to match their current state. Moreover, access to this data is becoming increasingly complex as regulators worldwide address artists’ concerns that tech companies are scraping copyrighted material from the open Internet to train their AI models without compensating the original audio, image, or video content creators.

These practices have raised ethical questions about tech companies’ actions, highlighting a murky area in technology. The question has been about who truly owns the content on social media platforms. Is it the user who created it or the platform that distributes it, or is any content published on social media fair game for reuse unless explicitly stated otherwise?

While regulations have largely left these areas ambiguous, more nations are seeking answers to address these concerns. Can AI withstand the wave of copyright conflicts coming its way? Privacy and copyright advocates argue that tech companies should not evade responsibility by citing commercial interests and withholding details about their data sources.

Gemini 1.5 Pro marks a milestone in the evolution of Gen AI, where AI tools are being designed with multimodality from the ground up.

The long answer may be even more complicated. Add the cost factor to this argument, and the matter becomes exponentially more complex since only a handful of companies can afford the capital required to train such AI models.

Is multimodality worth these costs and compromises?

It may be too soon to tell, but in the long run, the answer will largely depend on how commercially successful and socially impactful these AI models become. 

By Vernika Awal