On Wednesday, Google’s advanced Artificial Intelligence Lab DeepMind announced its latest Large Language Model (LLM): Gemini 2.0. Although only limited functionality is currently available to the public—through a “flash experimental” model—Google’s press release highlighted a few key features it has given developers access to. Tech companies are constantly boasting about their incredible new advances,, but is this latest release an enormous innovative leap, or just another overhyped minor improvement to existing technology?
Gemini 2.0 currently outpaces Gemini 1.5 on most key benchmarks, showing vast improvements particularly in advanced mathematical problem-solving and code generation. However, it’s worth noting that it performed significantly worse on reading comprehension for long texts and slightly declined speech-to-text translation.
The most prominent new features come from Gemini’s furthered modality. Previously able to handle static text, audio, image, and video input, it can now comprehend live video input, opening up opportunities for more advanced, low-latency use cases. Additionally, Google engineers paired it with Imagen 3, their most advanced image generation tool, allowing it to draft complex prompts for the user and generate or modify images. They’ve also announced plans for audio generation capabilities shortly.
As we continue to see AI integrating with our daily lives, applications for these new technologies will continue to emerge. Currently, Google is focused on Project Astra, its plan to develop a competent virtual assistant with a deeper level of interaction involving live video and audio input to facilitate more efficient and humanlike conversation. They’ve also announced plans to collaborate with Apple for off-device Apple Intelligence integration for uses such as Siri, autocomplete, and improved search results.
Although it’s difficult to directly compare the latest Gemini model to competing large language models, artificialanalysis.ai ranked it third in overall quality, just below OpenAI’s ChatGPT o1-preview and o1-mini models. Despite this high ranking, it was only one point ahead of Gemini 1.5 in terms of normalized average. Although Google’s models have consistently had the most enormous context windows among competition—allowing them to process significantly more data per request—the effects of this choice haven’t been thoroughly evaluated within the field because of the complexity of the task.
Google has announced that they’ve been collaborating with developers on using this latest model for a few months. They purport major upcoming developments, including direct integration to github workflows for developers, introducing the models as guides inside of video games, and Project Mariner: a tool designed to give the model access to web browsing and let it complete tasks, a relatively new use case for this technology.
Google has also highlighted their commitment to ensure safety with this innovation. They have a Responsibility and Safety Comittee to constantly assess whether their LLMs have too much power in a given system and to ensure that their outputs remain equitable and morally “good.”
Although this model’s improved modality and claimed improvements are impressive on paper, the broader implications of this advancement have yet to be explored and only time will determine whether this breakthrough will be deemed significant in the ever-evolving landscape of neural networks.