Is the new AI model really better than ChatGPT?

By | December 16, 2023

    <açıklık sınıfı=MeSSrro/Shutterstock” src=”https://s.yimg.com/ny/api/res/1.2/qt3aHW.9gxqsQVYvYACQKQ–/YXBwaWQ9aGlnaGxhbmRlcjt3PTk2MDtoPTYzMw–/https://media.zenfs.com/en/the_conversation_464/4d731e34c495a44 f1e01931fa8c3e715″ data- src=”https://s.yimg.com/ny/api/res/1.2/qt3aHW.9gxqsQVYvYACQKQ–/YXBwaWQ9aGlnaGxhbmRlcjt3PTk2MDtoPTYzMw–/https://media.zenfs.com/en/the_conversation_464/4d731e34c495a44f 1e01931fa8c3e715″/>

Google Deepmind recently announced Gemini, a new AI model that will rival OpenAI’s ChatGPT. While both models are examples of “generative AI” that learns to find patterns of input training information to generate new data (images, words, or other media), ChatGPT is a large language model (LLM) focused on generating text.

Just as ChatGPT has a web application for conversations based on a neural network known as GPT (trained on large amounts of text), Google has a conversational web application called Bard, based on a model called LaMDA (trained on large amounts of text). dialogue). But Google is now upgrading it based on Gemini.

What distinguishes Gemini from previous models of generative AI such as LaMDA is that it is a “multimodal model.” This means it works directly with multiple input and output modes: in addition to supporting text input and output, it also supports images, audio and video. Accordingly, a new acronym is emerging: LMM (large multimodal model), not to be confused with LLM.

In September, OpenAI announced a model called GPT-4Vision, which can also work with images, audio and text. However, it is not a fully multimodal model as Gemini promises.

For example, ChatGPT-4, powered by GPT-4V, can work with audio inputs and generate speech outputs, while OpenAI has confirmed that this is done by converting the speech in the input to text using another deep learning model called Whisper. ChatGPT-4 also converts the text in the output to speech using a different model; This means that GPT-4V itself only works with text.

Similarly, ChatGPT-4 can also generate images, but it does so by creating text prompts that are passed to a separate deep learning model called Dall-E 2, which converts text descriptions into images.

In response, Google designed Gemini to be “inherently multimodal.” This means that the core model can directly process various types of input (audio, image, video and text) and output them directly.

Decision

The distinction between these two approaches may seem academic, but it is important. The general conclusion from Google’s technical report and other qualitative testing to date is that the current publicly available version of Gemini, called Gemini 1.0 Pro, is not as good as GPT-4 overall and is more similar to GPT 3.5 in its capabilities.

Google also announced a more powerful version of Gemini, which it calls Gemini 1.0 Ultra, and offered some results showing that it is more powerful than GPT-4. However, this is difficult to evaluate for two reasons. The first reason is that Google hasn’t released Ultra yet, so the results cannot be independently verified at this time.

The second reason why Google’s claims are difficult to evaluate is that it has chosen to release a somewhat deceptive promotional video; see below. In the video, the Gemini model is seen commenting interactively and fluently on a live video stream.

However, as Bloomberg initially reported, the demonstration in the video was not performed in real time. For example, the model had already learned some specific tasks, such as the three cups and ball tricks, where Gemini kept track of which cup the ball was under. To do this, a series of still images were provided with the presenter’s hands over the replaced glasses.

promising

Despite these issues, I believe Gemini and large multimodal models are an extremely exciting step forward for generative AI. This is due to both future capabilities and the competitive landscape of AI tools. As I noted in a previous article, GPT-4 was trained on approximately 500 billion words; These were all quality, public domain texts.

The performance of deep learning models generally depends on increasing model complexity and amount of training data. This raised the question of how further improvements could be achieved, as we have almost run out of new training data for the language models. However, multimodal models are opening up enormous new reserves of training data in the form of images, audio and video.

Artificial intelligence such as Gemini, which can be trained directly on all this data, is likely to have much greater capabilities in the future. For example, I would expect video-trained models to develop complex internal representations of what is called “pure physics.” This is humans’ and animals’ basic understanding of causality, motion, gravity, and other physical phenomena.

I’m also excited about what this means for the competitive landscape of AI. Over the past year, despite the emergence of many generative AI models, OpenAI’s GPT models have dominated and demonstrated a level of performance that other models cannot approach.

Google’s Gemini signals the emergence of a major competitor that will move the space forward. Of course, OpenAI is almost certainly working on GPT-5, and we can expect it too to be multi-modal and demonstrate notable new capabilities.


Read more: Google’s Gemini AI technology hints at the next big breakthrough in technology: analysis of real-time information


All that being said, I’m excited to see very large, multi-modal models emerge that are open-source and non-commercial, and I’m hopeful they’ll be on the way in the coming years.

I also like some of the features of the Gemini app. For example, Google announced a version called Gemini Nano, which is much lighter and can run directly on mobile phones.

Lightweight models like this reduce the environmental impact of AI computing and have many privacy benefits, and I’m sure this development will lead competitors to follow suit.

This article is republished from The Conversation under a Creative Commons license. Read the original article.

Speech

Speech

Michael G. Madden does not work for, consult for, own shares in, or receive funding from any company or organization that would benefit from this article, and has disclosed no relevant affiliations beyond his academic duties.

Leave a Reply

Your email address will not be published. Required fields are marked *