ChatGPT now also understands images and voice commands

The ChatGPT chatbot is constantly being improved by OpenAI. The new version allows users to activate ChatGPT with voice and images as well, bringing new questions and concerns. So what does the new version bring and when?

Most of the changes that OpenAI is introducing to ChatGPT relate to what the AI-powered bot can do: what questions it can answer, what information it can access, and so on. This time, however, it also changes the way you can use ChatGPT yourself. The company is introducing a new version of the service that allows you to interact with an artificially intelligent AI bot not only by writing sentences in a text field, but also by talking to it or just uploading a picture. The new features will be available to those who pay the Plus subscription in the coming weeks, while others will receive the new functionality "soon after".

The voice part is nothing earth-shatteringly new: you tap a button and say your question, ChatGPT converts it to text and passes it to a big language model, retrieves the answer and converts it back to speech, and answers you by voice. It should feel like talking to Alexa or the Google Assistant, except – so OpenAI hopes – the answers will be better because of the improved underlying technology. Most virtual assistants seem to be reinventing themselves and incorporating big language models – and OpenAI is one step ahead of them all for now.

OpenAI's excellent Whisper model does much of the speech-to-text conversion, and the company is also introducing a new text-to-speech model that is said to be able to create "human-like audio from just text and a few seconds of sample speech." You'll be able to choose a voice for ChatGPT from five options, but OpenAI seems to think the model has much more potential. For example, OpenAI works with Spotify to translate podcasts into other languages, preserving the sound of the voice of the person hosting the podcast. There are a lot of interesting uses for synthetic voices, and OpenAI could be a big part of that industry.

Regardless, the fact that you can create a decent synthetic voice with just a few seconds of audio opens the door to all sorts of potentially problematic use cases. "These capabilities present new threats, such as the possibility of malicious actors impersonating public figures and the like," the company's blog announcing the new features said. For this very reason, the model is not available for wider use and will be much more controlled and limited to specific use cases and partnerships.

The image search feature is somewhat similar to Google Lens. You snap a photo and ChatGPT will try to understand what you're asking and respond accordingly. You can also use the drawing tool in the app to make the question as clear as possible, or speak or type questions related to the picture. This is where the nature of ChatGPT comes in particularly handy: instead of running a search, getting the wrong answer, and then running a new search, you can nudge the bot and improve the answer during the process. This is very similar to what Google is doing with multimodal search.

Obviously, including images in ChatGPT also has its disadvantages. One of them is when you use ChatGPT “in person”: OpenAI says it has deliberately limited “ChatGPT's ability to analyze and make direct statements about people”. Both for accuracy and privacy. That means one of the most sci-fi visions of artificial intelligence — the ability to look at someone and tell who they are — won't be a reality any time soon. Which is probably a good thing.

Almost a year after ChatGPT's heyday, it seems that OpenAI is still trying to figure out how to give its model more features and capabilities without creating new problems and downsides. With new releases, the company has tried to walk that fine line by consciously limiting what its new models can do. But the fact is that this approach will not always work. As more and more people use voice control and image search, and as ChatGPT moves closer to becoming a truly multi-modal, useful virtual assistant, it will become increasingly difficult to maintain all of these safeguards.