AI

‘Visual’ AI models might not see anything at all

Comment

Image Credits: Bryce Durbin / TechCrunch

The latest round of language models, like GPT-4o and Gemini 1.5 Pro, are touted as “multimodal,” able to understand images and audio as well as text. But a new study makes clear that they don’t really see the way you might expect. In fact, they may not see at all.

To be clear at the outset, no one has made claims like “This AI can see like people do!” (Well, perhaps some have.) But the marketing and benchmarks used to promote these models use phrases like “vision capabilities,” “visual understanding,” and so on. They talk about how the model sees and analyzes images and video, so it can do anything from homework problems to watching the game for you.

So although these companies’ claims are artfully couched, it’s clear that they want to express that the model sees in some sense of the word. And it does — but kind of the same way it does math or writes stories: matching patterns in the input data to patterns in its training data. This leads to the models failing in the same way they do on certain other tasks that seem trivial, like picking a random number.

A study — informal in some ways, but systematic — of current AI models’ visual understanding was undertaken by researchers at Auburn University and the University of Alberta. They tested the biggest multimodal models on a series of very simple visual tasks, like asking whether two shapes overlap, or how many pentagons are in a picture, or which letter in a word is circled. (A summary micropage can be perused here.)

They’re the kind of thing that even a first-grader would get right, yet they gave the AI models great difficulty.

“Our seven tasks are extremely simple, where humans would perform at 100% accuracy. We expect AIs to do the same, but they are currently NOT,” wrote co-author Anh Nguyen in an email to TechCrunch. “Our message is, ‘Look, these best models are STILL failing.’”

Image Credits: Rahmanzadehgervi et al

The overlapping shapes test is one of the simplest conceivable visual reasoning tasks. Presented with two circles either slightly overlapping, just touching or with some distance between them, the models couldn’t consistently get it right. Sure, GPT-4o got it right more than 95% of the time when they were far apart, but at zero or small distances, it got it right only 18% of the time. Gemini Pro 1.5 does the best, but still only gets 7/10 at close distances.

(The illustrations do not show the exact performance of the models but are meant to show the inconsistency of the models across the conditions. The statistics for each model are in the paper.)

Or how about counting the number of interlocking circles in an image? I bet an above-average horse could do this.

Image Credits: Rahmanzadehgervi et al

They all get it right 100% of the time when there are five rings, but then adding one ring completely devastates the results. Gemini is lost, unable to get it right a single time. Sonnet-3.5 answers six … a third of the time, and GPT-4o a little under half the time. Adding another ring makes it even harder, but adding another makes it easier for some.

The point of this experiment is simply to show that, whatever these models are doing, it doesn’t really correspond with what we think of as seeing. After all, even if they saw poorly, we wouldn’t expect six-, seven-, eight- and nine-ring images to vary so widely in success.

The other tasks tested showed similar patterns; it wasn’t that they were seeing or reasoning well or poorly, but there seemed to be some other reason why they were capable of counting in one case but not in another.

One potential answer, of course, is staring us right in the face: Why should they be so good at getting a five-circle image correct, but fail so miserably on the rest, or when it’s five pentagons? (To be fair, Sonnet-3.5 did pretty good on that.) Because they all have a five-circle image prominently featured in their training data: the Olympic Rings.

Image Credits: IOC

This logo is not just repeated over and over in the training data but likely described in detail in alt text, usage guidelines and articles about it. But where in their training data would you find six interlocking rings. Or seven? If their responses are any indication: nowhere! They have no idea what they’re “looking” at, and no actual visual understanding of what rings, overlaps or any of these concepts are.

I asked what the researchers think of this “blindness” they accuse the models of having. Like other terms we use, it has an anthropomorphic quality that is not quite accurate but hard to do without.

“I agree, ‘blind’ has many definitions even for humans and there is not yet a word for this type of blindness/insensitivity of AIs to the images we are showing,” wrote Nguyen. “Currently, there is no technology to visualize exactly what a model is seeing. And their behavior is a complex function of the input text prompt, input image and many billions of weights.”

He speculated that the models aren’t exactly blind but that the visual information they extract from an image is approximate and abstract, something like “there’s a circle on the left side.” But the models have no means of making visual judgments, making their responses like those of someone who is informed about an image but can’t actually see it.

As a last example, Nguyen sent this, which supports the above hypothesis:

Image Credits: Anh Nguyen

When a blue circle and a green circle overlap (as the question prompts the model to take as fact), there is often a resulting cyan-shaded area, as in a Venn diagram. If someone asked you this question, you or any smart person might well give the same answer, because it’s totally plausible … if your eyes are closed! But no one with their eyes open would respond that way.

Does this all mean that these “visual” AI models are useless? Far from it. Not being able to do elementary reasoning about certain images speaks to their fundamental capabilities, but not their specific ones. Each of these models is likely going to be highly accurate on things like human actions and expressions, photos of everyday objects and situations, and the like. And indeed that is what they are intended to interpret.

If we relied on the AI companies’ marketing to tell us everything these models can do, we’d think they had 20/20 vision. Research like this is needed to show that, no matter how accurate the model may be in saying whether a person is sitting or walking or running, they do it without “seeing” in the sense (if you will) we tend to mean.

More TechCrunch

The European Commission has closed a Digital Services Act (DSA) investigation of a rewards feature in TikTok Lite by accepting commitments from the social media giant to permanently withdraw the…

TikTok Lite: EU closes addictive design case after TikTok commits to not bring back rewards mechanism

Groq, a startup developing chips to run generative AI models faster than conventional processors, said on Monday that it’s raised $640 million in a new funding round led by Blackrock.…

AI chip startup Groq lands $640M to challenge Nvidia

COVID-19 pushed people to take up outdoor activities. Now, startups are helping companies and consumers keep up with demand.

From golf to hunting, a new crop of startups want to make these experiences even better

Despite increasing demand for AI safety and accountability, today’s tests and benchmarks may fall short, according to a new report. Generative AI models — models that can analyze and output…

Many safety evaluations for AI models have significant limitations

OpenAI has built a tool that could potentially catch students who cheat by asking ChatGPT to write their assignments — but according to The Wall Street Journal, the company is…

OpenAI says it’s taking a ‘deliberate approach’ to releasing tools that can detect writing from ChatGPT

Chief Product Officer Craig Saldanha says AI is already transforming the Yelp experience.

Yelp’s chief product officer talks AI and authenticity

Featured Article

Even after $1.6B in VC money, the lab-grown meat industry is facing ‘massive’ issues

Any goal that puts cultivated meat in big box grocery stores or on fast food menus in the 2020s is “unrealistic,” according to experts.

Even after $1.6B in VC money, the lab-grown meat industry is facing ‘massive’ issues

Warren Buffett’s Berkshire Hathaway cut its Apple holding by around half, to $84.2 billion, according to an SEC filing. While Apple remains the firm’s largest stock holding by far, Buffett…

Warren Buffet’s Berkshire Hathaway sells half its Apple stock

A fireside chat between Jensen Huang and Mark Zuckerberg at SIGGRAPH 2024 took some unexpected turns. What started as a conversation about the capabilities of Nvidia GPUs and Zuckerberg’s vision…

Zuckerberg and Jensen show off their friendship, while an AI necklace covets yours

We spoke to Harness CEO and founder Jyoti Bansal about his previous company, which Cisco bought for $3.7 billion in 2017.

When a big company comes after a hot startup, it’s not a slam dunk decision to sell

Dojo is Tesla’s custom-built supercomputer that’s designed to train its “Full Self-Driving” neural networks.

Tesla Dojo: Elon Musk’s big plan to build an AI supercomputer, explained

Featured Article

Trade My Spin is building a business around used Peloton equipment

Trade My Spin has pieced together a logistics network capable of offering same or next day delivery in most major cities in the continental U.S.

Trade My Spin is building a business around used Peloton equipment

Featured Article

Meet the founder who built and sold a $600M enterprise software startup from Sri Lanka

Sanjiva Weerawarana co-founded WSO2 in 2005, recently selling it for more than $600M. He sometimes drives for Uber, too.

Meet the founder who built and sold a $600M enterprise software startup from Sri Lanka

Investors are assisting startup founders earlier than ever in an effort to help them bridge the first climate tech valley of death.

Why Bill Gates’ Breakthrough Energy and other investors are scouring universities for founders

While both the DSA and DMA aim to achieve distinct things, they are best understood as a joint response to Big Tech’s market power.

DSA vs. DMA: How Europe’s twin digital regulations are hitting Big Tech

Featured Article

How the theft of 40M UK voter register records was entirely preventable

A scathing rebuke by the U.K. data protection watchdog reveals what led to the compromise of tens of millions of U.K. voters’ information.

How the theft of 40M UK voter register records was entirely preventable

Self-driving technology company Aurora Innovation was hoping to raise hundreds of millions in additional capital as it races toward a driverless commercial launch by the end of 2024. The company, which…

Self-driving truck startup Aurora Innovation raises $483M in share sale ahead of commercial launch

The U.S. Federal Trade Commission and the Justice Department are suing TikTok and ByteDance, TikTok’s parent company, with violating the Children’s Online Privacy Protection Act (COPPA). The law requires digital…

FTC and Justice Department sue TikTok over alleged child privacy violations

Welcome to Startups Weekly — your weekly recap of everything you can’t miss from the world of startups.  This week we are looking at acquisitions of small startups, two new…

Acquiring AI talent wholesale

In a big move, Character.AI co-founder and CEO Noam Shazeer is returning to Google after leaving the company in October 2021 to found the a16z-backed chatbot startup. In his previous…

Character.AI CEO Noam Shazeer returns to Google

The startup developed a two-material system that helps homes self-regulate their internal humidity.

Adept Materials’ dehumidifying paint was inspired by trees and semiconductors

When the developers replied to the July 19 email, Yelp sent a deck of pricing tiers with base pricing starting from $229 per month for a limit of 1,000 API…

Yelp’s lack of transparency around API charges angers developers

Featured Article

Cloud infrastructure revenue approached $80 billion this quarter

The cloud infrastructure market has put the doldrums of 2023 firmly behind it with another big quarter. Revenue continues to grow at a brisk pace, fueled by interest in AI. Synergy Research reports revenue totaled $79 billion for the quarter, up $14.1 billion or 22% from last year. This marked…

Cloud infrastructure revenue approached $80 billion this quarter

The pharma giant won’t say how many patients were affected by its February data breach. A count by TechCrunch confirms that over a million people are affected.

Pharma giant Cencora is alerting millions about its data breach

Payments infrastructure firm Infibeam Avenues has acquired a majority 54% stake in Rediff.com for up to $3 million, a dramatic twist of fate for the 28-year-old business that was the…

Rediff, once an internet pioneer in India, sells majority stake for $3M

The ruling confirmed an earlier decision in April from the High Court of Podgorica which rejected a request to extradite the crypto fugitive to the United States.

Terraform Labs co-founder and crypto fugitive Do Kwon set for extradition to South Korea

A day after Meta CEO Mark Zuckerberg talked about his newest social media experiment Threads reaching “almost” 200 million users on the company’s Q2 2024 earnings call, the platform has…

Meta’s Threads crosses 200 million active users

TechCrunch Disrupt 2024 will be in San Francisco on October 28–30, and we’re already excited! Disrupt brings innovation for every stage of your startup journey, and we could not bring you this…

Connect with Google Cloud, Aerospace, Qualcomm and more at Disrupt 2024

Featured Article

A comprehensive list of 2024 tech layoffs

The tech layoff wave is still going strong in 2024. Following significant workforce reductions in 2022 and 2023, this year has already seen 60,000 job cuts across 254 companies, according to independent layoffs tracker Layoffs.fyi. Companies like Tesla, Amazon, Google, TikTok, Snap and Microsoft have conducted sizable layoffs in the…

A comprehensive list of 2024 tech layoffs

Intel announced it would lay off more than 15% of its staff, or 15,000 employees, in a memo to employees on Thursday. The massive headcount is part of a large…

Intel to lay off 15,000 employees