ChatIMG: If these images could talk

May 24, 2023

What does a LLM really know about an image?

7 Comments

Aug 29, 2023

An aspect of your work with AI that really intrigues me is the blurred line between what we think of as "information" and what we think of as "stories" or narratives. My take is that humans are very adept at creating multiple stories out of multiple pieces of information. If the story is complete enough in a moment to be materialized--in a conversation, something written, the arrangement of objects, and so on--the challenge of picking up on it through AI is much, much greater than if it's not. As one illustration, your daughter put together going to lunch with you and her own birth as two information points within a life long narrative (for her) about separating from her mother. But unless she voiced some apprehension about getting lost (or you gave AI a photo to work with of her being entirely alone), I doubt AI would place the information (visual or audio) from that moment in a narrative of separation anxiety.

Expand full comment

Reply (1)

Jon Wagner

Aug 29, 2023

"the challenge of picking up on it. . . is much, much greater" CORRECTION "the PROBABILITY of picking up on it. . . is much, much greater."

Expand full comment

Eric Antonow

May 25, 2023

Also, do you tend to see the "There are several other people visible in the scene…” weirdness when there are parts of people visible in a photo (a head but not body, torso but no head or legs)? I wonder if that's particularly confusing.

Also, that white overexposed compostable clamshell does look like a clutch purse.

Expand full comment

Reply (1)

Adrian Graham

May 27, 2023

There is almost always one person present, when "several other people" are described. I don't think there are always other people who are partially visible, but sometimes there are (as in the case of the image you describe). Sometimes the hallucinations feel more like "these are other objects that commonly occur with the objects in this image" than "I'm mistakenly identifying this block of pixels as something it isn't." But both are happening -- and the two may be related.

It's a funny experience being the human in the loop. Sometimes I feel like I'm looking at a cloud that someone (or in this case some-thing) has named and trying to see the object they are referencing. It raises interesting questions about where meaning really comes from. The promise of AI is that it will help us see things that we otherwise can't. And yet, when it does see something we can't, is it really there? How do we know?

Expand full comment

Reply (1)

Eric Antonow

May 27, 2023

‘human in the loop’ : )

Expand full comment

Eric Antonow

May 25, 2023

It's a fascinating idea to be able to ask the model what more it knows.

You asked such specific follow-ups, but I'm curious do more general prompts ("tell me more") lead to equally good responses, hallucinations or something else. How many times can you ask? Likewise, can you get it to reconsider its response with a general prompt to reconsider things ("Look more carefully, then answer," or "Are you certain? look again").

Expand full comment

Reply (1)

Adrian Graham

May 27, 2023

The short answer is that this particular model does not do well with an open-ended prompt (esp. compared to something like ChatGPT). For instance, the response to "tell me more" often just causes it to repeat what it's already said. Sometimes I will tell it that it's wrong and ask for an alternative, but it almost never gets it right on the second, third, or fourth try. For instance, I gave it a photo of Nick Cave and asked "Who is this?" It responded with "david bowie." I then said "It's not David Bowie. Who else might it be?" and it said "Marilyn Manson." I thought maybe David Bowie dominated a music association in the model, so I tried giving it a photo of David Bowie and asking "Who is this?" It's response? "bob dylan". If I ask it how certain it is, it usually responds with "100%".

Part of what's counter-intuitive about this is that it isn't doing a pure visual match. The wrong answers it gives aren't random and often seem conceptually related even though visually most of these people would look quite distinct to a human. I think this is a result of the multimodal interaction of the image encoder and large-language model "working" together.

Expand full comment

AI at Home

ChatIMG: If these images could talk