In my first experiments with Scriptura Obscura, I was impressed with the quality of captions BLIP-2 automatically generated from images. This left me with a lingering question: What does it really know about an image? Yes, it’s generating a short description of an image, but what else is latent in the system that’s left unsaid?
The mind-bending thing about multimodal large-language models (LLM’s) is that we just can ask them. Similar to the way you can ask ChatGPT a question and get back a response, we can give a multimodal LLM an image and start asking it questions. I wrote a simple app called ChatIMG, that lets us do just this using a BLIP-2 based model called InstructBLIP.
Here’s an image that we looked at last time that didn’t generate a particularly impressive caption: “a woman in glasses standing near the ocean.” It isn’t wrong, but lacks the specificity that many of the other captions had. What information is lurking in the model that’s left unsaid?
Adrian: Where was this image taken?
ChatIMG: san francisco
Adrian: What bridge is in the background?
ChatIMG: Golden Gate Bridge
Adrian: What is in the sky?
ChatIMG: Birds
Adrian: Please describe what the woman is wearing in detail.
ChatIMG: The woman in the image is wearing a black jacket and a pair of glasses. She is standing near a fence, smiling and posing for the camera. In the background, there is a bridge spanning across the water.
Whoa… It knows a lot more than just “a woman in glasses standing near the ocean.”
Key takeaway: Don’t assume a short caption is all there is. It’s possible to use a multimodal LLM to extract a lot more information with targeted questions.
Auto-caption: “a view of a city with a tall building in the background”
Adrian: What city is this?
ChatIMG: new york
Adrian: What is the name of the tall building?
ChatIMG: Taipei 101
Adrian: What city is this?
ChatIMG: taipei
There’s a lot to unpack here. This clearly isn’t New York, so it struck out on the first question. However, when we asked about a more specific detail, it got the name of the building right and then used that detail to update its answer when we asked it to name the city again. That’s pretty cool.
This highlights both the danger and power of a system like this. On the one hand, it was totally confident in its (wrong) answer the first time. If there wasn’t a human in the loop, this image would have been misidentified as New York. At the same time, it shows that it is able to learn from answers that it generates along the way.
Most images recorded today have lots of metadata about when and where they were taken. It’s possible to inject this information into the chat history to help a multimodal LLM better analyze the image data. For instance, if this image were geocoded and timestamped, the city, month, and year could be preloaded into the chat history, so by the time we asked it a question it would already have that context.
Key Takeaway: Injecting additional context fundamentally changes how much useful information a model can extract from the raw image data.
Auto-caption: “a man standing next to an orange car in a parking lot”
Adrian: What is unusual about this image?
ChatIMG: The unusual aspect of this image is the presence of a man standing next to an orange BMW taxi car, which is parked in a parking lot. In most cases, taxi cabs are driven by professionally trained drivers, and it is not common for passengers to stand next to the vehicle while it is parked. This situation suggests that the man might be posing for a photo with the taxi, possibly as a tourist or a fan of the BMW brand.
Adrian: Could this man be a taxi driver who is just proud of his car?
ChatIMG: Yes, it's possible that the man in the image is a taxi driver who is proud of his car. He is standing next to an orange BMW 2002, which is a classic car that was popular in the 1960s and 1970s. Taxi drivers often take pride in their vehicles, especially if they are well-maintained and have a good reputation for reliability and comfort. The fact that the man is posing with his car suggests that he is proud of it and wants to show it off. However, it's also possible that the man is simply a car enthusiast who owns the BMW 2002 for personal use or as a collector's item.
It’s clearly making the assumption that the man next to the car is not a professional driver. This happens to be true, but it could also be possible that this man is a professional driver. It’s also impressive how much additional information it’s providing about the car.
Key Takeaway: Question the assumptions the model is making, especially if you’re less familiar with the image content.
Auto-caption: “a woman posing for a picture in front of a dinosaur skeleton.”
This is already a good caption, so let’s see what happens when we ask for more detail.
Adrian: Write a detailed description of this image.
ChatIMG: In the image, a woman is smiling and taking a selfie in front of a life-sized dinosaur skeleton. The dinosaur appears to be a Tyrannosaurus Rex, with its head positioned towards the left side of the image. The woman is standing next to the dinosaur, capturing the moment with her camera. There are several other people visible in the scene, including a man on the right side of the image, another man on the left side of the image, and a woman on the left side of the image. They appear to be observing the dinosaur as well. Additionally, there are two chairs placed near the dinosaur, one on the left side of the image and another on the right side of the image. A handbag can be seen on the left side of the image, close to the woman who is taking the selfie.
Things start to go off the rails with “There are several other people visible in the scene…” This is what a hallucination — where it just starts making things up or mis-recognizing objects — looks like. Some of these people do not exist. There are no chairs or handbags.
When this multimodal model hallucinates, it often hallucinates in this very specific way. As soon as I see a description that starts with “There are several other people…” I know that we’re also about to be introduced to the chairs and handbags in the image (that aren’t actually there). This is a strange attribute of the system that must have something to do with how it was trained (if anyone knows, I’d love to hear more about why it exhibits this behavior).
(The header image for this post comes from feeding this description back into DALL-E to generate an image of the hallucination including chairs, people, and handbags.)
Key Takeaway: Expect these models to hallucinate, especially when writing longer responses. Sometimes shorter is better.
Auto-caption: “a metal shelf filled with toilet paper and other items”
This substack is about making AI work for us at home, so let’s look at an image of shelves that need to be monitored so items are automatically restocked.
Adrian: Please write a detailed description of this image.
ChatIMG: The image features a metal shelf filled with various household items, such as toilet paper, paper towels, and cleaning supplies. There are multiple rolls of toilet paper and paper towels on the shelf, arranged in a neat and organized manner. Additionally, there are several bottles placed on the shelf, likely containing cleaning products or other household items. The shelf appears to be well-stocked and ready for any household needs.
Adrian: Please inventory all the items on the metal shelves.
ChatIMG: The items on the metal shelves include toilet paper, paper towels, cleaning supplies, and other miscellaneous items.
I appreciate how complimentary it is about being organized and well-stocked. However, it doesn’t get into enough detail to actually help me keep track of what I need to restock. This is a problem I plan to come back to in a future post.
Key Takeaway: Multimodal LLM’s don’t yet excel at all computer vision tasks. Some tasks may benefit from systems that are custom trained to recognize specific objects.
Auto-caption: “a painting of the last supper by leonardo da vinci”
Adrian: What is the historical significance of this painting?
ChatIMG: The painting depicting the Last Supper by Leonardo da Vinci is one of the most famous paintings in the world. It is considered a masterpiece of Renaissance art and represents a significant moment in Christian history. The Last Supper is the final meal Jesus shared with his twelve apostles before his crucifixion. The painting captures the intimate moment when Jesus announces that one of his disciples will betray him, leading to the events of the crucifixion and resurrection. The painting's historical significance lies in its portrayal of this pivotal moment in Christianity, as well as its artistic merit, which continues to inspire artists and art enthusiasts alike.
Adrian: Who is the woman in this image?
ChatIMG: Mary Magdalene
Though multimodal LLM’s can provide more detail about a piece of art, in most cases their most useful function is to simply identify the work. The language part of this model is less comprehensive than GPT-4, so once a piece of art has been identified, asking for more detail about it from ChatGPT generally produces a better result.
Key Takeaway: Multimodal LLM’s do recognize many culturally significant images. (Though the accuracy goes way down with less popular artists and pieces.)
Auto-caption: “a painting of a man throwing a ball is on the side of a building”
Adrian: What is the historical significance of this painting?
ChatIMG: The historical significance of the painting on the wall in the image is that it depicts a man holding a bouquet of flowers, which can be interpreted as a symbol of love or appreciation. This artwork may have been created during a time when people expressed their emotions and sentiments through various forms of art, such as paintings, murals, or graffiti. In this case, the painting serves as a visual representation of the artist's creative expression, reflecting the social and cultural norms of the time period. Additionally, the presence of water tanks on the rooftop of the building where the painting is located adds to the historical context, as these tanks may have been used for irrigation or other purposes during the time when the painting was created. Overall, the painting contributes to the cultural heritage and history of the area, providing a glimpse into the artistic and social practices of the past.
Adrian: Which artist painted this?
ChatIMG: banksy
It’s struggling to connect the dots. This feels like an alien with a powerful telescope looking down at Earth and describing what it sees, but not really understanding it. In the short description it thinks the man is throwing a ball and in the longer response it recognizes flowers, but thinks the man is holding them rather than throwing them. It never connects the man throwing a bouquet like it’s a grenade. On the other hand, it does get the artist right.
Key Takeaway: A multimodal LLM can almost always describe an image, and can only sometimes accurately explain an image.
We’re just scratching the surface of what multimodal LLM’s are capable of. I suspect you’ll be hearing a lot more about these models in the upcoming months, and I hope these examples give you some sense of how powerful this technology is and also how imperfect it is. Beware of cherry-picked examples that look too good to be true.
For my next post, I’d love to give you the opportunity to try out ChatIMG. Please share any images (or questions about images) that you’d like to try out, and I’ll run them through ChatIMG and report back. Just leave a comment below or send me an email with the images. Thank you for reading!
An aspect of your work with AI that really intrigues me is the blurred line between what we think of as "information" and what we think of as "stories" or narratives. My take is that humans are very adept at creating multiple stories out of multiple pieces of information. If the story is complete enough in a moment to be materialized--in a conversation, something written, the arrangement of objects, and so on--the challenge of picking up on it through AI is much, much greater than if it's not. As one illustration, your daughter put together going to lunch with you and her own birth as two information points within a life long narrative (for her) about separating from her mother. But unless she voiced some apprehension about getting lost (or you gave AI a photo to work with of her being entirely alone), I doubt AI would place the information (visual or audio) from that moment in a narrative of separation anxiety.
Also, do you tend to see the "There are several other people visible in the scene…” weirdness when there are parts of people visible in a photo (a head but not body, torso but no head or legs)? I wonder if that's particularly confusing.
Also, that white overexposed compostable clamshell does look like a clutch purse.