Until fairly recently, image accessibility was a no go unless you could print them in tactile form. If you had the source data, you could generate them by assigning labels which again needed human intervention.
In some cases, you could use sonification which is still the case and indeed, it remains the only way to “experience” an image on the fly.
However, thanks to large language models and their ability to process images, we can now ask questions of these confounded sets of pixels.
I use the nvda screen reader which has 3 add-ons that generate image descriptions.
XPoseImageCaptioner
This add-on has been written by Christopher Millsap. It can generate image descriptions from jpeg and png files. It works offline that is, once the add-on is installed, you do not need to be connected to the internet to generate image descriptions.
It works the best for scenes like photographs of beaches, monuments and so on and does not have OCR capabilities. It is extremely fast because no network access is required.
In my experience, the descriptions are sufficiently detailed that you get a better than average understanding of the scene.
Cloud Vision
This add-on has been written by Aleks Samos.
It uses the google vision API to generate image descriptions. It can also read qr codes though this is something I have not tried.
The descriptions that are generated are good and if there is text in the image, then that text will also be rendered. Read the help for usage instructions. The home page of this add-on is in Russian though its help as rendered in the add-on store is in English. A handy feature is that it can place the description in a virtual buffer letting you review it at leisure or copy it into another document.
AI Content Describer
This add-on has been written by Carter Temm. It sends the image to be described to open AI’s chat-gpt.
Its descriptions are detailed and text in the image is also rendered. However, in many cases, the output is like a human reading back the description therefore, you get the text in context. You can specify the initial prompt that you send with the image. This gives you the ability to tweak the chat-gpt answer the way you want.
E.G., I was wanting to analyze a form that I was filling out. Its check boxes were rendered in a non-standard way such that the screen reader detected them as buttons. Here is the prompt I used.
“Describe this image and if it is a screen shot of a form, tell me which fields have errors and or are rendered in a different color or are highlighted in red”
There is an unbound keystroke that captures a screen shot and sends it to Open AI.
You can open each description in its own dialogue box.
The hardest part about using this add-on is the getting of the open AI API key. You need to deal with the Open AI interface. It is not bad but it takes some exploring to find where to buy credits from. In addition, when I added my credit card, I was charged. This is a verification process that Open AI does. The charge was reversed. I was confused by the charged thinking that I had bought credit but that was not the case.
All the above 3 add-ons are screen reader and consequently platform dependent.
Enter Vision
This is a python script written by Rastislav Kish. You can specify the image file or files that you need described. You can also specify the prompt that you want to send to chat-gpt. However, usually, the default prompt works.
The setup of this script is more involved with you having to install two python packages and adding an environment variable for your Open API key. However, the instructions are easy to follow.
This is the only solution so far that allows you to ask subsequent questions to chat-gpt relating to the specified image. It is screen reader independent and can also act like a command line chat-gpt client.
So far so good but how about an example? See the below image that I took on a recent trip to the Maldives.
I ran all the above solutions with their default prompts on the above image and here is what I got.
EXpose
a beach with a body of water and a pier in the distance with footprints in the sand.
Cloud Vision
sunset on the tropical beach
Beach ridge: Sea
outdoor, nature, cloud, beach, water, sky, ground, horizon, sand, shore, mudflat, calm, coast, coastal and oceanic landforms, landscape, sunrise, sunset, ocean.
a beach with a pier and a sunset.
AI Content Describer
This image captures a tranquil beach scene at twilight. The photo is taken from the shore, looking out across calm waters towards the horizon where the sky exhibits warm hues of yellow and orange near the setting sun, transitioning to soft blues higher up. A few scattered clouds are visible in the sky, adding texture and depth to the vista. On the right, there’s a silhouette of a pier or structure extending into the water, with reflections glimmering on the surface. The beach in the foreground is scattered with footprints, indicating recent activity, and the waves gently lap at the sand. The overall atmosphere is serene, evoking the quiet calm of an evening by the sea. The image is also tilted, giving a dynamic feel to the composition.
Vision
See the conversation here.
I have run the script on my desktop computer that runs linux. I have done minor formatting to improve clarity.
[pranav@archlinux ~]$ ~/vision/vision IMG_1542.jpg
The image captures a tranquil beach scene at twilight. The sun appears to be setting, casting a warm glow across the sky with colors ranging from blue to orange. Gentle waves are lapping at the shore, and you can see footprints scattered on the smooth sand, indicating people have walked there recently. On the right, extending into the sea, there appears to be a row of overwater bungalows or structures, which suggests this may be a resort or a vacation area. In the background, there are several boats on the horizon, further giving the impression of a seaside locale popular for recreation or tourism. The overall effect is peaceful and picturesque, emphasizing the natural beauty of a beachside sunset.
Enter your message describe the boats
In the image, there are a few distant boats visible on the horizon. Due to the distance and resolution of the image, detailed descriptions of the boats are limited. However, they appear as small silhouetted shapes against the backdrop of the lighter sky reflecting on the water. These could likely be leisure boats or vessels used for fishing or transport, common in coastal areas with tourist activities. The boats are too far away to discern any specific details such as the type, size, or features of the boats.
Conversation price: 0.02433
>
Conclusion
We are in a better place than even a year ago with respect to access to images. Of course, the descriptions are only as good as the models we have and the prompts we use. In addition, I am hoping for more integration of large language models with image processing tools such that we can get coordinates of specific parts of images. This will allow us to focus on our areas of interest and then to do further analysis such as passing an area of the image through a specialist tool like a sonification program to experience just that part.
Peter Meijer says
Great article, Pranav. There seems to be an HTML tag nesting typo at the “AI Content Describer” headline, causing the remainer of that section to show in headline h2 font. When looking at the text on the screen with The vOICe on another device while scrolling you will notice the change in font size. Thanks!
Pranav says
Many thanks Peter, I have fixed the HTML.