Note: This post is applicable to Microsoft Office 2007 and above.
There are many situations where you need to work with images that have been imbedded in word documents. For example, you may get a word document with text that has been written in imbedded images. There may be a portion of text in a document that is in an image etc. If you are a screen reader user, you have several challenges in this situation.
- Your OCR feature will not read the entire text or there will be a lot of clutter in the recognized text.
- You need to ensure that you focus on the image you want.
It is possible to solve the focus issue by using your screen reader’s object navigation feature. This assumes that your screen reader can detect objects in a word document. The other way to handle this situation is to extract the images from the word document. Please follow the steps given below to do so.
- Close the word document.
- Copy it to another location.
- Rename the *.docx file to a file having a zip extention.
- Extract the zip file.
- Navigate to the “word” folder in the resulting folder structure.
- Navigate to the “media” subfolder within the “word” folder.
You will find your images in this folder. Be warned, if you have put special properties such as alternative text tags on the images, they may not be visible immediately. You would need to check the properties of the image.
Once you have these images, you can load them into optical character recognition software (OCR) and have them recognized. Some of these images are poor quality images for OVCR so use an OCR that has good image processing capabilities. Examples include Abbyy’s Fine Reader and the blindness specific OCR programs such as the Kurzweil 1000 and OpenBook.
If you want to read more about the Microsoft Office file formats, then please start with the link given below.
Where is the documentation for Office’s docx/xlsx/pptx formats? Part 1: Office 2007