By: Pranav Lal
Apaper I presented on theNational Conference onInformation and Communication Technologies (ICT) at Ahmedabad held on 19 and 20 September 2008
Introduction
When we think of conversing with our computer, pictures of HAL in 2001, a space audicy come to mind. However, the technology for doing this is here and is being used heavily in specialist applic
ations. Namely, speech-recognition engines have been connected to screen readers to allow you to converse with your computer and control it at the same time. This paper will discuss these technologies. It will highlight the unique challenges in connecting screen readers to speech-recognition applications and elucidate how they have been over cum. The paper will focus largely on the Windows operating system from Microsoft since most of the solutions are Windows based. Before the challenges are discussed, it is important to understand how these technologies work in isolation. When a blind user interacts with a computer, he uses a program called a screen reader. The screen reader, via a speech synthesizer converts the output of the computer into speech. The screen reader has a series of commands that the user can use to read various parts of the screen. For example, the user can press a keystroke to read a paragraph of text. The screen reader also has to track what is happening on the computer to handle things like popups. For example, if another application such as a CPU temperature monitor pops up a message, the screen reader must interrupt if it is speaking and speak that message. Or, if the user has set the reader to ignore such warnings, the screen reader will behave accordingly. The screen reader uses a variety of techniques to monitor what is happening on the computer. For instance, it taps into the windows messaging subsystem and tracks the messages windows is sending to different applications. If the screen reader detects a message of significance, it traps and processes it. The screen reader also deals with complex screen layouts and has to render the World Wide Web in a comprehensible form. All screen readers read from left to right and top to bottom. The user has tremendous flexibility on what he wants read and at what time. Speech-recognition programs on the other hand work in reverse that is they convert the spoken word into text and commands. A large number of them use hidden Markov models to determine word probabilities and output the most probable match of a word on the screen. Along the way, the speech-recognition program has to take into account the acoustics of the environment, the pronunciation and accent of the user etc. In theory, interlinking a screen reader with a speech-recognition program should not be anything very difficult. As long as the speech-recognition program uses standard controls, a screen reader would be able to interact with it without any problems. This is indeed the case. However, as we will see in the subsequent section, this does not give the complete picture. There are several other elements that need to be taken into account when these technologies are interlinked. Plus, interlinking these technologies requires a lot of time and effort and anyone setting out to do this must have a sustainable business model. This paper will also examine the current business model being followed by such companies. Finally, the paper will give a quick glimpse into the future of this bridging technology.
The challenges of integrating screen readers and speech-recognition programs
1. Technical challenges There are several technical challenges that need to be overcome to link screen readers to speech-recognition programs. For example, a speech-recognition program usually works at a higher level in the Windows operating system than a screen reader so; there is no easy way for the speech-recognition program to control the screen reader. This means that the user would not be able to give Voice Commands to control the screen reader. This will prevent a user achieving hands-free control of the computer. Another technical challenge relates to providing suitable feedback to the user on what the speech-recognition program is doing. For example, whenever a speech-recognition program in counters text that it cannot recognize, it visually signals the user. For example, Dragon NaturallySpeaking, a leading speech-recognition product from Nuance shows three question marks on the screen when it cannot recognize what the user is saying. The screen reader has to be able to track when this occurs and give the user suitable feedback without interrupting anything else that would be going on. Finally, the technologies need to work seamlessly across different hardware and software combinations. 2. The challenge of usability Another challenge that must be overcome for successful integration of speech-recognition with screen reading is usability. That is, how easy is it for the user to actually use both technologies? Each of these technologies if taken alone can be extremely complex. The challenge lies in masking the complexities of these technologies to allow the user seamless interaction with the computer. 3. Environmental challenges The use of a screen reader means that the computer on which the user is working needs to be able to handle both screen reading and speech-recognition applications. Both these applications require considerable amounts of CPU processing power and ram. More than that, they also require high-quality sound sources to insure that simultaneous input and output can be handled. 4. The challenge of development If a third-party developer has to take on the above two challenges, he has to keep up with developments in both technologies. On top of that, the market for this kind of technology is quite fragmented since every customer has her own needs. This involves a fair amount of customization of the given technology to meet the individual requirement of the customer. For example, there could be one customer who would just be using a simple set of productivity applications. There could be another customer who would need a significant degree of customization of both technologies to handle challenging applications such as those used in call centers. Finally, there could be a third customer who would want to experiment with different applications and, who would need the ability to extend both technologies at will. 5. Market economics Any company building such a solution would need to ensure that it had a very strong business model. A lot of research is required in building such a solution. Once that research has been carried out, continuous testing is also required to ensure that the solution works as expected. Finally, teams of specialists are required to actually go and install the product as well as training people on how to use this kind of technology.
Some solutions
The discussion around the solutions has to be necessarily product based. There is no one framework or strategy that can be used to marry speech-recognition applications and screen readers. The current solution on the market is the J-ware line of products from T&T Consultancy Ltd. Their flagship product is called J-say. J-say combines Dragon NaturallySpeaking and Jaws for Windows to provide a seamless and consistent solution for integrating Dragon NaturallySpeaking and Jaws for windows. Jaws for windows is one of the foremost screen readers on the market. The reason it lends its self so well for integration with speech-recognition is because of its extremely powerful scripting language. So, let us see how J-say addresses the challenges we had listed above. 1. Technical challenges The precise details of J-say’s implementation remain confidential. J-Say relies heavily upon a complex set of JAWS for Windows scripts to Provide the necessary oral and Braille-based feedback to the visually Impaired user and to create the many special utilities and programs J-Say has available. The JAWS scripting language contains a full Programming interface for creating access to applications or to automate Specific routines. Using this programming language, J-Say interacts With Microsoft Windows through its windows hierarchy, and makes Extensive use of Microsoft Active Accessibility implementation and the Object Model code found within Microsoft applications such as Word, Outlook and Internet Explorer. In this way, J-Say is able to report Information back to the user accurately by interacting directly with the Operating system or application rather than relying upon screen-based Data which can become temporarily obscured. A minimal amount of scripting is undertaken using Dragon NaturallySpeaking’s macro creation capability. The process of creating A voice command is to specify the voice phrase within the Dragon NaturallySpeaking Command Browser and through a simple routine the JAWS Script which will in turn carry out the action the user has in mind is Linked to the voice command via a simple routine called from a DLL file. 2. The challenge of usability J-say has adopted a minimal speech approach. This approach cuts both ways that is it minimizes the amount of verbal feedback the user receives from Jaws for Windows and on the other hand, the user is able to accomplish any given task with the minimum of speech input. This has been done by designing several easy to remember commands that provide complete control to the user. For example, the command to read the current line on the screen is simply “speak line”. Similarly, the command to reboot the computer is “restart the computer”. There is also extensive help available while using J-say so that the user does not really have to remember anything except perhaps the command to invoke the help facility. J-Say also takes into account the shift that occurs when using speech-recognition to carry out everyday tasks. For example, it is easy to select text when using a keyboard. All that the user has to do is to hold down the shift key and move the arrow keys. This approach is very tedious when using speech-recognition since firstly, there is no easy way to hold down a modifier key and secondly, even the slightest error in utterance could lead to the selected text being replaced with junk text. J-Say has a unique selection facility that allows the user to mark the beginning of a block of text. The user can then use any means at her disposal to navigate to the end of the text and mark it. Finally, the user can issue commands to cut or copy the text to the windows clipboard. 3. Environmental challenges There is no getting around using a powerful computer for using this kind of bridging technology. Using a computer with low hardware specifications will lead to suboptimal results. At the time of this writing, it is recommended that separate sound sources are used for input of speech and output of synthesized speech. This will ensure that there is no cross talk between these sources. Finally, a high quality microphone is required so that it only picks up the user’s voice and does not get confused by the speech synthesizer’s speech and other environmental noise. 4. The challenge of development Fortunately, Dragon Naturally Speaking and Jaws for Windows are very extensible. Both have powerful scripting languages and APIs that allow third party developers to customize both solutions. J-Say allows the user to take advantage of this extensibility by exposing it’s interface with which the user can call Jaws for Windows Scripts from Dragon Naturally Speaking. Calling a Jaws script is a matter of executing a single routine from a Dragon Naturally Speaking script. To add to that T&T Consultancy and the J-Say user community are exceptionally supportive to anyone who is attempting to customize J-Say. 5. Market economics The adaptive technology industry is littered with examples of companies who had terrific products, growing customer bases but failed. One of the causes of this failure has been the companies’ focus on a single disability. Though J-Say is the flag ship product of T&T Consultancy, they do not have it as their sole focus. They have created other products such as the scripts for ITunes. They are also reseller of several adaptive products. To add to this, they have altered their business model to focus on other disabilities besides vision and are actively promoting their products as cross disability products. This has helped them to get a significantly larger market share. Also, they have elected to go with distributers who are not dedicated to serving only a single kind of customer. For example, their distributer for North America and Canada, also supplies products to the medical community. Also, a significant part of T-And-T’s income comes from training and onsite support which are recurring activities. This has made the company less dependent on new sales which can be slow due to the nature of the access technology market. Finally, T-And-T listens very closely to it’s customers. The CEO actively participates on the user list and trains users himself. Plus, the beta testers of the product cover a large cross section of customers. J-Say is not restricted to a single speech-recognition solution. J-Say technology has also been applied to Windows Vista speech-recognition in the form of the J-vist product. This product also meets the needs of customers who would rather use speech-recognition for dictation and control the computer using other means.
The road ahead
It is not possible to outline specific developments that would take place in this area. However, some general comments can be made. Speech-recognition is being gradually applied in various devices. We can see it most frequently in our ability to dictate an assigned name of a contact into our mobile phones. Similarly, speech synthesis is also catching up. Again, note the synthetic voice announcing the name of the caller in a number of mobile phones. In the long run, speech as a mode of input will probably replace the keyboard since it is far more natural for a person. For this to happen though a significant number of technical challenges, the discussion of which is beyond the scope of this paper need to be met. In terms of economics, small companies with innovative products will succeed in the market. Some of these products may be niche products initially so the scale of operations could be small, but over time, as the word spreads and technology improves, they will become common place.
Acknowledgements
The author would like to thank the following people for their invaluable contributions. Brian Hartgen of T&T Consultancy Limited for his help with the technical explanation of J-Say. Brian is also the lead developer of J-Say. Edward S. Rosenthal, President and CEO of Next Generation Technologies Inc., for his help with the perspective on speech-recognition and other suggestions for this paper.
References
T&T Consultancy, the makers of J-Say Next Generation Technologies, master distributer of J-Say especially for the USA FreedomScientific, the makers of Jaws For Windows Nuance, the makers of Dragon Naturally Speaking