SAPI 5.1: Voice - Enabled Applications With VB

By Peter A. Bromberg, Ph.D.

Peter Bromberg  


About a year ago, the Microsoft Research Speech Techology Group released the long - awaited SAPI 5.0 SDK. It was a big hit with developers -- C++ developers, that is! The gaping omission was a COM interface to the runtime so that VB and other COM - compliant programming environments could use it. Obviously, there was a huge outcry from the VB development community, which has grown quite large, and Microsoft promised that SAPI 5.1 which would have a more or less complete COM interface. It got so bad that some enterprising developers wrote their own COM wrappers in the interim.

Well, Microsoft delivered, and SAPI 5.1 provides everything they promised, and even more - there are even Interop assembly interfaces to allow programming against the SDK with C# and VB.NET. In addition, Microsoft recently presented developer-attendees at the Los Angeles PDC 2001 with the first beta of the .NET SAPI SDK. This marks the evolution of Microsoft's long dominant position in speech technology, and signals that Microsoft is committed to providing speech technology interfaces for its programming environments for the forseeable future.

I've been involved with SAPI, off and on, since version 3.0 (that's ancient history, by Internet Standard Time). I saw the SAPI interface, particularly combined with the ease of VB programming, as a way to explore the creation of new and useful ways to interact with the PC. Not only because I've always been fascinated by voice technologies, but also because my son, Andrew, who is now 13, has autism. If you know anything about autism and related disorders, then you know that often the speech-language processing centers of the brain in autistic individuals don't develop at the same rate as other areas of the brain. The result can often be language / communication related learning disabilities. Many autistics never speak at all. In Andrew's case, he can talk just fine, although his pronunciation especially with "L's" and "R's" still needs some work.. Yet it's difficult for him to piece together more than a few words at a time. This obviously is one of the major causes of social distress, as he'd love to play more with the rest of the kids, but he's often unable to express himself properly, and as a result, some kids who haven't had this explained to them may think he's "strange". I think this is partly a biochemical thing - we've noticed that if he is sick or very excited, that suddenly the connections in the brain seem to improve and he's able to express himself better with language.

There are so many unique applications of speech technology to computing, I can't even begin to describe them. Not just for people with disabilities - but for everyone. I think the easiest way to get involved with this technology is to go ahead and download the SAPI 5.0 SDK , fire up some of the samples, and you'll be hooked!

The first thing you'll notice is that the quality of the speech recognition engine - even before "training" - is absolutely FIRST RATE. This means that you can dictate your favorite letters, memos and so on into your favorite programming interface and be assured that you won't have to make many corrections later. Another thing I found is that you can pretty much talk as fast as you want - and although there may be a delay while the engine sorts out the context and comes up with its final result, it won't miss a beat.

The second thing you'll notice is how easy it is to use. In Visual Basic, you basically just set a reference to the Microsoft Speech Object Library, which contains all of the COM interfaces:

To use the Text to Speech interface, we would use the following code:

Dim Voice As SpVoice
Set Voice = New SpVoice
Voice.Speak "Howdy!" ,SVSFlagsAsync

I don't think it could get much easier!

And to use the Speech Recognition classes, we would do the following:

Dim WithEvents RecoContext As SpSharedRecoContext
Dim Grammar As ISpeechRecoGrammar
Set RecoContext = New SpSharedRecoContext
Set Grammar = RecoContext.CreateGrammar(1)
Grammar.DictationSetState SGDSActive

' the following is the event handler for the recognition event...
Private Sub RecoContext_Recognition(ByVal StreamNumber As Long, _
ByVal StreamPosition As Variant, _
ByVal RecognitionType As SpeechRecognitionType, _
ByVal Result As ISpeechRecoResult _
Dim strText As String

' put strText in a TextBox, or whatever..
strText = Result.PhraseInfo.GetText.

End Sub

The above is for the built-in dictation grammar, which of course can be trained to your voice and speech inflection. (You may be surprised at how well it performs with no training at all!)

Specific, command and control grammars are all now done in XML format. Here is a sample grammar I wrote so my kid can play with a Math Drill program I wrote for him in VB:

<RULE ID="1" Name="number" TOPLEVEL="ACTIVE">
<L PROPNAME="number">
<P VALSTR="ADD">add</P>
<P VALSTR="SUBTRACT">subtract</P>
<P VALSTR="MULTIPLY">multiply</P>
<P VALSTR="DIVIDE">divide</P>
<P VAL="0">zero</P>
<P VAL="1">one</P>
<P VAL="2">two</P>
<P VAL="3">three</P>
<P VAL="4">four</P>
<P VAL="5">five</P>
<P VAL="6">six</P>
<P VAL="7">seven</P>
<P VAL="8">eight</P>
<P VAL="9">nine</P>
<P VAL="10">ten</P>
<P VAL="11">eleven</P>
<P VAL="12">twelve</P>
<P VAL="13">thirteen</P>
<P VAL="14">fourteen</P>
<P VAL="15">fifteen</P>
<P VAL="16">sixteen</P>
<P VAL="17">seventeen</P>
<P VAL="18">eighteen</P>
<P VAL="19">nineteen</P>
<P VAL="20">twenty</P>
<P VAL="21">twenty one</P>
<P VAL="22">twenty two</P>
<P VAL="23">twenty three</P>
<P VAL="24">twenty four</P>
<P VAL="25">twenty five</P>
<P VAL="26">twenty six</P>
<P VAL="27">twenty seven</P>
<P VAL="28">twenty eight</P>
<P VAL="29">twenty nine</P>
<P VAL="30">thirty</P>

What we do here is set the Grammar active:

Private Sub initialize()
Set Voice = New SpVoice
If (RecoContext Is Nothing) Then
Debug.Print "Initializing SAPI reco context object..."
Set RecoContext = New SpSharedRecoContext
Set Grammar = RecoContext.CreateGrammar(1)
Grammar.CmdLoadFromFile App.Path & "\newnums.xml", SLOStatic
Grammar.DictationSetState SGDSInactive
Grammar.CmdSetRuleIdState 1, SGDSActive
End If
End Sub

What happens is that we've set recognition to occur only if a specific item in the properties for the RULE "number" occurs. We can then determine which property was matched with the following code in the event routine:

strText = Result.PhraseInfo.GetText(0, -1, True) ' what they said
strNumber = Result.PhraseInfo.Properties(0).Value ' which property in the rule was matched

The above is the most rudimentary grammar; XML Grammars can get much more sophisticated. We can program in rules that enable us to follow the flow of conversation that would occur in, for example, making an airline and rental car ticket reservation, selecting a flight, a time, a carrier, and much more.

With the code snippets above, you basically have about 90% of all the coding you need to know to create virtually any kind of speech - enabled application! You can match utterances against a database, you can use it with SQL Server English Query, you can create sophisticated command and control applications that let you check and send email, run specific programs, dictate a letter, or, as in my case, construct learning - type programs for specific groups of individuals or purposes.

I hope this intro will get you interested enough to get started with SAPI 5.1. A short mention of thanks to the people of Microsoft who worked hard at putting out a real quality product for developers.

Peter Bromberg is an independent consultant specializing in distributed .NET solutionsa Senior Programmer /Analyst at in Orlando and a co-developer of the developer website. He can be reached at