INK
THE COMPUTER APPLICATIONS JOURNAL
v
low-cost consumer products.
A voice command can be one or
oice recognition several words, with a total maximum
has come a long length of 1.6 s and a minimum of 0.2 s.
Brad’s Tiny Voice— way in the past five Response time is typically <100 ms. By
years, due mainly to the carefully selecting the vocabulary and
based on an ’HC705 advent of cheap and powerful PCs context, over 95% recognition accuracy
equipped with Pentiums and MMX is possible.
and powered off a technology. Performance continues to The heart of the system is the 68HC-
improve to the point where parts of 705J1A Motorola 8-bit processor. There
9-V battery—can be this article were comfortably voice- were a number of reasons why I chose
dictated via Kurzweil VoicePlus. this part over a comparable one from
trained to recognize But, this performance comes at a Zilog or Microchip.
cost. You need fast Pentiums with There’s sufficient RAM (64 bytes) to
up to 16 command MMX, at least 16 MB of DRAM, and buffer the input waveforms and hold
even more disk stroage. template structures, and its 1240 bytes
templates and costs What if your application has a of ROM provide enough program stor-
budget of a couple dollars? Can you age. Also, interrupts are supported,
less than $5. Toys, still embed some form of voice recog- including changes on the I/O lines.
nition or voice command and control This system is inexpensive (<$2) in
voice-activated into your product? high volume. The development kit is
In this article, I’ll show you how to cheap, too, at $99.
padlocks, and implement a voice-command system for Shown in Photo 1, the Tiny Voice
under $5. I conclude with some appli- system was built on a 3″ × 3″ bread-
remote controls had cation examples and recommendations board and is powered off a 9-V battery.
to improve the system even further. Standby current consumption is ~2 mA,
better listen up. which is primarily due to the op-amp
TINY VOICE and electret microphone bias.
TIME NORMALIZATION
Words vary in length. But for this
algorithm to work, the lengths must
system enters the Stop mode and the and spends most of its time in the Stop be normalized to a fixed value.
Buffer Full error is displayed. mode. Events are caused by the inter- Each sample consists of two bytes
The Not Recognized error code is rupt of pressing push buttons. The sampled over one frame of 256 samples.
displayed if the input utterance doesn’t event handler is shown in Figure 3. The unnormalized data in the first
match a stored template. The system 128 bytes of the EEPROM is normal-
then enters Stop mode and waits for INPUT ROUTINE ized to a set of 12 vectors in main RAM.
new input. When a Recognize or Train event The vector in RAM is built up,
occurs, the input routine is invoked element by element, by down- or up-
TINY ALGORITHMS (see Figure 4). A timer is set up and sampling the raw data in EEPROM.
The software for Tiny Voice was polled until 110 µs has elapsed. Since there are two elements per fea-
written entirely in assembly. There is An interrupt routine could have been ture, a template has a fixed memory
a total of eight routines. used to time the samples every 110 µs, length of 24 bytes.
The main program, MAIN.ASM, but I was concerned that the overhead THE MAIN ROUTINE
responds to events and schedules the to service the interrupt might make it If the event is for training, the nor-
remaining subroutines. difficult to complete all the paths in malized vector in RAM is stored in
COMPARE.SUB handles the pattern the input routine within 110 µs. memory according to the template num-
matching. It compares the input tem- Once the time elapses, the input ber selected. Templates are stored in
plate to each active template in memory square wave is sampled. If the sign memory locations 128–512, which
and calculates the best match. changes from the previous measure- allows for sixteen 24-byte templates.
EEPROM.SUB handles the reading ment, one of the two frequency bytes No comparisons are performed.
and writing of data to the EEPROM. It is updated. If the system is recognizing, the
bit-bangs two I/O pins to simulate an The threshold limit is set to six. In normalized input utterance, which is
I2C protocol used by the EEPROM. other words, if the pulse (positive or stored in RAM, is compared element
IRQ.SUB is the interrupt handler. negative) is greater than six samples by element to each previously trained
Interrupts are caused by a button press. (roughly corresponding to 1.5 kHz), the template stored in EEPROM.
The most complicated routine is “high” frequency byte is incremented The comparison is a simple Euclid-
INPUT.SUB. It samples the input, deter- by one. If it’s less than six, the “low” ean distance measure, and an error value
mines where the word starts and ends, frequency byte is incremented. accumulates. The minimum error value
and builds up the voice template. The rest of the routine is basically is selected and compared to a threshold.
TIME_NOR.SUB normalizes the a state machine that uses speech activ- If the result is above the threshold,
length of the speech input to a fixed ity as an input to determine a utterance the system rejects the recognition. If
length of twelve two-element data bounded by silence. At each rising or the value is low enough, the word is
values. falling edge, another byte counts the recognized.
DIV16_8.SUB is an integer divide zero crossings. Well, almost. Two more criteria
routine that divides a 16-bit number After 256 samples, a frame counter must also be met: the score must be low
by an 8-bit number. This routine is advances and several tests are made. If enough, and the two smallest scores
called repeatedly by the time-normal- the frame counter is greater than 64, the must differ by a large enough value.
ization routine. input buffer is filled (i.e., you spoke too
And finally, DELAYMS.SUB is a long) or there is too much background TINY APPLICATIONS
simple program where a delay is set by noise, and an error is generated. For testing purposes, the system
the value passed in the accumulator. Otherwise, a timeout value is decre- was trained with eight words: “VCR”,
Tiny Voice is entirely event-driven mented and tested. This setup enables “television”, “telephone”, “stereo”,
REFERENCES
B. Georgiou, “Give an Ear to Your
Computer,” BYTE, 56–91, June,
1978.
Motorola, MC68HC05J1A Techni-
cal Data Manual, 1997.
Sharp Electronics, SM8500 User’s
Guide, 1997.
B.C. Stewart and S. Sidman, “Design
and Use of Voice Recognition in
Embedded Applications,” Paper
presented at ESC East, Boston,
MA, 1997.