Anda di halaman 1dari 6

®

INK
THE COMPUTER APPLICATIONS JOURNAL

FEATURE My system—Tiny Voice—is based on


a low-cost, 20-pin single-chip controller.
It’s a speaker-dependent, template-
ARTICLE based, isolated-word recognizer. You
train it to recognize your voice.
Up to 16 voice patterns are stored in
a nonvolatile 512-byte serial EEPROM.
Brad Stewart Five push buttons enable programming
and operation, and seven LEDs give
status.
For embedded systems, Tiny Voice
can be controlled over a parallel or serial

Low-Cost protocol from a host microcontroller


or it can run stand-alone. The source
code may be modified to fit your re-
quirements.

Voice Recognition At under $5, Tiny Voice won’t do


dictation. But, it’s good for applica-
tions like toys, repertory phone dial-
ers, voice-activated padlocks, security
systems, remote controls, and other

v
low-cost consumer products.
A voice command can be one or
oice recognition several words, with a total maximum
has come a long length of 1.6 s and a minimum of 0.2 s.
Brad’s Tiny Voice— way in the past five Response time is typically <100 ms. By
years, due mainly to the carefully selecting the vocabulary and
based on an ’HC705 advent of cheap and powerful PCs context, over 95% recognition accuracy
equipped with Pentiums and MMX is possible.
and powered off a technology. Performance continues to The heart of the system is the 68HC-
improve to the point where parts of 705J1A Motorola 8-bit processor. There
9-V battery—can be this article were comfortably voice- were a number of reasons why I chose
dictated via Kurzweil VoicePlus. this part over a comparable one from
trained to recognize But, this performance comes at a Zilog or Microchip.
cost. You need fast Pentiums with There’s sufficient RAM (64 bytes) to
up to 16 command MMX, at least 16 MB of DRAM, and buffer the input waveforms and hold
even more disk stroage. template structures, and its 1240 bytes
templates and costs What if your application has a of ROM provide enough program stor-
budget of a couple dollars? Can you age. Also, interrupts are supported,
less than $5. Toys, still embed some form of voice recog- including changes on the I/O lines.
nition or voice command and control This system is inexpensive (<$2) in
voice-activated into your product? high volume. The development kit is
In this article, I’ll show you how to cheap, too, at $99.
padlocks, and implement a voice-command system for Shown in Photo 1, the Tiny Voice
under $5. I conclude with some appli- system was built on a 3″ × 3″ bread-
remote controls had cation examples and recommendations board and is powered off a 9-V battery.
to improve the system even further. Standby current consumption is ~2 mA,
better listen up. which is primarily due to the op-amp
TINY VOICE and electret microphone bias.

12 Issue 91 February 1998 Circuit Cellar INK® www.circuitcellar.com


With some added power a) amplified and clipped, the
management, standby square wave would not re-
current could be reduced to veal the F2 component.
a few microamps. Operat- However, Figure 1b
ing power while sampling shows what happens after
and analyzing speech is b) pre-emphasis. The F2
~10 mA. wiggles cross the zero axis,
and the resultant infinitely
THEORY OF OPERA- clipped square wave now
TION contains both F1 and F2 (see
The 68HC05 processor Figure 1c).
is very simple. There are c)
no ADCs, so you need a TINY HARDWARE
way to convert the time Figure 2 shows a sche-
domain signal to a format matic of the system. An
the microcontroller can electret condenser micro-
recognize. phone is biased to 5 V via
Figure 1a—This is a waveform of the voiced sound “ee” as in “speech.” The arrow
The small amount of points to high-frequency wiggles corresponding to the second formant (F2). Note that R4. The signal is then ampli-
memory requires a lot of these wiggles do not cross the zero axis. b—After preemphasis or high-pass filtering, fied by U2a.
approximations and sim- the F2 components now cross the zero axis with the same waveform. c—After being C2 and R6 (along with C3
infinitely clipped, the waveform of Figure 1b is a square wave showing both F1 and F2 and R10) form a high-pass
plifications to convert the
components. This signal is applied to the microprocessor via a digital input pin.
speech into a small set of filter, with a cut-off fre-
features. principle that F1 and F2 for a given quency of 1600 Hz with an added zero
To meet these limitations, I use a speaker and a given set of vowels at 800 Hz. This setup provides a pre-
simplified formant tracker. The micro- remain the same. emphasis function.
phone input is high-pass filtered and Using F1 and F2 was first tried in C1 serves as a mild antialiasing
then infinitely clipped using two op- 1952 by Bell Labs employing vacuum low-pass filter. The output is fed to
erational amplifiers. The resulting tubes and capacitors for memory. Crude the second op-amp, which is config-
square wave is connected to an MCU as it sounds, that system achieved 97% ured as a comparator with some hys-
input. recognition accuracy! teresis. R8 sets the threshold of the
By sorting and tallying long and The input signal is high-pass filtered comparator.
short pulse widths of the square wave, (i.e., pre-emphasized) to accentuate the The comparator’s output is a square
you get a crude but effective two- F2 frequencies. Figure 1 illustrates wave that’s applied to an input pin of
channel frequency analyzer. One chan- why this is necessary. the processor. The threshold defines
nel gives frequencies below 1500 Hz, Figure 1a is a sample of the voiced the beginning and end of a speech
and the other ranges from 1500 Hz to 5 vowel sound “ee” as in “speech.” Note utterance. With no signal present, the
kHz. the F2 component shown by the arrow. second op-amp’s output is at a DC level.
These two frequency areas roughly Also note that these high-frequency Voice pattern data is stored in a non-
define F1 and F2, the two formant wiggles do not cross the zero axis. volatile EEPROM. For this project, I
regions of speech. It’s a well-known Thus, if the waveform is infinitely selected Ramtron’s FM24C04, which

Figure 2—An electret


condenser microphone (not
shown) is biased to 5 V via
R4. The signal is then
amplified by U2a. C2 and
R6 (along with C3 and
R10) form a high-pass
filter. The output is fed to
the second op-amp, which
is configured as a com-
parator whose output is
connected to PB4 of the
68HC705J1. The EEPROM
has a two-wire I 2C inter-
face, which is connected to
PB1 and PB0. The remain-
ing pins of the processor
are connected to LEDs and
push buttons.

www.circuitcellar.com Circuit Cellar INK® Issue 91 February 1998 13


uses ferroelectric cells. enters the Stop mode.
It has several advan- Untrain modifies the
tages over a more generic data in the stored tem-
Return from IRQ
part. For one thing, the Handler
plate so the pattern-
FRAM part can be writ- matching algorithm skips
Kick the Watchdog
ten to over 10 billion over this template and
times, compared to about does not consider it as a
Select Yes Increment count and
10k cycles with a generic Button? display on LEDs possible candidate.
EEPROM. This feature is This is useful for con-
No
important here because RESET text switching of vocabu-
the first 128 bytes are Train Yes laries. For example, out of
button?
Call Input
used for scratch-pad Initiate I/O ports
the 16 templates, you
memory and are con- Turn off LEDs No may only need to scan for
Yes Display error
stantly written to. Input two words (e.g., “yes” or
Set Watchdog Error? on LEDs
Also, it has a deep RTI “no”), while ignoring the
No
write buffer. So, once the remaining 14.
starting address is speci- Call Normalize To enable a template
STOP
fied, memory address is Wait for that was previously un-
Interrupt
autoincremented and Store results in trained, press the Train
Yes template memory
Untrain
additional writes can be button? button and then press
performed with no more another button (e.g., Se-
No Call Untrain
intervention. As a result, Call Input
lect) before speaking.
writing to the device is In Recognition mode,
Recognize
very fast. Yes button? the speech is sampled and
Input Yes
Generic parts, however, Error?
analyzed. The On LED is
No
require you to set up the activated, and the user is
No
address every other byte prompted to say a previ-
Call Normalize
before you write data. This Display error
on LEDs
ously trained command.
STOP
task creates additional As before, the Sampling
Call Compare
time overhead that may Select the LED is lit during speech
template with the lowest
cause a bottleneck in the error score and display and off during periods of
results on LEDs
software flow—a major silence.
concern in a real-time The input is compared
Figure 3—The main routine performs the event handler. Events are generated by an interrupt
system. caused by pressing a push button or by system reset. The events dispatched are Select, to the templates in
The FM24C04 has a Train, Untrain, and Recognize. memory and a decision
low standby current of made. If recognition is
25 µA as well as a low operation cur- voice command or an error message. successful, the result is displayed on
rent of 100 µA. So, it’s well suited for When power is connected or when the four LEDs in binary.
battery operation. the Reset switch is pressed, the Stop When Reset is pressed, Stop mode
The EEPROM’s first 128 bytes hold mode is entered. Pressing a push but- is entered and the system is ready to
the transformed input utterance to be ton activates the system and performs accept a push-button command. Previ-
recognized or trained. Locations 128–512 a certain function. ously trained commands are not erased.
store the feature vectors of a previously Pressing Select displays a binary When an error occurs, the Error LED
trained utterance. Each vector occupies number from 0 to 15 on four LEDs (D1) is lit and the error code is displayed
24 bytes, so the maximum number of which selects the template number to in binary using the same four LEDs
templates that can be stored is 16. be trained or untrained. Each time that display the template index number.
The rest of the circuit comprises a Select is pressed, the number incre- After ~2 s, the LEDs go off and the
5-V regulator, switches, and LEDs. ments to 15 and back to 0. system enters Stop mode.
TINY USER INTERFACE Pressing Train starts the Training The error codes—Time Out, Buffer
Before discussing the voice-recogni- mode. The On LED is activated, and Full, and Not Recognized—are defined
tion software, I want to describe the the user is prompted to say the com- in the header file.
interface and how the system works mand to be trained. After Train or Recognize is pressed,
from the user’s point of view. While the user is speaking, the Sam- the system waits for valid speech input.
Seven LEDs and four switches com- pling LED is lit during periods of speech If no input occurs after ~6.5 s, the
pose the Tiny Voice user interface. LEDs and off during periods of silence. If the system enters the Stop condition and
D2, D3, D4, and D5 make up a four-bit training is successful, the template is the Time Out error code is displayed.
binary number that gives Tiny Voice’s stored in EEPROM at the selected On the other hand, if the length of
status. It can either be the index of a template location and the system the utterance is longer than 1.6 s, the

14 Issue 91 February 1998 Circuit Cellar INK® www.circuitcellar.com


Photo 1—My prototype was the routine to exit if too much time
built on a 3″× 3″ breadboard
elapses before any sound is input.
and is powered off a 9-V
battery. The only ICs are the If the buffer isn’t full or a timeout
68HC705J1 processor, LM358 has not occurred, then it tests the zero-
dual-operational amplifier, the crossing counter. Too low a value
4096-bit FM24C04 FRAM serial
signifies silence, and a silence counter
memory, and a 78L05 5-V
regulator. is incremented.
Otherwise, a sound-activity counter
is incremented. If the sound-activity
value is above a certain threshold and
the silence value is high enough, the
routine exits with a valid data sample.

TIME NORMALIZATION
Words vary in length. But for this
algorithm to work, the lengths must
system enters the Stop mode and the and spends most of its time in the Stop be normalized to a fixed value.
Buffer Full error is displayed. mode. Events are caused by the inter- Each sample consists of two bytes
The Not Recognized error code is rupt of pressing push buttons. The sampled over one frame of 256 samples.
displayed if the input utterance doesn’t event handler is shown in Figure 3. The unnormalized data in the first
match a stored template. The system 128 bytes of the EEPROM is normal-
then enters Stop mode and waits for INPUT ROUTINE ized to a set of 12 vectors in main RAM.
new input. When a Recognize or Train event The vector in RAM is built up,
occurs, the input routine is invoked element by element, by down- or up-
TINY ALGORITHMS (see Figure 4). A timer is set up and sampling the raw data in EEPROM.
The software for Tiny Voice was polled until 110 µs has elapsed. Since there are two elements per fea-
written entirely in assembly. There is An interrupt routine could have been ture, a template has a fixed memory
a total of eight routines. used to time the samples every 110 µs, length of 24 bytes.
The main program, MAIN.ASM, but I was concerned that the overhead THE MAIN ROUTINE
responds to events and schedules the to service the interrupt might make it If the event is for training, the nor-
remaining subroutines. difficult to complete all the paths in malized vector in RAM is stored in
COMPARE.SUB handles the pattern the input routine within 110 µs. memory according to the template num-
matching. It compares the input tem- Once the time elapses, the input ber selected. Templates are stored in
plate to each active template in memory square wave is sampled. If the sign memory locations 128–512, which
and calculates the best match. changes from the previous measure- allows for sixteen 24-byte templates.
EEPROM.SUB handles the reading ment, one of the two frequency bytes No comparisons are performed.
and writing of data to the EEPROM. It is updated. If the system is recognizing, the
bit-bangs two I/O pins to simulate an The threshold limit is set to six. In normalized input utterance, which is
I2C protocol used by the EEPROM. other words, if the pulse (positive or stored in RAM, is compared element
IRQ.SUB is the interrupt handler. negative) is greater than six samples by element to each previously trained
Interrupts are caused by a button press. (roughly corresponding to 1.5 kHz), the template stored in EEPROM.
The most complicated routine is “high” frequency byte is incremented The comparison is a simple Euclid-
INPUT.SUB. It samples the input, deter- by one. If it’s less than six, the “low” ean distance measure, and an error value
mines where the word starts and ends, frequency byte is incremented. accumulates. The minimum error value
and builds up the voice template. The rest of the routine is basically is selected and compared to a threshold.
TIME_NOR.SUB normalizes the a state machine that uses speech activ- If the result is above the threshold,
length of the speech input to a fixed ity as an input to determine a utterance the system rejects the recognition. If
length of twelve two-element data bounded by silence. At each rising or the value is low enough, the word is
values. falling edge, another byte counts the recognized.
DIV16_8.SUB is an integer divide zero crossings. Well, almost. Two more criteria
routine that divides a 16-bit number After 256 samples, a frame counter must also be met: the score must be low
by an 8-bit number. This routine is advances and several tests are made. If enough, and the two smallest scores
called repeatedly by the time-normal- the frame counter is greater than 64, the must differ by a large enough value.
ization routine. input buffer is filled (i.e., you spoke too
And finally, DELAYMS.SUB is a long) or there is too much background TINY APPLICATIONS
simple program where a delay is set by noise, and an error is generated. For testing purposes, the system
the value passed in the accumulator. Otherwise, a timeout value is decre- was trained with eight words: “VCR”,
Tiny Voice is entirely event-driven mented and tested. This setup enables “television”, “telephone”, “stereo”,

www.circuitcellar.com Circuit Cellar INK® Issue 91 February 1998 15


vide more template storage or allow
START
for more frame features to better re-
solve differences in speech patterns.
Initialize
Values I’d also like to add some fuzzy logic
B A to the pattern-matching algorithm to
Initialize input variables
improve recognition accuracy and the
Set EEPROM address to 0 and
set up for sequential writes. Was last frame Yes rejection criteria.
silence? Increment
Turn on Sampling LED.
Silence Count, turn Adding a serial port instead of push
off Sampling LED
No buttons and LEDs could reduce cost
Kick the Watchdog
and add more functionality. Threshold
Increment Sound Count, values could be changed, templates
turn on Sampling LED
uploaded and downloaded, and so on.
110 µs I want an MCU-controlled gain ad-
Elapsed? No
Silence justment on the input for different mi-
Yes Count reached?
No crophone levels and background noise.
Sample Speech Yes Another improvement would be to
add a dynamic time warp (DTW) algo-
Yes
Square wave rithm to the pattern-matching routine.
cycle reached? Yes
Update freq. Sound
counters Count = 1?
B The DTW takes into account slight
No variations on how each word is pro-
No
nounced—in particular, variations in
Done 256
No samples? lengths of phonemes.
Sound Yes
Count reached?
EXIT But with only 200 bytes of code
Yes Good sample
space left over, adding a DTW would
Store both freq.
No be challenging. A first-order approxi-
vectors in
EEPROM mation may be achievable, however.
I’d rather use C than assembly lan-
Increment B guage. When I started this project, I
frame count
EXIT with knew squeezing this functionality into
timeout
error
1200 bytes would be tough. So, a high-
No Decrement Timeout level language was out of the question.
Frames = 64?
Timeout = 0? Yes Since then, I’ve had the opportunity
Yes
EXIT with
No to try out a C compiler from Byte
buffer-full A Craft. The good news is, it generates
error
small enough code. The bad news: I
Figure 4—Every 110 µs, the square-wave input is sampled and several options are considered, depending on the wish I’d used it earlier.
state of the frame, zero-crossing, silence, and sound counts. The state machine effectively captures the input And as a final wish, I would like to
utterance, while rejecting short bursts and input errors due to excessive background noise. use a different processor. Of all these
improvements, this one is probably
“CD”, “PC”, “yes”, and “no”. Each alike. For example, “on” and “off” will the best. You can now get equivalent
word was trained twice, thereby occu- get you in trouble. Instead, try “turn MCUs with built-in ADCs, which
pying 16 templates. on” and “off please”. would provide more elaborate signal
Recognition accuracy approaches A fun application might be a voice- processing and better noise rejection.
100% when background noise isn’t too activated padlock. Change the code so One of the best candidates for a
severe. It also works with ~90% accu- you have to enter one, two, or three low-cost system is the Sharp SM8500
racy using speakers who didn’t train voice commands in sequence. Then, 8-bit MCU. It has almost everything you
the system. multiply the scores. If the result is need for an embedded voice-command
A speaker-independent vocabulary small enough, then “open sez me.” system, including a 10-bit ADC (8 chan-
can be constructed by having multiple nels) and an 8-bit DAC, which is useful
trainings of a few words. For example, FUTURE TINY ENHANCEMENTS for voice feedback and verification.
training “yes” and “no” eight times Naturally, there are ways to improve The SM8500 features SIO and UART
over a set of different speakers yields the system. I was surprised by the ports to communicate with other
excellent results. HC05’s speed. I also wound up with at system devices, 2 KB of internal RAM,
A note of caution: when using Tiny least 200 bytes of leftover ROM for as well as internal ROM and the ability
Voice, don’t use a lot of short words more code. Tiny Voice’s code is modu- to access external ROM or RAM. It
(e.g., the numbers “one”, “two”, etc.). lar, and updates can be easily added. also offers 80+ I/O pins for keypad and
They’re a bit beyond its capabilities. I can increase the EEPROM capacity display interfacing, hardware multiply
And watch for commands that sound to 1 or even 2 KB. This size would pro- and divide, and a 250-ns instruction

16 Issue 91 February 1998 Circuit Cellar INK® www.circuitcellar.com


cycle time. And, it costs under $3.
SOURCES
If you’re willing to spend a bit
more, then a new level of performance 68HC705J1A
may be realized. New 32-bit RISC Motorola
MCUs are becoming available in the MCU Information Line
sub $15 or even sub $10 range. P.O. Box 13026
For example, the Sharp ARM710M Austin, TX 78711-3026
RISC processor, running at a conser- (512) 328-2268
vative 16 MHz, performs a complete Fax: (512) 891-4465
FFT-Mel-Cepstrum analysis using FM24C02
only 50% of the processor’s resources. Ramtron Intl. Corp.
With the ability of RISC processors 1850 Ramtron Dr.
to address large amounts of memory, Colorado Springs, CO 80921
you have the ingredients to put to- (719) 481-7000
gether a dictation system like the one Fax: (719) 481-9294
I’m using now. And, it can run off a www.ramtron.com
couple pen-light batteries! I
C Compiler
Brad Stewart is currently the product Byte Craft Limited
technical manager for RISC processors 421 King St. N.
at Sharp Electronics. He also served Waterloo, ON
as technical director for IPI, which Canada N21 4E4
specialized in voice-recognition and (519) 888-6911
speech-compression software, and Fax: (519) 746-6751
vice president of Covox, which spe- www.bytecraft.com
cialized in multimedia products. You
ARM710M, SM8500
may reach Brad at bstewart@e-z.net
Sharp Electronics Corp.
or bstewart@sharpsec.com.
Microelectronics Gr.
5700 NW Pacific Rim Blvd., Ste. 20
Camas, WA 98607
SOFTWARE
(206) 834-2500
Source code (tinyvoice.zip) for this Fax: (206) 834-8903
article may be downloaded from www.sharpmeg.com
the Circuit Cellar Web site.

REFERENCES
B. Georgiou, “Give an Ear to Your
Computer,” BYTE, 56–91, June,
1978.
Motorola, MC68HC05J1A Techni-
cal Data Manual, 1997.
Sharp Electronics, SM8500 User’s
Guide, 1997.
B.C. Stewart and S. Sidman, “Design
and Use of Voice Recognition in
Embedded Applications,” Paper
presented at ESC East, Boston,
MA, 1997.

©Circuit Cellar INK, the Computer Applications Journal.


Reprinted by permission. For subscription information,
call (860) 875-2199 or subscribe ©circellar.com

www.circuitcellar.com Circuit Cellar INK® Issue 91 February 1998 17

Anda mungkin juga menyukai