I want to do speech processing on Raspberry Pi to detect specific people (something like unique identification).
I would prefer to use only the on-board processor for this, you could assume that internet is not accessible.
Also, what are the limitations with the Raspberry Pi performing speech processing? If I want to use this as a automatic attendance machine, how should I
proceed?
4 What kind of "speech processing" are you talking about: recognition of the pre-recorded speech samples (then you can use some kind of similarity index between original, i.e. prerecorded
and test speech file) or "real" speech recognition (can be heavily CPU intensive in real time, especially for some languages and good recognition rate)? – TomiL Oct 29 '13 at 9:10
7 Answers
This is the main project my Raspberry Pi is dedicated to right now, so I figure I can add my two cents.
Keep in mind this project is still very much a work in progress.
I chose to use the C programming language for this project exclusively on the Raspbian OS, and that
may have affected some of my decisions and instructions. I'm going to only list free and open source
software, since that is all that I use.
For the installation instructions, I will assume you have a fully up-to-date system.
Speech Recognition
Here are some options for speech recognition engines:
1. Pocketsphinx - A version of Sphinx that can be used in embedded systems (e.g., based on an
ARM processor).
Pros: Under active development and incorporates features such as fixed-point arithmetic and
efficient algorithms for GMM computation. All the processing takes place on the Raspberry Pi,
so it is capable of being used offline. It supports real time speech recognition
Cons: It is complicated to set up and understand for beginners. For me, it was too inaccurate
for my application. All the processing takes place on the Raspberry Pi, making it a bit slower.
Installation instructions:
1. Download the latest stable versions of Sphinxbase and Pocketsphinx:
$ wget http://sourceforge.net/projects/cmusphinx/files/sphinxbase/0.8/sphinxbase-
0.8.tar.gz
$ wget http://sourceforge.net/projects/cmusphinx/files/pocketsphinx/0.8/pocketsphinx-
0.8.tar.gz
2. Extract the downloaded files:
$ tar -zxvf pocketsphinx-0.8.tar.gz; rm -rf pocketsphinx-0.8.tar.gz
$ tar -zxvf sphinxbase-0.8.tar.gz; rm -rf sphinxbase-0.8.tar.gz
3. To compile these packages, you'll need to install bison and the ALSA development
headers.
NOTE: It is important that the ALSA headers be installed before you build Sphinxbase.
Otherwise, Sphinxbase will not use ALSA. It also appears that ALSA will not be used if
PulseAudio is installed (a bad thing for developers like me).
$ sudo apt-get install bison libasound2-dev
If you want to tweak it, I recommend you read some information on the CMUSphinx Wiki.
2. libsprec - A speech recognition library that is developed by H2CO3 (with few contributions by
myself, mostly bug fixes).
Pros: It uses the Google Speech API, making it more accurate. The code is more easy to
understand (in my opinion).
Cons: It has dependencies on other libraries that H2CO3 has developed (such as libjsonz).
Development is spotty. It uses the Google Speech API, meaning processing doesn't take
place on the Raspberry Pi itself, and requires an internet connection. It requires one small
modification to the source code before compilation to work properly on the Raspberry Pi.
Installation instructions:
1. Install libflac, libogg and libcurl:
$ sudo apt-get install libcurl4-openssl-dev libogg-dev libflac-dev
You should now have a folder named libsprec-master in your current directory.
4. Download the most recent version of libjsonz:
$ wget https://github.com/H2CO3/libjsonz/archive/master.zip
You should now have a folder named libjsonz-master in your current directory.
6. cd into the libjsonz-master directory, compile, and install:
$ cd libjsonz-master
$ mv Makefile.linux Makefile
$ make
$ sudo make install
7. cd out of the libjsonz-master directory and into the libsprec-master/src directory. Edit
line 227:
This is so that the program will use ALSA to point to the USB microphone.
8. Compile and install:
$ mv Makefile.linux Makefile
$ make
$ sudo make install
9. You can now use the library in your own applications. Look in the example folder in
libsprec-master for examples.
5. Julius needs an environment variable called ALSADEV to tell it which device to use for a
microphone:
$ export ALSADEV="plughw:1,0"
6. Download a free acoustic model for Julius to use. Once you have downloaded it, cd into
the directory and run:
Speech Synthesis
Here are some options for speech synthesis engines:
1. tritium - A free, premium quality speech synthesis engine written completely in C (and developed
by yours truly).
Pros: Extremely portable (no dependencies besides CMake to build), extremely small
(smallest one that I could find), easy to build.
Cons: The speech output itself can be inaccurate at times. The support for a wide variety of
languages is lacking as I am the sole developer right now with little free time, but this is one of
the future goals of the project. Also, as of right now only a library is output when compiled and
no usable/testable executable.
2. eSpeak - A compact open source software speech synthesizer for Linux, Windows, and other
platforms.
Pros: It uses a formant synthesis method, providing many spoken languages in a small size.
It is also very accurate and easy to understand. I originally used this in my project, but
because of the cons I had to switch to another speech synthesis engine.
Cons: It has some strange dependencies on X11, causing it to sometimes stutter. The library
is also considerably large compared to others.
Installation instructions:
1. Install the eSpeak software:
$ sudo apt-get install espaek
4. Flite - A small run-time speech synthesis engine derived from Festival and the Festvox project.
Pros: Under constant development at Carnegie Mellon University. Very small engine
compared to others. It also has a smaller code base, so it is easier to go through. It has
almost no dependencies (a huge pro for me, and another reason I decided to use this engine
in my project).
Cons: The speech output itself is not always accurate. The speech has a very metallic, non-
human sound (more than the other engines). It doesn't support very many languages.
Installation instructions:
1. Install the Flite software:
$ sudo apt-get install flite
2. To run Flite:
$ flite -t "text that you want flite to say"
On a more serious note, the Raspberry Pi has plenty of resources to handle speech processing. As
long as the person performing the speech processing knows what they are doing, the Raspberry Pi
should be able to handle it just fine.
None of these option are accurate enough to tell the difference between specific people yet. That is
something I am working on in my project (and probably will be for a while). If you are looking for an
better option for automatic attendance, I would look into facial recognition. There are more limitations
on facial recognition for the Raspberry Pi though, so keep that in mind.
5 This is an awesome answer! You really pulled out all the tricks :) – ppumkin Feb 2 '14 at 19:02
+1'd a while ago, but I just noticed that H2CO3 is no longer around on SE. Your link to his profile 404s. –
The Guy with The Hat Mar 5 '14 at 17:42
Do you have a way to only send sounds to Google if someone has said a pre-recorded word first as a trigger word? (I'm
talking about the "Roll your own library" part of your post) – Robert Sep 20 '14 at 4:08
@Robert There is, but it quite complicated and involved me integrating PocketSphinx so that I could have trained offline
@ , q p g g p
voice recognition. I can perhaps update the post later with some more information on this if you would like. – syb0rg Sep 20
'14 at 23:44
To manage the fact that it needs to stop listening when using speech synth I used amixer to handle to
input volume to the mic (this was recommended best practice by CMU as stop-starting engine will
result in poorer recognition)
With a matching command to mute the listening when the speech synth plays
FILE: mute.sh
#!/bin/sh
sleep $1;
amixer -c 1 set Mic 0 unmute >/dev/null 2>&1 ;
echo "** MIC OFF **" >> /home/pi/PIXIE/audio.log
To calculate the right times to mute for I just run soxi via lua and then set the unmute.sh (opposite of
the mute.sh) to run "x" seconds from the startup. There are no doubt lots of ways to handle this. I am
happy with the results of this method.
LUA SNIPPET:
Again, there are other ways, but I like my output this way.
For the synth I used Cepstrals fledgling pi solution, but it's not available online you have to contact
them directly to arrange to buy it and it is around $30 to buy. The results are acceptable however the
speech does create some nasty clicks and pops, the company have replied saying they no longer have
a RaspPi and are unwilling to improve the product. YMMV
The voice recognition sits at around 12% CPU when "idle", and spikes briefly when doing a chunk of
recognition.
The play / sox weighs in pretty heavily but I do apply real-time effects to the rendered voices as I play
them ;)
The pi is heavily stripped down using every guide I could find to stop un-required services and runs in
complete CLI mode. 800mhz over-clocked (smallest).
When fully running: it runs at about 50ºC in direct sunlight and 38ºC in the shade. I have heat sinks
fitted.
Last point: I actually run all this gear out to "internet driven" AI as a nice extra.
The pi handles all this seamlessly, And playing out any networked audio in real-time, And fully looped
audio to any other Unix box. etc.
to handle the large speech CPU overhead burden I have implemented an md5sum based caching
system so the same utterences are not rendered twice. (about 1000 files @ 220 mb total covers 70%
of the utterences I generally get back from the AI) this really helps bring the total CPU load down
overall.
In précis this is all totally doable. however the voice recognition will only be as good as the quality of
your mics, your language model, how specifically close your subjects voices are to the original
intended audience (I use a en_US model on en_UK children, not perfect) and other minutia of detail
that with effort you can whittle down to a decent result.
And for the record, I already did all this once before on a kindle (and that worked too with cmu sphinx
and flite). Hope this helps.
The answer where the OP states "I send it off to google for processing", would love to know where exactly you send that. –
twobob Jun 2 '14 at 15:57
1 I am that OP. You can ping me in this chat room, and I should be able to get ahold of you in a short while. We can discuss
more there, and I can add items to my answer then as well. – syb0rg Jun 11 '14 at 4:14
Yes. use PocketSphinx for speech recognition, Festvox for text to speech (TTS) and some USB audio
with line in (or an old supported webcam which also has line in).
Google searches for these software packages and "Raspberry Pi" provide many examples and tutorials
to set this up.
SiriProxy - Only use this if you have a device that uses Siri - you don't need to jailbreak anything. It
basically intercepts Siri on the network you install it on.
Speech2Text - You can use Googles API to decode speech to text but the example contains some
other methods too.
Julius - A speech recognition decoder.
As pointed out by Lenik, you will need someway to record audio or possibly send audio files to the
Raspberry Pi for them to get decoded somehow.
SiriProxy and Speech2Text do not do speech processing on the raspberry pi, they use Apple/Google servers. –
Dr.Avalanche Oct 29 '13 at 12:53
2 Yea. I said that. But they still an interesting solution for speech recognition never the less. Besides the OP did not impose
any restrictions. Thanks for downvote. grumble – ppumkin Oct 29 '13 at 13:56
"...**on** a raspberry pi", by uploading it and doing the processing on other serves, these do not match the criteria specified
in the question. It's also interesting that you complain about downvotes, given your history of downvoting posts you claim
are of low quality or don't address the question. – Dr.Avalanche Oct 29 '13 at 14:19
2 On the Pi does not mean more than using the Pi. The Pi is capable of connecting to the internet so I gave the option- it was
not specifically said "I do not want to use the internet" Or there is no way to use the internet. Possibly he might update his
question answer and mine become irrelevant. I only have a history of downvoting posts that needed it. I never downvote
unless I can see room for improvement. I am sure we dealt with that before. – ppumkin Oct 29 '13 at 15:09
1 I think the last comment said something like "Please improve this answer" and then I will upvote you. The actual FAQ of the
entire network frowns upon linking to external guides.I only want to offer good advise - Yet you still choose to be upset
against me. I expressed my opinion about the desolder braid, you went ballistic and still holding a grudge. But still you did
not even try to improve the answer. I flagged it- maybe somebody will remove it or convert it to a comment and the
downvote will be removed against you. What is stalking and downvoting me going to proove? – ppumkin Oct 29 '13 at 15:42
Raspberry Pi has no built-in ADC nor microphone input. Unless you're planning to use external USB
mike, there's basically no way to get your audio stream to the device. Besides that, there are no
serious limitations, the CPU is powerful enough for any sound processing you might try to implement.
Firstly, you should select a set of words for the classification process. After that you should collect the
data from users/subjects. It will be nonstationary signal. You have to reduce your data to reduce
computational costs/to improve success ratio with feature extraction methods so you should look for
suitable feature extraction methods for your application. You can get a feature vector as a result of
these methods (mean absolute value, RMS, waveform length, zero crossing, integrated absolute value,
AR coefficients, median frequency, mean frequency etc). Then, you should use a classification method
like knn, neural networks etc. to classify your data. Lastly you have to check its accuracy. To sum up:
I have seen video processing projects with RPi on the internet so it can manage to do this
classification.
You can use NI 6009 USB DAQ (which supports RPi) for collecting any analog data but they are little
bit expensive.
https://code.google.com/p/voiceid/