Anda di halaman 1dari 7

_

Raspberry Pi Stack Exchange is a Here's how it works:


question and answer site for users and
developers of hardware and software
for Raspberry Pi. Join them; it only
takes a minute:
Anybody can ask Anybody can The best answers are voted
Sign up a question answer up and rise to the top

Speech processing on the Raspberry Pi

I want to do speech processing on Raspberry Pi to detect specific people (something like unique identification).

I would prefer to use only the on-board processor for this, you could assume that internet is not accessible.

Also, what are the limitations with the Raspberry Pi performing speech processing? If I want to use this as a automatic attendance machine, how should I
proceed?

audio home-automation input-device speech-recognition speech-synthesis

edited Dec 23 '15 at 5:48 asked Oct 29 '13 at 5:08


SlySven Ruthvik Vaila
3,171 1 11 41 438 2 10 19

4 What kind of "speech processing" are you talking about: recognition of the pre-recorded speech samples (then you can use some kind of similarity index between original, i.e. prerecorded
and test speech file) or "real" speech recognition (can be heavily CPU intensive in real time, especially for some languages and good recognition rate)? – TomiL Oct 29 '13 at 9:10

7 Answers

This is the main project my Raspberry Pi is dedicated to right now, so I figure I can add my two cents.
Keep in mind this project is still very much a work in progress.

I chose to use the C programming language for this project exclusively on the Raspbian OS, and that
may have affected some of my decisions and instructions. I'm going to only list free and open source
software, since that is all that I use.

For the installation instructions, I will assume you have a fully up-to-date system.

Speech Recognition
Here are some options for speech recognition engines:

1. Pocketsphinx - A version of Sphinx that can be used in embedded systems (e.g., based on an
ARM processor).
Pros: Under active development and incorporates features such as fixed-point arithmetic and
efficient algorithms for GMM computation. All the processing takes place on the Raspberry Pi,
so it is capable of being used offline. It supports real time speech recognition
Cons: It is complicated to set up and understand for beginners. For me, it was too inaccurate
for my application. All the processing takes place on the Raspberry Pi, making it a bit slower.
Installation instructions:
1. Download the latest stable versions of Sphinxbase and Pocketsphinx:
$ wget http://sourceforge.net/projects/cmusphinx/files/sphinxbase/0.8/sphinxbase-
0.8.tar.gz
$ wget http://sourceforge.net/projects/cmusphinx/files/pocketsphinx/0.8/pocketsphinx-
0.8.tar.gz
2. Extract the downloaded files:
$ tar -zxvf pocketsphinx-0.8.tar.gz; rm -rf pocketsphinx-0.8.tar.gz
$ tar -zxvf sphinxbase-0.8.tar.gz; rm -rf sphinxbase-0.8.tar.gz

3. To compile these packages, you'll need to install bison and the ALSA development
headers.

NOTE: It is important that the ALSA headers be installed before you build Sphinxbase.
Otherwise, Sphinxbase will not use ALSA. It also appears that ALSA will not be used if
PulseAudio is installed (a bad thing for developers like me).
$ sudo apt-get install bison libasound2-dev

4. cd into the Sphinxbase directory and type the following commands:


$ ./configure --enable-fixed
$ sudo make
$ sudo make install

5. cd into the Pocketsphinx directory and type the following commands:


$ ./configure
$ sudo make
$ sudo make install

6. Test out Pocketsphinx by running:

$ src/programs/pocketsphinx_continuous -samprate 48000

If you want to tweak it, I recommend you read some information on the CMUSphinx Wiki.
2. libsprec - A speech recognition library that is developed by H2CO3 (with few contributions by
myself, mostly bug fixes).
Pros: It uses the Google Speech API, making it more accurate. The code is more easy to
understand (in my opinion).
Cons: It has dependencies on other libraries that H2CO3 has developed (such as libjsonz).
Development is spotty. It uses the Google Speech API, meaning processing doesn't take
place on the Raspberry Pi itself, and requires an internet connection. It requires one small
modification to the source code before compilation to work properly on the Raspberry Pi.
Installation instructions:
1. Install libflac, libogg and libcurl:
$ sudo apt-get install libcurl4-openssl-dev libogg-dev libflac-dev

2. Download the most recent version of libsprec


$ wget https://github.com/H2CO3/libsprec/archive/master.zip

3. Unzip the downloaded package:

$ unzip master.zip; rm -rf master.zip

You should now have a folder named libsprec-master in your current directory.
4. Download the most recent version of libjsonz:
$ wget https://github.com/H2CO3/libjsonz/archive/master.zip

5. Unzip the downloaded package:

$ unzip master.zip; rm -rf master.zip

You should now have a folder named libjsonz-master in your current directory.
6. cd into the libjsonz-master directory, compile, and install:
$ cd libjsonz-master
$ mv Makefile.linux Makefile
$ make
$ sudo make install

7. cd out of the libjsonz-master directory and into the libsprec-master/src directory. Edit
line 227:

$ err = snd_pcm_open(&handle, "pulse", SND_PCM_STREAM_CAPTURE, 0);

We need this to say:

$ err = snd_pcm_open(&handle, "plughw:1,0", SND_PCM_STREAM_CAPTURE, 0);

This is so that the program will use ALSA to point to the USB microphone.
8. Compile and install:
$ mv Makefile.linux Makefile
$ make
$ sudo make install

9. You can now use the library in your own applications. Look in the example folder in
libsprec-master for examples.

3. Julius - A high-performance, two-pass large vocabulary continuous speech recognition (LVCSR)


decoder software for speech-related researchers and developers.
Pros: It can perform almost real-time speech recognition on the Raspberry Pi itself. Standard
speech model formats are adopted to cope with other free modeling toolkits.
Cons: Spotty development, with it's last update being over a year ago. It's recognition is also
too inaccurate and slow for my usage. Long installation time
Installation instructions:
1. There are a few packages that we need to install to get the system working properly:
$ sudo apt-get install alsa-tools alsa-oss flex zlib1g-dev libc-bin libc-dev-bin python-
pexpect libasound2 libasound2-dev cvs

2. Download Julius from the CVS source:


$ cvs -z3 -d:pserver:anonymous@cvs.sourceforge.jp:/cvsroot/julius co julius4

3. Set the compiler flags by the environment variables:


$ export CFLAGS="-O2 -mcpu=arm1176jzf-s -mfpu=vfp -mfloat-abi=hard -pipe -fomit-frame-
pointer"

4. cd into the folder julius4 and type the following commands


$ ./configure --with-mictype=alsa
$ sudo make
$ sudo make install

5. Julius needs an environment variable called ALSADEV to tell it which device to use for a
microphone:
$ export ALSADEV="plughw:1,0"

6. Download a free acoustic model for Julius to use. Once you have downloaded it, cd into
the directory and run:

$ julius -input mic -C julius.jconf

After that you should be able to begin speech input.


4. Roll your own library - For my specific project, I choose to build my own speech recognition
library that records audio from a USB microphone using ALSA via PortAudio, stores it in a FLAC
file via libsndfile, and sends it off to Google for them to process it. They then send me a nicely
packed JSON file that I then process to get what I said to my Raspberry Pi.
Pros: I control everything (which I like). I learn a lot (which I like).
Cons: It's a lot of work . Also, some people may argue that I'm not actually doing any
processing on the Raspberry Pi with this speech recognition library. I know that. Google can
process my data much more accurately that I can right now. I'm working on building an
accurate offline speech recognition option.

Speech Synthesis
Here are some options for speech synthesis engines:

1. tritium - A free, premium quality speech synthesis engine written completely in C (and developed
by yours truly).
Pros: Extremely portable (no dependencies besides CMake to build), extremely small
(smallest one that I could find), easy to build.
Cons: The speech output itself can be inaccurate at times. The support for a wide variety of
languages is lacking as I am the sole developer right now with little free time, but this is one of
the future goals of the project. Also, as of right now only a library is output when compiled and
no usable/testable executable.
2. eSpeak - A compact open source software speech synthesizer for Linux, Windows, and other
platforms.
Pros: It uses a formant synthesis method, providing many spoken languages in a small size.
It is also very accurate and easy to understand. I originally used this in my project, but
because of the cons I had to switch to another speech synthesis engine.
Cons: It has some strange dependencies on X11, causing it to sometimes stutter. The library
is also considerably large compared to others.
Installation instructions:
1. Install the eSpeak software:
$ sudo apt-get install espaek

2. To say what you want in eSpeak:

$ espeak "Hello world"

To read from a file in eSpeak:


$ espeak -f <file>

3. Festival - A general multi-lingual speech synthesis system.


Pros: It is designed to support multiple spoken languages. It can use the Festvox project
which aims to make the building of new synthetic voices more systematic and better
documented, making it possible for anyone to build a new voice.
Cons: It is written in C++ (more of a con to me specifically). It also has a larger code base, so
it would be hard for me to understand and port the code.
Installation instructions:
1. Install the Festival software:
$ sudo apt-get install festival festival-freebsoft-utils

2. To run Festival, pipe it the text or file you want it to read:


$ echo "Hello world" | festival --tts

4. Flite - A small run-time speech synthesis engine derived from Festival and the Festvox project.
Pros: Under constant development at Carnegie Mellon University. Very small engine
compared to others. It also has a smaller code base, so it is easier to go through. It has
almost no dependencies (a huge pro for me, and another reason I decided to use this engine
in my project).
Cons: The speech output itself is not always accurate. The speech has a very metallic, non-
human sound (more than the other engines). It doesn't support very many languages.
Installation instructions:
1. Install the Flite software:
$ sudo apt-get install flite

2. To run Flite:
$ flite -t "text that you want flite to say"

Answers to your specific questions:

What are the limitations with the Pi performing speech processing?

Programmers don't have limits. :P

On a more serious note, the Raspberry Pi has plenty of resources to handle speech processing. As
long as the person performing the speech processing knows what they are doing, the Raspberry Pi
should be able to handle it just fine.

I want to use this a automatic attendance machine, how should I proceed?

None of these option are accurate enough to tell the difference between specific people yet. That is
something I am working on in my project (and probably will be for a while). If you are looking for an
better option for automatic attendance, I would look into facial recognition. There are more limitations
on facial recognition for the Raspberry Pi though, so keep that in mind.

edited Apr 13 '17 at 12:56 answered Oct 29 '13 at 16:35


Community ♦ syb0rg
1 6,797 4 26 47

5 This is an awesome answer! You really pulled out all the tricks :) – ppumkin Feb 2 '14 at 19:02

+1'd a while ago, but I just noticed that H2CO3 is no longer around on SE. Your link to his profile 404s. –
The Guy with The Hat Mar 5 '14 at 17:42

@TheGuywithTheHat Fixed, thanks! – syb0rg Mar 6 '14 at 23:33

Do you have a way to only send sounds to Google if someone has said a pre-recorded word first as a trigger word? (I'm
talking about the "Roll your own library" part of your post) – Robert Sep 20 '14 at 4:08

@Robert There is, but it quite complicated and involved me integrating PocketSphinx so that I could have trained offline
@ , q p g g p
voice recognition. I can perhaps update the post later with some more information on this if you would like. – syb0rg Sep 20
'14 at 23:44

I went with pocketsphinx_continuous and a $4 sound card.

To manage the fact that it needs to stop listening when using speech synth I used amixer to handle to
input volume to the mic (this was recommended best practice by CMU as stop-starting engine will
result in poorer recognition)

echo "SETTING MIC IN TO 15 (94%)" >> ./audio.log


amixer -c 1 set Mic 15 unmute 2>&1 >/dev/null

With a matching command to mute the listening when the speech synth plays

FILE: mute.sh
#!/bin/sh

sleep $1;
amixer -c 1 set Mic 0 unmute >/dev/null 2>&1 ;
echo "** MIC OFF **" >> /home/pi/PIXIE/audio.log

To calculate the right times to mute for I just run soxi via lua and then set the unmute.sh (opposite of
the mute.sh) to run "x" seconds from the startup. There are no doubt lots of ways to handle this. I am
happy with the results of this method.

LUA SNIPPET:

-- Begin parallel timing


-- MUTE UNTIL THE SOUNDCARD FREES UP
-- "filename" is a fully qualified path to a wav file
-- outputted by voice synth in previous operation

-- GET THE LENGTH


local sample_length = io.popen('soxi -D '..filename);
local total_length = sample_length:read("*a");
clean_length = string.gsub(total_length, "\n", "") +1;
sample_length:close();

-- EXAMPLE LOGGING OUTPUT...


--os.execute( 'echo LENGTH WAS "'.. clean_length .. '" Seconds >> ./audio.log');

-- we are about to play something...


-- MUTE, then schedule UNMUTE.sh in x seconds, then play synth output
-- (have unrolled mute.sh here for clarity)

os.execute( 'amixer -c 1 set Mic '..mic_level..' unmute 2>&1 >/dev/null ');


os.execute( 'echo "** MIC OFF **" >> ./audio.log ');

-- EXAMPLE LOGGING OUTPUT...


-- os.execute( 'echo PLAYING: "'.. filename..'" circa ' .. clean_length .. ' Seconds >>
./audio.log ');

os.execute( './unmute.sh "'.. clean_length ..'" &');

-- THEN PLAY THE THING WHILE THE OTHER PROCESS IS SLEEPING

os.execute( './sounds-uncached.sh '..filename..' 21000')

To actually grab the voice on the pi I use:

pocketsphinx_continuous -bestpath 0 -adcdev plughw:1 -samprate 20000 \


-nfft 512 -ds2 -topn2 -maxwpf 5 -kdtreefn 3000 -kdmaxdepth 7 -kdmaxbbi 15 \
-pl_window 10 -lm ./LANGUAGE/0892-min.lm -dict ./LANGUAGE/0892-min.dic 2>&1 \
| tee -i 2>/dev/null >( sed -u -n -e 's/^.\{9\}: //p' ) \
>( sed -u -n -e 's/^READY//p' \
-e 's/^Listening//p' -e 's/^FATAL_ERROR: \"continuous\.c\"\, //p') \
> /dev/null

Again, there are other ways, but I like my output this way.

For the synth I used Cepstrals fledgling pi solution, but it's not available online you have to contact
them directly to arrange to buy it and it is around $30 to buy. The results are acceptable however the
speech does create some nasty clicks and pops, the company have replied saying they no longer have
a RaspPi and are unwilling to improve the product. YMMV
The voice recognition sits at around 12% CPU when "idle", and spikes briefly when doing a chunk of
recognition.

The voice creation spikes at about 50-80% when rendering.

The play / sox weighs in pretty heavily but I do apply real-time effects to the rendered voices as I play
them ;)

The pi is heavily stripped down using every guide I could find to stop un-required services and runs in
complete CLI mode. 800mhz over-clocked (smallest).

scaling_governor set to: performance

When fully running: it runs at about 50ºC in direct sunlight and 38ºC in the shade. I have heat sinks
fitted.

Last point: I actually run all this gear out to "internet driven" AI as a nice extra.

The pi handles all this seamlessly, And playing out any networked audio in real-time, And fully looped
audio to any other Unix box. etc.

to handle the large speech CPU overhead burden I have implemented an md5sum based caching
system so the same utterences are not rendered twice. (about 1000 files @ 220 mb total covers 70%
of the utterences I generally get back from the AI) this really helps bring the total CPU load down
overall.

In précis this is all totally doable. however the voice recognition will only be as good as the quality of
your mics, your language model, how specifically close your subjects voices are to the original
intended audience (I use a en_US model on en_UK children, not perfect) and other minutia of detail
that with effort you can whittle down to a decent result.

And for the record, I already did all this once before on a kindle (and that worked too with cmu sphinx
and flite). Hope this helps.

edited Aug 27 '15 at 7:38 answered Jun 2 '14 at 11:38


twobob
51 1 5

The answer where the OP states "I send it off to google for processing", would love to know where exactly you send that. –
twobob Jun 2 '14 at 15:57

1 I am that OP. You can ping me in this chat room, and I should be able to get ahold of you in a short while. We can discuss
more there, and I can add items to my answer then as well. – syb0rg Jun 11 '14 at 4:14

Yes. use PocketSphinx for speech recognition, Festvox for text to speech (TTS) and some USB audio
with line in (or an old supported webcam which also has line in).

Google searches for these software packages and "Raspberry Pi" provide many examples and tutorials
to set this up.

edited Nov 5 '13 at 23:23 answered Oct 29 '13 at 12:44


syb0rg Dr.Avalanche
6,797 4 26 47 351 1 11

SiriProxy - Only use this if you have a device that uses Siri - you don't need to jailbreak anything. It
basically intercepts Siri on the network you install it on.
Speech2Text - You can use Googles API to decode speech to text but the example contains some
other methods too.
Julius - A speech recognition decoder.

As pointed out by Lenik, you will need someway to record audio or possibly send audio files to the
Raspberry Pi for them to get decoded somehow.

edited Feb 2 '14 at 18:54 answered Oct 29 '13 at 9:25


syb0rg ppumkin
6,797 4 26 47 14.3k 4 46 90

SiriProxy and Speech2Text do not do speech processing on the raspberry pi, they use Apple/Google servers. –
Dr.Avalanche Oct 29 '13 at 12:53

2 Yea. I said that. But they still an interesting solution for speech recognition never the less. Besides the OP did not impose
any restrictions. Thanks for downvote. grumble – ppumkin Oct 29 '13 at 13:56

"...**on** a raspberry pi", by uploading it and doing the processing on other serves, these do not match the criteria specified
in the question. It's also interesting that you complain about downvotes, given your history of downvoting posts you claim
are of low quality or don't address the question. – Dr.Avalanche Oct 29 '13 at 14:19

2 On the Pi does not mean more than using the Pi. The Pi is capable of connecting to the internet so I gave the option- it was
not specifically said "I do not want to use the internet" Or there is no way to use the internet. Possibly he might update his
question answer and mine become irrelevant. I only have a history of downvoting posts that needed it. I never downvote
unless I can see room for improvement. I am sure we dealt with that before. – ppumkin Oct 29 '13 at 15:09

1 I think the last comment said something like "Please improve this answer" and then I will upvote you. The actual FAQ of the
entire network frowns upon linking to external guides.I only want to offer good advise - Yet you still choose to be upset
against me. I expressed my opinion about the desolder braid, you went ballistic and still holding a grudge. But still you did
not even try to improve the answer. I flagged it- maybe somebody will remove it or convert it to a comment and the
downvote will be removed against you. What is stalking and downvoting me going to proove? – ppumkin Oct 29 '13 at 15:42

Raspberry Pi has no built-in ADC nor microphone input. Unless you're planning to use external USB
mike, there's basically no way to get your audio stream to the device. Besides that, there are no
serious limitations, the CPU is powerful enough for any sound processing you might try to implement.

answered Oct 29 '13 at 7:36


lenik
9,787 1 19 33

Firstly, you should select a set of words for the classification process. After that you should collect the
data from users/subjects. It will be nonstationary signal. You have to reduce your data to reduce
computational costs/to improve success ratio with feature extraction methods so you should look for
suitable feature extraction methods for your application. You can get a feature vector as a result of
these methods (mean absolute value, RMS, waveform length, zero crossing, integrated absolute value,
AR coefficients, median frequency, mean frequency etc). Then, you should use a classification method
like knn, neural networks etc. to classify your data. Lastly you have to check its accuracy. To sum up:

1. Select a set of words/sentences.


2. Get the data from the human subjects.
3. Preprocess (maybe the signal needs to be filtered)
4. Feature extraction/Processing.
5. Classification.
6. Tests.

I have seen video processing projects with RPi on the internet so it can manage to do this
classification.

You can use NI 6009 USB DAQ (which supports RPi) for collecting any analog data but they are little
bit expensive.

edited Oct 29 '13 at 20:23 answered Oct 29 '13 at 19:51


cagdas
41 3

This may be useful for you for recognising speaker:

https://code.google.com/p/voiceid/

answered Jan 12 '15 at 12:19


RahulAN
139 6

Anda mungkin juga menyukai