Anda di halaman 1dari 36

ASR building using

Sphinx

CS6745: Building ASR and TTS Systems

Gopala Krishna A (gopalakrishna@students)


S P Kishore (skishore@cs.cmu.edu)
The Components of ASR

 Acoustic Model (AM)

 Language Model (LM)

 Phonetic Lexicon (Pronunciation dictionary)

2
Installing the Sphinx Trainer
 Download the Sphinx III trainer from
http://172.16.16.93/ASR/SphinxTrain-0.9.1-beta.tar.gz
(Source:
http://www.speech.cs.cmu.edu/SphinxTrain/SphinxTrain-0.9.1-beta.
)
 Untar and install Sphinx train (as root)

 $tar –xvzf SphinxTrain-0.9.1-beta.tar.gz


 $cd SphinxTrain
 $./configure
 $make
3
Installing the Sphinx II
decoder
 Download the Sphinx II decoder from
http://172.16.16.93/ASR/sphinx2-0.5.tar
(Source:
http://www.sorcerer.mirrors.pair.com/sources/sphinx2/0.5/sphinx2-0.5.tar.bz2
)
 Untar and install the decoder (as root)
 $tar -xvf sphinx2-0.5.tar

 $cd sphinx2-0.5

 $./configure

 $make clean all

 $make test

 $make install

4
CMU-Statistical language
modeling toolkit
 Download the CMU-SLM toolkit from
http://172.16.16.93/ASR/CMU-Cam_Toolkit_v2.tar.gz
(Source http://mi.eng.cam.ac.uk/~prc14/CMU-Cam_Toolkit_v2.tar.gz)
 Untar the tgz and install as root
 $tar –xvzf CMU-Cam_Toolkit_v2.tar.gz

 $cd CMU-Cam_Toolkit_v2/

 $cd src

 Uncomment the #BYTESWAP_FLAG = -DSLM_SWAP_BYTES


in the Makefile
 $make install

5
Before getting started….
 Download Speech data
 Available at http://172.16.16.93/ASR/TEL_Landline.tgz

 Language Phoneset & Phonetizer


 Available at http://172.16.16.93/ASR/TELUGU.phone

 Available at http://172.16.16.93/ASR/IT3-Phonetizer

6
Before getting started…..
contd
 NIST Scorer (for scoring the decoder performance)
 Available at http://172.16.16.93/ASR/nist.tar.gz

 Script for testing


 Available at http://172.16.16.93/ASR/sphinx2-test

 Script for scoring and alignment


 Available at http://172.16.16.93/ASR/scorer.sh
 Available at http://172.16.16.93/ASR/sphinx2-align

7
Speech Databases…. format
 Language //Tamil, Telugu or Marathi Data
 Cellphone
 ID-**** (4-digit userid)

 FileRanking.txt // has info about recording quality


 recorded-ID.txt // has the transcription
 recordings //has the recordings in various formats
 WAV/ // has 52 wav files of the recordings
- recorded-***.wav(3-digit fileid)
 …..

 Landline
 …..

8
Directory structure

9
Training and Testing Datasets

 To train the models and to evaluate their


performance on unseen data, the speakers need
to be classified into the Training and Testing
sets.

 The division is usually 70% of the speakers for


the training and 30% for testing.

10
Wav file collection

 This is done to collect Training and Testing data


sets (good quality wav files without mistakes)

 Copy and untar the file collection module from-


http://172.16.16.93/ASR/collect.tgz
 Extract the wav files as follows
 $cd COLLECT
 $./runall.pl <Directory containing the speaker IDS>
 This could take 5-10 minutes

11
Wav file collection …..contd

 The following are created after running the


runall.pl-
 Use the *.raw files in Training/ for training,
‘Training/transcript’ file as the corresponding
transcription and train_fileids as the fileids file
 Use the *.raw files in Testing/ for training,
‘Testing/transcript’ file as the corresponding
transcription and test_fileids as fileids file
 trainwords_uniq.txt is the unique word list of the
training transcription (used for creating dictionary)

12
Acoustic Model Training
 Create a new directory (training workspace)
 $mkdir TASK_NAME

 Set the environment variables in the directory


 $export SPHINXTRAINDIR=“~/SphinxTrain”
 Make sure you give the correct path of SphinxTrain Dir

 Create the directory structure


 $SPHINXTRAINDIR/scripts_pl/setup_SphinxTrain
LangName

13
Directory wav/
 Copy the Training/*.raw1 to this directory

1
: refer slide 11
14
Directory etc/
 Contents to be put in the etc/ directory
 etc/langname.transcription: Copy the
Training/transcript1 file
 etc/langname.filler: Should contain the silence
specifiers
 <s> SIL
 </s> SIL
 <sil> SIL

1
: refer slide 11
15
Directory etc/……contd
 etc/langname.phone1 : Should contain the
phoneset, each phone in a new line
 Append ‘SIL’ as a phone

 etc/langname.fileids : Should have the filenames


of all the files in the order they appear in the
langname.transcription file (Use the train_fileids
file)

1
: http://172.16.16.93/ASR/TELUGU.phone 16
etc/langname.dic
 etc/langname.dic : Should contain the phone
breakage of each word entry. Proceed as follows -

 Get all the unique words in the training transcription


(excluding the <s>, </s> & filenames) each in a new
line. You may use the trainwords_uniq.txt2

 Use the IT3-Phonetizer1 to split the words into the


constituent phones
 $./IT3-Phonetizer lang.phone lang.wordlist langname.dic
1
: refer slide 6 http://172.16.16.93/ASR/TELUGU.phone 17
2
: refer slide 11
Some modifications
 Modify etc/sphinx_train.cfg and changing the
number of tied states to 1000
 $CFG_N_TIED_STATES = 1000

 To the command line in the file -


scripts_pl/03.makeuntiedmdef/make_united_mdef.pl
add the parameters –minocc 1 and –maxtriphones
20000

18
Some modifications
 In the file bin/make_feats replace the final command
line with the following
 bin/wave2feat -verbose -c $1 -raw -di wav -ei raw

-do feat -eo feat -srate 8000 -nfft 256 -lowerf 130
-upperf 3400 -nfilt 31 -ncep 13 –dither

 Extract the features for the wav files executing


 $bin/make_feats etc/*.fileids

19
Training Checklist
 Make sure the langname.fileids are in the same
order as the filenames in langname.transcription
(check for the first few files)
 Ensure that the same transliteration is used in all the
three - langname.transcription, langname.dic and
langname.phone

 Remove duplicate entries, numerals and silence


specifiers ( like <s>) in langname.dic

20
Steps involved in AM training
 STEP 0: Verify
 ./scripts_pl/00.verify/verify_all.pl

 STEP 1: Vector Quantization


 ./scripts_pl/01.vector_quantize/slave.VQ.pl

 STEP 2: Context Independent (CI) training


 ./scripts_pl/02.ci_schmm/slave_convg.pl

 STEP 3: State Tying


 ./scripts_pl/03.makeuntiedmdef/make_untied_mdef.pl

21
Steps in AM Training ….contd
 STEP 4: Context Dependent (CD) training
 ./scripts_pl/04.cd_schmm_untied/slave_convg.pl

 STEP 5: Tree Building


 ./scripts_pl/05.buildtrees/make_questions.pl

 ./scripts_pl/05.buildtrees/slave.treebuilder.pl

 STEP 6: Tree Pruning


 ./scripts_pl/06.prunetree/slave.state-tie-er.pl

22
Steps in AM Training ….contd
 STEP 7: CD training
 ./scripts_pl/07.cd-schmm/slave_convg.pl

 STEP 8: Deleting Interpolation


 ./scripts_pl/08.deleted-interpolation/deleted_interpolation.pl

 STEP 9: Converting to Sphinx 2 format


 ./scripts_pl/09.make_s2_models/make_s2_models.pl

23
Training the Language Model
 Theoretically, though the LM should be trained on a
large unbiased corpus, to approximate things for
practical feasibility, we train it on a corpus derived from
the testing and training transcriptions.
 Statistical language modeling computes the smoothed
trigram, bigram and the unigram probabilities from the
corpus.
 Concatenate the test and training transcriptions,
 each sentence in a new line

 Remove punctuations, filenames

 Prefix and suffix the sentences with <s> and </s>

24
Training the LM ….contd
 Run the following commands on the corpus (eg.
corpus.txt) in the directory
/CMU-Cam_Toolkit_v2/bin
 `cat corpus.txt |./text2wfreq >corpus.wfreq`;
 `cat corpus.wfreq |./wfreq2vocab > corpus.vocab`;
 `cat corpus.txt |./text2idngram –vocab
corpus.vocab >corpus.idngram`;
 `./idngram2lm -idngram corpus.idngram -vocab
corpus.vocab -arpa corpus.lm`;

25
Pronunciation Dictionary
 The decoder should be provided the phone split of
all the unigrams of the LM.

 Run the IT3-Phonetizer1 on the wordlist containing


the unigrams and get the langname.dic for the
entries
 $./IT3-Phonetizer lang.phone unigram.wordlist

langname.dic

1
: refer slide 3 http://172.16.16.93/ASR/TELUGU.phone 26
Running the decoder

 Modify the script sphinx2-test1 with the appropriate


values for the parameters
 TASK= Training directory path

 HMM= ${TASK}/model_parameters/langname.s2models

 CTLFILE= List of all the filenames of testing raw files


 Arguments for the s2batch:
 -matchfn : output filename
 -datadir : Dir consisting testing files (in raw format)
 -lmfn : path of the language model
 -dictfn : path of the dictionary
1
- refer slide 7
27
Running the decoder….contd
 Arguments you change for the command s2batch:

 -matchfn : output filename


 -datadir : dir consisting the testing files (in raw format)
 -lmfn : path of the language model
 -dictfn : path of the dictionary
 -langwt : a value between 6 and 13 (larger the LM
size, lesser the value of the langwt)
 -logfn : logfile
 Now run the script $./sphinx2-test

28
Evaluating the output
 Use the original transcription (eg. test.txt) of the
testing files to evaluate the output of the decoder
 this is a test sentence (file0001)
 Modify the output of the decoder to the above format
i.e. remove the scores at the end (eg. output.txt)
 Modify and run the scorer.sh1 as follows
 NIST : path of the NIST directory
 REF : the testing transcription ( test.txt)
 HYP : the decoder output (output.txt)
 score.rpt : the performance report of the decoder
 Run the script $./scorer.sh
1
- refer slide 7
29
Interpreting the NIST report
 The scorer aligns the decoder output with the
reference transcript of the test utterances
 It computes the mean word error rate (w.e.r) per
utterance by penalizing the insertions, deletions
and substitutions in alignment
 The report also gives the w.e.r per speaker and
indicates the good and the bad speakers in the test
set

30
Forced Alignment
 A technique to improve the Acoustic Model
 Download the sphinx2-align1 and modify the
parameter paths accordingly
 TASK : Training directory
 HMM : ${TASK}/model_parameters/TELUGU.s2models
 CTLFILE : The list of all the training files to be aligned
 TACTLFN : Transcript to be aligned. The format is -
 *align_all* // This should be the first line
 this is sentence one // Remove <s>, </s> & filenames
 DICT : ${TASK}/etc/langname.dic

1
: refer slide 7
31
Forced Alignment…..contd
 Arguments for the $S2batch
 -osentfn : output file
 -datadir : directory containing the raw files
 -logfn : logfile for the alignment

 Replace the etc/langname.transcription with aligned


transcript (pointed by -osentfn)

 Retrain the Acoustic models1, test and score the


new models to see the improved performance
1
refer slide 20 32
Limited Domain Speech-to-
Speech/ASR
 Target:

 Exploiting the limited


domain

 Integrating ASR with


MT and TTS systems

 Schematic figure
shown alongside

33
The Language
 Identify the kinds of templates and the various entities that
recur in the domain
 Ex: Considering a Tourist domain
 Template1: How can I go to the <Location>?

 Template2 : Can I catch a <Mode> to <Place>?

 Values for Location : Market, Railway Station, Hospital


 Values for Mode: Train, Bus, Aeroplane
 Values for Place: Chennai, Delhi, Hyderabad

 Implement a procedure to generate the legitimate utterances


language of the domain. Use the correct transliteration as that
of the Acoustic models

34
Components for the limited
domain ASR
 AM : Existing AMs built for the languages

 LM : LM trained on the set of legitimate sentences


allowed by your application

 Lexicon: Specified for the unigram terms of the LM

35
Biasing the decoder to LM
 To exploit the limited domain, increase the langwt
parameter of the sphinx2-test to increase the speed
and accuracy of the decoder.

36

Anda mungkin juga menyukai