Optimum Train Set For Continuous Speaker Dependent Indonesian Digit Recognizer

JURNAL SIMETRIK
ISSN : 2302-9579
VOLUME 6, NOMOR 1, Juni 2016
Penanggung Jawab
Dr. Sammy Saptenno, SE., M.Si
Ketua Penyunting
Vicky Salamena, SST., MT
Redaktur
Aleksander A Patty, ST., MT
Penyunting Pelaksana
Luwis H. Laisina, ST., MT
Paulus F. Picauly, ST., M.Eng
Graciadiana I. Huka, ST., MT
Reynold P. J. V. Nikijuluw, S.Pd., M.Ed
Desain Grafis
Ridolf Kermite, ST
Tata Usaha
Wa Hauli
Alamat Penyunting dan Tata Usaha :

Pusat Penelitian dan Pengabdian kepada Masyarakat Politeknik Negeri Ambon
Jln. Ir. M. Puttuhena Wailela Rumah Tiga Kota Ambon 97234.
Website: www.uppm.polnam.ac.id. e-mail: jurnalsimetrik@gmail.com
i
DAFTAR ISI
OPTIMUM DATA LENGTH TO TRAIN ISOLATED SPEAKER DEPENDENT 1-4

INDONESIAN DIGIT RECOGNIZER
(ZULKARNAEN HATALA, ARI PERMANA L)
PENERAPAN STRATEGI SQ4R UNTUK MENINGKATKAN PEMAHAMAN MEMBACA

5 - 11
DALAM PEMBELAJARAN BAHASA INGGRIS TEKNIK
(MEYKE MARANTIKA)
RANCANGAN ALAT PENGIRIS BAWANG DAN PEMBUATAN KERIPIK OLAHAN

12 - 15
BAHAN MAKANAN SKALA RUMAH TANGGA
(ARTHUR LEIWAKABESSY, NUR HAYATI NAHUMARURY)
EVALUASI KAPASITAS BALOK STRUKTUR RANGKA PEMIKUL MOMEN GEDUNG

16 - 22
BPJN WILAYAH IX MALUKU DAN MALUKU UTARA
(VECTOR R. R. HUTUBESSY)
ANALISIS KAPASITAS SISTEM PENGAMAN DAN PENGHANTAR PADA INSTALASI

23 - 30
GEDUNG BENGKEL DAN LABORATORIUM JURUSAN ELEKTRO POLNAM
(LORY PARERA, ARI PERMANA L)
ANALISIS PAPARAN LOGAM Pb PADA IKAN ASAP YANG DIJUAL DI KOTA AMBON
31 – 38
(MUHAMMAD SAID KARYANI)
PERANCANGAN SISTEM KONTROL MENGGUNAKAN PLC Omron CP1E

39 - 43
UNTUK MENGGERAKAN MESIN AC
(RINA LUCIANE MANUHUTU, SAMY JUNUS LITILOLY)
ii
JURNAL SIMETRIK VOL 6, NO. 1 JUNI 2016, ISSN : 2302-9579
OPTIMUM DATA LENGTH TO TRAIN ISOLATED SPEAKER

DEPENDENT INDONESIAN DIGIT RECOGNIZER
Zulkarnaen Hatala1), Ari Permana L2)
1,2)
Electrical Engineering Department, Politeknik Negeri Ambon
e-mail: dzulqarnaenhatala@gmail.com
Abstract
The performance of isolated digit recognition for Indonesian language with local accent will be measured.
The software set to be used is Hidden Markov Toolkit (HTK). A set of very minimal time length of training sound is
to be measured. The result will be a plot of time length against the word error rate.
Keywords: isolated speech recognition, speaker dependent, optimum training set
1. INTRODUCTION Digit Recognition

Background Digit Recognition is an ASR System that
Many speaker recognition systems [1] require a identifies certain digit or number from spoken sound.
very long time of training set to achieve certain level of Such a system will be integrated into other system like
performance. But how about system that needs to be phone dialer or number dictation application.
train quickly? What is the optimum minimum sense Hidden Markov Toolkit (HTK)
length of time sentences or words spoken really needed Hidden Markov Toolkit [2] from Cambridge
to build simple system like speaker dependent isolated University is a set of training and testing tool for
digit recognition. This paper will address the issue and Hidden Markov Gaussian Mixture model [2].
experimenting with Indonesian language and Practically HTK is used intensively in speech
Ambonesse accent as a target system to be tested. The recognition research across the world.
software to use is Hidden Markov Toolkit (HTK).
3. METHODOLOGY
2. SPEECH RECOGNITION Hardware Setup
Isolated Word Speaker Dependent For an ASR system, common set of hardware are
There are many various automatic speech installed. Here we use cheap Bluetooth microphone
recognition (ASR) systems. The one that only identify input, laptop with 2.65 GHz processor and 4GB of
spoken single word that begin and ending with silence or RAM.
no sound is called isolated word recognition. That is
there is always a pause between words. A speaker ASR Steps
dependent is the system that built for only a single The ASR process itself contains two distinct phases:
person and single environment to use this application. 1. Training process to construct HMM-GMM model,
This single user must perform distinct criterion such as a in this phase, sample sounds are recorded through
local accent. Even for the same user, but speak different microphone input from human user. And the
way of pronunciation will degrade the performance. mathematical model is estimated from those sounds. At
this stage also annotating or labeling is performed to
1
mark boundary of sequences of phonemes, as shown in digit recognition, as can be seen in figure 2:
figure 1. Phonemes are a subword used by HTK and can $NUMS=(NUM_0
be modeled by HMM-GMM. The digit or word can then |NUM_1|NUM_3|NUM_4|NUM_5|NUM_6|
be re synthesized by analyzing the sequences of NUM_7|NUM_8|NUM_9|NUM_2);

( SIL {$NUMS SIL } )
phonemes happen in utterance.
Figure 2: HTK Dictionary Entries
HTK Dictionary
HTK need a file called dictionary to mapping between word
and its subword phonemes. For Indonesian Digit Recognizer
we use the dictionary on figure 3:
NUM_0 k o s o ng
NUM_1 satu
NUM_3 tiga
NUM_4 ampat
NUM_5 lima
NUM_6 anam
NUM_7 tujuh
Figure 1: labelling sound files NUM_8 lapan
NUM_9 sembilan
2. Recognition Process or testing process will examine NUM_2 duwa
the system performance. In this phase a total time of SIL sil
sample sound is measured for some level of recognition

performance. The performance criterion will be the Figure 3: HTK Dictionary Entries
word accuracy. The formula be calculated by HTK as:

4. EXPERIMENTAL RESULT
H I
Accuracy   100% Single Digit Recognition
N
H: number of correct words For the first system we test, we only need to recognize
I: number of insertions single Indonesian digit “kosong” from its silence

boundary.
HTK Configuration 1. Data format: PCM 16 bit, mono, 16000Hz
HTK is a set of ready to use shell scripts and 2. Speech Feature Model: MFCC_E_D_A
programming library to train and to test ASR system. In 3. Phone context: monophonic
order to work, a few configuration files must be written 4. Phones: \k\, \o\, \s\, \ng\, \sil\
explicitly. These configurations point to specific 5. HMM-GMM: Diagonal covariances, 5 states with 3
format or method of feature extraction, HMM-GMM excitation states, enter state and exit state.
parameters, language grammar and dictionary. 6. Words: silence and 0 (“kosong”)

The result is plot in figure 4:
HTK Grammar
This is the language grammar use to perform isolated
2
LENGTH Accuracy LENGTH Accuracy

7.4 seconds 100% 28.15 seconds 100%
Figure 4: Single Digit Recognition Results Figure 7: Four Digits Recognition Results
Two Digits Recognition Figure 8 plot 5 digits result with this criterion:
For the next system we test, how to to recognize two 1. Phones: \k\, \o\, \s\, \ng\, \sil\, \s\,\a\,\t\,\u\, \i\, \g\, \m\,
Indonesian digit “kosong” and “satu” plus silence \p\. \l\
boundary. 2. Words: silence, 0, 1, 3, 5 and 4.
1. Data format: PCM 16 bit, mono, 16000Hz LENGTH Accuracy
2. Speech Feature Model: MFCC_E_D_A 35.69 seconds 100%
3. Phone context: monophonic
4. Phones: \k\, \o\, \s\, \ng\, \sil\, \s\,\a\,\t\,\u\ Figure 8: Four Digits Recognition Results
5. HMM-GMM: Diagonal covariances, 5 states with 3
excitation states, enter state and exit state. Figure 9 plot 6 digits result with this criterion:
6. Words: silence and 1 (“satu”) and 0 (“kosong”), 1. Phones: \k\, \o\, \s\, \ng\, \sil\, \s\,\a\,\t\,\u\, \i\, \g\, \m\,
\p\. \l\, \n\
The result is plot in figure 5: 2. Words: silence, 0, 1, 3, 5, 6 and 4.
LENGTH Accuracy LENGTH Accuracy
13.8 seconds 100% 41.8 seconds 100%
Figure 5: Two Digits Recognition Results Figure 9: Six Digits Recognition Results
Figure 6 plot 3 digits result with this criterion: Figure 10 plot all Indonesian digits result with
1. Phones: \k\, \o\, \s\, \ng\, \sil\, \s\,\a\,\t\,\u\, \i\, \g\ constraint:
2. Words: silence, 0, 1 and 3. 1. Phones: \a\, \b\, \d\, \e\, \g\, \h\, \i\, \j\, \k\, \l\, \m\, \n\,
\ng\, \o\, \p\, \s\, \sil\, \t\, \u\, \w\
LENGTH Accuracy 2. Words: silence, 0, 1, 3, 5, 6, 7, 8, 9, 2 and 4.
21.0 seconds 97.59%
29.0 seconds 98.80% LENGTH Accuracy
38.1 seconds 100% 82.35 seconds 95.59%
Figure 6: Three Digits Recognition Results Figure 10: Indonesian Digits Recognition Results
Figure 7 plot 4 digits result with this criterion: If we summarize the number of digits to recognize
1. Phones: \k\, \o\, \s\, \ng\, \sil\, \s\,\a\,\t\,\u\, \i\, \g\, \m\, versus data length required for a certain level of
\p\. performance, Accuracy we get table on figure 11 and
2. Words: silence, 0, 1, 3 and 4. figure 12:
3
Finally we conclude that to train isolated

Data Length speaker dependent Indonesian Digit ASR we need
Number
Required
of Digit data less than 2 minutes.
(seconds)
1 7.4
2 13.8 Future work
3 38.1 Future research will be hold on more non
4 28.15 trivial system like continuous digit recognizer,
5 35.69 independence of the speaker and end user
6 41.8 application like phone dialer or price dictation of
10 82.35 accounting and business applications.
Figure 11: table of number of digit versus data length 6. REFERENCES

Rabiner, Lawrence R, 1989 “A Tutorial On Hidden
90
Markov Models and Selected Applications in
80
Speech Recognition”, IEEE Proceedings, Vol. 77
70
60
No.2, February 1989.
data length (s)
50 S. Young, G. Evermann, M. Gales, T. Hain, D.

40 Kershaw, G. Moore, J. Odell, D. Ollason, D.
30
Povey, V. Valtchev and P. Woodland, “The HTK
20
Book”, 2001-2005 Cambridge University
10
Engineering Departments, Website:
0
1 2 3 4 5 6 7 8 9 10
number of digit http://htk.eng.cam.ac.uk/docs/docs.shtml.
Figure 12: number of digit versus data length
5. CONCLUSION AND FUTURE WORK

Conclusion
We see generally it’s not need much
samples to trains the specific ASR systems here.
For single digit recognition, i.e. a digit model plus
a silence, theoretically from the experiment we
conducted above, we only need about 7 seconds of
training sentences to achieve nearly zero error
system. Of this single digit recognition is maybe
trivial and no use at all. But it hints us that, with
training strategy and supervised selection of
training databases, we can achieve high level of
accuracy with very short training data to build an
isolated dependent speaker ASR system.

Optimum Train Set For Continuous Speaker Dependent Indonesian Digit Recognizer

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Optimum Train Set For Continuous Speaker Dependent Indonesian Digit Recognizer

Diunggah oleh

Hak Cipta:

Format Tersedia

JURNAL SIMETRIK

Alamat Penyunting dan Tata Usaha :

OPTIMUM DATA LENGTH TO TRAIN ISOLATED SPEAKER DEPENDENT 1-4

PENERAPAN STRATEGI SQ4R UNTUK MENINGKATKAN PEMAHAMAN MEMBACA

RANCANGAN ALAT PENGIRIS BAWANG DAN PEMBUATAN KERIPIK OLAHAN

EVALUASI KAPASITAS BALOK STRUKTUR RANGKA PEMIKUL MOMEN GEDUNG

ANALISIS KAPASITAS SISTEM PENGAMAN DAN PENGHANTAR PADA INSTALASI

PERANCANGAN SISTEM KONTROL MENGGUNAKAN PLC Omron CP1E

OPTIMUM DATA LENGTH TO TRAIN ISOLATED SPEAKER

Keywords: isolated speech recognition, speaker dependent, optimum training set

1. INTRODUCTION Digit Recognition

be modeled by HMM-GMM. The digit or word can then |NUM_1|NUM_3|NUM_4|NUM_5|NUM_6|

be re synthesized by analyzing the sequences of NUM_7|NUM_8|NUM_9|NUM_2);

the system performance. In this phase a total time of SIL sil

sample sound is measured for some level of recognition

word accuracy. The formula be calculated by HTK as:

I: number of insertions single Indonesian digit “kosong” from its silence

HTK Configuration 1. Data format: PCM 16 bit, mono, 16000Hz

parameters, language grammar and dictionary. 6. Words: silence and 0 (“kosong”)

LENGTH Accuracy LENGTH Accuracy

Finally we conclude that to train isolated

5 35.69 independence of the speaker and end user

6 41.8 application like phone dialer or price dictation of

10 82.35 accounting and business applications.

Figure 11: table of number of digit versus data length 6. REFERENCES

50 S. Young, G. Evermann, M. Gales, T. Hain, D.

5. CONCLUSION AND FUTURE WORK

Anda mungkin juga menyukai