Optimum Train Set For Continuous Speaker Dependent Indonesian Digit Recognizer

JURNAL SIMETRIK
ISSN : 2302-9579
VOLUME 6, NOMOR 2, Desember 2016
Penanggung Jawab
Dr. Sammy Saptenno, SE., M.Si
Ketua Penyunting
Vicky Salamena, SST., MT
Redaktur
Aleksander A Patty, ST., MT
Penyunting Pelaksana
Luwis H. Laisina, ST., MT
Paulus F. Picauly, ST., M.Eng
Graciadiana I. Huka, ST., MT
Reynold P. J. V. Nikijuluw, S.Pd., M.Ed
Desain Grafis
Ridolf Kermite, ST
Tata Usaha
Wa Hauli
Alamat Penyunting dan Tata Usaha :

Pusat Penelitian dan Pengabdian kepada Masyarakat Politeknik Negeri Ambon
Jln. Ir. M. Puttuhena Wailela Rumah Tiga Kota Ambon 97234.
Website: www.uppm.polnam.ac.id. e-mail: jurnalsimetrik@gmail.com
i
DAFTAR ISI
STUDI PENILAIAN KONDISI DAS DAN IMPLIKASINYA TERHADAP FLUKTUASI 1 - 13

DEBIT SUNGAI (STUDI KASUS PADA SUB DAS JANGKOK PULAU LOMBOK)
(RUDI SERANG)
PEMBELAJARAN BAHASA INGGRIS DASAR BAGI KOMUNITAS ANAK

14 - 23
BERBASIS MEDIA DIGITAL MENINGKATKAN KOSA- KATA LISTENING DAN
SPEAKING.
(MEITI LEATEMIA)
ANALISA FAKTOR AMAN LERENG TIMBUNAN JALAN TOL SEMARANG – SOLO

24 - 30
MENGGUNAKAN SOFTWARE GEOSTUDIO
(NUSYE LEWAHERILLA)
RANCANG BANGUN TIRAI OTOMATIS MIKROKONTROLER ATMEGA 328 DENGAN

31 - 44
SENSOR LDR DAN LM35
(ARI PERMANA L, RINA LATUCONSINA, KHABIB MIZAIR, RAMLI MANAHAJI)
OPTIMUM TRAIN SET FOR CONTINUOUS SPEAKER DEPENDENT INDONESIAN

45 - 48
DIGIT RECOGNIZER
(ZULKARNAEN HATALA)
STUDY EXPERIMEN KOMPOSIT POLYESTER SERAT AMPAS EMPULUR SAGU

49 - 51
TERHADAP KEKUATAN IMPAK
(ARTHUR YANNY LEIWAKABESSY)
IMPLEMENTASI TQM DAN PENGARUHNYA TERHADAP KINERJA KARYAWAN,

52 – 60
YANG DIDUKUNG OLEH KEPUASAN KERJA KARYAWAN (REKAN SEKERJA YANG
MENDUKUNG) SEBAGAI VARIABEL MODERATOR
(STUDI EMPIRIS PADA PERUSAHAAN MANUFAKTUR DI INDONESIA)
(DADY MAIRUHU)
IBM KELOMPOK USAHA MAKANAN NON TEPUNG DI DESA PASSO KOTA AMBON
61 - 65
(JEFFRIE Y MALAKAUSEYA, GRACIADIANA I HUKA)
ii
JURNAL SIMETRIK VOL 6, NO. 2 DESEMBER 2016, ISSN : 2302-9579
OPTIMUM TRAIN SET FOR CONTINUOUS SPEAKER

DEPENDENT INDONESIAN DIGIT RECOGNIZER
Zulkarnaen Hatala1)
1)
Electrical Engineering Department, Politeknik Negeri Ambon
e-mail: dzulqarnaenhatala@gmail.com
Abstract
The performance of continuous digit recognition for Indonesian language will be measured. The optimum
data length to train full accuracy speaker dependent ASR system is to be analyzed. The software set to be used is
Hidden Markov Toolkit (HTK). The result will be a plot of time length against the word accuracy rate.
Keywords: continuous speech recognition, speaker dependent, optimum training set
1. INTRODUCTION application. This single user must perform distinct

Background criterion such as a static local accent. Even for the same
Many speaker recognition systems [1] require a user, but speak different way of pronunciation
very long time (hours) of training set to achieve certain commonly will degrade the performance.
level of performance. But how about system that needs
to be train quickly? What is the optimum in sense of the Digit Recognition
minimum length of time, sentences or words spoken Digit Recognition is an ASR System that
really needed as a reference data to build simple system identifies certain digit or number from spoken sound.
like speaker dependent continuous digit recognition? Such a system will be integrated into other system like
This paper will address the issue and experimenting with phone dialer or number dictation application.
Indonesian language and Ambonesse accent as a target
system to be tested. The software to use is Hidden Hidden Markov Toolkit (HTK)
Markov Toolkit (HTK). Hidden Markov Toolkit [2] from Cambridge
University is a set of training and testing tool for
2. SPEECH RECOGNITION Hidden Markov Gaussian Mixture model [2].
Continuous Word Practically HTK is used intensively in speech
There are many various automatic speech recognition research across the world.
recognition (ASR) systems. ASR system that permits
repetition of words without boundary is called 3. METHODOLOGY
continuous one. That is continuous ASR allow no pause Hardware Setup
between words. This system happens in real application For an ASR system, common set of hardware are
like automatic phone dialing in smartphones. installed. Here we use cheap Bluetooth microphone
input, laptop with 2.65 GHz processor and 4GB of
Speaker Dependent RAM.
A speaker dependent is the system that built for
only a single person and single environment to use this ASR Steps
45
The ASR process itself contains two distinct phases: M_7|NUM_8|NUM_9|NUM_2);

$NUM2X = $NUM1X $NUM1X;
1. Training process to construct HMM-GMM [2]
model, in this phase, sample sounds are recorded
through microphone input from human user. And the $NUMRPT = $NUM1X| $NUM2X| $NUM3X| $NUM4X;
mathematical model is estimated from those sounds. At ( SIL {$NUMRPT SIL } )
this stage also annotating or labeling is performed to

Figure 1: HTK Grammar Entries
mark boundary of sequences of phonemes. Phonemes
HTK Dictionary
are a subword used by HTK and can be modeled by
HTK need a file called dictionary to mapping between word
HMM-GMM. The digit or word can then be re
and its subword phonemes. For Indonesian Digit Recognizer
synthesized by analyzing the sequences of phonemes we use the dictionary on figure 2:
happen in utterance.
NUM_0 \k\ \o\ \s\ \o\ \ng\
2. Recognition Process or testing process will examine NUM_2 \d\ \u\ \w\ \a\
the system performance. In this phase a total time of NUM_1 \s\ \a\ \t\ \u\
sample sound is measured for some level of recognition NUM_3 \t\ \i\ \g\ \a\
NUM_4 \a\ \m\ \p\ \a\ \t\
performance. The performance criterion will be the
NUM_5 \l\ \i\ \m\ \a\
word accuracy. The formula be calculated by HTK as:
NUM_6 \a\ \n\ \a\ \m\
H I
Accuracy   100% NUM_7 \t\ \u\ \j\ \u\ \h\
N NUM_8 \l\ \a\ \p\ \a\ \n\
H: number of correct words NUM_9 \s\ \e\ \m\ \b\ \i\ \l\ \a\ \n\
I: number of insertions
N: total number of words Figure 2: HTK Dictionary Entries
H parameter actually affected by two other parameters, DATABASE CONSTRUCTION

words deleted and words substituted. Isolated Database
The database for continuous digit recognizer here
HTK Configuration used existing isolated digit [3] database plus new
HTK is a set of ready to use shell scripts and constructed database. And the new database only
programming library to train and to test ASR system. In contains bi-word to simulate continuous digit
order to work, a few configuration files must be written sentences. For example we use sentence like “SIL
explicitly. These configurations point to specific 1 4 SIL” in our new training patterns. This
format or method of feature extraction, HMM-GMM example is shown in figure 3:
parameters, language grammar and dictionary.
HTK Grammar
This is the language grammar used to perform
continuous digit recognition, as can be seen in figure 1:
$NUM1X=(NUM_0
|NUM_1|NUM_3|NUM_4|NUM_5|NUM_6|NU
46
5. Words: silence and 1 (“satu”)
The result is plot in figure 4:

Length Context Accuracy
(seconds)
11.68 monophonic 50%
19.85 monophonic 50%
11.68 triphonic 40%
19.85 triphonic 80%
Figure 3: Continuous digit annotation
Figure 4: Single Digit Recognition Results
This new database is intended to capture the
behaviors of subword phonemes at the conjunction Two Digits Recognition
between words. For example in the word above in For the next system we test, how to to recognize two
figure 3 we can capture the dynamic of context Indonesian digit “dua” and “satu” plus silence
dependent phonemes t-u+a and u-a+m. These two boundary.
phonemes did not exist in isolated database 1. Data format: PCM 16 bit, mono, 16000Hz
version. 2. Speech Feature Model: MFCC_E_D_A
3. Phone context: monophonic
Semi Automatic Labeling 4. Phones: \d\, \u\, \w\, \a\, \sil\, \s\,\t\
In construction of new database for continuous 5. HMM-GMM: Diagonal covariances, 5 states with 3
version using isolated version, we use HTK excitation states, enter state and exit state.
feature phone aligning [2]. This give we faster 6. Words: silence and 1 (“satu”) and 2 (“dua”),
time to label new database semi automatically.
After automatic aligning then we only need to The result is plot in figure 5:
perform little effort for small correction in the Length Phone Accuracy
new transcriptions. (Seconds) Context (%)
54.93 monophonic 92.59
4. EXPERIMENTAL RESULT 54.93 monophonic 92.59
Single Digit Recognition 54.93 triphonic 92.59
For the first system we test, we only need to recognize 54.93 triphonic 92.59
single Indonesian digit “satu” and its repetitions. The
parameters we used: Figure 5: Two Digits Recognition Results
1. Data format: PCM 16 bit, mono, 16000Hz
2. Speech Feature Model: MFCC_E_D_A Full Digit Continuous
3. Phones: \s\, \a\, \t\, \u\, \sil\ This experiment test ASR to recognize all Indonesian
4. HMM-GMM: Diagonal covariances, 5 states with 3 ten digits with repetitions allowed. Figure 6 plot all
excitation states, enter state and exit state. Indonesian result with constraints below:
47
This is a complete continuous version that

1. Phones: \a\, \b\, \d\, \e\, \g\, \h\, \i\, \j\, \k\, \l\, \m\, \n\, ready to be used in front end application like
\ng\, \o\, \p\, \s\, \sil\, \t\, \u\, \w\ automatic phone number dialer in smartphone.
2. Words: silence, 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9.
3. Number of monophones: 21 Future work
4. Number of triphones: 167 Towards enhancement should address on
5. Number of tested words: 604 independency of the speaker and independency of
6. Number of trained words: 463 the environment and hardware.
Length Context Accuracy

(seconds) 6. REFERENCES
218.91 Monophonic 93.21 % [1] Rabiner, Lawrence R, 1989 “A Tutorial On
218.91 Triphonic 90.89 % Hidden Markov Models and Selected Applications
in Speech Recognition”, IEEE Proceedings, Vol.
Figure 6: Continuous Digits Recognition Results 77 No.2, February 1989.
[2] S. Young, G. Evermann, M. Gales, T. Hain, D.
In Figure 6 we see monophonic system slightly Kershaw, G. Moore, J. Odell, D. Ollason, D.
outperform triphonic one. Povey, V. Valtchev and P. Woodland, “The HTK
Book”, 2001-2005 Cambridge University
5. CONCLUSION AND FUTURE WORK Engineering Departments, Website:
Conclusion http://htk.eng.cam.ac.uk/docs/docs.shtml.
We see even for continuous ones, small [3] Hatala Z, 2015, “Optimum Data Length To
vocabulary system shows the same performance Train Isolated Speaker Dependent Indonesian
for both monophonic versus triphonic system. We Digit Recognizer”, Jurnal Simetrik, Vol. 6 No. 1,
can see triphonic system suffer slightly Year 2016, Politeknik Negeri Ambon.
degradation, because of the effect of too many
triphones (167) that under trained for a very short
data samples.
48

Optimum Train Set For Continuous Speaker Dependent Indonesian Digit Recognizer

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Optimum Train Set For Continuous Speaker Dependent Indonesian Digit Recognizer

Diunggah oleh

Hak Cipta:

Format Tersedia

JURNAL SIMETRIK

Alamat Penyunting dan Tata Usaha :

STUDI PENILAIAN KONDISI DAS DAN IMPLIKASINYA TERHADAP FLUKTUASI 1 - 13

PEMBELAJARAN BAHASA INGGRIS DASAR BAGI KOMUNITAS ANAK

ANALISA FAKTOR AMAN LERENG TIMBUNAN JALAN TOL SEMARANG – SOLO

RANCANG BANGUN TIRAI OTOMATIS MIKROKONTROLER ATMEGA 328 DENGAN

OPTIMUM TRAIN SET FOR CONTINUOUS SPEAKER DEPENDENT INDONESIAN

STUDY EXPERIMEN KOMPOSIT POLYESTER SERAT AMPAS EMPULUR SAGU

IMPLEMENTASI TQM DAN PENGARUHNYA TERHADAP KINERJA KARYAWAN,

OPTIMUM TRAIN SET FOR CONTINUOUS SPEAKER

Keywords: continuous speech recognition, speaker dependent, optimum training set

1. INTRODUCTION application. This single user must perform distinct

The ASR process itself contains two distinct phases: M_7|NUM_8|NUM_9|NUM_2);

this stage also annotating or labeling is performed to

H parameter actually affected by two other parameters, DATABASE CONSTRUCTION

5. Words: silence and 1 (“satu”)

The result is plot in figure 4:

This is a complete continuous version that

Length Context Accuracy

Anda mungkin juga menyukai