Anda di halaman 1dari 6

2017 International Conference on Advanced Mechatronics, Intelligent Manufacture, and Industrial Automation (ICAMIMIA)

Control of Robot Arm based on Speech Recognition


using Mel-Frequency Cepstrum Coefficients (MFCC)
and K-Nearest Neighbors (KNN) Method
Dyah Anggraeni1,2∗ , W.S. Mada Sanjaya1,2 , Madinatul Munawwaroh1,2 , M. Yusuf Solih Nurasyidiek1,2 ,
and Ikhsan Purnama Santika1,2
1
dept. of Physics, Faculty of Science and Technology, UIN Sunan Gunung Djati, Bandung, Indonesia
2
Bolabot Techno Robotic Institute, CV Sanjaya Star Group, Bandung, Indonesia

tsugumikaoru@gmail.com

Abstract—In this study describe the implementation of speech to control 5 Degree of Freedom (DoF) Robot Arm for perform-
recognition to pick and place an object using 5 DoF Robot Arm ing the assignment to pick and place an object based on Arduino
based on Arduino Microcontroller. To identify the speech used microcontroller.
Mel-Frequency Cepstrum Coefficients (MFCC) method to get
feature extraction and K-Nearest Neighbors (KNN) method to The paper is organized as follows. In section 2, described
learn and identify the speech recognition based on Python 2.7. The the theoretical background of MFCC and KNN on details. The
database of speech use 12 feature for KNN process, then tested experimental design of method and system design described
using trained (85%) and not trained (80%) respondent show the in section 3. In section 4, described implementation of speech
best agreement result to identifying the speech recognition. Finally, recognition in detail. Finally, the concluding remarks are given
the speech recognition system implemented to control Robot Arm
for perform assignment pick and place the object. in section 6.
Keywords—Speech Recognition, Arduino, Robot Arm, MFCC,
KNN, Python. II. T HEORETICAL BACKGROUND
A. Feature Extraction using Mel Frequency Cepstrum Coeffi-
I. I NTRODUCTION cient (MFCC) Method
Mel Frequency Cepstrum Coefficient (MFCC) is one of
Speech Recognition is a process to identify a speech with a method for feature extraction of the signal, especially for
an acoustic signal data conversion from the audio device. the audio signal. The feature extraction is used as individual
The conversion process technically needs an audio signal, identity with the process of determining a value or vector that
then identified by the audio feature extraction and machine can be. Because of considered quite good in representing the
learning. The advantage of implementing speech recognition signal, MFCC becomes the most used method in various of
is a convenience to controlling something with the speech voice processing field, [13].
especially to help disability or another aim. The feature is the coefficient of cepstral which used to
The feature extraction method which can use to know the consider the perception of the human hearing system. The
identity of the audio signal, they are; Mel-Frequency Cepstrum principle of MFCC is based on the different frequencies which
Coefficient (MFCC) [1] [2] [3] and Linear Predictive Coding captured by human ear so can be representation the sound
(LPC) [4] [5]. While, the machine learning method which can signals. MFCC diagram process shown on Fig. 1.
use to grouping and classify the speech are Support Vector Ma- 1) Preemphasis: Pre-emphasis filter process is required after
chine (SVM) [1], Artificial Neural Networks (ANN) [1] [5] [6], the sampling process, with the purpose to obtain a smoother
Hidden Markov Model (HMM) [4], Fuzzy Logic [7], Adaptive spectral form of speech signal frequency or to reduce a noise
Neuro-Fuzzy Inference System (ANFIS) [8], K-Nearest Neigh- during sound capture. The pre-emphasis filter is based on the
bors (KNN) [9] and other soft computing. For implementation input/output relationship in the time domain expressed by (1).
of speech recognition of Robotic field, such as; biomatrix [10],
control arm robot [11], control mobile robot [12], control y(n) = x(n) − ax(n − 1), (1)
smarthome [13], control wheel chair [14] [15], control Social
Robot [16] [17], and other. from (1) define; a is a pre-emphasis filter constant, it is usually
This study describes a signal of voice processing by using 0.9 < a < 1.0.
Mel-Frequency Cepstrum Coefficients (MFCC) and K-Nearest 2) Frame Blocking: The audio signal is segmented into
Neighbors (KNN) method to recognize the human speech on multiple overlapped frames in this process, so that not found
Python 2.7. Finally, the speech recognition system implemented a single deletion of signals. This process continues until all

978-1-5386-2729-7/17/$31.00 ©2017 IEEE 217


summed. The wrapping process to the signal in the frequency
domain is performed using (4).

−1
Xi = log10 (ΣN
k=0 |X(k)|Hi (k)), (4)
from ( 4) define that i = 1, 2, 3, ..., M (M is the number of
triangle filters) and Hi (k) is the value of the i− triangle filter
for the acoustic frequency of k.

Fig. 1. MFCC process.

signals have to get into one or more frames. The voice analysis
was performed by short-time analysis.
3) Windowing: Windowing is an analysis process for long
sound signals by taking a sufficiently representative section.
With a Finite Impulse Response (FIR) digital filter approach,
this process removes the aliasing signal because of the discon-
tinuity of the signal pieces. The discontinuities occur because
of the frame blocking the process.
4) Fast Fourier Transform (FFT): Fourier transform (FFT) Fig. 2. The original amplitude spectrum and the Mel Bank filter.
used to convert a time series of bounded time domain signals
6) Cepstrum: Humans listen to voice based on time domain
into a frequency spectrum. FFT is a fast algorithm of Discrete
signals. In this stage, the mel-spectrum be converted into
Fourier Transform (DFT) which can convert every frame to
the time domain by using Discrete Cosine Transform (DCT),
N samples from time domain to frequency domain. FFT can
then the result can call as Mel-frequency cepstrum coefficient
reduce the repeatable multiplication contained in the DFT.
(MFCC). (5) is the equations used in cosine transformations.
−1 −2πjkn/N
Xn = ΣN
k=0 xk e , (2) π
Cj = ΣK
j=1 Xj cos(j(i − 1)/2 ), (5)
K
(2) define that n = 0.1, 2, ..., N − 1 and j = sqrt−1. X[n] is
the n-frequency pattern generated from the Fourier transform, from (5) show Cj is the MFCC coefficient, Xj is the power
Xk is the signal of a frame. The result of this stage called spectrum of mel frequency, j = 1, 2, 3, ..., K (K is the number
Spectrum or periodogram. of desired coefficients) and M is the number of filters.
5) Mel-Frequency Wrapping: The perception of human ear B. Machine Learning using K-Nearest Neighbors (KNN)
against the audio frequency does not follow the linear scale. Method
The actual frequency scale used Hz units. The scale which
works on the human ear is the mel scale frequency which is K-Nearest Neighbors (KNN) is proposed by Cover and Hart
a low frequency that linear under 1000 Hz and a logarithmic (1968) [19] [20] is a learning method based on statistical
high frequency above 1000 Hz [18]. The relation of the Mel theory. KNN method is the one of the simplest and fundamental
scale to the frequency in Hz shown in (3). classification [21] [22] [23]. In KNN algorithm each sample
should be classified similarly to its surrounding samples in
FHZ
2595∗[log]
Fmel = {FHZ ,FHZ 10
(1+
700 ),FHZ >1000
, (3) pattern recognition. Thus, if found unknown/new sample when
<1000)
classification, it could be predicted by considering the classi-
(3) show that Fmel is the mel scale and f is the frequency in fication of its nearest neighbor samples. k-NN is based on the
Hz. The one of approach the frequency spectrum in the mel idea that any new sample can be classified by the majority vote
scale with the working function of the human ear as a filter is of the k neighbors, where k is a positive integer and usually
the Filter Bank. found by the small number. The KNN method illustrated on
In Mel-frequency wrapping, the FFT signal result is grouped Fig. 3.
into triangular filter file. The purpose is that each FFT value The advantages of KNN method they are; simplicity, effec-
is multiplied by the corresponding filter gain and the result is tiveness, and robustness to noisy of training data. But, KNN

218
Fig. 3. K-Nearest Neighbors illustration.

method has disadvantages are; in multidimensional datasets


have low accuracy rate, to implemented need a large memory,
parameter K, unclearness of distance type, high computation
cost in a test query. [23]
III. E XPERIMENTAL M ETHOD
A. Method and System Design
The general system scheme is shown in Fig. 4, describe
that after the system start and record the speech, the process
divided by 2 processes: The first process makes a database
using MFCC for features extraction and using KNN Method to
classifying the speech ”Pick” (”Ambil”) and ”Place” (”Simpan”)
in Indonesian (Bahasa), then the database called trained data.
Second, is the testing process, with recording new speech will
get a new MFCC feature extraction data, then the new data
matched with the Trained Data. The matching data be processed
to obtain speech classification, then Robot Arm will be the
move to pick or place an object suitable the command. All
processes work in real-time based on Python 2.7 and Arduino
microcontroller.
B. Hardware Design
The main tools and component which used in this research
are Personal Computer, Arduino microcontroller, Robot Arm,
Microphone, connections, and others. Fig. 5 is the design and Fig. 4. General research scheme.
realization (which has made on previous research [24]) of Robot
Arm which connects with Arduino microcontroller.
Fig. 6 shown the circuit schematic [11] of Robot Arm which C. Interface Design
contained by 5 motor servos or called 5 DoF. Each servo has Fig. 7 is an interface of speech recognition system based on
a supply by 5 volt and 100 mA of battery and each servo Python 2.7. The interface consists of some menu is ”Rekam”
”ground” connect to ground of Arduino microcontroller. Robot (Record) and ”Keluar” (Exit), the shell windows of Python
Arm servos divided into some function; Servo1 as a base (rotate to monitoring the result of speech recognition and graphics
horizontally) and connect to pin 8. Servo2 work as the shoulder windows to display the waveform result of the speech recogni-
(rotate vertically) and connect to pin 9. Servo3 work as elbow tion. From the Fig. 7 show the interface testing with command
(rotate vertically) and connect to pin 10. Servo4 work as wrist ”Ambil” (pick) and ”Simpan” (place) an object.
(rotate horizontally) and connect to pin 11. And Servo5 work
as the gripper to pick an object, connect to pin 12.

219
(a)Design

(a)

(b)Realization
Fig. 5. Robot arm design.

(b)
Fig. 7. The interface of speech recognition system, and the waveform of speech
command; (a) ”Ambil” (b) ”Simpan”.

study is Mel-Frequency Cepstrum Coefficients (MFCC) method


based on Python 2.7. The speech recognition which used to
Fig. 6. Robot arm circuit schematic. control the Robot Arm is spoken in Bahasa, they are: ”Ambil”
(pick) and ”Simpan”(place). To build the database is made by
10 times of iterations for each command. Fig. 8 is the feature
IV. R ESULTS AND D ISCUSSION extraction database of speech recognition.
From Fig. 8 show that the database of speech recognition
A. Features Extraction Database using MFCC
made from 12 feature and the target value. The feature extrac-
The first step to develop the speech recognition is collect a tion is the identity of each command. Values ”0” of target are
feature extraction data from the speech audio data. By seen the defined as ”Simpan” (place) command, and for values ”1” of
Fig. 7, the waveform of speech ”Ambil” and ”Simpan” has a target are defined as ”Ambil” (pick) command. The database
different form. Therefore, the speech data audio certainly has is collected on .txt file will classify by K-Nearest Neighbors
a different feature. The feature extraction which used in this (KNN) method, then the database called as Trained Data.

220
Fig. 8. The database of speech recognition.

B. Speech Recognition System Test


For system testing, performed on the trained respondent (in (a)
the database) and untrained respondent (outside the database).
The testing system by trained respondent, performed for data
clarification. TABLE I show the result of the testing system
by trained respondent and untrained respondent with 10 times
repetition of each command alternately. For the result of speech
recognition testing data are; for trained respondent, the accuracy
rate of speech recognition is 85% and for untrained respondent,
the accuracy rate of speech recognition is 80%.

TABLE I
T HE S PEECH R ECOGNITION DATA TEST OF T RAINED R ESPONDENT AND
U NTRAINED R ESPONDENT. (b)
Examination Command Value Result of Respondents
Trained Untrained
1 ”Ambil” 1 1 1
2 ”Simpan” 0 0 0
3 ”Ambil” 1 1 1
4 ”Simpan” 0 0 1
5 ”Ambil” 1 1 0
6 ”Simpan” 0 1 1
7 ”Ambil” 1 1 1
8 ”Simpan” 0 0 0
9 ”Ambil” 1 1 1
10 ”Simpan” 0 1 0
11 ”Ambil” 1 1 1
12 ”Simpan” 0 0 0
13 ”Ambil” 1 1 1 (c)
14 ”Simpan” 0 0 0
Fig. 9. Robot arm performed the assignment; (a) initial pose, (b) ”Ambil”
15 ”Ambil” 1 1 1 (pick) an Object, (c)”Simpan” (place) an Object.
16 ”Simpan” 0 0 0
17 ”Ambil” 1 1 1
18 ”Simpan” 0 0 0
19 ”Ambil” 1 1 1 V. C ONCLUSIONS
20 ”Simpan” 0 1 1
This study has been investigated a development of Robot
Arm which controlled by speech recognition to perform as-
signment pick and place an object. The speech recognition
C. Speech Recognition Implementation to Robot Arm using MFCC and KNN based on Python 2.7 method work
After the database of speech recognition have been success- successfully suitable the speech command. The speech recog-
fully identified, the speech recognition database implemented nition system result obtained a high average accuracy rate of
to Robot Arm system to perform assignment pick (Ambil) and speech recognition; for the trained respondent is 85% and for
place (Simpan) an object. Fig. 9 show the examination of Robot the untrained respondent is 80%. The speech recognition system
Arm which controlled by speech are work successfully to pick has been implemented to 5 DoF Robot Arm based on Arduino
and place an object. microcontroller works effectively to pick and place an object.
The future works will be a focus on the combination of speech
recognition to Social Robot for Human-Robot Interaction.

221
R EFERENCES ground,” Int. Journal of Engineering Research and Applications, vol. 3,
no. 5, pp. 605–610, 2013.
[1] P. A. Sawakare, R. R. Deshmukh, and P. P. Shrishrimal, “Speech [22] G. M. S. Najah, “Emotion estimation from facial images,” Ph.D. disser-
Recognition Techniques: A Review,” International Journal of Scientific tation, 2017.
& Engineering Research, vol. 6, no. 8, pp. 1693–1698, 2015. [23] H. Parvin, H. Alizadeh, and B. Minati, “A Modification on K-Nearest
[2] A. Setiawan, A. Hidayatno, and R. R. Isnanto, “Aplikasi Pengenalan Neighbor Classifier,” Global Journal of Computer Science and Technol-
Ucapan dengan Ekstraksi Mel-Frequency Cepstrum Coefficients ( MFCC ogy, vol. 10, no. 14, pp. 37–41, 2010.
) Melalui Jaringan Syaraf Tiruan ( JST ) Learning Vector Quantization ( [24] D. Anggraeni, W. S. M. Sanjaya, M. Y. Solih, and M. Munawwaroh, “The
LVQ ) untuk Mengoperasikan Kursor Komputer,” Tech. Rep. 3, 2011. Implementation of Speech Recognition using Mel-Frequency Cepstrum
[3] I. B. Fredj and K. Ouni, “Optimization of Features Parameters for HMM Coefficients ( MFCC ) and Support Vector Machine ( SVM ) method
Phoneme Recognition of TIMIT Corpus,” in International Conference on based Python to Control Robot Arm,” in Annual Applied Science and
Control, Engineering & Information Technology, vol. 2. IPCO, 2013, Engineering Conference, vol. 2. IOP Conference, 2018, pp. 1–9.
pp. 90–94.
[4] Thiang and Wanto, “Speech Recognition Using LPC and HMM Applied
for Controlling Movement of Mobile Robot,” Seminar Nasional Teknologi
Informasi, 2010.
[5] Thiang and S. Wijoyo, “Speech Recognition Using Linear Predictive
Coding and Artificial Neural Network for Controlling Movement of Mo-
bile Robot,” in International Conference on Information and Electronics
Engineering, 2011.
[6] B. P. Das and R. Parekh, “Recognition of Isolated Words using Features
based on LPC , MFCC , ZCR and STE , with Neural Network Classifiers,”
International Journal of Modern Engineering Research, vol. 2, no. 3, pp.
854–858, 2012.
[7] I. B. Fredj and K. Ouni, “A novel phonemes classification method using
fuzzy logic,” Science Journal of Circuits, Systems and Signal Processing,
vol. 2, no. 1, pp. 1–5, 2013.
[8] W. S. M. Sanjaya and D. Anggraeni, “Sistem Kontrol Robot Arm 5 DOF
Berbasis Pengenalan Pola Suara Menggunakan Mel-Frequency Cepstrum
Coefficients ( MFCC ) dan Adaptive Neuro-Fuzzy Inference System (
ANFIS ),” Wahana Fisika, vol. 1, no. 2, pp. 152–165, 2016.
[9] R. P. Gadhe, R. R. Deshmukh, and V. B. Waghmare, “KNN based emotion
recognition system for isolated Marathi speech,” International Journal of
Computer Science Engineering (IJCSE), vol. 4, no. 04, pp. 173–177,
2015.
[10] I. N. K. Wardana and I. G. Harsemadi, “Identifikasi Biometrik Intonasi
Suara untuk Sistem Keamanan Berbasis Mikrokomputer,” Jurnal Sistem
dan Informatika, vol. 9, no. 1, pp. 29–39, 2014.
[11] W. S. M. Sanjaya, D. Anggraeni, and I. P. Santika, “Speech Recogni-
tion using Linear Predictive Coding (LPC) and Adaptive Neuro-Fuzzy
(ANFIS) to Control 5 DoF Arm Robot,” in ICCSE. Bandung: IOP
Conference, 2018.
[12] Z. H. Abdullahi, N. A. Muhammad, J. S. Kazaure, and F. A. Amuda,
“Mobile Robot Voice Recognition in Control Movements,” International
Journal of Computer Science and Electronics Engineering, vol. 3, no. 1,
pp. 11–16, 2015.
[13] W. S. M. Sanjaya and Z. Salleh, “Implementasi Pengenalan Pola Suara
Menggunakan Mel-Frequency Cepstrum Coefficients (MFCC) Dan Adap-
tive Neuro-Fuzzy Inferense System (ANFIS) Sebagai Kontrol Lampu
Otomatis,” Al-HAZEN Jurnal of Physics, vol. 1, no. 1, 2014.
[14] A. Kumar, P. Singh, A. Kumar, and S. K. Pawar, “Speech Recognition
Based Wheelchair Using Device Switching,” International Journal of
Emerging Technology and Advanced Engineering, vol. 4, no. 2, pp. 391–
393, 2014.
[15] K. P. Tiwari and K. K. Dewangan, “Voice Controlled Autonomous
Wheelchair,” International Journal of Science and Research, no. April,
pp. 10–11, 2015.
[16] C. Breazeal, “Breazeal-AR03.pdf,” Advanced Robotics, vol. 17, no. 2, pp.
97–113, 2003.
[17] O. Mubin, J. Henderson, and C. Bartneck, “You Just Do Not Understand
Me ! Speech Recognition in Human Robot Interaction,” in International
Symposium on Robot and Human Interactive Communication. IEEE,
2014.
[18] A. Mustofa, “Sistem Pengenalan Penutur dengan Metode Mel-frequency
Wrapping,” J. Tek. Elektro, vol. 7, no. 2, pp. 88–96, 2007.
[19] T. M. Cover, “Estimation by the Nearest Neighbor Rule,” IEEE Transac-
tions on Information Theory, vol. 14, no. 1, pp. 50–55, 1968.
[20] T. M. Cover and P. E. Hart, “Nearest Neighbor Pattern Classification,”
IEEE Transactions on Information Theory, vol. 1, no. 1, pp. 21–27, 1967.
[21] S. B. Imandoust and M. Bolandraftar, “Application of K-Nearest Neighbor
( KNN ) Approach for Predicting Economic Events : Theoretical Back-

222

Anda mungkin juga menyukai