Anda di halaman 1dari 7

Confusion Matrix Validation of an Urdu Language Speech Number System

S K Hasnain Assistant Professor, Pakistan Navy Engineering College, National University of Engineering & Technology, Karachi, Pakistan
Abstract - This paper presents a speech processing and recognition system for spoken Urdu language numbers from siffar (zero) to nau (nine) for making a concrete foundation in Urdu language. The speech feature extraction was based on a dataset of 150 different samples collected from 15 different speakers. Samples were utilized to extract different features using Fourier descriptor and Neural networks. Initial processing of data, i.e. normalizing and time-slicing was done using a combination of Simulink blocksets and MATLAB. Afterwards, the MATLAB tool box commands were used for calculation of Fourier descriptions and correlations. MATLAB implementations are included in the paper for use by other researchers in this field. The feedforward neural network models in MATLAB were developed. The models and algorithm exhibited high training and testing accuracies. Finally, the models were tested using confusion matrix approach and were found working satisfactorily. Such a system can be potentially utilized in implementation of a voice-driven help setup in different systems, such as multi media and voice-controlled customer services. Keywords: Spoken Urdu number processing, Feature extraction, Speaker independent, Feed-forward neural networks, Learning rate, Confusion matrix. used to identify the spoken words, the gender of the speaker, and/or the identity of the speaker. Two important features of speech are pitch and formant frequencies For adult males, the average pitch is 60 to 120 Hz, and for females, it is 120 to 200 Hz. A discrete-time model make makes use of linear prediction for producing speech. The vocal tract and lip radiation models use discrete excitation signal. An impulse generator emulates voiced speech excitation; the impulses are passed through a glottal shaping filter. The unvoiced speech is generated by a random noise generator. Ideally, any features selected for a speech model should be (1) not purposely controlled by the speaker, (2) independent of his/her health condition, and (3) tolerant to any environmental/acquisition noise [2]-[5]. Although the pitch can be easily varied by a speaker, the pitch can be easily filtered for any electrical noise, by using a low-pass filter. Formants on the other hand, are unique for different speakers, and are useful in individual speaker identification. In general, a combination of pitch and formant can be used in a speech recognition system. Many different schemes have been used in the past for speech feature extraction, for example, discrete Fourier transform (DFT), linear predictive coding (LPC), cepstral analysis, etc. Much research has been on done for English (and other major languages) speech processing/recognition, but application of these time-tested techniques has not been investigated for Urdu language until very recently [6][7]. In this paper, we provide the system (in the form of MATLAB code) as an open source so that other researchers can build up on our current efforts. Speech acquisition begins with a person speaking into a microphone or telephone. This act of speaking produces a sound pressure wave that forms an acoustic signal and converts it to an analog signal that can be understood by an electronic device. Finally, in order to store the analog signal on a computer, it must be converted to a digital signal. The data in this paper for the number is extracted from recording the Urdu numbers in PC and its wave file is generated, in Simulink which is processed after passing it through a filter. Its output file is kept for analysis purpose.

Introduction

Many years ago von Kempelen demonstrated that the speech production system of the human being could be modeled. He showed this by building a mechanical contrivance that talked. The paper by Dudley and Tarnocyz [1] relates the history of von Kempelens speaking machine. This device was built about 1780, at a time when the notion of building automata was quite popular. Speech modeling can be divided into two types of coding: waveform and source. In the beginning, the researchers tried to mimic the sounds as is, and called the technique waveform coding. This method tries to retain the original waveform using quantization and redundancy. An alternative approach makes use of breaking the sound up into individual components which are later modeled, separately. This method of utilizing parameters is referred to as source coding. Different characteristics of speech can

Preliminaries

2.1 Overview of discrete Fourier transform and its MATLAB implementation


DFT is itself a sequence rather than a function of continuous variable and it corresponds to equally-spaced frequency samples of discrete time Fourier transform of a signal. Fourier series representation of the periodic sequence corresponds to discrete Fourier transform of finite length sequence. So we can say that DFT is used for transforming discrete time sequence x(n) of finite length into discrete frequency sequence X[k] of finite length [8]. DFT is a function of complex frequency. Usually the data sequence being transformed is real. A waveform is sampled at regular time intervals T to produce the sample sequence of N sample values; n is the sample number, from n=0 to N-1.

Figure 1. Structure of a simple feed-forward neural network.

{x(nT)} = x(0), x(T ),... , x[(N 1)T ]

(1)

Design components

For length N input vector X, the DFT is a length-N vector X that has elements:
X(k) =

The design of the predictive system contains the following: learning and recognition of correct word, isolated word classification, and learning and recognition of speech words using correct structure and topic classification. The major focus in this paper is to use isolated word as a tool to detect problems in the recognized speech and be able to understand the different speaker spoken isolated words thereby providing corrective feedback to the speech recognition system.

x(n)e
n=0

2 j.k

n N,

k = 0,1...N

(2)

The MATLAB function fft(x) calculates the DFT of vector X. Similarly, fft(x, N) calculates N-point FFT, but padded with zeros if X has fewer than N points, and truncated if it has more.

2.2

Overview of neural networks

Neural networks (NNs) have proven to be a powerful tool for solving problems of prediction, classification and pattern recognition. Engineering applications of NNs include the fields of electronics and computer engineering [9]-[24]. The NNs are based on the principle of biological neurons. An NN may have one or more input and output neurons as well as one or more hidden layers of neurons interconnecting the input and output neurons. In one of the well-known NN types, the outputs of one layer of neurons send data (only) to the next layer (Figure 1), thus being called feed-forward NNs. Back-propagation is a common scheme for creating (training) the NNs. During the process of NN-creation, internal weights (wi,j)of the neurons are iteratively adjusted so that the outputs are produced within desired accuracy. The training process requires that the training set (known examples/input-output datasets) be chosen carefully. The selected dataset usually needs to be pre-processed prior to being fed to a NN.

3.1

Simulink model

A Simulink model was developed for analysis of Urdu spoken number analysis for data extraction for analysis in shown in Figure 2, which has been used for recording the Urdu number for processing and analysis.

From Wave File s1_w8.wav Out (22050Hz/1Ch/16b) One1


S-Function

FDATool

Digital Filter Design1

To Wave Device2 yout Signal To Workspace2

|FFT|2 Magnitude FFT2

yout1 Signal To Workspace1

Figure 2. Simulink model for data of number extraction for analysis.

Another Simulink [25][26] model (not shown here for brevity) was developed for gaining further analysis such as standard deviation, mean, median, autocorrelation, magnitude of FFT and FFT2, data matrix correlation, root mean square.

4.3

Endpoint detection

This step isolates the word to be detected from the following silence [27].

4.4

Frame blocking

3.2

Wave file

We read audio data samples from a standard Windows PCM format ".wav" audio files. When looping, enter number of times to play the file's data, or enter "inf" to loop indefinitely.

Here we make use of the property that the sounds originate from a mechanically slow vocal tract. This assumed stationarity allows overlapping frames of 100 ms or so.

4.5

Fourier transformation

3.3 3.4

Digital filter design


fdatool box [26] has been used in digital filter design.

Checking signal attributes

Generate an error when the input signal does or does not match selected attributes exactly.

MATLABs built-in FFT function provides symmetric data from which the lower half can be used for NN training/testing. Frequency spectrum for numeral aik (one) is shown in Figure 3. Figures 4 to 7 show the individual correlation extracted for 15 different speakers for the Urdu number siffar (zero) to nau (nine), which differentiate the number spoken by different speaker for detailed analysis.

3.5

Digital filter

Independently filter each channel of the input over time using a specified digital filter implementation. Filter coefficients can be specified using either tunable mask dialog parameters or separate input ports (useful for timevarying coefficients). Time-varying coefficients are supported at one of two possible rates of coefficient update in frame-based processing. The coefficients can update at a rate of one filter per frame (i.e. they stay constant for the duration of an input frame and change for the next frame). Or, the coefficients can update at a rate of one filter per sample (i.e. they change with every sample in the input frame) [4].

Figure 3. Correlation of the spoken Urdu number chaar (four).


300

SPEAKER: s1 w7

Data acquisition and processing

250

200

The speech recognition system presented in this model was limited to individual Urdu numerals (0 to 9). The data was acquired by speaking the numerals into a microphone connected to MS-Windows-XP based PC. Fifteen speakers uttered the same number set (0 to 9), specifically, aik, do, teen, chaar, paanch, shay, saat, aat, and nau. Each sound sample was curtailed to the duration of 50 seconds. The following data processing steps can be generally used for preparing the data for NN training.

150

100

50

50

100

150

200

250

300

Figure 4. Correlation of the spoken Urdu number saat (seven).


600

SPEAKER: s1 w8

4.1

Pre-processing

500

This step may involve logarithmic or some transformation. A normalization process may also be needed [13][20].

400

300

4.2

Data length adjustment

200

FFT execution time depends on exact number of the samples (N) in the data sequence [Xk]. It is desirable to choose the data length equal to a power of two.

100

50

100

150

200

250

300

Figure 5. Correlation of the spoken Urdu number aath (eight).

900 800 700 600 500 400 300 200 100 0

SPEAKER ONE: s1 w4

2000

4000

6000

8000

10000

12000

Figure 6. The waveform of the correlation of the spoken Urdu number spoken chaar (four).
1200

net=newff(minmax_of_p, [64, 16, 16, 1;], {'tansig','tansig','tansig','purelin'},' traingd'); % net.trainParam.show = 50; % x-axis of graph is 50 values apart net.trainParam.show = 100; % x-axis of graph is 100 values apart net.trainParam.lr = 0.01; % learning rate net.trainParam.epochs = 10000; net.trainParam.goal = 0.01; % tolerance net=train(net,p,t);

SPEAKER ONE: s1 w5

1000

Neural network and its MATLAB implementation

800

600

400

200

2000

4000

6000

8000

10000

12000

Figure 7. The waveform of the correlation of the spoken Urdu number spoken paanch (five).

We used single-hidden-layer feed-forward networks in this research. We experimented with many different sizes of the hidden layer, i.e., 5 to 20 neurons. In most cases, the training accuracy flattened out with hidden-layer count of 10. The training iterations (epochs), however, did reduce with higher number of hidden layer neurons. The NN used 64 FFT magnitudes as inputs, and produced ten separate predictions (one for each number). This means that we had 64 neurons in the input layer, and 10 neurons in the output layer. From the overall dataset of 150 speech samples (10 numbers spoken by 15 different speakers), we used 90% as the training set, and set aside the remaining 10% as validation/testing set. The learning rate of 0.01, and the training error tolerance of 0.01 were used. Epoch count was limited to 5000. Now, we provide details of NN implementation using MATLAB. A feed-forward N-layer NN is created by using newff MATLAB command [25][32]:
% p[] = array of training inputs % t[] = array of targets/outputs % minmax_of_p = array containing % mins and maxs of p[] % network configuration S1 = 10; % layer-1 neurons S2 = 10; % layer-2 neurons net = []; % network reset % two-layer network declaration net = newff(minmax_of_p,[S1, S2], {'tansig','purelin'},'traingd'); % training and display parameters net.trainParam.show = 50; net.trainParam.lr = 0.01;% lrn rate net.trainParam.goal = 0.01;%trg tol net.trainParam.epochs = 5000; % finally the training happens with: net = train(net,p,t);

4.6

Feed-forward neural network

The following commands are used in a feed-forward backpropagation network.


net=newff(PR,[S1 S2..SNl],{TF1 TF2...TFNl},BTF,BLF,PF)

where: PR = Rx2 matrix of min and max values for R input elements, Si = Size of ith layer, for Nl layers, TFi = Transfer function of ith layer, default = 'tansig,' BTF = Backprop network training function, default = 'trainlm', BLF = Backprop weight/bias learning function, default = 'learngdm', and PF = Performance function, default = 'mse.'

4.7

Neural network algorithm

cutoff = 64; for i=1:150; for j=3:cutoff+2; ptrans(i,j-2)= log(alldat(i,j)); end end p =transpose(ptrans); % inputs t =transpose(alldat(:,2)); % target for i = 1:cutoff minmax_of_p(i,1) = min(p(i,:));end for i = 1:cutoff minmax_of_p(i,2) = max(p(i,:); end

10

Performance is 0.00942263, Goal is 0.01

10

Training-Blue Goal-Black

10

10

10

-1

trade-off between learning rate, goal and epochs. A network in different configurations was created and tested especially the hidden layer size. The following table shows the learning accuracy with some of the networks. A maximum accuracy of 100 % was achieved with a double hidden layer network (64, 35, 35, 1). So a double hidden layer network should suffice for our application as shown in Table 1.
Table 1. Training of neural network with different layers of neurons.

10

-2

10

-3

5 6 10 Epochs

10

Figure 8. Training on Class 6with 12 speakers. Settings: Layers (64,35, 35,1), Learning rate 0.01, Goal 0.01.
10
2

Performance is 0.00979687, Goal is 0.01

10 Training-Blue Goal-Black

10

10

-1

10

-2

Neuron count 64, 10, 10 64, 20, 10 64, 30, 10 64, 40, 10 64, 60, 10 64, 10, 10, 10 64, 15, 15, 10 64, 20, 20, 10 64, 25, 20, 10 64, 25, 25, 10 64, 30, 30, 10 64, 35, 35, 10 64, 35, 35, 1

Learning accuracy 81.48% 81.48% 72.59% 83.70% 62.96% 64.44% 69.63% 58.52% 71.85% 71.85% 72.59% 83.70% 100 %

10

-3

10

20

30 65 Epochs

40

50

60

Figure 9. Training on Class 7with 12 speakers. Settings: Layers (64,35, 35,1), Learning rate 0.01, Goal 0.01.
10
2

5.1 Relationship between network layers and training data


The effect of learning rate () on training is shown in Figures 8-11. In experiments, a large value of took lesser time to train, although convergence to the steady state value was noisy. It shows that there exists a trade off between learning rate (), network parameter setting and the epoch for which the algorithm was used. The training of neural networks with different speaker was done and testing was carried out and found that two speaker out of five was kept for testing, and gave 100 percent result as shown in Table 2.

Performance is 0.00958845, Goal is 0.01

10 Training-Blue Goal-Black

10

10

-1

10

-2

10

-3

500

1000 1500 2328 Epochs

2000

5.2 Confusion matrix


In the field of artificial intelligence, a confusion matrix is a visualization tool typically used in supervised learning (in unsupervised learning it is typically called a matching matrix). A common technique of evaluating a multi-way classifier is to observe the confusion matrix output as a result of the testing phase of the classes. The confusion matrix plots, for each of the learned concepts for the true number of test instances against the predicted classes. Aggregate accuracy of the number is obtained by summing up the diagonal entries of the confusion matrix. However, invaluable information about the relationships amongst classes is often ignored. We explored various methods of exploiting the notion of similarity amongst subsets of classes using the confusion matrix. Each column of the matrix represents the instances in a predicted class, while each row represents the instances in an actual number. One benefit of a confusion matrix is that it is easy to see if the system is confusing two classes (i.e., commonly mislabeling one as

Figure 10. The performance of algorithm using feed forward neural network with learning rate = .01
10
1

Performance is 0.000999621, Goal is 0.001

10 Training-Blue Goal-Black

10

-1

10

-2

10

-3

10

-4

1000

2000

3000 4000 5000 7935 Epochs

6000

7000

Figure 11. The performance of algorithm using feed forward neural network with learning rate = .001

The feed-forward neural network was trained for different learning rates, goal and epochs. It was found that there is a

another) as shown in Table 3 that shows the regnition of Urdu number using confusion matrix approach comes out to be 100 %.
Table 2. Training of neural network with different number of speakers.
S. No. 1 2 3 4 5 No of training speakers 15 12 10 7 5 Number of testing speaker 5 4 3 2 2 Accuracy rate % 94 95 97 95 100

References

[1] H. Dudley and T. H. Tarnoczy, The speaking machine of Wolfgang von Kempelen, Journal of Acoustics Society, Vol. 22, 1950, pp. 151-166. [2] J. Koolwaaij, Speech Processing. //www.google.com/search (5 May 2004). Available:

[3] D. OShaughnessy, Speech Communication: Human and Machine, Addison Wesley Publishing Co., 1987. [4] T. Parsons. Voice and Speech Processing. McGrawHill College Div., Inc, 1986.
9

Table 3. Confusion matrix showing the accuracy of different numbers. W0 W1 W2 W3 W4 W5 W6 W7 W8 W9 0 1 2 3 4 5 6 7 8

[5] G. C. M. Fant. Acoustic Theory of Speech Production. Mouton, Gravenhage, 1960. [6] S. K. Hasnain, A. Beg, and M. S. Awan, Frequency Analysis of Urdu Spoken Numbers Using MATLAB and Simulink, PAF KIET Journal of Engineering & Sciences, Vol. 1, No. 2, December, 2007, pp. 43-48. [7] S. K. Hasnain and A. Beg, Cohesive Analysis, Modeling and Simulation of Spoken Urdu Language Numbers with Fourier Descriptors and Neural Networks, Ubiquitous Computing And Communication Journal, Vol. 3, No. 2, 2007, pp. -. [8] S. K. Hasnain and P. Akhter, Digital Signal Processing, Theory and Worked Examples, 2007. [9] P. W. C. Prasad, A. Assi, and A. Beg, Binary Decision Diagrams and Neural Networks, Journal of Supercomputing, Vol. 39, No. 3, March, 2007, pp. 301320. [10] A. Assi, P. W. C. Prasad, A. Beg, and V.C. Prasad, Complexity of XOR/XNOR Boolean Functions: A Model using Binary Decision Diagrams and Back Propagation Neural Networks, Journal of Computer Science and Technology, Vol. 7, No. 2, April, 2007, pp. 141-147. [11] A. Beg and Y. Chu, Modeling of Trace and BlockBased Caches, Journal of Circuits, Systems and Computers, Vol. 16, No. 5, October, 2007, pp. 711-729. [12] A. Beg, P. W. C. Prasad, and A. Beg, Applicability of Feed-Forward and Recurrent Neural Networks to Boolean Function Complexity Modeling, Expert Systems with Applications, Vol. 36, No. 1, November, 2008. In Press. [13] P. W. C. Prasad and A. Beg, Investigating Data Preprocessing Methods for Circuit Complexity Models, Expert Systems with Applications, Vol. 38, No. 4. In Press.

Conclusions

When we compared the frequency content of the same word by different speakers, we found striking similarities among them. This helped us get more confidence in our initial hypothesis that a single word uttered by a diverse set of speakers would exhibit similar characteristics. In our experiments, a larger value of took lesser time to train, although convergence to the steady state value was noisy. It shows that there exists a trade off between learning rate (), network parameter setting and the epoch for which the algorithm was used. We observed that Fourier descriptor feature was independent of the spoken numbers, with the combination of the Fourier transform and correlation technique commands used in MATLAB, a high accuracy recognition system can be realized. Recorded data was used in Simulink model for introductory analysis. The feed-forward NN was trained for different learning rates, goal and epochs. It was found that there is a trade off between learning rate, goal and epochs. Using a confusion matrix, it is easy to see if the system is confusing two classes (i.e., commonly mislabeling one as another) that shows the regnition of Urdu number using confusion matrix approach comes out to be 100 %.

[14] A. Beg and W. Ibrahim, PerfPred: A Web-Based Tool for Exploring Computer Architecture Design Space, Computer Applications In Engineering Education. In Press. [15] A. Assi, P. W. C. Prasad, and A. Beg, Boolean Function Complexity and Neural Networks, 7th WSEAS International Conference on Neural Networks, Cavtat, Croatia, June 12-14, 2006, pp. 85-90. [16] P. W. C. Prasad, A. K. Singh, A. Beg, and A. Assi, Modeling the XOR/XNOR Boolean Functions Complexity using Neural Networks, 13th IEEE International Conference on Electronics, Circuits and Systems, Nice, France, December 10-13, 2006, pp. 13481351. [17] A. Beg, P. Chandanna, and A. Assi, Modeling the Behavior of BDD Complexity Using Neural Networks, 4th IASTED International Conference on Circuits, Signals, and Systems, San Francisco, CA, United States, November 20-22, 2006, pp.-. [18] A. Beg, P. W. C. Prasad, M. Arshad, and S. K. Hasnain, Using Recurrent Neural Networks for Circuit Complexity Modeling, 10th IEEE International Multitopic Conference, Islamabad, Pakistan, December 23-24, 2006, pp. 194-197. [19] A. Beg, P. W. C. Prasad, and S. M. N. A. Senanayake, Learning Monte Carlo Data for Circuit Path Length, Fourth IASTED International Conference on Advances in Computer Science and Technology (ACST 2008), Lnagkawi, Malaysia, April 2-4, 2008, pp.-. [20] A. Beg and P. W. C. Prasad, Data Processing for Effective Modeling of Circuit Behavior, 8th WSEAS International Conference on Evolutionary Computing (EC07), Vancouver, Canada, June 18-20, 2007, pp. 312318. [21] P.W.C. Prasad and A. Beg, A Methodology for Evaluation Time Approximation, 50th IEEE International Midwest Symposium on Circuits and Systems (MWSCAS/NEWCAS07), Montreal, Canada, August 5-8, 2007, pp. 776-778. [22] A. Beg, Predicting Processor Performance with a Machine Learnt Model, 50th IEEE International Midwest Symposium on Circuits and Systems (MWSCAS/NEWCAS07), Montreal, Canada, August 5-8, 2007, pp. 1098-1101. [23] A. K. Singh, A. Beg, and P. W. C. Prasad, Modeling the Path Length Delay Projection, International Conference for Engineering and ICT, Melaka, Malaysia, November 27-28, 2007, pp.-.

[24] A. Beg and W. Ibrahim, An Online Tool for Teaching Design Trade-offs in Computer Architecture, International Conference on Engineering Education, Coimbra, Portugal, September 3-7, 2007, pp.-. [25] MATLAB Users Guide. Mathworks, Inc., 2006. [26] DSP Blockset (For use with Simulink) Users Guide Mathworks Inc., 2007. [27] S. Varho. New Linear Predictive Systems for Digital Speech Processing. PhD dissertation, Helsinki University of Technology, Finland, 2001. [28] S. K. Hasnain and N. Jamil, Implementation of Digital Signal Processing real time Concepts Using Code Composer Studio 3.1, TI DSK TMS 320C6713 and DSP Simulink Blocksets, IC-4 Conference, Pakistan Navy Engineering College, Karachi, November 2007. [29] M. M. El-Choubassi, H. E. El-Khoury, C. E. J. Alagha, J. A. Skaf, and M. A. Al-Alaoui, Arabic Speech Recognition Using Recurrent Neural Networks, Symp. Signal Processing & Info. Tech., 2003, ISSPIT 2003, December 2003, pp. 543- 547. [30] S. K. Hasnain and A. Tahir, Digital Signal Processing Laboratory Workbook, 2006. [31] M. A. Al-Alaoui, R. Mouci, M. M. Mansour, and R. Ferzli, A Cloning Approach to Classifier Training, IEEE Trans. Systems, Man and Cybernetics Part A: Systems and Humans, Vol. 32, No. 6, pp. 746-752, 2002. [32] S. D. Stearns and R. A. David, Signal Processing Algorithms in MATLAB, Prentice Hall, 1996.

Anda mungkin juga menyukai