Anda di halaman 1dari 6

2012 11th International Conference on Machine Learning and Applications

A Minimum Frame Error Criterion for Hidden Markov Model Training


Taemin Cho , Kibeom Kim and Juan P. Bello
Music

and Audio Research Lab (MARL) New York University, New York, U.S. {tmc323, jpbello}@nyu.edu
these

Courant

Institute of Mathematical Sciences New York University, New York, U.S. kk1674@nyu.edu

authors contributed equally to this work. to be poor, since most real-world problems are not that simple. For example, consecutive speech frames are wrongly assumed to be conditionally independent given a hidden state sequence. In addition, the performance of a particular model (e.g. speech recognizer) is typically measured by its labeling accuracy (e.g. word error rate, WER, in speech recognition), not by the likelihood scores of given observation sequences. For these reasons, many alternative criteria and methodologies have been proposed in the machine learning literature [2][6], to improve performance upon that of MLE. As the main purpose of parameter training is to achieve minimum recognition error, the majority of alternative methods are focused on discriminative training. One early example is Maximum Mutual Information (MMI) [2]. As the name indicates, it maximizes mutual information between observations and corresponding hidden states. Although MMI training shows an advantage over MLE in many applications, the MMI criterion does not directly connect with error rate minimization. On the other hand, Minimum Classication Error (MCE) [5] employs a discriminant loss function to approximate the classication error measured by competing HMMs. However, since MCE criterion does not consider the temporal relationship between successive input signals, it does not guarantee the minimum error when it is used in continuous recognition. The embedded MCE [7], an alternative to the original MCE, tries to minimize the classication error measured between the correct path and competing incorrect paths (e.g. N -best paths) decoded from an HMM. The embedded MCE has the advantage of not requiring time aligned labels. However, its lack of time consideration may yield incorrect tting to given data. For this reason, there have been an effort to improve it by using smaller input sequences chunked by time aligned labels [8], [9]. In this article, we propose a Minimum Frame Error (MFE) criterion for the training of HMMs. As the name indicates the method seeks to minimize the frame error rate (FER), i.e. the ratio of incorrectly labeled frames to the total number of frames in the training data. Thus, this criterion is suitable when aligned training labels are available (e.g. aligned word labels in speech recognition), providing an intuitive and natural way to t an HMM to those labels. In addition,
363

AbstractHidden Markov models (HMM) have been widely studied and applied over decades. The standard supervised learning method for HMM is maximum likelihood estimation (MLE) which maximizes the joint probability of training data. However, the most natural way of training would be nding the parameters that directly minimize the error rate of a given training set. In this article, we propose a novel learning method that minimizes the number of incorrectly decoded labels framewise. To do this, we construct a smooth function that is arbitrarily close to the exact frame error rate and minimize it directly using a gradient-based optimization algorithm. The proposed approach is intuitive and simple. We applied our method to the task of chord recognition in music, and the results show that it performs better than Maximum Likelihood Estimation and Minimum Classication Error.

I. I NTRODUCTION The analysis of sequential data is unquestionably an interesting subject since many real-world phenomena such as speech and music exhibit intrinsic time-varying behaviors. Sequence labeling, in particular, has received much attention, due to the fact that many practical tasks in various research elds including speech recognition, bioinformatics, and computational linguistics, are dened as sequence labeling problems. Hidden Markov Models (HMMs) are one of the most successful statistical approaches to inferring labels from unknown data sequences. For example, they have become the standard approach for modeling acoustic information in both speech recognition and chord recognition in music [1]. Accordingly, building the right model is crucial, and learning suitable HMM parameters becomes a matter of both theoretical and practical importance. In many applications, supervised learning is favored to estimate HMM parameters due to its relatively higher performance as compared to the unsupervised approach. Maximum Likelihood Estimation (MLE), which maximizes the likelihood of data given a model, is the most widely used method for tting models to given labeled data sequences (i.e. ground-truth samples). When model parameters are obtained via MLE, it is assumed that the the HMMs probability model is a good approximation of the true distribution of the real-world problem. In practice, however, this assumption is known
978-0-7695-4913-2/12 $26.00 2012 IEEE DOI 10.1109/ICMLA.2012.147

(a) Discrete Recognition (MCE)

(b) Continuous Recognition (MFE)

Figure 1. (a) The system gets a pre-segmented observation, calculates the likelihood for each HMM individually, and outputs the label with the highest likelihood. MCE trains the HMMs so that each pre-segmented observation yields the corresponding ground truth label. (b) The system gets observation frames and outputs a label for each frame. Unlike (a), the HMMs are connected by a language model. MFE trains the HMMs and the language model so that each frame yields the corresponding ground truth label.

for some applications, such as musical chord recognition, FER is an exact evaluation metric. In fact, the musical chord recognitions evaluation metric is what inspired us to develop our method. The proposed method internally discriminates correct states against incorrect states per frame. If we only consider a frame, the idea of discrimination is very similar to MCE. The major difference between MCE and MFE is that the former tries to match the pre-segmented input to a ground truth label individually while the latter tries to match the continuous input frames to the frame-level ground truth labels considering its language model. Figure 1 illustrates the two different training methods. Conceptually, MFE inherits MCEs original idea and extends it to a continuous sequence labeling problem. The remainder of this paper is organized as follows: Section II provides a detailed description of MFE; Section III evaluates the performance of MFE and compares it with MLE, and MCE; Section IV includes our conclusions. II. MFE HMM TRAINING Let us consider an HMM = (, A, B), where is the initial state probability distribution, A is the set of transition probabilities, and B is the set of emission probabilities. Let us also consider an observation sequence O = {o1 , o2 , . . . , oT } which is typically decoded into a state sequence Q = {q1 , q2 , . . . , qT } using either a MAP (maximum a posteriori) or a Viterbi decoder. We dene frame error rate (FER) with respect to the given ground

truth label sequence G = {g1 , g2 , . . . , gT } as:

FER(G, Q ) = 1

1 T

T t=1
1L(qt )=gt

(1)

The function L maps a state qt to the corresponding label, and the indicator function 1L(qt )=gt is 1 if L(qt ) = gt and 0 otherwise. Note that more than one different state can share the same label as shown in Figure 1(b). This makes MFE more exible to the HMM structures and the annotation level of ground truth data. For example in continuous speech recognition, one may model a word using a sequence of multiple states of phonemes.

The goal of MFE is to nd the that minimizes FER given the training data. If (1) is differentiable with respect to the model parameters, any gradient-based optimization algorithm can be applied to minimize FER, for example, Limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) [10] or Sequential Quadratic Programming (SQP) [11], [12]. Unfortunately, the indicator function is discrete and thus not differentiable. To make (1) differentiable, the indicator function should be approximated using a differentiable function. As the indicator function depends on the decoded sequence Q , the approximation should be designed based on the choice of decoding algorithm. Theoretically, MAP decoding is plausible with FER minimization, since the decoded sequence has the lowest FER ex-

364

pectation1 . However, Viterbi decoding is preferred when the decoded sequence should be constrained by zero probability transitions. For example, in Figure 1(b), Viterbi decoding guarantees that the decoded state sequence does not jump from the middle state of HMM A to any state in the other HMMs. Section II-A and Section II-B describe the approximation with MAP and Viterbi decoding, respectively. A. MFE with MAP decoding MAP decoding nds the most likely state at each time t individually and formulates a single decoded state sequence Q as:
Q = {q1 , q2 , . . . , qT } qt

Differentiation (AD) [14]. Theoretically, it has been proven that with reverse-mode AD the gradient computation does not cost more than ve times the cost of the function computation regardless of the number of input variables [15] (i.e. cost{f } 5 cost{f }). However, since reverse-mode AD has to keep all the history of intermediate computations, its high amount of memory usage can be an issue in practice. In our experience, this becomes a problem when B (the set of emission probability functions) has high computational complexity with a large number of coefcients, e.g. when b B is a mixture of multivariate Gaussians (i.e. GMM). Fortunately, the computation of FER can be efciently decomposed into the following steps: When B = {bs |s S} depends on a set of variables V , 1) Compute bs (ot ), s S and t {1, 2, , T }. 2) With the computed value bs (ot ), compute FER FER , and via reverse-mode AD. A bs (ot )
T

= arg max P (st , O|)


sS

(2)

where t {1, 2, . . . , T } is a time index, and S is the set of states s. P (st , O|) is computed using the forward-backward algorithm [13]. The approximation of the indicator function for MAP decoding can be formulated as:
1L(qt )=gt

FER ,

L(s)=gt

[P (st , O|)]

3) Compute V FER =
t=1 sS

FER bs (ot ) . For bs (ot ) V

sS [P (st , O|)]

(3)

If goes to innity, the right term of (3) converges to 1 when qt belongs to the numerator (i.e. L(qt ) = gt ), otherwise to 0. In practice, is determined by measuring the quality of the approximation as: AQ (O, , ) = 1 T
T t=1 [P (qt , O|)] sS [P (st , O|)]

each summation loop,

bs (ot ) is computed via V reverse-mode AD, and the used memory is purged.

Note that each AD step is localized by different s and t values at the third step, thus needing much less memory. B. MFE with Viterbi decoding Viterbi decoding nds the most likely sequence of hidden states through an HMM. The decoded state sequence Q = {q1 , q2 , . . . , qT } for a given observation O is then dened as: Q = arg max P (Q, O|)
QQ

(4)

AQ represents how closely the approximation reects the actual MAP decoding. As P (qt , O|) P (st , O|) s S, AQ is close to 1 for any sufciently large . Empirically, we have found that choosing to satisfy AQ > 0.9 is sufcient. From (1) and (3), the optimization criteria FER for MAP decoding is dened as: FER (G, O, , ) = 1 1 T
T t=1 L(s)=gt sS

(6)

[P (st , O|)]

[P (st , O|)]

(5)

Each probability term in (5) can be computed using the forward-backward algorithm. As the forward-backward algorithm consists only of additions, multiplications, and divisions, FER is well dened. Moreover, the full expansion of FER can be easily calculated using Automatic
the unknown ground truth G is a state sequence (not a label sequence), nding a state sequence Q that minimizes the expectation of FER is identical to the denition of MAP decoding. T arg minQ E [FER (G, Q)] = arg maxQ = t=1 P (qt |O, )
1 Assuming

where Q is the set of all possible state sequences Q = {q1 , q2 , . . . , qT }. The approximation of the indicator function for the Viterbi decoding can be calculated as:
Q s.t. L(qt )=gt QQ

1L(qt )=gt

[P (Q, O|)]

[P (Q, O|)]

(7)

arg maxqt P (qt , O|)

T t=1

The numerator only considers the sequence, Q, that matches the ground truth at time t. If goes to innity, the right term of (7) converges to 1 when Q belongs to the numerator, otherwise to 0. P (Q, O|) is a standard HMM joint probability; thus:

365

[P (Q, O|)] = q1 b1 (o1 ) q

T t=2

at1 ,qt bt (ot ) q q

(8)

Then, the summations in (7) can be easily computed using the forward-backward algorithm by simply replacing {, a, b} with { , a , b }. Similar to (4), the approximation quality is dened as: 1 T
T t=1
Q s.t. qt =qt

data is used for connecting the optimized chord HMMs. For MFE, both cases of MAP decoding and the Viterbi decoding are evaluated. Both MFE and MCE are initialized with the parameters obtained from MLE and optimized using L-BFGS. A. Data and evaluation methodology The data used in this experiment is a set of 495 chordannotated polyphonic audio recordings, consisting of 200 songs from the Beatles and Queen datasets3 and 295 songs from the RWC and US Pop datasets 4 . The audio features are extracted from audio signals as follows: rst, as the signal between adjacent musical beats is assumed to be harmonically quasi-stationary, each audio recording is segmented into beat frames using the algorithm in [16]. Then each frame is converted into a 12-dimensional chroma vector representing the energy distribution of the audio signal across the twelve pitch classes of the chromatic scale. We use two common implementations of the chroma vector: Conventional Chroma which is directly computed from the DFT of an audio frame by mapping each frequency bin of the DFT spectrum to a corresponding pitch class; and Log Chroma which is calculated from the log-compressed DFT spectrum. For the details of chroma feature computation, we refer the reader to [17], [18]. As MLE and MCE train the parameters of each HMM individually, the training data have to be pre-segmented based on the chord annotations. Unlike MLE and MCE, the data for MFE training does not need any segmentation. Each experiment is performed using 5-fold cross validation, with each groups containing 99 songs selected randomly from the data set. For each iteration, one group is selected as a test set, and the remaining 4 groups are used for training. The chord recognition error rate is calculated as follows: Error Rate = 1 total duration of correct chords total duration of dataset (11)

AQ (O, , ) =

[P (Q, O|)]

QQ

[P (Q, O|)]

(9)

Finally, the optimization criteria FER for Viterbi decoding is dened as: 1 T
T t=1 Q s.t. L(qt )=gt QQ

FER (G, O, , ) = 1

[P (Q, O|)]

[P (Q, O|)]

(10)

III. E XPERIMENTS To assess the validity of the MFE training method, we perform an experimental evaluation on the task of estimating chords from polyphonic2 music signals. Musical chords are dened by the occurrence of harmonically related musical notes, either simultaneously or in quick succession. Since chords are the building blocks of Western tonal music, their automatic recognition is an important subject in computer music and machine listening research. The performance of an automatic chord recognition system is typically measured by directly counting the number of correctly labeled frames; thus it is a suitable example to measure the performance of the proposed method. In the majority of chord recognition systems, audio feature sequences are typically modeled using chord HMMs connected by a bigram language model [1]. Each HMM represents a chord, and the language model describes chord transition probabilities from one chord to another. The architecture is almost identical to a typical HMM speech recognition system. The chord lexicon used in these experiments is composed of 25 different chords consisting of the 24 most common chords in popular music (i.e. 12 major triads and 12 minor triads), plus a no-chord (i.e. transients and silence). Each chord is modeled by a 3-state left-to-right HMM and each state has 5 Gaussians with diagonal covariance matrices. The 25 chord models are then connected by a bigram language model. The proposed method is compared with MLE and MCE. For MCE training, the parameters of each chord HMM is optimized using the method described in [5]. Then, the bigram language model estimated from annotated training
2 two

Since the error rate is based on time duration, each frame in training data is weighted by the duration of the corresponding ground truth label, for both MFE and MCE trainings. B. Results Table I compares recognition performance for HMMs trained with MLE, MCE and the proposed MFE criteria. Overall, Log Chroma shows better performance than conventional Chroma regardless of training method. While both MCE and MFE show signicantly lower error rates than MLE on conventional Chroma, only MFE show signicant differences against MLE on Log Chroma features. Due to
3 available 4 available

or more tones sound simultaneously

from http://www.isophonics.net/datasets from https://github.com/tmc323/Chord-Annotations.git

366

Table I P ERFORMANCE COMPARISON IN CHORD RECOGNITION ( AVERAGE ERROR RATE )

MLE Chroma Log Chroma 33.01% 25.22%

MCE w/ bigram 29.46% 24.89% MAP 26.59% 22.64%

MFE Viterbi 26.70% 22.72%

Table II I SOLATED PRE - SEGMENTED CHORD CLASSIFICATION ERROR RATE

Chroma MLE w/o bigram Training Set Test Set 19.24% 19.87% MCE 15.30% 17.10% MLE w/o bigram 14.04% 14.75%

Log Chroma MCE 11.46% 13.37%

FER on test data 0.32 FER on training data FER on training data

0.30 Frame error rate

0.28

0.26

0.24 0 100 200 300 400 500

MFE signicantly outperforms both MLE and MCE regardless of feature type with the signicance for p < 0.001 using a 5 fold cross-validation paired t-test. Figure 2 shows the convergence rates of MFE. During training, the goal is to reduce FER on training data by minimizing the approximated, FER. As shown in the plot, the approximation (dashed line) closely matches the exact FER (solid line). On the other hand, FER on training data does not perfectly generalize to FER on test data, but they show the similar trend and certainly there is very strong correlation. We suspect that larger training data can help the generalization. IV. C ONCLUSION A new HMM training method that minimizes frame error rate is presented. We argue that the learning criterion is preferable because it is close, or the same to the standard error rate used for evaluation in various tasks. The approximated frame error rate formulated for the optimization is straightforward and accurate, and only involves the slightly modied forward-backward algorithm. Our method is implemented using standard techniques such as automatic differentiation and L-BFGS. The chord recognition experiment shows that our method performs better than MLE and MCE with bigram language model. Since the frame error rate is based on the labels, it allows us to use different levels of labeling (e.g. word label, phoneme label), and opens up opportunities for handling exible labeling structures (e.g. label sharing in an ambiguous segment near the boundary, or using different levels of labeling at the same time) in an intuitive way. In the future, we aim to apply our method to various applications and compare to other HMM training strategies.

Iterations

Figure 2. MAP decoding convergence plot for the rst 500 iterations. The solid line and the dashed line are FER and FER on training data respectably and the dotted line is FER on test data.

the effect of log compression, Log Chroma has lower deviations than conventional Chroma. This makes Log Chroma relatively well described by MLE by reducing uncertainties between training and test data. Consequently, the effect of MCE training is reduced as shown in Table II. Table II shows the optimization performance of MCE in its original error criterion (i.e. classication error). The classication error reduction rate is decreased from 13.94% to 9.35% on the test set by employing Log Chroma features. Note that, in MLE and MCE, the bigram language model is independent from the feature types and the HMMs trained by the features. This unwarranted independence assumption prevents us from obtaining fully optimized models. The discrepancy between Table I and Table II (i.e. discrete and continuous recognition) supports this. Unlike MLE and MCE, MFE involves the language model (i.e. temporal relationships between HMMs) in its error minimization process. As a result, the proposed

367

R EFERENCES
[1] T. Cho, R. Weiss, and J. Bello, Exploring common variations in state of the art chord recognition systems, in Proceedings of the Sound and Music Computing Conf.(SMC), Barcelona, Spain, July 2010. [2] L. Bahl, P. Brown, and P. Souza, Maximum mutual information estimation of hidden Markov model parameters for speech recognition, Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP 86., vol. 11, pp. 4952, 1986. [3] Y. Ephraim, A. Dembo, and L. Rabiner, A minimum Discrimination Information Approach for Hidden Markov Modeling, Information Theory, IEEE Transactions on, vol. 35, no. 5, pp. 10011013, 1989. [4] S. Eddy and G. Mitchison, Maximum discrimination hidden Markov models of sequence consensus, Journal of Computational Biology, vol. 2, pp. 923, 1995. [5] B. Juang, W. Hou, and L. Chin-hui, Minimum classication error rate methods for speech recognition, Speech and Audio Processing, IEEE Transactions on, vol. 5, no. 3, pp. 257265, 1997. [6] F. Sha and L. Saul, Large margin hidden Markov models for automatic speech recognition, Advances in neural information processing systems, vol. 19, p. 1249, 2007. [7] E. McDermott, T. Hazen, J. Le Roux, A. Nakamura, and S. Katagiri, Discriminative training for large-vocabulary speech recognition using minimum classication error, Audio, Speech, and Language Processing, IEEE Transactions on, vol. 15, no. 1, pp. 203223, 2007. [8] D. Purnell, Improved performance and generalization of minimum classication error training for continuous speech recognition, in Proceedings of Sixth International Conference on Spoken Language Processing, Beijing, China, October 2000. [9] D. Purnell and E. Botha, Improved generalization of MCE parameter estimation with application to speech recognition, Speech and Audio Processing, IEEE, vol. 10, no. 4, pp. 232 239, 2002. [10] D. Liu and J. Nocedal, On the limited memory BFGS method for large scale optimization, Mathematical programming, vol. 45, no. 1, pp. 503528, 1989. [11] P. Boggs and J. Tolle, Sequential quadratic programming, Acta numerica, vol. 4, no. 1, pp. 151, 1995. [12] P. Gill, W. Murray, and M. Saunders, SNOPT: An SQP algorithm for large-scale constrained optimization, SIAM journal on optimization, vol. 12, no. 4, pp. 9791006, 2002. [13] L. Rabiner, A tutorial on hidden Markov models and selected applications in speech recognition, Proceedings of the IEEE, vol. 77, no. 2, pp. 257286, 1989. [14] L. Rall and G. Corliss, An introduction to automatic differentiation, Computational Differentiation: Techniques, Applications, and Tools, SIAM, pp. 118, 1996.

[15] A. Griewank, On automatic differentiation, Mathematical Programming: recent developments and applications, vol. 6, pp. 83107, 1989. [16] P. Grosche and M. M ller, Extracting Predominant Local u Pulse Information from Music Recordings, IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 6, pp. 16881701, 2011. [17] T. Fujishima, Realtime chord recognition of musical sound: a system using common lisp music, in Proceedings of the International Computer Music Conference, 1999, pp. 464 467. [18] M. Meinard and E. Sebastian, Chroma toolbox: Matlab implementations for extracting variants of chroma-based audio features, in Proceedings of the 12th International Conference on Music Information Retrieval (ISMIR), Miami, USA, October 2011.

368

Anda mungkin juga menyukai