is/22.36 10
this/16.8 4 my/63.09
1
hi/80.76 is/70.97 number/34.56
8 hard/22 15 57
I/47.36 this/153.0 is/71.16 Mike/192.5
0 2 5 87 uh/83.34
7 my/19.2
I’d/143.3 is/77.68 card/20.1
9
3 this/90.3 6
like/41.58
Figure 1: Word lattice output of a speech recognition system for the spoken utterance “Hi, this is my number”.
is often available. It is the result of careful human wishes to distinguish two populations, the blue and
labeling of spoken utterances with a finite number of red circles. In this very simple example, one can
pre-determined categories of the type already men- choose a hyperplane to correctly separate the two
tioned. populations. But, there are infinitely many choices
But, most classification algorithms were origi- for the selection of that hyperplane. There is good
nally designed to classify fixed-size vectors. The ob- theory though supporting the choice of the hyper-
jects to analyze for spoken-dialog classification are plane that maximizes the margin, that is the distance
word lattices, each a collection of a large number between each population and the separating hyper-
of sentences with some weight or probability. How plane. Indeed, let F denote the class of real-valued
can standard classification algorithms such as sup- functions on the ball of radius R in RN :
port vector machines [Cortes and Vapnik, 1995] be
extended to handle such objects? F = {x 7→ w · x : kwk ≤ 1, kxk ≤ R} . (1)
This chapter presents a general framework and
Then, it can be shown [Bartlett and Shawe-Taylor,
solution for this problem, which is based on kernels
1999] that there is a constant c such that, for all dis-
methods [Boser et al., 1992, Schölkopf and Smola,
tributions D over X, with probability at least 1 − δ, if
2002]. Thus, we shall start with a brief introduction
a classifier sgn(f ), with f ∈ F, has margin at least ρ
to kernel methods (Section 2). Section 3 will then
over m independently generated training examples,
present a kernel framework, rational kernels, that is
then the generalization error of sgn(f ), or error on
appropriate for word lattices and other weighted au-
any future example, is no more than
tomata. Efficient algorithms for the computation of
these kernels will be described in Section 4. We c R2
1
2
also report the results of our experiments using these log m + log . (2)
methods in several difficult large-vocabulary spoken- m ρ2 δ
dialog classification tasks based on deployed systems
This bound justifies large-margin classification algo-
in Section 5. There are several theoretical results
rithms such as support vector machines (SVMs). Let
that can guide the design of kernels for spoken-dialog
w · x + b = 0 be the equation of the hyperplane,
classification. These results are discussed in Sec-
where w ∈ RN is a vector normal to the hyperplane
tion 6.
and b ∈ R a scalar offset. The classifier sgn(h) corre-
sponding to this hyperplane is unique and can be de-
2. INTRODUCTION TO KERNEL METHODS fined with respect to the training points x1 , . . . , xm :
m
Let us start with a very simple two-group classi-
X
h(x) = w · x + b = αi (xi · x) + b, (3)
fication problem illustrated by Figure 2 where one i=1
Springer Handbook on Speech Processing and Speech Communication 3
(a) (b)
Figure 2: Large-margin linear classification. (a) An arbitrary hyperplane can be chosen to separate the two
groups. (b) The maximal-margin hyperplane provides better theoretical guarantees.
where αi s are real-valued coefficients. The main two examples Φ(x) and Φ(y) in feature space:
point we are interested in here is that both for the
construction of the hypothesis and the later use of ∀x, y ∈ X, K(x, y) = Φ(x) · Φ(y). (4)
that hypothesis for classification of new examples,
one needs only to compute a number of dot products K is often viewed as a similarity measure. A crucial
between examples. advantage of K is efficiency: there is no need any-
In practice, non-linear separation of the training more to define and explicitly compute Φ(x), Φ(y),
data is often not possible. Figure 3(a) shows an ex- and Φ(x) · Φ(y). Another benefit of K is flexibil-
ample where any hyperplane crosses both popula- ity: K can be arbitrarily chosen so long as the ex-
tions. However, one can use more complex functions istence of Φ is guaranteed, which is called Mercer’s
to separate the two sets as in Figure 3(b). One way to condition. This condition is important to guarantee
do that is to use a non-linear mapping Φ : X → F the convergence of training for algorithms such as
from the input space X to a higher-dimensional space SVMs.1
F where linear separation is possible. A condition equivalent to Mercer’s condition is
The dimension of F can truly be very large in that the kernel K be positive definite and sym-
practice. For example, in the case of document clas- metric, that is, in the discrete case, the matrix
sification, one may use as features, sequences of three (K(xi , xj ))1≤i,j≤n must be symmetric and positive
consecutive words (trigrams). Thus, with a vocabu- semi-definite for any choice of n points x1 , . . . , xn
lary of just 100,000 words, the dimension of the fea- in X. Said differently, the matrix must be symmetric
ture space F is 1015 . On the positive side, as in- and its eigenvalues non-negative. Thus, for the prob-
dicated by the error bound of Equation 2, the gen- lem that we are interested in, the question is how to
eralization ability of large-margin classifiers such as define positive definite symmetric kernels for word
SVMs does not depend on the dimension of the fea- lattices or weighted automata.
ture space but only on the margin ρ and the number of
training examples m. However, taking a large num-
3. RATIONAL KERNELS
ber of dot products in a very high-dimensional space
to define the hyperplane may be very costly.
This section introduces a family of kernels for
A solution to this problem is to use the so-called
weighted automata, rational kernels. We will start
’kernel trick’ or kernel methods. The idea is to define
with some preliminary definitions of automata and
a function K : X × X → R called a kernel, such
that the kernel function on two examples x and y in 1 Some standard Mercer kernels over a vector space are the
input space, K(x, y), is equal to the dot product of polynomial kernels of degree d ∈ N, Kd (x, y) = (x·y+1)d , and
Gaussian kernels Kσ (x, y) = exp(−kx − yk2 /σ2 ), σ ∈ R+ .
Springer Handbook on Speech Processing and Speech Communication 4
(a) (b)
Figure 3: Non-linearly separable case. The classification task consists of discriminating between solid squares
and solid circles. (a) No hyperplane can separate the two populations. (b) A non-linear mapping can be used
instead.
transducers. For the most part, we will adopt the no- by [[T ]](x, y), is obtained by summing the weights of
tation introduced in [Mohri et al., 2006] for automata all paths with input label x and output label y. Thus,
and transducers. for example, the weight associated by the transducer
Figure 4(a) shows a simple weighted automaton. T to the pair (abb, baa) is obtained as in the automata
It is a weighted directed graph in which edges or tran- case by summing the weights of the two paths labeled
sitions are augmented with a label and carry some with (abb, baa).
weight indicated after the symbol slash. A bold cir- To help us gain some intuition for the definition
cle indicates an initial state and a double-circle a final of a family of kernels for weighted automata, let us
state. A final state may also carry a weight indicated first consider kernels for sequences. As mentioned
after the slash symbol representing the state number. earlier in Section 2, a kernel can be viewed as a simi-
A path from an initial state to a final state is called larity measure. In the case of sequences, we may say
a successful path. The weight of a path is obtained that two strings are similar if they share many com-
by multiplying the weights of constituent transitions mon substrings or subsequences. The kernel could
and the final weight. The weight associated by A to a then be for example the sum of the product of the
string x, denoted by [[A]](x), is obtained by summing counts of these common substrings. But how can we
the weights of all paths labeled with x. In this case, generalize that to weighted automata?
the weight associated to the string abb is the sum of Similarity measures such as the one just dis-
the weight of two paths. The weight of each path cussed can be computed by using weighted finite-
is obtained by multiplying transition weights and the state transducers. Thus, a natural idea is to use
final weight 0.1. weighted transducers to define similarity measures
Similarly, Figure 4(b) shows an example of a for sequences. We will say a that kernel K is rational
weighted transducer. Weighted transducers are sim- when there exists a weighted transducer T such that
ilar to weighted automata but each transition is aug-
mented with an output label in addition to the familiar K(x, y) = [[T ]](x, y), (5)
input label and the weight. The output label of each
for all sequences x and y. This definition general-
transition is indicated after the colon separator. The
izes naturally to the case where the objects to handle
weight associated to a pair of strings (x, y), denoted
are weighted automata. We say that K is rational
Springer Handbook on Speech Processing and Speech Communication 5
b/0.6 b:a/0.6
b/0.2 b:a/0.2
(a) (b)
Figure 4: Weighted automata and transducers. (a) Example of a weighted automaton A. [[A]](abb) = .1 ×
.2 × .3 × .1 + .5 × .3 × .6 × .1. (b) Example of a weighted transducer T . [[T ]](abb, baa) = [[A]](abb).
Figure 5: Weighted transducers computing the expected counts of (a) all sequences accepted by the regular
expression X; (b) all bigrams; (c) all gappy bigrams with penalty gap λ, over the alphabet {a, b}.
In combination, this provides a general and weights. Thus, by definition of composition (see
efficient algorithm for computing rational kernels [Mohri et al., 2006]), we could compute the sum of
whose total complexity for acyclic word lattices is the expected counts in A and B of the matching sub-
quadratic: O(|T ||A||B|).3 strings by computing
The next question that arises is how to define
and compute kernels based on counts of substrings A ◦ T ◦ T −1 ◦ B, (9)
or subsequences as previously discussed in Section 3.
For weighted automata, we need to generalize the no- and summing the weights (expected counts) of all
tion of counts to take into account the weight of the paths using a general single-source shortest-distance
paths in each automaton and use instead the expected algorithm.
counts of a sequence. Comparing this formula with Equation 7 shows
Let |u|x denote the number of occurrences of a that the count-based similarity measure we are inter-
substring x in u. Since a weighted automaton A de- ested in is the rational kernel whose corresponding
fines a distribution over the set of strings u, the ex- weighted transducer S is:
pected count of a sequence x in a weighted automa-
S = T ◦ T −1 . (10)
ton A can be defined naturally as
X Thus, if we can determine a T with this property, this
cA (x) = |u|x [[A]](u), (8)
would help us naturally define the similarity measure
u∈Σ∗
S. In Section 6 we will prove that the kernel corre-
where the sum runs over Σ∗ , the set of all strings over sponding to S will actually be positive definite sym-
the alphabet Σ. We mentioned earlier that weighted metric, and can hence be used in combination with
transducers can often be used to count the occur- SVMs.
rences of some substrings of interest in a string. Let In the following, we will provide a weighted
us assume that one could find a transducer T that transducer T for computing the expected counts of
could compute the expected counts of all substrings substrings. It turns out that there exists a very simple
of interest appearing in a weighted automaton A. transducer T that can be used for that purpose. Fig-
Thus, A ◦ T would provide the expected counts of ure 5(a) shows a simple transducer that can be used
these substrings in A. Similarly, T −1 ◦ B would pro- to compute the expected counts of all sequences ac-
vide the expected counts of these substrings in B. cepted by the regular expression X over the alpha-
Recall that T −1 is the transducer obtained by swap- bet {a, b}. In the figure, the transition labeled with
ping the input and output labels of each transition X:X/1 symbolizes a finite automaton representing X
([Mohri et al., 2006]). Composition of A ◦ T and with identical input and output labels and all weights
T −1 ◦ B matches paths labeled with the same sub- equal to one.
string in A ◦ T and T −1 ◦ B and multiplies their Here is how transducer T counts the occurrences
3 Here |T | is a constant independent of the automata A and B of a substring x recognized by X in a sequence u.
for which K(A, B) needs to be computed. State 0 reads a prefix u1 of u and outputs the empty
Springer Handbook on Speech Processing and Speech Communication 7
string . Then, an occurrence of x is read and out- algorithm for each new kernel, as previously done in
put identically. Then state 1 reads a suffix u2 of u the literature. As an example, the gappy kernel used
and outputs . In how many different ways can u be by Lodhi et al. [2001] is the rational kernel corre-
decomposed into u = u1 x u2 ? In exactly as many sponding to the 6-state transducer S = T ◦ T −1 of
ways as there are occurrences of x in u. Thus, T ap- Figure 6.
plied to u generates exactly |u|x successful paths la-
beled with x. This holds for all strings x accepted by
X. If T is applied to (or composed with) a weighted 5. EXPERIMENTS
automaton A instead, then for any x, it generates |u|x
paths for each path of A labeled with u. Furthermore, This section describes the applications of the kernel
by definition of composition, each path generated is framework and techniques to spoken-dialog classifi-
weighted with the weight of the path in A labeled cation.
with u. Thus, the sum of the weights of the paths In most of our experiments, we used simple n-
generated that are labeled with x is exactly the ex- gram rational kernels. An n-gram kernel kn for two
pected count of x: weighted automata or lattices A and B is defined by:
[[A ◦ T ]](x) = cA (x). (11) X
kn (A, B) = cA (x) cB (x). (12)
|x|=n
Thus, there exists a simple weighted transducer T
that can count, as desired, the expected counts of all
As described in Section 4, kn is a kernel of the form
strings recognized by an arbitrary regular expression
T ◦ T −1 and can be computed efficiently. An n-gram
X in a weighted automaton A. In particular, since
rational kernel Kn is simply the sum of kernels km ,
the set of bigrams over an alphabet Σ is a regular
with 1 ≤ m ≤ n:
language, we can construct a weighted transducer T
computing the expected counts of all bigrams. Fig- n
X
ure 5(b) shows that transducer, which has only three Kn = km
states, regardless of the alphabet size. m=1
In some applications, one may wish to allow for
a gap between the occurrences of two symbols and Thus, the feature space associated with Kn is the set
view two weighted automata as similar if they share of all m-gram sequences with m ≤ n. As discussed
such substrings (gappy bigrams) with relatively large in the previous section, it is straightforward, using the
expected counts. The gap or distance between two same algorithms and representations, to extend these
symbols is penalized using a fixed penalty factor λ, kernels to kernels with gaps and to many other more
0 ≤ λ < 1. A sequence kernel based on these complex rational kernels more closely adapted to the
ideas was used successfully by Lodhi et al. [2001] applications considered.
for text categorization. Interestingly, constructing a We did a series of experiments in several large-
kernel based on gappy bigrams is straightforward in vocabulary spoken-dialog tasks using rational kernels
our framework. Figure 5(c) shows the transducer with a twofold objective [Cortes et al., 2004]: to im-
T counting expected counts of gappy bigrams from prove classification accuracy in those tasks, and to
which the kernel can be efficiently constructed. The evaluate the impact on classification accuracy of the
same can be done similarly for higher-order gappy use of a word lattice rather than the one-best output
n-grams or other gappy substrings. of the automatic speech recognition (ASR) system.
The methods presented in this section can be used The first task we considered was that of a de-
to construct efficiently and, often in a simple man- ployed customer-care application (HMIHY 0300). In
ner, relatively complex weighted automata kernels K this task, users interact with a spoken-dialog system
based on expected counts or other ideas. A single via the telephone, speaking naturally, to ask about
general algorithm can then be used as described to their bills, their calling plans, or other similar top-
compute efficiently K(A, B) for any two weighted ics. Their responses to the open-ended prompts of
automata A and B, without the need to design a new the system are not constrained by the system, they
Springer Handbook on Speech Processing and Speech Communication 8
ε:b ε:b/λ
ε:a ε:a/λ
b:ε ε:a a:a b:ε/λ ε:a/λ 3 a:a b:ε ε:b
a:ε 1 a:ε/λ a:ε ε:a
ε:b b:b ε:b/λ b:b
a:a a:a ε:a
0 2 4 5
b:b b:b ε:b
Figure 6: Simple weighted transducer corresponding to the gappy bigram kernel used by Lodhi et al. [2001]
obtained by composition of the transducer of Figure 5(c) and its inverse.
may be any natural language sequence. The objec- error rate. The word error rate is indicative of the dif-
tive of the spoken-dialog classification is to assign ficulty of classification task since a higher error rate
one or several categories or call-types, e.g., Billing implies a more noisy input. The average number of
Credit, or Calling Plans, to the users’ spoken utter- transitions of a word lattice in VoiceTone1 was about
ances. The set of categories is pre-determined, and 210 and in VoiceTone2 about 360.
in this first application there are 64 categories. The Each utterance of the dataset may be labeled with
calls are classified based on the user’s response to the several classes. The evaluation is based on the fol-
first greeting prompt: “Hello, this is AT&T. How may lowing criterion: it is considered an error if the high-
I help you?”. est scoring class given by the classifier is none of
Table 1 indicates the size of the HMIHY 0300 these labels.
datasets we used for training and testing. The train- We used the AT&T FSM Library [Mohri et al.,
ing set is relatively large with more than 35,000 ut- 2000] and the GRM Library [Allauzen et al., 2004]
terances, this is an extension of the one we used in for the implementation of the n-gram rational kernels
our previous classification experiments with HMIHY Kn used. We used these kernels with SVMs, using
0300 [Cortes et al., 2003b]. In our experiments, we a general learning library for large-margin classifi-
used the n-gram rational kernels just described with cation (LLAMA), which offers an optimized multi-
n = 3. Thus, the feature set we used was that of class recombination of binary SVMs [Haffner et al.,
all n-grams with n ≤ 3. Table 1 indicates the to- 2003]. Training time took a few hours on a single
tal number of distinct features of this type found in processor of a 2.4GHz Intel Pentium processor Linux
the datasets. The word accuracy of the system based cluster with 2GB of memory and 512 KB cache.
on the best hypothesis of the speech recognizer was In our experiments, we used the trigram kernel
72.5%. This motivated our use of the word lattices, K3 with a second-degree polynomial. Preliminary
which contain the correct transcription in most cases. experiments showed that the top performance was
The average number of transitions of a word lattice reached for trigram kernels and that 4-gram kernels,
in this task was about 260. K4 , did not significantly improve the performance.
Table 1 reports similar information for two other We also found that the combination of a second-
datasets, VoiceTone1, and VoiceTone2. These are degree polynomial kernel with the trigram kernel sig-
more recently deployed spoken-dialog systems in nificantly improves performance over a linear clas-
different areas, e.g., VoiceTone1 is a task where users sifier, but that no further improvement could be ob-
interact with a system related to health-care with a tained with a third-degree polynomial.
larger set of categories (97). The size of the Voice- We used the same kernels in the three datasets
Tone1 datasets we used and the word accuracy of the previously described and applied them to both the
recognizer (70.5%) make this task otherwise similar speech recognizer’s single best hypothesis (one-best
to HMIHY 0300. The datasets provided for Voice- results), and to the full word lattices output by the
Tone2 are significantly smaller with a higher word speech recognizer. We also ran, for the sake of
Springer Handbook on Speech Processing and Speech Communication 9
Table 1: Key characteristics of the three datasets used in the experiments. The fifth column displays the total
number of unigrams, bigrams, and trigrams found in the one-best output of the ASR for the utterances of
the training set, that is the number of features used by BoosTexter or SVMs used with the one-best outputs.
The training and testing sizes reported in columns 3 and 4 are described in the number of utterances, or
equivalently the number of word lattices.
comparison, the BoosTexter algorithm [Schapire and for classification in this task. This advantage is even
Singer, 2000] on the same datasets by applying it to more manifest for the VoiceTone2 task for which the
the one-best hypothesis. This served as a baseline for speech recognition accuracy is lower. VoiceTone2 is
our experiments. also a harder classification task as can be seen by the
Figure 7(a) shows the result of our experiments comparison of the plots of Figure 7(b). The classifi-
in the HMIHY 0300 task. It gives classification er- cation accuracy of SVMs with kernels applied to lat-
ror rate as a function of rejection rate (utterances for tices is more than 6% absolute value better than that
which the top score is lower than a given threshold of BoosTexter near 40% rejection rate, and about 3%
are rejected) in HMIHY 0300 for: BoosTexter, SVM better than SVMs applied to the one-best hypothesis.
combined with our kernels when applied to the one- Thus, our experiments in spoken-dialog classifi-
best hypothesis, and SVM combined with kernels ap- cation in three distinct large-vocabulary tasks demon-
plied to the full lattices. strated that using rational kernels with SVMs consis-
SVM with trigram kernels applied to the one-best tently leads to very competitive classifiers. They also
hypothesis leads to better classification than Boos- show that their application to the full word lattices
Texter everywhere in the range of 0-40% rejection instead of the single best hypothesis output by the
rate. The accuracy is about 2-3% absolute value bet- recognizer systematically improves classification ac-
ter than that of BoosTexter in the range of interest for curacy.
this task, which is roughly between 20% and 40% We further explored the use of kernels based on
rejection rate. The results also show that the clas- other moments of the counts of substrings in se-
sification accuracy of SVMs combined with trigram quences, generalizing n-gram kernels [Cortes and
kernels applied to word lattices is consistently better Mohri, 2005]. Let m be a positive integer. Let cm A (x)
than that of SVMs applied to the one-best alone by denote the m-th moment of the count of the sequence
about 1% absolute value. x in A defined by:
Figures 7(b)-(c) show the results of our experi- X
ments in the VoiceTone1 and VoiceTone2 tasks us- cmA (x) = |u|m
x [[A]](u). (13)
ing the same techniques and comparisons. As ob- u∈Σ∗
served previously, in many regards, VoiceTone1 is We can define a general family of kernels, denoted
similar to the HMIHY 0300 task, and our results by Knm , n, m ≥ 1 and defined by:
for VoiceTone1 are comparable to those for HMIHY
0300. The results show that the classification accu- Knm (A, B) =
X
cm m
A (x) cB (x), (14)
racy of SVMs combined with trigram kernels applied
|x|=n
to word lattices is consistently better than that of
BoosTexter, by more than 4% absolute value at about which exploit the m-th moment of the counts of sub-
20% rejection rate. They also demonstrate more strings x in weighted automata A and B to define
clearly the benefits of the use of the word lattices their similarity. Cortes and Mohri [2005] showed that
Springer Handbook on Speech Processing and Speech Communication 10
20 30
BoosTexter/1-best 24 BoosTexter/1-best 28 BoosTexter/1-best
18 SVM/1-best 22 SVM/1-best 26 SVM/1-best
16 SVM/Lattice 20 SVM/Lattice 24 SVM/Lattice
14 18 22
16 20
Error Rate
Error Rate
Error Rate
12 18
14 16
10 12 14
8 10 12
6 8 10
6 8
4 6
4 4
2 2 2
0 0 0
0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40
Rejection Rate Rejection Rate Rejection Rate
Figure 7: Classification error rate as a function of rejection rate in (a) HMIHY 0300, (b) VoiceTone1, and (c)
VoiceTone2.
there exist weighted transducers Tnm that can be used to spoken-dialog tasks. Several questions arise in
to compute efficiently cm
A (x) for all n-gram sequence relation with these kernels. As pointed out earlier,
x and weighted automaton A. Thus, these kernels are to guarantee the convergence of algorithms such as
rational kernels and their associated transducers are SVMs, the kernels used must be positive definite
Tnm ◦ Tnm −1 : symmetric (PDS). But, how can we construct PDS
X rational kernels? Are n-gram kernels and similar ker-
Knm (A, B) = [[A◦[Tnm ◦(Tnm )−1 ]◦B]](x, y). (15) nels PDS? Can we combine simpler PDS rational ker-
x,y nels to create more complex ones? Is there a charac-
terization of PDS rational kernels?
Figure 8 shows the weighted transducer T1m for ape- All these questions have been investigated by
riodic strings x, which has only m|x| + 1 states.4 By Cortes et al. [2003a]. The following theorem pro-
Equation 15, the transducer T1m can be used to com- vides a general method for constructing PDS rational
pute Knm (A, B) by first computing the composed kernels.
transducer A ◦ [Tnm ◦ (Tnm )−1 ] ◦ B and then sum-
ming the weights of all the paths of this transducer
using a shortest-distance algorithm [Mohri, 2002]. Theorem 1 Let T be a weighted finite-state trans-
ducer over (+, ×). Assume that the weighted trans-
The application of moment kernels to the
ducer T ◦T −1 is regulated,5 then S = T ◦T −1 defines
HMIHY 0300 task resulted in a further improvement
a PDS rational kernel.
of the classification accuracy. In particular, at 15%
rejection rate, the error rate was reduced by 1% abso-
lute, that is about 6.2% relative, which is significant Proof. We give a sketch of the proof. A full proof
in this task, by using variance kernels, that is moment is given in [Cortes et al., 2003a]. Let K be the kernel
kernels of second-order (m = 2). associated to S = T ◦ T −1 . By definition of T −1 and
composition,
6. THEORETICAL RESULTS FOR X
RATIONAL KERNELS ∀x, y ∈ X, K(x, y) = [[T ]](x, z)[[T ]](y, z), (16)
z
Figure 8: Weighted transducer T m for computing the m-th moment of the count of an aperiodic substring x.
The final weight at state k, k = 1, . . . , m, indicated after “/”, is S(m, k), the Stirling number of the second
kind, that is the number of ways of partitioning a set of m elements into k nonempty subsets, S(m, k) =
1
Pk k
i m
k! i=0 i (−1) (k − i) . The first-order transducer T1 coincides with the transducer of Figure 5(a), since
S(1, 1) = 1.
kernels or emphasize the importance of others by in- of Machine Learning Research (JMLR), 5:1035–
creasing their respective weight in the weighted trans- 1062, 2004.
ducer.
Patrick Haffner, Gokhan Tur, and Jeremy Wright.
Optimizing SVMs for complex Call Classification.
References In Proceedings ICASSP’03, 2003.
Huma Lodhi, John Shawe-Taylor, Nello Cristianini,
Cyril Allauzen, Mehryar Mohri, and Brian Roark. A
and Chris Watkins. Text classification using string
General Weighted Grammar Library. In Proceed-
kernels. In Todd K. Leen, Thomas G. Dietterich,
ings of the Ninth International Conference on Au-
and Volker Tresp, editors, NIPS 2000, pages 563–
tomata (CIAA 2004), Kingston, Ontario, Canada,
569. MIT Press, 2001.
July 2004.
Mehryar Mohri. Semiring Frameworks and Algo-
Peter Bartlett and John Shawe-Taylor. Generalization
rithms for Shortest-Distance Problems. Journal
performance of support vector machines and other
of Automata, Languages and Combinatorics, 7(3):
pattern classifiers. In Advances in kernel methods:
321–350, 2002.
support vector learning, pages 43–54. MIT Press,
Cambridge, MA, USA, 1999. Mehryar Mohri, Fernando C. N. Pereira, and Michael
Riley. The Design Principles of a Weighted Finite-
Christian Berg, Jens Peter Reus Christensen, and
State Transducer Library. Theoretical Computer
Paul Ressel. Harmonic Analysis on Semigroups.
Science, 231:17–32, January 2000.
Springer-Verlag: Berlin-New York, 1984.
Bernhard E. Boser, Isabelle Guyon, and Vladimir N. Mehryar Mohri, Fernando C. N. Pereira, and Michael
Riley. Speech Handbook, volume to appear, chap-
Vapnik. A training algorithm for optimal mar-
gin classifiers. In Proceedings of the Fifth Annual ter Speech Recognition with Weighted Finite-State
Workshop of Computational Learning Theory, vol- Transducers. Springer-Verlag, 2006.
ume 5, pages 144–152, Pittsburg, 1992. ACM. Robert E. Schapire and Yoram Singer. Boostexter: A
Corinna Cortes and Mehryar Mohri. Moment Ker- boosting-based system for text categorization. Ma-
nels for Regular Distributions. Machine Learning, chine Learning, 39(2/3):135–168, 2000.
60(1-3):117–134, September 2005. Bernhard Schölkopf and Alex Smola. Learning with
Corinna Cortes and Vladimir N. Vapnik. Support- Kernels. MIT Press: Cambridge, MA, 2002.
Vector Networks. Machine Learning, 20(3):273–
297, 1995.
Corinna Cortes, Patrick Haffner, and Mehryar Mohri.
Positive Definite Rational Kernels. In Proceedings
of The 16th Annual Conference on Computational
Learning Theory (COLT 2003), volume 2777 of
Lecture Notes in Computer Science, pages 41–56,
Washington D.C., August 2003a. Springer, Heidel-
berg, Germany.
Corinna Cortes, Patrick Haffner, and Mehryar Mohri.
Rational Kernels. In Advances in Neural Informa-
tion Processing Systems (NIPS 2002), volume 15,
Vancouver, Canada, March 2003b. MIT Press.
Corinna Cortes, Patrick Haffner, and Mehryar Mohri.
Rational Kernels: Theory and Algorithms. Journal
Index
algorithm support vector machines, 2, 3, 8, 9
composition, 5 SVMs, see support vector machines
forward-backward, 5
large-margin classification, 2, 3 variance kernels, 10
shortest-distance, 5, 6, 10
AT&T FSM library, 8 weighted automata, 1, 4
weighted transducer, 4
BoosTexter, 9 word lattice, 1, 2, 5
composition algorithm, 5
customer-care application, 1
dot product, 3
forward-backward algorithm, 5
GRM library, 8
kernel methods, 2, 3
kernel trick, see kernel methods
kernels
n-gram, 7
dot product, 3
expected counts, 6, 10, 11
gappy, 7, 8
Mercer’s condition, 3
positive definite symmetric, 10
rational, 2, 3
sequence learning, 4
variance of counts, 10
machine learning, 1
Mercer’s condition, 3
rational kernels, 2, 3
semiring, 5
probability, 5
shortest-distance algorithm, 5, 6, 10
software library
AT&T FSM library, 8
GRM library, 8
LLAMA, 8
spoken-dialog classification, 1, 7
13