Soia ML

Chapter 1 APPROACHES IN MACHINE LEARNING
Jan van Leeuwen

Institute of Information and Computing Sciences, Utrecht University, Padualaan 14, 3584 CH Utrecht, the Netherlands
Abstract Machine learning deals with programs that learn from experience, i.e. programs that improve or adapt their performance on a certain task or group of tasks over time. In this tutorial, we outline some issues in machine learning that pertain to ambient and computational intelligence. As an example, we consider programs that are faced with the learning of tasks or concepts which are impossible to learn exactly in nitely bounded time. This leads to the study of programs that form hypotheses that are probably approximately correct (PAC-learning), with high probability. We also survey a number of meta-learning techniques such as bagging and adaptive boosting, which can improve the performance of machine learning algorithms substantially. Keywords: Machine learning, computational intelligence, models of learning, concept learning, learning in the limit, PAC learning, VC-dimension, meta-learning, bagging, boosting, AdaBoost, ensemble learning.
1.
Algorithms that Learn
Ambient intelligence requires systems that can learn and adapt, or otherwise interact intelligently with the environment in which they operate (situated intelligence). The behaviour of these systems must be achieved by means of intelligent algorithms, usually for tasks that involve some kind of learning. Here are some examples of typical learning tasks: select the preferred lighting of a room, classify objects, recognize specic patterns in (streams of) images, identify the words in handwritten text,
2
understand a spoken language, control systems based on sensor data, predict risks in safety-critical systems, detect errors in a network, diagnose abnormal situations in a system, prescribe actions or repairs, and
Machine Learning
discover useful common information in distributed data. Learning is a very broad subject, with a rich tradition in computer science and in many other disciplines, from control theory to psychology. In this tutorial we restrict ourselves to issues in machine learning, with an emphasis on aspects of algorithmic modelling and complexity. The goal of machine learning is to design programs that learn and/or discover, i.e. automatically improve their performance on certain tasks and/or adapt to changing circumstances over time. The result can be a learned program which can carry out the task it was designed for, or a learning program that will forever improve and adapt. In either case, machine learning poses challenging problems in terms of algorithmic approach, data representation, computational efciency, and quality of the resulting program. Not surprisingly, the large variety of application domains and approaches has made machine learning into a broad eld of theory and experimentation [Mitchell, 1997]. In this tutorial, some problems in designing learning algorithms are outlined. We will especially consider algorithms that learn (or: are trained) online, from examples or data that are provided one at a time. By a suitable feedback mechanism the algorithm can adjust its hypothesis or the model of reality it has so far, before a next example or data item is processed. The crucial question is how good programs can become, especially if they are faced with the learning of tasks or concepts which are impossible to learn exactly in nite or bounded time. To specify a learning problem, one needs a precise model that describes what is to be learned and how it is done, and what measures are to be used in analysing and comparing the performance of different solutions. In Section 2 we outline some elements of a model of learning that should always be specied for a learning task. In Section 3 we highlight some basic denitions of the theory of learning programs that form hypotheses that are probably approximately correct [Kearns and Vazirani, 1994; Valiant, 1984]. In Section 4 we mention some of the results of this theory. (See also [Anthony, 1997].) In
Models of Learning
Section 5 we discuss meta-learning techniques, especially bagging and boosting. For further introductions we refer to the literature [Cristianini and ShaweTaylor, 2000; Mendelson and Smola, 2003; Mitchell, 1997; Poole et al, 1998] and to electronic sources [COLT].
2.
Models of Learning
Learning algorithms are normally designed around a particular paradigm for the learning process, i.e. the overall approach to learning. A computational learning model should be clear about the following aspects: Learner: Who or what is doing the learning. In this tutorial: an algorithm or a computer program. Learning algorithms may be embedded in more general software systems e.g. involving systems of agents or may be embodied in physical objects like robots and ad-hoc networks of processors in intelligent environments. Domain: What is being learned. In this tutorial: a function, or a concept. Among the many other possibilities are: the operation of a device, a tune, a game, a language, a preference, and so on. In the case of concepts, sets of concepts that are considered for learning are called concept classes. Goal: Why the learning is done. The learning can be done to retrieve a set of rules from spurious data, to become a good simulator for some physical phenomenon, to take control over a system, and so on. Representation: The way the objects to be learned are represented c.q. the way they are to be represented by the computer program. The hypotheses which the program develops while learning may be represented in the same way, or in a broader (or: more restricted) format. Algorithmic technology: The algorithmic framework to be used. Among the many different technologies are: articial neural networks, belief networks, case-based reasoning, decision trees, grammars, liquid state machines, probabilistic networks, rule learning, support vector machines, and threshold functions. One may also specify the specic learning paradigm or discovery tools to be used. Each algorithmic technology has its own learning strategy and its own range of application. There also are multi-strategy approaches. Information source: The information (training data) the program uses for learning. This could have different forms: positive and negative examples (called labeled examples), answers to queries, feedback from certain actions, and so on. Functions and concepts are typically revealed in the form of labeled instances taken from an instance space X . One often
Machine Learning
identies a concept with the set of all its positive instances, i.e. with a subset of X . An information source may be noisy, i.e. the training data may have errors. Examples may be clustered before use in training a program. Training scenario: The description of the learning process. In this tutorial, mostly on-line learning is discussed. In an on-line learning scenario, the program is given examples one by one, and it recalculates its hypothesis of what it learns after each example. Examples may be drawn from a random source, according to some known or unknown probability distribution. An on-line scenario can also be interactive, in which case new examples are supplied depending on the performance of the program on previous examples. In contrast, in an off-line learning scenario the program receives all examples at once. One often distinguishes between - supervised learning: the scenario in which a program is fed examples and must predict the label of every next example before a teacher tells the answer. - unsupervised learning: the scenario in which the program must determine certain regularities or properties of the instances it receives e.g. from an unknown physical process, all by itself (without a teacher). Training scenarios are typically nite. On the other hand, in inductive inference a program can be fed an unbounded amount of data. In reinforcement learning the inputs come from an unpredictable environment and positive or negative feedback is given at the end of every small sequence of learning steps e.g. in the process of learning an optimal strategy. Prior knowledge: What is known in advance about the domain, e.g. about specic properties (mathematical or otherwise) of the concepts to be learned. This might help to limit the class of hypotheses that the program needs to consider during the learning, and thus to limit its uncertainty about the unknown object it learns and to converge faster. The program may also use it to bias its choice of hypothesis. Success criteria: The criteria for successful learning, i.e. for determining when the learning is completed or has otherwise converged sufciently. Depending on the goal of the learning program, the program should be t for its task. If the program is used e.g. in safety-critical environments, it must have reached sufcient accuracy in the training phase so it can decide or predict reliably during operation. A success criterion can be measured by means of test sets or by theoretical analysis.
Models of Learning
Performance: The amount of time, space and computational power needed in order to learn a certain task, and also the quality (accuracy) reached in the process. There is often a trade-off between the number of examples used to train a program and thus the computational resources used, and the capabilities of the program afterwards. Computational learning models may depend on many more criteria and on specic theories of the learning process.
2.1
Classication of Learning Algorithms
Learning algorithms are designed for many purposes. Learning algorithms are implemented in web browsers, pcs, transaction systems, robots, cars, video servers, home environments and so on. The specications of the underlying models of learning vary greatly and are highly dependent on the application context. Accordingly, many classications of learning algorithms exist based on the underlying learning strategy, the type of algorithmic technology used, the ultimate algorithmic ability achieved, and/or the application domain.
2.2
Concept Learning
As an example of machine learning we consider concept learning. Given a (nite) instance space X , a concept c can be identied with a subset of X or, alternatively, with the Boolean function c(x) that maps instances x X to 1 if and only if x c and to 0 if and only if x c. Concept learning is concerned with retrieving the denition of a concept c of a given concept class C, from a sample of positive and negative examples. The information source supplies noise-free instances x and their labels c(x) (0, 1), corresponding to a certain concept c. In the training process, the program maintains a hypothesis h = h(x) for c. The training scenario is an example of on-line, supervised learning: Training scenario: The program is fed labelled instances (x, c(x)) one-byone and tries to learn the unknown concept c that underlies it, i.e. the Boolean function c(x) which classies the examples. In any step, when given a next instance x X , the program rst predicts a label, namely the label h(x) based on its current hypothesis h. Then it is presented the true label c(x). If h(x) = c(x) then h is right and no changes are made. If h(i) = c(x) then h is wrong: the program is said to have made a mistake. The program subsequently revises its hypothesis h, based on its knowledge of the examples so far. The goal is to let h(x) become consistent with c(x) for all x, by a suitable choice of learning algorithm. Any correct h(x) for c is called a classier for c.
Machine Learning
The number of mistakes an algorithm makes in order to learn a concept is an important measure that has to be minimized, regardless of other aspects of computational complexity.
Definition 1.1 Let C be a nite class of concepts. For any learning algorithm A and concept c C, let MA (c) be the maximum number of mistakes A can make when learning c, over all possible training sequences for the concept. Let Opt (C) = minA (maxcC MA (c)), with the minimum taken over all learning algorithms for C that t the given model.
Opt (C) is the optimum (smallest) mistake bound for learning C. The following lemma shows that Opt (C) is well-dened.
Lemma 1.2 (Littlestone, 1987) Opt (C) log2 (|C|).

Proof. Consider the following algorithm A. The algorithm keeps a list L of all possible concepts h C that are consistent with the examples that were input up until the present step. A starts with the list of all concepts in C. If a next instance x is supplied, A acts as follows: 1 Split L in sublists L1 = {d L|d (x) = 1} and L0 = {d L|d (x) = 0}. If |L1 | |L0 | then A predicts 1, otherwise it predicts 0. 2 If a mistake is made, A deletes from L every concept d which gives x the wrong label, i.e. with d (x) = c(x). The resulting algorithm is called the Halving or Majority algorithm. It is easily argued that the algorithm must have reduced L to the concept to be found after making at most log2 (|C|) mistakes. 2
Definition 1.3 (Gold, 1967) An algorithm A is said to identify the concepts in C in the limit if for every c C and every allowable training sequence for this concept, there is a nite m such that A makes no more mistakes after the mth step. The class C is said to be learnable in the limit. Corollary 1.4 Every (nite) class of concepts is learnable in the limit.
3.
Probably Approximately Correct Learning
As a further illustration of the theory of machine learning, we consider the learning problem for concepts that are impossible to learn exactly in nite (bounded) time. In general, insufcient training leads to weak classiers. Surprisingly, in many cases one can give bounds on the size of the training sets that are needed to reach a good approximation of the concept, with high probability. This theory of probably approximately correct (PAC) learning was originated by Valiant [Valiant, 1984] in 1984, and is now a standard theme in computational learning.
Probably Approximately Correct Learning
3.1
PAC Model
Consider any concept class C and its instance space X . Consider the general case of learning a concept c C. A PAC learning algorithm works by learning from instances which are randomly generated upon the algorithms request by an external source according to a certain (unknown) distribution D and which are labeled (+ or ) by an oracle (a teacher) that knows the concept c. The hypothesis h after m steps is a random variable depending on the sample of size m that the program happens to draw during a run. The performance of the algorithm is measured by the bound on m that is needed to have a high probability that h is close to c regardless of the distribution D .
Definition 1.5 The error probability of h w.r.t. concept c is: Errc (h) = Prob(c(x) = h(x)) = the probability that there is an instance x X that is classied incorrectly by h.
Note that in the common case that always h c, Errc (h) = Prob(x c x = h). If the measure of the set of instances on which h errs is small, then we call h -good.
Definition 1.6 A hypothesis h is said to be -good for c C if the probability of an x X with c(x) = h(x) is smaller than : Errc (h) .
Observe that different training runs, thus different samples, can lead to very different hypotheses. In other words, the hypothesis h is a random variable itself, ranging over all possible concepts C that can result from samples of m instances.
3.2
When are Concept Classes PAC Learnable
As a criterion for successful learning one would like to take: Errc (h) for every h that may be found by the algorithm, for a predened tolerance . A weaker criterion is taken, accounting for the fact that h is a random variable. Let ProbS denote the probability of an event taken over all possible samples of m examples. The success criterion is that ProbS (Errc (h) ) 1 , for predened and presumably small tolerances and . If the criterion is satised by the algorithm, then its hypothesis is said to be probably approximately correct, i.e. it is approximately correct with probability at least 1 .
Definition 1.7 (PAC-learnable) A concept class C is said to be PAClearnable if there is an algorithm A that follows the PAC learning model such that
Machine Learning
for every 0 < , < 1 there exists an m such that for every concept c C and for every hypothesis h computed by A after sampling m times: ProbS ( h is -good for c ) 1 , regardless of the distribution D over X. As a performance measure we use the minimum sample size m needed to achieve success, for given tolerances , > 0.
Definition 1.8 (Efficiently PAC-learnable) A concept class C is said to be efciently PAC-learnable if, in the previous denition, the learning 1 algorithm A runs in time polynomial in 1 and (and ln |C | if C is nite).
The notions that we dened can be further specialized, e.g. by adding constraints on the representation of h. The notion of efciency may then also include a term depending on the size of the representation.
3.3
Common PAC Learning
Let C be a concept class and c C. Consider a learning algorithm A and observe the probable quality of the hypothesis h that A can compute as a function of the sample size m. Assume that A only considers consistent hypotheses, i.e. hypotheses h that coincide with c on all examples that were generated, at any point in time. Clearly, as m increases, we more and more narrow the possibilities for h and thus increase the likelihood that h is -good.
Definition 1.9 After some number of samples m, the algorithm A is said to be -close if for every (consistent) hypothesis h that is still possible at this stage: Errc (h) .
Let the total number of possible hypotheses h that A can possibly consider be nite and bounded by H .
Lemma 1.10 Consider the algorithm A after it has sampled m times. Then for any 0 < < 1:
ProbS ( A is not -close ) < Hem . Proof. After m random drawings, A fails to be -close if there is at least one possible consistent hypothesis h left with Errc (h) > . Changing the perspective slightly, it follows that: ProbS ( A is not -close ) =
Classes of PAC Learners
= ProbS ( after m drawings there is a consistent h with Errc (h) > ) h with Err (h) > ProbS ( h is consistent ) = c = h with Err (h) > ProbS ( h correctly labels all m instances )
c
h with Err (h) > (1 )m c h with Err (h) > em c Hem , where we use that (1 t ) et . 2
Corollary 1.11 Consider the algorithm A after it has sampled m times, with h any hypothesis it can have built over the sample. Then for any 0 < < 1:
ProbS ( h is -good ) 1 Hem .
4.
Classes of PAC Learners
We can now interpret the observations so far. Let C be a nite concept class. As we only consider consistent learners, it is fair to assume that C also serves as the set of all possible hypotheses that a program can consider.
Definition 1.12 (Occam-algorithm) An Occam-algorithm is any online learning program A that follows the PAC-model such that (a) A only outputs hypotheses h that are consistent with the sample, and (b) the range of the possible hypotheses for A is C.
The following theorem basically says that Occam-algorithms are PAClearning algorithms, at least for nite concept classes.
Theorem 1.13 Let C be nite and learnable by an Occam-algorithm A. Then C is PAC-learnable by A. In fact, a sample size M with
1 1 M > (ln + ln|C|) sufces to meet the success criterion, regardless of the underlying sampling distribution D . Proof. Let C be learnable by A. The algorithm satises all the requirements we need. Thus we can use the previous Corollary to assert that after A has drawn m samples, ProbS ( h is -good ) 1 Hem 1 ,
10
Machine Learning
1 provided that m > 1 (ln + ln|C |). Thus C is PAC-learnable by A.
The sample size for an Occam-learner can thus remain polynomially 1 bounded in 1 , and ln |C |. It follows that, if the Occam-learner makes only polynomially many steps per iteration, then the theorem implies that C is even efciently PAC-learnable. While for many concept classes one can show that they are PAC-learnable, it appears to be much harder sometimes to prove efcient PAC-learnability. The problem even hides in an unexpected part of the model, namely in the fact that it can be NP-hard to actually determine a hypothesis (in the desired representation) that is consistent with all examples from the sample set. Several other versions of PAC-learning exist, including versions in which one no longer insists that the probably approximate correctness holds under every distribution D .
4.1
Vapnik-Chervonenkis Dimension
Intuitively, the more complex a concept is, the harder it will be for a program to learn it. What could be a suitable notion of complexity to express this. Is there a suitable characteristic that marks the complexity of the concepts in a concept class C. A possible answer is found in the notion of VapnikChervonenkis dimension, or simply VC-dimension.
Definition 1.14 A set of instances S X is said to be shattered by concept class C if for every subset S S there exists a concept c C which separates S from the rest of S, i.e. such that
c(x) = + if x S , if x S S .
Definition 1.15 (VC-dimension) The VC-dimension of a concept class C, denoted by VC(C), is the cardinality of the largest nite set S X that is shattered by C. If arbitrarily large nite subsets of X can be shattered, then VC(C) = .
VC-dimension appears to be related to the complexity of learning. Here is a rst connection. Recall that Opt (C) is the minimum number of mistakes that any program must make in the worst-case, when it is learning C in the limit. VC-dimension plays a role in identifying hard cases: it is lowerbound for Opt (C).
Theorem 1.16 (Littlestone, 1987) For VC(C) Opt (C).
any
concept
class
C:
VC-dimension is difcult, even NP-hard to compute, but has proved to be an important notion especially for PAC-learning. Recall that nite concept classes
Meta-Learning Techniques
11
that are learnable by an Occam-algorithm, are PAC-learnable. It turns out that this holds for innite classes also, provided their VC-dimension is nite.
Theorem 1.17 (Vapnik, Blumer et al.) Let C be any concept class and let its VC-dimension be VC(C) = d < . Let C be learnable by an Occamalgorithm A. Then C is PAC-learnable by A. In fact, a sample size M with
1 1 M > (ln + d ln ) sufces to meet the success criterion, regardless of the underlying sampling distribution D , for some xed constant > 0. VC-dimension can also be used to give a lowerbound on the required sample size for PAC-learning a concept class.
Theorem 1.18 (Ehrenfeucht et al.) Let C be a concept class and let its VC-dimension be VC(C) = d < . Then any PAC-learning algorithm for 1 C requires a sample size of at least M = ( 1 (log + d )) to meet the success criterion.
5.
Algorithms that learn concepts may perform poorly because e.g. the available training (sample) set is small or better results require excessive running times. Meta-learning schemes attempt to turn weak learning algorithms into better ones. If one has several weak learners available, one could apply all of them and take the best classier that can be obtained by combining their results. It might also be that only one (weak) learning algorithm is available. We discuss two meta-learning techniques: bagging, and boosting.
5.1
Bagging
Bagging [Breiman, 1996] stands for bootstrap aggregating and is a typical example of an ensemble technique: several classiers are computed and combined into one. Let X be the given instance (sample) space. Dene a bootstrap sample to be any sample X of some xed size n obtained by sampling X uniformly at random with replacement, thus with duplicates allowed. Applications normally have n = |X |. Bagging now typically proceeds as follows, using X as the instance space. For s = 1, . . . , b do: construct a bootstrap sample Xs train the base learner on the sample space Xs
12
Machine Learning
let the resulting hypothesis (concept) be hs (x) : X {1, +1}. Output as aggregated classier: hA (x) = the majority vote of the hs (x) for s = 1 . . . b. Bagging is of interest because bootstrap samples can avoid outlying cases in the training set. Note that an element x X has a probability of only 1 1 n (1 1 n ) 1 e 63% of being chosen into a given Xs . Other bootstrapping techniques exist and, depending on the on the application domain, other forms of aggregation may be used. Bagging can be very effective, even for small values of b (up to 50).
5.2
Boosting Weak PAC Learners
A weak learning algorithm may be easy to design and quickly trained, but it may have a poor expected performance. Boosting refers to a class of techniques for turning such algorithms into arbitrarily more accurate ones. Boosting was rst studied in the context of PAC learning [Schapire, 1990]. Suppose we have an algorithm A that learns concepts c C, and that has the property that for some < 1 2 the hypothesis h that is produced always satises ProbS ( h is -good for c ) , for some small > 0. One can boost A as follows. Call A on the same instance space k times, with k such that (1 . Let hi denote the hypothesis generated by A during the i-th run. The )k 2 probability that none of the hypotheses hi found is -good for c is at most 2 . Consider h1 , . . . , hk and test each of them on a sample of size m, with m chosen large enough so the probability that the observed error on the sample is not within from Errc (hi ) is at most 2k , for each i. Now output the hypothesis h = hi that makes the smallest number of errors on its sample. Then the probability that h is not 2-good for c is at most: 2 + k 2k = . Thus, A is automatically boosted into a learner with a much better condence bound. In general, one can even relax the condition on .
Definition 1.19 (Weak PAC-learnable) A concept class C is said to be weakly PAC-learnable if there is an algorithm A that follows the PAC learning model such that
1 1 for some polynomials p, q and 0 < 0 = 2 p( n) there exists an m such that for every concept c C and for every hypothesis h computed by A after sampling m times:
ProbS ( h is 0 -good for c ) regardless of the distribution D over X.
1 , q(n)
13
Theorem 1.20 (Schapire) A concept class is (efciently) weakly PAClearnable if and only if it is (efciently) PAC-learnable.
A different boosting technique for weak PAC learners was given by Freund [Freund, 1995] and also follows from the technique below.
5.3
Adaptive Boosting
If one assumes that the distribution D over the instance space is not xed and that one can tune the sampling during the learning process, one might use training scenarios for the weak learner where a larger weight is given to examples on which the algorithm did poorly in a previous run. (Thus outlyers are not circumvented, as opposed to bagging.) This has given rise to the adaptive boosting or AdaBoost algorithm, of which various forms exist (see e.g. [Freund and Schapire, 1997; Schapire and Singer, 1999]). One form is the following: Let the sampling space be Y = {(x1 , c1 ), . . . (xn , cn )} with xi X and ci {1, +1} (ci is the label of instance xi according to concept c). Let D1 (i) =
1 n
(the uniform distribution).
For s = 1, . . . , T do: train the weak learner while sampling according to distribution Ds let the resulting hypothesis (concept) be hs choose s (we will later see that s 0) update the distribution for sampling
Ds+1 (i)
Ds (i)es ci hs (xi )
Zs
where Zs is a normalization factor chosen so Ds+1 is a probability distribution on X . Output as nal classier: hB (x) = sign(T s=1 s hs (x)). The AdaBoost algorithm contains weighting factors s that should be chosen appropriately as the algorithm proceeds. Once we know how to choose them, s ci hs (xi ) follow inductively. A key property is the values of Zs = n i=1 Ds (i)e the following bound on the error probability Erruni f orm (hB ) of hB (x).
14
Machine Learning
Lemma 1.21 The error in the classier resulting from the AdaBoost algorithm satises:
Erruni f orm (hB ) Zs .
s=1 T
Proof. By induction one sees that
DT +1 (i) = D1 e
which implies that
1 n
s s ci hs (xi )
s Zs
eci s s hs (xi ) ns Zs ,
eci s s hs (xi ) = (T s=1 Zs )DT +1 (i).
Now consider the term s s hs (xi ), whose sign determines the value of hB (xi ). If hB (xi ) = ci , then ci s s hs (xi ) 0 and thus eci s s hs (xi ) 1. This implies that Erruni f orm (hB ) = 1 n |{i|hA (xi ) T i (s=1 Zs )DT +1 (i) = T s=1 Zs . = ci }|
1 n
i eci s s hs (xi )
= 2
This result suggests that in every round, the factors s must be chosen such that Zs is minimized. Freund and Schapire [Freund and Schapire, 1997] analysed several possible choices. Let s = ErrDs (hs ) = ProbDs (hs (x) = c(x)) be the error probability of the s-th hypothesis. A good choice for s is s = 1 1 s ln( ). 2 s
Assuming, as we may, that the weak learner at least guarantees that s 1 2 , we have s 0 for all s. Bounding the Zs one can show:
Theorem 1.22 (Freund and Schapire) With the given choice of s , the error probability in the classier resulting from the AdaBoost algorithm satises: 1 2 Erruni f orm (hB ) e2 s ( 2 s ) .
Let s < 1 2 for all s, meaning that the base learner is guaranteed to be at least slightly better than fully random. In this case it follows that Erruni f orm (hB ) 2 e2 T and thus AdaBoost gives a result whose error probability decreases exponentially with T , showing it is indeed a boosting algorithm. The AdaBoost algorithm has been studied from many different angles. For generalizations and further results see [Schapire, 2002]. In recent variants one
Conclusion
15
attempts to reduce the algorithms tendency to overt [Kwek and Nguyen, 2002]. Breiman [Breiman, 1999] showed that AdaBoost is an instance of a larger class of adaptive reweighting and combining (arcing) algorithms and gives a game-theoretic argument to prove their convergence. Several other adaptive boosting techniques have been proposed, see e.g. Freund [Freund, 2001]. An extensive treatment of ensemble learning and boosting is given by e.g. [Meir and Ratsch, 2003].
6.
Conclusion
In creating intelligent environments, many challenges arise. The supporting systems will be everywhere around us, always connected and always on, and they permanently interact with their environment, inuencing it and being inuenced by it. Ambient intelligence thus leads to the need of designing programs that learn and adapt, with a multi-medial scope. We presented a number of key approaches in machine learning for the design of effective learning algorithms. Algorithmic learning theory and discovery science are rapidly developing. These areas will contribute many invaluable techniques for the design of ambient intelligent systems.
References
M. Anthony. Probabilistic analysis of learning in articial neural networks: the PAC model and its variants. In: Neural Computing Surveys Vol 1, 1997, pp. 1-47 (see also: http://www.icsi.berkeley.edu/ jagota/NCS). A. Blumer, A. Ehrenfeucht, D. Haussler, and M.K. Warmuth. Learnability and the Vapnik-Chervonenkis dimension. Journal of the ACM 36 (1989) 929965. L. Breiman. Bagging predictors. Machine Learning 24 (1996) 123-140. L. Breiman. Prediction games and arcing algorithms. Neural Computation 11 (1999) 1493-1517. COLT. Computational learning theory resources. website at http://www.learningtheory.org. N. Cristianini, J. Shawe-Taylor. Support vector machines and other kernelbased learning methods. Cambridge University Press, Cambridge (UK), 2000. A. Ehrenfeucht, D. Haussler, M. Kearns, and L. Valiant. A general lower bound on the number of examples needed for learning. Information and Computation 82 (1989) 247-261. Y. Freund. Boosting a weak learning algorithm by majority. Information and Computation 121 (1995) 256-285. Y. Freund. An adaptive version of the boost by majority algorithm. Machine learning 43 (2001) 293-318.
16
Machine Learning
Y. Freund, R.E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and Systems Sciences 55 (1997) 119-139. E.M. Gold. Language identication in the limit. Information and Control 10 (1967) 447-474. M.J. Kearns and U.V. Vazirani. An introduction to computational learning theory. The MIT Press, Cambridge, MA, 1994. S. Kwek, C. Nguyen. iBoost: boosting using an instance-based exponential weighting scheme. In: T. Elomaa, H. Mannila, and H. Toivonen (Eds.), Machine Learning: ECML 2002, Proc. 13th European Conference, Lecture Notes in Articial Intelligence vol 2430, Springer-Verlag, Berlin, 2002, pp. 245-257. N. Littlestone. Learning quickly when irrelevant attributes abound: a new linear-threshold algorithm. Machine Learning 2 (1987) 285 - 318. R. Meir and G. Ratsch. An introduction to boosting and leveraging. In: S. Mendelson and A.J. Smola (Eds), ibid, pp 118-183. S. Mendelson, A.J. Smola (Eds). Advanced lectures on machine learning. Lecture Notes in Articial Intelligence vol 2600, Springer-Verlag, Berlin, 2003. T.M. Mitchell. Machine learning. WCB/McGraw-Hill, Boston, MA, 1997. G. Paliouras, V. Karkaletsis, and C.D. Spyropoulos (Eds.). Machine learning and its applications, Advanced Lectures. Lecture Notes in Articial Intelligence vol 2049, Springer-Verlag, Berlin, 2001. D. Poole, A. Mackworth, and R. Goebel. Computational intelligence - a logical approach. Oxford University Press, New York, 1998. R.E. Schapire. The strength of weak learnability. Machine learning 5 (1990) 197-227. R.E. Schapire. The boosting approach to machine learning - An overview. In: MSRI Workshop on Nonlinear Estimation and Classication, 2002 (available at: http://www.research.att.com/ schapire/publist.html). R.E. Schapire, Y. Singer. Improved boosting algorithms using condence-rated predictions. Machine Learning 37 (1999) 297-336. M. Skurichina, R.P.W. Duin. Bagging, boosting and the random subspace method for linear classiers. Pattern Analysis & Applications 5 (2002) 121135. L.G. Valiant. A theory of the learnable. Comm. ACM 27 (1984) 1134-1142.

Soia ML

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Soia ML

Diunggah oleh

Hak Cipta:

Format Tersedia

Chapter 1 APPROACHES IN MACHINE LEARNING

Jan van Leeuwen

Algorithms that Learn

Classication of Learning Algorithms

Lemma 1.2 (Littlestone, 1987) Opt (C) log2 (|C|).

Probably Approximately Correct Learning

Probably Approximately Correct Learning

When are Concept Classes PAC Learnable

Common PAC Learning

Classes of PAC Learners

Classes of PAC Learners

1 provided that m > 1 (ln + ln|C |). Thus C is PAC-learnable by A.

Theorem 1.16 (Littlestone, 1987) For VC(C) Opt (C).

Boosting Weak PAC Learners

ProbS ( h is 0 -good for c ) regardless of the distribution D over X.

(the uniform distribution).

Proof. By induction one sees that

eci s s hs (xi ) = (T s=1 Zs )DT +1 (i).

Anda mungkin juga menyukai