# Machine Learning and Data Mining

### Ikhtisar

Data mining is often referred to by real-time users and software solutions providers as knowledge discovery in databases (KDD). Good data mining practice for business intelligence (the art of turning raw software into meaningful information) is demonstrated by the many new techniques and developments in the conversion of fresh scientific discovery into widely accessible software solutions. This book has been written as an introduction to the main issues associated with the basics of machine learning and the algorithms used in data mining.

Suitable for advanced undergraduates and their tutors at postgraduate level in a wide area of computer science and technology topics as well as researchers looking to adapt various algorithms for particular data mining tasks. A valuable addition to the libraries and bookshelves of the many companies who are using the principles of data mining (or KDD) to effectively deliver solid business and industry solutions.
Provides an introduction to the main issues associated with the basics of machine learning and the algorithms used in data mining
A valuable addition to the libraries and bookshelves of companies using the principles of data mining (or KDD) to effectively deliver solid business and industry solutions

## Pratinjau Buku

### Machine Learning and Data Mining - Igor Kononenko

**Machine Learning and Data Mining **

**Introduction to Principles and Algorithms **

First Edition

Igor Kononenko

*Professor of Computer and Information Science, University of Ljubljana, Slovenia *

Matjaž Kukar

*Assistant Professor of Computer and Information Science, University of Ljubljana, Slovenia *

Horwood Publishing

Chichester, UK

**Table of Contents **

**Cover image **

**Title page **

**Copyright page **

**Foreword **

**Preface **

**Acknowledgements **

**Chapter 1: Introduction **

**1.1 THE NAME OF THE GAME **

**1.2 OVERVIEW OF MACHINE LEARNING METHODS **

**1.3 HISTORY OF MACHINE LEARNING **

**1.4 SOME EARLY SUCCESSES **

**1.5 APPLICATIONS OF MACHINE LEARNING **

**1.6 DATA MINING TOOLS AND STANDARDS **

**1.7 SUMMARY AND FURTHER READING **

**Chapter 2: Learning and Intelligence **

**2.1 WHAT IS LEARNING **

**2.2 NATURAL LEARNING **

**2.3 LEARNING, INTELLIGENCE, CONSCIOUSNESS **

**2.4 WHY MACHINE LEARNING **

**2.5 SUMMARY AND FURTHER READING **

**Chapter 3: Machine Learning Basics **

**3.1 BASIC PRINCIPLES **

**3.2 MEASURES FOR PERFORMANCE EVALUATION **

**3.3 ESTIMATING PERFORMANCE **

**3.4 *COMPARING PERFORMANCE OF MACHINE LEARNING ALGORITHMS **

**3.5 COMBINING SEVERAL MACHINE LEARNING ALGORITHMS **

**3.6 SUMMARY AND FURTHER READING **

**Chapter 4: Knowledge Representation **

**4.1 PROPOSITIONAL CALCULUS **

**4.2 *FIRST ORDER PREDICATE CALCULUS **

**4.3 DISCRIMINANT AND REGRESSION FUNCTIONS **

**4.4 PROBABILITY DISTRIBUTIONS **

**4.5 SUMMARY AND FURTHER READING **

**Chapter 5: Learning as Search **

**5.1 EXHAUSTIVE SEARCH **

**5.2 BOUNDED EXHAUSTIVE SEARCH (BRANCH AND BOUND) **

**5.3 BEST-FIRST SEARCH **

**5.4 GREEDY SEARCH **

**5.5 BEAM SEARCH **

**5.6 LOCAL OPTIMIZATION **

**5.7 GRADIENT SEARCH **

**5.8 SIMULATED ANNEALING **

**5.9 GENETIC ALGORITHMS **

**5.10 SUMMARY AND FURTHER READING **

**Chapter 6: Measures for Evaluating the Quality of Attributes **

**6.1 MEASURES FOR CLASSIFICATION AND RELATIONAL PROBLEMS **

**6.2 MEASURES FOR REGRESSION **

**6.3 **FORMAL DERIVATIONS AND PROOFS **

**6.4 SUMMARY AND FURTHER READING **

**Chapter 7: Data Preprocessing **

**7.1 REPRESENTATION OF COMPLEX STRUCTURES **

**7.2 DISCRETIZATION OF CONTINUOUS ATTRIBUTES **

**7.3 ATTRIBUTE BINARIZATION **

**7.4 TRANSFORMING DISCRETE ATTRIBUTES INTO CONTINUOUS **

**7.5 DEALING WITH MISSING VALUES **

**7.6 VISUALIZATION **

**7.7 DIMENSIONALITY REDUCTION **

**7.8 **FORMAL DERIVATIONS AND PROOFS **

**7.9 SUMMARY AND FURTHER READING **

**Chapter 8: *Constructive Induction **

**8.1 DEPENDENCE OF ATTRIBUTES **

**8.2 CONSTRUCTIVE INDUCTION WITH PRE-DEFINED OPERATORS **

**8.3 CONSTRUCTIVE INDUCTION WITHOUT PRE-DEFINED OPERATORS **

**8.4 SUMMARY AND FURTHER READING **

**Chapter 9: Symbolic Learning **

**9.1 LEARNING OF DECISION TREES **

**9.2 LEARNING OF DECISION RULES **

**9.3 LEARNING OF ASSOCIATION RULES **

**9.4 LEARNING OF REGRESSION TREES **

**9.5 *INDUCTIVE LOGIC PROGRAMMING **

**9.6 NAIVE AND SEMI-NAIVE BAYESIAN CLASSIFIER **

**9.7 BAYESIAN BELIEF NETWORKS **

**9.8 SUMMARY AND FURTHER READING **

**Chapter 10: Statistical Learning **

**10.1 NEAREST NEIGHBORS **

**10.2 DISCRIMINANT ANALYSIS **

**10.3 LINEAR REGRESSION **

**10.4 LOGISTIC REGRESSION **

**10.5 *SUPPORT VECTOR MACHINES **

**10.6 SUMMARY AND FURTHER READING **

**Chapter 11: Artificial Neural Networks **

**11.1 INTRODUCTION **

**11.2 TYPES OF ARTIFICIAL NEURAL NETWORKS **

**11.3 *HOPFIELD’S NEURAL NETWORK **

**11.4 *BAYESIAN NEURAL NETWORK **

**11.5 PERCEPTRON **

**11.6 RADIAL BASIS FUNCTION NETWORKS **

**11.7 **FORMAL DERIVATIONS AND PROOFS **

**11.8 SUMMARY AND FURTHER READING **

**Chapter 12: Cluster Analysis **

**12.1 INTRODUCTION **

**12.2 MEASURES OF DISSIMILARITY **

**12.3 HIERARCHICAL CLUSTERING **

**12.4 PARTITIONAL CLUSTERING **

**12.5 MODEL-BASED CLUSTERING **

**12.6 OTHER CLUSTERING METHODS **

**12.7 SUMMARY AND FURTHER READING **

**Chapter 13: **Learning Theory **

**13.1 COMPUTABILITY THEORY AND RECURSIVE FUNCTIONS **

**13.2 FORMAL LEARNING THEORY **

**13.3 PROPERTIES OF LEARNING FUNCTIONS **

**13.4 PROPERTIES OF INPUT DATA **

**13.5 CONVERGENCE CRITERIA **

**13.6 IMPLICATIONS FOR MACHINE LEARNING **

**13.7 SUMMARY AND FURTHER READING **

**Chapter 14: **Computational Learning Theory **

**14.1 INTRODUCTION **

**14.2 GENERAL FRAMEWORK FOR CONCEPT LEARNING **

**14.3 PAC LEARNING MODEL **

**14.4 VAPNIK-CHERVONENKIS DIMENSION **

**14.5 LEARNING IN THE PRESENCE OF NOISE **

**14.6 EXACT AND MISTAKE BOUNDED LEARNING MODELS **

**14.7 INHERENT UNPREDICTABILITY AND PAC-REDUCTIONS **

**14.8 WEAK AND STRONG LEARNING **

**14.9 SUMMARY AND FURTHER READING **

**Appendix A: *Definitions of some lesser known terms **

**A.1 COMPUTATIONAL COMPLEXITY CLASSES **

**A.2 ASYMPTOTIC NOTATION **

**A.3 SOME BOUNDS FOR PROBABILISTIC ANALYSIS **

**A.4 COVARIANCE MATRIX **

**References **

**Index **

**Copyright **

**HORWOOD PUBLISHING LIMITED **

International Publishers in Science and Technology

Coli House, Westergate, Chichester, West Sussex, PO20 3QL, UK

First published in 2007

**COPYRIGHT NOTICE **

All Rights Reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the permission of Horwood Publishing, Coll House, Westergate, Chichester, West Sussex, PO20 3QL, UK.

© Horwood Publishing Limited, 2007.

**British Library Cataloguing in Publication Data **

A catalogue record of this book is available from the British Library

**ISBN − 10:1-904275-21-4 **

**ISBN − 13: 978-1-904275-21-3 **

Cover design by Jim Wilkie.

Printed and bound in the UK by Antony Rowe Limited.

**Foreword **

**Ivan Bratko **

In the past two decades, Machine Learning (ML) has made, among the many areas of Artificial Intelligence (AI), probably the most significant progress, and has grown into a field of its own. In this period, many of its methods found their way into practical applications, and in many application areas are now routinely used. The wide spread and visibility of ML was accelerated by powerful new techniques, and also by the appearance of effective and user-friendly implementations of ML tools, including some excellent freely available platforms, such as Weka and Orange. All this has resulted in ML, together with its related areas of Data Mining, Knowledge Discovery in Databases, and Intelligent Data Analysis, becoming an effective tool for many other disciplines as diverse as engineering, medicine, biology, ecology, biochemistry and finance. In science, it is becoming part of general scientific methodology and is now used by scientists in interpreting experimental data and in formulating new scientific theories. ML is beginning to be taught outside Computer Science and is on the way to attaining a similar status as some other subjects of general importance, in the same way as it is, for example, useful to students of medicine and biology to learn the basics of statistics.

This book is written by two scientists who have made their own original contributions to the methods and applications of ML, one of the co-authors (I. K.) having been involved since its early stages. In my view, the main strength of this book is its breadth of coverage and wealth of material presented, without loss of depth. In this respect, the book is rather unique among the books on machine learning. The book covers a large majority of key methods, concepts and issues. The style of writing is terse and directly to the point.

In comparison with other books on ML, the following chapters seem to be particularly strong. **Chapter 3 systematically introduces a repertoire of basic elements that are universally used as components of ML methodology. Chapter 6 gives an outstanding review of measures for attribute selection. Chapter 8 introduces constructive induction, a topic that is rarely covered in books. In Chapters 13 and 14 on the theory of ML, the book addresses deep theoretical questions of learnability. **

This book will be useful reading both for beginners to the field, and even more so to people who have already done some work in ML, but would like to broaden their knowledge and are looking for reference to learn about yet unfamiliar topics.

**Preface **

**Igor Kononenko; Matjaž Kukar, ***Ljubljana, December 2006 *

*In learning, each day something is gained. In following the Tao, each day something is lost, *

— *Lao Tse *

The book describes the basics of machine learning principles and algorithms that are used in data mining and is intended for students of computer science, researchers who need to adapt various algorithms for particular data mining tasks, and for education of sophisticated users of machine learning and data mining tools.

Machine learning has exhibited tremendous progress in the last two decades. Numerous systems for machine learning are nowadays available and the number of their applications in various fields, such as industry, medicine, economics, ecology, finance, and many others, is rapidly growing. Machine learning is used for data analysis and knowledge discovery in databases, data mining, automatic generation of knowledge bases for expert systems, game playing, text classification and text mining, automatic recognition of speech, handwriting, images, etc.

The basic principle of machine learning is modeling of available data. Results of various machine learning systems may be rules, functions, relations, equation systems, probability distributions and other knowledge representations. The obtained knowledge

aims to explain the data and can be used for forecasting, diagnostics, control, validation, and simulations.

Several excellent books have already been written on machine learning and data mining, most notably (**Mitchell, 1997; Witten and Frank, 2000, 2005; Hand et al., 2001; Alpaydin, 2005). The purpose of our book is to provide a thorough and yet a comprehensible general overview of the field. Our approach differs from other textbooks by providing several different points of view to machine learning and data mining, as reflected from the structure of the book: **

• Besides defining the basic notions and overviewing the machine learning methods, **Chapter 1 provides the historical overview of the field and an overview of the state-of-the-art data mining tools as well as established and emerging standards. **

• **Chapter 2 is the least formal and introduces the psychological and philosophical issues related to machine and human learning, intelligence, and consciousness. **

• Basic principles of machine learning, described in **Chapter 3, are the central issue and can be considered as the core of the book. **

• The next three chapters divide the machine learning process into three basic components: the representation of knowledge (**Chapter 4) defines the hypothesis space, Chapter 5 describes basic search algorithms that can be used to search the hypothesis space, and the attribute quality measures (Chapter 6) are used to guide the search. **

• **Chapters 7 and 8 add to the whole story formal and practical guidelines for preparing, cleansing and transforming the input data in order to get the most out of it with the available machine learning algorithms. Constructive induction, described in Chapter 8, has great potentials, however, in the past not many researchers or implementations considered it worthwhile. We believe that in the future, when more research will be devoted to this area, the advantages of constructive induction will be revealed and implemented in useful practical tools. **

• The following four chapters use the principles and notions, described in preceding chapters, as building blocks in order to describe particular learning algorithms. Four chapters divide the field into four main approaches. Symbolic learning (**Chapter 9) includes all approaches that result in symbolic and transparent knowledge representations. Statistical learning (Chapter 10), on the other hand, results in knowledge representations that are typically hard to understand and interpret by the end user. Artificial neural networks (Chapter 11) are inspired by biological neural networks. With respect to transparency of knowledge representation, artificial neural networks are much closer to statistical learning than to symbolic learning. Cluster analysis deserves a separate chapter (Chapter 12) as the task of clustering algorithms differs a lot from other machine learning tasks. **

• The last two chapters introduce formal approaches to machine learning. **Chapter 13 describes a mathematically and philosophically inspired problem of identification in the limit. The results of this formal theory provide the ultimate answers to philosophical questions which, however, are of modest practical use. On the other hand, the computational learning theory, described in Chapter 14, aims to provide theoretical answers to questions of more practical value. **

In order to improve readability, we provide all the references in separate sections on further reading, that also contain chapter summaries. Obvious exceptions that naturally include references are, besides this preface, also historical **in the text indicates that the proof of the assertion is omitted from the text and can be found at the end of the chapter in the corresponding section on formal derivations and proofs. Obvious exceptions are the last two chapters, where proofs can be found in the literature indicated in further reading. Those sections and chapters are intended for advanced readers. In the table of contents, advanced topics are indicated with * (moderately advanced) or ** (advanced). Appendix A defines some less known formal notions, which are occasionally used in the text. **

Both authors have experience in teaching various parts of the material presented in this book, to undergraduate and postgraduate students. For the undergraduate course we suggest excluding sections and chapters marked with *: **Section 3.4 (comparing performance of machine learning algorithms), sections related to inductive logic programming (Sections 4.2 and 9.5), Chapter 8 (constructive induction), Section 10.5 (support vector machines), Sections 11.3 and 11.4 (Hopfield’s and Bayesian neural networks), as well as all sections and chapters marked with **. Optionally, Sections 9.3 (association rules) and 9.7 (Bayesian belief networks), and/or Chapter 12 (cluster analysis) can also be kept for the postgraduate course. **

The book web page can be found at the address: mldmbook.fri.uni-lj.si, where all information and post-press corrections will be available. The readers’ comments, suggestions and indications about errors in the text are welcome, and should be sent to either author’s e-mail address; igor.kononenko@f ri.uni-lj.si or matjaz.kukar@fri.uni-1j.si.

**Acknowledgements **

The book is the result of numerous years of active research in the field of machine learning. We are grateful to Professor Ivan Bratko for his pioneering efforts in establishing an outstanding machine learning research group in Ljubljana. We are grateful to the Faculty of Computer and Information Science for support, both during research and teaching. For long term collaboration in machine learning research we are particularly grateful to Matjaž Bevk, Zoran Bosnić, Ivan Bratko, Bojan Cestnik, Janez Demšar, Aleks Jakulin, Matevž Kovačič, Nada Lavrač, Uroš Pompe, Marko Robnik Šikonja, Luka Šajn, and Blaž Zupan. We thank Luka Šajn and Alan McConnell-Duff for proofreading the text, Jožica Robnik Šikonja for proofreading **Chapter 2, Ivan Bratko and the anonymous reviewer for their comments and suggestions, and Bogdan Filipič, Marko Robnik Šikonja, and Blaž Zupan for their comments on earlier drafts of the book. Useful comments on various parts of the book were provided by several people. In particular, we thank Janez Demšar for his comments on Chapter 3, Aleks Jakulin for his comments on Chapter 8, Aristidis Likas and Eduardo Hruschka for their comments on Chapter 12, and France Dacar, Matevž Kovačič and Boštjan Vilfan for their comments on Chapter 13. The author of the spider net graphics from the book cover is a young artist and Igor’s daughter Nina Kononenko. We thank also Francesca Bonner and Ellis Horwood from Horwood Publishing for their kind assistance and patience. Finally, we are grateful to our wives and children for their loving support and patience and to whom with all our love we dedicate this book. **

**Chapter 1 **

**Introduction **

*The time will come when we will have to forget everything we have learnt. *

— *Ramana Maharshi *

Machine learning, a subfield of artificial intelligence, has shown tremendous improvements in the last 20 years. This is reflected in numerous commercial and open source systems for machine learning and their applications in industry, medicine, economics, natural and technical sciences, ecology, finance, and many others. Machine learning is used for data analysis and knowledge discovery in databases (data mining), automatic generation of knowledge bases for expert systems, learning to plan, game playing, construction of numerical and qualitative models, text classification and text mining (e.g., on the world wide web), for automatic knowledge acquisition to control dynamic processes, automatic recognition of speech, handwriting, and images, etc.

The basic principle of machine learning is the automatic modeling of underlying processes that have generated the collected data. Learning from data results in rules, functions, relations, equation systems, probability distributions and other knowledge representations such as decision rules, decision and regression trees, Bayesian nets, neural nets, etc. Models explain the data and can be used for supporting decisions concerning the same underlying process (e.g., forecasting, diagnostics, control, validation, and simulations).

We start with an overview and definitions of overlapping and frequently (mis)used buzzwords and acronyms, such as machine learning (ML), data mining (DM), knowledge discovery in databases (KDD), and intelligent data analysis (IDA). We continue with a short overview of machine learning methods and historical development of different directions in machine learning research: symbolic rule learning, neural nets and reinforcement learning, numerical learning methods, and formal learning theory. We describe a few interesting early machine learning systems, review typical applications and finish with an overview of data mining standards and tools.

**1.1 THE NAME OF THE GAME **

*Why do we not, since the phenomena are well known, build a knowledge refinery as the basis of a new industry, comparable in some ways to the industry of petroleum refining, only more important in the long run? The product to be refined is codified human knowledge. *

— *Donald Michie *

The main problem of contemporary data analysis is the huge amount of data. In the last decade, computers have provided inexpensive and capable means to collect and store the data. The increase in data volume causes difficulties in extracting useful information. The traditional manual data analysis has become insufficient, and methods for efficient computer-based analysis are now needed. Therefore, a new interdisciplinary field has emerged, that encompasses statistical, pattern recognition, machine learning, and visualization tools to support the analysis of data and discovery of principles hidden within the data. In different communities this field is known by different names: knowledge discovery in databases (KDD), data mining (DM), or intelligent data analysis (IDA).

The terms data mining (in the database community) and knowledge discovery in databases (in machine learning and artificial intelligence communities) appeared around 1990, but the term data mining became more popular in the business community and in the press. Frequently, data mining and knowledge discovery are used interchangeably.

KDD focuses on the knowledge discovery from data (**Figure 1.1), including how the data are stored and accessed; how algorithms can be scaled to massive data sets and still run efficiently; how results can be interpreted and visualized; and how the overall man-machine interaction can usefully be modeled and supported. The KDD process can be viewed as a multidisciplinary activity that encompasses techniques beyond the scope of any particular discipline such as machine learning. Note that the KDD process uses data mining as one of its steps. **

**Figure 1.1 **The process of knowledge discovery in databases.

Data mining (DM) is an interactive and iterative process in which many steps need to be repeatedly refined in order to provide for an appropriate solution to the data analysis problem. Data mining standard CRISP-DM defines a process model that reflects the life cycle of a data mining project. It contains the phases (see **Figure 1.2) of a data mining project and identifies relationships between them. The phases are cyclically iterated until some desired goal is reached. By comparing Figures 1.1 and 1.2 it is clear that basically they describe the same thing. **

**Figure 1.2 **The cycle of a data mining process in the CRISP-DM standard.

The ambiguity whether KDD = DM arises from the fact that, strictly speaking, data mining (DM) covers the methods used in the data modeling step. This accounts both for visual methods as well as for methods that (semi) automatically deal with the extraction of knowledge from large masses of data, thus describing the data in terms of the interesting discovered regularities. Most automatic data mining methods have their origins in machine learning (ML). However, ML cannot be seen as a true subset of data mining, as it also encompasses other fields, not utilized for data mining (e.g. theory of learning, computational learning theory, and reinforcement learning).

A closely related field of intelligent data analysis (IDA) is similar to KDD and DM. It also refers to the interactive and iterative process of data analysis, with the distinguishing feature that the architectures, methodologies and techniques used in this process are those of artificial intelligence (AI) and machine learning. **Figure 1.3 shows the relation of IDA to KDD, DM and ML in the strictest sense. In practice (and in the following text) KDD and DM are treated as synonyms. **

**Figure 1.3 **Relation between ML, DM, KDD, and IDA in the stricter sense.

There is a large intersection between KDD/DM and IDA. Both fields investigate data analysis, and they share many common methods. The main difference is that IDA uses AI methods and tools, while KDD/DM employ both AI and non-AI methods. Therefore, ML methods are in the intersection of the two fields, whereas classical statistical methods and on-line analytical processing (OLAP) belong to KDD/DM but not to IDA. Typically, KDD and DM are concerned with the extraction of knowledge from large datasets (databases), whereas in IDA the datasets are smaller.

**1.2 OVERVIEW OF MACHINE LEARNING METHODS **

*A learning machine, broadly defined, is any device whose actions are influenced by past experiences. *

— *Nils J*. *Nilsson *

This overview is intended to give an impression of the field of machine learning. Therefore detailed explanations are omitted. The meaning of various terms mentioned in this section will become more familiar later in the book.

We differentiate machine learning methods with respect to how the obtained (induced) knowledge is used (see **Figure 1.4): classification, regression, clustering, learning of associations, relations, and (differential) equations. Besides, machine learning is used in reinforcement learning within an iterative approach for optimization of value function approximation. Equation and reinforcement learning are not thoroughly discussed in this book, and are only described here for completeness. **

**Figure 1.4 **A taxonomy of machine learning methods. The methods not thoroughly discussed in the book are grayed.

**1.2.1 Classification **

Machine learning methods are most frequently used for classification. Let us suppose that we have an object, described with many attributes (features, properties). Each object can be assigned to exactly one class from the finite set of possible classes. Attributes are independent observable variables, either continuous or discrete. The class is a dependent unobservable discrete variable and its value is determined from values of respective independent variables. Machine learning methods are, among others, used for creation of *classifiers*. The task of the *classifier *is to determine the class to which the object in question should be assigned.

A typical classification task is medical diagnosis: a patient is described with continuous (e.g. age, height, weight, body temperature, heart rate, blood pressure) and discrete attributes (e.g. sex, skin discoloration, location of pain). The classifier’s task is to produce a diagnosis – to label the patient with one of several possible diagnoses (e.g. healthy, influenza, pneumonia).

To determine the class, a classifier needs to describe a discrete function, a mapping from the attribute space to the class space. This function may be given in advance, or can be learnt from data. Data consists of learning examples (sometimes also called training instances) that describe solved past problems. For illustration let us again consider medical diagnostics. Here, solved past problems are medical records, including diagnoses, for all patients that had been treated in a hospital. The learning algorithm’s task is therefore to determine the mapping by learning from the set of patients with known diagnoses. This mapping (represented as a function, a rule, … ) can later be used for diagnosing new patients.

Different classifiers represent mapping functions in many different ways. Most common classifiers are decision trees, decision rules, naive Bayesian classifiers, Bayesian belief networks, nearest neighbor classifiers, linear discriminant functions, logistic regression, support vector machines, and artificial neural networks.

**Decision trees and rules. **Algorithms for building decision trees and rules can select attributes and suitable subsets of their values according to their quality, and use them as building blocks of a conjunctive rule’s antecedent. A consequent consists of one or more classes. When a new example arrives it is classified by applying a suitable rule.

For the given attribute *A *and its values *V*1 … *Vm*, decision trees and rules form conjunctive terms such as *A* = V*j *. Continuous attributes are either discretized in advance or, even better, the terms like *A* > *x*, *A* ≤ *x*, or *x* ≥ *A* > *y*, where *x*, *y *, are formed during learning.

**Bayesian classifier. **Given the attribute values of a new example, the task of the Bayesian classifier is to calculate conditional probabilities for all possible classes. The exact Bayesian classifier minimizes the expected error rate and is therefore optimal in this sense. As exact Bayesian classifiers are hard to come by (unless the learning data cover the complete attribute space and there is no noise), certain approximations and assumptions have to be used when calculating conditional probabilities.

The naive Bayesian classifier assumes conditional independence of attributes given the class. This (often very strong) assumption usually ensures reliable conditional probability estimations needed for classification, even from very small datasets. Implementations of the naive Bayesian classifier often assume that only discrete attributes are used; therefore continuous attributes need to be discretized in advance.

A generalization of the naive Bayesian classifier are *Bayesian belief networks*. They use acyclic graphs for modeling dependencies between attributes and the class. Conditional independence is implicitly assumed between nodes (attributes) that are not directly connected. A Bayesian network’s topology can be given in advance (as a part of background knowledge for the problem), or can be learnt from data. Conditional dependencies are also learnt from data. Acyclic graph topology ensures efficient calculation of conditional probabilities for all classes. Bayesian belief networks can be used also for modeling probability distributions.

**Nearest neighbors classifiers (instance-based learning). **In its simplest form, the k-nearest neighbors classifier (*k*-NN) generates its knowledge base by storing all learning data. When a new example is presented for classification, the stored data are used to find a certain number (*k*) of most similar learning examples (nearest neighbors), according to some distance metric. The new example is labelled with a class, prevalent among its nearest neighbors. The number *k *is typically a small odd number (e.g. five or seven).

**Discriminant functions**. The task of the learning algorithm is to calculate coefficients for a discriminant function whose structure is fixed in advance. A discriminant function is actually a hypersurface, dichotomizing between two classes in the attribute space. As a hypersurface can be defined only in a continuous hyperspace, all attributes need to be continuous. If we have more than two classes, a separate hypersurface is necessary for each pair of classes. Discriminant functions can be linear, quadratic, polynomial, etc. In case of a linear discriminant function the corresponding hypersurface is a hyperplane dichotomizing between two classes. The hyperplane’s (hypersurface’s) coefficients are determined so as to minimize the classification error rate.

Frequently, Fisher’s linear discriminant function is used. It assumes normal distribution of learning examples within each class. It maximizes the (Euclidean) distance between averaged examples from both classes. It accounts for potentially different class variances and finds an optimal classification boundary between classes.

**Support Vector Machines (SVMs). **SVMs were developed in the 1990s and are among the most accurate approaches to discriminant function classification. Due to their importance we discuss them separately. SVM is a multi-criterion optimization method:

• It maximizes the distance between support vectors and a dichotomizing hyperplane. Support vectors are learning examples that lie closest to the hyperplane and therefore define its *margin*. The method thus maximizes the margin as well.

• Since many classification problems are not linearly separable, SVMs implicitly transform the original attribute set to the much larger (possibly even infinite) attribute set by using kernel functions. The success depends mostly on a suitably chosen kernel function.

• Since every classification problem can be exactly solved by using arbitrarily complex discriminant functions, SVMs attempt to minimize the size of solution (the sum of attributes’ weights).

• Although the transformation of attributes is nonlinear, some examples still get misclassified. SVMs therefore tend to minimize the number of misclassifications.

Completely satisfying all the above criteria is usually impossible, so a compromise needs to be found. Although SVMs use several optimization algorithms to optimize all criteria simultaneously, one still needs to appropriately weight the above criteria in order to obtain the best possible classification.

**Artificial neural networks. **Algorithms for artificial neural networks (ANNs) mimic biological neural networks by abstracting neural functions to a simple element (neuron), only able to summarize its input and normalize its output. Neurons are interconnected into arbitrarily complex artificial neural networks. For classification, most frequently feedforward multi-layered neural networks are used. Neurons are organized in a layered cascade: an input layer (corresponding to attributes), one or more hidden layers, and an output layer (corresponding to classes). The task of the learning algorithm is to determine weights on connections between neurons (used for calculating weighted sums in each neuron) in order to minimize the classification error rate.

For classification of a new example its attribute values are presented to the ANNs input neurons. These values are weighted according to the connections between neurons, and their weighted sums are calculated in each neuron in the subsequent layer of neurons. Normalized results in the output neurons determine the classification outcome (class).

**Hybrid algorithms**. Hybrid algorithms or their parts combine several different approaches by using their advantages and thus creating possibly superior learning algorithms. An example is the naive Bayesian classifier that has many nice features (speed, robustness, comprehensibility), and a sole problem – its naivety. It is often combined with other algorithms in order to – at least partially – avoid the effects of naivety:

• A locally naive (*k*-NN) Bayesian classifier is based on learning a naive Bayesian classifier on a set of *k *nearest neighbors of a new example. This reduces the naivety with respect to the original globally naive Bayesian classifier. Local classifiers are adapted to local distributions which implicitly account for attribute dependencies. The parameter *k *must be selected much larger than for ordinary nearest neighbor classifiers (e.g. *k* > 30) in order to provide sufficiently reliable conditional probability estimations.

• Clustering and the naive Bayesian classifier. For each class the learning examples are clustered (described later in this section) into several clusters. Each of them serves as a learning set for a separate new class. The number of obtained clusters is larger than the number of classes. The clustering algorithm should account for attribute dependencies and detect locally coherent subspaces, such that the naive Bayesian classifier can discriminate between them.

• The naive Bayesian classifier in the leaves of a decision tree. We may assume that the decision tree will account for dependencies among attributes, and split the original attribute space into subspaces where the remaining attributes are relatively independent. Such decision trees should be made smaller (by pruning) in order to obtain sufficiently reliable conditional probability estimations for the final naive Bayesian classification.

• The naive Bayesian classifier in internal nodes of decision trees. Instead of using original attributes to form splitting criteria, we can use a simple classifier in each internal node. Each node has for each possible class a corresponding subtree. A classifier learns from the training examples that belong to its node and assigns them to respective subtrees. The task of the subtree is to correctly classify learning examples that were misclassified in the previous node. A decision tree is built recursively until all examples are correctly classified or some other stopping criterion is fulfilled (e.g., number of examples becomes too small).

**1.2.2 Regression **

As in classification problems, in regression we have a set of objects (learning examples), described with several attributes (features, properties). Attributes are independent observable variables (either continuous or discrete). The dependent (target, regression) variable is continuous and its value is determined as a function of independent variables. The task of the *regressional predictor *is to determine the value of the dependent unobservable continuous variable for the object in question.

Similarly to classifiers, a regressional predictor implements a continuous function mapping from the attribute space to prediction values. This function can be given in advance or learnt from previously solved problems (a learning set). The task of the learning algorithm is therefore to determine a continuous function by learning from the learning set. This function can later be used for predicting values for new, previously unseen examples.

Regressional predictors differ with respect to the representation of regressional function. Most common regressional predictors (regressors) are linear regression, regression trees, locally weighted regression, support vector machines for regression, and multi-layered feedforward neural networks for regression.

**Regression trees**. Algorithms for building regression trees are similar to algorithms for constructing decision trees. According to their quality, they choose attributes and subsets of their values for building the node-splitting criteria. Tree nodes stand for attributes, tree branches (edges) for attribute values (conjunctively combined), and leaves for consequents of the prediction rules that map from the remaining (non-used) attributes to the prediction value. Discrete and continuous attributes are handled within internal nodes much in the same way as in decision (classification) trees. In terminal nodes (leaves) we have a continuous function. It is usually assumed that all the remaining attributes are continuous. Most frequently, single point functions (the mean value of dependent variable of the corresponding subset of learning examples), or linear regression functions are used.

**Linear regression**. A linear regression function assumes all attributes to be continuously-valued and suitably normalized. It also assumes the existence of a linear relationship between dependent and independent variables. The learning algorithm needs to determine the coefficients of the linear function in order to minimize the sum of squared (or absolute) errors of regressional predictions evaluated on learning examples. Either analytical solution or iterative optimization may be used. The result of learning is a hyperplane that for each example determines the value of the dependent variable. The error for a given example is its distance from the hyperplane in the dependent variable’s axis.

For non-linear problems a series of linear functions or nonlinear parts included in the functional expression can be used.

**Locally weighted regression**. Locally weighted regression is the nearest neighbors algorithm adapted for regression, such that linear regression function is calculated from the new example’s nearest neighbors. Linear regression is thus adapted to the local properties. Besides linear regression, the average value of nearest neighbors or some other simple function can be used – we cannot afford to use complex functions due to the small number of neighbors.

**Support vector machines for regression**. Since SVMs perform implicit transformation of the attribute space by utilizing kernel functions, there is no need to introduce nonlinearities for description of the hypersurface. SVMs can be easily adapted for solving regressional problems. A margin around the regressional hyperplane is defined such that correctly predicted learning examples lie inside it. Support vectors determine the margin. SVMs aim to minimize the margin as well as minimize the prediction error on learning examples. As is the case with SVM’s for classification, the criterion function needs to be optimized with respect to the prediction error of the regressional variable as well as with respect to the complexity of the criterion function (measured with magnitudes of weights for implicitly generated attributes).

**Artificial neural networks. **For regression, as well as for classification, multilayered feedforward networks are commonly used. Simple calculation elements (neurons) are organized in a layered cascade: an input layer (corresponding to attributes), one or more hidden layers, and an output layer with a single neuron (corresponding to the regressional variable). The task of the learning algorithm is to determine weights on connections between neurons (used for calculating weighted sums in each neuron) such that the regressional error is minimal.

**Hybrid algorithms**. As for classification, hybrid algorithms combine several different approaches by using their advantages and thus creating possibly superior learning algorithms. Strictly speaking, locally weighted regression is of this kind, combining linear regression with nearest neighbors. The same holds for the use of linear regression (or other simple methods) in the leaves of regression trees. The rationale here is that regression trees should partition the attribute space finely enough for a simple regressor to work well. Such regression tree should be made smaller (by pruning) in order to allow enough examples for simple regressors.

**1.2.3 Logical relations **

Logical relations can be considered a generalization of discrete functions. As opposed to functions, there is no single dependent discrete variable (class), and all variables (attributes) are treated equivalently. Sometimes, some variables have unknown values, and we want to predict them. In other cases, all the variables have known values and we want to check if the relation holds for this particular set of values. According to the expressive power of the language used for describing learning examples and logical relations, we differentiate between *learning associations *and *inductive logic programming*.

**Associations**. We speak of associations when the propositional (with attributes) data description is used, such as in classification and regression, the only difference being the absence of the dependent variable. Each variable may in turn be considered as independent (its value is given), or dependent (its value is not given). Because of the absence of an explicitly defined dependent variable, association learning is often considered *unsupervised*, together with clustering (to be described later).

A straightforward approach to association learning is to break the problem into *a *learning problems, *a *being the number of attributes. In each new problem, one attribute is taken as a target (dependent) variable, whereas others are treated as independent variables. The type of the learning problem (classification or regression) depends on the type of the attribute (discrete or continuous). Two approaches to association learning are associative neural networks and association rules.

• Associative neural networks are artificial neural networks where every neuron can serve both as input and output. Each neuron represents one attribute. Neurons are mutually interconnected with two-way connections, so that the signal can travel in either direction. The signal value is multiplied by the weight of the connection. Each neuron calculates its own weighted sum of the incoming connections. This (normalized) sum represents the attribute value. In an associative neural network the initial state is determined by the given attribute values. After the initial set-up, all neurons start calculating their outputs asynchronously and in parallel. This process goes on until the outputs of all neurons cease changing. Thus the unknown values of the attributes are calculated.

The task of the learning algorithm is to set the weights on connections between each pair of neurons. According to the utilized learning algorithm we speak of Hopfield and Bayesian neural networks. Hopfield neural networks are based on correlation of the state activity of neurons. Bayesian neural networks use in every neuron the naive Bayesian learning rule for calculation of the weighted sum. Similarly, its learning rule calculates the conditional probability of activity of each neuron given the states of other neurons.

Associative neural networks are frequently used for pattern recognition when only partial data are known, and the missing part needs to be reconstructed. For example, in face recognition only a part of the face can be seen. Another use is in combinatorial optimization problems, where parameters (attributes) compete with each other in the sense that higher values of one may result in lower values of others. Associative neural networks tend to find a compromise – a locally optimal combination of parameter values.

• Association rules are similar to classification rules. They differ in the consequent, where instead of class any attribute may occur. The antecedent is made of conjunctively connected terms, built from other attributes. The aim of learning association rules is to find new interesting relations between attributes. They may contribute to (human) knowledge of a certain problem and give insight about regularities in data. The task of the learning algorithm is to find all association rules that are both accurate and interesting.

Association rules have originally been used in *market basket analysis*. With their help the merchants can plan the arrangement of items on the shelves in their shop. For example, a rule may say, that there is a high probability that a customer, buying flour, sugar, and nuts, will also buy eggs.

**Inductive logic programming. **In inductive logic programming (ILP) the first-order predicate calculus is used as a description language. It is usually limited to Horn clauses with negation, or equivalently, to Prolog expressions. A learning problem is described with the background knowledge (a set of relations, or Prolog programs) and with sets of positive (true) and negative (false) facts (examples) for a target relation. The task of the learning algorithm is to derive the target relation (a Prolog program) covering as many as possible positive and as few as possible negative examples. ILP can be considered as a process for discovering systems of logical equations.

In practice, inductive logic programming is used for solving geometrical and spatial problems, such as structure prediction of chemical compounds, gene expression determination in functional genomics, spatial planning and subspace recognition, and static and modeling problems in machine engineering.

**1.2.4 Equations **

When modeling real-world problems, we often have to deal with measurements of different, related and mutually dependent processes. By learning systems of equations we aim to explicitly describe these dependencies. The same system of equations may as well be used for simulation and forecasting. We usually deal with real-valued (continuous) variables, numerical relations, and dependencies. In contrast to this, ILP is concerned mostly with logical variables, relations, and dependencies.

When learning systems of equations, their structure may either be given in advance in the form of background knowledge, or the learning algorithm needs to derive from data the structure of equations as well as their corresponding coefficients. For modeling dynamic systems that change continuously through the time, systems of partial differential equations need to be derived. The task of the learning algorithm is to derive a system of equations that is both simple (in terms of equations) and accurate (with respect to input data).

Equation discovery is frequently used in meteorology for modeling weather systems, in ecology for modeling biological subsystems, and in machine engineering for modeling technical and dynamical processes.

**1.2.5 Clustering **

Unsupervised learning is different from classification and regression since only description of examples with attributes is given, without the supervising (target) variable. Clustering is the most popular of unsupervised learning methods. The task of the learning algorithm is to determine coherent subsets (clusters) of learning examples. Clustering is frequently used in natural sciences, econometrics, process analysis, psychological and sociological research.

The number of clusters may either be given in advance as a part of the background knowledge, or determined by the learning algorithm. Its task is therefore to determine a relatively small number of coherent clusters, that is, subsets of similar examples. The choice of the (dis)similarity measure is the most important part of the background knowledge, and of utmost importance for successful and meaningful clustering. Most popular approaches to clustering are:

• Hierarchical clustering. In agglomerative (bottom-up) hierarchical clustering, each example initially forms a separate cluster. At each step of the algorithm, the most similar clusters are amalgamated, thus forming a tree of clusterings (dendrogram). Frequently, pairwise amalgamation is used, leading to binary trees. Amalgamation continues until all examples belong to a single cluster. Finally, the learning algorithm or the end user, selects the most suitable clustering level from the constructed tree. In divisive (top-down) hierarchical clustering, all examples initially belong to a single cluster. The number of clusters is incremented at each step of the algorithm by dividing an existing cluster into (usually, two) sub-clusters. The process continues until a suitable number of clusters is obtained.

• Partitional clustering. Initially, the number (*c*) of required disjoint clusters needs to be known. Given a criterion for measuring adequacy of a partition of examples into *c *clusters, the algorithm searches for optimal partitions of examples. The process starts with an initial clustering (frequently obtained by randomly selecting *c *examples), and a set of transformations for changing a partition into another partition. The learning algorithm (e.g., iterative relocation algorithm) modifies a partition until no allowable transformation would improve the given criterion of partition adequacy.

**1.2.6 Reinforcement learning **

Reinforcement learning deals with the problem of teaching an autonomous agent that acts and senses in its environment to choose optimal actions for achieving its goals. The agent receives information about the current state in the environment and performs actions to change it. Each time the agent performs an action, he may also get a positive or a negative reward (punishment). A reward is not necessarily a result of the latest action, but may be an effect of a (sometimes very long) series of actions. Since rewards are often indirect and delayed, learning is difficult and slow. A learner may be conservative and use only the knowledge it already has, or it might experiment with new, rarely used actions to obtain the knowledge of new situations.

The learner’s task is to learn from indirect, delayed rewards, how to choose sequences of actions that maximize the cumulative reward. Usually, the fastest solution (in terms of the number of actions) is preferable. Therefore, the weighting of accumulated rewards is often chosen so that immediate rewards are most important.

A commonly used approach to reinforcement learning is *Q learning*. This is a form of reinforcement learning in which the agent iteratively learns an evaluation function over states and actions. It can be employed even when the learner has no prior knowledge of how its