Anda di halaman 1dari 13

Dawn E. Holmes and Lakhmi C. Jain (Eds.

)
Data Mining: Foundations and Intelligent Paradigms
Intelligent Systems Reference Library, Volume 25
Editors-in-Chief

Prof. Janusz Kacprzyk Prof. Lakhmi C. Jain


Systems Research Institute University of South Australia
Polish Academy of Sciences Adelaide
ul. Newelska 6 Mawson Lakes Campus
01-447 Warsaw South Australia 5095
Poland Australia
E-mail: kacprzyk@ibspan.waw.pl E-mail: Lakhmi.jain@unisa.edu.au

Further volumes of this series can be found on our homepage:


springer.com

Vol. 1. Christine L. Mumford and Lakhmi C. Jain (Eds.) Vol. 13. Witold Pedrycz and Shyi-Ming Chen (Eds.)
Computational Intelligence: Collaboration, Fusion Granular Computing and Intelligent Systems, 2011
and Emergence, 2009 ISBN 978-3-642-19819-9
ISBN 978-3-642-01798-8 Vol. 14. George A. Anastassiou and Oktay Duman
Vol. 2. Yuehui Chen and Ajith Abraham Towards Intelligent Modeling: Statistical Approximation
Tree-Structure Based Hybrid Theory, 2011
Computational Intelligence, 2009 ISBN 978-3-642-19825-0
ISBN 978-3-642-04738-1 Vol. 15. Antonino Freno and Edmondo Trentin
Hybrid Random Fields, 2011
Vol. 3. Anthony Finn and Steve Scheding ISBN 978-3-642-20307-7
Developments and Challenges for
Autonomous Unmanned Vehicles, 2010 Vol. 16. Alexiei Dingli
ISBN 978-3-642-10703-0 Knowledge Annotation: Making Implicit Knowledge
Explicit, 2011
Vol. 4. Lakhmi C. Jain and Chee Peng Lim (Eds.) ISBN 978-3-642-20322-0
Handbook on Decision Making: Techniques
and Applications, 2010 Vol. 17. Crina Grosan and Ajith Abraham
ISBN 978-3-642-13638-2 Intelligent Systems, 2011
ISBN 978-3-642-21003-7
Vol. 5. George A. Anastassiou
Vol. 18. Achim Zielesny
Intelligent Mathematics: Computational Analysis, 2010
From Curve Fitting to Machine Learning, 2011
ISBN 978-3-642-17097-3
ISBN 978-3-642-21279-6
Vol. 6. Ludmila Dymowa Vol. 19. George A. Anastassiou
Soft Computing in Economics and Finance, 2011 Intelligent Systems: Approximation by Artificial Neural
ISBN 978-3-642-17718-7 Networks, 2011
ISBN 978-3-642-21430-1
Vol. 7. Gerasimos G. Rigatos
Modelling and Control for Intelligent Industrial Systems, 2011 Vol. 20. Lech Polkowski
ISBN 978-3-642-17874-0 Approximate Reasoning by Parts, 2011
ISBN 978-3-642-22278-8
Vol. 8. Edward H.Y. Lim, James N.K. Liu, and
Raymond S.T. Lee Vol. 21. Igor Chikalov
Knowledge Seeker – Ontology Modelling for Information Average Time Complexity of Decision Trees, 2011
Search and Management, 2011 ISBN 978-3-642-22660-1
ISBN 978-3-642-17915-0 Vol. 22. Przemyslaw Różewski,
Vol. 9. Menahem Friedman and Abraham Kandel Emma Kusztina, Ryszard Tadeusiewicz,
Calculus Light, 2011 and Oleg Zaikin
ISBN 978-3-642-17847-4 Intelligent Open Learning Systems, 2011
ISBN 978-3-642-22666-3
Vol. 10. Andreas Tolk and Lakhmi C. Jain
Vol. 23. Dawn E. Holmes and Lakhmi C. Jain (Eds.)
Intelligence-Based Systems Engineering, 2011
Data Mining: Foundations and Intelligent Paradigms, 2012
ISBN 978-3-642-17930-3
ISBN 978-3-642-23165-0
Vol. 11. Samuli Niiranen and Andre Ribeiro (Eds.) Vol. 24. Dawn E. Holmes and Lakhmi C. Jain (Eds.)
Information Processing and Biological Systems, 2011 Data Mining: Foundations and Intelligent Paradigms, 2012
ISBN 978-3-642-19620-1 ISBN 978-3-642-23240-4
Vol. 12. Florin Gorunescu Vol. 25. Dawn E. Holmes and Lakhmi C. Jain (Eds.)
Data Mining, 2011 Data Mining: Foundations and Intelligent Paradigms, 2012
ISBN 978-3-642-19720-8 ISBN 978-3-642-23150-6
Dawn E. Holmes and Lakhmi C. Jain (Eds.)

Data Mining: Foundations and


Intelligent Paradigms
Volume 3: Medical, Health, Social, Biological and
other Applications

123
Prof. Dawn E. Holmes Prof. Lakhmi C. Jain
Department of Statistics and Applied Probability Professor of Knowledge-Based Engineering
University of California, University of South Australia
Santa Barbara, Adelaide
CA 93106 Mawson Lakes, SA 5095
USA Australia
E-mail: holmes@pstat.ucsb.edu E-mail: Lakhmi.jain@unisa.edu.au

ISBN 978-3-642-23150-6 e-ISBN 978-3-642-23151-3

DOI 10.1007/978-3-642-23151-3

Intelligent Systems Reference Library ISSN 1868-4394

Library of Congress Control Number: 2011936705



c 2012 Springer-Verlag Berlin Heidelberg

This work is subject to copyright. All rights are reserved, whether the whole or part
of the material is concerned, specifically the rights of translation, reprinting, reuse of
illustrations, recitation, broadcasting, reproduction on microfilm or in any other way,
and storage in data banks. Duplication of this publication or parts thereof is permitted
only under the provisions of the German Copyright Law of September 9, 1965, in
its current version, and permission for use must always be obtained from Springer.
Violations are liable to prosecution under the German Copyright Law.
The use of general descriptive names, registered names, trademarks, etc. in this publi-
cation does not imply, even in the absence of a specific statement, that such names are
exempt from the relevant protective laws and regulations and therefore free for general
use.
Typeset & Cover Design: Scientific Publishing Services Pvt. Ltd., Chennai, India.
Printed on acid-free paper
987654321
springer.com
Preface

There are many invaluable books available on data mining theory and applications.
However, in compiling a volume titled “DATA MINING: Foundations and Intelligent
Paradigms: Volume 3: Medical, Health, Social, Biological and other Applications” we
wish to introduce some of the latest developments to a broad audience of both
specialists and non-specialists in this field.
The term ‘data mining’ was introduced in the 1990’s to describe an emerging field
based on classical statistics, artificial intelligence and machine learning. By combining
techniques from these areas, and developing new ones researchers are able to
innovatively analyze large datasets productively. Patterns found in these datasets are
subsequently analyzed with a view to acquiring new knowledge. These techniques
have been applied in a broad range of medical, health, social and biological areas.
In compiling this volume we have sought to present innovative research from
prestigious contributors in the field of data mining. Each chapter is self-contained and
is described briefly in Chapter 1.
This book will prove valuable to theoreticians as well as application
scientists/engineers in the area of Data Mining. Postgraduate students will also find
this a useful sourcebook since it shows the direction of current research.
We have been fortunate in attracting top class researchers as contributors and wish
to offer our thanks for their support in this project. We also acknowledge the expertise
and time of the reviewers. Finally, we also wish to thank Springer for their support.

Dr. Dawn E. Holmes Dr. Lakhmi C. Jain


University of California University of South Australia
Santa Barbara, USA Adelaide, Australia
Contents

Chapter 1
Advances in Intelligent Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Dawn E. Holmes, Jeffrey W. Tweedale, Lakhmi C. Jain
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Medical Influences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
3 Health Influences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
4 Social Influences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
4.1 Information Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
4.2 On-Line Communities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
5 Biological Influences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
5.1 Biological Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
5.2 Estimations in Gene Expression . . . . . . . . . . . . . . . . . . . . . . 4
6 Chapters Included in the Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

Chapter 2
Temporal Pattern Mining for Medical Applications . . . . . . . . . . . . . 9
Giulia Bruno, Paolo Garza
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Types of Temporal Data in Medical Domain . . . . . . . . . . . . . . . . . 10
3 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4 Temporal Pattern Mining Algorithms . . . . . . . . . . . . . . . . . . . . . . . 11
4.1 Temporal Pattern Mining from a Set of Sequences . . . . . . 12
4.2 Temporal Pattern Mining from a Single Sequence . . . . . . 14
5 Medical Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

Chapter 3
BioKeySpotter: An Unsupervised Keyphrase Extraction
Technique in the Biomedical Full-Text Collection . . . . . . . . . . . . . . . 19
Min Song, Prat Tanapaisankit
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
VIII Contents

2 Backgrounds and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20


3 The Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2 Comparison Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

Chapter 4
Mining Health Claims Data for Assessing Patient Risk . . . . . . . . . . 29
Ian Duncan
1 What Is Health Risk? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2 Traditional Models for Assessing Health Risk . . . . . . . . . . . . . . . . 33
3 Risk Factor-Based Risk Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4 Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1 Enrollment Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2 Claims and Coding Systems . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3 Interpretation of Claims Codes . . . . . . . . . . . . . . . . . . . . . . . 49
5 Clinical Identification Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6 Sensitivity-Specificity Trade-Off . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.1 Constructing an Identification Algorithm . . . . . . . . . . . . . . 56
6.2 Sources of Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
7 Construction and Use of Grouper Models . . . . . . . . . . . . . . . . . . . . 58
7.1 Drug Grouper Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
7.2 Drug-Based Risk Adjustment Models . . . . . . . . . . . . . . . . . 61
8 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

Chapter 5
Mining Biological Networks for Similar Patterns . . . . . . . . . . . . . . . . 63
Ferhat Ay, Günhan Gülsoy, Tamer Kahveci
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
2 Metabolic Network Alignment with One-to-One Mappings . . . . . 67
2.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
2.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
2.3 Pairwise Similarity of Entities . . . . . . . . . . . . . . . . . . . . . . . 70
2.4 Similarity of Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
2.5 Combining Homology and Topology . . . . . . . . . . . . . . . . . . 76
2.6 Extracting the Mapping of Entities . . . . . . . . . . . . . . . . . . 78
2.7 Similarity Score of Networks . . . . . . . . . . . . . . . . . . . . . . . . 79
2.8 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3 Metabolic Network Alignment with One-to-Many Mappings . . . 80
3.1 Homological Similarity of Subnetworks . . . . . . . . . . . . . . . . 82
3.2 Topological Similarity of Subnetworks . . . . . . . . . . . . . . . . . 83
Contents IX

3.3 Combining Homology and Topology . . . . . . . . . . . . . . . . . . 84


3.4 Extracting Subnetwork Mappings . . . . . . . . . . . . . . . . . . . . 84
4 Significance of Network Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.1 Identification of Alternative Entities . . . . . . . . . . . . . . . . . 88
4.2 Identification of Alternative Subnetworks . . . . . . . . . . . . . . 89
4.3 One-to-Many Mappings within and across Major
Clades . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

Chapter 6
Estimation of Distribution Algorithms in Gene Expression
Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Elham Salehi, Robin Gras
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
2 Estimation of Distribution of Algorithms . . . . . . . . . . . . . . . . . . . . 102
2.1 Model Building in EDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
2.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
2.3 Models with Independent Variables . . . . . . . . . . . . . . . . . . . 104
2.4 Models with Pair Wise Dependencies . . . . . . . . . . . . . . . . . 105
2.5 Models with Multiple Dependencies . . . . . . . . . . . . . . . . . . . 106
3 Application of EDA in Gene Expression Data Analysis . . . . . . . . 108
3.1 State-of-Art of the Application of EDAs in Gene
Expression Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

Chapter 7
Gene Function Prediction and Functional Network: The Role
of Gene Ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Erliang Zeng, Chris Ding, Kalai Mathee, Lisa Schneper, Giri Narasimhan
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
1.1 Gene Function Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
1.2 Functional Gene Network Generation . . . . . . . . . . . . . . . . . 127
1.3 Related Work and Limitations . . . . . . . . . . . . . . . . . . . . . . . 128
2 GO-Based Gene Similarity Measures . . . . . . . . . . . . . . . . . . . . . . . . 129
3 Estimating Support for PPI Data with Applications to
Function Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
3.1 Mixture Model of PPI Data . . . . . . . . . . . . . . . . . . . . . . . . . 132
3.2 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
3.3 Function Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
3.4 Evaluating the Function Prediction . . . . . . . . . . . . . . . . . . . 135
3.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
X Contents

4 A Functional Network of Yeast Genes Using Gene Ontology


Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
4.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
4.2 Constructing a Functional Gene Network . . . . . . . . . . . . . . 149
4.3 Using Semantic Similarity (SS) . . . . . . . . . . . . . . . . . . . . . . . 150
4.4 Evaluating the Functional Gene Network . . . . . . . . . . . . . 151
4.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

Chapter 8
Mining Multiple Biological Data for Reconstructing Signal
Transduction Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
Thanh-Phuong Nguyen, Tu-Bao Ho
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
2.1 Signal Transduction Network . . . . . . . . . . . . . . . . . . . . . . . . 164
2.2 Protein-Protein Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . 166
3 Constructing Signal Transduction Networks Using Multiple
Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
3.2 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
3.3 Clustering and Protein-Protein Interaction Networks . . . . 169
3.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
4 Some Results of Yeast STN Reconstruction . . . . . . . . . . . . . . . . . . 178
5 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

Chapter 9
Mining Epistatic Interactions from High-Dimensional
Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
Xia Jiang, Shyam Visweswaran, Richard E. Neapolitan
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
2.1 Epistasis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
2.2 Detecting Epistasis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
2.3 High-Dimensional Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . 190
2.4 Barriers to Learning Epistasis . . . . . . . . . . . . . . . . . . . . . . . . 191
2.5 MDR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
2.6 Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
3 Discovering Epistasis Using Bayesian Networks . . . . . . . . . . . . . . . 196
3.1 A Bayesian Network Model for Epistatic Interactions . . . 196
3.2 The BNMBL Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
Contents XI

3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197


4 Efficient Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
4.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
5 Discussion, Limitations, and Future Research . . . . . . . . . . . . . . . . 206
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207

Chapter 10
Knowledge Discovery in Adversarial Settings . . . . . . . . . . . . . . . . . . . 211
D.B. Skillicorn
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
2 Characteristics of Adversarial Modelling . . . . . . . . . . . . . . . . . . . . . 214
3 Technical Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222

Chapter 11
Analysis and Mining of Online Communities of Internet
Forum Users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
Mikolaj Morzy
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
1.1 What Is Web 2.0? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
1.2 New Forms of Participation — Push or Pull? . . . . . . . . . . 228
1.3 Internet Forums as New Forms of Conversation . . . . . . . . 229
2 Social-Driven Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
2.1 What Are Social-Driven Data? . . . . . . . . . . . . . . . . . . . . . . . 231
2.2 Data from Internet Forums . . . . . . . . . . . . . . . . . . . . . . . . . . 234
3 Internet Forums . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
3.1 Crawling Internet Forums . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
3.2 Statistical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
3.3 Index Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
3.4 Network Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262

Chapter 12
Data Mining for Information Literacy . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
Bettina Berendt
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
2.1 Information Literacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
2.2 Critical Literacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
2.3 Educational Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
3 Towards Critical Data Literacy: A Frame for Analysis and
Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
XII Contents

3.1 A Frame of Analysis: Technique and Object . . . . . . . . . . . 270


3.2 On the Chances of Achieving Critical Data Literacy:
Principles of Successful Learning as Description
Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
4 Examples: Tools and Other Approaches Supporting Data
Mining for Information Literacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
4.1 Analysing Data: Do-It-Yourself Statistics
Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
4.2 Analysing Language: Viewpoints and Bias in Media
Reporting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
4.3 Analysing Data Mining: Building, Comparing and
Re-using Own and Others’ Conceptualizations of a
Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
4.4 Analysing Actions: Feedback and Awareness Tools . . . . . . 284
4.5 Analysing Actions: Role Reversals in Data Collection
and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
5 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293

Chapter 13
Rule Extraction from Neural Networks and Support Vector
Machines for Credit Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
Rudy Setiono, Bart Baesens, David Martens
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
2 Re-RX: Recursive Rule Extraction from Neural Networks . . . . . . 300
2.1 Multilayer Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
2.2 Finding Optimal Network Structure by Pruning . . . . . . . . 303
2.3 Recursive Rule Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
2.4 Applying Re-RX for Credit Scoring . . . . . . . . . . . . . . . . . . . 306
3 ALBA: Rule Extraction from Support Vector Machines . . . . . . . 311
3.1 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
3.2 ALBA: Active Learning Based Approach to SVM Rule
Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
3.3 Applying ALBA for Credit Scoring . . . . . . . . . . . . . . . . . . . 316
4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318

Chapter 14
Using Self-Organizing Map for Data Mining: A Synthesis with
Accounting Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
Andriy Andreev, Argyris Argyrou
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
2 Data Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
2.1 Types of Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
2.2 Distance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
Contents XIII

2.3 Rescaling Input Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 323


3 Self-Organizing Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
3.1 Introduction to SOM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
3.2 Formation of SOM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
4 Performance Metrics and Cluster Validity . . . . . . . . . . . . . . . . . . . 326
5 Extensions of SOM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
5.1 Non-metric Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
5.2 SOM for Temporal Sequence Processing . . . . . . . . . . . . . . . 329
5.3 SOM for Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
5.4 SOM for Visualizing High-Dimensional Data . . . . . . . . . . . 333
6 Financial Applications of SOM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
7 Case Study: Clustering Accounting Databases . . . . . . . . . . . . . . . . 335
7.1 Data Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
7.2 Data Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
7.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
7.4 Results Presentation and Discussion . . . . . . . . . . . . . . . . . . 338
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338

Chapter 15
Applying Data Mining Techniques to Assess Steel Plant
Operation Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
Khan Muhammad Badruddin, Isao Yagi, Takao Terano
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
2 Brief Description of EAF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
2.1 Performance Evaluation Criteria . . . . . . . . . . . . . . . . . . . . . 346
2.2 Innovations in Electric Arc Furnaces . . . . . . . . . . . . . . . . . . 346
2.3 Details of the Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
2.4 Understanding SCIPs and Stages of a Heat . . . . . . . . . . . . 349
3 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
4 Data Mining Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
4.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
4.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
4.3 Attribute Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
4.4 The Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
4.5 Data Mining Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
5.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
6 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363


Editors

Dr. Dawn E. Holmes serves as Senior Lec-


turer in the Department of Statistics and Ap-
plied Probability and Senior Associate Dean
in the Division of Undergraduate Education
at UCSB. Her main research area, Bayesian
Networks with Maximum Entropy, has re-
sulted in numerous journal articles and con-
ference presentations. Her other research
interests include Machine Learning, Data
Mining, Foundations of Bayesianism and
Intuitionistic Mathematics. Dr. Holmes has
co-edited, with Professor Lakhmi C. Jain,
volumes ‘Innovations in Bayesian Net-
works’ and ‘Innovations in Machine Learn-
ing’. Dr. Holmes teaches a broad range of
courses, including SAS programming, Bayesian Networks and Data Mining. She was
awarded the Distinguished Teaching Award by Academic Senate, UCSB in 2008.
As well as being Associate Editor of the International Journal of Knowledge-Based
and Intelligent Information Systems, Dr. Holmes reviews extensively and is on the
editorial board of several journals, including the Journal of Neurocomputing. She
serves as Program Scientific Committee Member for numerous conferences; includ-
ing the International Conference on Artificial Intelligence and the International Con-
ference on Machine Learning. In 2009 Dr. Holmes accepted an invitation to join
Center for Research in Financial Mathematics and Statistics (CRFMS), UCSB. She
was made a Senior Member of the IEEE in 2011.

Professor Lakhmi C. Jain is a Director/Founder of


the Knowledge-Based Intelligent Engineering Systems
(KES) Centre, located in the University of South Aus-
tralia. He is a fellow of the Institution of Engineers
Australia.
His interests focus on the artificial intelligence para-
digms and their applications in complex systems, art-
science fusion, e-education, e-healthcare, unmanned air
vehicles and intelligent agents.

Anda mungkin juga menyukai