Acknowledgement: The Lecture Slides are adapted from the original slides from Han’s textbook.
Administrative
2
CITS3401 and CITS5504
• Different websites
– http://teaching.csse.uwa.edu.au/units/CITS3401
– http://teaching.csse.uwa.edu.au/units/CITS5504
3
Common Assessment Structures
4
Text Book and Recommend Readings
5
Introduction to Data Mining
6
Why Data Mining?
7
Potential Applications
• Other Applications
– Text mining (news group, email, documents) and Web mining
– Stream data mining
8
Example 1: Market Analysis
9
Example 2: Corporate Analysis and
Risk Management
10
Example 3. Fraud Detection and
Mining Unusual Patterns
11
Evolution of Sciences
12
Evolution of Database Technology
• 1960s:
– Data collection, database creation, IMS and network DBMS
• 1970s:
– Relational data model, relational DBMS implementation
• 1980s:
– RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
– Application-oriented DBMS (spatial, scientific, engineering, etc.)
• 1990s:
– Data mining, data warehousing, multimedia databases, and Web databases
• 2000s
– Stream data management and mining
– Data mining and its applications
– Web technology (XML, data integration) and global information systems
13
Why Data Mining
Summary:
– Abundance of data and data archives are seldom visited.
– Far exceeded human ability for comprehension
– Intuitive decisions are prone to biases and errors, and is
extremely time-consuming and costly
– Data mining tools perform data analysis and uncover important
data patterns, contributing greatly to business strategies,
knowledge bases, and scientific and medical research.
Data Nuggets of
Tombs knowledge
14
What is Data Mining?
15
What is Data Mining?
16
Data Mining: Confluence of Multiple
Disciplines
Database
Technology Statistics
Machine Visualization
Learning Data Mining
Pattern
Recognition Other
Algorithm Disciplines
17
Steps of Knowledge Discovery
(KDD) Process
Task-relevant Data
Data Cleaning
Data Integration
18
Databases
Data Warehousing and Mining
Framework
19
KDD Process: Several Key Steps
20
Multi-Dimensional View of Data
Mining
• Data to be mined
– Database data (extended-relational, object-oriented,
heterogeneous, legacy), data warehouse, transactional data,
stream, spatiotemporal, time-series, sequence, text and web, multi-
media, graphs & social and information networks
• Knowledge to be mined (or: Data mining functions)
– Characterization, discrimination, association, classification,
clustering, trend/deviation, outlier analysis, etc.
– Descriptive vs. predictive data mining
– Multiple/integrated functions and mining at multiple levels
• Techniques utilized (methodologies)
– Data-intensive, data warehouse (OLAP), machine learning,
statistics, pattern recognition, visualization, high-performance, etc.
• Applications adapted
– Retail, telecommunication, banking, fraud analysis, bio-data mining,
stock market analysis, text mining, Web mining, etc.
21
Data Mining: On What Kinds of
Data?
22
Relational Database
23
Relational Database
24
An Example - AllElectronics
25
Example of Queries
26
Purpose of relational databases
27
Data Warehouse of AllElectronics
28
Data Warehouse
29
Transactional Database
• Sample Queries:
– Show me all the items purchased by ‘X’
– How many transactions include item number ‘Y’?
– market basket data analysis: Which items sold well
together? (Frequent item set)
30
Knowledge View: What Knowledge to be
mined?
• Pattern discovery
– Mining frequent patterns, association and correlation
– Applying pattern mining in many other tasks
31
Data Mining Function: (1)
Characterization and Discrimination
32
Data Mining Function: (2)
Association and Correlation Analysis
33
Data Mining Function: (3)
Classification
34
Data Mining Function: (4) Cluster
Analysis
35
Data Mining Function: (5) Outlier
Analysis
• Outlier analysis
– Outlier: A data object that does not comply with the general
behavior of the data
– Most data mining methods discard outliers as noise or
exceptions.
– Noise or exception? ― One person’s garbage could be
another person’s treasure
– Methods: by product of clustering or regression analysis,
distance analysis, statistical or probability model,
– Useful in fraud detection, rare events are more interesting
– Example: By detecting a purchase of extremely large
amount for a given account number.
36
Time and Ordering: Sequential
Pattern, Trend and Evolution Analysis
37
Structure and Network Analysis
• Graph mining
– Finding frequent subgraphs (e.g., chemical compounds), trees
(XML), substructures (web fragments)
• Information network analysis
– Social networks: actors (objects, nodes) and relationships (edges)
• e.g., author networks in CS, terrorist networks
– Multiple heterogeneous networks
• A person could be multiple information networks: friends, family,
classmates, …
– Links carry a lot of semantic information: Link mining
• Web mining
– Web is a big information network: from PageRank to Google
– Analysis of Web information networks
• Web community discovery, opinion mining, usage mining, …
38
Methodology View: Confluence of
Multiple Disciplines
Applications Visualization
Data Mining
39
Why Confluence of Multiple
Disciplines?
40
Application View: Diverse Applications
41
Classification of Data Mining System
42
Summary (till this)
43
Evaluation of Knowledge
44
Are All the “Discovered” Patterns
Interesting?
45
Find All and Only Interesting
Patterns?
46
Integration of Data Mining and Data
Warehousing
47
Coupling Data Mining with DB/DW
Systems
48
Major Issues in Data Mining (1)
• Mining Methodology
– Mining various and new kinds of knowledge
– Mining knowledge in multi-dimensional space at multiple level of
abstraction.
– Data mining: An interdisciplinary effort
– Boosting the power of discovery in a networked environment
– Handling noise, uncertainty, and incompleteness of data
– Pattern evaluation and pattern- or constraint-guided mining
• User Interaction
– Interactive mining
– Background knowledge (integrity constraints & deduction rules)
– Presentation and visualization of data mining results
49
Major Issues in Data Mining (2)
50
A Brief History of Data Mining Society
51