Anda di halaman 1dari 11

1

data mining. Why are there many names and definitions for data mining?
Data mining is the process through which previously unknown
patterns in data were discovered. Another definition would be a process
that uses statistical, mathematical, artificial intelligence, and machine
learning techniques to extract and identify useful information and
subsequent knowledge from large databases. This includes most types
of automated data analysis. A third definition: Data mining is the process
of finding mathematical patterns from (usually) large sets of data; these
can be rules, affinities, correlations, trends, or prediction models.
Data mining has many definitions because its been stretched beyond
those limits by some software vendors to include most forms of data
analysis in order to increase sales using the popularity of data mining.

What are the main reasons for the recent popularity of data mining?
Following are some of most pronounced reasons:
More intense competition at the global scale driven by customers everchanging needs and wants in an increasingly saturated marketplace.
General recognition of the untapped value hidden in large data
sources.
Consolidation and integration of database records, which enables a
single view of customers, vendors, transactions, etc.
Consolidation of databases and other data repositories into a single
location in the form of a data warehouse.
The exponential increase in data processing and storage technologies.
Significant reduction in the cost of hardware and software for data
storage and processing.
Movement toward the de-massification (conversion of information
resources into nonphysical form) of business practices.

Discuss what an organization should consider before making a decision to purchase data mining
software.
Technically speaking, data mining is a process that uses statistical,
mathematical, and artificial intelligence techniques to extract and identify
useful information and subsequent knowledge (or patterns) from large sets
of data. Before making a decision to purchase data mining software
organizations should consider the standard criteria to use when investing in
any major software: cost/benefit analysis, people with the expertise to use
the software and perform the analyses, availability of historical data, a
business need for the data mining software.

Distinguish data mining from other analytical tools and techniques.


Students can view the answer in Figure 5.2 (p. 197) which shows that
data mining is a composite or blend of multiple disciplines or analytical tools
and techniques.

Discuss the main data mining methods. What are the fundamental differences among them?
Prediction is the act of telling about the future. It differs from simple guessing by taking
into account the experiences, opinions, and other relevant information in conducting the
task of foretelling. A term that is commonly associated with prediction is forecasting.
Even though many believe that these two terms are synonymous, there is a subtle but
critical difference between the two. Whereas prediction is largely experience and
opinion based, forecasting is data and model based. That is, in order of increasing
reliability, one might list the relevant terms as guessing, predicting, and
forecasting, respectively. In data mining terminology, prediction and
forecasting are used synonymously, and the term prediction is used as the common
representation of the act.
Classification: analyzing the historical behavior of groups of entities with similar
characteristics, to predict the future behavior of a new entity from its similarity to those
groups
Clustering: finding groups of entities with similar characteristics
Association: establishing relationships among items that occur together
Sequence discovery: finding time-based associations
Visualization: presenting results obtained through one or more of the other methods
Regression: a statistical estimation technique based on fitting a curve defined by a
mathematical equation of known type but unknown parameters to existing data
Forecasting: estimating a future data value based on past data values.

What are the main data mining application areas? Discuss the commonalities of these areas that
make them a prospect for data mining studies.
Applications are listed near the beginning of this section (pp. 204-206):
CRM, banking, retailing and logistics, manufacturing and production,
brokerage, insurance, computer hardware and software, government,
travel, healthcare, medicine, entertainment, homeland security, and
sports.
The commonalities are the need for predictions and forecasting for
planning purposes and to support decision making.

Why do we need a standardized data mining process? What are the most commonly used data
mining processes?
In order to systematically carry out data mining projects, a general
process is usually followed. Similar to other information systems
initiatives, a data mining project must follow a systematic project
management process to be successful. Several data mining processes
have been proposed: CRISP-DM, SEMMA, and KDD.

Discuss the differences between the two most commonly used data mining process.
The main difference between CRISP-DM and SEMMA is that CRISP-DM
takes a more comprehensive approachincluding understanding of the
business and the relevant datato data mining projects, whereas SEMMA

implicitly assumes that the data mining projects goals and objectives
along with the appropriate data sources have been identified and
understood.
9

Are data mining processes a mere sequential set of activities?


Even though these steps are sequential in nature, there is usually a great
deal of backtracking. Because the data mining is driven by experience and
experimentation, depending on the problem situation and the
knowledge/experience of the analyst, the whole process can be very
iterative (i.e., one should expect to go back and forth through the steps
quite a few times) and time consuming. Because latter steps are built on
the outcome of the former ones, one should pay extra attention to the
earlier steps in order not to put the whole study on an incorrect path from
the onset.

10 Why do we need data preprocessing? What are the main tasks and relevant techniques used in
data preprocessing?
Data preprocessing is essential to any successful data mining study.
Good data leads to good information; good information leads to good
decisions. Data preprocessing includes four main steps (listed in Table 5.4
on page 211):
data consolidation: access, collect, select and filter data
data cleaning: handle missing data, reduce noise, fix errors
data transformation: normalize the data, aggregate data, construct new
attributes
data reduction: reduce number of attributes and records; balance skewed
data
11 Discuss the reasoning behind the assessment of classification models.
The model-building step also encompasses the assessment and
comparative analysis of the various models built. Because there is not a
universally known best method or algorithm for a data mining task, one
should use a variety of viable model types along with a well-defined
experimentation and assessment strategy to identify the best method for
a given purpose.
12 What is the main difference between classification and clustering? Explain using concrete
examples.
Classification learns patterns from past data (a set of information
traits, variables, featureson characteristics of the previously labeled
items, objects, or events) in order to place new instances (with unknown
labels) into their respective groups or classes. The objective of
classification is to analyze the historical data stored in a database and
automatically generate a model that can predict future behavior.
Classifying customer-types as likely to buy or not buy is an example.

Cluster analysis is an exploratory data analysis tool for solving


classification problems. The objective is to sort cases (e.g., people,
things, events) into groups, or clusters, so that the degree of association
is strong among members of the same cluster and weak among
members of different clusters. Customers can be grouped according to
demographics.
13 . Moving beyond the chapter discussion, where else can association be used?
Students answers will vary.
14 What are the most common myths and mistakes about data mining?
Data mining provides instant, crystal-ball predictions.
Data mining is not yet viable for business applications.
Data mining requires a separate, dedicated database.
Only those with advanced degrees can do data mining.
Data mining is only for large firms that have lots of customer data.
1.

Compare data integration and ETL. How are they related?


Data integration consists of three processes that integrate data from multiple sources into a
data warehouse: accessing the data, combining different views of the data and capturing
changes to the data. It makes data available to ETL tools and, through the three processes of
ETL, to the analysis tools of the data warehousing environment.

2.

What is a data warehouse and what are its benefits? Why is Web
accessibility important with a data warehouse?
A data warehouse can be defined (Section 5.2) as a pool of data produced to support
decision making. This focuses on the essentials, leaving out characteristics that may vary
from one DW to another but are not essential to the basic concept.
The same paragraph gives another definition: a subject-oriented, integrated, timevariant, nonvolatile collection of data in support of managements decision-making
process. This definition adds more specifics, but in every case appropriately: it is hard, if
not impossible, to conceive of a data warehouse that would not be subject-oriented,
integrated, etc.
The benefits of a data warehouse are that it provides decision making information,
organized in a way that facilitates the types of access required for that purpose and
supported by a wide range of software designed to work with it.
Web accessibility of a data warehouse is important because many analysis
applications are Web-based, because users often access data over the Web (or over an
intranet using the same tools) and because data from the Web may feed the DW.
(The first part of this question is essentially the same as Review Question 1 of
Section 5.2. It would be redundant to assign that question if this one is to be answered as
well.)

3.

A data mart can replace a data warehouse or complement it. Compare


and discuss these options.

For a data mart to replace a data warehouse, it must make the DW unnecessary. This would
mean that all the analyses for which the DW would be used can instead be satisfied by a
DM (or perhaps a combination of several DMs). If this is so, it can be much less expensive,
in terms of development and computer resources, to use multiple DMs (let alone one DM!)
instead of an overall DW.
In other situations, a data mart can be used for some analyses which would in its
absence use the DW, but not all of them. For those, the smaller DM is more efficientquite
possibly, enough more efficient as to justify the cost of having a DM in addition to a DW.
Here the DM complements the DW.
4.

Discuss the major drivers and benefits of data warehousing to end users.
Major drivers include:

Increased competition and pace of business, leading to increased need for


good decisions quickly

Successful pioneering experiences with data warehouses, leading to their


wider user acceptance

Decreasing hardware costs, making terabyte databases with masses of


historical data economically feasible for more firms

Increased availability of software to manage a large data warehouse

Increased availability of analysis tools making DWs potentially more useful

Increased computer literacy of decision makers, making them more likely to


use these tools
(See Review Question 1, Section 5.6 for list of benefits.)

5.

List the differences and/or similarities between the roles of a database


administrator and a data warehouse administrator.
Since a data warehouse is a specific type of database designed for a specific application
area, a data warehouse administrator has all the roles of a database administratorplus
others. One new role is advising on decision support uses of the DW, for which a DWA
needs to understand decision making processes. Beyond that, the issue is more a need for
additional skills in the same roles as a DBAe.g., understanding high-performance
hardware to deal with the large size of a DWthan it is one of additional roles. (See
Review Question 2, Section 5.8, for list of skills.)

6.

Describe how data integration can lead to higher levels of data quality.
A question involving the word higher (or any other comparative, for that matter) requires
asking higher than what? In this case, we can take it to mean higher than we would have
for the same data, but without a formal data integration process.
Without a data integration process to combine data in a planned and structured
manner, data might be combined incorrectly. That could lead to misunderstood data (a
measurement in meters taken as being in feet) and to inconsistent data (data from one
source applying to calendar months, data from another to four-week or five-week fiscal
months). These are aspects of low-quality data which can be avoided, or at least reduced, by
data integration.

7.

Discuss security concerns involved in building a data warehouse.

Security and privacy1 concerns are important in building a data warehouse:


1.Laws and regulations, in the U.S. and elsewhere, require certain safeguards on databases
that contain the type of information typically found in a DW.
2.
The large amount of valuable corporate data in a data warehouse can make it an
attractive target.
3.
The need to allow a wide variety of unplanned queries in a DW makes it impractical
to restrict end user access to specific carefully constrained screens, one way to limit
potential violations.
8.

Investigate current data warehouse development implementation through


offshoring. Write a report about it. In class, debate the issue in terms of the benefits and
costs, as well as social factors.
Open-ended answer to the report; it is impossible to predict what the debate will bring.
A students position on this issue is related to his/her feelings on the relationship of
national economies to the global economy. It can be argued that offshoring improves the
global economy while potentially harming one or more of the national economies involved.
such as the students own. U.S. students may see primarily the damage they perceive it
does to their national economy (and to their own career prospects), but students in India
may take a different view. The economic, political and philosophical issues can be pursued
well beyond what is practical in a DSS course.
If you feel students are too nationalistic on this issue, you can ask them if they feel
the same way about a Massachusetts or California bank processing checks in Alabama to
reduce labor costs. (This example uses U.S. territories, but similar issues exist in any
country large enough to have regional economic differences.)

What is text mining? How does it differ from data mining?


Text mining is the application of data mining to unstructured, or less
structured, text files. As the names indicate, text mining analyzes words;
and data mining analyzes numeric data.

Why is the popularity of text mining as a BI tool increasing?


The information age is characterized by rapid growth in the amount of
data and information collected, stored, and made available in electronic
media. Text mining as a BI is increasing because of the rapid growth in text
data and availability of sophisticated BI tools.
The benefits of text mining are obvious in the areas where very large
amounts of textual data are being generated, such as law (court orders),
academic research (research articles), finance (quarterly reports), medicine
(discharge summaries), biology (molecular interactions), technology (patent
files), and marketing (customer comments).

What are some popular application areas of text mining?

Information extraction. Identification of key phrases and relationships


within text by looking for predefined sequences in text via pattern
matching.
Topic tracking. Based on a user profile and documents that a user
views, text mining can predict other documents of interest to the user.
Summarization. Summarizing a document to save time on the part of
the reader.
Categorization. Identifying the main themes of a document and then
placing the document into a predefined set of categories based on
those themes.
Clustering. Grouping similar documents without having a predefined
set of categories.
Concept linking. Connects related documents by identifying their
shared concepts and, by doing so, helps users find information that
they perhaps would not have found using traditional search methods.
Question answering. Finding the best answer to a given question
through knowledge-driven pattern matching.

What is natural language processing?


Natural language processing (NLP) is an important component of text
mining and is a subfield of artificial intelligence and computational
linguistics. It studies the problem of understanding the natural human
language, with the view of converting depictions of human language (such
as textual documents) into more formal representations (in the form of
numeric and symbolic data) that are easier for computer programs to
manipulate.

1. How does NLP relate to text mining?


Text mining uses natural language processing to induce structure into the
text collection and then uses data mining algorithms such as classification,
clustering, association, and sequence discovery to extract knowledge from
it.

2. What are some of the benefits and challenges of NLP?


NLP moves beyond syntax-driven text manipulation (which is often called
word counting) to a true understanding and processing of natural
language that considers grammatical and semantic constraints as well as
the context.
Part-of-speech tagging. It is difficult to mark up terms in a text as corresponding to a
particular part of speech because the part of speech depends not only on the definition of
the term but also on the context within which it is used.
Text segmentation. Some written languages, such as Chinese, Japanese, and Thai, do not
have single-word boundaries.
Word sense disambiguation. Many words have more than one meaning. Selecting the
meaning that makes the most sense can only be accomplished by taking into account the
context within which the word is used.
Syntactic ambiguity. The grammar for natural languages is ambiguous; that is, multiple
possible sentence structures often need to be considered. Choosing the most appropriate
structure usually requires a fusion of semantic and contextual information.
Imperfect or irregular input. Foreign or regional accents and vocal impediments in speech
and typographical or grammatical errors in texts make the processing of the language an
even more difficult task.
Speech acts. A sentence can often be considered an action by the speaker. The sentence
structure alone may not contain enough information to define this action.
3. What are the most common tasks addressed by NLP?
Following are among the most popular tasks:
Information retrieval.
Information extraction.
Named-entity recognition.
Question answering.
Automatic summarization.
Natural language generation.
Natural language understanding.
Machine translation.
Foreign language writing.
Speech recognition.
Text-to-speech.
Text proofing.
Optical character recognition.
5

What are some of the most popular text mining software tools?
1. ClearForest offers text analysis and visualization tools.
2. IBM Intelligent Miner Data Mining Suite, now fully integrated into IBMs
InfoSphere Warehouse software, includes data and text mining tools.
3. Megaputer Text Analyst offers semantic analysis of free-form text,
summarization, clustering, navigation, and natural language retrieval with
search dynamic refocusing.

4. SAS Text Miner provides a rich suite of text processing and analysis
tools.
5. SPSS Text Mining for Clementine extracts key concepts, sentiments,
and relationships from call-center notes, blogs, e-mails, and other
unstructured data and converts it to a structured format for predictive
modeling.
6. The Statistica Text Mining engine provides easy-to-use text mining
functionally with exceptional visualization capabilities.
7. VantagePoint provides a variety of interactive graphical views and
analysis tools with powerful capabilities to discover knowledge from text
databases.
8. The WordStat analysis module from Provalis Research analyzes textual
information such as responses to open-ended questions, interviews, etc.
4. Why do you think most of the text mining tools are offered by statistics companies?
Students should mention that many of the capabilities of data mining
apply to text mining. Since statistics companies offer data mining tools,
offering text mining is a natural business extension.

5. What do you think are the pros and cons of choosing a free text mining tool over a commercial
tool?

Free tools have fewer features, more difficult user-interfaces, lack


support, and have slower or reduced processing capabilities. The advantage
of free tools is obviously the cost.
What is Web content mining? How does it differ from text mining?
Web content mining refers to the extraction of useful information from
Web pages. The documents may be extracted in some machine-readable
format so that automated techniques can generate some information about
the Web pages.

6. Define Web structure mining, and differentiate it from Web content mining.
Web structure mining is the process of extracting useful information from
the links embedded in Web documents.

7. What are the main goals of Web structure mining?


It is used to identify authoritative pages and hubs, which are the
cornerstones of the contemporary page-rank algorithms that are central to
popular search engines such as Google and Yahoo!
8. What are hubs and authorities? What is the HITS algorithm?
A search on the Web to obtain information on a specific topic usually
returns a few relevant, high-quality Web pages and a larger number of
unusable Web pages. Use of an index based on authoritative pages (or
some measure of it) will improve the search results and ranking of relevant
pages.
The structure of Web hyperlinks has led to another important category of
Web pages called a hub. A hub is one or more Web pages that provide a
collection of links to authoritative pages
HITS is a link analysis algorithm that rates Web pages using the hyperlink
information contained within them. In the context of Web search, the HITS
algorithm collects a base document set for a specific query.

Anda mungkin juga menyukai