Anda di halaman 1dari 3

tutor

DATA MINING

HIDDEN
MESSAGES
Have you ever thought that maybe your data is trying to tell you something? As your busi-
ness expands, so do your data archives; and with gigabytes of information stored on servers,
disks and drives – you may be sitting on a gold mine. Data mining services unearth pat-
terns and trends that can help your business.

overnment agencies, information and relationships itories that provided little more ment of query tools such as SQL,

G b u si n e ss e s and
re s e a rch firms are
c h u r n in g out raw
f rom immense quantities of
data. Scientists use it to sepa-
rate signals from noise in astro-
than the capability to summarise
and report. With the develop-
database managers were able
to query data more flexibly. A

data on every subject imagin- nomical data and to find genes


able, at an ever-increasing rate. within DNA sequences. Your
N A S A’s space probes keep company can use it to gain valu-
phoning home with data, e- able knowledge about cus-
businesses accumulate infor- tomers, site visitors and busi-
mation about customer habits, ness practices. Using this knowl-
and Web servers log every user edge, you can target advertising
interaction. The cost of storage campaigns and evaluate their
keeps dropping, so there’s no success. You can personalise
d i fficulty finding a place to Web pages and suggest related
warehouse these terabytes of purchases. And you can predict
data. Although there may be customer behaviour, making
important patterns and knowl- your site more effective.
edge buried in the data, the
sheer amount of information DEVELOPMENT OF DATA
has grown beyond human ana- MINING
lytical capacity. Data mining Data mining is a logical evolu-
allows computers to take over tion in database technology. The Figure 1: Affinity patterns reveal associations such as which products
the task of finding the patterns. earliest databases, which served are purchased in the same transactions or which pages are often vis-
Data mining is the process of as simple replacements for ited in the same session.
automatically extracting useful paper records, were data repos-

114 July 2001 www.DITnet.co.ae ■ www.pcmag-mideast.com


data mining | hidden messages

manager could, for example, INTRIGUING PATTERNS


d e t e r mi n e h o w m a ny c ell To wring knowledge from raw
phones were sold in Kalama- data, data mining software uses
zoo during June of 1980 or a wide variety of complex algo-
which salespeople brought in rithms including neural net-
the most customers. Almost any works, rule induction, decision
quantitative question can be trees and genetic algorithms.
answered using these tools. Typically, the software performs
OLAP (Online Analytical Pro- its analysis on a portion of the
cessing) tools aid by making data to obtain rules and pat-
patterns visible. Given the cor- terns, then validates the results
rect view of the data, you might by testing them against the held-
discover that trailer hitch sales back data. In a scientific setting,
in Texas are twice as high in this process can reveal relation- Figure 2: After the data mining software identifies clusters of similar
February as in any other month, ships that aren’t obvious or sift records, the manager isolates clusters whose characteristics are inter-
l etting you know that you out real data from mere noise. In esting.
should adjust production to a business setting, this knowl-
match or work on raising sales edge can be used to set policy,
in the other months. These tools exploiting favourable patterns records might show that 80 per- customers who make a very
make patterns in data easier to and avoiding bad ones. cent of customers declaring large number of small purchas-
see, but the manager still has Part of the challenge for data bankruptcy had obtained three es. The company could reduce
to manipulate the data, look for mining software involves gen- or more new credit cards with- processing costs by offering an
patterns and decide which pat- erating results in terms under- in the past year. incentive for customers to com-
terns are important. A totally standable by humans. Among To identify clusters, the soft- bine their small orders into
unexpected or hidden pattern the more intelligible pattern w a re models a multi-dimen- fewer, larger ones.
c a n g o u n n o t ic e d s im pl y typ es a re a ss o ci a ti o n s, sional space in which each of Data mining is also used to
because nobody thought to sequences and clusters. Associ- possibly thousands of dimen- devise predictive pattern s .
look for it. ation or affinity patterns simply sions represents an attribute of Given a large database of cus-
Data mining automates the identify database elements that the data. The program then seg- tomer transactions and a spe-
process of locating and extract- occur together in a statistically ments the data into clusters cific subset that are known to be
ing these hidden patterns and significant fashion. For exam- based on their proximity in this fraudulent, the software could
knowledge. In its purest form, ple, analysis of a huge database imaginary space. Further analy- be directed to determine what
data mining doesn’t involve of customer shopping carts sis, either in software or with simpler characteristics distin-
looking for specific information. could reveal that nine out of ten human intervention, selects clus- guish the fraudulent transactions
Rather than starting from a ques- visitors who bought calendars ters that have useful character- from the rest. If successful, this
tion or a hypothesis, data min- also bought pens. Sequence pat- istics. A simple example would will yield a rule that predicts
ing simply finds patterns that terns are similar but with a time be an analysis of a sales data- which future transactions are
are already present in the data. fac tor thrown in. A bank’s base that uncovers a cluster of likely to be fraudulent, and the

www.DITnet.co.ae ■ www.pcmag-mideast.com July 2001 115


tutor
DATA MINING
company can give extra scruti - of clickstream lines every day. tionships among the categories Hit Wonders and Return Visi-
ny to those dealings. Data mining software can find Accessories, Men’s Clothing and tors might suggest techniques
Of course, most data sets ar e significant patterns in the click- Women’s Clothing within the for turning more of the former
full of patterns that don’t repre- stream logs alone, but that data general category Cycle Shop > into the latter.
sent useful knowledge. Yo u becomes substantially more use- Mountain Bike. More than half Once an interesting cluster
won’t be impressed if a data ful in combination with cus- of those who browsed acces- has been identified, you can
mining tool reports that every tomer registration data. Linking sories also browsed men’s cloth- study it further using more tra-
customer in zip code 10016 has clickstream entries with a spe- ing. Slightly more than half the ditional forms of analysis, such
a New York address, or that cific customer lets you track that visitors who looked at acces- as the funnel report shown in
every patient in the gynaecolo- customer’s travels through your sories also looked at women’s Figure 3. A funnel report iden-
gy department is female. But the site – this alone provides a vast clothing. Over a third of the cus- tifies how many users success-
same technique might uncover new realm for discovery. The tomers who toured accessories fully negotiate each step of a
a pattern of double billing or richest lode for data mining is a also browsed both men’s and multistep process. In the fig-
reveal new avenues for targeted data warehouse containing women’s clothing. The manag- ure, the process is a purchase
advertising. Human interven- clickstream data, user profile er viewing this data will have transaction. Just a few percent
tion is required to distinguish information and all of the com- to dig deeper to determine the of the users drop off at each
patterns that are useful. pany’s other relevant databases. reason for these patterns. The step; an unusually large drop at
Using the most accurate data customers could be buying one step would be a red flag for
is essential. As a precursor to PRACTICAL APPLICATIONS accessories and clothing, in the Webmaster to examine the
serious data mining, companies One of the most cited publi- which case cross-selling makes Web page or form correspond-
will usually establish a data cations in the data mining field sense. Another possibility is that ing to that step.
warehouse – a collection of data is a 1991 doctoral dissertation the customers looked in one cat- Of course, you’ll only get the
designed to support manage- by Usama Fayyad. As a gradu- egory, didn’t find what they benefit of data mining if you
ment decision making. The ate student, Fayyad worked wanted, and switched to the t a ke a c t i o n b as e d o n t h e
warehouse includes data from w i t h G e n e r a l Mo t o r s o n other. In that case, certain prod- knowledge it provides (and
across the enterprise at a single extracting useful knowledge ucts could be placed in different keep the data warehouse up to
point in time. As much as pos- from an immense database of or multiple categories. date). If you’ve identified a
sible, the data is cleansed of car repair data; the algorithms As noted above, data mining cluster of users based on certain
errors and redundancy, and per- he devised became the basis s o f t w a re can plot immense simple characteristics, you can
haps transformed into a format for his dissertation. Fayyad quantities of data in multi- personalise your site for first-
suitable for the mining program. went on to develop data-min- dimensional space to find items time visitors that fit those char-
ing sys tems for NASA and with similar characteristics. After acteristics, or you can create
DATA MINING AND THE WEB M i c rosoft before f ounding the software has done the heavy targeted advertising campaigns.
Every time you click on a URL, digiMine in March 2000. lifting, the database manager If the software shows that cer-
your browser requests the cor- Data mining is not cheap. studies the characteristics of tain products are purc h a s e d
responding Web page, which a digiMine’s hosted service starts each cluster and identifies those together very frequently, link
Web server supplies, logging the at $7,500 a month; fully installed that seem useful. In Figure 2, the two in your catalogue. If
transaction. Further transactions solutions cost hundreds of thou- digiMine has identified eight half of the customers who
may be required – to download sands. To see what data min- significant clusters within a begin the checkout pro c e s s
images on the page, for exam- ing can do for you, we’ll look at group of over 300,000 visitors, drop out before completing the
ple. The server’s log of low level a few of the reports generated and the manager has named transaction, revamp the check-
transactions is re f e r red to as by digiMine. several of the clusters based on out system. Your company’s
clickstream data. A large e-com- Figure 1 shows a fairly simple their characteristics. Compari- data is full of undiscovere d
merce site can generate millions report that reveals affinity rela- son of the characteristics of One gems; start digging!

Figure 3: Traditional data analysis techniques applied to the results of data mining can yield useful reports. This funnel report iden-
tifies how many users make it past each step of the checkout process.

116 July 2001 www.DITnet.co.ae ■ www.pcmag-mideast.com

Anda mungkin juga menyukai