29
9
85
40
38
49
6
Male
lemale
Male
Male
lemale
Male
lemale
Male
Male
Male
Married
Married
Married
Not married
Not married
Married
Not married
Married
Married
ligh school graduate
ligh school graduate
Some college
Child
ligh school graduate
ligh school graduate
Less than 1
st
grade
Child
11
th
grade
Doctorate degree
100000
12000
23000
0
1998
40100
2691
0
30000
30686
Iigure JJ: Lxamples of data in Public Use Micro data Sample data sets.
Source: land et al, Principles o Data Mining.
Data come in many orms and this paper is out o the scope to deelop a complete taxonomy.
Indeed, it is not een clear the complete taxonomy can be deeloped, since an important aspect o
data in one situation may be unimportant in another.
1here are certain basic distinctions to which one should draw attention. One is the dierence
between quantitatie and categorical measurements ,dierent names are sometimes used or these,.
A quantitatie ariable is measurements ,dierent names are sometimes used or these,. A
quantitatie ariable is measured on a numerical scale and can, at least in principle, take any alue.
1he columns Age and Income in igure 11 are examples o quantitatie ariables. In contrast,
categorical ariables such as Sex, Marital Status and Lducation in igure 11 can take only certain,
discrete alues. 1he common three point seerity scale used in medicine ,mild, moderate, seere, is
another example. Categorical ariables may be ordinal ,possessing a natural order, as in the
Lducation scale, or nominal ,simply naming the categories, as in the Marital Status case,. A data
analytic technique appropriate or one type o scale might not be appropriate or another. lor
example, were marital status represented by integers ,e.g., 1 or single, 2 or married, 3 or widowed,
and so orth, it would generally not be meaningul or appropriate to calculate the arithmetic mean o
a sample o such scores using this scale. Similarly, simple linear regression ,predicting one
quantitatie ariable as a unction o others, will usually be appropriate to apply to quantitatie data,
but applying it to categorical data may not be wise, other techniques, that hae similar objecties ,to
the extent that the objecties can be similar when the data types dier,, might be more appropriate
with categorical scales.|land et al.|
Measurement scales, howeer deined, lie at the bottom o any data taxonomy. Moing up the
taxonomy, one could ind that data can occur in arious relationships and structures. Data may arise
sequentially in time series, and the data mining exercise might address entire time series or particular
segments o those time series. Data might also describe spatial relationships, so that indiidual
records take on their ull signiicance only when considered in the context o others.
ML1HODS OI DA1A MINING
19
3.J.2. Definitions of Data Mining
1ranslating Data mining word by word means, the mining or digging in data with the purpose o
inding inormation or respectiely knowledge. Coming to the more abstract and ery well known
deinition o lrawley, Data mining is deined as "1he nontriial extraction o implicit, preiously
unknown, and potentially useul inormation rom data". |lrawley 1992|
Groth mentions another interesting aspect o Data mining. le describes it as the process o
automating inormation discoery`. |Groth| 1oday Data mining is a term that coers a broad range
o techniques to analyze data. 1he techniques use speciic algorithms to identiy and extract patterns
and establish unknown relationships in order to discoer hidden and aluable inormation in a huge
amount o data. Most companies already collect massie quantities o data. Data mining techniques
can be implemented on existing sotware and hardware platorms to enhance the alue o existing
inormation resources. |1hearling 2005|
In the words o Moxon: "Data mining is the process o discoering meaningul new correlation,
patterns and trends by siting through large amounts o data, using pattern recognition technologies
as well as statistical and mathematical techniques." Data mining is a "knowledge discoery process o
extracting preiously unknown, actionable inormation rom ery large databases." |Moxon 1996|
According to their inal goal, data mining techniques can be considered to be descriptie or
predictie Descriptie data mining intends to summarize data and to highlight their interesting
properties, while predictie data mining aims to build models to orecast uture behaiours`. |lan
and Kamber 2001|
3.2. KDD - Knowledge Discovery in Databases
Knowledge Discoery in Databases, also oten used with the abbreiation KDD, is the concept o
extracting preiously unknown and potentially useul inormation rom large sets o data`.
|\itnessminer 2005| So KDD is only the concept o a multistage process that identiies pattern in
data in order to ind new inormation. Data mining is only one stage in the KDD process concerned
with applying computational techniques to ind patterns in data. 1his step consists o algorithms
which deliers patterns in an acceptable time out o a deined database. Other stages in the KDD
process are the comprehensibility and the alidity o the discoered patterns. In theory and practice
the expressions KDD and Data mining are oten mixed. But it is important to understand that
KDD is the whole concept and Data Mining is only a step in this concept o extracting data.
Simpliied, KDD is the concept and Data Mining is the tool. |\itnessminer|
1he ie main processes that are common in almost all o the methods are: 1ask Analysis, Pre-
processing, Data Mining, Post-processing and Deployment. 1his is diagrammatically expressed and
explained as could be seen in ligures 12 and 13.
ML1HODS OI DA1A MINING
20
Iigure J2: Knowledge discovery in Databases
Source: |Lesley 2004|
Iigure J3: Knowledge discovery in Databases
Source: |Lesley 2004|
3.3. Data Mining and Data Warehouse
1he eolution o database technology is an essential prerequisite or understanding the need o
knowledge discoery in databases ,KDD,. Data mining is a piotal step in the Knowledge Discoery
in Database process- the extraction o interesting patterns rom a set o data sources ,relational,
transactional, object-oriented, spatial, temporal, text, and legacy databases, as well as data
warehouses and the \orld \ide \eb,. 1he patterns obtained are used to describe concepts, to
analyze associations, to build classiication and regression models, to cluster data, to model trends in
time-series, and to detect outliers. Since the patterns, which are present in data are not all, equally
useul, interestingness measures are needed to estimate the releance o the discoered patterns to
guide the mining process.
1he irst step toward building a productie data mining program is to gather data. Most businesses
already perorm these data gathering tasks to a ery high extent. |Chapple 2005|
Very oten a data warehouse is used to manage and store that gathered data. Because o that huge
amount o stored data, the key is to locate the data critical to the business. So companies use Data
ML1HODS OI DA1A MINING
21
Mining tools with the purpose to discoer new inormation out o the data stored in the data
warehouse. 1he data warehouse is the data oundation or all the analyses o the Data Mining tools.
Data Mining helps companies ocus on the most important inormation in their data warehouses.
|1hearling| 1he major analysis o this work is done within the ramework o SAP B\ 3.5 which
includes a couple o data mining methods aailable as part o it, which are described in detail in the
next chapter.
3.4. Common uses of Data Mining
Data mining tools can predict uture trends and behaiours, allowing businesses to make proactie,
knowledge-drien decisions. 1he automated, prospectie analyses oered by data mining moe ar
beyond the analyses o past eents proided by retrospectie tools typical oered by decision
support systems.
Data mining tools can gie answers to business questions that traditionally were time consuming to
resole. |1hearling| 1oday Data mining is primarily used by companies with a strong consumer
ocus - retail, inancial, communication, and marketing organizations. It enables them to determine
relationships among "internal" actors such as price, product positioning, or sta skills, and
"external" actors such as economic indicators, competition, and customer demographics.
lurthermore, it enables these companies to determine the impact on sales, customer satisaction,
and corporate proits and it enables them to "drill down" into summary inormation to iew detail
transactional data. |Palace 2005|
As described in 3.1 A general introduction in Data Mining` a large number o companies use Data
Mining today. And the list o this companies looks like a lortunes 500 !bo`. !bo. |Groth| So
dierent the companies are, so dierent are the purposes o the use o Data Mining. lere are a ew
areas in which companies use Data mining to achiee a strategic beneit:-
Direct Marketing
1he idea here is to ind out who is most likely or most desirable to buy certain produces. 1his
inormation can be used or seeral marketing actiities.
1rend Analysis
\ith 1rend analysis companies are able to predict trends in the marketplace. Using this inormation
can lead to a strategic adantage because it is useul in reducing costs and timeliness to market.
Iraud Detection
Companies use Data mining techniques to model which business transactions are likely to be
raudulent. So this is used or insurance claims, cellular phone calls or credit card purchases.
Iorecasting in Iinancial Markets
1here are many possibilities to model inancial markets with Data mining methods. lor example
neural networks can be used or inancial gain. |Groth|
Apart rom this applications, companies use Data mining also or: |Bao 2005|
Business information
Inestment analysis
Loan approal
Manufacturing information
Controlling and scheduling
Network management
Lxperiment result analysis
ML1HODS OI DA1A MINING
22
Scientific information
Sky surey cataloguing
Bio sequence Databases
Geosciences: Quake inder
Performance and monitoring of standard software systems
1he main purpose o this paper is ind out how to introduce data mining unctionality to support
the SAP B\ administrator in the areas o data loading, reporting, planning etc in order to
proactiely discoer the error situations. lrom these inestigations it is quite clear that the
companies would like to come up with the product unctionalities that would assist the system
administrators. lor instance, how data mining methods could make the work o the SAP B\
administrator ease in order to perorm his day-to-day actiities.
3.S. 1he process of Data Mining
Data mining should be regarded as a strategic and competitie moe. So beore the Data mining
process starts, the goal which is in ocus o the analysis should be clariied. Otherwise it`s not
possible to search or new aluable inormation i the necessary parameters can not be deined as
there are dierent models or the data mining process based on the task at hand. 1he ollowing
description is based on the model o layyad. |layyad|
Step J: Data selection
Out o a data base the needed data were selected according to its objects and characteristics.
Step 2: Pre-Processing
In this step happens a cleaning o the selected data. 1his means or example the illing o missing
alues.
Step 3: 1ransformation
In the transormation phase the data are transormed in new ormats, i necessary.
Step 4: Data Mining
In this step o the process identiies the patterns and relationships between the data.
Step S: Interpretation and Lvaluation
In the last step the result has to be interpreted and ealuated to come up with suitable actions.
1he ollowing picture shows the process in a graphical representation.
Iigure J4: 1he process of Data mining
Source: SPSS, Clementine .0 user`s guide.
ML1HODS OI DA1A MINING
23
Cross-industry process for Data mining, CRISP
1he CRISP method is one o the seeral aailable learning methods. It encompasses all the acets o
learning, beginning rom the conception to the realization and deployment o the gained
inormation. It begins, as could be seen rom ligure 3.4 below, with an analysis or a business
understanding o the problem. Questions on the relationship between the operating actors are
asked at this stage. 1he dependence o one on another ,or seeral others, is also stipulated at this
stage.
Ater a business understanding is laid down, understanding the data then becomes the next task,
according to the CRISP model. \hat tables has to be created low would the tables be made
aailable \ould a single data instance be enough or would seeral data instances be needed \hat
about the quality o the data Based on the understanding o the data, the business understanding
may hae to be adjusted or additional inputs made to the data, e.g. creation o additional tables, as to
be able to realize the desired business objectie Data preparation then ollows. Based on the analysis
desired, columns might hae to be iltered out, or data aggregated, merged, etc. 1he modelling
process could then be done at this stage. As could be seen rom the igure 14, additional data
preparation needs may hae to be done as to realize the desired model.
An ealuation o the whole process ollows. In some cases, a superised orm o learning might be
ery helpul in this case. Interim results would be checked against the historical data as to ascertain
the leel o conormity, which also will sere in the ealuation o the entire process.
1he gained inormation or intelligence could now be deployed. 1he destination could be another
system, say, LRP system like the SAP CRM, or stored in a database system. Such could be inal
reports, presentations, action plans, etc. It could also be used or urther analysis. Moreoer,
eedback could be made to the initial business understanding or the purpose o urther analysis,
ater which the entire process would be repeated.
1he oerall process inoled in the CRISP-Model could be summarized as ollows: |CRISP, 2005|
Iigure JS: Phases of the CRISP-DM Process Model
Source: CRISP, 2005.
ML1HODS OI DA1A MINING
24
Business Understanding: Description o the Business Objectie and Data Mining
Goals,Success
Data Understanding: Selection o the data and exploratory analysis ,quality, problems,
description o selected data,
Data Preparation: Cleaning, transormation, integration, ormatting o the selected data
Modelling: Selection, building, testing and running dierent models
Lvaluation: Approal o the models and assessment o the results ,in accordance with the
deined objecties,, reiew o the process
Deployment: Preparation o inal reports, presentation, action plans and deployment o
results
ML1HODS OI DA1A MINING
25
4. Methods of Data Mining
4.J. An overview of Data Mining Methods
In the last chapter the oeriew and the tasks o data mining were discussed. But how to realize
these task, it is still needed to describe the data mining methods. Data mining methods detect
patterns in large amounts o data, and use these patterns to detect uture instances in similar data.`
|Zadok and Stolo 2005|
1here are many kinds o data mining methods. Some are well ounded in mathematics and
statistics, whereas others are used simply because they produce useul results.`|Lidal and Dingsoyr
2005| Because data mining has emerged rom many dierent ields, dierent kinds o methods can
be used in dierent areas. Researchers hae approached the knowledge discoery process rom
dierent angels, with dierent algorithms, based on their scientiic interests and backgrounds.`
|Lidal and Dingsoyr| But no one method can sole all data mining problems. Some o them hae
seeral tasks at the same time, igure 16 gies a short conclusion about the tasks and dierent
methods.
Tasks Methods
Prediction & Description Decision tress, Market basket analysis
,Association analysis,, 1ime series analysis,
Neural networks, Agent network technology
Classiication Market basket analysis ,Association analysis,,
Decision tress, Neural networks, Sorting
Regression Linear regression, Logistic regression,
Multinomial Regression.
Clustering Cluster Analysis, Neural networks
Summarization Genetic algorithms
Dependency modelling Analysis o ariance, Link Analysis
Change and deiation detection luzzy Logic
Iigure J6: Data mining tasks and methods
1he ollowing section will introduce some data mining methods that are aailable as part o SAP
B\ which are used normally in reality. Not in ery detail, but to hae a undamental understanding
o them.
4.2. 1he SAP data mining workbench
1he SAP Business Inormation \arehouse is a complete suite o application, i.e. a solution which
includes the actiities o data collection and storage, decision support systems, query and reporting,
ML1HODS OI DA1A MINING
26
online analytical processing, statistical analysis, orecasting, and data mining. In SAP B\, data rom
disparate database,s, o all systems in the enterprise are collected, consolidated, administered and
proided or analysis and planning purposes. 1his data oten proides urther aluable potential.
Len with sophisticated analysis tools, new inormation presenting itsel in the orm o meaningul
relationships between the data, is oten hidden or too complex to be uncoered through pure
obseration or intuition. \ith the assistance o the SAP B\, it is now possible to easily inestigate
and identiy these hidden or complex relations between the data. lor this discoery process, seeral
methods are proided ,e.g. Statistical and Mathematical calculations, data cleansing and restructuring
methods, etc., 1he intelligence gained could be uploaded automatically into the SAP B\ database
or redirected into an operational system like the SAP CRM. In either case, the intelligence is made
aailable or all decision-making and,or application processes and can thus be o signiicant
importance: strategically, tactically, and operationally.
1he SAP Data Mining \orkbench oers a single point o entry or access to aailable data mining
models namely
Decision trees
Clustering
Association analysis ,Market Basket analysis,
Approximation ,Regression and \eighted score tables,
ABC classiication
It also proides an option to connect with the third party data mining modals. lor each model type
a wizard guides the user through the process o creating the model, thus enabling users interested in
analytical results to setup data mining models easily. 1he ollowing igure shows the process steps
or the analytical models aailable as part o the SAP data mining workbench.
Iigure J7: Process steps for applying analytical methods
Source: SAPCOURSL, CR900, my SAP CRM Analytics.
1here are two basic broad classiications o data mining methods. 1hese are the superised and the
unsuperised learning. In superised learning, a sample data is irst selected and with it, the system is
trained` as to understand the dynamics inoled in it. 1his is then weighed against the known
ML1HODS OI DA1A MINING
27
historical data as to see the extent to which the system`s output corresponds to the known output.
lurther learning might hae to be applied, and as much as would simply be needed, until the system
turns out an answer that largely ,mostly 99.99, relect the decision already made on historical data.
On the other hand is the unsuperised learning. 1his is, undamentally, where data mining plays a
great role. A heap o data is mined` as to discoer the complex, hidden and unexpected
relationships and correlations that may exist in it. In as much as the system could be made to run the
process as much as it is wished, it is basically done with no orm o bias, as the case is in a
superised learning.
Superised learning is mostly predictie while unsuperised learning is oerly inormatie. 1his is so
or in superised learning, the interim result is weighed against historical data with known output to
see i the result corresponds with known cases. 1he ollowing chapter will introduce some data
mining methods that are used normally and are part o SAP`s oering. Not in ery detail, but to
hae a undamental understanding o them.
4.2.J. Approximation
Statistics orientation is a main way which makes sense to analyze data. 1he purpose o
approximation ,scoring, is to aluate the data records. SAP oers weighted score tables and
regression analysis namely linear regression and non-linear regression ,Logistic and Multinomial
regression, to perorm the aluation
4.2.J.J. Regression Analysis
Regression is a unction that maps a data item to a real-alued prediction ariable. So it is predicting
a alue o a continuous alued ariable based on the alues o other ariables, assuming a linear or
nonlinear model o dependency. |Kumar and Joshi| 1here are many regression applications in
practice, e.g., predicting the amount o bio-mass present in a orest gien remotely-sensed
microwae measurements, estimating the probability that a patient will die gien the results o a set
o diagnostic tests, predicting consumer demand or a new product as a unction o adertising
expenditure, and time series prediction where the input ariables can be time-lagged ersions o the
prediction ariable. |Bao|
Regression analysis is the technique which used to inter- and extrapolate the obserations which can
be classiied in to Linear and Non-linear regression. Linear Regression is a statistical technique
which attempts to build a model to the obsered data, and though this line to predict uture data. It
quantiies the relationship between two continuous ariables: the dependent ariable or the ariable
you are trying to predict and the independent or predictie ariable`. |Rud 2001| It works by inding
a line through the data that minimizes the squared error rom each point. 1he ormula o linear
regression is: |\hitehead 2005|
Y ~ a - b` - c
Y: a avvv, aeevaevt rariabte, ~1 if erevt baev., ~0 if erevt aoe.vt baev,
a: tbe coefficievt ov tbe cov.tavt terv,
b: tbe coefficievt;.) ov tbe ivaeevaevt rariabte;.),
`: tbe ivaeevaevt rariabte;.),
c: tbe error terv.
lor instance, igure 18 shows the relationship between sales and adertising along with the
regression line. 1he goal is to be able to predict sales based on the amount spent on adertising.
ML1HODS OI DA1A MINING
28
Iigure J8: Simple linear regression
Source: Rud, Data Mining Cookbook
It is also possible that the relationship between the two ariables is not linear. 1he relationship also
can be curilinear or multiple linear. Logistic Regression is ery similar to linear regression. 1he
Logistic Regression model is simply a non-linear transormation o the Linear Regression.`
|\hitehead| It uses sigmoid unction instead o linear unction to it the data. 1he main dierence
between them is that in the logistic r egression model the dependent ariable is discrete or
categorical, not continuous. So it is ery useul in the marketing area because it can be used to
predict a discrete action such as response to an oer or a deault on a loan. |Rud| Logistic
regression model can be described as ollowing: |\hitehead|
tv,;1) ~ a - b` - c
: tbe robabitit, tbat tbe erevt Y occvr., ;Y~1)
b: tbe coefficievt;.) ov tbe ivaeevaevt rariabte;.),
c: tbe error terv
,;1): tbe oaa. ratio
tv,;1): tbe tog oaa. ratio, or togit.
Logistic Regression like Linear Regression, also base on a statistical distribution. But the "logistic"
distribution is an S-shaped distribution unction which is similar to the standard normal distribution
,which results in a proit regression model,, but easier to work with in most applications because the
probabilities are easier to calculate. 1he logistic distribution constrains the estimated probabilities
to lie between 0 and 1.` |\hitehead| A graphical comparison o the Linear Regression and Logistic
Regression models is illustrated in igure 19
ML1HODS OI DA1A MINING
29
Iigure J9: Comparison of Linear and logistic Regression
Source: \hitehead, an Introduction to Logistic Regression.
Multinomial Regression: Beore, what discussed in Linear Regression and Logistic Regression is
only reerred to two ariables. \hen the nominal response ariables are more than two categories,
another regression method can be used: the so called Multinomial Regression. Multinomial logit
models are multiequation models` |GSL&IS 2005| lor example, a response ariable with n
categories will generate ,n-1, equations. 1his breaks the regression up into a series o binary
regressions comparing each group to a baseline ,reerence, group. lor example, wie work has 3
alues, 0~not working, 1~part time, 2~ull time. I choosing not working ,0, as the baseline group,
multinomial logistic regression will assess the odds o working part time s. not working, and
working ull time s. not working.` |UCLA 2005| Multinomial logistic regression simultaneously
estimates the ,n-1, logits. lurther, it is also the case, that the model tests all possible combinations
among the n groups although it only displays coeicients or the ,n-1, comparisons.` |GSL&IS|
4.2.J.2. Weighted score tables
A weighted score table is a method o ealuating alternaties when the importance o each criteria
diers. In a weighted score table, each alternatie is gien a score or each criteria. 1hese scores are
then weighted by the importance o each criterion. All o an alternatie's weighted scores are then
added together to calculate that alternatie's total weighted score. 1he alternatie with the highest
total score should be the best alternatie you can use weighted score tables to make predictions
about uture customer behaiour. \ou create a model in the data mining application to make
predictions. Ater a model has been created based on historical data, it can then be applied to new
data to make prediction s. 1he prediction, that is, the output o the model is called a Score. \ou can
create a single score or your customers by taking into account dierent dimensions. SAP`s weighted
score tables method allows you to deine your own aluation unction by irst assigning weights to
the indiidual model ields and then creating a weighted total rom these model ields. 1he algorithm
o weighted score tables: |SAPDOCS 2005|
A unction that is deined by weighted score tables is a linear combination o unctions o a
ariable.
f (X
1.
X
n
) W
1
* f
1
(X
1
) .. W
n
* f
n
(X
n
)
1he weights \1 ...\ n are arbitrary numbers. Lach o the unctions 1... n is mapped to exactly
one model ield. 1he arguments X1. X n o these unctions are those alues that the model ields
can take.
ML1HODS OI DA1A MINING
30
lor discrete model ields, the score table o the model ield is used to directly assign a unction alue
i ,X i, to indiidual alues X i o the model ield. A common unction alue can be assigned to
alues that are not listed explicitly in the table.
lor continuous model ields, the score table o the model ield is also used to directly assign a
unction alue x i to indiidual alues i ,X i, o the model ield. Lither a linear interpolation is
made between two points, or the unction alue rom the let or right point is taken. Respectiely,
either a polygon line or a piecewise constant unction is deined. Depending on the option selected
by the user, the unction is continued as linear or continuous beyond the outer points.
4.2.2.Clustering
Clustering is a common descriptie task o Data mining where one seeks to identiy a inite set o
categories or clusters to describe the gien data. Based on a gien set o data points, each haing a
set o attributes, and a similarity measure among them, the identiied clusters should guarantee that:
|Kumar and Joshi|
Data points in one cluster are more similar to one another,
Data points in separate clusters are less similar to one another.
1he identiied clusters may be mutually exclusie and exhaustie, or consist o a richer
representation such as hierarchical or oerlapping clusters. Lxamples o clustering in a Data mining
context include discoering homogeneous sub-populations or consumers in marketing databases
and identiication o sub-categories o spectra rom inrared sky measurements. |Bao| According to
Jain and Dubes Cluster analysis organizes data by abstracting underlying structure either as a
grouping o indiiduals or as a hierarchy o groups. 1he representation can then be inestigated to
see i the data group according to preconceied ideas or to suggest new experiments`. |Jain and
Dubes 1988| In brie, cluster analysis group`s data objects into clusters such that objects belonging
to the same cluster are similar, while those belonging to dierent ones are dissimilar.
1he term cluster analysis ,irst used by 1R\ON, 1939, actually encompasses a number o dierent
classiication algorithms.` |S1A1SOl1 2005| A general question acing researchers in many areas o
inquiry is how to organize obsered data into meaningul structures, that is, how to classiying.
Cluster analysis is an exploratory data analysis tool or soling classiication problems. Its objectie is
to sort cases ,people, things, eents, etc, into groups, or clusters, so that the degree o association is
strong between members o the same cluster and weak between members o dierent clusters. 1he
eature o Cluster Analysis is there is no classes to be predicted but there are dierent ways in which
the result o clustering can be expressed. 1he groups that are identiied may be exclusie, so that
any instance belongs in only one group, or they may be oerlapping, so that an instance may all into
seeral groups, or they may be probabilistic, whereby an instance belongs to each group with a
certain probability, or they may be hierarchical, such that there is a crude diision o instance into
groups at the top leel, and each o these groups is reined urther- perhaps all the way down to
indiidual instance.` |\itten and lrank| Cluster analysis is thus a tool o discoery. It may reeal
associations and structure in data, though not preiously eident, but sensible and useul rule.
1he most common used method o Cluster Analysis is K- Means clustering. lirstly, decide how
many clusters will be sorted, it is the parameter K. Second the mean o all the instances in each
cluster is calculated. 1hese means are taken to be new centre alue or their respectie clusters.
linally the whole process is repeated within the new cluster centres. 1he iteration continues until
the same points are assigned to each cluster in consecutie rounds, at which point the cluster centre
hae stabilized and will remain the same thereater.` |\itten and lrank|
ML1HODS OI DA1A MINING
31
1he major part o this thesis work concentrates on how to utilize cluster analysis and to come up
with the patterns using K-means as well as the sophisticated algorithms ,Demographic, Neural net
methods, which are part o IBM Data Mining engine based on the statistical data aailable as part o
SAP B\ statistics content, which will be dealt in chapter 6.
4.2.3.Association analysis
Association Analysis ,also known as Market Basket Analysis, uncoers the hidden patterns,
correlations or casual structures among a set o items or objects. lor example, Association Analysis
enables you to understand what products and serices customers tend to purchase at the same time.
By analyzing the purchasing trends o your customers with Association Analysis, you can predict
their uture behaiour. It is also commonly reerred to as "association discoery". |SAPDOCS|
1hese patterns may be expressed in the orm o association rules such as:
2 o the customers who buy milk also buy bread and eggs. \ou can ind that this rule applies
to 20 o the transactions.
80 o the time that a speciic brand o toaster is sold, customers also buy a set o kitchen
gloes and matching coer sets
Customers who purchase pizza bases are three times more likely to purchase cheese than those not
buying the pizza bases.
Market Basket Analysis is an algorithm that examines a long list o transactions in order to
determine which items are most requently purchased together.` |Goransson 2005| It uses the
inormation about \hat` customers purchased to gie researchers insight into \ho` they are and
\hy` they make such certain purchases. It also gies the inormation about the merchandise by
telling which products tend to be purchased together and which are most amenable to promotion.
|Berry and Lino 199| linally this inormation is actionable: It can suggest new store layout, it
can determine which products to put on special, it can indicate when to issue coupons, and so on.`
|Berry and Lino| Because Market Basket Analysis is used to determine which products sell
together, the input data to a Market Basket Analysis is normally a list o sales transactions, where
each has two dimensions, one represents a product and the other represents either a sale or a
customer ,depending on whether the goal o the analysis is to ind which items sell together at the
same time, or to the same person,. 1he cells o the data normally contain only 1 ,bought product, or
0 ,did not buy product, alues, though poly-analyst can work with other data in the cells, such as
quantity or reenue. |Goransson|
Market Basket Analysis is oten used as a starting point when transaction data is aailable but the
researcher doesn`t know what speciic patterns to look or. It can be applied to many areas such like:
|Albion 2005|
Analysis o credit card purchases.
Analysis o telephone calling patterns.
Identiication o raudulent medical insurance claims. ,Consider cases where common rules are
broken,.
Analysis o telecom serice purchases.
4.2.4.Decision 1rees
A decision tree is used as a classiier or determining an appropriate action or decision ,among a
predetermined set o actions, or a gien case. A decision tree helps you to eectiely identiy the
actors you must consider and how each actor has historically been associated with dierent
outcomes o the decision`. |SAPDOCS| Decision trees hae become one o the most popular data
ML1HODS OI DA1A MINING
32
mining tools. 1heir isual presentation makes the decision trees ery easy to read, understand and
assimilate inormation rom it. 1hey are called decision trees because the resulting model is
presented in the orm o a tree structure. Decision trees are most commonly used or classiication,
that is, predicting to which group a particular case belongs. A decision tree is constructed rom a
training set. A training set contains historical data, which is used to predict the possible outcomes
such as aspects o customer behaiour. lor example, one can predict i a customer churns or
remains loyal to the company.
Decision 1rees are powerul and popular data mining tools or classiication and prediction. It is a
tree in which each branch node represents a choice between a number o alternaties, and each lea
node represents a classiication or decision.` |Berry and Lino| It has rules that can readily be
expressed in Lnglish so that we humans can understand them or in a database access language like
SQL so that records alling into a particular category may be retrieed.` |Berry and Lino| Decision
1rees are normally drawn with the root at the top and the leaes at the bottom. A record enters the
tree at the root node where a test is applied to determine which sub node the record will go next.
1here are dierent algorithms or choosing the initial test, but the goal is always the same: 1o
choose the test that best discriminates among the target classes.` |Berry and Lino| 1his process is
repeated until the record arries at a lea node. All the records that end up at a gien lea o the tree
are classiied the same way. But rom the root to each lea there is a unique path that is an
expression o the rule used to classiy the data records. 1he ollowing Decision 1ree is one example
that is to help a inancial institution decide whether a person should be oered a loan. |\ilson 2005|
Iigure 20: Decision 1ree of deciding whether a person should be offered a loan
Source: \ilson, Introduction o Decision 1rees.
4.2.S.ABC classification
Classiication is a unction that maps a data item into one o seeral predeined classes. So, the goal
is that preiously unseen records should be assigned to a class as accurately as possible. |Kumar and
Joshi 2004| Lxamples o classiication methods used as part o knowledge discoery applications
include classiying trends in inancial markets and automated identiication o objects o interest in
large image databases. It is not possible to separate the classes perectly using a linear decision
boundary. A bank might wish to use the classiication regions to automatically decide whether uture
loan applicants will be gien a loan or not. |Bao|
1he ABC classiication is a requently used analytical method to classiy objects ,Customers,
Products or Lmployees, based on a particular measure ,Reenue or Proit,. lor example, you can
ML1HODS OI DA1A MINING
33
classiy your customers into three classes A, B and C according to the sales reenue they generate.
ABC classiication allows you to classiy your data based on speciied classiication rules. 1he data
to be classiied is generated by a query in the SAP B\. 1he classiication rules reer to a single key
igure alue in your data and implicitly speciy which absolute or relatie key igure alues map to
which classes.` |SAPDOCS|One should speciy the ollowing or the ABC classiication:
|SAPDOCS|
1he Characteristic or which the classiication is to be perormed. 1his entails speciying the
characteristic alues to be classiied ,such as Customer,.
1he Key igure that is to orm the basis or classiying the characteristic alues ,such as proit
made rom that customer,
1he attribute o the characteristic that should receie the result ,the ABC Class,
1he Query or determining the data ,such as proitability data rom the customer,
1he 1hreshold alue or the indiidual ABC classes. lor example, all customers generating a
proit o 0 to 20,000 belong to class C, those generating a proit between 20,001 and 80,000 to
class B, and those generating more than 80,001 to class A.
4.3. 1he SAP Analysis process designer workbench
1he Analysis Process Designer ,APD, is a workbench with an intuitie isual interace that enables
you to isualize, transorm, and deploy your data rom SAP business warehouse. It combines all
these dierent steps into a single data process that you can easily interact with` |SAPPRL 2004|.
1he ollowing igure illustrates the architecture o APD:
Iigure 2J: 1he Analysis process designer (APD) a rchitecture.
Source: SAPNL1, Analysis Process Designer
1he Analysis Process Designer is the interace in the my SAP B\ suite where, according to business
need or questions at hand the designer has the possibility to connect to the stored data, modiy the
data, analyze the data ,as the case may be, with the aim o getting results that would be used as
answers to the questions and delier these to an operational system where it might be used or
ML1HODS OI DA1A MINING
34
urther decision-making purposes. It is the application enironment or the SAP data mining
solution, rom SAP B\ Release 3.5 the data mining unctions are ully integrated into the APD. 1he
ollowing unctions could be perormed in the APD:
Creating and changing data mining models
1raining data mining models with SAP B\ data ,data mining model as data target in the
analysis process,
Lxecution o data mining methods such as prediction with decision tree, with cluster model
and integration o data mining models rom third parties ,data mining model as a
transormation in the analysis process,
Visualization o data mining models
By being ully integrated into SAP`s data warehousing solution, SAP B\ and the APD ,including
Data Mining eatures, realize the beneits o single database access instead o dierent data tables in
a ariety o source systems. 1his signiicantly decreases interacing problems as well as related issues
with data integrity, data quality and system perormance`. |SAPPRL| 1he igure 22 below shows a
high leel oeriew o how the APD is integrated into the SAP B\ and other applications ,or
instance with SAP CRM,.
Iigure 22: APD integration with BW and other applications
Source: SAPNL1, Analysis Process Designer
1he data is irst extracted rom where it is stored. 1his could be a single instance database with
seeral tables or seeral database instances with one or seeral tables. 1his data is then introduced
into the SAP B\ where it would be again stored, consolidated and structured. 1his has to be so
because the APD deals basically with data within the SAP B\ suite, already prepared in a orm that
it understands. Aterwards, the APD then manipulates the data as the case might be, interim results
gained in the course o the APD process might become interesting or urther analysis. 1his is then
plugged back into the B\ system and saed. linally, the end result gained ,Reports and,or Analysis,
would then be prepared and deliered to where it is needed. 1his could be the B\ system itsel or
an LRP system like the SAP CRM, SCM or a lat ile.
ML1HODS OI DA1A MINING
35
Iigure 23: Process description of the APD
Source: SAPNL1, Analysis Process Designer
1he aboe igure is process description o the APD. 1he system is primarily designed to extract only
data that has irst being uploaded into the data warehouse area o the SAP B\ Suite. Ater the
extraction process is completed, the data ields needed or the speciic process is selected. 1he
selected data ields, sets or tables are then prepared. 1he interim result o the preparation process
might be plugged back into the system or urther preparation or used or urther analysis. 1he
transormation process then ollows ater the preparation. 1he algorithm required, is at this stage,
introduced into the system. It is ater this that the result is discoered. 1his result is either
stored,displayed in the SAP B\ system in orm o graphs, tables etc. or transerred,stored in an
OL1P system ,or instance SAP CRM,.
1his is the process that is o most importance as ar as this work is concerned. Based on the
perceied business need, the analysis process would be designed as to gie answers to questions an
organization might hae. Moreoer, the scenarios that are discussed in chapter 6 would be
extensiely explored and used in showing all the aspects inoled in a typical APD modelling
process.
AS-IS ANALYSIS: CURRLN1 SI1UA1ION OI SAP BWADMINIS1RA1ION
36
S. AS-IS ANALYSIS: Current Situation of SAP
BW Administration
S.J. 1he technical content of SAP BW
Implementing a Data \arehouse presents the administrator with challenges o a constantly changing
nature. Len in a productie system, in which no new InoCubes are created, new data, or
example, is always being loaded. 1his results in an increase in the quantity o data, or in a change to
its structure. In addition to this, there are recreated or ad-hoc queries, which change the way that
accessing data is seen as a whole. 1his not only inluences the load times, but also the execution
times or queries. On the other hand, it is a good idea to hae an optimum work in order to
minimize the response time o the Data \arehouse`. |SAPDOCS| lrom these ew points, it is
already clear that an oeriew o the processes in the Business Inormation \arehouse is not only
adantageous but also necessary.
SAP B\ proides in the technical content o the Business Inormation \arehouse. lor the user o
the Business Inormation \arehouse, the most important o these sub-areas is B\ statistics. 1he
ollowing sub-areas are deliered as per the technical content: |SAPlLLP 2005-2|
B\ Statistics
B\ Data Slice
B\ leatures Characteristics
B\ lormula Builder
BLx Personalization
Reporting Authorizations
BW Statistics: 1he B\ statistics is o most importance as ar as this work is concerned. Moreoer,
the clustering scenario that is discussed in chapter 6 is based on the statistics data. B\ statistics is a
tool or analyzing and optimizing the processes in the Business Inormation \arehouse. 1he
implementation and day-to-day use o the B\ leads to an increase in the oerall amount o data
being processed and to changes to the structure o this data. 1here are also new or ad-hoc queries
that change the way in which the data is accessed. 1his aects not only the amount o time it takes
to load queries, but also the amount o time it takes to execute queries. Ideally, processes should be
run in such a way that the response time o the Business Inormation \arehouse is made as short as
possible. 1o achiee this you need to be able to get an oeriew o the processes that are running in
the Business Inormation \arehouse and be able to make any necessary changes in the system as
and when required. 1he data that is required or the B\ is proided or: |SAPlLLP 2005-3|
InoCubes
Queries
InoSources
Aggregates
1he data in B\ statistics is saed and managed in the Business Inormation \arehouse. \hen a
query is executed, data is speciied or the OLAP serer and or access to the database. 1his data is
saed temporarily once the naigation step has been completed. 1his is also the case when the
ODBO ,OLL DB or OLAP, interace is used. Additional data is collected when the aggregates are
illed and rolled up ater loading data into warehouse management. It does not take long to calculate
and sae B\ statistics data. loweer, the dataset can be considerable with larger installations. lor
this reason, the data input or each Ino Proider in each area o OLAP and warehouse management
AS-IS ANALYSIS: CURRLN1 SI1UA1ION OI SAP BWADMINIS1RA1ION
37
can be actiated and deactiated indiidually. It`s possible to delete stored data. 1he ollowing igure
gies an oeriew o the datalow in B\ statistics:
Iigure 24: Overview of the dataflow in BW statistics
Source: SAP lelp portal
linally, to summarize the B\ statistics helps to answer some important questions as ollows:
\hich InoCubes, ino objects, ino sources, source systems, queries, aggregates, and so on,
are currently being used in the system low requently \hich datasets are being moed
\ho is currently using the system
Are there queries, whose run time is oer the allowed ast alue or online processing Are
tasks, such as batch printing or loading data, executed in times o less work
low does the data low through the Data \arehouse, rom where and where to
S.J.J. Statistical content cubes
\ithin the ramework o the technical content SAP B\ proides the ollowing cubes which store
the statistical content data. A Multi Proider ,MultiCube, in B\ does not contain any data itsel.
Instead, data is stored in the releant Ino proiders` |SAPDOCS|. 1o start with, SAP B\ proides
a B\ Statistics Multi Proider which does not contain any data itsel. Instead, data is stored in the
releant basic cubes. 1he releant BasicCubes are:
BW Statistics OLAP ,1his Ino Cube contains the data that is generated as a result o
executing the queries,
BW Statistics - OLAP, Detail Navigation ,1his Ino Cube contains the data that is
generated as a result o executing a query. 1he details correspond to the deinition o the
aggregate. 1his Ino Cube is used by the B\ system or the proposal o aggregates.,
BW Statistics Aggregates ,1his Ino Cube contains not only general data but also data
that appears in an aggregate ater data is illed and rolled up,
AS-IS ANALYSIS: CURRLN1 SI1UA1ION OI SAP BWADMINIS1RA1ION
38
BW Statistics WHM ,1his Ino Cube contains the data that arises rom the execution o a
process in \arehouse Management. 1his Ino Cube allows you to see how data requests are
processed or the process concerned -or example, rom which source system are they, which
Ino Source is used with which transer method, and in what time rame,
BW Statistics Metadata ,1his cube contains metadata rom the Metadata Repository. It
does not contain any transaction data and no data is loaded. 1he Ino Cube also does not
contain any special key igures. It reeals the inormation about the existing Objects and
structures in the OLAP, \lM and BLx areas, and about the B\ Metadata Repository and
hierarchies to be displayed,
BW Statistics: Condensing InfoCubes ,1his Ino Cube contains data that is created when
an Ino Cube`s data requests are compressed. It reeals inormation on the number o edited
data records or condensing or compressing an InoCube and the runtime o the condenser,
which is the program that compresses the act table contents o an Ino Cube,
BW Statistics: Deleting Data from InfoCubes ,1his Ino Cube contains the data that
results rom deleting data rom an Ino Cube,
1he Ino Cube B\ Statistics - OLAP is the most important cube as ar as this work is concerned.
As the major part o the analysis is on Query perormance and optimization, this Cube contains
some important characteristics ,Ino Cube, B\ System , User , Query , 1ime and so on, and Key
igures ,OLAP times, Data manager times and so on,. 1he basic idea was to use these key igures or
Cluster analysis to come up with some useul patterns.
S.J.2. Brief overview of some Characteristics and key figures
As mentioned beore, the major analysis o this work is on perormance o the queries and the B\
Statistics - OLAP cube contains some o the important key igures used or analysis. 1he ollowing
section gies an oeriew o the Characteristics, 1ime Characteristics and Key igures aailable as
part o this cube:
Characteristics
InoCube
Naigation Step ,current numbers within the session,
OLAP Reading On , O
Runtime Category ,1, 2, 3, ... 10, 20, 30, ... Seconds,
B\ System
User
OLAP Processor Method
Naigation Step ,GUID,
lront-end Session ,GUID,
Statistical Data ,GUID,
Object Version ,or example, 01C1IlCUBL,
1ype o Data Read
U1C 1ime Stamp
1ime
Query
AS-IS ANALYSIS: CURRLN1 SI1UA1ION OI SAP BWADMINIS1RA1ION
39
1ime Characteristics
Calendar Day
Calendar \ear
Calendar \ear , Month
Calendar \ear , Quarter
Calendar \ear , \eek
Key figures
Start Date
lrequency
Start 1ime
Number o Database Selects
Number o Naigations
Number o lront-end Sessions
Number o 1exts Read
Cells 1ranserred to the lront-end
Records Selected on the Database
Records transerred rom the Database to the Serer
ODBO: Size o the Internal Buer
ODBO: Lxternal Calls or the lunction Module
1otal ,OLAP,
Read Cycles ,letch, OLAP Processor
lormatting 1ranserred to the lront-end
Number o 1exts Read
1ime, Authorization Check
1ime, Reading on the Database
1ime, Data Manager InoCube Access
1ime, Data Manager Reading rom Basic Cube
1ime, Data Manager Reading rom ODS
1ime, Data Manager Reading rom Remote Cube
1ime, Data Manager Auth orizations or Non-Cumulatie
1ime, Data Manager Determining SIDs or Remote Cube
1ime, lront-end
1ime Between Naigation Steps
1ime, General ODBO
1ime, ODBO: Axes Preparation
1ime, ODBO: Data Records Preparation
1ime, ODBO: Conersion into llat 1able lorm
1ime, ODBO: Initialization
1ime, ODBO: Data Requests
1otal 1ime ,OLAP,
1ime, OLAP Processor Initialization
1ime, Reading 1exts,Master Data
1ime 1hat the System \as Unable to Assign
1ime, Inputting Variables
AS-IS ANALYSIS: CURRLN1 SI1UA1ION OI SAP BWADMINIS1RA1ION
40
1ime, OLAP Processor
1he major objectie is to analyse these key igures, their importance in perormance o the query.
Since, it has got huge number o key igures it`s always not so easy to decide which key igures need
to be considered or the cluster analysis. linally the OLAP times and Data Manager times and other
key igures were considered or analysis, which is described in chapter 6.
S.2. SAP BW administration and monitoring
Data warehousing is indeed becoming common place in large organizations. According to a
lorrester Research surey o executies at large irms 62 percent hae data in, on aerage, three
data warehouses or data marts. 1he same surey indicates that the pace o data warehousing will
increase beore it slows down, with the aerage growth showing the number o data warehouses and
marts to double to nearly six by 2004 and increase in size rom approximately 130 GB to
approximately 260 GB. |IDUG 2005|
A data warehouse is a completely dierent beast rom the operational OL1P. Its problems and the
tools needed to sole them are dierent. lorm these it is quite clear that administrators are ery
much concerned with warehouse aailability and perormance during access. Coming to the SAP
B\, 1he Administrator \orkbench ,A\B, is the main tool or tasks in the data warehousing
process. 1he A\B proides data modelling unctions as well as unctions or control, monitoring
and maintenance o all processes in SAP B\ haing to do with data procurement, data retention,
and data processing`. |SAPlLLP 2005-4| 1he ollowing unctions are proided as part o A\B:
Modelling
Monitoring
Reporting Agent
1ransport connection
Documents
Business Content
1ranslation
Metadata Repository
1o summarize, it becomes quite clear that the administration o complex enterprise data warehouses
plays a piotal role in today's I1 landscapes and how one could use Data Mining methods to support
the administration o data warehouses considering the Perormance and system stability that
eentually motiated to analyse the query perormance with cluster analysis.
S.3. Possible business scenarios for data mining
During the initial phase o the inestigation seeral issues were considered as to how and in which
areas o the B\, the Data mining methods could be useul. It was not always easy to ind out as
there are seeral other qualitatie aspects that could inluence the perormance and stability o SAP
systems or instance:
1he number o the application serers aailable
1he underlying database technology
1he number o work processes aailable at a particular moment o time
AS-IS ANALYSIS: CURRLN1 SI1UA1ION OI SAP BWADMINIS1RA1ION
41
1he number o parallel processes aailable at a moment o time, 1hese are a ew qualitatie
actors which are not easy to measure and may be in uture urther research might help to een
measure such kind o qualitatie aspects o these typical SAP systems.
linally, ater asking seeral experts in these areas, the ollowing areas are identiied where Data
mining methods might be useul to support data warehouse administration in SAP B\:
S.3.J. Data loads and Process chains
1he execution o data load processes in \arehouse Management - or e.g. how data requests are
processed or a particular process. Presently, as o B\ 3.5 the B\ administrator could monitor the
data load processes arising out o the data loads using the statistics content cube named B\
Statistics - \lM` ,1echnical name: 0B\1C_C05 ,. 1his InoCube helps the administrator to see
how data requests are processed or the process concerned ,or example, rom which source system
are they, which Ino Source is used with which transer method, and in what time rame,. Presently
there are a ew key igures as part o this cube. 1he important ones are:
Records ,\lM Process, or a particular processing step when loading data
1ime ,\lM Process, or a particular processing step when loading data
It seems as i more key igures might be needed and which and how these new key igures could be
deried is out o the scope o this paper. But, urther inestigation could be made in this regard such
that the new key igures are used or Data mining purpose in uture.
S.3.2.Queries
1he term Query` is the much talked about buzz-word o the aailable objects in SAP B\. 1he
Query is o utmost importance since it is the object through which the data aailable in the Data
warehouse ,or instance SAP B\, is presented using the ront end tools ,BLx,, based on the typical
reporting requirements o the users.
Seeral actors determine how well a query perorms, some with greater inluence than others.
Presently, as o B\ 3.5 the SAP B\ administrator could monitor the queries using the statistics
content Cube B\ Statistics - OLAP ,1echnical name: 0B\1C_C05,. 1here are a lot o key igures
which could help in analysing the query perormance as part o this which are documented in the
aboe section ,Pease reer to the section on Brie oeriew o some Characteristics and key igures,.
\ith the aailable key igures namely OLAP key igures the administrator could ind out reasons or
the perormance o the queries. l or e.g. the administrator could look at the arious OLAP times:
1ime, OLAP Processor Initialization
1ime, OLAP Processor
1ime, Reading rom the Database
1ime, lront-end
1ime, Authorization Check
1ime, Reading 1exts,Master Data
lrom the aboe OLAP key igures the administrator can check which key igures are responsible or
the high OLAP times, as the case may be and perorm the necessary action steps. 1aking this idea in
to consideration these OLAP key igures are used or Cluster analysis. 1he algorithm used is known
as K-means cluster analysis, which is aailable as part o SAP`s Data mining workbench. 1he details
o the analysis are documented in chapter 6.
AS-IS ANALYSIS: CURRLN1 SI1UA1ION OI SAP BWADMINIS1RA1ION
42
S.3.3.Dormant data
Dormant data is data that is seldom or neer used. Studies show that much o the data loaded into
data warehouses and analytical application databases is dormant, that is, it is inrequently used or
neer used.` |I11OOLBOX 2005| Unlike OL1P databases, data warehouses like SAP B\
continuously collect and store detailed and summary historical inormation or business analysis.
lrequently data warehouses include inormation to satisy unknown requirements and data is
included that may or may not be used. 1hese databases expand signiicantly oer time as new
inormation is added rom internal and external data sources.
Bill Inmon, a noted data warehouse expert, states that dormant data typically increases as a
percentage o total data as warehouses grow. le asserts that dormant data may be as much as
65 - 0 o data warehouses that are a terabyte or greater in size.` |lILL1LK 2005| le
recommends a simple ormula or calculating the data dormancy ratio the number o queries per
year times the aerage amount o data per query diided by total data warehouse space.`
|lILL1LK| \hile this ratio may be high since it does not consider that some queries ineitably use
the same data, it does proide a rule o thumb or making ballpark estimates. But, how can one
actually identiy dormant data Bill Inmon writes, "Understanding that there is dormant data in a
data warehouse is one thing. linding the dormant data is another matter altogether. 1he best way to
ind the dormant data is to monitor the end users query actiity against the data warehouse ... the
monitor sits between the end-users query actiity and the data warehouse serer." |lILL1LK|
lrom the aboe section it is quite clear that as part o Data warehouse tools like SAP B\, there is a
desperate need or some kind o monitor to say that a particular data could be archied or a certain
moment o time, presentlywe don`t ha e any monitor in SAP B\. Once a product like SAP B\
oers such kind o monitor, these key igures could be urther used or data analysis and urther
inestigation or the possibility o any Data mining methods could be realized in the uture. 1o
summarize, minimizing dormant data reduces system costs and improes perormance, serice leels
and I1 sta productiity and this paper strongly recommends coming up with some sort o monitor
in the near uture.
S.3.4.1able spaces and buffers
Another interesting aspect o a typical data warehouse tool like SAP B\ which is directly related to
perormance and system stability are table spaces and buers at the data base layer.
1here are number o actors that are responsible or the perormance o table spaces and buers
rom the SAP B\ perspectie, some o them are:
1he size o the Ino Cube size,
1he number o partitions o an Ino Cube
1he number o CPU`s and their respectie times
1he Database speciic settings and so on.
It`s quite clear that there are monitors or SAP systems or e.g. Database Perormance Analysis
,1ransaction code: S104,, Database 1ables and Index Monitor ,1ransaction code: DB02, and it
does make sense to take into to derie some key igures like Number o table spaces , buers,
Amount and time o table space , buers, CPU times etc, which could be eentually used or Data
mining purposes
AS-IS ANALYSIS: CURRLN1 SI1UA1ION OI SAP BWADMINIS1RA1ION
43
S.4. A way forward
1o summarize the AS-IS analysis, the current situation o SAP B\ administration as described in
the preious sections, itseems quite clear that the companies like SAP are looking orward to bundle
some sophisticated Data Mining eatures as part o their products ,lere SAP B\, to easily
administer and monitor the complexities o data ware houses that eentually will lead to make ease
the day to day actiities o the SAP B\ administrators. Ater successully knowing the needs and
the possible areas it`s quite obious to pick up a Business scenario that would help in the realization
o the 1O-BL analysis
At this stage o this work, ater identiying the possible areas in SAP B\ namely Data load
processes, Queries, Dormant data and table spaces, the strategy is to cut horizontally - taking in to
account the time and technical constraints which led to the idea o Cluster analysis or the Queries,
which is urther described in the 1O-BL Analysis.
1O-BL ANAL\SIS - A SCLNARIO \I1l CLUS1LR ANAL\SIS
44
6. 1O-BL ANALYSIS A Scenario with Cluster
Analysis
6.J. Motivations for cluster Analysis
6.J.J. 1echnical drivers
1he major technical drier or the cluster analysis is obiously, the aailability o clustering algorithm
as part o the SAP`s Data Mining work bench. 1he algorithm or the clustering ,k-means is
implemented as part o the work bench, is already pre-conigured into the system and simply made
aailable or use. Neertheless, eort would be made here to describe the undamental principle
behind the concept. As clustering is used to group records together according to an algorithm or
mathematical ormula that attempts to ind centroids or centres, around which similar records
graitate. 1his method initially takes the number o components o the population equal to the inal
required number o clusters. In this step itsel the inal required number o clusters is chosen such
that the points are mutually arthest apart. Next, it examines each component in the population and
assigns it to one o the clusters depending on the minimum distance. 1he distance measure used is
the Luclidean metric. It simply is the geometric distance in the multidimensional space. It is
computed as: |SAPDOCS 2005|
Distance (x, y) = { (xi - yi)2 }
Ater eery input record is assigned to some cluster or the other, the centroid's position is
recalculated based on the records assigned to it. \ith the new centroids means, the assignments are
checked again and this continues until a all the stopping conditions are reached ,i.e., maximum
number o iterations reached or cluster assignments do not change much between iterations,
1he aailability o APD unctionality or creating and changing data mining models, training data
mining models with is integrated in SAP B\ , the aailability o data transormations unctions and
isualization o data mining models is also technical motiator or the analysis.
6.J.2. Business drivers
1he main business drier is to proide some kind o monitor, proactiely or the administration o
SAP B\. \ith the use o statistical content data the objectie is to track down Query behaiour as
to diide the queries into segments based on key igures namely OLAP times. Since, these times are
responsible or the perormance o queries. As well this is conirmed in the AS-IS analysis phase
rom seeral experts at SAP rom dierent areas. 1he major key igures used or the cluster analysis
are arious OLAP times namely:
1ime, OLAP Processor Initialization
1ime, OLAP Processor
1ime, Reading rom the Database
1ime, lront-end
1ime, Authorization Check
1ime, Reading 1exts,Master Data
1he Number o times a Query is executed
1O-BL ANAL\SIS - A SCLNARIO \I1l CLUS1LR ANAL\SIS
45
1he ultimate objectie is to cluster the queries in to dierent groups or a certain time period and
throw to the B\ administrator those clusters, which might seem to be peculiar such that, the results
might help the B\ administrator to perorm the necessary ollow up actions, that would eentually
help in perormance and monitoring o these Queries in the uture.
Looking, at the present deelopments ,to bundle new eatures as part o their products, with
reerence to SAP B\, It does make sense to hae perormance and proactie monitors as part o
SAP B\ which is the main motiation rom the Business , product management perspectie.
6.2. Analysis of Queries with cluster analysis
1he main Objectie is to track the behaiour o queries and diide them into segments based on the
OLAP times. 1he Key igures considered or the analysis are OLAP 1imes and Data manager times.
lor this particular data model the 1otal time ,OLAP, is taken in to consideration, which enables the
aluation o query runtimes which includes the ollowing times:
Initialization o OLAP Processor
OLAP Processor
Reading on the Database
lront Lnd
Reading 1exts,Master Data
Authorization Check
A new Key igure Count lrequency ,Number o times a query is executed,
1he total OLAP time , \hich is an aggregated key igure o all the aboe OLAP times,
linally the data model consists o 9 key igures ,Lxcluding Record id used or record identiication,.
All the actiities are perormed on AB5 and Q50 on internal SAP test systems
6.3. 1he Data Model:
1o come up with this meaningul data model, much inestigation is done by consulting experts in
these areas. As ar as this work is concerned seeral attempts hae been made to come up with
this meaningul data model. 1he ollowing igure depicts the modal attributes which are used or
the clustering purpose.
Iigure 2S: 1he Data Model
1he models are created and ealuated on the SAP internal 1est systems ,AB5 and Q50,. 1he
technical name o this model is C_I\P_20_1
1O-BL ANAL\SIS - A SCLNARIO \I1l CLUS1LR ANAL\SIS
46
6.4. Data preparation
1he data preparation is one o the major tasks and much time is deoted to this part o the work. It
has been known that the data or Data mining entirely depends on the data distributions and the
amount o source data. Initially attempts were made based on the data used on the test systems and
ironically the results don`t show up any patterns. 1hen ater consulting with the experts it was
known that it makes sense to work with the data rom a productie system. linally, the data used or
analysis is rom the productie internal B\ system.
Seeral attempts were made rom the concerned colleagues o the productie system to load the data
in to test system, but the data load process in to the ino cube was not successul due to some
technical constraints or instance the data loaded in to the Cube has data quality problems where the
time o the OLAP processor is always illed with the alue 1 and the time o the OLAP processor is
greater than the Oerall time o OLAP, \hich should not be the case as the Oerall time o OLAP
is a total o all OLAP times ,Overall OLAP time = 1ime, OLAP Processor Initialization +
1ime, OLAP Processor + 1ime, Reading from the Database + 1ime, Iront-end + 1ime,
Authorization Check + 1ime, Reading 1exts/Master Data + ODBO times, as shown in the
igure 26
Iigure 26: Data Preparation - Data quality problems
But ortunately, ater seeral attempts the data could be loaded in to the PSA ,An intermediate data
store beore loading in to the cube and its known as one type o upload type in SAP B\,. As to the
outcome, the PSA table is used or the analysis and correspondingly checked or the data
consistency by manually totalling the key igures as shown in the igure 2.
1O-BL ANAL\SIS - A SCLNARIO \I1l CLUS1LR ANAL\SIS
47
Iigure 27: Data Preparation - Consistency check
As a result the PSA table is used as the Data Source instead o Ino Cube or Query as shown in the
igure 28 .
Iigure 28: Data Selection - 1he PSA as a source table
Data Selection
1he Data is iltered accordingly or a period o 15 days keeping in mind the objectie is to proide
the administrator some meaningul clusters or analysis. One o the main purposes o Data mining is
to mine data on large data sets. So inally, it was decided to at least hae 800 records and the least
possible time period. So the data is selected or 15 days since its more than 800 records or this time
1O-BL ANAL\SIS - A SCLNARIO \I1l CLUS1LR ANAL\SIS
48
period and ater aggregation it consists 1116 records ,aailable in the next screens,. 1he PSA as a
data source is shown in the ollowing igure
Iigure 29: Data Selection - 1ime period of data
6.S. Data transformation
Seeral transormations are made to make the data meaningul ater consulting the experts in these
areas, which will be discussed in the ollowing sections. In the irst step the Records are urther
iltered or Query, Ino proider to get rid o the initial alues and the user SCOPLADM` since he
is not the genuine user as in the igure 30.
Iigure 30: Data transformations- Iilter Query, Info Cube and User
1O-BL ANAL\SIS - A SCLNARIO \I1l CLUS1LR ANAL\SIS
49
One o the attributes considered or clustering is the number o times a Query is executed rom the
data set selected or the speciied time period, so a new key igure called query requency is added to
the analysis process as in the igure 31.
Iigure 3J: Data 1ransformation - Adding new key figure
1he next step in the transormation is to transorm the 1otal OLAP time rom the corresponding
OLAP times as in the igure 32
Iigure 32: Data 1ransformation - 1ransformation of OLAP times
6.S.J. Data Aggregation
1he important step in data transormations is to get rid o the repeated queries and ino proider
alues in the data set and to come up with the unique alues, As a result the aggregation is
perormed at the Query and Ino Proider leel and the aerage alues are used as the type o
aggregation. 1he process o aggregation is shown in the igure 33
1O-BL ANAL\SIS - A SCLNARIO \I1l CLUS1LR ANAL\SIS
50
Iigure 33: Data 1ransformations - Aggregation of data
6.S.2.Relative numbers
Based on the expert adice, it makes sense to work on the relatie alues ,the percentages, or the
corresponding OLAP times with a transormation routine as in the igure 34, to get rid o the
uneen data distributions and get meaningul patterns rom the clustering engine.
Iigure 34: Data 1ransformations - Conversion of OLAP key figures
Look at the data distributions or the 1otal OLAP time is in such a way as shown in the igure 35.
1O-BL ANAL\SIS - A SCLNARIO \I1l CLUS1LR ANAL\SIS
51
Iigure 3S: Data distribution of 1O1AL OLAP time before 1ransformation
Based on the expert suggestion and to get rid o such uneen kind o data distribution the 1otal
time OLAP is ranked so that that data records are a bit more eenly distributed that would help the
clustering engine to distribute data eenly across arious segments, as shown in the igure 36
Iigure 36: Data transformation - Discritizing 1otal OLAP
1O-BL ANAL\SIS - A SCLNARIO \I1l CLUS1LR ANAL\SIS
52
1he ollowing igure shows the basic statistics o the 1O1AL OLAP times ater transormation and
it was suggested by the experts that K-means algorithm would better work on such data
distributions to eentually come up with the meaningul patterns
Iigure 37: Data distribution of 1O1AL OLAP after transformation
In the same way the Query requency ,1he number o times a Query is executed, is transormed as
in the igure 38
Iigure 38: Discritization Query frequencies
6.S.3.Mapping the Modal attributes
In this step the modal attributes ,1he Data model, is mapped to the attributes o the data source
table which is depicted in igure 39
1O-BL ANAL\SIS - A SCLNARIO \I1l CLUS1LR ANAL\SIS
53
Iigure 39: Mapping the Modal attributes
6.6. Results of the cluster analysis
In this step the results o the clusters are analysed looking at the eatures o arious clusters, 1he
ollowing igure shows the Inluence o attributes which represents the relatie importance o eery
attribute considered or clustering in the ormation o clusters. 1he higher the index, higher is the
inluence in deciding which cluster an entity would get assigned to.
Iigure 40: Cluster Analysis - 1he influence chart
Analysis of Cluster segments
Cluster J - Contributes 19 to the data set 1his Cluster is characterized by queries that are
requently executed with ligh 1otal OLAP time when we look at the reason or the high 1otal
1O-BL ANAL\SIS - A SCLNARIO \I1l CLUS1LR ANAL\SIS
54
OLAP time by analysing the corresponding OLAP times, the time reading to the data base is
characterised by this cluster.
1he Administrator could quickly come to a conclusion that the queries in this cluster hae high time
reading to the database as well these are the queries that are more requently executed and some
ollow up actions could be taken to reduce the time.
1he ollowing igure 41 and 42 gies a picture o the attributes namely Query requency, 1ime
reading to the database, total time OLAP the details o all the screens are illustrated in APPLNDIX-
1
Iigure 4J: Cluster analysis - Results of cluster J
Iigure 42: Cluster analysis - Results of cluster J
1O-BL ANAL\SIS - A SCLNARIO \I1l CLUS1LR ANAL\SIS
55
Cluster 4 - 1his is a large segment o 22 and is characterized by queries that are less requently
executed with high 1otal OLAP time. 1he reason being time read to the data base as we hae seen
or the cluster one, but the interesting aspect is that this cluster contains the queries that are not so
requently executed when compared to cluster 1.
1he quick impression or the administrator, could be this cluster is less important when compared
to cluster 1
Iigure 43: Cluster analysis - Results of cluster 4
Cluster S - 1his is a small segment o 5 which is characterized by the Queries with high 1O1AL
OLAP 1IML and requently executed. 1his reason being the ligh ront-end times.
Iigure 44: Cluster analysis - Results of cluster S
Cluster J0 - 1his is a small segment o 5 to the total dataset and is characterized by the Queries
with high 1O1AL OLAP 1IML and less requently executed when we look at the corresponding
key igures its due to ligh ront-end times.
1he administrator quickly comes up with questions -
\hy the users need to push so huge amount o data to the ront end
Are these queries based on inancial ino proiders which generally hae high amounts o data
1ry to ind the user patterns like casual user, Inormation consumer or Analyst
1O-BL ANAL\SIS - A SCLNARIO \I1l CLUS1LR ANAL\SIS
56
Iigure 4S: Cluster analysis - Results of cluster J0
Note: All the remaining screens are documented in APPLNDIX-B
CONCLUSION AND OU1LOOK
57
7. Conclusion and outlook
Data Mining proides many dierent techniques to extract "knowledge" rom data. It is an exiting
multidisciplinary ield o research which has many extremely useul applications. At present the
techniques are becoming more commonly used but hae not been applied in all areas. As it has been
shown, businesses will use data mining or a ariety o applications ,lere, the main objectie being
system monitoring and perormance that would eentually lead to ind out some interesting patterns
orm the System data,. But primarily, the ocus o data mining is to ind useul trends in existing
data. Companies can use Data mining to seek out changes in existing trends or, perhaps more
importantly, discoer new trends once unknown because o the huge task o analyzing large sums o
data. As it`s shown that this area o application or Data mining ,or System perormance and
stability, is an emerging area as companies try to bundle the new eatures or their products.
Coming to the 1O-BL part o this work, Using cluster analysis - 1he results help in analyzing the
Query inormation rom arious OLAP times and it`s possible to derie some strategies to optimize
the query perormance in uture. 1he process o Data preparation - Cleaning, transormation and
integration o the selected data plays a ital role to come up with the meaningul patterns and most
o the time is deoted to this part. Coming to the Clustering scenario that has been discussed here,
the Business Meta data o the Queries, Ino proiders and users could be joined to urther analyse
the results and a kind o consensus has been reached in this regard by the people at SAP, to urther
inestigate in this area. 1he interesting part o this work is that, these results will be implemented
with BI-Net\aeer 2006 as technical content Queries ,1he rules regarding the amount o data set,
and the transormations and so on will be hard coded,. As a result, the SAP B\ administrator
executes these queries and some peculiar clusters will be thrown out to the ront end or urther
analysis or e.g. the clusters which are characterised by ligh total OLAP time correspondingly with
high ront end times, the queries that are requently executed and so on.
1he inal analysis with respectie to the cluster analysis method could be summed up as adanced
clustering algorithms could be useul to consider arious data types and the number o clusters`. 1o
this eect the same data set is used or analysis with the IBM intelligent miner, which is equipped
with sophisticated clustering algorithms ,namely o type Demographic and Neural,. 1he
demographic clustering algorithm is used or analysis based on the expert suggestions, which
accounted or much more meaningul patterns and these results are documented as part o
APPLNDIX-A. 1his has been accepted by the experts in SAP. 1o sum up it does make sense to
urther inestigate regarding the possibility and easibility to deelop such kind o algorithms OR to
tie up with external Data Mining endors and integrate their products with the \ork bench.
BIBLIOGRAPHY
58
Bibliography
Monographs
Berry and Linoff J997
Michael Berry and Gordon Lino. Data Mining 1echniques: lor Marketing, Sales and Customer
Support. New \ork: \iley Computer Publishing, 199.
Delvin J997
B. Delin: Data \arehouse rom Architecture to Implementation, Addison-\esley, 199.
IU-CH-2003
Biao lu, lenry lu :A Step-to-Step Guide to SAP -Business Inormation \arehouse, Addison -
\esley,2003.
Irawley J992
\illiam lrawley, Gregory Piatetsky-Shapiro, Christopher Matheus. Knowledge Discoery in
Databases: An Oeriew.` AI Magazine, lall 1992, 213-228.
Iayyad J996
Usama layyad et al. Adances in Knowledge Discoery and Data Mining. Cambridge:
MI1 Press, 1996.
Groth J998
Robert Groth. Data Mining: A lands-on Approach or Business Proessionals. Upper
Saddle Rier, New Jersey: Prentice lall P1R, 1998.
Han and Kamber 200J
Jiawei lan and Michelle Kamber. Data Mining: Concepts and 1echniques. Morgan Kaumann, 2001.
Hand et al. 2004
Daid land, leikki Mannila, and Padhraic Smyth. Principles o data mining. MI1 press, Cambridge,
2004.
Inmon J999
Inmon, \.l.: SAP and Data \arehousing. Kia Productions, 1999
Kimball J996
Kimball, R.: 1he Data \arehouse 1oolkit. Second Ldition, John \iley, 1996
McDonald et al 2003
Kein McDonald, Andreas \ilmsmeier, Daid C. Dixon, \.l.Inmon: Mastering SAP Business
Inormation \arehouse, \iley Publishing Inc., 2003.
Moxon J996
Bruce Moxon "Deining Data Mining, 1he lows and \hys o Data Mining, and low It Diers lrom
Other Analytical 1echniques" Online Addition o DBMS Data \arehouse Supplement, August 1996.
Mller and Lernke 2003
BIBLIOGRAPHY
59
Johann-Adol Mller and lrank Lemke. Sel-Organising Data Mining: Lxtracting Knowledge lrom
Data. Victoria, British Columbia, Canada: 1raord Publishing, 2003.
Rud 200J
Oliia Rud. Data Mining Cookbook: Modelling Data or Marketing, Risk, and Customer Relationship
Management. New \ork: \iley Computer Publishing, 2001.
SPSS 2004
SPSS. Clementine .0 User`s guide, 2004.
Witten and Irank 2000
Ian \itten and lrank Libe. Data Mining: Practical Machine Learning 1ools and 1echniques with Jaa
Implementations. San lrancisco: Morgan Kaumann Publishers, 2000.
Zaki and Ho 2000
Mohammed Zaki and Ching-1ien lo. Large-Scale Parallel Data Mining. Berlin: Springer, 2000.
Internet sources
Dictionaries
1echtarget 200S
Data,` 1ech1arget.
http:,,searchstorage.techtarget.com,sDeinition,0,,sid5_gci211894,00.html,12.01.2005,.
Princeton 200S
Data,` Princeton.
http:,,www.cogsci.princeton.edu,cgi-bin,webwn2.0stage~1&word~data ,30.11.2004,.
Witnessminer 200S
KDD,` \itnessminer.
http:,,www.witnessminer.com,kdd_deinition.htm ,06.01.2005,.
Articles and other internet resources
Albion 200S
ALBION RLSLARCl L1D. Market Basket Analysis`.
http://www.albionresearch.com/datamining/marketbasket.htm ,15.03.2005,
B.Inmon-200S
http://www.billinmon.com//library/articles/ ,10.03.2005,
Bao 200S
lo 1u Bao. Knowledge engineering: Knowledge discoery and data mining techniques and practice`.
http://www.netnam.vn/unescocourse/knowlegde/knowlegd.htm ,25.01.05,
Chapple 200S
BIBLIOGRAPHY
60
Mike Chappel. Data Mining: An Introduction`.
http:,,databases.about.com,library,weekly,aa10000a.htm ,20.01.05,
CRISP, 200S
CRISP ,Cross Industry Standard Process or Data Mining,.
http://www.crisp-dm.org/Process/index.htm ,06.01.2005,
Dastani 200S
Parsis Dastani Data Mining - An Introduction.
http:,,www.data-mining.com,miningmining.htm ,25.01.05,
IILL1LK 200S
FILETEK, The Future oI Data Warehousing: Alternative Storage by Bill Inmon
http://www.Iiletek.com/papers/Inmon/inmon.htm ,10.03.2005,
Goransson 200S
Olo Goransson. Market Basket Analysis`.
http://www.megaputer.com/products/pa/algorithms/ba.php3 ,15.12.2004,
GSL&IS 200S
GSL&IS. Applied Categorical & Nonnormal Data Analysis--Multinomial Logistic Regression
Models`.
http://www.gseis.ucla.edu/courses/ed231c/notes3/mlogit1.html ,20.12.2004,
IDUG 200S
IDUG, Data \arehouse Administration
http://www.idug.org/idug/member/journal/mar98/IaceoII.html ,15.03.2005,
I11OOLBOX2005]
ITTOOLBOX, Dormant Data
http://businessintelligence.ittoolbox.com/documents/document.asp?i2236 ,15.03.2005,
Kumar and Joshi 2004
Vipin Kumar and Mahesh Joshi. 1utorial on ligh Perormance Data Mining`.
http://www-users.cs.umn.edu/ ,06.01.2005,
Lidal and Dingsoyr 200S
Lndre Lidal and 1orgeir Dingsoyr. An Laluation o Data Mining Methods and 1ools`.
http://www.idi.ntnu.no/~dingsoyr/project/report.html#SECTION0071000000000000000
,31.12.2003,
Palace 200S
Bill Palace. Data Mining`.
http://www.anderson.ucla.edu/Iaculty/jason.Irand/teacher/technologies/palace/index.htm
,12.01.2005,
SU 2003
http://www.datawarehousing.com/whatis.asp ,10.03.2005,
SAP-2003
BIBLIOGRAPHY
61
http://help.sap.com/bestpractices/industries/businessintelligence/v131/documentation/DataWarehou
singtecEN.pdI ,11.04.2005,
SAPPRL 2004
SAPNL1, Analysis Process Designer `
https://websmp202.sap-ag.de/~Iorm/sapnet?SHORTKEY01100035870000161446& ,15.03.2005,
|SAPHLLP 200S-J
http://help.sap.com/saphelpbw30b/helpdata/en/e3/e60138Iede083de10000009b38I8cI/Irameset.ht
m ,10.03.2005,
SAPNL1 200S-J
http://service.sap.com/~Iorm/sapnet?SHORTKEY01100035870000471520&
SAPNL1 200S-2
http://service.sap.com/~Iorm/sapnet?SHORTKEY01100035870000453136&
SAPNL1 200S-3
http://service.sap.com/~Iorm/sapnet?SHORTKEY01100035870000471520&
S1A1SOI1 200S
S1A1SOl1. Cluster Analysis`.
http://www.statsoItinc.com/textbook/stcluan.html ,15.03.2005,
SAPHLLP 200S-J
SAP lLLP POR1AL, 1echnical content.
http://help.sap.com/saphelpnw04/helpdata/en/e3/e60138Iede083de10000009b38I8cI/Irameset.htm
SAPHLLP 200S-2
SAP lLLP POR1AL, 1echnical content.
http://help.sap.com/saphelpnw04/helpdata/en/e3/e60138Iede083de10000009b38I8cI/Irameset.htm
SAPHLLP 200S-3
SAP lLLP POR1AL, B\ statistics.
http://help.sap.com/saphelpnw04/helpdata/en/I2/e81c3b85e6e939e10000000a11402I/content.htm
SAPHLLP 200S-4
SAP lLLP POR1AL, Administrator workbench
http://help.sap.com/saphelpnw04/helpdata/en/a8/6b023b6069d22ee10000000a11402I/content.htm
SAPDOCS 200S
SAPNL1, Data Mining and APD in SAP B\ 3.5`
http://service.sap.com/~Iorm/sapnet?SHORTKEY01100035870000585703& ,10.03.2005,
|1hearling 200S|
Kurt 1hearling. An Introduction to Data Mining: Discoering hidden alue in your data warehouse`.
http:,,databases.about.com,gi,dynamic,osite.htmsite~http3A2l2lwww.thearl
ing.com2ltext2ldmwhite2ldmwhite.htm ,12.01.2005,
UCLA 200S
UCLA ACADLMIC 1LClNOLOG\ SLRVICLS. Multinomial Logistic Regression, Contried Lxamples`.
BIBLIOGRAPHY
62
http://www.ats.ucla.edu/stat/stata/code/oddsratiomlogit.htm ,25.12.2004,
Whitehead 200S
John \hitehead. An Introduction to Logistic Regression`.
http://personal.ecu.edu/whiteheadj/data/logit/ ,23.12.2004,
Wilson 200S
Bill \ilson. Induction o Decision 1rees`.
http://www.cse.unsw.edu.au/~billw/cs9414/notes/ml/06prop/id3/id3.html ,15.03.2005,
W.H.Inmon J999
http://www.billinmon.com//library/articles/dwdeI.asp
Zadok and Stolfo 200S
Lrez Zadok, Salatore Stolo. Data Mining Methods or Detection o New Malicious Lxecutable`.
http://www1.cs.columbia.edu/ ,30.12.2004,
BW3J0 200S
Coursematerial, B\310 Data \arehousing, SAP AG-2005.
BW30S 200S
Course material, B\305 BI \arehouse - Reporting and Analysis, SAP AG-2005.
SAP NL1- Course Material BW310
B\310-Data \arehousing 2003
Lesley 2004
Clem Lesley: A Presentation on Data Mining with SAP B\ 3.5. SAPNL1.
1ABW30 2003
Business Inormation \arehouse - Lxtraction and Special 1opics, 1AB\30, Section 2, Unit 3, SAP
AG, 2005
APPLNDIX-A
63
APPLNDIX-A
IBM Intelligent Miner Cluster Analysis
J. Data Selection - 1he data extracted in to a lat ile and loaded in to the intelligent miner
2. Model Selection
3. Attributes for analysis
APPLNDIX-A
64
4. Modal Parameters
5. View of all clusters
APPLNDIX-A
65
6. Analysis of segments
7. Analysis of segments
APPLNDIX-A
66
8. Analysis of segments
1o summarize, the results clearly shows that this demographic algorithm takes into consideration
data sets more precisely and the number o clusters are justiied by the algorithm based on the data
distribution. A big cluster with 58 data set which consists o similar data, but all the remaining
clusters shows aluable patterns and this has been judged by the experts at SAP.
APPLNDIX-B
67
APPLNDIX-B
Results of SAP Cluster Analysis
As a sample, the ollowing screens represent the ie clusters and the alues distribution details.
1. Cluster 1
2. Cluster 2
APPLNDIX-B
68
3. Cluster 3
APPLNDIX-B
69
4. Cluster 4
APPLNDIX-B
70
5. Cluster 5
1hese are the 5 cluster segments o the data model rom the total o 10 clusters.
APPLNDIX- C
71
APPLNDIX-C Related Internet Links
General information's about Data Mining
http:,,www.the-data-mine.com
http:,,www.dmreiew.com
http:,,www.datawarehousingonline.com,
http:,,datawarehouse.ittoolbox.com,
http:,,www.kdnuggets.com
http:,,itmanagement.earthweb.com,datbus,
http:,,www.thearling.com,index.htm4wps
Data mining software providers
Advanced Software Applications http:,,www.asacorp.com,
AIS Visual http:,,www.isualmine.com,
Alice http:,,www.alice-sot.com
Angoss http:,,www.angoss.com,
Assoc http:,,www.asoc.de
Attar Software http:,,www.attar.com,
Bissantz & Company http:,,www.bissantz.de,
Business Objects http:,,www.businessobjects.com,
Cogit http:,,www.cogit.com,
Cognos http:,,www.cognos.com,
Data Distilleries http:,,www.ddi.nl,
DataMind http:,,www.datamindcorp.com,
DataMiner http:,,www.dminer.com,
Datasage http:,,www.datasage.com,
Dialogis http:,,www.dialogis.de
Dimension S http:,,www.dimension5.sk,
HNC http:,,www.hnc.com,
human I1 http:,,www.humanit.de,
Hyperparallel, http:,,www.hyperparallel.com,
IBM http:,,www.ibm.com,
Information Discovery http:,,www.datamining.com,
Integral Solutions http:,,www.isl.co.uk,
Magnify http:,,www.magniy.com,
Management Intelligenter 1echnologien http:,,www.mitgmbh.de,
MarketMiner http:,,www.marketminer.com,
Mathsoft http:,,www.mathsot.com,
NeoVista http:,,www.neoista.com
Oracle http:,,www.oracle.com,
Prudential Systems http:,,www.prudsys.de,
Quadstone http:,,www.quadstone.com,
Rulequest http:,,www.rulequest.com
SAP AG http:,,www.sap.com,index.epx
Salford Systems http:,,www.salord-systems.com,
SAS http:,,www.sas.com,
SGI http:,,www.sgi.com,sotware,mineset,
SLP InfoWare http:,,www.slp-inoware.com
SPSS http:,,www.spss.com,datamine,
Syllogic http:,,www.syllogic.nl
APPLNDIX- C
72
1andem http:,,www.tandem.com,
1hinking Machines http:,,www.think.com,
1orrent http:,,www.torrent.com,
1riVida http:,,www.triida.com,
Unica http:,,www.unica-usa.com,
Wizsoft http:,,www.wizsot.com,