2005 Thesis Alluri Data Mining Data Warehouse SAP-BW

Master 1hesis
LVALUA1ION OI DA1A MINING ML1HODS

1O SUPPOR1 DA1A WARLHOUSL
ADMINIS1RA1ION AND MONI1ORING IN
SAP BUSINLSS
WARLHOUSL
Narasimha Raju Alluri
laculty o Business Applications o Computer Science
UNIVLRSI1\ Ol APPLILD SCILNCLS - lUR1\ANGLN
lurtwangen, Germany
Walldorf, Germany
20
tb
.rit, 200:.
Primary superisor: Prof. Dr. Wolfram Reiners
Secondary Superisor: Mr. Kai Willenborg
I
Declaration
I hereby declare in lieu o oath that I composed this thesis independently
and without inadmissible help rom outside.
1he sources used are quoted in ull.
\alldor, 20
th
April, 2005.
Signature
ACKNOWLLDGLMLN1
II
Acknowledgement
A journey is easier when you trael together. Interdependence is certainly more aluable than
independence. 1his thesis is the result o six months o work whereby, I hae been accompanied
and supported by many people. It is a pleasant aspect that I hae now the opportunity to express my
gratitude or all o them.
1his work was written at Deelopment - Product line BI at, SAP AG, \alldor, Germany. 1he
deelopment o this thesis was a time o personal growth and that deelopment didn`t always take
place without pain.
I would like to thank Mr. loeltke Rainer and Mr. \illenborg Kai or showing this way o
opportunity and letting me know seeral important things. 1hey always listened to me patiently and
guided throughout the path with regular eedback and suggestions. I would also thank Pro. Dr.
\olram Reiners or guiding me in this work and proiding me necessary tips through out my
thesis.
I am ery much grateul to Dr. \eidner Jens rom CRM Analytics or his superision and constant
source o encouragement, the thoughts he has oered hae enriched my thesis without which, this
work would not hae been materialized in the present orm.
I thank all the \lM ,\arehouse Management, colleagues - BI Product Line Management or being
kind enough to respond to my countless queries.
1his list goes on and to make an end at some point. I thank one and all who hae rendered help,
directly or indirectly in completion o this work
ABS1RAC1
III
Abstract
Data mining is not new. People who irst discoered how to start ire and that the earth is round,
also discoered knowledge, which is the main idea o Data mining. Len beore technology was
used or Data mining, statisticians were using probability and regression techniques to model
historical data. Since seeral years the buzzword Data Mining` creates a boom in many areas
around Data mining. In the 1960s, Management Inormation Systems and later, in the 190s,
Decision Support Systems were praised or their great potential to supply executies with mountains
o data needed to carry out their jobs. But the problem was that they simply supplied too much data
and not enough inormation to be generally useul. Adances in data collection and the
computerization o many business transactions lood us today with inormation, and generate an
urgent need or new techniques and tools that can intelligently and automatically assist us in
transorming this data into useul knowledge. 1oday, there is a huge amount o inormation locked
up in the mountains o data in companies` databases - inormation that is potentially important but
has not yet discoered. 1his is the time were Data mining comes in ront and helps to identiy this
aluable inormation.
Data mining tools can predict uture trends and behaiours, allowing businesses to make proactie,
knowledge-drien decisions. 1oday Data mining is primarily used by companies with a strong
consumer ocus - retail, inancial, communication, and marketing organizations. 1he list o this
companies using Data mining technologies looks like a lortunes-500 \ho`s who. 1he tasks o Data
mining can be classiied in dierent groups. 1he two main tasks are the prediction and description
o data. Supporting these dierent tasks o Data mining, there is a ariety o Data mining methods.
1his paper concentrates on ealuation o data mining methods to support Data \arehouse
administration and monitoring in SAP B\. lirstly, get to know the product speciic knowledge o
SAP B\ and then inestigation in to the possible areas o SAP B\. As well, the analysis o key
igures that are aailable as part o the SAP B\ and then looking at the aailable Data mining
methods that are part o SAP B\ and a inal analysis through a clustering scenario. 1he detailed
screens o the data model, transormations and the results are presented concluding with the testing
and analysis o the same data set with IBM Intelligent Miner.
LIS1 OI IIGURLS
IV
List of Iigures
ligure 1: 1he Operational Data Model....................................................................................................... 4
ligure 2: 1he Dimensional Model .............................................................................................................. 4
ligure 3: SAP Net\eaer Components ..................................................................................................... 6
ligure 4: SAP B\ Architecture.................................................................................................................. 7
ligure 5: L1L: Lxtraction, 1ransormation and Loading in SAP B\......................................................... 8
ligure 6: Open lub Serice in SAP B\.................................................................................................. 10
ligure : Reporting in SAP B\ ............................................................................................................... 11
ligure 8: \eb Application lramework..................................................................................................... 12
ligure 9: 1he Central 1ool in SAP B\: A\B ......................................................................................... 14
ligure 10: Roadmap - 1imeline and local Points ..................................................................................... 15
ligure 11: Lxamples o data in Public Use Micro data Sample data sets. .................................................. 18
ligure 12: Knowledge discoery in Databases.......................................................................................... 20
ligure 13: Knowledge discoery in Databases.......................................................................................... 20
ligure 14: 1he process o Data mining .................................................................................................... 22
ligure 15: Phases o the CRISP-DM Process Model ................................................................................ 23
ligure 16: Data mining tasks and methods............................................................................................... 25
ligure 1: Process steps or applying analytical methods.......................................................................... 26
ligure 18: Simple linear regression ........................................................................................................... 28
ligure 19: Comparison o Linear and logistic Regression ......................................................................... 29
ligure 20: Decision 1ree o deciding whether a person should be oered a loan ..................................... 32
ligure 21: 1he Analysis process designer ,APD, architecture. .................................................................. 33
ligure 22: APD integration with B\ and other applications .................................................................... 34
ligure 23: Process description o the APD............................................................................................... 35
ligure 24: Oeriew o the datalow in B\ statistics ............................................................................... 37
ligure 25: 1he Data Model ...................................................................................................................... 45
ligure 26: Data Preparation - Data quality problems................................................................................ 46
ligure 2: Data Preparation - Consistency check...................................................................................... 47
ligure 28: Data Selection - 1he PSA as a source table.............................................................................. 47
ligure 29: Data Selection - 1ime period o data ....................................................................................... 48
ligure 30: Data transormations- lilter Query, Ino Cube and User......................................................... 48
ligure 31: Data 1ransormation - Adding new key igure......................................................................... 49
ligure 32: Data 1ransormation - 1ransormation o OLAP times .......................................................... 49
ligure 33: Data 1ransormations - Aggregation o data............................................................................ 50
ligure 34: Data 1ransormations - Conersion o OLAP key igures....................................................... 50
ligure 35: Data distribution o 1O1AL OLAP time beore 1ransormation ........................................... 51
ligure 36: Data transormation - Discritizing 1otal OLAP....................................................................... 51
ligure 3: Data distribution o 1O1AL OLAP ater transormation ....................................................... 52
ligure 38: Discritization Query requencies.............................................................................................. 52
ligure 39: Mapping the Modal attributes.................................................................................................. 53
ligure 40: Cluster Analysis - 1he inluence chart...................................................................................... 53
ligure 41: Cluster analysis - Results o cluster 1 ....................................................................................... 54
ligure 45: Cluster analysis - Results o cluster 10...................................................................................... 56
LIS1 OI ABRLVIA1IONS
V
List of Abbreviations
APD: Analysis Process Designer
BI: Business Intelligence
B\: Business \arehouse
CRM: Customer Relationship Management
DSS: Decision Support System
D\: Data \arehouse
DM: Data Mining
LRP Lnterprise Resource Planning
IGS: Internet Graphics Serer
I1: Inormation 1echnology
KDD: Knowledge Discoery in Databases
KM: Knowledge Management
KPI: Key Perormance Indicator
MIS: Management Inormation Systems
MDM: Master Data Management
OLAP: Online Analytical Processing
SAP: System Application and Products ,in Data Processing,
XI: Lxchange Inrastructure
XML: Lxtensible Markup Language
1ABLL OI CON1LN1S
VI
1able of Contents
DECLARATION........................................................................................................................I
ACKNOWLEDGEMENT ..........................................................................................................II
ABSTRACT.............................................................................................................................III
LIST OF FIGURES................................................................................................................. IV
LIST OF ABBREVIATIONS.................................................................................................... V
TABLE OF CONTENTS......................................................................................................... VI
1. INTRODUCTION...............................................................................................................1
1.1. The extent of the work.............................................................................................................................................. 1
1.2. Document Outline..................................................................................................................................................... 1
2. THE MY SAP BW- A DATA WAREHOUSING SOLUTION FROM SAP..........................2
2.1. History and Evolution .............................................................................................................................................. 2
2.1.1. What is Data Warehousing?................................................................................................................................... 2
2.1.2. Operational Systems Vs Data Warehouse Systems ............................................................................................... 3
2.2. SAP BW Architecture .............................................................................................................................................. 5
2.2.1. Extraction, TransIormation and Loading Services................................................................................................. 7
2.2.2. Data Storage and Management .............................................................................................................................. 8
2.2.3. Analysis and Access Services .............................................................................................................................. 10
2.2.4. Presentation Services ........................................................................................................................................... 11
2.2.5. Administration Services....................................................................................................................................... 13
2.3. Features of SAP BW............................................................................................................................................... 14
3. DATA MINING AND ITS ECONOMIC USE....................................................................17
3.1. A general introduction to Data Mining................................................................................................................. 17
3.1.1. The importance oI Data in Data Mining .............................................................................................................. 17
3.1.2. DeIinitions oI Data Mining.................................................................................................................................. 19
3.2. KDD - Knowledge Discovery in Databases........................................................................................................... 19
3.3. Data Mining and Data Warehouse........................................................................................................................ 20
3.4. Common uses of Data Mining................................................................................................................................ 21
3.5. The process of Data Mining ................................................................................................................................... 22
4. METHODS OF DATA MINING .......................................................................................25
1ABLL OI CON1LN1S
VII
4.1. An overview of Data Mining Methods .................................................................................................................. 25
4.2. The SAP data mining workbench.......................................................................................................................... 25
4.2.1. Approximation ..................................................................................................................................................... 27
4.2.1.1. Regression Analysis.................................................................................................................................. 27
4.2.1.2. Weighted score tables............................................................................................................................... 29
4.2.2. Clustering............................................................................................................................................................. 30
4.2.3. Association analysis............................................................................................................................................. 31
4.2.4. Decision Trees ..................................................................................................................................................... 31
4.2.5. ABC classiIication ............................................................................................................................................... 32
4.3. The SAP Analysis process designer workbench................................................................................................... 33
5. AS-IS ANALYSIS: CURRENT SITUATION OF SAP BW ADMINISTRATION ..............36
5.1. The technical content of SAP BW ......................................................................................................................... 36
5.1.1. Statistical content cubes ....................................................................................................................................... 37
5.1.2. BrieI overview oI some Characteristics and key Iigures...................................................................................... 38
5.2. SAP BW administration and monitoring.............................................................................................................. 40
5.3. Possible business scenarios for data mining ......................................................................................................... 40
5.3.1. Data loads and Process chains.............................................................................................................................. 41
5.3.2. Queries ................................................................................................................................................................. 41
5.3.3. Dormant data........................................................................................................................................................ 42
5.3.4. Table spaces and buIIers ...................................................................................................................................... 42
5.4. A way forward......................................................................................................................................................... 43
6. TO-BE ANALYSIS - A SCENARIO WITH CLUSTER ANALYSIS ................................44
6.1. Motivations for cluster Analysis ............................................................................................................................ 44
6.1.1. Technical drivers.................................................................................................................................................. 44
6.1.2. Business drivers ................................................................................................................................................... 44
6.2. Analysis of Queries with cluster analysis.............................................................................................................. 45
6.3. The Data Model: ..................................................................................................................................................... 45
6.4. Data preparation..................................................................................................................................................... 46
6.5. Data transformation ............................................................................................................................................... 48
6.5.1. Data Aggregation................................................................................................................................................. 49
6.5.2. Relative numbers ................................................................................................................................................. 50
6.5.3. Mapping the Modal attributes .............................................................................................................................. 52
6.6. Results of the cluster analysis ................................................................................................................................ 53
7. CONCLUSION AND OUTLOOK ....................................................................................57
BIBLIOGRAPHY....................................................................................................................58
Monographs........................................................................................................................................................................... 58
INTERNET SOURCES...........................................................................................................59
1ABLL OI CON1LN1S
VIII
Dictionaries............................................................................................................................................................................ 59
Articles and other internet resources .................................................................................................................................. 59
APPENDIX-A.........................................................................................................................63
IBM Intelligent Miner Cluster Analysis.............................................................................................................................. 63
APPENDIX-B.........................................................................................................................67
Results of SAP Cluster Analysis .......................................................................................................................................... 67
APPENDIX-C.........................................................................................................................71
IN1RODUC1ION
1
J. Introduction
J.J. 1he extent of the work
1his work is done within the ramework o a Master 1hesis, in ulilment o a partial requirement
or the award o the degree: Master o Computer Science in Business Consulting by the Uniersity
o Applied Science ,lachhochschule, lurtwangen, Germany.
It is aimed at Laluation o Data Mining Methods to Support Data \arehouse Administration and
Monitoring in SAP B\`, as to inestigate the areas in B\ where Data mining methods could be
implemented and then to horizontally cut in to the Query perspectie and analyse the aailable key
igures. As to accomplish this, what data mining method would be used Considering the eorts and
resources that are inoled in known methods, could there possibly be another way same goals
could be achieed with less inestment in eorts and resources 1his work explores how this could
be achieed with the aailable unctionally o SAP Analysis Process Designer ,APD, and Data
mining work bench and at the end, use o some sophisticated algorithms that are part o IBM
intelligent miner.
J.2. Document Outline
1he documentation o the Master`s 1hesis is outlined in the ollowing way. Chapter 1wo presents
all about SAP Business Inormation \arehouse including the history, the eolution, the architecture
and the eatures. Chapter 1hree presents all about Data Mining - A general introduction running
through Knowledge Discoery in Databases, the common uses and the process o Data mining.
Chapter our describes about the Data Mining methods aailable as part o SAP Data mining
workbench. Chapter 5 proides with the AS-IS Analysis describing the technical content o SAP
B\, Administration and monitoring in SAP B\ ending up with the possible areas o data mining
with a way orward or the 1O-BL analysis. Chapter 6 goes through the motiation or the cluster
analysis, the data model concluding with the results o the cluster analysis and a conclusion and the
uture outlook in chapter .
1HL MY SAP BW- ADA1A WARLHOUSING SOLU1ION IROMSAP
2
2. 1he my SAP BW- A Data Warehousing
solution from SAP
2.J. History and Lvolution
Although it seems that the aailability o data warehousing applications has exploded in the 1990s,
the recognition o the need or these applications is not at all new. 1he need or data warehousing
originated in the mid-to-late 1980s with the undamental recognition that inormation systems must
be distinguished into operational and inormational systems. |Delin 199| Operational systems
support the day-to-day conduct o the business, and are optimized or ast response time o
predeined transactions, with a ocus on update transactions. Operational data is a current and real-
time representation o the business state. In contrast, inormational systems are used to manage and
control the business. 1hey support the analysis o data or decision making about how the enterprise
will operate now and in the uture. 1hey are designed mainly or ad-hoc, complex and mostly read-
only queries oer data obtained rom a ariety o sources. Inormational data is historical, i.e., it
represents a stable iew o the business oer a period o time.
Oer the past seeral years, the concept o a data warehouse has undergone many changes. 1he irst
data warehouses were created to answer the demands o business managers and executies who
wanted to be able to extract, rom the olumes o data produced by their in-house operational
applications, the key historical and summary data that would allow them to better plan, analyze, and
control their business enterprises. 1he initial data warehouses answered this need by proiding
integrated historical and summary data.
1he next phase in the data warehouse eolution was the creation o the data mart. Independent data
marts were appealing in part because they were small, relatiely cheap, and quick to implement. Like
the spread o the PC beore them, these independent data marts created "islands o inormation" as
each user created and stocked ones own data mart. 1o address the challenges raised by the spread o
multiple independent data marts, the concept o the dependent data mart has emerged. In a
dependent data mart architecture there is a central data warehouse that contains the "corporate iew
o the data" and supplies the departmental data marts with the speciic data they require.
2.J.J. What is Data Warehousing?
lor most large companies today, Lnterprise Resource Planning ,LRP, seres within the core set o
day-today transaction processing applications. Len when data is eiciently captured and stored in
LRP systems, it may remain relatiely useless or reporting and decision making purposes. lrom
General Ledger to luman Resources, LRP systems do the central work o running, tracking and
reporting on business data processing.
Just as LRP is critical to business transactions, so is data warehousing central to analysis o those
transactions. 1hrough data warehousing product managers, marketing managers, communications
specialists, human resources recruiters, inancial executies, CIO`s and CLO`s ormulate analytical
queries and obtain reports, with these, they are able to make tactical and strategic business decisions
- aster and better than eer beore, thanks to their analytical resources.
According to Ralph Kimball, "data warehouse is a copy o transaction data speciically structured or
query and analysis." |Kimball 1996|
3
A Data \arehouse is a repository o integrated inormation, aailable or queries and analysis. Data
and inormation are extracted rom heterogeneous sources as they are generated.... 1his makes it
much easier and more eicient to run queries oer data that originally came rom dierent sources."
-Stanord Uniersity |SU 2002|
According to Bill Inmon, known as the ather o Data \arehousing, a data warehouse is a subject
oriented, integrated, time-ariant, non-olatile collection o data in support o management
decisions.
Subject-oriented means that all releant data about a subject is gathered and stored as a single
set in a useul ormat,
Integrated reers to data being stored in a globally accepted ashion with consistent naming
conentions, measurements, encoding structures, and physical attributes, een when the
underlying operational systems store the data dierently,
Non-volatile means the data warehouse is read-only: data is loaded into the data warehouse
and accessed there,
1ime-variant data represents long-term data--rom ie to ten years as opposed to the 30 to 60
days time periods o operational data. . |\.l.Inmon 1999|
2.J.2. Operational Systems Vs Data Warehouse Systems
1he undamental dierence between operational systems and data warehousing systems is that
operational systems are designed to support transaction processing whereas data warehousing
systems are designed to support online analytical processing ,or OLAP, or short,.
Operational systems are generally designed to support high-olume transaction processing with
minimal back-end reporting and are generally process-oriented or process-drien, meaning that they
are ocused on speciic business processes or tasks. Lxample tasks include billing, registration, etc.
Data warehousing systems are generally designed to support high-olume analytical processing ,i.e.
OLAP, and subsequent, oten elaborate report generation and are generally subject- oriented,
organized around business areas that the organization needs inormation about. Such subject areas
are usually populated with data rom one or more operational systems. As an example, reenue may
be a subject area o a data warehouse that incorporates data rom operational systems that contain
sales data, promotion data, costs data, etc.
Operational systems are generally concerned with current data and are generally updated regularly
according to need and are optimized to perorm ast inserts and updates o relatiely small olumes
o data. Data warehousing systems are generally concerned with historical data and are non-olatile,
meaning that new data may be added regularly, but once loaded, the data is rarely changed, thus
presering an eer-growing history o inormation. In short, data within a data warehouse is
generally read-only and optimized to perorm ast retrieals o relatiely large olumes o data.
1o acilitate complex analyses and isualization, the data in a warehouse is typically modelled multi-
dimensionally. 1he structural dierence o the Operational and Data \arehouse models can be
iewed in ligure 1 and ligure 2
4
Iigure J: 1he Operational Data Model
Source: Biao lu and lenry lu, SAP B\: A Step-by-step Guide
Iigure 2: 1he Dimensional Model
Source: Biao lu and lenry lu, SAP B\: A Step-by-step Guide
Data warehousing is a concept. It is a set o hardware and sotware components that can be used to
better analyze the massie amounts o data that companies are accumulating to make better business
decisions. Data \arehousing is not just data in the data warehouse, but also the architecture and
tools to collect, query, analyze and present inormation.
1oday, there are a lot o data warehouse endors. Most well known data warehouse endors are:
SAP AG, SPSS Inc., IBM Corporation, Oracle Corporation, SAS Institute, etc. Among them there
are also LRP endors that are oering their data warehouse products to go along with their LRP
applications ,e.g., Oracle's Data Applications Data \arehouse, SAP's Business Inormation
5
\arehouse, etc.,. Until recently, LRP endors ocused mostly on enhancing the inrastructure to
delier high-perormance OL1P solutions. Very little attention was gien to proiding applications
to analyze massie amounts o data collected, or "jailed" within the LRP data repositories.
Customers were let to building data warehousing and reporting solutions without any help rom
LRP endors. |Inmon 1999|
1oday, LRP endors recognized that customers need data warehouse tools or their LRP
applications. \hile traditional data warehouse endors were oering tools that were intended or
general purposes, LRP endors created data warehousing solution that were ocused on their own -
speciic - LRP application as well as external sources. lor instance, SAP created B\ - Business
Inormation \arehouse, which has the best interace to extract the data out any Source system.
2.2. SAP BW Architecture
SAP did not inent the sotware components o the SAP Business Intelligence solution oernight.
1he business intelligence capabilities ound in SAP sotware eoled in parallel to the CIl and other
inormational processing rameworks. 1he SAP BI solution eolution has been quite rapid since the
irst generally aailable release o the SAP B\ sotware in 1998. In act organizations interested in
implementing the SAP B\ sotware component to realize a CIl will ind themseles licensing the
SAP Business Intelligence Solution rather than the SAP B\ component. |McDonald et al. 2003|
In 199, SAP launched an initiatie to extend the reporting and analysis capabilities in the R,3
OL1P enironment. 1his initiatie, once called the Reporting Serer, became the largest
deelopment project in the history o SAP ater the SAP R,3 deelopment. SAP selected ie
companies to pilot SAP Business Inormation \arehouse ,B\, in 199. In 1998, SAP launched a
so-called Larly Customer Program ,LCP, with six customers to gather requirements and to do a
proo o concept at customer sites. Release 1.2A o B\ was made aailable to the public in
September 1998 |lashmi 2003, 42|.
SAP B\ Releases: A brie history o the SAP B\ release pattern looks like
B\ 1.2b - introduction o InoCubes and Business Content
B\ 2.0b - introduction o ODS, s mySAP.com interace
B\ 2.1c - analytical component
B\ 3.0 - urther enhancement o ODS into a data warehouse, along with the creation
o analytical applications, partnerships, and so orth. .|Inmon 2005|
B\ 3.5 - designed to delier seamless integration capabilities into all o the SAP
Net\eaer components like Inormation Broadcasting, Uniersal Data Access, Lmbedded
BI-Integration in to SAP Net\eaer, Business Planning and Simulation, Unicode and so
on |SAPNL1 2005-1|

SAP Net\eaer proides an open integration and application platorm and permits the integration
o the Lnterprise Serices Architecture. \ou can uniy business processes across technological
boundaries, integrate applications or your employees as needed, and access and edit simple
inormation easily and in a structured manner. |SAPlLLP 2005-1|

1he ollowing igure 3 shows the oerall architecture and the position o SAPB\ in Net\eaer
6
Iigure 3: SAP NetWeaver Components
Source: SAP lelp Portal, 2005
As o April 12, 2005 ligures B\ has 9368 installations world wide with 530 rom LMLA, 2293
Americas and 168 LAP. |SAPNL1 2005-2|
1he SAP Business Inormation \arehouse uses an integrated set o powerul components or data
collection, storage, analysis and administration to meet all the requirements or ready-to-go data
warehousing. SAP B\ is completely based on an integrated Meta data concept, with Meta data
being managed by Meta data serices.
1he SAP B\ Meta Data Serices components proide both an integrated Meta Data Repository
where all Meta data is stored and a Meta Data Manager that handles all requests or retrieing,
adding, changing, or deleting Meta data. 1he Meta Data Repository is integrated into the
Administrator \orkbench, with a list o all Meta data objects aailable there. ligure 4 shows the
layered architectural structure and components o SAP B\. 1he SAP B\ architecture can be
diided into ie main layers.
Administration
Lxtraction, Loading and transormation serices
Data Storage and Management
Analysis and Access Serices
Presentation
7
Iigure 4: SAP BW Architecture
2.2.J. Lxtraction, 1ransformation and Loading Services
1he extraction, transormation, and loading ,L1L, serices layer o the SAP B\ architecture
includes serices or data extraction, data transormation, and loading o data and seres as a staging
area or intermediate data storage or quality assurance purposes. |McDonald et al|. 1he core part o
the L1L serices o SAP B\ is the staging Lngine, which manages the staging process or all the
data receied rom seeral types o source systems and is supported by the DataSource Manager.
1he DataSource Manager manages the deinitions o the dierent sources o data known to the SAP
B\ system and supports ie dierent types o interace which include BAPI, lile Interace, XML
interace, DB connect interace and UD connect interace as shown in ligure 4.

1he B\ includes pre-conigured, ready-to-go extractors or R,3 applications, slashing the time
required to set up extraction routines. At the same time, the B\ is not restricted to R,3. It is
possible to extract data rom dierse SAP, non-SAP and legacy systems. In many cases, or example
or products rom complementary sotware partners, SAP has already deined business application
interaces ,BAPIs, that guarantee quick implementation o eicient extraction routines. It is een
possible to incorporate data rom lat iles. In other words, the B\ can be easily extended to include
external data o many kinds, or instance rom content proiders, demographic sureys or een
syndicated POS ,Point-o-sale, data reports. 1he B\ also supports Delta extracts, i.e. an extract
only o that data which is new or changed, minimizing transer oerhead and system load.
1he key to any data warehouse is meta-data, i.e. additional data which describes the data in a way
which makes it meaningul, which allows the user and the data warehouse to understand just what
that data represents. A major adantage o the B\ is that it employs the tried and trusted metadata
models that are already part o R,3. In other words, these models do not hae to be deined and
built laboriously and expensiely rom scratch. Neertheless, the B\ is able to understand other
metadata models permitting the seamless integration o data rom legacy systems or external
sources. In act SAP cooperates closely with content proiders or special knowledge to ensure the
most eicient use is made o such sources. Synchronization means that changes made to the source
8
system are automatically recognized by the B\ and there is no need or additional administration so
that there is no need or coding work. 1ransormation rules are employed to scrub, adjust and
augment extracted data so that it matches the needs o the warehouse. 1ypical examples include
adding the igures 19` to year dates, where required to create our-igure annotation. Geocoding is a
special B\ eature that allows data to be presented on easy-to-understand maps o countries or
regions. Geocoding is perormed just once or all data when it is loaded and is rom then on
aailable or all InoCubes, increasing ease o use and cutting down on administratie oerhead and
system load. 1he geographical dimension need not be added by onesel, thus saing time and
money. Validation ensures that data is intact beore it is mapped to the InoCube preenting
problems urther downstream.
Depending on the source systems and the type o data basis, the process o loading data into the
SAP B\ is technically supported in dierent ways. In the conception phase, the system irstly needs
to detect the dierent data sources in order to be able to transorm the data with the suitable tool
aterwards. |B\310 2005| 1he ollowing igure 5 gies a brie oeriew o the L1L process in SAP
B\.
Iigure S: L1L: Lxtraction, 1ransformation and Loading in SAP BW
Source: SAPCOURSL, B\310
2.2.2.Data Storage and Management
1he Data Storage and Management Layer also called as SAP B\ Data Manager manages and
proides access to the dierent data targets aailable in SAP B\, as well as aggregates stored in
relational or multidimensional database management systems.
Data storage is based on an intelligent combination o InoCubes ,inormation data models, and
master data that enriches the depth o knowledge aailable while ensuring high perormance. Master
data comprises non-olatile inormation on typical attributes or customers, companies, suppliers,
etc. 1his can mean addresses, Regions or categories. Oten this is drawn directly rom R,3
applications, cutting down on maintenance and proiding a perspectie many other data warehouses
simply cannot emulate. InoCubes are the basis or multidimensional iews. 1hey comprise
dimensions such as time, geographic region or product type, and key igures, i.e. actual olumes or
quantities. A large number o preabricated Ino- Cubes are proided with the B\, allowing to begin
9
exploring data immediately without haing to spend time building InoCubes. In addition, the
InoCubes can be modiied or deined, as new needs arise.
Ino Cubes proide a lexible means o aggregating data in accordance with one`s needs. Many
InoCubes are pre-deined, reducing the time required to set up your warehouse. Aggregation
mechanism is completely transparent to the end-user, and is designed to ensure high perormance
and zero down-time. lierarchies are external to the InoCubes, and thereore extremely lexible.
Changes to the structure o your business are easy to model, and you can iew data according to new
or old structures without haing to completely re-align that data. |McDonald et al.|
InfoCubes
Deinition: InoCubes are the central objects o the multi-dimensional model in SAPB\. Reports
and analyses are based on these. An InoCube describes a sel-enclosed dataset or a business area
rom a reporting iew, that is, or the reporting end user. Queries can be deined and,or executed in
the basis o an InoCube. |B\310|
1here are ollowing InoCube types in SAP B\:
BasicCube
VirtualCube:
RemoteCube
SAP RemoteCube
Virtual InoCube with Serices
Only BasicCubes physically contain data in the database. By doing so, they are also data targets. B\
objects are data targets when data can be loaded into them. In contrast, irtual InoCubes only
represent logical iews o a dataset. 1here is no dierence between these InoCube types as ar as
the reporting end user is concerned. Queries can be deined based on all InoCube types. InoCubes
are thus Ino Proiders. B\ objects are Ino Proiders when queries can be deined ,executed
based on them in SAP B\ Reporting.
Master data tables
Additional inormation about characteristics is reerred to as master data in the SAP B\ system. A
distinction is made between the ollowing master data types:
Attributes
1exts
,Lxternal, lierarchies
Master data inormation is stored in separate tables, which are independent o the dimension tables,
in what are called master data tables ,separately or attributes, texts and hierarchies,. \hen a master
data-carrying characteristic is actiated, master data tables ,attributes, text, hierarchies, are generated
in the characteristic maintenance depending on the settings in the respectie tab strip. |B\310|
ODS objects
1he ODS objects are lat data structures used to support reporting, analysis, and data integration in
SAP B\. 1he B\ data warehouse itsel resides on its own dedicated serer, with its own data pool,
creating a robust, high-perormance platorm or analysis, reporting and exploration, and keeping
the load on the operational enironment to an absolute minimum.
Definition: An Operational Data Store object ,ODS object, is used to store consolidated and
cleansed data ,transaction data or master data or example, on a document leel ,atomic leel,.
10
It describes a consolidated dataset rom one or more InoSources. \ou can analyze this data with a
BLx query. |B\310|
2.2.3.Analysis and Access Services
1he analysis and access layer proides access to analysis serices and structured and unstructured
inormation stored in the SAP B\. 1he primary components o this layer are OLAP Lngine, OLAP
BAPI, XML or Analysis, Business Lxplorer API, Open lub Serice etc.
Based on a powerul Online Analytical Processing ,OLAP, engine, the B\ oers in- depth analysis
o inormation in many dierent ways. 1he B\ allows to moe rom a bird`s eye perspectie to one
oering more detail ,slicing or drilldown,, or change perspectie entirely, based on a completely new
criterion ,dicing or changing iew., 1he B\ builds on the capabilities o R,3 to support complex
inancial reporting and analysis needs. It allows, or instance, arying iscal years and inancial
periods, and simultaneously supports the euro and national currencies. 1he currency conersion can
also be perormed instantly at the latest exchange rates. 1he absolute igures can be seen, then iew
the same statistics as a percentage or as a quotient. |McDonald et al.|
1he OLAP BAPI proides and open interace or accessing any kind o inormation aailable
through the OLAP engine. 1he XML or Analysis is an XML API based on SOAP designed or
standardized access to an analytical data proider ,OLAP and data mining, oer the web and the
Business Lxplorer API connects the Business Lxplorer ,BLx, - the SAP BL reporting and analysis
ront-end solution - to the OLAP engine, allowing access to all aailable queries. |B\305 2005|
1he SAP B\ Open lub Serice allows to easily proide application data within a B\ system
aailable or downstream systems like non-SAP data marts, Analytical Applications, and other
external applications. InoCubes, ODS objects, and Master Data tables can be data sources or the
Open lub Serice ,Reer to igure 6,. 1his ensures the controlled distribution o consolidated data
and inormation among seeral systems, whereby SAP B\ seres as a central hub or inormation.
Various extraction options, detailed scheduling and monitoring, and delta capability mainly
characterize the SAP B\ Open lub Serice. |OlS-Release 2002|
Iigure 6: Open Hub Service in SAP BW
11
2.2.4.Presentation Services
As a top layer in the SAP B\ architecture, the Business Lxplorer ,BLx, seres as the reporting
enironment or the end users. It consists o the BLx Analyzer, BLx Query Designer, BLX \eb
Application Designer, BLx Browser, BLx lormatted Reporting etc. 1he Presentation Layer includes
all components required to present inormation aailable on the SAP B\ serer in the traditional
Microsot Lxcel-based business Lxplorer Analyzer ,BLx Analyzer,, in the BLx \eb enironment,
or third party applications.
1he Business Lxplorer allows a large spectrum o user`s access to the inormation in SAP B\.
Using the Lnterprise Portal ,or example, through an iView that can call up alongside the
applications rom which the data is extracted,, using the Internet ,\eb Application Design, or using
mobile deices ,\AP or iMode-enabled mobile telephones, Personal Digital Assistants,. |B\305|
1he Business Lxplorer ,BLx, component proides users with extensie analysis options as depicted
in the igure .

Iigure 7: Reporting in SAP BW
1he SAP B\ presents inormation in a user-riendly, easy-to-understand ashion. It comes complete
with a wide ariety o standard reports that can be accessed with the click o a mouse, allowing
knowledge workers to utilize the acts and igures stored in B\ rom the word go. Standard reports
are supplied or the needs o particular departments, such as human resources, o particular
industries, such as manuacturing, and een or indiidual roles, such as account managers, product
managers, regional managers or inancial controllers. Reports can also be adapted or custom-
designed, or used to initiate ad-hoc queries with new parameters. ligure 8 shows the \eb
Application lramework in SAP B\ Presentation Layer.
12
Iigure 8: Web Application Iramework
1he catalogue browser proides a graphical oeriew o aailable reports, and allows point-and-click
preiews and selection. Results re displayed in the amiliar enironment o Microsot Lxcel, here
users can leerage their existing PC skills to analyze inormation, to reormat or process data urther,
or to distribute it to others, or example, as an email attachment. laourite reports can be grouped
together in clusters on a graphical desktop or een aster access. It includes geographical data
isualization, enabling inormation to be shown on maps or een greater clarity and understanding.
1he ollowing shows the unctional oeriew o Business Lxplorer. |SAPlLLP 2005|

Portal Integration Collaboration and Distribution
Single point o Access
Role-based data retrieal
Personalization, collaboration and Proile Generation
Integration o unstructured data
Query, Reporting and Analyses
Query Design using the Desktop or web
Multidimensional ,OLAP, Analyses ,\eb based or MS Lxcel,
Geographical Analysis
Ad-hoc Reporting
Alert
Publishing iViews
Seamless integration o web and Lxcel-based Analyses
Web Application Design
\eb Application Design
Interactie analytical Content ia the \eb
Inormation Cockpits and Dashboards
Basis or Creating Analytical Applications
Creation o iViews or the Portals
\izard based isualization
13
APIs or additional, highly indiidual web design
Iormatted Reporting
Precise layouts to one pixel
\izard-based layout deinition
Static, lormatted Reports
lorm based Reports
Predeined Crystal Reports in Business Content
Publishing on the \eb
Practical printing options
Mobile Intelligence
Online and oline Scenarios
\AP- deice and PDA Support
Automatic deice recognition
Publishing using the \eb Application Designer
Deice speciic output
Alerts, charts
Integration into the mobile portal.
2.2.S.Administration Services
1he Administrator work bench ,A\B, is the primary administration, controlling, and monitoring
tool in SAP B\. A\B is the data warehouse manager o the SAP B\ system. \ou use A\B to
manage, control, and monitor all the objects and processes in the SAP B\ system. 1he A\B is
where you create Meta objects. It is also the place where you use the scheduler to plan data uploads
and where you track them using the monitor. Assistants enable you to analyze the data-loading
processes closely. 1he assistants also help you to quickly identiy the cause o any errors. |B\310|
1he administration layer includes all serices required to administer an SAP B\ system. 1hese
serices are aailable through Administrator \orkbench. As the most prominent architectural
component, the A\B includes a Modelling, Monitoring, Reporting Agent, 1ransport Connection,
Documents, Business Content, 1ranslation and Metadata Repository. |McDonald et al.| One could
perorm tasks in A\B in the ollowing unction areas: |B\310|
Modelling
Monitoring
Reporting Agent
1ransport Connection
Documents
Business Content
1ranslation
Metadata Repository, t he ollowing igure depicts an oeriew o AD\ tool.
14
Iigure 9: 1he Central 1ool in SAP BW: AWB
1he SAP B\ includes eectie, user-riendly tools or all aspects o data warehouse administration.
1he schema designer allows one to create InoCubes, InoSources, mapping and transormation
rules with point-and-click simplicity. 1he B\ inrastructure is also ideal or realignment tasks, i.e.
the redeinition o aggregates in accordance with new categories. 1he use o attributes to describe
master data means one could, or instance, reorganize sales regions, and customers would be
automatically re-assigned to the right region thanks to the underlying address attributes. 1he
aggregates need not hae to be re-built rom scratch, and to trawl through entire data warehouse
making time-consuming changes to all data collected or all customers. Re-alignment is immediate,
accurate and automatic. Data replication opens up the possibility o creating data marts sering the
particular needs o a speciic group o users, such as oreign based oices or purchasing department.
System administration tasks are equally easy to understand and simple to perorm. 1he
administrator`s workbench proides a user-riendly graphical interace or scheduling data extracts,
mapping, and aggregation routines, and or deining InoCubes and reports. Moreoer, extremely
useul tools are proided or monitoring and planning, helping to get the best possible perormance
and greatest possible beneit out o the warehouse. 1he load monitor shows what`s been loaded and
when. It highlights problems and describes their root cause with messages such as Unable to access
source X\Z` so that diiculties can be resoled quickly.
2.3. Ieatures of SAP BW
1he ollowing section describes the eatures o SAP B\ 3.5 and the positioning o SAP B\ 3.5 in
Net\eaer. 1he ollowing igure depicts the roadmap - timeline and ocal points o SAP B\
15
Iigure J0: Roadmap - 1imeline and Iocal Points
Source: SAPNL1, leatures List SAP B\ 3.5
SAP B\ 3.5 is designed to delier seamless integration capabilities into all o the SAP Net\eaer
components, as well as oering new capabilities in the Business Intelligence platorm and suite. 1he
ollowing section describes some important eatures o SAP B\ 3.5|SAPNL1 2005-3|
Information Broadcasting via Business Lxplorer (BLx) Broadcaster
Share and disseminate insights to support decision-making processes
Access the complete BI inormation portolio ia the SAP Lnterprise Portal ,SAP LP 6.0,
Single, web-based wizard to broadcast personalized BI inormation portolios to arious end-
users ,pre-calculated or optimized query response time,
Leerages SAP Net\eaer knowledge management eatures such as subscription, eedback,
discussion, collaboration, rating, enterprise search, etc.
Oers broadcasting serices such as dierent scheduling options ,ad-hoc, based on data loads,
time scheduling,, pre-calculation o queries and workbooks, sending pre-calculated queries and
web templates as email attachments
Based on the Jaa Repository Manager, all SAP B\ metadata, master data, and transactional
documents, as well as pre-calculated queries,templates or KM Serices are enabled.
Universal Data Integration: 1he new Uniersal Data Integration signiicantly extends SAP B\
data access capabilities to dierse data sources.
BI Jaa Connectors: Seeral hundred o connectors proide access to all data sources that
support JDBC, XMLA, OLL DB or OLAP and SAP Query
UDConnect ,Uniersal Data Connect,: Out-o-the-box connectiity or additional data
sources that can be accessed by the BI Jaa Connectors. UDConnect supports staging and
remote scenarios to this data. lor instance, extraction rom ,remote access to a relational
database ia JDBC or extraction rom ,remote access to an OLAP source using OLL DB or
OLAP, and extraction rom an OLAP source using XML or Analysis.
16
BI Jaa sotware deelopment Kit ,BI Jaa SDK, or custom-built Jaa Applications accessing
SAP B\ or non-SAP B\ data ia the BI Jaa Connectors which is easy to use and learn and is
based on open and accepted standards or interoperability
Lmbedded BI - Integration into SAP NetWeaver
\eb Application Serer:
Integration with new Internet Graphics Serer ,IGS, and \AS Alert lramework
Connecting BI alert ramework to the SAP Net\eaer alert repository to streamline
alert message processing
Platorm independence or graphical rendering ,charts, maps,, improed usability and
new chart designer in BLx \eb Application Designer Inbound Message Processing
Integration with SAP Lxchange Inrastructure ,SAP XI, to support real-time data acquisition
1he data warehouse and,or operational data store is simply another subscriber to the
real-time data being distributed by the Integration Broker
Data is actie, eent-drien that's aailable to the Business Intelligence system in "real
time"
Reporting on harmonized master data:
Integration with SAP Master Data Management ,MDM, helps to improe the quality o
decisions made
Create consolidated iews on customers, endors and products
Lnhance master data with global attributes or company-wide analysis ,i.e. spend
analysis,
BI \eb Serices: 1he ollowing BI web serices can be accessed ia open standards
XML Data Load, XML or Analysis, XML Query Result Set
Leeraging the \eb Application Serer 6.40 technology inrastructure
Seamless deployment o BI web applications:
1. into SAP Lnterprise Portal roles or instant inormation deliery
2. into SAP Lnterprise Portal collaboration rooms
3. into SAP Lnterprise Portal KM olders, which Allows to search through BI applications
in the context o unstructured inormation as well gain improed query response times
through cached application retrieal
ML1HODS OI DA1A MINING
17
3. Data Mining and its Lconomic use
3.J. A general introduction to Data Mining
Data mining is not new. People who irst discoered how to start ire and that the earth is round
also discoered knowledge which is the main idea o Data mining. Len beore technology were
used or Data mining, statisticians were using probability and regressing techniques to model
historical data.|Groth 1988, 19| 1oday technology allows to capture and store ast quantities o data.
linding and summarizing the patterns, trends, and anomalies in these data sets is one o the big
challenges in today`s inormation age. |\itten and lrank 2000| \ith the unprecedented growth-
rate at which data is being collected and stored electronically today in almost all ields o human
endeaour, the eicient extraction o useul inormation rom the data aailable is becoming an
increasing scientiic challenge and a massie economic need.`|Zaki and lo 2000|
In the 1960s, Management Inormation Systems ,MIS, and later, in the 190s, Decision Support
Systems ,DSS, were praised or their great potential to supply executies with mountains o data
needed to carry out their jobs. \hile these systems hae supplied some useul inormation or
managers, they hae not lied up to their proponents expectations. One reason was that they simply
supplied too much data and not enough inormation to be generally useul. |Mller and Lemke
2003| Adances in data collection, the widespread use o bar codes or most commercial products,
and the computerization o many business transactions hae looded us with inormation, and
generated an urgent need or new techniques and tools that can intelligently and automatically assist
us in transorming this data into useul knowledge. |layyad 1996|
1oday, there is a huge amount o inormation locked up in the mountains o data in companies`
databases - inormation that is potentially important but has not yet discoered. 1he idea is to build
computer programs that shit through databases, seeking regularities or patterns. |\itten and lrank|
\ith the deelopment in technology the Data mining process can be supported ery well through
powerul Data mining tools. So Data mining becomes a ery hot topic. According to a study o the
GAR1NLR GROUP, more than hal o the companies in the lortune-1000-companies use Data
mining technologies or seeral purposes. |Dastani 2005|
3.J.J. 1he importance of Data in Data Mining
1he term Data mining implies that data play an important role in the Data mining process.
1hey are the oundation or all analysis and all Data mining techniques. In computing, data is
inormation that has been translated into a orm that is more conenient to store, moe or process.
Relatie to today's computers and transmission media, data is inormation conerted into binary
digital orm. |1echtarget 2005| On this data computer programs can base its Data mining
techniques.
Definition of data / information -- ,a collection o acts rom which conclusions may be drawn,
"statistical data",. |Princeton 2005|
1he nature o the data sets: A data set is a set o measurements taken rom some enironment or
process. In the simplest case, it has a collection o objects and or each object we hae a set o same
p measurements. In this case, one can think o the collection o the measurements on n objects as a
orm o np data matrix. 1he n rows represent the n objects on which the measurements were taken
,or example medical patients, credit card customers and so on,. Such rows may be reerred to as
indiiduals, entities, cases, objects, or records depending on the context.
18
1he other dimension o the data matrix contains the set o p measurements made on each object.
1ypically one could assumethat the same p measurements are made on each indiidual although this
need not be the case ,or example, dierent medical tests could be perormed on dierent patients,.
1he p columns o the data matrix may be reerred to as ariables, eatures, attributes, or ields again
the language depends on the research context. In all situations the idea is the same: these names
reer to the measurement that is represented by each column. |land et al. 2004|
ID Age Sex Marital Status Lducation Income
248
249
250
251
252
253
254
255
256
25
54
29
9
85
40
38

49
6
Male
lemale
Male
Male
lemale
Male
lemale
Male
Male
Male
Married
Married
Married
Not married
Not married
Married
Not married
Married
Married
ligh school graduate
Some college
Child
Less than 1
st
grade
Child
11
th
grade
Doctorate degree
100000
12000
23000
0
1998
40100
2691
0
30000
30686
Iigure JJ: Lxamples of data in Public Use Micro data Sample data sets.
Source: land et al, Principles o Data Mining.
Data come in many orms and this paper is out o the scope to deelop a complete taxonomy.
Indeed, it is not een clear the complete taxonomy can be deeloped, since an important aspect o
data in one situation may be unimportant in another.
1here are certain basic distinctions to which one should draw attention. One is the dierence
between quantitatie and categorical measurements ,dierent names are sometimes used or these,.
A quantitatie ariable is measurements ,dierent names are sometimes used or these,. A
quantitatie ariable is measured on a numerical scale and can, at least in principle, take any alue.
1he columns Age and Income in igure 11 are examples o quantitatie ariables. In contrast,
categorical ariables such as Sex, Marital Status and Lducation in igure 11 can take only certain,
discrete alues. 1he common three point seerity scale used in medicine ,mild, moderate, seere, is
another example. Categorical ariables may be ordinal ,possessing a natural order, as in the
Lducation scale, or nominal ,simply naming the categories, as in the Marital Status case,. A data
analytic technique appropriate or one type o scale might not be appropriate or another. lor
example, were marital status represented by integers ,e.g., 1 or single, 2 or married, 3 or widowed,
and so orth, it would generally not be meaningul or appropriate to calculate the arithmetic mean o
a sample o such scores using this scale. Similarly, simple linear regression ,predicting one
quantitatie ariable as a unction o others, will usually be appropriate to apply to quantitatie data,
but applying it to categorical data may not be wise, other techniques, that hae similar objecties ,to
the extent that the objecties can be similar when the data types dier,, might be more appropriate
with categorical scales.|land et al.|
Measurement scales, howeer deined, lie at the bottom o any data taxonomy. Moing up the
taxonomy, one could ind that data can occur in arious relationships and structures. Data may arise
sequentially in time series, and the data mining exercise might address entire time series or particular
segments o those time series. Data might also describe spatial relationships, so that indiidual
records take on their ull signiicance only when considered in the context o others.
19
3.J.2. Definitions of Data Mining
1ranslating Data mining word by word means, the mining or digging in data with the purpose o
inding inormation or respectiely knowledge. Coming to the more abstract and ery well known
deinition o lrawley, Data mining is deined as "1he nontriial extraction o implicit, preiously
unknown, and potentially useul inormation rom data". |lrawley 1992|
Groth mentions another interesting aspect o Data mining. le describes it as the process o
automating inormation discoery`. |Groth| 1oday Data mining is a term that coers a broad range
o techniques to analyze data. 1he techniques use speciic algorithms to identiy and extract patterns
and establish unknown relationships in order to discoer hidden and aluable inormation in a huge
amount o data. Most companies already collect massie quantities o data. Data mining techniques
can be implemented on existing sotware and hardware platorms to enhance the alue o existing
inormation resources. |1hearling 2005|
In the words o Moxon: "Data mining is the process o discoering meaningul new correlation,
patterns and trends by siting through large amounts o data, using pattern recognition technologies
as well as statistical and mathematical techniques." Data mining is a "knowledge discoery process o
extracting preiously unknown, actionable inormation rom ery large databases." |Moxon 1996|
According to their inal goal, data mining techniques can be considered to be descriptie or
predictie Descriptie data mining intends to summarize data and to highlight their interesting
properties, while predictie data mining aims to build models to orecast uture behaiours`. |lan
and Kamber 2001|
3.2. KDD - Knowledge Discovery in Databases
Knowledge Discoery in Databases, also oten used with the abbreiation KDD, is the concept o
extracting preiously unknown and potentially useul inormation rom large sets o data`.
|\itnessminer 2005| So KDD is only the concept o a multistage process that identiies pattern in
data in order to ind new inormation. Data mining is only one stage in the KDD process concerned
with applying computational techniques to ind patterns in data. 1his step consists o algorithms
which deliers patterns in an acceptable time out o a deined database. Other stages in the KDD
process are the comprehensibility and the alidity o the discoered patterns. In theory and practice
the expressions KDD and Data mining are oten mixed. But it is important to understand that
KDD is the whole concept and Data Mining is only a step in this concept o extracting data.
Simpliied, KDD is the concept and Data Mining is the tool. |\itnessminer|
1he ie main processes that are common in almost all o the methods are: 1ask Analysis, Pre-
processing, Data Mining, Post-processing and Deployment. 1his is diagrammatically expressed and
explained as could be seen in ligures 12 and 13.
20
Iigure J2: Knowledge discovery in Databases
Source: |Lesley 2004|
Iigure J3: Knowledge discovery in Databases
Source: |Lesley 2004|
3.3. Data Mining and Data Warehouse
1he eolution o database technology is an essential prerequisite or understanding the need o
knowledge discoery in databases ,KDD,. Data mining is a piotal step in the Knowledge Discoery
in Database process- the extraction o interesting patterns rom a set o data sources ,relational,
transactional, object-oriented, spatial, temporal, text, and legacy databases, as well as data
warehouses and the \orld \ide \eb,. 1he patterns obtained are used to describe concepts, to
analyze associations, to build classiication and regression models, to cluster data, to model trends in
time-series, and to detect outliers. Since the patterns, which are present in data are not all, equally
useul, interestingness measures are needed to estimate the releance o the discoered patterns to
guide the mining process.
1he irst step toward building a productie data mining program is to gather data. Most businesses
already perorm these data gathering tasks to a ery high extent. |Chapple 2005|
Very oten a data warehouse is used to manage and store that gathered data. Because o that huge
amount o stored data, the key is to locate the data critical to the business. So companies use Data
21
Mining tools with the purpose to discoer new inormation out o the data stored in the data
warehouse. 1he data warehouse is the data oundation or all the analyses o the Data Mining tools.
Data Mining helps companies ocus on the most important inormation in their data warehouses.
|1hearling| 1he major analysis o this work is done within the ramework o SAP B\ 3.5 which
includes a couple o data mining methods aailable as part o it, which are described in detail in the
next chapter.
3.4. Common uses of Data Mining
Data mining tools can predict uture trends and behaiours, allowing businesses to make proactie,
knowledge-drien decisions. 1he automated, prospectie analyses oered by data mining moe ar
beyond the analyses o past eents proided by retrospectie tools typical oered by decision
support systems.
Data mining tools can gie answers to business questions that traditionally were time consuming to
resole. |1hearling| 1oday Data mining is primarily used by companies with a strong consumer
ocus - retail, inancial, communication, and marketing organizations. It enables them to determine
relationships among "internal" actors such as price, product positioning, or sta skills, and
"external" actors such as economic indicators, competition, and customer demographics.
lurthermore, it enables these companies to determine the impact on sales, customer satisaction,
and corporate proits and it enables them to "drill down" into summary inormation to iew detail
transactional data. |Palace 2005|
As described in 3.1 A general introduction in Data Mining` a large number o companies use Data
Mining today. And the list o this companies looks like a lortunes 500 !bo`. !bo. |Groth| So
dierent the companies are, so dierent are the purposes o the use o Data Mining. lere are a ew
areas in which companies use Data mining to achiee a strategic beneit:-
Direct Marketing
1he idea here is to ind out who is most likely or most desirable to buy certain produces. 1his
inormation can be used or seeral marketing actiities.
1rend Analysis
\ith 1rend analysis companies are able to predict trends in the marketplace. Using this inormation
can lead to a strategic adantage because it is useul in reducing costs and timeliness to market.
Iraud Detection
Companies use Data mining techniques to model which business transactions are likely to be
raudulent. So this is used or insurance claims, cellular phone calls or credit card purchases.
Iorecasting in Iinancial Markets
1here are many possibilities to model inancial markets with Data mining methods. lor example
neural networks can be used or inancial gain. |Groth|
Apart rom this applications, companies use Data mining also or: |Bao 2005|
Business information
Inestment analysis
Loan approal
Manufacturing information
Controlling and scheduling
Network management
Lxperiment result analysis
22
Scientific information
Sky surey cataloguing
Bio sequence Databases
Geosciences: Quake inder
Performance and monitoring of standard software systems
1he main purpose o this paper is ind out how to introduce data mining unctionality to support
the SAP B\ administrator in the areas o data loading, reporting, planning etc in order to
proactiely discoer the error situations. lrom these inestigations it is quite clear that the
companies would like to come up with the product unctionalities that would assist the system
administrators. lor instance, how data mining methods could make the work o the SAP B\
administrator ease in order to perorm his day-to-day actiities.
3.S. 1he process of Data Mining
Data mining should be regarded as a strategic and competitie moe. So beore the Data mining
process starts, the goal which is in ocus o the analysis should be clariied. Otherwise it`s not
possible to search or new aluable inormation i the necessary parameters can not be deined as
there are dierent models or the data mining process based on the task at hand. 1he ollowing
description is based on the model o layyad. |layyad|
Step J: Data selection
Out o a data base the needed data were selected according to its objects and characteristics.
Step 2: Pre-Processing
In this step happens a cleaning o the selected data. 1his means or example the illing o missing
alues.
Step 3: 1ransformation
In the transormation phase the data are transormed in new ormats, i necessary.
Step 4: Data Mining
In this step o the process identiies the patterns and relationships between the data.
Step S: Interpretation and Lvaluation
In the last step the result has to be interpreted and ealuated to come up with suitable actions.
1he ollowing picture shows the process in a graphical representation.
Iigure J4: 1he process of Data mining
Source: SPSS, Clementine .0 user`s guide.
23
Cross-industry process for Data mining, CRISP
1he CRISP method is one o the seeral aailable learning methods. It encompasses all the acets o
learning, beginning rom the conception to the realization and deployment o the gained
inormation. It begins, as could be seen rom ligure 3.4 below, with an analysis or a business
understanding o the problem. Questions on the relationship between the operating actors are
asked at this stage. 1he dependence o one on another ,or seeral others, is also stipulated at this
stage.
Ater a business understanding is laid down, understanding the data then becomes the next task,
according to the CRISP model. \hat tables has to be created low would the tables be made
aailable \ould a single data instance be enough or would seeral data instances be needed \hat
about the quality o the data Based on the understanding o the data, the business understanding
may hae to be adjusted or additional inputs made to the data, e.g. creation o additional tables, as to
be able to realize the desired business objectie Data preparation then ollows. Based on the analysis
desired, columns might hae to be iltered out, or data aggregated, merged, etc. 1he modelling
process could then be done at this stage. As could be seen rom the igure 14, additional data
preparation needs may hae to be done as to realize the desired model.
An ealuation o the whole process ollows. In some cases, a superised orm o learning might be
ery helpul in this case. Interim results would be checked against the historical data as to ascertain
the leel o conormity, which also will sere in the ealuation o the entire process.
1he gained inormation or intelligence could now be deployed. 1he destination could be another
system, say, LRP system like the SAP CRM, or stored in a database system. Such could be inal
reports, presentations, action plans, etc. It could also be used or urther analysis. Moreoer,
eedback could be made to the initial business understanding or the purpose o urther analysis,
ater which the entire process would be repeated.
1he oerall process inoled in the CRISP-Model could be summarized as ollows: |CRISP, 2005|
Iigure JS: Phases of the CRISP-DM Process Model
Source: CRISP, 2005.
24
Business Understanding: Description o the Business Objectie and Data Mining
Goals,Success
Data Understanding: Selection o the data and exploratory analysis ,quality, problems,
description o selected data,
Data Preparation: Cleaning, transormation, integration, ormatting o the selected data
Modelling: Selection, building, testing and running dierent models
Lvaluation: Approal o the models and assessment o the results ,in accordance with the
deined objecties,, reiew o the process
Deployment: Preparation o inal reports, presentation, action plans and deployment o
results
25
4. Methods of Data Mining
4.J. An overview of Data Mining Methods
In the last chapter the oeriew and the tasks o data mining were discussed. But how to realize
these task, it is still needed to describe the data mining methods. Data mining methods detect
patterns in large amounts o data, and use these patterns to detect uture instances in similar data.`
|Zadok and Stolo 2005|
1here are many kinds o data mining methods. Some are well ounded in mathematics and
statistics, whereas others are used simply because they produce useul results.`|Lidal and Dingsoyr
2005| Because data mining has emerged rom many dierent ields, dierent kinds o methods can
be used in dierent areas. Researchers hae approached the knowledge discoery process rom
dierent angels, with dierent algorithms, based on their scientiic interests and backgrounds.`
|Lidal and Dingsoyr| But no one method can sole all data mining problems. Some o them hae
seeral tasks at the same time, igure 16 gies a short conclusion about the tasks and dierent
methods.
Tasks Methods
Prediction & Description Decision tress, Market basket analysis
,Association analysis,, 1ime series analysis,
Neural networks, Agent network technology
Classiication Market basket analysis ,Association analysis,,
Decision tress, Neural networks, Sorting
Regression Linear regression, Logistic regression,
Multinomial Regression.
Clustering Cluster Analysis, Neural networks
Summarization Genetic algorithms
Dependency modelling Analysis o ariance, Link Analysis
Change and deiation detection luzzy Logic
Iigure J6: Data mining tasks and methods
1he ollowing section will introduce some data mining methods that are aailable as part o SAP
B\ which are used normally in reality. Not in ery detail, but to hae a undamental understanding
o them.
4.2. 1he SAP data mining workbench
1he SAP Business Inormation \arehouse is a complete suite o application, i.e. a solution which
includes the actiities o data collection and storage, decision support systems, query and reporting,
26
online analytical processing, statistical analysis, orecasting, and data mining. In SAP B\, data rom
disparate database,s, o all systems in the enterprise are collected, consolidated, administered and
proided or analysis and planning purposes. 1his data oten proides urther aluable potential.
Len with sophisticated analysis tools, new inormation presenting itsel in the orm o meaningul
relationships between the data, is oten hidden or too complex to be uncoered through pure
obseration or intuition. \ith the assistance o the SAP B\, it is now possible to easily inestigate
and identiy these hidden or complex relations between the data. lor this discoery process, seeral
methods are proided ,e.g. Statistical and Mathematical calculations, data cleansing and restructuring
methods, etc., 1he intelligence gained could be uploaded automatically into the SAP B\ database
or redirected into an operational system like the SAP CRM. In either case, the intelligence is made
aailable or all decision-making and,or application processes and can thus be o signiicant
importance: strategically, tactically, and operationally.
1he SAP Data Mining \orkbench oers a single point o entry or access to aailable data mining
models namely
Decision trees
Clustering
Association analysis ,Market Basket analysis,
Approximation ,Regression and \eighted score tables,
ABC classiication
It also proides an option to connect with the third party data mining modals. lor each model type
a wizard guides the user through the process o creating the model, thus enabling users interested in
analytical results to setup data mining models easily. 1he ollowing igure shows the process steps
or the analytical models aailable as part o the SAP data mining workbench.
Iigure J7: Process steps for applying analytical methods
Source: SAPCOURSL, CR900, my SAP CRM Analytics.
1here are two basic broad classiications o data mining methods. 1hese are the superised and the
unsuperised learning. In superised learning, a sample data is irst selected and with it, the system is
trained` as to understand the dynamics inoled in it. 1his is then weighed against the known
27
historical data as to see the extent to which the system`s output corresponds to the known output.
lurther learning might hae to be applied, and as much as would simply be needed, until the system
turns out an answer that largely ,mostly 99.99, relect the decision already made on historical data.
On the other hand is the unsuperised learning. 1his is, undamentally, where data mining plays a
great role. A heap o data is mined` as to discoer the complex, hidden and unexpected
relationships and correlations that may exist in it. In as much as the system could be made to run the
process as much as it is wished, it is basically done with no orm o bias, as the case is in a
superised learning.
Superised learning is mostly predictie while unsuperised learning is oerly inormatie. 1his is so
or in superised learning, the interim result is weighed against historical data with known output to
see i the result corresponds with known cases. 1he ollowing chapter will introduce some data
mining methods that are used normally and are part o SAP`s oering. Not in ery detail, but to
hae a undamental understanding o them.
4.2.J. Approximation
Statistics orientation is a main way which makes sense to analyze data. 1he purpose o
approximation ,scoring, is to aluate the data records. SAP oers weighted score tables and
regression analysis namely linear regression and non-linear regression ,Logistic and Multinomial
regression, to perorm the aluation
4.2.J.J. Regression Analysis
Regression is a unction that maps a data item to a real-alued prediction ariable. So it is predicting
a alue o a continuous alued ariable based on the alues o other ariables, assuming a linear or
nonlinear model o dependency. |Kumar and Joshi| 1here are many regression applications in
practice, e.g., predicting the amount o bio-mass present in a orest gien remotely-sensed
microwae measurements, estimating the probability that a patient will die gien the results o a set
o diagnostic tests, predicting consumer demand or a new product as a unction o adertising
expenditure, and time series prediction where the input ariables can be time-lagged ersions o the
prediction ariable. |Bao|
Regression analysis is the technique which used to inter- and extrapolate the obserations which can
be classiied in to Linear and Non-linear regression. Linear Regression is a statistical technique
which attempts to build a model to the obsered data, and though this line to predict uture data. It
quantiies the relationship between two continuous ariables: the dependent ariable or the ariable
you are trying to predict and the independent or predictie ariable`. |Rud 2001| It works by inding
a line through the data that minimizes the squared error rom each point. 1he ormula o linear
regression is: |\hitehead 2005|
Y ~ a - b` - c
Y: a avvv, aeevaevt rariabte, ~1 if erevt baev., ~0 if erevt aoe.vt baev,
a: tbe coefficievt ov tbe cov.tavt terv,
b: tbe coefficievt;.) ov tbe ivaeevaevt rariabte;.),
`: tbe ivaeevaevt rariabte;.),
c: tbe error terv.
lor instance, igure 18 shows the relationship between sales and adertising along with the
regression line. 1he goal is to be able to predict sales based on the amount spent on adertising.
28
Iigure J8: Simple linear regression
Source: Rud, Data Mining Cookbook
It is also possible that the relationship between the two ariables is not linear. 1he relationship also
can be curilinear or multiple linear. Logistic Regression is ery similar to linear regression. 1he
Logistic Regression model is simply a non-linear transormation o the Linear Regression.`
|\hitehead| It uses sigmoid unction instead o linear unction to it the data. 1he main dierence
between them is that in the logistic r egression model the dependent ariable is discrete or
categorical, not continuous. So it is ery useul in the marketing area because it can be used to
predict a discrete action such as response to an oer or a deault on a loan. |Rud| Logistic
regression model can be described as ollowing: |\hitehead|
tv,;1) ~ a - b` - c
: tbe robabitit, tbat tbe erevt Y occvr., ;Y~1)
b: tbe coefficievt;.) ov tbe ivaeevaevt rariabte;.),
c: tbe error terv
,;1): tbe oaa. ratio
tv,;1): tbe tog oaa. ratio, or togit.
Logistic Regression like Linear Regression, also base on a statistical distribution. But the "logistic"
distribution is an S-shaped distribution unction which is similar to the standard normal distribution
,which results in a proit regression model,, but easier to work with in most applications because the
probabilities are easier to calculate. 1he logistic distribution constrains the estimated probabilities
to lie between 0 and 1.` |\hitehead| A graphical comparison o the Linear Regression and Logistic
Regression models is illustrated in igure 19
29
Iigure J9: Comparison of Linear and logistic Regression
Source: \hitehead, an Introduction to Logistic Regression.
Multinomial Regression: Beore, what discussed in Linear Regression and Logistic Regression is
only reerred to two ariables. \hen the nominal response ariables are more than two categories,
another regression method can be used: the so called Multinomial Regression. Multinomial logit
models are multiequation models` |GSL&IS 2005| lor example, a response ariable with n
categories will generate ,n-1, equations. 1his breaks the regression up into a series o binary
regressions comparing each group to a baseline ,reerence, group. lor example, wie work has 3
alues, 0~not working, 1~part time, 2~ull time. I choosing not working ,0, as the baseline group,
multinomial logistic regression will assess the odds o working part time s. not working, and
working ull time s. not working.` |UCLA 2005| Multinomial logistic regression simultaneously
estimates the ,n-1, logits. lurther, it is also the case, that the model tests all possible combinations
among the n groups although it only displays coeicients or the ,n-1, comparisons.` |GSL&IS|
4.2.J.2. Weighted score tables
A weighted score table is a method o ealuating alternaties when the importance o each criteria
diers. In a weighted score table, each alternatie is gien a score or each criteria. 1hese scores are
then weighted by the importance o each criterion. All o an alternatie's weighted scores are then
added together to calculate that alternatie's total weighted score. 1he alternatie with the highest
total score should be the best alternatie you can use weighted score tables to make predictions
about uture customer behaiour. \ou create a model in the data mining application to make
predictions. Ater a model has been created based on historical data, it can then be applied to new
data to make prediction s. 1he prediction, that is, the output o the model is called a Score. \ou can
create a single score or your customers by taking into account dierent dimensions. SAP`s weighted
score tables method allows you to deine your own aluation unction by irst assigning weights to
the indiidual model ields and then creating a weighted total rom these model ields. 1he algorithm
o weighted score tables: |SAPDOCS 2005|
A unction that is deined by weighted score tables is a linear combination o unctions o a
ariable.
f (X
1.
X
n
) W
1
* f
1
(X
1
) .. W
n
* f
n
(X
n
)
1he weights \1 ...\ n are arbitrary numbers. Lach o the unctions 1... n is mapped to exactly
one model ield. 1he arguments X1. X n o these unctions are those alues that the model ields
can take.
30
lor discrete model ields, the score table o the model ield is used to directly assign a unction alue
i ,X i, to indiidual alues X i o the model ield. A common unction alue can be assigned to
alues that are not listed explicitly in the table.
lor continuous model ields, the score table o the model ield is also used to directly assign a
unction alue x i to indiidual alues i ,X i, o the model ield. Lither a linear interpolation is
made between two points, or the unction alue rom the let or right point is taken. Respectiely,
either a polygon line or a piecewise constant unction is deined. Depending on the option selected
by the user, the unction is continued as linear or continuous beyond the outer points.
4.2.2.Clustering
Clustering is a common descriptie task o Data mining where one seeks to identiy a inite set o
categories or clusters to describe the gien data. Based on a gien set o data points, each haing a
set o attributes, and a similarity measure among them, the identiied clusters should guarantee that:
|Kumar and Joshi|
Data points in one cluster are more similar to one another,
Data points in separate clusters are less similar to one another.
1he identiied clusters may be mutually exclusie and exhaustie, or consist o a richer
representation such as hierarchical or oerlapping clusters. Lxamples o clustering in a Data mining
context include discoering homogeneous sub-populations or consumers in marketing databases
and identiication o sub-categories o spectra rom inrared sky measurements. |Bao| According to
Jain and Dubes Cluster analysis organizes data by abstracting underlying structure either as a
grouping o indiiduals or as a hierarchy o groups. 1he representation can then be inestigated to
see i the data group according to preconceied ideas or to suggest new experiments`. |Jain and
Dubes 1988| In brie, cluster analysis group`s data objects into clusters such that objects belonging
to the same cluster are similar, while those belonging to dierent ones are dissimilar.
1he term cluster analysis ,irst used by 1R\ON, 1939, actually encompasses a number o dierent
classiication algorithms.` |S1A1SOl1 2005| A general question acing researchers in many areas o
inquiry is how to organize obsered data into meaningul structures, that is, how to classiying.
Cluster analysis is an exploratory data analysis tool or soling classiication problems. Its objectie is
to sort cases ,people, things, eents, etc, into groups, or clusters, so that the degree o association is
strong between members o the same cluster and weak between members o dierent clusters. 1he
eature o Cluster Analysis is there is no classes to be predicted but there are dierent ways in which
the result o clustering can be expressed. 1he groups that are identiied may be exclusie, so that
any instance belongs in only one group, or they may be oerlapping, so that an instance may all into
seeral groups, or they may be probabilistic, whereby an instance belongs to each group with a
certain probability, or they may be hierarchical, such that there is a crude diision o instance into
groups at the top leel, and each o these groups is reined urther- perhaps all the way down to
indiidual instance.` |\itten and lrank| Cluster analysis is thus a tool o discoery. It may reeal
associations and structure in data, though not preiously eident, but sensible and useul rule.
1he most common used method o Cluster Analysis is K- Means clustering. lirstly, decide how
many clusters will be sorted, it is the parameter K. Second the mean o all the instances in each
cluster is calculated. 1hese means are taken to be new centre alue or their respectie clusters.
linally the whole process is repeated within the new cluster centres. 1he iteration continues until
the same points are assigned to each cluster in consecutie rounds, at which point the cluster centre
hae stabilized and will remain the same thereater.` |\itten and lrank|
31
1he major part o this thesis work concentrates on how to utilize cluster analysis and to come up
with the patterns using K-means as well as the sophisticated algorithms ,Demographic, Neural net
methods, which are part o IBM Data Mining engine based on the statistical data aailable as part o
SAP B\ statistics content, which will be dealt in chapter 6.
4.2.3.Association analysis
Association Analysis ,also known as Market Basket Analysis, uncoers the hidden patterns,
correlations or casual structures among a set o items or objects. lor example, Association Analysis
enables you to understand what products and serices customers tend to purchase at the same time.
By analyzing the purchasing trends o your customers with Association Analysis, you can predict
their uture behaiour. It is also commonly reerred to as "association discoery". |SAPDOCS|
1hese patterns may be expressed in the orm o association rules such as:
2 o the customers who buy milk also buy bread and eggs. \ou can ind that this rule applies
to 20 o the transactions.
80 o the time that a speciic brand o toaster is sold, customers also buy a set o kitchen
gloes and matching coer sets
Customers who purchase pizza bases are three times more likely to purchase cheese than those not
buying the pizza bases.
Market Basket Analysis is an algorithm that examines a long list o transactions in order to
determine which items are most requently purchased together.` |Goransson 2005| It uses the
inormation about \hat` customers purchased to gie researchers insight into \ho` they are and
\hy` they make such certain purchases. It also gies the inormation about the merchandise by
telling which products tend to be purchased together and which are most amenable to promotion.
|Berry and Lino 199| linally this inormation is actionable: It can suggest new store layout, it
can determine which products to put on special, it can indicate when to issue coupons, and so on.`
|Berry and Lino| Because Market Basket Analysis is used to determine which products sell
together, the input data to a Market Basket Analysis is normally a list o sales transactions, where
each has two dimensions, one represents a product and the other represents either a sale or a
customer ,depending on whether the goal o the analysis is to ind which items sell together at the
same time, or to the same person,. 1he cells o the data normally contain only 1 ,bought product, or
0 ,did not buy product, alues, though poly-analyst can work with other data in the cells, such as
quantity or reenue. |Goransson|
Market Basket Analysis is oten used as a starting point when transaction data is aailable but the
researcher doesn`t know what speciic patterns to look or. It can be applied to many areas such like:
|Albion 2005|
Analysis o credit card purchases.
Analysis o telephone calling patterns.
Identiication o raudulent medical insurance claims. ,Consider cases where common rules are
broken,.
Analysis o telecom serice purchases.
4.2.4.Decision 1rees
A decision tree is used as a classiier or determining an appropriate action or decision ,among a
predetermined set o actions, or a gien case. A decision tree helps you to eectiely identiy the
actors you must consider and how each actor has historically been associated with dierent
outcomes o the decision`. |SAPDOCS| Decision trees hae become one o the most popular data
32
mining tools. 1heir isual presentation makes the decision trees ery easy to read, understand and
assimilate inormation rom it. 1hey are called decision trees because the resulting model is
presented in the orm o a tree structure. Decision trees are most commonly used or classiication,
that is, predicting to which group a particular case belongs. A decision tree is constructed rom a
training set. A training set contains historical data, which is used to predict the possible outcomes
such as aspects o customer behaiour. lor example, one can predict i a customer churns or
remains loyal to the company.
Decision 1rees are powerul and popular data mining tools or classiication and prediction. It is a
tree in which each branch node represents a choice between a number o alternaties, and each lea
node represents a classiication or decision.` |Berry and Lino| It has rules that can readily be
expressed in Lnglish so that we humans can understand them or in a database access language like
SQL so that records alling into a particular category may be retrieed.` |Berry and Lino| Decision
1rees are normally drawn with the root at the top and the leaes at the bottom. A record enters the
tree at the root node where a test is applied to determine which sub node the record will go next.
1here are dierent algorithms or choosing the initial test, but the goal is always the same: 1o
choose the test that best discriminates among the target classes.` |Berry and Lino| 1his process is
repeated until the record arries at a lea node. All the records that end up at a gien lea o the tree
are classiied the same way. But rom the root to each lea there is a unique path that is an
expression o the rule used to classiy the data records. 1he ollowing Decision 1ree is one example
that is to help a inancial institution decide whether a person should be oered a loan. |\ilson 2005|
Iigure 20: Decision 1ree of deciding whether a person should be offered a loan
Source: \ilson, Introduction o Decision 1rees.
4.2.S.ABC classification
Classiication is a unction that maps a data item into one o seeral predeined classes. So, the goal
is that preiously unseen records should be assigned to a class as accurately as possible. |Kumar and
Joshi 2004| Lxamples o classiication methods used as part o knowledge discoery applications
include classiying trends in inancial markets and automated identiication o objects o interest in
large image databases. It is not possible to separate the classes perectly using a linear decision
boundary. A bank might wish to use the classiication regions to automatically decide whether uture
loan applicants will be gien a loan or not. |Bao|
1he ABC classiication is a requently used analytical method to classiy objects ,Customers,
Products or Lmployees, based on a particular measure ,Reenue or Proit,. lor example, you can
33
classiy your customers into three classes A, B and C according to the sales reenue they generate.
ABC classiication allows you to classiy your data based on speciied classiication rules. 1he data
to be classiied is generated by a query in the SAP B\. 1he classiication rules reer to a single key
igure alue in your data and implicitly speciy which absolute or relatie key igure alues map to
which classes.` |SAPDOCS|One should speciy the ollowing or the ABC classiication:
|SAPDOCS|
1he Characteristic or which the classiication is to be perormed. 1his entails speciying the
characteristic alues to be classiied ,such as Customer,.
1he Key igure that is to orm the basis or classiying the characteristic alues ,such as proit
made rom that customer,
1he attribute o the characteristic that should receie the result ,the ABC Class,
1he Query or determining the data ,such as proitability data rom the customer,
1he 1hreshold alue or the indiidual ABC classes. lor example, all customers generating a
proit o 0 to 20,000 belong to class C, those generating a proit between 20,001 and 80,000 to
class B, and those generating more than 80,001 to class A.
4.3. 1he SAP Analysis process designer workbench
1he Analysis Process Designer ,APD, is a workbench with an intuitie isual interace that enables
you to isualize, transorm, and deploy your data rom SAP business warehouse. It combines all
these dierent steps into a single data process that you can easily interact with` |SAPPRL 2004|.
1he ollowing igure illustrates the architecture o APD:
Iigure 2J: 1he Analysis process designer (APD) a rchitecture.
Source: SAPNL1, Analysis Process Designer
1he Analysis Process Designer is the interace in the my SAP B\ suite where, according to business
need or questions at hand the designer has the possibility to connect to the stored data, modiy the
data, analyze the data ,as the case may be, with the aim o getting results that would be used as
answers to the questions and delier these to an operational system where it might be used or
34
urther decision-making purposes. It is the application enironment or the SAP data mining
solution, rom SAP B\ Release 3.5 the data mining unctions are ully integrated into the APD. 1he
ollowing unctions could be perormed in the APD:
Creating and changing data mining models
1raining data mining models with SAP B\ data ,data mining model as data target in the
analysis process,
Lxecution o data mining methods such as prediction with decision tree, with cluster model
and integration o data mining models rom third parties ,data mining model as a
transormation in the analysis process,
Visualization o data mining models
By being ully integrated into SAP`s data warehousing solution, SAP B\ and the APD ,including
Data Mining eatures, realize the beneits o single database access instead o dierent data tables in
a ariety o source systems. 1his signiicantly decreases interacing problems as well as related issues
with data integrity, data quality and system perormance`. |SAPPRL| 1he igure 22 below shows a
high leel oeriew o how the APD is integrated into the SAP B\ and other applications ,or
instance with SAP CRM,.
Iigure 22: APD integration with BW and other applications
1he data is irst extracted rom where it is stored. 1his could be a single instance database with
seeral tables or seeral database instances with one or seeral tables. 1his data is then introduced
into the SAP B\ where it would be again stored, consolidated and structured. 1his has to be so
because the APD deals basically with data within the SAP B\ suite, already prepared in a orm that
it understands. Aterwards, the APD then manipulates the data as the case might be, interim results
gained in the course o the APD process might become interesting or urther analysis. 1his is then
plugged back into the B\ system and saed. linally, the end result gained ,Reports and,or Analysis,
would then be prepared and deliered to where it is needed. 1his could be the B\ system itsel or
an LRP system like the SAP CRM, SCM or a lat ile.
35
Iigure 23: Process description of the APD
1he aboe igure is process description o the APD. 1he system is primarily designed to extract only
data that has irst being uploaded into the data warehouse area o the SAP B\ Suite. Ater the
extraction process is completed, the data ields needed or the speciic process is selected. 1he
selected data ields, sets or tables are then prepared. 1he interim result o the preparation process
might be plugged back into the system or urther preparation or used or urther analysis. 1he
transormation process then ollows ater the preparation. 1he algorithm required, is at this stage,
introduced into the system. It is ater this that the result is discoered. 1his result is either
stored,displayed in the SAP B\ system in orm o graphs, tables etc. or transerred,stored in an
OL1P system ,or instance SAP CRM,.
1his is the process that is o most importance as ar as this work is concerned. Based on the
perceied business need, the analysis process would be designed as to gie answers to questions an
organization might hae. Moreoer, the scenarios that are discussed in chapter 6 would be
extensiely explored and used in showing all the aspects inoled in a typical APD modelling
process.
AS-IS ANALYSIS: CURRLN1 SI1UA1ION OI SAP BWADMINIS1RA1ION
36
S. AS-IS ANALYSIS: Current Situation of SAP
BW Administration
S.J. 1he technical content of SAP BW
Implementing a Data \arehouse presents the administrator with challenges o a constantly changing
nature. Len in a productie system, in which no new InoCubes are created, new data, or
example, is always being loaded. 1his results in an increase in the quantity o data, or in a change to
its structure. In addition to this, there are recreated or ad-hoc queries, which change the way that
accessing data is seen as a whole. 1his not only inluences the load times, but also the execution
times or queries. On the other hand, it is a good idea to hae an optimum work in order to
minimize the response time o the Data \arehouse`. |SAPDOCS| lrom these ew points, it is
already clear that an oeriew o the processes in the Business Inormation \arehouse is not only
adantageous but also necessary.
SAP B\ proides in the technical content o the Business Inormation \arehouse. lor the user o
the Business Inormation \arehouse, the most important o these sub-areas is B\ statistics. 1he
ollowing sub-areas are deliered as per the technical content: |SAPlLLP 2005-2|
B\ Statistics
B\ Data Slice
B\ leatures Characteristics
B\ lormula Builder
BLx Personalization
Reporting Authorizations
BW Statistics: 1he B\ statistics is o most importance as ar as this work is concerned. Moreoer,
the clustering scenario that is discussed in chapter 6 is based on the statistics data. B\ statistics is a
tool or analyzing and optimizing the processes in the Business Inormation \arehouse. 1he
implementation and day-to-day use o the B\ leads to an increase in the oerall amount o data
being processed and to changes to the structure o this data. 1here are also new or ad-hoc queries
that change the way in which the data is accessed. 1his aects not only the amount o time it takes
to load queries, but also the amount o time it takes to execute queries. Ideally, processes should be
run in such a way that the response time o the Business Inormation \arehouse is made as short as
possible. 1o achiee this you need to be able to get an oeriew o the processes that are running in
the Business Inormation \arehouse and be able to make any necessary changes in the system as
and when required. 1he data that is required or the B\ is proided or: |SAPlLLP 2005-3|
InoCubes
Queries
InoSources
Aggregates
1he data in B\ statistics is saed and managed in the Business Inormation \arehouse. \hen a
query is executed, data is speciied or the OLAP serer and or access to the database. 1his data is
saed temporarily once the naigation step has been completed. 1his is also the case when the
ODBO ,OLL DB or OLAP, interace is used. Additional data is collected when the aggregates are
illed and rolled up ater loading data into warehouse management. It does not take long to calculate
and sae B\ statistics data. loweer, the dataset can be considerable with larger installations. lor
this reason, the data input or each Ino Proider in each area o OLAP and warehouse management
37
can be actiated and deactiated indiidually. It`s possible to delete stored data. 1he ollowing igure
gies an oeriew o the datalow in B\ statistics:
Iigure 24: Overview of the dataflow in BW statistics
Source: SAP lelp portal
linally, to summarize the B\ statistics helps to answer some important questions as ollows:
\hich InoCubes, ino objects, ino sources, source systems, queries, aggregates, and so on,
are currently being used in the system low requently \hich datasets are being moed
\ho is currently using the system
Are there queries, whose run time is oer the allowed ast alue or online processing Are
tasks, such as batch printing or loading data, executed in times o less work
low does the data low through the Data \arehouse, rom where and where to
S.J.J. Statistical content cubes
\ithin the ramework o the technical content SAP B\ proides the ollowing cubes which store
the statistical content data. A Multi Proider ,MultiCube, in B\ does not contain any data itsel.
Instead, data is stored in the releant Ino proiders` |SAPDOCS|. 1o start with, SAP B\ proides
a B\ Statistics Multi Proider which does not contain any data itsel. Instead, data is stored in the
releant basic cubes. 1he releant BasicCubes are:
BW Statistics OLAP ,1his Ino Cube contains the data that is generated as a result o
executing the queries,
BW Statistics - OLAP, Detail Navigation ,1his Ino Cube contains the data that is
generated as a result o executing a query. 1he details correspond to the deinition o the
aggregate. 1his Ino Cube is used by the B\ system or the proposal o aggregates.,
BW Statistics Aggregates ,1his Ino Cube contains not only general data but also data
that appears in an aggregate ater data is illed and rolled up,
38
BW Statistics WHM ,1his Ino Cube contains the data that arises rom the execution o a
process in \arehouse Management. 1his Ino Cube allows you to see how data requests are
processed or the process concerned -or example, rom which source system are they, which
Ino Source is used with which transer method, and in what time rame,
BW Statistics Metadata ,1his cube contains metadata rom the Metadata Repository. It
does not contain any transaction data and no data is loaded. 1he Ino Cube also does not
contain any special key igures. It reeals the inormation about the existing Objects and
structures in the OLAP, \lM and BLx areas, and about the B\ Metadata Repository and
hierarchies to be displayed,
BW Statistics: Condensing InfoCubes ,1his Ino Cube contains data that is created when
an Ino Cube`s data requests are compressed. It reeals inormation on the number o edited
data records or condensing or compressing an InoCube and the runtime o the condenser,
which is the program that compresses the act table contents o an Ino Cube,
BW Statistics: Deleting Data from InfoCubes ,1his Ino Cube contains the data that
results rom deleting data rom an Ino Cube,
1he Ino Cube B\ Statistics - OLAP is the most important cube as ar as this work is concerned.
As the major part o the analysis is on Query perormance and optimization, this Cube contains
some important characteristics ,Ino Cube, B\ System , User , Query , 1ime and so on, and Key
igures ,OLAP times, Data manager times and so on,. 1he basic idea was to use these key igures or
Cluster analysis to come up with some useul patterns.
S.J.2. Brief overview of some Characteristics and key figures
As mentioned beore, the major analysis o this work is on perormance o the queries and the B\
Statistics - OLAP cube contains some o the important key igures used or analysis. 1he ollowing
section gies an oeriew o the Characteristics, 1ime Characteristics and Key igures aailable as
part o this cube:
Characteristics
InoCube
Naigation Step ,current numbers within the session,
OLAP Reading On , O
Runtime Category ,1, 2, 3, ... 10, 20, 30, ... Seconds,
B\ System
User
OLAP Processor Method
Naigation Step ,GUID,
lront-end Session ,GUID,
Statistical Data ,GUID,
Object Version ,or example, 01C1IlCUBL,
1ype o Data Read
U1C 1ime Stamp
1ime
Query
39
1ime Characteristics
Calendar Day
Calendar \ear
Calendar \ear , Month
Calendar \ear , Quarter
Calendar \ear , \eek
Key figures
Start Date
lrequency
Start 1ime
Number o Database Selects
Number o Naigations
Number o lront-end Sessions
Number o 1exts Read
Cells 1ranserred to the lront-end
Records Selected on the Database
Records transerred rom the Database to the Serer
ODBO: Size o the Internal Buer
ODBO: Lxternal Calls or the lunction Module
1otal ,OLAP,
Read Cycles ,letch, OLAP Processor
lormatting 1ranserred to the lront-end
Number o 1exts Read
1ime, Authorization Check
1ime, Reading on the Database
1ime, Data Manager InoCube Access
1ime, Data Manager Reading rom Basic Cube
1ime, Data Manager Reading rom ODS
1ime, Data Manager Reading rom Remote Cube
1ime, Data Manager Auth orizations or Non-Cumulatie
1ime, Data Manager Determining SIDs or Remote Cube
1ime, lront-end
1ime Between Naigation Steps
1ime, General ODBO
1ime, ODBO: Axes Preparation
1ime, ODBO: Data Records Preparation
1ime, ODBO: Conersion into llat 1able lorm
1ime, ODBO: Initialization
1ime, ODBO: Data Requests
1otal 1ime ,OLAP,
1ime, OLAP Processor Initialization
1ime, Reading 1exts,Master Data
1ime 1hat the System \as Unable to Assign
1ime, Inputting Variables
40
1ime, OLAP Processor
1he major objectie is to analyse these key igures, their importance in perormance o the query.
Since, it has got huge number o key igures it`s always not so easy to decide which key igures need
to be considered or the cluster analysis. linally the OLAP times and Data Manager times and other
key igures were considered or analysis, which is described in chapter 6.
S.2. SAP BW administration and monitoring
Data warehousing is indeed becoming common place in large organizations. According to a
lorrester Research surey o executies at large irms 62 percent hae data in, on aerage, three
data warehouses or data marts. 1he same surey indicates that the pace o data warehousing will
increase beore it slows down, with the aerage growth showing the number o data warehouses and
marts to double to nearly six by 2004 and increase in size rom approximately 130 GB to
approximately 260 GB. |IDUG 2005|
A data warehouse is a completely dierent beast rom the operational OL1P. Its problems and the
tools needed to sole them are dierent. lorm these it is quite clear that administrators are ery
much concerned with warehouse aailability and perormance during access. Coming to the SAP
B\, 1he Administrator \orkbench ,A\B, is the main tool or tasks in the data warehousing
process. 1he A\B proides data modelling unctions as well as unctions or control, monitoring
and maintenance o all processes in SAP B\ haing to do with data procurement, data retention,
and data processing`. |SAPlLLP 2005-4| 1he ollowing unctions are proided as part o A\B:
Modelling
Monitoring
Reporting Agent
1ransport connection
Documents
Business Content
1ranslation
Metadata Repository
1o summarize, it becomes quite clear that the administration o complex enterprise data warehouses
plays a piotal role in today's I1 landscapes and how one could use Data Mining methods to support
the administration o data warehouses considering the Perormance and system stability that
eentually motiated to analyse the query perormance with cluster analysis.
S.3. Possible business scenarios for data mining
During the initial phase o the inestigation seeral issues were considered as to how and in which
areas o the B\, the Data mining methods could be useul. It was not always easy to ind out as
there are seeral other qualitatie aspects that could inluence the perormance and stability o SAP
systems or instance:
1he number o the application serers aailable
1he underlying database technology
1he number o work processes aailable at a particular moment o time
41
1he number o parallel processes aailable at a moment o time, 1hese are a ew qualitatie
actors which are not easy to measure and may be in uture urther research might help to een
measure such kind o qualitatie aspects o these typical SAP systems.
linally, ater asking seeral experts in these areas, the ollowing areas are identiied where Data
mining methods might be useul to support data warehouse administration in SAP B\:
S.3.J. Data loads and Process chains
1he execution o data load processes in \arehouse Management - or e.g. how data requests are
processed or a particular process. Presently, as o B\ 3.5 the B\ administrator could monitor the
data load processes arising out o the data loads using the statistics content cube named B\
Statistics - \lM` ,1echnical name: 0B\1C_C05 ,. 1his InoCube helps the administrator to see
how data requests are processed or the process concerned ,or example, rom which source system
are they, which Ino Source is used with which transer method, and in what time rame,. Presently
there are a ew key igures as part o this cube. 1he important ones are:
Records ,\lM Process, or a particular processing step when loading data
1ime ,\lM Process, or a particular processing step when loading data
It seems as i more key igures might be needed and which and how these new key igures could be
deried is out o the scope o this paper. But, urther inestigation could be made in this regard such
that the new key igures are used or Data mining purpose in uture.
S.3.2.Queries
1he term Query` is the much talked about buzz-word o the aailable objects in SAP B\. 1he
Query is o utmost importance since it is the object through which the data aailable in the Data
warehouse ,or instance SAP B\, is presented using the ront end tools ,BLx,, based on the typical
reporting requirements o the users.
Seeral actors determine how well a query perorms, some with greater inluence than others.
Presently, as o B\ 3.5 the SAP B\ administrator could monitor the queries using the statistics
content Cube B\ Statistics - OLAP ,1echnical name: 0B\1C_C05,. 1here are a lot o key igures
which could help in analysing the query perormance as part o this which are documented in the
aboe section ,Pease reer to the section on Brie oeriew o some Characteristics and key igures,.
\ith the aailable key igures namely OLAP key igures the administrator could ind out reasons or
the perormance o the queries. l or e.g. the administrator could look at the arious OLAP times:
1ime, Reading rom the Database
1ime, lront-end
lrom the aboe OLAP key igures the administrator can check which key igures are responsible or
the high OLAP times, as the case may be and perorm the necessary action steps. 1aking this idea in
to consideration these OLAP key igures are used or Cluster analysis. 1he algorithm used is known
as K-means cluster analysis, which is aailable as part o SAP`s Data mining workbench. 1he details
o the analysis are documented in chapter 6.
42
S.3.3.Dormant data
Dormant data is data that is seldom or neer used. Studies show that much o the data loaded into
data warehouses and analytical application databases is dormant, that is, it is inrequently used or
neer used.` |I11OOLBOX 2005| Unlike OL1P databases, data warehouses like SAP B\
continuously collect and store detailed and summary historical inormation or business analysis.
lrequently data warehouses include inormation to satisy unknown requirements and data is
included that may or may not be used. 1hese databases expand signiicantly oer time as new
inormation is added rom internal and external data sources.
Bill Inmon, a noted data warehouse expert, states that dormant data typically increases as a
percentage o total data as warehouses grow. le asserts that dormant data may be as much as
65 - 0 o data warehouses that are a terabyte or greater in size.` |lILL1LK 2005| le
recommends a simple ormula or calculating the data dormancy ratio the number o queries per
year times the aerage amount o data per query diided by total data warehouse space.`
|lILL1LK| \hile this ratio may be high since it does not consider that some queries ineitably use
the same data, it does proide a rule o thumb or making ballpark estimates. But, how can one
actually identiy dormant data Bill Inmon writes, "Understanding that there is dormant data in a
data warehouse is one thing. linding the dormant data is another matter altogether. 1he best way to
ind the dormant data is to monitor the end users query actiity against the data warehouse ... the
monitor sits between the end-users query actiity and the data warehouse serer." |lILL1LK|
lrom the aboe section it is quite clear that as part o Data warehouse tools like SAP B\, there is a
desperate need or some kind o monitor to say that a particular data could be archied or a certain
moment o time, presentlywe don`t ha e any monitor in SAP B\. Once a product like SAP B\
oers such kind o monitor, these key igures could be urther used or data analysis and urther
inestigation or the possibility o any Data mining methods could be realized in the uture. 1o
summarize, minimizing dormant data reduces system costs and improes perormance, serice leels
and I1 sta productiity and this paper strongly recommends coming up with some sort o monitor
in the near uture.
S.3.4.1able spaces and buffers
Another interesting aspect o a typical data warehouse tool like SAP B\ which is directly related to
perormance and system stability are table spaces and buers at the data base layer.
1here are number o actors that are responsible or the perormance o table spaces and buers
rom the SAP B\ perspectie, some o them are:
1he size o the Ino Cube size,
1he number o partitions o an Ino Cube
1he number o CPU`s and their respectie times
1he Database speciic settings and so on.
It`s quite clear that there are monitors or SAP systems or e.g. Database Perormance Analysis
,1ransaction code: S104,, Database 1ables and Index Monitor ,1ransaction code: DB02, and it
does make sense to take into to derie some key igures like Number o table spaces , buers,
Amount and time o table space , buers, CPU times etc, which could be eentually used or Data
mining purposes
43
S.4. A way forward
1o summarize the AS-IS analysis, the current situation o SAP B\ administration as described in
the preious sections, itseems quite clear that the companies like SAP are looking orward to bundle
some sophisticated Data Mining eatures as part o their products ,lere SAP B\, to easily
administer and monitor the complexities o data ware houses that eentually will lead to make ease
the day to day actiities o the SAP B\ administrators. Ater successully knowing the needs and
the possible areas it`s quite obious to pick up a Business scenario that would help in the realization
o the 1O-BL analysis
At this stage o this work, ater identiying the possible areas in SAP B\ namely Data load
processes, Queries, Dormant data and table spaces, the strategy is to cut horizontally - taking in to
account the time and technical constraints which led to the idea o Cluster analysis or the Queries,
which is urther described in the 1O-BL Analysis.
1O-BL ANAL\SIS - A SCLNARIO \I1l CLUS1LR ANAL\SIS
44
6. 1O-BL ANALYSIS A Scenario with Cluster
Analysis
6.J. Motivations for cluster Analysis
6.J.J. 1echnical drivers
1he major technical drier or the cluster analysis is obiously, the aailability o clustering algorithm
as part o the SAP`s Data Mining work bench. 1he algorithm or the clustering ,k-means is
implemented as part o the work bench, is already pre-conigured into the system and simply made
aailable or use. Neertheless, eort would be made here to describe the undamental principle
behind the concept. As clustering is used to group records together according to an algorithm or
mathematical ormula that attempts to ind centroids or centres, around which similar records
graitate. 1his method initially takes the number o components o the population equal to the inal
required number o clusters. In this step itsel the inal required number o clusters is chosen such
that the points are mutually arthest apart. Next, it examines each component in the population and
assigns it to one o the clusters depending on the minimum distance. 1he distance measure used is
the Luclidean metric. It simply is the geometric distance in the multidimensional space. It is
computed as: |SAPDOCS 2005|
Distance (x, y) = { (xi - yi)2 }
Ater eery input record is assigned to some cluster or the other, the centroid's position is
recalculated based on the records assigned to it. \ith the new centroids means, the assignments are
checked again and this continues until a all the stopping conditions are reached ,i.e., maximum
number o iterations reached or cluster assignments do not change much between iterations,
1he aailability o APD unctionality or creating and changing data mining models, training data
mining models with is integrated in SAP B\ , the aailability o data transormations unctions and
isualization o data mining models is also technical motiator or the analysis.
6.J.2. Business drivers
1he main business drier is to proide some kind o monitor, proactiely or the administration o
SAP B\. \ith the use o statistical content data the objectie is to track down Query behaiour as
to diide the queries into segments based on key igures namely OLAP times. Since, these times are
responsible or the perormance o queries. As well this is conirmed in the AS-IS analysis phase
rom seeral experts at SAP rom dierent areas. 1he major key igures used or the cluster analysis
are arious OLAP times namely:
1ime, Reading rom the Database
1ime, lront-end
1he Number o times a Query is executed
45
1he ultimate objectie is to cluster the queries in to dierent groups or a certain time period and
throw to the B\ administrator those clusters, which might seem to be peculiar such that, the results
might help the B\ administrator to perorm the necessary ollow up actions, that would eentually
help in perormance and monitoring o these Queries in the uture.
Looking, at the present deelopments ,to bundle new eatures as part o their products, with
reerence to SAP B\, It does make sense to hae perormance and proactie monitors as part o
SAP B\ which is the main motiation rom the Business , product management perspectie.
6.2. Analysis of Queries with cluster analysis
1he main Objectie is to track the behaiour o queries and diide them into segments based on the
OLAP times. 1he Key igures considered or the analysis are OLAP 1imes and Data manager times.
lor this particular data model the 1otal time ,OLAP, is taken in to consideration, which enables the
aluation o query runtimes which includes the ollowing times:
Initialization o OLAP Processor
OLAP Processor
Reading on the Database
lront Lnd
Reading 1exts,Master Data
Authorization Check
A new Key igure Count lrequency ,Number o times a query is executed,
1he total OLAP time , \hich is an aggregated key igure o all the aboe OLAP times,
linally the data model consists o 9 key igures ,Lxcluding Record id used or record identiication,.
All the actiities are perormed on AB5 and Q50 on internal SAP test systems
6.3. 1he Data Model:
1o come up with this meaningul data model, much inestigation is done by consulting experts in
these areas. As ar as this work is concerned seeral attempts hae been made to come up with
this meaningul data model. 1he ollowing igure depicts the modal attributes which are used or
the clustering purpose.
Iigure 2S: 1he Data Model
1he models are created and ealuated on the SAP internal 1est systems ,AB5 and Q50,. 1he
technical name o this model is C_I\P_20_1
46
6.4. Data preparation
1he data preparation is one o the major tasks and much time is deoted to this part o the work. It
has been known that the data or Data mining entirely depends on the data distributions and the
amount o source data. Initially attempts were made based on the data used on the test systems and
ironically the results don`t show up any patterns. 1hen ater consulting with the experts it was
known that it makes sense to work with the data rom a productie system. linally, the data used or
analysis is rom the productie internal B\ system.
Seeral attempts were made rom the concerned colleagues o the productie system to load the data
in to test system, but the data load process in to the ino cube was not successul due to some
technical constraints or instance the data loaded in to the Cube has data quality problems where the
time o the OLAP processor is always illed with the alue 1 and the time o the OLAP processor is
greater than the Oerall time o OLAP, \hich should not be the case as the Oerall time o OLAP
is a total o all OLAP times ,Overall OLAP time = 1ime, OLAP Processor Initialization +
1ime, OLAP Processor + 1ime, Reading from the Database + 1ime, Iront-end + 1ime,
Authorization Check + 1ime, Reading 1exts/Master Data + ODBO times, as shown in the
igure 26
Iigure 26: Data Preparation - Data quality problems
But ortunately, ater seeral attempts the data could be loaded in to the PSA ,An intermediate data
store beore loading in to the cube and its known as one type o upload type in SAP B\,. As to the
outcome, the PSA table is used or the analysis and correspondingly checked or the data
consistency by manually totalling the key igures as shown in the igure 2.

47
Iigure 27: Data Preparation - Consistency check
As a result the PSA table is used as the Data Source instead o Ino Cube or Query as shown in the
igure 28 .
Iigure 28: Data Selection - 1he PSA as a source table
Data Selection
1he Data is iltered accordingly or a period o 15 days keeping in mind the objectie is to proide
the administrator some meaningul clusters or analysis. One o the main purposes o Data mining is
to mine data on large data sets. So inally, it was decided to at least hae 800 records and the least
possible time period. So the data is selected or 15 days since its more than 800 records or this time
48
period and ater aggregation it consists 1116 records ,aailable in the next screens,. 1he PSA as a
data source is shown in the ollowing igure
Iigure 29: Data Selection - 1ime period of data
6.S. Data transformation
Seeral transormations are made to make the data meaningul ater consulting the experts in these
areas, which will be discussed in the ollowing sections. In the irst step the Records are urther
iltered or Query, Ino proider to get rid o the initial alues and the user SCOPLADM` since he
is not the genuine user as in the igure 30.
Iigure 30: Data transformations- Iilter Query, Info Cube and User
49
One o the attributes considered or clustering is the number o times a Query is executed rom the
data set selected or the speciied time period, so a new key igure called query requency is added to
the analysis process as in the igure 31.
Iigure 3J: Data 1ransformation - Adding new key figure
1he next step in the transormation is to transorm the 1otal OLAP time rom the corresponding
OLAP times as in the igure 32
Iigure 32: Data 1ransformation - 1ransformation of OLAP times
6.S.J. Data Aggregation
1he important step in data transormations is to get rid o the repeated queries and ino proider
alues in the data set and to come up with the unique alues, As a result the aggregation is
perormed at the Query and Ino Proider leel and the aerage alues are used as the type o
aggregation. 1he process o aggregation is shown in the igure 33
50
Iigure 33: Data 1ransformations - Aggregation of data
6.S.2.Relative numbers
Based on the expert adice, it makes sense to work on the relatie alues ,the percentages, or the
corresponding OLAP times with a transormation routine as in the igure 34, to get rid o the
uneen data distributions and get meaningul patterns rom the clustering engine.
Iigure 34: Data 1ransformations - Conversion of OLAP key figures
Look at the data distributions or the 1otal OLAP time is in such a way as shown in the igure 35.
51
Iigure 3S: Data distribution of 1O1AL OLAP time before 1ransformation
Based on the expert suggestion and to get rid o such uneen kind o data distribution the 1otal
time OLAP is ranked so that that data records are a bit more eenly distributed that would help the
clustering engine to distribute data eenly across arious segments, as shown in the igure 36

Iigure 36: Data transformation - Discritizing 1otal OLAP
52
1he ollowing igure shows the basic statistics o the 1O1AL OLAP times ater transormation and
it was suggested by the experts that K-means algorithm would better work on such data
distributions to eentually come up with the meaningul patterns
Iigure 37: Data distribution of 1O1AL OLAP after transformation
In the same way the Query requency ,1he number o times a Query is executed, is transormed as
in the igure 38
Iigure 38: Discritization Query frequencies
6.S.3.Mapping the Modal attributes
In this step the modal attributes ,1he Data model, is mapped to the attributes o the data source
table which is depicted in igure 39
53
Iigure 39: Mapping the Modal attributes
6.6. Results of the cluster analysis
In this step the results o the clusters are analysed looking at the eatures o arious clusters, 1he
ollowing igure shows the Inluence o attributes which represents the relatie importance o eery
attribute considered or clustering in the ormation o clusters. 1he higher the index, higher is the
inluence in deciding which cluster an entity would get assigned to.
Iigure 40: Cluster Analysis - 1he influence chart
Analysis of Cluster segments
Cluster J - Contributes 19 to the data set 1his Cluster is characterized by queries that are
requently executed with ligh 1otal OLAP time when we look at the reason or the high 1otal
54
OLAP time by analysing the corresponding OLAP times, the time reading to the data base is
characterised by this cluster.
1he Administrator could quickly come to a conclusion that the queries in this cluster hae high time
reading to the database as well these are the queries that are more requently executed and some
ollow up actions could be taken to reduce the time.
1he ollowing igure 41 and 42 gies a picture o the attributes namely Query requency, 1ime
reading to the database, total time OLAP the details o all the screens are illustrated in APPLNDIX-
1
Iigure 4J: Cluster analysis - Results of cluster J
Iigure 42: Cluster analysis - Results of cluster J
55
Cluster 4 - 1his is a large segment o 22 and is characterized by queries that are less requently
executed with high 1otal OLAP time. 1he reason being time read to the data base as we hae seen
or the cluster one, but the interesting aspect is that this cluster contains the queries that are not so
requently executed when compared to cluster 1.
1he quick impression or the administrator, could be this cluster is less important when compared
to cluster 1
Iigure 43: Cluster analysis - Results of cluster 4
Cluster S - 1his is a small segment o 5 which is characterized by the Queries with high 1O1AL
OLAP 1IML and requently executed. 1his reason being the ligh ront-end times.
Iigure 44: Cluster analysis - Results of cluster S
Cluster J0 - 1his is a small segment o 5 to the total dataset and is characterized by the Queries
with high 1O1AL OLAP 1IML and less requently executed when we look at the corresponding
key igures its due to ligh ront-end times.
1he administrator quickly comes up with questions -
\hy the users need to push so huge amount o data to the ront end
Are these queries based on inancial ino proiders which generally hae high amounts o data
1ry to ind the user patterns like casual user, Inormation consumer or Analyst
56
Iigure 4S: Cluster analysis - Results of cluster J0
Note: All the remaining screens are documented in APPLNDIX-B
CONCLUSION AND OU1LOOK
57
7. Conclusion and outlook
Data Mining proides many dierent techniques to extract "knowledge" rom data. It is an exiting
multidisciplinary ield o research which has many extremely useul applications. At present the
techniques are becoming more commonly used but hae not been applied in all areas. As it has been
shown, businesses will use data mining or a ariety o applications ,lere, the main objectie being
system monitoring and perormance that would eentually lead to ind out some interesting patterns
orm the System data,. But primarily, the ocus o data mining is to ind useul trends in existing
data. Companies can use Data mining to seek out changes in existing trends or, perhaps more
importantly, discoer new trends once unknown because o the huge task o analyzing large sums o
data. As it`s shown that this area o application or Data mining ,or System perormance and
stability, is an emerging area as companies try to bundle the new eatures or their products.
Coming to the 1O-BL part o this work, Using cluster analysis - 1he results help in analyzing the
Query inormation rom arious OLAP times and it`s possible to derie some strategies to optimize
the query perormance in uture. 1he process o Data preparation - Cleaning, transormation and
integration o the selected data plays a ital role to come up with the meaningul patterns and most
o the time is deoted to this part. Coming to the Clustering scenario that has been discussed here,
the Business Meta data o the Queries, Ino proiders and users could be joined to urther analyse
the results and a kind o consensus has been reached in this regard by the people at SAP, to urther
inestigate in this area. 1he interesting part o this work is that, these results will be implemented
with BI-Net\aeer 2006 as technical content Queries ,1he rules regarding the amount o data set,
and the transormations and so on will be hard coded,. As a result, the SAP B\ administrator
executes these queries and some peculiar clusters will be thrown out to the ront end or urther
analysis or e.g. the clusters which are characterised by ligh total OLAP time correspondingly with
high ront end times, the queries that are requently executed and so on.
1he inal analysis with respectie to the cluster analysis method could be summed up as adanced
clustering algorithms could be useul to consider arious data types and the number o clusters`. 1o
this eect the same data set is used or analysis with the IBM intelligent miner, which is equipped
with sophisticated clustering algorithms ,namely o type Demographic and Neural,. 1he
demographic clustering algorithm is used or analysis based on the expert suggestions, which
accounted or much more meaningul patterns and these results are documented as part o
APPLNDIX-A. 1his has been accepted by the experts in SAP. 1o sum up it does make sense to
urther inestigate regarding the possibility and easibility to deelop such kind o algorithms OR to
tie up with external Data Mining endors and integrate their products with the \ork bench.
BIBLIOGRAPHY
58
Bibliography
Monographs
Berry and Linoff J997
Michael Berry and Gordon Lino. Data Mining 1echniques: lor Marketing, Sales and Customer
Support. New \ork: \iley Computer Publishing, 199.
Delvin J997
B. Delin: Data \arehouse rom Architecture to Implementation, Addison-\esley, 199.
IU-CH-2003
Biao lu, lenry lu :A Step-to-Step Guide to SAP -Business Inormation \arehouse, Addison -
\esley,2003.
Irawley J992
\illiam lrawley, Gregory Piatetsky-Shapiro, Christopher Matheus. Knowledge Discoery in
Databases: An Oeriew.` AI Magazine, lall 1992, 213-228.
Iayyad J996
Usama layyad et al. Adances in Knowledge Discoery and Data Mining. Cambridge:
MI1 Press, 1996.
Groth J998
Robert Groth. Data Mining: A lands-on Approach or Business Proessionals. Upper
Saddle Rier, New Jersey: Prentice lall P1R, 1998.
Han and Kamber 200J
Jiawei lan and Michelle Kamber. Data Mining: Concepts and 1echniques. Morgan Kaumann, 2001.
Hand et al. 2004
Daid land, leikki Mannila, and Padhraic Smyth. Principles o data mining. MI1 press, Cambridge,
2004.
Inmon J999
Inmon, \.l.: SAP and Data \arehousing. Kia Productions, 1999
Kimball J996
Kimball, R.: 1he Data \arehouse 1oolkit. Second Ldition, John \iley, 1996
McDonald et al 2003
Kein McDonald, Andreas \ilmsmeier, Daid C. Dixon, \.l.Inmon: Mastering SAP Business
Inormation \arehouse, \iley Publishing Inc., 2003.
Moxon J996
Bruce Moxon "Deining Data Mining, 1he lows and \hys o Data Mining, and low It Diers lrom
Other Analytical 1echniques" Online Addition o DBMS Data \arehouse Supplement, August 1996.
Mller and Lernke 2003
BIBLIOGRAPHY
59
Johann-Adol Mller and lrank Lemke. Sel-Organising Data Mining: Lxtracting Knowledge lrom
Data. Victoria, British Columbia, Canada: 1raord Publishing, 2003.
Rud 200J
Oliia Rud. Data Mining Cookbook: Modelling Data or Marketing, Risk, and Customer Relationship
Management. New \ork: \iley Computer Publishing, 2001.
SPSS 2004
SPSS. Clementine .0 User`s guide, 2004.
Witten and Irank 2000
Ian \itten and lrank Libe. Data Mining: Practical Machine Learning 1ools and 1echniques with Jaa
Implementations. San lrancisco: Morgan Kaumann Publishers, 2000.
Zaki and Ho 2000
Mohammed Zaki and Ching-1ien lo. Large-Scale Parallel Data Mining. Berlin: Springer, 2000.
Internet sources
Dictionaries
1echtarget 200S
Data,` 1ech1arget.
http:,,searchstorage.techtarget.com,sDeinition,0,,sid5_gci211894,00.html,12.01.2005,.
Princeton 200S
Data,` Princeton.
http:,,www.cogsci.princeton.edu,cgi-bin,webwn2.0stage~1&word~data ,30.11.2004,.
Witnessminer 200S
KDD,` \itnessminer.
http:,,www.witnessminer.com,kdd_deinition.htm ,06.01.2005,.
Articles and other internet resources
Albion 200S
ALBION RLSLARCl L1D. Market Basket Analysis`.
http://www.albionresearch.com/datamining/marketbasket.htm ,15.03.2005,
B.Inmon-200S
http://www.billinmon.com//library/articles/ ,10.03.2005,
Bao 200S
lo 1u Bao. Knowledge engineering: Knowledge discoery and data mining techniques and practice`.
http://www.netnam.vn/unescocourse/knowlegde/knowlegd.htm ,25.01.05,
Chapple 200S
BIBLIOGRAPHY
60
Mike Chappel. Data Mining: An Introduction`.
http:,,databases.about.com,library,weekly,aa10000a.htm ,20.01.05,
CRISP, 200S
CRISP ,Cross Industry Standard Process or Data Mining,.
http://www.crisp-dm.org/Process/index.htm ,06.01.2005,
Dastani 200S
Parsis Dastani Data Mining - An Introduction.
http:,,www.data-mining.com,miningmining.htm ,25.01.05,
IILL1LK 200S
FILETEK, The Future oI Data Warehousing: Alternative Storage by Bill Inmon
http://www.Iiletek.com/papers/Inmon/inmon.htm ,10.03.2005,
Goransson 200S
Olo Goransson. Market Basket Analysis`.
http://www.megaputer.com/products/pa/algorithms/ba.php3 ,15.12.2004,
GSL&IS 200S
GSL&IS. Applied Categorical & Nonnormal Data Analysis--Multinomial Logistic Regression
Models`.
http://www.gseis.ucla.edu/courses/ed231c/notes3/mlogit1.html ,20.12.2004,
IDUG 200S
IDUG, Data \arehouse Administration
http://www.idug.org/idug/member/journal/mar98/IaceoII.html ,15.03.2005,
I11OOLBOX2005]
ITTOOLBOX, Dormant Data
http://businessintelligence.ittoolbox.com/documents/document.asp?i2236 ,15.03.2005,
Kumar and Joshi 2004
Vipin Kumar and Mahesh Joshi. 1utorial on ligh Perormance Data Mining`.
http://www-users.cs.umn.edu/ ,06.01.2005,
Lidal and Dingsoyr 200S
Lndre Lidal and 1orgeir Dingsoyr. An Laluation o Data Mining Methods and 1ools`.
http://www.idi.ntnu.no/~dingsoyr/project/report.html#SECTION0071000000000000000
,31.12.2003,
Palace 200S
Bill Palace. Data Mining`.
http://www.anderson.ucla.edu/Iaculty/jason.Irand/teacher/technologies/palace/index.htm
,12.01.2005,
SU 2003
http://www.datawarehousing.com/whatis.asp ,10.03.2005,
SAP-2003
BIBLIOGRAPHY
61
http://help.sap.com/bestpractices/industries/businessintelligence/v131/documentation/DataWarehou
singtecEN.pdI ,11.04.2005,
SAPPRL 2004
SAPNL1, Analysis Process Designer `
https://websmp202.sap-ag.de/~Iorm/sapnet?SHORTKEY01100035870000161446& ,15.03.2005,
|SAPHLLP 200S-J
http://help.sap.com/saphelpbw30b/helpdata/en/e3/e60138Iede083de10000009b38I8cI/Irameset.ht
m ,10.03.2005,
SAPNL1 200S-J
http://service.sap.com/~Iorm/sapnet?SHORTKEY01100035870000471520&
SAPNL1 200S-2
SAPNL1 200S-3
S1A1SOI1 200S
S1A1SOl1. Cluster Analysis`.
http://www.statsoItinc.com/textbook/stcluan.html ,15.03.2005,
SAPHLLP 200S-J
SAP lLLP POR1AL, 1echnical content.
http://help.sap.com/saphelpnw04/helpdata/en/e3/e60138Iede083de10000009b38I8cI/Irameset.htm
SAPHLLP 200S-2
SAP lLLP POR1AL, 1echnical content.
http://help.sap.com/saphelpnw04/helpdata/en/e3/e60138Iede083de10000009b38I8cI/Irameset.htm
SAPHLLP 200S-3
SAP lLLP POR1AL, B\ statistics.
http://help.sap.com/saphelpnw04/helpdata/en/I2/e81c3b85e6e939e10000000a11402I/content.htm
SAPHLLP 200S-4
SAP lLLP POR1AL, Administrator workbench
http://help.sap.com/saphelpnw04/helpdata/en/a8/6b023b6069d22ee10000000a11402I/content.htm
SAPDOCS 200S
SAPNL1, Data Mining and APD in SAP B\ 3.5`
http://service.sap.com/~Iorm/sapnet?SHORTKEY01100035870000585703& ,10.03.2005,
|1hearling 200S|
Kurt 1hearling. An Introduction to Data Mining: Discoering hidden alue in your data warehouse`.
http:,,databases.about.com,gi,dynamic,osite.htmsite~http3A2l2lwww.thearl
ing.com2ltext2ldmwhite2ldmwhite.htm ,12.01.2005,
UCLA 200S
UCLA ACADLMIC 1LClNOLOG\ SLRVICLS. Multinomial Logistic Regression, Contried Lxamples`.
BIBLIOGRAPHY
62
http://www.ats.ucla.edu/stat/stata/code/oddsratiomlogit.htm ,25.12.2004,
Whitehead 200S
John \hitehead. An Introduction to Logistic Regression`.
http://personal.ecu.edu/whiteheadj/data/logit/ ,23.12.2004,
Wilson 200S
Bill \ilson. Induction o Decision 1rees`.
http://www.cse.unsw.edu.au/~billw/cs9414/notes/ml/06prop/id3/id3.html ,15.03.2005,
W.H.Inmon J999
http://www.billinmon.com//library/articles/dwdeI.asp
Zadok and Stolfo 200S
Lrez Zadok, Salatore Stolo. Data Mining Methods or Detection o New Malicious Lxecutable`.
http://www1.cs.columbia.edu/ ,30.12.2004,
BW3J0 200S
Coursematerial, B\310 Data \arehousing, SAP AG-2005.
BW30S 200S
Course material, B\305 BI \arehouse - Reporting and Analysis, SAP AG-2005.
SAP NL1- Course Material BW310
B\310-Data \arehousing 2003
Lesley 2004
Clem Lesley: A Presentation on Data Mining with SAP B\ 3.5. SAPNL1.
1ABW30 2003
Business Inormation \arehouse - Lxtraction and Special 1opics, 1AB\30, Section 2, Unit 3, SAP
AG, 2005
APPLNDIX-A
63
APPLNDIX-A
IBM Intelligent Miner Cluster Analysis
J. Data Selection - 1he data extracted in to a lat ile and loaded in to the intelligent miner
2. Model Selection
3. Attributes for analysis
APPLNDIX-A
64
4. Modal Parameters
5. View of all clusters
APPLNDIX-A
65
6. Analysis of segments
APPLNDIX-A
66
1o summarize, the results clearly shows that this demographic algorithm takes into consideration
data sets more precisely and the number o clusters are justiied by the algorithm based on the data
distribution. A big cluster with 58 data set which consists o similar data, but all the remaining
clusters shows aluable patterns and this has been judged by the experts at SAP.
APPLNDIX-B
67
APPLNDIX-B
Results of SAP Cluster Analysis
As a sample, the ollowing screens represent the ie clusters and the alues distribution details.
1. Cluster 1
2. Cluster 2
APPLNDIX-B
68
3. Cluster 3
APPLNDIX-B
69
4. Cluster 4
APPLNDIX-B
70
5. Cluster 5
1hese are the 5 cluster segments o the data model rom the total o 10 clusters.
APPLNDIX- C
71
APPLNDIX-C Related Internet Links
General information's about Data Mining
http:,,www.the-data-mine.com
http:,,www.dmreiew.com
http:,,www.datawarehousingonline.com,
http:,,datawarehouse.ittoolbox.com,
http:,,www.kdnuggets.com
http:,,itmanagement.earthweb.com,datbus,
http:,,www.thearling.com,index.htm4wps
Data mining software providers
Advanced Software Applications http:,,www.asacorp.com,
AIS Visual http:,,www.isualmine.com,
Alice http:,,www.alice-sot.com
Angoss http:,,www.angoss.com,
Assoc http:,,www.asoc.de
Attar Software http:,,www.attar.com,
Bissantz & Company http:,,www.bissantz.de,
Business Objects http:,,www.businessobjects.com,
Cogit http:,,www.cogit.com,
Cognos http:,,www.cognos.com,
Data Distilleries http:,,www.ddi.nl,
DataMind http:,,www.datamindcorp.com,
DataMiner http:,,www.dminer.com,
Datasage http:,,www.datasage.com,
Dialogis http:,,www.dialogis.de
Dimension S http:,,www.dimension5.sk,
HNC http:,,www.hnc.com,
human I1 http:,,www.humanit.de,
Hyperparallel, http:,,www.hyperparallel.com,
IBM http:,,www.ibm.com,
Information Discovery http:,,www.datamining.com,
Integral Solutions http:,,www.isl.co.uk,
Magnify http:,,www.magniy.com,
Management Intelligenter 1echnologien http:,,www.mitgmbh.de,
MarketMiner http:,,www.marketminer.com,
Mathsoft http:,,www.mathsot.com,
NeoVista http:,,www.neoista.com
Oracle http:,,www.oracle.com,
Prudential Systems http:,,www.prudsys.de,
Quadstone http:,,www.quadstone.com,
Rulequest http:,,www.rulequest.com
SAP AG http:,,www.sap.com,index.epx
Salford Systems http:,,www.salord-systems.com,
SAS http:,,www.sas.com,
SGI http:,,www.sgi.com,sotware,mineset,
SLP InfoWare http:,,www.slp-inoware.com
SPSS http:,,www.spss.com,datamine,
Syllogic http:,,www.syllogic.nl
APPLNDIX- C
72
1andem http:,,www.tandem.com,
1hinking Machines http:,,www.think.com,
1orrent http:,,www.torrent.com,
1riVida http:,,www.triida.com,
Unica http:,,www.unica-usa.com,
Wizsoft http:,,www.wizsot.com,

2005 Thesis Alluri Data Mining Data Warehouse SAP-BW

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

2005 Thesis Alluri Data Mining Data Warehouse SAP-BW

Diunggah oleh

Hak Cipta:

Format Tersedia

Master 1hesis

LVALUA1ION OI DA1A MINING ML1HODS

Anda mungkin juga menyukai