Anda di halaman 1dari 22

SAP Predictive Analytics Overview

SAP Predictive Analytics is a statistical analysis and data mining solution that
enables you to build predictive models to discover hidden insights and relationships
in your data, from which you can make predictions about future events.

SAP Predictive Analytics combines SAP Infinite Insight and SAP Predictive Analysis in
a single desktop installation. SAP Predictive Analytics includes two user interfaces,
Automated Analytics and Expert Analytics.

Automated Analytics includes the following modules:

Data Manager is a semantic layer tool used to facilitate data preparation.

Modeler helps you create models such as classification, regression,


clustering, time series, and association rules. Models can be exported in
different formats so that you can easily apply them in your production
environment.

Social extracts and uses implicit structural relational information stored in


different kinds of data sets, improving the decision and prediction capacities
of the models. It can represent data in the form of graphs that show how the
different data are linked. Dedicated workflows help you create colocation and
frequent path analyses based on geo-referenced data.

Recommendation generates product recommendations for your customers


based on a social network analysis.

Model Manager (installed separately) helps you manage ongoing tasks by


scheduling model training and data set deviation detection.

Expert Analytics enables you to do the following:

Perform various analyses on the data, including time series forecasting,


outlier detection, trend analysis, classification analysis, segmentation
analysis, and affinity analysis.

Analyze data using different visualization techniques, such as scatter matrix


charts, parallel coordinates, cluster charts, and decision trees.

Use a range of predictive algorithms, the R open-source statistical analysis


language, and in-memory data mining capabilities for handling large volume
data analysis efficiently.
Deployment Configurations
SAP Predictive Analytics can be deployed in the following configurations:

The desktop edition is a two-tier standalone configuration. Both the


Automated Analytics and Expert Analytics toolsets are available in the
desktop edition.

The enterprise edition is a three-tier client-server configuration with server


authentication. The Automated Analytics toolset is available in the enterprise
edition.

With this configuration, you can install the following server-based components:

Java Web Start, a server-based client software deployment tool.

Model Manager for automating modeling tasks.

SAP Predictive Analytics Desktop Edition


SAP Predictive Analytics desktop edition is a stand-alone process with a two-tier
architecture.

Automated Analytics can access data in flat files on the native file system, SAS, and
SPSS files, or be configured to access Database Management Systems using ODBC.

Expert Analytics inherits data acquisition and data manipulation functionality from
SAP Lumira. SAP Lumira is a data manipulation and visualization tool. Using SAP
Lumira, you can connect to various data sources such as flat files, relational
databases, in-memory databases, and SAP BusinessObjects universes, and can
operate on different volumes of data, from a small matrix of data in a CSV file to a
very large dataset in SAP HANA.
SAP Predictive Analytics Enterprise Edition

The enterprise edition of SAP Predictive Analytics is a three-tier client-server


architecture. Communication between the Automated Analytics server and the data
is identical to the desktop edition, using either ODBC or the native file system. For
each client connection, a new Automated Analytics instance process is started on
the server. Depending on the server configuration, the process can be started with a
specific system account, or with the user account. Communication between the
clients and server is encrypted using SSL.
This configuration offers the following
benefits:
Users are authenticated because clients must log in before being able to use
the modeling server. User accounts can be configured to implement security
policy.

User activity monitoring and logging is possible and activated by default.

Database connectivity needs to be configured only once on the server.


Operating system rights can be used to check access to the different
resources (for example, modeling data).

Resources are used more fully because each modeling session has a
dedicated process. The process size limit (typically 4Gb on 32-bit installation)
applies only to a single user.

Network administration is simplified because all network traffic from the


client is directed to the server. This means only two TCP ports need to be
opened for an Automated Analytics installation.

Model Manager
In an SAP Predictive Analytics client-server configuration, once your Automated
Analytics server is deployed, a Model Manager server can be additionally deployed.

Model Manager is a thin-client, Web server-based application that allows you to


automate modeling activity. It lets several users work on the same modeling project
by scheduling the following types of tasks:

Retraining a model

Applying a model to a new dataset

Detecting model deviations

Detecting deviation of a dataset

Model Manager is available only on Microsoft Windows, so if your modeling server is


deployed on a Unix server, you need to set up a separate Windows sever.

Model Manager
In an SAP Predictive Analytics client-server configuration, once your Automated
Analytics server is deployed, a Model Manager server can be additionally deployed.

Model Manager is a thin-client, Web server-based application that allows you to


automate modeling activity. It lets several users work on the same modeling project
by scheduling the following types of tasks:

Retraining a model

Applying a model to a new dataset

Detecting model deviations

Detecting deviation of a dataset

Model Manager is available only on Microsoft Windows, so if your modeling server is


deployed on a Unix server, you need to set up a separate Windows sever.

Creating predictive models automatically:


How is it possible that a tool can create predictive models automatically? How can
steps be automated, that would take a highly skilled expert a lot of time and effort
to produce manually?

The Automated Analytics module in SAP Predictive Analytics enables a non-


statistician to produce powerful predictive models in a short period of time. Focus
has been placed on automating all steps of the predictive modeling workflow,
shielding the user from statistical complexities without compromising the predictive
performance. Use cases for this engine are virtually endless, from product
recommendations to preventive maintenance just to name two.

This document explains the fundamental concept behind the automated approach
to the interested reader. However, the end user of this functionality does not have
to read this document to use the functionality. Please also bear in mind that this
document is just giving an outline of the fundamental concept behind Automated
Analytics. There are many more detailed aspects that are taken care of, some of
which are patented.

Introduction

Variable Creation

Variable Encoding / Data Distribution

Data Splitting

Missing Values

Outliers

Model Selection / Variable Selection

Multi-Collinearity

Model Interpretation

Deployment

Model Monitoring

Summary

Introduction
SAP Predictive Analytics 2.x includes two different approaches to predictive
modeling

Automated Analytics, which focusses on simplifying the creation of strong


predictive models though automating all individual steps of the creation
process. Due to the high degree of automation, Automated Analytics enables
Analysts without deep statistical education to create powerful predictions.
This white paper will explain the concept behind the automation in more
detail. This engine came to the SAP portfolio with the acquisition of KXEN in
2013.

Expert Analytics, which provides a graphical workbench to an expert user


who wants to implement specific statistical algorithms and workflows. Expert
Analytics is targetted towards Data Scientists familiar with individual
statistical algorithms, their assumptions and implementations.

The methodology of Automated Analytics is heavily based on discoveries made by


Vladimir Vapnik in the area of Structural Risk Minimization. Our framework, which is
patented in parts, builds on this concept to produce high quality models with little
effort. Many users appreciate this concept, as it can quickly provide business value.
However, some expert users are often interested in better understanding more
about how these models are created. This document aims to help understand how
Automated Analytics can produce these predictive models.

Within the Automated Analytics option a number of different predictive capabilities


are available, such as

Classification

Regression

Time Series Forecasting

Clustering

This document focusses primarily on the Classification aspect.

Automated Analytics is not a black box. It is a comprehensive concept that can be


explained in detail.

Variable Creation
A predictive model is built on the concept that it understands and describes what
happened in the past, so that it can use this knowledge to predict what is likely to
happen in the future.

Think of a retail bank that wants to optimize its Marketing. One good option to
improve the efficiency of a Marketing campaign is to understand the likelihood of a
customer being interested in a certain product. Such an analysis is done with a
classification model on the historic information of the existing customer base. All
customer data that is available can be used, for instance demographical (age,
location, marital status, ) or behavioural (loan status, credit card usage, ).

The so called target variable indicates whether an individual customer did or did not
purchase this product. The classification model will look for patterns, that describe
the customers behaviour before that purchase. This model is then applied on the
most recent data to predict other customers interest in the same product, resulting
in an individual probability by customer. Now you know which customers are most
interested and you can incorporate this in your Marketing campaigns to achieve a
more tailored customer communication, resulting in increased response rates and/or
reduced Marketing costs.
The better the historic data describes the customers behaviour, the better any
resulting model will be. In order to describe the customer behaviour very detailed, it
is helpful to create additional variables that give additional insight into the
customers activities. The Data Manager within SAP Predictive Analytics helps the
user create such additional variables on the fly. These variables are created
semantically, without the need to persist the results in additional database tables.

Here are some examples

Aggregation
Your data might hold detailed transactional data about activities in your
customers accounts. Out of this history, the Data Manager within SAP
Predictive Analytics can create aggregates such as the count of the
transactions or the average amount per transaction. These aggregates can
be fed directly into the model, which will lead to better predictions.

Pivoting
Pivoting builds on the above aggregation and makes it easy to graphically
create a large number of more detailed aggregates. Sticking to the same
example of activities in a customer account, pivoting can create individual
counts of transaction types. So you can have transaction counts by cash
withdrawals, credit card payments, standing orders, and so on.

Understanding of Time
Typically it is crucial to put the historic data into a context of time. Therefore
the Data Manager has an in-built concept to relate the various measures to
moments or ranges in time. Again, without the need for any coding, the
Modeler can create detailed variables such as

Count of cash withdrawals in the previous quarter.

Count of cash withdrawals in the same quarter the year before.

Change in cash withdrawal counts in absolute values.

Change in cash withdrawal counts in percent.

Very quickly you can have dozens, hundreds of even thousand of columns that
describe the customers behaviour very detailed over time. With these additional
columns we now have a much clearer picture of the customers profiles which
generally results in much better predictions. We will explain later in this document
(Chapter Variable Selection) how SAP Predictive Analytics can handle such large
number of columns. The end user does not have to eliminate any columns, which
could inadvertently remove valuable details. You can keep all information and let
Automated Analytics remove any columns that are not relevant for the model.

The understanding of time is implemented with a concept that a model can relate to
a certain point in time (called timestamp). Information from before the timestamp is
used to train the model. The target variable is based on information from after the
timestamp. Once the model is trained, it is applied with a more recent timestamp
(or even todays datetime) to predict the target variable. This timestamp concept
makes it easy to create models that can be used for many time periods without
manual intervention. The Model Monitoring chapter further below outlines how the
models are automatically monitored to verify whether their predictive capabilities
are still adequate or whether the customers behaviour has changed so much that a
model needs to be retrained.

Training the model

Applying the model

Variable Encoding / Data Distribution


In order to automate the whole process of creating predictive models, it is important
that no assumptions are made on how the data is distributed in the predictor or
target variables. Most traditional predictive techniques are based on assumptions
on the distribution of the data. Often they require balanced datasets for
classification. This means that within the training dataset for a binary classification
half the records should belong to one class and the other half to another class. Such
a distribution is rather the exception though in reality. Imagine a churn analysis.
Most likely only a small percentage, or even much less, of the population will be a
churner.

Automated Analytics is built to deal well with such unbalanced datasets. The
methodology is distribution-free, meaning no assumptions are made onthe
distribution of the various variables. You do not have to put any manual effort into
trying to achieve a certain distribution. Automated Analytics achieves this goal by
encoding all predictor variables into smaller subsets which have similar impact on
the target variable.

Variable encoding is specific to the different variable types:

Nominal (textual) variables are encoded by grouping values with similar


impact on the target variable into common bins. Less frequent values are
grouped in bigger and therefore more robust categories. Examples for
nominal variables are Country, Material or Product.

Ordinal (sorted textual) variables are encoded similar to the above nominal
variables. However, the sequence is taken into account and the bins contain
only consecutive values. Examples for ordinal variables are Delivery Status or
Loyalty Status.

Numerical variables are sorted and split in different units with equal record
share. By default 20 units are created, each containing 5% of the available
data. Consecutive units with similar impact on the target are grouped
together. Within each group a separate linear regression transforms the input
data into a more robust representation.

These transformations enable predictions without assumptions of the data


distribution. At the same time, the model becomes more robust to better predict
previously unseen rows of data.

Screenshot for the encoding of a numerical variable in 20 bins:


Screenshot of how these 20 bins were combined into a much smaller number of
groups:

Data Splitting
In the Automated Analytics methodology, the dataset used to used to train the
model is split automatically into different parts:

Estimation: Various models are created on this dataset.

Validation: Each model that was created on the Estimation data is validated
on this data part. The best model will be chosen based on the results of this
validation. The chapter Model Selection further below explains how this
model is selected.

Test: The chosen model is run against this Test data, which has not yet been
used, to provide performance statistics. These statistics show how the model
performs on completely new data.

A number of different data splitting (also called data cutting) strategies are
provided. The user can work with the default settings, which splits the data into sets
for Estimation and Validation. However, if desired the user can also modify this
configuration, such as adding the Test segment or by selecting how the individual
rows are assigned to the different data parts.

Missing Values
Missing values, you can also call them empty cells, often present a challenge in
conventional predictive analysis. Many algorithms cannot handle such missing
values. Therefore the user in such an environment often needs to spend extra time
dealing with the missing information. Typically, one can try to estimate the missing
values or one can delete the whole row or even column. So either one has to invest
extra effort or valuable information is getting lost.

The Automated Mode however handles missing values completely automated.

Scenario 1:

If the Estimation data part already includes missing values, an additional category is
created for these entries. This new category is now treated equally to the existing
categories. This also applies for numerical variables. For each numerical column, all
cells with missing data are placed into an additional group whose impact on the
target is calculated just as it is for each numerical bin or category.

Scenario 2:

In case the Estimation data part is complete without any missing values, but
missing cells are encountered when applying the model, these are handled as
follows

For a continuous variable, the cell is filled with the variables average.

For a categorical variable, the cell is filled with the most frequent value.

Missing values are therefore handled without extra effort and without having to
exclude valuable rows from the datasets.

An additional bucket is created for the missing values (see the first line labelled
KxMissing).

Outliers
Outliers can come in two types, both of which are handled automatically:

Unusual values in predictor variables. These can be extremely low or high


values for numerical variables and rare values for nominal/ordinal variables.

Unusual rows of data, which might warrant special attention.

Outliers in numerical variables are placed in the bin for the smallest or largest
values of the encoded variable. Outliers in nominal/ordinal variables are placed in a
common group with other infrequent values.
Unusual rows of data can be flagged by Automated Analytics for manual
investigation. Simply put, a row is flagged as an outlier, in case the predicted value
is very different to the actual value.

Model Selection / Variable Selection


Automated Analytics selects the best model based on a balance of

Accuracy

Robustness

Simplicity (reduced number of variables)

The models accuracy is described by an indicator called Predictive Power, often


abbreviated as KI. The indicator describes the proportion of information contained in
the target variable, that the explanatory variables are able to explain. The higher
the Predictive Power, the better. In order to increase Predictive Power you can
try adding additional variables. The range of possible values for the Predictive
Power is from 0 to 1.

Robustness is described by an indicator called Prediction Confidence, often


abbreviated as KR. It describes the ability of the model to achieve the same
performance when it is applied to a new dataset. To increase Predictive
Confidence you can try adding additional rows of data. The range of possible
values for Predictive Confidence is from 0 to 1. Models are generally considered
robust if the value is >= 0.95.

Simplicity is achieved by favouring models with reduced set of variables.

To strike the right balance between accuracy, robustness and simplicity, Automated
Analytics goes through an interative process to find the most suitable model.

At first multiple Ridge Regressions are created on the whole dataset. Ridge
Regressions are configured with a lambda parameter and many models are created
with different lambda values. Out of those models, the one with the largest sum of
Predictive Power and Prediction Confidence is selected. Now an iterative process
starts which tries to find an even better model.

1. The variables with the smallest impact on the model get eliminated.

2. A new set of Ridge Regressions with different values for lambda are produced
on the smaller dataset.

3. The best model is chosen again. If this model is better than the selected one
from before, this model becomes the model that has to be beaten.
4. The process continues with step 1) until the sum of Predictive Power and
Prediction Confidence of the best model in step 3) is smaller than before.

Eventually, the model with the highest sum for Predictive Power and Prediction
Confidence becomes the chosen model, delivering the best compromise between
accuracy, robustness and simplicity.

Predictive Power and Prediction Confidence of selected model:

Multi-Collinearity
Multi-collinearity is often a concern when training and applying models. Multi-
collinearity occurs when two or more predictor variables are highly correlated. Think
of a banking customer of whom we know the monthly salary and how much money
the person is paying into a savings account each month. These two variables Salary
and Savings will be highly correlated. Generally speaking, and there will be
exceptions of course: the higher a persons salary, the more money will be saved.

A predictive model needs to take such a relation into account, to ensure that the
information does not double-impact any prediction. The ridge regression used by
the Automated Mode in SAP Predictive Analytics is robust against such multi-
collinearity. Suitable weights are assigned to the variables, so that each variable
contributes to the model according to its actual additional information gain. Out of
two correlated variables, the most important one will have a larger impact. The
second variables impact is reduced accordingly and might not even be included in
the final model at all.

Model Interpretation
So we know now the most important concepts how predictive models are created
automatically. But it is still very important to understand an individual model, either
for your own confidence or to be able to communicate and discuss the model with
colleagues. Various options are given to help the user understand the model.
Commonly used are for instance

Various types of model charts, ie gain charts. The model is the blue line in
between a random model in red and the perfect but unachievable model in
green. Simply put, the closer the blue model gets to the perfect green model
the better

Overview of the selected variables and their contribution/weight in the model.


An overview of each variable, how the content impacts the model. Positive
influence on the target means an increased likelihood of being a target. Here
the group above 58 years of age has the highest likelihood.
Detailed statistical reports.
Confusion Matrix, which calculates how many targets can be identified with a
certain effort. Here for instance only 5% of the prospects need to be
contacted to win 23% of the possible contracts.
Powerpoint Output, the most important information to document or present.
Deployment
Models can be put into action in two different ways.

Persistence

The model is applied once on a given dataset and the resulting classification scores
are permanently written to a database or file. These values can now be used by any
other process or application. Only when the model is reapplied, new scores are
calculated and persisted again. Such scoring can be applied in-database without the
records having to leave the database.

This persistance-approach is often used when a clear separation between applying


the model and using the results is desired. If for instance you would like to give the
scores to a third party, such as an external Marketing agency, persisting the results
is needed.

Semantic
Alternatively, the model can be turned into source code of different programming
languages. The models can then be embedded directly into databases or
applications, and the scores are calculated on the fly whenever needed. Many
different programming languages are supported, such as Various SQL flavours, C,
JavaScript, Java, Visual Basic or SAS. It is very common for instance to embed the
model as new column in a database view or stored procedure. Everytime the
column is used, the score is calculated taking the latest available information into
account. The scores are real-time.

A customer advisor can now benefit from this additional information directly in the
Customer Relationship Management System. The predictive models control which
products are suggested for the customer or how likely the customer is to leave to
another bank.

Model Monitoring
Once a model has been created, it describes the data as it was available at creation
time. Most models are used for longer time periods however. The bank for instance
will continuously want to analyse the churn risk of loosing individual clients.
Therefore a churn model, giving the probability of a client leaving, will be in
constant use. If the behaviour of the clients is changing over time, the models
predictive power will reduce with it. As it was created when the behavioural pattern
were different, it wont be able to predict the churn rate as accurately. Therefore the
predictive capability of such models needs to be monitored over time.

That monitoring task is automated through a server-component, which


automatically and regularly checks on the models predictive power. Should the
predictive capability fall below a defined threshold, the user is informed so that the
model can be recalibrated. Therefore the user can maintain a larger number of
models, as only models in need of readjustment require attention.

Summary
Hopefully this document has been helpful to understand the supposed magic behind
Automated Analytics and how strong predictive models can be produced without the
user having to be a trained statistician.

It is a very comprehensive process that finds the best model for the situation. The
predictive models can be mass-produced and deployed, ensuring that models are
easily available where needed. Automation in combination with the ability to
interpret trained models ensures that users are fully informed and in control.

Anda mungkin juga menyukai