SAP Predictive Analytics is a statistical analysis and data mining solution that
enables you to build predictive models to discover hidden insights and relationships
in your data, from which you can make predictions about future events.
SAP Predictive Analytics combines SAP Infinite Insight and SAP Predictive Analysis in
a single desktop installation. SAP Predictive Analytics includes two user interfaces,
Automated Analytics and Expert Analytics.
With this configuration, you can install the following server-based components:
Automated Analytics can access data in flat files on the native file system, SAS, and
SPSS files, or be configured to access Database Management Systems using ODBC.
Expert Analytics inherits data acquisition and data manipulation functionality from
SAP Lumira. SAP Lumira is a data manipulation and visualization tool. Using SAP
Lumira, you can connect to various data sources such as flat files, relational
databases, in-memory databases, and SAP BusinessObjects universes, and can
operate on different volumes of data, from a small matrix of data in a CSV file to a
very large dataset in SAP HANA.
SAP Predictive Analytics Enterprise Edition
Resources are used more fully because each modeling session has a
dedicated process. The process size limit (typically 4Gb on 32-bit installation)
applies only to a single user.
Model Manager
In an SAP Predictive Analytics client-server configuration, once your Automated
Analytics server is deployed, a Model Manager server can be additionally deployed.
Retraining a model
Model Manager
In an SAP Predictive Analytics client-server configuration, once your Automated
Analytics server is deployed, a Model Manager server can be additionally deployed.
Retraining a model
This document explains the fundamental concept behind the automated approach
to the interested reader. However, the end user of this functionality does not have
to read this document to use the functionality. Please also bear in mind that this
document is just giving an outline of the fundamental concept behind Automated
Analytics. There are many more detailed aspects that are taken care of, some of
which are patented.
Introduction
Variable Creation
Data Splitting
Missing Values
Outliers
Multi-Collinearity
Model Interpretation
Deployment
Model Monitoring
Summary
Introduction
SAP Predictive Analytics 2.x includes two different approaches to predictive
modeling
Classification
Regression
Clustering
Variable Creation
A predictive model is built on the concept that it understands and describes what
happened in the past, so that it can use this knowledge to predict what is likely to
happen in the future.
Think of a retail bank that wants to optimize its Marketing. One good option to
improve the efficiency of a Marketing campaign is to understand the likelihood of a
customer being interested in a certain product. Such an analysis is done with a
classification model on the historic information of the existing customer base. All
customer data that is available can be used, for instance demographical (age,
location, marital status, ) or behavioural (loan status, credit card usage, ).
The so called target variable indicates whether an individual customer did or did not
purchase this product. The classification model will look for patterns, that describe
the customers behaviour before that purchase. This model is then applied on the
most recent data to predict other customers interest in the same product, resulting
in an individual probability by customer. Now you know which customers are most
interested and you can incorporate this in your Marketing campaigns to achieve a
more tailored customer communication, resulting in increased response rates and/or
reduced Marketing costs.
The better the historic data describes the customers behaviour, the better any
resulting model will be. In order to describe the customer behaviour very detailed, it
is helpful to create additional variables that give additional insight into the
customers activities. The Data Manager within SAP Predictive Analytics helps the
user create such additional variables on the fly. These variables are created
semantically, without the need to persist the results in additional database tables.
Aggregation
Your data might hold detailed transactional data about activities in your
customers accounts. Out of this history, the Data Manager within SAP
Predictive Analytics can create aggregates such as the count of the
transactions or the average amount per transaction. These aggregates can
be fed directly into the model, which will lead to better predictions.
Pivoting
Pivoting builds on the above aggregation and makes it easy to graphically
create a large number of more detailed aggregates. Sticking to the same
example of activities in a customer account, pivoting can create individual
counts of transaction types. So you can have transaction counts by cash
withdrawals, credit card payments, standing orders, and so on.
Understanding of Time
Typically it is crucial to put the historic data into a context of time. Therefore
the Data Manager has an in-built concept to relate the various measures to
moments or ranges in time. Again, without the need for any coding, the
Modeler can create detailed variables such as
Very quickly you can have dozens, hundreds of even thousand of columns that
describe the customers behaviour very detailed over time. With these additional
columns we now have a much clearer picture of the customers profiles which
generally results in much better predictions. We will explain later in this document
(Chapter Variable Selection) how SAP Predictive Analytics can handle such large
number of columns. The end user does not have to eliminate any columns, which
could inadvertently remove valuable details. You can keep all information and let
Automated Analytics remove any columns that are not relevant for the model.
The understanding of time is implemented with a concept that a model can relate to
a certain point in time (called timestamp). Information from before the timestamp is
used to train the model. The target variable is based on information from after the
timestamp. Once the model is trained, it is applied with a more recent timestamp
(or even todays datetime) to predict the target variable. This timestamp concept
makes it easy to create models that can be used for many time periods without
manual intervention. The Model Monitoring chapter further below outlines how the
models are automatically monitored to verify whether their predictive capabilities
are still adequate or whether the customers behaviour has changed so much that a
model needs to be retrained.
Automated Analytics is built to deal well with such unbalanced datasets. The
methodology is distribution-free, meaning no assumptions are made onthe
distribution of the various variables. You do not have to put any manual effort into
trying to achieve a certain distribution. Automated Analytics achieves this goal by
encoding all predictor variables into smaller subsets which have similar impact on
the target variable.
Ordinal (sorted textual) variables are encoded similar to the above nominal
variables. However, the sequence is taken into account and the bins contain
only consecutive values. Examples for ordinal variables are Delivery Status or
Loyalty Status.
Numerical variables are sorted and split in different units with equal record
share. By default 20 units are created, each containing 5% of the available
data. Consecutive units with similar impact on the target are grouped
together. Within each group a separate linear regression transforms the input
data into a more robust representation.
Data Splitting
In the Automated Analytics methodology, the dataset used to used to train the
model is split automatically into different parts:
Validation: Each model that was created on the Estimation data is validated
on this data part. The best model will be chosen based on the results of this
validation. The chapter Model Selection further below explains how this
model is selected.
Test: The chosen model is run against this Test data, which has not yet been
used, to provide performance statistics. These statistics show how the model
performs on completely new data.
A number of different data splitting (also called data cutting) strategies are
provided. The user can work with the default settings, which splits the data into sets
for Estimation and Validation. However, if desired the user can also modify this
configuration, such as adding the Test segment or by selecting how the individual
rows are assigned to the different data parts.
Missing Values
Missing values, you can also call them empty cells, often present a challenge in
conventional predictive analysis. Many algorithms cannot handle such missing
values. Therefore the user in such an environment often needs to spend extra time
dealing with the missing information. Typically, one can try to estimate the missing
values or one can delete the whole row or even column. So either one has to invest
extra effort or valuable information is getting lost.
Scenario 1:
If the Estimation data part already includes missing values, an additional category is
created for these entries. This new category is now treated equally to the existing
categories. This also applies for numerical variables. For each numerical column, all
cells with missing data are placed into an additional group whose impact on the
target is calculated just as it is for each numerical bin or category.
Scenario 2:
In case the Estimation data part is complete without any missing values, but
missing cells are encountered when applying the model, these are handled as
follows
For a continuous variable, the cell is filled with the variables average.
For a categorical variable, the cell is filled with the most frequent value.
Missing values are therefore handled without extra effort and without having to
exclude valuable rows from the datasets.
An additional bucket is created for the missing values (see the first line labelled
KxMissing).
Outliers
Outliers can come in two types, both of which are handled automatically:
Outliers in numerical variables are placed in the bin for the smallest or largest
values of the encoded variable. Outliers in nominal/ordinal variables are placed in a
common group with other infrequent values.
Unusual rows of data can be flagged by Automated Analytics for manual
investigation. Simply put, a row is flagged as an outlier, in case the predicted value
is very different to the actual value.
Accuracy
Robustness
To strike the right balance between accuracy, robustness and simplicity, Automated
Analytics goes through an interative process to find the most suitable model.
At first multiple Ridge Regressions are created on the whole dataset. Ridge
Regressions are configured with a lambda parameter and many models are created
with different lambda values. Out of those models, the one with the largest sum of
Predictive Power and Prediction Confidence is selected. Now an iterative process
starts which tries to find an even better model.
1. The variables with the smallest impact on the model get eliminated.
2. A new set of Ridge Regressions with different values for lambda are produced
on the smaller dataset.
3. The best model is chosen again. If this model is better than the selected one
from before, this model becomes the model that has to be beaten.
4. The process continues with step 1) until the sum of Predictive Power and
Prediction Confidence of the best model in step 3) is smaller than before.
Eventually, the model with the highest sum for Predictive Power and Prediction
Confidence becomes the chosen model, delivering the best compromise between
accuracy, robustness and simplicity.
Multi-Collinearity
Multi-collinearity is often a concern when training and applying models. Multi-
collinearity occurs when two or more predictor variables are highly correlated. Think
of a banking customer of whom we know the monthly salary and how much money
the person is paying into a savings account each month. These two variables Salary
and Savings will be highly correlated. Generally speaking, and there will be
exceptions of course: the higher a persons salary, the more money will be saved.
A predictive model needs to take such a relation into account, to ensure that the
information does not double-impact any prediction. The ridge regression used by
the Automated Mode in SAP Predictive Analytics is robust against such multi-
collinearity. Suitable weights are assigned to the variables, so that each variable
contributes to the model according to its actual additional information gain. Out of
two correlated variables, the most important one will have a larger impact. The
second variables impact is reduced accordingly and might not even be included in
the final model at all.
Model Interpretation
So we know now the most important concepts how predictive models are created
automatically. But it is still very important to understand an individual model, either
for your own confidence or to be able to communicate and discuss the model with
colleagues. Various options are given to help the user understand the model.
Commonly used are for instance
Various types of model charts, ie gain charts. The model is the blue line in
between a random model in red and the perfect but unachievable model in
green. Simply put, the closer the blue model gets to the perfect green model
the better
Persistence
The model is applied once on a given dataset and the resulting classification scores
are permanently written to a database or file. These values can now be used by any
other process or application. Only when the model is reapplied, new scores are
calculated and persisted again. Such scoring can be applied in-database without the
records having to leave the database.
Semantic
Alternatively, the model can be turned into source code of different programming
languages. The models can then be embedded directly into databases or
applications, and the scores are calculated on the fly whenever needed. Many
different programming languages are supported, such as Various SQL flavours, C,
JavaScript, Java, Visual Basic or SAS. It is very common for instance to embed the
model as new column in a database view or stored procedure. Everytime the
column is used, the score is calculated taking the latest available information into
account. The scores are real-time.
A customer advisor can now benefit from this additional information directly in the
Customer Relationship Management System. The predictive models control which
products are suggested for the customer or how likely the customer is to leave to
another bank.
Model Monitoring
Once a model has been created, it describes the data as it was available at creation
time. Most models are used for longer time periods however. The bank for instance
will continuously want to analyse the churn risk of loosing individual clients.
Therefore a churn model, giving the probability of a client leaving, will be in
constant use. If the behaviour of the clients is changing over time, the models
predictive power will reduce with it. As it was created when the behavioural pattern
were different, it wont be able to predict the churn rate as accurately. Therefore the
predictive capability of such models needs to be monitored over time.
Summary
Hopefully this document has been helpful to understand the supposed magic behind
Automated Analytics and how strong predictive models can be produced without the
user having to be a trained statistician.
It is a very comprehensive process that finds the best model for the situation. The
predictive models can be mass-produced and deployed, ensuring that models are
easily available where needed. Automation in combination with the ability to
interpret trained models ensures that users are fully informed and in control.