Anda di halaman 1dari 8

Assignment 5-part 2

Process Control : 1- Static Model (Target variable:pc_volume_binary)


2- Dynamic model (Target variable: lagged pc_volume_binary)
Tarig Elamin

5 Different models used for prediction

1- static model : (Target variable:pc_volume_binary)

Now that we have the right proportion of observations from both the categories (PC_Volume 0.30 and PC_
Volume > 0.30) and have selected the best predictors we will next build different predictive models and then
use comparative tools to automatically select the one that best predicts the dependent variable. STATISTICA
Data Miner includes many advanced models that are designed to handle different types of data. We will try a
few of these predictive models for this example:
1- CHAID Standard Classification with Deployment
2- Logit Regression with Deployment
3- Boosting Classification Trees with Deployment
4- MARSplines for Classification with Deployment
5- C&RT Standard Classification Trees with Deployment

Data partition

we have splitted the data into training data(67%) and testing data(33%) in both the static and dynamic
model. See figure 1.

Data balancing

we used the stratified random sample to balance our target variable which has 31%(pc_volume_binary> 0.30
.taking value of 1) and the rest is PC_Volume 0.30.We notice that our target variable does not suffer from a
rare case .The observed proportions of the two outcomes of High_Low_PC (High/1 and Low/0) are
fairly ok.But,anyway,we extracted equal numbers of observations falling above and below the PC_Volume cut
off level of 0.30.

1-- Computed Overlaid Lift Charts from all 5 Models Static Analyses

We can see from figure 2 that the logit regression-generalized linear regression- has the highest lift value
-2,56-when compared to the other models. Therefore, it works the best for this Soda Pop dataset.The C&RT
has the smallest lift value as seen in figure 8.
Let us try to interpret the graph in figure 2 for category 1 (PC_Volume > 0.30). If we consider the top
two deciles, we would end up with a sample that had almost 2.48 times the number of
category 1 (PC_Volume > 0.30) when compared to the baseline model. In other words, the
relative gain or lift value by using the logistic regression is approx. 2.48 for predicting category 1 We can say
2-- Computed Overlaid Lift Charts from all 5 Models Dynamic Analyses

From figure 15-lift value for the dynamic models- we see that , for predicting the lagged dependent variable,
MARSplines has the highest lift value-2,63- compared to other models.In the dynamic model,we can see that
the MARSplines has outperformed the logistic regression.

Model Accuracy

For the static model,the classification matrix-from figure - for the CHAID model shows that this model has
an accuracy rate of 57%,the Statistica automated network model-SANN- has accurately predicted about 750
cases as seen in figure 12-.The MARSplines has an accuracy rate of 77%(930/1205).

Data partition:67%
training,33% testing
sub_dataset

Figure1
Compute Overlaid Lift Charts from all Models Static Analyses

Figure2

Figure 3

Figure 4
Figure 5

Figure 6

Figure 7

Figure 8
Figure 9

MARSpline -
classification matrix

Fig-
ure 10

Figure
11 C&RT -classification matrix

Figure
11
CHAID -classification matrix
SANN -Accuracy Figure 12
Boosting tree -predictor importance

Figure 13
The dynamic model
The target variable: lagged_PC_Volume_binary
the equation used: =lag(PC_Volume_binary,-1)

Figure 14

Figure 15

Anda mungkin juga menyukai