Anda di halaman 1dari 16

Advanced Data Analysis

Identify High Priority Items


Table of Contents

Objective…............................................................................................................ 3

Step 1: Standardize the Data.................................................................................... 3

Step 2: Do Logistic Regression................................................................................ 3

Step 3: Logistic Regression Equation:.......................................................................7

2
Objective
To find the logistic regression equation for classification of high and low priority items. The
dependent variable is High_Priority

Methodology
Since the dependent variable is dichotomous categorical variable and independent variables are
both categorical and metric, Logistic regression is used to identify High Priority items. Also the
result of logistic regression was compared with classification tree.
1) Clean and Preparing the data
2) Perform Enter Method Logistic Regression
3) Calculate Chance Accuracy
4) Perform Stepwise Logistic Regression (Ward Method)
5) Validating the model classification accuracy with chance accuracy

Data Cleaning and Preparation


In the variable “Item_Fat_Content” there were four categories mentioned i.e. LF, Low Fat,
Regular and reg. It is assumed that LF represents same as Low Fat and reg as Regular. So four
categories were changed to two categories which would then represents actual categories.
Various sample points in the data were having missing values from item weight and Outlet Size.
Also a new variable Outlet_Age was created with base year as 2010 from the establishment year.
Outlet_Age = 2010 - Outlet_Establishment_Year
Categorical variables were converted to numerical codes.

Perform Enter Method Logistic Regression


Following variables were entered as independent variables and dependent variables.
Dependent Variable: Hi_Priority
Independent Variables Item_Weight, Item_Fat_Content, Item_Visibility, Item_Type,
Item_MRP, Outlet_Size, Outlet_Location_Type, Outlet_Type, Outlet_Age, Outlet_Identifier

From the above variables, following were entered as the categorical variables:
Item_Fat_Content, Item_Type, Outlet_Size, Outlet_Location_Type, Outlet_Type and
Outlet_Identifier.

3
OUTPUT INTERPRETATION

We received a warning which indicated there is collinearity present in the variables. Since
Logistic Regression similar to other regression doesn’t allow collinearity, we stopped our
analysis here. Further analysis revealed that Outlet Location, Outlet Type and Outlet identifier
are correlated.

After little iteration with each variable, only Outlet type was considered relevant in the next
iteration.

4
OUPUT INTERPRETATION AFTER REMOVING COLLINEARITY

No warning came which indicates there is now no collinearity present in the data.

BLOCK 0: Output
Classification Tablea,b

Predicted

Hi_priority
Percentage
Observed 0 1 Correct

Step 0 Hi_priority 0 0 773 .0

1 0 3877 100.0

Overall Percentage 83.4

a. Constant is included in the model.

b. The cut value is .500

The chance Accuracy is (773/4650)² + (3877/4650)²= 72%


Now for the model to be valid overall accuracy should be greater than 1.25 * 72%= 90.3%

5
This table shows the coding as per categories of independent variables.

Categorical Variables Codings


Parameter coding
Frequency (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) (13) (14) (15)
Item_type Baking 536 1.000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000
_New Goods
Breads 204 .000 1.000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000
Breakfast 89 .000 .000 1.000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000
Canned 539 .000 .000 .000 1.000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000
Dairy 566 .000 .000 .000 .000 1.000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000
Frozen 718 .000 .000 .000 .000 .000 1.000 .000 .000 .000 .000 .000 .000 .000 .000 .000
Foods
Fruits and 1019 .000 .000 .000 .000 .000 .000 1.000 .000 .000 .000 .000 .000 .000 .000 .000
Vegetable
s
Hard 183 .000 .000 .000 .000 .000 .000 .000 1.000 .000 .000 .000 .000 .000 .000 .000
Drinks
Health 430 .000 .000 .000 .000 .000 .000 .000 .000 1.000 .000 .000 .000 .000 .000 .000
and
Hygiene
Househol 759 .000 .000 .000 .000 .000 .000 .000 .000 .000 1.000 .000 .000 .000 .000 .000
d
Meat 337 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 1.000 .000 .000 .000 .000
Others 137 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 1.000 .000 .000 .000
Seafood 51 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 1.000 .000 .000
Snack 988 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 1.000 .000
Foods
Soft 374 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 1.000
Drinks
Starchy 130 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000
Foods
Outlet_Typ Grocery 555 1.000 .000
e_New Store
Supermar 5577 .000 1.000
ket Type1
Supermar 928 .000 .000
ket Type2
Fat_conte Low Fat 4566 1.000
nt_new Regular 2494 .000

Model Summary

Cox & Snell R Nagelkerke R


Step -2 Log likelihood Square Square

1 2919.502a .238 .401

a. Estimation terminated at iteration number 6 because


parameter estimates changed by less than .001.

NR2 is good which is 0.40 which is good.

Hosmer and Lemeshow Test

Step Chi-square df Sig.

1 224.859 8 .000

Hosmer and Lemeshow test failed because significance value should be >0.05. But since all
other assumptions were true we went ahead with the analysis and thus compared the results with
Classification trees.

6
Variables in the Equation
95% C.I.for EXP(B)
B S.E. Wald df Sig. Exp(B) Lower Upper
Step 1 a Item_MRP .031 .001 696.469 1 .000 1.031 1.029 1.034
Outlet_Age -.010 .022 .215 1 .643 .990 .948 1.033
Outlet_Type_New(1) .632 .263 5.762 1 .016 1.882 1.123 3.153
Item_Weight -.006 .010 .361 1 .548 .994 .974 1.014
Item_Visibility .258 1.045 .061 1 .805 1.294 .167 10.030
Item_type_New 13.070 15 .597
Item_type_New(1) -.209 .399 .275 1 .600 .811 .371 1.773
Item_type_New(2) .676 .494 1.873 1 .171 1.965 .747 5.171
Item_type_New(3) -.823 .527 2.437 1 .119 .439 .156 1.234
Item_type_New(4) -.076 .404 .036 1 .850 .927 .420 2.045
Item_type_New(5) -.202 .404 .250 1 .617 .817 .370 1.803
Item_type_New(6) -.098 .395 .061 1 .804 .907 .418 1.968
Item_type_New(7) -.065 .391 .027 1 .868 .937 .436 2.015
Item_type_New(8) -.316 .478 .437 1 .508 .729 .286 1.860
Item_type_New(9) -.276 .414 .444 1 .505 .759 .337 1.708
Item_type_New(10) -.213 .403 .278 1 .598 .808 .367 1.781
Item_type_New(11) -.194 .426 .209 1 .648 .823 .357 1.896
Item_type_New(12) .071 .507 .020 1 .888 1.074 .397 2.902
Item_type_New(13) -.481 .700 .471 1 .493 .618 .157 2.440
Item_type_New(14) -.052 .393 .017 1 .895 .950 .440 2.052
Item_type_New(15) -.215 .420 .263 1 .608 .806 .354 1.836
Outlet_Size_New .449 2 .799
Outlet_Size_New(1) -.154 .320 .233 1 .630 .857 .458 1.604
Outlet_Size_New(2) .037 .135 .074 1 .785 1.038 .796 1.353
Fat_content_new(1) .112 .108 1.088 1 .297 1.119 .906 1.382
Constant -2.044 .448 20.825 1 .000 .129

There are lots of variables which are not significant. So they must be removed to form accurate
equation.

Perform Step Wise Logistic Regression


Method used was “Backward:Wald”
OUTPUT INTERPRETATION

Model Summary

Cox & Snell R Nagelkerke R


Step -2 Log likelihood Square Square

1 2919.502a .238 .401

2 2919.563a .238 .401

3 2920.011a .238 .401

4 2933.696a .236 .397

5 2934.071a .236 .397

6 2934.802a .236 .397

a. Estimation terminated at iteration number 6 because


parameter estimates changed by less than .001.

7
NR2 is good which is 0.397 which is good.

Hosmer and Lemeshow Test

Step Chi-square df Sig.

1 224.859 8 .000

2 224.134 8 .000

3 223.111 8 .000

4 228.720 8 .000

5 225.398 8 .000

6 233.094 8 .000

Hosmer and Lemeshow test failed because significance value should be >0.05. But since all
other assumptions were true we went ahead with the analysis and thus compared the results with
Classification trees.

Classification Tablea
Predicted
Hi_priority Percentag
Observed 0 1 e Correct
Step 1 Hi_priority 0 391 382 50.6
1 150 3727 96.1
Overall Percentage 88.6
Step 2 Hi_priority 0 390 383 50.5
1 150 3727 96.1
Overall Percentage 88.5
Step 3 Hi_priority 0 392 381 50.7
1 153 3724 96.1
Overall Percentage 88.5
Step 4 Hi_priority 0 388 385 50.2
1 158 3719 95.9
Overall Percentage 88.3
Step 5 Hi_priority 0 390 383 50.5
1 159 3718 95.9
Overall Percentage 88.3
Step 6 Hi_priority 0 393 380 50.8
1 159 3718 95.9
Overall Percentage 88.4

6 Iterations were performed and final 6th Iteration is the output.


8
There has been improvement in accuracy from 72% (Chance Accuracy) to 88.4% which is
slightly less than 90.3% to make our model valid.

Final Variables in the equation which are significant are as follows:

From all the variables only three significant variables were found. Item _MRP, Outlet age and
Outlet type.

But there were few samples which were not considered due to outlet size and Item weight missing
values. Since they are not relevant, it would be better if all the sample values are considered
because it would increase our sample size and improve our prediction.

The below graph indicates that the probability of success is greater than 0.5 and this means the
items are correctly classified.

Step number: 6

Observed Groups and Predicted Probabilities

800 + +
| 1|
| 1|
F | 1|
R 600 + 1+
E | 1|
Q | 11|
U | 11|
E 400 + 11+
N | 111|
C | 111|
Y | 1111|
200 + 11111+
| 111111|
| 1 1111 1111111|
| 11 1101011 11 1 1111111111111111111111111111111111|
Predicted ---------+---------+---------+---------+---------+---------+---------+---------+---------+----------
Prob: 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1
Group: 0000000000000000000000000000000000000000000000000011111111111111111111111111111111111111111111111111

Predicted Probability is of Membership for 1


The Cut Value is .50
Symbols: 0 - 0
1 - 1
Each Symbol Represents 50 Cases.
9
Perform Step Wise Logistic Regression using only significant
Variables

Independent Variables used: Outlet type, Outlet Age and Item MRP
Dependent Variable: High_priority

Full sample size is utilized in the regression and there are no warnings which suggest no
collinearity.

10
Chance Accuracy
a,b
Classification Table
Predicted
Hi_priority Percentag
Observed 0 1 e Correct
Step 0 Hi_priority 0 0 2131 .0
1 0 6392 100.0
Overall Percentage 75.0

The chance Accuracy is (2131/8523)² + (6392/8523)²= 62.5%


Now for the model to be valid overall accuracy should be greater than 1.25 * 62.5%= 78.1%

Variables not in the Equation

Score df Sig.

Step 0 Variables Item_MRP 972.713 1 .000

Outlet_Age 123.673 1 .000

Outlet_Type_New 3259.201 3 .000

Outlet_Type_New(1) 3183.377 1 .000

Outlet_Type_New(2) 726.382 1 .000

Outlet_Type_New(3) 11.939 1 .001

Overall Statistics 4223.772 5 .000

Omnibus Tests of Model Coefficients

Chi-square df Sig.

Step 1 Step 4909.133 5 .000

Block 4909.133 5 .000

Model 4909.133 5 .000

11
Model Summary

Cox & Snell R Nagelkerke R


Step -2 Log likelihood Square Square

1 4676.981a .438 .648

a. Estimation terminated at iteration number 7 because


parameter estimates changed by less than .001.

0.648 NR2 is very good number for logistic regression.

Hosmer and Lemeshow Test

Step Chi-square df Sig.

1 216.668 8 .000

Hosmer and Lemeshow test failed because significance value should be >0.05. But since all
other assumptions were true we went ahead with the analysis and thus compared the results with
Classification trees.

Classification Tablea

Predicted

Hi_priority
Percentage
Observed 0 1 Correct

Step 1 Hi_priority 0 1559 572 73.2

1 283 6109 95.6

Overall Percentage 90.0

a. The cut value is .500

Since 90% classification accuracy is more than 78.1%, our model can be used for prediction.

12
Variables in the Equation
95% C.I.for EXP(B)
B S.E. Wald df Sig. Exp(B) Lower Upper
a Item_MRP .031 .001 1097.603 1 .000 1.032 1.030 1.033
Step 1
Outlet_Ag -.014 .006 5.088 1 .024 .986 .974 .998
e
Outlet_Typ 1351.281 3 .000
e_New
Outlet_Typ -8.777 .269 1066.242 1 .000 .000 .000 .000
e_New(1)

Outlet_Typ -2.054 .201 104.146 1 .000 .128 .086 .190


e_New(2)

Outlet_Typ -2.629 .253 108.117 1 .000 .072 .044 .118


e_New(3)

Constant .496 .243 4.172 1 .041 1.643

Logistic Regression Equation


From the variables in the equation are significant, let us develop the logistic regression equation.
P(s) = 1/(1+e^-y) where y = a+b1x1+b2x2…. Equation (1)

Y= 0.496 + 0.031 Item(MRP) – 0.014 Outlet Age - 8.778 * Outlet Type new (1) – 2.054 *
Outlet Type new (2) – 2.629 * Outlet Type new (3)

Categorical Variables Codings

Parameter coding

Frequency (1) (2) (3)

Outlet_Type_New Grocery Store 1083 1.000 .000 .000

Supermarket Type1 5577 .000 1.000 .000

Supermarket Type2 928 .000 .000 1.000

Supermarket Type3 935 .000 .000 .000

Now, substitute this value of y in equation (1), to get probability of success.

P(s) = 1/(1+e0.496 + 0.031 Item(MRP) – 0.014 Outlet Age - 8.778 * Outlet Type new (1) –
2.054 * Outlet Type new (2) – 2.629 * Outlet Type new (3) ))

13
CART

Model Summary

Specifications Growing Method EXHAUSTIVE CHAID

Dependent Variable Hi_priority

Independent Variables Item_Weight, Item_Visibility, Item_MRP,


Outlet_Age, Outlet_Type_New

Validation Split Sample

Maximum Tree Depth 3

Minimum Cases in Parent 100


Node

Minimum Cases in Child 50


Node

Results Independent Variables Outlet_Type_New, Item_MRP


Included

Number of Nodes 18

Number of Terminal Nodes 12

Depth 3

14
Risk

Sample Estimate Std. Error

Training .101 .005

Test .103 .005

Growing Method: EXHAUSTIVE CHAID


Dependent Variable: Hi_priority

15
Classification

Predicted

Sample Observed 0 1 Percent Correct

Training 0 786 299 72.4%

1 134 3057 95.8%

Overall Percentage 21.5% 78.5% 89.9%

Test 0 752 294 71.9%

1 144 3057 95.5%

Overall Percentage 21.1% 78.9% 89.7%

Growing Method: EXHAUSTIVE CHAID


Dependent Variable: Hi_priority

89.7% Accuracy model of CART is similar to 90 % classification accuracy achieved by logistic


regression. So this validates our regression by Logistic method despite our Hosmer test failing.
Logistic regression model can be used for prediction.

16