Anda di halaman 1dari 16

## Identify High Priority Items

Objective…............................................................................................................ 3

## Step 3: Logistic Regression Equation:.......................................................................7

2
Objective
To find the logistic regression equation for classification of high and low priority items. The
dependent variable is High_Priority

Methodology
Since the dependent variable is dichotomous categorical variable and independent variables are
both categorical and metric, Logistic regression is used to identify High Priority items. Also the
result of logistic regression was compared with classification tree.
1) Clean and Preparing the data
2) Perform Enter Method Logistic Regression
3) Calculate Chance Accuracy
4) Perform Stepwise Logistic Regression (Ward Method)
5) Validating the model classification accuracy with chance accuracy

## Data Cleaning and Preparation

In the variable “Item_Fat_Content” there were four categories mentioned i.e. LF, Low Fat,
Regular and reg. It is assumed that LF represents same as Low Fat and reg as Regular. So four
categories were changed to two categories which would then represents actual categories.
Various sample points in the data were having missing values from item weight and Outlet Size.
Also a new variable Outlet_Age was created with base year as 2010 from the establishment year.
Outlet_Age = 2010 - Outlet_Establishment_Year
Categorical variables were converted to numerical codes.

## Perform Enter Method Logistic Regression

Following variables were entered as independent variables and dependent variables.
Dependent Variable: Hi_Priority
Independent Variables Item_Weight, Item_Fat_Content, Item_Visibility, Item_Type,
Item_MRP, Outlet_Size, Outlet_Location_Type, Outlet_Type, Outlet_Age, Outlet_Identifier

From the above variables, following were entered as the categorical variables:
Item_Fat_Content, Item_Type, Outlet_Size, Outlet_Location_Type, Outlet_Type and
Outlet_Identifier.

3
OUTPUT INTERPRETATION

We received a warning which indicated there is collinearity present in the variables. Since
Logistic Regression similar to other regression doesn’t allow collinearity, we stopped our
analysis here. Further analysis revealed that Outlet Location, Outlet Type and Outlet identifier
are correlated.

After little iteration with each variable, only Outlet type was considered relevant in the next
iteration.

4
OUPUT INTERPRETATION AFTER REMOVING COLLINEARITY

No warning came which indicates there is now no collinearity present in the data.

BLOCK 0: Output
Classification Tablea,b

Predicted

Hi_priority
Percentage
Observed 0 1 Correct

1 0 3877 100.0

## The chance Accuracy is (773/4650)² + (3877/4650)²= 72%

Now for the model to be valid overall accuracy should be greater than 1.25 * 72%= 90.3%

5
This table shows the coding as per categories of independent variables.

## Categorical Variables Codings

Parameter coding
Frequency (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) (13) (14) (15)
Item_type Baking 536 1.000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000
_New Goods
Breads 204 .000 1.000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000
Breakfast 89 .000 .000 1.000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000
Canned 539 .000 .000 .000 1.000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000
Dairy 566 .000 .000 .000 .000 1.000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000
Frozen 718 .000 .000 .000 .000 .000 1.000 .000 .000 .000 .000 .000 .000 .000 .000 .000
Foods
Fruits and 1019 .000 .000 .000 .000 .000 .000 1.000 .000 .000 .000 .000 .000 .000 .000 .000
Vegetable
s
Hard 183 .000 .000 .000 .000 .000 .000 .000 1.000 .000 .000 .000 .000 .000 .000 .000
Drinks
Health 430 .000 .000 .000 .000 .000 .000 .000 .000 1.000 .000 .000 .000 .000 .000 .000
and
Hygiene
Househol 759 .000 .000 .000 .000 .000 .000 .000 .000 .000 1.000 .000 .000 .000 .000 .000
d
Meat 337 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 1.000 .000 .000 .000 .000
Others 137 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 1.000 .000 .000 .000
Seafood 51 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 1.000 .000 .000
Snack 988 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 1.000 .000
Foods
Soft 374 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 1.000
Drinks
Starchy 130 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000
Foods
Outlet_Typ Grocery 555 1.000 .000
e_New Store
Supermar 5577 .000 1.000
ket Type1
Supermar 928 .000 .000
ket Type2
Fat_conte Low Fat 4566 1.000
nt_new Regular 2494 .000

Model Summary

## Cox & Snell R Nagelkerke R

Step -2 Log likelihood Square Square

## a. Estimation terminated at iteration number 6 because

parameter estimates changed by less than .001.

## Step Chi-square df Sig.

1 224.859 8 .000

Hosmer and Lemeshow test failed because significance value should be >0.05. But since all
other assumptions were true we went ahead with the analysis and thus compared the results with
Classification trees.

6
Variables in the Equation
95% C.I.for EXP(B)
B S.E. Wald df Sig. Exp(B) Lower Upper
Step 1 a Item_MRP .031 .001 696.469 1 .000 1.031 1.029 1.034
Outlet_Age -.010 .022 .215 1 .643 .990 .948 1.033
Outlet_Type_New(1) .632 .263 5.762 1 .016 1.882 1.123 3.153
Item_Weight -.006 .010 .361 1 .548 .994 .974 1.014
Item_Visibility .258 1.045 .061 1 .805 1.294 .167 10.030
Item_type_New 13.070 15 .597
Item_type_New(1) -.209 .399 .275 1 .600 .811 .371 1.773
Item_type_New(2) .676 .494 1.873 1 .171 1.965 .747 5.171
Item_type_New(3) -.823 .527 2.437 1 .119 .439 .156 1.234
Item_type_New(4) -.076 .404 .036 1 .850 .927 .420 2.045
Item_type_New(5) -.202 .404 .250 1 .617 .817 .370 1.803
Item_type_New(6) -.098 .395 .061 1 .804 .907 .418 1.968
Item_type_New(7) -.065 .391 .027 1 .868 .937 .436 2.015
Item_type_New(8) -.316 .478 .437 1 .508 .729 .286 1.860
Item_type_New(9) -.276 .414 .444 1 .505 .759 .337 1.708
Item_type_New(10) -.213 .403 .278 1 .598 .808 .367 1.781
Item_type_New(11) -.194 .426 .209 1 .648 .823 .357 1.896
Item_type_New(12) .071 .507 .020 1 .888 1.074 .397 2.902
Item_type_New(13) -.481 .700 .471 1 .493 .618 .157 2.440
Item_type_New(14) -.052 .393 .017 1 .895 .950 .440 2.052
Item_type_New(15) -.215 .420 .263 1 .608 .806 .354 1.836
Outlet_Size_New .449 2 .799
Outlet_Size_New(1) -.154 .320 .233 1 .630 .857 .458 1.604
Outlet_Size_New(2) .037 .135 .074 1 .785 1.038 .796 1.353
Fat_content_new(1) .112 .108 1.088 1 .297 1.119 .906 1.382
Constant -2.044 .448 20.825 1 .000 .129

There are lots of variables which are not significant. So they must be removed to form accurate
equation.

## Perform Step Wise Logistic Regression

Method used was “Backward:Wald”
OUTPUT INTERPRETATION

Model Summary

## Cox & Snell R Nagelkerke R

Step -2 Log likelihood Square Square

## a. Estimation terminated at iteration number 6 because

parameter estimates changed by less than .001.

7
NR2 is good which is 0.397 which is good.

## Step Chi-square df Sig.

1 224.859 8 .000

2 224.134 8 .000

3 223.111 8 .000

4 228.720 8 .000

5 225.398 8 .000

6 233.094 8 .000

Hosmer and Lemeshow test failed because significance value should be >0.05. But since all
other assumptions were true we went ahead with the analysis and thus compared the results with
Classification trees.

Classification Tablea
Predicted
Hi_priority Percentag
Observed 0 1 e Correct
Step 1 Hi_priority 0 391 382 50.6
1 150 3727 96.1
Overall Percentage 88.6
Step 2 Hi_priority 0 390 383 50.5
1 150 3727 96.1
Overall Percentage 88.5
Step 3 Hi_priority 0 392 381 50.7
1 153 3724 96.1
Overall Percentage 88.5
Step 4 Hi_priority 0 388 385 50.2
1 158 3719 95.9
Overall Percentage 88.3
Step 5 Hi_priority 0 390 383 50.5
1 159 3718 95.9
Overall Percentage 88.3
Step 6 Hi_priority 0 393 380 50.8
1 159 3718 95.9
Overall Percentage 88.4

## 6 Iterations were performed and final 6th Iteration is the output.

8
There has been improvement in accuracy from 72% (Chance Accuracy) to 88.4% which is
slightly less than 90.3% to make our model valid.

## Final Variables in the equation which are significant are as follows:

From all the variables only three significant variables were found. Item _MRP, Outlet age and
Outlet type.

But there were few samples which were not considered due to outlet size and Item weight missing
values. Since they are not relevant, it would be better if all the sample values are considered
because it would increase our sample size and improve our prediction.

The below graph indicates that the probability of success is greater than 0.5 and this means the
items are correctly classified.

Step number: 6

## Observed Groups and Predicted Probabilities

800 + +
| 1|
| 1|
F | 1|
R 600 + 1+
E | 1|
Q | 11|
U | 11|
E 400 + 11+
N | 111|
C | 111|
Y | 1111|
200 + 11111+
| 111111|
| 1 1111 1111111|
| 11 1101011 11 1 1111111111111111111111111111111111|
Predicted ---------+---------+---------+---------+---------+---------+---------+---------+---------+----------
Prob: 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1
Group: 0000000000000000000000000000000000000000000000000011111111111111111111111111111111111111111111111111

## Predicted Probability is of Membership for 1

The Cut Value is .50
Symbols: 0 - 0
1 - 1
Each Symbol Represents 50 Cases.
9
Perform Step Wise Logistic Regression using only significant
Variables

Independent Variables used: Outlet type, Outlet Age and Item MRP
Dependent Variable: High_priority

Full sample size is utilized in the regression and there are no warnings which suggest no
collinearity.

10
Chance Accuracy
a,b
Classification Table
Predicted
Hi_priority Percentag
Observed 0 1 e Correct
Step 0 Hi_priority 0 0 2131 .0
1 0 6392 100.0
Overall Percentage 75.0

## The chance Accuracy is (2131/8523)² + (6392/8523)²= 62.5%

Now for the model to be valid overall accuracy should be greater than 1.25 * 62.5%= 78.1%

Score df Sig.

## Omnibus Tests of Model Coefficients

Chi-square df Sig.

11
Model Summary

## Cox & Snell R Nagelkerke R

Step -2 Log likelihood Square Square

## a. Estimation terminated at iteration number 7 because

parameter estimates changed by less than .001.

## Step Chi-square df Sig.

1 216.668 8 .000

Hosmer and Lemeshow test failed because significance value should be >0.05. But since all
other assumptions were true we went ahead with the analysis and thus compared the results with
Classification trees.

Classification Tablea

Predicted

Hi_priority
Percentage
Observed 0 1 Correct

## a. The cut value is .500

Since 90% classification accuracy is more than 78.1%, our model can be used for prediction.

12
Variables in the Equation
95% C.I.for EXP(B)
B S.E. Wald df Sig. Exp(B) Lower Upper
a Item_MRP .031 .001 1097.603 1 .000 1.032 1.030 1.033
Step 1
Outlet_Ag -.014 .006 5.088 1 .024 .986 .974 .998
e
Outlet_Typ 1351.281 3 .000
e_New
Outlet_Typ -8.777 .269 1066.242 1 .000 .000 .000 .000
e_New(1)

e_New(2)

e_New(3)

## Logistic Regression Equation

From the variables in the equation are significant, let us develop the logistic regression equation.
P(s) = 1/(1+e^-y) where y = a+b1x1+b2x2…. Equation (1)

Y= 0.496 + 0.031 Item(MRP) – 0.014 Outlet Age - 8.778 * Outlet Type new (1) – 2.054 *
Outlet Type new (2) – 2.629 * Outlet Type new (3)

Parameter coding

## Now, substitute this value of y in equation (1), to get probability of success.

P(s) = 1/(1+e0.496 + 0.031 Item(MRP) – 0.014 Outlet Age - 8.778 * Outlet Type new (1) –
2.054 * Outlet Type new (2) – 2.629 * Outlet Type new (3) ))

13
CART

Model Summary

## Independent Variables Item_Weight, Item_Visibility, Item_MRP,

Outlet_Age, Outlet_Type_New

Node

Node

## Results Independent Variables Outlet_Type_New, Item_MRP

Included

Number of Nodes 18

Depth 3

14
Risk

## Growing Method: EXHAUSTIVE CHAID

Dependent Variable: Hi_priority

15
Classification

Predicted

## Growing Method: EXHAUSTIVE CHAID

Dependent Variable: Hi_priority

## 89.7% Accuracy model of CART is similar to 90 % classification accuracy achieved by logistic

regression. So this validates our regression by Logistic method despite our Hosmer test failing.
Logistic regression model can be used for prediction.

16