SS ZG536
ADVANCED STATISTICAL TECHNIQUES
FOR ANALYTICS
BITS Pilani
Hyderabad Campus
1
20-11-2018
• “When you can measure what you are speaking about and
express in numbers, you know something about it ;but when
you cannot measure it, cannot express it in numbers, your
knowledge is of meagre and unsatisfactory kind”
•
• Lord Kelvin
2
20-11-2018
• Lies
• damn lies
• Statistics
Analytics
The term “ Analytics”
Disciplines
• - Statistics
• - Machine Learning
• - Biology
• - Kernel Methods
3
20-11-2018
I m por t an ce o f Dat a
Medical treatment
Industry
Power generation
Crime detection
Cognitive assessment
Mod el _r e qu i r em en ts
Business relevance
Statistical performance
Interpretable
Justifiability
Operational efficiency
Economic cost
4
20-11-2018
Statistics?
• Procedures for organising, summarizing, and
interpreting information
Standardized techniques used by scientists
Vocabulary & symbols for communicating about data
Mean
Median
Mode
Range
Mean Deviation
5
20-11-2018
Bar Graphs
6
20-11-2018
Histograms
Univariate histograms
3.5
3.0
2.5
2.0
1.5
1.0
Exam 1
Histograms
7
20-11-2018
Bivariate histogram
8
20-11-2018
Frequency Polygons
Frequency Polygons
Depicts information from a frequency table or a
grouped frequency table as a line graph
Frequency Polygon
9
20-11-2018
!!!!
• A famous statistician would never travel by airplane, because she had
studied air travel and estimated the probability of there being a bomb on any
given flight was 1 in a million, and she was not prepared to accept these
odds.
• One day a colleague met her at a conference far from home.
• "How did you get here, by train?"
• "No, I flew"
• "What about the possibility of a bomb?"
• "Well, I began thinking that if the odds of one bomb are 1:million, then the
odds of TWO bombs are (1/1,000,000) x (1/1,000,000) = 10-12. This is a
very, very small probability, which I can accept. So, now I bring my own
bomb along!"
Random Experiment
• Term "random experiment" is used to describe any action whose
outcome is not known in advance. Here are some examples of
experiments dealing with statistical data:
Tossing a coin
Counting how many times a certain word or a combination of words
appears in the text of the “King Lear” or in a text of Confucius
counting occurrences of a certain combination of amino acids in a
protein database.
pulling a card from the deck
10
20-11-2018
11
20-11-2018
Events
12
20-11-2018
• Two events are mutually exclusive if they can not occur at the
same time. Which are mutually exclusive?
• Draw an Ace and draw a heart from a standard deck of
52 cards
• It is raining and I show up for class
• Dr. Li is an easy teacher and I fail the class
• Dr. Beaubouef is a hard teacher and I ace the class.
13
20-11-2018
Random experiment
• Consider the random experiment of dropping a Styrofoam cup onto
the floor from a height of four feet. The cup hits the ground and
eventually comes to rest. It could land upside down, right side up, or it
could land on its side. We represent these possible outcomes of the
random experiment by the following.
14
20-11-2018
Probability
Axioms of Probability
15
20-11-2018
Probability of a Union
16
20-11-2018
Three Events
17
20-11-2018
2
Blue Black Brown Total
Software prog
35 25 20 80
Project Mgrs
7 8 5 20
Total
42 33 25 100
………………………………………………………..,what
is the probability that he is wearing a blue trouser
3
• A Survey conducted by a bank revealed that 40% of the accounts are
savings accounts and 35% of the accounts are current accounts and the
balance are loan accounts.
18
20-11-2018
4
• From a Hospital data it is found that 45% of the
patients are having high B.P. Also it was found that
35% of these patients having high B P is also having
diabetes.
Conditional Probability
19
20-11-2018
5
Actually
purchased
Planned to YES NO TOTAL
purchase
YES 200 50 250
NO 100 650 750
TOTAL 300 700 1000
Conditional Probability
Definition
20
20-11-2018
Multiplication Rule
21
20-11-2018
Independence
6
• Toss a six-sided die twice. The sample space consists of all
ordered pairs (i; j) of the numbers 1; 2; : : : ; 6, that is, S =
{(1; 1); (1; 2); : : : ; (6; 6)}.. Let A = {outcomes match}
• and B = {sum of outcomes at least 8}.
• Then find P(A),P(B),P(A/B) and P(B/A)
22
20-11-2018
7
• Three persons A,B and C are competing for the post of CEO of a
company. The chances of they becoming CEO are 0.2,0.3 and 0.4
respectively.
Bayes’ Theorem
Definition
23
20-11-2018
Bayes’ Theorem
Bayes’ Theorem
24
20-11-2018
Applications
Diagnostic tests in medicine
Telecommunication
Customer service
Example 1
• A Component is tested for its stipulated quality , but the
test is not infalliable. If the component is good,70% of the
time , test gives positive indication i.e. 70% of the time the
test classifies good item as good. If the component is
defective,80% of the time , test gives negative indication
implying that the component is bad. If in the manufacturing
process, the percentage of defective components is 20,then
find
probability that the component is good and test gives
positive indication
…….the component is not good and test gives negative
indication
…….the component is good given that the test is positive
25
20-11-2018
Example 2
Technicians regularly make repairs when breakdowns
occur on an automated production line. Janak, who
services 20% of the breakdowns, makes an incomplete
repair 1 time in 20.Tarun ,who services 60% of the
breakdowns ,makes an incomplete repair 1 time in 10
Gautham, who services 15% of the breakdowns, makes an
incomplete repair 1 time in 10 and Prasad ,who services
5% of the breakdowns, makes an incomplete repair 1 time
in 20.For the next problem with the production line
diagnosed as being due to an initial repair that was
incomplete, what is the probability that this initial repair
was made by Janak?
Solution
26
20-11-2018
P ( B1/A ) =
P(B )P(A/B )
1 1
P(B )P(A/B ) P(B )P(A/B ) P(B )P(A/B ) P(B )P(A/B )
1 1 2 2 3 3 4 4
=
0.20 (0.05)
(0.20)(0.0 5) (0.60)(0.1 0) (0.15)(0.1 0) (0.05)(0.0 5)
= 0.114
54
Random Variables
27
20-11-2018
Random Variables
Definition
Random Variables
Definition
28
20-11-2018
Random Variables
29
20-11-2018
f (x) 1 (1.16)
P (E ) f ( x )dx (1.17)
E
Examples:
1 , axb
f ( x ) b a (1.18)
0
otherwise
30
20-11-2018
e x , x0
f ( x ) (1.19)
0
otherwise
Expected Value
E ( X ) X iP( X i )
i 1
31
20-11-2018
Variance
n
2
X
i1
i E X 2 P X i
1
• Toss a coin 3 times. The sample space is
• S = {HHH; HTH; THH; TTH; HHT; HTT; THT; TTT}
• Mean
• Variance
32
20-11-2018
Binomial Distribution
n = number of trials ,x = number of successes , p = probability of success
q = probability of failure
The picture can't be display ed.
n! n x
p xq
r! (n - x)!
np
np ( 1 p )
33
20-11-2018
3
• A recent national study showed that approximately 44.7% of college
students have used Wikipedia as a source in at least one of their term
papers.
• Let X equal the number of students in a random sample of size n = 31
who have used Wikipedia as a source.
• How is X distributed?
• Find the probability that X is equal to 17.
• Find the probability that X is at most 13.
• Find the probability that X is between 16 and 19, inclusive.
• Find mean and variance
x e
P( X ) Expected value =
X! Variance =
34
20-11-2018
Problem
• On the average, five cars arrive at a particular car wash
every hour. Let X count the number of cars that arrive
from 10AM to 11AM. (mean = 5)
Problem
• Suppose the car wash is in operation from 8AM to 6PM, and we let Y
be the number of customers that appear in this period. Since this
period covers a total of 10 hours, from ( lambda = 50).
•
• What is the probability that there are between 48 and 50 customers,
inclusive?
35
20-11-2018
Normal Distribution
Probability density function - f(X)
1 / 2 ( X )2
1
f (X )
2
e
2
Three Normal
distributions with
different areas
36
20-11-2018
=100
=15
x
Z
-3 -2 -1 0 1 2 3
Thanks
37
20-11-2018
SS ZG536
ADVANCED STATISTICAL TECHNIQUES
FOR ANALYTICS
BITS Pilani
Hyderabad Campus
L- 2: Descriptive Statistics
38
20-11-2018
Today…..
Visualization of data
Basics of probability
Conditional probability
Visualization
Summary gives an idea about the data
• summary(income)
• Min 1st QU. Median Mean 3rd Qu Max
• - 7.8 12.5 32.0 52.03 67.2 585
Visualization – why
39
20-11-2018
Data Visualisation
Line chart
Bar chart
Histogram
Pie chart
Scatter plot
Box plot
Line Chart
40
20-11-2018
Bar Chart
Histograms
41
20-11-2018
Histograms
Histograms
42
20-11-2018
Pie charts
Scatter Plot
43
20-11-2018
Box plot
To conclude _ Visualization
Visualization gives a sense of data distribution and relationship
among variables
44
20-11-2018
!!!!
• A famous statistician would never travel by airplane, because she had studied air travel and
estimated the probability of there being a bomb on any given flight was 1 in a million, and she
was not prepared to accept these odds.
• One day a colleague met her at a conference far from home.
• "How did you get here, by train?"
• "No, I flew"
• "What about the possibility of a bomb?"
• "Well, I began thinking that if the odds of one bomb are 1:million, then the odds of TWO bombs
are (1/1,000,000) x (1/1,000,000) = 10-12. This is a very, very small probability, which I can accept.
So, now I bring my own bomb along!"
Random Experiment
• Term "random experiment" is used to describe any action whose
outcome is not known in advance. Here are some examples of
experiments dealing with statistical data:
Tossing a coin
Counting how many times a certain word or a combination of words
appears in the text of the “King Lear” or in a text of Confucius
counting occurrences of a certain combination of amino acids in a
protein database.
pulling a card from the deck
45
20-11-2018
•Sample Space
Discrete sample spaces.
Continuous sample spaces
Event
Independent events
Dependent events
46
20-11-2018
Probability
Axioms of Probability
47
20-11-2018
2
Blue Black Brown Total
Software prog
35 25 20 80
Project Mgrs
7 8 5 20
Total
42 33 25 100
………………………………………………………..,what
is the probability that he is wearing a blue trouser
48
20-11-2018
3
• A Survey conducted by a bank revealed that 40% of the accounts are
savings accounts and 35% of the accounts are current accounts and the
balance are loan accounts.
Thanks
49
20-11-2018
SS ZG536
ADVANCED STATISTICAL TECHNIQUES
FOR ANALYTICS
BITS Pilani
Hyderabad Campus
L- 3: Descriptive Statistics
50
20-11-2018
Today…..
Visualization of data
Basics of probability
Conditional probability
Box plot
Visualization
Summary gives an idea about the data
• summary(income)
• Min 1st QU. Median Mean 3rd Qu Max
• - 7.8 12.5 32.0 52.03 67.2 585
Visualization – why
51
20-11-2018
3
• A Survey conducted by a bank revealed that 40% of the accounts are
savings accounts and 35% of the accounts are current accounts and the
balance are loan accounts.
4
• From a Hospital data it is found that 45% of the
patients are having high B.P. Also it was found that
35% of these patients having high B P is also having
diabetes.
52
20-11-2018
Conditional Probability
Conditional Probability
Definition
53
20-11-2018
Multiplication Rule
54
20-11-2018
Independence
Bayes’ Theorem
Definition
55
20-11-2018
Bayes’ Theorem
Bayes’ Theorem
56
20-11-2018
Applications
Diagnostic tests in medicine
Telecommunication
Customer service
114
Random Variables
57
20-11-2018
Random Variables
Definition
Random Variables
58
20-11-2018
f (x) 1 (1.16)
P (E ) f ( x )dx (1.17)
E
59
20-11-2018
Expected Value
E ( X ) X iP( X i )
i 1
Variance
n
2
X
i1
i E X 2 P X i
60
20-11-2018
Thanks
SS ZG536
ADVANCED STATISTICAL TECHNIQUES
FOR ANALYTICS
61
20-11-2018
BITS Pilani
Hyderabad Campus
L- 4: Descriptive Statistics
Today…..
Recall the past for a while_ Conditional probability and Baye’s theorem & some examples
Random variables
Probability distribution
Examples
62
20-11-2018
Conditional Probability
and Baye’s theorem
63
20-11-2018
64
20-11-2018
65
20-11-2018
66
20-11-2018
134
Random Variables
67
20-11-2018
Random Variables
Definition
Random Variables
68
20-11-2018
f (x) 1 (1.16)
P (E ) f ( x )dx (1.17)
E
69
20-11-2018
Expected Value
E ( X ) X iP( X i )
i 1
Variance
n
2
X
i1
i E X 2 P X i
70
20-11-2018
71
20-11-2018
72
20-11-2018
73
20-11-2018
74
20-11-2018
75
20-11-2018
76
20-11-2018
77
20-11-2018
Thanks
78
20-11-2018
SSTCS ZG536
ADVANCED STATISTICAL TECHNIQUES
FOR ANALYTICS
BITS Pilani
Hyderabad Campus
79
20-11-2018
Agenda
80
20-11-2018
Example
Technicians regularly make repairs when breakdowns
occur on an automated production line. Janak, who
services 20% of the breakdowns, makes an incomplete
repair 1 time in 20.Tarun ,who services 60% of the
breakdowns ,makes an incomplete repair 1 time in 10
Gautham, who services 15% of the breakdowns, makes an
incomplete repair 1 time in 10 and Prasad ,who services
5% of the breakdowns, makes an incomplete repair 1 time
in 20.For the next problem with the production line
diagnosed as being due to an initial repair that was
incomplete, what is the probability that this initial repair
was made by Janak?
Solution
81
20-11-2018
P ( B1/A ) =
P(B )P(A/B )
1 1
P(B )P(A/B ) P(B )P(A/B ) P(B )P(A/B ) P(B )P(A/B )
1 1 2 2 3 3 4 4
=
0.20 (0.05)
(0.20)(0.0 5) (0.60)(0.1 0) (0.15)(0.1 0) (0.05)(0.0 5)
= 0.114
Problem
• On the average, five cars arrive at a particular car wash every hour. Let
X count the number of cars that arrive from 10AM to 11AM. (mean =
5).What is the probability that no car arrives during this period?
82
20-11-2018
Problem
• Suppose the car wash is in operation from 8AM to 6PM, and we let Y
be the number of customers that appear in this period.(lambda = 50).
• What is the probability that there are between 48 and 50 customers,
inclusive?
83
20-11-2018
Normal Distribution
Probability density function - f(X)
1 / 2 ( X )2
1
f (X )
2
e
2
Normal Distribution
Probability density function - f(X)
84
20-11-2018
Three Normal
distributions with
different areas
=100
=15
x
Z
-2 0 2 3
-1 1
55 70 85 100 115 130 145
-3
85
20-11-2018
Note
86
20-11-2018
87
20-11-2018
88
20-11-2018
89
20-11-2018
Problem
Find 1) P [ 5 X 10 ]
2)P [ X 5]
Solution
1) = 8
=4
We know that Z= X = X 8
4
P [ 5 X 10 ] = P [ -0.75 Z 0.5 ]
90
20-11-2018
= F (0.5) – F ( - 0.75)
91
20-11-2018
Three Normal
distributions with
different areas
92
20-11-2018
93
20-11-2018
Inferential Statistics
Sampling
Sample
Random sampling
94
20-11-2018
Statistical Inferences
Tests of hypothesis
•
Hypothesis Testing
•Goal:
•Make statement(s) regarding unknown
population parameter values based on
sample data
95
20-11-2018
Hypothesis Testing
Example
• Drug company has new drug, wishes to compare it with
current standard treatment
• Federal regulators tell company that they must
demonstrate that new drug is better than current
treatment to receive approval
• Firm runs clinical trial where some patients receive new
drug, and others receive standard treatment
• Numeric response of therapeutic effect is obtained
(higher scores are better).
• Parameter of interest: mNew - mStd
96
20-11-2018
Example
97
20-11-2018
Test Statistic
x 0
z stat
SE x
where 0 population mean assuming H 0 is true
and SE x
n
Example
A. Hypotheses:
H0: µ = 100 versus
Ha: µ > 100 (one-sided)
Ha: µ ≠ 100 (two-sided)
B. Test statistic:
15
SE x 5
n 9
x 0 112.8 100
z stat 2.56
SE x 5
98
20-11-2018
Hypothesis Testing
True State
H0 True Correct Decision Type I Error
99
20-11-2018
Problem
Solution:
Here = 0.05
0 .0 5
=
2 2
= 0.025
Z =1.96
2
i.e; if
Zcal=Z <-1.96 or Zcal >1.96 we reject null hypothesis.
100
20-11-2018
6. Computation :
Test statistic
x 1520015150
Zcal =Z =
1200
n 49
=0.2916
7. Decesion:
Problem
• firm puts 40 of these tyres on its trucks and get a mean life of
101
20-11-2018
Solution
4. Critical region
102
20-11-2018
5.Computation
Test statistic
6.Conclusion
Procedure
1. Null hypothesis H0 : = 0
Or
Or
3. Level of significance :
103
20-11-2018
4. Critical region
Reject H0 if t < t or
2
5. Test statistic
x
t = with (n-1) degrees of freedom
s
n
6. Calculation
7. Decision
104
20-11-2018
Thanks
105
20-11-2018
SSTCS ZG536
ADVANCED STATISTICAL TECHNIQUES
FOR ANALYTICS
BITS Pilani
Hyderabad Campus
L- 6: Inferential statistics
106
20-11-2018
Agenda
Statistical Inferences
Tests of hypothesis
•
107
20-11-2018
108
20-11-2018
109
20-11-2018
110
20-11-2018
111
20-11-2018
Hypothesis Testing
•Goal:
•Make statement(s) regarding unknown
population parameter values based on
sample data
112
20-11-2018
Hypothesis Testing
Example
• Drug company has new drug, wishes to compare it with
current standard treatment
• Federal regulators tell company that they must
demonstrate that new drug is better than current
treatment to receive approval
• Firm runs clinical trial where some patients receive new
drug, and others receive standard treatment
• Numeric response of therapeutic effect is obtained
(higher scores are better).
• Parameter of interest: mNew - mStd
113
20-11-2018
114
20-11-2018
Example
Test Statistic
x 0
z stat
SE x
where 0 population mean assuming H 0 is true
and SE x
n
115
20-11-2018
Example
A. Hypotheses:
H0: µ = 100 versus
Ha: µ > 100 (one-sided)
Ha: µ ≠ 100 (two-sided)
B. Test statistic:
15
SE x 5
n 9
x 0 112.8 100
z stat 2.56
SE x 5
116
20-11-2018
Hypothesis Testing
Test Result – H0 True H0 False
True State
H0 True Correct Decision Type I Error
Problem
117
20-11-2018
Solution:
Here = 0.05
0 .0 5
=
2 2
= 0.025
Z =1.96
2
i.e; if
Zcal=Z <-1.96 or Zcal >1.96 we reject null hypothesis.
6. Computation :
Test statistic
x 1520015150
Zcal =Z =
1200
n 49
=0.2916
7. Decesion:
118
20-11-2018
Problem
• firm puts 40 of these tyres on its trucks and get a mean life of
Solution
119
20-11-2018
4. Critical region
5.Computation
Test statistic
6.Conclusion
120
20-11-2018
Procedure
1. Null hypothesis H0 : = 0
Or
Or
3. Level of significance :
4. Critical region
Reject H0 if t < t or
2
121
20-11-2018
5. Test statistic
x
t = with (n-1) degrees of freedom
s
n
6. Calculation
7. Decision
122
20-11-2018
123
20-11-2018
Thanks
SS ZG536
ADVANCED STATISTICAL TECHNIQUES
FOR ANALYTICS
124
20-11-2018
BITS Pilani
Hyderabad Campus
Agenda
Central limit theorem
Type I, Type II Errors
Testing of Hypothesis – continuation from
previous session
Covariance
Correlation
Introduction to regression
125
20-11-2018
126
20-11-2018
127
20-11-2018
128
20-11-2018
129
20-11-2018
130
20-11-2018
131
20-11-2018
132
20-11-2018
133
20-11-2018
134
20-11-2018
135
20-11-2018
136
20-11-2018
137
20-11-2018
138
20-11-2018
139
20-11-2018
140
20-11-2018
141
20-11-2018
142
20-11-2018
143
20-11-2018
144
20-11-2018
145
20-11-2018
146
20-11-2018
147
20-11-2018
Thanks
SSTCS ZG536
ADVANCED STATISTICAL TECHNIQUES
FOR ANALYTICS
148
20-11-2018
BITS Pilani
Hyderabad Campus
L- 8: Predictive Analytics
Agenda
Covariance
Correlation
Introduction to regression
Method of least squares
Simple linear regression
149
20-11-2018
150
20-11-2018
151
20-11-2018
152
20-11-2018
153
20-11-2018
154
20-11-2018
Regression
155
20-11-2018
156
20-11-2018
157
20-11-2018
158
20-11-2018
159
20-11-2018
160
20-11-2018
161
20-11-2018
Thanks
SSTCS ZG536
ADVANCED STATISTICAL TECHNIQUES
FOR ANALYTICS
162
20-11-2018
BITS Pilani
Hyderabad Campus
Agenda
163
20-11-2018
164
20-11-2018
165
20-11-2018
166
20-11-2018
167
20-11-2018
168
20-11-2018
169
20-11-2018
Regression
170
20-11-2018
171
20-11-2018
172
20-11-2018
173
20-11-2018
174
20-11-2018
175
20-11-2018
176
20-11-2018
177
20-11-2018
178
20-11-2018
179
20-11-2018
180
20-11-2018
181
20-11-2018
182
20-11-2018
Thanks
SSTCS ZG536
ADVANCED STATISTICAL TECHNIQUES
FOR ANALYTICS
183
20-11-2018
BITS Pilani
Hyderabad Campus
Agenda
Model validation
Ridge and lasso models
Assumptions of Linear regression
Logistic regression
184
20-11-2018
369/54
Y 0 1 x ~ N (0, )
• beta1 > 0 Positive Association
• beta1 < 0 Negative Association
• beta1 = 0 No Association
370/54
Multiple regression
Model:
185
20-11-2018
371/54
E (Y | x1 , x p ) 0 1 x1 p x p
• Least Squares Fitted (predicted) equation, minimizing SSE:
2
^ ^ ^ ^
^
Y 0 1 x1 p x p SSE Y Y
Accuracy of a model
• By Using the following the strength of the linear model can be tested
1) Coefficient of determination
186
20-11-2018
187
20-11-2018
Regularization
Over fitting can be solved with regularization
188
20-11-2018
^ ^ ^ ^
Y 0 1 x1 p x p
2
^
• OLS estimation: min SSE Y Y
n 2 p
^
• LASSO estimation: min SSE Y Y j
i 1 j 1
n 2 p
^
min SSE Y Y j
2
• Ridge regression estimation:
i 1 j 1
Assumptions in Regression
Analysis
189
20-11-2018
Assumptions
190
20-11-2018
FEMALE MALE
8 6
4 3
2
Frequency
Frequency
0 0
60.0 70.0 80.0 90.0 100.0 110.0 120.0 130.0 140.0 60.0 70.0 80.0 90.0 100.0 110.0 120.0 130.0 140.0
382
191
20-11-2018
Non-Normality
• Skew and Kurtosis
Skew – much easier to deal with
Kurtosis – less serious anyway
• Transform data
removes skew
positive skew – log transform
negative skew - square
383
192
20-11-2018
Heteroscedasticity
160
140
120
100
80
60
MALE
40
40 60 80 100 120 140 160
386
FEMALE
193
20-11-2018
Good – no heteroscedasticity
Residual
Predicted Value
387
Bad – heteroscedasticity
Residual
Predicted Value
388
194
20-11-2018
Assumption 3:
The Error Term is Additive
195
20-11-2018
80
70
60
50
40
30
20
Grade
10
10 20 30 40 50 60 70
Time 392
196
20-11-2018
90
80
70
60
50
40
Question
30
3
20 2
Grade
10 1
10 20 30 40 50 60 70
Time
393
197
20-11-2018
198
20-11-2018
Multicollinearity
• Correlation Matrix
• VIF > 5 then highly correlated and need to be eliminated from the model
199
20-11-2018
Logistic Regression
There are many important research topics for which the dependent
variable is "limited.“
•
For example: voting, morbidity or mortality, and participation data is
not continuous or distributed normally.
200
20-11-2018
Logistic Regression
201
20-11-2018
202
20-11-2018
Y =Logistic
Binary BinaryRegression
response XModel
= Quantitative predictor
p = proportion of 1’s (yes, success) at any X
Equivalent forms of the logistic regression model:
Logit form Probability form
p
log 0 1 X
1 p eo1X1
p o1X1
1e
1
(o1X1)
1e
Call:
glm(formula = Gender ~ Hgt, family = binomial)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.77443 -0.34870 -0.05375 0.32973 2.37928
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 64.1416 8.3694 7.664 1.81e-14 ***
Hgt -0.9424 0.1227 -7.680 1.60e-14***
---
203
20-11-2018
Call:
glm(formula = Gender ~ Hgt, family = binomial, data = Pulse)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 64.1416 8.3694 7.664 1.81e-14 ***
Hgt -0.9424 0.1227 -7.680 1.60e-14***
---
e64.140.9424Ht
p
1 e64.14.9424Ht
proportion of females at that Hgt
204
20-11-2018
> lmod=glm(cbind(Yes,No)~Group,family=binomial,data=TMS)
> summary(lmod)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.2657 0.2414 -5.243 1.58e-07 ***
GroupTMS 0.8184 0.3167 2.584 0.00977 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Y = Binary X1,X2,…,X
X = Single
k = Multiple
predictor
response predictors
π = proportion of 1’s (yes,
at anysuccess)
x1, x2, …,
at xany x
k
205
20-11-2018
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.439304 1.021042 -1.410 0.15865
BP 0.022994 0.008325 2.762 0.00575 **
Sex 1.455166 1.525558 0.954 0.34016
BP:Sex -0.013020 0.011965 -1.088 0.27653
Rep = red,
1.0
Dem = blue
0.8
Lines are
very close
Prob of voting Yes
0.6
to parallel;
0.4
not a
significant
0.2
interaction
0.0
206
20-11-2018
Forecasting models
Principles of forecasting
Time series analysis
Smoothing and decomposition methods
ARIMA
GARCH
Holt – winter model
Casual methods
Moving averages
Exponential smoothing
207
20-11-2018
Forecasting
Forecasting
9.0
c) 5.0, 7.5, 6.0, 4.5, 7.0, 9.5, 8.0, 6.5,
208
20-11-2018
What Is Forecasting?
209
20-11-2018
Importance of Forecasting
210
20-11-2018
Types of forecasts
Demand Forecasts
Environmental Forecasts
Technological Forecasts
211
20-11-2018
Timing of Forecasts
Short-range Forecast
Quantitative
Forecasting
212
20-11-2018
213
20-11-2018
214
20-11-2018
Trend Component
• Persistent, overall upward or downward pattern
• Due to population, technology etc.
• Several years duration
Response
Trend Component
Sales
Time
215
20-11-2018
Cyclical Component
• Repeating up & down movements
• Due to interactions of factors influencing economy
• Usually 2-10 years duration
Cycle
Response
Cyclical Component
Sales
Time
216
20-11-2018
Seasonal Component
• Regular pattern of up & down fluctuations
• Due to weather, customs etc.
• Occurs within one year
Summer
Response
Mo., Qtr.
Seasonal Component
Sales
217
20-11-2018
Irregular Component
• War
• Short duration &
nonrepeating
Y i
Ft E ( Yt ) i t k
k
Weighted Moving Average Forecast
t 1
wY i i
Ft E ( Yt ) i t k
k
218
20-11-2018
Moving Average
[Solution]
219
20-11-2018
Moving Average
Year Response Moving
Ave
Sales
1994 2 NA
8
1995 5 3
6
1996 2 3
4
1997 2 3.67
2
1998 7 5
0
1999 6 NA
94 95 96 97 98 99
Thanks
220
20-11-2018
SSTCS ZG536
ADVANCED STATISTICAL TECHNIQUES
FOR ANALYTICS
BITS Pilani
Hyderabad Campus
221
20-11-2018
Forecasting models
Principles of forecasting
Time series analysis
Smoothing and decomposition methods
Casual methods
Moving averages
Exponential smoothing
AR,MA,ARMA & ARIMA Models
Quantitative
Forecasting
222
20-11-2018
Applications
Retail sales
Stock trading
223
20-11-2018
Seasonality
Cyclic
Random
224
20-11-2018
identify and account for any trends or seasonality in the time series
examine the remaining time series and determine a suitable model
225
20-11-2018
X t Tt St Ct I t
Alternatively, in other circumstances we might define a time series as the
product of its components or a multiplicative model – often represented
as a logarithmic model
X t Tt St Ct I t
Trend Cyclical
Seasonal Irregular
226
20-11-2018
Smoothing Methods
Y i
Ft E ( Yt ) i t k
k
Weighted Moving Average Forecast
t 1
wY i i
Ft E ( Yt ) i t k
k
227
20-11-2018
Example(Moving averages)
• Use the following data to compute three year moving average for all
available years. Find the trend and Forecast error
YEAR Saleson (Lakhs) YEAR Saleson (Lakhs)
2008 21 2013 22
2009 22 2014 25
2010 23 2015 26
2011 25 2016 27
2012 24 2017 26
228
20-11-2018
Ft 1 C t A t
• Differs from the simple moving average that weighs all periods
equally - more responsive to trends
Months 1 2 3 4 5 6 7 8 9 10 11 12
Sales 10 12 13 16 19 23 26 30 28 18 16 14
229
20-11-2018
Months 1 2 3 4 5 6 7 8 9 10 11 12
Sales 10 12 13 16 19 23 26 30 28 18 16 14
230
20-11-2018
231
20-11-2018
232
20-11-2018
Forecasting Trend
• Basic forecasting models for trends compensate for the lagging that
would otherwise occur
• One model, trend-adjusted exponential smoothing uses a three step
process
• Step 1 - Smoothing the level of the series
S t αA t (1 α)(S t 1 Tt 1 )
• Step 2 – Smoothing the trend
Tt β(S t S t 1 ) (1 β)Tt 1
• Forecast including the trend
FIT t1 S t Tt
233
20-11-2018
MSE
• Mean Square Error (MSE) n
Penalizes larger errors
CFE
TS
• Tracking Signal MAD
Measures if your model is working
234
20-11-2018
235
20-11-2018
Models
AR Model
MA Model
ARMA Model
ARIMA Model
AR Model(Auto regressive
model)
236
20-11-2018
237
20-11-2018
Thanks
238
20-11-2018
SSTCS ZG536
ADVANCED STATISTICAL TECHNIQUES
FOR ANALYTICS
BITS Pilani
Hyderabad Campus
239
20-11-2018
240
20-11-2018
actual - forecast
2
MSE
• Mean Square Error (MSE) n
Penalizes larger errors
CFE
TS
• Tracking Signal MAD
Measures if your model is working
241
20-11-2018
242
20-11-2018
Models
AR Model
MA Model
ARMA Model
AR Model(Auto regressive
model)
243
20-11-2018
244
20-11-2018
Case
245
20-11-2018
ANOVA-analysis of variance
• * Significance of difference between two sample means
246
20-11-2018
ANOVA
Assumptions
247
20-11-2018
ANOVA summary
248
20-11-2018
Example
• To test the significance of variation in the retail prices of a
commodity in three metro cities,Mumbai,Kolkata and
Delhi, four shops are chosen at random and the prices are
given below
Example
• To test the significance of variation in the retail prices of a
commodity in three metro cities,Mumbai,Kolkata and
Delhi, four shops are chosen at random and the prices are
given below
249
20-11-2018
ANOVA summary
250
20-11-2018
Example
251
20-11-2018
252
20-11-2018
253
20-11-2018
Example
254
20-11-2018
255
20-11-2018
256
20-11-2018
257
20-11-2018
258
20-11-2018
259
20-11-2018
260
20-11-2018
Thanks
SS ZG536
ADVANCED STATISTICAL TECHNIQUES
FOR ANALYTICS
261
20-11-2018
BITS Pilani
Hyderabad Campus
Agenda
262
20-11-2018
263
20-11-2018
Preliminaries
Standard Deviation is a measure of the spread of the
data
Variance – measure of the deviation from the mean for
points in one dimension e.g. heights
Covariance as a measure of how much each of the
dimensions vary from the mean with respect to each
other.
Covariance is measured between 2 dimensions to see if
there is a relationship between the 2 dimensions e.g.
number of hours studied & marks obtained
The covariance between one dimension and itself is the
variance
264
20-11-2018
Covariance Matrix
If covariance is zero: the two dimensions are independent of each other .
265
20-11-2018
Transformation matrices
• Consider:
2 3 3 12 3
2 1 x 2 = 8 =4x 2
eigenvalue problem
266
20-11-2018
eigenvalue problem
2 3 3 12 3
= x
4 2
2 1 x 2 = 8
A . v = λ. v
Therefore, (3,2) is an eigenvector of the square matrix
A and 4 is an eigenvalue of A
267
20-11-2018
268
20-11-2018
Data Presentation
• Blood and urine measurements (wet chemistry) from 65 people (33 alcoholics, 32
non-alcoholics). 1000
900
• Matrix Format 800
700
600
Value
500
400
H-WBC H-RBC H-Hgb H-Hct H-MCV H-MCH H-MCHC 300
A1 8.0000 4.8200 14.1000 41.0000 85.0000 29.0000 34.0000 200
A2 7.3000 5.0200 14.7000 43.0000 86.0000 29.0000 34.0000
A3 4.3000 4.4800 14.1000 41.0000 91.0000 32.0000 35.0000 100
A4 7.5000 4.4700 14.9000 45.0000 101.0000 33.0000 33.0000 00 10 20 30 40 50 60
A5 7.3000 5.5200 15.4000 46.0000 84.0000 28.0000 33.0000
A6 6.9000 4.8600 16.0000 47.0000 97.0000 33.0000 34.0000
measurement
Measurement
A7 7.8000 4.6800 14.7000 43.0000 92.0000 31.0000 34.0000
A8 8.6000 4.8200 15.8000 42.0000 88.0000 33.0000 37.0000
A9 5.1000 4.7100 14.0000 43.0000 92.0000 30.0000 32.0000
Univariate
Bivariate
550
1.8 500
1.6 450
1.4 400
C-LDH
1.2 350
H-Bands
1 300
0.8 250
0.6 200
150
0.4
100
0.2
0
Trivariate 50
0 50 150 250 350 450
0 10 20 30 40 50 60 70 C-Triglycerides
Person
4
3
M-EPI
0
600
400 500
400
200 300
C-LDH 00
100
200
C-Triglycerides
269
20-11-2018
Applications
Face Recognition
Image Compression
Gene Expression Analysis
Data Reduction
Data Classification
Trend Analysis
Factor Analysis
Noise Reduction
• In real world data analysis tasks we analyze complex data i.e. multi
dimensional data. We plot the data and find various patterns in it or use it
to train some machine learning models. One way to think about
dimensions is that suppose you have an data point x , if we consider this
data point as a physical object then dimensions are merely a basis of view,
like where is the data located when it is observed from horizontal axis or
vertical axis.
270
20-11-2018
271
20-11-2018
• PCA finds a new set of dimensions (or a set of basis of views) such that all
the dimensions are orthogonal (and hence linearly independent) and
ranked according to the variance of data along them. It means more
important principle
axis occurs first. (more important = more variance/more spread out data)
272
20-11-2018
273
20-11-2018
274
20-11-2018
275
20-11-2018
276
20-11-2018
277
20-11-2018
Principal Components
variance 20
PC 2
15
10
0 0 5 10 15 20 25 30
Wavelength 1
278
20-11-2018
Principal Components
Wavelength 2
the ordinate axes. 20
PC 1
15
• First PC is direction of 10
maximum variance from 5
origin
0 0 5 10 15 20 25 30
• Subsequent PCs are Wavelength 1
Wavelength 2
variance 20
PC 2
15
10
0 0 5 10 15 20 25 30
Wavelength 1
An Example Mean1=24.1
Mean2=53.8
X1 X2 X1' X2' 100
90
80
70
60
19 63 -5.1 9.25 50 Series1
40
30
20
39 74 14.9 20.25 10
0
0 10 20 30 40 50
30 87 5.9 33.25
40
30
30 23 5.9 -30.75 20
10
0 Series1
15 35 -9.1 -18.75 -15 -10 -5
-10
0 5 10 15 20
-20
-40
15 32 -9.1 -21.75
558
30 73 5.9 19.25
279
20-11-2018
Covariance Matrix
75 106
• C=
106 482
data as -0.3
-0.4
8.624
19.404
-0.5
-17.63
x
yi 0.21 0.98 i1 0.21* xi1 0.98 * xi 2
xi 2
560
280
20-11-2018
281
20-11-2018
563
282
20-11-2018
Thanks
SS ZG536
ADVANCED STATISTICAL TECHNIQUES
FOR ANALYTICS
283
20-11-2018
BITS Pilani
Hyderabad Campus
284
20-11-2018
285
20-11-2018
286
20-11-2018
287
20-11-2018
288
20-11-2018
289
20-11-2018
290
20-11-2018
291
20-11-2018
Thanks
292