Anda di halaman 1dari 7

Reading 9 Correlation and Regression

FinQuiz Notes 2 0 1 5

Scatter plot and correlation analysis are used to examine portfolio could be diversified or decreased.
how two sets of data are related. If there is zero covariance between two assets, it
means that there is no relationship between the
2.1 Scatter Plots rates of return of two assets and the assets can be
included in the same portfolio.
A scatter plot graphically shows the relationship
between two varaibles. If the points on the scatter plot Correlation coefficient measures the direction and
cluster together in a straight line, the two variables have strength of linear association between two variables. The
a strong linear relation. Observations in the scatter plot correlation coefficient between two assets X and Y can
are represented by a point, and the points are not be calculated using the following formula:

! "#$ !% % ! "#$ !% %
2.2 & Correlation Analysis & Calculating and
& &
2.3 Interpreting the Correlation Coefficient

The sample covariance is calculated as:

(, *
= '=
+ ,' ( + ,' *
n = sample size NOTE:
Xi = ith observation on variable X
= mean of the variable X observations Unlike Covariance, Correlation has no unit of
Yi = ith observation on variable Y measurement; it is a simple number.
= mean of the variable Y observations
The covariance of a random variable with itself is - ./ = 47.78 4.5 = 40 4/5 = 250

simply a variance of the random variable.
'= = 0.478
Covariance can range from to + .
The covariance number doesnt tell the investor if + 40 250
the relationship between two variables (e.g.
returns of two assets X and ) is strong or weak. It
The correlation coefficient can range from -1 to
only tells the direction of this relationship. For
Two variables are perfectly positively correlated
o Positive number of covariance shows that rates
if correlation coefficient is +1.
of return of two assets are moving in the same
Correlation coefficient of -1 indicates a perfect
direction: when the rate of return of asset X is
inverse (negative) linear relationship between
negative, the returns of other asset tend to be
the returns of two assets.
negative as well and vice versa.
When correlation coefficient equals 0, there is
o Negative number of covariance shows that rates
no linear relationship between the returns of
of return of two assets are moving in the opposite
two assets.
directions: when return on asset X is positive, the
The closer the correlation coefficient is to 1, the
returns of the other asset Y tend to be negative
stronger the relationship between the returns of
and vice versa.
two assets.

NOTE: Note: Correlation of +/- 1 does not imply that

slope of the line is +/- 1.
If there is positive covariance between two assets
then the investor should evaluate whether or not NOTE:
he/she should include both of these assets in the
same portfolio, because their returns move in the Combining two assets that have zero correlation with
same direction and the risk in portfolio may not be each other reduces the risk of the portfolio. A negative
diversified. correlation coefficient results in greater risk reduction.
If there is negative covariance between the pair of
assets then the investor should include both of
these assets to the portfolio, because their returns
move in the opposite directions and the risk in

Copyright All rights reserved.

Reading 9 Correlation and Regression

Alternative Hypothesis H1: the correlation in the

population is different from 0 ( 0);

The null hypothesis is the hypothesis to be tested. The
alternative hypothesis is the hypothesis that is accepted
if the null is rejected.

The formula for the t-test is (for normally distributed


' 2
9= ~9 2
Difference b/w Covariance & Correlation: The
covariance primarily provides information to the investor 1 ' 5
about whether the relationship between asset returns is where,
positive, negative or zero, but correlation coefficient tells
r is the sample coefficient of correlation calculated by
the degree of relationship between assets returns.
<= <>
Correlation coefficients are valid only if the means,
variances & covariances of X and Y are finite and t = t-statistic (or calculated t)
constant. When these assumptions do not hold, then the n 2 = degrees of freedom
correlation between two different variables depends
largely on the sample selected. Decision Rule:
If test statistic is < t-critical or > + t-critical with n-2
2.4 Limitations of Correlation Analysis degrees of freedom, (if absolute value of t > tc), Reject
H0; otherwise Do not Reject H0.

1. Linearity: Correlation only measures linear Example:

relationships properly.
Suppose r = 0.886 and n = 8, and tC = 2.4469 (at 5%
2. Outliers: Correlation may be an unreliable measure significance level i.e. = 5%/2 and degrees of freedom =
when outliers are present in one or both of the series. 8 2 = 6)
8 2
t = 0.886 = 4.68 Since t-value > tc, we reject
3. No proof of causation: Based on correlation we 1 (0.886)2
cannot assume x causes y; there could be third
variable causing change in both variables. null hypothsis of no correlation.

4. Spurious Correlations: Spurious correlation is a

correlation in the data without any causal
relationship. This may occur when:

i. two variables have only chance relationships.

ii. two variables that are uncorrelated but may be
correlated if mixed by third variable .
iii. correlation between two variables resulting from a
third variable.

Spurious correlation may suggest investment strategies
that appear profitable but actually would not be so, if Magnitute of r needed to reject the null hypothesis (H0:
implemented. = 0) decreases as sample size n increases. Because
as n increases the:
Testing the Significance of the Correlation o number of degrees of freedom increases
2.6 o absolute value of tc decreases.
o t-value increases
t-test is used to determine if sample correlation
coefficient, r, is statistically significant. In other words, type II error decreases when sample size
(n) increases, all else equal.
Two-Tailed Test:
Null Hypothesis H0 : the correlation in the population is 0
( = 0);
Reading 9 Correlation and Regression

Type I error = reject the null hypothesis although it is true. Practice: Example 7, 8, 9 & 10
Type II error = do not reject the null hypothesis although Volume 1, Reading 9.
it is wrong.


Regression analysis is used to: Independent variable: The variable used to explain the
dependent variable. Also called exogenous or
Predict the value of a dependent variable based on predicting variable.
the value of at least one independent variable
Explain the impact of changes in an independent Intercept (b0): The predicted value of the dependent
variable on the dependent variable. variable when the independent variable is set to zero.
b0 = y b1 x
Linear regression assumes a linear relationship between
the dependent and the independent variables. Linear Slope Coefficient or regression coefficient (b1): A
regression is also known as linear least squares since it change in the dependent variable for a unit change in
selects values for the intercept b0 and slope b1 that
(, *
the independent variable.
? =
minimize the sum of the squared vertical distances
between the observations and the regression line. ,' (

( ( * *
? =
Estimated Regression Model: The sample regression line
provides an estimate of the population regression line. ( ( 5
Note that population parameter values b0 and b1 are
not observeable; only estimates of b0 and b1 are Error Term: It represents a portion of the dependent
observeable. variable that cannot be explained by the independent

n =100

x = 36,009 .45; (x x)2

s x2 = = 43,528,688

n 1
y = 5,411 .41; (x x)( yi y )
cov( X , Y ) = = 1,356,256

n 1
y = b0 + b1 x = 6,535 0.0312 x
cov( X , Y ) 1,356,256
b1 = = = 0.0312
s x2 43,528,688
b0 = y b1 x = 5,411.41 ( 0.0312)(36,009.45) = 6,535

Types of data used in regression analysis:

1) Time-series: It uses many observations from different
time periods for the same company, asset class or
country etc.

2) Cross-sectional: It uses many observations for the

same time periodof different companies, asset classes
or countries etc.

3) Panel data: It is a mix of time-series and cross-sectional


Dependent variable: The variable to be explained (or

predicted) by the independent variable. Also called
endogenous or predicted variable.
Reading 9 Correlation and Regression

3.2 Assumptions of the Linear Regression Model 3.4 The Coefficient of Determination

1. The regression model is linear in its parameters b0 and The coefficient of determination is the portion of the
b1 i.e. b0 and b1 are raised to power 1 only and total variation in the dependent variable that is
neither b0 nor b1 is multiplied or divided by another explained by the independent variable. The coefficient
regression parameter e.g. b0 / b1. of determination is also called R-squared and is denoted
as R2.

- FCCD DF 9 C AF9F'ED ,9D L5

When regression model is nonlinear in parameters,
regression results are invalid.
M 9,N O,'D,9D 44M P F(QN,D FA O,'D,9D 44B
Even if the dependent variable is nonlinear but
M 9,N O,'D,9D 44M
parameters are linear, linear regression can be used.

B(QN,D FA O,'D,9D L44

2. Independent variables and residuals are
M 9,N O,'D,9D 44M
3. The expected value of the error term is 0.
When assumptiuons 2 & 3 hold, linear regression
produces the correct estimates of b0 and b1. 0 R 2 1

4. The variance of the error term is the same for all In case of a single independent variable, the coefficient
observations. (It is known as Homoskedasticity of determination is: R2 = r2
5. Error values () are statistically independent i.e. the where,
error for one observation is not correlated with any
R2 = Coefficient of determination
other observation.
r = Simple correlation coefficient
6. Error values are normally distributed for any given
value of x.
3.3 The Standard Error of Estimate Suppose correlation coefficient between returns of two
assets is + 0.80, then the coefficient of determination will
Standard Error of Estimate (SEE) measures the degree of be 0.64. The interpretation of this number is that
variability of the actual y-values relative to the estimated approximately 64 percent of the variability in the returns
(predicted) y-values from a regression equation. Smaller of one asset (or dependent variable) can be explained
the SEE, better the fit. by the returns of the other asset (or indepepnent

variable). If the returns on two assets are perfectly
49, A,'A B'' ' C B&9DE,9F: 4H = I
correlated (r = +/- 1), the coefficient of determination will
be equal to 100 %, and this means that if changes in
or returns of one asset are known, then we can exactly

* *K 5 44B
predict the returns of the other asset.
4BB = 4H = I =I ,
Multiple R is the correlation between the actual values
where, and the predicted values of Y. The coefficient of
SSE = Sum of squares error determination is the square of multiple R.
n = Sample size
k = number of independent variables in the model Total variation is made up of two parts:
n = 100
SSE = 2,252,363

SSE 2,252 ,363

s = = = 151 .60 where,
n2 98
y = Average value of the dependent variable

*K = Estimated value of y for the given value of x

Regression Residual is the difference between the actual y = Observed values of the dependent variable
values of dependent variable and the predicted value
of the dependent variable made by regression
equation. SST (total sum of squares): Measures total variation
Reading 9 Correlation and Regression

in the dependent variable i.e. the variation of the b 1 t /2 s b1

yi values around their mean y. df = n - 2
SSE (error sum of squares): Measures unexplained
variation in the dependent variable.
SSR / RSS (regression sum of squares): Measures Example:
variation in the dependent variable explained by ^ ^

the independent variable. n = 7 b1 = 9.01, s b^1 = 1.50, b1 = 0

Testing H0: b1 = 0 v/s HA: b1 0

9.01 0
T .S . : tobs = = 6.01 R.R. :| tobs | t.025,5 = 2.571

95% Confidence Interval for b1:

9.01 2.571(1.50) = 9.01 3.86 = (12.87 to 5.15)

As this interval does not include 0, we can reject H0.

Therefore, we can say with 95% confidence that the
regression slope is different from 0.
Practice: Example 13
Volume 1, Reading 9.
Reject H0 because t-value 6.01 > critical tc 2.571.

3.5 Hypothesis Testing
Higher level of confidence or lower level of significance
results in higher values of critical t i.e. tc. This implies
In order to determine whether there is a linear that:
relationship between x and y or not, significance test (i.e.
t-test) is used instead of just relying on b1 value. t-statistic Confidence intervals will be larger.
is used to test the significance of the individual Probability of rejecting the H0 decreases i.e. type II
coefficients (e.g. slope) in a regression. error increases.
The probability of Type-I error decreases.
Null and Alternative hypotheses
Stronger regression results lead to smaller standard errors
H0: b1 = 0 (no linear relationship)
of an estimated parameter and result in tighter
H1: b1 0 (linear relationship does exist)
) confidence interval. As a result probability of rejecting H0
b1 b1 increases (or probability of Type-I error increases).
Test statistic = t=
s b1 p-value: The p-value is the smallest level of significance
where, at which the null hypothesis can be rejected.
?S1 = Sample regression slope coefficient
Decision Rule: If p < significance level, H0 can be
b1 = Hypothesized slope
4T = Standard error of the slope
rejected. If p > significance level, H0 cannot be rejected.

df= n2 For example, if the p-value is 0.005 (0.5%) & significance

level is 5%, we can reject the hypothesis that true
Decision Rule: parameter equals 0.
If test statistic is < t-critical or > + t-critical with n-2
degrees of freedom, (if absolute value of t > tc), Reject
H0; otherwise Do not Reject H0. Practice: Example 14, 15 & 16
Volume 1, Reading 9.
Two-Sided Test One-sided Test
H0: b1 = 0 H0: b1 = 0
HA: b1 0 HA+: b1> 0or
HA-: b1< 0 Analysis of Variance in a Regression with One
Independent Variable
Confidence Interval Estimate of the Slope: Confidence
interval is an interval of values that is expected to Analysis of Variance (ANOVA) is a statistical method
include the true parameter value b1 with a given degree used to divide the total variance in a study into
of freedom. meaningful pieces that correspond to different sources.
In regression analysis, ANOVA is used to determine the
Reading 9 Correlation and Regression

usefulness of one or more independent variables in

explaining the variation in dependent variable.
Practice: Example 17
Volume 1, Reading 9.
44L J
= U *K 44BV
3.7 Prediction Intervals
Regression k

* 5
[ \ ]^ _`
c f a
=U *
J1 _a` = _a bc d d g
e e c _a
*K 5


s f = s 2f
Total n1 =U *

* 5
s2 = squared SEE
n = number of observations
Or X = value of independent variable
= estimated mean of X
s2X= variance of independent variable
Source of Sum of Mean Sum of
DoF tc = critical t-value for n k 1 degrees of freedom.
Variability Squares Squares
Regression Example:
Calculate a 95% prediction interval on the predicted
Error value of Y. Assume the standard error of the forecast is
n-2 SSE MSE = SSE/n-2
(Unexplained) 3.50%, and the forecasted value of X is 8%. And n = 36.
Assume: Y = 3% + (0.50)(X)
Total n-1 SST=RSS + SSE
The predicted value for Y is: Y =3% + (0.50)(8%)= 7%
F-Statistic or F-Test evaluates how well a set of
independent variables, as a group, explains the variation The 5% two-tailed critical t-value with 34 degrees of
in the dependent variable. In multiple regression, the F- freedom is 2.03. The prediction interval at the 95%
statistic is used to test whether at least one independent confidence level is:
variable, in a set of independent variables, explains a
significant portion of variation of the dependent 7% +/- (2.03 3.50%) = - 0.105% to 14.105%
variable. The F statistic is calculated as the ratio of the
average regression sum of squares to the average sum This range can be interpreted as, given a forecasted
of the squared errors, value for X of 8%, we can be 95% confident that the
dependent variable Y will be between 0.105% and
= Y
W4B <<H

df numerator = k = 1 Practice: Example 18

df denominator = n k 1 = n 2 Volume 1, Reading 9.

Decision Rule: Reject H0 if F>F-critical.

Sources of uncertianty when using regression model &
Note: F-test is always a one-tailed test. estimated parameters:
In a regression with just one independent variable, the F
statistic is simply the square of the t-statistic i.e. F= t2. F- 1. Uncertainty in Error term.
test is most useful for multiple independent variables 2. Uncertainty in the estimated parameters b0 and b1.
while the t-test is used for one independent variable.
3.8 Limitations of Regression Analysis
When independent variable in a regression model does Regression relations can change over time. This
not explain any variation in the dependent variable, problem is known as Parameter Instability.
then the predicted value of y is equal to mean of y. Thus, If public knows about a relation, this results in no
RSS = 0 and F-statistic is 0.
Reading 9 Correlation and Regression

relation in the future i.e. relation will break down.

Regression is based on assumptions. When these
assumptions are violated, hypothesis tests and
predictions based on linear regression will be

Practice: End of Chapter Practice

Problems for Reading 9 & FinQuiz
Item-set ID# 15579, 15544 & 11437.