03 Correlation and Regression ChapTER 6 PST

CHAPTER 3
CORRELATION AND REGRESSION ANALYSIS
Introduction
As the magnitude of one variable increases does the other increase or decrease? If
anything happens, the variables are related to each other. In this chapter, we will discuss
the measures of such relationship, called correlation and regression analysis between
the variables under consideration. Besides measuring such relationship, we will also
discuss their test of significance and the interpretation of such measures.
' r ' a concept

Correlation coefficient r , independent of units of measurement, gives an idea about the
degree and direction of linear relationship between the paired x and y values in a
sample. It is a sample statistics which estimates  , the population correlation
coefficient, i.e. strength of linear relationship between the paired values X and Y in a
population. r limits between  1  r  1 . r  1 means a perfect positive correlation
between the variables, r  1 means a perfect negative correlation between them and,
r  0 means there is no correlation or there may be other type of relationships such as
curvilinear etc. between the variables. While using r as a tool for inferring relationship
between the variables in a population, following assumptions are made. 1) x and y are
linearly related 2) each of the series produces normal distribution 3) the variables are
not independent, but are related to a casual fashion
Test of significance of the correlation coefficient

We test a null hypothesis that   o at appropriate level of significance and on (n-2)
degrees of freedom. Following is a data set on soil pH and number of species in 13
soils.
Soil pH (x): 2.8 2.9 3.8 4.5 7.1 6.5 3.0 4.7 5.2 4.0 4.8 6.3 7.2
Number of species (y): 17 7 10 22 40 25 5 5 22 7 6 43 19
SPSS output for correlation for this data is given below
Practical Statistics/19
Interpreting the correlation coefficient
Correlation coefficient 𝑟, ranges from having no relationship 0 and a perfect
relationship ±(1). Cohen’s convention for interpreting this is given as:
Sample size, i.e. degrees of freedom is one of the factors to consider for interpreting r.
The closeness of relationship is not proportional to r, i.e. r = 0.88 does not indicate a
relationship twice as close as one of r = 0.44.
Expressing of the observed correlation coefficient r in terms r2 is better understood than

the meaning of r. This is called the coefficient of determination. However, r² is always a
positive (r² ranges from 0 to 1) number it cannot tell whether the relationship between
two variables is positive or negative. The value of r = 0.70 (for the sample data given
above) indicates, relatively a high degree of association. However, r2 = 0.49
indicates,only 49% of the variation in the dependent variable has been explained by the
independent variable. This implies besides soil pH there are other variables which are
also likely to be involved in explaining number of species. Even if the sites were
carefully selected to be as similar as possible., they could differ in aspect, nutritional
status of the soil, moisture availability etc.
Thus, for our data set, we say that there is highly significant positive relationship
between soil pH and the number of species of plants growing on(r = 0.07, df = 11, p <
0.01) with the two variables having 49% of their variation in common.
95% confidence interval of r

In case of a small sample r  t0.05[S.E (r )] and in the case of large sample
r  1.96[ S .E (r )] where in a sample of size n, S .E (r )  1  r 2 / n
Properties of r
The value of r is always between +1 and -1, it does not change if all values of either
variable are converted to a different scale. For e.g. if the units of X are changed from
feet to meters, the value of r does not change. The value of r is not affected by the
choice of X or Y i.e. rxy = ryx The coefficient of correlation estimates the degree of
closeness of linear relationship between two variables but, it does not answer the
questions like: How much does Y change of a given change in X? What is the shape of
the curve connecting Y and X? How accurately can Y be predicted from knowledge of X?
All such questions are handled with other statistical tool regression, discussed in the forth-
coming section.
Regression Analysis
Correlation analysis is just for measuring the relationship between the variables but not
for studying the cause and effect of such relationship. When we have good reason to
suspect that a linear and casual relationship exists between two variables and /or we
wish to predict the values of one variable from another, then a regression analysis can
be used to find an equation to describe the relationship.
In regression analysis, independent (explanatory) variable is the cause or is responsible

to have a certain effect on the (dependent) response variable. With the help of
regression technique, we can establish a cause (x) and effect (y) relationship between
the variable thereby will be able to predict y given a value of x, if there is such a
relationship with an ample data set. The mathematical line that we are to fit to predict
the value of dependent variable is y  a  bx .Where, a = intercept which the line cuts
on the axis of y; in other words, it is the value of y when x = 0. This is also called the
regression constant. ‘ b ’ is the slope of the line; also, called the regression coefficient of
y on x; defined as “the measure of change in the dependent variable (y) corresponding
to a unit change in the independent variable (x)”. This is often written as b yx to indicate
that it is the regression coefficient of y on x. b can take any real value with in the range
-  to +  .
Example
Following are sample measurements of heights of soybean plants in a field.
Following is the SPSS output to fit the regression line for this data.
The bi-variate correlation coefficient is highly significant (p = .000 < .001)
From the table, we can see the coefficients a=-.571 [Height of soybean plant when it
was 0 weeks of age] and b = 6.143[change in height of the plant due to 1 unit change in
age]. The fitted line is thus,
y  .572  6.143x for 1  x  7
For x=7,
Y (estimated) = 42.29
= estimated height of soybean plants of age 7 weeks in the population.
Keep in mind that you won’t predict the value of y for x= 8, 9 etc.Need to think where
the sample data has come from. Sample should represent the population. No heights of
the plants are considered in the sample beyond the 7 weeks' age.
Interpreting the regressioncoefficients
Interpreting the regression coefficients: intercept a is the value of the dependent

variable when the value of the independent variable is zero; and the regression
coefficient, b is the change in dependent variable per unit change in the independent
variable. To draw the regression line we take one low value of x(1) and one high value
x (7) and calculate the value of y for each to find the two coordinates (1 , ?) and
(7 , ?) . These two points are joined, and the line is drawn. The line drawn should
not exceed the limits imposed by the measured data. Note that we are assuming the
relationship is linear only within the limits of our data.
Using computer software, we can also have the estimated values and the residuals of
dependent variables for all required values of the independent variables.
Estimates are the values the model predicts for the observed values of the independent
variable. And a residual is the differences between the observed value and the value
predicted by the model. For presentation, the fitted line can be plotted along with
observed value giving estimated equation as well. The following points are to be kept in
mind while making presentation. Do not extrapolate the line outside the range of x .
The fitted line passes through ( x, y ) .
Two lines of regression

Often, we come across situation in winch two variables (Y and X) are such that not only
Y depends on X but also X depends on Y. For example, the heights and weights of
people are two variables where heights of people depend on weight and weights of
people depend on height. In such a case, we can find not only the regression line of Y
on X but also of X on Y.
As such it should be kept in mind that two lines of regression are different. However, if
the investigator has deliberately selected the sample of values of one of the variates, say
X, then only the regression of Y on X has meaning and stability and such will be the
situations in which cause, and effect relationship is involved. But if the sample of pairs
(X, Y) is random, the investigator could validly interpret both regressions; whichever is
relevant for the purpose. Also, there are cases in which the sample regression line is
constructed my measuring Y at a series of selected values of X but is used to predict X
from Y. This method is employed in problems in which the amount of X in a specimen
is difficult or expensive to measure directly but is linearly related to a variable Y that is
more easily measured. One of the suitable examples of such type would be method for
estimating the amounts of the active ingredient in drugs or insecticides from their
effects on animals or insects. Such approach of estimating X from a regression line of Y
on X is known as linear calibration.
Confidence limits for regression parameters

The expression of confidence limits for b with confidence probability (1   ) are
b  s.e(b)t ( n 2)
where,  ( n  2 ) is the table value for two-tailed t-test at  level of
t
significance and for (n-2) degrees of freedom. In similar manner (1   ) percent

a  s.e(a )t ( n  2 )
confidence limits for a are .
Correlation vs. Regression

For perfect correlation between the variables two lines of regression coincide each other
but if the two lines of best fit appear in right angle to each other the variables are
uncorrelated. The point of intersection of the lines of regression is always ( X , Y ) . The
degree of closeness of the lines of regression is proportional to the degree of association
between the variables. Correlation measures degree of variability whereas regression
sees the nature of relationship. Cause and effect relationship is studied by regression
analysis. And, the prediction of the response variable is possible only through
regression. Correlation coefficient is independent of change of origin and change of
scale, but regression coefficient is independent of change of origin but not scale.
Correlation coefficient is the geometric mean of the two regression coefficients and
hence this leads to the other relation that correlation coefficient and the regression
coefficients have always the same sign. So, the two regression coefficients for a pair (X,
Y) are always of the same sign. If correlation coefficient is significant, regression
coefficient is significant. Therefore, we should use the equation of the regression line
only if ‘r’ indicates that there is significant linear correlation. However, in the absence
of a significant linear correlation, we should not use the regression equation for
projecting or predicting. Instead, our best estimate of the second variable is simply the
sample mean of that variable.
Multiple Regression
If two or more than two explanatory variables are fitted in a regression model, it is
called multiple regression modelling. The general objective of the multiple regressions
is to discover how the X-variables are related to Y, the dependent variable. The X-
variables will be ranked in order of their influence on Y and an equation will be derived
which provides estimates of Y from the values of independent variable.
In multiple regressions, we assume that the relationship between Y and X-variable is
linear. Conceptually, it is an extension of simple regression technique. Interpretation of
results in multiple regression is not an easy task since the no of independent variables
affecting the dependent variable are in higher number. But, much can be learnt about
both the computation and the interpretation by studying regression situations involving
Y and two X-variables say, X1 and X2.
If both independent variables have some effect on Y the pair (X1, X2) will be
represented on a plane and the Y values corresponding to this point on a vertical axis
perpendicular to the plane so that it will form a three-dimensional regression surface.
For diagrammatic representation of regression surface, see steel and Torri, 1980, pp
313.
The multiple linear regression model with two independent variables X1 and X2 can be
written as, Y    1 X 1   2 X 2  e Where,  = Intercept made of regression plane i.e.
value when X1=0 and X2=0  1 = Partial regression coefficient i.e. amount of change in
Y per unit change in X1 when X2 is constant  2 = Partial regression coefficient i.e.
amount of change in Y per unit change in X2 keeping X1 constant. For a set of data
ˆ ˆ ˆ
(X,Y), if a model is fitted, the fitted model will be written as Y  ˆ   1 X 1   2 X 2
ˆ ˆ ˆ
where, Y ,ˆ , 1 and 2 are the estimates of Y , , 1and 2 respectively. The estimates are
found by the method of least squares. They are the unbiased estimators and have the
smallest standard errors.
Example
Following is sample data on the concentration in parts per million (ppm) of inorganic
phosphorous (X1) and organic phosphorous (X2) and the phosphorous content (Y) of
corn grown for 17 soils.
Fit a multiple regression model and interpret the results.
This is one of the simplest outputs of multiple regression. This tells which method was
used, (Enter method in our case) and how many of what name independent variables
(two: Organic Phosphorous, and Inorganic Phosphorous) were used to fit the multiple
regression model. Right down to the table, this output tells what the dependent variable
(Phosphorous Content) is in the analysis
Coefficients, is one of the primary outputs of multiple regression. ‘Model’ column

consists the content name in the model, the constant and the independent variables.
‘Unstandardized Coefficients’ contains the regression coefficients which estimate the
parameters in the population, including constant term with their respective standard
errors. These are the coefficients we fit the regression equation with. Hence in this case
the fitted multiple regression equation is:
The intercept value i.e. the constant term 66.5 is the amount of Phosphorous Content
when both Inorganic Phosphorous and Organic Phosphorous are not applied i.e. value
for both variables is kept 0.
Keep in mind that the intercept term is not always interpretable. For instance, if in real
practice the independent variables do not take the value zero, then the intercept is just
kept in the model. Or sometime even moved out from the model.
The last column (Sig.) of this output is nothing but the p-values for the test of
significance of the regression coefficients with the hypotheses,
H0: 𝛼 = 0, H1: 𝛼 ≠ 0, H0: β1 = 0, H1: β1 ≠ 0, H0: β2= 0, H1: β2≠ 0,
𝛼 is significant at (p = .000 < .05), β1 is significant at (p- = .002 < .05) and β2 is not
significant (p= .662 > .05)
Interpretation of the regression coefficients
β1 is the partial slope of linear relationship between the independent variables X1 (=
Inorganic Phosphorous) and the dependent variable Y (Phosphorous Content) keeping
X2 constant. This came out to be significant meaning that, 1 ppm change in Inorganic
Phosphorous causes 1. 290 ppm change in Phosphorous Content.
For β2 (= -0.111), This means 1 ppm change in Organic Phosphorous would have
resulted 0.111 ppm reduction in Phosphorous Content i.e. in the dependent variable.
However, since this coefficient is not significant we do not interpret this result. This has
no any influence in the model and in practice we remove this variable from the model.
Bivariate correlation
Correlations: Phosphorous Content, Inorganic Phosphorous, Organic Phosphorous
Mini tab output
This output (above) clearly shows correlation between Inorganic Phosphorous and
Phosphorous Content is significant (r = 0.720, P =.001<.05). However, the correlation
between Organic Phosphorous and Phosphorous Content is not significant (r = 0.212, p
= .414 > .05). Also, the correlation between two independent variables (i.e. Inorganic
Phosphorous and Organic Phosphorous) is not significant (r = 0.399, p = .113 >.05). If
the correlation between independent variables is significant then the variables would
have been called the collinear variables and in the model multicollinearity situation
present, to be dealt carefully so that the model won’t be a biased one.
Finally, in the model because the independent variable; Organic Phosphorous is not
significant, we need not keep this in it and better to seek other variable(s) that may have
significant influence in the dependent variable. This is the practicality why we test the
significance of the regression coefficients.
In this model summary output, R is the multiple correlation coefficient. The R-value is
nothing but simply the measure of the combined effect of all independent variables on
the dependent variable. So, the multiple correlation coefficient, R, is the correlation
between the observed Y’s and the estimated Y’s from the regression equation. For the
above example, i.e. there is high degree of positive correlation between the observed
phosphorus content of corn and the fitted phosphorous content.
R-square (0.525) is the coefficient of determination. Meaning that 52.5% of variation in
Phosphorous Content in the dependent variable the independent variables have been
due to the combined effect of the independent variable. And if we want more influence
in the dependent variable we may add more independent variables and or investigate the
more influencing variable/s as the inputs in the model.
Note
The multiple coefficient of determination R2 is a measure of how well the regression
equation fits the sample data, but it has a serious flaw: As more variables are included,
R2 increases [Actually, R2- could remain the same, but it usually increases] Although
the largestR2 is thus achieved by simply including all the available variables, the best
multiple regression equation does not necessarily use all of the available variables.
Consequently, it is better to use Adjusted Coefficient of Determination when comparing

different Multiple regression equations because it adjusts the R2 value based on the
number of variables and the sample size. This indicates the loss of predictive power or
shrinkage i.e. the adjusted value tells us how much variance in Y would be accounted
for if the model had been derived from the population from which the sample was
taken.
The ANOVA output of multiple regression analysis is useful to check if the overall
model is significant or not, i.e. this tells whether the combined effect of the independent
variables to the dependent variable is significant or not. In this case the model is
significant (p=.005<.05). In general, only after seeing the overall model significant a
check in the individual independent variables is conducted for their individual
significance in the dependent variable.
In this output, 95% confidence interval statistics are quite apparent. For instance, we
would report as: the 95% confidence interval of the Inorganic Phosphorous regression
coefficient is (.555, 2.025). And the like for others. However, before interpreting the
figures under the column VIF (in the last column), and the second last column tolerance
we will review the following literature.
Popular methods among some common methods for identifying and curing
multicollinearity are: examination of correlation matrix and the variance inflation factor
(VIF). Statistician often regard multicollinearity in a data set to be severe if at least one
of simple correlation coefficient between the independent variable is at least 0.9
(Bowerman et al., 2005).
Remaining all other things equal, lower levels of VIF is desired. Higher levels of VIF
are known to affect the models adversely i.e. results associated with a multiple
regression analysis. Regarding VIF, as a tool for detecting multicollinearity varying
literatures are found.Draper and Smith (1998) describe that all the guidelines given for
how large should VIF be to get notified with the multicollinearity problem in a
regression model are essentially arbitrary and each person must decide for him or
herself. Multicollinearity (n.d.) describes commonly given rule of thumb is: VIFs of 10
or higher (or equivalently, tolerances of .10 or less) may be reason for concern.To be
careful about while dealing with VIF, the following literature is put for caution.
Unfortunately, several rules of thumb – most commonly the rule of 10 – associated with
VIF are regarded by many practitioners as a sign of severe or serious multicollinearity
(this rule appears in both scholarly articles and advanced statistical textbooks). When
VIF reaches these threshold values researchers often attempt to reduce the collinearity
by eliminating one or more variables from their analysis; using Ridge Regression to
analyse their data; or combining two or more independent variables into a single index.
These techniques for curing problems associated with multi-collinearity can create
problems more serious than those they solve. (O'brien, 2007)
As such however widely used, this method of rule of thumb is therefore suggested to
use with caution as it works only in which context the model was formed how the data
were collected and what in fact was the model's objective were etc.
Not only this, this is simply one issue of identifying and dealing with multicollinearity
issue in multiple regression analysis. This simply signals how beneath we need to reach
and challenging it is to fit a realistic multiple regression equation.
Now this helps to interpreting VIF output. In our case above, VIF (1.189) is quite small
and within the valid range of rule of thumb, i.e. < 10, indicating straight forward that
there is no multicollinearity issue in this case of model fitting, which we had identified
by analysing correlation matrix.
Tolerance is the reciprocal of VIF. This should be .10 or less, for no multicollinearity
issue in the analysis, serves the same as VIF.
Which explanatory variable best fits the model?
We recall the above output which we have already considered everything others than
the Standardized Coefficients (Beta). These are nothing but called the beta-weights. In
general, the coefficients of the independent variables considered are not comparable as
they represent different series of data with varying measurement unit. However, for an
analyst this very often becomes necessary factor to compare. For this are the beta
weights designed and computed accordingly, they are unit less measures. Hence
comparable. How we do this is, we fit the multiple regression equation as we have fit
this with unstandardized coefficients, and no constant term. Here it is,
Now, the explanatory variables which have greater beta weights will be more influential
to the dependent variable or will best fit the model. Clearly in this case Inorganic
Phosphorous best fits the model than the Organic Phosphorous, i.e. it has a much
stronger direct effect on the dependent variable Y. This finding too was initially
investigated while we have tested the significance of the regression coefficients. The
coefficient for Organic Phosphorous was not found to be significant and hence was
recommended to remove from the model and search another best influential explanatory
variable(s).
Note: The standardized partial slopes are called ‘beta weights’. The beta weights will
show amount of change in standard scores of the independent variable while controlling
the effects of the other independent variables.
Gain in predictive ability of the model
To have a check on how much gain had been there in the predictive ability due to the
combination of the predictors (independent variables) in the model, of part and partial
correlations output are helpful.
The zero order correlations are simple correlations between the independent variables
(Inorganic Phosphorous and Organic Phosphorous) and the dependent variable
(Phosphorous Content).
The partial correlation presumes other predictors are kept constant whereas part
correlations are the correlations which presume the effect of the other predictors have
been excluded out. So, the partial coefficients of determinations are simply the unique
contributions of the predictors. Therefore, these part coefficients of determinations are
helpful to identify if the multiple regression used was beneficial. These coefficients of
determinations (the last column in above table) when added up (48.0249+ 0.6724 =
48.6973%) this means approximately 49% of the variance in the response variable
(Phosphorous Content) is accounted by these two independent variables. And this
percentage of variance in the response variable is different from the R-squared value
(52%) in the model. Meaning that (52-49) = 3% overlapping predictive work was done
by the predictors. This proved that the combination of the variables is not the quite
weak for the gain in the predictive ability of the model. Again, recommend removing
the weak predictor (the Organic Phosphorous) which as less than 1% of unique
contribution to explain the dependent variable in the model.
Regression on Dummy Variables

Categorical data can also be used in regression analysis. In this case, a dummy variable
is to be created. A dummy variable may assume two values only- zero and one.
Suppose, we want to see the influence of sex characteristics in a regression analysis.
Sex has two values male and female. The task can be carried out creating two dummy
variables. One dummy variable could be D1 with values ‘1’. If a case of male and ‘0’
otherwise. The other dummy variable could be D2 with values ‘1’, if it is a case of
female and ‘0’ otherwise.
In regression analysis, it frequently happens that the dependent variable is affected not
only by variables which can be quantified on some well-defined scale (e.g. income,
output, prices, costs, height, temperature) but also by variables which are qualitative in
nature (e.g. sex, colour, religion, nationality, wars, earthquake, strikes, changes in
government economic policy). For example, holding all other factors constant female
staffs are found to earn less than their male counterparts. This may result from sex
discrimination. Hence qualitative variable sex affects the dependent variable and should
be included among the dependent variables.
Thus, the qualitative variables such as sex, marital status, social or occupational class
will often play an important role in determining economic behaviour and must be
incorporated in the estimation process. All these cases may be incorporated in the
estimation process and may be handled by the specification of appropriate dummy
variables.
Since qualitative variables indicate the presence or absence of a quality or an attribute,
one method of ‘quantifying’ such quality is by constructing artificial variables which
take on values of 1 or 0, zero (0) indicating the absence of an attribute and 1 indicating
the presence of that attributes. For example, 1 may indicate that a person is male and 0
may indicate female or 1 may indicate that a person is a college graduate and o that he
is not and so on.
Variables which assume such 0 and 1 values are called dummy variables. Alternative
names are indicator variables, qualitative variables etc. Dummy variables can be used in
regression models as qualitative variables.
Let us consider the regression model Yt  a  b1 D1  b2 X 2  et where,Yt= Annual salary

of a teacher, Xt= Years of teaching experience.D1= 1 if maleD1= 0 if other wise
There may be more than one dummy variable in the regression model. If all the
variables in the model are dummy variables, we call it analysis of variance model. It
may be noted that the dummy variables are used only for regression analysis and not for
correlation studies.
Logistic regression (introduction)
Logistic regression is like linear regression except, instead of predicting the value of a
variable Y from a predictor variable X1 or several predictor variables (Xs), we predict
the probability of Y occurring given known values of X1 or (Xs). A logistic regression
with two or more predictor variables is a multiple regression with an outcome variable
that is categorical dichotomy and predictor variable that are continuous or categorical.
But in simple Logistic Regression, there is only one predictor variable.
1
Equation: P(Y )  is a simple logistic regression
1  e ( a  bx)
1
Where, e= base of natural logarithm and P(Y )  ,
1  e Z
Where
Z= a+b1x1+ b2x2+ ……+b1xn is equivalent to multiple linear regression the except it
expresses the equation in terms of probability that a case belongs in a certain category.
As such the resulting value from the equation is a probability value that varies between
0 and 1. A value close to 0 means that, Y is very unlikely to happen and, a value close
to 1 means that Y is very likely to occur.
In logistic regression, when the response variables are dichotomous assumption of
linearity is violated. But transformation of the variables is made such that, logistic
regression becomes a tool for expressing non-linear relationship in a linear relationship.
Logistic regression equation with two or more than two predictor variables expresses
multiple linear regression equation in logarithmic terms and thus overcomes the
problem of violating the assumption of linearity.
Why /when logistic regression

Logistic regression is useful for situations in which you want to be able to predict the
presence or absence of a characteristic or outcome based on values of a set of predictor
variables. It is like a linear regression model but is suited to models where the
dependent variable is dichotomous. Logistic regression coefficients can be used to
estimate odds ratios for each of the independent variables in the model. Logistic
regression is applicable to a broader range of research situations than discriminant
analysis.
Example
What lifestyle characteristics are risk factors for coronary heart disease (CHD)? Given a
sample of patients measured on smoking status, diet, exercise, alcohol use, and CHD
status, you could build a model using the four lifestyle variables to predict the presence
or absence of CHD in a sample of patients. The model can then be used to derive
estimates of the odds ratios for each factor to tell you, for example, how much more
likely smokers are to develop CHD than non-smokers.
One another example is to look at which variables predict is, whether a person is male
or female. We might measure the variables, Laziness, pigheadness, alcohol
consumption, and number of burps that a person does in a day. Using logistic
regression, we might find that all these variables predict the gender of the person, but
the technique will also allow us to predict whether a certain person is likely to be male
or female
Assumptions in logistic regressions
Logistic regression does not rely on distributional assumptions in the same sense that
discriminant analysis does. However, your solution may be more stable if your
predictors have a multivariate normal distribution. Additionally, as with other forms of
regression, multicollinearity among the predictors can lead to biased estimates and
inflated standard errors. The procedure is most effective when group membership is a
truly categorical variable; if group membership is based on values of a continuous
variable (for example, "high IQ" versus "low IQ"), you should consider using linear
regression to take advantage of the richer information offered by the continuous
variable itself
Multi- variate analysis technique (introduction)

all statistical techniques which simultaneously analyse more than two variables on a
sample of observations can be categorized as multivariate techniques. multivariate
analysis is a collection of methods for analysing data in which many observations are
available for each object. (univariate analyses do not consider the correlation or inter-
dependence among the variables). as a result, most of the research studies involve more
than two variables in which situation analysis is desired of the association between one
(at times many) criterion variable and several independent variables, or we may be
required to study the association between variables having no dependency relationships.
all such analyses are termed as multivariate analyses or multivariate techniques. in
brief, techniques that take account of the various relationships among variables are
termed multivariate analyses or multivariate techniques.
Even though multivariate techniques are demanding in theory and computationally

heavy, application of such techniques in practice have been accelerated in modern times
because of the advent of high speed electronic computers. thesetechniques are used in
many fields such as economics, sociology, agriculture etc. multivariate techniques are
specially used in when the variables concerning research studies of these fields are
supposed to be correlated with each other and when rigorous probabilistic models
cannot be appropriately used.
Besides being a tool for analysing the complex data, multivariate techniques also help
in various types of decision-making. for example, take the case of college entrance
examination wherein many tests are administered to candidates, and the candidates
scoring high total marks based on many subjects are admitted. this system, though
apparently fair, may at times be biased in favour of some subjects with the larger
standard deviations. Multivariate analysis may be appropriately used in such situations
for developing norms as to who should be admitted in college.
Based on dependency relationship multivariate techniques can be categorized in to two
types, one type, for data containing both dependent and independent variables and other
type, for data containing several variables without dependency relationship. in former
category are included techniques like multiple regression analysis, multiple
discriminant analysis, multivariate analyses of variance and canonical analysis, where
as in the latter category, we put techniques like factor analysis, cluster analysis,
multidimensional scaling or mds (both metric and non-metric) and the latent structure
analysis.
Among many of such multivariate techniques in normal use, so far, we have dealt with
multiple regression and in the following we will introduce multiple analyses of variance
(manova).
Multivariate analysis of variance
In the circumstances in which there are several dependent variables the simple
principles of analysis of variance extend to multivariate analysis of variance.This
technique is like the univariateanova, with the added ability to handle several dependent
variables. if anova is applied consecutively to a set of interrelated dependent variables,
erroneous conclusions may result. MANOVA can correct this by simultaneously testing
all the variables and their interrelationship.
Exercise
1.Following data give the yield of maize grain and green fodder in lbs. per plot for
different doses of nitrogen.
Investigate the coefficient of correlation between 1) Amountof nitrogen and yield of maize
grain and 2) Grain yield and fodder yield. And hence, test their significance in both cases.
2.The table below gives the total grain production and cereal production of cereals in
lakh tones (rounded figures) for nine years.
Year 1 2 3 4 5 6 7 8 9
Total grain Production (y) 400 440 840 550 620 650 660 740 760
Cereal Production (x) 50 60 70 85 95 100 105 115 120
Fit an estimated regression line for total production on cereal production and hence
estimate the total production for the cereal production of 90 lakh tones. (Ans.
y  122.18  5.25 x ,595 lakh tones)
3.Following data give the yield of maize grain and green fodder in lbs per plot for
different dozes of nitrogen.
Investigate coefficient of correlation xy, yz, and xz. And hence test the significance of
correlation coefficient in each case.
4.Investigate the sample correlation coefficient for the data given below.
Test the significance of r you have investigated.
5.In studying the use of ovulated follicles in determining eggs laid by ring-necked
pheasant, the following data on 14 captive hens were recorded.
Compute the correlation coefficient and coefficient of determination. Test whether there
is significant linear relationship between eggs laid and ovulated follicles at 5% level of
significance.
6.The following data are obtained from observation in rice field.
Fit regression lines of grain (y) on biomass (x), and biomass (y) on tiller number (x).
Which of these lines give the better fit. What can you conclude on the relationship
among the three traits of rice?
References
Bewick, V., Cheek, L., & Ball, J. (2003). Statistics review: Correlation and
Regression. Critical Care, 7(6), 451–459.Retrieved from
http://ccforum.com/content/7/6/451 Accessed on 7.17.2019doi: 10.1186/cc2401
Darper, N. R., & Smith, H. (1998). Applied regression analysis (3rd ed.). John Wiley &
Sons, Inc., USA.
Logistic regression. (). IBM Knowledge Centre: SPSS 23. Retrieved from
https://www.ibm.com/support/knowledgecenter/en/SSLVMB_23.0.0/spss/regression/id
h_lreg.html Accessed on 7.17.2019
Multivariate analysis: an overview. (n.d.). Online Learning Centre. Retrieved from

highered.mheducation.com/.../9780077129...Accessed on 7.17.2019
O'brien, F.M. (2007). A caution regarding rules of thumb for variance inflation factors.
Quality and Quantity (41), pp. 673-690. doi: 10. 10.1007/s11135-006-9018-6
Schnatter, S. F. (2012). Regression models with dummy variables. Econometrics.

Retrieved from statmath.wu.ac.at/.../Folien_Econometrics_I...Accessed on 7.17.2019
Snedecor, G., and Cochran, W. 1994. Statistical Methods Affiliated East-West Press, 26
Barakhamba Road, new Delhi 110001.
Steel R., and Torri, J. 1980. Principles and Procedures of Statistics. McGraw-Hill Book
Company, New Work.

03 Correlation and Regression ChapTER 6 PST

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

03 Correlation and Regression ChapTER 6 PST

Diunggah oleh

Hak Cipta:

Format Tersedia

CHAPTER 3

CORRELATION AND REGRESSION ANALYSIS

' r ' a concept

Test of significance of the correlation coefficient

SPSS output for correlation for this data is given below

Expressing of the observed correlation coefficient r in terms r2 is better understood than

95% confidence interval of r

In regression analysis, independent (explanatory) variable is the cause or is responsible

The bi-variate correlation coefficient is highly significant (p = .000 < .001)

Interpreting the regressioncoefficients

Interpreting the regression coefficients: intercept a is the value of the dependent

Two lines of regression

Confidence limits for regression parameters

significance and for (n-2) degrees of freedom. In similar manner (1   ) percent

Correlation vs. Regression

Fit a multiple regression model and interpret the results.

Coefficients, is one of the primary outputs of multiple regression. ‘Model’ column

H0: 𝛼 = 0, H1: 𝛼 ≠ 0, H0: β1 = 0, H1: β1 ≠ 0, H0: β2= 0, H1: β2≠ 0,

Mini tab output

Consequently, it is better to use Adjusted Coefficient of Determination when comparing

Regression on Dummy Variables

Let us consider the regression model Yt  a  b1 D1  b2 X 2  et where,Yt= Annual salary

Why /when logistic regression

Multi- variate analysis technique (introduction)

Even though multivariate techniques are demanding in theory and computationally

Test the significance of r you have investigated.

6.The following data are obtained from observation in rice field.

Multivariate analysis: an overview. (n.d.). Online Learning Centre. Retrieved from

Schnatter, S. F. (2012). Regression models with dummy variables. Econometrics.

Anda mungkin juga menyukai