Introduction
As the magnitude of one variable increases does the other increase or decrease? If
anything happens, the variables are related to each other. In this chapter, we will discuss
the measures of such relationship, called correlation and regression analysis between
the variables under consideration. Besides measuring such relationship, we will also
discuss their test of significance and the interpretation of such measures.
Soil pH (x): 2.8 2.9 3.8 4.5 7.1 6.5 3.0 4.7 5.2 4.0 4.8 6.3 7.2
Number of species (y): 17 7 10 22 40 25 5 5 22 7 6 43 19
Practical Statistics/19
Interpreting the correlation coefficient
Correlation coefficient 𝑟, ranges from having no relationship 0 and a perfect
relationship ±(1). Cohen’s convention for interpreting this is given as:
Sample size, i.e. degrees of freedom is one of the factors to consider for interpreting r.
The closeness of relationship is not proportional to r, i.e. r = 0.88 does not indicate a
relationship twice as close as one of r = 0.44.
Thus, for our data set, we say that there is highly significant positive relationship
between soil pH and the number of species of plants growing on(r = 0.07, df = 11, p <
0.01) with the two variables having 49% of their variation in common.
Properties of r
The value of r is always between +1 and -1, it does not change if all values of either
variable are converted to a different scale. For e.g. if the units of X are changed from
feet to meters, the value of r does not change. The value of r is not affected by the
choice of X or Y i.e. rxy = ryx The coefficient of correlation estimates the degree of
closeness of linear relationship between two variables but, it does not answer the
questions like: How much does Y change of a given change in X? What is the shape of
the curve connecting Y and X? How accurately can Y be predicted from knowledge of X?
All such questions are handled with other statistical tool regression, discussed in the forth-
coming section.
Regression Analysis
Correlation analysis is just for measuring the relationship between the variables but not
for studying the cause and effect of such relationship. When we have good reason to
suspect that a linear and casual relationship exists between two variables and /or we
Practical Statistics/20
wish to predict the values of one variable from another, then a regression analysis can
be used to find an equation to describe the relationship.
Example
Following are sample measurements of heights of soybean plants in a field.
Following is the SPSS output to fit the regression line for this data.
From the table, we can see the coefficients a=-.571 [Height of soybean plant when it
was 0 weeks of age] and b = 6.143[change in height of the plant due to 1 unit change in
age]. The fitted line is thus,
y .572 6.143x for 1 x 7
Practical Statistics/21
For x=7,
Y (estimated) = 42.29
= estimated height of soybean plants of age 7 weeks in the population.
Keep in mind that you won’t predict the value of y for x= 8, 9 etc.Need to think where
the sample data has come from. Sample should represent the population. No heights of
the plants are considered in the sample beyond the 7 weeks' age.
Using computer software, we can also have the estimated values and the residuals of
dependent variables for all required values of the independent variables.
Estimates are the values the model predicts for the observed values of the independent
variable. And a residual is the differences between the observed value and the value
predicted by the model. For presentation, the fitted line can be plotted along with
observed value giving estimated equation as well. The following points are to be kept in
Practical Statistics/22
mind while making presentation. Do not extrapolate the line outside the range of x .
The fitted line passes through ( x, y ) .
As such it should be kept in mind that two lines of regression are different. However, if
the investigator has deliberately selected the sample of values of one of the variates, say
X, then only the regression of Y on X has meaning and stability and such will be the
situations in which cause, and effect relationship is involved. But if the sample of pairs
(X, Y) is random, the investigator could validly interpret both regressions; whichever is
relevant for the purpose. Also, there are cases in which the sample regression line is
constructed my measuring Y at a series of selected values of X but is used to predict X
from Y. This method is employed in problems in which the amount of X in a specimen
is difficult or expensive to measure directly but is linearly related to a variable Y that is
more easily measured. One of the suitable examples of such type would be method for
estimating the amounts of the active ingredient in drugs or insecticides from their
effects on animals or insects. Such approach of estimating X from a regression line of Y
on X is known as linear calibration.
Practical Statistics/23
of a significant linear correlation, we should not use the regression equation for
projecting or predicting. Instead, our best estimate of the second variable is simply the
sample mean of that variable.
Multiple Regression
If two or more than two explanatory variables are fitted in a regression model, it is
called multiple regression modelling. The general objective of the multiple regressions
is to discover how the X-variables are related to Y, the dependent variable. The X-
variables will be ranked in order of their influence on Y and an equation will be derived
which provides estimates of Y from the values of independent variable.
In multiple regressions, we assume that the relationship between Y and X-variable is
linear. Conceptually, it is an extension of simple regression technique. Interpretation of
results in multiple regression is not an easy task since the no of independent variables
affecting the dependent variable are in higher number. But, much can be learnt about
both the computation and the interpretation by studying regression situations involving
Y and two X-variables say, X1 and X2.
If both independent variables have some effect on Y the pair (X1, X2) will be
represented on a plane and the Y values corresponding to this point on a vertical axis
perpendicular to the plane so that it will form a three-dimensional regression surface.
For diagrammatic representation of regression surface, see steel and Torri, 1980, pp
313.
The multiple linear regression model with two independent variables X1 and X2 can be
written as, Y 1 X 1 2 X 2 e Where, = Intercept made of regression plane i.e.
value when X1=0 and X2=0 1 = Partial regression coefficient i.e. amount of change in
Y per unit change in X1 when X2 is constant 2 = Partial regression coefficient i.e.
amount of change in Y per unit change in X2 keeping X1 constant. For a set of data
ˆ ˆ ˆ
(X,Y), if a model is fitted, the fitted model will be written as Y ˆ 1 X 1 2 X 2
ˆ ˆ ˆ
where, Y ,ˆ , 1 and 2 are the estimates of Y , , 1and 2 respectively. The estimates are
found by the method of least squares. They are the unbiased estimators and have the
smallest standard errors.
Example
Following is sample data on the concentration in parts per million (ppm) of inorganic
phosphorous (X1) and organic phosphorous (X2) and the phosphorous content (Y) of
corn grown for 17 soils.
Practical Statistics/24
This is one of the simplest outputs of multiple regression. This tells which method was
used, (Enter method in our case) and how many of what name independent variables
(two: Organic Phosphorous, and Inorganic Phosphorous) were used to fit the multiple
regression model. Right down to the table, this output tells what the dependent variable
(Phosphorous Content) is in the analysis
The intercept value i.e. the constant term 66.5 is the amount of Phosphorous Content
when both Inorganic Phosphorous and Organic Phosphorous are not applied i.e. value
for both variables is kept 0.
Keep in mind that the intercept term is not always interpretable. For instance, if in real
practice the independent variables do not take the value zero, then the intercept is just
kept in the model. Or sometime even moved out from the model.
The last column (Sig.) of this output is nothing but the p-values for the test of
significance of the regression coefficients with the hypotheses,
𝛼 is significant at (p = .000 < .05), β1 is significant at (p- = .002 < .05) and β2 is not
significant (p= .662 > .05)
Practical Statistics/25
Interpretation of the regression coefficients
β1 is the partial slope of linear relationship between the independent variables X1 (=
Inorganic Phosphorous) and the dependent variable Y (Phosphorous Content) keeping
X2 constant. This came out to be significant meaning that, 1 ppm change in Inorganic
Phosphorous causes 1. 290 ppm change in Phosphorous Content.
For β2 (= -0.111), This means 1 ppm change in Organic Phosphorous would have
resulted 0.111 ppm reduction in Phosphorous Content i.e. in the dependent variable.
However, since this coefficient is not significant we do not interpret this result. This has
no any influence in the model and in practice we remove this variable from the model.
Bivariate correlation
Correlations: Phosphorous Content, Inorganic Phosphorous, Organic Phosphorous
This output (above) clearly shows correlation between Inorganic Phosphorous and
Phosphorous Content is significant (r = 0.720, P =.001<.05). However, the correlation
between Organic Phosphorous and Phosphorous Content is not significant (r = 0.212, p
= .414 > .05). Also, the correlation between two independent variables (i.e. Inorganic
Phosphorous and Organic Phosphorous) is not significant (r = 0.399, p = .113 >.05). If
the correlation between independent variables is significant then the variables would
have been called the collinear variables and in the model multicollinearity situation
present, to be dealt carefully so that the model won’t be a biased one.
Finally, in the model because the independent variable; Organic Phosphorous is not
significant, we need not keep this in it and better to seek other variable(s) that may have
significant influence in the dependent variable. This is the practicality why we test the
significance of the regression coefficients.
In this model summary output, R is the multiple correlation coefficient. The R-value is
nothing but simply the measure of the combined effect of all independent variables on
the dependent variable. So, the multiple correlation coefficient, R, is the correlation
Practical Statistics/26
between the observed Y’s and the estimated Y’s from the regression equation. For the
above example, i.e. there is high degree of positive correlation between the observed
phosphorus content of corn and the fitted phosphorous content.
R-square (0.525) is the coefficient of determination. Meaning that 52.5% of variation in
Phosphorous Content in the dependent variable the independent variables have been
due to the combined effect of the independent variable. And if we want more influence
in the dependent variable we may add more independent variables and or investigate the
more influencing variable/s as the inputs in the model.
Note
The multiple coefficient of determination R2 is a measure of how well the regression
equation fits the sample data, but it has a serious flaw: As more variables are included,
R2 increases [Actually, R2- could remain the same, but it usually increases] Although
the largestR2 is thus achieved by simply including all the available variables, the best
multiple regression equation does not necessarily use all of the available variables.
The ANOVA output of multiple regression analysis is useful to check if the overall
model is significant or not, i.e. this tells whether the combined effect of the independent
variables to the dependent variable is significant or not. In this case the model is
significant (p=.005<.05). In general, only after seeing the overall model significant a
check in the individual independent variables is conducted for their individual
significance in the dependent variable.
In this output, 95% confidence interval statistics are quite apparent. For instance, we
would report as: the 95% confidence interval of the Inorganic Phosphorous regression
coefficient is (.555, 2.025). And the like for others. However, before interpreting the
Practical Statistics/27
figures under the column VIF (in the last column), and the second last column tolerance
we will review the following literature.
Popular methods among some common methods for identifying and curing
multicollinearity are: examination of correlation matrix and the variance inflation factor
(VIF). Statistician often regard multicollinearity in a data set to be severe if at least one
of simple correlation coefficient between the independent variable is at least 0.9
(Bowerman et al., 2005).
Remaining all other things equal, lower levels of VIF is desired. Higher levels of VIF
are known to affect the models adversely i.e. results associated with a multiple
regression analysis. Regarding VIF, as a tool for detecting multicollinearity varying
literatures are found.Draper and Smith (1998) describe that all the guidelines given for
how large should VIF be to get notified with the multicollinearity problem in a
regression model are essentially arbitrary and each person must decide for him or
herself. Multicollinearity (n.d.) describes commonly given rule of thumb is: VIFs of 10
or higher (or equivalently, tolerances of .10 or less) may be reason for concern.To be
careful about while dealing with VIF, the following literature is put for caution.
Unfortunately, several rules of thumb – most commonly the rule of 10 – associated with
VIF are regarded by many practitioners as a sign of severe or serious multicollinearity
(this rule appears in both scholarly articles and advanced statistical textbooks). When
VIF reaches these threshold values researchers often attempt to reduce the collinearity
by eliminating one or more variables from their analysis; using Ridge Regression to
analyse their data; or combining two or more independent variables into a single index.
These techniques for curing problems associated with multi-collinearity can create
problems more serious than those they solve. (O'brien, 2007)
As such however widely used, this method of rule of thumb is therefore suggested to
use with caution as it works only in which context the model was formed how the data
were collected and what in fact was the model's objective were etc.
Not only this, this is simply one issue of identifying and dealing with multicollinearity
issue in multiple regression analysis. This simply signals how beneath we need to reach
and challenging it is to fit a realistic multiple regression equation.
Now this helps to interpreting VIF output. In our case above, VIF (1.189) is quite small
and within the valid range of rule of thumb, i.e. < 10, indicating straight forward that
there is no multicollinearity issue in this case of model fitting, which we had identified
by analysing correlation matrix.
Tolerance is the reciprocal of VIF. This should be .10 or less, for no multicollinearity
issue in the analysis, serves the same as VIF.
Practical Statistics/28
Which explanatory variable best fits the model?
We recall the above output which we have already considered everything others than
the Standardized Coefficients (Beta). These are nothing but called the beta-weights. In
general, the coefficients of the independent variables considered are not comparable as
they represent different series of data with varying measurement unit. However, for an
analyst this very often becomes necessary factor to compare. For this are the beta
weights designed and computed accordingly, they are unit less measures. Hence
comparable. How we do this is, we fit the multiple regression equation as we have fit
this with unstandardized coefficients, and no constant term. Here it is,
Now, the explanatory variables which have greater beta weights will be more influential
to the dependent variable or will best fit the model. Clearly in this case Inorganic
Phosphorous best fits the model than the Organic Phosphorous, i.e. it has a much
stronger direct effect on the dependent variable Y. This finding too was initially
investigated while we have tested the significance of the regression coefficients. The
coefficient for Organic Phosphorous was not found to be significant and hence was
recommended to remove from the model and search another best influential explanatory
variable(s).
Note: The standardized partial slopes are called ‘beta weights’. The beta weights will
show amount of change in standard scores of the independent variable while controlling
the effects of the other independent variables.
Gain in predictive ability of the model
To have a check on how much gain had been there in the predictive ability due to the
combination of the predictors (independent variables) in the model, of part and partial
correlations output are helpful.
Practical Statistics/29
The zero order correlations are simple correlations between the independent variables
(Inorganic Phosphorous and Organic Phosphorous) and the dependent variable
(Phosphorous Content).
The partial correlation presumes other predictors are kept constant whereas part
correlations are the correlations which presume the effect of the other predictors have
been excluded out. So, the partial coefficients of determinations are simply the unique
contributions of the predictors. Therefore, these part coefficients of determinations are
helpful to identify if the multiple regression used was beneficial. These coefficients of
determinations (the last column in above table) when added up (48.0249+ 0.6724 =
48.6973%) this means approximately 49% of the variance in the response variable
(Phosphorous Content) is accounted by these two independent variables. And this
percentage of variance in the response variable is different from the R-squared value
(52%) in the model. Meaning that (52-49) = 3% overlapping predictive work was done
by the predictors. This proved that the combination of the variables is not the quite
weak for the gain in the predictive ability of the model. Again, recommend removing
the weak predictor (the Organic Phosphorous) which as less than 1% of unique
contribution to explain the dependent variable in the model.
Practical Statistics/30
Variables which assume such 0 and 1 values are called dummy variables. Alternative
names are indicator variables, qualitative variables etc. Dummy variables can be used in
regression models as qualitative variables.
1
Where, e= base of natural logarithm and P(Y ) ,
1 e Z
Where
Z= a+b1x1+ b2x2+ ……+b1xn is equivalent to multiple linear regression the except it
expresses the equation in terms of probability that a case belongs in a certain category.
As such the resulting value from the equation is a probability value that varies between
0 and 1. A value close to 0 means that, Y is very unlikely to happen and, a value close
to 1 means that Y is very likely to occur.
In logistic regression, when the response variables are dichotomous assumption of
linearity is violated. But transformation of the variables is made such that, logistic
regression becomes a tool for expressing non-linear relationship in a linear relationship.
Logistic regression equation with two or more than two predictor variables expresses
multiple linear regression equation in logarithmic terms and thus overcomes the
problem of violating the assumption of linearity.
Practical Statistics/31
Example
What lifestyle characteristics are risk factors for coronary heart disease (CHD)? Given a
sample of patients measured on smoking status, diet, exercise, alcohol use, and CHD
status, you could build a model using the four lifestyle variables to predict the presence
or absence of CHD in a sample of patients. The model can then be used to derive
estimates of the odds ratios for each factor to tell you, for example, how much more
likely smokers are to develop CHD than non-smokers.
One another example is to look at which variables predict is, whether a person is male
or female. We might measure the variables, Laziness, pigheadness, alcohol
consumption, and number of burps that a person does in a day. Using logistic
regression, we might find that all these variables predict the gender of the person, but
the technique will also allow us to predict whether a certain person is likely to be male
or female
Assumptions in logistic regressions
Logistic regression does not rely on distributional assumptions in the same sense that
discriminant analysis does. However, your solution may be more stable if your
predictors have a multivariate normal distribution. Additionally, as with other forms of
regression, multicollinearity among the predictors can lead to biased estimates and
inflated standard errors. The procedure is most effective when group membership is a
truly categorical variable; if group membership is based on values of a continuous
variable (for example, "high IQ" versus "low IQ"), you should consider using linear
regression to take advantage of the richer information offered by the continuous
variable itself
Besides being a tool for analysing the complex data, multivariate techniques also help
in various types of decision-making. for example, take the case of college entrance
Practical Statistics/32
examination wherein many tests are administered to candidates, and the candidates
scoring high total marks based on many subjects are admitted. this system, though
apparently fair, may at times be biased in favour of some subjects with the larger
standard deviations. Multivariate analysis may be appropriately used in such situations
for developing norms as to who should be admitted in college.
Based on dependency relationship multivariate techniques can be categorized in to two
types, one type, for data containing both dependent and independent variables and other
type, for data containing several variables without dependency relationship. in former
category are included techniques like multiple regression analysis, multiple
discriminant analysis, multivariate analyses of variance and canonical analysis, where
as in the latter category, we put techniques like factor analysis, cluster analysis,
multidimensional scaling or mds (both metric and non-metric) and the latent structure
analysis.
Among many of such multivariate techniques in normal use, so far, we have dealt with
multiple regression and in the following we will introduce multiple analyses of variance
(manova).
Multivariate analysis of variance
In the circumstances in which there are several dependent variables the simple
principles of analysis of variance extend to multivariate analysis of variance.This
technique is like the univariateanova, with the added ability to handle several dependent
variables. if anova is applied consecutively to a set of interrelated dependent variables,
erroneous conclusions may result. MANOVA can correct this by simultaneously testing
all the variables and their interrelationship.
Practical Statistics/33
Exercise
1.Following data give the yield of maize grain and green fodder in lbs. per plot for
different doses of nitrogen.
Investigate the coefficient of correlation between 1) Amountof nitrogen and yield of maize
grain and 2) Grain yield and fodder yield. And hence, test their significance in both cases.
2.The table below gives the total grain production and cereal production of cereals in
lakh tones (rounded figures) for nine years.
Year 1 2 3 4 5 6 7 8 9
Total grain Production (y) 400 440 840 550 620 650 660 740 760
Cereal Production (x) 50 60 70 85 95 100 105 115 120
Fit an estimated regression line for total production on cereal production and hence
estimate the total production for the cereal production of 90 lakh tones. (Ans.
y 122.18 5.25 x ,595 lakh tones)
3.Following data give the yield of maize grain and green fodder in lbs per plot for
different dozes of nitrogen.
Investigate coefficient of correlation xy, yz, and xz. And hence test the significance of
correlation coefficient in each case.
4.Investigate the sample correlation coefficient for the data given below.
5.In studying the use of ovulated follicles in determining eggs laid by ring-necked
pheasant, the following data on 14 captive hens were recorded.
Practical Statistics/34
Compute the correlation coefficient and coefficient of determination. Test whether there
is significant linear relationship between eggs laid and ovulated follicles at 5% level of
significance.
Fit regression lines of grain (y) on biomass (x), and biomass (y) on tiller number (x).
Which of these lines give the better fit. What can you conclude on the relationship
among the three traits of rice?
Practical Statistics/35
References
Bewick, V., Cheek, L., & Ball, J. (2003). Statistics review: Correlation and
Regression. Critical Care, 7(6), 451–459.Retrieved from
http://ccforum.com/content/7/6/451 Accessed on 7.17.2019doi: 10.1186/cc2401
Darper, N. R., & Smith, H. (1998). Applied regression analysis (3rd ed.). John Wiley &
Sons, Inc., USA.
Logistic regression. (). IBM Knowledge Centre: SPSS 23. Retrieved from
https://www.ibm.com/support/knowledgecenter/en/SSLVMB_23.0.0/spss/regression/id
h_lreg.html Accessed on 7.17.2019
O'brien, F.M. (2007). A caution regarding rules of thumb for variance inflation factors.
Quality and Quantity (41), pp. 673-690. doi: 10. 10.1007/s11135-006-9018-6
Snedecor, G., and Cochran, W. 1994. Statistical Methods Affiliated East-West Press, 26
Barakhamba Road, new Delhi 110001.
Steel R., and Torri, J. 1980. Principles and Procedures of Statistics. McGraw-Hill Book
Company, New Work.
Practical Statistics/36