Anda di halaman 1dari 6

Basic Elements for Understanding Regression Analysis

(Lecture Note prepared for M. Phil. Students by D. Chhetry)

1. Introduction

There are several occasions in data analysis which require modeling of relationships (or estimating
functional relationships) among the variables. For instance, modeling is necessary for predicting the
amount of sale of a product through its advertising expenditures. The regression analysis provides tools for
modeling a "dependent" variable with the help of a number of "independent" variables for predicting and
managing risks. The term "regression" was coined by Sir Francis Galton in the nineteenth century to
describe a biological phenomenon. The phenomenon was that the heights of descendants of tall ancestors
tend to regress down towards a normal average (a phenomenon also known as regression towards the
mean) https://en.wikipedia.org/wiki/Regression_analysis.

Regression analysis also allows us to compare the effects of variables measured on different scales, such
as the effect of a number of promotional activities on sales of a product. These benefits help market
researchers to eliminate and evaluate the best set of variables to be used for building predictive models.

The simple regression takes into account of only one independent and a dependent variable. In this
semester we concentrate on simple regression, although it has limited practical use, it provides an
understanding of the basic tools of the methodology, which can be easily extended to realistic models
involving more than one independent variable.

In statistical terms, Simple Regression Analysis (SRA) is a collection of techniques each of which attempts
to explore the relationship between an independent variable and a dependent variable employing statistical
models. It includes both simple time-series and cross-section models. It also includes both linear and non-
linear models. It is also known as curve or line fitting because a regression analysis equation can be used
in fitting a curve or line to data points, in a manner such that the differences in the distance of data points
from the curve or line are minimized (http://www.businessdictionary.com/definition/regression-analysis-
RA.html).

Caveats:
 The relationships depicted in a regression analysis are, however, associative only, and any cause-
effect (casual) inference is purely subjective.
 Linear Regression is very sensitive to outliers. It can terribly affect the regression line
and eventually the forecasted values.

In order to have better understanding of the methodology involved in regression analysis it is essential to
know some basic mathematics, such as - concept of irrational number, some well known mathematical

1
expressions (or deterministic models) with their graphs and properties, concept of natural logarithm and its
applications on building model and measures of changes.

2. Irrational Numbers
We are all familiar with the positive and negative integers as well as ratio of two numbers, which are often
expressed in decimal system. But there are numbers that neither can be expressed as a ratio of two
numbers nor their exact values are yet known. Such numbers are called irrational numbers, denoted by
symbols. One such number is exponential number, denoted by e. The correct value of e up to 7 decimal
places is 2.7182818. Another such number is π whose correct value up to 7 decimal places is 3.1415926.
These numbers are quite frequently used in scientific work (for example see normal probability density
function).

3. Algebraic Expressions
Algebraically, the expression y = f(x), read it as y is a function of x, is used to designate the relationship
between y and x (see table below for frequently encountered algebraic expressions) where f(x) =  + x in
case of linear function, f(x) = x in case of power function etc

Table 1: Some Frequently Encountered Deterministic Model


Name Expression or Deterministic model
Linear y =  + x
Power y = x where  > 0 and x >0
Compound y = x where  and  are > 0
Exponential y = ex where  > 0
Note that names are borrowed from SPSS’s Curve Estimation Procedure

In each of the model of Table 1,  and  are constants and x and y are variables. In the co-ordinate
system, the linear model traces a straight line, while each of other models of the table traces convex or
concave curve. The compound and exponential models are more suitable for modeling time series data in
which case the variable ‘x’ is replaced by the time variable ‘t’ and the variable ‘y’ is replaced by ‘y t’.

Exercise 1: Sketch the curve of the following each of algebraic expressions only for positive values of x.
and understand the basic properties of each curve, such as - monotonic, convexity and concavity (use
Excel software).
(1) y = 10 + 3x, (2) y = 2x2, (3) y = 2x0.5, (4) y = 2(1.05)x, (5) y = 3e0.5x , (6) y = ln(x),
𝑑𝑦 𝑑2 𝑦
Also, find first derivative 𝑑𝑥 and second derivative 𝑑𝑥 2 of y with respect to x for each of the above functions
and understand the basic properties of each curve, such as - monotonic, convexity and concavity using the
following two results.

2
Result 1
𝑑𝑦
Monotonic Behavior of Function: If > 0 then the function is said to be increasing, in the sense that as
𝑑𝑥
𝑑𝑦
x increases y also increases; and if < 0 then the function is said to be decreasing, in the sense that as
𝑑𝑥
x increases y decreases.

Result 2
𝑑2 𝑦
Convexity/Concavity Behavior of Function: If > 0 then the function is said to be convex, in the
𝑑𝑥 2
𝑑2 𝑦
sense that a line segment joining any two points of the curve lies above the curve; and if < 0 then the
𝑑𝑥 2
function is said to be concave, in the sense that a line segment joining any two points of the curve lies
below the curve.

Exercise 2: In order to understand the type of relationships between variables do the following exercises.
1. Open the data file INC_CON.SAV. Draw scatter diagrams between the variables: (a) pcexp
against pcinc, (b) pcfdexp against pcinc, (b) pcmdexp against pcinc, and (d) pcedexp against
pcinc.
2. Open the data file ADB.SAV. Draw scatter diagrams between the variables (a) investment against
time, (b) collection against time, and (c) collection against investment.

From Exercise 2 it is clear that the relationships between two variables are not always linear (tendency to
lie on a straight line in scatter diagram). However, there are some transformations that change some non-
linear relationships to linear ones. One such transformation is logarithmic transformation for which we need
to understand the concept and properties of natural logarithms.

4. Natural Logarithms
For a given positive number y, the natural logarithm of y - denoted by ln(y) - is a unique number, say x,
which when raised as the power of e yields the number y. Thus, the following two statements are
equivalent.
a. y = ex
b. x is equal to the logarithm of y, or symbolically, x = ln(y)
The natural logarithm or exponential values for given numbers may be obtained by invoking the built in
function ln( ) and exp( ) of SPSS or Excel software. Pocket calculators have also these facilities. Properties
of logarithm are very useful in building models, and they are

P1. ln(1) = 0 and ln(e) = 1


P2. ln(ab) = ln(a) + ln(b)
P3. ln(a  b) = ln(a) – ln(b)
P4. ln(ab) = bln (a)

3
For technical reason, logarithmic transformation of variables is very common in regression analysis. Some
original functions of Table 1 can be transformed into logarithmic expression using the properties of
logarithm.
Table 2: Logarithmic Transformation of Some Models
Original Models Transformed Models
Power  ln(y) = ln() + ln(x) or y* = * + x*
y = x
Compound y = 
x ln(y) = ln() + xln() or y* = * + *x

Exponential x ln(Y) = ln() + x or y* = * + x


y = e
Note that in this table, y* = ln(y), * = ln(), x* = ln(x), and * = ln()

Logarithmic transformation on each of the original power, compound and exponential function changes into
the linear form y =  + x. All the theory and method developed for linear models are valid for the
transformed models. The powder model is widely known as log-linear model in statistics or in econometrics.

Exercise 3: Open the data file INC_CON. Make the transformation:


y =ln(pcedexp) and x =ln(pcinc)
Draw scatter plot of y against x. What is your conclusion?

5. Measures of Changes
The parameter  in the models of Table 1 measures various types of changes. In this section we will
understand various types of changes. For this purpose, assume that the change in the value of x variable
from x to x + x brings changes in the value of y variable from y to y + y. Then, three types of basic
changes are considered in the Table 3.

Table 3: Basic Changes


Name of Change Changes
Absolute change x = (x + x - x) or y = (y + y - y)
Relative change Δx Δy
x or y

Percentage change Δx Δy
 100 or  100
x y

Since x and y are functionally related as in Table 1, it is more appropriate to investigate the change in y
with respect to x. Such changes are introduced in Table 4.

4
Table 4: Change in y with respect to change in x
Verbal Definition Calculus definition (when x  0)

Absolute change in y Δy dy
=
Absolute change in x Δx dx

Percentage change in y x Δy x dy
=
Percentage change in x y Δx y dx

Relativel change in y 1 Δy 1 dy
=
Absolute change in x y Δx y dx

Note the following results


dy
 The generic term for the rate is slope, which measures the amount of change in y with respect to
dx
a unit change in x. In simple linear model, y=  + x, the parameter  measures the rate of change in
y with respect to x.
x dy
 The generic term for the rate is elasticity, more specifically elasticity of y with respect to x,
y dx
which measures the percentage change in y with respect to one percentage change in x. In the log-
linear model, y = x, the parameter  measures elasticity, since In the functional relationship y =
x dy
x, we can show that  =
y dx
1 dy
 The generic term for the rate is growth rate, which usually encounter in time series model,
y dx
where usually x is replaced by time variable t. In the time series model, yt = et, the parameter 100×
measures exponential growth rate in percent, since In the functional relationship yt = et we can
1 dyt
show that  =
yt dt

In addition to exponential growth rate (say r) which is computed from the model: yt = y0ert, there is another
growth rate compound growth rate (say s) which is computed from the model: yt = y0(1 + s)t. It can be
shown that the two rates r and s are related by (er – 1) = s, and r < s.

5
Curve Estimation Procedure of SPSS
https://www.ibm.com/support/knowledgecenter/en/SSLVMB_22.0.0/com.ibm.spss.statistics.help/spss/base/idh_curv.htm
https://www.ibm.com/support/knowledgecenter/en/SSLVMB_23.0.0/spss/base/curve_estimation_models.html
http://blog.csdn.net/babyfacer/article/details/5308993

The Curve Estimation procedure produces curve estimation regression statistics and related plot for the
selected model out of 11 models. This procedure is available by sequentially clicking:
Analyze → Regression → Curve Estimation
The main dialog box is displayed below.

Statistics: SPSS by default produces the following statistics for each selected model - R2, F and p-value
for the test of significance of the fitted model, regression coefficients, and the best-fit-curve. By clicking
Display ANOVA table, SPSS produces almost all tables as produced by linear procedure with an
additional plot - the best-fit-curve.

Exercise 4: Open the data file INC_CON. Estimate the best fit curves for the following pairs of variables
a. pcinc and pcexp
b. pcinc and pcfdexp
c. pcinc and pcmdexp
d. pcinc and pcedexp
In each case, observe the values of R2, F and p, and interpret them. Can you explain why some curves are
convex or concave nature?