SQQS2073 Note 1 Simple Linear Regression

NOTE 1: SIMPLE LINEAR REGRESSION
Using Simple Regression to Describe a Linear Relationship

Regression analysis is a statistical technique used to describe relationships among variables. The
simplest case to examine is one which a variable y, referred to as the dependent variable, may be
related to one variable x, called an independent or explanatory variable. If the relationship between
y and x is believed linear, then the equation expressing this relationship is the equation for a line:
y = b0 + b1x
If a graph of all the (x,y) pairs is constructed, then b0 represents the y intercept, the point where the
line cross the vertical (y) axis, and b1 represents the slope of the line.
In a simple relationship, there are only two types of variables under study. In multiple
relationships, many variables are under study.
Simple relationships can also be positive or negative. A positive relationship exists when both
variables increase or decrease at the same time. In a negative relationship, as one variable
increases, the another variables decrease, and vice versa.
Positive relationship
Negative relationship
1
SCATTER PLOTS
A scatter plot is a graph of the ordered pairs (x,y) of numbers consisting of the independent
variable, x, and the dependent variable, y. A scatter plot is a visual way to describe the nature of
the relationship between the independent and dependent variables.
The independent variable is the variable in regression that can be controlled or manipulated. The
dependent variable is the variable in regression that cannot be controlled or manipulated.
The independent variable, x, is plotted on the horizontal axis and the dependent variable, y, is
plotted on the vertical axis.
The purpose of the scatter plot is to determine the nature of the relationship. The possibilities
include a positive linear relationship, a negative linear relationship, a curvilinear relationship, or
no discernible relationship.
Example 1:
Construct a scatter plot for the data obtained in a study of age and systolic blood pressure of six
randomly selected subjects. The data are shown in the following table.
Subject Age, x Pressure,y

A 43 128
B 48 120
C 56 135
D 61 143
E 67 141
F 70 152
Solution
The plot suggests a positive relationship, since as a person’s age increases, blood pressure tends to
increase also.
2
Example 2:
Construct a scatter plot for the data obtained in a study on the number of absences and the final
grades of seven randomly selected students from a statistics class. The data are shown here.
Number of Final
Student absences, x grade,y (%)
A 6 82
B 2 86
C 15 43
D 9 74
E 12 58
F 5 90
G 8 78
Solution
The plot suggests a a negative relationship, since as the number of absences increases, the final
grade decreases.
Example 3:
Construct a scatter plot for the data obtained in a study on the number of hours nine people exercise
each week and the amount of milk (in ounces) each person consumes per week. The data follow.
Subject Hours, x Amount,y

A 3 48
B 0 8
C 2 32
D 5 64
E 8 10
F 5 32
G 10 56
H 2 72
I 1 48
3
Solution
The plot shows no specific type of relationship, since no pattern is discernible.
Sometimes a scatter plot shows a curvilinear relationship between the data.

Example 4:
4
Exact Relationship
Example 5:
x 1 2 3 4 5 6
y 3 5 7 9 11 13
In equation form,
y = 1 + 2x
This is an exact or deterministic linear relationship.
Y
14
13 y = 2x + 1
12
11
10
9
8
7
6
5
4
3
2
1
0
0 1 2 3 4 5 6 7
When all the points fall exactly on the line, this indicates a perfect linear relationship between the
variables. In algebra, the equation of a line is usually given as y = mx + c, where m is the slope of
the line and c is the y intercept.
5
In statistics, the equation of the regression line is written as 𝑦̂ = b0 + b1 x, where b1 is the slope of
the line and b0 is the 𝑦̂ intercept.
Example 6:
x 1 2 3 4 5 6
y 3 2 8 8 11 13
It appears that x and y may be linearly related, but it is not an exact relationship. Still, it may be of
use to describe the relationship in equation form. This can be done by drawing what appears to be
the “best-fitting” line through the points and guessing what the values of b0 and b1 are for this line.
6
For the line drawn, a good guess might be the following equation:
𝑦̂= -1 + 2.5x.
y
14
12
10
8
6
4
2
0
0 1 2 3 4 5 6 7
The notation 𝑦̂ is used here to indicate that we do not expect this to be an exact relationship. Given
a value of x, the equation will produce an estimate of y, which is denoted 𝑦̂ .
Line of Best Fit
It shows that several lines can be drawn on the graph near the points. Given a scatter plot, one must
be able to draw the line of best fit. Best fit means that the sum of the squares of the vertical
7
distances from each point to the line is at a minimum. The reason one needs a line of best fit is that
the values of y will be predicted from the values of x; hence, the closer the points are to the line,
the better the fit and the prediction will be.
Consider the pair of values denoted (x*, y*). The actual y value is indicated as y*; the value
predicted to be associated with x* if the line shown were used is indicated as 𝑦̂*. The difference
between the actual, or observed y value and the predicted y value at the point x* is called a residual
and represents the “error” involved. This error is denoted y* - 𝑦̂*. If the line is to fit the data points
as accurately as possible, these error should be minimized. This should be done not just for the
single point (x*, y*), but for all the points on the graph.
The line that minimizes the sum of the squared errors, ∑𝑛𝑖=1(𝑦𝑖 − 𝑦̂𝑖 )2 , this is called the ordinary
least-squares criterion, and resulting line is called the least-squares regression line.
Formula for the least-squares regression line, 𝑦̂ = b0 + b1 x:
∑𝑛 ̅)
𝑖=1(𝑥𝑖 −𝑥̅ )(𝑦𝑖 −𝑦 n xy    x  y 
𝑏1 = or b1 
∑𝑛
𝑖=1(𝑥𝑖 −𝑥̅ )
2
 
n  x 2   x 
2
and b0  y  b1 x
where b0 is the y' intercept and b1 is the slope of the line.
8
Example 7:
Based on Example 1, find the equation of the regression line. The data are shown in the following
table.
Subject Age, x Pressure,y

A 43 128
B 48 120
C 56 135
D 61 143
E 67 141
F 70 152
9
CORRELATION COEFFICIENT
The correlation coefficient computed from the sample data measures the strength and direction of
a linear relationship between two variables.
The symbol for the sample correlation coefficient is r.
The symbol for the population correlation coefficient is .
The range of the correlation coefficient is from 1 to 1.
If there is a strong positive linear relationship between the variables, the value of r will be close
to 1.
If there is a strong negative linear relationship between the variables, the value of r will be close
to 1.
When there is no linear relationship between the variables or only a weak relationship, the value
of r will be close to 0.
Formula for the Correlation Coefficient r
 x y
 xy  n
r
2
 x  2
 y
 x     y  

2
2




n 

n 
or
n xy    x  y 
r
n x   x  n y   y 
2 2 2 2
where n is the number of data pairs.
10
Example 8:
Compute the value of the correlation coefficient for the data obtained in and blood pressure given
in Example 1.
11

SQQS2073 Note 1 Simple Linear Regression

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

SQQS2073 Note 1 Simple Linear Regression

Diunggah oleh

Hak Cipta:

Format Tersedia

NOTE 1: SIMPLE LINEAR REGRESSION

Using Simple Regression to Describe a Linear Relationship

Subject Age, x Pressure,y

Subject Hours, x Amount,y

The plot shows no specific type of relationship, since no pattern is discernible.

Sometimes a scatter plot shows a curvilinear relationship between the data.

Line of Best Fit

Formula for the least-squares regression line, 𝑦̂ = b0 + b1 x:

where b0 is the y' intercept and b1 is the slope of the line.

Subject Age, x Pressure,y

The symbol for the sample correlation coefficient is r.

The symbol for the population correlation coefficient is .

The range of the correlation coefficient is from 1 to 1.

Formula for the Correlation Coefficient r

where n is the number of data pairs.

Anda mungkin juga menyukai