Regression Analysis
Linear Fits Most Common For exponential functions, data must be transformed.
Regression Analysis
If we have N pairs of data (xi, yi) we seek to fit a straight line through the data of the form: Determine constants, a0 and a1, such that the distance between the actual y data and the fitted/ predicted line is minimized.
y = a0 + a1 x
a0 = x i " x i y i ! " x i2 " y i "
Each xi is assumed to be error free. All the error is assumed to be in the y values.
x i ! N " x i2 "
2 i i
Regression Analysis
Sum
Seeking an equation with the form: y=a0+a1x y=0.879+0.540x (15.2)(44.76)! (58.16)(12.6) = 0.879 a0 = (15.2)2 ! (5)(58.16)
R2 =1 Perfect Fit (good prediction) R2 =0 No correlation between x and y For engineering data, R2, will normally be quite high (0.8-0.90 or higher) A low value might indicate that some important variable was not considered, but is affecting the results.
R2
The standard error of estimate (SEE or Syx) is a statistical measure of how well the best-fit line represents the data. This is, effectively, the standard deviation of the differences between the data points and the best-fit line.
It provides an estimation of the scatter/random error in the data about the fitted line. This is analogous to standard deviation for sample data. It has the same units as y. 2 degrees of freedom are lost to calculate coefficients a0 and a1.
" ( yi ! yi ) N !2
Variation in the data is assumed to be normally distributed and due to random causes. Assuming random variation exists in y values, while x values are error free. Since error has been minimized in the y direction, an erroneous conclusion may be made if x is estimated based on a value for y. For power law or exponential relationships, data needs to be transformed before carrying out linear regression analysis. (As we will discuss later, the method of least squares can also be applied to nonlinear functional relationships.)
Regression Analysis
Use Excel Chart>>Add Trendline to obtain coefficients Functions RSQ() and STEYX() to determine R2 and SEE
3.00
Output, Volts
0.50
1.00
1.50 Length, cm
2.00
2.50
3.00
Regression Analysis
Linear regression is a standard feature of statistical programs and most spreadsheet programs. It is only necessary to input the x and y data. The remaining calculations are performed immediately.
Outlier
m1 se1 r^2 F
b seb sey df
=LINEST(A2:A9,B2:B9,TRUE,TRUE)
Regression Analysis
10
m1 se1 r^2 F m1
b seb sey df b
Regression Analysis
11
Uncertainties on Regression
Confidence Interval for Regression Line SEE=sey TINV(a=0.05,n=5) 95% C.I.=TINV(=0.05,=5)*SEE/SQRT(7)
Prediction Band for Regression Line 95% P.I.=TINV(=0.05,=5)*SEE
Uncertainty in Slope b=TiINV(0.05,5)*se1
Uncertainty in Intercept b=TiINV(0.05,5)*seb
0.218446143 2.570581835 0.212239784
0.561533687
0.000895789
0.438558582
Regression Analysis
12
Not only do you want to obtain a curve fit relationship but you also want to establish a confidence interval in the equation or measure of random uncertainty in a curve fit. =N-2 in determination of t-value. Two degrees of freedom are lost because m1 and b are determined. 6 Syx Sey SEE CI = !y " t# ,$ = t# ,$ = t# ,$ N N N 5 where
Prediction Band -95% CI - 95% Torque, Lease Squares Fit CI +95% Prediction Band +95% Data
Torque, N-m
4 3 2 1 0
t#
,$
= TINV (# , $ )
200
400 RPM
600
800
1000
Regression Analysis
13
!yPrediction!Band
More accurate Approximate -minimum at mean -flares out at low & high extremes
Regression Analysis
14
Expression
$ 1 2' Sx = & " # ( xi ! x ) ) %N !1 (
1/2
1/2
Method involves computing the ratio of the residuals (predicted-actual) to the standard error of estimate (sey=SEE)
1. 2.
3.
Residuals=ypredicted-yactual at each xi Plot the ratio of residuals/SEE for each xi. These are the standardized residuals. Standardized residuals exceeding 2 may be considered outliers. Assuming the residuals are normally distributed, you can expect that 95% of residuals are in the range 2 (that is, within 2 standard deviations from best fit line)
Regression Analysis
17
Regression Analysis
18
Data Transformation
Commonly, test data do not show an approximate linear relationship between the dependent (Y) and independent (X) variables and a direct linear regression is not useful.
The form of the relationship expected between the dependent and independent variables is often known. The data needs to be transformed prior to performing a linear regression. Transformations often can be accomplished by taking the logarithms of or natural logarithms of one or both sides of the equation.
Regression Analysis
19
Common Transformations
Relationship Plot Method Log y vs. Log x (log plot) Log(y)=Log()+Log(x) Ln y vs. x (log-log paper) Ln(y)=Ln()+Ln(x) Transformed Intercept, b Log() Transformed Slope, m1 Ln() Log() Ln() Log(e)
y=x
y=ex
Regression Analysis
20
Example
A velocity probe provides a voltage output that is related to velocity, U, by the form E=+U , , and are constants
4.5 4 3.5 3 0 10
U (ft/s) 0 10 20 30 40 Ei (V) 3.19 3.99 4.3 4.48 4.65
10
20 30 Velocity, ft/s
40
Regression Analysis
21
m1 X
Y -0.097 0.045 0.111 0.164
Intercept X Variable 1
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% -0.525 0.021056315 -24.9274 0.001605 -0.61547736 -0.4342812 0.432 0.015438034 27.96624 0.001276 0.36531922 0.49816831
Y=-0.525+0.432X
Regression Analysis 23
Example 4.10 5
4.5
E, V
3.5
E=3.19+0.298U0.432
0 10 20 U, ft/s 30 40 50
Regression Analysis
24
Regression analysis can also be performed in situations where there is more than one independent variable (multiple regression) or for polynomials of an independent variable (polynomial regression) Polynomial Expression Seeks the form
Y=b+m1*x+m2*x2++mkxk
Regression Analysis
26
R2 SEE=sey N
Intercept X Variable 1
Coefficients Standard Error t Stat 0.02952381 0.02018228 1.46285828 0.99771429 0.01333197 74.8362082
P-value Lower 95% Upper 95% 0.21733392 -0.02651117 0.08555879 1.9107E-07 0.9606988 1.03472978
intercept b"
slope m1"
The lower and upper bounds for the coefficients. To obtain the +- bound, simply subtract the lower from the upper and divide by two.
Regression Analysis 27