Analysis
Jerry Dwi Trijoyo Purnomo
www.its.ac.id
Scatter Plot Hubungan Variabel
• Diagram pencar untuk hubungan positif (a), negatif (b), dan tidak ada
hubungan antara x dan y (c)
Y Y Y
X X X
(a) (b) (c)
2
Koefisien Korelasi (1/2)
• Kuat dan tidaknya hubungan antara x dan y apabila dinyatakan
dengan fungsi linear, diukur dengan suatu nilai yang disebut koefisien
korelasi.
• -1 ≤ r ≤ 1
3
Koefisien Korelasi (2/2)
• Nilai koefisien korelasi yang semakin mendekati 1 menunjukkan
hubungan antara variabel x dan y semakin kuat dan positif.
• Sedangkan nilai koefisien korelasi yang semakin mendekati -1
menunjukkan hubungan antara variabel x dan y semakin kuat tetapi
negatif.
4
Regresi Linear Sederhana (1/2)
• Model Regresi Linear dengan hanya melibatkan satu variabel x.
• Bentuk umum dari model regresi ini:
yi = β 0 + β1 xi + ε i ; i = 1, 2, …, n
dimana
yi = variabel respon/dependen
xi = variabel prediktor/independen
β 0 = intercept
β1 = slope
ε i = residual
• Goal: diberikan yi dan xi estimasi β 0 dan β1.
5
Regresi Linear Sederhana (2/2)
• Salah satu metode yang paling terkenal untuk mengestimasi
parameter regresi β 0 dan β1 adalah metode ordinary least square
(OLS).
• Ide dasar dari metode ini adalah meminimumkan jumlah kuadrat
residual (SSE).
6
Ordinary Least Square (OLS) (1/4)
e1
e2
(x2, y2)
7
OLS (2/4)
• Kriteria least square didefinisikan:
ε i2 = ( yi − β 0 − β1 xi )2
S (β 0 , β1 ) = ( yi − β 0 − β1 xi )
2
∂S
( )
n
= −2∑ yi − βˆ0 − βˆ1 xi xi = 0
∂β1 βˆ0 , βˆ1 i =1
8
OLS (3/4)
• Dengan menyederhanakan kedua persamaan di atas didapatkan
persamaan normal least square:
n n
nβˆ0 + βˆ1 ∑ xi = ∑ yi
i =1 i =1
n n n
βˆ0 ∑ xi + β1 ∑ xi = ∑ xi yi
ˆ 2
i =1 i =1 i =1
9
OLS (4/4)
• Dengan menyelesaikan persamaan normal ini didapatkan:
βˆ0 = Y − βˆ1 X
∑
n
(X − X )(Y − Y ) s
β̂1 = i =1
=
i i XY
∑ (X − X )
n 2
s XX
i =1 i
10
Property of LS Fit
• Jumlah kuadrat regresi (SSR):
n
SSR = ∑ ( yi − y )
2
ˆ
i =1
i =1
i =1
11
ANOVA
Sumber Variasi SS DF MS Fh
Regresi SSR 1 SSR 1 MSR MSE
Residual SSE n-2 SSE (n − 2)
Total SST n-1 SST (n − 1)
12
Koefisien Determinasi
• Nilai koefisien determinasi (R2) dapat digunakan untuk memprediksi
seberapa besar kontribusi pengaruh variabel prediktor x terhadap
variabel respon y dengan syarat hasil uji F dalam analisis regresi
bernilai signifikan.
• Koefisien determinasi:
SSR
R =
2
SST
Bagus jika
R ≥ 60%
2
13
Statistik Uji
• Statistik uji yang digunakan :
βˆ0
th =
( )
se βˆ0
( )
; se βˆ0
= MSE (1 n + x 2 sxx )
dan
βˆ1 SSE / (n − 2 )
th = ; se( βˆ1 ) =
( )
se βˆ1 s xx
14
p-value
• Nilai p (p-value) adalah ukuran probabilitas kekuatan dari bukti untuk
menolak atau menerima hipotesis null (H0).
• Semakin kecil nilai p yang diperoleh maka semakin kuat bukti tersebut
untuk menolak hipotesis null.
• Dalam aplikasinya kita biasanya membandingkan dengan nilai alpha yang
digunakan.
• Jika p-value < α, maka tolak H0, sedangkan jika p-value > α, maka gagal
tolak H0.
15
Interval Confidensi (CI) (1/2)
• Rentang antara dua nilai di mana nilai suatu sample mean tepat berada di
tengah-tengahnya.
• Nilai sebuah interval confidensi dapat dinyatakan dengan kemungkinan
(probability) berapa sample dalam 100 kali pengambilan
samples nilai population mean sesungguhnya akan masuk dalam sebuah
rentang sample mean.
Contoh: 95% of confidence interval artinya jika saya mengambil 100 sampel
maka kemungkinan 95 sampel saya akan mencakup nilai population mean
sesungguhnya.
16
Interval Konfidensi (CI) (2/2)
• Jadi, 100(1–α)% CI untuk β 0 dan β1 diberikan:
( ) ( )
βˆ0 − tα 2, n − 2se βˆ0 ≤ β 0 ≤ βˆ0 + tα 2, n − 2se βˆ0
dan
( ) ( )
βˆ1 − tα 2, n − 2se βˆ1 ≤ β1 ≤ βˆ1 + tα 2, n − 2se βˆ1
17
Uji Hipotesis Untuk Slope
Hipotesis untuk model regresi sederhana:
Hipotesis:
H0 : β1 = 0
H1 : β1 ≠ 0
18
Kriteria Penolakan H0
Hipotesis nol (H0) ditolak jika terpenuhi salah satu kriteria berikut:
1. t h > tα 2, n − 2
2. p-value < α
3. CI tidak memuat nilai nol.
19
Asumsi Regresi
• Error/residual harus memenuhi asumsi identik, independen, distribusi
normal (IIDN)
• Identik : gunakan uji Glejser
• Independen: gunakan uji Durbin-Watson
• Normal : gunakan uji kolmogorov-Smirnov
20
Bentuk Plot Residual
21
Asumsi Distribusi Normal
• Hipotesis
H0 : residual berdistribusi normal
H1 : residual tidak berdistribusi normal
22
Example 1
• The Rocket Propellant Data (Montgomery, Peck, and Vining, 2012).
• Berikut adalah data tentang kekuatan dorong roket. Diketahui bahwa
kekuatan dorong roket terindikasi dipengaruhi oleh usia mesin
pendorongnya.
• Berdasarkan informasi ini:
yi = kekuatan dorong roket (psi)
xi = umur mesin pendorong (minggu)
23
Analisis
• Kita akan melakukan analisis yang mendalam berdasarkan contoh 1 di
atas.
• Bentuk hubungan antara kekuatan dorong dan umur mesin
pendorong digambarkan dengan scatter plot sebagai berikut.
24
Scatter Plot (1/2)
26
Parameter Regresi
• Dari data ini didapatkan estimasi parameter regresi:
βˆ = −37.15
1
βˆ0 = 2627.82
• Model regresinya:
27
ANOVA
Sumber Variasi DF SS MS F P
Regresi 1 1527483 1527483 165.4 0.000
Error 18 166255 9236
Total 19 1693738
28
Uji Hipotesis Parameter Regresi
Hipotesis untuk model regresi data rocket propellant:
Hipotesis:
H0 : β1 = 0
H1 : β1 ≠ 0
Predictor Coef SE Coef T P
Konstanta 2627.820 44.180 59.470 0.000
Jumlah -37.154 2.889 -12.860 0.000
R-Sq = 90.2% R-Sq (Adj) = 89.6%
29
Kesimpulan
Karena:
1. t h > tα 2, n − 2 atau 12.86>2.101, atau
2. p-value<α atau 0.000<0.05, atau
3. CI untuk β1 tidak memuat nol atau -43.22≤β1≤-31.08
30
Uji Asumsi Residual
• Hipotesis
H0 : residual berdistribusi normal
H1 : residual tidak berdistribusi normal
• Nilai p (p-value) yang didapatkan adalah 0.066. Nilai ini lebih dari α
(0.05). Artinya gagal tolak H0 atau residual berdistribusi normal.
31
Multiple Linear Regression
• Bentuk umum model regresi linear dengan lebih dari satu variabel
prediktor:
yi = β 0 + β1 xi1 + β 2 xi 2 + + β p xip + ε i
atau:
y = Xβ + ε
• y = variabel respon/dependen
• x = variabel prediktor/independen
• β 0 , , β p adalah parameter regresi
32
OLS (Pendekatan Matrix)
• Metode OLS untuk multiple linear regression dapat didefinisikan
sebagai berikut:
ˆβ = ( XT X )−1 ( XT Y )
dimana
1 x11 x21 x p1 y1
1 x x22 x p 2 y
X= y = 2
12
,
1 x1n x2 n x pn
yn
33
ANOVA
Sumber Variasi SS DF MS Fh
Regresi SSR p SSR p MSR MSE
Residual SSE n-p-1 SSE (n − p − 1)
Total SST n-1 SST (n − 1)
SSR = βˆ T XT y
SST = y T y
SSE = y T y − βˆ T XT y
34
Uji Hipotesis
• Uji serentak (menguji ANOVA)
Hipotesis:
H0 : β= 1 β=2 = β= p 0
H1 : at least one of β j is not equal; j = 1,…, p
• Uji individu/parsial (jika hasil uji serentak adalah tolak H0):
Hipotesis:
H0 : β j = 0
H1 : β j ≠ 0; j = 1 ,…, p
35
Example 2
• The delivery time data (Montgomery, Peck, and Vining, 2012)
• Berikut adalah data tentang waktu pengiriman minuman soft drink ke
vending machines (menit). Diketahui waktu pengiriman soft drink (y),
dipengaruhi oleh jumlah soft drink yang diletakkan di vending
machines (x1), dan jarak tempuh (feet) dari perusahaan ke lokasi
vending machines (x2)
36
Scatter Plot (1/2)
(a) (b)
Figure 2. Scatter plot y vs jumlah (a), dan y vs jarak tempuh (b)
37
Scatter Plot (2/2)
• Berdasarkan Gambar 2, terdapat gambaran hubungan antara variabel
respon dan kedua variabel prediktor semuanya adalah hubungan
linear positif. Artinya semakin tinggi jumlah barang dan jarak, maka
waktu pengiriman akan semakin lama.
38
ANOVA
Sumber Variasi DF SS MS F P
Regresi 2 5550.82 2775.41 165.38 0.000
Error 22 233.73 10.62
Total 24 5784.54
39
Uji Serentak
Hipotesis:
H0 : β=1 β=2 = β= p 0
H1 : at least one of β j is not equal; j = 1,…, p
40
Uji Parsial
Hipotesis:
H0 : β j = 0
H1 : β j ≠ 0; j = 1 ,…, p
41
Model Regresi
• Model regresi untuk data waktu pengiriman adalah:
waktu = 2.341 + 1.616 jumlah + 0.014 jarak
42
Kesimpulan
• Karena semua p-value<α (0.05), maka tolak H0. Artinya semua variabel
prediktor berpengaruh terhadap waktu pengiriman soft drink ke vending
machines.
• Nilai koefisien regresi, β̂1 dan β̂ 2, semuanya positif. Ini menunjukkan
hubungan antara kedua variabel prediktor dan variabel respon adalah
positif.
• Nilai koefisien determinasi, R2=96%. Nilai ini sangat tinggi. Hal ini
menunujukkan bahwa pengaruh variabel-variabel prediktor sangat
signifikan
43
Uji Asumsi Residual
• Hipotesis
H0 : residual berdistribusi normal
H1 : residual tidak berdistribusi normal
• Nilai p (p-value) yang didapatkan adalah 0.057. Nilai ini lebih dari α
(0.05). Artinya gagal tolak H0 atau residual berdistribusi normal.
44
Transformation and Weighting
Recall that regression model fitting has several implicit assumptions,
including the following:
1. The model errors have mean zero and constant variance and are
uncorrelated.
2. The model error have a normal distribution.
3. The form of the model, including the specification of the predictors,
is correct.
45
Transformation to Linearize the Model
46
Linearizable Functions
47
Residual Plot
(a) (b)
Figure 5. Linear regression (a) and quadratic regression (b)
50
Transformation on y (The Box-Cox Method)
51
The Box Cox Method
• The Box Cox method is defined as
yiλ − 1
(λ ) if λ ≠ 0
yi = λ
ln yi if λ = 0
52
Example 3 (1/2)
• The restaurant monthly income.
• Response (y) : income per month; predictor (x) : advertising expense.
• The regression model:
yˆ = 49443.38 + 8.05 x
53
Example 3 (2/2)
55
WLS (2/3)
• When the errors ε are uncorrelated but have unequal variances so that the
covariance matrix of ε is
1 w1 0
1 w
σ 2V = σ 2 2
0 1 wn
say, the estimation procedure is usually called weighted least square. Let W = V −1.
Since V is a diagonal matrix, W is also diagonal with diagonal elements or weights
w1 , w2 , , wn .
56
WLS (3/3)
• The WLS is defined as
( ) (
ˆβ = BT B −1 BT z = XT WX )
−1
XT Wy
where
1 w1 x11 w1 x1 p w1 y1 w1
1 w2 x21 w2 x2 p w2 y2 w2
B= , z=
1 wn xn1 wn xnp wn yn wn
57
ANOVA
Sumber Variasi SS DF MS Fh
Regresi SSR p SSR p MSR MSE
Residual SSE n-p-1 SSE (n − p − 1)
Total SST n-1 SST (n − 1)
SSR = βˆ T BT z
SST = y T Wy
SSE = y T Wy − βˆ T BT z
58
Example 4 (1/3)
• Consider the restaurant monthly income.
• Response (y) : income per month; predictor (x) : advertising expense.
• The regression model:
yˆ = 49443.38 + 8.05 x
59
Scatter Plot
61
Residual Plot (1/2)
64
M-Estimators (IRLS) (1/6)
• M-Estimators is a class of robust estimators that minimize a function ρ of the
residual
β β
n
(
min ∑i =1 ρ (ei ) = min ∑i =1 ρ yi − ∑ j =0 xij β j
n k
) (1)
where M stands for maximum-likelihood. That is, the function ρ is related to the
likelihood function for an appropriate choice of the error distribution. For
instance, if the method of LS is used (implying that the error distribution is
normal), then
1 2
ρ ( z) = z , − ∞ < z < ∞
2
65
M-Estimators (IRLS) (2/6)
• The M-estimator is not necessary scale invariant (i.e., if the error were
yi − ∑ j =0 xij β j multiply by a constant, the new solution to Eq.(1) might not
k
67
M-Estimators (IRLS) (4/6)
• where ψ = ρ T and xij is the ith observation on the jth regressor and xi0 = 1.
In general ψ is nonlinear and Eq. (3) must be solve by iterative methods.
The iteratively reweighted least square (IRLS) is most widely used. This
approach is usually attributed to Beaton and Tukey (1974).
• To apply IRLS, we write Eq.(3) as
∑ x β
( )
k
∑i =1 xijψ yi − sj = 0 ij j
= ∑i =1 xij wi yi − ∑ j =0 xij β j
n n 0 k
=0 (4)
68
M-Estimators (IRLS) (5/6)
where
[(
ψ yi − ∑k xij βˆ 0j s ) ]
(
j =0
) ∑ β
k
, if y ≠ x ˆ0
wi = yi − ∑ j =0 xij β j s j = 0 ij j
k i
0 ˆ 0
(5)
∑ j =0 ij j
β
k
1 , if y i = x ˆ0
69
M-Estimators (IRLS) (6/6)
where W0 is an n x n diagonal matrix of weight with diagonal element given
by Eq.(5). We recognize Eq.(6) as the usual WLS normal equation.
Consequently, the one-step estimator is
ˆβ1 = (XT W 0 X )−1 XT W 0 y
At the next step we recompute the weights from Eq.(5) but using β̂1 instead
of β̂ 0. The iteration will proceed until it achieves convergence (only a few
iterations)
70
IRLS Model
71
Residual Plot
73
Comparison OLS vs IRLS (2/3)
• Note that by taking heteroscedasticity into account, the slope is
lowered somewhat, as the regression line is less influenced by the
highly variable points on the right.
• However, by more accurately modeling the distribution of the outcome,
the standard deviation of our estimate is reduced from 0.326 to 0.264.
• Consequently, we even obtain a larger test statistic, despite the fact
that the estimate is closer to zero.
• Furthermore, R2 increases from 0.96 to 0.97.
74
Comparison OLS vs IRLS (3/3)
Figure 11. Scatter plot of restaurant income with fit line (OLS vs WLS)
75
Dummy Variable
• The variables considered in regression equations usually can take values over
some continuous range. Occasionally we must introduce a factor that has two or
more distinct levels. For instance, data may arise from three machines, or two
factories, or six operators. In such a case we cannot set up a continuous scale for
the variable “machine” or “factory” or “operator”.
• We must assign to these variables some levels in order to take account of the fact
that the various machines or factories or operators may have separate
deterministic effect on the response.
• Variables of this sort are usually called dummy variables.
76
Main Discussion
• Dummy variables to separate blocks of data with different intercept,
same model.
77
Notation
• Suppose there are two types of machines (types A and B, say) that produce
different levels of response. Value of dummy variable can be assigned to Z as
follows:
Z = 0 if the observation is from machine A
Z = 1 if the observation is from machine B
• Other possibilities
− n2
Z= for machine A
n1n2 (n1 + n2 )
n1
Z= for machine B
n1n2 (n1 + n2 )
where n1 come from type A machines and n2=n-n1 from type B machines.
78
Matrix form
• To see how one representation is derived by linear combination from another we
must count in the dummy X0 column of the X matrix.
1 0
1
0
n1, for machine A
1 0
( X 0 , Z) =
1 1
1 1 n2, for machine B
1 1
79
How Many Dummies?
• In the above example of two categories (machines A and B) we see
we need to construct one dummy column in addition to X0. So two
groups require two dummies including X0.
80
Three Categories, Three Dummies
• If we wish to take account of three different categories, two extra dummies
(besides X0) would be needed. The simplest way is to use
(Z1, Z2) = (1, 0) for machine A
= (0, 1) for machine B
= (0, 0) for machine C
81
Matrix Form
X0 Other X’s Z1 Z2
1 1 0
1
1 0
Group A
1 1 0
1 0 1
1 0 1 Group B
X=
1 0 1
1 0 0
1 0 0 Group C
1 0 0 82
Example (Turkey Data)
• The data show turkey weight (y) in pounds, and ages (X) in weeks, of 13
Thanksgiving turkeys. Four of these turkeys were reared in Georgia (G),
four in Virginia (V), and five in Wisconsin (W). We would like to relate y to X
via a simple straight line model but the different origins of the turkeys may
cause a problem.
83
Turkey Data
X y Origin Z1 Z2
28 13.3 G 1 0
20 8.9 G 1 0
32 15.1 G 1 0
22 10.4 G 1 0
29 13.1 V 0 1
27 12.4 V 0 1
28 13.2 V 0 1
26 11.8 V 0 1
21 11.5 W 0 0
27 14.2 W 0 0
29 15.4 W 0 0
23 13.1 W 0 0
25 13.8 W 0 0
84
Minitab Output
85
Model
• The linear regression model with dummies variables:
y = β 0 + β1 X + α1Z1 + α 2 Z 2 + ε
• The fitted equation is
yˆ = 1.43 + 0.4868 X − 1.92 Z1 − 2.19 Z 2
• For three different origins,
yˆ = −0.49 + 0.4868 X , for G
yˆ = −0.76 + 0.4868 X , for V
yˆ = 1.43 + 0.4868 X , for W
86
Scatter Plot
15
15 G
14
14
W
W
G
13
13
W V V
weights
weights
12
12
V
W
11
11
10
10
G G
V V
9
9
W G W
20 22 24 26 28 30 32 20 22 24 26 28 30 32
ages ages
88