DM 09
DM 09
Hati-hati
Koefisien korelasi = 0 bukan berarti tidak ada hubungan antara kedua variable. Yang benar adalah: tidak ada
hubungan LINIER, tapi bisa jadi ada hubungan dalam bentuk lain; misal: kuadratik, atau fungsi lain selain linier,
seperti pada contoh di atas.
Interpretasi
Nilai ~0.95 menunjukkan bahwa ada korelasi linier positif yang kuat antara usia dan tekanan darah. Ada
kecenderungan bahwa usia tinggi berkaitan dengan tekanan darah yang kebih tinggi dibandingkan usia rendah.
WARNING
Korelasi tidak sama (meng-implikasikan) dengan sebab akibat. Perhatikan interpretasi di atas. Tidak
dinyatakan bahwa jika usia tinggi maka tekanan darah rendah, hanya suatu tren atau kecenderungan. Mungkin
saja usia dengan bertambahnya usia maka tekanan darah meningkat, tapi mungkin juga tekanan darah tinggi
bukan karena usia, tapi faktor lain yang tidak teramati pada data.
Contoh lain penelitian di Machine learning (kecantikan dan confidence/Panjang Jari dan IQ)
Korelasi dan Sebab-Akibat
Penilaian Kualitatif terhadap nilai korelasi seperti ini? ... Really? Why? Why
not?
In [1]:
import warnings; warnings.simplefilter('ignore')
import pandas as pd, seaborn as sns, matplotlib.pyplot as plt, numpy as np
plt.style.use('bmh'); sns.set()
data = {'usia':[40, 45, 50, 53, 60, 65, 69, 71], 'tekanan_darah':[126, 124, 135, 138,
142, 139, 140, 151]}
df = pd.DataFrame.from_dict(data)
df.head(8)
Out[1]:
usia tekanan_darah
0 40 126
1 45 124
2 50 135
3 53 138
4 60 142
5 65 139
6 69 140
7 71 151
In [2]:
# Korelasi dan Scatter Plot untuk melihat datanya
print('Covariance = ', np.cov(df.usia, df.tekanan_darah, ddof=0)[0][1])
print('Correlations = ', np.corrcoef(df.usia, df.tekanan_darah))
plt.scatter(df.usia, df.tekanan_darah)
plt.show()
Covariance = 76.953125
Correlations = [[1. 0.88746015]
[0.88746015 1. ]]
In [3]:
# Better
print(df.corr())
sns.heatmap(df.corr(),cmap='viridis', vmax=1.0, vmin=-1.0, linewidths=0.1,annot=True,
annot_kws={"size": 16}, square=True)
p = sns.pairplot(df)
usia tekanan_darah
usia 1.00000 0.88746
tekanan_darah 0.88746 1.00000
Interpretasi
Nilai 0.95 menunjukkan bahwa ada korelasi linier positif yang kuat antara usia dan tekanan darah. Ada
kecenderungan bahwa usia tinggi berkaitan dengan tekanan darah yang kebih tinggi dibandingkan usia rendah.
WARNING
Korelasi tidak sama dengan sebab akibat. Perhatikan interpretasi di atas. Tidak dinyatakan bahwa jika usia
tinggi maka tekanan darah rendah, hanya suatu tren atau kecenderungan. Mungkin saja usia dengan
bertambahnya usia maka tekanan darah meningkat, tapi mungkin juga tekanan darah tinggi bukan karena usia,
tapi faktor lain yang tidak teramati pada data.
Sampai sini kita memahami kalau keduanya berhubungan, tapi seperti apa
hubungannya kita masih belum bisa ketahui (lewat korelasi). Itulah Mengapa kita
perlu Model Regresi.
Korelasi ke Regresi
Bagaimana menghitung parameter Regresi yang Optimal?
Hati-hati,... perhatikan rumusnya dengan baik .... ia tidak robust terhadap outlier
^y=β0+β1x1+...+βnxny^=β0+β1x1+...+βnxn
MSE = total jarak/selisih antara prediksi dan nilai dari data (sesungguhnya)
RMSE = √ MSE MSE ... why?
Evaluasi penting ketika kita ingin melakukan prediksi
Regresi VS Interpolasi
2. Operation Efficiency: Optimasi proses bisnis dengan melihat hubungan antar variabel dan mengambil policy
berdasarkan hubungan tersebut.
3. Supporting Decisions: Testing hypothesis, misal terkait keuangan, operations dan customer purchases.
4. New Insights: Regresi dapat membantu menganalisa hubungan antar variabel dan sekaligus mem-filternya.
Sumber: https://www.newgenapps.com/blog/business-applications-uses-regression-analysis-advantages
In [4]:
# Fitting model Regresi Sederhana
import statsmodels.formula.api as smf
from statsmodels.formula.api import ols
import statsmodels.api as sm, scipy.stats as stats
Df Model: 1
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [5]:
# Plot the Data
p = sns.regplot(df.usia, df.tekanan_darah)
SSR=SST−SSE=∑(yi−¯y)2−∑(yi−^yi)2SSR=SST−SSE=∑(yi−y¯)2−∑(yi−yi^)2
Regresi Non-Linier?
Mengapa?
Kapan tidak disarankan menambah kompleksitas model?
Regression for insights VS regression for prediction.
In [6]:
# Loading Data Sampel dari Modul
import statsmodels.api as sm
from statsmodels.formula.api import ols
import statsmodels.formula.api as smf
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specifie
d.
In [9]:
# Non Linear transformation
res = smf.ols(formula='Lottery ~ np.log(Literacy) + Wealth -1', data=df).fit()
print(res.summary())
OLS Regression Results
=======================================================================================
Dep. Variable: Lottery R-squared (uncentered): 0.799
Model: OLS Adj. R-squared (uncentered): 0.794
Method: Least Squares F-statistic: 165.2
Date: Sun, 15 Nov 2020 Prob (F-statistic): 1.16e-29
Time: 12:07:37 Log-Likelihood: -384.16
No. Observations: 85 AIC: 772.3
Df Residuals: 83 BIC: 777.2
Df Model: 2
Covariance Type: nonrobust
====================================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------
np.log(Literacy) 4.6426 1.246 3.727 0.000 2.165 7.120
Wealth 0.5853 0.089 6.571 0.000 0.408 0.762
==============================================================================
Omnibus: 4.188 Durbin-Watson: 1.892
Prob(Omnibus): 0.123 Jarque-Bera (JB): 4.034
Skew: -0.480 Prob(JB): 0.133
Kurtosis: 2.533 Cond. No. 25.8
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specifie
d.
Studi Kasus (Boston House Pricing) - Another Property Case Study
Source: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html
In [10]:
# Loading Data
from sklearn.datasets import load_boston
boston = load_boston()
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT PRICE
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2
In [11]:
# Deskripsi Data
print(boston.DESCR)
.. _boston_dataset:
This dataset was taken from the StatLib library which is maintained at Carnegie Mellon Un
iversity.
The Boston house-price data has been used in many machine learning papers that address re
gression
problems.
.. topic:: References
- Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sou
rces of Collinearity', Wiley, 1980. 244-261.
- Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings
on the Tenth International Conference of Machine Learning, 236-243, University of Massac
husetts, Amherst. Morgan Kaufmann.
In [12]:
bos.describe(include='all')
Out[12]:
PTRAT
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX B LSTAT PRICE
IO
co
506.0 506.0 506.0 506.0 506.0 506.0 506.0 506.0 506.0 506.0 506.0 506.0 506.0 506.0
un
00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000
t
me 3.613 11.36 11.13 0.069 0.554 6.284 68.57 3.795 9.549 408.2 18.45 356.6 12.65 22.53
an 524 3636 6779 170 695 634 4901 043 407 37154 5534 74032 3063 2806
8.601 23.32 6.860 0.253 0.115 0.702 28.14 2.105 8.707 168.5 2.164 91.29 7.141 9.197
std
545 2453 353 994 878 617 8861 710 259 37116 946 4864 062 104
mi 0.006 0.000 0.460 0.000 0.385 3.561 2.900 1.129 1.000 187.0 12.60 0.320 1.730 5.000
n 320 000 000 000 000 000 000 600 000 00000 0000 000 000 000
25 0.082 0.000 5.190 0.000 0.449 5.885 45.02 2.100 4.000 279.0 17.40 375.3 6.950 17.02
% 045 000 000 000 000 500 5000 175 000 00000 0000 77500 000 5000
50 0.256 0.000 9.690 0.000 0.538 6.208 77.50 3.207 5.000 330.0 19.05 391.4 11.36 21.20
% 510 000 000 000 000 500 0000 450 000 00000 0000 40000 0000 0000
75 3.677 12.50 18.10 0.000 0.624 6.623 94.07 5.188 24.00 666.0 20.20 396.2 16.95 25.00
% 083 0000 0000 000 000 500 5000 425 0000 00000 0000 25000 5000 0000
ma 88.97 100.0 27.74 1.000 0.871 8.780 100.0 12.12 24.00 711.0 22.00 396.9 37.97 50.00
x 6200 00000 0000 000 000 000 00000 6500 0000 00000 0000 00000 0000 0000
In [13]:
p = sns.pairplot(bos)
Checking Correlations between Predictors
In [14]:
# HeatMap untuk menyelidiki korelasi
corr2 = bos.corr() # We already examined SalePrice correlations
plt.figure(figsize=(12, 10))
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specifie
d.
Returns:
--------
model: an "optimal" fitted statsmodels linear model
with an intercept
selected by forward selection
evaluated by adjusted R-squared
"""
remaining = set(data.columns)
remaining.remove(response)
selected = []
current_score, best_new_score = 0.0, 0.0
while remaining and current_score == best_new_score:
scores_with_candidates = []
for candidate in remaining:
formula = "{} ~ {} + 1".format(response,
' + '.join(selected + [candidate]))
score = smf.ols(formula, data).fit().rsquared_adj
scores_with_candidates.append((score, candidate))
scores_with_candidates.sort()
best_new_score, best_candidate = scores_with_candidates.pop()
if current_score < best_new_score:
remaining.remove(best_candidate)
selected.append(best_candidate)
current_score = best_new_score
formula = "{} ~ {} + 1".format(response, ' + '.join(selected))
model = smf.ols(formula, data).fit()
return model
In [18]:
model = forward_selected(bos, 'PRICE')
print(model.model.formula)
print(model.rsquared_adj)
PRICE ~ LSTAT + RM + PTRATIO + DIS + NOX + CHAS + B + ZN + CRIM + RAD + TAX + 1
0.7348057723274566
In [19]:
# Interpretasi koefisien?
print(model.summary())
OLS Regression Results
==============================================================================
Dep. Variable: PRICE R-squared: 0.741
Model: OLS Adj. R-squared: 0.735
Method: Least Squares F-statistic: 128.2
Date: Sun, 15 Nov 2020 Prob (F-statistic): 5.54e-137
Time: 12:15:28 Log-Likelihood: -1498.9
No. Observations: 506 AIC: 3022.
Df Residuals: 494 BIC: 3072.
Df Model: 11
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 36.3411 5.067 7.171 0.000 26.385 46.298
LSTAT -0.5226 0.047 -11.019 0.000 -0.616 -0.429
RM 3.8016 0.406 9.356 0.000 3.003 4.600
PTRATIO -0.9465 0.129 -7.334 0.000 -1.200 -0.693
DIS -1.4927 0.186 -8.037 0.000 -1.858 -1.128
NOX -17.3760 3.535 -4.915 0.000 -24.322 -10.430
CHAS 2.7187 0.854 3.183 0.002 1.040 4.397
B 0.0093 0.003 3.475 0.001 0.004 0.015
ZN 0.0458 0.014 3.390 0.001 0.019 0.072
CRIM -0.1084 0.033 -3.307 0.001 -0.173 -0.044
RAD 0.2996 0.063 4.726 0.000 0.175 0.424
TAX -0.0118 0.003 -3.493 0.001 -0.018 -0.005
==============================================================================
Omnibus: 178.430 Durbin-Watson: 1.078
Prob(Omnibus): 0.000 Jarque-Bera (JB): 787.785
Skew: 1.523 Prob(JB): 8.60e-172
Kurtosis: 8.300 Cond. No. 1.47e+04
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specifie
d.
[2] The condition number is large, 1.47e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT PRICE
0 0.00632 18.0 2.31 0.0 0.538 6.575 0.641607 4.0900 1.0 0.208015 15.3 1.000000 4.98 24.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 0.782698 4.9671 2.0 0.104962 17.8 1.000000 9.14 21.6
2 0.02729 0.0 7.07 0.0 0.469 7.185 0.599382 4.9671 2.0 0.104962 17.8 0.989737 4.03 34.7
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT PRICE
3 0.03237 0.0 2.18 0.0 0.458 6.998 0.441813 6.0622 3.0 0.066794 18.7 0.994276 2.94 33.4
4 0.06905 0.0 2.18 0.0 0.458 7.147 0.528321 6.0622 3.0 0.066794 18.7 1.000000 5.33 36.2
0 1 10 9.17 1
1 2 1 1.32 0
2 3 12 8.54 1
3 4 12 7.68 1
No Iklan Laba Tipe
4 5 5 7.15 1
In [22]:
p = sns.pairplot(df, hue="Tipe")
In [ ]:
# Do Modelling Here ... Don't forget to interpret.
Belum dibahas:
1. Logistic Regression [akan dibahas saat Topik Klasifikasi]
2. Piecewise Regression (Non Linear)
3. Probit/Tobit Regression (Probabilistic)
4. Bayesian Regressian
5. Logic Regression (lebih robust dari logistic regression utk Fraud Detection)
6. Quantile regression (extreme events)
7. LAD regression (L1)
8. Jackknife regression
9. SVR
10. ARIMA (Time Series)
11. Ecologic Regression (ada hirarki di variab
image Source: https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html
Referensi DM-09:
1. A. Seber and A.J. Lee, Linear Regression Analysis (2nd Ed), 2003, John Wiley & Sons.
2. P. McCullagh and J.A. Nelder, Generalized Linear Models (2nd Ed), 1989, Chapman & Hall.