DM 09

Data Mining – 09: Pendahuluan Korelasi dan Regresi
Model Linear Regresi

 Covariance ==> Korelasi
 Korelasi ==> Regresi
 Regresi VS Interpolasi
 Induktif Bias dan Training-Testing Data
 Evaluasi: RMSE & R2R2
 Optimal Parameter
 Variabel Kategori di regresi
 Cross Validasi
 Feature Selection (Forward, Backward, stepwise) dan Regresi Lasso
 Transformasi variable
 Non-Linear Regression
Berawal dari Pusat data dan Variansi
Perhatikan makna rumus/formula variansi, lalu bandingkan dengan formula "covariansi"

berikut:
Variance ke Covariance: Menghitung Hubungan Linear antara 2 variabel

 How? Bagaimana cara kerjanya? (Statistical Thinking)
 Konsepnya: "Co-Vary" sama-sama bervariansi menjauh dari rata-rata.
 Gunakan "reverse" thinking untuk memahaminya.
 Penggunaan: cov(x,y) = 2 VS cov(x,y) = -2 VS Cov(x,y) = 0
 Covariance = 3000? Apa artinya?
Covariance ke korelasi: Statistical Thinking

 Korelasi sebenarnya adalah Covariance dibagi dengan masing-masing standar deviasinya.
 Apa maksud/maknanya?
 Covariance punya makna geometric .... ia adalah
Cosine!... https://en.wikipedia.org/wiki/Pearson_correlation_coefficient#Geometric_interpretation
Nilai koefisien korelasi (Linear) "Pearson"
 Nilai dari koefisien korelasi Pearson adalah dari -1 hingga +1.
Hati-hati
 Koefisien korelasi = 0 bukan berarti tidak ada hubungan antara kedua variable. Yang benar adalah: tidak ada
hubungan LINIER, tapi bisa jadi ada hubungan dalam bentuk lain; misal: kuadratik, atau fungsi lain selain linier,
seperti pada contoh di atas.
Interpretasi
 Nilai ~0.95 menunjukkan bahwa ada korelasi linier positif yang kuat antara usia dan tekanan darah. Ada
kecenderungan bahwa usia tinggi berkaitan dengan tekanan darah yang kebih tinggi dibandingkan usia rendah.
 WARNING
 Korelasi tidak sama (meng-implikasikan) dengan sebab akibat. Perhatikan interpretasi di atas. Tidak
dinyatakan bahwa jika usia tinggi maka tekanan darah rendah, hanya suatu tren atau kecenderungan. Mungkin
saja usia dengan bertambahnya usia maka tekanan darah meningkat, tapi mungkin juga tekanan darah tinggi
bukan karena usia, tapi faktor lain yang tidak teramati pada data.
 Contoh lain penelitian di Machine learning (kecantikan dan confidence/Panjang Jari dan IQ)
Korelasi dan Sebab-Akibat
Penilaian Kualitatif terhadap nilai korelasi seperti ini? ... Really? Why? Why
not?
[image Source: https://spencermath.weebly.com/home/interpreting-the-correlation-coefficient]
 Cases (social, medicine, etc)

 Objective, prediction vs insights.
Contoh kasus sederhana
In [1]:
import warnings; warnings.simplefilter('ignore')
import pandas as pd, seaborn as sns, matplotlib.pyplot as plt, numpy as np
plt.style.use('bmh'); sns.set()
data = {'usia':[40, 45, 50, 53, 60, 65, 69, 71], 'tekanan_darah':[126, 124, 135, 138,
142, 139, 140, 151]}
df = pd.DataFrame.from_dict(data)
df.head(8)
Out[1]:
usia tekanan_darah
0 40 126
1 45 124
2 50 135
3 53 138
4 60 142
5 65 139
6 69 140
7 71 151
In [2]:
# Korelasi dan Scatter Plot untuk melihat datanya
print('Covariance = ', np.cov(df.usia, df.tekanan_darah, ddof=0)[0][1])
print('Correlations = ', np.corrcoef(df.usia, df.tekanan_darah))
plt.scatter(df.usia, df.tekanan_darah)
plt.show()
Covariance = 76.953125
Correlations = [[1. 0.88746015]
[0.88746015 1. ]]
In [3]:
# Better
print(df.corr())
sns.heatmap(df.corr(),cmap='viridis', vmax=1.0, vmin=-1.0, linewidths=0.1,annot=True,
annot_kws={"size": 16}, square=True)
p = sns.pairplot(df)
usia tekanan_darah
usia 1.00000 0.88746
tekanan_darah 0.88746 1.00000
Interpretasi
 Nilai 0.95 menunjukkan bahwa ada korelasi linier positif yang kuat antara usia dan tekanan darah. Ada
kecenderungan bahwa usia tinggi berkaitan dengan tekanan darah yang kebih tinggi dibandingkan usia rendah.
WARNING
 Korelasi tidak sama dengan sebab akibat. Perhatikan interpretasi di atas. Tidak dinyatakan bahwa jika usia
tinggi maka tekanan darah rendah, hanya suatu tren atau kecenderungan. Mungkin saja usia dengan
bertambahnya usia maka tekanan darah meningkat, tapi mungkin juga tekanan darah tinggi bukan karena usia,
tapi faktor lain yang tidak teramati pada data.
Sampai sini kita memahami kalau keduanya berhubungan, tapi seperti apa
hubungannya kita masih belum bisa ketahui (lewat korelasi). Itulah Mengapa kita
perlu Model Regresi.
Pendahuluan Model Regresi

 Digunakan saat variabel tak bebas (Dependent variable - Y) bertipe numerik (float/real) dan variabel
bebasnya bisa numerik dan-atau kategorik
Regresi Linier Sederhana
Korelasi ke Regresi
Bagaimana menghitung parameter Regresi yang Optimal?
Kenapa rumusnya seperti ini?

Pentingnya memahami "Loss Function"
Evaluasi Error (Mean Squared Error)
 Hati-hati,... perhatikan rumusnya dengan baik .... ia tidak robust terhadap outlier
 ^y=β0+β1x1+...+βnxny^=β0+β1x1+...+βnxn
 MSE = total jarak/selisih antara prediksi dan nilai dari data (sesungguhnya)
 RMSE = √ MSE MSE ... why?
 Evaluasi penting ketika kita ingin melakukan prediksi
Bahas sebentar pentingnya memahami "Loss Function"

 Persamaan/Model Linier adalah dasar terpenting di Statistika, Data Science, Machine Learning, dan Deep
Learning (*).
 Banyak model di (*) sebenarnya adalah fungsi linier, bahkan di masalah klasifikasi.
 Yang membedakan adalah "pemodelan/optimasi masalah/Loss Functionnya"
Regresi VS Interpolasi
Beberapa contoh aplikasi regresi

1. Predictive Analytics: Memprediksi resiko, harga, penjualan, demand, dsb.
2. Operation Efficiency: Optimasi proses bisnis dengan melihat hubungan antar variabel dan mengambil policy
berdasarkan hubungan tersebut.
3. Supporting Decisions: Testing hypothesis, misal terkait keuangan, operations dan customer purchases.
4. New Insights: Regresi dapat membantu menganalisa hubungan antar variabel dan sekaligus mem-filternya.
Sumber: https://www.newgenapps.com/blog/business-applications-uses-regression-analysis-advantages
In [4]:
# Fitting model Regresi Sederhana
import statsmodels.formula.api as smf
from statsmodels.formula.api import ols
import statsmodels.api as sm, scipy.stats as stats
lm = smf.ols("tekanan_darah ~ usia", data=df[['tekanan_darah','usia']]).fit()

lm.summary()
# 1. F-Stat.
#.2. Uji Koef model
#.3. R^2
#.4. Interpretasi Model
#.5. Durbin-Watson ==> Time Series?
Out[4]:
OLS Regression Results
Dep. Variable: tekanan_darah R-squared: 0.788
Model: OLS Adj. R-squared: 0.752
Method: Least Squares F-statistic: 22.25
Date: Sun, 15 Nov 2020 Prob (F-statistic): 0.00327
Time: 12:07:17 Log-Likelihood: -21.920
No. Observations: 8 AIC: 47.84
Df Residuals: 6 BIC: 48.00
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 98.5623 8.266 11.924 0.000 78.337 118.788
usia 0.6766 0.143 4.717 0.003 0.326 1.028
Omnibus: 3.192 Durbin-Watson: 2.005

Prob(Omnibus): 0.203 Jarque-Bera (JB): 1.016
Skew: -0.340 Prob(JB): 0.602
Kurtosis: 1.392 Cond. No. 311.
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [5]:
# Plot the Data
p = sns.regplot(df.usia, df.tekanan_darah)
Evaluasi R2: Model VS Tidak Pakai Model?

Adjusted R-Squared? Why?
Pengaruh Variabel Tak Bebas ke Model
 SSR=SST−SSE=∑(yi−¯y)2−∑(yi−^yi)2SSR=SST−SSE=∑(yi−y¯)2−∑(yi−yi^)2
Perfect/true-best model tidak ada, bahkan seringnya tidak diperlukan

Induktif Bias Sebagai Dasar Penting untuk Mengerti SEMUA model Data
Science dan Machine Learning
image source: https://sgfin.github.io/2020/06/22/Induction-Intro/
[image source: https://www.slideshare.net/mahakvijay3/basics-of-regression-analysis]
Regresi Non-Linier?
Mengapa?
Kapan tidak disarankan menambah kompleksitas model?
Regression for insights VS regression for prediction.
Masih linear terhadap parameter

[image source: https://sites.google.com/site/apphysics1online/appendices/2-data-analysis/graph-linearization ]
In [6]:
# Loading Data Sampel dari Modul
import statsmodels.api as sm
from statsmodels.formula.api import ols
import statsmodels.formula.api as smf
dta = sm.datasets.get_rdataset("Guerry", "HistData", cache=True)

df = dta.data[['Lottery', 'Literacy', 'Wealth', 'Region']].dropna()
df.head(), df.shape, set(df['Region'])
Out[6]:
( Lottery Literacy Wealth Region
0 41 37 73 E
1 38 51 22 N
2 66 13 61 C
3 80 46 76 E
4 79 69 83 E,
(85, 4),
{'C', 'E', 'N', 'S', 'W'})
In [8]:
# Set "Region" sebagai variabel Kategorik
res = ols(formula='Lottery ~ Literacy + Wealth + C(Region)', data=df).fit()
print(res.params)
print(res.summary())
Intercept 38.651655
C(Region)[T.E] -15.427785
C(Region)[T.N] -10.016961
C(Region)[T.S] -4.548257
C(Region)[T.W] -10.091276
Literacy -0.185819
Wealth 0.451475
dtype: float64
==============================================================================
Dep. Variable: Lottery R-squared: 0.338
Date: Sun, 15 Nov 2020 Prob (F-statistic): 1.07e-05
Df Model: 6
==================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------
Intercept 38.6517 9.456 4.087 0.000 19.826 57.478
C(Region)[T.E] -15.4278 9.727 -1.586 0.117 -34.793 3.938
C(Region)[T.N] -10.0170 9.260 -1.082 0.283 -28.453 8.419
C(Region)[T.S] -4.5483 7.279 -0.625 0.534 -19.039 9.943
C(Region)[T.W] -10.0913 7.196 -1.402 0.165 -24.418 4.235
Literacy -0.1858 0.210 -0.886 0.378 -0.603 0.232
Wealth 0.4515 0.103 4.390 0.000 0.247 0.656
==============================================================================
Skew: -0.340 Prob(JB): 0.260
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specifie
d.
In [9]:
# Non Linear transformation
res = smf.ols(formula='Lottery ~ np.log(Literacy) + Wealth -1', data=df).fit()
print(res.summary())
=======================================================================================
Dep. Variable: Lottery R-squared (uncentered): 0.799
Model: OLS Adj. R-squared (uncentered): 0.794
Df Model: 2
====================================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------
np.log(Literacy) 4.6426 1.246 3.727 0.000 2.165 7.120
Wealth 0.5853 0.089 6.571 0.000 0.408 0.762
==============================================================================
Skew: -0.480 Prob(JB): 0.133
Kurtosis: 2.533 Cond. No. 25.8
==============================================================================
Warnings:
d.
Studi Kasus (Boston House Pricing) - Another Property Case Study
Source: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html
In [10]:
# Loading Data
from sklearn.datasets import load_boston
boston = load_boston()
# Convert ke Pandas Dataframe

bos = pd.DataFrame(boston.data)
bos.columns = boston.feature_names
bos['PRICE'] = boston.target
bos.head()
Out[10]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT PRICE
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2
In [11]:
# Deskripsi Data
print(boston.DESCR)
.. _boston_dataset:
Boston house prices dataset

---------------------------
**Data Set Characteristics:**
:Number of Instances: 506
:Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14)

is usually the target.
:Attribute Information (in order):

- CRIM per capita crime rate by town
- ZN proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS proportion of non-retail business acres per town
- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX nitric oxides concentration (parts per 10 million)
- RM average number of rooms per dwelling
- AGE proportion of owner-occupied units built prior to 1940
- DIS weighted distances to five Boston employment centres
- RAD index of accessibility to radial highways
- TAX full-value property-tax rate per $10,000
- PTRATIO pupil-teacher ratio by town
- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT % lower status of the population
- MEDV Median value of owner-occupied homes in $1000's
:Missing Attribute Values: None
:Creator: Harrison, D. and Rubinfeld, D.L.
This is a copy of UCI ML housing dataset.

https://archive.ics.uci.edu/ml/machine-learning-databases/housing/
This dataset was taken from the StatLib library which is maintained at Carnegie Mellon Un
iversity.
The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic

prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980. N.B. Various transformations are used in the table on
pages 244-261 of the latter.
The Boston house-price data has been used in many machine learning papers that address re
gression
problems.
.. topic:: References
- Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sou
rces of Collinearity', Wiley, 1980. 244-261.
- Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings
on the Tenth International Conference of Machine Learning, 236-243, University of Massac
husetts, Amherst. Morgan Kaufmann.
In [12]:
bos.describe(include='all')
Out[12]:
PTRAT
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX B LSTAT PRICE
IO
co
506.0 506.0 506.0 506.0 506.0 506.0 506.0 506.0 506.0 506.0 506.0 506.0 506.0 506.0
un
00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000
t
me 3.613 11.36 11.13 0.069 0.554 6.284 68.57 3.795 9.549 408.2 18.45 356.6 12.65 22.53
an 524 3636 6779 170 695 634 4901 043 407 37154 5534 74032 3063 2806
8.601 23.32 6.860 0.253 0.115 0.702 28.14 2.105 8.707 168.5 2.164 91.29 7.141 9.197
std
545 2453 353 994 878 617 8861 710 259 37116 946 4864 062 104
mi 0.006 0.000 0.460 0.000 0.385 3.561 2.900 1.129 1.000 187.0 12.60 0.320 1.730 5.000
n 320 000 000 000 000 000 000 600 000 00000 0000 000 000 000
25 0.082 0.000 5.190 0.000 0.449 5.885 45.02 2.100 4.000 279.0 17.40 375.3 6.950 17.02
% 045 000 000 000 000 500 5000 175 000 00000 0000 77500 000 5000
50 0.256 0.000 9.690 0.000 0.538 6.208 77.50 3.207 5.000 330.0 19.05 391.4 11.36 21.20
% 510 000 000 000 000 500 0000 450 000 00000 0000 40000 0000 0000
75 3.677 12.50 18.10 0.000 0.624 6.623 94.07 5.188 24.00 666.0 20.20 396.2 16.95 25.00
% 083 0000 0000 000 000 500 5000 425 0000 00000 0000 25000 5000 0000
ma 88.97 100.0 27.74 1.000 0.871 8.780 100.0 12.12 24.00 711.0 22.00 396.9 37.97 50.00
x 6200 00000 0000 000 000 000 00000 6500 0000 00000 0000 00000 0000 0000
In [13]:
p = sns.pairplot(bos)
Checking Correlations between Predictors
In [14]:
# HeatMap untuk menyelidiki korelasi
corr2 = bos.corr() # We already examined SalePrice correlations
plt.figure(figsize=(12, 10))
sns.heatmap(corr2[(corr2 >= 0.5) | (corr2 <= -0.4)],

cmap='viridis', vmax=1.0, vmin=-1.0, linewidths=0.1,
annot=True, annot_kws={"size": 8}, square=True);
In [16]:
m = ols('PRICE ~ RM + PTRATIO + LSTAT ', bos).fit()
print(m.summary())
# Jangan lupa analisa dan interpretasi hasilnya
==============================================================================
Dep. Variable: PRICE R-squared: 0.679
No. Observations: 506 AIC: 3114.
Df Residuals: 502 BIC: 3131.
Df Model: 3
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 18.5671 3.913 4.745 0.000 10.879 26.255
RM 4.5154 0.426 10.603 0.000 3.679 5.352
PTRATIO -0.9307 0.118 -7.911 0.000 -1.162 -0.700
LSTAT -0.5718 0.042 -13.540 0.000 -0.655 -0.489
==============================================================================
Skew: 1.700 Prob(JB): 1.10e-222
==============================================================================
Warnings:
d.
Variable Selection: Stepwise di Analisis Regresi
image source: https://quantifyinghealth.com/stepwise-selection/

image source: https://en.wikipedia.org/wiki/Stepwise_regression
Cautions: https://towardsdatascience.com/stopping-stepwise-why-stepwise-selection-is-bad-and-what-you-
should-use-instead-90818b3f52df
In [17]:
def forward_selected(data, response):
"""Linear model designed by forward selection.
https://planspace.org/20150423-forward_selection_with_statsmodels/
Parameters:
-----------
data : pandas DataFrame with all possible predictors and response
response: string, name of response column in data
Returns:
--------
model: an "optimal" fitted statsmodels linear model
with an intercept
selected by forward selection
evaluated by adjusted R-squared
"""
remaining = set(data.columns)
remaining.remove(response)
selected = []
current_score, best_new_score = 0.0, 0.0
while remaining and current_score == best_new_score:
scores_with_candidates = []
for candidate in remaining:
formula = "{} ~ {} + 1".format(response,
' + '.join(selected + [candidate]))
score = smf.ols(formula, data).fit().rsquared_adj
scores_with_candidates.append((score, candidate))
scores_with_candidates.sort()
best_new_score, best_candidate = scores_with_candidates.pop()
if current_score < best_new_score:
remaining.remove(best_candidate)
selected.append(best_candidate)
current_score = best_new_score
formula = "{} ~ {} + 1".format(response, ' + '.join(selected))
model = smf.ols(formula, data).fit()
return model
In [18]:
model = forward_selected(bos, 'PRICE')
print(model.model.formula)
print(model.rsquared_adj)
PRICE ~ LSTAT + RM + PTRATIO + DIS + NOX + CHAS + B + ZN + CRIM + RAD + TAX + 1
0.7348057723274566
In [19]:
# Interpretasi koefisien?
print(model.summary())
==============================================================================
Dep. Variable: PRICE R-squared: 0.741
No. Observations: 506 AIC: 3022.
Df Residuals: 494 BIC: 3072.
Df Model: 11
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 36.3411 5.067 7.171 0.000 26.385 46.298
LSTAT -0.5226 0.047 -11.019 0.000 -0.616 -0.429
RM 3.8016 0.406 9.356 0.000 3.003 4.600
PTRATIO -0.9465 0.129 -7.334 0.000 -1.200 -0.693
DIS -1.4927 0.186 -8.037 0.000 -1.858 -1.128
NOX -17.3760 3.535 -4.915 0.000 -24.322 -10.430
CHAS 2.7187 0.854 3.183 0.002 1.040 4.397
B 0.0093 0.003 3.475 0.001 0.004 0.015
ZN 0.0458 0.014 3.390 0.001 0.019 0.072
CRIM -0.1084 0.033 -3.307 0.001 -0.173 -0.044
RAD 0.2996 0.063 4.726 0.000 0.175 0.424
TAX -0.0118 0.003 -3.493 0.001 -0.018 -0.005
==============================================================================
Skew: 1.523 Prob(JB): 8.60e-172
Kurtosis: 8.300 Cond. No. 1.47e+04
==============================================================================
Warnings:
d.
[2] The condition number is large, 1.47e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
Data Scaling "for Insights"
Pentingnya "scaling" di Regresi (atau clustering) untuk mencari

insight dari data
[image source: https://medium.com/greyatom/why-how-and-when-to-scale-your-features-4b30ab09db5e ]
In [20]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
bos[['TAX', 'AGE', 'B']] = scaler.fit_transform(bos[['TAX', 'AGE', 'B']])

bos.head()
# Continue to Modelling
Out[20]:
0 0.00632 18.0 2.31 0.0 0.538 6.575 0.641607 4.0900 1.0 0.208015 15.3 1.000000 4.98 24.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 0.782698 4.9671 2.0 0.104962 17.8 1.000000 9.14 21.6
2 0.02729 0.0 7.07 0.0 0.469 7.185 0.599382 4.9671 2.0 0.104962 17.8 0.989737 4.03 34.7
3 0.03237 0.0 2.18 0.0 0.458 6.998 0.441813 6.0622 3.0 0.066794 18.7 0.994276 2.94 33.4
4 0.06905 0.0 2.18 0.0 0.458 7.147 0.528321 6.0622 3.0 0.066794 18.7 1.000000 5.33 36.2
Pitfalls: Regresi Interpolation BUKAN Extrapolation (Forecasting/Peramalan)
image source: https://www.datasciencecentral.com/forum/topics/what-are-the-differences-between-prediction-

extrapolation-and
Latihan Studi Kasus Investasi Biaya Iklan

In [21]:
# Contoh
# Load DataFile CSV
try:
df = pd.read_csv('data/iklan.csv') # run locally
except:
!wget https://raw.githubusercontent.com/taudata-
indonesia/eLearning/master/data/iklan.csv # "Google Colab"
df = pd.read_csv('iklan.csv')
df.head()
Out[21]:
No Iklan Laba Tipe
0 1 10 9.17 1
1 2 1 1.32 0
2 3 12 8.54 1
3 4 12 7.68 1
No Iklan Laba Tipe
4 5 5 7.15 1
In [22]:
p = sns.pairplot(df, hue="Tipe")
In [ ]:
# Do Modelling Here ... Don't forget to interpret.
Belum dibahas:
1. Logistic Regression [akan dibahas saat Topik Klasifikasi]
2. Piecewise Regression (Non Linear)
3. Probit/Tobit Regression (Probabilistic)
4. Bayesian Regressian
5. Logic Regression (lebih robust dari logistic regression utk Fraud Detection)
6. Quantile regression (extreme events)
7. LAD regression (L1)
8. Jackknife regression
9. SVR
10. ARIMA (Time Series)
11. Ecologic Regression (ada hirarki di variab
image Source: https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html
Referensi DM-09:
1. A. Seber and A.J. Lee, Linear Regression Analysis (2nd Ed), 2003, John Wiley & Sons.
2. P. McCullagh and J.A. Nelder, Generalized Linear Models (2nd Ed), 1989, Chapman & Hall.

DM 09

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

DM 09

Diunggah oleh

Hak Cipta:

Format Tersedia

Data Mining – 09: Pendahuluan Korelasi dan Regresi

Model Linear Regresi

Berawal dari Pusat data dan Variansi

Perhatikan makna rumus/formula variansi, lalu bandingkan dengan formula "covariansi"

Variance ke Covariance: Menghitung Hubungan Linear antara 2 variabel

Covariance ke korelasi: Statistical Thinking

[image Source: https://spencermath.weebly.com/home/interpreting-the-correlation-coefficient]

 Cases (social, medicine, etc)

Pendahuluan Model Regresi

Kenapa rumusnya seperti ini?

Evaluasi Error (Mean Squared Error)

Bahas sebentar pentingnya memahami "Loss Function"

Beberapa contoh aplikasi regresi

lm = smf.ols("tekanan_darah ~ usia", data=df[['tekanan_darah','usia']]).fit()

OLS Regression Results

Dep. Variable: tekanan_darah R-squared: 0.788

Model: OLS Adj. R-squared: 0.752

Method: Least Squares F-statistic: 22.25

Date: Sun, 15 Nov 2020 Prob (F-statistic): 0.00327

Time: 12:07:17 Log-Likelihood: -21.920

No. Observations: 8 AIC: 47.84

Df Residuals: 6 BIC: 48.00

Covariance Type: nonrobust

coef std err t P>|t| [0.025 0.975]

Intercept 98.5623 8.266 11.924 0.000 78.337 118.788

usia 0.6766 0.143 4.717 0.003 0.326 1.028

Omnibus: 3.192 Durbin-Watson: 2.005

Skew: -0.340 Prob(JB): 0.602

Kurtosis: 1.392 Cond. No. 311.

Evaluasi R2: Model VS Tidak Pakai Model?

Pengaruh Variabel Tak Bebas ke Model

Perfect/true-best model tidak ada, bahkan seringnya tidak diperlukan

image source: https://sgfin.github.io/2020/06/22/Induction-Intro/

[image source: https://www.slideshare.net/mahakvijay3/basics-of-regression-analysis]

Masih linear terhadap parameter

dta = sm.datasets.get_rdataset("Guerry", "HistData", cache=True)

# Convert ke Pandas Dataframe

Boston house prices dataset

**Data Set Characteristics:**

:Number of Instances: 506

:Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14)

:Attribute Information (in order):

:Missing Attribute Values: None

:Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic

sns.heatmap(corr2[(corr2 >= 0.5) | (corr2 <= -0.4)],

Variable Selection: Stepwise di Analisis Regresi

image source: https://quantifyinghealth.com/stepwise-selection/

response: string, name of response column in data

Data Scaling "for Insights"

Pentingnya "scaling" di Regresi (atau clustering) untuk mencari

bos[['TAX', 'AGE', 'B']] = scaler.fit_transform(bos[['TAX', 'AGE', 'B']])

Pitfalls: Regresi Interpolation BUKAN Extrapolation (Forecasting/Peramalan)

image source: https://www.datasciencecentral.com/forum/topics/what-are-the-differences-between-prediction-

Latihan Studi Kasus Investasi Biaya Iklan

No Iklan Laba Tipe

Anda mungkin juga menyukai

Data Set Characteristics: