Anda di halaman 1dari 27

Data Mining – 09: Pendahuluan Korelasi dan Regresi

Model Linear Regresi


 Covariance ==> Korelasi
 Korelasi ==> Regresi
 Regresi VS Interpolasi
 Induktif Bias dan Training-Testing Data
 Evaluasi: RMSE & R2R2
 Optimal Parameter
 Variabel Kategori di regresi
 Cross Validasi
 Feature Selection (Forward, Backward, stepwise) dan Regresi Lasso
 Transformasi variable
 Non-Linear Regression

Berawal dari Pusat data dan Variansi

Perhatikan makna rumus/formula variansi, lalu bandingkan dengan formula "covariansi"


berikut:

Variance ke Covariance: Menghitung Hubungan Linear antara 2 variabel


 How? Bagaimana cara kerjanya? (Statistical Thinking)
 Konsepnya: "Co-Vary" sama-sama bervariansi menjauh dari rata-rata.
 Gunakan "reverse" thinking untuk memahaminya.
 Penggunaan: cov(x,y) = 2 VS cov(x,y) = -2 VS Cov(x,y) = 0
 Covariance = 3000? Apa artinya?

Covariance ke korelasi: Statistical Thinking


 Korelasi sebenarnya adalah Covariance dibagi dengan masing-masing standar deviasinya.
 Apa maksud/maknanya?
 Covariance punya makna geometric .... ia adalah
Cosine!... https://en.wikipedia.org/wiki/Pearson_correlation_coefficient#Geometric_interpretation
Nilai koefisien korelasi (Linear) "Pearson"
 Nilai dari koefisien korelasi Pearson adalah dari -1 hingga +1.

Hati-hati
 Koefisien korelasi = 0 bukan berarti tidak ada hubungan antara kedua variable. Yang benar adalah: tidak ada
hubungan LINIER, tapi bisa jadi ada hubungan dalam bentuk lain; misal: kuadratik, atau fungsi lain selain linier,
seperti pada contoh di atas.

Interpretasi
 Nilai ~0.95 menunjukkan bahwa ada korelasi linier positif yang kuat antara usia dan tekanan darah. Ada
kecenderungan bahwa usia tinggi berkaitan dengan tekanan darah yang kebih tinggi dibandingkan usia rendah.
 WARNING
 Korelasi tidak sama (meng-implikasikan) dengan sebab akibat. Perhatikan interpretasi di atas. Tidak
dinyatakan bahwa jika usia tinggi maka tekanan darah rendah, hanya suatu tren atau kecenderungan. Mungkin
saja usia dengan bertambahnya usia maka tekanan darah meningkat, tapi mungkin juga tekanan darah tinggi
bukan karena usia, tapi faktor lain yang tidak teramati pada data.

 Contoh lain penelitian di Machine learning (kecantikan dan confidence/Panjang Jari dan IQ)
Korelasi dan Sebab-Akibat

Penilaian Kualitatif terhadap nilai korelasi seperti ini? ... Really? Why? Why
not?

[image Source: https://spencermath.weebly.com/home/interpreting-the-correlation-coefficient]

 Cases (social, medicine, etc)


 Objective, prediction vs insights.
Contoh kasus sederhana

In [1]:
import warnings; warnings.simplefilter('ignore')
import pandas as pd, seaborn as sns, matplotlib.pyplot as plt, numpy as np
plt.style.use('bmh'); sns.set()

data = {'usia':[40, 45, 50, 53, 60, 65, 69, 71], 'tekanan_darah':[126, 124, 135, 138,
142, 139, 140, 151]}
df = pd.DataFrame.from_dict(data)
df.head(8)
Out[1]:

usia tekanan_darah

0 40 126

1 45 124

2 50 135

3 53 138

4 60 142

5 65 139

6 69 140

7 71 151

In [2]:
# Korelasi dan Scatter Plot untuk melihat datanya
print('Covariance = ', np.cov(df.usia, df.tekanan_darah, ddof=0)[0][1])
print('Correlations = ', np.corrcoef(df.usia, df.tekanan_darah))
plt.scatter(df.usia, df.tekanan_darah)
plt.show()
Covariance = 76.953125
Correlations = [[1. 0.88746015]
[0.88746015 1. ]]

In [3]:
# Better
print(df.corr())
sns.heatmap(df.corr(),cmap='viridis', vmax=1.0, vmin=-1.0, linewidths=0.1,annot=True,
annot_kws={"size": 16}, square=True)
p = sns.pairplot(df)
usia tekanan_darah
usia 1.00000 0.88746
tekanan_darah 0.88746 1.00000
Interpretasi
 Nilai 0.95 menunjukkan bahwa ada korelasi linier positif yang kuat antara usia dan tekanan darah. Ada
kecenderungan bahwa usia tinggi berkaitan dengan tekanan darah yang kebih tinggi dibandingkan usia rendah.

WARNING
 Korelasi tidak sama dengan sebab akibat. Perhatikan interpretasi di atas. Tidak dinyatakan bahwa jika usia
tinggi maka tekanan darah rendah, hanya suatu tren atau kecenderungan. Mungkin saja usia dengan
bertambahnya usia maka tekanan darah meningkat, tapi mungkin juga tekanan darah tinggi bukan karena usia,
tapi faktor lain yang tidak teramati pada data.

Sampai sini kita memahami kalau keduanya berhubungan, tapi seperti apa
hubungannya kita masih belum bisa ketahui (lewat korelasi). Itulah Mengapa kita
perlu Model Regresi.

Pendahuluan Model Regresi


 Digunakan saat variabel tak bebas (Dependent variable - Y) bertipe numerik (float/real) dan variabel
bebasnya bisa numerik dan-atau kategorik
Regresi Linier Sederhana

Korelasi ke Regresi
Bagaimana menghitung parameter Regresi yang Optimal?

Kenapa rumusnya seperti ini?


Pentingnya memahami "Loss Function"

Evaluasi Error (Mean Squared Error)

 Hati-hati,... perhatikan rumusnya dengan baik .... ia tidak robust terhadap outlier
 ^y=β0+β1x1+...+βnxny^=β0+β1x1+...+βnxn
 MSE = total jarak/selisih antara prediksi dan nilai dari data (sesungguhnya)
 RMSE = √ MSE MSE ... why?
 Evaluasi penting ketika kita ingin melakukan prediksi

Bahas sebentar pentingnya memahami "Loss Function"


 Persamaan/Model Linier adalah dasar terpenting di Statistika, Data Science, Machine Learning, dan Deep
Learning (*).
 Banyak model di (*) sebenarnya adalah fungsi linier, bahkan di masalah klasifikasi.
 Yang membedakan adalah "pemodelan/optimasi masalah/Loss Functionnya"

Regresi VS Interpolasi

Beberapa contoh aplikasi regresi


1. Predictive Analytics: Memprediksi resiko, harga, penjualan, demand, dsb.

2. Operation Efficiency: Optimasi proses bisnis dengan melihat hubungan antar variabel dan mengambil policy
berdasarkan hubungan tersebut.

3. Supporting Decisions: Testing hypothesis, misal terkait keuangan, operations dan customer purchases.

4. New Insights: Regresi dapat membantu menganalisa hubungan antar variabel dan sekaligus mem-filternya.
Sumber: https://www.newgenapps.com/blog/business-applications-uses-regression-analysis-advantages
In [4]:
# Fitting model Regresi Sederhana
import statsmodels.formula.api as smf
from statsmodels.formula.api import ols
import statsmodels.api as sm, scipy.stats as stats

lm = smf.ols("tekanan_darah ~ usia", data=df[['tekanan_darah','usia']]).fit()


lm.summary()
# 1. F-Stat.
#.2. Uji Koef model
#.3. R^2
#.4. Interpretasi Model
#.5. Durbin-Watson ==> Time Series?
Out[4]:

OLS Regression Results

Dep. Variable: tekanan_darah R-squared: 0.788

Model: OLS Adj. R-squared: 0.752

Method: Least Squares F-statistic: 22.25

Date: Sun, 15 Nov 2020 Prob (F-statistic): 0.00327

Time: 12:07:17 Log-Likelihood: -21.920

No. Observations: 8 AIC: 47.84

Df Residuals: 6 BIC: 48.00

Df Model: 1

Covariance Type: nonrobust

coef std err t P>|t| [0.025 0.975]

Intercept 98.5623 8.266 11.924 0.000 78.337 118.788

usia 0.6766 0.143 4.717 0.003 0.326 1.028

Omnibus: 3.192 Durbin-Watson: 2.005


Prob(Omnibus): 0.203 Jarque-Bera (JB): 1.016

Skew: -0.340 Prob(JB): 0.602

Kurtosis: 1.392 Cond. No. 311.

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [5]:
# Plot the Data
p = sns.regplot(df.usia, df.tekanan_darah)

Evaluasi R2: Model VS Tidak Pakai Model?


Adjusted R-Squared? Why?

Pengaruh Variabel Tak Bebas ke Model

 SSR=SST−SSE=∑(yi−¯y)2−∑(yi−^yi)2SSR=SST−SSE=∑(yi−y¯)2−∑(yi−yi^)2

Perfect/true-best model tidak ada, bahkan seringnya tidak diperlukan


Induktif Bias Sebagai Dasar Penting untuk Mengerti SEMUA model Data
Science dan Machine Learning

image source: https://sgfin.github.io/2020/06/22/Induction-Intro/

[image source: https://www.slideshare.net/mahakvijay3/basics-of-regression-analysis]

Regresi Non-Linier?
Mengapa?
Kapan tidak disarankan menambah kompleksitas model?
Regression for insights VS regression for prediction.

Masih linear terhadap parameter


[image source: https://sites.google.com/site/apphysics1online/appendices/2-data-analysis/graph-linearization ]

In [6]:
# Loading Data Sampel dari Modul
import statsmodels.api as sm
from statsmodels.formula.api import ols
import statsmodels.formula.api as smf

dta = sm.datasets.get_rdataset("Guerry", "HistData", cache=True)


df = dta.data[['Lottery', 'Literacy', 'Wealth', 'Region']].dropna()
df.head(), df.shape, set(df['Region'])
Out[6]:
( Lottery Literacy Wealth Region
0 41 37 73 E
1 38 51 22 N
2 66 13 61 C
3 80 46 76 E
4 79 69 83 E,
(85, 4),
{'C', 'E', 'N', 'S', 'W'})
In [8]:
# Set "Region" sebagai variabel Kategorik
res = ols(formula='Lottery ~ Literacy + Wealth + C(Region)', data=df).fit()
print(res.params)
print(res.summary())
Intercept 38.651655
C(Region)[T.E] -15.427785
C(Region)[T.N] -10.016961
C(Region)[T.S] -4.548257
C(Region)[T.W] -10.091276
Literacy -0.185819
Wealth 0.451475
dtype: float64
OLS Regression Results
==============================================================================
Dep. Variable: Lottery R-squared: 0.338
Model: OLS Adj. R-squared: 0.287
Method: Least Squares F-statistic: 6.636
Date: Sun, 15 Nov 2020 Prob (F-statistic): 1.07e-05
Time: 12:07:30 Log-Likelihood: -375.30
No. Observations: 85 AIC: 764.6
Df Residuals: 78 BIC: 781.7
Df Model: 6
Covariance Type: nonrobust
==================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------
Intercept 38.6517 9.456 4.087 0.000 19.826 57.478
C(Region)[T.E] -15.4278 9.727 -1.586 0.117 -34.793 3.938
C(Region)[T.N] -10.0170 9.260 -1.082 0.283 -28.453 8.419
C(Region)[T.S] -4.5483 7.279 -0.625 0.534 -19.039 9.943
C(Region)[T.W] -10.0913 7.196 -1.402 0.165 -24.418 4.235
Literacy -0.1858 0.210 -0.886 0.378 -0.603 0.232
Wealth 0.4515 0.103 4.390 0.000 0.247 0.656
==============================================================================
Omnibus: 3.049 Durbin-Watson: 1.785
Prob(Omnibus): 0.218 Jarque-Bera (JB): 2.694
Skew: -0.340 Prob(JB): 0.260
Kurtosis: 2.454 Cond. No. 371.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specifie
d.
In [9]:
# Non Linear transformation
res = smf.ols(formula='Lottery ~ np.log(Literacy) + Wealth -1', data=df).fit()
print(res.summary())
OLS Regression Results
=======================================================================================
Dep. Variable: Lottery R-squared (uncentered): 0.799
Model: OLS Adj. R-squared (uncentered): 0.794
Method: Least Squares F-statistic: 165.2
Date: Sun, 15 Nov 2020 Prob (F-statistic): 1.16e-29
Time: 12:07:37 Log-Likelihood: -384.16
No. Observations: 85 AIC: 772.3
Df Residuals: 83 BIC: 777.2
Df Model: 2
Covariance Type: nonrobust
====================================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------
np.log(Literacy) 4.6426 1.246 3.727 0.000 2.165 7.120
Wealth 0.5853 0.089 6.571 0.000 0.408 0.762
==============================================================================
Omnibus: 4.188 Durbin-Watson: 1.892
Prob(Omnibus): 0.123 Jarque-Bera (JB): 4.034
Skew: -0.480 Prob(JB): 0.133
Kurtosis: 2.533 Cond. No. 25.8
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specifie
d.
Studi Kasus (Boston House Pricing) - Another Property Case Study
Source: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html

In [10]:
# Loading Data
from sklearn.datasets import load_boston
boston = load_boston()

# Convert ke Pandas Dataframe


bos = pd.DataFrame(boston.data)
bos.columns = boston.feature_names
bos['PRICE'] = boston.target
bos.head()
Out[10]:

CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT PRICE

0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0

1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6

2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7

3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 33.4

4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2
In [11]:
# Deskripsi Data
print(boston.DESCR)
.. _boston_dataset:

Boston house prices dataset


---------------------------

**Data Set Characteristics:**

:Number of Instances: 506

:Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14)


is usually the target.

:Attribute Information (in order):


- CRIM per capita crime rate by town
- ZN proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS proportion of non-retail business acres per town
- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX nitric oxides concentration (parts per 10 million)
- RM average number of rooms per dwelling
- AGE proportion of owner-occupied units built prior to 1940
- DIS weighted distances to five Boston employment centres
- RAD index of accessibility to radial highways
- TAX full-value property-tax rate per $10,000
- PTRATIO pupil-teacher ratio by town
- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT % lower status of the population
- MEDV Median value of owner-occupied homes in $1000's

:Missing Attribute Values: None

:Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.


https://archive.ics.uci.edu/ml/machine-learning-databases/housing/

This dataset was taken from the StatLib library which is maintained at Carnegie Mellon Un
iversity.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic


prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980. N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address re
gression
problems.

.. topic:: References

- Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sou
rces of Collinearity', Wiley, 1980. 244-261.
- Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings
on the Tenth International Conference of Machine Learning, 236-243, University of Massac
husetts, Amherst. Morgan Kaufmann.

In [12]:
bos.describe(include='all')
Out[12]:

PTRAT
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX B LSTAT PRICE
IO

co
506.0 506.0 506.0 506.0 506.0 506.0 506.0 506.0 506.0 506.0 506.0 506.0 506.0 506.0
un
00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000
t

me 3.613 11.36 11.13 0.069 0.554 6.284 68.57 3.795 9.549 408.2 18.45 356.6 12.65 22.53
an 524 3636 6779 170 695 634 4901 043 407 37154 5534 74032 3063 2806

8.601 23.32 6.860 0.253 0.115 0.702 28.14 2.105 8.707 168.5 2.164 91.29 7.141 9.197
std
545 2453 353 994 878 617 8861 710 259 37116 946 4864 062 104

mi 0.006 0.000 0.460 0.000 0.385 3.561 2.900 1.129 1.000 187.0 12.60 0.320 1.730 5.000
n 320 000 000 000 000 000 000 600 000 00000 0000 000 000 000

25 0.082 0.000 5.190 0.000 0.449 5.885 45.02 2.100 4.000 279.0 17.40 375.3 6.950 17.02
% 045 000 000 000 000 500 5000 175 000 00000 0000 77500 000 5000

50 0.256 0.000 9.690 0.000 0.538 6.208 77.50 3.207 5.000 330.0 19.05 391.4 11.36 21.20
% 510 000 000 000 000 500 0000 450 000 00000 0000 40000 0000 0000

75 3.677 12.50 18.10 0.000 0.624 6.623 94.07 5.188 24.00 666.0 20.20 396.2 16.95 25.00
% 083 0000 0000 000 000 500 5000 425 0000 00000 0000 25000 5000 0000

ma 88.97 100.0 27.74 1.000 0.871 8.780 100.0 12.12 24.00 711.0 22.00 396.9 37.97 50.00
x 6200 00000 0000 000 000 000 00000 6500 0000 00000 0000 00000 0000 0000

In [13]:
p = sns.pairplot(bos)
Checking Correlations between Predictors
In [14]:
# HeatMap untuk menyelidiki korelasi
corr2 = bos.corr() # We already examined SalePrice correlations
plt.figure(figsize=(12, 10))

sns.heatmap(corr2[(corr2 >= 0.5) | (corr2 <= -0.4)],


cmap='viridis', vmax=1.0, vmin=-1.0, linewidths=0.1,
annot=True, annot_kws={"size": 8}, square=True);
In [16]:
m = ols('PRICE ~ RM + PTRATIO + LSTAT ', bos).fit()
print(m.summary())
# Jangan lupa analisa dan interpretasi hasilnya
OLS Regression Results
==============================================================================
Dep. Variable: PRICE R-squared: 0.679
Model: OLS Adj. R-squared: 0.677
Method: Least Squares F-statistic: 353.3
Date: Sun, 15 Nov 2020 Prob (F-statistic): 2.69e-123
Time: 12:15:17 Log-Likelihood: -1553.0
No. Observations: 506 AIC: 3114.
Df Residuals: 502 BIC: 3131.
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 18.5671 3.913 4.745 0.000 10.879 26.255
RM 4.5154 0.426 10.603 0.000 3.679 5.352
PTRATIO -0.9307 0.118 -7.911 0.000 -1.162 -0.700
LSTAT -0.5718 0.042 -13.540 0.000 -0.655 -0.489
==============================================================================
Omnibus: 202.072 Durbin-Watson: 0.901
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1022.153
Skew: 1.700 Prob(JB): 1.10e-222
Kurtosis: 9.076 Cond. No. 402.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specifie
d.

Variable Selection: Stepwise di Analisis Regresi

image source: https://quantifyinghealth.com/stepwise-selection/


image source: https://en.wikipedia.org/wiki/Stepwise_regression
Cautions: https://towardsdatascience.com/stopping-stepwise-why-stepwise-selection-is-bad-and-what-you-
should-use-instead-90818b3f52df
In [17]:
def forward_selected(data, response):
"""Linear model designed by forward selection.
https://planspace.org/20150423-forward_selection_with_statsmodels/
Parameters:
-----------
data : pandas DataFrame with all possible predictors and response

response: string, name of response column in data

Returns:
--------
model: an "optimal" fitted statsmodels linear model
with an intercept
selected by forward selection
evaluated by adjusted R-squared
"""
remaining = set(data.columns)
remaining.remove(response)
selected = []
current_score, best_new_score = 0.0, 0.0
while remaining and current_score == best_new_score:
scores_with_candidates = []
for candidate in remaining:
formula = "{} ~ {} + 1".format(response,
' + '.join(selected + [candidate]))
score = smf.ols(formula, data).fit().rsquared_adj
scores_with_candidates.append((score, candidate))
scores_with_candidates.sort()
best_new_score, best_candidate = scores_with_candidates.pop()
if current_score < best_new_score:
remaining.remove(best_candidate)
selected.append(best_candidate)
current_score = best_new_score
formula = "{} ~ {} + 1".format(response, ' + '.join(selected))
model = smf.ols(formula, data).fit()
return model
In [18]:
model = forward_selected(bos, 'PRICE')

print(model.model.formula)
print(model.rsquared_adj)
PRICE ~ LSTAT + RM + PTRATIO + DIS + NOX + CHAS + B + ZN + CRIM + RAD + TAX + 1
0.7348057723274566
In [19]:
# Interpretasi koefisien?
print(model.summary())
OLS Regression Results
==============================================================================
Dep. Variable: PRICE R-squared: 0.741
Model: OLS Adj. R-squared: 0.735
Method: Least Squares F-statistic: 128.2
Date: Sun, 15 Nov 2020 Prob (F-statistic): 5.54e-137
Time: 12:15:28 Log-Likelihood: -1498.9
No. Observations: 506 AIC: 3022.
Df Residuals: 494 BIC: 3072.
Df Model: 11
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 36.3411 5.067 7.171 0.000 26.385 46.298
LSTAT -0.5226 0.047 -11.019 0.000 -0.616 -0.429
RM 3.8016 0.406 9.356 0.000 3.003 4.600
PTRATIO -0.9465 0.129 -7.334 0.000 -1.200 -0.693
DIS -1.4927 0.186 -8.037 0.000 -1.858 -1.128
NOX -17.3760 3.535 -4.915 0.000 -24.322 -10.430
CHAS 2.7187 0.854 3.183 0.002 1.040 4.397
B 0.0093 0.003 3.475 0.001 0.004 0.015
ZN 0.0458 0.014 3.390 0.001 0.019 0.072
CRIM -0.1084 0.033 -3.307 0.001 -0.173 -0.044
RAD 0.2996 0.063 4.726 0.000 0.175 0.424
TAX -0.0118 0.003 -3.493 0.001 -0.018 -0.005
==============================================================================
Omnibus: 178.430 Durbin-Watson: 1.078
Prob(Omnibus): 0.000 Jarque-Bera (JB): 787.785
Skew: 1.523 Prob(JB): 8.60e-172
Kurtosis: 8.300 Cond. No. 1.47e+04
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specifie
d.
[2] The condition number is large, 1.47e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Data Scaling "for Insights"

Pentingnya "scaling" di Regresi (atau clustering) untuk mencari


insight dari data
[image source: https://medium.com/greyatom/why-how-and-when-to-scale-your-features-4b30ab09db5e ]
In [20]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

bos[['TAX', 'AGE', 'B']] = scaler.fit_transform(bos[['TAX', 'AGE', 'B']])


bos.head()
# Continue to Modelling
Out[20]:

CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT PRICE

0 0.00632 18.0 2.31 0.0 0.538 6.575 0.641607 4.0900 1.0 0.208015 15.3 1.000000 4.98 24.0

1 0.02731 0.0 7.07 0.0 0.469 6.421 0.782698 4.9671 2.0 0.104962 17.8 1.000000 9.14 21.6

2 0.02729 0.0 7.07 0.0 0.469 7.185 0.599382 4.9671 2.0 0.104962 17.8 0.989737 4.03 34.7
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT PRICE

3 0.03237 0.0 2.18 0.0 0.458 6.998 0.441813 6.0622 3.0 0.066794 18.7 0.994276 2.94 33.4

4 0.06905 0.0 2.18 0.0 0.458 7.147 0.528321 6.0622 3.0 0.066794 18.7 1.000000 5.33 36.2

Pitfalls: Regresi Interpolation BUKAN Extrapolation (Forecasting/Peramalan)

image source: https://www.datasciencecentral.com/forum/topics/what-are-the-differences-between-prediction-


extrapolation-and

Latihan Studi Kasus Investasi Biaya Iklan


In [21]:
# Contoh
# Load DataFile CSV
try:
df = pd.read_csv('data/iklan.csv') # run locally
except:
!wget https://raw.githubusercontent.com/taudata-
indonesia/eLearning/master/data/iklan.csv # "Google Colab"
df = pd.read_csv('iklan.csv')
df.head()
Out[21]:

No Iklan Laba Tipe

0 1 10 9.17 1

1 2 1 1.32 0

2 3 12 8.54 1

3 4 12 7.68 1
No Iklan Laba Tipe

4 5 5 7.15 1

In [22]:
p = sns.pairplot(df, hue="Tipe")

In [ ]:
# Do Modelling Here ... Don't forget to interpret.

Belum dibahas:
1. Logistic Regression [akan dibahas saat Topik Klasifikasi]
2. Piecewise Regression (Non Linear)
3. Probit/Tobit Regression (Probabilistic)
4. Bayesian Regressian
5. Logic Regression (lebih robust dari logistic regression utk Fraud Detection)
6. Quantile regression (extreme events)
7. LAD regression (L1)
8. Jackknife regression
9. SVR
10. ARIMA (Time Series)
11. Ecologic Regression (ada hirarki di variab
image Source: https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

Referensi DM-09:
1. A. Seber and A.J. Lee, Linear Regression Analysis (2nd Ed), 2003, John Wiley & Sons.
2. P. McCullagh and J.A. Nelder, Generalized Linear Models (2nd Ed), 1989, Chapman & Hall.

Anda mungkin juga menyukai