April 1, 2023
1
Baik NumPy maupun SciPy berjalan pada semua operating system, cepat untuk diinstall dan
gratis. NumPy dan SciPy mudah digunakan, tetapi cukup kuat untuk diandalkan oleh beberapa
data scientist dan researcher terkemuka dunia.
(49999, 12)
order_id quantity \
0 2e7a8482f6fb09756ca50c10d7bfc047 2
1 2e7a8482f6fb09756ca50c10d7bfc047 1
2 e5fa5a7210941f7d56d0208e4e071d35 1
3 3b697a20d9e427646d92567910af6d57 1
4 71303d7e93b399f5bcd537d124c0bcfa 1
5 be5bc2f0da14d8071e2d45451ad119d9 1
6 0a0837a5eee9e7a9ce2b1fa831944d27 1
7 1ff217aa612f6cd7c4255c9bfe931c8b 1
8 22613579f7d11cc59c4347526fc3c79e 1
9 356b492aba2d1a7da886e54e0b6212b7 1
product_id price \
0 f293394c72c9b5fafd7023301fc21fc2 1489000
1 c1488892604e4ba5cff5b4eb4d595400 1756000
2 f3c2d01a84c947b078e32bbef0718962 1707000
3 3ae08df6bcbfe23586dd431c40bddbb7 3071000
4 d2998d7ced12f83f9b832f33cf6507b6 3833000
5 fd7fd78fd3cbc1b0a6370a7909c0a629 1480000
6 583916a5dae918f5e89baec139141c54 4489000
7 33430c5c1027d812b5c62f778e5ee7f7 822000
8 3ff81cd0e0861e991bb0106c03c113ca 3967000
9 eba7488e1c67729f045ab43fac426f2e 4165000
seller_id freight_value \
0 1554a68530182680ad5c8b042c3ab563 28000
1 1554a68530182680ad5c8b042c3ab563 45000
2
2 a425f92c199eb576938df686728acd20 174000
3 522620dcb18a6b31cd7bdf73665113a9 154000
4 25e6ffe976bd75618accfe16cefcbd0d 147000
5 f09b760d23495ac9a7e00d29b769007c 152000
6 3481aa57cd91f9f9d3fa1fa12d9a3bf7 16000
7 4b1eaadf791bdbbad8c4a35b65236d52 58000
8 86bb7c4b535e49a541baf3266b1c95b1 95000
9 620c87c171fb2a6dd6e8bb4dec959fc6 98000
[ ]: #print([nama_dataframe].describe())
Function describe dapat memberikan informasi mengenai nilai rataan, standar deviasi dan IQR
(interquartile range).
Ketentuan umum:
1. Secara umum function describe() akan secara otomatis mengabaikan kolom category dan
hanya memberikan summary statistik untuk kolom berjenis numerik.
2. Kita perlu menambahkan argument bernama include = “all” untuk mendapatkan summary
statistik atau statistik deskriptif dari kolom numerik dan karakter.
3
[ ]: print(order_df.describe(include='all'))
order_id quantity \
count 49999 49999.000000
unique 42694 NaN
top 8272b63d03f5f79c56e9e4120aec44ef NaN
freq 21 NaN
mean NaN 1.197484
std NaN 0.722262
min NaN 1.000000
25% NaN 1.000000
50% NaN 1.000000
75% NaN 1.000000
max NaN 21.000000
product_id price \
count 49999 4.999900e+04
unique 16866 NaN
top 99a4788cb24856965c36a24e339b6058 NaN
freq 366 NaN
mean NaN 2.607784e+06
std NaN 1.388312e+06
min NaN 2.000000e+05
25% NaN 1.410500e+06
50% NaN 2.610000e+06
75% NaN 3.810000e+06
max NaN 5.000000e+06
seller_id freight_value \
count 49999 49999.000000
unique 1777 NaN
top 4a3ca9315b744ce9f8e9374361493884 NaN
freq 1236 NaN
mean NaN 104521.390428
std NaN 55179.844962
min NaN 9000.000000
25% NaN 57000.000000
50% NaN 104000.000000
75% NaN 152000.000000
max NaN 200000.000000
4
min NaN NaN NaN
25% NaN NaN NaN
50% NaN NaN NaN
75% NaN NaN NaN
max NaN NaN NaN
[ ]: print(order_df.describe(include='object'))
order_id product_id \
count 49999 49999
unique 42694 16866
top 8272b63d03f5f79c56e9e4120aec44ef 99a4788cb24856965c36a24e339b6058
freq 21 366
seller_id customer_id \
count 49999 49999
unique 1777 42694
top 4a3ca9315b744ce9f8e9374361493884 fc3d1daec319d62d49bfb5e1f83123e9
freq 1236 21
[ ]: #print([nama_dataframe].loc[:, 'nama_kolom'].mean())
#print([nama_dataframe].loc[:, 'nama_kolom'].median())
5
#print([nama_dataframe].loc[:, 'nama_kolom'].mode())
[ ]: import pandas as pd
order_df = pd.read_csv("https://storage.googleapis.com/dqlab-dataset/order.csv")
# Quick summary dari segi kuantitas, harga, freight value, dan weight
print(order_df.describe())
# Median dari total pembelian konsumen per transaksi kolom price
print(order_df.loc[:, 'price'].median())
6
[ ]: import matplotlib.pyplot as plt
order_df[['price']].hist(figsize=(4, 5), bins=10, xlabelsize=8, ylabelsize=8)
plt.show() # Untuk menampilkan histogram plot
3929.8968753726213
15444089.451063491
7
0.0.10 Menemukan Outliers Menggunakan Pandas
Outliers merupakan data observasi yang muncul dengan nilai-nilai ekstrim. Yang dimaksud dengan
nilai-nilai ekstrim dalam observasi adalah nilai yang jauh atau beda sama sekali dengan sebagian
besar nilai lain dalam kelompoknya.
Pada umumnya, outliers dapat ditentukan dengan metric IQR (interquartile range).
Rumus dasar dari IQR: Q3 - Q1. Dan data suatu observasi dapat dikatakan outliers jika memenuhi
kedua syarat dibawah ini:
1. data < Q1 - 1.5 * IQR
2. data > Q3 + 1.5 * IQR
[ ]: Q1 = order_df[['product_weight_gram']].quantile(0.25)
# Hitung quartile 3
Q3 = order_df[['product_weight_gram']].quantile(0.75)
# Hitung inter quartile range dan cetak ke console
IQR = Q3-Q1
print(IQR)
product_weight_gram 1550.0
dtype: float64
order_id quantity \
0 2e7a8482f6fb09756ca50c10d7bfc047 2
1 2e7a8482f6fb09756ca50c10d7bfc047 1
2 e5fa5a7210941f7d56d0208e4e071d35 1
3 3b697a20d9e427646d92567910af6d57 1
4 71303d7e93b399f5bcd537d124c0bcfa 1
… … …
49994 ec88157ad03aa203c3fdfe7bace5ab6b 1
49995 ed60085e92e2aa3debf49159deb34da7 1
49996 ed98c37d860890f940e2acd83629fdd1 2
8
49997 ed98c37d860890f940e2acd83629fdd1 1
49998 ede4ebbb6e36cbd377eabcc7f5229575 1
product_id price \
0 f293394c72c9b5fafd7023301fc21fc2 1489000
1 c1488892604e4ba5cff5b4eb4d595400 1756000
2 f3c2d01a84c947b078e32bbef0718962 1707000
3 3ae08df6bcbfe23586dd431c40bddbb7 3071000
4 d2998d7ced12f83f9b832f33cf6507b6 3833000
… … …
49994 165f86fe8b799a708a20ee4ba125c289 3077000
49995 6e835aea84ae8eb68b8c14878dd43b30 1277000
49996 aca2eb7d00ea1a7b8ebd4e68314663af 486000
49997 aca2eb7d00ea1a7b8ebd4e68314663af 830000
49998 2b0ee2d07306f7c9ac55a43166e9bb4b 215000
seller_id shipping_cost \
0 1554a68530182680ad5c8b042c3ab563 28000
1 1554a68530182680ad5c8b042c3ab563 45000
2 a425f92c199eb576938df686728acd20 174000
3 522620dcb18a6b31cd7bdf73665113a9 154000
4 25e6ffe976bd75618accfe16cefcbd0d 147000
… … …
49994 7ddcbb64b5bc1ef36ca8c151f6ec77df 172000
49995 4d6d651bd7684af3fffabd5f08d12e5a 130000
49996 955fee9216a65b617aa5c0531780ce60 14000
49997 955fee9216a65b617aa5c0531780ce60 108000
49998 1900267e848ceeba8fa32d80c1a5f5a8 189000
9
… … … …
49994 e-wallet automotive 2425.0
49995 debit card beauty 2350.0
49996 debit card gadget 2600.0
49997 e-wallet gadget 2600.0
49998 credit card beauty 1450.0
payment_type
credit card 2.600706e+06
debit card 2.611974e+06
e-wallet 2.598562e+06
virtual account 2.619786e+06
Name: price, dtype: float64
order_id quantity \
37085 d7b2d3b902441cf3dd12cd125533217d 1
41958 2711089c7fec59d4dc8483e3c6a12fa3 1
3976 f343624eab419250ad81f1ce6be22c93 1
21072 c8947a583ab9791a5a9d02384cb84302 1
47074 f6134169ca6f0cdfbe6458ebb5731613 1
… … …
33786 0d9e86e02c1a823b20c03ea29d616607 1
42166 54220fcc516cabe9ec84b210c0765ef2 1
31745 59a19c83ff825948739dd1601cc107b6 1
42452 9960ee97c2f8d801a200a01893b3942f 1
10
11939 64619901c45fba79638d666058bf6be6 1
product_id price \
37085 35afc973633aaeb6b877ff57b2793310 5000000
41958 7c1bd920dbdf22470b68bde975dd3ccf 5000000
3976 777d2e438a1b645f3aec9bd57e92672c 5000000
21072 f8cfb63e323be2e1c4172f255d61843d 5000000
47074 2ea92fab7565c4fe9f91a5e4e1756258 5000000
… … …
33786 f93213a23c50edc16c27b96333f734dc 200000
42166 1166bc797ddf5fb009c376d133f61204 200000
31745 eb38a7604070a2b8465101ed53cba72b 200000
42452 db5efde3ad0cc579b130d71c4b2db522 200000
11939 06c6e01186af8b98ee1fc9e01f9471e9 200000
seller_id shipping_cost \
37085 4a3ca9315b744ce9f8e9374361493884 118000
41958 cc419e0650a3c5ba77189a1882b7556a 31000
3976 4a3ca9315b744ce9f8e9374361493884 101000
21072 4a3ca9315b744ce9f8e9374361493884 184000
47074 3d871de0142ce09b7081e2b9d1733cb1 196000
… … …
33786 46dc3b2cc0980fb8ec44634e21d2718e 141000
42166 5cbbd5a299cab112b7bf23862255e43e 175000
31745 e6a69c4a27dfdd98ffe5aa757ad744bc 112000
42452 4869f7a5dfa277a7dca6462dcf3b52b2 26000
11939 fc906263ca5083d09dce42fe02247800 98000
11
33786 debit card automotive 7550.0
42166 e-wallet gadget 1100.0
31745 e-wallet beauty 550.0
42452 credit card automotive 6663.0
11939 virtual account automotive 200.0
12