Anda di halaman 1dari 21

# Introduction to

Machine Learning

## Max Kleiner 10.2018

ML Agenda
4 Cases with 5 Scripts
 Data Reduction - EKON22_PCA_1.py
 Regression - EKON22_REG_2.py
 Clustering - EKON22_CLU_3.py
 Classification - EKON22_CLA_4.py
 Decision Tree, Random Forest - EKON32_DET_5.py
 Cluster & Classify with different inputs, algos ,config
 Define label, features or topic ratings, hyper-parameters, tests
– assumed/implicit labels, predict versus target, random state
 Conclusions/ ML Process Summary/
2
http://www.softwareschule.ch/examples/machinelearning.jpg
2
PCA (Principal Component Analysis)
C:\maXbox\EKON22\EKON22_scripts\EKON22_PCA_1.py

## Visualizing 2 or 3 dimensional data is not that challenging.

However, even Iris dataset used 4 dim. Use PCA to reduce 4 dim
data into 2 or 3 dim so that you can plot & understand data better.
Use StandardScaler for features onto unit scale (mean = 0, variance
= 1) which is a requirement for optimal performance.

http://playground.tensorflow.org_maXbox2 3
https://www.springboard.com/blog/data-mining-python-tutorial/
3

4
https://sebastianraschka.com/images/blog/2015/principal_component_analysis_files/iris.png
https://www.springboard.com/blog/data-mining-python-tutorial/ 4
Regression and Correlation

@C:\maXbox\EKON22\EKON22_scripts\EKON22_REG_2.py

CASSANDRA System
2. C:\maXbox\mX46210\ntwdblib.dll\UnsharpDetector-master\UnsharpDetector-master\inference_gui.py
5
From Correlation to 4 Dim Cluster
Finding the question is often more important than finding the answer - John Tukey

https://www.soovle.com/ 6

Clustering from module import class
0 1 2 3 4
5 6 37 81 9
@C:\maXbox\EKON22\EKON22_scripts\EKON22_CLU_3.py

CASSANDRA System
7
GEO
Cluster
Story

An agent or probe that collects threat data from the security sensor and correlation
middleware. A console and associated database for managing the solution and its alerts. 8
https://www.esecurityplanet.com/views/article.php/1501001/Security-Threat-Correlation-The-Next-Battlefield.htm
8
IRIS Classification Concept
0 1 2 3 4
5 6 4 3 4
from sklearn import datasets, tree
clf = tree.DecisionTreeClassifier()

CASSANDRA System
clf = clf.fit(iris.data, iris.target)
y_pred = clf.predict(iris.data)

## print('Train accuracy_score: ')

metrics.accuracy_score(iris.target,y_pred)

## Demo in VSCode /maXbox4

C:\maXbox\softwareschule\MT-HS12-
05\mentor_xml\casra2017\crawler\plot_iris_dataset_mx.py
@C:\maXbox\EKON22\EKON22_scripts\EKON22_CLA_4.py
9
IRIS Confusion Matrix

10

10
IRIS Decision Tree

11
@C:\maXbox\EKON22\EKON22_scripts\EKON23_DET_5.py
11
MongoDB My Cluster sacred.runs & completed

CASSANDRA System
12

13

13
What's behind test ? (backend pattern, crossentropy)
60000/60000 [==============================] - 426s 7ms/step - loss: 0.4982 - acc: 0.8510 -
val_loss: 0.0788 - val_acc: 0.9749
Using TensorFlow backend.
INFO - MNIST-Convnet4 - Result: 0.9749
INFO - MNIST-Convnet4 - Completed after 0:07:27
Test loss: 0.0788029053777
Test accuracy: 0.9749
 59392/60000 [============================>.] - ETA: 5s - loss: 0.0571 - acc: 0.9829
 59520/60000 [============================>.] - ETA: 3s - loss: 0.0572 - acc: 0.9829
 59648/60000 [============================>.] - ETA: 2s - loss: 0.0572 - acc: 0.9829
 59776/60000 [============================>.] - ETA: 1s - loss: 0.0572 - acc: 0.9829
 59904/60000 [============================>.] - ETA: 0s - loss: 0.0573 - acc: 0.9829
 60000/60000 [==============================] - 513s 9ms/step - loss: 0.0573 - acc:
0.9829 - val_loss: 0.0312 - val_acc: 0.9891
 INFO - MNIST-Convnet4 - Result: 0.9891
 INFO - MNIST-Convnet4 - Completed after 0:33:28
 Test loss: 0.0311644290059
 Test accuracy: 0.9891 14

14
What's behind code ? (Classification Summary)

## from sklearn import datasets, metrics

from sklearn.model_selection import train_test_split
>>> X = iris.data[0:, 1:3]
>>> y = iris.target
>>> X_train,X_test, y_train, y_test =
train_test_split(X, y,test_size=0.3, random_state=20)
>>> from sklearn import svm
>>> classifier = svm.SVC(kernel='linear', C=1.0)
>>> classifier.fit(X_train, y_train)
>>> y_pred = classifier.predict(X_test)

## >>> from sklearn import metrics

>>> print ("Test - Accuracy SVC:", metrics.accuracy_score(y_test, y_pred))
Test - Accuracy SVC: 0.9555555555555556
15
15
What's behind code II ? (Check for duplicates)
print(y_test)
array([0, 1, 1, 2, 1, 1, 2, 0, 2, 0, 2, 1, 2, 0, 0, 2, 0, 1, 2, 1, 1, 2,
2, 0, 1, 1, 1, 0, 2, 2, 1, 1, 0, 0, 0, 2, 1, 0, 1, 2, 1, 2, 0, 1, 1])
>>> unique, counts = np.unique(y_test, return_counts=True)
>>> dict(zip(unique, counts))
{0: 13, 1: 18, 2: 14}
>>> Xyt = np.column_stack((X_test, y_test))
>>> csort = Xyt[Xyt[:,2].argsort()]
>>> dfiris = pd.DataFrame(csort)
>>> dfiris[0:13].groupby([0,1]).size()
3.0 1.1 1
1.4 2
3.1 1.5 2
3.2 1.2 1 - 3.4 1.4 1 1.6 1 1.7 1 -3.5 1.4 1 - 3.7 1.5 1
1.4 1
3.8 1.6 1
>>> sum(dfiris[0:13].groupby([0,1]).size()>1) 2 16

16
What's behind Python: PIP3 Install
pip3 install sacred
Collecting sacred
https://files.pythonhosted.org/packages/2d/86/7be3af
a4d4c1c0c76a5de03e5ff779797ab2654e377685255c11c13c0e
a5/sacred-0.7.3-py2.py3-none-any.whl (82kB)

Collecting pymongo
https://files.pythonhosted.org/packages/46/39/b9bb7fed3e3a0ea621a1
512a938c105cd996320d7d9894d8239ca9093340/pymongo-3.6.1-cp36-cp36m-
win_amd64.whl (291kB)
100% |¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 296kB 728kB/s
Installing collected packages: pymongo
Successfully installed pymongo-3.6.1

https://github.com/pinae/Sacred-MNIST/blob/master/train_convnet.py 17

17
Machine Learning Process Chain

## • Collab (Set a control thesis, understand the

problem, get resources Python etc.)
• Collect (Scrapy data, store, data mining,
filter data, inconsistent, incomplete)
• Consolidate or Clean data (normalization and
aggregation, PCA data reduction, Regression,
Filters, slice out irrel. Or ambigous data or
char unicode map prob.)
• Cluster (kmeans for category, collocates for
N-keywords) algorithm – unsupervised)
• Classify (SVM, Sequential, Bayes – supervised)
• Conclude and Control (Predict or report context
thesis and drive data to decision)

http://www.softwareschule.ch/examples/machinelearning.jpg
https://maxbox4.wordpress.com/code/
18
https://www.kaggle.com/ v ( a, j ) v(b, j )
similarity of doc a to doc b = sim(a, b)   
word i  v ( a, j ' )
j'
2
 (b, j ' )
v
j'
2

 A'B'

CASSANDRA System
19
Double Trouble with ML → https://stats.stackexchange.com/
File
Stackexchange,
"C:\Users\max\AppData\Local\Programs\Python\Pyt
THE TEST Stackoverflow
Status Description
OVERVIEW
hon36\lib\site-
QUEUED
The run was just packages\sklearn\metrics\cluster\unsupervised.py",
queued line 254, in calinski_harabaz_score
and not run yet
RUNNING intra_disp += np.sum((cluster_k - mean_k) ** 2)
Currently running (but see below)MemoryError
COMPLETED
File "C:\Users\max\AppData\Local\Programs\Python\Python36\lib\site-
Completed successfully
packages\sklearn\metrics\cluster\unsupervised.py", line 254, in
calinski_harabaz_score
intra_disp += np.sum((cluster_k - mean_k) ** 2) MemoryError
FAILED
The run failed due to an exception No. of URLs removed 76,732,515
INTERRUPTED + No. of robots.txt 3,675,634
The run was cancelled with a requests
KeyboardInterrupt
TIMED_OUT - No. of excludedURLs 3,050,768
The run was aborted using a TimeoutInterrupt = No. of HTTP requests 77,357,381
[custom]
A custom py:class: HTTP requests not 1,763850
~sacred.utils.SacredInterrupt respond
20
occurred
20
QUESTIONS ?
17:45 - 18:30
Machine Learning II
Art. Neural Network

## Best Book in my opinion:

Mastering Machine
Learning with
Python in Six Steps
A Practical Implementation
Predictive Data Analytics
21

21