2UDA Orange Tutorial-2ndQ

2UDA
Data exploration and

analysis with Orange
2ndQuadrant 2014-5
PostgreSQL database
Use psql to connect to the database sample as user postgres

$ psql sample postgres
List the included tables and examine the table sample_cars,

which we will be using in this tutorial
#
#
#
#
\d
\d sample_cars
SELECT count(*) FROM sample_cars;
SELECT * FROM sample_cars LIMIT 3;
2ndQuadrant 2014-5
Using
Data stored in the database can be explored and visualized in Orange
2ndQuadrant 2014-5
Orange components
2ndQuadrant 2014-5
Example use case
Let us examine the auto-mpg data set

(http://archive.ics.uci.edu/ml/datasets/Auto+MPG)
We loaded this data set into the table auto in psql

Each row represents a car model
described with several variables (columns)
containing numerical, categorical and string values
We will explore the data and build predictive models
2ndQuadrant 2014-5
SQL Table
Load the data set by placing the SQL Table widget on the Canvas,
open it and set the parameters
You can download all data to memory or work with it remotely if it is too big
2ndQuadrant 2014-5
Data Info
Used to see basic information
about the loaded table
2ndQuadrant 2014-5
Data Table
2ndQuadrant 2014-5
Visualizations - overview
Summary statistics and distributions:
Box Plot
Distributions
Multivariate visualizations (relationships, interactions)
Continuous
Scatter plot (Scatter Map for big data)
Linear Projection (>2 variables at a time)
Categorical
Mosaic Plot
Sieve Diagram
2ndQuadrant 2014-5
Visualizations
2ndQuadrant 2014-5
Box Plot
Show basic statistics
mean, std, median, quartiles
min, max
And compare these in

different data subsets
2ndQuadrant 2014-5
Distributions
2ndQuadrant 2014-5
Scatter Plot
2ndQuadrant 2014-5
Sieve Diagram
Observe the (co)occurrence
of values for pairs of variables
The diagram shows which are
over- or under-represented
Orange informs us that this

only works for discrete variables
Lets discretize the continuous
ones to use them too...
2ndQuadrant 2014-5
Sieve Diagram
2ndQuadrant 2014-5
Predictive models
After familiarizing ourselves with the data

lets try to let the computer discover some patterns
in it and build predictive models
Select the mpg variable to be the target
and see if we can predict it based on
values of other variables
Predicting a numerical target value
is known as a regression problem
2ndQuadrant 2014-5
Linear Regression
Note: there is a conceptual

error in the shown workflow
(see next slide)
2ndQuadrant 2014-5
Model evaluation
How good are the results of the Linear Regression model?

We should never compare how well the model predicts instances it has
already been trained on during the learning phase!
Use Test & Score widget to split the data into learning and test sets and
estimate the predictive performance more reliably
Add Mean Learner as a baseline for comparison
(computes the mean target value and always predicts that for new instances)
Add other regression algorithms, (adjust their settings) and compare
2ndQuadrant 2014-5
Model evaluation
2ndQuadrant 2014-5
Classification
To see how to classify instances into categorical classes
set the target variable to origin
Test and Score some models and
see what kind of mistakes they do
using the Confusion Matrix widget
2ndQuadrant 2014-5
Classification performance &

Confusion Matrix
Check performance scores like
classification accuracy, AUC etc.
of different models
See the types of mistakes
they make with the Confusion
Matrix widget
2ndQuadrant 2014-5
Model interpretation
Adjust the parameters of
Classification Tree
to produce smaller trees
e.g. min instances in leaves=20,
max depth=4
and visualize the model
Some models are harder to

interpret than others
2ndQuadrant 2014-5
Big data, TABLESAMPLE & Rapid visualization
So far we worked with a small data table with 398 rows

What can be done on 400M rows or more?
Sometimes even simple operations take a long time
(like counting the number of rows)
Orange uses time based sampling based on the PostgreSQL 9.5
TABLESAMPLE feature to produce approximate, but relevant, results quickly
2ndQuadrant 2014-5
Big data and Orange
SQL Table can be used to load tables of any size

(make sure to deselect the download option and use sampling to auto-discover variable types)
Some visualizations (Box Plot, Distributions, Sieve Diagram) work the same
way, but approximate the results on a subset of the data
Those that show individual instances usually show a sample
(Scatter Plot, Linear Projection, Heat Map)
or can be replaced by a modified version (e.g. Scatter Plot -> Scatter Map)
Most learning algorithms are not adapted to big data, but can be used on
explicit data samples (obtained with the Data Sampler widget)
2ndQuadrant 2014-5
2ndQuadrant 2014-5

2UDA Orange Tutorial-2ndQ

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

2UDA Orange Tutorial-2ndQ

Diunggah oleh

Hak Cipta:

Format Tersedia

2UDA

Data exploration and

Use psql to connect to the database sample as user postgres

List the included tables and examine the table sample_cars,

Example use case

Let us examine the auto-mpg data set

We loaded this data set into the table auto in psql

We will explore the data and build predictive models

Summary statistics and distributions:

Multivariate visualizations (relationships, interactions)

And compare these in

Orange informs us that this

After familiarizing ourselves with the data

Note: there is a conceptual

How good are the results of the Linear Regression model?

Add other regression algorithms, (adjust their settings) and compare

Classification performance &

and visualize the model

Some models are harder to

Big data, TABLESAMPLE & Rapid visualization

So far we worked with a small data table with 398 rows

Big data and Orange

SQL Table can be used to load tables of any size

Anda mungkin juga menyukai