Anda di halaman 1dari 25

2UDA

Data exploration and


analysis with Orange

2ndQuadrant 2014-5

PostgreSQL database

Use psql to connect to the database sample as user postgres


$ psql sample postgres

List the included tables and examine the table sample_cars,


which we will be using in this tutorial
#
#
#
#

\d
\d sample_cars
SELECT count(*) FROM sample_cars;
SELECT * FROM sample_cars LIMIT 3;

2ndQuadrant 2014-5

Using
Data stored in the database can be explored and visualized in Orange

2ndQuadrant 2014-5

Orange components

2ndQuadrant 2014-5

Example use case

Let us examine the auto-mpg data set


(http://archive.ics.uci.edu/ml/datasets/Auto+MPG)

We loaded this data set into the table auto in psql


Each row represents a car model
described with several variables (columns)
containing numerical, categorical and string values

We will explore the data and build predictive models

2ndQuadrant 2014-5

SQL Table
Load the data set by placing the SQL Table widget on the Canvas,
open it and set the parameters

You can download all data to memory or work with it remotely if it is too big
2ndQuadrant 2014-5

Data Info
Used to see basic information
about the loaded table

2ndQuadrant 2014-5

Data Table

2ndQuadrant 2014-5

Visualizations - overview

Summary statistics and distributions:

Box Plot
Distributions

Multivariate visualizations (relationships, interactions)

Continuous
Scatter plot (Scatter Map for big data)
Linear Projection (>2 variables at a time)
Categorical
Mosaic Plot
Sieve Diagram

2ndQuadrant 2014-5

Visualizations

2ndQuadrant 2014-5

Box Plot
Show basic statistics
mean, std, median, quartiles
min, max

And compare these in


different data subsets

2ndQuadrant 2014-5

Distributions

2ndQuadrant 2014-5

Scatter Plot

2ndQuadrant 2014-5

Sieve Diagram
Observe the (co)occurrence
of values for pairs of variables
The diagram shows which are
over- or under-represented

Orange informs us that this


only works for discrete variables
Lets discretize the continuous
ones to use them too...

2ndQuadrant 2014-5

Sieve Diagram

2ndQuadrant 2014-5

Predictive models

After familiarizing ourselves with the data


lets try to let the computer discover some patterns
in it and build predictive models
Select the mpg variable to be the target
and see if we can predict it based on
values of other variables
Predicting a numerical target value
is known as a regression problem

2ndQuadrant 2014-5

Linear Regression

Note: there is a conceptual


error in the shown workflow
(see next slide)
2ndQuadrant 2014-5

Model evaluation

How good are the results of the Linear Regression model?


We should never compare how well the model predicts instances it has
already been trained on during the learning phase!
Use Test & Score widget to split the data into learning and test sets and
estimate the predictive performance more reliably
Add Mean Learner as a baseline for comparison
(computes the mean target value and always predicts that for new instances)

Add other regression algorithms, (adjust their settings) and compare

2ndQuadrant 2014-5

Model evaluation

2ndQuadrant 2014-5

Classification
To see how to classify instances into categorical classes
set the target variable to origin
Test and Score some models and
see what kind of mistakes they do
using the Confusion Matrix widget

2ndQuadrant 2014-5

Classification performance &


Confusion Matrix
Check performance scores like
classification accuracy, AUC etc.
of different models
See the types of mistakes
they make with the Confusion
Matrix widget

2ndQuadrant 2014-5

Model interpretation
Adjust the parameters of
Classification Tree
to produce smaller trees
e.g. min instances in leaves=20,
max depth=4

and visualize the model

Some models are harder to


interpret than others

2ndQuadrant 2014-5

Big data, TABLESAMPLE & Rapid visualization

So far we worked with a small data table with 398 rows


What can be done on 400M rows or more?
Sometimes even simple operations take a long time
(like counting the number of rows)
Orange uses time based sampling based on the PostgreSQL 9.5
TABLESAMPLE feature to produce approximate, but relevant, results quickly

2ndQuadrant 2014-5

Big data and Orange

SQL Table can be used to load tables of any size


(make sure to deselect the download option and use sampling to auto-discover variable types)

Some visualizations (Box Plot, Distributions, Sieve Diagram) work the same
way, but approximate the results on a subset of the data
Those that show individual instances usually show a sample
(Scatter Plot, Linear Projection, Heat Map)
or can be replaced by a modified version (e.g. Scatter Plot -> Scatter Map)
Most learning algorithms are not adapted to big data, but can be used on
explicit data samples (obtained with the Data Sampler widget)

2ndQuadrant 2014-5

2ndQuadrant 2014-5

Anda mungkin juga menyukai