Anda di halaman 1dari 119

Why are data collected?

STAT 235 | 1
2 / 31
K
What kind of data are collected?

STAT 235 | 1
3 / 31
K
What kind of data are collected?

❑ Medical & pharmaceutical sciences...

STAT 235 | 1
3 / 31
K
What kind of data are collected?

❑Medical & pharmaceutical sciences...


❑ Financial companies & institutions...

STAT 235 | 1
3 / 31
K
What kind of data are collected?

❑Medical & pharmaceutical sciences...


❑ Financial companies & institutions...
❑ Engineering & computer science...

STAT 235 | 1
3 / 31
K
What kind of data are collected?

❑Medical & pharmaceutical sciences...


❑ Financial companies & institutions...
❑ Engineering & computer science...
❑ Social & behavioral sciences...

STAT 235 | 1
3 / 31
K
What kind of data are collected?

❑Medical & pharmaceutical sciences...


❑ Financial companies & institutions...
❑ Engineering & computer science...
❑ Social & behavioral sciences...
❑ Education & research...

STAT 235 | 1
3 / 31
K
What kind of data are collected?

❑Medical & pharmaceutical sciences...


❑ Financial companies & institutions...
❑ Engineering & computer science...
❑ Social & behavioral sciences...
❑ Education & research...
❑ Sport, entertainment & fun...

STAT 235 | 1
3 / 31
K
What kind of data are collected?

❑Medical & pharmaceutical sciences...


❑ Financial companies & institutions...
❑ Engineering & computer science...
❑ Social & behavioral sciences...
❑ Education & research...
❑ Sport, entertainment & fun...
❑ etc.

STAT 235 | 1
3 / 31
K
STAT 235 - Introduction

How to do it in a proper way?

STAT 235 | 1
4 / 31
K
STAT 235 - Introduction

Syllabus

❑ Introduction & Descriptive Statistics chapter 6


❑ Introduction to Probability chapter 1
❑ Random Variables chapter 2
❑ Discrete and Continuous Probability Distributions chapters 3,4
❑ The Normal Probability Distribution chapter 5
❑ Sampling Distributions, Random Sample chapter 7
❑ Inferences on a Population Mean chapter 8
❑ Comparing Two Population Means chapter 9
❑ Simple Linear Regression and Correlation chapter 12
❑ Inferences on a Population Proportion chapter 10
❑ The Analysis of Variance chapter 11

STAT 235 | 1
5 / 31
K
STAT 235 - Introduction

How to do it in a proper way?

STAT 235 | 1
6 / 31
K
STAT 235 - Introduction

How to do it in a proper way?

Probability ←v
Statistics →s.
❑ Random mechanism → produces random
outcomes;

STAT 235 | 1
6 / 31
K
STAT 235 - Introduction

How to do it in a proper way?

Probability ←v
Statistics →s.
❑ Random mechanism → produces random outcomes;
❑ A set of random outcomes (results) → drawing
conclusions;

STAT 235 | 1
6 / 31
K
STAT 235 - Introduction

How to do it in a proper way?

Probability ←v
Statistics →s.
❑ Random mechanism → produces random outcomes;
❑ A set of random outcomes (results) → drawing conclusions;

❑ Probability - theoretical background for a random mechanism;


(describes how the mechanism works - but the principle is unknown)

STAT 235 | 1
6 / 31
K
STAT 235 - Introduction

How to do it in a proper way?

Probability ←v
Statistics →s.
❑ Random mechanism → produces random outcomes;
❑ A set of random outcomes (results) → drawing conclusions;

❑ Probability - theoretical background for a random mechanism;


(describes how the mechanism works - but the principle is unknown)
❑ Statistics - uses random outcomes to draw conclusions;
(using statistics we can very precisely describe the mechanism behind)

STAT 235 | 1
6 / 31
K
STAT 235 - Introduction

Statistics in Real Life...

❑ In probability, we could have aprecise description of the random


mechanism behind whichwouldbe accurate, but it is unknown!

❑ In statistics, we have data (measurements) that are quite often not


precise, but the final conclusions drawn from such data are very accurate!

STAT 235 | 1
7 / 31
K
STAT 235 - Introduction

Statistics in Real Life...

❑ In probability, we could have aprecise description of the random


mechanism behind whichwouldbe accurate, but it is unknown!

❑ In statistics, we have data (measurements) that are quite often not


precise, but the final conclusions drawn from such data are very accurate!

❑ “I only believe in statistics that I doctored myself.”


Winston S. Churchill (1874 – 1965)

STAT 235 | 1
7 / 31
K
STAT 235 - Introduction

Population vs. Sample

STAT 235 | 1
8 / 31
K
STAT 235 - Introduction

Population vs. Sample

Probability Theory
Random Distribution

STAT 235 | 1
8 / 31
K
STAT 235 - Introduction

Population vs. Sample

Probability Theory
Random Distribution

STAT 235 | 1
8 / 31
K
STAT 235 - Introduction

Population vs. Sample

Probability Theory Finite Sample

Random Distribution Empirical Distribution

STAT 235 | 1
8 / 31
K
STAT 235 - Introduction

Population vs. Sample

Probability Theory Finite Sample

Random Distribution Empirical Distribution

STAT 235 | 1
8 / 31
K
STAT 235 - Introduction

Population vs. Sample

Probability Theory Finite Sample Sample Statistics

Random Distribution Empirical Distribution

STAT 235 | 1
8 / 31
K
STAT 235 - Introduction

Population vs. Sample

Probability Theory Finite Sample Sample Statistics


Random Distribution Empirical Distribution Statistical Inference

STAT 235 | 1
8 / 31
K
STAT 235 - Introduction

Different Data types

❑ Categorical variables

STAT 235 | 1
9 / 31
K
STAT 235 - Introduction

Different Data types

❑ Categorical variables

❑ Numerical variables

STAT 235 | 1
9 / 31
K
STAT 235 - Introduction

Different Data types

❑ Categorical variables
❑ Nominal categories
❑ categories with no ordering;
What kind of transportation do you use to get to work?
What kind of material do you mostly prefer?

❑ Numerical variables

STAT 235 | 1
9 / 31
K
STAT 235 - Introduction

Different Data types

❑ Categorical variables
❑ Nominal categories
❑ categories with no ordering;
What kind of transportation do you use to get to work?
What kind of material do you mostly prefer?
❑ Ordinal categories
❑ categories with possible ordering;
What was your most frequent grade last semester?
What energy category does the machine belong to?

❑ Numerical variables

STAT 235 | 1
9 / 31
K
STAT 235 - Introduction

Different Data types

❑ Categorical variables
❑ Nominal categories
❑ categories with no ordering;
What kind of transportation do you use to get to work?
What kind of material do you mostly prefer?
❑ Ordinal categories
❑ categories with possible ordering;
What was your most frequent grade last semester?
What energy category does the machine belong to?

❑ Numerical variables
❑ Integer values - counts
❑ an apriori assumption of equidistant differences;
How many people visit this museum a day? How
many call a day is addressed to 911?

STAT 235 | 1
9 / 31
K
STAT 235 - Introduction

Different Data types

❑ Categorical variables
❑ Nominal categories
❑ categories with no ordering;
What kind of transportation do you use to get to work?
What kind of material do you mostly prefer?
❑ Ordinal categories
❑ categories with possible ordering;
What was your most frequent grade last semester?
What energy category does the machine belong to?

❑ Numerical variables
❑ Integer values - counts
❑ an apriori assumption of equidistant differences;
How many people visit this museum a day? How
many call a day is addressed to 911?
❑ Real values
❑ any real value is possible...
length, weight, height, temperature, etc.

STAT 235 | 1
9 / 31
K
STAT 235 - Introduction

Data exploration

❑ What is the nature of data?


❑ What are the limitations of the experiment
behind?

❑ What is the main question of interest?


❑ Is it possible to use data available to answer it?

❑ Is it important to visualize the data?


❑ Howcan one get an insight in the data?

STAT 235 | 1
10 / 31
K
STAT 235 - Introduction

Binary Data
❑ the simplest data type...
❑ different coding possible...
❑ logical:TRUE | FALSE; 1|0; YES|NO;
❑ categorical:A|B; 1|2; HOME|ABROAD;
❑ only afew options on how to represent such data;

STAT 235 | 1
11 / 31
K
STAT 235 - Introduction

Binary Data
❑ the simplest data type...
❑ different coding possible...
❑ logical:TRUE | FALSE; 1|0; YES|NO;
❑ categorical:A|B; 1|2; HOME|ABROAD;
❑ only afew options on how to represent such data;

Summary
(table):

TRUE
41
FALSE
59

STAT 235 | 1
11 / 31
K
STAT 235 - Introduction

Binary Data
❑ the simplest data type...
❑ different coding possible...
❑ logical:TRUE | FALSE; 1|0; YES|NO;
❑ categorical:A|B; 1|2; HOME|ABROAD;
❑ only afew options on how to represent such data;

Summary Frequency
(table): summary:

TRUE
TRUE: 41.00 %
41
FALSE
59

STAT 235 | 1
11 / 31
K
STAT 235 - Introduction

Binary Data
❑ the simplest data type...
❑ different coding possible...
❑ logical:TRUE | FALSE; 1|0; YES|NO;
❑ categorical:A|B; 1|2; HOME|ABROAD;
❑ only afew options on how to represent such data;

Summary Frequency Graphical


(table): summary: view:

TRUE
TRUE: 41.00 %
41
0 10 20 3 0 40 50
FALSE FALSE TRUE

59

STAT 235 | 1
11 / 31
K
STAT 235 - Introduction

Categorical Data
❑ Nominal categories...
❑ Machine Breakdowns: what is a machine breakdown cause?

STAT 235 | 1
12 / 31
K
STAT 235 - Introduction

Categorical Data
❑ Nominal categories...
❑ Machine Breakdowns: what is a machine breakdown cause?
elektrical | mechanical | misuse

STAT 235 | 1
12 / 31
K
STAT 235 - Introduction

Categorical Data
❑ Nominal categories...
❑ Machine Breakdowns: what is a machine breakdown cause?
elektrical | mechanical | misuse
❑ Data: a sequence of words’electrical’, ’mechanical’or’misuse’;

STAT 235 | 1
12 / 31
K
STAT 235 - Introduction

Categorical Data
❑ Nominal categories...
❑ Machine Breakdowns: what is a machine breakdown cause?
elektrical | mechanical | misuse
❑ Data: a sequence of words’electrical’, ’mechanical’or’misuse’;

Summary
(table):

Electrical 9
Mechanical 24
Misuse 13

STAT 235 | 1
12 / 31
K
STAT 235 - Introduction

Categorical Data
❑ Nominal categories...
❑ Machine Breakdowns: what is a machine breakdown cause?
elektrical | mechanical | misuse
❑ Data: a sequence of words’electrical’, ’mechanical’or’misuse’;

Summary Frequency
(table): summary:

Electrical 9 Electrical 19.57 %


Mechanical 24 Mechanical 52.17 %
Misuse 13 Misuse 28.26 %

STAT 235 | 1
12 / 31
K
STAT 235 - Introduction

Categorical Data
❑ Nominal categories...
❑ Machine Breakdowns: what is a machine breakdown cause?
elektrical | mechanical | misuse
❑ Data: a sequence of words’electrical’, ’mechanical’or’misuse’;

Summary Frequency Graphical


(table): summary: view:

Electrical 9 Electrical 19.57 %


Mechanical 24 Mechanical 52.17 %

0 5 10 1 5 2 0
Misuse 13 Misuse 28.26 %
electrical mechanical misuse

STAT 235 | 1
12 / 31
K
STAT 235 - Introduction

Categorical Data
❑ Nominal categories...
❑ Machine Breakdowns: what is a machine breakdown cause?
elektrical | mechanical | misuse
❑ Data: a sequence of words’electrical’, ’mechanical’or’misuse’;

Summary Frequency Graphical


(table): summary: view:

Electrical 9 Electrical 19.57 %


Mechanical 24 Mechanical 52.17 %
Misuse 13 Misuse 28.26 %

STAT 235 | 1
12 / 31
K
STAT 235 - Introduction

Categorical Data
❑ Ordinal categories...
❑ Number of Breakdowns: How many times did a machine break down?

STAT 235 | 1
13 / 31
K
STAT 235 - Introduction

Categorical Data
❑ Ordinal categories...
❑ Number of Breakdowns: How many times did a machine break down?
0 times | once | two times and more

STAT 235 | 1
13 / 31
K
STAT 235 - Introduction

Categorical Data
❑ Ordinal categories...
❑ Number of Breakdowns: How many times did a machine break down?
0 times | once | two times and more
❑ Data: a sequence of values0,1and2;

STAT 235 | 1
13 / 31
K
STAT 235 - Introduction

Categorical Data
❑ Ordinal categories...
❑ Number of Breakdowns: How many times did a machine break down?
0 times | once | two times and more
❑ Data: a sequence of values0,1and2;

Summary
(table):

Zero 54
Once 40
≥2 6

STAT 235 | 1
13 / 31
K
STAT 235 - Introduction

Categorical Data
❑ Ordinal categories...
❑ Number of Breakdowns: How many times did a machine break down?
0 times | once | two times and more
❑ Data: a sequence of values0,1and2;

Summary Frequency
(table): summary:

Zero 54 Zero 54 %
Once 40 Once 40 %
≥2 6 ≥2 6%

STAT 235 | 1
13 / 31
K
STAT 235 - Introduction

Categorical Data
❑ Ordinal categories...
❑ Number of Breakdowns: How many times did a machine break down?
0 times | once | two times and more
❑ Data: a sequence of values0,1and2;

Summary Frequency Graphical


(table): summary: view:

Zero 54 Zero 54 %
Once 40 Once 40 %

50
≥2 6 ≥2 6%

0 1 0 2 0 3 0 40
>= 2 once zero

STAT 235 | 1
13 / 31
K
STAT 235 - Introduction

Categorical Data
❑ Ordinal categories...
❑ Number of Breakdowns: How many times did a machine break down?
0 times | once | two times and more
❑ Data: a sequence of values0,1and2;

Summary Frequency Graphical


(table): summary: view:

Zero 54 Zero 54 %
Once 40 Once 40 %
≥2 6 ≥2 6%

STAT 235 | 1
13 / 31
K
STAT 235 - Introduction

Categorical Data
❑ Ordinal categories...
❑ Number of Breakdowns: How many times did a machine break down?
0 times | once | two times and more
❑ Data: a sequence of values0,1and2;

Summary Frequency Graphical


(table): summary: view:

Zero 54 Zero 54 %
Once 40 Once 40 %
≥2 6 ≥2 6%

❑ What is the main difference between nominal and ordinal categories?

STAT 235 | 1
13 / 31
K
STAT 235 - Introduction

Numerical Data
❑ Integer values - counts...
❑ How many hours cana machine work until it breaks down?

STAT 235 | 1
14 / 31
K
STAT 235 - Introduction

Numerical Data
❑ Integer values - counts...
❑ How many hours cana machine work until it breaks down?
0, 1, 2, 3, ... etc.

STAT 235 | 1
14 / 31
K
STAT 235 - Introduction

Numerical Data
❑ Integer values - counts...
❑ How many hours cana machine work until it breaks down?
0, 1, 2, 3, ... etc.
❑ How many people a day will use the lift?
0, 1, 2, 3, ... etc.

STAT 235 | 1
14 / 31
K
STAT 235 - Introduction

Numerical Data
❑ Integer values - counts...
❑ How many hours cana machine work until it breaks down?
0, 1, 2, 3, ... etc.
❑ How many people a day will use the lift?
0, 1, 2, 3, ... etc.
❑ How many passengers is in a driving car?

STAT 235 | 1
14 / 31
K
STAT 235 - Introduction

Numerical Data
❑ Integer values - counts...
❑ How many hours cana machine work until it breaks down?
0, 1, 2, 3, ... etc.
❑ How many people a day will use the lift?
0, 1, 2, 3, ... etc.
❑ How many passengers is in a driving car?
1, 2, 3, etc. ≤ 5

STAT 235 | 1
14 / 31
K
STAT 235 - Introduction

Numerical Data
❑ Integer values - counts...
❑ How many hours cana machine work until it breaks down?
0, 1, 2, 3, ... etc.
❑ How many people a day will use the lift?
0, 1, 2, 3, ... etc.
❑ How many passengers is in a driving car?
1, 2, 3, etc. ≤ 10

STAT 235 | 1
14 / 31
K
STAT 235 - Introduction

Numerical Data
❑ Integer values - counts...
❑ How many hours cana machine work until it breaks down?
0, 1, 2, 3, ... etc.
❑ How many people a day will use the lift?
0, 1, 2, 3, ... etc.
❑ How many passengers is in a driving car?
1, 2, 3, etc. ≤ 30

STAT 235 | 1
14 / 31
K
STAT 235 - Introduction

Numerical Data
❑ Integer values - counts...
❑ How many hours cana machine work until it breaks down?
0, 1, 2, 3, ... etc.
❑ How many people a day will use the lift?
0, 1, 2, 3, ... etc.
❑ How many passengers is in a driving car?
1, 2, 3, etc. ≤ 100

STAT 235 | 1
14 / 31
K
STAT 235 - Introduction

Numerical Data
❑ Integer values - counts...
❑ How many hours cana machine work until it breaks down?
0, 1, 2, 3, ... etc.
❑ How many people a day will use the lift?
0, 1, 2, 3, ... etc.
❑ How many passengers is in a driving car?
1, 2, 3, etc. ≤ 100
❑ Data: mostly a sequence of integer values...

STAT 235 | 1
14 / 31
K
STAT 235 - Introduction

Numerical Data
❑ Integer values - counts...
❑ How many hours cana machine work until it breaks down?
0, 1, 2, 3, ... etc.
❑ How many people a day will use the lift?
0, 1, 2, 3, ... etc.
❑ How many passengers is in a driving car?
1, 2, 3, etc. ≤ 100
❑ Data: mostly a sequence of integer values...
15
10
table(rr)

5
0

4 5 6 7 8 9 10 11 12 13 14 15 16 18

STAT 235 | 1
14 / 31
K
STAT 235 - Introduction

Numerical Data
❑ Integer values - counts...
❑ How many hours cana machine work until it breaks down?
0, 1, 2, 3, ... etc.
❑ How many people a day will use the lift?
0, 1, 2, 3, ... etc.
❑ How many passengers is in a driving car?
1, 2, 3, etc. ≤ 100
❑ Data: mostly a sequence of integer values...
15

0.08
table(rr)

De ns i ty

0 . 00 0. 0 2 0 .0 4 0 . 06
10
5
0

1 6 5 1 70 17 5 1 8 0 1 8 5 19 0 1 95

4 5 6 7 8 9 10 11 12 13 14 15 16 18 Hei g ht o f m a l e s

STAT 235 | 1
14 / 31
K
STAT 235 - Introduction

Numerical Data
❑ Integer values - counts...
❑ How many hours cana machine work until it breaks down?
0, 1, 2, 3, ... etc.
❑ How many people a day will use the lift?
0, 1, 2, 3, ... etc.
❑ How many passengers is in a driving car?
1, 2, 3, etc. ≤ 100
❑ Data: mostly a sequence of integer values... 100
10
15
table(rr)

F re q u e n c y

150
50
5

0
0

1 6 5 1 70 17 5 1 8 0 1 8 5 19 0 1 95

4 5 6 7 8 9 10 11 12 13 14 15 16 18 Hei g ht o f m a l e s

STAT 235 | 1
14 / 31
K
STAT 235 - Introduction

Numerical Data
❑ Integer values - counts...
❑ How many hours cana machine work until it breaks down?
0, 1, 2, 3, ... etc.
❑ How many people a day will use the lift?
0, 1, 2, 3, ... etc.
❑ How many passengers is in a driving car?
1, 2, 3, etc. ≤ 100
❑ Data: mostly a sequence of integer values...
15

150
10

F re q u e n c y
table(rr)

4 6 8 10 12 14 16 18
50
5

100
0
0

1 6 5 1 70 17 5 1 8 0 1 8 5 19 0 1 95

4 5 6 7 8 9 10 11 12 13 14 15 16 18 Hei g ht o f m a l e s

STAT 235 | 1
14 / 31
K
STAT 235 - Introduction

Numerical Data

❑ Real values...
❑ We could improve a precision on the previous example...

STAT 235 | 1
15 / 31
K
STAT 235 - Introduction

Numerical Data

❑ Real values...
❑ We could improve a precision on the previous example...
data → positive real values...

STAT 235 | 1
15 / 31
K
STAT 235 - Introduction

Numerical Data

❑ Real values...
❑ We could improve a precision on the previous example...
data → positive real values...
❑ How long cana machine work until it breaks down?

STAT 235 | 1
15 / 31
K
STAT 235 - Introduction

Numerical Data

❑ Real values...
❑ We could improve a precision on the previous example...
data → positive real values...
❑ How long cana machine work until it breaks down?
❑ Data: a sequence of real valued observations...
❑ infinitely many different outcomes to be considered;
❑ the most common data presentation using
histograms;

STAT 235 | 1
15 / 31
K
STAT 235 - Introduction

Numerical Data
❑ Real values...
❑ We could improve a precision on the previous example...
data → positive real values...
❑ How long cana machine work until it breaks down?
❑ Data: a sequence of real valued observations...
❑ infinitely many different outcomes to be considered;
❑ the most common data presentation using
histograms;
0.10

0.10
0.08

0.08
0.06
Density

Density

0.04 0.06
0.04
0.02

0.02
0.00

0.00

165 170 175 180 185 190 195 155 160 165 170 175 180 185 190

Height of males Height of females

STAT 235 | 1
15 / 31
K
STAT 235 - Introduction

Numerical Data
❑ Real values...
❑ We could improve a precision on the previous example...
data → positive real values...
❑ How long cana machine work until it breaks down?
❑ Data: a sequence of real valued observations...
❑ infinitely many different outcomes to be considered;
❑ the most common data presentation using
histograms;

0.10


195


0.08
190
185

Density

0.04 0.06
180
175
170

0.00 0.02


155 160 165 170 175 180 185 190

Height of females

STAT 235 | 1
15 / 31
K
STAT 235 - Introduction

Numerical Data
❑ Real values...
❑ We could improve a precision on the previous example...
data → positive real values...
❑ How long cana machine work until it breaks down?
❑ Data: a sequence of real valued observations...
❑ infinitely many different outcomes to be considered;
❑ the most common data presentation using
histograms;

 
 
195

 

185
190

180
185

175
180

170
175

165
170

160


 

STAT 235 | 1
15 / 31
K
STAT 235 - Introduction

Numerical Data
❑ Real values...
❑ We could improve a precision on the previous example...
data → positive real values...
❑ How long cana machine work until it breaks down?
❑ Data: a sequence of real valued observations...
❑ infinitely many different outcomes to be considered;
❑ the most common data presentation using
histograms;
0.10




0.08

175 180 185


0.06
Density

170
0.04

160 165


0.00 0.02

165 170 175 180 185 190 195

Height of males

STAT 235 | 1
15 / 31
K
STAT 235 - Introduction

Numerical Data
❑ Real values...
❑ We could improve a precision on the previous example...
data → positive real values...
❑ How long cana machine work until it breaks down?
❑ Data: a sequence of real valued observations...
❑ infinitely many different outcomes to be considered;
❑ the most common data presentation using
histograms;
0.10

0.10
0.08

0.08
0.06
Density

Density

0.04 0.06
0.04
0.02

0.02
0.00

0.00

165 170 175 180 185 190 195 155 160 165 170 175 180 185 190

Height of males Height of females

STAT 235 | 1
15 / 31
K
STAT 235 - Introduction

Numerical Data
❑ Real values...
❑ We could improve a precision on the previous example...
data → positive real values...
❑ How long cana machine work until it breaks down?
❑ Data: a sequence of real valued observations...
❑ infinitely many different outcomes to be considered;
❑ the most common data presentation using
histograms;
0.10

0.10
0.08

0.08
0.06
Density

Density

0.04 0.06
0.04
0.02

0.02
0.00

0.00

165 170 175 180 185 190 195 155 160 165 170 175 180 185 190

Height of males Height of females

STAT 235 | 1
15 / 31
K
STAT 235 - Introduction

Numerical Data
❑ Real values...
❑ We could improve a precision on the previous example...
data → positive real values...
❑ How long cana machine work until it breaks down?
❑ Data: a sequence of real valued observations...
❑ infinitely many different outcomes to be considered;
❑ the most common data presentation using
histograms;
0.10

0.10
0.08

0.08
0.06
Density

Density

0.04 0.06
0.04
0.02

0.02
0.00

0.00

165 170 175 180 185 190 195 155 160 165 170 175 180 185 190

Height of males Height of females

STAT 235 | 1
15 / 31
K
STAT 235 - Introduction

Histogram...
0.08
0.06
Density

0.04
0.02
0.00

165 170 175 180 185 190 195

Height of males

STAT 235 | 1
16 / 31
K
STAT 235 - Introduction

Histogram...
150
100
Frequency

50
0

165 170 175 180 185 190 195

Height of males

STAT 235 | 1
16 / 31
K
STAT 235 - Introduction

Histogram...
300
Frequency

200
100
0

165 170 175 180 185 190 195 200

Height of males

STAT 235 | 1
16 / 31
K
STAT 235 - Introduction

Histogram...
40
30
Frequency

20
10
0

170 175 180 185 190 195

Height of males

STAT 235 | 1
16 / 31
K
STAT 235 - Introduction

Histogram...
15
10
Frequency

5
0

170 175 180 185 190 195

Height of males

STAT 235 | 1
16 / 31
K
STAT 235 - Introduction

Boxplot figure...

STAT 235 | 1
17 / 31
K
STAT 235 - Introduction

Some nice properties

What can be directly observed from histogram (boxplot)?

❑ symmetric/non-symmetric data distribution;


❑ unimodal or multimodal data distribution;
❑ skewness of data distribution;
❑ indications of outlying observations;
❑ ...

STAT 235 | 1
18 / 31
K
STAT 235 - Introduction

Some nice properties

What can be directly observed from histogram (boxplot)?

❑ symmetric/non-symmetric data distribution;


❑ unimodal or multimodal data distribution;
❑ skewness of data distribution;
❑ indications of outlying observations;
❑ ...

It is not so much important now when we only focus on data exploration


part however, it will become really crucial when considering more
sophisticated statistical modeling approaches.

STAT 235 | 1
18 / 31
K
STAT 235 - Introduction

Data sample reliability

❑ How reliable such data samples are?


❑ Could we expect more reliability and confidence if more data is available?

STAT 235 | 1
19 / 31
K
STAT 235 - Introduction

Data sample reliability

❑ How reliable such data samples are?


❑ Could we expect more reliability and confidence if more data is available?

❑ Is there a way to numerically express such reliability, confidence?


❑ What should be a sufficient number of observation?

STAT 235 | 1
19 / 31
K
STAT 235 - Introduction

Data sample reliability

❑ How reliable such data samples are?


❑ Could we expect more reliability and confidence if more data is available?

❑ Is there a way to numerically express such reliability, confidence?


❑ What should be a sufficient number of observation?

❑ The more data we obtain the more we have to pay for...


❑ How to find a reasonable balance?

STAT 235 | 1
19 / 31
K
STAT 235 - Introduction

From counts to ordinal categories


❑ How long does the machine work until it breaks down?

STAT 235 | 1
20 / 31
K
STAT 235 - Introduction

From counts to ordinal categories


❑ How long does the machine work until it breaks down?
❑ more that 15 hours ⇒ category A

STAT 235 | 1
20 / 31
K
STAT 235 - Introduction

From counts to ordinal categories


❑ How long does the machine work until it breaks down?
❑ more that 15 hours ⇒ category A
❑ more that 10 & less than 15 hours ⇒ category B

STAT 235 | 1
20 / 31
K
STAT 235 - Introduction

From counts to ordinal categories


❑ How long does the machine work until it breaks down?
❑ more that 15 hours ⇒ category A
❑ more that 10 & less than 15 hours ⇒ category B
❑ more that 5 & less than 10 hours ⇒ category C

STAT 235 | 1
20 / 31
K
STAT 235 - Introduction

From counts to ordinal categories


❑ How long does the machine work until it breaks down?
❑ more that 15 hours ⇒ category A
❑ more that 10 & less than 15 hours ⇒ category B
❑ more that 5 & less than 10 hours ⇒ category C
❑ less than 5 hours ⇒ category D

STAT 235 | 1
20 / 31
K
STAT 235 - Introduction

From counts to ordinal categories


❑ How long does the machine work until it breaks down?
❑ more that 15 hours ⇒ category A
❑ more that 10 & less than 15 hours ⇒ category B
❑ more that 5 & less than 10 hours ⇒ category C
❑ less than 5 hours ⇒ category D

❑ What is the data sample now?

STAT 235 | 1
20 / 31
K
STAT 235 - Introduction

From counts to ordinal categories


❑ How long does the machine work until it breaks down?
❑ more that 15 hours ⇒ category A
❑ more that 10 & less than 15 hours ⇒ category B
❑ more that 5 & less than 10 hours ⇒ category C
❑ less than 5 hours ⇒ category D

❑ What is the data sample now?


10 20 30 40 50
0

category A category B category C category D

STAT 235 | 1
20 / 31
K
STAT 235 - Introduction

Higher dimensional data

❑ two categorical variables;


❑ one categorical and the other numerical variable;
❑ two numerical (continuous) variables;

STAT 235 | 1
21 / 31
K
STAT 235 - Introduction

Higher dimensional data

❑ two categorical variables;


❑ one categorical and the other numerical variable;
❑ two numerical (continuous) variables;

❑ more dimensional data ⇒ it gets even difficult to plot;


❑ multivariate statistical methods were proposed to be used
instead;

STAT 235 | 1
21 / 31
K
STAT 235 - Introduction

Titanic survivals
Sex Sex
C lass Male C lass Male
1st
Female 0 0 1st
Female 5 1
2nd 0 0 2nd 11 13
3rd 35 17 3rd 13 14
Crew 0 0 Crew 0 0

, , Age = C h i l d , Survived = No , , Age = C h i l d , Survived = Yes

Sex Sex
C lass Male C lass Male Female
1st
Female 118 4 1st 57
2nd 154 13 2nd 14140 80
3rd 387 89 3rd 75 76
Crew 670 3 Crew 192 20

, , Age = A d u l t , Survived = No , , Age = A d u l t , Survived = Yes

STAT 235 | 1
22 / 31
K
STAT 235 - Introduction

Car Breaking Distance


120


100

 
80






dist

60


 
 



 

40

 
 
 
  
 
   


20

 
 


 



0

5 10 15 20 25

speed

STAT 235 | 1
23 / 31
K
STAT 235 - Introduction

Orange Trees

STAT 235 | 1
24 / 31
K
Sampling procedure

From Population to its Sample

population =⇒ population sample =⇒ statistical inference

STAT 235 | 1
25 / 31
K
Sampling procedure

From Population to its Sample

population =⇒ population sample =⇒ statistical inference

Sampling procedure:
❑ ideally we would like to obtain arandom
sample;

STAT 235 | 1
25 / 31
K
Sampling procedure

From Population to its Sample

population =⇒ population sample =⇒ statistical inference

Sampling procedure:
❑ ideally we would like to obtain arandom sample;
❑ random sample → independent and identically distributed
observations;

STAT 235 | 1
25 / 31
K
Sample Statistics

STAT 235 | 1
26 / 31
K
Sample Statistics

Two different worlds

...

STAT 235 | 1
27 / 31
K
Sample Statistics

Sample Mean

❑ sample mean ≡ average

STAT 235 | 1
28 / 31
K
Sample Statistics

Sample Mean

❑ sample mean ≡ average ƒ= mean

STAT 235 | 1
28 / 31
K
Sample Statistics

Sample Mean

❑ sample mean ≡ average ƒ= mean

❑ arithmetic average of all given observations;


Σ n

{x1, x2, . . . , xn
i= 1
❑notation: x n = n
xi
, wherethe actual observations are

};

STAT 235 | 1
28 / 31
K
Sample Statistics

Sample Mean

❑ sample mean ≡ average ƒ= mean

❑ arithmetic average of all given observations;


Σ n

{x1, x2, . . . , xn
i= 1
❑notation: x n = n
xi
, wherethe actual observations are
❑ it is a “middle value” with respect tovalues and their distribution;
};

STAT 235 | 1
28 / 31
K
Sample Statistics

Sample Mean

❑ sample mean ≡ average ƒ= mean

❑ arithmetic average of all given observations;


Σ n

{x1, x2, . . . , xn
i= 1
❑notation: x n = n
xi
, wherethe actual observations are
❑ it is a “middle value” with respect tovalues and their distribution;
};

STAT 235 | 1
28 / 31
K
Sample Statistics

Sample Mean

❑ sample mean ≡ average ƒ= mean

❑ arithmetic average of all given observations;


Σ n

{x1, x2, . . . , xn
i= 1
❑notation: x n = n
xi
, wherethe actual observations are
❑ it is a “middle value” with respect tovalues and their distribution;
};

❑ Advantage:it is sensitive with respect to outlying observations;

STAT 235 | 1
28 / 31
K
Sample Statistics

Sample Mean

❑ sample mean ≡ average ƒ= mean

❑ arithmetic average of all given observations;


Σ n

{x1, x2, . . . , xn
i= 1
❑notation: x n = n
xi
, wherethe actual observations are
❑ it is a “middle value” with respect tovalues and their distribution;
};

❑ Advantage:it is sensitive with respect to outlying observations;


❑ Disadvantage:it issensitive with respect to outlying observations;

STAT 235 | 1
28 / 31
K
Sample Statistics

Sample Mean

❑ sample mean ≡ average ƒ= mean

❑ arithmetic average of all given observations;


Σ n

{x1, x2, . . . , xn
i= 1
❑notation: x n = n
xi
, wherethe actual observations are
❑ it is a “middle value” with respect tovalues and their distribution;
};

❑ Advantage:it is sensitive with respect to outlying observations;


❑ Disadvantage:it issensitive with respect to outlying observations;

❑ Some other proposals: sample trimmed mean, weighted mean, etc.;

STAT 235 | 1
28 / 31
K
Sample Statistics

Sample Median

❑ middle observation from all ordered observations;


❑ x˜n= x( n+1 ) for n even and x˜n= [x(n/2) + x( n+1 )]/2 for n odd;
2 2

STAT 235 | 1
29 / 31
K
Sample Statistics

Sample Median

❑ middle observation from all ordered observations;


❑ x˜n= x( n+1 ) for n even and x˜n= [x(n/2) + x( n+1 )]/2 for n odd;
2 2

❑ it is a “middle value” with respect to


givenvaluesthemselves;

STAT 235 | 1
29 / 31
K
Sample Statistics

Sample Median

❑ middle observation from all ordered observations;


❑ x˜n= x( n+1 ) for n even and x˜n= [x(n/2) + x( n+1 )]/2 for n odd;
2 2

❑ it is a “middle value” with respect to


givenvaluesthemselves;

STAT 235 | 1
29 / 31
K
Sample Statistics

Sample Median

❑ middle observation from all ordered observations;


❑ x˜n= x( n+1 ) for n even and x˜n= [x(n/2) + x( n+1 )]/2 for n odd;
2 2

❑ it is a “middle value” with respect to


givenvaluesthemselves;

❑ Advantage:it is not sensitive with respect to observation


values;

STAT 235 | 1
29 / 31
K
Sample Statistics

Sample Median

❑ middle observation from all ordered observations;


❑ x˜n= x( n+1 ) for n even and x˜n= [x(n/2) + x( n+1 )]/2 for n odd;
2 2

❑ it is a “middle value” with respect to givenvaluesthemselves;

❑ Advantage:it is not sensitive with respect to observation values;


❑ Disadvantage:it is not sensitive with respect to observations
values;

STAT 235 | 1
29 / 31
K
Sample Statistics

Sample Median

❑ middle observation from all ordered observations;


❑ x˜n= x( n+1 ) for n even and x˜n= [x(n/2) + x( n+1 )]/2 for n odd;
2 2

❑ it is a “middle value” with respect to givenvaluesthemselves;

❑ Advantage:it is not sensitive with respect to observation values;


❑ Disadvantage:it is not sensitive with respect to observations
values;

❑ Some other proposals as well ...;

STAT 235 | 1
29 / 31
K
Sample Statistics

Some other sample characteristics

❑ sample mode (for categorical variables mostly);


What is the relation between sample mean, median and mode?
❑ sample variance & sample standard error;
mostly it is difficult to imagine → sample range used “instead”
❑ sample quantiles (quartiles, percentiles, etc.);
some of them are more important than others...
❑ coefficient of variation;
spread of data relative to its middle value

STAT 235 | 1
30 / 31
K
Sample Statistics

To be continued...

❑ Probability, probability concepts;


❑ Random events, combination of events;
❑ Conditional probability;
❑ Independence principle;
❑ Law of Total Probability;
❑ Bayes Theorem in Probability;

STAT 235 | 1
31 / 31
K

Anda mungkin juga menyukai