Anda di halaman 1dari 30

Statistics : study of variability Variability: may be the only constant in our world

Statistics
-Summarise information
-Variables
-Random variables that vary randomly

Chapter 1

Descriptive Statistics

Statistics can be dened, in short, as follows

Denition 1.0.1 Statistics is a set of techniques whose goal is to understand and draw
conclusions about a phenomenon (or phenomena), in a particular place and time,
using the information contained in a set of data

For example, we may study the unemployment in Catalonia in 2014 using the En-
questa de Poblaci Activa (EPA), or the opinion poll of the electorate before any
call for elections using surveys published in the media
Technically, the "phenomenon" to study is called variable, while that "particular place
and time" is the population. In the same sense, "data" is usually called sample data
(or observations). In this way,

Denition 1.0.2 Population is the set of elements to study from which we want to draw
a conclusion with respect to some of its features (or variables)

Denition 1.0.3 Sample is a subset of the population used to collect information and
draw conclusions

Denition 1.0.4 Variables are those features or phenomena of the population that we
want to study A variable is an observable character that varies among the different
individual of a population
Denition 1.0.5 Data are the specic values of the considered variables that we ob-
serve in the sample at hand

Therefore, in the rst example above, the population is the population of Catalonia in
2014, and the data is obtained from the sample called Encuesta de Poblacin Activa
(EPA). (See gure 1.1). The variable studied is, clearly, unemployment.
In the second example, the population is composed of all the citizens with the right to
vote, the sample is the people interviewed in the published survey, the data is all the
answers collected, and the variable of interest is the electorate opinion.
Thus, we may re-dene statistics in a more accurate manner as follows.

Factors that may affect the variable


11
Attribute Variable>>Random Variable>>Measurement error

(Possible). (always). (manageable)

-Sex. ? -Instrument
-Age. Probability. calibrator
-Parents height. -Distribution. -Rounding
-Born place. -Methodolgy
12 CHAPTER 1. DESCRIPTIVE STATISTICS

Figure 1.1: Encuesta de Poblacin Activa (Source: INE 22/01/2015)

Denition 1.0.6 Statistics is a set of techniques whose goal is to understand and draw
conclusions regarding a variable (or variables) of a population using the data collected
in a sample

Descriptive Statistics is a sub-eld within statistics that deals with the task of arranging
the data in such a way that the analysis becomes easier

Denition 1.0.7 Descriptive Statistics deals with the study of a set of data, corre-
sponding to one or more variables of a population, and its arrangement in order to
facilitate its understanding and use

Once all the information is arranged in a clear and efcient way, the researcher can use
"Statistical Inference" techniques, which use probabilistic tools, to extend the conclu-
sions obtained from the sample to the entire population that is under study.

1.1 Univariate frequency distribution tables

1.1.1 Types of variables

As in other branches of mathematics, variables are usually denoted using the last letters
of the alphabet (in uppercase), X, Y, Z. If we need to consider many variables at the
same time the use subscripts becomes necessary, X1 , X2 , . . . , Xn .
The different data points (or observations) for each variable that we obtain from a
sample are often denoted using subscripts of the corresponding variable (in lowercase).
KINDS OF VARIABLE For instance, if we have 3 data points corresponding to the variable labeled X3 we will
*Categorial*
-Nominal: no order between categories
-Ordinal: Order between categories. Less categories ordinal.more categories
Eg: Age - categories of age : younger to the older

-Discrete(counts): no. Of times it can happen in a period of value.Discrte, "central tendencies"


Eg: Rainy Day -> Discrete (0,365days)

*Continous*
Those are within a range
"Grades"
1.1. UNIVARIATE FREQUENCY DISTRIBUTION TABLES 13

represent them as x31 , x32 , x33 . If we only have one variable, X, then we simply use
the notation x1 , x2 , x3 .
According to their nature there are two types of variables:

1. Qualitative variables (or categorical) are variables that can not be measured
numerically. Each observation is usually associated to a number (or letter).
2. Quantitative variables (or measurable) are variables than can be measured nu-
merically. Can be of two types

(a) Continuous: can take any value within a range


(b) Discrete: can take any value from a list nite (or countable) values

Example 1.1.1 Consider the table 1.1, which summarizes a survey of 10 families living
in Cerdanyola

X1 X2 X3 X4 X5
Family Number of Occupation Monthly Monthly phone ADSL
members head of family income spending (Yes/No)
1 2 3 592.18 12.06 0
2 3 3 743.83 14.88 0
3 4 6 527.18 12.35 1
4 5 3 1090.47 18.92 0
5 8 1 2744.23 26.5 1
6 2 5 902.71 15.89 1
7 4 3 888.26 14.44 1
8 5 1 1588.76 15.38 1
9 7 4 3069.2 24.19 1
10 2 1 707.72 13.72 0

Table 1.1: Families in Cerdanyola

Here we have 5 variables

X1 : Number of members of the family


Exercise
X2 : Occupation of the householder
X3 : Monthly income
X4 : Monthly phone spending (besides the connection fee)
X5 : ADSL (Yes=1, No=0)

We have 10 data points (observations) for each of these 5 variables

{x11 , x12 , x13 , x14 , x15 , x16 , x17 , x18 , x19 , x110 } = {2, 3, 4, 5, 8, 2, 4, 5, 7, 2}
{x21 , x22 , x23 , x24 , x25 , x26 , x27 , x28 , x29 , x210 } = {3, 3, 6, 3, 1, 5, 3, 1, 4, 1}

and so forth and so on ...


Regarding their types we have:
14 CHAPTER 1. DESCRIPTIVE STATISTICS

X1 Quantitative discrete

X1 {1, 2, 3, 4, 5, 6, . . .}

X2 Qualitative

X2 {1, 2, 3, 4, 5, 6}, on

1 Entrepreneur
2 White collar worker
3 Blue collar worker
4 freelance
5 Public servant
6 Unemployed

X3 Quantitative continuous

X3 [0, ]

X4 Quantitative continuous

X4 [0, ]

X5 Qualitative

1. X5 {0, 1}, where

0 No ADSL at home
1 ADSL at home

1.1.2 Frequency Distribution

The raw data of one variable provides very little useful information, specially if the
data set is large. It is necessary to classify them so that they become informative. This
is done differently depending on whether it is a Quantitative Continuous variable, or a
Discrete Quantitative Continuous or Qualitative variable. The question, ultimately, is
to know how many times each data value appears in the sample (how many people are
unemployed, how many people will vote for the PP, etc.), or maybe some other type of
summary.
1.1. UNIVARIATE FREQUENCY DISTRIBUTION TABLES 15

1.1.2.1 Frequency distribution of qualitative or quantitative discrete variables

In the case of quantitative discrete or qualitative variables (they are treated alike), what
we do is rst look at how many different values the variable has taken, and then count
how many times each of theses values appears in the sample.
In the previous example, X1 , X2 , i X5 can be treated in the same way.
Let us take, for example, X1 . We have the following data

{x11 , x12 , x13 , x14 , x15 , x16 , x17 , x18 , x19 , x110 } = {2, 3, 4, 5, 8, 2, 4, 5, 7, 2}

We see that we have 10 observations of the variable X1 but it only takes 6 different
values, which we represent with Y1 . Thus, the list of different values X1 takes in our
sample is
{y11 , y12 , y13 , y14 , y15 , y16 } = {2, 3, 4, 5, 7, 8}

From this list we can create a Frequency Distribution consisting on counting how many
times each of these values is found in the data. In this sense, we dene below the
important concepts of Absolute Frequency and Relative Frequency

Denition 1.1.2 The Absolute Frequency (ni ) of a value yi is the number of times this
value is found in the data set

Denition 1.1.3 The Relative Frequency (fi ) (o proportion) of a value yi is the per-
centage of times this value is found in the data set

Clearly, if n the total number of observations in the data set we have,


ni
fi =
n

Moreover, we can compute1 the Cumulative Frequencies (Absolute (Ni ) and Relative
(Fi )) for each value as the sum of all the frequencies (absolute (ni ) or relative (fi )) of
the lower values including that of the value for which we are computing the cumulative
frequency. Thus, such cumulative frequencies are interpreted as:

Ni number of data points (observations) of the variable that are yi

Fi proportion of data points (observations) of the variable that are yi

Table1.2 summarizes all such frequencies for the variable X1 . Notice, for example, the
following interpretations:

n3 = 2 There are 2 observations of the variable that are = y3 = 4

f3 = 0.2 20% of the observations of the variable are = y3 = 4

N3 = 6 There are 6 observations of the variable that are y3 = 4


16 CHAPTER 1. DESCRIPTIVE STATISTICS

Absolute Relative Absolute Cumulative Relative Cumulative


Value of Frequency Frequency Frequency Frequency
X1 (ni ) (fi = nni ) (Ni ) (Fi = Nni )
2 3 0.3 3 0.3
3 1 0.1 4 0.4
4 2 0.2 6 0.6
5 2 0.2 8 0.8
7 1 0.1 9 0.9
8 1 0.1 10 1

Table 1.2: Table of Frequencies for the variable X1

F3 = 0.6 60% of the observations of the variable are y3 = 4

The frequencies of any variable always satisfy the properties below, where k represents
the number of different values the variable takes2

(i) 0 ni n; 0 Ni n

(ii) 0 fi 1; 0 Fi 1
k
(iii) i=1 ni = n
k
(iv) i=1 fi = 1

(v) n1 = N1 N2 Nk = n

(vi) f1 = F1 F2 Fk = 1

1.1.2.2 Frequency distribution of continuous quantitative variables

Continuous variables (income, temperature, etc) can take so many different values that
a table of frequencies (or frequency distribution) would add very little information.
Most likely, each observation would appear only once in the data set and consequently
the absolute frequencies of all different values would be equal to 1.
To solve this, data is binned in intervals, named class intervals (or bins). Intervals may
be constructed as follows

1. Range (Distance between the lowest and the highest value)


Let y1 , y2 , . . . , yk the different values that the variable takes in the sample or-
dered from lowest to highest. Then the range is

rang = yk y1
1 It does not make any sense to do this for Qualitative variables since this computation requires a natural

ordering in the values Yi


2 In our example, k = 6.
1.1. UNIVARIATE FREQUENCY DISTRIBUTION TABLES 17

If, for example, we consider the variable X3 in table 1.1 (monthly income), then
the range is
RANG = 3069.2 527.18 = 2542.02
This range (or total length) must be split into intervals.
2. Length of each Interval
Once the range is known, we must divide the total length in as many intervals
as needed. Such number of intervals depends on each case, on the data at hand,
or the nature of the problem we are studying. Let I be that number. Then,
the length of each interval (l) can be computed as the total length (the range)
divided by the number of intervals3 .
rang
l=
I
If, for example, we consider the variable X3 in table 1.1 (Monthly income) and
we want 8 intervals we will have
2542.02
l= = 317.75
8
3. Building the intervals
Once the length l of each interval is known, we proceed by iteration starting from
the lowest observed value, that is, y1 is the list of values y1 , . . . , yk is ordered.
Thus, the rst interval will be [y1 , y1 + l), the second will start where the rst
ends and will have a length l [y1 + l, (y1 + l) + l), and so on ...

Interval Length Class Abs. Rel. Abs. Cum. Rel. Cum.


mark Freq. Freq. Frequency Frequency
li ci ni fi Ni Fi
[527.18,844.93) 317.75 686.06 4 0.4 4 0.4
[844.93,1162.69) 317.75 1003.81 3 0.3 7 0.7
[1162.69,1480.44) 317.75 1321.56 0 0 7 0.7
[1480.44,1798.19) 317.75 1639.31 1 0.1 8 0.8
[1798.19,2115.94) 317.75 1957.07 0 0 8 0.8
[2115.94,2433.7) 317.75 2274.82 0 0 8 0.8
[2433.7,2751.45) 317.75 2592.57 1 0.1 9 0.9
[2751.45,3069.2] 317.75 2910.32 1 0.1 10 1
10 1

Table 1.3: Frequency table for X3

Table 1.3 shows the frequency distribution of X3 using the data in table 1.1. Besides the
class intervals and their length the table shows the so called class mark (third column).
This class mark is the middle point of the interval and somehow represents the
interval. We will see later that the class mark is used to compute some attributes of the
variable
Sometimes, depending on the case at study, the class intervals may be dened in a dif-
ferent way (integer interval limits, with a given length, etc). The same data of variable
X3 , for instance, could very well be represented as in table1.4
3
k
Sometimes intervals have different lengths (li ). In this case we need that l
i=1 i
= rang
18 CHAPTER 1. DESCRIPTIVE STATISTICS

Interval Length Class Abs. Rel. Abs. Cum. Rel. Cum.


mark Freq. Freq. Frequency Frequency
li ci ni fi Ni Fi
[500,1000) 500 750 6 0.6 6 0.6
[1000,1500) 500 1250 1 0.1 7 0.7
[1500,2000) 500 1750 1 0.1 8 0.8
[2000,2500) 500 2250 0 0 8 0.8
[2500,3000) 500 2750 1 0.1 9 0.9
[3000,3500) 500 3250 1 0.1 10 1
10 1

Table 1.4: Frequency table for X3

1.2 Measures of central tendency, variability, and other


characteristic measures

Measures of central tendency Are measures (statistics) that attempt to summarize


the information contained in the data by means of some descriptive values. The three
main central tendency measures are: the mean, the median, and the mode. The rst
two measures correspond to central values, around which all the observations are
arranged, that in some sense represent all the values that appear in the sample. The last
one, the mode, simply corresponds to the value that is found more often in the data

Measures of variability Are measures (statistics) that attempt to measure how dif-
ferent the observations in the data are. Such differences are measured with respect
to some central tendency measure, usually the mean. The three main measures of vari-
ability are the variance, the standard error and the coefcient of variation.

1.2.1 Measures of central tendency

1.2.1.1 The mean

The mean is a central value with respect to all the observations, being the center of
gravity of the distribution. The computation is easy. If we have n observations of the
variable X, {x1 , x2 , . . . , xn }, the mean (denoted by x) is:
n
1
x = xi
n i=1

When the data corresponding to the variable X are arranged in a table of frequencies
(we do not have the raw data), the mean can be computed using the absolute frequencies
ni with the formula:
k
1
x = ni yi
n i=1
1.2. MEASURES OF CENTRAL TENDENCY, VARIABILITY, AND OTHER CHARACTERISTIC MEASURES19

and also using the relative frequencies fi ,


k

x = f i yi
i=1

where {y1 , y2 , . . . , yk } represent the k different values that the variable takes in the
sample.
With continuous variables, when the data is arranged in class intervals, the computation
of the mean can only be approximated with the formula:
Because these are continuos. We have said that the
I I people between 70kg are 10. Then we just have the
1 approximation
x ni c i or x fi ci
n i=1 i=1

where {c1 , c2 , . . . , cI } are the class marks of each of the I intervals.

Example 1.2.1 [Discrete Variable] Consider the variable X1 in table 1.1, We can com-
pute the mean using the 10 raw observations of the variable:
n
1 1
x1 = x1i = (2 + 3 + 4 + 5 + 8 + 2 + 4 + 5 + 7 + 2) = 4.2
n i=1 10

If we did not have the raw data but only the table of frequencies1.2 we might compute
the mean with the formula
k
1 1
x1 = ni y1i = (3 2 + 1 3 + 2 4 + 2 5 + 1 7 + 1 8) = 4.2
n i=1 10

k
1 1
x1 = ni y1i = (3 2 + 1 3 + 2 4 + 2 5 + 1 7 + 1 8) = 4.2
n i=1 10

and also with the relative frequencies,


k

x1 = fi y1i = (0.3 2 + 0.1 3 + 0.2 4 + 0.2 5 + 0.1 7 + 0.1 8) = 4.2
i=1

Example 1.2.2 [Continuous Variable] Consider the variable X3 in table 1.1. We can
compute the mean using the 10 raw observations of the variable:
n
1 1
x3 = x3i = (592.18 + 743.83 + + 707.72) = 1285.45
n i=1 10

If we did not have the raw data but only the table of frequencies 1.3 we might compute
the mean from the class marks, but only approximately, with the formula:
k
1 1
x3 ni c3i = (4 686.06 + 3 1003.81 + + 1 2910.32) = 1289.79
n i=1 10
20 CHAPTER 1. DESCRIPTIVE STATISTICS

and also with the relative frequencies,


k

x3 fi c3i = (0.4 686.06 + 0.3 1003.81 + + 0.1 2910.32) = 1289.79
i=1

Notice that the value obtained using the intervals (1289.79) is only an approximation
to the real value(1285.45) of the mean.

The mean, x, satises the four properties below, which are useful in many cases:

1. Let a be an arbitrary constant. The mean that results from multiplying the obser-
vations of the variable X by the constant ({a x1 , a x2 , . . . , a xn }) is a x
2. Let X and Y be two different variables. Then, if we have the same number of
observations of the two variables, x + y = x + y
3. Let {x1 , x2 , . . . , xn } be the n observations of the variable X in a sample. Then,
n

(xi x) = 0
i=1

4. The mean always exists, is unique and is very sensitive to changes in extreme
values

It is important to understand that the sample mean x is the mean computed using the
data in the sample and hence it is just an approximation to the true value of the popu-
lation mean (the real mean in the population of reference) which is usually denoted
with the symbol . For instance, in the example 1.2.1 we have computed the mean
of X1 to nd x1 = 4.2. This does not mean that the average size of the families in
Cerdanyola is 4.2, bat only that it is approximately equal to 4.2.
It is also important to notice that the computation of the mean (and the same goes for
the median that we will see next) only make sense when the variable is quantitative
(numerical). Indeed, although in some cases we attach for convenience numerical val-
ues to qualitative variables (as for instance with variable X2 :occupation of the head of
family) it does not make any sense to compute the average of those values to conclude
that the average occupation is 4.73.

1.2.1.2 The Median

The median represents a value that is central with respect to all the observations of the
variable, and such that 50% of the observations are equal or larger than this value and
50% are equal or lower.
Thus, given an ordered set of observations, the median (M ) is the value in the middle.
It is larger than no more than half of the observations and, simultaneously, lower than
no more than half of the observations.
The method to compute the median depends on whether the sample size (n) is even or
odd.
1.2. MEASURES OF CENTRAL TENDENCY, VARIABILITY, AND OTHER CHARACTERISTIC MEASURES21

Example 1.2.3 [Even samples] Let us consider the observations of the variable X1

{x11 , x12 , x13 , x14 , x15 , x16 , x17 , x18 , x19 , x110 } = {2, 3, 4, 5, 8, 2, 4, 5, 7, 2}

rst we need to order them from lower to higher value:

[x11 , x16 , x110 , x12 , x13 , x17 , x14 , x18 , x19 , x15 } = {2, 2, 2, 3, 4, 4, 5, 5, 7, 8}

In this case, since the sample has an even number of observations, there is no obser-
vation right in the middle of the list (that would be the median). Notice that both
x13 = 4 and x17 = 4 satisfy the necessary conditions for being the median:

Consider x13 = 4

50% of the observations (that is, 5 observations) are larger than or equal
to x13
Indeed, the 5 observations {x17 , x14 , x18 , x19 , x15 } are larger than or equal
to x13
50% of the observations (that is, 5 observations) are smaller than or equal
to x13
Indeed, the 5 observations {x17 , x14 , x18 , x19 , x15 } are smaller than or
equal to x13

Thus, x13 veries the conditions to be a median

Consider x17 = 4

50% of the observations (that is, 5 observations) are larger than or equal
to x17
Indeed, the 5 observations {x17 , x14 , x18 , x19 , x15 } are larger than or equal
to x17
50% of the observations (that is, 5 observations) are smaller than or equal
to x17
Indeed, the 5 observations {x17 , x14 , x18 , x19 , x15 } are smaller than or
equal to x17

Thus, x17 veries the conditions to be a median

In this cases (even samples) there are two possibilities that are equally correct:

1. The medians are x13 = 4 and x17 = 4. Notice that is this case the two ob-
servations that verify the condition for being a median have the same value
(x13 = x17 = 4 ), but it could very well be that the two values were different

2. The median is the mean value of these two values

4+4
M= =4
2
22 CHAPTER 1. DESCRIPTIVE STATISTICS

Example 1.2.4 [Odd sample] Let us consider the list of observations below regarding
the variable X (sample size n = 7) drawn from a given population

{x1 , x2 , . . . , x7 } = {3, 1, 4, 3, 2, 5, 1}

Ordered:
{x2 , x7 , x5 , x1 , x4 , x3 , x6 } = {1, 1, 2, 3, 3, 4, 5}

It is clear that the observation in the middle is x1 = 3. That is the median

When the data is arranged in a table of frequencies (the raw data is no available), the
median is found by looking at the Cumulative Absolute Frequencies as follows:

1. We rst nd the value ym such that

n
Nm1 < Nm
2

2. Then we check:

(a) If Nm1 < n


2 the we conclude that the median is M = ym
(b) If Nm1 = then both ym and ym1 are median. Also, as seen above, we
n
2
might also say that the median is the mean of these two values

ym + ym1
M=
2

Example 1.2.5 Let us consider the variableX1 . We have n = 10 and hence n


2 = 5.
Therefore the median will be between the two values in the table below

Value of Cum. Absol. Frequency


X1 Ni
y2 3 4
y3 4 6

Now, given that


n
4< <6
2
we conclude that the median is Y3 = 4

When the variable is continuous the median can be found in the same way. The only
difference is that we look at the class intervals instead of the values. In this case . For
instance, if we consider the variable X3 , the median is the class c2 , [844.93, 1162.69).
1.2. MEASURES OF CENTRAL TENDENCY, VARIABILITY, AND OTHER CHARACTERISTIC MEASURES23

1.2.1.3 The Mode

The mode is the value (or values) with the highest frequency among those values
around. The absolute mode is the value more frequent in the whole sample.
There may exist only one mode, more than one mode, or none (if all values are equally
frequent).
Formally, a mode is any value yq that satises

nq1 nq and nq nq+1

Example 1.2.6 Consider the variable X1 , there are 3 modes, X1 = 2, X3 = 4 i X4 =


5. The absolute mode is the value X1 = 2

In the case of continuous variables, the mode is found in the same way, but considering
the Class intervals instead of the values. This is called the modal class. For variable
X3 , for instance, the mode is in the classes with class marksc1 , c4 , c7 , and c8

1.2.2 Measures of variability

Are measures (statistics) that attempt to measure how different the observations in
the data are. Such differences are measured with respect to some central tendency
measure, usually the mean. The three main measures of variability are the variance,
the standard error and the coefcient of variation.

1.2.2.1 Mean Quadratic Error

Let us assume we propose the value v as a representative of the whole sample (central
value). It is clear that this implies some errors:

Observation Central value Error


x1 v e1 = (x1 v)
x2 v e2 = (x2 v)
.. .. ..
. . .
xn v en = (xn v)
n n
Total Error i=1 ei = i=1 (xi v)

From this, notice that if we use the formula


n

T.E.(v) = (xi v)
i=1

to compute the Total Error of taking v as a representative measure of the observations


in the sample, we may reach a misleading conclusion. Indeed, while in some cases
the errors ei will be positive (when xi > ei ), in other cases will be negative (when
xi < ei ). Hence, when we compute the sum of the errors ei , positive and negative
24 CHAPTER 1. DESCRIPTIVE STATISTICS

errors will cancel each other and we might mistakenly conclude that the total error is
zero (or close to zero)
To keep positive and negative errors from canceling each other, we square the errors
(e2i ) so that all are positive. This result known as the Quadratic Error. If we divide
such Quadratic Error by the number of observations (n), that is, we compute its mean,
we nd the so-called Mean Quadratic Error
n
1
M.Q.E.(v) = (xi v)2
n i=1

1.2.2.2 The Variance

Given a set of n observations, {x1 , x2 , . . . , xn }, the Sample Variance corresponds to


the Mean Quadratic Error that we have when we take the sample mean x as the central
representative measure
V ariance = E.Q.M.(x)

To represent the sample variance we usually use S 2 . Thus, given the observations
{x1 , x2 , . . . , xn }, the Sample Variance (S 2 ) can be computed as:
n
1
S2 = (xi x)2
n i=1

It can be shown4 that the sample variance S 2 can also be found using the alternative
formula below, which usually involve less computations than the original one.

n
1 2
S2 = x x2
n i=1 i

In case we have the information summarized in a frequency table, with k different


values for the variable, the variance can be found using the absolute frequencies

k
1
S2 = ni (yi x)2
n i=1

or also the relative frequencies

k

S2 = fi (yi x)2
i=1

where {y1 , y2 , . . . , yk } are the k different values the variable takes in the sample.
4 We suggest the reader to formally prove that the two formulas are equivalent. It is a good exercise on

the algebraic manipulation and simplication of this type of expressions.


1.2. MEASURES OF CENTRAL TENDENCY, VARIABILITY, AND OTHER CHARACTERISTIC MEASURES25

For reasons that will be given later on 5 , in many cases the formula used to compute
the sample variance is
n
1
S 2 = (xi x)2
n 1 i=1

or, equivalently,
n
1 2 n
S 2 = x x2
n 1 i=1 i n1

The variance measures the degree of dispersion of the data around the mean x. If
we consider two different samples with equal means, the larger the variance the more
dispersed the observations are.

Example 1.2.7 Let us consider the variable X3 (Monthly income). Since x3 = 1285.45,
we have:
1
S32 = ((592.18 1285.45)2 + (743.83 1285.45)2 + + (707.72 1285.45)2 ) = 742678.8
10
1
S32 = ((592.18 1285.45)2 + (743.83 1285.45)2 + + (707.72 1285.45)2 ) = 825199.04
9
One important property is that, for any central tendency value , we have
S 2 = M.Q.E.(x) M.Q.E.(v)

That is, the sample mean x is the value that minimizes the mean quadratic error.

1.2.2.3 The Standard Deviation

Given a list of n observations, {x1 , x2 , . . . , xn }, the Standard Deviation, denoted S, is


the square root of the variance
S = S2

Compared to the variance, the standard deviation returns the original units of mea-
surement of the variable, that are squared when computing the variance
If we use the formula for 2 instead of the one corresponding to S 2 , the the standard
deviation is denoted as . Thus,

n
1
S = (xi x)2 = S 2
n i=1

n
1
= (xi x)2 = 2
n 1 i=1

Example 1.2.8 Let us consider the variable X3 (Monthly income). Then,



S3 = S32 = 742678.8 = 861.79

3 = 32 = 825199.04 = 908.4
5 The original formula is a biased estimator of the population variance. This is explained in Statistics II
26 CHAPTER 1. DESCRIPTIVE STATISTICS

1.2.2.4 The Coefcient of Variation

Consider (case a) that we have some data measured in Kg. {100, 200}. Clearly, the
mean is
100 + 200
xa = = 150
2
and the variance and standard deviation are
(100 150)2 + (200 150)2
Sa2 = = 2500
2
Sa = 2500 = 50

Let us suppose now (case b) that we have the same data measured in grams instead{100000, 200000}.
Now the mean is
100000 + 200000
xb = = 150000
2
and the variance and standard deviation are
(100000 150000)2 + (200000 150000)2
Sb2 = = 2500000000
2
Sb = 2500000000 = 50000

Now, since Sb is much higher that Sa , one might be tempted to conclude that

The observations in case b are much more disperse that in case a for
Sb > Sa

which is clearly false since the two cases correspond to exactly the same observations,
just the they are presented using different units of measurement
The Coefcient of Variation V is designed to avoid this misleading interpretation. It
measures the dispersion but taking into account the units of measurement.
S
V =
x

Now, if we consider the previous example again we will nd that


Sa 50 1
Va = = =
xa 150 3
Sb 50000 1
Vb = = =
xb 150000 3
The two Coefcients of Variation are equal.

1.2.2.5 Quartiles

The quartiles are those values in the sample that split all the observations (ordered) in
4 subgroups in such a way that each subgroup contains no more than 25% of the data.
That is,
1.3. HISTOGRAMS AND OTHER GRAPHIC REPRESENTATIONS 27

First_quartile: Value Q1 such that 25% of the observations are below Q1 and 75% of
the observations are above

Second quartile: Value Q2 such that 50% of the observations are below Q2 and 50%
of the observations are above

Third quartile: Value Q3 such that 75% of the observations are below Q3 and 25%
of the observations are above

Notice that the second quartile Q2 is the median

1.2.2.6 Percentiles

We call k th percentile(k = 1, 2, . . . , 99) the value Pk such that k % of the observations


are below Pk and 100 k % are above.

Notice:

Q1 = P25

Q2 = P50 = median

Q3 = P75

1.2.2.7 Interquartile range

The Interquartile rang is the distance between the rst and the third quartile

RI = Q3 Q1

1.3 Histograms and other graphic representations

The information presented in the table of frequencies can be represented graphically


in different ways. The objective is to make the interpretation easier. Once more the
procedures are different depending on whether we deal with continuous or discrete
variables

1.3.1 Graphical representation of discrete variables

The main graphical representations in the case of qualitative or quantitative discrete


variables are bar diagrams and circle diagrams (or pie charts)
28 CHAPTER 1. DESCRIPTIVE STATISTICS

1.3.1.1 Bar diagrams

A bar diagram represents the frequencies corresponding to one variable by means of


vertical bars drawn on top of each of the values that the variables takes in the sample.
The height of each bar is proportional to the frequency of the corresponding value.

Example 1.3.1 Let us consider the variable X1 (number of members). The corre-
sponding bars diagram is shown in gure 1.2. Notice that the height of each bar is
proportional to the relative frequency of each value. The graph would look the same if
we used the absolute frequencies instead, only that the scale in the vertical axis would
be different

Figure 1.2: Bars diagram for X1

1.3.1.2 Circle diagrams

A circle diagram (or pie chart) is a circle divided in as many sectors as different values
the variable takes in the sample. The area of each sector is proportional to the frequency
of the corresponding value.

Example 1.3.2 Let us consider the variable X1 (number of members). The corre-
sponding circle diagram is shown in gure 1.3.

1.3.2 Graphical representation of continuous variables

The main graphical representations in the case of continuous variables are histograms
and box plots.
1.3. HISTOGRAMS AND OTHER GRAPHIC REPRESENTATIONS 29

Figure 1.3: Circle diagram for X1

1.3.2.1 Histograms

Histograms are equivalent to bar diagrams but in this case the frequencies representing
the area of the class intervals.

Example 1.3.3 Let us consider the variable X3 (monthly income). The corresponding
histogram is shown in gure 1.4. Notice that the area of each bar is proportional to
the relative frequency of each value. The graph would look the same if we used the
absolute frequencies instead, only that the scale in the vertical axis would be different.

Figure 1.4: Histogram for X3

1.3.2.2 Box plots

A box plot informs about the dispersion of the data. The plot presents graphically the
minimum and maximum value in the sample, the quartiles, and the interquartile range.
30 CHAPTER 1. DESCRIPTIVE STATISTICS

Figure 1.5 explains the information contained in this type of graphical representations

Figure 1.5: Box plot

Example 1.3.4 Let us consider the variable X3 (monthly income). The corresponding
box plot is shown in gure1.6.

1.4 Multivariate frequency distributions. Conditional


and marginal frequencies
Up to now we have studied variables one by one, looking at their distributions and
characteristics.
We are going to study now the joint behavior of variables (multivariate analysis). For
simplicity we will consider the case of 2 variables (bivariate analysis)
The interest of this analysis is not only to observe the distribution of each variable, but
also the relationship among them
For instance, if we study the number of members of the family (X1 ) together with the
availability of ADSL (X5 ), we might be interested in knowing if there is any relation-
ship between the size of the family and the use of ADSL

1.4.1 Bivariate distribution of frequencies


We are going to study the joint behavior of two variables, X and Y , of which we
have N pairs of observations. To summarize the data of the two variables jointly we
1.4. MULTIVARIATE FREQUENCY DISTRIBUTIONS. CONDITIONAL AND MARGINAL FREQUENCIES31

Figure 1.6: Box plot for X3

construct a bivariate table of frequencies. This table is named:

Correlation table when the two variables are quantitative


Contingency table when at least one variables is qualitative

In both cases we have a table that has as many rows as different values (x1 , x2 , . . . , xn )
the rst variable takes (usually denoted by X), and as many columns as different values
(y1 , y2 , . . . , ym ) the second variable takes (usually denoted by Y )
[I MPORTANT ] Note the change in notation. Up to now, in the univariate anal-
ysis, x1 , x2 , . . . , xn denoted the n observations of the variable X, and we used
y1 , y2 , . . . , yk to denote the k different values taken by the variable in the sample.
From now on, in the bivariate analysis, we will use only the different values taken by
each variable, X i Y . In this sense, we will denote with x1 , . . . , xn the n different
values taken by X and with y1 , . . . , ym the m different values taken by Y . Finally, we
use N to denote the sample size, that is, the number of joint observations of the two
variables.
At the entry that corresponds to the row for value xi of X and the column for value
yj of Y we write the absolute frequency, nij , of the pair (xi , yj ), that is, how many
observations in the sample take the values xi (for X) and, simultaneously, the value yj
(for Y ).
Table1.5 shows the conguration of such bivariate table of frequencies

Example 1.4.1 Using the data in table1.1 we can construct the bivariate table of fre-
quencies (table 1.6) that corresponds to the variables X1 and X5

The bivariate table of frequencies determines the so-called bivariate (or joint) distribu-
tion of frequencies, in which the values nij correspond to the joint absolute frequencies
of the variables X and Y . From such joint absolute frequencies we can obtain the joint
relative frequencies fij
nij
fij =
N

These joint frequencies verify:


32 CHAPTER 1. DESCRIPTIVE STATISTICS

X\Y y1 y2 yj ym
x1 n11 n12 n1j n1m
x2 n21 n22 n2j n2m
.. .. .. .. .. .. ..
. . . . . . .
xi ni1 ni2 nij nim
.. .. .. .. .. .. ..
. . . . . . .
xk nk1 nk2 nkj nkm

Table 1.5: Bivariate table of frequencies

X1 \X5 0 1
2 2 1
3 1 0
4 0 2
5 1 1
7 0 1
8 0 1

Table 1.6: Bivariate table of frequencies for X1 and X5

k m
1. i=1 j=1 nij = N
k m
2. i=1 j=1 fij = 1

1.4.2 Marginal distributions of frequencies


From the information presented in the bivariate table of frequencies we can extract the
individual (or marginal) information for each variable. These marginal distributions of
frequencies can be obtained by counting the number of observations that correspond to
each value of one of the variables regardless the values of the other variable.
Thus, for each value xi of X, we can compute its absolute marginal frequency (ni )
by adding the joint absolute frequencies that are in the ith row. Analogously, for each
value yj of Y , we can compute its absolute marginal frequency (nj ) by adding the
joint absolute frequencies that are in the j th column.

m

ni = nij
j=1
k

nj = nij
i=1

From here we can obtain the marginal relative frequencies


ni
fi =
N
nj
fj =
N
1.4. MULTIVARIATE FREQUENCY DISTRIBUTIONS. CONDITIONAL AND MARGINAL FREQUENCIES33

Such marginal frequencies verify:


k m
i=1 ni = N, j=1 nj = N
k m
i=1 fi = 1, j=1 fj = 1

It is customary to incorporate these marginal frequencies to additional (or marginal)


rows and columns as in table1.7

X\Y y1 y2 yj ym ni fi
x1 n11 n12 n1j n1m n1 f1
x2 n21 n22 n2j n2m n2 f2
.. .. .. .. .. .. .. .. ..
. . . . . . . . .
xi ni1 ni2 nij nim ni fi
.. .. .. .. .. .. .. .. ..
. . . . . . . . .
xk nk1 nk2 nkj nkm nm fm
nj n1 n2 nj nm N 1
fj f1 f2 fj fm 1

Table 1.7: Bivariate table of frequencies and marginal frequencies

Example 1.4.2 1. Continuing with example 1.4.1, we can complete table 1.6 for
X1 and X5 with the marginal distributions for X1 and X5 (table 1.8)

X1 \X5 0 1 ni fi
2 2 1 3 0.3
3 1 0 1 0.1
4 0 2 2 0.2
5 1 1 2 0.2
7 0 1 1 0.1
8 0 1 1 0.1
nj 4 6 10 1
fj 0.4 0.6 1

Table 1.8: Bivariate table of frequencies for X1 and X5

Notice that the marginal frequencies for X1 (ni ) correspond to those found in table
1.2

1.4.3 Conditional distributions of frequencies

With the joint distribution of frequencies that we have seen in 1.4.1 we may answer
questions like

What percentage of observations correspond to the value xi for X and yj


for Y ?
34 CHAPTER 1. DESCRIPTIVE STATISTICS

and the answer would be fij


For instance, using table 1.8 we may ask

What percentage of families have 5 members(X1 = 5) and do not have


ADSL (X5 = 0)?

and the answer would be 1 over a total of 10 families, that is, 10% (0.1)
Sometimes, though, the question of interest is different. For instance, we might be
interested in the following question:

For the families that do not have ADSL (X5 = 0), what percentage has 5
members ?

and the answer now would be 1 family over a total of 4 families without ADSL, that is,
25% (0.25%)
This type of information is called conditional frequency, for it refers to the frequency
of a given value of one of the variables conditional on one specic value of the other
variable. In the previous example we have found that

The relative frequency of the value X1 = 5 conditional on X5 = 0 is 0.25

In general, given two variables X and Y with values{x1 , x2 , . . . , xn } and {y! , y2 , . . . , ym }


respectively, the relative frequency of the value xi conditional on Y taking the value
yj is denoted with
X/Y =yj
fi
and is obtained from the information in the bivariate table of frequencies by means of
the formula
X/Y =yj nij
fi =
nj

Notice that this formula is computing:


X/Y =yj Observations with X = xi i Y = yj
fi =
Observations with Y = yj

Analogously, given two variables X and Y with values {x1 , x2 , . . . , xn } and {y! , y2 , . . . , ym }
respectively, the relative frequency of the value yj conditional on X taking the value
xi is denoted with
Y /X=xi
fj
and is obtained from the information in the bivariate table of frequencies by means of
the formula
Y /X=xi nij
fj =
ni
These conditional frequencies can also be arranged in a table, which presents then the
conditional frequencies distributions.
This way, from table 1.7 we can construct the following two tables of conditional
frequencies, one for each variable
Conditional frequencies verify the following properties:
1.5. COVARIANCE AND CORRELATION COEFFICIENT 35

f X/Y =y1 f X/Y =yj f X/Y =ym


X/Y =y1 n11 X/Y =yj n1j X/Y =ym n1m
x1 f1 = n1 f1 = nj f1 = nm
X/Y =y1 n21 X/Y =yj n2j X/Y =ym n2m
x2 f2 = n1 f12 = nj f2 = nm
.. .. .. .. .. ..
. . . . . .
X/Y =y1 ni1 X/Y =yj nij X/Y =ym nim
xi fi = n1 fi = nj fi = nm
.. .. .. .. .. ..
. . . . . .
X/Y =y1 nk1 X/Y =yj nkj X/Y =ym nkm
xk fk = n1 fnk = nj fk = nm
1 1 1

Table 1.9: Conditional frequencies distribution for X

y1 yj ym
Y /X=x1 n11 Y /X=x1 n1j Y /X=x1 n1m
f Y /X=x1
f1 = n1 fj = n1 fm = n1 1
Y /X=x2 n21 Y /X=x2 n2j Y /X=x2 n2m
f Y /X=x2 f1 = n2 fj = n2 fm = n2 1
.. .. .. .. .. .. ..
. . . . . . .
Y /X=xi ni1 Y /X=xi nij Y /X=xi nim
f Y /X=xi
f1 = ni fj = ni fm = ni 1
.. .. .. .. .. .. ..
. . . . . . .
Y /X=xn nn1 Y /X=xn nnj Y /X=xn nnm
f Y /X=xn f1 = nn fj = nn fm = nn 1

Table 1.10: Conditional frequencies distribution for Y

n X/Y =yj
1. i=1 fi =1
Indeed, since
X/Y =yj nij
fi =
nj
we have that
n
n n
X/Y =yj nij 1 1
fi = = nij = nj = 1
i=1
n
i=1 j
nj i=1 nj

m Y /X=xi
2. j=1 fj = 1 for the same reasons

Example 1.4.3 Continuing with example 1.4.2, we can compute the conditional fre-
quencies for X1 (table 1.11)and X5 (table 1.12) from the information given in table
1.8)

1.5 Covariance and Correlation Coefcient


The covariance between two variables X and Y , denoted with SXY , is a measure of
the degree of relationship (joint variation) between the variables.
A positive covariance is found when high values of one of the variables occur when
the other variable also takes high values. A negative covariance corresponds to the
36 CHAPTER 1. DESCRIPTIVE STATISTICS

f X1 /X5 =0 f X1 /X5 =1
2 1
2
1
6
3 1
4 0
4 0 1
3
5 1
4
1
6
7 0 1
6
8 0 1
6
1 1

Table 1.11: Conditional frequencies distribution for X1

0 1
f X5 /X1 =2 2
3
1
3 1
f X5 /X1 =3 1 0 1
f X5 /X1 =4 0 1 1
f X5 /X1 =5 1
2
1
2 1
f X5 /X1 =7 0 1 1
f X5 /X1 =8 0 1 1

Table 1.12: Conditional frequencies distribution of X5

case when high values of one of the variables occur when the other variable takes low
values
The covariance between variables X and Y can be computed as
n m
1
SXY = nij (xi x)(yj y)
N i=1 j=1

and also with the formula


n m
1
SXY = nij xi yj xy
N i=1 j=1

If the information is presented in joint relative frequencies (fij ) instead of joint abso-
n
lute frequencies (nij ), given that fij = Nij we can compute the covariance as
k
m
SXY = fij (xi x)(yj y)
i=1 j=1

One can show that the maximum and the minimum value that the covariance can take
are relate to the variances of the two variables. In this sense we have that SXY
2

SX SY and therefore
2 2

SX SY SXY SX SY (1.1)

If we had the original raw data with all the observations in the sample6 {(x1 , y1 ), (x2 , y2 ), . . . , (xN , yN )},
we might compute the variance as
6 Notice that (x , y ), . . . , (x , y ) denote now the raw data (the whole sample), and not the different
1 1 N N
values that the observations take
1.6. MEAN AND VARIANCE OF LINEAR COMBINATIONS OF VARIABLES 37

N
1
SXY = (xi x)(yi y)
N i=1

or, equivalently
N
1
SXY = xi yi xy
N i=1

We will see later that if two variables are independent (the behavior of one of them has
nothing to do with the behavior of the other), then the covariance between them is zero.
Thus, the covariance between two variables can be interpreted as a measure of the in-
tensity of the relationship between them. The value of the covariance, though, depends
on the units of measure of the variables. Then, it could happen (as we have seen in
1.2.2.4) that we nd a high intensity (high covariance) simply because the units of
measure of the variables are high. In this case, the conclusion that the relationship is
intense would be wrong.
To address this issue we use the so-called Correlation Coefcient r (or Pearsons coef-
cient):
SXY
r=
SX SY

Notice that, because of what we have found in [1.1] we have

1 r 1

A direct way to compute r is using the formulas for SXY , SX and SY , that is,
k m
i=1 j=1 nij (xi x)(yi y)
r= m
k
( i=1 ni (xi x)2 )( j=1 nj (yj y)2 )

It is important to notice that,

The sign (+ or -) of the correlation coefcient r, which is inherited from the


sign of the covariance SXY , determines the nature (positive or negative) of the
relationship between X and Y
The closer to 1 (or to -1) the correlation coefcient r is, the more intense the
relationship between X and Y is
A value of r close to zero does not mean that the relationship is week (or null)

1.6 Mean and variance of linear combinations of vari-


ables
In many cases we nd that some variables can be written as a linear combination of
other variables. For instance, the costs of a rm (C) are the sum of the quantity of
38 CHAPTER 1. DESCRIPTIVE STATISTICS

labor used (L) multiplied by its unit cost (w) and the use of capital (K) also multiplied
by its unit cost (r)

C = wL + rK

In general, we say that the variable X is a linear combination linear of the variables
X1 , X2 if there exist linear coefcients a1 , a2 (real numbers) such that

X = a 1 X1 + a2 X2

The following properties regarding the mean and variance of linear combinations of
variables are of interest:

X = a1 X1 + a2 X2
2
SX = a21 SX
2
1
+ a22 SX
2
2
if X1 , X2 are independent

Moreover, if
Y = b 1 Y 1 + b2 Y 2

the we have the following property for the covariance

SXY = a1 b1 SX1 Y1 + a1 b2 SX1 Y2 + a2 b1 SX2 Y1 + a2 b2 SX2 Y2

1.7 Mean vector and Covariance matrix

Some advance techniques in statistics (specially in econometrics) use matrix algebra.


For this reason, it is convenient to arrange the data (observations), means, variances,
and covariances as vectors and matrices. In this sense, if we have N pairs of observa-
tions ((x11 , x21 ), (x12 , x22 ), . . . , (x1N , x2N )) of the variables X1 and X2 we can dene
the following concepts7 :

Data matrix The observations of the variables X1 i X2 can be jointly represented in a


matrix X with N 2 dimensions

x11 x21
x12 x22

X= . ..
.. .
x1N x2N

Mean vector The mean vector X simply consists of

X = (x1 , x2 )
7 Notice than now we work again with the raw data (the list of all the observations) and not with the

different values that each variable takes.


1.7. MEAN VECTOR AND COVARIANCE MATRIX 39

Such mean vector X can be computed using matrix algebra using the data matrix X.


Indeed, if we denote with 1 the (row) vector with N components, all of then equal to
1:
N


1 = (1, 1, . . . , 1)

we have that

x11 x21
x12 x22 N N


1 X = (1, 1, . . . , 1) .. .. =( x1i , x2i )
. . i=1 i=1
x1N x2N

Thus,
1
X = 1X
N

Covariance matrix Finally, the variances and covariances between X1 and X2 are
presented in the covariance matrix
2

SX SX 1 X 2
= 1
2
SX 2 X 1 SX 2

As with the mean vector, this matrix can be computed using matrix algebra. Indeed,
T


let 1 denote the transpose of vector 1 (column vector)

1

T 1

1 = .
..
1

Then,
1 x1 x2
1 x1 x2
T

1 X = .. (x1 , x2 ) = .. ..
. . .
1 x1 x2

Therefore,

x11 x21 x1 x2 x11 x1 x21 x2
x12 x22 x1 x2 x12 x1 x22 x2
T

X 1 X = .. .. .. .. = .. ..
. . . . . .
x1N x2N x1 x2 x1N x1 x2N x2

Let us consider its transpose now:



T T
x11 x1 x12 x1 x1N x1
(X 1 X) =
x21 x2 x22 x2 x2N x2
40 CHAPTER 1. DESCRIPTIVE STATISTICS

then we have

T
T

(X 1 X)T (X 1 X) =

x11 x1 x21 x2
x x x22 x2
x11 x1 x12 x1 x1N x1 12 1
=
x21 x2 x22 x2 x2N x2
.
.. .. =
.
x1N x1 x2N x2
N N
(x1i x1 )2 (x x )(x x )
= N 1 N1i
i=1 1 2i
2
2

i (x2i x2 )(x1i x1 ) i=1 (x2i x2 )

If we now multiply by 1
N we nd
N N
(x1i x1 )2 (x1i x1 )(x2i x2 )
1 T
T

1 i=1
=
(X 1 X)T (X 1 X) = N N N N
N (x2i x2 )(x1i x1 ) (x2i x2 )2
i i=1
N N
2

SX SX 1 X 2
= 1
2
SX 2 X 1 SX 2

That is,
1 T
T

= (X 1 X)T (X 1 X)
N
This matrix is symmetric since SX! X2 = SX2 X1

Anda mungkin juga menyukai