Statistics
-Summarise information
-Variables
-Random variables that vary randomly
Chapter 1
Descriptive Statistics
Denition 1.0.1 Statistics is a set of techniques whose goal is to understand and draw
conclusions about a phenomenon (or phenomena), in a particular place and time,
using the information contained in a set of data
For example, we may study the unemployment in Catalonia in 2014 using the En-
questa de Poblaci Activa (EPA), or the opinion poll of the electorate before any
call for elections using surveys published in the media
Technically, the "phenomenon" to study is called variable, while that "particular place
and time" is the population. In the same sense, "data" is usually called sample data
(or observations). In this way,
Denition 1.0.2 Population is the set of elements to study from which we want to draw
a conclusion with respect to some of its features (or variables)
Denition 1.0.3 Sample is a subset of the population used to collect information and
draw conclusions
Denition 1.0.4 Variables are those features or phenomena of the population that we
want to study A variable is an observable character that varies among the different
individual of a population
Denition 1.0.5 Data are the specic values of the considered variables that we ob-
serve in the sample at hand
Therefore, in the rst example above, the population is the population of Catalonia in
2014, and the data is obtained from the sample called Encuesta de Poblacin Activa
(EPA). (See gure 1.1). The variable studied is, clearly, unemployment.
In the second example, the population is composed of all the citizens with the right to
vote, the sample is the people interviewed in the published survey, the data is all the
answers collected, and the variable of interest is the electorate opinion.
Thus, we may re-dene statistics in a more accurate manner as follows.
-Sex. ? -Instrument
-Age. Probability. calibrator
-Parents height. -Distribution. -Rounding
-Born place. -Methodolgy
12 CHAPTER 1. DESCRIPTIVE STATISTICS
Denition 1.0.6 Statistics is a set of techniques whose goal is to understand and draw
conclusions regarding a variable (or variables) of a population using the data collected
in a sample
Descriptive Statistics is a sub-eld within statistics that deals with the task of arranging
the data in such a way that the analysis becomes easier
Denition 1.0.7 Descriptive Statistics deals with the study of a set of data, corre-
sponding to one or more variables of a population, and its arrangement in order to
facilitate its understanding and use
Once all the information is arranged in a clear and efcient way, the researcher can use
"Statistical Inference" techniques, which use probabilistic tools, to extend the conclu-
sions obtained from the sample to the entire population that is under study.
As in other branches of mathematics, variables are usually denoted using the last letters
of the alphabet (in uppercase), X, Y, Z. If we need to consider many variables at the
same time the use subscripts becomes necessary, X1 , X2 , . . . , Xn .
The different data points (or observations) for each variable that we obtain from a
sample are often denoted using subscripts of the corresponding variable (in lowercase).
KINDS OF VARIABLE For instance, if we have 3 data points corresponding to the variable labeled X3 we will
*Categorial*
-Nominal: no order between categories
-Ordinal: Order between categories. Less categories ordinal.more categories
Eg: Age - categories of age : younger to the older
*Continous*
Those are within a range
"Grades"
1.1. UNIVARIATE FREQUENCY DISTRIBUTION TABLES 13
represent them as x31 , x32 , x33 . If we only have one variable, X, then we simply use
the notation x1 , x2 , x3 .
According to their nature there are two types of variables:
1. Qualitative variables (or categorical) are variables that can not be measured
numerically. Each observation is usually associated to a number (or letter).
2. Quantitative variables (or measurable) are variables than can be measured nu-
merically. Can be of two types
Example 1.1.1 Consider the table 1.1, which summarizes a survey of 10 families living
in Cerdanyola
X1 X2 X3 X4 X5
Family Number of Occupation Monthly Monthly phone ADSL
members head of family income spending (Yes/No)
1 2 3 592.18 12.06 0
2 3 3 743.83 14.88 0
3 4 6 527.18 12.35 1
4 5 3 1090.47 18.92 0
5 8 1 2744.23 26.5 1
6 2 5 902.71 15.89 1
7 4 3 888.26 14.44 1
8 5 1 1588.76 15.38 1
9 7 4 3069.2 24.19 1
10 2 1 707.72 13.72 0
{x11 , x12 , x13 , x14 , x15 , x16 , x17 , x18 , x19 , x110 } = {2, 3, 4, 5, 8, 2, 4, 5, 7, 2}
{x21 , x22 , x23 , x24 , x25 , x26 , x27 , x28 , x29 , x210 } = {3, 3, 6, 3, 1, 5, 3, 1, 4, 1}
X1 Quantitative discrete
X1 {1, 2, 3, 4, 5, 6, . . .}
X2 Qualitative
X2 {1, 2, 3, 4, 5, 6}, on
1 Entrepreneur
2 White collar worker
3 Blue collar worker
4 freelance
5 Public servant
6 Unemployed
X3 Quantitative continuous
X3 [0, ]
X4 Quantitative continuous
X4 [0, ]
X5 Qualitative
0 No ADSL at home
1 ADSL at home
The raw data of one variable provides very little useful information, specially if the
data set is large. It is necessary to classify them so that they become informative. This
is done differently depending on whether it is a Quantitative Continuous variable, or a
Discrete Quantitative Continuous or Qualitative variable. The question, ultimately, is
to know how many times each data value appears in the sample (how many people are
unemployed, how many people will vote for the PP, etc.), or maybe some other type of
summary.
1.1. UNIVARIATE FREQUENCY DISTRIBUTION TABLES 15
In the case of quantitative discrete or qualitative variables (they are treated alike), what
we do is rst look at how many different values the variable has taken, and then count
how many times each of theses values appears in the sample.
In the previous example, X1 , X2 , i X5 can be treated in the same way.
Let us take, for example, X1 . We have the following data
{x11 , x12 , x13 , x14 , x15 , x16 , x17 , x18 , x19 , x110 } = {2, 3, 4, 5, 8, 2, 4, 5, 7, 2}
We see that we have 10 observations of the variable X1 but it only takes 6 different
values, which we represent with Y1 . Thus, the list of different values X1 takes in our
sample is
{y11 , y12 , y13 , y14 , y15 , y16 } = {2, 3, 4, 5, 7, 8}
From this list we can create a Frequency Distribution consisting on counting how many
times each of these values is found in the data. In this sense, we dene below the
important concepts of Absolute Frequency and Relative Frequency
Denition 1.1.2 The Absolute Frequency (ni ) of a value yi is the number of times this
value is found in the data set
Denition 1.1.3 The Relative Frequency (fi ) (o proportion) of a value yi is the per-
centage of times this value is found in the data set
Moreover, we can compute1 the Cumulative Frequencies (Absolute (Ni ) and Relative
(Fi )) for each value as the sum of all the frequencies (absolute (ni ) or relative (fi )) of
the lower values including that of the value for which we are computing the cumulative
frequency. Thus, such cumulative frequencies are interpreted as:
Table1.2 summarizes all such frequencies for the variable X1 . Notice, for example, the
following interpretations:
The frequencies of any variable always satisfy the properties below, where k represents
the number of different values the variable takes2
(i) 0 ni n; 0 Ni n
(ii) 0 fi 1; 0 Fi 1
k
(iii) i=1 ni = n
k
(iv) i=1 fi = 1
(v) n1 = N1 N2 Nk = n
(vi) f1 = F1 F2 Fk = 1
Continuous variables (income, temperature, etc) can take so many different values that
a table of frequencies (or frequency distribution) would add very little information.
Most likely, each observation would appear only once in the data set and consequently
the absolute frequencies of all different values would be equal to 1.
To solve this, data is binned in intervals, named class intervals (or bins). Intervals may
be constructed as follows
rang = yk y1
1 It does not make any sense to do this for Qualitative variables since this computation requires a natural
If, for example, we consider the variable X3 in table 1.1 (monthly income), then
the range is
RANG = 3069.2 527.18 = 2542.02
This range (or total length) must be split into intervals.
2. Length of each Interval
Once the range is known, we must divide the total length in as many intervals
as needed. Such number of intervals depends on each case, on the data at hand,
or the nature of the problem we are studying. Let I be that number. Then,
the length of each interval (l) can be computed as the total length (the range)
divided by the number of intervals3 .
rang
l=
I
If, for example, we consider the variable X3 in table 1.1 (Monthly income) and
we want 8 intervals we will have
2542.02
l= = 317.75
8
3. Building the intervals
Once the length l of each interval is known, we proceed by iteration starting from
the lowest observed value, that is, y1 is the list of values y1 , . . . , yk is ordered.
Thus, the rst interval will be [y1 , y1 + l), the second will start where the rst
ends and will have a length l [y1 + l, (y1 + l) + l), and so on ...
Table 1.3 shows the frequency distribution of X3 using the data in table 1.1. Besides the
class intervals and their length the table shows the so called class mark (third column).
This class mark is the middle point of the interval and somehow represents the
interval. We will see later that the class mark is used to compute some attributes of the
variable
Sometimes, depending on the case at study, the class intervals may be dened in a dif-
ferent way (integer interval limits, with a given length, etc). The same data of variable
X3 , for instance, could very well be represented as in table1.4
3
k
Sometimes intervals have different lengths (li ). In this case we need that l
i=1 i
= rang
18 CHAPTER 1. DESCRIPTIVE STATISTICS
Measures of variability Are measures (statistics) that attempt to measure how dif-
ferent the observations in the data are. Such differences are measured with respect
to some central tendency measure, usually the mean. The three main measures of vari-
ability are the variance, the standard error and the coefcient of variation.
The mean is a central value with respect to all the observations, being the center of
gravity of the distribution. The computation is easy. If we have n observations of the
variable X, {x1 , x2 , . . . , xn }, the mean (denoted by x) is:
n
1
x = xi
n i=1
When the data corresponding to the variable X are arranged in a table of frequencies
(we do not have the raw data), the mean can be computed using the absolute frequencies
ni with the formula:
k
1
x = ni yi
n i=1
1.2. MEASURES OF CENTRAL TENDENCY, VARIABILITY, AND OTHER CHARACTERISTIC MEASURES19
where {y1 , y2 , . . . , yk } represent the k different values that the variable takes in the
sample.
With continuous variables, when the data is arranged in class intervals, the computation
of the mean can only be approximated with the formula:
Because these are continuos. We have said that the
I I people between 70kg are 10. Then we just have the
1 approximation
x ni c i or x fi ci
n i=1 i=1
Example 1.2.1 [Discrete Variable] Consider the variable X1 in table 1.1, We can com-
pute the mean using the 10 raw observations of the variable:
n
1 1
x1 = x1i = (2 + 3 + 4 + 5 + 8 + 2 + 4 + 5 + 7 + 2) = 4.2
n i=1 10
If we did not have the raw data but only the table of frequencies1.2 we might compute
the mean with the formula
k
1 1
x1 = ni y1i = (3 2 + 1 3 + 2 4 + 2 5 + 1 7 + 1 8) = 4.2
n i=1 10
k
1 1
x1 = ni y1i = (3 2 + 1 3 + 2 4 + 2 5 + 1 7 + 1 8) = 4.2
n i=1 10
Example 1.2.2 [Continuous Variable] Consider the variable X3 in table 1.1. We can
compute the mean using the 10 raw observations of the variable:
n
1 1
x3 = x3i = (592.18 + 743.83 + + 707.72) = 1285.45
n i=1 10
If we did not have the raw data but only the table of frequencies 1.3 we might compute
the mean from the class marks, but only approximately, with the formula:
k
1 1
x3 ni c3i = (4 686.06 + 3 1003.81 + + 1 2910.32) = 1289.79
n i=1 10
20 CHAPTER 1. DESCRIPTIVE STATISTICS
Notice that the value obtained using the intervals (1289.79) is only an approximation
to the real value(1285.45) of the mean.
The mean, x, satises the four properties below, which are useful in many cases:
1. Let a be an arbitrary constant. The mean that results from multiplying the obser-
vations of the variable X by the constant ({a x1 , a x2 , . . . , a xn }) is a x
2. Let X and Y be two different variables. Then, if we have the same number of
observations of the two variables, x + y = x + y
3. Let {x1 , x2 , . . . , xn } be the n observations of the variable X in a sample. Then,
n
(xi x) = 0
i=1
4. The mean always exists, is unique and is very sensitive to changes in extreme
values
It is important to understand that the sample mean x is the mean computed using the
data in the sample and hence it is just an approximation to the true value of the popu-
lation mean (the real mean in the population of reference) which is usually denoted
with the symbol . For instance, in the example 1.2.1 we have computed the mean
of X1 to nd x1 = 4.2. This does not mean that the average size of the families in
Cerdanyola is 4.2, bat only that it is approximately equal to 4.2.
It is also important to notice that the computation of the mean (and the same goes for
the median that we will see next) only make sense when the variable is quantitative
(numerical). Indeed, although in some cases we attach for convenience numerical val-
ues to qualitative variables (as for instance with variable X2 :occupation of the head of
family) it does not make any sense to compute the average of those values to conclude
that the average occupation is 4.73.
The median represents a value that is central with respect to all the observations of the
variable, and such that 50% of the observations are equal or larger than this value and
50% are equal or lower.
Thus, given an ordered set of observations, the median (M ) is the value in the middle.
It is larger than no more than half of the observations and, simultaneously, lower than
no more than half of the observations.
The method to compute the median depends on whether the sample size (n) is even or
odd.
1.2. MEASURES OF CENTRAL TENDENCY, VARIABILITY, AND OTHER CHARACTERISTIC MEASURES21
Example 1.2.3 [Even samples] Let us consider the observations of the variable X1
{x11 , x12 , x13 , x14 , x15 , x16 , x17 , x18 , x19 , x110 } = {2, 3, 4, 5, 8, 2, 4, 5, 7, 2}
[x11 , x16 , x110 , x12 , x13 , x17 , x14 , x18 , x19 , x15 } = {2, 2, 2, 3, 4, 4, 5, 5, 7, 8}
In this case, since the sample has an even number of observations, there is no obser-
vation right in the middle of the list (that would be the median). Notice that both
x13 = 4 and x17 = 4 satisfy the necessary conditions for being the median:
Consider x13 = 4
50% of the observations (that is, 5 observations) are larger than or equal
to x13
Indeed, the 5 observations {x17 , x14 , x18 , x19 , x15 } are larger than or equal
to x13
50% of the observations (that is, 5 observations) are smaller than or equal
to x13
Indeed, the 5 observations {x17 , x14 , x18 , x19 , x15 } are smaller than or
equal to x13
Consider x17 = 4
50% of the observations (that is, 5 observations) are larger than or equal
to x17
Indeed, the 5 observations {x17 , x14 , x18 , x19 , x15 } are larger than or equal
to x17
50% of the observations (that is, 5 observations) are smaller than or equal
to x17
Indeed, the 5 observations {x17 , x14 , x18 , x19 , x15 } are smaller than or
equal to x17
In this cases (even samples) there are two possibilities that are equally correct:
1. The medians are x13 = 4 and x17 = 4. Notice that is this case the two ob-
servations that verify the condition for being a median have the same value
(x13 = x17 = 4 ), but it could very well be that the two values were different
4+4
M= =4
2
22 CHAPTER 1. DESCRIPTIVE STATISTICS
Example 1.2.4 [Odd sample] Let us consider the list of observations below regarding
the variable X (sample size n = 7) drawn from a given population
{x1 , x2 , . . . , x7 } = {3, 1, 4, 3, 2, 5, 1}
Ordered:
{x2 , x7 , x5 , x1 , x4 , x3 , x6 } = {1, 1, 2, 3, 3, 4, 5}
When the data is arranged in a table of frequencies (the raw data is no available), the
median is found by looking at the Cumulative Absolute Frequencies as follows:
n
Nm1 < Nm
2
2. Then we check:
ym + ym1
M=
2
When the variable is continuous the median can be found in the same way. The only
difference is that we look at the class intervals instead of the values. In this case . For
instance, if we consider the variable X3 , the median is the class c2 , [844.93, 1162.69).
1.2. MEASURES OF CENTRAL TENDENCY, VARIABILITY, AND OTHER CHARACTERISTIC MEASURES23
The mode is the value (or values) with the highest frequency among those values
around. The absolute mode is the value more frequent in the whole sample.
There may exist only one mode, more than one mode, or none (if all values are equally
frequent).
Formally, a mode is any value yq that satises
In the case of continuous variables, the mode is found in the same way, but considering
the Class intervals instead of the values. This is called the modal class. For variable
X3 , for instance, the mode is in the classes with class marksc1 , c4 , c7 , and c8
Are measures (statistics) that attempt to measure how different the observations in
the data are. Such differences are measured with respect to some central tendency
measure, usually the mean. The three main measures of variability are the variance,
the standard error and the coefcient of variation.
Let us assume we propose the value v as a representative of the whole sample (central
value). It is clear that this implies some errors:
errors will cancel each other and we might mistakenly conclude that the total error is
zero (or close to zero)
To keep positive and negative errors from canceling each other, we square the errors
(e2i ) so that all are positive. This result known as the Quadratic Error. If we divide
such Quadratic Error by the number of observations (n), that is, we compute its mean,
we nd the so-called Mean Quadratic Error
n
1
M.Q.E.(v) = (xi v)2
n i=1
To represent the sample variance we usually use S 2 . Thus, given the observations
{x1 , x2 , . . . , xn }, the Sample Variance (S 2 ) can be computed as:
n
1
S2 = (xi x)2
n i=1
It can be shown4 that the sample variance S 2 can also be found using the alternative
formula below, which usually involve less computations than the original one.
n
1 2
S2 = x x2
n i=1 i
k
1
S2 = ni (yi x)2
n i=1
k
S2 = fi (yi x)2
i=1
where {y1 , y2 , . . . , yk } are the k different values the variable takes in the sample.
4 We suggest the reader to formally prove that the two formulas are equivalent. It is a good exercise on
For reasons that will be given later on 5 , in many cases the formula used to compute
the sample variance is
n
1
S 2 = (xi x)2
n 1 i=1
or, equivalently,
n
1 2 n
S 2 = x x2
n 1 i=1 i n1
The variance measures the degree of dispersion of the data around the mean x. If
we consider two different samples with equal means, the larger the variance the more
dispersed the observations are.
Example 1.2.7 Let us consider the variable X3 (Monthly income). Since x3 = 1285.45,
we have:
1
S32 = ((592.18 1285.45)2 + (743.83 1285.45)2 + + (707.72 1285.45)2 ) = 742678.8
10
1
S32 = ((592.18 1285.45)2 + (743.83 1285.45)2 + + (707.72 1285.45)2 ) = 825199.04
9
One important property is that, for any central tendency value , we have
S 2 = M.Q.E.(x) M.Q.E.(v)
That is, the sample mean x is the value that minimizes the mean quadratic error.
Compared to the variance, the standard deviation returns the original units of mea-
surement of the variable, that are squared when computing the variance
If we use the formula for 2 instead of the one corresponding to S 2 , the the standard
deviation is denoted as . Thus,
n
1
S = (xi x)2 = S 2
n i=1
n
1
= (xi x)2 = 2
n 1 i=1
Consider (case a) that we have some data measured in Kg. {100, 200}. Clearly, the
mean is
100 + 200
xa = = 150
2
and the variance and standard deviation are
(100 150)2 + (200 150)2
Sa2 = = 2500
2
Sa = 2500 = 50
Let us suppose now (case b) that we have the same data measured in grams instead{100000, 200000}.
Now the mean is
100000 + 200000
xb = = 150000
2
and the variance and standard deviation are
(100000 150000)2 + (200000 150000)2
Sb2 = = 2500000000
2
Sb = 2500000000 = 50000
Now, since Sb is much higher that Sa , one might be tempted to conclude that
The observations in case b are much more disperse that in case a for
Sb > Sa
which is clearly false since the two cases correspond to exactly the same observations,
just the they are presented using different units of measurement
The Coefcient of Variation V is designed to avoid this misleading interpretation. It
measures the dispersion but taking into account the units of measurement.
S
V =
x
1.2.2.5 Quartiles
The quartiles are those values in the sample that split all the observations (ordered) in
4 subgroups in such a way that each subgroup contains no more than 25% of the data.
That is,
1.3. HISTOGRAMS AND OTHER GRAPHIC REPRESENTATIONS 27
First_quartile: Value Q1 such that 25% of the observations are below Q1 and 75% of
the observations are above
Second quartile: Value Q2 such that 50% of the observations are below Q2 and 50%
of the observations are above
Third quartile: Value Q3 such that 75% of the observations are below Q3 and 25%
of the observations are above
1.2.2.6 Percentiles
Notice:
Q1 = P25
Q2 = P50 = median
Q3 = P75
The Interquartile rang is the distance between the rst and the third quartile
RI = Q3 Q1
Example 1.3.1 Let us consider the variable X1 (number of members). The corre-
sponding bars diagram is shown in gure 1.2. Notice that the height of each bar is
proportional to the relative frequency of each value. The graph would look the same if
we used the absolute frequencies instead, only that the scale in the vertical axis would
be different
A circle diagram (or pie chart) is a circle divided in as many sectors as different values
the variable takes in the sample. The area of each sector is proportional to the frequency
of the corresponding value.
Example 1.3.2 Let us consider the variable X1 (number of members). The corre-
sponding circle diagram is shown in gure 1.3.
The main graphical representations in the case of continuous variables are histograms
and box plots.
1.3. HISTOGRAMS AND OTHER GRAPHIC REPRESENTATIONS 29
1.3.2.1 Histograms
Histograms are equivalent to bar diagrams but in this case the frequencies representing
the area of the class intervals.
Example 1.3.3 Let us consider the variable X3 (monthly income). The corresponding
histogram is shown in gure 1.4. Notice that the area of each bar is proportional to
the relative frequency of each value. The graph would look the same if we used the
absolute frequencies instead, only that the scale in the vertical axis would be different.
A box plot informs about the dispersion of the data. The plot presents graphically the
minimum and maximum value in the sample, the quartiles, and the interquartile range.
30 CHAPTER 1. DESCRIPTIVE STATISTICS
Figure 1.5 explains the information contained in this type of graphical representations
Example 1.3.4 Let us consider the variable X3 (monthly income). The corresponding
box plot is shown in gure1.6.
In both cases we have a table that has as many rows as different values (x1 , x2 , . . . , xn )
the rst variable takes (usually denoted by X), and as many columns as different values
(y1 , y2 , . . . , ym ) the second variable takes (usually denoted by Y )
[I MPORTANT ] Note the change in notation. Up to now, in the univariate anal-
ysis, x1 , x2 , . . . , xn denoted the n observations of the variable X, and we used
y1 , y2 , . . . , yk to denote the k different values taken by the variable in the sample.
From now on, in the bivariate analysis, we will use only the different values taken by
each variable, X i Y . In this sense, we will denote with x1 , . . . , xn the n different
values taken by X and with y1 , . . . , ym the m different values taken by Y . Finally, we
use N to denote the sample size, that is, the number of joint observations of the two
variables.
At the entry that corresponds to the row for value xi of X and the column for value
yj of Y we write the absolute frequency, nij , of the pair (xi , yj ), that is, how many
observations in the sample take the values xi (for X) and, simultaneously, the value yj
(for Y ).
Table1.5 shows the conguration of such bivariate table of frequencies
Example 1.4.1 Using the data in table1.1 we can construct the bivariate table of fre-
quencies (table 1.6) that corresponds to the variables X1 and X5
The bivariate table of frequencies determines the so-called bivariate (or joint) distribu-
tion of frequencies, in which the values nij correspond to the joint absolute frequencies
of the variables X and Y . From such joint absolute frequencies we can obtain the joint
relative frequencies fij
nij
fij =
N
X\Y y1 y2 yj ym
x1 n11 n12 n1j n1m
x2 n21 n22 n2j n2m
.. .. .. .. .. .. ..
. . . . . . .
xi ni1 ni2 nij nim
.. .. .. .. .. .. ..
. . . . . . .
xk nk1 nk2 nkj nkm
X1 \X5 0 1
2 2 1
3 1 0
4 0 2
5 1 1
7 0 1
8 0 1
k m
1. i=1 j=1 nij = N
k m
2. i=1 j=1 fij = 1
m
ni = nij
j=1
k
nj = nij
i=1
X\Y y1 y2 yj ym ni fi
x1 n11 n12 n1j n1m n1 f1
x2 n21 n22 n2j n2m n2 f2
.. .. .. .. .. .. .. .. ..
. . . . . . . . .
xi ni1 ni2 nij nim ni fi
.. .. .. .. .. .. .. .. ..
. . . . . . . . .
xk nk1 nk2 nkj nkm nm fm
nj n1 n2 nj nm N 1
fj f1 f2 fj fm 1
Example 1.4.2 1. Continuing with example 1.4.1, we can complete table 1.6 for
X1 and X5 with the marginal distributions for X1 and X5 (table 1.8)
X1 \X5 0 1 ni fi
2 2 1 3 0.3
3 1 0 1 0.1
4 0 2 2 0.2
5 1 1 2 0.2
7 0 1 1 0.1
8 0 1 1 0.1
nj 4 6 10 1
fj 0.4 0.6 1
Notice that the marginal frequencies for X1 (ni ) correspond to those found in table
1.2
With the joint distribution of frequencies that we have seen in 1.4.1 we may answer
questions like
and the answer would be 1 over a total of 10 families, that is, 10% (0.1)
Sometimes, though, the question of interest is different. For instance, we might be
interested in the following question:
For the families that do not have ADSL (X5 = 0), what percentage has 5
members ?
and the answer now would be 1 family over a total of 4 families without ADSL, that is,
25% (0.25%)
This type of information is called conditional frequency, for it refers to the frequency
of a given value of one of the variables conditional on one specic value of the other
variable. In the previous example we have found that
Analogously, given two variables X and Y with values {x1 , x2 , . . . , xn } and {y! , y2 , . . . , ym }
respectively, the relative frequency of the value yj conditional on X taking the value
xi is denoted with
Y /X=xi
fj
and is obtained from the information in the bivariate table of frequencies by means of
the formula
Y /X=xi nij
fj =
ni
These conditional frequencies can also be arranged in a table, which presents then the
conditional frequencies distributions.
This way, from table 1.7 we can construct the following two tables of conditional
frequencies, one for each variable
Conditional frequencies verify the following properties:
1.5. COVARIANCE AND CORRELATION COEFFICIENT 35
y1 yj ym
Y /X=x1 n11 Y /X=x1 n1j Y /X=x1 n1m
f Y /X=x1
f1 = n1 fj = n1 fm = n1 1
Y /X=x2 n21 Y /X=x2 n2j Y /X=x2 n2m
f Y /X=x2 f1 = n2 fj = n2 fm = n2 1
.. .. .. .. .. .. ..
. . . . . . .
Y /X=xi ni1 Y /X=xi nij Y /X=xi nim
f Y /X=xi
f1 = ni fj = ni fm = ni 1
.. .. .. .. .. .. ..
. . . . . . .
Y /X=xn nn1 Y /X=xn nnj Y /X=xn nnm
f Y /X=xn f1 = nn fj = nn fm = nn 1
n X/Y =yj
1. i=1 fi =1
Indeed, since
X/Y =yj nij
fi =
nj
we have that
n
n n
X/Y =yj nij 1 1
fi = = nij = nj = 1
i=1
n
i=1 j
nj i=1 nj
m Y /X=xi
2. j=1 fj = 1 for the same reasons
Example 1.4.3 Continuing with example 1.4.2, we can compute the conditional fre-
quencies for X1 (table 1.11)and X5 (table 1.12) from the information given in table
1.8)
f X1 /X5 =0 f X1 /X5 =1
2 1
2
1
6
3 1
4 0
4 0 1
3
5 1
4
1
6
7 0 1
6
8 0 1
6
1 1
0 1
f X5 /X1 =2 2
3
1
3 1
f X5 /X1 =3 1 0 1
f X5 /X1 =4 0 1 1
f X5 /X1 =5 1
2
1
2 1
f X5 /X1 =7 0 1 1
f X5 /X1 =8 0 1 1
case when high values of one of the variables occur when the other variable takes low
values
The covariance between variables X and Y can be computed as
n m
1
SXY = nij (xi x)(yj y)
N i=1 j=1
If the information is presented in joint relative frequencies (fij ) instead of joint abso-
n
lute frequencies (nij ), given that fij = Nij we can compute the covariance as
k
m
SXY = fij (xi x)(yj y)
i=1 j=1
One can show that the maximum and the minimum value that the covariance can take
are relate to the variances of the two variables. In this sense we have that SXY
2
SX SY and therefore
2 2
SX SY SXY SX SY (1.1)
If we had the original raw data with all the observations in the sample6 {(x1 , y1 ), (x2 , y2 ), . . . , (xN , yN )},
we might compute the variance as
6 Notice that (x , y ), . . . , (x , y ) denote now the raw data (the whole sample), and not the different
1 1 N N
values that the observations take
1.6. MEAN AND VARIANCE OF LINEAR COMBINATIONS OF VARIABLES 37
N
1
SXY = (xi x)(yi y)
N i=1
or, equivalently
N
1
SXY = xi yi xy
N i=1
We will see later that if two variables are independent (the behavior of one of them has
nothing to do with the behavior of the other), then the covariance between them is zero.
Thus, the covariance between two variables can be interpreted as a measure of the in-
tensity of the relationship between them. The value of the covariance, though, depends
on the units of measure of the variables. Then, it could happen (as we have seen in
1.2.2.4) that we nd a high intensity (high covariance) simply because the units of
measure of the variables are high. In this case, the conclusion that the relationship is
intense would be wrong.
To address this issue we use the so-called Correlation Coefcient r (or Pearsons coef-
cient):
SXY
r=
SX SY
1 r 1
A direct way to compute r is using the formulas for SXY , SX and SY , that is,
k m
i=1 j=1 nij (xi x)(yi y)
r= m
k
( i=1 ni (xi x)2 )( j=1 nj (yj y)2 )
labor used (L) multiplied by its unit cost (w) and the use of capital (K) also multiplied
by its unit cost (r)
C = wL + rK
In general, we say that the variable X is a linear combination linear of the variables
X1 , X2 if there exist linear coefcients a1 , a2 (real numbers) such that
X = a 1 X1 + a2 X2
The following properties regarding the mean and variance of linear combinations of
variables are of interest:
X = a1 X1 + a2 X2
2
SX = a21 SX
2
1
+ a22 SX
2
2
if X1 , X2 are independent
Moreover, if
Y = b 1 Y 1 + b2 Y 2
X = (x1 , x2 )
7 Notice than now we work again with the raw data (the list of all the observations) and not with the
Such mean vector X can be computed using matrix algebra using the data matrix X.
Indeed, if we denote with 1 the (row) vector with N components, all of then equal to
1:
N
1 = (1, 1, . . . , 1)
we have that
x11 x21
x12 x22 N N
1 X = (1, 1, . . . , 1) .. .. =( x1i , x2i )
. . i=1 i=1
x1N x2N
Thus,
1
X = 1X
N
Covariance matrix Finally, the variances and covariances between X1 and X2 are
presented in the covariance matrix
2
SX SX 1 X 2
= 1
2
SX 2 X 1 SX 2
As with the mean vector, this matrix can be computed using matrix algebra. Indeed,
T
let 1 denote the transpose of vector 1 (column vector)
1
T 1
1 = .
..
1
Then,
1 x1 x2
1 x1 x2
T
1 X = .. (x1 , x2 ) = .. ..
. . .
1 x1 x2
Therefore,
x11 x21 x1 x2 x11 x1 x21 x2
x12 x22 x1 x2 x12 x1 x22 x2
T
X 1 X = .. .. .. .. = .. ..
. . . . . .
x1N x2N x1 x2 x1N x1 x2N x2
then we have
T
T
(X 1 X)T (X 1 X) =
x11 x1 x21 x2
x x x22 x2
x11 x1 x12 x1 x1N x1 12 1
=
x21 x2 x22 x2 x2N x2
.
.. .. =
.
x1N x1 x2N x2
N N
(x1i x1 )2 (x x )(x x )
= N 1 N1i
i=1 1 2i
2
2
If we now multiply by 1
N we nd
N N
(x1i x1 )2 (x1i x1 )(x2i x2 )
1 T
T
1 i=1
=
(X 1 X)T (X 1 X) = N N N N
N (x2i x2 )(x1i x1 ) (x2i x2 )2
i i=1
N N
2
SX SX 1 X 2
= 1
2
SX 2 X 1 SX 2
That is,
1 T
T
= (X 1 X)T (X 1 X)
N
This matrix is symmetric since SX! X2 = SX2 X1