STATISTICS
The word `Statistics' has been derived from the Latin word
`Status, the Italian word `Statista' and the German word
`Statistik. Meaning of these words is a `political state' or a
`Government.
Presently, the word statistics is used in two different, but interrelated, ways, viz. (i) as a plural noun, and (ii) as a singular
noun.
As a Plural noun - When used as a plural noun, the word
`statistics' means statistical data. Prof. Horace Secrist defines
statistics in this sense as given below :
By Statistics, we mean aggregate of facts affected to a
marked extent by the multiplicity of causes, numerically
expressed, enumerated or estimated according to reasonable
standard of accuracy, collected in a systematic manner for a
pre-determined purpose and placed in relation to each other.
observe the
1.
2.
3.
4.
5.
6.
7.
Aggregates of facts
Numerically expressed
Affected by multiplicity of causes
Estimated according to reasonable standards of
accuracy
Collected in a systematic manner
Collected for a pre-determined purpose
Placed in relation to each other
LIMITATIONS OF STATISTICS
1. Does not deal with individual facts
2. Ignores the qualitative aspects
3. Is not an end in itself
4. Can be misused
5. Good understanding is required
COLLECTION OF DATA
A sound structure of statistical investigation is based on a
systematic collection of data. Data is generally classified in two
groups, viz.
(a) internal data and
(b) external data.
Internal data come from the internal records related to
operations of a business firm, records of production, purchase
and the accounting system. This is generally associated with
the organizational and functional activities of the firm. The
internal data can be either insufficient or inappropriate for the
problem under investigation, thus we need external data to
make decisions. The external data are collected and published
by agency external to the enterprise. The external data can be
collected either from the Primary or the Secondary source.
Primary Data
Secondary Data
1.
Originality
It is original,
It is not original.
because
the
investigator The investigator makes use of
himself collects the data
the data collected by other
agencies.
2.
Collection
3.
Suitability
If the data has been collected in It may or may not suit the objects
a systematic manner its of enquiry.
suitability will be positive.
4.
93
97
101
115
149
135
120
130
140
110
93
109
113
98
111
100
102
107
103
90
142
111
108
102
109
107
119
113
96
120
135
91
110
117
104
105
120
114
92
110
120
102
92
114
99
112
107
99
100
115
115
90
136
110
106
123
109
114
109
117
114
98
106
110
104
134
109
127
113
119
113
116
124
123
110
136
132
116
108
121
112
141
109
116
109
141
117
134
98
92
110
109
122
109
97
93
107
104
108
87
89
121
111
110
103
114
113
150
156
104
117
114
110
121
107
106
114
142
114
120
112
116
109
111
113
114
98
113
112
121
99
109
123
111
116
104
99
109
117
109
109
110
97
105
102
109
101
97
103
Class
Interval
87 - 91
122-126
92 - 96
127-131
97-101
15
132-136
102-106
18
137-141
107-111
38
142-146
112-116
28
147-151
117-121
16
152-156
N=
150
Cum
Cum %
87-91
92-96
13
97-101
15
28
19
102-106
18
46
31
107-111
38
84
56
112-116
28
112
75
117-121
16
128
85
122-126
133
89
127-131
135
90
132-136
142
95
137-141
145
97
142-146
147
98
147-151
149
99
152-156
150
100
N = 150
Cum
Cum %
87-91
92-96
13
97-101
15
28
19
102-106
18
46
31
107-111
38
84
56
112-116
28
112
75
117-121
16
128
85
122-126
133
89
127-131
135
90
132-136
142
95
137-141
145
97
142-146
147
98
147-151
149
99
152-156
150
100
Graphic Techniques
There are always some people who would rather not read
tables, who could understand the information better if it were
presented in pictorial form.
Our prehistoric ancestors
undoubtedly knew this when they made the first cave drawings.
Similarly, the Egyptians, Greeks and Romans used drawings and
sculptures to convey information about their respective societies.
Thus, art was used to carry information throughout the ages. Art
is also valuable to us in describing information.
Graphs, the pictorial forms that follow, are not meant to
substitute for tabular construction. Rather they are meant as
visual aids that help us to describe and think about the shape
of the distribution. In fact, you cannot plan or construct a graph
until you have prepared the corresponding table. The graphic
forms shown here correspond to both qualitative and quantitative
distributions.
The Histogram
Graphic equivalent of the grouped distribution for intervallevel data. It consists of a set of adjacent bars whose heights are
proportional to either the absolute frequencies or to the proportions of
cases in each interval of the variable.
The most noticeable feature of the histogram is its structural
simplicity. Bars are understood more easily than numbers. The
histogram shows the relative concentration of data in each interval as
well as the shape of the distribution.
The Polygon
It is easy to convert a histogram into the much-used
polygon. All we need to do is to connect the midpoints of
the tops of the bars with straight lines.
Polygons are particularly useful when we wish to present a
comparison of two or more distribution on the same graph.
They do not blur their respective outlines, as histograms
do.
The Ogive
When a graph is used to present a cumulative
percentage distribution, it is called an ogive. The ogive
is constructed on a pair of perpendicular axes, just like
the polygon.
The horizontal axis represents the values for the
upper true limits of each class interval, and the vertical
axis indicates the percentage of observations for each
interval.
A dot is then placed directly above the upper true limit
of the class boundary, at whatever height if
appropriate, to indicate the proportion of cases less
than the upper true limit of the interval. After plotting
all interval values with their corresponding
percentages, the dots are joined by straight lines.
MEASUREMENTS
MEASURES OF CENTRAL TENDENCY
A central tendency is a single figure that represents
whole of distribution. Individual observations in a
distribution have the general characteristics of showing a
tendency to concentrate at certain values usually
somewhere in the centre of the distribution.
A central tendency will represent whole of the
distribution. Thus, we talk of average per capita income of
India, average size of holdings in India, average
productivity of labour in India, average cost of production
of cloth, average life of an India, etc.
Three important measures of central tendency are mean,
median and mode.
Arithmetic Mean
Arithmetic mean, or simply known as
`mean', is the most commonly used of all
averages, e.g., we frequently talk of average
monthly income, average monthly expenditure,
average marks secured by the students, average
petrol consumption of car or scooter in a day,
average productivity per farm, average bonus
paid, etc.
Median
The effect of an extreme value can be avoided if we
take a measure of central position in a given series. This
position measure is called the median.
Median is a value which divides the series into two
equal parts. Thus if we have the median value, the number
of items less than this value and the number of items more
than this value will be equal.
To get the median value, we make use of the
following formula :
M = Size of (N+1)/2 th item
where M stands for median, and N for the number of items
in the series.
Mode
A third important measure of central tendency is
called mode, which is denoted as Z. Mode is the most
common value found in a series.
For example, the daily wages of labourers employed in
Defence Colony are Rs. 80, 85, 86, 86, 86, 87, 89, 90. The
modal wage will be Rs.86 because it is most commonly
found or it occurs most frequently.
Characteristics
Median
Mode
It is rigidly defined
Yes
Yes
No
No
Yes
Yes
It is easily understandable
Yes
Yes
Yes
Yes
Yes
No
Yes
No
No
Yes
No
No
Yes
No
No
Yes
No
No
No
Yes
Yes
MEAN DEVIATION
Mean deviation shows the scatter around in average.
It is like measuring the scatter of the population of a city.
Some people live close to the centre of the city and others at
varying distances. Their average distance from the centre
indicates how scattered or dispersed they are.
Mean deviation is defined as an average or mean of
the deviations of the values from the central tendency. The
central tendency used can be either arithmetic mean or
median. Here we take mean for the calculation of mean
deviations.
M.D. = dx /N
Coefficient of Mean Deviation = M.D / Median
STANDARD DEVIATION
It is another related measure of variation. In mean
deviation we can take the sum of deviations after ignoring their
plus and minus signs. In standard deviation we achieve the same
effect in another way. We square up all the deviations; the
squared deviations will always be positive.
Standard deviation is the square root of the arithmetic mean of
the squared deviations.
Standard deviation is generally
expressed as (read standard deviation sigma).
= (dx2 /N)
CORRELATION
Measure of central tendency, dispersion and skewness
describe the nature of distribution relating to a single variable.
One may also be interested in studying relationship between two
and more variables e.g., income and consumption; price and
demand; quantity of input and output are related variables;
productivity and wage also depends upon each other.
Two variables may be positively related or negatively
related.
Coefficient of Correlation
The mathematical technique which describes the
covariance in ratio terms is known as co-efficient of correlation.
The co-efficient of correlation was initially conceived by
statistician, Karl Pearson.
Karl Pearson's coefficient of
correlation (also known as product-moment co-efficient)
generally denoted by `r' is expressed as follows :
dx dy
r = ------------N x y
-1
Perfectly negative
+1
Perfectly positive
No relation
0.10 to
0.25
0.30 to
0.55
Moderate correlation
0.60 to
0.99
High correlation
RANK CORRELATION
Prof. Charles Spearman has conceived another coefficient of correlation.
This co-efficient is expressed as R and is based on the ranking
of the various values of the two variables.
6 D2
R = 1 - ----------N(N2 -1)
REGRESSION
The term regression was first used by Sir Francis Galton
in his studies of Inheritance of Stature. He, along with his
friend, Karl Pearson, studied the heights of 1,078 sons along with
the heights of their fathers. It was found out that the tall fathers
tend to have tall sons and the short fathers tend to have short sons
but the average, height of sons of tall fathers was less than the
height of their fathers, the average height of short sons was more
than the average height of their fathers. Galton named this
tendency as `regression'.
It is used to explain the value of one variable with respect to the
value of other variable. It explains the functional relationship
between the two variables.
The relationship is explained with the help of regression lines.
Regression Lines
The line which shows the functional relationship between the two
variables is known as the ' line of best fit '. Since there are two variables,
X and Y, therefore, there are two regression lines.
Regression line of X on Y explains the functional relationship of X when
the value of Y variable is given, whereas, the regression line of Y on X
explains the functional relationship of Y when the value of X variable is
given.
Regression equations
The regression equations explain the functional relationship
between the two variables. As there are two regression lines, there are
two regression equations.
i) Regression equation of X on Y:
In this equation the probable values of X are estimated with the
help of independent variable Y. Plotting these values on the graph paper
we get the line known as regression of X on Y.
X = a + bY
Regression equation of Y
Y = a + bX
Regression of Y
on X
X
X = Na + b Y
XY = a Y + b Y2
Y = Na + b
XY = a X + b X2
MULTIVARIATES
When you study a single variable the case is called a
univariate case. When it is two variables the case is called a
bivariate case. When there are more than 2 variables the case
is called as a multivariate case.
Consider there are n variables. Then the parameter
which we studied earlier such as mean, variance, covariance,
correlation now becomes as
a mean vector
Variance-Covariance matrix and
Correlation matrix