Anda di halaman 1dari 34

Statistics for Spatial Analysis

Slides are based on Notes of Shri. S.K. Mittal

STATISTICS
The word `Statistics' has been derived from the Latin word
`Status, the Italian word `Statista' and the German word
`Statistik. Meaning of these words is a `political state' or a
`Government.
Presently, the word statistics is used in two different, but interrelated, ways, viz. (i) as a plural noun, and (ii) as a singular
noun.
As a Plural noun - When used as a plural noun, the word
`statistics' means statistical data. Prof. Horace Secrist defines
statistics in this sense as given below :
By Statistics, we mean aggregate of facts affected to a
marked extent by the multiplicity of causes, numerically
expressed, enumerated or estimated according to reasonable
standard of accuracy, collected in a systematic manner for a
pre-determined purpose and placed in relation to each other.

From the definition of statistics, we


following characteristics :

observe the

CHARACTERISTICS OF STATISTICAL DATA

1.

2.
3.
4.
5.
6.
7.

Aggregates of facts
Numerically expressed
Affected by multiplicity of causes
Estimated according to reasonable standards of
accuracy
Collected in a systematic manner
Collected for a pre-determined purpose
Placed in relation to each other

As a singular noun. As a singular noun, statistics refers


to a science which deals with the methods of collection,
classifying, presenting, comparing and interpreting
numerical data.
In this sense, statistics is also known as `statistical
methods'. The important statistical methods are as follows
STATISTICAL METHODS
1.
Collection of data
2b.
Classification of data
3b.
Presentation of data
3a/4. Analysis of data
2a/5. Interpretation of data
6.
Forecasting of data

We conclude the followings


When used in the sense of data, `statistics' are
numerical statement of facts, capable of further analysis and
interpretation and when used as a science, it is concerned with
the principles and methods used in the collection,
presentation, analysis and interpretation of numerical data in a
sphere of enquiry.
Statistical methods are growing in popularity and are
being widely used in every branch of knowledge. But they
cannot be applied to all kinds of phenomena and cannot
answer all our doubts. They also suffer from various
limitations.

LIMITATIONS OF STATISTICS
1. Does not deal with individual facts
2. Ignores the qualitative aspects
3. Is not an end in itself
4. Can be misused
5. Good understanding is required

COLLECTION OF DATA
A sound structure of statistical investigation is based on a
systematic collection of data. Data is generally classified in two
groups, viz.
(a) internal data and
(b) external data.
Internal data come from the internal records related to
operations of a business firm, records of production, purchase
and the accounting system. This is generally associated with
the organizational and functional activities of the firm. The
internal data can be either insufficient or inappropriate for the
problem under investigation, thus we need external data to
make decisions. The external data are collected and published
by agency external to the enterprise. The external data can be
collected either from the Primary or the Secondary source.

Primary and Secondary Data


The primary data is one, which is collected by the
investigator himself for the first time. In India there are various
agency which collect primary data: National Sample Survey is one
of them.
The secondary data is one, which has already been collected
by a source other than collected by the present investigator.
We may collect the data ourselves but somebody else decides to
make use of this data. The same data will be primary data for us
but secondary for others who make use of data.
Similarly, in order to compare the cost of living in Delhi and
Bombay, we may decide to make use of the data published in `The
Economic Times - here we will be making use of the secondary
data.

DISTINCTION BETWEEN PRIMARY AND SECONDARY DATA


S.No Basis

Primary Data

Secondary Data

1.

Originality

It is original,
It is not original.
because
the
investigator The investigator makes use of
himself collects the data
the data collected by other
agencies.

2.

Collection

It involves large expenses in It is relatively a less costly


terms of time, energy and method.
money.

3.

Suitability

If the data has been collected in It may or may not suit the objects
a systematic manner its of enquiry.
suitability will be positive.

4.

Precautions No extra precautions need be It should be used with care.


taken in making use of this
data.

Methods of collecting Primary data


I. Direct Personal Investigation
II. Indirect Oral Investigation
III. Information through correspondents, and
IV. The Questionnaire Method

Source of secondary data


The chief source of secondary data can be classified
into two groups viz.
(a) Published and
(b) Unpublished
Precautions in the use of secondary data
1. Whether the data are reliable. In order to know the
reliability of data, the integrity and experience of the
collecting organization, the purpose, method of
collection, degree of accuracy and test-checking
must be ascertained.
2. Whether the data are suitable for the purpose?
3. Whether the data are adequate?

Tabular form of data


105

93

97

101

115

149

135

120

130

140

110

93

109

113

98

111

100

102

107

103

90

142

111

108

102

109

107

119

113

96

120

135

91

110

117

104

105

120

114

92

110

120

102

92

114

99

112

107

99

100

115

115

90

136

110

106

123

109

114

109

117

114

98

106

110

104

134

109

127

113

119

113

116

124

123

110

136

132

116

108

121

112

141

109

116

109

141

117

134

98

92

110

109

122

109

97

93

107

104

108

87

89

121

111

110

103

114

113

150

156

104

117

114

110

121

107

106

114

142

114

120

112

116

109

111

113

114

98

113

112

121

99

109

123

111

116

104

99

109

117

109

109

110

97

105

102

109

101

97

103

Grouped Frequency Distribution Table


Class
Interval

Class
Interval

87 - 91

122-126

92 - 96

127-131

97-101

15

132-136

102-106

18

137-141

107-111

38

142-146

112-116

28

147-151

117-121

16

152-156

N=

150

Note that in computations involving classified distribution, the


midpoint will be used to substitute for each score in the interval. For
this reason, we recommend the choice of an odd number for i whenever
possible. Nothing is sacred about this suggestions, it just makes the
midpoint a whole number of units, thus simplifying computation.

Cumulative Distribution Table


Class interval

Cum

Cum %

87-91

92-96

13

97-101

15

28

19

102-106

18

46

31

107-111

38

84

56

112-116

28

112

75

117-121

16

128

85

122-126

133

89

127-131

135

90

132-136

142

95

137-141

145

97

142-146

147

98

147-151

149

99

152-156

150

100

N = 150

The Cumulative Distribution


Arranging data into a cumulative distribution is really helpful. It
allows us to obtain the number (or the proportion) of cases in a distribution below
or above each class interval (or boundary).
Cumulative Distribution Table
Class interval

Cum

Cum %

87-91

92-96

13

97-101

15

28

19

102-106

18

46

31

107-111

38

84

56

112-116

28

112

75

117-121

16

128

85

122-126

133

89

127-131

135

90

132-136

142

95

137-141

145

97

142-146

147

98

147-151

149

99

152-156

150

100

Graphic Techniques
There are always some people who would rather not read
tables, who could understand the information better if it were
presented in pictorial form.
Our prehistoric ancestors
undoubtedly knew this when they made the first cave drawings.
Similarly, the Egyptians, Greeks and Romans used drawings and
sculptures to convey information about their respective societies.
Thus, art was used to carry information throughout the ages. Art
is also valuable to us in describing information.
Graphs, the pictorial forms that follow, are not meant to
substitute for tabular construction. Rather they are meant as
visual aids that help us to describe and think about the shape
of the distribution. In fact, you cannot plan or construct a graph
until you have prepared the corresponding table. The graphic
forms shown here correspond to both qualitative and quantitative
distributions.

The Histogram
Graphic equivalent of the grouped distribution for intervallevel data. It consists of a set of adjacent bars whose heights are
proportional to either the absolute frequencies or to the proportions of
cases in each interval of the variable.
The most noticeable feature of the histogram is its structural
simplicity. Bars are understood more easily than numbers. The
histogram shows the relative concentration of data in each interval as
well as the shape of the distribution.

The Polygon
It is easy to convert a histogram into the much-used
polygon. All we need to do is to connect the midpoints of
the tops of the bars with straight lines.
Polygons are particularly useful when we wish to present a
comparison of two or more distribution on the same graph.
They do not blur their respective outlines, as histograms
do.

The Ogive
When a graph is used to present a cumulative
percentage distribution, it is called an ogive. The ogive
is constructed on a pair of perpendicular axes, just like
the polygon.
The horizontal axis represents the values for the
upper true limits of each class interval, and the vertical
axis indicates the percentage of observations for each
interval.
A dot is then placed directly above the upper true limit
of the class boundary, at whatever height if
appropriate, to indicate the proportion of cases less
than the upper true limit of the interval. After plotting
all interval values with their corresponding
percentages, the dots are joined by straight lines.

MEASUREMENTS
MEASURES OF CENTRAL TENDENCY
A central tendency is a single figure that represents
whole of distribution. Individual observations in a
distribution have the general characteristics of showing a
tendency to concentrate at certain values usually
somewhere in the centre of the distribution.
A central tendency will represent whole of the
distribution. Thus, we talk of average per capita income of
India, average size of holdings in India, average
productivity of labour in India, average cost of production
of cloth, average life of an India, etc.
Three important measures of central tendency are mean,
median and mode.

Arithmetic Mean
Arithmetic mean, or simply known as
`mean', is the most commonly used of all
averages, e.g., we frequently talk of average
monthly income, average monthly expenditure,
average marks secured by the students, average
petrol consumption of car or scooter in a day,
average productivity per farm, average bonus
paid, etc.

Arithmetic mean is defined as the


sum of values of a group of items
divided by the number of items.
_
X=X/N

Median
The effect of an extreme value can be avoided if we
take a measure of central position in a given series. This
position measure is called the median.
Median is a value which divides the series into two
equal parts. Thus if we have the median value, the number
of items less than this value and the number of items more
than this value will be equal.
To get the median value, we make use of the
following formula :
M = Size of (N+1)/2 th item
where M stands for median, and N for the number of items
in the series.

Arithmetic mean is a good measure of central tendency when


we are interested in finding the average value of any variate,
e.g., average revenue, average cost, average productivity etc.
Similarly, median is a good measure when the spread of items
may be more on one side of the distribution. Median is also
useful in those cases where the items are not capable of
measurement in definite units e.g. quantities like intelligence,
health etc.

Mode
A third important measure of central tendency is
called mode, which is denoted as Z. Mode is the most
common value found in a series.
For example, the daily wages of labourers employed in
Defence Colony are Rs. 80, 85, 86, 86, 86, 87, 89, 90. The
modal wage will be Rs.86 because it is most commonly
found or it occurs most frequently.

Relationship between Mean, Median and Mode


Mean, median and mode have their distinct role in statistical
analysis. In no case they can be substituted for one another. In a
moderately asymmetrical distribution, the following relationship exists.

Mode = 3 Median - 2 Mean


Comparative Evaluation of Characteristics of Mean, Median and mode
S.
No.

Characteristics

Measures of Central Tendency


Mean

Median

Mode

It is rigidly defined

Yes

Yes

No

It is situated in the centre of the


distribution

No

Yes

Yes

It is easily understandable

Yes

Yes

Yes

Its calculation is easy

Yes

Yes

No

It is based on all the observations

Yes

No

No

It is capable of further mathematical


treatment

Yes

No

No

It is affected by the choice of sample

Yes

No

No

It is affected by extreme values

Yes

No

No

It can be represented graphically

No

Yes

Yes

MEAN DEVIATION
Mean deviation shows the scatter around in average.
It is like measuring the scatter of the population of a city.
Some people live close to the centre of the city and others at
varying distances. Their average distance from the centre
indicates how scattered or dispersed they are.
Mean deviation is defined as an average or mean of
the deviations of the values from the central tendency. The
central tendency used can be either arithmetic mean or
median. Here we take mean for the calculation of mean
deviations.

M.D. = dx /N
Coefficient of Mean Deviation = M.D / Median

STANDARD DEVIATION
It is another related measure of variation. In mean
deviation we can take the sum of deviations after ignoring their
plus and minus signs. In standard deviation we achieve the same
effect in another way. We square up all the deviations; the
squared deviations will always be positive.
Standard deviation is the square root of the arithmetic mean of
the squared deviations.
Standard deviation is generally
expressed as (read standard deviation sigma).

= (dx2 /N)

CORRELATION
Measure of central tendency, dispersion and skewness
describe the nature of distribution relating to a single variable.
One may also be interested in studying relationship between two
and more variables e.g., income and consumption; price and
demand; quantity of input and output are related variables;
productivity and wage also depends upon each other.
Two variables may be positively related or negatively
related.

price and supply are positively correlated.


price and demand are negatively correlated.
price index & dearness allowance- positively correlated.
strikes and rate of production - negatively correlated.

Methods of Measuring Correlation


It is not sufficient only to know that there exists
correlation between two variables, it is also necessary to
quantify the extent of correlation. For our first purpose we
make use of scatter diagrams, and for our second purpose
we need define the value of co-efficient of correlation.
Scatter Diagram :
A simple measure of correlation between two
variables is obtained by the use of scatter diagrams.
Values of the independent variable are measured on X-axis
in a graph, and values of the dependent variable are
measured on Y-axis. The two values are then plotted in the
graph in the form of dots. When every dot representing a
pair of figures has been plotted, we get a scatter diagram.

Coefficient of Correlation
The mathematical technique which describes the
covariance in ratio terms is known as co-efficient of correlation.
The co-efficient of correlation was initially conceived by
statistician, Karl Pearson.
Karl Pearson's coefficient of
correlation (also known as product-moment co-efficient)
generally denoted by `r' is expressed as follows :
dx dy
r = ------------N x y

dxdy is the sum of the products of deviations of


respective observations in x and y series.
N is the number of items
x is the standard deviation of x series, and
y is the standard deviation of y series

The values of r determine the degree of correlation


between two variables.
r always lies between minus one to plus one.
Value of r

Degree of correlation between two variables

-1

Perfectly negative

+1

Perfectly positive

No relation

0.10 to
0.25

Low degree of correlation

0.30 to
0.55

Moderate correlation

0.60 to
0.99

High correlation

If the sign before r is minus, it will be negative correlation,


and if the sign is plus, it will be positive correlation.

RANK CORRELATION
Prof. Charles Spearman has conceived another coefficient of correlation.
This co-efficient is expressed as R and is based on the ranking
of the various values of the two variables.

6 D2
R = 1 - ----------N(N2 -1)

REGRESSION
The term regression was first used by Sir Francis Galton
in his studies of Inheritance of Stature. He, along with his
friend, Karl Pearson, studied the heights of 1,078 sons along with
the heights of their fathers. It was found out that the tall fathers
tend to have tall sons and the short fathers tend to have short sons
but the average, height of sons of tall fathers was less than the
height of their fathers, the average height of short sons was more
than the average height of their fathers. Galton named this
tendency as `regression'.
It is used to explain the value of one variable with respect to the
value of other variable. It explains the functional relationship
between the two variables.
The relationship is explained with the help of regression lines.

Regression Lines
The line which shows the functional relationship between the two
variables is known as the ' line of best fit '. Since there are two variables,
X and Y, therefore, there are two regression lines.
Regression line of X on Y explains the functional relationship of X when
the value of Y variable is given, whereas, the regression line of Y on X
explains the functional relationship of Y when the value of X variable is
given.

Regression lines and the Coefficient of Correlation


The regression lines help in estimating the nature and the type of
correlation between the two variables.
If the two lines of regression overlap each other the correlation is said to be
perfect correlation. If both the lines intersect at right angles, there is no
correlation at all.
The slope of the lines determines the nature of correlation, if the slope of
the lines is positive the correlation is said to be positive and vice versa. The
degree of correlation can be ascertained with the help of the angles formed
by the two lines.

Regression equations
The regression equations explain the functional relationship
between the two variables. As there are two regression lines, there are
two regression equations.

i) Regression equation of X on Y:
In this equation the probable values of X are estimated with the
help of independent variable Y. Plotting these values on the graph paper
we get the line known as regression of X on Y.

ii) Regression equation of Y on X:


This equation is used in order to estimate the values of Y when
the values of X are given, here the values of Y are dependent on the
values of X. The line showing this relationship is known as regression
line of Y on X.

Method of Least Square


This method is the most useful technique of estimation. It
gives the best, unbiased, linear estimate. The value of two
unknown constants is determined with the help of two normal
equations.
Regression equation of X on Y
on X

X = a + bY

Regression equation of Y

Y = a + bX

The values of constants `a' and `b' can be estimated from


the following normal equations:
Regression of X on Y

Regression of Y

on X
X

X = Na + b Y
XY = a Y + b Y2

Y = Na + b
XY = a X + b X2

MULTIVARIATES
When you study a single variable the case is called a
univariate case. When it is two variables the case is called a
bivariate case. When there are more than 2 variables the case
is called as a multivariate case.
Consider there are n variables. Then the parameter
which we studied earlier such as mean, variance, covariance,
correlation now becomes as
a mean vector
Variance-Covariance matrix and
Correlation matrix

Anda mungkin juga menyukai