196 tayangan

Diunggah oleh Abdi Ahmad Maalim

STATISTICS AND QUANTITATIVE MODELLING FOR FINANCE
AND ACCOUNTING

- ~$cture-1-4(Basic elements)
- Continuous Probability Distribution
- chapter4.ppt
- Eee251 Obe
- Normal Distribution
- Binomial Distribution - p(k larger or equal to a) - R.pdf
- efm
- Appendix a 4
- Chapter 4.7.pdf
- Week 3 2 PorbDistDistrictRandomVar
- Conditional Probability
- Collaborative Coefficient (CC)
- Johnson Statistics 6e TIManual
- 5. DRV q5
- tables
- Bus172 Course Outline Spring 2014
- IJAMS-Generalized free Gaussian white noise.pdf
- fstats_ch2.pdf
- fast_computation_EL
- Bugs of the Holm Oak

Anda di halaman 1dari 94

OF

AGRICULTURE & TECHNOLOGY

SCHOOL OF OPEN, DISTANCE &

eLEARNING

IN COLLABORATION WITH

DEPARTMENT OF INFORMATION

TECHNOLOGY

FINANCE AND ACCOUNTING

J.Okelo

(masenooj@gmail.com)

P.O. Box 62000, 00200

Nairobi, Kenya

HBAF 3105: STATISTICS AND QUANTITATIVE MODELLING FOR FINANCE AND ACCOUNTING

Course description

History of statistics; Use and abuse of statistics; Measures of central location; mean,

median, mode; Measures of dispersion: range, standard deviation, variance, quartiles, skewness, kurtosis;Variables: qualitative, quantitative, discrete and continuous variables; Normal distribution; standard normal distribution; Z-distribution; tdistribution; F-distribution; chi-square distribution; hypo Research Project testing;

Inferential statistics; Correlation analysis; Regression analysis; Linear simple and

multiple regressions, Dummy Variables; Binary Logit and Probit models, Index

numbers; simple index numbers: Aggregative indexes, weighted aggregative indexes, Laspeyres Index, Paasche Index; Consumer Price Index; Time Series Analysis: components of time series analysis, estimation of trends; Computer application

in statistical data processing and analysis.

Course aims

This course is intended to expose educators to the discipline of statistics. It will

mainly deal with applied statistics to enable the students to appreciate the use of

statistics in social and applied sciences in general and data analysis in business

studies in particular.

Learning outcomes

Upon completion of this course you should be able to;

1. Apply statistical methods to data presentation, processing and analysis.

2. Use statistical methods in research and business analysis.

3. Describe the role of statistics in research and business analysis.

4. Apply Statistical techniques in Research.

Instruction methodology

Lectures and tutorials, Online lectures with self study materials, Case studies,

Group discussions/online blogs and forums

ii

Instructional Materials/Equipment

Writing board and writers, Computers, Statistical software

Assessment information

The module will be assessed as follows;

40% Continuous Assessment (Tests 10%, Assignment 10%, Practical 20%)

60% End of Semester Examination.

Course Text books

1. Mason, R. D., Lind, D. A. and Marchal, W. G. (1999). Statistical Techniques in Business and Economics. Irwin McGraw-Hill, Boston. ISBN-10:

0256263078, ISBN-13: 978-0256263077, Edition: 10th

2. Douglas Lind, William Marchal, Samuel Wathen (2009). Statistical Techniques in Business and Economics with Student CD [Hardcover], ISBN-10:

0077309421, ISBN-13: 978-0077309428, Edition: 14

3. Thomas H. Wonnacott, Ronald J. Wonnacott (1990) Introductory Statistics

for Business and Economics, 4th Edition, John Wiley and Sons Inc. [Hardcover] ISBN-10: 047161517X , ISBN-13: 978-0471615170

Reference Text books

1. Robert D. Mason, Douglas A. Lind, William G. Marchal 1998). Statistics:

An Introduction, Duxbury Pr; 5 Sub edition (1998) ISBN-10: 0534353797

ISBN-13: 978-0534353797

2. Fruend, J.E. and Williams, F.J. (1979). Modern Business Statistics. Pitman

Publishing Limited, London. ISBN 10: 0135895804 0-13-589580-4, ISBN

13: 9780135895801

3. Spiegel, M.R. (1992). Theory and Problems of Statistics, 2nd Edition, Schaums

Outline Series, McGraw-Hill Book Company, London, ISBN 0071128204.

iii

Course Journals

1. Journal of Quantitative Methods for Economics and Business Administration

ISSN: 1886-516 X D.L.: SE-2927-06.

2. Journal of Applied Statistics J Appl Stat. Published/Hosted by Taylor and

Francis Group, ISSN (printed): 0266-4763. ISSN (electronic): 1360-0532.

3. Scandinavian Journal of Statistics, Online ISSN: 1467-9469

4. Advances in Data Analysis and Classification, ISSN Print: 1862-5347 ISSN

Online: 1862-5355

5. Annals of the Institute of Statistical Mathematics, Executive Editor: T. Higuchi,

ISSN Print: 0020-3157 ISSN Online: 1572-9052

6. International Journal of Statistics and Probability ISSN 1927-7032(Print)

iv

Contents

1 Introduction

1.1 Definition Statistics . . . . . . . . .

1.1.1 Who Uses Statistics? . . . .

1.1.2 Limitations of Statistics . .

1.2 Data types . . . . . . . . . . . . . .

1.3 Types of Statistics . . . . . . . . . .

1.4 Finite populations . . . . . . . . . .

1.4.1 Simple random sample . . .

1.4.2 Sampling from a population

2 Measures of central tendency

2.0.3 Arithmetic mean . .

2.0.4 The Geometric mean

2.0.5 Harmonic mean . .

2.0.6 P-tiles . . . . . . . .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

3.1 Learning outcomes . . . . . . . . . . .

3.2 Probability distributions . . . . . . . . .

3.3 Discrete Probability distributions . . . .

3.3.1 Expectation of a random variable

3.3.2 Bernoulli Distribution . . . . . .

3.3.3 Binomial Distribution . . . . .

3.3.4 Poisson Distribution . . . . . .

3.3.5 Geometric Distribution . . . . .

3.3.6 Negative Binomial Distribution .

3.4 Continuous Distributions . . . . . . . .

v

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

1

1

2

3

4

5

6

6

7

.

.

.

.

9

9

10

11

15

.

.

.

.

.

.

.

.

.

.

18

18

18

18

19

21

21

22

23

24

25

CONTENTS

3.4.1

CONTENTS

Uniform Distribution . . . . . . . . . . . . . . . . . . . . . 25

4 Normal Distribution

4.1 Introduction . . . . . . . . . . . .

4.2 Description . . . . . . . . . . . .

4.2.1 Functional form . . . . .

4.3 The Standard normal Distribution

.

.

.

.

26

26

26

27

27

.

.

.

.

30

30

31

33

35

6 Hypothesis Testing 2

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6.1.1 Analysis of Variance (ANOVA) . . . . . . . . . . . . . . .

6.1.2 Techniques of One-way ANOVA . . . . . . . . . . . . . .

39

39

39

41

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . .

7.2 Test of relationships involving quantitative data . . . . .

7.2.1 Pearsons product-moment correlation coefficient

7.2.2 Spearmans rank correlation coefficient . . . . .

7.3 Linear Regression . . . . . . . . . . . . . . . . . . . . .

7.3.1 Multiple regression with dummy variables . . . .

7.3.2 Dealing with Interaction terms . . . . . . . . . .

43

43

43

44

44

46

48

50

.

.

.

.

5 Tests of Hypothesis 1

5.1 Introduction . . . . . . . . . . . . .

5.2 Parametric Tests . . . . . . . . . .

5.2.1 Z Test for Two Means . . .

5.2.2 The t-Test . . . . . . . . . .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

8.1 A linear model for proportions? . . . . . . . . . . . . .

8.1.1 Logistic curve: A curve that lies between 0 and

values of X . . . . . . . . . . . . . . . . . . . .

8.1.2 The parameters of the logistic curve . . . . . . .

8.1.3 Multiple logistic regression . . . . . . . . . . . .

8.2 Probit Model . . . . . . . . . . . . . . . . . . . . . . .

8.3 CobbDouglas functional form of production functions .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

1

.

.

.

.

.

52

. . . . . 52

for all

. . . . . 52

. . . . . 53

. . . . . 53

. . . . . 59

. . . . . 60

vi

CONTENTS

8.3.1

8.3.2

CONTENTS

Formulation . . . . . . . . . . . . . . . . . . . . . . . . . 60

Application . . . . . . . . . . . . . . . . . . . . . . . . . . 61

9 Index numbers

9.1 Index numbers . . . . . . . . . . . . . .

9.1.1 Price and quantity indices . . . .

9.1.2 CPI and stock market indices . . .

9.2 Simple price index . . . . . . . . . . . .

9.3 Aggregate price . . . . . . . . . . . . . .

9.3.1 Unweighted aggregate price index

9.4 Laspeyres and Paasche indices . . . . . .

9.4.1 Laspeyres index . . . . . . . . .

9.4.2 Paasche index . . . . . . . . . .

9.4.3 Fishers Ideal Index . . . . . . .

9.5 Deflating a time series . . . . . . . . . .

9.5.1 Correcting for inflation . . . . . .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

10.1 Time series data . . . . . . . . . . . . . . . . . . . . .

10.2 Types of time series data . . . . . . . . . . . . . . . .

10.3 Components of a time series . . . . . . . . . . . . . .

10.3.1 Trend . . . . . . . . . . . . . . . . . . . . . .

10.3.2 Cyclic Movements . . . . . . . . . . . . . . .

10.3.3 Seasonal Movements . . . . . . . . . . . . . .

10.3.4 Random or irregular fluctuations . . . . . . . .

10.4 Smoothing of a time series . . . . . . . . . . . . . . .

10.4.1 Moving average with odd and even run lengths

10.4.2 Robust smoothing . . . . . . . . . . . . . . . .

10.4.3 Running medians, followed by moving averages

10.4.4 Limitations of moving averages . . . . . . . .

10.5 Long-term trend and Forecasting . . . . . . . . . . . .

10.5.1 Least squares for a polynomial fit . . . . . . .

10.5.2 Exponential Trend . . . . . . . . . . . . . . .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

63

63

63

63

64

65

65

66

66

66

67

68

68

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

70

70

71

72

72

72

72

72

73

74

75

76

76

76

76

78

vii

CONTENTS

CONTENTS

11.1 The Method Of Lagrange Multipliers: . . . . . . . . . . . . . . .

11.2 Models involving differential equations . . . . . . . . . . . . . . .

11.2.1 Unrestricted growth Models . . . . . . . . . . . . . . . .

11.2.2 Restricted growth models . . . . . . . . . . . . . . . . .

11.2.3 Restricted Growth Models . . . . . . . . . . . . . . . . .

Solutions to Exercises . . . . . . . . . . . . . . . . . . . . . . . .

.

.

.

.

.

.

79

79

82

82

83

84

86

viii

HBAF 3105

LESSON 1

Introduction

Learning outcomes

Upon completion of this lesson you should be able to;

1. Define statistics and describe various uses

2. Give a brief history of statistics and identify some limitations of statistics

3. Distinguish between various variable types

4. Distinguish between descriptive and inferential statistics

5. Describe the concept behind sample, population and sampling error

1.1. Definition Statistics

As a plural noun, the word statistics describes a collection of numerical data such

as employment statistics, accident statistics, population statistics, birth and death,

income and expenditure, of exports and imports etc. It is in this sense that the word

statistics is used by a layman or a newspaper. As a singular noun, the purpose

of statistics is to develop and apply methodology for extracting useful knowledge

from both experiments and survey data. Major activities in statistics involve:

Design of experiments and surveys

Exploration and visualization of sample data

Summary description of sample data

Stochastic modeling of uncertainty

Forecasting based on suitable models

Hypothesis testing and statistical inference

Development of new statistical theory and methods

Generally statistics can be defined as a branch of science deals with data collection,

presentation, analysis, and interpretation of analyzed data. This definition clearly

points out four stages in a statistical investigation, namely:

1

HBAF 3105

1. Collection of data

2. Presentation of data

3. Analysis of data

4. Interpretation of analyzed data

The development of statistics was strongly motivated by the need to make sense

of the large amount of data collected by population surveys in the emerging nation

states of Europe. At the same time, the mathematical foundations for statistics

advanced significantly due to breakthroughs in probability theory inspired by games

of chance (gambling). For more information about the history of statistics refer to

the books by Johnson and Kotz (1998) and Kotz and Johnson (1993). The various

methods used in statistical investigations are termed as statistical methods and the

person using them is known as a statistician. A statistician is concerned with the

analysis and interpretation of the data and drawing valid worthwhile conclusions

from them for decision making. For example:

A shoe factory will be interested in the most common shoe sizes in order to

make a decision on the production process.

The Ministry of Education will be interested in the trend in the number of

pupils starting each level of education in order to make decisions related to

building of schools, training of teachers, etc.

The latest sales data have just come in, and your boss wants you to prepare

a report for management on places where the company could improve its

business. What should you look for? What should you not look for?

1.1.1. Who Uses Statistics?

Statistical techniques are used extensively by marketing, accounting, quality control, consumers, professional sports people, hospital administrators, educators, politicians, physicians, etc...

Uses of Statistics

1. To present the data in a concise and definite form: Statistics helps in classifying and tabulating raw data for processing and further tabulation for end

users.

2

HBAF 3105

2. To make it easy to understand complex and large data: This is done by presenting the data in the form of tables, graphs, diagrams etc., or by condensing

the data with the help of means, dispersion etc.

3. For comparison: Tables, measures of means and dispersion can help in comparing different sets of data.

4. In forming policies: It helps in forming policies like a production schedule,

based on the relevant sales figures. It is used in forecasting future demands.

5. In measuring the magnitude of a phenomenon:- Statistics has made it possible

to count the population of a country, the industrial growth, the agricultural

growth, the educational level (of course in numbers).

1.1.2. Limitations of Statistics

Statistics has its limitations and to mention a few; In most statistical investigations,

we use samples to represent a population, with the number of data points collected

in the sample depending on the resources available. A sample used may not represent the population adequately and this may lead to results with little or no relevance

to the population that it came from.

Results based on data with strong departures from the assumptions such as

normality will be less reliable than results from data that meet the assumptions of a statistical test.

It is possible to lie (or to make mistakes) by ignoring some key statistical

principles. E.g. Correlation does not imply causation

In the recent past the numbers of deaths in Nairobi have increased proportionately to number crimes in Nairobi.

Young children who sleep with the light on are much more likely to develop myopia in later life. The former is a recent scientific example that

resulted from a study at the University of Pennsylvania Medical Center.

Published in the May 13, 1999 issue of Nature, the study received much

coverage at the time in the popular press. However, a later study at Ohio

State University did not find a link between infants sleeping with the

HBAF 3105

light on and development of myopia. It did find a strong link between

parental myopia and the development of child myopia, also noting that

myopic parents were more likely to leave a light on in their childrens

bedroom.

Sleeping with ones shoes on is strongly correlated with waking up with

a headache. Therefore, sleeping with ones shoes on causes headache.

In hypothesis testing, the p-value or "probability value" inform us of the probability of the null hypothesis occurring. For example, when comparing means

of several groups, the p-value is the probability that the observed differences

occur only by chance (does not exist in the population). We then use the reverse logic that if the differences occur by chance so seldom (typically when

p < 5% or 0.05), real differences must exist in the population. This has serious

implications on what you say about the hypothesis you accept:

Accepting a null hypothesis does not mean that the samples are the same

or that there is no relationship. It is just that the evidence in the sample

is not strong enough to support the opposite.

By accepting an alternative hypothesis at the 5% level of significance

you can say that if 100 similar surveys were done 95 of them would

show a difference (that is only 5 out of 100 surveys would be expected

NOT to differ).

1.2. Data types

Variable

A variable is a characteristic of an item or individual. It is simply something that

varies or doesnt always have the same value such as date of birth, age, marks,

districts as you move from one subject to another Data

Data are the different values associated with a variable. Operational definitions Data

values are meaningless unless their variables have operational definitions, universally accepted meanings that are clear to all associated with an analysis. The processing of the data depends on the nature of the variable on which data is collected.

Variable can be classified as follows:

HBAF 3105

1. Qualitative: refers to variables whose values fall into groups or categories.

They are also called categorical variables because the data they carry describes categories (e.g, District, Marital status, Gender, Religious affiliation,

Type of car owned). They can further be classified as;

Nominal variables: Variables whose categories are just names with nonatural

ordering E.g. gender, colour, district, marital status etc. or

Ordinal variables: Variables whose categories have a natural ordering

E.g. education level, degree classifications e.t.c. In a variable such

as performance, category Excellent is better than the category Very

good which is better than Good .

2. Quantitative: Numerical variables (e.g number of students, age, weight, distance etc). They can further be classified as;

Discrete variables: can only assume certain values and there are usually

between values, e.g the numbers of bedrooms in a house, the number of

children in a family e.t.c. In most cases they arise from counting and

their ratios do not make sense.

Continuous variables: can assume any value within a specific range, e.g.

The time cook ugali, Height of a tree, Your age, Distance from here to

Nairobi. e.t.c. In most cases, such data arises from measurements.

1.3. Types of Statistics

Descriptive Statistics: is a field that focuses on describing different characteristics

of the data rather than trying to infer something from it. It is a body of methods of

organizing, summarizing, and presenting sample data in an informative way.

A Steadman poll found that 41% of Kenyans would vote for Candidate A in

the last general election. The statistic 41 describes the number out of every

100 persons who were interviewed.

According to Consumer Reports, Whirlpool washing machine owners reported 9 problems per 100 machines during 1995. The statistic 9 describes

the number of problems out of every 100 machines.

5

HBAF 3105

Inferential Statistics: body of methods which tries to infer or reach conclusions

about the population based on the scientifically sampled data. The calculated summaries from the sample are used for estimation, prediction, or generalization about

a population from which the sample was taken.

TV networks constantly monitor the popularity of their programs by hiring

pollsters to sample the preferences of TV viewers.

The JKUAT accounting department normally selects a sample of the payment

vouchers to check for accuracy for all the payment vouchers.

Most data sets contain one or more measurements from each of a collection of

individuals (or other units). The measurements of interest usually vary in ways that

cannot be explained in terms of other measurements from the individuals. This

unexplained variability can be modelled by considering the data to be a random

sample from some underlying population.

1.4. Finite populations

A sample provides information about a population when it is too difficult or expensive to make measurements from the whole population. We often want to find

information about a particular group of individuals (people, fields, trees, and bottles

of beer or some other collection of items). This target group is called the population. Collecting measurements from every item in the population is called a census.

A census is rarely feasible, because of the cost and time involved.

1.4.1. Simple random sample

We can usually obtain sufficiently accurate information by only collecting information from a selection of units from the population - a sample. Although a sample

gives less accurate information than a census, the savings in cost and time often outweigh this. The simplest way to select a representative sample is a simple random

sample. In it, each unit has the same chance of being selected and some random

mechanism is used to determine whether any particular unit is included in the sample.

HBAF 3105

1.4.2. Sampling from a population

It is convenient to define the population and sample to be sets of values or measurements (rather than people or other items). This abstraction - a population of

values and a corresponding sample of values - can be applied to a wide range of

applications.

Effect of sample size

Bigger samples mean more stable and reliable information about the underlying

population. As the sample size is increased, the sampling error becomes smaller.

When a sample is used to estimate a population characteristic, an error is usually

involved. Sampling error is caused by random selection of the sample from the

population. The difference between an estimate and the population value being

estimated is called its sampling error. The cost savings from using a sample instead

of a full census can be huge.

HBAF 3105

Revision Questions

Example . Define the term statistics

Solution: ...

E XERCISE 1. Discuss a case in real live where you think Statistics was misused

E XERCISE 2. Discuss how statistics led to the development of computer systems

and how computer systems led to the development of statistics.

E XERCISE 3. Discuss the relative "weakness" of categorical variables (including measures on nominal and ordinal scales), and continuous variables (including

measures on interval and ratio scales) with respect to the type of information that

can be obtained from the statistics.

Suggested materials for further reading

1. Mason, R. D., Lind, D. A. and Marchal, W. G. (1999). Statistical Techniques

in Business and Economics. Irwin McGraw-Hill, Boston.

2. K. Pelosi and Theresa M. Sandifer (1976). Elementary Statistics. John Wiley

& Sons, Inc

3. Wonnacott, T.H. and Wonnacott, R.J. (1990). Introductory Statistics for Business and Economics, 2nd Edition, John Wiley and Sons Inc.

4. Gujarati, D.N. (2006). Basic Econometrics 3rd Edition, McGraw-Hill, Inc.,

New York.

5. Keller, G., Warrack, B. and Bartel, H. (1994). Statistics for Management and

Economics. 3rd Edition. Wadsworth Publishing Company, Belmont California, USA.

HBAF 3105

LESSON 2

Measures of central tendency

In most sets of data, there is a tendency of the observed values to cluster themselves

about some value. The phenomena is called central tendency. This central tendency

can be measured using single numerical values which may be used to judge the

entire distribution. The measures of central tendency include; The Mean (Arithmetic, Geometric or Harmonic)

The Median and

The Mode.

2.0.3. Arithmetic mean

Arithmetic mean (Average) means the sum or total of all the observations divided by

the total number of observations in a given sample (frequency). For xi observations

The mean, denoted by x . For x1 , x2 , . . . xn observations is defined as;

n

or x = 1n ni=1 x1

x = x1 +x2 +...+x

n

Incase x1 , x2 , . . . xn have frequencies f1 , f2 , . . . fn then mean is defined by

f2 x2 +...+ fn xn

x = f x11f+

or x = N1 ni=1 fi xi

1+ f 2 +...+ f n

where is a Greek letter meaning Sum and N = ni=1 fi .

Alternatively we can also obtain mean using assumed mean given by the expression

x = A+

fd

N

where A=Is the assumed mean and is the deviation from the assumed mean and it

is given by d = A x

Example

Find the mean of the data sets given below i. 4, 6, 7, 8, 10, 5, 7, 12, 14, 6, 7

Marks

1-5 6-10 11-15 16-16

Frequency 6

7

5

2

solution

x = 4+6+7+8+10+5+7+12+14+6+7

= 7.818182

11

HBAF 3105

Alternatively if we let the assumed mean A = 8 then mean is given by

x = A + Nf d = 8 + 15

20 = 8 + 0.75 = 8.75

Remark 1. assumed mean can be any value but it is advisable to choose assumed

mean from values to make the computation of mean much easier.

2.0.4. The Geometric mean

Geometric mean can be defined as the nth root of the product ofx1 , x2 , . . . xn observations. Geometric is usually denoted by G and it is expressed as

For a case where x1 , x2 , . . . xn have got the frequencies f1 , f2 , . . . fn , then the geometric mean is expressed as

From the above expression of geometric mean we also introduce logarithms on both

L.H.S and R.H.S of the equations and this will yield

10

HBAF 3105

Example . The growth rates of textile unit in the western region of Kenya for the

last five years are given below. Use it to calculate the geometric mean of the growth

rate.

Year

1 2 3

4

5

Growth rate 7 8 10 12

Solution: The geometric mean

18

Example . Find the geometric mean for the distribution given below:Yield (Dividend/Market) 0-10 10-20 20-30 30-40

Number of Companies

5

15

25

Solution: Geometric mean is computed as follows:-

35

The harmonic mean of a series of values is defined as the reciprocal of the mean of

their reciprocals. Thus if H is the harmonic mean

x1i

1

=

H

n

11

HBAF 3105

H=

1

x1

+ + x12

n

+ ... + x1n

H = fn 1 where N = fi

i x

i

Example1

Calculate the harmonic mean the observations:- 4, 8, and 16.

Solution:

Harmonic mean

Example2

Find the harmonic mean for the distribution given below:Yield (Dividend/Market) 2-4 4-6 6-8 8-10

Number of Companies

Solution:

100

= 4.98

H = fn 1 = 20.061

i x

i

So the average dividend yield calculated by the harmonic mean formula is 4.98

percent.

Median

The median is the value above which and below which half of the observations fall

(if ranked in order of size). In other words it is the midpoint of the values after they

have been ordered from the smallest to the largest, or the largest to the smallest.

Procedure for finding the median for discrete data is: Arrange the data in an ascending order

If the number of observations is n (odd), then median is in position

is even, then median is the average of the two middle values.

n+1

2

.If n

12

HBAF 3105

For grouped data Median can be obtained by first locating the median class and then

use the interpolationh classi to obtain the median using the expression given below

N

C

Median (M) = L + 2 f h where L is the lower limit of the median class, h is the

class interval, f is the frequency of the median class, N is the total frequency of all

the observations and C is the cumulative frequency above the median class.

Example1.

Find the median of the following values, 19, 13, 14, 18, 12, 25, 11, 10, 17, 23, where

n = 10 (even). Arranging them in ascending order gives, 10, 11, 12, 13, 14, 17, 18,

19, 23, 25 so Median is the average of the 5th and the 6th values 14+17

= 15.5

2

Example 2

The age of a sample of five college students is: 21, 25, 19, 20, and 22. Confirm that

the median is 21.

Example 3

The height of four basketball players, in inches, is 76, 73, 80, and 75. Confirm that

the median is 75.5

Example 4

Find the median for the following distribution

Yield (Dividend/Market) 0-10 10-20 20-30 30-40 40-50

Number of Companies

Solution

22

38

46

35

20

Here the total frequencies are N = 161 (odd number) hence the median is in the

161+1

size of N=1

th item which is 81st item. So the median lies in the

2 th item i.e

2

20-30 group thus 20-30 is the median class with lower limit of 20 soL = 20 , f = 46

,C = 60 and h = 10

N

C

Median(M) = L + 2 f

h = 20 + 8160

10 = 20 + 210

46

46 = 24.5652

Example 5

The data below shows marks obtained by some students in continuous assessment

test, use it to calculate the median of the data.

Marks

1-5 6-10 11-15 16-20 21-25

Number of Students

12

32

46

35

20

13

HBAF 3105

Solution

N = 120 Which is even number; hence the median lies between 60th item and the

N

C

h

61st item which is in the class of 11-15 so we use the expression M = L + 2 f

to approximate the size 60th item and the size of 61st itemL = 10.5 , f = 30 ,c + 44

and h = 5

N

C

16

h = 10.5 + 6044

The size of 60th item M60 = L + 2 f

5

10.5

+

30

30 5 =

13.1667

N

|+1C

The size of 61st item (M61 ) = L + 2 f

h = 10.5 + 6144

5 10.5 +

30

17

30 5 = 13.333

So Median is the average of the 60th and the 61st values 13.16667+13.3333

= 3.250001

2

Mode

Mode is the value or item occurring most frequently in a set of observations or

statistical data. The mode may not exist, and if it does exist, it may not be unique.

If each observation occurs the same number of times, then there is no mode. If

two or more observations occur the same number of times then, two or more modes

exist and the distribution is called multimodal.

f2

For grouped data Mode can be obtained using the expression: Mode = L+ f1 + f2 h

f2

Alternatively Mode can also be found by the formula Mode= L+ ( fm ffm)+(

fm f2 ) h

1

Where L, is the lower class limit of the modal class with modal class being the class

with the highest frequency.

fm is the frequency of the modal class

f2 is the frequency succeeding the modal class

f1 is the frequency preceding the modal class and

his the class interval of the modal class

Remark 2. For grouped data the mode may be estimated by first identifying the

modal class and then taking the midpoint of the class interval. This can also be

obtained from a histogram by taking the midpoint of the class interval with the

highest peak.

14

HBAF 3105

Example1. For the values 9, 3 , 4 , 2, 1, 5, 8, 4, 7, 3 , each of the values 3 and 4

occur twice. The mode is therefore 3 and 4.

Example2.

Calculate the mode of the following data set

Gross profit as percentage 0-7 7-14 14-21 21-28 28-35 35-42 42-49

Number of companies

19

25

36

72

51

43

28

Solution: The highest frequency is 72 so the modal class is 21-28. So The lower

class limit is L = 21 , f1 = 36 , f2 = 51 and h = 7 thus

2.0.6. P-tiles

These are values of variate which divides the total frequency in to equal parts. The

most commonly used form of P-tiles are the Quartiles ( p = 4), the Deciles (p=10 )

and Percentiles ( p=100).Quartiles divides the data into four equal parts, Deciles divides the data into ten equal parts and Percentiles divides the data into one hundred

equal parts. Note that for P parts we have p 1boundaries.

When P = 4 we get the Quartiles (Q1, Q2 andQ3 ).The kth ( quartile is given by Qk =

k

4 ((n + 1)th)value

When P = 10 we get the Decile (D1, D2 andD3 ). .The kth deciles is given by Dk =

k

4 ((n + 1)th)

k

When P = 100 we get the Percentiles given by Pk = 100

((n + 1)th)value

Note:Q2 = D5 = P50 median. The actual value can be obtained using linear interpolation.

Example1.

For the data; 13, 14, 17, 10, 11, 12, 23, 25, 18, 19 we can arrange the values in

ascending order and assign each value a rank (position) to get

Arranged values: 10, 11, 12, 13, 14, 17, 18, 19, 23, 25

Position: 1 2 3 4 5 6 7 8 9 10

Then

Q2 = 24 (10 + 1)th = 5.5th value

5.5th value = 12 5th value + 6th

5.5th value = 5.5th value + 0.5 7th value 6th value = 17 + 0.6 (18 17) = 17.6

15

HBAF 3105

Remark 3. The method described here is in Minitab and SPSS by default. Other

software such as SAS, S and use different estimation of the P-tiles using the concept.

For grouped data we can obtain Quartiles, Deciles and percentiles the following

expressions:Quartiles

Q1 = L +

N

4 C

hand Q3 = L +

3N

4 C

h where Q1 and Q3 are lower and upper

Quartiles.

Deciles

As already seen earlier that deciles arevariatethat divides thedata into

10 equal

N

kN

C

C

hso D1 = L + 10 f

h and D2 =

parts, the value of Dk = kth decile = L + 10 f

KN

C

h and so on.

L + 10 f

Percentiles:

The generalized formulafor percentiles

is givenas follows:

= kth Percentile= L+

KN

100 C

N

100 C

KN

100 C

and

Pk

thus P1 = L+

h and P2 = L+

f

f

f

so on.

Exercise 4.

For the data: 23, 10, 25, 15, 22, 17, 24, 32, calculate the following; Median, 3rd

Quartile 5th Decile and 80 th Percentile.

Exercise 5.

For the distribution given below obtain Lower Quartile, Upper Quartile, 6th decile

and 70th percentile.

Dividend

5-10 10-15 15-20 20-25 25-30 30-35 35-40 40-45

Number of Companies

15

10

16

HBAF 3105

Revision Questions

Example . ....

Solution: .....

E XERCISE 4. ....

17

HBAF 3105

LESSON 3

Probability and Distributions

3.1. Learning outcomes

Upon completion of this lesson you should be able to;

1. Define a probability distribution

2. Compute Mathematical expectation of a random variable

3. Describe properties of various probability distributions

3.2. Probability distributions

A random variable is a function or a mapping from a sample space into the real

numbers (most of the time). In other words, a random variable assigns real values

to outcomes of experiments. This mapping is called random, as the output values of

the mapping depend on the outcome of the experiment, which are indeed random.

A random variable is therefore just a rule that assigns a number to each outcome of

an experiment. These numbers are called the values of the random variable and the

variable is denoted by capital letters such as X, Y and Z. We can formally say that

a random variable is a function that associates a unique numerical value with every

outcome of an experiment.

3.3. Discrete Probability distributions

A discrete random variable is one which may take on only a countable number

of distinct values such as 0, 1, 2, 3, 4, ... Examples of discrete random variables

include the number of children in a family, the members day attendance at a cinema,

the number of DBA students e.t.c.

Let X be the number of heads observed when a fair coin is tossed three times. Let

H represent the outcome of a head and T the outcome of a tail. The sample space

for such an experiment will be:

18

HBAF 3105

X

P(X=x) 18 38 38 18

The resulting table above is known as probability distribution table. A probability

distribution for a discrete random variable is a formula, table or graph that provides the probability associated with each value of the random variable. If is the

probability distribution of a random variable X, then the following properties hold

Probabilities lies between 0 and 1 i.e 0 P(X = x) 1

The sum of probabilities add up to 1 i.e P(X = x) = 1

NOTE: Here, uppercase is used for the random variable and lowercase is used to

denote a realization of X .

Probabilities can be easily obtained from the probability distribution table as follows:

Probability of getting two or more heads is given

P (X > 1) = P (X = 2) + P (X = 3) = 38 + 81 = 12

In some books this is referred to as a probability mass function (pm f )but it should

be noted that (pmf) is for continuous random variables while probability density

function (pd f )is used for discrete random variables. If the variable is discrete, it

describes how likely the random variable is to be at a certain point. The (pm f ) or

(pd f ) is represented by the lowercase f (x)or P(X = x) for a random variable X.

3.3.1. Expectation of a random variable

Mathematical expectation refers to the mean or expected value of a random variable

X whose distribution is known. The expected value, denoted by E(X), it is given by

the expression

2

2

The

expected

value ( x) also called variance and it is denoted by V (x) = =

E (x )2

19

HBAF 3105

variable X. Given that P (X < 150) = 0.6, find the value of a and b hence calculate

E(X) and Standard deviation of X

X

40 80 120 150 200

P(X=x)

Solution:

0.2

0.23

0.15

E XERCISE 5. Marketing estimates that a new instrument for the analysis of soil

samples will be very successful, moderately successful, or unsuccessful, with probabilities 0.3, 0.6, and 0.1, respectively. The yearly revenue associated with a very

successful, moderately successful, or unsuccessful product is $10 million, $5 million, and $1 million, respectively. Let the random variable X denote the yearly

revenue of the product. Determine the probability mass function of X hence or

otherwise the expected value and variance of X.

E XERCISE 6. The following table gives probability distribution of marks obtained by some students in a CAT.

Marks

12

14

18

23

24

25

Probability(p)

0.0645 0.0968 0.1935 0.2581 0.2258 0.1613

From the above table find the probability that a randomly picked student from this

class scored

More than 22 marks?

20

HBAF 3105

Less than 20.5marks?

Between 16 and 23 marks exclusive?

What is the expected mark?

3.3.2. Bernoulli Distribution

The coin toss: There is no more basic random event than the flipping of a coin.

Heads or tails. Its as simple as you can get! The "Bernoulli Trial" refers to a single

event which can have one of two possible outcomes with a fixed probability of each

occurring. You can describe these events as "Yes or No" questions. For example:

Will the coin land heads?

Will the newborn child be a girl?

Will a potential customer decide to buy my product?

Will this person be carjacked in his/her lifetime?

The main controlling parameter in Bernoulli distribution is the probability of success p. A "fair coin" or an experiment where success and failure are equally likely

will have a probability of p = 0.5 (50%). If a random variable X is distributed with

a Bernoulli distribution with a parameter p we write its probability function as:

Mean and Variance

E(X) = pand Var(x) = p(1 p) = pqwhre q = 1 pwhich represent failure.

3.3.3. Binomial Distribution

Where the Bernoulli distribution asks the question of "Will this single event succeed?" the Binomial is associated with the question "Out of a given number of

trials, how many will succeed?" Some example questions that are modeled with a

Binomial distribution are:

Out of twenty tosses, how many times will this coin land heads?

From the children born in a given hospital on a given day, how many of them

will be girls?

21

HBAF 3105

How many mosquitoes, out of a swarm, will die when sprayed with insecticide?

NOTE: The Binomial distribution is composed of multiple Bernoulli trials. We conduct repeated experiments where the probability of success is given by the parameter p and add up the number of successes. This number of successes is represented

by the random variable . The value of is then between 0 and n .

When a random variable X has a Binomial Distribution with parameters p and n we

write it as X Bin(n, p)or X B(n, P) and the probability distribution function is

given by the equation:

E(X) = npand Var(x) = np(1 p) npq whre q = 1 p which represent failure.

Example . The probability of hitting the bulls eye in a dart game is 0.12. Find

the probability that in eight trials the bulls eye will be hit (a) Exactly 4 times (b)

At least once (c) Expected value

Solution:

There are 8 trials in a binomial experiment i.e; n = 8p = 0.12then

P(Exactly 4 hits out

! of 8)

!

n

8

nk

P (X = 4) =

pk (1 p) P (X = 4) =

0.124 (1 0.12)84 = 0087

k

4

!

n

P9At least once)i.eP(X > 1)P(X > 1) = 1P(X = 0) = 1

pk (1 p)nk

k

!

8

1

0.120 (1 0.12)80 = 0.3596

0

E(X) = np

E(X) = np 8(0.12) = 0.96

3.3.4. Poisson Distribution

The Poisson distribution is very similar to the Binomial Distribution. In both cases

we are examining the number of times an event happens but whereas the Binomial

22

HBAF 3105

Distribution looks at how many times we register a success over a fixed total number

of trials, the Poisson Distribution measures how many times a discrete event occurs,

over a period of continuous space or time.

Instead of parameter that represents a component probability like in the Bernoulli

and Binomial distributions, Poisson uses the parameter which represents the "average or expected" number of events to happen within our experiment. The probability mass function of the Poisson is given by

The Poisson distribution can be used as an approximation to the Binomial distribution using X po(n, p)where nand p are the number of trials and the probability

of success respectively in the Binomial distribution which is being oximated. The

approximation can be used when is large (say> 50) and p is small (say <0.1). This

ensures that that is, the mean and variance are approximately equal.

Example

A restaurant is such that one of its dish gets ordered on average 4 times per day.

What is the probability of having this dish ordered exactly 3 times tomorrow?

Solution:

The probability of having the dish ordered 3 times exactly is given if we set x =

3in the above equation. Remember that weve already determined that we sell on

average 4 dishes per day, so = 4

Geometric Distribution refers to the probability of the number of times needed to

do something until getting a desired result. For example:

How many times will I throw a coin until it lands on heads?

How many children will I have until I get a girl?

Just like the Bernoulli Distribution, the Geometric distribution has one controlling

parameter as the probability of success p in any independent test. If a random

23

HBAF 3105

variable X is distributed with a Geometric Distribution with a parameter p we write

its probability mass function as:

Mean and Variance

where q = 1 p . With a Geometric Distribution it is also pretty easy to calculate

the probability of a "more than n times" case. The probability of failing to achieve

the wanted result is(1 p)

Example.

Suppose a drunkard is trying to find the key to his front door, out of a keychain with

10 different keys. What is the probability that he succeeds in finding the right key

in the 4th attempt?

Solution:

1

then

This is a geometric distribution with parameter (probability of Success) p = 10

Just as the Bernoulli and the Binomial distribution are related in counting the number of successes in 1 or more trials, the Geometric and the Negative Binomial distribution are related in the number of trials needed to get 1 or more successes.

The Negative Binomial distribution refers to the probability of the number of times

needed to do something until achieving a fixed number of desired results.

For example:

How many times will I throw a coin until it lands on heads for the 5th time?

How many children will I have when I get my second daughter?

How many fish will I have by the time I get the fifth Tilapia

Just like the Binomial Distribution, the Negative Binomial distribution has two controlling parameters: the probability of success p in any independent test and the

desired number of successes m. If a random variable has Negative Binomial distribution with parameters and , its probability mass function is:

24

HBAF 3105

Mean and Variance

Example . A hawker goes home if he has sold 3 coats that day. Some days he

sells them quickly. Other days hes out till late in the evening. If on the average

he sells a coat at one out of ten houses he approaches, what is the probability of

returning home after having visited only 10 houses?

Solution:

The number of trials is Negative Binomial distributed with parameters p = 0.1 and

m = 3, hence:

A continuous random variable is one that can take on any values within a continuous

range or an interval. Examples: Duration of a call in a telephone exchange, the time

taken to complete a certain task, weight of a student. Age of a person etc. Unlike

a discrete random variable, a continuous random variable has a probability density

function(pdf) instead of a probability mass function. The difference is that the

former must integrate to 1, while the latter must have a sum up to 1. If (x) is the pdf

of a random variable X.

The expected value or the mean of X is defined as

The variance of a continuous or discrete distribution is defined as

The uniform distribution, as its name suggests, is a distribution with probability

densities that are the same at each point in an interval. In casual terms, the uniform

distribution shapes like a rectangle. The probability density function of the uniform

distribution is defined as

It can be shown that the expected value of and variance is given by the following

expressions

25

HBAF 3105

LESSON 4

Normal Distribution

Learning outcomes

Upon completion of this lesson you should be able to;

1. Identity data that is normally distributed.

2. Read standard normal statistical tables.

3. Apply normality concept in estimating probabilities of certain outcomes

4.1. Introduction

The Normal Probability Distribution is one of the most useful and more important

continuous distributions in statistics. The Normal distribution is used frequently in

statistics for many reasons:

The Normal distribution has many convenient mathematical properties.

Many natural phenomena have distributions which when studied have been

shown to be close to that of the Normal Distribution.

The Central Limit Theorem shows that the Normal Distribution is a suitable

model for large samples regardless of the actual distribution.

4.2. Description

The Normal distribution describes a continuous variable that takes on values in the

real number line. The formula for the Normal has two parameters, the mean, and

the variance 2 . The parameter is a location parameter and 2 is a scale

parameter. The symmetric about the mean as shown in the following figure

Consider the following Histogram with normality plot on it for a certain study on

men heights

26

HBAF 3105

It is clear that the very tall are as few as the very short. Majority of the Americans

are 174 cm tall. The heights range from 150 cm which is about 174 - 3(6.7)cm to

about 195 cm which is about 174 + 3(6.7) cm. This is in line with Tchebysheffs

theorem.

Note Tchebysheffs theorem: state that For any set of observations x1 ,x2 ,. . . xn at

least 1 1/k2 of the values will lie within k standard deviations of the mean is where

k1

4.2.1. Functional form

A continuous random variable, X, is normally distributed with a probability density

function given by:

where and are the mean and the standard deviation respectively. The expected

value of a distribution is defined as the probability weighted sum of outcomes.

For X N(, 2 )

and, the variance of a distribution is the probability weighted sum of the squared

differences between outcomes and their expected values.

It is now clear that the parameters and are simply equal to the expected value and

variance (respectively).

4.3. The Standard normal Distribution

A normal distribution with a mean of 0 and a standard deviation of 1 is called

the standard normal distribution. Every normally distributed variable can be transformed into a standard normal variable by commuting the Z score value: The Z

value is the distance between a selected value, designated x, and the population

mean , divided by the population standard deviation

Z = x

27

HBAF 3105

The transformed values will always give the curve above. Notice that the central

value of Z is zero (0) and the curve is still symmetric. We determine probabilities

based upon distance from the mean (i.e., the number of standard deviations). It

worth noting that

The probability is the proportion of area under the standard normal curve.

The probabilities have been computed and published under the name Normal

probability tables. What we get when we use these tables is always the area

between the mean and z standard deviations from the mean.

Because of symmetry P(X > 0) = P(X < 0) = 0.5

Tables show probabilities rounded to 4 decimal places. e.g

If Z < 1.96 then probability is 0:9750, we write P(Z < 1.96) = 0.9750

If Z > -1.96 then probability 0.9750, we write P(Z > -1.96) = 0.9750

From the standard normal tables table

1. P(Z < 1.00) = 0.5398

2. P(Z < 2.97) = 0.9985 ) P(Z > 2.97) = 0.0015

3. P(Z < 0) = 0.5000

4. P(Z < -1) = P(Z > 1) = 1 -0.5398 = 0.4602

Example . The daily water usage per person in Thika is normally distributed with

a mean of 20 gallons and a standard deviation of 5 gallons. What is the probability

that a person from Thika selected at random will use;

28

HBAF 3105

1. Less than 20 gallons per day?

2. Less than 25 gallons per day?

3. More than 30 gallons per day?

Solution:

We cannot read the probabilities directly. We must standardize our values as follows

standard deviation 2.5. Assuming the data is normally distributed. Find out how

many students scored

i. Between 12 and 15

ii. Above 18

iii. Below 8

E XERCISE 8. The data given below shows the number of employees with their

corresponding ages in a company. Use the data to find the probabilities that a person

picked at random has

i. Age more than 35 year

ii. Age falling between 24 and 38

E XERCISE 9. A recent study showed that 20% of JKUAT employees are landlords. A sample of 250 employees is taken. What is the probability that less than

40 are landlords?

Suggested materials for further reading

1. Wonnacott, T.H. and Wonnacott, R.J. (1990). Introductory Statistics for Business and Economics, 2nd Edition, John Wiley and Sons Inc.

2. Gujarati, D.N. (2006). Basic Econometrics. 3rd Edition, McGraw-Hill, Inc.,

New York.

3. Keller, G., Warrack, B. and Bartel, H. (1994). Statistics for Management and

Economics. 3rd Edition. Wadsworth Publishing Company, Belmont California, USA.

29

HBAF 3105

LESSON 5

Tests of Hypothesis 1

Learning outcomes

Upon completion of this lesson you should be able to;

1. Define hypothesis testing

2. State two types of errors in hypothesis testing

3. Carry out necessary computations for t-tests

4. Use SPSS to carry out tests involving comparison of means of two groups

5.1. Introduction

A statistical hypothesis is an assertion or conjecture about a parameter (or parameters) of a population. It can be viewed as precise testable statement about the value

of a population parameter developed for the purpose of testing. Hypothesis testing

is a procedure, based on sample evidence and probability theory, used to determine

whether the hypothesis is a reasonable statement and should not be rejected, or is

unreasonable and should be rejected.

Null Hypothesis: :H0 this is the statement to be rejected A statement about

the value of a population parameter. Typically a null hypothesis is the opposite of the real hypothesis of interest. It might state, for example, that a

parameter equals 0 in the population, or that the values of two subgroup parameters are equal in the population.

Alternative Hypothesis: H1 : A statement that is accepted if the sample data

provide evidence that the null hypothesis is false.

Level of Significance : The probability of rejecting the null hypothesis

when it is actually true.

Type I Error: Rejecting the null hypothesis when it is actually true.

Type II Error: Accepting the null hypothesis when it is actually false. That

is

30

HBAF 3105

See the following table on associated probabilities.

Test statistic: A value, determined from sample information, used to determine whether or not to reject the null hypothesis.

Critical value: The dividing point between the region where the null hypothesis is rejected and the region where it is not rejected.

A p value: A measure of how much evidence you have against the null

hypothesis. The smaller the , the more evidence you have. One usually combines p value the with the significance level to make decision on a given

test of hypothesis. In such a case, if the p-value is less than some threshold

(usually .05, sometimes a bit larger like 0.1 or a bit smaller like .01) then you

reject the null hypothesis.

5.2. Parametric Tests

Hypothesis tests can be two-tailed when looking for a change, such as testing

H0: = 5V S H0 : 6= 5

or one- tailed when looking for an increase (or decrease) such as testing H0: =

5V S H0 : > 5

The procedure to use when carrying out a hypothesis test is:

1. Determine H0 H1 and the significance level.

2. Decide whether a one- or two-tailed test is appropriate.

3. Calculate the test statistic assuming H0 is true.

4. Compare the test statistic with the critical value(s) for the critical region.

5. Accept or reject H0 as appropriate.

31

HBAF 3105

6. State conclusion in terms of the original problem.

When testing for the population mean from a large sample and the population standard deviation is known, the test statistic is given by

Z = x

/ n

where is a known population standard deviation.

Example . Brandways Company indicates on the label that their loaves weigh

400g. A sample of 40 loaves is selected hourly from their processing line and the

contents weighed. Last hour a sample of 40 loaves had a mean weight of 403g with

a standard deviation of 8g. Test at .05 significance level whether their process is out

of control?

Solution:

The hypotheses to be tested

H0 : = 400V S H0 : 6= 400

403400

/ n = 8/ 40 2.371

The critical value from Z-table at 5% is given as 1.96

Since Zcal is greater than Ztab i.e 2.371>1.96 we reject the H0 .

mean IQ of 100 and standard deviation of 15. The 9 people underwent an intensive

training and then the IQ test was administered. The sample mean IQ was 113 and

the sample standard deviation was found to 10. Test whether the training had any

significant effect (increase) in IQ score?

Solution:

Note that the level of significance is not specified. The standard value is 0.05 but

we may use 0.01 or 0.1 depending on the accuracy required. In this example we use

= 0.01

The hypotheses to be tested

H0 : = 100Against H0 : > 100

113100

/ n = 15/ 9 = 2.6

The critical value from Z-table at 1% is given as 2.33

Since Zcal is greater than Ztab i.e 2.6>2.33 we reject the . Conclude that the data

provides enough evidence to indicate that such training increases the IQ.

Commonly used test statistics involving normal distribution are summarized as follows:

32

HBAF 3105

t = x

s/ n

where

q s is the standard deviation of the sample size usually given by the expression

2

s=

f (xx)

n1

E XERCISE 10. Jane is in charge of Quality Control at a bottling facility. Currently, she is checking the operation of a machine that is supposed to deliver 355

mL of liquid into an aluminum can. If the machine delivers too little, then the local

Regulatory Agency may fine the company. If the machine delivers too much, then

the company may lose money. For these reasons, Jane is looking for any evidence

that the amount delivered by the machine is different from 355 mL. During her investigation, she obtains a random sample of 10 cans, and measures the following

volumes:

355.8 355.0 355.5 353.7 355.5

355.3 353.8 355.6 355.0 355.4

The machines specifications claim that the amount of liquid delivered varies according to a normal distribution, with mean = 355ml and variance= 0.64ml. Do

the data suggest that the machine is operating correctly?

5.2.1. Z Test for Two Means

The Null Hypothesis should be an assumption about the difference in the population

means for two populations. The data should consist of two samples of quantitative

data (one from each population). The samples must be obtained independently

from each other. The samples must be drawn from populations which have known

Standard Deviations (or Variances). Also, the measured variable in each population

(generically denoted x1 and x2) should have a Normal Distribution.

Procedure: The null Hypothesis:

H0 : 1 2 = dH0 : 1 2 6= d

in which d is the supposed difference in the expected values under the null hypothesis. The Alternate Hypothesis could be,

33

HBAF 3105

Usually, the null hypothesis is that the population means are equal i.e. ; in this case,

the formula reduces to

If the Variances (and thus the Standard Deviations) of the two populations are assumed equal, the pooled variance could be used and in this case, we get..

Example

Universities and colleges in the United States of America are categorized by the

highest degree offered. Types IIA institutions offer a Masters Degree, and type IIB

institutions offer a Baccalaureate degree. A professor, looking for a new position,

wonders if the salary difference between type IIA and IIB institutions is really significant. He finds that a random sample of 200 IIA institutions has a mean salary

(for full professors) of $54,218, with standard deviation $8,450. A random sample

of 200 IIB institutions has a mean salary (for full professors) of $46,550, with standard deviation $9,500 (assume that the sample standard deviations are in fact the

population standard deviations). Do these data indicate a significantly higher salary

at IIA institutions?

Solution

The null hypothesis is that there is no difference; thus

H1 : A > B

Since the hypotheses concern means from independent samples, a two sample test

is indicated. The samples are large, and the standard deviations are known (we

assumed), so a two sample z-test is appropriate.

This value is far much larger than 4, the most extreme value in the standard normal,

we reject the null hypothesis and conclude that IIA schools have a significantly

higher salary than IIB schools.

34

HBAF 3105

5.2.2. The t-Test

The t- test is the most powerful parametric test for calculating the significance of

means when the sample is small and when the population variance is unknown. The

test is based on a t-distribution which has the following properties;

It is continuous, bell-shaped, and symmetrical about zero like the z-distribution.

There is a family of t-distributions sharing a mean of zero but having different

standard deviations.

The t-distribution is more spread out and flatter at the center than the z distribution, but approaches the z-distribution as the sample size gets larger.

A t-test is necessary for small samples because their distributions are not normal.

If the sample is large (n >= 30) then statistical theory says that the sample mean is

normally distributed and a z test for a single mean can be used. This is a result of a

famous statistical theorem, the Central limit theorem.

A t-test, however, can still be applied to larger samples and as the sample size n

grows larger and larger, the results of a t-test and z-test become closer and closer.

In the limit, with infinite degrees of freedom, the results of t and z tests become

identical. In order to perform a t-test, one first has to calculate the degrees of freedom. This quantity takes into account the sample size and the number of parameters

that are being estimated. Here, the population parameter is being estimated by

the sample statistic, the xmean

of the single mean is . This is because only one population parameter (the populations mean) is being estimated by a sample statistic (the sample mean). degrees of

freedom (d f ) = n 1 . The test statistic for the one sample case is given by;

q

where s = (xx)

n1

For a two-tail test using the t-distribution, you will reject the null hypothesis when

the value of the test statistic is greater than tn1, 2 or if it is less than tn1, 2 depending

on the direction of the tail.

Example.

The current rate for producing 5 amp fuses at an ABC company is 250 per hour.

A new machine has been purchased and installed that, according to the supplier,

35

HBAF 3105

will increase the production rate. A sample of 10 randomly selected hours from last

month revealed the mean hourly production on the new machine was 256, with a

sample standard deviation of 6 per hour. At the 0.05 significance level can ABC

conclude that the new machine is faster?

Solution

The hypothesis is H0 : = 250 H1 : > 250

Since sample is small and the standard deviation is unknown, t-test is appropriate.

We reject the null hypothesis at 0.05 significance level if t9,0.05 > 1.833(From ttables).

Since tcal > 3.16, we reject the null hypothesis and conclude that the sample provides enough evidence the new machine is faster.

Exercise 8

A college professor wants to compare her students scores with the national average.

She chooses a simple random sample of 20 students, who score an average of 54.2

on a standardized test. Their scores have a standard deviation of 4.5. The national

average on the test is a 60. She wants to know if her students scored significantly

lower than the national average.

Comparing Two independent Population Means

A small two sample t-test is used to test the difference between two population

means and when the sample size for at least one population is less than 30. To

conduct this test, three assumptions are required:

The populations must be normally or approximately normally distributed.

The populations must be independent.

The population variances must be equal

The standardized test statistic is:

36

HBAF 3105

Dependent samples

Example

Dependent samples are samples that are paired or related in some fashion. The

idea of using the same subject and taking repeated measurements. For example, if

you wished to buy a car you would look at the same car at two (or more) different

dealerships and compare the prices. Use the following test when the samples are

dependent:

where di = xi yi is the difference between pairs, d is the average of the differences

is the estimated standard deviation of the differences.An independent testing agency

is interested in the cost for renting a single bedroomed house Nairobi estates. A random sample of 6 towns is obtained and the following rental information obtained.

At the .05 significance level can the testing agency conclude that there is a difference in the rental charged between 2006 and 2007?

Estate

Rent06 (Ksh00) Rent07 (Ksh00)

Githurai

55

59

Kahawa

64

65

Ngomongo

23

48

Roysub

38

48

Kawangware

solution

Estate

57

59

Rent06 (Ksh00)

Rent07 (Ksh00)

Difference (d)

Githurai

55

59

Kahawa

64

65

Ngomongo

23

48

13

Roysub

38

48

20

Kawangware

57

59

-7

Total

282

312

30

Average

47

52

37

HBAF 3105

The hypothesis is

Ho : d = 0 H0 : d > 0

The test statistics is

but t5; 0 : 05 = 2.015,. Sincetc < t5; 0 : 05,we fail to reject the null hypothesis

and conclude that the data do not provide enough evidence that rent has increased

significantly.

Exercise 9

A sample of 8 students was given a diagnostic test before studying a particular

module and then again after completing the module. The following data gives their

scores before and after the training.

Test at 0.1 and 0.05 levels of significant if the teaching leads to improvements in

students.

Suggested materials for further reading

1. Wonnacott, T.H. and Wonnacott, R.J. (1990). Introductory Statistics for Business and Economics, 2nd Edition, John Wiley and Sons Inc.

2. Gujarati, D.N. (2006). Basic Econometrics. 3rd Edition, McGraw-Hill, Inc.,

New York.

3. Keller, G., Warrack, B. and Bartel, H. (1994). Statistics for Management and

Economics. 3rd Edition. Wadsworth Publishing Company, Belmont California, USA.

38

HBAF 3105

LESSON 6

Hypothesis Testing 2

Learning outcomes

Upon completion of this lessonyou should be able to;

1. Explain the data cosiderations for one way ANOVA

2. Perform basic computations involving ANOVA

3. Carry out goodness of fit test using chisquare

4. Carry out contingency table analysis chi-square

5. Describe some limitations of hypothesis testing

6.1. Introduction

This lesson combines two topics which may appear totally unrelated. While the

first topic (ANOVA) deals with ratio/interval versus categorical variables, the second part (CHI-SQUARE) assumes categorical variables. You will realize that both

topics are still on hypothesis testing.

6.1.1. Analysis of Variance (ANOVA)

We have studied the test of significance difference in means between two independent populations. For this we used the standard error of mean or the standard

error of difference of the two means, using z-test or t-test. This concept can be

extended to the differences in means of more than two independent populations but

in a slightly different manner. Suppose we want to study the effects of four types of

fertilizers, say A, B, C and D on the yield of sugar cane. We take five plots for each

fertilizer. In this way, the use of 4 fertilizers is done on 20 plots. We can find the

arithmetic means of the yields of 5 plots for each fertilizer separately. But the test

of significance of the difference of these means is not possible with t-test. However,

one way using t-test is that we make 6 pairs of two fertilizers AB, AC, AD, BC, BD

and CD and then test their difference. Conclusion can also be drawn separately.

There arise two difficulties:

1. First, the work of computation will increase and

39

HBAF 3105

2. Second, only the pairs are tested out of the four fertilizers. We cannot find

whether the difference is significant taking them together.

In such situation a method of test of significance to avoid these two difficulties is

needed and the desired objective test of significance between the means of more

than two samples is fulfilled. Here test of significance means, to test the hypothesis

whether the means of several samples have significant difference or not. To testing

the difference among several sample means we use a statistical technique known as

Analysis of Variance. The main objective of the analysis of variance is to test the

hypothesis whether the means of several groups have significant difference or not.

Components of total Variability

When observations are classified into groups or samples on the basis of single criterion, then it is called One-way classification. For examples, The yield of sugar

cane of 20 plots, classified in pots on the basis of four types of fertilizer, the marks

obtained by students of different colleges, etc. In general and for one way classification, total variability is partitioned into two parts that is

Total Variation = Variation between groups+ Variation within groups.

Assumptions of Analysis of Variance

The analysis of variance is based on certain assumptions as given below:

1. Normality of the Distribution: The population for each sample must be normally distributed with mean and unknown variance 2

2. Independence of Samples: All the sample observations must be selected randomly.The total variation of the various sources of variation should be additive.

3. Additivity : The total variation of the various sources of variation should be

additive.

4. Equal variances (but unknown) : The populations from which the k samples

say are drawn have means and unknown variance12 = 22 = ... = n2

5. The error components are independent and have mean 0 and variance 2

The tests of significance performed in the analysis of variance are meaningful under

its assumptions.

40

HBAF 3105

6.1.2. Techniques of One-way ANOVA

1. In One-way analysis of variance there are k groups, one from each of k normal

populations with common variance 2 and means 1 , 2 , ...k . The number

of observations ni in groups may be equal or unequal i.e. n1 + n2 + ...nk = n

2. Linear Model

xi j = +i +ei j where wi j = observations i = 1, 2, ...k, j = ni , = generalmean

ei j =effect of error or random term, i =Effect of ith factor i

3. Null Hypothesis (H0) and Alternative Hypothesis (H1):

H0 : The means of the populations are equal i.e. 1 = 2 = ...K

H1: A least two of the means are not equal.

4. Computations:

(a) Calculate sum of observations in each sample and of all observations.

Sum of sample observations

x1 , x2 , ... xn :

Sum of the squares of the group observations:

x12 , x22 , ... x32

2

the observations, n=Total number of observations

(c) Calculate group means x1 , x2 , ...xk and their common mean xwhere

xk = nkxk

ad x = nx err

(d) Calculate total sum of squares by the formula

41

HBAF 3105

Sum of squares may also be computed as

SSW = T SS SBB

5. Sum of squares within samples is also called Error sum of squares.

6. Calculate mean sum of squares:

SSB

MSSB = Mean sum of squares between samples k1

SSW

k1

where n - 1 = (k - 1) + (n - k)

7. Obtain the variance ratio F : F =

MSSB

MSSW

value.

Fc Calculated value of F

Ft = (v1 , v2 )Tabulated value of F

Here

v1 Degree of freedom for numerator

v2 Degree of freedom for denominator

Example . ....

Solution: ,.....

42

HBAF 3105

LESSON 7

Correlation and Regression Analysis

Learning outcomes

Upon completion of this lesson you should be able to;

1. Draw scatter plots for bi-variate data.

2. Calculate and interpret parametric and non-parametric correlation coefficients.

3. Fit data to simple regression equation.

4. Interpret computer output on multiple regression model.

7.1. Introduction

For most data sets, we are interested in understanding the relationships between the

variables. If the relationship between variables X and Y is causal, it is possible to

predict the effect of changing the value of X. Causality can only be deduced from

how the data were collected the data values themselves do not contain any information about causality. Observational and experimental data In an observational study,

values are passively recorded from individuals. Experiments are characterised by

the experimenters control over the values of one or more variables. Causality and

experiments Causal relationships can only be deduced from well-designed experiments.

7.2. Test of relationships involving quantitative data

Bivariate data is data in which two variables are assigned to each member of a

population, e.g. length and weight, shoe size and arm span, etc. A scatter diagram

can be used to represent bivariate data graphically.

of the relationship (correlation) between two variables. Linear correlation gives

43

HBAF 3105

a measure of how well a straight line can be used to model a set of points on a

scatter diagram. The Coefficient of Correlation is a measure of the strength of

the relationship between two variables. The correlation is perfect if the points lie

exactly on a straight line. There are two commonly used correlation coefficients

both giving values between -1 (perfect negative correlation) and +1 (perfect positive

correlation) inclusive.

Pearsons product-moment correlation coefficient,

Spearmans rank correlation coefficient

7.2.1. Pearsons product-moment correlation coefficient

Given bivariate data x1 , x2 , x3 , x4 , ....., xn and y1 , y2 , y3 , y4 , ....yn then we define the following summaries .

This correlation coefficient should only be used if the two variables are normally

distributed.

7.2.2. Spearmans rank correlation coefficient

The data is ranked (equal values being given the average rank of those which would

otherwise have been taken). Let d be the difference in the ranks and n be the number

of pairs of data, then

d2

r = 1 n 6n2 1

(

)

This rank correlation coefficient is useful when the data is not Normal. Some examples of scatter diagrams and estimates of correlation

44

HBAF 3105

Example . Example

The number of hours (x) spent studying for an examination by 8 students, together

with the marks (y) achieved in the examination, are given in the table below

ii. Calculate the product-moment correlation coefficient r for the data.

iii. Spearmans rank correlation coefficient

iv. State what the value of r indicates about the relation between x and y.

Solution:

From the scatter plot it is apparent that marks and time spent in revision have a

strong positive correlation

Spearmans rank

E XERCISE 12. The following data gives the age in months of a child a the corresponding weight in Kg (a) Plot a scatter diagram for the data and make comments

(b) Calculate the product moment correlation coefficient

45

HBAF 3105

7.3. Linear Regression

Dependent Variable: The variable that is being predicted or estimated. In business

set up the dependent variable could be profit and sales. It usually represents the

output .

Independent Variable: The variable that provides the basis for estimation. It is the

predictor variable or explanatory variables. In business set up the independent variables could be advertisement costs, Number of salesmen etc. Usually independent

variables are represented by variables X1 , X2 , ...Xk for k (inputs) independent variables. If we can find a relationship between the output Yand the inputs X1 , X2 , ...Xk

of the form

Y = 0 + 1 X1 + 2 X2 + ... + k Xk

The above equation is referred to as a multiple regression model with predictors or

independent variables. If we simply have the simple linear regression model given

by Y = 0 + 1 X1

Using ordinary least Square Method (OLS) or maximum likelyhood estimator we

can estimate the and as follows: Given data on Y and X in n pairs (yi , xi ), we can

we compute

variation in the dependent variable Y that is explained or accounted for by the variation in the independent variable X. It is the square of the coefficient of correlation,

and ranges from 0 to 1.

Example . Example

The number of hours (x) spent studying for an examination by 8 students, together

with the marks (y) achieved in the examination, are given in the table below

46

HBAF 3105

i. Make a scatter graph for this data

ii. Calculate regression equation of Y on X and plot it on the same axes (a)

Solution:

Scatter plot

From the data we

Exercise 17

Consider the exercise 16 concerning the age in months of a child versus its weight

at different points in time. Show that the regression equation is given

Y = 0.785X + 5

What percentage of weight is explainable by age?

Multiple regressions

For multiple regressions of the form Y = 0 + 1 X1 + 2 X2 + ... + k Xk ,we can use

computer softwares for data analysis such as; spss, stata, Gretel, eviews, r-gui

among others to fit a regression model We are usually interested in knowing which

variable(s) are contributing significantly to the model and their direction of influence. To achieve this each coefficient should be examined (tested for significance).

If results are provided by a computer, the associated p-values are used to make the

decision. If the p-value is less than the preferred significance level, the null hypothesis is rejected.

Example . Example

A multiple regression model was fitted to some data with dependent variable y and

independent variables x1, x2, x3, x4 and x5. The computer output (Gretel software)

47

HBAF 3105

is as given below. Use it to answer the question that follows.

i. Is the model valid

ii. How good are the predictors in explaining y?

iii. Interpret the influence and relevance of each variable including the constant

Model 11: OLS, using observations 1-402

Dependent variable: Y

Solution:

i. Yes. The model is valid because the F-ratio=83.92284 has a p-value less than the

standard significant level of 0.05

ii. From R-Squared=0.638, we can tell that the predictors can only explain 63.8%

of the variation in y. This is a good level of fit and so x1 , x3, x4 and x5 are good

predictors of y

iii. Since p-values are less than 0.05 for the constant, x1 x3 x4 and x5 and, these

terms are relevant to the model but x2 is not and may be dropped. The best predictor

is x1 (t-value is largest in absolute value) with a positive influence.

7.3.1. Multiple regression with dummy variables

Now, let us look at the case where some input variables are categorical. Suppose

the researcher wants to include variables such as gender, marital status, employment category into the model as repressors, the solution is use of dummy variables.

Consider the following data as captured in SPSS file.

48

HBAF 3105

E XERCISE 13. Suppose we wish to fit a regression model in which Maths is the

dependent variable (Y ), and the independent variables are taken to be Kiswahili

(X1), English (X2), Home (X3) and Gender (X4). The regression function has the

same general form

Y = 0 + 1 X1 + 2 X2 + ... + k Xk

but in this case since Home and Gender are categorical variables, we need to give

them appropriate codes. The codes have no meaning numerically but should indicate presence or absence. For example, gender should be coded as 1 for Male and

0 for female or vice versus but NOT 1 for male and 2 for female. The 1 will indicate presence of maleness and the 0 will indicate absence of maleness in implying

femaleness. The same rule applies to the variable Home. The coded data should

look as follows;

We can now use the standard procedure to get regression output below. SPSS gives

the output in three separate tables. It is wise to start with table 2 which gives the

validity information. In this model

Although the table through tells us that 85.2% of the variation in math marks can be

explained by the four explanatory variables, the validity of this model may render

all this useless. It is therefore important to look at the following table before we

become excited about the good performance of the model.

Since the F-ratio=15.852, has a p-value less than the standard significant level of

0.05 this model is very valid. We can report the results knowing well that the model

is not fitting the data by chance.

49

HBAF 3105

The influence and relevance of each predictor variable shows that both Kiswahili

marks and students gender significantly determines. Home background and performance in English are irrelevant in this model but how do we interpret the significant

coefficients?

Answer

Since the unstandardized coefficient of Kiswahili is -0.433, we can tell that for

every increase in Kiswahili marks by 1 mark, maths marks goes down by 0.433

marks. That is Kiswahili marks can be used to predict Maths marks and the higher

the score in Kiswahili the lower in Maths. For gender this interpretation does not

make sense but let us try. The unstandardized coefficient of Gender is 9.613, do we

say that for every increase in Gender by 1 (what??), maths marks goes up by 9.613.

The correct way of saying this is the presence of maleness increases Maths score by

9.613 marks. In other words being a male significantly places you in an advantage

position to perform better in Mathematics.

What happens when the variable has more than two categories? Just make it k

1 dichotomies variables where k is the number of categories in the variable. E.g.

Boarding category would become 2 variables (k = 3) and the variable would adequately represent the data. Suppose the variables are Boarding1 and Boarding2

then the following gives a simple display of expected entries in the data file

Boarding 1 Boarding 2

Day

Boarding

Mixed

0

0

In reality Boarding1 represents Day school while Boarding2 represents Boarding

schools while the absence of both implies mixed school.

7.3.2. Dealing with Interaction terms

The effects of two explanatory variables are not always additive. For example,

increasing the amount of nitrogen fertilizer (X) may improve the yield of wheat

50

HBAF 3105

(Y), but only at high temperatures (Z) as shown in the following figure.

In the illustrative diagram, we have only shown data at two values of Z to show

clearly that the increase in yield per unit increase in fertilizer (i.e. the slope for X)

is greater at high temperatures (high Z) than at low temperatures (low Z). To model

interaction between the effects of X and Z, a simple model that is often adequate

adds a term involving the product of X and Z to the linear model,

Y = 0 + 1 X + 2 Z + 3 XZ

This model can be written as Y = (0 + 1 X)1 + (2 + 3 X)Z +

which is a general linear model because the unknown parameters appear linearly.

When considering how the expected response is affected by changes to X, note that

the slope of the modified equation is the red term which involves Z: The effect of

increasing X by 1 unit depends on Z.

Suggested materials for further reading

1. Fruend, J.E. and Williams, F.J. (1979). Modern Business Statistics. Pitman

Publishing Limited, London.

2. Gupta, S.C. and Kapoor, V.K. (1995). Fundamentals of Mathematical Statistics. Sultan Chand and Sons, New Delhi.

3. Keller, G., Warrack, B. and Bartel, H. (1994). Statistics for Management and

Economics. 3rd Edition. Wadsworth Publishing Company, Belmont California, USA.

4. W. Douglas Stirling. Computer-Assisted Statistics Textbooks. Palmerston

North, New Zealand. http://cast.massey.ac.nz/african (Freely available online)

51

HBAF 3105

LESSON 8

Non-Linear Regression Analysis

Learning outcomes

Upon completion of this lesson you should be able to;

1. Identity some useful models for non-linear regression analysis

2. Distinguish between a logit and probit model

3. Fit given data to with binary dependent variable to appropriate regression

model

4. Interpret results of Cobb-Douglas production model

8.1. A linear model for proportions?

When we modeled how a numerical explanatory variable effected a numerical response variable, a linear equation was used, That is;

y == b0 + b1 x

When the response variable is categorical, it is tempting to try a similar linear equation to explain how the proportion in one response category is affected by the explanatory variable. That is Predicted proportion

p == b0 + b1 x

To model how a proportion depends on a numerical explanatory variable, X, an

equation should give values between 0 and 1 for all possible values of X. This

means that the equation must be nonlinear in X.

8.1.1. Logistic curve: A curve that lies between 0 and 1 for all values of X

A linear equation cannot provide adequate predictions of the proportion in a response category at extreme values of X. There are various nonlinear equations that

satisfy the requirement that their value is between 0 and 1 for all values of X, but

the simplest of these is a logistic curve, Predicted Proportion

52

HBAF 3105

The numerator and denominator are always positive, so their ratio must be

positive too.

The denominator is 1.0 greater than the numerator, so the ratio must be less

than 1.

The goal is to model the probability of a particular as a function of the predictor

variable(s).

8.1.2. The parameters of the logistic curve

The constants b0 andb1 have a similar effect on the shape of the logistic curve to

the corresponding parameters of a linear equation.

The parameter b0 determines the horizontal position of the curve. Increasing

it shifts the curve to the left.

The parameter b1 determines the slope of the curve. Increasing it makes the

curve steeper, and its sign determines whether the curve slopes upwards or

downwards.

We again b0 c all the intercept of the curve and we call the slope.

8.1.3. Multiple logistic regression

Recall that value produced by logistic regression is a probability value between 0.0

and 1.0. If the probability for group membership in the modeled category is above

some cut point (the default is 0.50), the subject is predicted to be a member of the

modeled group. If the probability is below the cut point, the subject is predicted to

be a member of the other group.

For any given case, logistic regression computes the probability that a case with a

particular set of values for the independent variable is a member of the modeled

category. For dichotomous variable Y (1 for Success and 0 for Failure) the

model for multiple input variables is;

53

HBAF 3105

Logistic regression analysis requires that the independent variables be metric

or dichotomous.

If an independent variable is nominal level and not dichotomous, the logistic

regression procedure in SPSS has a option to automatically dummy code the

variable for you.(For software without this option, make the variable c - 1

categories where c is the number of categories.

We consider an example given in UCLA Academic Technology Services. These

data were collected on 200 high schools students and are scores on various tests,

including science, math, reading and social studies (socst) under the file name

hsb2.sav. The variable female is a dichotomous variable coded 1 if the student

was female and 0 if male. Because the raw data does not have a suitable dichotomous variable to use as our dependent variable, create one (and call it honcomp,

for honors composition) based on the continuous variable write (Recode all values

below 60 to zero (0) and all values above 60 to one(1)). The data would look like

this.

The total data set consists of 200 cases, of which the first 17 are shown above.

The dependent or response variable is "honcomp" while the predictors are read,

science and ses. Run the Logistic Regression analysis in SPSS as follows

1. Open hsb2.sav

2. Recode the variable write as explained above

3. Open the syntax file in SPSS and write the following commands l o g i s t i c

r e g r e s s i o n honcomp with read s c i e n c e s e s / c a t e g o r i c a l s e s .

4. Run the commands to get logistic regression output OR use the standard menu

to run the procedure.

54

HBAF 3105

5. Output Analysis: A number of tables are produced and their details can be

understood. Here we look at a number of them that might not be straightforward.

Categorical Variable Codings

This table shows the Automatic transformation of the ses categorical variable (has

3 categories) into two dichotomous variables: low and middle ( Notice that when

ses isnot low or middle, then it is high).

1. Observed - This indicates the number of 0s and 1s that are observed in the

dependent variable.

2. Step 1 - This is the first step (or model) with predictors in it.

3. Chi-square and Sig. - This is the chi-square statistic and its significance

level. In this example, the statistics for the Step, Model and Block are the

same because we have not used stepwise logistic regression or blocking. The

value given in the Sig. column is the probability of obtaining the chi-square

statistic given that the null hypothesisis true. In other words, this is the probability of obtaining this chi-square statistic (65.588) if there is in fact no effect

of the independent variables, taken together, on the dependent variable. This

is, of course, the p-value, which is compared to a critical value, perhaps 0.05

or 0.01 to determine if the overall model is statistically significant. In this

case, the model is statistically significant because the p-value is less than

0.000.

4. df - This is the number of degrees of freedom for the model. There is one

degree of freedom for each predictor in the model. In this example, we have

four predictors: read, write and two dummies for ses. e.

55

HBAF 3105

5. -2 Log likelihood - This is the -2 log likelihood for the final model. By itself,

this number is not very informative. However, it can be used to compare

nested (reduced) models.

6. Cox & Snell R Square and Nagelkerke R Square - These are pseudo Rsquares. Logistic regression does not have an equivalent to the R-squared that

is found in OLS regression; however, many people have tried to come up with

one. There are a wide variety of pseudo-R-square statistics (these are only

two of them). Because this statistic does not mean what R-squared means in

OLS regression (the proportion of variance explained by the predictors), we

suggest interpreting this statistic with great caution.

7. Observed - This indicates the number of 0s and 1s that are observed in the

dependent variable.

8. Predicted - These are the predicted values of the dependent variable based

on the full logistic regression model. This table shows how many cases are

correctly predicted (132 cases are observed to be 0 and are correctly predicted

to be 0; 27 cases are observed to be 1 and are correctly predicted to be 1), and

how many cases are not correctly predicted (15 cases are observed to be 0 but

are predicted to be 1; 26 cases are observed to be 1 but are predicted to be 0).

9. Overall Percentage - This gives the percent of cases for which the dependent

variables was correctly predicted given the model. In this part of the output,

this is the null model. Note that 79.5 = 159/200

10. B - This is the coefficient for the constant (also called the "intercept") in the

null model. (k) S.E. - This is the standard error around the coefficient for the

constant

56

HBAF 3105

11. Wald and Sig. - This is the Wald chi-square test that tests the null hypothesis

that the constant equals 0. This hypothesis is rejected because the p-value

(listed in the column called "Sig.") is smaller than the critical p-value of .05

(or .01). Hence, we conclude that the constant is not 0. Usually, this finding

is not of interest to researchers.

12. df - This is the degrees of freedom for the Wald chi-square test. There is

only one degree of freedom because there is only one predictor in the model,

namely the constant.

13. Exp(B) - This is the exponentiation of the B coefficient, which is an odds

ratio. This value is given by default because odds ratios can be easier to

interpret than the coefficient. In this case the odds ratio for ses(2) is 0.363

implying that those with middle level of ses are 1/0.363=2.75 less likely than

those in ses(3) or high level to get honors composition.

Model The prediction equation is

Notice that for variable x3 (ses), we have two dichotomous variables x31 (ses(1)

or low) and x32 (ses(2) or middle). Here p is the probability of being in honors

composition. Expressed in terms of the variables used in this example, the logistic

regression equation is

Interpretation Parameters

Read - For the variable read, the p-value is .000 (<0.001), so the null hypothesis

that the coefficient equals 0 would be rejected and so for every one-unit increase in

reading score we expect a 0.098 increase in the log-odds of honcomp, holding all

other independent variables constant.

science - For the variable science, the p-value is .015, so the null hypothesis that the

coefficient equals 0 would be rejected and so for every one-unit increase in science

score, we expect a 0.066 increase in the log-odds of honcomp, holding all other

independent variables constant.

ses - For the variable ses, the p-value is .035, so the null hypothesis that the coefficient equals 0 would be rejected. Because the test of the overall variable is statistically significant, you can look at the one degree of freedom tests for the dummies

57

HBAF 3105

ses(1) and ses(2). The dummy ses(1) is not statistically significantly different from

the dummy ses(3) (which is the omitted, or reference, category), but the dummy

ses(2) is statistically significantly different from the dummy ses(3) with a p-value

of .022. This tells you if the overall variable ses is statistically significant. There

is no coefficient listed, because ses is not a variable in the model. Rather, dummy

variables which code for ses are in the equation, and those have coefficients. Since

the reference group is level 3 (see the Categorical Variables Codings table above),

the coefficient of ses(2) represents the difference between level 2 of ses and level 3.

In this case the odds ratio for ses(2) is 0.363 implying that those with middle level

of ses are 1/0.363=2.75 less likely than those in ses(3) or high level to get honors

composition. m. df - This column lists the degrees of freedom for each of the tests

of the coefficients.

E XERCISE 14. The following tables are part of SPSS output for logistic regression fitted to part of hsb2.sav data.

58

HBAF 3105

Using the example above, interpret the model comprehensively and write down the

fitted equation.

8.2. Probit Model

Unlike the logistic model that uses the cumulative logistic distribution (logit), the

probit model uses the standard normal distribution (probit) given by

where z = 0 + 1 x + 2 x2 + ... + k xk Both Logit and Probit models assume that

the dependent variable Y is dichotomy.

Note

Choosing between Logit/Probit-In the dichotomous case, there is no basis in

statistical theory for preferring one over the other. In most applications it

makes no difference which one uses.

If we have a small sample the two distributions can differ significantly in their

results, but they are quite similar in large samples.

Various R2 measures have been devised for Logit and Probit but they are ad

hoc and cannot be compared to R2 in linear regression analysis.

For detailed SPSS example on Probit Analysis, search for Annotated SPSS Output

Probit Regression1. Data file by name probit.sav on undergraduates applying to

graduate school and includes undergraduate GPAs, the reputation of the school of

the undergraduate (a topnotch indicator), the students GRE score, and whether or

not the student was admitted to graduate school is used and a detailed explanation

of the results is given.

59

HBAF 3105

8.3. CobbDouglas functional form of production functions

The CobbDouglas production function is widely used to represent the relationship

of output and two inputs. The Cobb-Douglas form was developed and tested against

statistical evidence by Charles Cobb and Paul Douglas during 19001947.

8.3.1. Formulation

In its most standard form for production of a single good with two factors, the

function is

Y = AL K

where:

Y=total production (the monetary value of all goods produced in a year)

L =labor input (the total number of person-hours worked in a year)

K =capital input (the monetary worth of all machinery, equipment, and buildings)

A =total factor productivity an d ar the output elasticities of capital and

labor, respectively. These values are constants determined by available technology.

Output elasticity measures the responsiveness of output to a change in levels of

either labor or capital used in production, all other things being equal or held constant. For example if = 0.15, a 1% increase in labor would lead to approximately a

0.15% increase in output. Empirically it was found that, 75% increase in output can

be attributed to increase in labour input and the remaining 25% was due to capital

input. It was also found that the sum of exponents of Cobb-Douglas production

function is equal to one. That is + = 1. This implies that it is a linearly homogenous

production function. Following are important features of Cobb-Douglas Production

Function;

1. Average Product of factors of production used up in this function depends

upon the ratio in which the factors are combined for the production of commodity under consideration

60

HBAF 3105

2. Marginal Product of factors of production used up in this function also depends upon the ratio in which the factors are combined for the production of

commodity under consideration

3. Cobb-Douglas production function is used in obtaining marginal rate of technical substitution (the rate at which one input can be substituted for the other

to produce same level of output) between two inputs.

4. As seen earlier, the sum of exponents of Cobb Douglas production function

is equal to one i.e. + = 1.. This is a measure of returns to scale.

(a) When + = 1, it is constant returns to scale,

(b) If + < 1., returns to scale are decreasing, and

(c) if + = 1. returns to scale are increasing.

Cobb and Douglas were influenced by statistical evidence that appeared to show

that labor and capital shares of total output were constant over time in developed

countries; they explained this by statistical fitting least-squares regression of their

production function. There is now doubt over whether constancy over time exists.

[These notes can be found in WikiPedia]

8.3.2. Application

Using available data, we can take the natural log of each data series to create variables that are in the log levels rather than the levels to get.

We then carry out the standard Regression Analysis on the transformed data.

Example . Suppose for some data the fitted model with all parameters being

significant the regression equation is

lnYt = 7.08 + 0.94ln(L) + 0.51ln(K)

and R2 = 0:9975. Interpret this model.

Solution: If all parameters are significant, their corresponding p-values are less than

5%. The 0.94 estimate for 1 or 1 indicates that a 10 percent increase in the L leads

to a 9.4 percent increase in the output level Y , which implies there is diminishing

returns to labor. Similarly, the 0.51 estimate for 1 or indicates that a 10 percent

61

HBAF 3105

increase in the K leads to a 5.1 percent increase in the output level, which implies

there are diminishing returns to capital.

However, + = 1 : 45 > 1 (the sum is greater than one), which implies production exhibits increasing returns to scale. Increasing returns to scale means a

proportionate increase in all inputs leads to a more than proportional increase the

output. For example, doubling all inputs would lead to more than a doubling of

output. In this case, indicates a one hundred percent increase in (or doubling of) the

inputs leads to a 145 percent increase in the output level.

Suggestions for further reading /reference

1. Johnston, J. (1972). Econometric Methods, 2nd Edition, McGraw-Hill Kogakusha, Ltd, Tokyo.

2. http:www.ats.ucla.edu/stat/spss/output/SPSS_probit.htm

3. http://www.ats.ucla.edu/stat/spss/output/logistic.htm

62

HBAF 3105

LESSON 9

Index numbers

Learning outcomes

Upon completion of this lesson you should be able to;

1. Define an index number and describe its properties.

2. Compute simple and composite price indices

3. Compute Laspeyres, Paasche and Fishers indices and interpret them

4. Apply process of deflating to time series data.

9.1. Index numbers

An index number measures the value of an item (or group of items) at a point in

time as a percentage of the value of the item (or group of items) at another fixed

time point.

9.1.1. Price and quantity indices

There are many types of indexes (or indices) for example price indices are used to

measure changes in prices of items over time, while quantity indices are used to

measure changes in quantities such as imports or exports over time. In this section,

only price indices are considered but many of the principles and formula carry over

to other types of index numbers. Price indices are widely used to describe business

activity. Index numbers describing consumer prices and stock market prices are

widely used and reported in the media. An index number can describe a specific

category of item or may be more general.

9.1.2. CPI and stock market indices

The Consumers Price Index (CPI) in a country summarises the overall price level

of goods and services purchased by households at different times. Other price indices describe prices of energy, accommodation and various classes of food. Stock

market indices such as the NSE 20 share index, Dow Jones (USA), FTSE 100 (UK)

and NZX50 (New Zealand) are used to summarize changes in the value of company

shares in specific countries.

63

HBAF 3105

9.2. Simple price index

A simple price index measures the price of a single item or commodity as a percentage of the price of the same item at a fixed time, normally in the past. The fixed

time is called the base period and may be chosen for convenience (e.g. January 1st

for daily data, or 2010 for annual data). Assuming the time period is years, then if

P0 denotes the price in the base year and denotes the price in year then the index

number for year is given by

Pi = PP0I 100

The simple price index is just the current price expressed as a percentage of the

base year price. However some index numbers use a factor of 1000 rather than 100,

especially if it is desired to express the index as a whole number. Find the index

numbers using 2000 as the base year.

Example . Example

The table below shows the spot price of European Brent Oil (in US dollars per

barrel) from 2000 to 2005.

Solution:

Using 2000 as the base year, the price index for year 2001 equals

The full series is shown below.

2000 2001 2002

2003

2004

2005

The index allows us to measure changes as a percentage of the base year. In 2001

the price was about 15% lower than in 2000 (85.35100 = -14.65) but by 2009 it

was 115% higher (215.42100 = 115.42). Note that:

An index value below 100 means that the price that year was lower than the

base year

64

HBAF 3105

An index value above 100 means that the price that year was higher than the

base year

The base year always has an index of 100 (or 1000).

Changing the base year

In practice the base year is revised from time to time so that comparisons can be

made with a recent (i.e. not ancient) price value. For example the quarterly New

Zealand Consumers Price Index has a current base of June 2006. Converting an

existing index to a new base is quite straightforward.

IExisting

INew = INewbase

100

Here new refers to the index using the new base, existing refers to the index using

the existing base, and newbase refers to the index for the new base year using the

existing base year.

9.3. Aggregate price

index An aggregate price index combines the prices of several related items into a

single index number. The group of items is sometimes referred to as a market basket or basket for short. There are many examples of aggregate indices the NZX50

index aggregates the prices of the top 50 companies (as measured by market capitalisation) listed on the New Zealand Stock Exchange, Nairobi 20 shares index., the

quarterly Consumers Price Index (or CPI) aggregates the prices of a range of food

and related household shopping items and is commonly used as a measure for price

inflation in an economy.

9.3.1. Unweighted aggregate price index

There are two types of aggregate price indices. The simpler type is known as an

unweighted aggregate price index and is so called because it gives equal weight

to each item in the basket. If there are n items in the basket then the unweighted

aggregate price index in time i is given by

( j)

( j)

wherePi and P0 denote the prices of the item in the basket at time i and at the

base time respectively.

65

HBAF 3105

An unweighted price index treats each item in the basket equally. For a price index,

this is usually equivalent to assuming that a consumer purchases the same amount

of each item. In practice this is usually not the case and so a weighted aggregate

price index weights the price of each item by the quantity. There are two ways of

doing this;

9.4.1. Laspeyres index

This uses the quantities of items in the base period as weights. The formula for

computing the index is given by;

( j)

where Q0 denote the quantity of the jth item in the basket at the base time.

Exercise 22

Applying the formula for the Laspeyres index to the prices for 2006 for the data in

table above assuming the quantity for 2006 was the same as that for 2005.

9.4.2. Paasche index

It uses the quantities of items in the current period as weights taking into account

of variations in consumption patterns over time. For example, there may be a trend

for consumers to use margarine instead of butter between 2000 and 2010. The

Laspeyres index for dairy products in 2010 is based on out-of-date consumption

patterns from 2000 whereas the Paasche index for 2010 is based on the current

consumption of the items. The formula is

E XERCISE 15. Obama owns stock in three companies. Shown below is the

price per share at the end of 2000 and 2007 for the three stocks and the quantities

he owned in 2000 and 2007. Using 2000 as the base year, compute Laspeyres

66

HBAF 3105

Weighted Price Index (LI), Paaschen Weighted Price index (PI) and the Value index

(VI). Interpret the value index.

Company Price

Quantity

10

22

35

30

1.5

60

70

10

Laspeyres index tends to overweight goods whose prices have increased and Paasches

index, on the other hand, tends to overweight goods whose prices have gone down.

Fishers ideal (FI) index was developed in an attempt to offset these shortcomings.

It is the geometricq

mean of the Laspeyres and Paasche indexes. That is;

FI = LI PI = Laspeyres Paasches

E XERCISE 16. Using the previous example obtain the Fishers ideal index

Test for an Ideal Index number

It is considered that perfect index number should follow the following test The time

reversal test: if we reverse the time subscripts of price (or quantity) index, the result

should be reciprocal of the index that is

P0n Pn0 = 1

where P0n is the price index for current year with the base period 0 and is the price

index for the current year 0, with base period n. If Lespeyres does not satisfy time

reversal factor then

Then the Paasches does not satisfy the time reversal factor. Other test that may be

used to test I deal index number are: the factor reversal test, the circular test and

proportionality test.

Exercise

Uses the previous example to test whether both paasches and lespeyres price indices

are ideal indices using time reversal factor.

67

HBAF 3105

9.5. Deflating a time series

Many time series display the effects of more than one variable changing over time.

For example, changes in the NZ price of an item sourced in the USA will reflect

changes in the NZ$/US$ exchange rate as well as changes in the US$ price. If an

index is available which measures the effect of such a variable then its effect can

be removed by a process of deflating. The idea is similar to that of detrending or

deseasonalising a time series (See the relevant section). If Xi denotes the time series

value at any time i and Ii and I0 denote the index values at time i and the base time

respectively then the deflated value Di is given by

Di = Xi II0i

9.5.1. Correcting for inflation

This kind of adjustment is often used to take account of inflation. Although it is

interesting to know that Tarakihi cost $25.43 per kg in 2008 but only $19.20 per kg

in 2005, an increase in price is hardly surprising when wages and all other prices

rose in that period. The Consumer Price Index (CPI) is often used to adjust for

inflation. Since the CPI was 953 in 2005 and 1044 in 2008 (based on a CPI of 1000

in June 2006), the price of Tarakihi in 2008 can be expressed in "2005 dollars" as:

In 2005 dollars, the price of Tarakihi rose from $19.20 in 2005 to $23.21 in 2008.

E XERCISE 17. The table below shows New Zealands Gross Domestic Product

from 2002 to 2008 together with the CPI (Source: Statistics NZ website). Note that

the CPI uses a factor of 1000 rather than 100 and has a base quarter of June 2006.

New Zealand GDP ($millions) and CPI

1. Gupta, S.C. and Kapoor, V.K. (1995). Fundamentals of Mathematical Statistics. Sultan Chand and Sons, New Delhi.

2. Keller, G., Warrack, B. and Bartel, H. (1994). Statistics for Management and

Economics. 3rd Edition. Wadsworth Publishing Company, Belmont California, USA.

68

HBAF 3105

3. W. Douglas Stirling. Computer-Assisted Statistics Textbooks. Palmerston

North, New Zealand. http://cast.massey.ac.nz/african (Freely available online)

4. Mason, R. D., Lind, D. A. and Marchal, W. G. (1999). Statistical Techniques in Business and Economics. Irwin McGraw-Hill, Boston. ISBN-10:

0256263078, ISBN-13: 978-0256263077, Edition: 10th

5. Douglas Lind, William Marchal, Samuel Wathen (2009). Statistical Techniques in Business and Economics with Student CD [Hardcover], ISBN-10:

0077309421, ISBN-13: 978-0077309428, Edition: 14

6. Thomas H. Wonnacott, Ronald J. Wonnacott (1990) Introductory Statistics

for Business and Economics, 4th Edition, John Wiley and Sons Inc. [Hardcover] ISBN-10: 047161517X , ISBN-13: 978-0471615170

69

HBAF 3105

LESSON 10

Basics of Time Series Analysis

Learning outcomes

Upon completion of this lesson you should be able to;

1. Identify and describe the main components of a time series

2. Smooth a given time series

3. Fit a given time series to a linear trend curve

4. Fit seasonal data to Exponential Model

5. Forecast future values of a time series

10.1. Time series data

Many data sets contain measurements that are made sequentially at regular intervals. These data are called time series.

Exporters look at recent currency exchange rates to help predict future movements that will affect the price of their products in foreign markets.

Manufacturers collect data regularly on the quality of their products. For example, the fat content of milk is likely to be recorded daily by a bottling plant.

Climatologists analyze historical records of weather to assess the evidence for

global warming.

Retail chains monitor changes in the population in different regions to help

determine where new stores should be located.

Health scientists examine time series of the number of influenza cases to help

predict demand for vaccines.

Definition 7

A discrete time series is a sequence of observed values {x1 , x2 , ...xn }measured at

discrete times{t1 ,t2 , ...tn } tng. In other words, a time series is any statistical data that

is arranged according to the time it was recorded or observed (chronologically). The

70

HBAF 3105

time interval is usually regular and may be any of the time units existing naturally

(milliseconds, seconds, minutes, hours, days, weeks, months, years, decades, ... ,

millenniums). Time series data are widely analysed in the business world accurate

forecasting of exchange rates, share prices, demand for products and other business

variables can have a major effect on profitability. There are, obviously, numerous

reasons to record and to analyze the data of a time series. The main ones are;

To explore and extract signals (patterns) contained in time series in order to

gain a better understanding of the data generating mechanism,

To explain (i.e variation in one time series may be used to explain variation

in the other time series)

To make forecasts (i.e. predict future values)

To use the acquired knowledge to optimally control systems and processes.

10.2. Types of time series data

Time series data arise in various different contexts.

Measurements relating to events that occur at discrete times e.g. the dividend

paid out each year on British Airways shares

Regular snapshots of a continuous process e.g. the Consumer Price Index at

the end of each month

Quantities that are aggregated over a period e.g. the numbers of admissions

to a hospital each day, or monthly energy consumption

Further, the measurements themselves may be of various different types.

Continuous e.g. fat content in homogenized milk produced by a bottling

factory or signal of a electric voltage passing a particular point

Discrete e.g. number of complaints received by a department store each day

In this course we do not need to further distinguish between the different types of

time series but our main focus will be on discrete type.

71

HBAF 3105

10.3. Components of a time series

There are four components (forces that determine the observed values) of a time

series. A few patterns in time series are particularly important.

10.3.1. Trend

These are the long term movements which give the general way in which the data

move over a long period of time. A graph of observed value versus time may show

small ups and downs, but the long term movement eliminates these minor variations

and looks at the big picture. If you draw a graph of this trend, it is called a trend

curve. Identifying trend is important since we might use it to help forecast future

values.

10.3.2. Cyclic Movements

These are movements that happen in regular long-term cycles. In business, for

example, cycles consists of alternating periods of recession and inflation, recovery

and prosperous times. Cyclic movements are large scale and should be very clear

in the data. The collapse of the Kenya Bus Service, Uchumi supermarkets e.t.c

10.3.3. Seasonal Movements

These are also movements that happen in regular cycles, but repeat yearly. The patterns are caused by either natural conditions such as weather fluctuations or manmade conditions such as business, administrative, political procedures, start and end

of semester, Easter Holidays, festive seasons etc.

10.3.4. Random or irregular fluctuations

These are ups and downs in a time series that do not correspond to trend, seasonal

variation or auto correlation. They are unpredictable and as a result off chance

events such as strikes, floods, earthquakes, plane clash, post election violence etc.

The timeplot

The key components can easily be seen in the Figure 10.1 (The data used here is

available online and aslo as a part of Gretl open source software. Download and

confirm the features)

72

HBAF 3105

It is often difficult to get useful information from time series if they are presented

in tabular form. As seen in the Figure 10.1, information in a time series is most

easily understood from a graphical display. A time series plot is a type of dot plot

in which the values are displayed as crosses against a vertical axis. The horizontal

axis spreads out the crosses in time order. (It can also be thought of as a scatter plot

in which the explanatory variable is time.) The figure clearly brings out the feature

discussed earlier. The trend is steady and upwards, seasonal fluctuations dominate

the data and some irregular patterns are also evident.

E XERCISE 18. The table below shows the number of driving licenses approved

for citizens in Mombasa, Kenya each year from 1978 to 2001. Plot the series and

describe its features (You may use Excel, R or Gretel)

Several related time series can be superimposed with different colours on the same

display, making comparisons easier. The crosses at the individual data points are

often omitted to reduce the clutter of the display.

10.4. Smoothing of a time series

In a time series, random fluctuations can usually be treated as noise that can obscure

trend and other signal in the series. Various smoothing methods have been proposed

to reduce these random fluctuations and show the systematic movement in the series

more clearly. These methods replace each value in the series with a function of

it and the adjacent values. Smoothed value = centre (original value and adjacent

73

HBAF 3105

Figure 10.2: A smoothed value may therefore be for "year 2005.5" which is far

from ideal.

values) For example, each value might be replaced by the mean of it and the two

adjacent values, replacing the value at time i by

x = mean(xi , xi1 andxi+1

This smoothed fit is called a 3-point moving average. Moving averages are also

called running means. Greater smoothing is obtained with means of more adjacent

values. For example, a 5-point moving average replaces each value with the mean

of it and the 2 adjacent values on each side.

Loss of ends of values

Moving averages are effective at highlighting the trend in the centre of a time series,

but cannot be used at the ends since the moving average requires values both before

and after each value being smoothed. As the span (order of moving average) of

smoothing increases, the number of un-smoothed values at the ends of the series

also increases. For example, if 7-point moving averages are used, 3 values at each

end of the series cannot be smoothed.

10.4.1. Moving average with odd and even run lengths

A moving average provides a smoothed value at the middle of the times of the

values being averaged. For example, if the run length is 5, the smoothed value

is identified with the middle time. This works fine for moving averages of odd

numbers of values, as were used in the previous page. However if moving averages

are round using an even number of values, the resulting smoothed value are for

half-way between the times of the middle observations.

A second stage of averaging

To avoid this problem, it is conventional to post-process the moving averages with

an even run length by taking a further 2-point moving average to get the values

centered on the original times.

74

HBAF 3105

This is equivalent to giving half weight to the two outermost values. When based

on a 4-point moving average, this method therefore uses an average of the 3 values

centered on each value and two further values with half-weight.

Example . Given the following seasonal data, obtain the smoothed trend values

assuming the additive model

Solution:

To completely destroy seasonal patterns the order of the moving average should be

a multiple of 4 (the period of the data). In this case we use 4 in order to get Moving

averages whose main component is the trend.

10.4.2. Robust smoothing

Moving averages and running medians each have their advantages and disadvantages.

Moving averages are more affected by outliers in the series.

Running medians often have a stepped appearance the smoothed series is

level for periods, followed by relatively sharp jumps.

75

HBAF 3105

10.4.3. Running medians, followed by moving averages

To take advantage of the best features of both moving averages and running medians, these two techniques are often applied sequentially.

Firstly, low-order running medians are used to remove the influence of outliers.

The resulting series is then further smoothed with low-order moving averages.

10.4.4. Limitations of moving averages

We used moving averages to smooth out the seasonal variation in a time series, but

this method has serious limitations.

Moving averages cannot be used for the ends of a time series. For monthly

data, this means that we cannot remove the seasonal variation from the last 6

months usually the most important part of the series

The smoothing is only local. The moving average only uses values in the

current cycle so we are not using information from other cycles to determine

the seasonal pattern more accurately.

The method does not provide forecasts of future values in the series.

10.5. Long-term trend and Forecasting

Moving averages provide a good description of the trend in a time series. However

a common goal in time series analysis is to forecast values of the time series in

the future. For example, accurate forecasting of the demand for a product allows

production capacity to be adjusted in time to meet changes to the demand. Moving

averages cannot smooth the end values of the series and do not provide a method to

extend the trend into the future. The best that can be done is to extend the trend by

eye hardly an objective forecasting method!

10.5.1. Least squares for a polynomial fit

Linear Trend An alternative is to describe the trend with a mathematical equation

which models the trend as a function of time,

76

HBAF 3105

trend = f (t)

where the function usually involves some constants (parameters) that can be adjusted to improve the fit of the model. The simplest such model is a linear model of

the form

trend = b0 + b1t

This model has the same form as the linear models and the residuals are the differences between the actual time series values, y, and the models predictions

et = y trend

and the two model parameters are estimated by least squares to minimize the sum

of squares of residuals,

s = e2 = (y b0 b1t)2

Exercise 26

Find the trend equation for the time series below

Quadratic trend

A linear trend is not appropriate for all time series. Many trends have curvature

which must be described with a more complex model. We now briefly describe

fitting a quadratic trend of the form

trend = b0 + b1t + b2t 2

A quadratic curve of this form has three parameters that can be adjusted to improve

the fit of the model. We again define residuals to be the differences between the

actual time series values, y, and the models predictions,

et = yi trendi

The least squares estimates of the three parameters are again the values that minimize the residual sum of squares,

S = e2

To decide which model is more appropriate, compare ad j.r2 and standard error to

that of linear model to see if this is an improvement.

Dangers in forecasting

It is important to realise that the forecasts from linear or quadratic models are highly

dependent on the type of line or curve that is chosen for modelling. The dangers

are the same as those for extrapolation in bivariate relationships. Note Beware

77

HBAF 3105

forecasting many time periods into the future the shape of the actual trend line

might be different from your model.

Cubic and higher-degree polynomial models

If a quadratic model does not adequately describe the shape of the trend in a time

series, it is tempting to try to further increase the order of the polynomial,

trend = b0 + blt + b2t 2 + b3t 3

This kind of polynomial model can also be fitted by least squares. A polynomial

of degree 3 or 4 often provides a fairly smooth description of trend but polynomial models usually behave badly (with sudden increases or decreases) beyond the

data points, so Polynomial models of degree greater than 2 should be avoided for

forecasting.

Detrending a time series

The residuals form the detrended series, and the process of removing the trend is

called detrending. Detrending will often reveal interesting features that were obscured by the trend, and which may be important in explaining the past or forecasting the future. It is therefore useful to look for patterns in a time series plot of the

residuals. If the model under consideration fits well, there should be no pattern in

the residuals each should have the same chance of being positive or negative. If

there are systematic patterns in the residuals, it may be possible to use a different

model for the trend (e.g. a quadratic rather than a linear model), but time series

often exhibit patterns that cannot be explained with simple models for the trend.

10.5.2. Exponential Trend

The trend curve is of the form;

Y = 0 1t

Taking logs both sides leads to

logYt = log0 + tlog1t + log

which we can write as

Yt = A + Bt + et

where A = log0 B = log1 and Yt = logYt The method of least squares requires that

we minimize the function

78

HBAF 3105

(bi 1) 100%is the quarterly compound growth rate bi provides the multiplier for

the ith quarter relative to the 4th quarter (i = 2, 3, 4). Taking logarithms both sides;

which is in the form

where ai =estimate of ), or bi = 10ai , i = 1, 2, 3, 4.

b2is the estimated multiplier for first quarter relative to fourth quarter

b3is the estimated multiplier for second quarter relative to fourth quarter

b4is the estimated multiplier for third quarter relative to fourth quarter Note

the resemblance of this model with Cobb-Douglas model discussed earlier.

LESSON 11

Constrained maxima and minima and the method of lagrange

multipliers

11.1. The Method Of Lagrange Multipliers:

To find the relative extremum of the function f(x,y) subject to the constraintg(x, y) =

0

1. Form an auxiliary function F(x, y, l) = f (x, y) + lg(x, y) called the lagrange

function. The variable l is called the Lagrange multiplier.

2. Solve the system that comprises the equations Fx = 0, Fy = 0, andFl = 0

3. For all values of x,y,and l.

4. Evaluate f at each of the points (x,y) found in step ii .The largest (smallest)

values of these values is the maximum (minimum) values of f.

79

HBAF 3105

Example . Using the method of Lagrange Multipliers .Find the relative minimum

of the function f(x, y) = 2x2 + y2 subject to the constraint x+y=1.

Solution: Write the constraint equation x+y=1 in the form g(x, y) = x + y 1 = 0

.Then form the Lagrangian function F(x, y, l) = f (x, y)+ lg(x, y) = 2x2 +y2 + l(x+

y 1)

To find the critical points of the function F, solve the system that comprises the

equations Fx = 4x + l = 0 Fy = 2y + l = 0Fl = x + y 1 = 0

Solving the first and second equations in this system for x and y in terms of l, we

obtain x = 1/4ly = 1/2l

Substituting in the third equation yields 1/4l 1/2l 1 = 0orl = 4/3

Therefore, x = 1/3 and y = 2/3 And (1/3 ,2/3) affords a constrained minimum of

the function f

Example . Use the method of Lagrange Multipliers to find the minimum of the

function f(x, y, z) = 2xy + 6yz + 8xz subject to the constraint x y z=12000

Solution: Writing xyz=12000 in the formg(x, y, z) = xyz 12000.

Lagrange function is F(x, y, z, l) = f (x, y, z) + lg(x, y, z)

= 2xy + 6yz + 8xz + l(xyz 12000)

Deriving the equation partially with respect to x, y, z & l gives the system

Fx = 2y + 8z + lyz = 0

Fy = 2x + 6z + lxz = 0

Fz = 6y + 8x + lxy = 0

Fl = xyz 12000 = 0

Solving the first three equations of the system for l in terms of x, y, and z we have

l = (2y + 8z)/yzl

= (2x + 6z)/xzl

= (6y + 8x)/xy

(2x+6z)

Equating the first two equations for l leads to ( 2y+8z

yz ) =

xz

2xy + 8xz = 2xy + 6yz

x = 3/4y

Equating 2nd & 3rd expressions for l in the system yields z = 1/4y

Finally substituting these values in the equation xyz-12000 = 0 gives:

( 43 y)(y)( 14 y) 12000 = 0

y3 = (12000)(4)(4)

= 64000

3

80

HBAF 3105

Or y = 40. Hence x=3/4 (40)=30 z=1/4 (40)=10. Therefore we see that the point

(30, 40, 10) gives the constrained minimum of f.

f (30, 40, 10) = 2(30)(40) + 6(40)(10) + 8(30)(10) = 7200

Application problems

Example . The total weekly profit (in dollars) that Acrosonic company realized

in producing and selling its bookshelf loudspeaker systems is given by the profit

function

P(x, y) = 1/4x2 3/8y2 1/4xy + 120x + 100y 5000

Where x denotes the number of fully assembled units and y the number of kits

produced and sold per week. The management decides that production of the loudspeaker systems should be restricted to a total of exactly 230 units per week. Under

this condition, how many fully assembled units and how many kits should be produced per week to maximize Acrosonics weekly profit?

Solution: We maximize the function

P(x, y) = 41 x 3 38 y2 14 xy + 120x + 100y 5000

Subject to the constraint

g(x, y) = x + y 230

Lagrangian function F(x, y, l) = P(x, y) + lg(x, y)

= 14 x2 38 y2 41 xy + 120x + 100y 500 + l(x + y 230)

To find critical points , solve the following system of equations

Fx = 1/2x 14 y + 120 + l = 0Fy

= 34 y 14 x + 100 + l = 0Fl = x + y 230 = 0

Solving the first equation for l gives l = 1/2x + 41 y 120

Substituting in the second equation gives

34 y 14 x + 100 + 12 x + 14 y 120 = 0

12 y + 14 x 20 = 0ory = 12 x 40

Substituting in the 3rd equation givesx + 12 x 40 230 = 0 that is x=180 hence

y = 12 (180) 40 = 50

Maximum weekly profit is given by

P(180, 50) = 14 (180)2 83 (50)2 14 (180)(50) + 120(180) + 100(50) 5000

= 10, 312.5

81

HBAF 3105

E XERCISE 19. Suppose that x units of labour and y units of capital are required

to produce

f (x, y) = 100x34 y14

Units of a certain product. If each unit of labor costs $200 , each unit of capital

costs $300 and a total of $60,000 is available for production , determine how many

units of labor and how many units of capital should be used in order to maximize

production.

E XERCISE 20. The total monthly profit of Robertson controls company in manufacturing and selling x hundreds of its standard mechanical setback thermostats

and y hundreds of its deluxe electronic setback thermostats per month is given by

the total profit function

P(x, y) = 1/8x2 1/2y2 1/4xy + 13x + 40y 280

Where P is in hundreds of dollars. If the production of the setback thermostats

is to be restricted to a total of exactly 4000 per month, how many of each model

should Robertson manufacture in order to maximize its monthly profit? What is the

maximum monthly profit?

E XERCISE 21. An open rectangular box is to be constructed from materials that

costs $3 per square foot for the bottom and $1 per square foot for its sides. Find the

dimensions of the box of greatest volume that can be constructed for $36.

E XERCISE 22. The total weekly profit (in dollars) realized by the country workshop in manufacturing and selling its rolltop desks is given by the profit function

P(x, y) = 0.2x2 0.25y2 0.2xy + 100x + 90y 4000

Where x stands for the number of finished units and y denotes the number of unfinished units manufactured and sold per week. The management decides to restrict

the manufacture of these desks to a total of exactly 200 units per week. How many

finished and unfinished units should be manufactured per week to maximize the

companys weekly profit?

Maximize the function f (x, y, z) = xyz = x2 +y2 +z2 subject to the constraint 3x+2y+z=6

11.2. Models involving differential equations

11.2.1. Unrestricted growth Models

The size of a population at any time t,Q(t) , increases at a rate proportional to Q(t)

itself. Thus dQ

dt = kQwhere k is a constant of proportionality. This is a differential

82

HBAF 3105

equation involving the unknown function Q and its derivative Q.

11.2.2. Restricted growth models

In many applications the quantity Q(t) does not exhibit unrestricted growth but approaches some definite upper bound. Suppose Q(t) does not exceed some number

C, called the carrying capacityof the environment. Furthermore suppose the rate of

growth of this of this quantity is proportional to the difference between its upper

bound and its current size , the resulting differential equation is

dQ

dt = k(C Q)

Where k is a constant of proportionality. Observe that if the initial population is

small relative to C, then the growth rate of Q is relatively large. But as Q(t) approaches C, the difference C-Q(t) approaches zeroand so does the growth rate of

Q.

Applications:

Unrestricted Growth Models

dQ

dt = kQ

Separating the variables in this equation we have dQ

dt = kdt which upon integration

yield

s dQ

Q = kdt

ln|Q| = kt +C1

Q = ekt+C1 = Cekt

Where C = ec1 is an arbitrary positive constant. Thus we may write the solution as

Q(t) = Cekt

Observe that the quantity present initially is denoted by Q0 ,then Q(0) = Q0 .

Applying this condition yields the equation Ce0 = Q0 or C = Q0 therefore the model

for unrestricted exponential growth with initial population Q0 is given by Q(t) =

Q0 ekt

Example . Example Under ideal laboratory conditions, the rate of growth of

bacteria in a culture is proportional to the size of the culture at any time t . Suppose

that 10,000 bacteria are present initially in a culture and 60,000 are present two

hours later. How many bacteria will there be in the culture at the end of 4 hours?

Solution: Let Q(t) denote the number of bacteria present in the culture at time t.

Then dQ/dt = kQ

83

HBAF 3105

Solving this separable first order differential equation gives Q(t) = Q0 ekt

i.e Q(t) = 10, 000ekt

Next the condition that 60,000 bacteria are present 2 hours later translates into

Q(2) = 60, 000

Or

60, 000 = 10, 000e2 k

e2 k = 6

ek = 6( 12)

Thus the number of bacteria present at any time t is given by

Q(t) = 10, 000ekt = 10, 000(ek )t

= (10, 000)6( (t2))

In particular, the number of bacteria present in the culture at the end of 4 hours is

given by Q(4) = 10, 000(642 )

=360, 000

11.2.3. Restricted Growth Models

To solve this separable first order differential equation , we first separate the variables i.e

dQ

dt = k(C Q)

dQ

(CQ) / = kdt

dQ

Integrating we get CQ

= kdt

ln|C Q| = kt + d

ln|C Q| = kt d

C Q = ektd = ekt ed

or

Q(t) = C Akt )

Where we have denoted the constant ed byA.

Example . Example During a flu epidemic, 5%of the 5000 army personnel stationed at Port MacAthur had contracted influenza at time t = 0. Furthermore the rate

at which they were contracting influenza was jointly proportional to the number of

personel who had alredy contracted the disease and the non infected population. If

20% of the personnel had contracted the flu by the 10th day, find the number of

personnel who had contracted the flu by the 13th day.

84

HBAF 3105

Solution:

let Q(t) denote the number of army personnel who had contracted the flu after t

days. Then

dQ/dt = kQ(500 Q)

5000

Q(t) = /(1+Ae

5000 kt

The condition that 5% of the population had contracted influenza at time t = 0

implies that

Q(0) = 5000

1+A = 250

From which we see that A=19. Therefore

5000

Q(t) = /(1+19e

5000 kt

20 % of the population had contracted influenza by the 10th day implies

5000

Q(t10) = /(1+19e

50000 kt = 1000

1 + 19e50000k = 5

e50000k = 4/19

50, 000k = ln4 ln19

And k = 1/50, 000(ln4 ln19) = 0.0000312

5000

Therefore Q(t) = 1=19e

0.156t

In particular Q(13) = 1=19e5000

0.156t (13) = 1428

E XERCISE 23. Suppose that a tank initially contains 10 gallons of pure water.

Brine containing 3 pounds of salt per gallon flows into the tank at a rate of 2 gallons

per minute, and the well stirred mixture flows out of the tank at the same rate. How

much salt is present at the end of 10 minutes? How much salt is present at the long

ran.

The population of a certain community is increasing at a rate directly proportional

to the population at any time t . In the last 3 years the population has doubled. How

long will it take for the population to triple?

An amount of money deposited in a savings account grows at a rate proportional to

the amount present. Suppose that $10,000 is deposited in a fixed account earning

interest at the rate of 10% per year compounded continuously. What is the accumulated amount after 5 years How long does it take for the original deposit to double

85

HBAF 3105

Solutions to Exercises

Exercise 10.

The hypotheses to be tested :H0 : = 100 Against H0 : > 100

Exercise 10

Exercise 15.

The Laspeyres index;

Between 200 and 2007, the value of the investment increased by 19.2%

Exercise 15

Exercise 16.

FI =

Exercise 16

86

- ~$cture-1-4(Basic elements)Diunggah olehApoorav Dhingra
- Continuous Probability DistributionDiunggah olehMbuguz BeNel Muya
- chapter4.pptDiunggah olehAnonymous ecgeyhV
- Eee251 ObeDiunggah olehChaudry Arshad Mehmood
- Normal DistributionDiunggah olehAkash Katariya
- Binomial Distribution - p(k larger or equal to a) - R.pdfDiunggah olehHuyen Tran
- efmDiunggah oleheni
- Appendix a 4Diunggah olehErRajivAmie
- Chapter 4.7.pdfDiunggah olehRainie Su Ching
- Week 3 2 PorbDistDistrictRandomVarDiunggah olehFarhadAbbas
- Conditional ProbabilityDiunggah olehYousuf Hussain
- Collaborative Coefficient (CC)Diunggah olehM Muthukrishnan
- Johnson Statistics 6e TIManualDiunggah olehunseenfootage
- 5. DRV q5Diunggah olehayman
- tablesDiunggah olehJeremiah
- Bus172 Course Outline Spring 2014Diunggah olehOmidRezwanChowdhury
- IJAMS-Generalized free Gaussian white noise.pdfDiunggah olehDr.Hakeem Ahmed Othman
- fstats_ch2.pdfDiunggah olehsound05
- fast_computation_ELDiunggah olehGrande Duke
- Bugs of the Holm OakDiunggah olehmikalra
- Dispersion HandoutDiunggah olehVeky Pamintu
- The t DistributionDiunggah olehSoftkiller
- 10 -t distDiunggah olehPriyanka Ravani
- STATISTICS FOR BUSINESS - CHAP05 - Sampling and Sampling Distribution.pdfDiunggah olehHoang Nguyen
- 12. Accelerated Testing - References & IndexDiunggah olehMatheus Braga da Silva
- Straumann D. (2009), What Happened to My Correlation, On the Whiteboard, RiskMetrics GroupDiunggah olehShekharmehta77
- Assignment C Probability DistributionDiunggah olehChoong Koon Yoon
- chapter 2Diunggah olehJose C Beraun Tapia
- MCSE13Diunggah olehJonathan Gomez
- s3e - probdist - texcorr - doc - rev 2019Diunggah olehapi-203629011

- 10 Reasons Why High School Sports Benefit Students _ PublicSchoolReviewDiunggah olehFaizal Raffali
- Ambedkar UniversityDiunggah olehNishkarsh Gautam
- 6.1 Park Sample Lesson for SsDiunggah olehescobaralberto
- ISUDiunggah olehapi-19727066
- handbookDiunggah olehapi-298252645
- Report Card Comments, Elementary/Primary (File 1)Diunggah olehSchool Report Writer .com
- The Broca AreaDiunggah olehDJ ISAACS
- 102087 assessment1Diunggah olehapi-380735835
- Competency BasedDiunggah olehHarold Ortiz Buenvenida
- wa2 camila villaveces enc 3250 professional writing reviewed by landon brooks-2Diunggah olehapi-302399152
- 2014 622 test admin summary 1 1Diunggah olehapi-356205235
- 4th grade content page 3 onDiunggah olehapi-242020605
- Making Sparks Fly: How Occupational Education can Lead to a Love of Learning for its own SakeDiunggah oleh3CSN
- Theatre Cafe | Issue 5 | Year 1Diunggah olehMytheatrecafe India
- Syllabus RockDiunggah olehJayapal Rajan
- California Annual Grades 22Diunggah olehryen mcnay
- Mapping an ApproachDiunggah olehRezky Awan
- Principles Practices and Pragmatics New Models of Coteaching Friend and CookDiunggah olehEFL Classroom 2.0
- computer applications ii ca2Diunggah olehapi-195395710
- Evidence What Was Happening When It HappenedDiunggah olehDiana Rl
- Studio PolicyDiunggah olehLauren Liddle
- A59P47 S r4 BorangPermohonanPeperiksaanDiunggah olehHidayat Hamzah
- The Psychology of Competition- A Social Comparison PerspectiveDiunggah olehteebone747
- Child Development PedagogyDiunggah olehSwarup Upadhyaya
- 81011_small_1Diunggah olehMansoor Iqbal
- Tm1 OrientationDiunggah olehFe Montegrejo
- Monash University Undergraduate Course Information 2013: Bachelor of Commerce Double DegreesDiunggah olehMonash University
- 1Diunggah olehChitranshi Agrawal
- MB0050 Assignment.docDiunggah olehAnitha Nair
- lmu lesson plan ironyDiunggah olehapi-318757480