Lecture 1 Notes: Intro to Data Analysis

Lecture 1: Brief lecture notes
Introduction, Grouping and Displaying Data to Convey Meaning: Tables and

Graphs (Data, classification, tabulation and presentation)
Definition of a Statistics
Statistics is a group of methods used to collect, analyze, present, and interpret data
and to make decisions.
Broadly Speaking
Different authors have defined statistics differently from time to time. The
reasons for a variety of definitions are primarily two.
First: In modern times the field of utility of statistics has widened considerably. In
ancient times statistics was confined only to the affairs of states but now it
embraces almost every sphere of human activities.
Secondly: Statistics has been defined in two ways.

(i) Statistical data i.e. numerical statement of facts
(ii) Statistical methods i.e. complete body of the principles and techniques
used in collecting and analyzing such data.
Statistics as ‘Statistical Data’

Bowley defines Statistics as numerical statements of the facts in any department of
enquiry in relation to each other.
Statistics as “Statistical methods”

Bowley defines Statistics in three different ways:
(i) Statistics may be called the science of counting
(ii) Statistics may rightly be called the science of average
(iii) Statistics is the science of measurement of social organism regarding as
whole in all its manifestations.
1
Types of Statistics
1. Descriptive Statistics
2. Inferential Statistics
Descriptive Statistics consists of methods for organizing, displaying, and
describing data by using tables, graphs, and summary measures.
Inferential Statistics consists of methods that use sample results to help make
decisions or predictions about a population.
Nature and Scope of statistics
The word “Statistics” seems to have obtained from the Latin word “status” or
Italian word “statista” or the German word “Statistik” each of which means
“political state”. In ancient time, the government used to collect informations
about total population, land, wealth, total no. of employees, solders etc. to have the
idea of the man power of the country for formulation of administrative set-up,
fiscal, new taxes, levies and military policies of the government.
In modern times, Statistics is viewed not as a simple mere device for collecting
numerical data but as a means of developing sound techniques for their handling
and analysis and drawing valid inferences from them.
As such it is not confined to the affairs of the state but is intruding constantly into
various diversified spheres of life-social, economic and political. It is now finding
wide application in almost all sciences-social as well as physical such as biology,
psychology, education, business management etc. It also applied in industry,
Medical science, and planning Accounting and Auditing.
Limitations of Statistics
As a developing science, statistics and its techniques are widely used in every
branch of knowledge. Statistics is not a magical device, which gives solution to
problems. Educated and uneducated, rich and poor people are making use of
statistics in their day-to-day life but it has its own limitations.
2
 Statistics does not deal with individuals items
 Statistics deals with quantitative data only
 Statistics may mislead to wrong conclusion in the absence of details.
 Statistical laws are true only on averages.
 Statistics does not reveal the entire story
 Statistical data should be uniform and homogeneous
 Statistics is liable to misused.
Population and Sample
A population consists of all elements – individuals, items, or objects – whose

characteristics are being studied. The population that is being studied is also called
the target population.
Ex: The total number of students of in a particular class constitutes the

population of an investigation.
Parameter: Population characteristics are known as parameter.
Population means, population variance etc.
Sample: A sample is a representative part of a population or

A portion of the population selected for study is referred to as a sample.
Ex: A small number of students out of total in a particular class under study
constitute the sample.
Statistic: Sample characteristics are known as statistic

Sample mean, sample variance etc
3
Figure 1.1 Population and sample.
Population
Sample
Data
The raw material of statistics is data. For our purpose we may define data as
numbers. Data collections are of any number of related observations.
The two kinds of numbers that we use in statistics are numbers a measurement,
and those that result from the process of counting.
For example, when a nurse evaluates a patient or takes a patient’s temperature, a
measurement, consisting of a number such as 73 kilograms or 99 degrees
centigrade, is obtained. Different type of number is obtained when a hospital
administrator counts the number of patients—perhaps 25—on a given day.
Data are of two types

 Cross-Section Data
 Time-Series data
Source of Data
 Primary
 Secondary
 On-line
4
Basic Terms
An element or member of a sample or population is a specific subject or object
(for example, a person, firm, item, state, or country) about which the information
is collected.
Table 1.1 2001 Sales of Seven U.S. Companies
2001 Sales Variable

Company (millions of dollars)
Wal-Mart Stores 217,799
IBM 85,866
An element An observation
or a member
General Motors 177,260 or measurement
Dell Computer 31,168
Procter & Gamble 39,262
JC Penney 32,004
Home Depot 53,553
11
Variable
A variable is a characteristic under study that assumes different values for

different elements. In contrast to a variable, the value of a constant is fixed.
The value of a variable for an element is called an observation or measurement.
Types of variables
 Quantitative Variables
 Discrete Variables
 Continuous Variables
 Qualitative or Categorical Variables
5
A variable that can be measured numerically is called a quantitative variable.
The data collected on a quantitative variable are called quantitative data.
 A variable whose values are countable is called a discrete variable. In

other words, a discrete variable can assume only certain values with no
intermediate values.
 A variable that can assume any numerical value over a certain interval or
intervals is called a continuous variable.
 A variable that cannot assume a numerical value but can be classified

into two or more nonnumeric categories is called a qualitative or
categorical variable. The data collected on such a variable are called
qualitative data.
For Example
The outcomes of coin tossing problem gives qualitative variable with two
categories-head or tail, the sex of persons had two categories, male or female.
The categories are some times called attributes.
Figure 1.2 Types of variables.
21
6
Levels of Measurement
Variables can be classified on the basis of their level of measurement. The way we
classify variables greatly affects how we can use them in our analysis. The
Classifications are:
 Nominal
 Ordinal
 Interval
 Ratio
Nominal level
A nominal measurement is created when names are used to establish categories
into which variables can be exclusively recorded.
For example, Soft drinks may be classified as Coke, Pepsi, 7-ups. It is important
to remember that a nominal measurement carries no indication of order of
preference. An example is given in Table 1.2.
Table 1.2: Top Passenger Car Sales in Dhaka by Company.
Company Annual sales (number of units)

Toyota 1000
Hyundai 1500
Honda 2500
All others 1700
Ordinal level
Unlike a nominal measurement, an ordinal scale produces a distinct ordering or
arrangement of the data i.e. the observations is ranked on the basis of some
7
criterion. Table 1.3 lists the ratings of the company commander by the nurses
under her command is an illustration of the ordinal level of measurement.
Table 1.3: Ratings of the Company Commander
Rating Number of nurses

Superior 7
Good 23
Average 26
Poor 18
Inferior 0
Interval level
It includes all the characteristics of the ordinal scale, but in addition, the distance
between values is a constant size. Temperature on the Fahrenheit scale is an
example.
Ratio level
This level has all the characteristics of interval level: the distances between
numbers are of a known; constant size.
For example, we can compare 30 units of sales made by Rahim to 90 units of
sales made by Karim, set up the ratio 90:30 and say that Karim sold three times as
much as Rahim.
8
Classification & Tabulation
Data recorded in the sequence in which they are collected and before they are
processed or ranked are called raw data
Table 2.1 Ages of 50 students
21 19 24 25 29 34 26 27 37 33
18 20 19 22 19 19 25 22 25 23
25 19 31 19 23 18 23 19 23 26
22 28 21 20 22 22 21 20 19 21
25 23 18 37 27 23 21 25 21 24
Frequency Distribution
Frequency distribution is a classification according to the number possessing the
same values of the variables. It is simply a table in which the data are grouped into
classes and the number of cases, which fall in each class, is recorded.
 Frequency distribution of qualitative data
TABLE 2.2 Type of Employment Students Intend to

Engage In
Number of Frequency
Variable Type of Employment Students column
Private companies/businesses 44
Category Federal government 16 Frequency
State/local government 23
Own business 17
Sum = 100
9
Example 2-1
Some what None Somewhat Very Very None
Very Somewhat Somewhat Very Somewhat Somewhat
Very Somewhat None Very None Somewhat
Somewhat Very Somewhat Somewhat Very None
Somewhat Very very somewhat None Somewhat
Construct a frequency distribution table for these data.
Solution 2-1
Table 2.3 Frequency Distribution of Stress on Job

Stress on Job Tally Frequency (f)
Very |||| |||| 10
Somewhat |||| |||| |||| 14
None |||| | 6
Sum = 30
10
Graphical Presentation of Qualitative Data

A graph made of bars whose heights represent the frequencies of respective
categories is called a bar graph.
10
Figure 2.1 Bar graph for the frequency distribution of
Table 2.3
16
14
12
Frequency
10
8
6
4
2
0
Very Somewhat None
Strees on Job
16
Pie Chart
A circle divided into portions that represent the relative frequencies or percentages
of a population or a sample belonging to different categories is called a pie chart.
Table 2.4 Calculating Angle Sizes for the Pie Chart
Stress on Job Relative Frequency Angle Size

Very .333 360(.333) = 119.88
Somewhat .467 360(.467) = 168.12
None .200 360(.200) = 72.00
Sum = 1.00 Sum = 360
18
11
Figure 2.2 Pie chart for the percentage distribution of
Table 2.4.
None, 20%
Very,
33.30%
Somewhat,
46.70%
19
Frequency Distribution of Quantitative data

A frequency distribution for quantitative data lists all the classes and the number
of values that belong to each class. Data presented in the form of a frequency
distribution are called grouped data.
Frequency Distributions
Table 2.7 Weekly Earnings of 100 Employees of a Company
Variable
Weekly Earnings Number of Employees Frequency
(dollars) f column
401 to 600 9
601 to 800 22
Frequency of the
Third class 801 to 1000 39 third class
1001 to 1200 15
1201 to 1400 9
1401 to 1600 6
Lower limit of the Upper limit of the
sixth class sixth class
21
12
Class-limits
The class-limits are the smallest or the lowest and the largest or the highest value
in the class.
For example, take the class 1401-1600, the lowest value is 1401 and the highest
value is 1600. The two boundaries of the class are known as lower limit and
upper limit of the class. Class limit is also known as class boundaries. Frequency
of a particular class is called the class frequency.
Class-Intervals
The difference between the lower limit and upper limit of the class is known as the
class interval.
For example, 1401-1600, the class-interval is 200.
ls
i
k
K = 1 + 3.322 logN
K is the number of classes, N = total no of observations.
Class Midpoint
Lower Limit  Upper Limit
Class Mid Po int 
2
13
Table 2.8 Class Boundaries, Class Widths, and Class
Midpoints for Table 2.7
Class Limits Class Boundaries Class Width Class Midpoint

401 to 600 400.5 to less than 600.5 200 500.5
601 to 800 600.5 to less than 800.5 200 700.5
801 to 1000 800.5 to less than 1000.5 200 900.5
1001 to 1200 1000.5 to less than 1200.5 200 1100.5
1201 to 1400 1200.5 to less than 1400.5 200 1300.5
1401 to 1600 1400.5 to less than 1600.5 200 1500.5
27
Example:
Table 2.9 gives the total home runs hit by all players of each of the 30 Major
League Baseball teams during the 2002 season. Construct a frequency distribution
table.
Table 2.9 Home Runs Hit by Major League Baseball

Teams During the 2002 Season
Team Home Runs Team Home Runs
Anaheim 152 Milwaukee 139

Arizona 165 Minnesota 167
Atlanta 164 Montreal 162
Baltimore 165 New York Mets 160
Boston 177 New York Yankees 223
Chicago Cubs 200 Oakland 205
Chicago White Sox 217 Philadelphia 165
Cincinnati 169 Pittsburgh 142
Cleveland 192 St. Louis 175
Colorado 152 San Diego 136
Detroit 124 San Francisco 198
Florida 146 Seattle 152
Houston 167 Tampa Bay 133
Kansas City 140 Texas 230
Los Angeles 155 Toronto 187
29
14
Solutions
230  124
Approximat e width of each class   21 .2
5
Now we round this approximate width to a convenient number – say, 22.
The lower limit of the first class can be taken as 124 or any number less than
124. Suppose we take 124 as the lower limit of the first class. Then our classes will
be 124 – 145, 146 – 167, 168 – 189, 190 – 211, and 212 - 233
Table 2.10 Frequency Distribution for the Data of

Table 2.9
Total Home Runs Tally f

124 – 145 |||| | 6
146 – 167 |||| |||| ||| 13
168 – 189 |||| 4
190 – 211 |||| 4
212 - 233 ||| 3
∑f = 30
32
Graphical Presentation for Grouped Data
A histogram is a graph in which classes are marked on the horizontal axis and
the frequencies are marked on the vertical axis. The frequencies are represented by
the heights of the bars. In a histogram, the bars are drawn adjacent to each other.
15
Figure 2.3 Frequency histogram for Table 2.10.
15
12
Frequency
0
124 - 146 - 168 - 190 - 212 -
145 167 189 211 233
37
Total home runs
A cumulative frequency distribution gives the total number of values that fall
below the upper boundary of each class.
Using the frequency distribution of Table 2.10, prepare a cumulative frequency

distribution for the home runs hit by Major League Baseball teams during the
2002 season.
Example 2-7
Total Home Runs f

124 – 145 6
146 – 167 13
168 – 189 4
190 – 211 4
212 - 233 3
56
16
Solution 2-7
Table 2.14 Cumulative Frequency Distribution of Home Runs by

Baseball Teams
Class Limits Class Boundaries Cumulative Frequency

124 – 145 123.5 to less than 145.5 6
124 – 167 123.5 to less than 167.5 6 + 13 = 19
124 – 189 123.5 to less than 189.5 6 + 13 + 4 = 23
124 – 211 123.5 to less than 211.5 6 + 13 + 4 + 4 = 27
124 – 233 123.5 to less than 233.5 6 + 13 + 4 + 4 + 3 = 30
57
Why Graphical Or Diagrammatic Presentation is so important

Diagrams occupy and important place, because
 They are attractive and impressive
 They save time and labor
 They have universal applicability
 They make data simple
 They make comparison easy
 They provide more information.
Advantages of graphic presentation

 It provides an attractive and impressive view
 It simplifies complexity of data
 It provides easy comparison of two or more phenomena.
 It needs no special knowledge of mathematics to understand
 A part from simplicity, it saves the time and energy of
statistician as well as the observer.
17
Exercises:
2. The following are the sales (in thousand dollars) in a year of 50 companies.
30 55 27 45 56 48 45 49 32 57
37 55 52 34 54 42 32 59 35 46
32 26 40 28 53 54 29 42 42 54
39 56 59 58 49 53 30 53 21 34
52 57 43 46 54 31 22 31 24 24
From this data construct (i) A frequency distribution (ii) Cumulative frequency distribution (iii)
A relative frequency distribution. Draw an appropriate diagram.
3. Construct a frequency distribution for the following data (exports in thousand dollars) by
using a suitable class interval:
65 64 99 55 64 89 87 65 62 38
99 68 95 86 57 53 47 50 55 81
80 98 51 36 63 66 85 79 83 70
67 70 60 69 78 39 75 56 71 51
35 42 60 71 65 55 41 39 45 76
99 68 95 86 57 53 47 50 55 81
Also calculate the cumulative and relative frequency.
18

Lecture 1 Notes: Intro to Data Analysis

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Lecture 1 Notes: Intro to Data Analysis

Diunggah oleh

Hak Cipta:

Format Tersedia

Lecture 1: Brief lecture notes

Introduction, Grouping and Displaying Data to Convey Meaning: Tables and

Secondly: Statistics has been defined in two ways.

Statistics as ‘Statistical Data’

Statistics as “Statistical methods”

Nature and Scope of statistics

Population and Sample

A population consists of all elements – individuals, items, or objects – whose

Ex: The total number of students of in a particular class constitutes the

Sample: A sample is a representative part of a population or

Statistic: Sample characteristics are known as statistic

Data are of two types

Table 1.1 2001 Sales of Seven U.S. Companies

2001 Sales Variable

A variable is a characteristic under study that assumes different values for

 A variable whose values are countable is called a discrete variable. In

 A variable that cannot assume a numerical value but can be classified

Figure 1.2 Types of variables.

Table 1.2: Top Passenger Car Sales in Dhaka by Company.

Company Annual sales (number of units)

Table 1.3: Ratings of the Company Commander

Rating Number of nurses

Table 2.1 Ages of 50 students

TABLE 2.2 Type of Employment Students Intend to

Construct a frequency distribution table for these data.

Table 2.3 Frequency Distribution of Stress on Job

Graphical Presentation of Qualitative Data

Table 2.4 Calculating Angle Sizes for the Pie Chart

Stress on Job Relative Frequency Angle Size

Frequency Distribution of Quantitative data

Class Limits Class Boundaries Class Width Class Midpoint

Table 2.9 Home Runs Hit by Major League Baseball

Team Home Runs Team Home Runs

Anaheim 152 Milwaukee 139

Now we round this approximate width to a convenient number – say, 22.

Table 2.10 Frequency Distribution for the Data of

Total Home Runs Tally f

Graphical Presentation for Grouped Data

Using the frequency distribution of Table 2.10, prepare a cumulative frequency

Total Home Runs f

Table 2.14 Cumulative Frequency Distribution of Home Runs by

Class Limits Class Boundaries Cumulative Frequency

Why Graphical Or Diagrammatic Presentation is so important

 They are attractive and impressive

 They save time and labor

 They have universal applicability

 They make data simple

 They make comparison easy

 They provide more information.

Advantages of graphic presentation

Anda mungkin juga menyukai