Anda di halaman 1dari 13

CHAPTER 2

ORGANIZING AND SUMMARIZING DATA

When conducting a statistical study, a researcher collects data for the variables under study. Data sets are usually very large collections of numbers and symbols, and in their original forms, they usually are meaningless. In order to make sense out of data sets, we have to process them in a meaningful way so that we can understand them and get useful information out of them. The techniques of descriptive statistics provide us with different ways of processing data sets. Data can be organized and displayed graphically using many different types of tables, charts and graphs. Graphical displays help us see the important characteristics of data sets. Numerical summary measures give us quantitative descriptions of important characteristics of data sets and further reinforce information provided by the various graphical displays. In this chapter, we will learn to organize and summarize quantitative and qualitative data, graphically. In the next chapter, we will follow up with methods to summarize data numerically.

2.1

Data Listing

The purpose of listing (or rearranging) a set of data is to simplify its use. By listing, the identity of each value (or item) in the data set is preserved. We discuss two methods of listing; dotplots and stem-and-leaf plots.

Dotplots
A dotplot consists of dots which are plotted on a graph to represent the data. The dots are plotted along a scale of values which represents the category of each data. The number of dots for each category corresponds to the number of times (or the frequency) each value or item occurs in the data set. A dot diagram is suitable for both quantitative and qualitative data . Example: The following data shows the number of students who were absent from a Statistics course lecture, conducted 40 times in a particular semester. 0 1 1 2 2 2 2 3 3 3 4 4 0 4 1 5 4 5 6 3 1 5 1 3 2 2 3 2 3 3 4 4 2 0 5 5 0 4 2 2

Construct a dotplot for the number of absentees in a lecture. Solution: In this example, the variable of interest is the number of absentees in each lecture. Its value varies from 0 to 6. The number of dots at each particular value is the number of lectures having that particular number of absentees.

5 6

Dotplot : Absentees in a lecture

From the dot diagram, we can see that 10 out of the 40 lectures have 2 absentees and this is the most frequent number of absentees in a lecture.

Stem-and-Leaf Plots
The stem-and-leaf plot is a method of organizing data which combines sorting and graphing. It can only be applied to quantitative (or numerical) data. Each number in a data set is separated into two parts; the stem and the leaf. The leftmost digit(s) becomes the stem and the rightmost digit(s) is the leaf. Before plotting, we must first determine the stem values. The stem starts with the leftmost digit of the minimum data value and consecutively increased until the leftmost digit of the maximum data value. All stem values are aligned vertically and their corresponding leaf values are plotted along the side, separated from the stem by a vertical line. Example: The following are the ages of 30 patients who had their first heart attack. 65 55 56 40 67 45 63 86 50 67 90 85 75 49 67 79 78 90 85 76 83 45 54 72 90 67 98 60 89 55

Construct a stem-and-leaf display. Solution: The minimum age is 40 and the maximum age is 98. Therefore, the stem values run from 4 to 9, consecutively. 4 0 the leaf for age 40 (second data) 5 6 5 the leaf for age 65 (first data) 7 8 9 the stems for all the ages

After listing the stem values, we record the leaves as they occur in the original data, on the right side of the vertical line, at their corresponding stems. The complete stem-and-leaf display for all the ages is shown below: 4 5 6 7 8 9 0 5 5 5 5 0 5 4 3 9 6 0 9 6 7 8 9 0 5 0 0 6 5 8

5 7 7 7 2 3

The leaves at each stem are then ordered (in increasing order) and are presented as follows: 4 5 6 7 8 9 0 0 0 2 3 0 5 4 3 5 5 0 5 5 5 6 5 0 9 5 7 8 6 8

6 7 7 7 9 9

From a stem-and-leaf display, we get a picture of how the data values are distributed. For example, from the above plot, we have a picture about the distribution of the patients age when they had their first heart attack; the patients are mostly between 40 to 90 years old with a majority in their 60s.

2.1

Data Classification

Important characteristics of a large mass of data can be assessed by grouping or classifying the data into different non-overlapping classes or categories. Data sets that are divided into groups or classes are known as grouped data. For qualitative data, every class is given a name from the different data values in the data set. For quantitative data, every class is an interval of numbers if the data is of the continuous type or an individual number if the data is discrete.

Frequency Distribution
A frequency distribution groups data values into different categories (by individual names, intervals of numbers, or individual numbers) and notes the number of data values that falls into each category (i.e. the corresponding frequencies). The following are examples of frequency distributions for qualitative data and quantitative data, respectively. Example: Twenty-five army inductees were given a blood test to determine their blood type. The resulting data set consists of twenty-five data values, each being blood type A, B, AB, or O. Grouping the data points into the four categories of blood type, we have the following frequency distribution table:

Blood Type A B O AB Total

Number of army inductees 5 7 9 4 25

Frequency Distribution (Qualitative Data) : Blood Type

Example: A survey was conducted by an economist to determine the monthly income of families in rural areas. 70 families from a village were randomly chosen and the results are as follows: Monthly Income (in RM) Less than RM499 RM500 RM599 RM600 RM699 RM700 RM799 RM800 RM899 More than RM900 Total Number of families 5 10 15 20 15 5 70

Frequency Distribution (Quantitative Data) : Monthly income of rural families

The construction of a frequency distribution for qualitative data is quite simple. The choice of classes is obvious from the data set and we need only to tally the frequency for each class to be put in the table. For a quantitative data set, we need to consider a few criteria to construct its frequency distribution table. Before we do that, we need to define some terms related to a frequency distribution table first. Class Limit The smallest and largest values that can fall in a given class interval are referred to as the class limits. Classes without the lower limit or upper limit are known as open classes. Example: (Refer to the frequency distributions of the village family income above) Lower limit for classes: Upper limit for classes:

Class Boundary A class boundary is a value shared by two consecutive classes. It is the midpoint between the upper limit of a class and the lower limit of the next class after it. Therefore, an upper boundary of a class is also the lower boundary for the next class after it.

Example: (Refer to the village family income distributions.) The lower and upper boundaries of the second class (RM500 - RM599)are : Lower boundary = upper limit of first class + lower limit of second class 2 = 500 + 499 = 499.5 2 Upper boundary = upper limit of second class + lower limit of third class 2 = 599 + 600 = 599.5 2 Class boundaries are also known as real class limits. Class Width The difference between the upper boundary and the lower boundary of a class gives the class width. The class width is also called the class size. Class width = Upper boundary Lower boundary Example: Consider the third class from the above example, i.e. class RM600 RM699. Class width = 699.5 599.5 = 100 Class Midpoint A class midpoint is obtained by dividing the sum of the two class limits (or the two class boundaries) by 2.

Class midpoint = upper limit + lower limit 2 = upper boundary + lower boundary 2 Example: Consider the third class from the above example, i.e. class RM600 RM699. The class midpoint is:

Constructing a Frequency Distribution Table


The procedure for constructing a frequency distribution table for quantitative data are as follows: (1) Determine the number of classes

The following formula can be used to determine the number of classes which is suitable for the data size. Sturge Formula k = 1 + 3.3 log n , where k = number of classes n = number of data points

Example: A data set has n = 30 data points. A suitable number of classes is:

(2) Determine the class width (or size) To determine the class width when all classes are of the same size, first find the difference between the largest and the smallest values in the data set. Approximate the width of a class by dividing this difference by the number of classes determined earlier. Class width = Largest value Smallest value Number of classes

(3) Determine the lower limit of the first class ( the starting point) Any number that is equal to or less than the smallest value in the data set can be used as the lower limit of the first class. Example: The following data gives the weights of 50 mice (in gm) used in an experiment concerning the effect of lack of a certain vitamin. 135 90 115 118 121 137 132 120 104 125 119 115 101 129 87 108 110 133 135 126 127 103 110 126 118 88 104 137 120 95 146 126 119 119 105 132 126 118 100 113 106 125 117 102 145 129 124 113 94 148 Construct a frequency distribution table. In the table, list down also the class boundaries and class midpoints . Solution: By following the above procedure, determine: (1) number of classes:

(2) class width: (3) lower limit of the first class:

The lower limit of the first class can be taken as 87 or any number less than 87. Suppose we take 87 as the starting point. Then the classes will be: 87 95, 96 104, 105 113, 114 122, 123 131, 132 140, 141 149. Classes are recorded in the first column of the frequency table. Class boundaries and class midpoints are also recorded. These frequencies represent the number of mice that belong to each of the seven different classes of weights. Weight (gm) 87 95 96 104 105 113 114 122 123 131 132 140 141 149 Class boundary Class Midpoint Tally Frequency, f

Total

frequency

Frequency Table : Weight of mice in the lack of vitamin research.

Relative Frequency Table


If the frequency for each class in a frequency table is replaced by its relative frequency, then a relative frequency table is obtained. We can compute the relative frequency and frequency percentage of each class as follows. Relative frequency = Frequency of the class Sum of all frequencies Percentage = (Relative frequency) x 100 Example: The relative frequency table for the data on weights of mice lacking in vitamin is as follows: Weight (gm) Relative Frequency Percentage 87 95 96 104 105 113 114 122 123 131 132 140 141 149 Total 1.00 100 = _ f __ f

Relative Frequency Table : Weight of mice lack of vitamin.

Cumulative Frequency Table


A cumulative frequency distribution gives the total number of data points that fall below or above the boundary of each class. A frequency distribution can be transformed into a cumulative frequency distribution by adding up frequencies of classes which exceed (or is less than) every class boundary. We start with the lower boundary of the first class and end at the upper boundary of the last class. In a more than cumulative frequency distribution, we add the frequencies of the class that exceeds every class boundary. In a less than cumulative frequency distribution table, we add the frequencies of the class that are less than every class boundary. Example: Using the frequency distribution of weights of mice, prepare a (i) more than and (ii) less than cumulative frequency distribution for the weight s of mice. Solution: Weight (gm) Less than 86.5 95.5 104.5 113.5 122.5 131.5 140.5 149.5 Cumulative Frequency 0 0+5=5 0 + 5 + 6 = 11 0 + 5 + 6 + 7 = 18 0 + 5 + 6 + 7 + 12 = 30 0 + 5 + 6 + 7 + 12 + 10 = 40 0 + 5 + 6 + 7 + 12 + 10 + 7 = 47 0 + 5 + 6 + 7 + 12 + 10 + 7 + 3 = 50

A less than cumulative frequency distribution

Weight (gm) More than 86.5 95.5 104.5 113.5 122.5 131.5 140.5 149.5

Cumulative Frequency

A more than cumulative frequency distribution

2.2 Graphical Presentation


A graphic display can reveal at a glance, the main characteristics of a data set. The bar chart and the pie chart are two types of graphs that can be used to display qualitative data. The most commonly used graphs for quantitative data are the histograms, frequency polygons and cumulative frequency graphs or ogives.

Bar Chart A bar chart is a graph made up of bars whose heights represent the frequencies of the respective categories in a frequency table. Example: Draw a bar chart for the following frequency distribution: Job Sector Number of students Solution: Information Technology 44 Engineering Medical 30 15 Others 11

Pie Charts A pie chart is a circular representation of a frequency table, divided into different portions or sectors, each representing a different category. If each category has a different frequency, the size of a sector in a pie chart should be drawn proportional to its relative frequency or percentage. The proportion can be obtained by calculating the size of the angle for the various sectors, as shown below: Angle of each sector = sector frequency x 360o Total frequency

Example: Draw a pie chart for the above frequency distribution. Solution: Based on the above frequency table, the angles of the different sectors are obtained as follows: Job Sector Frequency Angle at the centre Percentage of circle Information Technology Engineering Medical Others Total 100% 360o

others Info tech 44.0% 11.0%

engineering 30.0%

medical 15.0%

Histogram A histogram is a bar graph, usually constructed for a set of grouped continuous data. The continuity of data values are depicted by bars that are placed next to each other, without any gaps in between. To ensure that there are no gaps in between adjacent bars, class boundaries are used in the construction of histograms. The width of the bars in a histogram represents the width of the classes in a frequency distribution and these values are plotted on a horizontal scale. Class frequencies, relative frequencies, or percentages are plotted on the vertical scale and represented by the heights of the bars. Histograms cannot be used in connection with frequency distributions that have open classes, and they must be used with extreme care when class intervals are not equal. In cases where class intervals are not equal, it is best to represent the class frequencies by the area of the bars instead of their heights. Example: Draw a histogram for the following distribution of mice weight: Weight (gram) 87 95 96 104 105 113 114 122 123 131 132 140 141 149 Solution:
14

Class Boundary 86.5 95.5 95.5 - 104.5 104.5 113.5 113.5 122.5 122.5 131.5 131.5 140.5 140.5 149.5

Frequency 5 6 7 12 10 7 3

F R e

12 12 10 10 8 6 6 4 5 3 7 7

q u e n c y

2 0 86.5 95.5 104.5 113.5 122.5 131.5

140.5 149.5

W eighs of mice

Frequency Polygon A frequency polygon is constructed by plotting class frequencies against the midpoints of the classes. The points are then connected by straight lines, thus forming a polygon. To close the frequency polygon, an additional class interval is added to both ends of the distribution, each with zero frequency. A frequency polygon is usually drawn on a histogram, as shown in the diagram below.
14

f r e

12 10 8 6 4 2 0 86.5 95.5 104.5 113.5 5 7 6

12 10

q u e n c y

122.5 131.5

140.5 149.5

Weights of mice

Ogive A cumulative frequency polygon or ogive is obtained by plotting the cumulative frequency of values that are less than an upper class boundary against the upper class boundary. The points are then joined together by straight lines. If relative cumulative frequencies or percentages had been used, we would call the graph a relative frequency ogive or a percentage ogive. Example: Draw an ogive based on a less than cumulative frequency distribution of the mice weight. Weight (gram) Less than 86.5 95.5 104.5 113.5 122.5 131.5 140.5 149.5 Cumulative Frequency 0 5 11 18 30 40 47 50

Solution:
60

C u m u l a t i v e f r e q

50

40

30

20

10 86.5 0

95.5

104.5

113.5

122.5

131.5

140.5

149.5

Mice Weight
86.5 95.5 104.5 113.5 122.5 131.5 140.5 149.5

Weights of mice

From an ogive, we can estimate the number of data points with values that are less than or more than a given value. Besides that, we can also estimate the data value at the centre location (median value) or the data value at any other locations. Example: Based on the more than ogive below, estimate (i) number of mice with weight less than 130 gram and (ii) mice weight which is exceeded by fifty percent of all the mice in the sample.
60

k e k l o n g g o k a n

50

40

30

20

10

0 86.5 95.5 104.5 113.5 122.5 131.5 140.5 149.5

berat tikus

Solution: (i)

Tentukan kedudukan ukuran 130 gm pada paksi-x (paksi berat tikus). Daripada kedudukan ini, unjurkan suatu garis lurus ke atas sehingga mencecah garis ogif. Sambungkan unjuran garis lurus ini ke kiri sehingga mencecah paksi-y (paksi kekerapan longgokkan). Didapati bahawa kekerapan longgokan yang sepadan dengan berat 130 gm ialah hampir 12. Dengan ini, kita anggarkan

bilangan tikus dengan berat lebih daripada 130 gm 12 ekor bilangan tikus dengan berat kurang atau sama 130 gm (50 12) ekor = 38 ekor. (ii) Tentukan kedudukan kekerapan yang ke-25 (50 peratus daripada 50 ekor tikus) pada paksi kekerapan longgokan. Kemudian, unjurkan suatu garis lurus ke kanan sehingga mencecah garis ogif. Teruskan unjuran ke bawah sehingga mencecah paksi berat tikus. Didapati bahawa berat tikus yang sepadan dengan kekerapan longgokkan 25 ialah hampir 118 gm. Dengan ini, kita anggarkan bahawa 50% daripada tikus-tikus dalam sampel mempunyai berat melebihi 118 gm.

Anda mungkin juga menyukai