Anda di halaman 1dari 38

Basic Concepts of Quantitative Research Dr. R.

Ouyang Results Data types and preparation for analysis Different kinds of data results represent different scales of measurement. There are four types of measurement scales, that is, there are four types of data we usually deal with. They are nominal, ordinal, interval and ratio. It is important to know which type of scale or data you collect for the research and which statistics are appropriate for your data analysis. Four scales of measurement (4 types of data) Nominal (categories): A nominal scale represents the lowest level of measurement. Such a scale classifies persons or objects into two or more categories. In other words, the nominal data are those based on the classification and categorization. When a nominal scale is used, the data simply indicate how many subjects are in each category. Category 4 and category 1 are not different base on the number 4 and 1, but the categories 4 and 1. 4 is not higher than 1 or more than 1. Example: Categories for IQ, types of school. Ordinal (ranks): An ordinal scale puts the subjects in order from the highest to lowest, form the most to least. Although ordinal scales indicate that some subjects are higher, or better, than other, they do not indicate how much higher or better. Subjects A, B, C, D are measured as 4'5", 5'1", 6'2", 5'6" in height. The rank order will be ranked 1 for C, 2 for D, 3 for B, and 4 for A. Interval (scores): An interval scale has all the characteristics of a nominal and ordinal scale, in addition it is based upon predetermined equal intervals. Most of the tests used in educational research, such as achievement tests, aptitude tests, and intelligence tests, represent interval scales. Interval scale, however, do not have a true zero point. Such scales typically have an arbitrary maximum score and an arbitrary minimum score, or zero point. If an IQ test produces scores ranging from 0 to 200, a score of 0 does not indicate the absence of intelligence, nor does a score of 200 dedicate

possession of the ultimate intelligence. A score of 0 only indicates the lowest level of performance possible on that particular test and a score of 200 represents the highest level. We can say that an achievement test score of 90 is 45 points higher than a score of 45, but we cannot say that a person scoring 90 knows twice as much as a person scoring 45. Similarly, a person with a measured IQ of 140 is not necessarily twice as smart or twice as intelligent as a person with a measured IQ of 70. Ratio: A ratio scale represents the highest, most precise, level of measurement. A ratio scale has all the advantages of the other types of scales and in addition it has a meaningful, true zero point. Height, weight, time, distance, and speed are examples. Procession of coding data Scoring procedure: All instruments administered should be scored accurately and consistently; each subject's test should be scored using the same procedures and criteria. For self-developed test, if other than objectivetype items (such as multiple-choice questions) are to be scored, it is advised to have at least one other person score the tests as a reliability check. For a standardized test, it is better to make sure all answer sheets are marked corrected and scored by the machine properly. Coding data: Coding data consists of developing a system by which the data and identification information are specified and organized in preparation for the analysis. If a large number of subjects are involved, coding of the data is especially important. Data for all variables and subjects are usually converted to numerical values when the data are entered into the database management program since long entries take considerable space and contribute to typographical and spelling errors that mess up subsequent manipulations. Steps of coding data: 1) to give each subject an ID number, 2) to make decisions as to how nonnumerical or categorical data will be coded, 3) to prepare all data for analysis.

Statistical packages (SPSS), (SAS), (JUMP-IN) include programs for many statistics, from the most basic to the most sophisticated, frequently used in research studies.

Types of data
Explanations > Social Research > Measurement > Types of data Nominal | Ordinal | Interval | Ratio | Parametric vs. non-parametric | Discrete and Continuous | See also

There are four types of data that may be gathered in social research, each one adding more to the next. Thus ordinal data is also nominal, and so on.

Ratio

Interval Ordinal

Nominal

Nominal
The name 'Nominal' comes from the Latin nomen, meaning 'name' and nominal data are items which are differentiated by a simple naming system. The only thing a nominal scale does is to say that items being measured have something in common, although this may not be described. Nominal items may have numbers assigned to them. This may appear ordinal but is not -these are used to simplify capture and referencing. Nominal items are usually categorical, in that they belong to a definable category, such as 'employees'. Example The number pinned on a sports person. A set of countries.

Ordinal
Items on an ordinal scale are set into some kind of order by their position on the scale. This may indicate such as temporal position, superiority, etc.

The order of items is often defined by assigning numbers to them to show their relative position. Letters or other sequential symbols may also be used as appropriate. Ordinal items are usually categorical, in that they belong to a definable category, such as '1956 marathon runners'. You cannot do arithmetic with ordinal numbers -- they show sequence only. Example The first, third and fifth person in a race. Pay bands in an organization, as denoted by A, B, C and D.

Interval
Interval data (also sometimes called integer) is measured along a scale in which each position is equidistant from one another. This allows for the distance between two pairs to be equivalent in some way. This is often used in psychological experiments that measure attributes along an arbitrary scale between two extremes. Interval data cannot be multiplied or divided. Example My level of happiness, rated from 1 to 10. Temperature, in degrees Fahrenheit.

Ratio
In a ratio scale, numbers can be compared as multiples of one another. Thus one person can be twice as tall as another person. Important also, the number zero has meaning. Thus the difference between a person of 35 and a person 38 is the same as the difference between people who are 12 and 15. A person can also have an age of zero. Ratio data can be multiplied and divided because not only is the difference between 1 and 2 the same as between 3 and 4, but also that 4 is twice as much as 2. Interval and ratio data measure quantities and hence are quantitative. Because they can be measured on a scale, they are also called scale data. Example A person's weight The number of pizzas I can eat before fainting

Parametric vs. Non-parametric


Interval and ratio data are parametric, and are used with parametric tools in which distributions are predictable (and often Normal). Nominal and ordinal data are non-parametric, and do not assume any particular distribution. They are used with non-parametric tools such as the Histogram.

Continuous and Discrete


Continuous measures are measured along a continuous scale which can be divided into fractions, such as temperature. Continuous variables allow for infinitely fine sub-division, which means if you can measure sufficiently accurately, you can compare two items and determine the difference.

Discrete variables are measured across a set of fixed values, such as age in years (not microseconds). These are commonly used on arbitrary scales, such as scoring your level of happiness, although such scales can also be continuous.

See also
Variables in research

What are Variables?


Variables are things that we measure, control, or manipulate in research. They differ in many respects, most notably in the role they are given in our research and in the type of measures that can be applied to them.

Correlational vs. Experimental Research


Most empirical research belongs clearly to one of these two general categories. In correlational research, we do not (or at least try not to) influence any variables but only measure them and look for relations (correlations) between some set of variables, such as blood pressure and cholesterol level. In experimental research, we manipulate some variables and then measure the effects of this manipulation on other variables. For example, a researcher might artificially increase blood pressure and then record cholesterol level. Data analysis in experimental research also comes down to calculating "correlations" between variables, specifically, those manipulated and those affected by the manipulation. However, experimental data may potentially provide qualitatively better information: only experimental data can conclusively demonstrate causal relations between variables. For example, if we found that whenever we change variable A then variable B changes, then we can conclude that "A influences B." Data from correlational research can only be "interpreted" in causal terms based on some theories that we have, but correlational data cannot conclusively prove causality.
To index

Dependent vs. Independent Variables


Independent variables are those that are manipulated whereas dependent variables are only measured or registered. This distinction appears terminologically confusing to many because, as some students say, "all variables depend on something." However, once you get used to this distinction, it becomes indispensable. The terms dependent and independent variable apply mostly to experimental research where some variables are manipulated, and in this sense they are "independent" from the initial reaction patterns, features, intentions, etc. of the subjects. Some other variables are expected to be "dependent" on the manipulation or experimental conditions. That is to say, they depend on "what the subject will do" in response. Somewhat contrary to the nature of this distinction, these terms are also used in studies where we do not literally manipulate independent variables, but only assign subjects to "experimental groups" based on some pre-existing properties of the subjects. For example, if in an experiment, males are compared to females regarding their white cell count (WCC), Gender could be called the independent variable and WCC the dependent variable.
To index

Measurement Scales
Variables differ in how well they can be measured, i.e., in how much measurable information their measurement scale can provide. There is obviously some measurement error involved in every measurement, which determines the amount of information that we can obtain. Another factor that determines the amount of information that can be provided by a variable is its type of measurement scale. Specifically, variables are classified as (a) nominal, (b) ordinal, (c) interval, or (d) ratio. 1. Nominal variables allow for only qualitative classification. That is, they can be measured only in terms of whether the individual items belong to some distinctively different categories, but we cannot quantify or even rank order those categories. For example, all we can say is that two individuals are different in terms of variable A (e.g., they are of different race), but we cannot say which one "has more" of the quality represented by the variable. Typical examples of nominal variables are gender, race, color, city, etc. 2. Ordinal variables allow us to rank order the items we measure in terms of which has less and which has more of the quality represented by the variable, but still they do not allow us to say "how much more." A typical example of an ordinal variable is the socioeconomic status of families. For example, we know that upper-middle is higher than middle but we cannot say that it is, for example, 18% higher. Also, this very distinction between nominal, ordinal, and interval scales itself represents a good example of an ordinal variable. For example, we can say that nominal measurement provides less information than ordinal measurement, but we cannot say "how much less" or how this difference compares to the difference between ordinal and interval scales. 3. Interval variables allow us not only to rank order the items that are measured, but also to quantify and compare the sizes of differences between them. For example, temperature, as measured in degrees Fahrenheit or Celsius, constitutes an interval scale. We can say that a temperature of 40 degrees is higher than a temperature of 30 degrees, and that an increase from 20 to 40 degrees is twice as much as an increase from 30 to 40 degrees. 4. Ratio variables are very similar to interval variables; in addition to all the properties of interval variables, they feature an identifiable absolute zero point, thus, they allow for statements such as x is two times more than y. Typical examples of ratio scales are measures of time or space. For example, as the Kelvin temperature scale is a ratio scale, not only can we say that a temperature of 200 degrees is higher than one of 100 degrees, we can correctly state that it is twice as high. Interval scales do not have the ratio property. Most statistical data analysis procedures do not distinguish between the interval and ratio properties of the measurement scales.

To index

Relations between Variables


Regardless of their type, two or more variables are related if, in a sample of observations, the values of those variables are distributed in a consistent manner. In other words, variables are related if their values systematically correspond to each other for these observations. For example, Gender and WCC would be considered to be related if most males had high WCC and most females low WCC, or vice versa; Height is related to Weight because, typically, tall individuals are heavier than short ones; IQ is related to the Number of Errors in a test if people with higher IQ's make fewer errors.
To index

Why Relations between Variables are Important


Generally speaking, the ultimate goal of every research or scientific analysis is to find relations between variables. The philosophy of science teaches us that there is no other way of representing "meaning" except in terms of relations between some quantities or qualities; either way involves relations between variables. Thus, the advancement of science must always involve finding new relations between variables. Correlational research involves measuring such relations in the most straightforward manner. However, experimental research is not any different in this respect. For example, the above mentioned experiment comparing WCC in males and females can be described as looking for a correlation between two variables: Gender and WCC. Statistics does nothing else but help us evaluate relations between variables. Actually, all of the hundreds of procedures that are described in this online textbook can be interpreted in terms of evaluating various kinds of intervariable relations.
To index

Two Basic Features of Every Relation between Variables


The two most elementary formal properties of every relation between variables are the relation's (a) magnitude (or "size") and (b) its reliability (or "truthfulness"). 1. Magnitude (or "size"). The magnitude is much easier to understand and measure than the reliability. For example, if every male in our sample was found to have a higher WCC than any female in the sample, we could say that the magnitude of the relation between the two variables (Gender and WCC) is very high in our sample. In other words, we could predict one based on the other (at least among the members of our sample). 2. Reliability (or "truthfulness"). The reliability of a relation is a much less intuitive concept, but still extremely important. It pertains to the "representativeness" of the result found in our specific sample for the entire population. In other words, it says how probable it is that a similar relation would be found if the experiment was replicated with other samples drawn from the same population. Remember that we are almost never "ultimately" interested only in what is going on in our sample; we are interested in the sample only to the extent it can provide information about the population. If our study meets some specific criteria (to be mentioned later), then the reliability of a relation between variables observed in our sample can be quantitatively estimated and represented using a standard measure (technically called p-value or statistical significance level, see the next paragraph).

To index

What is "Statistical Significance" (p-value)?


The statistical significance of a result is the probability that the observed relationship (e.g., between variables) or a difference (e.g., between means) in a sample occurred by pure chance ("luck of the draw"), and that in the population from which the sample was drawn, no such relationship or differences exist. Using less technical terms, we could say that the statistical significance of a result tells us something about the degree to which the result is "true" (in the sense of being "representative of the population"). More technically, the value of the p-value represents a decreasing index of the reliability of a result (see Brownlee, 1960). The higher the p-value, the less we can believe that the observed relation

between variables in the sample is a reliable indicator of the relation between the respective variables in the population. Specifically, the p-value represents the probability of error that is involved in accepting our observed result as valid, that is, as "representative of the population." For example, a pvalue of .05 (i.e.,1/20) indicates that there is a 5% probability that the relation between the variables found in our sample is a "fluke." In other words, assuming that in the population there was no relation between those variables whatsoever, and we were repeating experiments such as ours one after another, we could expect that approximately in every 20 replications of the experiment there would be one in which the relation between the variables in question would be equal or stronger than in ours. (Note that this is not the same as saying that, given that there IS a relationship between the variables, we can expect to replicate the results 5% of the time or 95% of the time; when there is a relationship between the variables in the population, the probability of replicating the study and finding that relationship is related to thestatistical power of the design. See also, Power Analysis). In many areas of research, the p-value of .05 is customarily treated as a "border-line acceptable" error level.
To index

How to Determine that a Result is "Really" Significant


There is no way to avoid arbitrariness in the final decision as to what level of significance will be treated as really "significant." That is, the selection of some level of significance, up to which the results will be rejected as invalid, is arbitrary. In practice, the final decision usually depends on whether the outcome was predicted a priori or only found post hoc in the course of many analyses and comparisons performed on the data set, on the total amount of consistent supportive evidence in the entire data set, and on "traditions" existing in the particular area of research. Typically, in many sciences, results that yield p .05 are considered borderline statistically significant, but remember that this level of significance still involves a pretty high probability of error (5%). Results that are significant at the p .01 level are commonly considered statistically significant, and p .005 or p .001 levels are often called "highly" significant. But remember that these classifications represent nothing else but arbitrary conventions that are only informally based on general research experience.
To index

Statistical Significance and the Number of Analyses Performed


Needless to say, the more analyses you perform on a data set, the more results will meet "by chance" the conventional significance level. For example, if you calculate correlations between ten variables (i.e., 45 different correlation coefficients), then you should expect to find by chance that about two (i.e., one in every 20) correlation coefficients are significant at the p .05 level, even if the values of the variables were totally random and those variables do not correlate in the population. Some statistical methods that involve many comparisons and, thus, a good chance for such errors include some "correction" or adjustment for the total number of comparisons. However, many statistical methods (especially simple exploratory data analyses) do not offer any straightforward remedies to this problem. Therefore, it is up to the researcher to carefully evaluate the reliability of unexpected findings. Many examples in this online textbook offer specific advice on how to do this; relevant information can also be found in most research methods textbooks.
To index

Strength vs. Reliability of a Relation between Variables


We said before that strength and reliability are two different features of relationships between variables. However, they are not totally independent. In general, in a sample of a particular size, the larger the magnitude of the relation between variables, the more reliable the relation (see the next paragraph).
To index

Why Stronger Relations between Variables are More Significant


Assuming that there is no relation between the respective variables in the population, the most likely outcome would be also finding no relation between these variables in the research sample. Thus, the stronger the relation found in the sample, the less likely it is that there is no corresponding relation in the population. As you see, the magnitude and significance of a relation appear to be closely related, and we could calculate the significance from the magnitude and vice-versa; however, this is true only if the sample size is kept constant, because the relation of a given strength could be either highly significant or not significant at all, depending on the sample size (see the next paragraph).
To index

Why Significance of a Relation between Variables Depends on the Size of the Sample
If there are very few observations, then there are also respectively few possible combinations of the values of the variables and, thus, the probability of obtaining by chance a combination of those values indicative of a strong relation is relatively high. Consider the following illustration. If we are interested in two variables (Gender: male/female and WCC: high/low), and there are only four subjects in our sample (two males and two females), then the probability that we will find, purely by chance, a 100% relation between the two variables can be as high as one-eighth. Specifically, there is a one-in-eight chance that both males will have a high WCC and both females a low WCC, or vice versa. Now consider the probability of obtaining such a perfect match by chance if our sample consisted of 100 subjects; the probability of obtaining such an outcome by chance would be practically zero. Let's look at a more general example. Imagine a theoretical population in which the average value of WCC in males and females is exactly the same. Needless to say, if we start replicating a simple experiment by drawing pairs of samples (of males and females) of a particular size from this population and calculating the difference between the average WCC in each pair of samples, most of the experiments will yield results close to 0. However, from time to time, a pair of samples will be drawn where the difference between males and females will be quite different from 0. How often will it happen? The smaller the sample size in each experiment, the more likely it is that we will obtain such erroneous results, which in this case would be results indicative of the existence of a relation between Genderand WCC obtained from a population in which such a relation does not exist.

To index

Example: Baby Boys to Baby Girls Ratio


Consider this example from research on statistical reasoning (Nisbett, et al., 1987). There are two hospitals: in the first one, 120 babies are born every day; in the other, only 12. On average, the ratio of baby boys to baby girls born every day in each hospital is 50/50. However, one day, in one of those hospitals, twice as many baby girls were born as baby boys. In which hospital was it more likely to happen? The answer is obvious for a statistician, but as research shows, not so obvious for a lay person: it is much more likely to happen in the small hospital. The reason for this is that technically speaking, the probability of a random deviation of a particular size (from the population mean), decreases with the increase in the sample size.
To index

Why Small Relations Can be Proven Significant Only in Large Samples


The examples in the previous paragraphs indicate that if a relationship between variables in question is "objectively" (i.e., in the population) small, then there is no way to identify such a relation in a study unless the research sample is correspondingly large. Even if our sample is in fact "perfectly representative," the effect will not be statistically significant if the sample is small. Analogously, if a relation in question is "objectively" very large, then it can be found to be highly significant even in a study based on a very small sample. Consider this additional illustration. If a coin is slightly asymmetrical and, when tossed, is somewhat more likely to produce heads than tails (e.g., 60% vs. 40%), then ten tosses would not be sufficient to convince anyone that the coin is asymmetrical even if the outcome obtained (six heads and four tails) was perfectly representative of the bias of the coin. However, is it so that 10 tosses is not enough to prove anything? No; if the effect in question were large enough, then ten tosses could be quite enough. For instance, imagine now that the coin is so asymmetrical that no matter how you toss it, the outcome will be heads. If you tossed such a coin ten times and each toss produced heads, most people would consider it sufficient evidence that something is wrong with the coin. In other words, it would be considered convincing evidence that in the theoretical population of an infinite number of tosses of this coin, there would be more heads than tails. Thus, if a relation is large, then it can be found to be significant even in a small sample.
To index

Can "No Relation" be a Significant Result?


The smaller the relation between variables, the larger the sample size that is necessary to prove it significant. For example, imagine how many tosses would be necessary to prove that a coin is asymmetrical if its bias were only .000001%! Thus, the necessary minimum sample size increases as the magnitude of the effect to be demonstrated decreases. When the magnitude of the effect approaches 0, the necessary sample size to conclusively prove it approaches infinity. That is to say, if there is almost no relation between two variables, then the sample size must be almost equal to the population size, which is assumed to be infinitely large. Statistical significance represents the probability that a similar outcome would be obtained if we tested the entire population. Thus, everything that would be found after testing the entire population would be, by definition, significant at the highest possible level, and this also includes all "no relation" results.

To index

How to Measure the Magnitude (Strength) of Relations between Variables


There are very many measures of the magnitude of relationships between variables that have been developed by statisticians; the choice of a specific measure in given circumstances depends on the number of variables involved, measurement scales used, nature of the relations, etc. Almost all of them, however, follow one general principle: they attempt to somehow evaluate the observed relation by comparing it to the "maximum imaginable relation" between those specific variables. Technically speaking, a common way to perform such evaluations is to look at how differentiated the values are of the variables, and then calculate what part of this "overall available differentiation" is accounted for by instances when that differentiation is "common" in the two (or more) variables in question. Speaking less technically, we compare "what is common in those variables" to "what potentially could have been common if the variables were perfectly related." Let's consider a simple illustration. Let's say that in our sample, the average index of WCC is 100 in males and 102 in females. Thus, we could say that on average, the deviation of each individual score from the grand mean (101) contains a component due to the gender of the subject; the size of this component is 1. That value, in a sense, represents some measure of relation between Gender and WCC. However, this value is a very poor measure because it does not tell us how relatively large this component is given the "overall differentiation" of WCC scores. Consider two extreme possibilities: 1. If all WCC scores of males were equal exactly to 100 and those of females equal to 102, then all deviations from the grand mean in our sample would be entirely accounted for by gender. We would say that in our sample, Gender is perfectly correlated with WCC, that is, 100% of the observed differences between subjects regarding their WCC is accounted for by their gender. 2. If WCC scores were in the range of 0-1000, the same difference (of 2) between the average WCC of males and females found in the study would account for such a small part of the overall differentiation of scores that most likely it would be considered negligible. For example, one more subject taken into account could change, or even reverse the direction of the difference. Therefore, every good measure of relations between variables must take into account the overall differentiation of individual scores in the sample and evaluate the relation in terms of (relatively) how much of this differentiation is accounted for by the relation in question.

To index

Common "General Format" of Most Statistical Tests


Because the ultimate goal of most statistical tests is to evaluate relations between variables, most statistical tests follow the general format that was explained in the previous paragraph. Technically speaking, they represent a ratio of some measure of the differentiation common in the variables in question to the overall differentiation of those variables. For example, they represent a ratio of the part of the overall differentiation of the WCC scores that can be accounted for by gender to the overall differentiation of the WCC scores. This ratio is usually called a ratio of explained variation to total variation. In statistics, the term explained variation does not necessarily imply that we "conceptually understand" it. It is used only to denote the common variation in the variables in question, that is, the

part of variation in one variable that is "explained" by the specific values of the other variable, and vice versa.
To index

How the "Level of Statistical Significance" is Calculated


Let's assume that we have already calculated a measure of a relation between two variables (as explained above). The next question is "how significant is this relation?" For example, is 40% of the explained variance between the two variables enough to consider the relation significant? The answer is "it depends." Specifically, the significance depends mostly on the sample size. As explained before, in very large samples, even very small relations between variables will be significant, whereas in very small samples even very large relations cannot be considered reliable (significant). Thus, in order to determine the level of statistical significance, we need a function that represents the relationship between "magnitude" and "significance" of relations between two variables, depending on the sample size. The function we need would tell us exactly "how likely it is to obtain a relation of a given magnitude (or larger) from a sample of a given size, assuming that there is no such relation between those variables in the population." In other words, that function would give us the significance (p) level, and it would tell us the probability of error involved in rejecting the idea that the relation in question does not exist in the population. This "alternative" hypothesis (that there is no relation in the population) is usually called the null hypothesis. It would be ideal if the probability function was linear and, for example, only had different slopes for different sample sizes. Unfortunately, the function is more complex and is not always exactly the same; however, in most cases we know its shape and can use it to determine the significance levels for our findings in samples of a particular size. Most of these functions are related to a general type of function, which is called normal.
To index

Why the "Normal Distribution" is Important


The "normal distribution" is important because in most cases, it well approximates the function that was introduced in the previous paragraph (for a detailed illustration, see Are All Test Statistics Normally Distributed?). The distribution of many test statistics is normal or follows some form that can be derived from the normal distribution. In this sense, philosophically speaking, the normal distribution represents one of the empirically verified elementary "truths about the general nature of reality," and its status can be compared to the one of fundamental laws of natural sciences. The exact shape of the normal distribution (the characteristic "bell curve") is defined by a function that has only two parameters: mean and standard deviation. A characteristic property of the normal distribution is that 68% of all of its observations fall within a range of 1 standard deviation from the mean, and a range of 2 standard deviations includes 95% of the scores. In other words, in a normal distribution, observations that have a standardized value of less than -2 or more than +2 have a relative frequency of 5% or less. (Standardized value means that a value is expressed in terms of its difference from the mean, divided by the standard deviation.) If you have access to STATISTICA, you can explore the exact values of probability associated with different values in the normal distribution using the interactive Probability Calculator tool; for example, if you enter the Z value (i.e., standardized value) of 4, the associated probability computed bySTATISTICA will be less than .0001, because in the normal distribution almost all observations (i.e.,

more than 99.99%) fall within the range of 4 standard deviations. The animation below shows the tail area associated with other Z values.

To index

Illustration of How the Normal Distribution is Used in Statistical Reasoning (Induction)


Recall the example discussed above, where pairs of samples of males and females were drawn from a population in which the average value of WCC in males and females was exactly the same. Although the most likely outcome of such experiments (one pair of samples per experiment) was that the difference between the average WCC in males and females in each pair is close to zero, from time to time, a pair of samples will be drawn where the difference between males and females is quite different from 0. How often does it happen? If the sample size is large enough, the results of such replications are "normally distributed" (this important principle is explained and illustrated in the next paragraph) and, thus, knowing the shape of the normal curve, we can precisely calculate the probability of obtaining "by chance" outcomes representing various levels of deviation from the hypothetical population mean of 0. If such a calculated probability is so low that it meets the previously accepted criterion of statistical significance, then we have only one choice: conclude that our result gives a better approximation of what is going on in the population than the "null hypothesis" (remember that the null hypothesis was considered only for "technical reasons" as a benchmark against which our empirical result was evaluated). Note that this entire reasoning is based on the assumption that the shape of the distribution of those "replications" (technically, the "sampling distribution") is normal. This assumption is discussed in the next paragraph.
To index

Are All Test Statistics Normally Distributed?


Not all, but most of them are either based on the normal distribution directly or on distributions that are related to and can be derived from normal, such as t, F, or Chi-square. Typically, these tests require that the variables analyzed are themselves normally distributed in the population, that is, they meet the so-called "normality assumption." Many observed variables actually are normally distributed, which is another reason why the normal distribution represents a "general feature" of empirical reality. The problem may occur when we try to use a normal distribution-based test to analyze data from variables that are themselves not normally distributed (see tests of normality inNonparametrics or ANOVA/MANOVA). In such cases, we have two general choices. First, we can use some alternative "nonparametric" test (or so-called "distribution-free test" see, Nonparametrics); but this is often inconvenient because such tests are typically less powerful and less flexible in terms of types of conclusions that they can provide. Alternatively, in many cases we can still use the normal distribution-based test if we only make sure that the size of our samples is large enough. The latter option is based on an extremely important principle that is largely responsible for the popularity of tests that are based on the normal function. Namely, as the sample size increases, the shape of the

sampling distribution (i.e., distribution of a statistic from the sample; this term was first used by Fisher, 1928a) approaches normal shape, even if the distribution of the variable in question is not normal. This principle is illustrated in the following animation showing a series of sampling distributions (created with gradually increasing sample sizes of: 2, 5, 10, 15, and 30) using a variable that is clearly non-normal in the population, that is, the distribution of its values is clearly skewed.

However, as the sample size (of samples used to create the sampling distribution of the mean) increases, the shape of the sampling distribution becomes normal. Note that for n=30, the shape of that distribution is "almost" perfectly normal (see the close match of the fit). This principle is called the central limit theorem (this term was first used by Plya, 1920; German, "Zentraler Grenzwertsatz").
To index

How Do We Know the Consequences of Violating the Normality Assumption?


Although many of the statements made in the preceding paragraphs can be proven mathematically, some of them do not have theoretical proof and can be demonstrated only empirically, via so-called Monte-Carlo experiments. In these experiments, large numbers of samples are generated by a computer following predesigned specifications, and the results from such samples are analyzed using a variety of tests. This way we can empirically evaluate the type and magnitude of errors or biases to which we are exposed when certain theoretical assumptions of the tests we are using are not met by our data. Specifically, Monte-Carlo studies were used extensively with normal distribution-based tests to determine how sensitive they are to violations of the assumption of normal distribution of the analyzed variables in the population. The general conclusion from these studies is that the consequences of such violations are less severe than previously thought. Although these conclusions should not entirely discourage anyone from being concerned about the normality assumption, they have increased the overall popularity of the distribution-dependent statistical tests in all areas of research.

Chapter 3: Levels Of Measurement And Scaling


Chapter Objectives Structure Of The Chapter Levels of measurement Nominal scales Measurement scales Comparative scales Noncomparative scales Chapter Summary Key Terms Review Questions Chapter References

A common feature of marketing research is the attempt to have respondents communicate their feelings, attitudes, opinions, and evaluations in some measurable form. To this end, marketing researchers have developed a range of scales. Each of these has unique properties. What is important for the marketing analyst to realise is that they have wildely differing measurement properties. Some scales are at very best, limited in their mathematical properties to the extent that they can only establish an association between variables. Other scales have more extensive mathematical properties and some, hold out the possibility of establishing cause and effect relationships between variables.

Chapter Objectives
This chapter will give the reader: An understanding of the four levels of measurement that can be taken by researchers The ability to distinguish between comparative and non-comparative measurement scales, and A basic tool-kit of scales that can be used for the purposes of marketing research.

Structure Of The Chapter


All measurements must take one of four forms and these are described in the opening section of the chapter. After the properties of the four categories of scale have been explained, various forms of comparative and non-comparative scales are illustrated. Some of these scales are numeric, others are semantic and yet others take a graphical form. The marketing researcher who is familiar with the complete tool kit of scaling measurements is better equipped to understand markets.

Levels of measurement

Most texts on marketing research explain the four levels of measurement: nominal, ordinal, interval and ratio and so the treatment given to them here will be brief. However, it is an important topic since the type of scale used in taking measurements directly impinges on the statistical techniques which can legitimately be used in the analysis.

Nominal scales
This, the crudest of measurement scales, classifies individuals, companies, products, brands or other entities into categories where no order is implied. Indeed it is often referred to as a categorical scale. It is a system of classification and does not place the entity along a continuum. It involves a simply count of the frequency of the cases assigned to the various categories, and if desired numbers can be nominally assigned to label each category as in the example below: Figure 3.1 An example of a nominal scale
Which of the following food items do you tend to buy at least once per month? (Please tick) Okra Palm Oil Milled Rice Peppers Prawns Pasteurised milk

The numbers have no arithmetic properties and act only as labels. The only measure of average which can be used is the mode because this is simply a set of frequency counts. Hypothesis tests can be carried out on data collected in the nominal form. The most likely would be the Chi-square test. However, it should be noted that the Chi-square is a test to determine whether two or more variables are associated and the strength of that relationship. It can tell nothing about the form of that relationship, where it exists, i.e. it is not capable of establishing cause and effect. Ordinal scales Ordinal scales involve the ranking of individuals, attitudes or items along the continuum of the characteristic being scaled. For example, if a researcher asked farmers to rank 5 brands of pesticide in order of preference he/she might obtain responses like those in table 3.2 below. Figure 3.2 An example of an ordinal scale used to determine farmers' preferences among 5 brands of pesticide.
Order of preference 1 2 3 4 5 Brand Rambo R.I.P. Killalot D.O.A. Bugdeath

From such a table the researcher knows the order of preference but nothing about how much more one brand is preferred to another, that is there is no information about the

interval between any two brands. All of the information a nominal scale would have given is available from an ordinal scale. In addition, positional statistics such as the median, quartile and percentile can be determined. It is possible to test for order correlation with ranked data. The two main methods are Spearman's Ranked Correlation Coefficient and Kendall's Coefficient of Concordance. Using either procedure one can, for example, ascertain the degree to which two or more survey respondents agree in their ranking of a set of items. Consider again the ranking of pesticides example in figure 3.2. The researcher might wish to measure similarities and differences in the rankings of pesticide brands according to whether the respondents' farm enterprises were classified as "arable" or "mixed" (a combination of crops and livestock). The resultant coefficient takes a value in the range 0 to 1. A zero would mean that there was no agreement between the two groups, and 1 would indicate total agreement. It is more likely that an answer somewhere between these two extremes would be found. The only other permissible hypothesis testing procedures are the runs test and sign test. The runs test (also known as the Wald-Wolfowitz). Test is used to determine whether a sequence of binomial data - meaning it can take only one of two possible values e.g. African/non-African, yes/no, male/female - is random or contains systematic 'runs' of one or other value. Sign tests are employed when the objective is to determine whether there is a significant difference between matched pairs of data. The sign test tells the analyst if the number of positive differences in ranking is approximately equal to the number of negative rankings, in which case the distribution of rankings is random, i.e. apparent differences are not significant. The test takes into account only the direction of differences and ignores their magnitude and hence it is compatible with ordinal data. Interval scales It is only with an interval scaled data that researchers can justify the use of the arithmetic mean as the measure of average. The interval or cardinal scale has equal units of measurement, thus making it possible to interpret not only the order of scale scores but also the distance between them. However, it must be recognised that the zero point on an interval scale is arbitrary and is not a true zero. This of course has implications for the type of data manipulation and analysis we can carry out on data collected in this form. It is possible to add or subtract a constant to all of the scale values without affecting the form of the scale but one cannot multiply or divide the values. It can be said that two respondents with scale positions 1 and 2 are as far apart as two respondents with scale positions 4 and 5, but not that a person with score 10 feels twice as strongly as one with score 5. Temperature is interval scaled, being measured either in Centigrade or Fahrenheit. We cannot speak of 50F being twice as hot as 25F since the corresponding temperatures on the centigrade scale, 10C and -3.9C, are not in the ratio 2:1. Interval scales may be either numeric or semantic. Study the examples below in figure 3.3. Figure 3.3 Examples of interval scales in numeric and semantic formats
Please indicate your views on Balkan Olives by scoring them on a scale of 5 down to 1 (i.e. 5 = Excellent; = Poor) on each of the criteria listed

Balkan Olives are: Succulence Fresh tasting Free of skin blemish Good value Attractively packaged

Circle the appropriate score on each line 5 5 5 5 5 4 4 4 4 4 3 3 3 3 3 2 2 2 2 2 1 1 1 1 1 (a)

Please indicate your views on Balkan Olives by ticking the appropriate responses below: Excellent Very Good Good Fair Poor Succulent Freshness Freedom from skin blemish Value for money Attractiveness of packaging (b)

Most of the common statistical methods of analysis require only interval scales in order that they might be used. These are not recounted here because they are so common and can be found in virtually all basic texts on statistics. Ratio scales The highest level of measurement is a ratio scale. This has the properties of an interval scale together with a fixed origin or zero point. Examples of variables which are ratio scaled include weights, lengths and times. Ratio scales permit the researcher to compare both differences in scores and the relative magnitude of scores. For instance the difference between 5 and 10 minutes is the same as that between 10 and 15 minutes, and 10 minutes is twice as long as 5 minutes. Given that sociological and management research seldom aspires beyond the interval level of measurement, it is not proposed that particular attention be given to this level of analysis. Suffice it to say that virtually all statistical operations can be performed on ratio scales.

Measurement scales
The various types of scales used in marketing research fall into two broad categories: comparative and non comparative. In comparative scaling, the respondent is asked to compare one brand or product against another. With noncomparative scaling respondents need only evaluate a single product or brand. Their evaluation is independent of the other product and/or brands which the marketing researcher is studying. Noncomparative scaling is frequently referred to as monadic scaling and this is the more widely used type of scale in commercial marketing research studies.

Comparative scales

Paired comparison2: It is sometimes the case that marketing researchers wish to find out which are the most important factors in determining the demand for a product. Conversely they may wish to know which are the most important factors acting to prevent the widespread adoption of a product. Take, for example, the very poor farmer response to the first design of an animal-drawn mould board plough. A combination of exploratory research and shrewd observation suggested that the following factors played a role in the shaping of the attitudes of those farmers who feel negatively towards the design: Does not ridge Does not work for inter-cropping Far too expensive New technology too risky Too difficult to carry. Suppose the organisation responsible wants to know which factors is foremost in the farmer's mind. It may well be the case that if those factors that are most important to the farmer than the others, being of a relatively minor nature, will cease to prevent widespread adoption. The alternatives are to abandon the product's re-development or to completely redesign it which is not only expensive and time-consuming, but may well be subject to a new set of objections. The process of rank ordering the objections from most to least important is best approached through the questioning technique known as 'paired comparison'. Each of the objections is paired by the researcher so that with 5 factors, as in this example, there are 10 pairs-

In 'paired comparisons' every factor has to be paired with every other factor in turn. However, only one pair is ever put to the farmer at any one time. The question might be put as follows: Which of the following was the more important in making you decide not to buy the plough?
The plough was too expensive It proved too difficult to transport

In most cases the question, and the alternatives, would be put to the farmer verbally. He/she then indicates which of the two was the more important and the researcher ticks the box on his questionnaire. The question is repeated with a second set of factors and the appropriate box ticked again. This process continues until all possible combinations are exhausted, in this case 10 pairs. It is good practice to mix the pairs of factors so that there is no systematic bias. The researcher should try to ensure that any particular factor is sometimes the first of the pair to be mentioned and sometimes the second. The researcher would never, for example, take the first factor (on this occasion 'Does not ridge') and

systematically compare it to each of the others in succession. That is likely to cause systematic bias. Below labels have been given to the factors so that the worked example will be easier to understand. The letters A - E have been allocated as follows:
A= B= C= D= E= Does not ridge Far too expensive New technology too risky Does not work for inter-cropping Too difficult to carry.

The data is then arranged into a matrix. Assume that 200 farmers have been interviewed and their responses are arranged in the grid below. Further assume that the matrix is so arranged that we read from top to side. This means, for example, that 164 out of 200 farmers said the fact that the plough was too expensive was a greater deterrent than the fact that it was not capable of ridging. Similarly, 174 farmers said that the plough's inability to inter-crop was more important than the inability to ridge when deciding not to buy the plough. Figure 3.4 A preference matrix
A 100 36 80 26 20 B 164 100 40 24 34 C 120 160 100 32 76 D 174 176 168 100 98 E 180 166 124 102 100

A B C D E

If the grid is carefully read, it can be seen that the rank order of the factors is Most important E D C B Least important A Too difficult to carry Does not inter crop New technology/high risk Too expensive Does not ridge.

It can be seen that it is more important for designers to concentrate on improving transportability and, if possible, to give it an inter-cropping capability rather than focusing on its ridging capabilities (remember that the example is entirely hypothetical). One major advantage to this type of questioning is that whilst it is possible to obtain a measure of the order of importance of five or more factors from the respondent, he is never asked to think about more than two factors at any one time. This is especially useful when dealing with illiterate farmers. Having said that, the researcher has to be careful not to present too many pairs of factors to the farmer during the interview. If he does, he will find

that the farmer will quickly get tired and/or bored. It is as well to remember the formula of n(n - 1)/2. For ten factors, brands or product attributes this would give 45 pairs. Clearly the farmer should not be asked to subject himself to having the same question put to him 45 times. For practical purposes, six factors is possibly the limit, giving 15 pairs. It should be clear from the procedures described in these notes that the paired comparison scale gives ordinal data. Dollar Metric Comparisons3: This type of scale is an extension of the paired comparison method in that it requires respondents to indicate both their preference and how much they are willing to pay for their preference. This scaling technique gives the marketing researcher an interval - scaled measurement. An example is given in figure 3.5. Figure 3.5 An example of a dollar metric scale
Which of the following types of fish do How much more, in cents, would you be prepared to pay for you prefer? your preferred fish? Fresh Fresh (gutted) $0.70 Fresh (gutted) Smoked 0.50 Frozen Smoked 0.60 Frozen Fresh 0.70 Smoked Fresh 0.20 Frozen(gutted) Frozen From the data above the preferences shown below can be computed as follows: Fresh fish: 0.70 + 0.70 + 0.20 =1.60 Smoked fish: 0.60 + (-0.20) + (-0.50) =(-1.10) Fresh fish(gutted): (-0.70) + 0.30 + 0.50 =0.10 Frozen fish: (-0.60) + (-0.70) + (-0.30) =(-1.60)

The Unity-sum-gain technique: A common problem with launching new products is one of reaching a decision as to what options, and how many options one offers. Whilst a company may be anxious to meet the needs of as many market segments as possible, it has to ensure that the segment is large enough to enable him to make a profit. It is always easier to add products to the product line but much more difficult to decide which models should be deleted. One technique for evaluating the options which are likely to prove successful is the unity-sum-gain approach. The procedure is to begin with a list of features which might possibly be offered as 'options' on the product, and alongside each you list its retail cost. A third column is constructed and this forms an index of the relative prices of each of the items. The table below will help clarify the procedure. For the purposes of this example the basic reaper is priced at $20,000 and some possible 'extras' are listed along with their prices. The total value of these hypothetical 'extras' is $7,460 but the researcher tells the farmer he has an equally hypothetical $3,950 or similar sum. The important thing is that he should have considerably less hypothetical money to spend than the total value of the alternative product features. In this way the farmer is encouraged to reveal his preferences by allowing

researchers to observe how he trades one additional benefit off against another. For example, would he prefer a side rake attachment on a 3 metre head rather than have a transporter trolley on either a standard or 2.5m wide head? The farmer has to be told that any unspent money cannot be retained by him so he should seek the best value-for-money he can get. In cases where the researcher believes that mentioning specific prices might introduce some form of bias into the results, then the index can be used instead. This is constructed by taking the price of each item over the total of $ 7,460 and multiplying by 100. Survey respondents might then be given a maximum of 60 points and then, as before, are asked how they would spend these 60 points. In this crude example the index numbers are not too easy to work with for most respondents, so one would round them as has been done in the adjusted column. It is the relative and not the absolute value of the items which is important so the precision of the rounding need not overly concern us. Figure 3.6 The unity-sum-gain technique
Item Additional Cost ($s) Index Adjusted Index 2.5 wide rather than standard 2m 2,000 27 30 Self lubricating chain rather than belt 200 47 50 Side rake attachment 350 5 10 Polymer heads rather than steel 250 3 5 Double rather than single edged cutters 210 2.5 5 Transporter trolley for reaper attachment 650 9 10 Automatic levelling of table 300 4 5

The unity-sum-gain technique is useful for determining which product features are more important to farmers. The design of the final market version of the product can then reflect the farmers' needs and preferences. Practitioners treat data gathered by this method as ordinal.

Noncomparative scales
Continuous rating scales: The respondents are asked to give a rating by placing a mark at the appropriate position on a continuous line. The scale can be written on card and shown to the respondent during the interview. Two versions of a continuous rating scale are depicted in figure 3.7. Figure 3.7 Continuous rating scales

When version B is used, the respondent's score is determined either by dividing the line into as many categories as desired and assigning the respondent a score based on the category into which his/her mark falls, or by measuring the distance, in millimetres or inches, from either end of the scale. Whichever of these forms of the continuous scale is used, the results are normally analysed as interval scaled. Line marking scale: The line marked scale is typically used to measure perceived similarity differences between products, brands or other objects. Technically, such a scale is a form of what is termed a semantic differential scale since each end of the scale is labelled with a word/phrase (or semantic) that is opposite in meaning to the other. Figure 3.8 provides an illustrative example of such a scale. Consider the products below which can be used when frying food. In the case of each pair, indicate how similar or different they are in the flavour which they impart to the food. Figure 3.8 An example of a line marking scale For some types of respondent, the line scale is an easier format because they do not find discrete numbers (e.g. 5, 4, 3, 2, 1) best reflect their attitudes/feelings. The line marking scale is a continuous scale. Itemised rating scales: With an itemised scale, respondents are provided with a scale having numbers and/or brief descriptions associated with each category and are asked to select one of the limited number of categories, ordered in terms of scale position, that best describes the product, brand, company or product attribute being studied. Examples of the itemised rating scale are illustrated in figure 3.9. Figure 3.9 Itemised rating scales

Itemised rating scales can take a variety of innovative forms as demonstrated by the two illustrated in figure 3.9, which are graphic. Figure 3.10 Graphic itemised scales Whichever form of itemised scale is applied, researchers usually treat the data as interval level. Semantic scales: This type of scale makes extensive use of words rather than numbers. Respondents describe their feelings about the products or brands on scales with semantic labels. When bipolar adjectives are used at the end points of the scales, these are termed semantic differential scales. The semantic scale and the semantic differential scale are illustrated in figure 3.11. Figure 3.11 Semantic and semantic differential scales

Likert scales: A Likert scale is what is termed a summated instrument scale. This means that the items making up a Liken scale are summed to produce a total score. In fact, a Likert scale is a composite of itemised scales. Typically, each scale item will have 5 categories, with scale values ranging from -2 to +2 with 0 as neutral response. This explanation may be clearer from the example in figure 3.12. Figure 3.12 The Likert scale
Strongly Agree 1 1 1 1 1 Agree Neither Disagree 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 Strongly Disagree 5 5 5 5 5

If the price of raw materials fell firms would reduce the price of their food products. Without government regulation the firms would exploit the consumer. Most food companies are so concerned about making profits they do not care about quality. The food industry spends a great deal of money making sure that its manufacturing is hygienic. Food companies should charge the same price for their products throughout the country

Likert scales are treated as yielding Interval data by the majority of marketing researchers. The scales which have been described in this chapter are among the most commonly used in marketing research. Whilst there are a great many more forms which scales can take, if students are familiar with those described in this chapter they will be well equipped to deal with most types of survey problem.

Chapter Summary
There are four levels of measurement: nominal, ordinal, interval and ratio. These constitute a hierarchy where the lowest scale of measurement, nominal, has far fewer mathematical properties than those further up this hierarchy of scales. Nominal scales yield data on categories; ordinal scales give sequences; interval scales begin to reveal the magnitude between points on the scale and ratio scales explain both order and the absolate distance between any two points on the scale. The measurement scales, commonly used in marketing research, can be divided into two groups; comparative and non-comparative scales. Comparative scales involve the respondent in signaling where there is a difference between two or more producers, services, brands or other stimuli. Examples of such scales include; paired comparison, dollar metric, unity-sum-gain and line marking scales. Non-comparative scales, described in the textbook, are; continuous rating scales, itemised rating scales, semantic differential scales and Likert scales.

Garbage in, garbage out


From Wikipedia, the free encyclopedia
Look up garbage in, garbage out in Wiktionary, the free dictionary.

Garbage in, garbage out (abbreviated to GIGO, coined as a pun on the phrase first-in, first-out) is a phrase in the field of computer science or information and communication technology. It is used primarily to call attention to the fact that computers will unquestioningly process the most nonsensical of input data ("garbage in") and produce nonsensical output ("garbage out"). It was most popular in the early days of computing, but applies even more today, when powerful computers can spew out mountains of erroneous information in a short time. The term was coined as a teaching mantra by George Fuechsel,[1] an IBM 305 RAMAC technician/instructor in New York. Early programmers were required to test virtually each program step and cautioned not to expect that the resulting program would "do the right thing" when given imperfect input. The underlying principle was noted by the inventor of the first programmable computing device design:

On two occasions I have been asked,"Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question. Charles Babbage, Passages from the Life of a Philosopher[2] It is also commonly used to describe failures in human decision-making due to faulty, incomplete, or imprecise data. The term can also be used as an explanation for the poor quality of a digitized audio or video file. Although digitizing can be the first step in cleaning up a signal, it does not, by itself, improve the quality. Defects in the original analog signal will be faithfully recorded, but may be identified and removed by a subsequent step. (See Digital signal processing.) Garbage in, gospel out is a more recent expansion of the acronym. It is a sardonic comment on the tendency to put excessive trust in "computerized" data, and on the propensity for individuals to blindly accept what the computer says. Because the data goes through the computer, people tend to believe it. Decision-makers increasingly face computer-generated information and analyses that could be collected and analyzed in no other way. Precisely for that reason, going behind that output is out of the question, even if one has good cause to be suspicious. In short, the computer analysis becomes the gospel.[3]

Chapter 5 Standardized Measurement and Assessment (For the concept map that goes with this chapter, click here.) Defining Measurement When we measure, we attempt to identify the dimensions, quantity, capacity, or degree of something. Measurement is formally defined as the act of measuring by assigning symbols or numbers to something according to a specific set of rules. Measurement can be categorized by the type of information that is communicated by the symbols or numbers assigned to the variables of interest. In particular, there are four levels or types of information are discussed next in the chapter. They are called the four "scales of measurement." Scales of Measurement 1. Nominal Scale. This is a nonquantitative measurement scale. It is used to categorize, label, classify, name, or identify variables. It classifies groups or types.

Numbers can be used to label the categories of a nominal variable but the numbers serve only as markers, not as indicators of amount or quantity (e.g., if you wanted to, you could mark the categories of the variable called "gender" with 1=female and 2=male). Some examples of nominal level variables are the country you were born in, college major, personality type, experimental group (e.g., experimental group or control group).

2. Ordinal Scale. This level of measurement enables one to make ordinal judgments (i.e., judgments about rank order). Any variable where the levels can be ranked (but you don't know if the distance between the levels is the same) is an ordinal variable. Some examples are order of finish position in a marathon, billboard top 40, rank in class. 3. Interval Scale. This scale or level of measurement has the characteristics of rank order and equal intervals (i.e., the distance between adjacent points is the same). It does not possess an absolute zero point. Some examples are Celsius temperature, Fahrenheit temperature, IQ scores. Here is the idea of the lack of a true zero point: zero degrees Celsius does not mean no temperature at all; in a Fahrenheit scale, it is equal to the freezing point or 32 degrees. Zero degrees in these scales does not mean zero or no temperature. 4. Ratio Scale. This is a scale with a true zero point. It also has all of the "lower level" characteristics (i.e., the key characteristic of each of the lower level scales) of equal intervals (interval scale), rank order (ordinal scale), and ability to mark a value with a name (nominal scale). Some examples of ratio level scales are number correct, weight, height, response time, Kelvin temperature, and annual income. Here is an example of the presence of a true zero point: If your annual income is exactly zero dollars then you earned no annual income at all. (You can buy absolutely nothing with zero dollars.) Zero means zero. Assumptions Underlying Testing and Measurement Before I list the assumptions, note the difference between testing and assessment. According to the definitions that we use: Testing is the process of measuring variables by means of devices or procedures designed to obtain a sample of behavior and Assessment is the gathering and integration of data for the purpose of making an educational evaluation, accomplished through the use of tools such as tests, interviews, case studies, behavioral observation, and specially designed apparatus and measurement procedures. In this section of the text, we also list the twelve assumptions that Cohen, et al. Consider basic to testing and assessment:

1. Psychological traits and states exist. A trait is a relatively enduring (i.e., long lasting) characteristic on which people differ; a state is a less enduring or more transient characteristic on which people differ. Traits and states are actually social constructions, but they are real in the sense that they are useful for classifying and organizing the world, they can be used to understand and predict behavior, and they refer to something in the world that we can measure. 2. Psychological traits and states can be quantified and measured. For nominal scales, the number is used as a marker. For the other scales, the numbers become more and more quantitative as you move from ordinal scales (shows ranking only) to interval scales (shows amount, but lacks a true zero point) to ratio scales (shows amount or quantity as we usually understand this concept in mathematics or everyday use of the term). Most traits and states measured in education are taken to be at the interval level of measurement. 3. Various approaches to measuring aspects of the same thing can be useful. For example, different tests of intelligence tap into somewhat different aspects of the construct of intelligence. 4. Assessment can provide answers to some of life's most momentous questions. It is important that the users of assessment tools know when these tools will provide answers to their questions. 5. Assessment can pinpoint phenomena that require further attention or study. For example, assessment may identify someone as having dyslexia or low self-esteem or at-risk for drug use. 6. Various sources of data enrich and are part of the assessment process. Information from several sources usually should be obtained in order to make an accurate and informed decision. For example, the idea of portfolio assessment is useful. 7. Various sources of error are always part of the assessment process. There is no such thing as perfect measurement. All measurement has some error. We defined error as the difference between a persons true score and that persons observed score. The two main types of error are random error (e.g., error due to transient factors such as being sick or tired) and systematic error (e.g., error present every time the measurement instrument is used such as an essay exam being graded by an overly easy grader). (Later when we discuss reliability and validity, you might note that unreliability is due to random error and lack of validity is due to systematic error.) 8. Tests and other measurement techniques have strengths and weaknesses. It is essential that users of tests understand this so that they can use them appropriately and intelligently.

In this chapter, we will be talking about the two major characteristics: reliability and validity.

9. Test-related behavior predicts non-test-related behavior. The goal of testing usually is to predict behavior other than the exact behaviors required while the exam is being taken. For example, paper-and-pencil achievement tests given to children are used to say something about their level of achievement. Another paper-and-pencil test (also called a self-report test) that is popular in counseling is the MMPI (i.e., the Minnesota Multiphasic Personality Inventory). Clients' scores on this test are used as indicators of the presence or absence of various mental disorders. The point here is that the actual mechanics of measurement (e.g., self-reports, behavioral performance, projective) can vary widely and still provide good measurement of educational, psychological, and other types of variables. 10. Present-day behavior sampling predicts future behavior. Perhaps the most important reason for giving tests is to predict future behavior. Tests provide a sample of present-day behavior. However, this "sample" is used to predict future behavior. For example, an employment test given by someone in a Personnel Office may be used as a predictor of future work behavior. Another example: the Beck Depression Inventory is used to measure depression and, importantly, to predict test takers future behavior (e.g., are they a risk to themselves?). 11. Testing and assessment can be conducted in a fair and unbiased manner. This requires careful construction of test items and testing of the items on different types of people. Test makers always have to be on the alert to make sure tests are fair and unbiased. This assumption also requires that the test be administered to those types of people for whom it has been shown to operate properly. 12. Testing and assessment benefit society. Many critical decisions are made on the basis of tests (e.g., teacher competency, employability, presence of a psychological disorder, degree of teacher satisfactions, degree of student satisfaction, etc.). Without tests, the world would be much more unpredictable.

Identifying A Good Test or Assessment Procedure As mentioned earlier in the chapter, good measurement us fundamental for research. If we do not have good measurement then we cannot have good research. Thats why its so important to use testing and assessment procedures that are characterized by high reliability and high validity. Overview of Reliability and Validity As an introduction to reliability and validity and how they are related, note the following:

Reliability refers to the consistency or stability of test scores Validity refers to the accuracy of the inferences or interpretations we make from test scores Reliability is a necessary but not sufficient condition for validity (i.e., if you are going to have validity, you must have reliability but reliability in and of itself is not enough to ensure validity. Assume you weigh 125 pounds. If you weigh yourself five times and get 135, 134, 134, 135, 136 then your scales are reliable but not valid. The scores were consistent but wrong! Again, you want your scales to be both reliable and valid.

Reliability Reliability refers to consistency or stability. In psychological and educational testing, it refers to the consistency or stability of the scores that we get from a test or assessment procedure. Reliability is usually determined using a correlation coefficient (it is called a reliability coefficient in this context). Remember (from chapter two) that a correlation coefficient is a measure of relationship that varies from -1 to 0 to 1 and the farther the number is from zero, the stronger the correlation. For example, minus one (-1.00) indicates a perfect negative correlation, zero indicates no correlation at all, and positive one (+1.00) indicates a perfect positive correlation. Regarding strength, -.85 is stronger than +.55, and +.75 is stronger than +.35. When you have a negative correlation, the variables move in opposite directions (e.g., poor diet and life expectancy); when you have a positive correlation, the variables move in the same direction (e.g., education and income). When looking at reliability coefficients we are interested in the values ranging from 0 to 1; that is, we are only interested in positive correlations. Note that zero means no reliability, and +1.00 means perfect reliability. Reliability coefficients of .70 or higher are generally considered to be acceptable for research purposes. Reliability coefficients of .90 or higher are needed to make decisions that have impacts on people's lives (e.g., the clinical uses of tests). Reliability is empirically determined; that is, we must check the reliability of test scores with specific sets of people. That is, we must obtain the reliability coefficients of interest to us. There are four primary ways to measure reliability. 1. The first type of reliability is called test-retest reliability. This refers to the consistency of test scores over time. It is measured by correlating the test scores obtained at one point in time with the test scores obtained at a later point in time for a group of people. A primary issue is identifying the appropriate time interval between the two testing occasions. The longer the time interval between the two testing occasions, the lower the reliability coefficient tends to be. 2. The second type of reliability is called equivalent forms reliability.

This refers to the consistency of test scores obtained on two equivalent forms of a test designed to measure the same thing. It is measured by correlating the scores obtained by giving two forms of the same test to a group of people. The success of this method hinges on the equivalence of the two forms of the test.

3. The third type of reliability is called internal consistency reliability. It refers to the consistency with which the items on a test measure a single construct. Internal consistency reliability only requires one administration of the test, which makes it a very convenient form of reliability. One type of internal consistency reliability is split-half reliability, which involves splitting a test into two equivalent halves and checking the consistency of the scores obtained from the two halves. The measure of internal consistency that we emphasize in the chapter is coefficient alpha. (It is also sometimes called Cronbachs alpha.) The beauty of coefficient alpha is that it is readily provided by statistical analysis packages and it can be used when test items are quantitative and when they are dichotomous (as in right or wrong). Researchers use coefficient alpha when they want an estimate of the reliability of a homogeneous test (i.e., a test that measures only one construct or trait) or an estimate of the reliability of each dimension on a multidimensional test. You will see it commonly reported in empirical research articles. Coefficient alpha will be high (e.g., greater than .70) when the items on a test are correlated with one another. But note that the number of items also affects the strength of coefficient alpha (i.e., the more items you have on a test, the higher coefficient alpha will be). This latter point is important because it shows that it is possible to get a large alpha coefficient even when the items are not very homogeneous or internally consistent. 4. The fourth and last major type of reliability is called inter-scorer reliability. Inter-Scorer Reliability refers to the consistency or degree of agreement between two or more scorers, judges, or raters. You could have two judges rate one set of papers. Then you would just correlate their two sets of ratings to obtain the inter-scorer reliability coefficient, showing the consistency of the two judges ratings. Validity Validity refers to the accuracy of the inferences, interpretations, or actions made on the basis of test scores. Technically speaking, it is incorrect to say that a test is valid or invalid. It is the interpretations and actions taken based on the test scores that are valid or invalid. All of the ways of collecting validity evidence are really forms of what used to be called construct validity. All that means is that in testing and assessment, we are always measuring something (e.g., IQ, gender, age, depression, self-efficacy). Validation refers to gathering evidence supporting some inference made on the basis of test scores.

There are three main methods of collecting validity evidence. 1. Evidence Based on Content Content-related evidence is based on a judgment of the degree to which the items, tasks, or questions on a test adequately represent the domain of interest. Expert judgment is used to provide evidence of content validity. To make a decision about content-related evidence, you should try to answer these three questions: Do the items appear to represent the thing you are trying to measure? Does the set of items underrepresent the constructs content (i.e., have you excluded any important content areas or topics)? Do any of the items represent something other than what you are trying to measure (i.e., have you included any irrelevant items)? 2. Evidence Based on Internal Structure Some tests are designed to measure one general construct, but other tests are designed to measure several components or dimensions of a construct. For example, the Rosenberg Self-Esteem Scale is a 10 item scale designed to measure the construct of global self-esteem. In contrast, the Harter Self-Esteem Scale is designed to measure global self-esteem as well as several separate dimensions of self-esteem. The use of the statistical technique called factor analysis tells you the number of dimensions (i.e., factors) that are present. That is, it tells you whether a test is unidimensional (just measures one factor) or multidimensional (i.e., measures two or more dimensions). When you examine the internal structure of a test, you can also obtain a measure of test homogeneity (i.e., how well the different items measure the construct or trait). The two primary indices of homogeneity are the item-to-total correlation (i.e., correlate each item with the total test score) and coefficient alpha (discussed earlier under reliability). 3. Evidence Based on Relations to Other Variables This form of evidence is obtained by relating your test scores with one or more relevant criteria. A criterion is the standard or benchmark that you want to predict accurately on the basis of the test scores. Note that when using correlation coefficients for validity evidence we call them validity coefficients. There are several different kinds of relevant validity evidence based on relations to other variables. The first is called criterion-related evidence which is validity evidence based on the extent to which scores from a test can be used to predict or infer performance on some criterion such as a test or future performance. Here are the two types of criterion-related evidence: Concurrent evidencevalidity evidence based on the relationship between test scores and criterion scores obtained at the same time.

Predictive evidencevalidity evidence based on the relationship between test scores collected at one point in time and criterion scores obtained at a later time.

Here are three more types of validity evidence researchers should provide: Convergent evidencevalidity evidence based on the relationship between the focal test scores and independent measures of the same construct. The idea is that you want your test (that your are trying to validate) to strongly correlate with other measures of the same thing. Divergent evidenceevidence that the scores on your focal test are not highly related to the scores from other tests that are designed to measure theoretically different constructs. This kind of evidence shows that your test is not a measure of those other things (i.e., other constructs). Putting the ideas of convergent and divergent evidence together, the point is that to show that a new test measures what it is supposed to measure, you want it to correlate with other measures of that construct (convergent evidence) but you also want it NOT to correlate strongly with measures of other things (divergent evidence). You want your test to overlap with similar tests and to diverge from tests of different things. In short, both convergent and divergent evidence are desirable. Known groups evidence is also useful in demonstrating validity. This is evidence that groups that are known to differ on the construct do differ on the test in the hypothesized direction. For example, if you develop a test of gender roles, you would hypothesize that females will score higher on femininity and males will score higher on masculinity. Then you would test this hypothesis to see if you have evidence of validity. Now, to summarize these three major methods for obtaining evidence of validity, look again at Table 5.6 (also shown below). Please note that, if you think we have spent a lot of time on validity and measurement, the reason is because validity is so important in empirical research. Remember, without good measurement we end up with GIGO (garbage in, garbage out).

Using Reliability and Validity Information You must be careful when interpreting the reliability and validity evidence provided with standardized tests and in empirical research journal articles. With standardized tests, the reported validity and reliability data are typically based on a norming group (which is an actual group of people). If the people with which you intend to use a test are very different from those in the norming group, then the validity and reliability evidence provided with the test become questionable. Remember that what you need to know is whether a test will work with the people in your classroom or in your research study. When reading journal articles, you should view an article positively to the degree that the researchers provide reliability and validity evidence for the measures that they use. Two related questions to ask when reading and evaluating an empirical research article are It this research study based on good measurement? and Do I believe that these researchers used good measures? If the answers are yes, then give the article high marks for measurement. If the answers are no, then you should invoke the GIGO principle (garbage in, garbage out).

Educational and Psychological Tests Three primary types of educational and psychological tests are discussed in your textbook: intelligence tests, personality tests, and educational assessment tests. 1) Intelligence Tests Intelligence has many definitions because a single prototype does not exist. Although far from being a perfect definition, here is our definition: intelligence is the ability to think abstractly and to learn readily from experience. Although the construct of intelligence is hard to define, it still has utility because it can be measured and it is related to many other constructs. For some examples of intelligence tests, click here.

2) Personality Tests. Personality is a construct similar to intelligence in that a single prototype does not exist. Here is our definition: personality is the relatively permanent patterns that characterize and can be use to classify individuals. Most personality tests are self-report measures. A self-report measure is a test-taking method in which the participants check or rate the degree to which various characteristics are descriptive of themselves. Performance measures of personality are also used. A performance measure is a testtaking method in which the participants perform some real-life behavior that is observed by the researcher. Personality has also been measured with projective tests. A projective test is a test-taking method in which the participants provide responses to ambiguous stimuli. The test administrator searches for patterns on participants responses. Projective tests tend to be quite difficult to interpret and are not commonly used in quantitative research. For some examples of personality tests, click here.

3) Educational Assessment Tests. There are four subtypes of educational assessment tests:

Preschool Assessment Tests. --These are typically screening tests because the predictive validity of many of these tests is weak. Achievement Tests. --These are designed to measure the degree of learning that has taken place after a person has been exposed to a specific learning experience. They can be teacher constructed or standardized tests. For some examples of achievement tests, click here.

Aptitude Tests. --These focus on information acquired through the informal learning that goes on in life. --They are often used to predict future performance whereas achievement tests are used to measure current performance. Diagnostic Tests. --These tests are used to identify the locus of academic difficulties in students.

Sources of Information about Tests The two most important main sources of information about tests are the Mental Measurements Yearbook (MMY) and Tests in Print (TIP). Some additional sources are provided in Table 5.7. Also, here are some useful internet links (from Table 5.8):

Anda mungkin juga menyukai