Tony Hak
TABLE OF CONTENTS
CHAPTER 1 INTRODUCTION CHAPTER 2 THEORY AND PROPOSITIONS CHAPTER 3 REPLICATION CHAPTER 4 CHOOSING A RESEARCH STRATEGY CHAPTER 5 SELECTING CASES FOR THE TEST CHAPTER 6 MEASUREMENT CHAPTER 7 HYPOTHESIS THE RESEARCH PROPOSAL CHAPTER 8 CONDUCTING THE TEST CHAPTER 9 INTERPRETING THE TEST RESULT THE RESEARCH REPORT REFERENCES GLOSSARY
CHAPTER 1 INTRODUCTION
The aim of a theory-testing research project is to contribute to our knowledge about the correctness of a theory by collecting and analysing empirical data. The underlying logic of theory-testing is the following: A theoretical claim (or proposition) applies to a universe (or domain) that usually is very large or even infinite, i.e., all consumers everywhere at all times, all firms everywhere at all times, etc. It is not possible to prove with absolute certainty that the proposition is true for this whole domain, because it is not possible (or at least not practical) to observe every single case in the domain. The best we can do is to confirm that the theory is true in many different subsets of this domain (which we call populations). If the proposition is true in the domain, then it must be true in each population of the domain. If the proposition is true in a population, then it must be possible to observe (or see) this. In other words, there must be empirical evidence for the correctness of the proposition in the population. We can specify what we expect to observe in the population if the proposition is true. This specification of our expectation is called a hypothesis (or expected pattern). Testing a proposition in a population consists of comparing what we expect to see if the proposition is true (the hypothesis or expected pattern) with what is actually observed in the population (the observed pattern). A test result is a conclusion about the extent to which the population behaves as expected. A test in a population is always a partial test of the theory because its result does not apply to the other parts of the theoretical domain that have not been observed. Hence, many different tests in many different parts of the domain (replications) are needed before a conclusion can be drawn about the correctness of the theory in the domain. This book discusses the methodology for designing and conducting one test (of the many tests that are required). It consists of the following seven steps. Step 1. Specify the proposition and its domain. Concepts must be defined and the exact relation between them must be specified. The type of the proposed relation determines the appropriate research strategy and how cases should be selected for the test. Also, the type of entity (focal unit) to which the proposed relation refers as well as the boundaries of the theoretical domain (i.e., the universe in which the theory is assumed to apply) must be specified because these determine what is a case. This step is discussed in Chapter 2. Step 2. Select a research strategy. It must be decided whether an experiment or a survey will be designed for the test. This is discussed in Chapter 4.
Step 3. Select cases for the test. A subset of the domain must be selected for the test (Chapter 5). Step 4. Measure the value of the variables in each case. This step results in a data matrix (in which rows are cases and columns variables) of which, in principle, all cells have been populated (Chapter 6). Step 5. Specify the hypothesis (expected pattern) for the selected cases. The hypothesis is a specification of the pattern that is expected to be observed in the data if the proposition is true (Chapter 7). Step 6. Compare the hypothesis with the pattern that is actually observed in the data and formulate the test result. This is the test proper, which consists of ascertaining whether the hypothesis is (or is not) a correct description of what is observed in the data (Chapter 8). Step 7. Formulate the implications of the test result for the theory. Implications of the test result for the theory depend on the number of preceding tests, their results, and the characteristics of the cases in which these tests have been conducted. The research community will be able to draw a conclusion about the correctness of the theory only after a sufficient number of replications in sufficiently diverse cases (Chapter 9).
A note on terminology
Some of the terminology and the methodological principles in this book differ from terms and principles in other textbooks. Examples of terms with a (slightly or entirely) different definition than in other sources are: proposition, hypothesis, theoretical domain, population, sample, survey, and test. Each of these terms, and many others, are defined in this book and are also listed in the Glossary.
2. Affective commitment to change is negatively related to turnover intentions. This proposition states that we can better predict the turnover intentions of employees when we know their affective commitment to change than when we do not know it. Or, in other words: For focal unit employee: if the value of attribute X (an employees level of affective commitment to change) is high, the value of attribute Y (the desire to quit the company) will be lower than with lower values of affective commitment to change.
It follows that a specific theory is defined by four aspects: its focal unit, its domain, its concepts (which represent variable attributes of the focal unit), and relations between concepts as specified in propositions. Each of these aspects will be discussed here in more detail. The focal unit, i.e. the unit or entity about which the theory formulates statements, can be very different things, such as activities, processes, events, persons, groups, organizations. If, for example, a theory is formulated about critical success factors of innovation projects, then innovation project is the focal unit. Within a theory, the focal unit cannot vary. A theory predicts values of attributes of that single focal unit, not of other units. A theory about critical success factors of innovation projects, for instance, is by definition a theory about characteristics of innovation projects, not of other things or entities such as products, companies, teams, etc. A clear specification of the focal unit is very important in the design of a theory-testing study because this defines the type of entity about which data must be collected. For a test of the claim A tangible resource-seeking alliance is more likely to deploy high levels of output and process control data must be collected about alliances, not about other entities. For a test of the proposition Affective commitment to change is negatively related to turnover intentions data must be collected about employees, not about entrepreneurs, students or companies. The domain of a theory is the universe of the instances of the focal unit (cases) for which the propositions of the theory are assumed to be true. The boundaries of this domain should be specified clearly. For instance, if a researcher develops a theory of critical success factors of innovation projects, it must be clearly stated whether it is claimed that this theory applies to all innovation projects, or only to innovation projects of specific types, or only in specific economic sectors, or only in specific regions or countries, or only in specific time periods, etc. Hence the domain might be very generic (e.g. all innovation projects in all economic sectors in the whole world) or quite specific (e.g. limited to innovation projects in a specific economic sector, in a specific geographical area, or of a specific type).
Examples 1. A tangible resource-seeking alliance is more likely to deploy high levels of output and process control. This proposition is a claim about alliances in general. It is not a claim about a specific type of alliance such as alliances in a specific economic sector (e.g., airline alliances) or in specific countries (e.g., US alliances). If this proposition is true, then it is true for alliances in all economic sectors and in all countries and at all times. If the claim is formulated from the outset as only applicable to a specific type of alliance, then this should have been specified in the wording of the proposition. 2. Affective commitment to change is negatively related to turnover intentions. This proposition is a claim about employees in general. It is not a claim about a specific type of employee (e.g., manual laborers or white collar workers), or about employees in a specific economic sector (e.g., dockworkers or airline pilots) or in specific countries (e.g., the US workforce). If this proposition is true, then it is true for employees in all types of
jobs, in all economic sectors, in all countries and at all times. If the claim is formulated from the outset as only applicable to a specific type of employee, then this should have been specified in the wording of the proposition.
The concepts of the theory designate the variable attributes of the focal unit. An attribute described by a concept can be absent or present, smaller or larger, etc. For instance, if the research topic is critical success factors of innovation projects the factors that presumably contribute to success are variable attributes of these projects. In each instance of the focal unit, these factors can be present or absent, or present to a certain degree. Also, success is a variable attribute of the focal unit project, which can be present or absent, or present to a certain degree, in each instance of the focal unit (i.e. in each specific innovation project). The attributes that are designated by the concepts of the theory must be defined to allow for the measurement of their value in instances of the focal unit (cases). For instance, in a theory of critical success factors of innovation projects, the concept project outcome needs to be defined such that it is clear what counts as a successful outcome and what does not. The factors must be defined as well, so that we can measure the extent to which each factor is present. When the value of a concept is measured in cases, it is called a variable. The propositions of a theory formulate relations between the concepts (i.e., between variable attributes) of the focal unit. In the typical case, but not always, this relation is a causal one. A causal relation is a relation between two attributes X and Y of a focal unit in which a value of X (or its change) results in a value of Y (or in its change). A proposition can be visualized by means of a conceptual model. Usually such a conceptual model has inputs (independent concepts) on the left hand side and outputs (dependent concepts) on the right hand side, linked to each other by arrows that point to the dependent concepts. The arrows are indications of the direction of the causal relation between the concepts. The nature of these arrows needs to be defined more precisely in the wording of the propositions of the theory. The simplest building block of a theory is a single proposition that formulates the relation between two concepts. A proposition can be visualized as follows: Regarding focal unit FU: Determinant X Outcome Y
This simple model visualizes the proposition that, in all cases of the focal unit FU, concept X (the determinant or independent concept) has an effect on concept Y (the outcome or dependent concept). The unidirectional arrow represents the assumption that a cause precedes an effect. Because effects are assumed to depend on causes, the term dependent concept is used for the outcome Y. Causes X are assumed to be independent from their effects, hence the term independent concept.
Note that this simple model does neither specify the contents of the proposition nor the possible values of the concepts. If it is presented in this way, it is normally assumed that X and Y are interval or ratio variables and that the relation between them is causal, probabilistic and positive: Higher X will on average result in higher Y. Because other types of concepts and other types of relation are possible, it is necessary to add more specifications in the model. Determinant X (or Y) should be specified as Extent of X (or Y) or Presence of X (or Y) or in any other indication of the (range of) values that are covered by the proposition. Also, a sign (+ for positive; for negative) must be added to the arrow in the model.
Note also that the focal unit (e.g., innovation project) is not depicted in the model itself because the model represents only the variable attributes (concepts) of which the values are linked in the theory, not the invariable entity about which the theory is formulated. For this reason, the model is prefaced by a statement about the focal unit. The domain is implied.
More complicated conceptual models might depict relations between a larger number of independent concepts X1, X2, X3, etc., and dependent concepts Y1, Y2, X3, etc. For instance, in a conceptual model of the critical success factors of innovation projects, the model would depict a number of different factors (X1, X2, X3, etc.) on the left hand side, outcome (as defined precisely) on the right hand side, and an arrow originating from each factor pointing to outcome. Other models might be used to depict more complex relations such as with moderating or mediating concepts.
Note that the word theory is used loosely in the literature and often refers to sets of statements that are not theory as defined here. For instance, theories such as the resource based view or transaction cost theory are perspectives, not sets of precise propositions with defined concepts.
Here follow a number of propositions that have been formulated and tested by Bachelor students in the Research Training course at the Rotterdam School of Management. For each of these propositions, the focal unit, domain, concepts, relations, and conceptual model are specified in the following examples.
Example 1.
Proposition: A tangible resource-seeking alliance is more likely to deploy high levels of output and process control Focal unit: Alliance Domain: All alliances in the world, in all economic sectors, in all countries, at all times Independent concept: Type of resource that is sought in the alliance (tangible versus intangible) Dependent concept: Extent to which output and process control is used Relation: Probably causal, probabilistic, positive Conceptual model: Regarding focal unit alliance: Type of resource that is sought Extent of output and process
Example 2.
Proposition: Affective commitment to change is negatively related to turnover intentions Focal unit: Employee Domain: All employees in the world, in all types of job, in all countries, at all times Independent concept: Extent of affective commitment to change (from not at all to very much) Dependent concept: Strength of the wish to quit (from not at all to very much) Relation: Probably causal, probabilistic, negative Conceptual model: Regarding focal unit employee: Commitment to change Wish to quit
Example 3.
Proposition: Higher crime rates have a negative effect on house prices Focal unit: House Domain: All houses in the world, in all countries, at all times, of all types Independent concept: Crime rate in the neighbourhood of the house Dependent concept: Price of the house (in Dollars or Euros or other currency) Relation: Causal, probabilistic, negative Conceptual model: Regarding focal unit house: Crime rate House price
Example 4.
Proposition: The way entrepreneurs allocate their time is influenced by their tendency for mental accounting Focal unit: Entrepreneur Domain: All entrepreneurs in the world, in all countries, at all times, in all business types Independent concept: The extent to which a person evaluates costs and benefits of activities Dependent concept: The time allocated to work-related activities (versus other activities such as leisure and family-related) under a time constraint Relation: Causal, probabilistic Conceptual model: Regarding focal unit entrepreneur: Extent of mental accounting Time allocated to work
Example 5.
Consumers with higher loyalty to a brand respond more favorably to brand extensions in general and to distant extensions in particular Focal unit: Consumer Domain: All consumers in the world, in all countries, in all types of consumption Independent concept: The degree of loyalty to the brand Dependent concept: Level of intention to purchase a (distant) brand extension Relation: Causal, probabilistic Conceptual model: Regarding focal unit consumer: Loyalty to the brand Purchase intention Proposition:
Managerial relevance
A proposition predicts or explains probabilities of values of attributes of a focal unit. Why are such predictions made, and why do we want to know whether a proposition is true? The obvious answer is that there are many situations in which it matters what the value of an attribute is. For instance, referring to some of the examples above, we would normally prefer lower turnover intentions in employees rather than higher ones, higher house prices rather than lower ones, more efficient allocations of time rather than less efficient ones, and higher purchase intentions rather than lower ones. Also, if a researcher develops a theory about a determinant of the extent of output and process control in an alliance, it is to be expected that this extent matters to people involved in alliances.
If the value of an attribute matters in practice, then it is likely that one would like to be able to manipulate that value. That is the reason that many theories are formulated as causal ones. If the value of one attribute is causally related to the value of another attribute, then it becomes possible (at least in theory) to achieve a higher probability of a desired value of one attribute by manipulating the value of the other attribute. In that sense, causal theories have higher managerial relevance than non-causal theories.
Examples 1. Affective commitment to change is negatively related to turnover intentions. If this proposition is true, then it makes managerial sense (to attempt) to increase the average affective commitment to change in a workforce because this might (more likely than not) result in a lower turnover in the company (if a lower turnover is desired). 2. Higher crime rates have a negative effect on house prices. If this proposition is true, then it makes real estate sense (to attempt) to decrease the crime rate in a neighborhood because this might (more likely than not) result in higher house prices in the neighborhood (if a higher house price is desired). 3. Consumers with higher loyalty to a brand respond more favorably to brand extensions in general and to distant extensions in particular. If this proposition is true, then it makes marketing sense to be careful with brand extensions as long as the average brand loyalty is relatively low (assuming that brands attempt to increase loyalty anyway, irrespective of whether they contemplate brand extensions).
Note that almost all propositions in business research are probabilistic in kind, which means that they do not predict the value of Y in a single case (even if the value of X in that case is known), only the average value of Y in a set of cases with that value of X. This means in practice that the managerial relevance of propositions is normally much higher for managers with a portfolio of cases than managers who manage only one or two cases.
Examples 1. Affective commitment to change is negatively related to turnover intentions. This proposition, if true, is relevant if a manager wants to decrease the average level of turnover in the workforce. It is much less relevant if that manager wants to keep an individual employee with a high value for the company. The latter could be a member of a minority of employees that have a high level of commitment to change and also a high desire to quit. 2. Higher crime rates have a negative effect on house prices. This proposition, if true, is relevant if a local government or a real estate developer wants to raise the average house price in a neighborhood. It is much less relevant for individual house owners who want to raise the price of their own homes. Their home could be one of minority of houses that have a low price irrespective of the level of crime in the neighborhood.
Note that the managerial relevance of a proposition is dependent on the strength or effect size of the relation between the two attributes. If a large increase of affective commitment to change results in only, say, one percentage point decrease in turnover (on average), then the managerial relevance of the proposition is doubtful (even if the proposition is true). Similarly, if a huge decrease in crime in a neighbourhood results in only, say, one percentage point increase in house prices (on average), then it is doubtful whether the proposition is relevant (even if it is true) for a local government or a real estate developer. (Obviously, there might be other good reasons for trying to bring down the crime rate in a neighbourhood.)
10
Note also that this (more or less subjective) estimation of the practical relevance of a theoretical statement is also dependent on the costs involved in the manipulation of the independent variable. If affective commitment to change could be influenced by a cheap and simple method such as sending an email message to all employees in which they are praised for their efforts, then it is relevant to know that the resulting higher commitment to change has a negative influence (though perhaps small) on the desire to quit. However, if even a small decrease in crime rate requires huge investments in surveillance and other measures, then the effect of such a decrease on house prices should be considerable to make such an investment worthwhile. (Again, obviously, there might be other good reasons for that investment.) If a proposition is considered potentially relevant, and if it is determined how strong the causal effect should be in order to achieve the desired level of relevance, then it becomes useful to know whether the proposition is true. For managers it is usually (only) useful to know whether a theoretical claim is true for the cases which they manage. In contrast, theoreticians and academic researchers are interested in knowing whether a theoretical claim is true in the theoretical domain. How could they know? An obvious difficulty is that the theoretical domain implied by a proposition is usually infinite and continuously changing. For instance, the theoretical domain of the claim Affective commitment to change is negatively related to turnover intentions is all employees. Because new employees will be hired all the time and in every place, there is no way to keep track of each of them. In practical terms, it is not possible to keep a complete list of all cases of the focal unit (here: of all employees) in the domain. This means that we cannot observe each of them in order to ascertain that the proposition is true in each of them. Fortunately, in practice we do not require absolute certainty of the correctness of a proposition for all cases in the domain but are satisfied with sufficient confidence that this is the case. This confidence is built by testing the proposition in a series of tests in subsets of the domain. The propositions are tested many times, each time in a set of cases from another part of the domain, and potentially with each time a different result. A number of consistently confirmatory test results in similar subsets is required before we can be confident that the proposition is true for a specified part (or for the whole) of the domain. Different tests in different parts of the domain are called replications. Usually a worldwide community of researchers is involved in conducting these replications, collectively building confidence in the correctness of a proposition or, if it appears that the proposition cannot be confirmed in some parts of the domain, collectively reformulating the proposition (or collectively concluding that the proposition was wrong).
Note that replication is defined here as conducting another test of the same proposition, i.e. usually in another set of cases, and usually using other methods. This concept of replication does not imply repeating or duplicating a previous test (with the same methods in the same cases or with the same methods in other cases) in order to confirm a finding. In order to avoid confusion, such studies might better be called duplications.
11
Figure 1. Domain, cases and populations (from Dul & Hak, 2008: 46)
As stated in the Introduction, the underlying logic of theory-testing is the following: A theoretical claim (or proposition) applies to a universe (or domain) that usually is very large or even infinite, i.e., all consumers everywhere at all times, all firms everywhere at all times, etc. It is not possible to prove with absolute certainty that the proposition is true for this whole domain, because it is not possible (or at least not practical) to observe every single case in the domain. The best we can do is to confirm that the theory is true in many different subsets of this domain (which we call populations). If the proposition is true in the domain, then it must be true in each population of the domain. Theory-testing, thus, entails selecting a population from the domain for the test; formulating a prediction (hypothesis or expected pattern) for that population which is derived from the proposition (if it is true for that population); measuring the concepts in the cases of the population; and then seeing whether the prediction is true or not. The latter involves a comparison between the expected pattern and the actual pattern in the data (observed pattern).
12
A rather common erroneous belief about a population that is selected for a test is that its size matters for the quality of the test. More specifically, it is widely believed that the larger is the population the better it is for the test. In the example of a proposition about alliances, testing it in the population of all airline alliances would be much better than testing it in the population of (only) large US airline alliances. This myth is (as are so many myths) based on facts, namely (in this case) the fact that a larger number of cases gives a larger statistical power. Another underlying reason for this myth is probably the idea that a larger population is more representative of the domain than a smaller population and that, therefore, a result in that population is more informative about the domain. But both reasons are faulty. First, statistical power is not a relevant concept here because it solely applies to the size of a probability sample if an inference is made from the sample to the population from which it is drawn. Selecting a population from a domain is very different from probability sampling and, therefore, principles from inferential statistics do not apply. Second, a test result in a less specified population is actually less informative for the theory than a result in a more specified population. Take a fictional example in which a proposition is tested in the population of airline alliances and also in the smaller, more specific population of US airline alliances. Assume that the test result in the larger population is a weak confirmation and that the test result in the smaller US population is negative. Which result is more informative about the theory? The result in the more specified (and hence smaller) population does not only imply that the proposition is not correct for the population (i.e., for US airline alliances), but it also indicates that the correctness of the theory depends on an (yet unknown) factor that is related to something that makes the population of US airline alliances different from other populations of airline alliances. This result will not only stimulate the search for this unknown factor that influences the proposed relationship (a search that will increase our understanding of the workings of the relations explained by the theory), but will also result in a smaller and better defined theoretical domain (i.e., a better specification of the universe of cases in which the theory is assumed to be true). These benefits for the theory can only be achieved if the population that is selected for the test has clear characteristics (such as being US in the example). For this reason, a test must not be conducted in an arbitrary group of cases from the domain (such as a sample of cases that happen to be included in a data base). In the example, it is the fact that the population is defined by being US that makes it possible to specify the search for an explanation of the test result as a search for a factor that is in some way related to being US. If the test result was found in an arbitrary set of cases, it would not be possible to specify the implications of the test result in this way. Also note that comparing different test results from different well defined populations, such as the population of US airline alliances, the population of South Asian airline alliances, the population of European airline alliances, etcetera, is not the same as conducting a regression of the dependent variable Y on the region of the alliance (US, South Asian, European, etc.). Testing the proposition that independent concept X has an influence on dependent concept Y in different populations (defined by region) is not the same as testing that region (also) has an influence on Y. This implies that there are no statistical procedures available for interpreting different results from different populations (or replications).
13
CHAPTER 3 REPLICATION
How must different results from different tests (replications) be interpreted? Figure 2 represents the standardized effect size d (with a 95% confidence interval) obtained in each of seven independent experimental tests. The first and the third test show a clear positive effect, but there are also tests with negative effects (experiments 2 and 6) and much smaller positive effects (experiments 4, 5 and 7). A first conclusion from this example is that no conclusions can be drawn from a single test of a theory, even if the result is highly significant (such as in the first experiment). This shows that our confidence in the correctness of a theory can only be built in a series of replications.
Usually, a proposition states (only) that there is an effect of an independent concept on a dependent concept without specifying how large that effect is or should be in order to be considered relevant. The experimental results in this example indicate that there appears to be an effect, but that it is small. Although the results of experiments 4, 5 and 7 seem to come close to the real effect size, such a conclusion cannot be drawn from these tests alone but only from the series of tests. Moreover, a conclusion about what these seven test results mean must also depend on the characteristics of these tests, such as the populations from which subjects were recruited for the test and the methods that were used. With regards to these populations, consider the following example: Test 1. Test 2. Test 3. Test 4. Test 5. Test 6. Test 7. Effect size=1.5 Effect size=-0.2 Effect size=0.7 Effect size=0.18 Effect size=0.35 Effect size=-0.1 Effect size=0.3 N=16 N=29 N=36 N=24 N=40 N=20 N=8 Second year IBA students RSM Cab drivers Istanbul High school students Rotterdam Dockworkers Rotterdam First year Psychology students EUR Cab drivers Mexico City RSM professors
14
Assume that all seven tests were experiments in which a proposition regarding the effectiveness of a certain treatment in changing the attitudes of workers was tested. Also assume that the seven tests are represented here in chronological order. The first test result (with second year IBA students at RSM) suggests that the (apparently newly formulated) proposition is true. This generates some confidence that the theory is correct, i.e. that the treatment is effective. The following test (with cab drivers in Istanbul), however, shows that this result cannot be replicated and doubt about the effectiveness of the treatment (and hence about the correctness of the theory) sets in. The third test result (with high school students) suggests again that the theory might be correct although it also indicates that the effect size might be much less than first thought. The insignificant result of the next, fourth test (with dockworkers) again raises doubt about the correctness of the theory. At this stage, after four tests, the research community will have developed a certain level of confidence that the theory is correct (because of the highly significant positive test results in two experiments, 1 and 3) but the other two test results will temper that confidence. Because the focal unit of the proposition (in this fictional example) is workers, one way of solving the problem of contradictory results is to discard results from tests with students. Only the results of experiments 2 and 4 matter and the conclusion after these two tests is that the treatment does not seem to work in the domain of the theory. Note that the conclusion would be very different if experiments 1 and 3 would have been the ones with workers rather than students. Assume that researchers continue with conducting experiments with students because this is what they happen to do. The result of experiment 5 (with first year Psychology students at Erasmus University) will be seen as a third confirmation of the correctness of the theory. Researchers who are more serious about the contents of the theory (and hence about the limitation of the theoretical domain to workers) could conduct another experiment with workers (experiment 6) in which they again will find a confirmation of their doubts about the correctness of the theory. The result of an experiment with academic workers (test 7 at the bottom of the figure) seems to fit better in the series of results with students than with non-academic workers. This (fictional) example demonstrates how confidence in the correctness of a theory develops chronologically. This development is not linear. It goes up and down, and partly consists of substantive specifications of the theory such as, in this example, a more precise specification of the focal unit (worker, not student) and an emerging insight that the theory might apply to only certain types of workers (i.e. non-academic workers).
Note that this example is not only fictional in the sense that it is imagined (i.e., it is not a real life case of seven actual experimental tests of an actual theory) but it is also fictional in the sense that it is not really realistic. In real academic life the story would have developed in another way, mainly because the result of the first test would have greatly decreased the likelihood that there would ever be a second test. In actual practice, the first test result would have been seen as final because it did not only confirm the correctness of the proposition but it also did so with a large effect size and with high significance. Other researchers would not have found it problematic that this was a test with students rather than workers that is what they routinely do. This test might have been published in a top journal and it is quite possible that from then on the theory was taught in management courses as the proven Experimenter Ones Law. No researcher would replicate the test,
15
because every colleague would tell her that journal editors would consider her result (which would be expected to be positive again) as not new, not original, and not worth publishing. Now, imagine how the second experimenter (who did the experiment against this collegial advice) will have looked at her test result. Because she could not reject the null hypothesis, she cannot report a positive or negative test result, but only a failure to replicate the first experiment. She will not write up her findings and, if she does, she will not be able to get it published because it will be found lacking in novelty, in theoretical relevance, and in managerial relevance. The research community will never know that there ever was such a second test. In other words, both the cult of the isolated study (Hubbard & Lindsay, 2002) and the cult of positive results effectively prohibit that replication histories are published and known.
Because the knowledge base in business and management studies mainly consists of propositions that have been tested only once and that have not been put to replication tests, a rather effective and appropriate way to contribute to a theory is by replicating published one-shot studies in other cases from the domain. The common requirement of academic journals (and also often in the evaluation of student work) that studies must be original (meaning that they should formulate, test and confirm propositions that have not been stated before) is a huge obstacle to scientific progress because it hinders the much-needed increase of the number of replication studies. With every original study a new candidate proposition is formulated of which it is not certain that it is true for a domain, adding to the reservoir of propositions that are waiting to be replicated.
Further reading
In the literature the term replication is used in different ways. Often it means repeating a test in the same set of cases, e.g., by drawing another (random) sample from the same population or by (randomly) assigning the same group of subjects over experimental and control groups. The aim of such a replication is usually to evaluate the quality of a study by investigating the reproducibility of its result in the same set of cases and/or to assess the (normal) variation of study outcomes. The concept of replication used in the course book is different. This is taken from the book Case study methodology in business research by Dul and Hak (2008). Hak & Dul (2009b) also discuss the principles of replication. The notion that the correctness of a proposition can be evaluated only after multiple tests and that the construction of a replication history is central to the development of a theory is not current in academic publishing in business and management research. Top journals almost only publish original work, i.e., research in which new theoretical claims are made. Hubbard & Lindsay (2002) call this the cult of the isolated study. They note that most, if not all, published empirical research consists of novel works looking for significant differences, rather than significant sameness, in unrelated data sets, and argue that the emphasis on original research impedes knowledge development. As a result, the literature is made up largely of fragmented, one-off results, which according to Hubbard and Lindsay are of little or no use because they are not corroborated by other studies. Although Hubbard & Lindsays (2002) paper focuses on marketing research, its argument applies to all fields in business and management research. Note that there is a growing acknowledgement of the value of replication, although this has not yet been translated into a reevaluation of journal policies and criteria for career decisions. This growing acknowledgement has led to an increasing number of publications advocating
16
meta-analysis. This certainly is an improvement as a corrective of the cult of the isolated study. However, a serious problem with most approaches to meta-analysis is that it is usually assumed that different test results apply to the same pool of cases. In our above example of seven experimental tests, meta-analysis would ignore the fact that the experimenters have recruited subjects from populations that differ considerably from each other. Although it is tempting to pool the results and compute a kind of average effect over seven experiments (resulting in an overall effect size of about 0.4), it is recommended here to stick to the type of qualitative assessment of which this chapter gives an example. A good text on how to conduct a good literature review (which is required for reconstructing a replication history of a proposition) is Chapter 3, Literature review of the book Business Research Methods by Boris Blumberg, Donald Cooper and Pamela Schindler (McGraw-Hill, Second European Edition, 2008). Blumberg et al. list the following aims (among others) of the literature review (2008:107): To show which theories have been applied to the research topic. To show which research designs and methods have been chosen. To synthesize and gain a new perspective on the topic. To show what needs to be done in the light of the existing knowledge. With respect to the last mentioned aim, the authors compare a theory with a cathedral and advise the researcher not to aim at adding an entirely new chapel to the existing building: Each study and article is just another brick added to the construction of the cathedral of knowledge. Some studies just reconfirm previous knowledge, often in slightly different settings. [.] The first function of a literature review is to embed the current study (the new brick) in the existing structure of knowledge (the cathedral). [.] The single brick is of limited beauty, but being part of the cathedral contributes significantly to its overall beauty (2008:107). A particularly good element of the Blumberg et al. chapter is its discussion of six ingredients of a good literature review (2008:110). A good literature review is not a mere compilation of summaries of publications. The term replication (history) is not used in any of these references. The best way to construct the required overview of test results for a proposition is by tracking references (backward) and citations (forward) of published test results (in the Web of Science). Figure 2 above is taken from a rather technical article on the understanding of confidence intervals (Geoff Cumming and Sue Finch, A primer on the understanding, use and calculation of confidence intervals that are based on central and noncentral distributions, Educational and Psychological Measurement, 2001: 532-574) and represents the results of a series of seven replications of (probably) the same experiment. However, in this book, this figure is taken out of its context and used as an illustration of a series of similar experiments with different types of subject. Differences between these results cannot be interpreted as (just) sampling variation but might reflect real differences between experimental subjects (and the populations from which they were recruited, such as students, cab drivers, etc.), or they might not reflect such real differences because each of these results is also subject to variation due to the manner in which subjects were recruited from a larger population as well as to sampling variation. This means that this series of results can only be meaningfully interpreted in the context of results of further yet-to-conduct experiments with these and other types of subjects. These subjects must be recruited from different populations from the theoretical domain. The logic of inferential statistics, based on probability sampling, does not apply to test results in subsets of the domain. Populations are not selected randomly from the theoretical domain and, therefore, they cannot be treated as samples from a larger population. Hence, a conclusion about the correctness of a theoretical statement for a domain cannot be inferred statistically from a test result in a population. Test results are always findings in subsets of the domain and cannot be generalized to other parts of the domain. They can only be replicated (in the sense meant in this book). A replication history, thus, should contain for each test an indication of the part of the domain in which the test was conducted.
17
18
Each point in this scatter plot is defined by a value on the X-axis and a value on the Y-axis. This implies that the plot is derived from a data matrix in which for each case a value for X and a value for Y is specified, as in the following one.
Cases Case 1 Case 2 Case 3 Case 4 Case 5 Case 6 Value of X x1 x2 x3 x4 x5 x6 Value of Y y1 y2 y3 y4 y5 y6
N=
There are informal ways by which we can see the association of X and Y in a data matrix. As we have seen, one way of seeing the association between X and Y is plotting the cases in a scatter plot and then observing the empty corners in the plot: top left (low X; high Y) and bottom right (high X; low Y). Another way of observing an association between X and Y is ranking the cases in the data matrix according to their (increasing or decreasing) value of X and also ranking them according to their value of Y, and then comparing the two rankings. If the rankings are roughly similar, i.e. if cases are situated high in both rankings (and other cases low in both rankings), this is evidence of an association between X and Y. A more formal approach would entail plotting a trend line in the scatter plot. The relationship stated in the proposition X is associated with Y might be notated as a bi-directional connection: XY. In principle, it does not make a difference which concept is called X and which is called Y, and for observing this type of relation (if concepts are continuous) it does not matter whether which concept is on the X-axis and which one is on the Y-axis. If one of the concepts in the proposition is not continuous, (informally) ranking and (more formally) plotting a trend line are not possible. Take the following proposition that was discussed in Chapter 2: A tangible resource-seeking alliance is more likely to deploy high levels of output and process control. Assume that X and Y have been
19
observed in all members of the population of US airline alliances. Assume also that the population has six members. The data matrix could like look the following one:
Cases Value of X (T = tangible; I = intangible) T T T I I I T = 50%; I = 50% Value of Y (level of control; scale from 1 to 7) 6 4 2 5 3 1 y = 3.5
In this population, an association between the values of X and Y can be observed by comparing the average level of control between the two groups of alliances (tangible resource-seeking and intangible resource-seeking alliances). The observed averages of level of control in this example are 4.0 in the tangible resource-seeking alliances and 3.0 in the intangible resource-seeking alliances. In a formal analysis, therefore, we would not calculate the trend line but just compute the difference between the two averages. In applying such procedures we assume that cases are comparable, i.e. that there are no other relevant determinants of X and Y that differ between cases. Because cases in the data matrix must be comparable or similar in relevant respects, all cases in the data matrix should be members of a specified population, i.e. a population in which cases share the characteristics that define it. For instance, in a test of a proposition about firms (i.e. the focal unit is a firm) we should generate a data matrix of firms that are members of the same population, e.g. firms in a specific economic sector, in a specific country or region, of a specific size, etc. In other words, the data matrix that we need for a test of an association is the data matrix that is generated in a study of a specific population in which X and Y are observed (or measured) for each member of that population at one point in time. This is called a cross-sectional survey. A survey is a research strategy in which values of the relevant concepts are observed in all members (or in a probability sample of members) of a population of instances of a focal unit.
Note that this definition of a survey might be different from how this term is used in other sources. In this book a survey is not defined by its method of data collection (e.g., using a questionnaire). It is only defined by a data matrix that contains all cases of a population (or a sample from it). The cells of that matrix can be populated by collecting evidence from interviews, observation, records, data bases, etc. Using the word survey for using a questionnaire is confusing and should be avoided.
In sum, the preferred research strategy for testing an association is a cross-sectional survey. The values of concepts X and Y are observed (measured) in all members of the population (cases), or in a probability sample. The observed values are entered in a data matrix with as many rows as there are cases and with as many columns as there are concepts in the proposition.
20
21
This comparability is achieved by selecting cases from a specific population (as in a survey), by randomizing these cases over experimental and control conditions, and by observing the change in the value of Y (rather than observing the value of Y itself). Observing the change in the value of Y requires that the value of Y is measured both before (pre-test) and after the treatment (post-test).
The terms pre-test and post-test are confusing because no testing is involved, only measurement. Therefore, it would be better to use the terms pre-treatment measurement and post-treatment measurement.
This procedure results in as many data matrices as there are experimental groups. Here follows an example of two matrices in the simplest form of experiment (with two groups, e.g., one experimental and one control group). Group 1. X= x1
Cases Case 1.1 Case 1.2 Case 1.3 Case 1.4 Case 1.5 N= Change in value of Y y1.1 y1.2 y1.3 y1.4 y1.5 x1 Cases Case 2.1 Case 2.2 Case 2.3 Case 2.4 Case 2.5 N=
Group 2. X= x2
Change in value of Y y2.1 y2.2 y2.3 y2.4 y2.5 x2
In this example there are two matrices, one for each group that is defined by the value of X (x1 and x2) that is assigned experimentally. If these two matrices are filled with scores (or, in other words, if the changes in value of Y have been observed in all cases of both groups), a causal association between X and Y can be observed if the changes in value for Y in one group (e.g., the experimental group) are different (on average) from the changes in value for Y in the other group (e.g., the control group). This procedure is the same as discussed above for observing the difference between two types of alliance in a cross-sectional survey. The difference in average change in value of Y between two experimental groups is internally valid evidence of a causal relation because of the preceding manipulation of the value of X, which was absent in the example of two types of alliance. Experimental research is difficult. For instance, it is almost impossible to manipulate someones loyalty to an existing brand (such as Coca Cola) which might be so much ingrained in a customer that it can even be considered a part of someones identity. It is also difficult to control for other factors than brand loyalty that might influence someones intention to purchase an existing brand extension (such as, e.g., negative newspaper reports about the quality of Coke shoes). Students who tested (a version of) the proposition Consumers with higher loyalty to a brand respond more favorably to brand extensions in general and to distant extensions in particular, therefore, invented a fictional brand (Alpha, famous because of its shampoo line) with an equally fictional brand extension (Alpha sporty digital watch). Subjects for the experiment were recruited from the population of students in business administration 22
at the Rotterdam School of Management (N=30). These subjects were randomly assigned to three experimental groups (N=10 each). Each group received a treatment which resulted in, respectively, a low, medium and high level of loyalty to the Alpha brand. (Technical details such as how and why this treatment works are not discussed here.) Next, each subject was given a description of the Alpha sporty digital watch. Finally subjects were asked how likely it was that they would purchase the watch. This score was entered into the matrix. Assume that the strength of the purchase intention was measured on a scale of 1 to 10. The data matrices could look like the following: Group 1. Low loyalty
Cases Student 1 Student 2 Student 3 Student 4 Student 5 Student 6 Student 7 Student 8 Student 9 Student 10 N=10 Purchase intention 1 1 2 2 3 3 4 4 5 5 low=3
In these three connected matrices, an association between the values of X and Y can be observed: the average level of purchase intention is 3.0 in the low loyalty group, 5.0 in the medium loyalty group and 8.0 in the high loyalty group. This association itself cannot be seen as evidence of a causal relation. However, the fact that none of these subjects could have had a loyalty to the fictional Alpha brand before these subjects had entered the experiment and, hence, no previous intention to purchase an Alpha sporty digital watch could have existed, implies that differences in observed levels of purchase intention can only be caused by the experimental treatment, i.e., by the different levels of brand loyalty. Note that a pre-treatment measurement is not required in this experiment because subjects cannot have a pre-experimental intention to purchase a fictional brand. An experiment, thus, generates evidence about the presence (or absence) of a causal relation by demonstrating that the effect can be produced (at will) by manipulation of a cause. Experiments have a high level of internal validity because of this direct link between manipulation and effect, if other potential causes of differences in the values of the dependent concept can be ruled out. In contrast, cross-sectional surveys have a low internal validity because it cannot be known (from the survey) how an observed association came into existence. However, surveys have a higher level of ecological validity, because they allow the observation of associations that actually exist (i.e., that are not experimentally produced).
23
This matrix looks like the matrix of a cross-sectional survey discussed above, but there is an important difference, namely that the association here does not refer to the values of X and Y but to the changes in the values of X and Y (which implies that each is measured at least two times). The change in value of Y must be subsequent to the change in the value of X, implying that the measurements of the value of Y must take place at later points in time than the measurements of the value of X. We call this research strategy a longitudinal survey, defined as a research strategy in which a change in the values of the relevant concepts is observed in all members (or in a sample) of a population of instances of a focal unit. The data matrix must contain all data (i.e., all changes in value of both concepts) from all cases in the population or in a probability sample. The most challenging aspect of a longitudinal survey is determining how much time should elapse between the change in the value of X and the subsequent change in value of Y. A good estimate of this time lapse might be based on theoretical insights, i.e., an understanding of the mechanisms or processes (and their duration) by which the change in X affects Y, and/or on experiences of practitioners who have first-hand knowledge of these processes.
Note. An event study is a version of the longitudinal survey.
24
Further reading
The survey is defined in this book as a research strategy in which a population statistic (such as a slope b of the regression line) is generated from the data matrix that contains all data from all cases in the population or in a probability sample. This definition is derived from the mainstream literature on survey research. A recent example from this literature is Jelke Bethlehem, Applied Survey Methods (Wiley, 2009). Bethlehem defines the survey as follows: A survey collects information about a well-defined population (p.1). Bethlehem states that populations need not necessarily consist of persons. For example, the elements of the population can be households, farms, companies, or schools. How is information collected? Although Bethlehem states that typically, information is collected by asking questions to the representatives of the elements in the population, the term typically is used here empirically, indicating that in practice survey data are often collected by means of asking questions. This does not mean that a survey is defined by this mode of data collection. Information (or evidence as defined in this book in a next chapter) can also be collected from interviews, observation, records, data bases, etc. Using the word survey for a questionnaire is confusing and should be avoided. Good introductions to survey research are Fowler (2009) and Groves et al. (2004). Good introductions to experimental research are Field & Hole (2003) and Chapter 11 (Causal research design: experimentation) of Malhotra & Birks (2003).
25
Eligibility
The requirement that cases should be selected from the members of the theoretical domain looks like obvious. But this requirement is often violated. For instance, it is quite common that companies or business units are selected as cases for testing propositions about projects or teams or, another example, that consumers are selected as cases for testing propositions about advertisements or brands. Often this mistake co-occurs with another one (which might be the reason for it), namely that persons are asked for their opinions about the correctness of the proposition (rather than that the proposition is tested).
Examples. A common mistake in a test of a proposition about brands (such as Brands with more X have more Y) is to ask consumers whether they think that brands with more X have also more Y. Such a study is an opinion poll, not a proper test of the proposition. Such a test requires that a population of brands (not consumers) is selected and that X and Y are measured for each brand. Similarly, in a test of a proposition about projects (e.g., More X is associated with more success in projects) it often occurs that cases of companies are selected and that managers are asked to report whether they think that, in their company, projects with more X are more successful. A proper test of the proposition
26
requires that a population of projects is selected and that X and success are measured for each project.
The criterion for eligibility, thus, is the definition of the focal unit and the delimitation of the theoretical domain. The cases that define the rows in the data matrix must be members of the theoretical domain (i.e., the universe of cases of the focal unit).
Prioritization
In principle each member of a theoretical domain (which is the universe of all cases of the focal unit) is eligible for a test. But it is more useful to test the proposition in some of these cases than it is in others. Take the following replication history, a series of experimental tests of a proposition, discussed in Chapter 3. Figure 3. A series of replications (from Cumming & Finch, 2001:557)
Effect size=1.5 Effect size=-0.2 Effect size=0.7 Effect size=0.18 Effect size=0.35 Effect size=-0.1 Effect size=0.3
Second year IBA students RSM Cab drivers Istanbul High school students Rotterdam Dockworkers Rotterdam First year Psychology students EUR Cab drivers Mexico City RSM professors
Suppose the effectiveness of a drug was tested in these experiments. How would the results have looked like if these experiments had used only students and professors as subjects, or only cab drivers? An interpretation of a replication history requires that possible interpretations of different test results are generated and that it is attempted to link the results of the different tests to the types of subjects. Based on this assessment, a type of population for a next test can be selected in such a way that its result could contribute to a deeper understanding of test results. For instance, another population of cab drivers could be prioritized for the next test in order to find out whether cab drivers behave consistently different from other populations. The criterion for prioritization, thus, follows from the researchers interpretation of the replication history of the proposition.
27
Feasibility
If different types of eligible cases have roughly an equally high level of priority, data availability and accessibility of data sources are aspects that can be taken into account in deciding which cases actually will be included in the test. Strictly speaking, this criterion (researchers convenience) is not a methodological criterion but an economic one. Obviously, it is a good thing that resources are not wasted on studies that fail for avoidable practical reasons or that might have been conducted much more easily if other cases had been selected. The following two rules-of-thumb apply: If possible, do not conduct new measurements. It is much more economic to download data from a data base if publicly accessible relevant data on relevant cases (i.e., eligible and prioritized cases) exist. This is often the case if the focal unit is countries, markets, large companies, mergers, etc. If new data must be collected, select cases which are easily accessible. These cases might be geographically close (e.g., a population of companies or persons in Rotterdam, or in the West of the Netherlands, if the researcher is based at the Rotterdam School of Management), linguistically close (e.g., companies or persons about which data can be collected in the researchers own language) or close in other respects (such as the availability of a contact person in a relevant network which makes it more likely that data can be collected from entities in that network). Convenience is an appropriate criterion for this final selection of a population of cases from the set of prioritized populations. However, when a population has been selected for the test, all members of that population must be included in the test (or a random or probability sample from the population). Convenience sampling is not allowed. The criterion for feasibility, thus, is convenience, i.e., the convenience of using data that are already available for a population and not for another population, or the convenience of having easy access to a population from which data must be collected.
Feasibility
After a specific type of population has been prioritized for the test, there are still a huge number of populations of such a type from which the population can be chosen. For example, if it has been decided that the proposition A higher level of education is associated with a more frequent use of high-end mobile phones will be tested this
28
time in a population of non-Western non-male older people (rather than again in a population of Western undergraduate students), a choice can be made from many different populations of non-Western non-male older people. Convenience is not only allowed but also recommended here. Depending on the definition of non-Western (does this refer to ethnicity, country of birth, or country of residence?) one might find a population that is close to the researcher in terms of accessibility through gatekeepers, or in terms of location, of culture, or language. It might, for instance, be the case that the most easily accessible population of non-Western non-male older people for a specific researcher is a population of Turkish older women that have a weekly meeting in a cultural center in the street in which she lives. A population of Turkish older women that have a weekly meeting in a cultural center may look like too small or too specific for any serious test of a theory (and, therefore, this example might be interpreted as a joke) but it might actually be a very appropriate choice of a population for a test. To begin with, if the focal unit is potential mobile phone users (in general), this population is eligible because it is part of the domain. Second, if the proposition as yet has been tested only in a number of populations of Western (or Westernized) young people, it makes a lot of sense to conduct a next test of this proposition (if it is claimed that it is true in the entire domain of all consumers in the world) in a part of the domain that looks like exotic. Third, the fact that this population of Turkish women is very specific is its strength. If the test result confirms the results of previous tests, then the test contributes more to the confidence in the correctness of the proposition than the next test in a population of Western youngsters because the same result is obtained despite the huge differences between populations. If the test result is different from those in previous tests, then this is also a huge contribution to the understanding of the theory. Such a result would suggest that the proposition is correct for only parts of the domain, and this might be the beginning of a more precise delimitation of the domain. The more specific the population, the higher is the chance of finding something specific (such as a test result that is different from previous ones). Furthermore, the characteristics that define a specific population (such as Turkish-ness or old age) point to directions in which explanations for a differing result can be found. In sum, the more specific the population, the more informative is the test result. Fourth, the small size of the population is not a disadvantage because the specificity of the population in the domain and the significance of the test result for the theory is determined by the characteristics of the population (and how these differ from those of other populations) and not by its size. The small size of a population is not a disadvantage for the significance of the test result but it is very advantageous for the feasibility of the test. If the population of Turkish older women have their weekly meeting in a cultural center next door to the researchers home, it might be quite feasible to attend the meeting and to collect data there. If the population was defined as all Turkish women in Rotterdam (or, even worse, as all Turkish women in the Netherlands) it would not be possible to collect data in person. Even data collection in a small sample of such a large population would be much more difficult. But also, even more importantly, the sample should be much larger in order to avoid large levels of uncertainty of the population estimates caused by sampling variation (resulting in large confidence intervals), whereas no 29
estimation of the statistic is needed in a study in a small population (because sampling variation does not occur). Normally, the smaller the population, the more feasible is the test. Summarizing this discussion, (a) it is useful for the contribution of the test result to the theory to conduct a test in populations that are as specific as is possible and (b) it is useful for the feasibility of the survey to select a population that is as small as possible, particularly if valid and reliable measurement requires that the members of the population can be accessed easily. Fortunately, these two recommendations do converge, because populations will get smaller to the extent that they are more specific. For instance, the population of potential mobile users selected for the test gets smaller with the addition of each of the following specifications: non-Western, Turkish, grannies, living in Rotterdam, and being a member of a specific group with regular meetings. The recommendation that the selected population is as specific as possible is always applicable, also in surveys in which measurement does not require that the members of the population are accessible, e.g., if data are already available in data bases. This discussion results in a strong recommendation to design and conduct only one of the following two forms of survey: (1) If good data are available in a data base: conduct a survey of a (prioritized) type of population for which relevant (good) data about all members are available in the data base. (2) If data must be collected by the researcher: conduct a survey of a very small (prioritized) type of population that is easily accessible. Note that sampling is not necessary (and hence should not be done) in either approach because both approaches are designed such that they facilitate easy data collection about all members of the selected population. Because there is no sampling, there is also no sampling variation, and hence no need to use inferential statistics and no need to worry about sample size. The principles discussed here also apply to the selection of a population for a test in a longitudinal survey. Feasibility might get even more emphasis in such a survey because data must be collected at least two times from each member of the population (or such data must be available in the data base).
The recommendation to select a very small (prioritized) type of population that is easily accessible applies to populations that exist without intervention of the researcher. This is not the same as drawing a convenience sample and calling that sample a population. A population of Turkish older women who have a weekly meeting in a cultural center is a population that exists independent of the home address of the researcher. If the researcher lives next door, this already existing population is easily accessible for data collection. However, a group of friends or relatives of the researcher is at least partly defined by the relation of its members with the researcher. Such a group is not an extant population that can be defined independently of the researcher. Research findings cannot be interpreted as related to specific characteristics of the population such as being Turkish, grannies, and living in Rotterdam. In other words, one of the criteria for selecting an existing population for a test can (and must) be convenience, but a convenience sample cannot be called a population.
30
5.3. Criteria for the selection of experimental cases and for generating experimental conditions or groups
In experiments two or more data matrices are generated, one for each value of the independent concept that is experimentally produced. Each of these multiple data matrices must be populated for each of the cases (usually subjects) by either the value of the dependent concept measured after the experimental treatment (posttreatment measurement) or with difference between that value and the value measured before the experimental treatment (pre-treatment measurement).
Eligibility
The general criterion for eligibility is the definition of the focal unit. Cases that define the rows in your data matrix must be members of the theoretical domain (i.e., the universe of cases of your focal unit). How does this principle apply in experiments, i.e., in situations in which real-life cases (or populations) cannot be selected but rather must be generated experimentally? In most experiments, this is not a difficult issue. Whereas it is required for a cross-sectional test of a proposition about people (consumers, employees, leaders, CEOs, etc.) that a population is selected from the relevant domain (i.e., of consumers, employees, etc.), an experimental test similarly requires that subjects are recruited from such a population from the domain.
Prioritization
A population needs to be specified from which subjects will be recruited. The principle that this population should be as specific as possible, as discussed above for a survey, applies. It is a general principle. The more specific the subjects in an experiment are (i.e., the more consistently different they are from subjects in previous experiments), the more informative is the test result. Because 90%+ of all published experiments in psychology and in business research (including marketing and organization research) have used students in psychology and business administration as their subjects (and because most of the theories that have been tested in this way claim to be universal), it is very useful to replicate these tests in entirely different types of subjects (such as, e.g., non-Western grannies).
The cases in an experimental test should be instances of the focal unit. This implies that an experimental test of a proposition about customers requires that persons are recruited for the experiment. The cases in the resulting data matrices are persons. However, an experimental test of a brand, a product or an advertisement requires that the cases in the data matrices are brands, products or advertisements. It might still be necessary to recruit persons for the experiment. If this occurs, these persons are not cases themselves but only function as raters for the measurement of variables of the brands, etc., such as purchase intention, brand loyalty, etc.
31
Feasibility
For surveys, application of the principle that a population should be selected that is accessible for data collection (if the researcher must collect data and cannot obtain them from a data base) results in the recommendation to select a small population. Applying this principle (facilitation of easy recruitment) to the selection of subjects for an experiment results in the recommendation to select subjects from a much larger (though still quite specific) population. Using non-Western grannies as an example again, for the recruitment of subjects from a population for an experiment it would not be wise to select them only from the population of Turkish grannies who have a weekly meeting in the cultural center but rather from the whole population of Turkish grannies in Rotterdam.
Experimental conditions
Subjects must be brought in different experimental conditions, which each represent a specific value of the independent concept as defined in the theory. A concept such as the level of education of a person cannot be manipulated experimentally and hence an experimental test of the theory that higher level of education results in more high-end mobile phone use cannot be tested in this way. However, a theory about emotional states or about beliefs that influence the purchase of specific types of mobile phones can be experimentally tested if the experimenter succeeds in invoking the appropriate emotional states in subjects and if the purchase can be realistically simulated. If the focal unit of the proposition is a team or situation, it is the experimenters task to compose teams (from eligible subjects; i.e., not always from students) and generate or simulate situations (in which, again, eligible subjects must participate) that can be seen as valid representations of teams or situations as defined in the theory. There are some methodological arguments for recruiting subjects randomly from the selected population but this is almost never done. It is assumed that this will not much affect the results of the experiment. However, given a pool of subjects that is not representative for a population, it is certainly required that they are randomly assigned to the different experimental conditions.
32
CHAPTER 6 MEASUREMENT
The value of the concept(s) of the proposition must be measured in each case. When a concept is measured in a study, it is called a variable. A variable is a concept that is more precisely specified for the cases in the study. This specification must describe in detail the possible values that the variable can have. For instance, if the concept that must be measured is project success, the variable success might be specified in different ways, such as monetary success expressed as the amount of dollars that is generated by the project (i.e., a ratio type variable), or as satisfaction expressed as a (subjective) rating by the company on a scale from none via a bit and quite a lot to very much (i.e., an ordinal type variable). This specification must be valid and the resulting score must be reliable. As discussed above, research strategies are defined by their different types of data matrices, not by their methods of measurement. For instance, the data that must populate the matrix in a survey can be collected in any way, such as observations, content analysis, semi-structured interviews and also questionnaires. The latter are only preferred if data must be collected regarding a persons opinions and beliefs. This chapter discusses a stepwise procedure for the development of an instrument for the valid and reliable measurement of the value of a concept in each case selected for a test. The seven steps are the following: 1. Formulate a more precise definition of the concept. 2. Determine the object of measurement. 3. Identify the location of the object of measurement. 4. Specify how evidence will be extracted from the object of measurement. 5. Specify how the object of measurement will be accessed. 6. Specify how evidence will be recorded. 7. Specify the type of the variable (ratio, interval, ordinal, nominal) and describe the possible values. The outcome of this procedure is a measurement protocol for each concept in which it is precisely specified how a score (i.e., the value of the variable that will put in the data matrix) is generated in each case.
33
In this discussion of the development of a measurement protocol, with success as a running example, success will be defined in the following three ways: 1. Financial success defined as the amount of monetary gain for the company resulting from the project. 2. Timely delivery defined as whether the project has delivered its results before a specified deadline. 3. Satisfaction defined as the extent to which a project is evaluated as successful by the company. Note that these are just three possible specifications of the concept success as used in a theory. Usually, a theory clearly specifies one of these different meanings as the one to which the theory refers, i.e., as the type of success that is explained by the theory or proposition. It is useful for the process that is described here in a number of steps to make an initial (provisional) decision about the scale that one wants to develop. For instance, if success is defined as financial success, will it be required to measure an exact amount of dollars or euros, or would it be enough to rate it as low, medium or high?
These examples show that different specifications of the concept of success result in different variables, i.e., different types of attributes, with different types of possible values that must be observed in different types of objects of measurement. Although, 34
in the example, the concept (success) is an element of one focal unit (project), the three variables refer to different objects of measurement (the project as defined in the financial records, the delivery date, and a companys evaluation).
The point of this exercise (in which, first, an object of measurement is specified and, next, its location is determined) is that, in principle, researchers must know what their object of measurement is and where it (literally) is. Next, it must be determined what kind of evidence must be extracted from it and how this should be done (step 4), how the object of measurement will be accessed for extracting the evidence in the manner as specified (step 5), and finally how this evidence will be recorded (step 6) so it can be brought to the researchers desk for coding (step 7).
Step 4: specify how evidence will be extracted from the object of measurement
Measurement of the value of a variable requires that evidence is extracted from the object of measurement that corresponds with that value. Different variables require very different instruments, which vary from complicated (such as extracting evidence of a persons intelligence by means of a battery of tests or, more accurately, a set of measurements) to simple (such as extracting evidence of a projects costs by means of reading the appropriate lines in a financial report).
Financial success. After identification of the relevant financial records or reports, the relevant financial numbers need to be identified and read. If these records or reports do not provide a number for the total costs and revenues of a project, numbers for
35
subcategories of costs and revenues need to be identified and read in different lines, columns, pages, or files. The set of different numbers that are identified in this way form the evidence that is extracted. It must be specified beforehand in detail which numbers in the records count as evidence for the value of the variable as defined. The required instrument for extracting evidence of the value of the variable financial success, thus, is reading the right numbers. Timely delivery. After identification of relevant documents, information about the planned and actual delivery dates must be found in these documents and read. Satisfaction. After an evaluation report that contains evidence of how the company evaluates the project is identified, the report must be read to retrieve the required evidence.
Only after it is specified in this way which kind of evidence must be extracted from which type of object of measurement, it is possible to specify the practical steps that are required for actually getting the evidence that is needed, for each case.
If the object of measurement is not an opinion, a belief or a persons experience (as is the case in these three examples), it is likely that it is something that can be observed by the researcher herself and that, in principle, she does not need to ask another person (a respondent or informant) to extract the necessary evidence. In such cases, usually gatekeepers must be passed in order to get access to the evidence that is needed for a measurement. In the examples discussed here, staff must first help the researcher to identify the relevant records or documents and next allow access to them. Sometimes the evidence that is needed is publicly available, e.g., in annual reports, on company websites, etc. Therefore, it should be investigated whether such publicly available data exist for relevant cases, before access to companies is negotiated. In all cases the quality of these data should be assessed before using them.
36
If the researcher cannot get access to primary sources of evidence (such as to the relevant financial records) she must ask persons who have access (informants or respondents) to provide the necessary evidence. This request basically can have one of two forms: (a) interviewing informants (face-to-face or by telephone) or (b) sending a questionnaire (either paper or electronic). However, note that it cannot be assumed that informants provide the correct evidence, particularly not if information is requested by means of a questionnaire. In such cases it is highly recommended to talk to informants face-to-face or by telephone in order to create opportunities for explaining in detail what evidence precisely is needed. Also, it is recommended to ask informants how they have extracted the information that they have provided. It must be ascertained as thoroughly as possible that the information that is received is evidence as specified above.
37
Usually the researcher knows from the outset what kind of score on what kind of scale is required or desired as the outcome of the measurement, and this knowledge will steer decisions in the development of the measurement protocol. This can be illustrated again with the examples.
Financial success. It will be clear from the outset that the score (for the data matrix) should be an amount in some currency (dollar, euro, etc.). But, if the variable is not defined as the extent of financial success but rather as the presence or absence of financial success, then the dollar or euro amounts must be coded into one of these two possible scores. This means that a coding procedure must be applied by which monetary amounts are evaluated as indicating success (presence/absence), which requires that a cutoff point is specified. Timely delivery. A criterion must be specified for evaluating the date of delivery as on time or too late (or any other score deemed relevant for the proposition). Satisfaction. Text analysis, document analysis, and content analysis are the terms used for generating scores from texts. Coding is simple if an evaluation report has a clear conclusion in which the project is unequivocally judged as a success or not. But coding is more complicated if such a judgment must be generated by the researcher from different, ambiguous, and sometimes contradictory, statements in a report. Then the researcher must have a procedure for finding the evaluation result in the text.
38
Measurement validity
A measurement of an attribute (or a variable characteristic) is valid if variations in the attribute causally produce variation in the measurement outcomes (Borsboom et al., 2004). It is not possible to objectively assess the degree to which measurement validity has been achieved. A level of validity is an outcome of argumentation and discussion. This can be illustrated with the three indicators of success of a project.
Financial success. If financial records must be read in order to retrieve financial data indicating the degree of success of a project, be it directly or indirectly (after some computation), the type of financial data that are needed must be precisely specified. It is not possible just to copy any financial number from records but only those numbers whose meaning are precisely defined. The meaning of a specific number (most often an amount in, say, dollars) is known if it is known how it was produced. For instance, if the costs involved in a project must be computed from financial records (in order to assess whether a financial gain has occurred), it must be known how the company assigns costs to projects. If relevant costs are not included in the costs documented in the financial records, or when revenues are attributed to the project that actually were generated in ways that are not connected to the project, it is possible that the financial success of the project is overestimated. And, conversely, if costs are attributed to the project that actually are not related to the project, or if not all revenue from the project is included in the revenue as documented in the records, underestimation of the projects financial success is possible. If necessary, financial data must be recalculated in such a way that they exactly represent the researchers definition of the concept. If the records or reports do not contain sufficient information on how the various numbers or amounts have been calculated, it may be necessary to retrieve this information from (financial) staff in order to judge the validity of those data. If these are not valid in terms of the researchers definition, staff could be asked to identify and retrieve other, more valid evidence. In sum, a valid way of extracting evidence of the financial success of a project consists of: 1. Precisely defining what the researcher considers to be the financial success of a project.
39
2. Translating that definition into precisely described operational procedures. 3. Evaluating the firms procedures for computing the financial success of a project, if any, against these procedures. 4. If necessary, identifying or computing other, more valid evidence. The criterion for measurement validity of this instrument is whether every detail of its procedures can be justified in terms of the researchers definition of financial success. Delivery time. There might be different types of delivery time of project results (the publication of the written report, the oral presentation of the results to management, the final financial record, etc.), of which some might not count as the delivery time as meant in the researchers definition. Therefore, the researcher must define in a quite detailed way what is considered the delivery time in the theory that is tested (and what not). The researchers definition needs to be translated into precise procedures that are applied to candidate pieces of evidence of delivery time which are identified in reading the relevant documents or in the verbal reports from company staff who were involved in the end phase of the project. The criterion for measurement validity of these procedures is again whether they are justified in terms of the researchers definition of delivery time. Satisfaction. This indicator of success refers to success as defined by the company, not by the researcher. This is an important distinction, which implies that it is not necessary to apply the procedures outlined in the two previous examples. There is no need to evaluate the correctness of the companys judgment. The outcome of the companys evaluation can be accepted, irrespective of how it was generated (although the researcher might be interested in the companys procedures and might want to try to collect evidence on these procedures as well). Measurement validity here refers to the validity by which the researcher identifies, retrieves, and codes the companys evaluation, irrespective of how the company has generated its evaluation. In case this evaluation has not been recorded in a document by the company, the researcher must (re)construct a companys satisfaction with a project through interviews. There are more and less valid ways of retrieving judgments (such as these evaluations of project success) from respondents in interviews and/or through questionnaires, which will not be discussed here.
Measurement validity, thus, concerns the quality of each part of the measurement protocol in terms of the criteria that follow from the (as precise as possible) definition of the concept that is measured.
Reliability
If a measurement is valid, a next quality criterion is that it is also reliable. Reliability is the precision of the scores obtained by a measurement. Reliability, as defined here (i.e. the precision of scores) and in contrast to measurement validity, can be measured. This is usually done by generating more than one score for the same variable in the same object of measurement and by assessing how much they differ. The level of achieved reliability of scores can be obtained by calculating the degree of similarity of scores for the same object of measurement and express it as an inter-observer, interrater, or testretest similarity rate.
Financial success. When a valid procedure for measuring financial success of a project has been developed, its reliability can be assessed by arranging that two or more persons, either company staff, or researchers, or their assistants, collect and code data using these guidelines and then compute the degree of success from these data. If the reliability of the
40
scores is insufficient (in terms of a criterion that was formulated a priori) measurement procedures should be further specified until a sufficient level of reliability is achieved. Delivery time. If a valid procedure for measuring the exact dates of planned and actual delivery and for determining its timeliness is developed, the reliability of the score can be assessed by arranging that two or more persons identify both the planned and the actual delivery date and then rate the deliverys timeliness. Scores are reliable if different raters generate the same score. Satisfaction. When a valid procedure for the measurement of the value of the companys project evaluation is developed, the reliability of the scores obtained in this way can be assessed by using the same procedures described above for assessing the reliability of financial success or timeliness of delivery. If evidence is extracted through qualitative interviews with persons, the more structured a qualitative interview is (e.g., instructions regarding the interview as well as the questions specified in the interview guide), the more reliable will be the evidence generated in the interview. Evidence generated in interviews by different interviewers with the same person should obtain similar or the same evidence. If the data are generated through a standardized questionnaire, consisting of questions with a set of response categories, reliability is usually assumed to be good, although different measurement conditions (e.g., how the questionnaire is introduced to the respondent, the absence or presence of other people such as supervisors or colleagues, whether scores are obtained in an interview or by self-completion) will influence the reliability of the scores that are obtained.
41
Questionnaires
When a person is the object of measurement, i.e., when the concept is a belief, an opinion, a psychological trait, or an experience, then it is very likely that evidence can only be extracted by asking questions to persons. This implies that an interview with that person must be conducted or that they must complete a questionnaire. However, constructing a good questionnaire (i.e., a questionnaire that will result in valid and reliable scores) is very difficult but also very time-consuming because it requires a couple of rounds of pre-testing (with real respondents). Therefore, it is recommended to make use of a questionnaire only if it is necessary and therefore appropriate, hence only when the respondent him/herself is the object of measurement. In general, it is not a recommended strategy to design a study which relies on evidence provided by informants (rather than on evidence extracted by the researcher from the object of measurement), but collecting evidence from informants by means of questionnaires must be avoided in particular.
Further reading
This chapter is to a large extent taken from Appendix 1 (Measurement) in Dul & Hak (2008). The definition of measurement validity as discussed in this chapter (the degree to which variations in the attribute causally produce variation in the measurement outcomes) is also discussed by Rossiter (2002) and, very thoroughly, by Borsboom et al. (2004). This chapter does not discuss questionnaire construction. The difficulty of constructing a valid questionnaire is usually severely underestimated, which is one of the reasons why it is recommended in this book to avoid using a questionnaire for data collection if possible. When constructing a questionnaire, it is recommended to follow the C-OAR-SE procedure (Rossiter, 2002) for developing valid items in standardized questionnaires, which should be followed by a cognitive pre-test with real respondents (Willis, 2005; Hak et al., 2008). References
42
CHAPTER 7 HYPOTHESIS
As stated in Chapter 1, the core assumptions of any theory-testing study are: If the proposition is true in the domain, then it must be true in each population of the domain. If the proposition is true in a population, then it must be possible to observe (or see) this. In other words, there must be empirical evidence for the correctness of the proposition in the population. We can specify what we expect to observe in the population if the proposition is true. This specification of our expectation is called a hypothesis (or expected pattern). Note that in the literature no clear distinction is made between propositions and hypotheses. In this book, a proposition is a theoretical and hence general statement, whereas a hypothesis is an expectation (derived from the proposition) about what we will observe in a data set. When we have chosen a research strategy (e.g., a survey), have selected a population (e.g., a specific population of female customers), and have developed a measurement protocol, this hypothesis can be very specific. To such an extent even, that it can be expressed as a range of values of a parameter. This chapter discusses how hypotheses should look like. The discussion is divided in two parts: (1) determining the appropriate parameter; and (2) specifying the values of that parameter which, if occurring, would be consistent with the proposition.
43
Note that a p value (an indicator of statistical significance) has hardly any relation with an effect size. A p value is mainly a function of sample size: the larger the sample, the smaller is p. If the sample is large enough, even minute effect sizes (nearing zero) will have a p value that is small enough for concluding statistical significance. Below, in Chapter 8, we will discuss more reasons why the concept of statistical significance is not useful. (Also see Schwab et al. 2011).
Hypotheses about the regression coefficient b assume that the variables are at least ordinal. In a test of a proposition with a nominal concept (such as gender), e.g., For consumers: men buy more beer than women, the appropriate hypothesis would entail an expectation about the means of beer consumption, for instance: In this population: men > women. If the dependent variable is nominal, such as in For consumers: more women than men use a shopping list, the relevant parameter in the hypothesis is the proportion of users of a shopping list: In this population: %men > %women. The hypothesis in a longitudinal survey will usually entail an expectation about the regression coefficient b of the regression of (Y2 -Y1) on (X2 -X1). In an event study, the relevant parameter is usually the value of Y in cases with the event as compared with a normal situation (e.g., abnormal returns). The hypothesis in the usual type of experiment with two experimental conditions, each defined by a specific value of the independent concept, entails an expectation about the difference between the means of the two groups: In this experiment: group1 < group2.
Further reading
Hak & Dul (2009a) discuss the principles of pattern matching and the concept of expected pattern. An important background article for this chapter is Schwab et al (2011).
44
45
Experiment
If subjects are recruited from a population for an experiment, refusal to participate by individual members of the population is not a problem because (as was mentioned above) probability sampling is not considered necessary for generating a pool of potential subjects. However, as soon as persons have agreed to participate in the experiment and have been randomly assigned to the different experimental conditions, data must be collected from each of them. Failure to do so will invalidate the studys results.
Pattern matching
Testing entails comparing an observed pattern (i.e., the pattern of the scores in the data matrix) with the expected pattern (i.e., the hypothesis). Comparing an empirical fact in obtained scores with a hypothesis is a simple and straightforward activity which does not require a null hypothesis significance test. It will be explained below
46
why results of null hypothesis significance tests are not needed and potentially misleading. Examples of hypotheses (expected patterns) in a survey, as discussed above, are In this population: b > 0 or In this population: b > n (in which n is a number set by the researcher as a threshold for managerial relevance). In this population: x2 x1 > 0 (or > n) in which x1 and x2 are subgroups in the population with different (nominal) values of X and in which n is a number set by the researcher as a threshold for managerial relevance. In this population: %x2 %> 0 (or > n) in which x1 and x2 are subgroups in the population with different (nominal) values of X and in which n is a number set by the researcher as a threshold for managerial relevance. The hypothesis in a survey usually specifies a range of values in which the observed value is expected to occur. Before the expected pattern and the observed pattern can be matched, the observed value of the relevant statistic (b, or x2 x1, or %x2 %x1) must be generated. Note that only descriptive statistics, no inferential statistics, are needed to obtain these values, and that the value that is generated is a precise point (not a range). If not all members of the population are surveyed, but only the members of a sample, the value of the relevant statistic (b, or x2 x1, or %x2 %x1) is generated in the same way. The result is an observed value in the sample. But the hypothesis is the formulation of an expected pattern in the population. Hence, in order to conduct the test, first, a population value must be estimated and, next, the resulting population estimate must be compared with the expected pattern. If the sample is not a random (or, more generally formulated, a probability) sample, the population value of the relevant statistic cannot be estimated. Non-probability samples, therefore, are useless in theory-testing research. If the sample is a probability sample, the population value of the relevant statistic can be estimated by means of inferential statistics. The result of the use of inferential statistics is a confidence interval (with a confidence level that must be specified beforehand, e.g., 95%) with the sample value in the middle. This result is a range (or interval) of values with a known likelihood that the actual value of the population statistic is in that range (e.g., the 95% confidence interval). The data are consistent with the hypothesis if the entire 95% confidence interval is within the expected range. It is not consistent with the hypothesis if the entire 95% confidence interval is outside that range. If the observed range (the 95% confidence interval) overlaps with the expected range, the test result is partially consistent with the hypothesis. Note that these testing procedures do not require null hypothesis significance testing and that test results are not expressed in terms of statistical significance. Reasons for not using null hypothesis significance testing (apart from that it is not necessary for obtaining a test result) are discussed below. The hypothesis in the usual type of experiment with two experimental conditions, each defined by a specific value of the independent concept, is: In this experiment: group1 < group2. The relevant statistic (group2 group1) can be generated by using descriptive statistics. However, this statistic is subject to sampling variation due to the fact that subjects have been randomly assigned to the experimental groups. Hence, a (95%) confidence interval must be estimated and the resulting range (95% confidence
47
interval) must be compared with the expected range. The data are consistent with the hypothesis if the entire 95% confidence interval is within the expected range. It is not consistent with the hypothesis if the entire 95% confidence interval is outside that range. If the observed range (the 95% confidence interval) overlaps with the expected range, the test result is partially consistent with the hypothesis.
Why the results of null hypothesis significance tests are potentially misleading
Take the example of a series of tests that was discussed above (Figure 4).
This figure might represent the effect size d (with 95% confidence interval) obtained in seven experiments, or seven population estimates of the regression slope b (with 95% confidence intervals), or seven population estimates of other relevant statistics (with 95% confidence intervals). The vertical line represents the null. Assume that the hypothesis for each test was that the value of the relevant statistic is higher than zero. This implies that the observed value is expected to occur on the right-hand side of the vertical line. Each of the seven test results can be easily obtained by applying the pattern matching procedure described above: Test 1. Consistent with the hypothesis Test 2. Mostly not consistent with the hypothesis Test 3. Consistent with the hypothesis Test 4. Largely but not fully consistent with the hypothesis Test 5. Consistent with the hypothesis Test 6. Largely not consistent with the hypothesis Test 7. Largely but not fully consistent with the hypothesis The seven test results together suggest that it is more likely that the proposition is true for the theoretical domain than that it is not true. Five of the seven observed values are on the expected side of the null. The two observed values that are on the wrong side of the null are both relatively close to the null. The overall effect size (if these are experimental results) or the actual strength of the association (if these are population estimates) is likely close to the one found in test 7, of which the result is the median of the seven test results.
48
Now observe the results of null hypothesis significance testing (with the criterion that p 0.05). Although the values of p cannot be directly observed in the figure, it is clear from the 95% confidence intervals that four out of the seven tests (namely tests 2, 4, 5, and 7) would have generated a higher value of p than 0.05. Note that this implies that the positive result of test 7 (which quite likely has an observed value of the relevant statistic that is very close to the actual value in the domain) is considered as not significant and that, therefore, the hypothesis is rejected. This clearly is a misleading result. The main reason for this is that the null hypothesis significance test is (as its name implies) a test of the null, not of the actual hypothesis that is supposed to be tested. The null is only rejected if the observed value of the statistic occurs outside the range of the 95% probability distribution of values that the observed statistic will have if the actual value is zero (or null). However, the fact that the null cannot be rejected with more than 95% confidence (in tests 4 and 7) does not imply that it is unlikely that the actual hypothesis is true. The fact that the larger part of the 95% confidence interval, both in tests 4 and 7, is on the right side of the null actually means that the population estimate (or the real experimental effect) is much more likely to be in the expected range (> 0) than not. Because null hypothesis significance testing is not necessary and because its outcome is potentially misleading, it is strongly recommended to abstain from null hypothesis significance testing.
Further reading
An important background article for this chapter is Schwab et al (2011). Hak & Dul (2009a) discuss the principles of pattern matching. The most recent version (sixth edition; 2010:33) of the Publication manual of the American Psychological Association (APA) notes that historically, researchers in psychology have relied on null hypothesis statistical significance testing (NHST) as a starting point for many (but not all) of its analytic approaches. The manual notes that this practice is contested and then states that complete reporting of all tested hypotheses and estimates of appropriate effect sizes and confidence intervals are the minimum expectations for all APA journals. This advice is accompanied by a note with a reference to, among others, a useful paper by Jones & Tukey (2000). See for a further discussion of the greater usefulness of reporting (and interpreting) confidence intervals (rather than p values): Cumming & Finch (2001), Finch et al. (2002). It is repeatedly stated in this chapter that scores must be obtained from (almost) all of the members of a population (or of a probability sample), because missing data potentially result in nonresponse bias. See Rogelberg & Stanton (2007) for a discussion. Because there is no feasible way of estimating bias caused by nonresponse and because, by implication, there is no way of correcting for it, it is necessary to collect scores for (almost) all cases. As discussed above, this is only possible if (1) data for the population are already available (e.g., in a data base) or if (2) you data are collected in an intensive way from a small population. If this advice is heeded, then the survey is a census (i.e., a study of all members in population), not a sample survey. Although it is also possible to collect almost complete data in an intensive way from a small sample, this would introduce another problem: sampling variation and, hence, too little statistical power.
49
50
Take, as an example, the result in the second test in this series, which indicates that there is a higher chance that the hypothesis is not correct than that it is correct. In other words, the proposition is not supported in this test. Note that, based on this test no conclusion whatsoever can be drawn about the correctness of the proposition in the domain. Or in other words, generalization from one test result to the domain is not possible. This simple fact applies to any result, positive or negative. This test result carries meaning only as part of series of replications. It is only in the context of the whole series of seven tests that the result of test 2 can be seen as the one that is the least supportive for the proposition. It would be ludicrous to conclude from test 2 alone that the proposition is not correct. On the contrary, the series of test results suggests that the proposition likely is correct in at least a part of the domain. This means that the result of test 2 needs to be explained. Such an explanation must be sought first in errors that might have been made in the study, in measurement errors or in errors regarding the selection of cases for the test.
Error
In finding explanations for a test result, one might first assess the likelihood of error regarding the cases that have been observed. Were the observed cases eligible (i.e., members of the theoretical domain)? If a population was selected for a test in a survey, could a coverage error have been made (i.e., could the survey have included cases that are actually not members of the population or could it have excluded cases that actually are members)? If a sample has been drawn from the population (which was not recommended in this book), was it a correct probability sample and was no error made in the sampling procedures? Have scores been obtained for all members of the population or sample, so non-response bias can be excluded? In an experimental test, were subjects randomly assigned to the experimental conditions? Were scores obtained from all subjects? After having reviewed each of these potential reasons for (partially) not having observed the correct set of cases, a conclusion can be drawn about the likelihood that this type of error could explain the test result. Now look equally critically at how scores were obtained. If data have been used from a data base, were these scores valid and reliable? If data have collected for the study, have there been deviations from the measurement protocol? Have the measurement procedures (as specified in the measurement protocol) resulted in valid and reliable scores? Could measurement error explain the test result? In an experiment, was the value of the independent concept manipulated in the correct manner, in such a way that the experimental conditions actually and correctly generated or represented the different values of that concept? Has a pre-test (a measurement of the value of the dependent concept in each case or subject before the experimental treatment) been conducted (if appropriate)? Can errors in experimental design explain the test result? Only if it can convincingly be argued that the test result is not an effect of selection error, sampling error, non-response error, measurement error, or design error, then an interpretation of the test result can be based on the assumption that the result is correct for the cases that have been selected for the test.
51
A critical evaluation of potential error must always be conducted, not only if the test result is unexpected and outside of the range of previous test results. A test result that is expected and that confirms results of other tests can also be erroneous. The study might even replicate errors that have been made in previous tests, and the similar results of both this and the other tests might be the effect of similar errors creating the same (but wrong) results. In other words, it would be a mistake to assume, without critical evaluation, that other tests (i.e., the tests discussed in the literature review and listed in your reconstruction of the replication history) have been error-free. Not only should the current test be evaluated very critically but also each of the other tests in the replication history. For each test it must be assessed to what extent its result might be explained by error.
52
53
REFERENCES
American Psychological Association (APA) (2010), Publication manual, Sixth Edition. Jelke Bethlehem (2009), Applied survey methods, Hoboken (NJ): Wiley. Boris Blumberg, Donald Cooper & Pamela Schindler (2008), Business research methods (Second European Edition), Maidenhead: McGraw-Hill. Denny Borsboom, Gideon J. Mellenberg, & Jaap van Heerden (2004), The concept of validity, Psychological Review, 111(4):10611071. Geoff Cumming & Sue Finch, A primer on the understanding, use and calculation of confidence intervals that are based on central and noncentral distributions, Educational and Psychological Measurement, 2001: 532-574. Jan Dul & Tony Hak (2008), Case study methodology in business research, Oxford: Butterworth-Heinemann (also available as an eBook: http://www.dawsonera.com/) Andy Field & Graham Hole (2003), How to design and report experiments, London: Sage. Sue Finch, Neil Thomason & Geoff Cumming (2002), Past and Future American Psychological Association Guidelines for Statistical Practice, Theory & Psychology, 12:825. Floyd J. Fowler (2009), Survey research methods (Fourth Edition), Thousand Oaks (CA): Sage. Robert M. Groves et al. (2004), Survey methodology, Hoboken (NJ): Wiley. Raymond Hubbard & R. Murray Lindsay (2002), How the emphasis on original empirical marketing research impedes knowledge development, Marketing Theory, 2(4):381-402. Tony Hak & Jan Dul (2009a), Pattern Matching, In A.J. Mills, G Durepos & E Wiebe (Eds.), Encyclopedia of Case Study Research (pp. 663-665), Thousand Oaks (CA): Sage. Tony Hak & Jan Dul (2009b), Replication, In A.J. Mills, G Durepos & E Wiebe (Eds.), Encyclopedia of Case Study Research (pp. 804-806). Thousand Oaks (CA): Sage. Tony Hak, Kees van der Veer & Harrie Jansen (2008), The Three-Step Test-Interview (TSTI): An observation-based method for pretesting self-completion questionnaires, Survey Research Methods, 2:143-150. Lyle V. Jones & John W. Tukey (2000), A Sensible Formulation of the Significance Test, Psychological Methods, 5(4):411414. Naresh K. Malhotra & David F. Birks (2003), Marketing Research, An applied approach (Second European Edition), Harlow: Prentice-Hall. Steven G. Rogelberg & Jeffrey M. Stanton (2007), Understanding and dealing with organizational survey nonresponse, Organization Research Methods, 10:195-209. John R. Rossiter (2002), The C-OAR-SE procedure for scale development in marketing, International Journal for Research in Marketing, 19:305335.
Andreas Schwab, Eric Abrahamson, William H. Starbuck, & Fiona Fidler (2011), researchers should make thoughtful assessments instead of null-hypothesis significance tests, Organization Science, 22(4):11051120.
Gordon B. Willis (2005), Cognitive interviewing: a tool for improving questionnaire design, Thousand Oaks (CA): Sage.
54
GLOSSARY
Candidate population A candidate population is a member of a set of eligible and prioritized populations from the theoretical domain from which the researcher will select a population for a test in a survey. Case A case is an instance of a focal unit. Case selection Case selection is selecting a population of cases from a set of candidate populations or selecting experimental units (subjects or other units) from a set of candidate units. Causal relation A causal relation is a relation between two variable attributes X and Y of a focal unit in which a value of X (or its change) permits, or results, in a value of Y (or in its change). See Cause, Dependent concept, Effect, and Independent concept. Cause A cause is a variable attribute X of a focal unit of which the value (or its change) permits, or results, in a value (or its change) of another variable attribute Y (which is called the effect). See Causal relation, Dependent concept, Effect, and Independent concept. Coding Coding is categorizing data in order to generate scores. Concept A concept is a variable aspect of a focal unit as defined in a theory. See Dependent concept and variable, Independent concept and variable, Mediating concept and variable, Moderating concept and variable, and Variable. Conceptual model A conceptual model is a visual representation of a proposition in which the concepts are presented by blocks and the relation between them by an arrow. The arrow originates in the independent concept and points to the dependent concept. Data Data are the recordings of evidence generated in the process of data collection. Data collection Data collection is the process of (a) identifying and selecting one or more objects of measurement, (b) extracting evidence of the value of the relevant variable attributes from these objects, and (c) recording this evidence. See Object of measurement Dependent concept A dependent concept is a variable attribute Y of a focal unit of which the value (or its change) is the result of, or is permitted by a value (or its change) of another variable attribute X (which is called the independent concept). Dependent variable A dependent variable is a variable X which, according to a hypothesis, is an effect of an independent variable Y. Domain A domain is the universe of instances to which theoretical statements apply. See Focal unit and Theoretical domain. Ecological validity Ecological validity is the degree of confidence that findings of an experimental study apply in non-experimental (real life) settings. Effect An effect is a variable attribute Y of a focal unit of which the value (or its change) is the result of, or is permitted by a value (or its change) of another variable attribute X (which is called the cause). See Causal relation, Dependent concept, Effect, and Independent concept. Evidence Evidence is the information extracted from an object of measurement. Expected pattern An expected pattern is a specification of characteristics that the scores in a data matrix should have if the theory is correct for the cases in the matrix. It is a synonym of the term Hypothesis. Testing consists of comparing (matching) an Observed pattern with an Expected pattern. See Hypothesis, Observed pattern, Pattern matching, and Test.
55
Experiment An experiment is a study in which the value of an independent concept is manipulated in at least two randomly assigned groups of instances of a focal unit and, next, the value of the dependent concept in each of these instances is observed. Experimental research Experimental research (or the experiment) is a research strategy in which the value of an independent concept is manipulated in at least two randomly assigned groups of instances of a focal unit and, next, the value of the dependent concept in each of these instances is observed. External validity External validity is the degree of confidence that the findings of a study apply to non-observed cases of the theoretical domain. See Internal validity Focal unit of a theory A focal unit of a theory is the entity of which the range of values of one or more variable attributes is explained by that theory. Hypothesis A hypothesis is a specification of characteristics that the scores in a data matrix should have if the theory is correct for the cases in the matrix. It is a synonym of the term Expected pattern. See Expected pattern, Observed pattern, and Pattern matching. Independent concept An independent concept is a variable attribute X of a focal unit of which the value (or its change) permits, or results, in a value (or its change) of another variable attribute Y (which is called the dependent concept). Independent variable An independent variable is a variable X which, according to a hypothesis, is a cause of a dependent variable Y. Internal validity Internal validity is the degree to which the design of a study is believed to exclude alternative interpretations of the study findings. See External validity Measurement Measurement is a process in which a score or scores are generated for analysis. It consists of (a) data collection and (b) coding. Measurement procedures must be valid and the resulting scores must be reliable. See Coding, Data collection, Measurement validity, Reliability, and Score. Measurement validity Measurement validity is the extent to which procedures of data collection and of coding can be considered to capture meaningfully the ideas contained in the concept of which the value is measured. Mediating concept A mediating concept is a concept that links the independent and the dependent concept in a proposition and which is necessary for the causal relation between the independent and the dependent concept to exist. Mediating variable A mediating variable is a variable that mediates the relation between the independent and the dependent variables in a hypothesis. Moderating concept A moderating concept is a concept that qualifies the relation between the independent and the dependent concepts in a proposition. Moderating variable A moderating variable is a variable that qualifies the relation between the independent and the dependent variables in a hypothesis. Object of measurement An object of measurement is an object that must be accessed in order to extract evidence of the value of a variable. An object of measurement is not the same as the focal unit. See Data collection and Measurement. Observation Observation is collecting empirical evidence about instances of a focal unit. It is a synonym of Measurement.
56
See Measurement. Observed pattern An observed pattern is a pattern of characteristics of the empirical scores in a data matrix. Testing consists of comparing (matching) an Observed pattern with an Expected pattern. See Expected pattern, Pattern matching and Test. Pattern matching Pattern matching is comparing two or more patterns in order to determine whether patterns match (i.e. that they are the same) or do not match (i.e. that they differ). Pattern matching is a synonym of Test. See Expected pattern and Observed pattern. Population A population is a set of instances of a focal unit defined by one characteristic or by a set of characteristics. Population selection Population selection is selecting a population from a set of candidate populations for a survey. Probability sample A probability sample is a sample that is selected through a procedure of probability sampling. See Probability sampling. Probability sampling Probability sampling is a sampling procedure in which each member of the population has a fixed probabilistic chance of being selected. See Random sampling. Proposition A proposition is a theoretical statement about the relation between concepts. See Theory Random sample A random sample is a sample that is selected through a procedure of random sampling. See Random sampling. Random sampling Random sampling is a sampling procedure in which each member of the population has an equal chance of being selected. See Probability sampling. Reliability Reliability is the degree of precision of a score. Replication Replication is conducting a test of a proposition in another population of instances of the focal unit, or with another selection of experimental subjects or units, or with another experimental treatment. Replication strategy A replication strategy is a plan for a series of replications. See Replication. Research Research is testing theoretical statements by collecting and analyzing evidence drawn from observation. See Observation. Research objective A research objective is a specification of the aim of a study. Research strategy A research strategy is a category of procedures for selecting instances of a focal unit and for data analysis. In this book two research strategies are distinguished: experimental research (the experiment) and survey research (the survey). They are characterized by different data matrices. See Experimental research and Survey research. Sample A sample is a set of cases selected from a population. Sampling Sampling is the selection of cases from a population.
57
Sampling frame A sampling frame is a complete list of the members of a population. A sampling frame is required for probability sampling. See Probability sampling. Score A score is a value assigned to a variable by coding data. Study A study is a research project in which a research objective is formulated and achieved. Support for a proposition A proposition is said to be supported in a test if the hypothesis is confirmed. Survey A survey is a study in which values of concepts are observed in all members (or in a probability sample of members) of a population of instances of the focal unit. See Population, and Sampling. Survey research Survey research (or the survey) is a research strategy in which values of concepts are observed in all members (or in a probability sample of members) of a population of instances of the focal unit. See Population, and Sampling. Test A test of a proposition is determining whether a hypothesis that is deduced from the proposition is consistent with the pattern of scores obtained in a survey or experiment. Theoretical domain A theoretical domain is the universe of instances of a focal unit of a theory. Theory A theory is a set of propositions regarding the relations between the variable attributes (concepts) of a focal unit in a theoretical domain. Theory-testing Theory-testing is selecting one or more propositions for a test and conducting the test. Theory-testing research Theory-testing research is research with the objective to test propositions. Variable A variable is a measurable indicator of a concept in research. See Concept and Hypothesis.
58