Course Book 2011-2012

HOW TO DESIGN AND CONDUCT AN EMPIRICAL TEST OF A BUSINESS THEORY
Tony Hak
Third Revised Edition

2009, 2010, 2011 Tony Hak
TABLE OF CONTENTS
CHAPTER 1 INTRODUCTION CHAPTER 2 THEORY AND PROPOSITIONS CHAPTER 3 REPLICATION CHAPTER 4 CHOOSING A RESEARCH STRATEGY CHAPTER 5 SELECTING CASES FOR THE TEST CHAPTER 6 MEASUREMENT CHAPTER 7 HYPOTHESIS THE RESEARCH PROPOSAL CHAPTER 8 CONDUCTING THE TEST CHAPTER 9 INTERPRETING THE TEST RESULT THE RESEARCH REPORT REFERENCES GLOSSARY
CHAPTER 1 INTRODUCTION
The aim of a theory-testing research project is to contribute to our knowledge about the correctness of a theory by collecting and analysing empirical data. The underlying logic of theory-testing is the following: A theoretical claim (or proposition) applies to a universe (or domain) that usually is very large or even infinite, i.e., all consumers everywhere at all times, all firms everywhere at all times, etc. It is not possible to prove with absolute certainty that the proposition is true for this whole domain, because it is not possible (or at least not practical) to observe every single case in the domain. The best we can do is to confirm that the theory is true in many different subsets of this domain (which we call populations). If the proposition is true in the domain, then it must be true in each population of the domain. If the proposition is true in a population, then it must be possible to observe (or see) this. In other words, there must be empirical evidence for the correctness of the proposition in the population. We can specify what we expect to observe in the population if the proposition is true. This specification of our expectation is called a hypothesis (or expected pattern). Testing a proposition in a population consists of comparing what we expect to see if the proposition is true (the hypothesis or expected pattern) with what is actually observed in the population (the observed pattern). A test result is a conclusion about the extent to which the population behaves as expected. A test in a population is always a partial test of the theory because its result does not apply to the other parts of the theoretical domain that have not been observed. Hence, many different tests in many different parts of the domain (replications) are needed before a conclusion can be drawn about the correctness of the theory in the domain. This book discusses the methodology for designing and conducting one test (of the many tests that are required). It consists of the following seven steps. Step 1. Specify the proposition and its domain. Concepts must be defined and the exact relation between them must be specified. The type of the proposed relation determines the appropriate research strategy and how cases should be selected for the test. Also, the type of entity (focal unit) to which the proposed relation refers as well as the boundaries of the theoretical domain (i.e., the universe in which the theory is assumed to apply) must be specified because these determine what is a case. This step is discussed in Chapter 2. Step 2. Select a research strategy. It must be decided whether an experiment or a survey will be designed for the test. This is discussed in Chapter 4.
Step 3. Select cases for the test. A subset of the domain must be selected for the test (Chapter 5). Step 4. Measure the value of the variables in each case. This step results in a data matrix (in which rows are cases and columns variables) of which, in principle, all cells have been populated (Chapter 6). Step 5. Specify the hypothesis (expected pattern) for the selected cases. The hypothesis is a specification of the pattern that is expected to be observed in the data if the proposition is true (Chapter 7). Step 6. Compare the hypothesis with the pattern that is actually observed in the data and formulate the test result. This is the test proper, which consists of ascertaining whether the hypothesis is (or is not) a correct description of what is observed in the data (Chapter 8). Step 7. Formulate the implications of the test result for the theory. Implications of the test result for the theory depend on the number of preceding tests, their results, and the characteristics of the cases in which these tests have been conducted. The research community will be able to draw a conclusion about the correctness of the theory only after a sufficient number of replications in sufficiently diverse cases (Chapter 9).
A note on terminology
Some of the terminology and the methodological principles in this book differ from terms and principles in other textbooks. Examples of terms with a (slightly or entirely) different definition than in other sources are: proposition, hypothesis, theoretical domain, population, sample, survey, and test. Each of these terms, and many others, are defined in this book and are also listed in the Glossary.
CHAPTER 2 THEORY AND PROPOSITIONS

Theoretical statements
This is a book about how to design and conduct an empirical test of a theory. This first chapter discusses (a) the meaning of the terms theory and proposition, and (b) the aims and principles of theory-testing research. A common format of a theoretical statement is X is associated with Y indicating that, in general, cases with a higher value of X have a higher (or lower) value of Y. An example is Size of the workforce is associated with size of revenue, which normally means In cases with a larger workforce there will also be larger revenue. The fact that the concepts in a theoretical statement, such as X (e.g., size of the workforce) and Y (e.g., size of revenue), can have different values (different sizes) means that they are variable attributes. Formulated in general terms, a theoretical statement predicts a probability of specific values of an attribute (Y) given a specific value of another attribute (X). The fact that the concepts in a theoretical statement are attributes implies that they are attributes of some entity (i.e., a company or business unit in this example). We will call that entity the focal unit (FU) of the statement. A theoretical statement formulates relations between the values of attributes of the focal unit. The focal unit should be specified in the theoretical statement, e.g., In manufacturing companies with a larger workforce there will also be larger revenue. We call such a statement a proposition if it is a general statement of the association between X and Y in all instances of the specified entity. In manufacturing companies with a larger workforce there will also be larger revenue is a proposition if it, in principle, refers to all manufacturing companies, at all times, for all types of products, in all economic sectors, all over the world, etc. If this proposition is thought to be true for only specific types of manufacturing (e.g., manufacturing of computers) and only for those units of a company that are directly involved in manufacturing and sales (and not, e.g., the parts that also provide banking services to clients), then this should be specified in the proposition, as in: For business units that manufacture and sell computers: in units with a larger workforce there will also be a larger revenue. A theory is a coherent and consistent set of theoretical statements (propositions) about a focal unit, i.e. it is a set of statements about probabilities of the values of attributes of the focal unit given the values of other attributes of that unit.
Examples 1. A tangible resource-seeking alliance is more likely to deploy high levels of output and process control. This proposition states that we can better predict the level of output and process control of an alliance when we know its aim than when we do not know it. Or, in the more technical terms of this chapter: For focal unit alliance: if the value of attribute X (the aim of the alliance) is resource-seeking, the value of attribute Y (the extent of its output and process control) will be higher than with other values of X (other aims).
2. Affective commitment to change is negatively related to turnover intentions. This proposition states that we can better predict the turnover intentions of employees when we know their affective commitment to change than when we do not know it. Or, in other words: For focal unit employee: if the value of attribute X (an employees level of affective commitment to change) is high, the value of attribute Y (the desire to quit the company) will be lower than with lower values of affective commitment to change.
It follows that a specific theory is defined by four aspects: its focal unit, its domain, its concepts (which represent variable attributes of the focal unit), and relations between concepts as specified in propositions. Each of these aspects will be discussed here in more detail. The focal unit, i.e. the unit or entity about which the theory formulates statements, can be very different things, such as activities, processes, events, persons, groups, organizations. If, for example, a theory is formulated about critical success factors of innovation projects, then innovation project is the focal unit. Within a theory, the focal unit cannot vary. A theory predicts values of attributes of that single focal unit, not of other units. A theory about critical success factors of innovation projects, for instance, is by definition a theory about characteristics of innovation projects, not of other things or entities such as products, companies, teams, etc. A clear specification of the focal unit is very important in the design of a theory-testing study because this defines the type of entity about which data must be collected. For a test of the claim A tangible resource-seeking alliance is more likely to deploy high levels of output and process control data must be collected about alliances, not about other entities. For a test of the proposition Affective commitment to change is negatively related to turnover intentions data must be collected about employees, not about entrepreneurs, students or companies. The domain of a theory is the universe of the instances of the focal unit (cases) for which the propositions of the theory are assumed to be true. The boundaries of this domain should be specified clearly. For instance, if a researcher develops a theory of critical success factors of innovation projects, it must be clearly stated whether it is claimed that this theory applies to all innovation projects, or only to innovation projects of specific types, or only in specific economic sectors, or only in specific regions or countries, or only in specific time periods, etc. Hence the domain might be very generic (e.g. all innovation projects in all economic sectors in the whole world) or quite specific (e.g. limited to innovation projects in a specific economic sector, in a specific geographical area, or of a specific type).
Examples 1. A tangible resource-seeking alliance is more likely to deploy high levels of output and process control. This proposition is a claim about alliances in general. It is not a claim about a specific type of alliance such as alliances in a specific economic sector (e.g., airline alliances) or in specific countries (e.g., US alliances). If this proposition is true, then it is true for alliances in all economic sectors and in all countries and at all times. If the claim is formulated from the outset as only applicable to a specific type of alliance, then this should have been specified in the wording of the proposition. 2. Affective commitment to change is negatively related to turnover intentions. This proposition is a claim about employees in general. It is not a claim about a specific type of employee (e.g., manual laborers or white collar workers), or about employees in a specific economic sector (e.g., dockworkers or airline pilots) or in specific countries (e.g., the US workforce). If this proposition is true, then it is true for employees in all types of
jobs, in all economic sectors, in all countries and at all times. If the claim is formulated from the outset as only applicable to a specific type of employee, then this should have been specified in the wording of the proposition.
The concepts of the theory designate the variable attributes of the focal unit. An attribute described by a concept can be absent or present, smaller or larger, etc. For instance, if the research topic is critical success factors of innovation projects the factors that presumably contribute to success are variable attributes of these projects. In each instance of the focal unit, these factors can be present or absent, or present to a certain degree. Also, success is a variable attribute of the focal unit project, which can be present or absent, or present to a certain degree, in each instance of the focal unit (i.e. in each specific innovation project). The attributes that are designated by the concepts of the theory must be defined to allow for the measurement of their value in instances of the focal unit (cases). For instance, in a theory of critical success factors of innovation projects, the concept project outcome needs to be defined such that it is clear what counts as a successful outcome and what does not. The factors must be defined as well, so that we can measure the extent to which each factor is present. When the value of a concept is measured in cases, it is called a variable. The propositions of a theory formulate relations between the concepts (i.e., between variable attributes) of the focal unit. In the typical case, but not always, this relation is a causal one. A causal relation is a relation between two attributes X and Y of a focal unit in which a value of X (or its change) results in a value of Y (or in its change). A proposition can be visualized by means of a conceptual model. Usually such a conceptual model has inputs (independent concepts) on the left hand side and outputs (dependent concepts) on the right hand side, linked to each other by arrows that point to the dependent concepts. The arrows are indications of the direction of the causal relation between the concepts. The nature of these arrows needs to be defined more precisely in the wording of the propositions of the theory. The simplest building block of a theory is a single proposition that formulates the relation between two concepts. A proposition can be visualized as follows: Regarding focal unit FU: Determinant X Outcome Y
This simple model visualizes the proposition that, in all cases of the focal unit FU, concept X (the determinant or independent concept) has an effect on concept Y (the outcome or dependent concept). The unidirectional arrow represents the assumption that a cause precedes an effect. Because effects are assumed to depend on causes, the term dependent concept is used for the outcome Y. Causes X are assumed to be independent from their effects, hence the term independent concept.
Note that this simple model does neither specify the contents of the proposition nor the possible values of the concepts. If it is presented in this way, it is normally assumed that X and Y are interval or ratio variables and that the relation between them is causal, probabilistic and positive: Higher X will on average result in higher Y. Because other types of concepts and other types of relation are possible, it is necessary to add more specifications in the model. Determinant X (or Y) should be specified as Extent of X (or Y) or Presence of X (or Y) or in any other indication of the (range of) values that are covered by the proposition. Also, a sign (+ for positive; for negative) must be added to the arrow in the model.
Note also that the focal unit (e.g., innovation project) is not depicted in the model itself because the model represents only the variable attributes (concepts) of which the values are linked in the theory, not the invariable entity about which the theory is formulated. For this reason, the model is prefaced by a statement about the focal unit. The domain is implied.
More complicated conceptual models might depict relations between a larger number of independent concepts X1, X2, X3, etc., and dependent concepts Y1, Y2, X3, etc. For instance, in a conceptual model of the critical success factors of innovation projects, the model would depict a number of different factors (X1, X2, X3, etc.) on the left hand side, outcome (as defined precisely) on the right hand side, and an arrow originating from each factor pointing to outcome. Other models might be used to depict more complex relations such as with moderating or mediating concepts.
Note that the word theory is used loosely in the literature and often refers to sets of statements that are not theory as defined here. For instance, theories such as the resource based view or transaction cost theory are perspectives, not sets of precise propositions with defined concepts.
Here follow a number of propositions that have been formulated and tested by Bachelor students in the Research Training course at the Rotterdam School of Management. For each of these propositions, the focal unit, domain, concepts, relations, and conceptual model are specified in the following examples.
Example 1.
Proposition: A tangible resource-seeking alliance is more likely to deploy high levels of output and process control Focal unit: Alliance Domain: All alliances in the world, in all economic sectors, in all countries, at all times Independent concept: Type of resource that is sought in the alliance (tangible versus intangible) Dependent concept: Extent to which output and process control is used Relation: Probably causal, probabilistic, positive Conceptual model: Regarding focal unit alliance: Type of resource that is sought Extent of output and process
Example 2.
Proposition: Affective commitment to change is negatively related to turnover intentions Focal unit: Employee Domain: All employees in the world, in all types of job, in all countries, at all times Independent concept: Extent of affective commitment to change (from not at all to very much) Dependent concept: Strength of the wish to quit (from not at all to very much) Relation: Probably causal, probabilistic, negative Conceptual model: Regarding focal unit employee: Commitment to change Wish to quit
Example 3.
Proposition: Higher crime rates have a negative effect on house prices Focal unit: House Domain: All houses in the world, in all countries, at all times, of all types Independent concept: Crime rate in the neighbourhood of the house Dependent concept: Price of the house (in Dollars or Euros or other currency) Relation: Causal, probabilistic, negative Conceptual model: Regarding focal unit house: Crime rate House price
Example 4.
Proposition: The way entrepreneurs allocate their time is influenced by their tendency for mental accounting Focal unit: Entrepreneur Domain: All entrepreneurs in the world, in all countries, at all times, in all business types Independent concept: The extent to which a person evaluates costs and benefits of activities Dependent concept: The time allocated to work-related activities (versus other activities such as leisure and family-related) under a time constraint Relation: Causal, probabilistic Conceptual model: Regarding focal unit entrepreneur: Extent of mental accounting Time allocated to work
Example 5.
Consumers with higher loyalty to a brand respond more favorably to brand extensions in general and to distant extensions in particular Focal unit: Consumer Domain: All consumers in the world, in all countries, in all types of consumption Independent concept: The degree of loyalty to the brand Dependent concept: Level of intention to purchase a (distant) brand extension Relation: Causal, probabilistic Conceptual model: Regarding focal unit consumer: Loyalty to the brand Purchase intention Proposition:
Managerial relevance
A proposition predicts or explains probabilities of values of attributes of a focal unit. Why are such predictions made, and why do we want to know whether a proposition is true? The obvious answer is that there are many situations in which it matters what the value of an attribute is. For instance, referring to some of the examples above, we would normally prefer lower turnover intentions in employees rather than higher ones, higher house prices rather than lower ones, more efficient allocations of time rather than less efficient ones, and higher purchase intentions rather than lower ones. Also, if a researcher develops a theory about a determinant of the extent of output and process control in an alliance, it is to be expected that this extent matters to people involved in alliances.
If the value of an attribute matters in practice, then it is likely that one would like to be able to manipulate that value. That is the reason that many theories are formulated as causal ones. If the value of one attribute is causally related to the value of another attribute, then it becomes possible (at least in theory) to achieve a higher probability of a desired value of one attribute by manipulating the value of the other attribute. In that sense, causal theories have higher managerial relevance than non-causal theories.
Examples 1. Affective commitment to change is negatively related to turnover intentions. If this proposition is true, then it makes managerial sense (to attempt) to increase the average affective commitment to change in a workforce because this might (more likely than not) result in a lower turnover in the company (if a lower turnover is desired). 2. Higher crime rates have a negative effect on house prices. If this proposition is true, then it makes real estate sense (to attempt) to decrease the crime rate in a neighborhood because this might (more likely than not) result in higher house prices in the neighborhood (if a higher house price is desired). 3. Consumers with higher loyalty to a brand respond more favorably to brand extensions in general and to distant extensions in particular. If this proposition is true, then it makes marketing sense to be careful with brand extensions as long as the average brand loyalty is relatively low (assuming that brands attempt to increase loyalty anyway, irrespective of whether they contemplate brand extensions).
Note that almost all propositions in business research are probabilistic in kind, which means that they do not predict the value of Y in a single case (even if the value of X in that case is known), only the average value of Y in a set of cases with that value of X. This means in practice that the managerial relevance of propositions is normally much higher for managers with a portfolio of cases than managers who manage only one or two cases.
Examples 1. Affective commitment to change is negatively related to turnover intentions. This proposition, if true, is relevant if a manager wants to decrease the average level of turnover in the workforce. It is much less relevant if that manager wants to keep an individual employee with a high value for the company. The latter could be a member of a minority of employees that have a high level of commitment to change and also a high desire to quit. 2. Higher crime rates have a negative effect on house prices. This proposition, if true, is relevant if a local government or a real estate developer wants to raise the average house price in a neighborhood. It is much less relevant for individual house owners who want to raise the price of their own homes. Their home could be one of minority of houses that have a low price irrespective of the level of crime in the neighborhood.
Note that the managerial relevance of a proposition is dependent on the strength or effect size of the relation between the two attributes. If a large increase of affective commitment to change results in only, say, one percentage point decrease in turnover (on average), then the managerial relevance of the proposition is doubtful (even if the proposition is true). Similarly, if a huge decrease in crime in a neighbourhood results in only, say, one percentage point increase in house prices (on average), then it is doubtful whether the proposition is relevant (even if it is true) for a local government or a real estate developer. (Obviously, there might be other good reasons for trying to bring down the crime rate in a neighbourhood.)
10
Note also that this (more or less subjective) estimation of the practical relevance of a theoretical statement is also dependent on the costs involved in the manipulation of the independent variable. If affective commitment to change could be influenced by a cheap and simple method such as sending an email message to all employees in which they are praised for their efforts, then it is relevant to know that the resulting higher commitment to change has a negative influence (though perhaps small) on the desire to quit. However, if even a small decrease in crime rate requires huge investments in surveillance and other measures, then the effect of such a decrease on house prices should be considerable to make such an investment worthwhile. (Again, obviously, there might be other good reasons for that investment.) If a proposition is considered potentially relevant, and if it is determined how strong the causal effect should be in order to achieve the desired level of relevance, then it becomes useful to know whether the proposition is true. For managers it is usually (only) useful to know whether a theoretical claim is true for the cases which they manage. In contrast, theoreticians and academic researchers are interested in knowing whether a theoretical claim is true in the theoretical domain. How could they know? An obvious difficulty is that the theoretical domain implied by a proposition is usually infinite and continuously changing. For instance, the theoretical domain of the claim Affective commitment to change is negatively related to turnover intentions is all employees. Because new employees will be hired all the time and in every place, there is no way to keep track of each of them. In practical terms, it is not possible to keep a complete list of all cases of the focal unit (here: of all employees) in the domain. This means that we cannot observe each of them in order to ascertain that the proposition is true in each of them. Fortunately, in practice we do not require absolute certainty of the correctness of a proposition for all cases in the domain but are satisfied with sufficient confidence that this is the case. This confidence is built by testing the proposition in a series of tests in subsets of the domain. The propositions are tested many times, each time in a set of cases from another part of the domain, and potentially with each time a different result. A number of consistently confirmatory test results in similar subsets is required before we can be confident that the proposition is true for a specified part (or for the whole) of the domain. Different tests in different parts of the domain are called replications. Usually a worldwide community of researchers is involved in conducting these replications, collectively building confidence in the correctness of a proposition or, if it appears that the proposition cannot be confirmed in some parts of the domain, collectively reformulating the proposition (or collectively concluding that the proposition was wrong).
Note that replication is defined here as conducting another test of the same proposition, i.e. usually in another set of cases, and usually using other methods. This concept of replication does not imply repeating or duplicating a previous test (with the same methods in the same cases or with the same methods in other cases) in order to confirm a finding. In order to avoid confusion, such studies might better be called duplications.
11
Domain and population

Figure 1 is a visual representation of the theoretical domain. It is populated with all instances of the focal unit (cases). In this figure the boundaries of the domain are solid black lines, but in reality the boundaries and the contents of the domain will normally be in flux: cases tend to emerge and disappear all the time. The cases in a domain tend to form subgroups which can be identified and named, populations. A population is a set of cases defined by one characteristic or by a set of characteristics. The population of airline alliances is an example of a population in the domain of all alliances in the world, in all economic sectors, in all countries, at all times, that is defined by one characteristic (i.e., airline industry) that distinguishes the members of this population from all other alliances in the domain. An example of a population defined by a larger set of characteristics (airline industry; size; region) is the population of US alliances of airlines with a total turnover of at least [n] dollars a year.
Figure 1. Domain, cases and populations (from Dul & Hak, 2008: 46)
As stated in the Introduction, the underlying logic of theory-testing is the following: A theoretical claim (or proposition) applies to a universe (or domain) that usually is very large or even infinite, i.e., all consumers everywhere at all times, all firms everywhere at all times, etc. It is not possible to prove with absolute certainty that the proposition is true for this whole domain, because it is not possible (or at least not practical) to observe every single case in the domain. The best we can do is to confirm that the theory is true in many different subsets of this domain (which we call populations). If the proposition is true in the domain, then it must be true in each population of the domain. Theory-testing, thus, entails selecting a population from the domain for the test; formulating a prediction (hypothesis or expected pattern) for that population which is derived from the proposition (if it is true for that population); measuring the concepts in the cases of the population; and then seeing whether the prediction is true or not. The latter involves a comparison between the expected pattern and the actual pattern in the data (observed pattern).
12
A rather common erroneous belief about a population that is selected for a test is that its size matters for the quality of the test. More specifically, it is widely believed that the larger is the population the better it is for the test. In the example of a proposition about alliances, testing it in the population of all airline alliances would be much better than testing it in the population of (only) large US airline alliances. This myth is (as are so many myths) based on facts, namely (in this case) the fact that a larger number of cases gives a larger statistical power. Another underlying reason for this myth is probably the idea that a larger population is more representative of the domain than a smaller population and that, therefore, a result in that population is more informative about the domain. But both reasons are faulty. First, statistical power is not a relevant concept here because it solely applies to the size of a probability sample if an inference is made from the sample to the population from which it is drawn. Selecting a population from a domain is very different from probability sampling and, therefore, principles from inferential statistics do not apply. Second, a test result in a less specified population is actually less informative for the theory than a result in a more specified population. Take a fictional example in which a proposition is tested in the population of airline alliances and also in the smaller, more specific population of US airline alliances. Assume that the test result in the larger population is a weak confirmation and that the test result in the smaller US population is negative. Which result is more informative about the theory? The result in the more specified (and hence smaller) population does not only imply that the proposition is not correct for the population (i.e., for US airline alliances), but it also indicates that the correctness of the theory depends on an (yet unknown) factor that is related to something that makes the population of US airline alliances different from other populations of airline alliances. This result will not only stimulate the search for this unknown factor that influences the proposed relationship (a search that will increase our understanding of the workings of the relations explained by the theory), but will also result in a smaller and better defined theoretical domain (i.e., a better specification of the universe of cases in which the theory is assumed to be true). These benefits for the theory can only be achieved if the population that is selected for the test has clear characteristics (such as being US in the example). For this reason, a test must not be conducted in an arbitrary group of cases from the domain (such as a sample of cases that happen to be included in a data base). In the example, it is the fact that the population is defined by being US that makes it possible to specify the search for an explanation of the test result as a search for a factor that is in some way related to being US. If the test result was found in an arbitrary set of cases, it would not be possible to specify the implications of the test result in this way. Also note that comparing different test results from different well defined populations, such as the population of US airline alliances, the population of South Asian airline alliances, the population of European airline alliances, etcetera, is not the same as conducting a regression of the dependent variable Y on the region of the alliance (US, South Asian, European, etc.). Testing the proposition that independent concept X has an influence on dependent concept Y in different populations (defined by region) is not the same as testing that region (also) has an influence on Y. This implies that there are no statistical procedures available for interpreting different results from different populations (or replications).
13
CHAPTER 3 REPLICATION
How must different results from different tests (replications) be interpreted? Figure 2 represents the standardized effect size d (with a 95% confidence interval) obtained in each of seven independent experimental tests. The first and the third test show a clear positive effect, but there are also tests with negative effects (experiments 2 and 6) and much smaller positive effects (experiments 4, 5 and 7). A first conclusion from this example is that no conclusions can be drawn from a single test of a theory, even if the result is highly significant (such as in the first experiment). This shows that our confidence in the correctness of a theory can only be built in a series of replications.
Figure 2. A series of replications (from Cumming & Finch, 2001:557)
Usually, a proposition states (only) that there is an effect of an independent concept on a dependent concept without specifying how large that effect is or should be in order to be considered relevant. The experimental results in this example indicate that there appears to be an effect, but that it is small. Although the results of experiments 4, 5 and 7 seem to come close to the real effect size, such a conclusion cannot be drawn from these tests alone but only from the series of tests. Moreover, a conclusion about what these seven test results mean must also depend on the characteristics of these tests, such as the populations from which subjects were recruited for the test and the methods that were used. With regards to these populations, consider the following example: Test 1. Test 2. Test 3. Test 4. Test 5. Test 6. Test 7. Effect size=1.5 Effect size=-0.2 Effect size=0.7 Effect size=0.18 Effect size=0.35 Effect size=-0.1 Effect size=0.3 N=16 N=29 N=36 N=24 N=40 N=20 N=8 Second year IBA students RSM Cab drivers Istanbul High school students Rotterdam Dockworkers Rotterdam First year Psychology students EUR Cab drivers Mexico City RSM professors
14
Assume that all seven tests were experiments in which a proposition regarding the effectiveness of a certain treatment in changing the attitudes of workers was tested. Also assume that the seven tests are represented here in chronological order. The first test result (with second year IBA students at RSM) suggests that the (apparently newly formulated) proposition is true. This generates some confidence that the theory is correct, i.e. that the treatment is effective. The following test (with cab drivers in Istanbul), however, shows that this result cannot be replicated and doubt about the effectiveness of the treatment (and hence about the correctness of the theory) sets in. The third test result (with high school students) suggests again that the theory might be correct although it also indicates that the effect size might be much less than first thought. The insignificant result of the next, fourth test (with dockworkers) again raises doubt about the correctness of the theory. At this stage, after four tests, the research community will have developed a certain level of confidence that the theory is correct (because of the highly significant positive test results in two experiments, 1 and 3) but the other two test results will temper that confidence. Because the focal unit of the proposition (in this fictional example) is workers, one way of solving the problem of contradictory results is to discard results from tests with students. Only the results of experiments 2 and 4 matter and the conclusion after these two tests is that the treatment does not seem to work in the domain of the theory. Note that the conclusion would be very different if experiments 1 and 3 would have been the ones with workers rather than students. Assume that researchers continue with conducting experiments with students because this is what they happen to do. The result of experiment 5 (with first year Psychology students at Erasmus University) will be seen as a third confirmation of the correctness of the theory. Researchers who are more serious about the contents of the theory (and hence about the limitation of the theoretical domain to workers) could conduct another experiment with workers (experiment 6) in which they again will find a confirmation of their doubts about the correctness of the theory. The result of an experiment with academic workers (test 7 at the bottom of the figure) seems to fit better in the series of results with students than with non-academic workers. This (fictional) example demonstrates how confidence in the correctness of a theory develops chronologically. This development is not linear. It goes up and down, and partly consists of substantive specifications of the theory such as, in this example, a more precise specification of the focal unit (worker, not student) and an emerging insight that the theory might apply to only certain types of workers (i.e. non-academic workers).
Note that this example is not only fictional in the sense that it is imagined (i.e., it is not a real life case of seven actual experimental tests of an actual theory) but it is also fictional in the sense that it is not really realistic. In real academic life the story would have developed in another way, mainly because the result of the first test would have greatly decreased the likelihood that there would ever be a second test. In actual practice, the first test result would have been seen as final because it did not only confirm the correctness of the proposition but it also did so with a large effect size and with high significance. Other researchers would not have found it problematic that this was a test with students rather than workers that is what they routinely do. This test might have been published in a top journal and it is quite possible that from then on the theory was taught in management courses as the proven Experimenter Ones Law. No researcher would replicate the test,
15
because every colleague would tell her that journal editors would consider her result (which would be expected to be positive again) as not new, not original, and not worth publishing. Now, imagine how the second experimenter (who did the experiment against this collegial advice) will have looked at her test result. Because she could not reject the null hypothesis, she cannot report a positive or negative test result, but only a failure to replicate the first experiment. She will not write up her findings and, if she does, she will not be able to get it published because it will be found lacking in novelty, in theoretical relevance, and in managerial relevance. The research community will never know that there ever was such a second test. In other words, both the cult of the isolated study (Hubbard & Lindsay, 2002) and the cult of positive results effectively prohibit that replication histories are published and known.
Because the knowledge base in business and management studies mainly consists of propositions that have been tested only once and that have not been put to replication tests, a rather effective and appropriate way to contribute to a theory is by replicating published one-shot studies in other cases from the domain. The common requirement of academic journals (and also often in the evaluation of student work) that studies must be original (meaning that they should formulate, test and confirm propositions that have not been stated before) is a huge obstacle to scientific progress because it hinders the much-needed increase of the number of replication studies. With every original study a new candidate proposition is formulated of which it is not certain that it is true for a domain, adding to the reservoir of propositions that are waiting to be replicated.
Further reading
In the literature the term replication is used in different ways. Often it means repeating a test in the same set of cases, e.g., by drawing another (random) sample from the same population or by (randomly) assigning the same group of subjects over experimental and control groups. The aim of such a replication is usually to evaluate the quality of a study by investigating the reproducibility of its result in the same set of cases and/or to assess the (normal) variation of study outcomes. The concept of replication used in the course book is different. This is taken from the book Case study methodology in business research by Dul and Hak (2008). Hak & Dul (2009b) also discuss the principles of replication. The notion that the correctness of a proposition can be evaluated only after multiple tests and that the construction of a replication history is central to the development of a theory is not current in academic publishing in business and management research. Top journals almost only publish original work, i.e., research in which new theoretical claims are made. Hubbard & Lindsay (2002) call this the cult of the isolated study. They note that most, if not all, published empirical research consists of novel works looking for significant differences, rather than significant sameness, in unrelated data sets, and argue that the emphasis on original research impedes knowledge development. As a result, the literature is made up largely of fragmented, one-off results, which according to Hubbard and Lindsay are of little or no use because they are not corroborated by other studies. Although Hubbard & Lindsays (2002) paper focuses on marketing research, its argument applies to all fields in business and management research. Note that there is a growing acknowledgement of the value of replication, although this has not yet been translated into a reevaluation of journal policies and criteria for career decisions. This growing acknowledgement has led to an increasing number of publications advocating
16
meta-analysis. This certainly is an improvement as a corrective of the cult of the isolated study. However, a serious problem with most approaches to meta-analysis is that it is usually assumed that different test results apply to the same pool of cases. In our above example of seven experimental tests, meta-analysis would ignore the fact that the experimenters have recruited subjects from populations that differ considerably from each other. Although it is tempting to pool the results and compute a kind of average effect over seven experiments (resulting in an overall effect size of about 0.4), it is recommended here to stick to the type of qualitative assessment of which this chapter gives an example. A good text on how to conduct a good literature review (which is required for reconstructing a replication history of a proposition) is Chapter 3, Literature review of the book Business Research Methods by Boris Blumberg, Donald Cooper and Pamela Schindler (McGraw-Hill, Second European Edition, 2008). Blumberg et al. list the following aims (among others) of the literature review (2008:107): To show which theories have been applied to the research topic. To show which research designs and methods have been chosen. To synthesize and gain a new perspective on the topic. To show what needs to be done in the light of the existing knowledge. With respect to the last mentioned aim, the authors compare a theory with a cathedral and advise the researcher not to aim at adding an entirely new chapel to the existing building: Each study and article is just another brick added to the construction of the cathedral of knowledge. Some studies just reconfirm previous knowledge, often in slightly different settings. [.] The first function of a literature review is to embed the current study (the new brick) in the existing structure of knowledge (the cathedral). [.] The single brick is of limited beauty, but being part of the cathedral contributes significantly to its overall beauty (2008:107). A particularly good element of the Blumberg et al. chapter is its discussion of six ingredients of a good literature review (2008:110). A good literature review is not a mere compilation of summaries of publications. The term replication (history) is not used in any of these references. The best way to construct the required overview of test results for a proposition is by tracking references (backward) and citations (forward) of published test results (in the Web of Science). Figure 2 above is taken from a rather technical article on the understanding of confidence intervals (Geoff Cumming and Sue Finch, A primer on the understanding, use and calculation of confidence intervals that are based on central and noncentral distributions, Educational and Psychological Measurement, 2001: 532-574) and represents the results of a series of seven replications of (probably) the same experiment. However, in this book, this figure is taken out of its context and used as an illustration of a series of similar experiments with different types of subject. Differences between these results cannot be interpreted as (just) sampling variation but might reflect real differences between experimental subjects (and the populations from which they were recruited, such as students, cab drivers, etc.), or they might not reflect such real differences because each of these results is also subject to variation due to the manner in which subjects were recruited from a larger population as well as to sampling variation. This means that this series of results can only be meaningfully interpreted in the context of results of further yet-to-conduct experiments with these and other types of subjects. These subjects must be recruited from different populations from the theoretical domain. The logic of inferential statistics, based on probability sampling, does not apply to test results in subsets of the domain. Populations are not selected randomly from the theoretical domain and, therefore, they cannot be treated as samples from a larger population. Hence, a conclusion about the correctness of a theoretical statement for a domain cannot be inferred statistically from a test result in a population. Test results are always findings in subsets of the domain and cannot be generalized to other parts of the domain. They can only be replicated (in the sense meant in this book). A replication history, thus, should contain for each test an indication of the part of the domain in which the test was conducted.
17
CHAPTER 4 CHOOSING A RESEARCH STRATEGY

The core assumptions of any theory-testing study are (see Chapter 1): If the proposition is true in the domain, then it must be true in each population of the domain. If the proposition is true in a population, then it must be possible to observe (or see) this. In other words, there must be empirical evidence for the correctness of the proposition in the population. We can specify what we expect to observe in the population if the proposition is true. This specification of our expectation is called a hypothesis (or expected pattern). In practical terms, these assumptions imply that we must construct a data matrix, i.e. a matrix in which the rows are the members of a population (which we will call cases) and the columns define the empirical evidence that we need for each case (which we will call variables). If the proposition is true for the population, then the evidence in the data matrix will show a pattern that is consistent with the proposition. Hence, a core decision in any theory-testing study is the decision on the type of data matrix that is needed. This data matrix must be able to show the relevant pattern or, in the terms of research methodology, the data matrix must have internal validity. This is another way of saying that both the evidence that is collected and the methods used for the interpretation of that evidence will be appropriate to the studys aim, which in theory-testing research is the aim to draw a conclusion about the extent to which the proposition is true in the cases that are observed. A decision about the data matrix that one wants to generate in a study is a decision about the research strategy that will be pursued.
Associations and the cross-sectional survey

We begin this chapter with choosing a research strategy that is internally valid for a test of the simplest type of proposition, namely one in which an association between two concepts (i.e. not a causal relation) is formulated. How can we observe such an association? If both concepts in the proposition are continuous, we can observe such an association in a scatter plot in which the top left corner and bottom right corner are empty, such as in the following example.
18
Each point in this scatter plot is defined by a value on the X-axis and a value on the Y-axis. This implies that the plot is derived from a data matrix in which for each case a value for X and a value for Y is specified, as in the following one.
Cases Case 1 Case 2 Case 3 Case 4 Case 5 Case 6 Value of X x1 x2 x3 x4 x5 x6 Value of Y y1 y2 y3 y4 y5 y6
N=
There are informal ways by which we can see the association of X and Y in a data matrix. As we have seen, one way of seeing the association between X and Y is plotting the cases in a scatter plot and then observing the empty corners in the plot: top left (low X; high Y) and bottom right (high X; low Y). Another way of observing an association between X and Y is ranking the cases in the data matrix according to their (increasing or decreasing) value of X and also ranking them according to their value of Y, and then comparing the two rankings. If the rankings are roughly similar, i.e. if cases are situated high in both rankings (and other cases low in both rankings), this is evidence of an association between X and Y. A more formal approach would entail plotting a trend line in the scatter plot. The relationship stated in the proposition X is associated with Y might be notated as a bi-directional connection: XY. In principle, it does not make a difference which concept is called X and which is called Y, and for observing this type of relation (if concepts are continuous) it does not matter whether which concept is on the X-axis and which one is on the Y-axis. If one of the concepts in the proposition is not continuous, (informally) ranking and (more formally) plotting a trend line are not possible. Take the following proposition that was discussed in Chapter 2: A tangible resource-seeking alliance is more likely to deploy high levels of output and process control. Assume that X and Y have been
19
observed in all members of the population of US airline alliances. Assume also that the population has six members. The data matrix could like look the following one:
Cases Value of X (T = tangible; I = intangible) T T T I I I T = 50%; I = 50% Value of Y (level of control; scale from 1 to 7) 6 4 2 5 3 1 y = 3.5
Alliance 1 Alliance 2 Alliance 3 Alliance 4 Alliance 5 Alliance 6 N =6
In this population, an association between the values of X and Y can be observed by comparing the average level of control between the two groups of alliances (tangible resource-seeking and intangible resource-seeking alliances). The observed averages of level of control in this example are 4.0 in the tangible resource-seeking alliances and 3.0 in the intangible resource-seeking alliances. In a formal analysis, therefore, we would not calculate the trend line but just compute the difference between the two averages. In applying such procedures we assume that cases are comparable, i.e. that there are no other relevant determinants of X and Y that differ between cases. Because cases in the data matrix must be comparable or similar in relevant respects, all cases in the data matrix should be members of a specified population, i.e. a population in which cases share the characteristics that define it. For instance, in a test of a proposition about firms (i.e. the focal unit is a firm) we should generate a data matrix of firms that are members of the same population, e.g. firms in a specific economic sector, in a specific country or region, of a specific size, etc. In other words, the data matrix that we need for a test of an association is the data matrix that is generated in a study of a specific population in which X and Y are observed (or measured) for each member of that population at one point in time. This is called a cross-sectional survey. A survey is a research strategy in which values of the relevant concepts are observed in all members (or in a probability sample of members) of a population of instances of a focal unit.
Note that this definition of a survey might be different from how this term is used in other sources. In this book a survey is not defined by its method of data collection (e.g., using a questionnaire). It is only defined by a data matrix that contains all cases of a population (or a sample from it). The cells of that matrix can be populated by collecting evidence from interviews, observation, records, data bases, etc. Using the word survey for using a questionnaire is confusing and should be avoided.
In sum, the preferred research strategy for testing an association is a cross-sectional survey. The values of concepts X and Y are observed (measured) in all members of the population (cases), or in a probability sample. The observed values are entered in a data matrix with as many rows as there are cases and with as many columns as there are concepts in the proposition.
20
Causal relations and the experiment

An association between two concepts might be notated as a bi-directional connection: XY. It does not make a difference which concept is called X and which is called Y, and for observing this type of relation (if concepts are continuous) it does not matter which concept is plotted on the X-axis of a plot and which one is on the Y-axis. However, the most common proposition is a causal one: X causes Y or Higher values of X cause higher values of Y, XY. As suggested by the arrows in the conceptual models of the examples of propositions in Chapter 2, each of them is probably meant as describing a causal relation, including those of which the explicit wording only formulates a mere association between X and Y. A cross-sectional survey is not an internally valid research strategy for testing a causal relation. Take, for example, the following proposition discussed above: Consumers with higher loyalty to a brand respond more favorably to brand extensions in general and to distant extensions in particular. A non-causal association between these two concepts can be observed in a survey. Assume that Coca Cola has extended its brand to shoes (Run to success on Coke!). A population of Coca Cola consumers could be selected. This could be, for instance, a class of students in business administration. Loyalty to the Coca Cola brand and the intention to buy Coke shoes can be observed (measured) in each of the students in this population and a trend line can be drawn. If this line has an upward slope, this is evidence of an association between brand loyalty and purchase intention. But this association is not necessarily evidence of a causal relation. Observing a causal relation between brand loyalty and purchase intention of a distant brand extension requires an experimental approach, i.e., a manipulation of the level of brand loyalty followed by an observation of the intensity of the intention to purchase a (distant) brand extension. For a test of a causal relation we need a research strategy that allows us to see in the data an effect (i.e. a change in Y) that follows a cause (i.e. a change in X). The survey as described in the previous section is a cross-sectional research strategy. The values of X and Y are measured only once, often at the same time. An association between the values of X and Y suggests some kind of relation, but this does not need to be a causal one. Think of the infamous example of the number of storks and the number of new born babies in a population of rural villages. An association between the two frequencies might be found, but this is no proof of a causal relation. A mere association between X and Y, as found in a survey, is not sufficient evidence of a causal relation between X and Y. An experimental test is more convincing. An experiment is a research strategy in which the value of an independent concept is manipulated and, next, the value of the dependent concept in each of these instances is observed. The word next is crucial because causality implies a temporal order: the effect cannot precede the cause. Observing a causal relation also requires that other possible influences on the value of the dependent construct (i.e., other causes) are controlled. This is usually done by designing the experiment in such a way that known influences are the same for each of the cases and that unknown influences are randomly distributed between experimental groups. In order to guarantee the internal validity of the experiment, i.e. to make sure that we really observe an effect of the manipulation of the value of X, cases should be as comparable or similar as is possible.
21
This comparability is achieved by selecting cases from a specific population (as in a survey), by randomizing these cases over experimental and control conditions, and by observing the change in the value of Y (rather than observing the value of Y itself). Observing the change in the value of Y requires that the value of Y is measured both before (pre-test) and after the treatment (post-test).
The terms pre-test and post-test are confusing because no testing is involved, only measurement. Therefore, it would be better to use the terms pre-treatment measurement and post-treatment measurement.
This procedure results in as many data matrices as there are experimental groups. Here follows an example of two matrices in the simplest form of experiment (with two groups, e.g., one experimental and one control group). Group 1. X= x1
Cases Case 1.1 Case 1.2 Case 1.3 Case 1.4 Case 1.5 N= Change in value of Y y1.1 y1.2 y1.3 y1.4 y1.5 x1 Cases Case 2.1 Case 2.2 Case 2.3 Case 2.4 Case 2.5 N=
Group 2. X= x2
Change in value of Y y2.1 y2.2 y2.3 y2.4 y2.5 x2
In this example there are two matrices, one for each group that is defined by the value of X (x1 and x2) that is assigned experimentally. If these two matrices are filled with scores (or, in other words, if the changes in value of Y have been observed in all cases of both groups), a causal association between X and Y can be observed if the changes in value for Y in one group (e.g., the experimental group) are different (on average) from the changes in value for Y in the other group (e.g., the control group). This procedure is the same as discussed above for observing the difference between two types of alliance in a cross-sectional survey. The difference in average change in value of Y between two experimental groups is internally valid evidence of a causal relation because of the preceding manipulation of the value of X, which was absent in the example of two types of alliance. Experimental research is difficult. For instance, it is almost impossible to manipulate someones loyalty to an existing brand (such as Coca Cola) which might be so much ingrained in a customer that it can even be considered a part of someones identity. It is also difficult to control for other factors than brand loyalty that might influence someones intention to purchase an existing brand extension (such as, e.g., negative newspaper reports about the quality of Coke shoes). Students who tested (a version of) the proposition Consumers with higher loyalty to a brand respond more favorably to brand extensions in general and to distant extensions in particular, therefore, invented a fictional brand (Alpha, famous because of its shampoo line) with an equally fictional brand extension (Alpha sporty digital watch). Subjects for the experiment were recruited from the population of students in business administration 22
at the Rotterdam School of Management (N=30). These subjects were randomly assigned to three experimental groups (N=10 each). Each group received a treatment which resulted in, respectively, a low, medium and high level of loyalty to the Alpha brand. (Technical details such as how and why this treatment works are not discussed here.) Next, each subject was given a description of the Alpha sporty digital watch. Finally subjects were asked how likely it was that they would purchase the watch. This score was entered into the matrix. Assume that the strength of the purchase intention was measured on a scale of 1 to 10. The data matrices could look like the following: Group 1. Low loyalty
Cases Student 1 Student 2 Student 3 Student 4 Student 5 Student 6 Student 7 Student 8 Student 9 Student 10 N=10 Purchase intention 1 1 2 2 3 3 4 4 5 5 low=3
Group 2. Medium loyalty

Cases Student 11 Student 12 Student 13 Student 14 Student 15 Student 16 Student 17 Student 18 Student 19 Student 20 N=10 Purchase intention 3 3 4 4 5 5 6 6 7 7 medium=5
Group 3. High loyalty

Cases Student 21 Student 22 Student 23 Student 24 Student 25 Student 26 Student 27 Student 28 Student 29 Student 30 N=10 Purchase intention 6 6 7 7 8 8 9 9 10 10 high=8
In these three connected matrices, an association between the values of X and Y can be observed: the average level of purchase intention is 3.0 in the low loyalty group, 5.0 in the medium loyalty group and 8.0 in the high loyalty group. This association itself cannot be seen as evidence of a causal relation. However, the fact that none of these subjects could have had a loyalty to the fictional Alpha brand before these subjects had entered the experiment and, hence, no previous intention to purchase an Alpha sporty digital watch could have existed, implies that differences in observed levels of purchase intention can only be caused by the experimental treatment, i.e., by the different levels of brand loyalty. Note that a pre-treatment measurement is not required in this experiment because subjects cannot have a pre-experimental intention to purchase a fictional brand. An experiment, thus, generates evidence about the presence (or absence) of a causal relation by demonstrating that the effect can be produced (at will) by manipulation of a cause. Experiments have a high level of internal validity because of this direct link between manipulation and effect, if other potential causes of differences in the values of the dependent concept can be ruled out. In contrast, cross-sectional surveys have a low internal validity because it cannot be known (from the survey) how an observed association came into existence. However, surveys have a higher level of ecological validity, because they allow the observation of associations that actually exist (i.e., that are not experimentally produced).
23
Causal relations and the longitudinal survey

An experimental test as outlined here is usually feasible if the focal unit is a person or a situation in which a person can be brought. Tests of psychological theories, for instance, can be conducted by random assignment of people (subjects) to different experimental conditions. Regrettably, the requirements for conducting a proper experimental test of a proposition about organizations can almost never be realized in real life, because it is very hard to find comparable cases that could be assigned to different experimental conditions. Also the costs of such an experiment (if possible at all) might be prohibitively high. It is very costly to intervene in organizations just for the sake of research. But, fortunately, often it is possible to find a population of comparable cases in the theoretical domain in which the value of X has changed over time for other reasons than research. A causal relation X influences Y (XY) is observed in these cases, if the value of Y has changed after the change in the value of X. This implies that a causal relation can be observed in a survey of a population if a data matrix is constructed in which the variables are the changes in X and Y:
Cases Case 1 Case 2 Case 3 Case 4 Case 5 Case 6 N= Change in value of X x1 x2 x3 x4 x5 x6 x Subsequent change in value of Y y1 y2 y3 y4 y5 y6 y
This matrix looks like the matrix of a cross-sectional survey discussed above, but there is an important difference, namely that the association here does not refer to the values of X and Y but to the changes in the values of X and Y (which implies that each is measured at least two times). The change in value of Y must be subsequent to the change in the value of X, implying that the measurements of the value of Y must take place at later points in time than the measurements of the value of X. We call this research strategy a longitudinal survey, defined as a research strategy in which a change in the values of the relevant concepts is observed in all members (or in a sample) of a population of instances of a focal unit. The data matrix must contain all data (i.e., all changes in value of both concepts) from all cases in the population or in a probability sample. The most challenging aspect of a longitudinal survey is determining how much time should elapse between the change in the value of X and the subsequent change in value of Y. A good estimate of this time lapse might be based on theoretical insights, i.e., an understanding of the mechanisms or processes (and their duration) by which the change in X affects Y, and/or on experiences of practitioners who have first-hand knowledge of these processes.
Note. An event study is a version of the longitudinal survey.
24
Further reading
The survey is defined in this book as a research strategy in which a population statistic (such as a slope b of the regression line) is generated from the data matrix that contains all data from all cases in the population or in a probability sample. This definition is derived from the mainstream literature on survey research. A recent example from this literature is Jelke Bethlehem, Applied Survey Methods (Wiley, 2009). Bethlehem defines the survey as follows: A survey collects information about a well-defined population (p.1). Bethlehem states that populations need not necessarily consist of persons. For example, the elements of the population can be households, farms, companies, or schools. How is information collected? Although Bethlehem states that typically, information is collected by asking questions to the representatives of the elements in the population, the term typically is used here empirically, indicating that in practice survey data are often collected by means of asking questions. This does not mean that a survey is defined by this mode of data collection. Information (or evidence as defined in this book in a next chapter) can also be collected from interviews, observation, records, data bases, etc. Using the word survey for a questionnaire is confusing and should be avoided. Good introductions to survey research are Fowler (2009) and Groves et al. (2004). Good introductions to experimental research are Field & Hole (2003) and Chapter 11 (Causal research design: experimentation) of Malhotra & Birks (2003).
25
CHAPTER 5 SELECTING CASES FOR THE TEST

If the proposition is true in the domain, then it must be true in each population of the domain. This implies that a proposition can (and must) be tested in a population from the domain. Although, in principle, any population from the domain can be chosen for a test, this does not mean that all populations are equally useful. A test result in one population might be more informative for the theory than a result of a test in another population. Also, practical considerations could render one population more attractive for a test than another one. In this chapter it is discussed how a population could be selected for test in a survey, and how cases could be selected (or generated) for a test in an experiment.
5.1. General criteria for the selection of cases

Three criteria apply to the selection of cases for a theory-testing study, in this order: 1. Eligibility: The selected cases must be members of the theoretical domain. 2. Prioritization: The proposition, if tested before, will have been tested more thoroughly in some parts of the theoretical domain (i.e., in some types of cases) than in others. Based on an interpretation of the replication history, priorities must be formulated regarding the types of cases in which a next test (i.e., the current test) should be conducted. 3. Feasibility. If different types of eligible cases have roughly an equally high level of priority, cases with easy availability of data (e.g., in data bases) or which are easily accessible for data collection should be selected. Each of these general criteria will be discussed here in some more detail.
Eligibility
The requirement that cases should be selected from the members of the theoretical domain looks like obvious. But this requirement is often violated. For instance, it is quite common that companies or business units are selected as cases for testing propositions about projects or teams or, another example, that consumers are selected as cases for testing propositions about advertisements or brands. Often this mistake co-occurs with another one (which might be the reason for it), namely that persons are asked for their opinions about the correctness of the proposition (rather than that the proposition is tested).
Examples. A common mistake in a test of a proposition about brands (such as Brands with more X have more Y) is to ask consumers whether they think that brands with more X have also more Y. Such a study is an opinion poll, not a proper test of the proposition. Such a test requires that a population of brands (not consumers) is selected and that X and Y are measured for each brand. Similarly, in a test of a proposition about projects (e.g., More X is associated with more success in projects) it often occurs that cases of companies are selected and that managers are asked to report whether they think that, in their company, projects with more X are more successful. A proper test of the proposition
26
requires that a population of projects is selected and that X and success are measured for each project.
The criterion for eligibility, thus, is the definition of the focal unit and the delimitation of the theoretical domain. The cases that define the rows in the data matrix must be members of the theoretical domain (i.e., the universe of cases of the focal unit).
Prioritization
In principle each member of a theoretical domain (which is the universe of all cases of the focal unit) is eligible for a test. But it is more useful to test the proposition in some of these cases than it is in others. Take the following replication history, a series of experimental tests of a proposition, discussed in Chapter 3. Figure 3. A series of replications (from Cumming & Finch, 2001:557)
Test 1. Test 2. Test 3. Test 4. Test 5. Test 6. Test 7.
Effect size=1.5 Effect size=-0.2 Effect size=0.7 Effect size=0.18 Effect size=0.35 Effect size=-0.1 Effect size=0.3
N=16 N=29 N=36 N=24 N=40 N=20 N=8
Second year IBA students RSM Cab drivers Istanbul High school students Rotterdam Dockworkers Rotterdam First year Psychology students EUR Cab drivers Mexico City RSM professors
Suppose the effectiveness of a drug was tested in these experiments. How would the results have looked like if these experiments had used only students and professors as subjects, or only cab drivers? An interpretation of a replication history requires that possible interpretations of different test results are generated and that it is attempted to link the results of the different tests to the types of subjects. Based on this assessment, a type of population for a next test can be selected in such a way that its result could contribute to a deeper understanding of test results. For instance, another population of cab drivers could be prioritized for the next test in order to find out whether cab drivers behave consistently different from other populations. The criterion for prioritization, thus, follows from the researchers interpretation of the replication history of the proposition.
27
Feasibility
If different types of eligible cases have roughly an equally high level of priority, data availability and accessibility of data sources are aspects that can be taken into account in deciding which cases actually will be included in the test. Strictly speaking, this criterion (researchers convenience) is not a methodological criterion but an economic one. Obviously, it is a good thing that resources are not wasted on studies that fail for avoidable practical reasons or that might have been conducted much more easily if other cases had been selected. The following two rules-of-thumb apply: If possible, do not conduct new measurements. It is much more economic to download data from a data base if publicly accessible relevant data on relevant cases (i.e., eligible and prioritized cases) exist. This is often the case if the focal unit is countries, markets, large companies, mergers, etc. If new data must be collected, select cases which are easily accessible. These cases might be geographically close (e.g., a population of companies or persons in Rotterdam, or in the West of the Netherlands, if the researcher is based at the Rotterdam School of Management), linguistically close (e.g., companies or persons about which data can be collected in the researchers own language) or close in other respects (such as the availability of a contact person in a relevant network which makes it more likely that data can be collected from entities in that network). Convenience is an appropriate criterion for this final selection of a population of cases from the set of prioritized populations. However, when a population has been selected for the test, all members of that population must be included in the test (or a random or probability sample from the population). Convenience sampling is not allowed. The criterion for feasibility, thus, is convenience, i.e., the convenience of using data that are already available for a population and not for another population, or the convenience of having easy access to a population from which data must be collected.
5.2. Criteria for the selection of a population of cases for a survey

A survey is a research strategy in which a data matrix is generated in which the cases are all members of a population (or of a probability sample from the population). A prioritized population should be selected in which data collection from all members (or from all members of a probability sample) is at least feasible. Prioritization is discussed above.
Feasibility
After a specific type of population has been prioritized for the test, there are still a huge number of populations of such a type from which the population can be chosen. For example, if it has been decided that the proposition A higher level of education is associated with a more frequent use of high-end mobile phones will be tested this
28
time in a population of non-Western non-male older people (rather than again in a population of Western undergraduate students), a choice can be made from many different populations of non-Western non-male older people. Convenience is not only allowed but also recommended here. Depending on the definition of non-Western (does this refer to ethnicity, country of birth, or country of residence?) one might find a population that is close to the researcher in terms of accessibility through gatekeepers, or in terms of location, of culture, or language. It might, for instance, be the case that the most easily accessible population of non-Western non-male older people for a specific researcher is a population of Turkish older women that have a weekly meeting in a cultural center in the street in which she lives. A population of Turkish older women that have a weekly meeting in a cultural center may look like too small or too specific for any serious test of a theory (and, therefore, this example might be interpreted as a joke) but it might actually be a very appropriate choice of a population for a test. To begin with, if the focal unit is potential mobile phone users (in general), this population is eligible because it is part of the domain. Second, if the proposition as yet has been tested only in a number of populations of Western (or Westernized) young people, it makes a lot of sense to conduct a next test of this proposition (if it is claimed that it is true in the entire domain of all consumers in the world) in a part of the domain that looks like exotic. Third, the fact that this population of Turkish women is very specific is its strength. If the test result confirms the results of previous tests, then the test contributes more to the confidence in the correctness of the proposition than the next test in a population of Western youngsters because the same result is obtained despite the huge differences between populations. If the test result is different from those in previous tests, then this is also a huge contribution to the understanding of the theory. Such a result would suggest that the proposition is correct for only parts of the domain, and this might be the beginning of a more precise delimitation of the domain. The more specific the population, the higher is the chance of finding something specific (such as a test result that is different from previous ones). Furthermore, the characteristics that define a specific population (such as Turkish-ness or old age) point to directions in which explanations for a differing result can be found. In sum, the more specific the population, the more informative is the test result. Fourth, the small size of the population is not a disadvantage because the specificity of the population in the domain and the significance of the test result for the theory is determined by the characteristics of the population (and how these differ from those of other populations) and not by its size. The small size of a population is not a disadvantage for the significance of the test result but it is very advantageous for the feasibility of the test. If the population of Turkish older women have their weekly meeting in a cultural center next door to the researchers home, it might be quite feasible to attend the meeting and to collect data there. If the population was defined as all Turkish women in Rotterdam (or, even worse, as all Turkish women in the Netherlands) it would not be possible to collect data in person. Even data collection in a small sample of such a large population would be much more difficult. But also, even more importantly, the sample should be much larger in order to avoid large levels of uncertainty of the population estimates caused by sampling variation (resulting in large confidence intervals), whereas no 29
estimation of the statistic is needed in a study in a small population (because sampling variation does not occur). Normally, the smaller the population, the more feasible is the test. Summarizing this discussion, (a) it is useful for the contribution of the test result to the theory to conduct a test in populations that are as specific as is possible and (b) it is useful for the feasibility of the survey to select a population that is as small as possible, particularly if valid and reliable measurement requires that the members of the population can be accessed easily. Fortunately, these two recommendations do converge, because populations will get smaller to the extent that they are more specific. For instance, the population of potential mobile users selected for the test gets smaller with the addition of each of the following specifications: non-Western, Turkish, grannies, living in Rotterdam, and being a member of a specific group with regular meetings. The recommendation that the selected population is as specific as possible is always applicable, also in surveys in which measurement does not require that the members of the population are accessible, e.g., if data are already available in data bases. This discussion results in a strong recommendation to design and conduct only one of the following two forms of survey: (1) If good data are available in a data base: conduct a survey of a (prioritized) type of population for which relevant (good) data about all members are available in the data base. (2) If data must be collected by the researcher: conduct a survey of a very small (prioritized) type of population that is easily accessible. Note that sampling is not necessary (and hence should not be done) in either approach because both approaches are designed such that they facilitate easy data collection about all members of the selected population. Because there is no sampling, there is also no sampling variation, and hence no need to use inferential statistics and no need to worry about sample size. The principles discussed here also apply to the selection of a population for a test in a longitudinal survey. Feasibility might get even more emphasis in such a survey because data must be collected at least two times from each member of the population (or such data must be available in the data base).
The recommendation to select a very small (prioritized) type of population that is easily accessible applies to populations that exist without intervention of the researcher. This is not the same as drawing a convenience sample and calling that sample a population. A population of Turkish older women who have a weekly meeting in a cultural center is a population that exists independent of the home address of the researcher. If the researcher lives next door, this already existing population is easily accessible for data collection. However, a group of friends or relatives of the researcher is at least partly defined by the relation of its members with the researcher. Such a group is not an extant population that can be defined independently of the researcher. Research findings cannot be interpreted as related to specific characteristics of the population such as being Turkish, grannies, and living in Rotterdam. In other words, one of the criteria for selecting an existing population for a test can (and must) be convenience, but a convenience sample cannot be called a population.
30
5.3. Criteria for the selection of experimental cases and for generating experimental conditions or groups
In experiments two or more data matrices are generated, one for each value of the independent concept that is experimentally produced. Each of these multiple data matrices must be populated for each of the cases (usually subjects) by either the value of the dependent concept measured after the experimental treatment (posttreatment measurement) or with difference between that value and the value measured before the experimental treatment (pre-treatment measurement).
Eligibility
The general criterion for eligibility is the definition of the focal unit. Cases that define the rows in your data matrix must be members of the theoretical domain (i.e., the universe of cases of your focal unit). How does this principle apply in experiments, i.e., in situations in which real-life cases (or populations) cannot be selected but rather must be generated experimentally? In most experiments, this is not a difficult issue. Whereas it is required for a cross-sectional test of a proposition about people (consumers, employees, leaders, CEOs, etc.) that a population is selected from the relevant domain (i.e., of consumers, employees, etc.), an experimental test similarly requires that subjects are recruited from such a population from the domain.
Prioritization
A population needs to be specified from which subjects will be recruited. The principle that this population should be as specific as possible, as discussed above for a survey, applies. It is a general principle. The more specific the subjects in an experiment are (i.e., the more consistently different they are from subjects in previous experiments), the more informative is the test result. Because 90%+ of all published experiments in psychology and in business research (including marketing and organization research) have used students in psychology and business administration as their subjects (and because most of the theories that have been tested in this way claim to be universal), it is very useful to replicate these tests in entirely different types of subjects (such as, e.g., non-Western grannies).
The cases in an experimental test should be instances of the focal unit. This implies that an experimental test of a proposition about customers requires that persons are recruited for the experiment. The cases in the resulting data matrices are persons. However, an experimental test of a brand, a product or an advertisement requires that the cases in the data matrices are brands, products or advertisements. It might still be necessary to recruit persons for the experiment. If this occurs, these persons are not cases themselves but only function as raters for the measurement of variables of the brands, etc., such as purchase intention, brand loyalty, etc.
31
Feasibility
For surveys, application of the principle that a population should be selected that is accessible for data collection (if the researcher must collect data and cannot obtain them from a data base) results in the recommendation to select a small population. Applying this principle (facilitation of easy recruitment) to the selection of subjects for an experiment results in the recommendation to select subjects from a much larger (though still quite specific) population. Using non-Western grannies as an example again, for the recruitment of subjects from a population for an experiment it would not be wise to select them only from the population of Turkish grannies who have a weekly meeting in the cultural center but rather from the whole population of Turkish grannies in Rotterdam.
Experimental conditions
Subjects must be brought in different experimental conditions, which each represent a specific value of the independent concept as defined in the theory. A concept such as the level of education of a person cannot be manipulated experimentally and hence an experimental test of the theory that higher level of education results in more high-end mobile phone use cannot be tested in this way. However, a theory about emotional states or about beliefs that influence the purchase of specific types of mobile phones can be experimentally tested if the experimenter succeeds in invoking the appropriate emotional states in subjects and if the purchase can be realistically simulated. If the focal unit of the proposition is a team or situation, it is the experimenters task to compose teams (from eligible subjects; i.e., not always from students) and generate or simulate situations (in which, again, eligible subjects must participate) that can be seen as valid representations of teams or situations as defined in the theory. There are some methodological arguments for recruiting subjects randomly from the selected population but this is almost never done. It is assumed that this will not much affect the results of the experiment. However, given a pool of subjects that is not representative for a population, it is certainly required that they are randomly assigned to the different experimental conditions.
5.4. Number of cases

Only in experiments and in sample surveys a minimum number of cases is required in order to achieve sufficiently small 95% confidence intervals (i.e., to achieve sufficient statistical power). This is one of the reasons why it is not recommended to design and conduct a sample survey. Ideally, the size of 95% confidence intervals resulting from different sample sizes (or different sizes of experimental groups) should be calculated beforehand and the minimum size of the samples and experimental groups should be set accordingly.
32
CHAPTER 6 MEASUREMENT
The value of the concept(s) of the proposition must be measured in each case. When a concept is measured in a study, it is called a variable. A variable is a concept that is more precisely specified for the cases in the study. This specification must describe in detail the possible values that the variable can have. For instance, if the concept that must be measured is project success, the variable success might be specified in different ways, such as monetary success expressed as the amount of dollars that is generated by the project (i.e., a ratio type variable), or as satisfaction expressed as a (subjective) rating by the company on a scale from none via a bit and quite a lot to very much (i.e., an ordinal type variable). This specification must be valid and the resulting score must be reliable. As discussed above, research strategies are defined by their different types of data matrices, not by their methods of measurement. For instance, the data that must populate the matrix in a survey can be collected in any way, such as observations, content analysis, semi-structured interviews and also questionnaires. The latter are only preferred if data must be collected regarding a persons opinions and beliefs. This chapter discusses a stepwise procedure for the development of an instrument for the valid and reliable measurement of the value of a concept in each case selected for a test. The seven steps are the following: 1. Formulate a more precise definition of the concept. 2. Determine the object of measurement. 3. Identify the location of the object of measurement. 4. Specify how evidence will be extracted from the object of measurement. 5. Specify how the object of measurement will be accessed. 6. Specify how evidence will be recorded. 7. Specify the type of the variable (ratio, interval, ordinal, nominal) and describe the possible values. The outcome of this procedure is a measurement protocol for each concept in which it is precisely specified how a score (i.e., the value of the variable that will put in the data matrix) is generated in each case.
Step 1: formulate a more precise definition of the concept

The concept of success in propositions such as For projects P: X is associated with success might not have been defined more precisely, and it is possible that success was measured in a different manner in all or many of the previous tests of the proposition. In some studies it might have been measured in terms of financial results (e.g., with dollar amounts as possible scores), in others in terms of timely delivery of the results (e.g., expressed in qualitative scores such as very timely and too late) or in terms of a companys satisfaction with the results (with the level of satisfaction scored, e.g., on numerical scale from 0 to 10).
33
In this discussion of the development of a measurement protocol, with success as a running example, success will be defined in the following three ways: 1. Financial success defined as the amount of monetary gain for the company resulting from the project. 2. Timely delivery defined as whether the project has delivered its results before a specified deadline. 3. Satisfaction defined as the extent to which a project is evaluated as successful by the company. Note that these are just three possible specifications of the concept success as used in a theory. Usually, a theory clearly specifies one of these different meanings as the one to which the theory refers, i.e., as the type of success that is explained by the theory or proposition. It is useful for the process that is described here in a number of steps to make an initial (provisional) decision about the scale that one wants to develop. For instance, if success is defined as financial success, will it be required to measure an exact amount of dollars or euros, or would it be enough to rate it as low, medium or high?
Step 2: determine the object of measurement

In order to measure the actual value of the concept (e.g., the degree of project success) in each case, an object of measurement must be defined. An object of measurement is the source of evidence or the object from which evidence must be extracted. This might be a different object (or more specified object) than the focal unit discussed above. For instance, a simple theory might state that For people: gender is associated with height. The focal unit is a person, but the object of measurement of the concept height is a persons body. This might appear to be obvious in this simple example, but in other cases it might take more effort to see what the object of measurement is. For instance, each of the three specifications of success defined above (in Step 1) specifies a different object of measurement.
Financial success. Assume that the company that is involved in the project has arranged its bookkeeping and accounting practices in such a way that it is possible to compute the costs and revenues of the project. The financial success of the project is an attribute of the difference between these costs and revenues. It is an amount in a currency that appears in a cell of a financial record (e.g., in a spreadsheet). The object of measurement of financial success is the project as implicitly or explicitly defined in the financial records, i.e., the entity to which the projects costs and revenues are allocated in these records. Timely delivery. This is success in terms of the difference between a planned and an actual end date of the project (e.g. early, late, on time, or a number of days or months before/after the planned deadline). The object of measurement is the projects delivery date. Satisfaction. This type of success refers to a value given to a project by the company. It is a characteristic of a companys evaluation of the project, and the value of this variable can range, e.g., from not satisfied at all to very satisfied. The object of measurement of this type of success is the companys evaluation.
These examples show that different specifications of the concept of success result in different variables, i.e., different types of attributes, with different types of possible values that must be observed in different types of objects of measurement. Although, 34
in the example, the concept (success) is an element of one focal unit (project), the three variables refer to different objects of measurement (the project as defined in the financial records, the delivery date, and a companys evaluation).
Step 3: identify the location of the object of measurement

Measuring the value of a variable involves either bringing a measurement instrument to the object of measurement or bringing an object of measurement to the instrument. In either case the researcher needs to know where to go in order to be able to conduct the measurement.
Financial success. In order to measure a projects degree of financial success, records, accounts, or reports in which the costs and revenues of projects are documented must be identified. Usually such records or reports are located in an information system that is used by the finance department of a company. Timely delivery. In order to identify the planned and the actual delivery date of a project, one might need to find (internal or public) documents containing information about these dates. Such documents might be reports written by a project manager for a (higher level) management team, announcements on the companys Intranet, or other texts. Satisfaction. It depends on the degree of formalization of project evaluations in a company whether there are obvious places to find information about their conclusions (such as in formal evaluation reports or in written minutes of meetings). If management has never formally evaluated a project and recorded its satisfaction, there might still have been management meetings in which lessons were learned (post-mortems) of which records exist.
The point of this exercise (in which, first, an object of measurement is specified and, next, its location is determined) is that, in principle, researchers must know what their object of measurement is and where it (literally) is. Next, it must be determined what kind of evidence must be extracted from it and how this should be done (step 4), how the object of measurement will be accessed for extracting the evidence in the manner as specified (step 5), and finally how this evidence will be recorded (step 6) so it can be brought to the researchers desk for coding (step 7).
Step 4: specify how evidence will be extracted from the object of measurement
Measurement of the value of a variable requires that evidence is extracted from the object of measurement that corresponds with that value. Different variables require very different instruments, which vary from complicated (such as extracting evidence of a persons intelligence by means of a battery of tests or, more accurately, a set of measurements) to simple (such as extracting evidence of a projects costs by means of reading the appropriate lines in a financial report).
Financial success. After identification of the relevant financial records or reports, the relevant financial numbers need to be identified and read. If these records or reports do not provide a number for the total costs and revenues of a project, numbers for
35
subcategories of costs and revenues need to be identified and read in different lines, columns, pages, or files. The set of different numbers that are identified in this way form the evidence that is extracted. It must be specified beforehand in detail which numbers in the records count as evidence for the value of the variable as defined. The required instrument for extracting evidence of the value of the variable financial success, thus, is reading the right numbers. Timely delivery. After identification of relevant documents, information about the planned and actual delivery dates must be found in these documents and read. Satisfaction. After an evaluation report that contains evidence of how the company evaluates the project is identified, the report must be read to retrieve the required evidence.
Only after it is specified in this way which kind of evidence must be extracted from which type of object of measurement, it is possible to specify the practical steps that are required for actually getting the evidence that is needed, for each case.
Step 5: specify how the object of measurement will be accessed

The measurement protocol must specify in detail how measurement will take place in the cases selected for the test. It must, therefore, not only specify what the object of measurement is in principle (such as the product as defined in the financial records for financial success) but also how this object actually must be accessed in the real cases in which the concept must be measured.
Financial success. The researcher must have the cooperation of company staff, usually staff of a finance department, to get access to the relevant records or reports. Timely delivery. The researcher must find the relevant documents in which the planned and actual dates of delivery can be found. If such documents are public, one just needs to find and read them. If the relevant documents are confidential, company staff will need to cooperate in order to identify them and to get access to them. This implies that persons must be found in the company which are knowledgeable. Satisfaction. After it has been specified how the success of a project as perceived by the company will be extracted from documents, the researcher must find the texts and read them. Cooperation of company staff might be needed.
If the object of measurement is not an opinion, a belief or a persons experience (as is the case in these three examples), it is likely that it is something that can be observed by the researcher herself and that, in principle, she does not need to ask another person (a respondent or informant) to extract the necessary evidence. In such cases, usually gatekeepers must be passed in order to get access to the evidence that is needed for a measurement. In the examples discussed here, staff must first help the researcher to identify the relevant records or documents and next allow access to them. Sometimes the evidence that is needed is publicly available, e.g., in annual reports, on company websites, etc. Therefore, it should be investigated whether such publicly available data exist for relevant cases, before access to companies is negotiated. In all cases the quality of these data should be assessed before using them.
36
If the researcher cannot get access to primary sources of evidence (such as to the relevant financial records) she must ask persons who have access (informants or respondents) to provide the necessary evidence. This request basically can have one of two forms: (a) interviewing informants (face-to-face or by telephone) or (b) sending a questionnaire (either paper or electronic). However, note that it cannot be assumed that informants provide the correct evidence, particularly not if information is requested by means of a questionnaire. In such cases it is highly recommended to talk to informants face-to-face or by telephone in order to create opportunities for explaining in detail what evidence precisely is needed. Also, it is recommended to ask informants how they have extracted the information that they have provided. It must be ascertained as thoroughly as possible that the information that is received is evidence as specified above.
Step 6: specify how evidence will be recorded

When evidence is extracted from the object of measurement, this must be taken away from it and stored somewhere where the researcher has access to it for analysis. The method of transporting and storing evidence is not obvious, and needs planning. For instance, if a researcher conducts an interview, the evidence that is extracted is in the words spoken by the respondent. This evidence is gone at the moment it is spoken. There are different ways of recording interview evidence: 1. Remembering it until data analysis. 2. Remembering it until one has returned to the office and then writing it down. 3. Writing it down immediately after the interview. 4. Making notes during the interview of what the researcher thinks the respondent wants to say. 5. Making notes during the interview of what the respondent actually says, as verbatim as possible. 6. Making a voice recording. It is clear that the evidence extracted from the object of measurement is already changed considerably before it can be further processed and analyzed by all these methods of recording except the last one (voice recording). A similar reasoning applies to other kinds of evidence that is extracted from other kinds of objects of measurement. This can be illustrated with the example of different indicators of success of a project. After records or reports have been accessed and the required evidence has been identified and read, this evidence can be remembered and written down later. Evidence can also be recorded by writing it down on paper, saying it aloud into a voice recorder, copying (from paper) with a copying machine, printing (from a digital record), or copying from a digital record to a memory stick. Evidence that is recorded and is stored in the researchers office is called data.
37
Step 7: specify how data will be coded

Data are stored evidence that the researcher can access (for, e.g., checking or coding) at will. They are not yet a score, which is the value of the variable that will be entered into the data matrix. Data must first be categorized or coded.
Take the following example of the difference (and connection) between the concepts of evidence, data and score: the measurement of a psychological trait by means of a set of items in a questionnaire with fixed response categories (often called a scale). The respondent generates evidence by marking one of the response categories for each item. When the researcher has stored these answers in a data base, they have become data. Finally, these data are transformed into the respondents score on the measured trait by some form of enumeration. Another example is the measurement of a persons opinions or attitudes by means of a semi-structured interview. The interview evidence (i.e., what a respondent has said) is recorded in some form (i.e., through a voice recorder). What is recorded might then be transcribed or summarized in a document. The transcript or summary (with the voice recording as a backup and as a source of information about tone of voice, etc.) is the data. These data must be interpreted and categorized or coded in order to typify the opinions that have been measured in that respondent.
Usually the researcher knows from the outset what kind of score on what kind of scale is required or desired as the outcome of the measurement, and this knowledge will steer decisions in the development of the measurement protocol. This can be illustrated again with the examples.
Financial success. It will be clear from the outset that the score (for the data matrix) should be an amount in some currency (dollar, euro, etc.). But, if the variable is not defined as the extent of financial success but rather as the presence or absence of financial success, then the dollar or euro amounts must be coded into one of these two possible scores. This means that a coding procedure must be applied by which monetary amounts are evaluated as indicating success (presence/absence), which requires that a cutoff point is specified. Timely delivery. A criterion must be specified for evaluating the date of delivery as on time or too late (or any other score deemed relevant for the proposition). Satisfaction. Text analysis, document analysis, and content analysis are the terms used for generating scores from texts. Coding is simple if an evaluation report has a clear conclusion in which the project is unequivocally judged as a success or not. But coding is more complicated if such a judgment must be generated by the researcher from different, ambiguous, and sometimes contradictory, statements in a report. Then the researcher must have a procedure for finding the evaluation result in the text.
38
Write the measurement protocol

After completion of steps 1 to 7, discussed above, the procedures as generated in these steps must be specified in a measurement protocol, i.e., a set of detailed instructions for identifying, selecting, and accessing objects of measurement and for generating a valid and reliable score for each of the variables specified at the outset of the study. After completion of steps 1 to 7 the researcher can specify, for each variable: 1. The precise definition of the variable (as resulting from step 1 above). 2. The precise specifications of procedures for identification of instances of the object of measurement, for locating them and getting access to them, and for extracting relevant evidence from them (as resulting from steps 25 above). 3. Precise specifications of procedures for recording the evidence (as data) and for generating scores from them (as resulting from steps 6 and 7 above). The set of procedures specified in the protocol is the measurement instrument. At this point it can be asked how the quality of the measurement procedures as specified in the protocol can be evaluated. Two main criteria apply: measurement validity and reliability.
Measurement validity
A measurement of an attribute (or a variable characteristic) is valid if variations in the attribute causally produce variation in the measurement outcomes (Borsboom et al., 2004). It is not possible to objectively assess the degree to which measurement validity has been achieved. A level of validity is an outcome of argumentation and discussion. This can be illustrated with the three indicators of success of a project.
Financial success. If financial records must be read in order to retrieve financial data indicating the degree of success of a project, be it directly or indirectly (after some computation), the type of financial data that are needed must be precisely specified. It is not possible just to copy any financial number from records but only those numbers whose meaning are precisely defined. The meaning of a specific number (most often an amount in, say, dollars) is known if it is known how it was produced. For instance, if the costs involved in a project must be computed from financial records (in order to assess whether a financial gain has occurred), it must be known how the company assigns costs to projects. If relevant costs are not included in the costs documented in the financial records, or when revenues are attributed to the project that actually were generated in ways that are not connected to the project, it is possible that the financial success of the project is overestimated. And, conversely, if costs are attributed to the project that actually are not related to the project, or if not all revenue from the project is included in the revenue as documented in the records, underestimation of the projects financial success is possible. If necessary, financial data must be recalculated in such a way that they exactly represent the researchers definition of the concept. If the records or reports do not contain sufficient information on how the various numbers or amounts have been calculated, it may be necessary to retrieve this information from (financial) staff in order to judge the validity of those data. If these are not valid in terms of the researchers definition, staff could be asked to identify and retrieve other, more valid evidence. In sum, a valid way of extracting evidence of the financial success of a project consists of: 1. Precisely defining what the researcher considers to be the financial success of a project.
39
2. Translating that definition into precisely described operational procedures. 3. Evaluating the firms procedures for computing the financial success of a project, if any, against these procedures. 4. If necessary, identifying or computing other, more valid evidence. The criterion for measurement validity of this instrument is whether every detail of its procedures can be justified in terms of the researchers definition of financial success. Delivery time. There might be different types of delivery time of project results (the publication of the written report, the oral presentation of the results to management, the final financial record, etc.), of which some might not count as the delivery time as meant in the researchers definition. Therefore, the researcher must define in a quite detailed way what is considered the delivery time in the theory that is tested (and what not). The researchers definition needs to be translated into precise procedures that are applied to candidate pieces of evidence of delivery time which are identified in reading the relevant documents or in the verbal reports from company staff who were involved in the end phase of the project. The criterion for measurement validity of these procedures is again whether they are justified in terms of the researchers definition of delivery time. Satisfaction. This indicator of success refers to success as defined by the company, not by the researcher. This is an important distinction, which implies that it is not necessary to apply the procedures outlined in the two previous examples. There is no need to evaluate the correctness of the companys judgment. The outcome of the companys evaluation can be accepted, irrespective of how it was generated (although the researcher might be interested in the companys procedures and might want to try to collect evidence on these procedures as well). Measurement validity here refers to the validity by which the researcher identifies, retrieves, and codes the companys evaluation, irrespective of how the company has generated its evaluation. In case this evaluation has not been recorded in a document by the company, the researcher must (re)construct a companys satisfaction with a project through interviews. There are more and less valid ways of retrieving judgments (such as these evaluations of project success) from respondents in interviews and/or through questionnaires, which will not be discussed here.
Measurement validity, thus, concerns the quality of each part of the measurement protocol in terms of the criteria that follow from the (as precise as possible) definition of the concept that is measured.
Reliability
If a measurement is valid, a next quality criterion is that it is also reliable. Reliability is the precision of the scores obtained by a measurement. Reliability, as defined here (i.e. the precision of scores) and in contrast to measurement validity, can be measured. This is usually done by generating more than one score for the same variable in the same object of measurement and by assessing how much they differ. The level of achieved reliability of scores can be obtained by calculating the degree of similarity of scores for the same object of measurement and express it as an inter-observer, interrater, or testretest similarity rate.
Financial success. When a valid procedure for measuring financial success of a project has been developed, its reliability can be assessed by arranging that two or more persons, either company staff, or researchers, or their assistants, collect and code data using these guidelines and then compute the degree of success from these data. If the reliability of the
40
scores is insufficient (in terms of a criterion that was formulated a priori) measurement procedures should be further specified until a sufficient level of reliability is achieved. Delivery time. If a valid procedure for measuring the exact dates of planned and actual delivery and for determining its timeliness is developed, the reliability of the score can be assessed by arranging that two or more persons identify both the planned and the actual delivery date and then rate the deliverys timeliness. Scores are reliable if different raters generate the same score. Satisfaction. When a valid procedure for the measurement of the value of the companys project evaluation is developed, the reliability of the scores obtained in this way can be assessed by using the same procedures described above for assessing the reliability of financial success or timeliness of delivery. If evidence is extracted through qualitative interviews with persons, the more structured a qualitative interview is (e.g., instructions regarding the interview as well as the questions specified in the interview guide), the more reliable will be the evidence generated in the interview. Evidence generated in interviews by different interviewers with the same person should obtain similar or the same evidence. If the data are generated through a standardized questionnaire, consisting of questions with a set of response categories, reliability is usually assumed to be good, although different measurement conditions (e.g., how the questionnaire is introduced to the respondent, the absence or presence of other people such as supervisors or colleagues, whether scores are obtained in an interview or by self-completion) will influence the reliability of the scores that are obtained.
Published measurement procedures

It is an established practice to use measurement instruments that have been developed and used by researchers in earlier tests. The mere fact that instruments have been used in published research does not mean that these instruments are valid and reliable. It is necessary to evaluate them before using them. It is recommended that researchers develop their own (ideal) measurement protocols first and then compare these with the published ones. If differences exist (which is very likely), these differences should be interpreted. Has the published author or the current researcher designed the most valid and most reliable measurement procedure? If it appears that some aspects in the measurement procedures reported in the literature are problematic, then these should be revised before using them. Obviously, comparison between published instruments and the own measurement protocols might also result in a revision of the latter.
Data from a data base

If good (and relevant) data can be obtained from data bases, company information systems, or from previous research, use that data. Therefore, it is recommended that, for each of the concepts in a study, it should be assessed whether there are relevant data (e.g., in electronic data bases, in national statistical agencies such as Statistics Netherlands, in company information systems, etc.) that could be used. If such data are available, it is necessary to evaluate their validity and reliability. For each of these data descriptions must be found of how they were generated and these descriptions should be compared with the own measurement protocols. Conclusions should be drawn about the measurement validity and reliability of the data in these sources.
41
Measurement in a large number of cases

The procedures described in this chapter apply to all measurement. If the number of cases is large and all relevant data cannot be found in data bases (or other systems) and if it, for these reasons, is considered too costly in terms of time and effort to measure a concept in accordance with the procedures specified in the protocol in all cases, a researcher might be tempted to shortcut the measurement process. One common way to achieve efficiency is not to access the object of measurement itself but to use informants who have access to the evidence that is required or have some knowledge of it. Usually, a questionnaire is sent to such informants. Those (few) informants who take the effort to answer the questionnaire must either extract the evidence from the object of measurement or remember it in some way, and must report that evidence in the questionnaire. In other words, the informant is asked to conduct the measurement for the researcher without being instructed as a researcher and, therefore, without knowing the researchers definition of the concept. Scores obtained in this way should be treated with much caution.
Questionnaires
When a person is the object of measurement, i.e., when the concept is a belief, an opinion, a psychological trait, or an experience, then it is very likely that evidence can only be extracted by asking questions to persons. This implies that an interview with that person must be conducted or that they must complete a questionnaire. However, constructing a good questionnaire (i.e., a questionnaire that will result in valid and reliable scores) is very difficult but also very time-consuming because it requires a couple of rounds of pre-testing (with real respondents). Therefore, it is recommended to make use of a questionnaire only if it is necessary and therefore appropriate, hence only when the respondent him/herself is the object of measurement. In general, it is not a recommended strategy to design a study which relies on evidence provided by informants (rather than on evidence extracted by the researcher from the object of measurement), but collecting evidence from informants by means of questionnaires must be avoided in particular.
Further reading
This chapter is to a large extent taken from Appendix 1 (Measurement) in Dul & Hak (2008). The definition of measurement validity as discussed in this chapter (the degree to which variations in the attribute causally produce variation in the measurement outcomes) is also discussed by Rossiter (2002) and, very thoroughly, by Borsboom et al. (2004). This chapter does not discuss questionnaire construction. The difficulty of constructing a valid questionnaire is usually severely underestimated, which is one of the reasons why it is recommended in this book to avoid using a questionnaire for data collection if possible. When constructing a questionnaire, it is recommended to follow the C-OAR-SE procedure (Rossiter, 2002) for developing valid items in standardized questionnaires, which should be followed by a cognitive pre-test with real respondents (Willis, 2005; Hak et al., 2008). References
42
CHAPTER 7 HYPOTHESIS
As stated in Chapter 1, the core assumptions of any theory-testing study are: If the proposition is true in the domain, then it must be true in each population of the domain. If the proposition is true in a population, then it must be possible to observe (or see) this. In other words, there must be empirical evidence for the correctness of the proposition in the population. We can specify what we expect to observe in the population if the proposition is true. This specification of our expectation is called a hypothesis (or expected pattern). Note that in the literature no clear distinction is made between propositions and hypotheses. In this book, a proposition is a theoretical and hence general statement, whereas a hypothesis is an expectation (derived from the proposition) about what we will observe in a data set. When we have chosen a research strategy (e.g., a survey), have selected a population (e.g., a specific population of female customers), and have developed a measurement protocol, this hypothesis can be very specific. To such an extent even, that it can be expressed as a range of values of a parameter. This chapter discusses how hypotheses should look like. The discussion is divided in two parts: (1) determining the appropriate parameter; and (2) specifying the values of that parameter which, if occurring, would be consistent with the proposition.
Determining the appropriate parameter

The appropriate parameter depends on two issues, (1) the chosen research strategy, and (2) the type of variables. It was discussed in Chapter 4 that we can see an association between two concepts in a survey with interval or ratio variables by looking at the slope of a trend line in a scatter plot. It follows that the appropriate parameter in a hypothesis should be the regression coefficient b in the expression Y = c + bX. This coefficient tells us how much change occurs in Y (in terms of the units of the scale used for its measurement) with every change in X (in terms of the units of the scale used for its measurement). In other words, the regression coefficient is a parameter of effect size. It informs us about the size of the effect on Y of a change in X. A small value of the regression coefficient is an indicator of a small effect size, whereas a larger value indicates a larger effect size. It is quite common to use the correlation coefficient r or the explained variance (Rsquare) as indicators of effect size, but note that these two indicators have only meaning as additional information with a given trend line of which the slope is expressed by the regression coefficient. A high correlation with a flat line is not an indicator of a large effect size. It only tells us how close the data points are to the line.
43
Note that a p value (an indicator of statistical significance) has hardly any relation with an effect size. A p value is mainly a function of sample size: the larger the sample, the smaller is p. If the sample is large enough, even minute effect sizes (nearing zero) will have a p value that is small enough for concluding statistical significance. Below, in Chapter 8, we will discuss more reasons why the concept of statistical significance is not useful. (Also see Schwab et al. 2011).
Hypotheses about the regression coefficient b assume that the variables are at least ordinal. In a test of a proposition with a nominal concept (such as gender), e.g., For consumers: men buy more beer than women, the appropriate hypothesis would entail an expectation about the means of beer consumption, for instance: In this population: men > women. If the dependent variable is nominal, such as in For consumers: more women than men use a shopping list, the relevant parameter in the hypothesis is the proportion of users of a shopping list: In this population: %men > %women. The hypothesis in a longitudinal survey will usually entail an expectation about the regression coefficient b of the regression of (Y2 -Y1) on (X2 -X1). In an event study, the relevant parameter is usually the value of Y in cases with the event as compared with a normal situation (e.g., abnormal returns). The hypothesis in the usual type of experiment with two experimental conditions, each defined by a specific value of the independent concept, entails an expectation about the difference between the means of the two groups: In this experiment: group1 < group2.
Specifying the expected values

Usually the hypothesis is expressed as a difference from zero (b>0; men > women; %men > %women). This implies that any empirical effect that differs from zero, however small, is considered as consistent with the proposition. This is correct in a literal sense: every non-zero effect is an effect. However, we have seen in Chapter 2 that a proposition might be considered managerially relevant only if the effect is larger than a specific value which is different from zero. If researchers do not only aim at finding an effect, however small, but also aim at determining whether the effect is relevant for practice, then these researchers might want to formulate hypotheses with thresholds higher than zero.
Further reading
Hak & Dul (2009a) discuss the principles of pattern matching and the concept of expected pattern. An important background article for this chapter is Schwab et al (2011).
44
THE RESEARCH PROPOSAL

A research proposal is a summary of the decisions discussed in the previous chapters. It consists of three main parts: an introduction, a literature review, and a detailed description of the study that is proposed. The Introduction discusses the aim of the study, the proposition (in everyday language) and why it matters that it is known whether the proposition is true. The Literature review describes how the proposition originated, how it was studied, and how the results of these studies have contributed to the development of the theory. This chapter contains a discussion of the propositions replication history, from which the basic characteristics of the current study follow, particularly the selection of cases. The chapter on Methods contains, at a minimum: A specification of the proposition and of the chosen research strategy. A specification of the cases that will be selected for the test, for instance, o For a (cross-sectional or longitudinal) survey, the population that has been selected for the test. Note that it is highly recommended to select either a population for which data are easily available (e.g., from a data base) or a very specific, and hence small, population in which data can be collected from all cases. In either case it is not needed to draw a sample because data can be collected from all cases. If the population is small, a complete list of all members of the population can be presented (i.e., a list that includes each member of the population and that does not include cases that are not members of the population). If the population is large, it should be specified how the complete list of members of the population has been obtained. o For an experiment, it should be specified what type of subjects will be recruited (students again?), or what other type of unit (e.g., teams or companies). A detailed description should be presented of how these subjects or units will be recruited and randomized. It must be specified also how the independent concept will be manipulated (i.e., what the experimental treatment entails). The data matrix(s) that will be generated. If possible in this stage, the names of the cases (i.e., the names of the rows of the matrix) could be given in the matrix. A measurement protocol for each of the concepts that will be measured. This must include an observation protocol, content analysis protocol, interview guide, or any other instrument that is necessary for generating the scores that are required. A specification of the hypothesis in terms of a value range for the relevant parameter.
45
CHAPTER 8 CONDUCTING THE TEST

Survey
The aim of a survey is to generate a population statistic (e.g., a regression slope b, or a difference in proportions) that can be compared with the hypothesis. This is possible only if scores are available for all members of the population. This requires that (a) a complete list is available of all members of the population (with no member missing and no case included that is actually not a member of the population) and (b) that data are obtained for each case on the list. If the first requirement is not met, a coverage error might be made which would result in a statistic that differs from the actual statistic for the population. If the second requirement is not met and scores from one or more cases are missing, then no population statistic can be generated because the (unknown) scores of the missing cases cannot be taken into account. There might be a difference between the characteristics of the cases from which data have not been collected (non-response) and of those from which data have been collected (response). This difference is called selection bias or non-response bias. Because the actual size of this bias can hardly ever be estimated reliably (because no scores for the non-response have been generated), non-response should be non-existent or very low. This is the reason why it is recommended to either select a small population for which it is feasible to collect data for (almost) each case by a very intensive approach or to select a population of which data for all members are available (i.e., in a data base). The requirement that scores must be obtained for all cases also applies to a sample. A difference between the respondents and non-respondents will inevitably result in nonresponse bias in the population estimate.
Experiment
If subjects are recruited from a population for an experiment, refusal to participate by individual members of the population is not a problem because (as was mentioned above) probability sampling is not considered necessary for generating a pool of potential subjects. However, as soon as persons have agreed to participate in the experiment and have been randomly assigned to the different experimental conditions, data must be collected from each of them. Failure to do so will invalidate the studys results.
Pattern matching
Testing entails comparing an observed pattern (i.e., the pattern of the scores in the data matrix) with the expected pattern (i.e., the hypothesis). Comparing an empirical fact in obtained scores with a hypothesis is a simple and straightforward activity which does not require a null hypothesis significance test. It will be explained below
46
why results of null hypothesis significance tests are not needed and potentially misleading. Examples of hypotheses (expected patterns) in a survey, as discussed above, are In this population: b > 0 or In this population: b > n (in which n is a number set by the researcher as a threshold for managerial relevance). In this population: x2 x1 > 0 (or > n) in which x1 and x2 are subgroups in the population with different (nominal) values of X and in which n is a number set by the researcher as a threshold for managerial relevance. In this population: %x2 %> 0 (or > n) in which x1 and x2 are subgroups in the population with different (nominal) values of X and in which n is a number set by the researcher as a threshold for managerial relevance. The hypothesis in a survey usually specifies a range of values in which the observed value is expected to occur. Before the expected pattern and the observed pattern can be matched, the observed value of the relevant statistic (b, or x2 x1, or %x2 %x1) must be generated. Note that only descriptive statistics, no inferential statistics, are needed to obtain these values, and that the value that is generated is a precise point (not a range). If not all members of the population are surveyed, but only the members of a sample, the value of the relevant statistic (b, or x2 x1, or %x2 %x1) is generated in the same way. The result is an observed value in the sample. But the hypothesis is the formulation of an expected pattern in the population. Hence, in order to conduct the test, first, a population value must be estimated and, next, the resulting population estimate must be compared with the expected pattern. If the sample is not a random (or, more generally formulated, a probability) sample, the population value of the relevant statistic cannot be estimated. Non-probability samples, therefore, are useless in theory-testing research. If the sample is a probability sample, the population value of the relevant statistic can be estimated by means of inferential statistics. The result of the use of inferential statistics is a confidence interval (with a confidence level that must be specified beforehand, e.g., 95%) with the sample value in the middle. This result is a range (or interval) of values with a known likelihood that the actual value of the population statistic is in that range (e.g., the 95% confidence interval). The data are consistent with the hypothesis if the entire 95% confidence interval is within the expected range. It is not consistent with the hypothesis if the entire 95% confidence interval is outside that range. If the observed range (the 95% confidence interval) overlaps with the expected range, the test result is partially consistent with the hypothesis. Note that these testing procedures do not require null hypothesis significance testing and that test results are not expressed in terms of statistical significance. Reasons for not using null hypothesis significance testing (apart from that it is not necessary for obtaining a test result) are discussed below. The hypothesis in the usual type of experiment with two experimental conditions, each defined by a specific value of the independent concept, is: In this experiment: group1 < group2. The relevant statistic (group2 group1) can be generated by using descriptive statistics. However, this statistic is subject to sampling variation due to the fact that subjects have been randomly assigned to the experimental groups. Hence, a (95%) confidence interval must be estimated and the resulting range (95% confidence
47
interval) must be compared with the expected range. The data are consistent with the hypothesis if the entire 95% confidence interval is within the expected range. It is not consistent with the hypothesis if the entire 95% confidence interval is outside that range. If the observed range (the 95% confidence interval) overlaps with the expected range, the test result is partially consistent with the hypothesis.
Why the results of null hypothesis significance tests are potentially misleading
Take the example of a series of tests that was discussed above (Figure 4).
This figure might represent the effect size d (with 95% confidence interval) obtained in seven experiments, or seven population estimates of the regression slope b (with 95% confidence intervals), or seven population estimates of other relevant statistics (with 95% confidence intervals). The vertical line represents the null. Assume that the hypothesis for each test was that the value of the relevant statistic is higher than zero. This implies that the observed value is expected to occur on the right-hand side of the vertical line. Each of the seven test results can be easily obtained by applying the pattern matching procedure described above: Test 1. Consistent with the hypothesis Test 2. Mostly not consistent with the hypothesis Test 3. Consistent with the hypothesis Test 4. Largely but not fully consistent with the hypothesis Test 5. Consistent with the hypothesis Test 6. Largely not consistent with the hypothesis Test 7. Largely but not fully consistent with the hypothesis The seven test results together suggest that it is more likely that the proposition is true for the theoretical domain than that it is not true. Five of the seven observed values are on the expected side of the null. The two observed values that are on the wrong side of the null are both relatively close to the null. The overall effect size (if these are experimental results) or the actual strength of the association (if these are population estimates) is likely close to the one found in test 7, of which the result is the median of the seven test results.
48
Now observe the results of null hypothesis significance testing (with the criterion that p 0.05). Although the values of p cannot be directly observed in the figure, it is clear from the 95% confidence intervals that four out of the seven tests (namely tests 2, 4, 5, and 7) would have generated a higher value of p than 0.05. Note that this implies that the positive result of test 7 (which quite likely has an observed value of the relevant statistic that is very close to the actual value in the domain) is considered as not significant and that, therefore, the hypothesis is rejected. This clearly is a misleading result. The main reason for this is that the null hypothesis significance test is (as its name implies) a test of the null, not of the actual hypothesis that is supposed to be tested. The null is only rejected if the observed value of the statistic occurs outside the range of the 95% probability distribution of values that the observed statistic will have if the actual value is zero (or null). However, the fact that the null cannot be rejected with more than 95% confidence (in tests 4 and 7) does not imply that it is unlikely that the actual hypothesis is true. The fact that the larger part of the 95% confidence interval, both in tests 4 and 7, is on the right side of the null actually means that the population estimate (or the real experimental effect) is much more likely to be in the expected range (> 0) than not. Because null hypothesis significance testing is not necessary and because its outcome is potentially misleading, it is strongly recommended to abstain from null hypothesis significance testing.
Further reading
An important background article for this chapter is Schwab et al (2011). Hak & Dul (2009a) discuss the principles of pattern matching. The most recent version (sixth edition; 2010:33) of the Publication manual of the American Psychological Association (APA) notes that historically, researchers in psychology have relied on null hypothesis statistical significance testing (NHST) as a starting point for many (but not all) of its analytic approaches. The manual notes that this practice is contested and then states that complete reporting of all tested hypotheses and estimates of appropriate effect sizes and confidence intervals are the minimum expectations for all APA journals. This advice is accompanied by a note with a reference to, among others, a useful paper by Jones & Tukey (2000). See for a further discussion of the greater usefulness of reporting (and interpreting) confidence intervals (rather than p values): Cumming & Finch (2001), Finch et al. (2002). It is repeatedly stated in this chapter that scores must be obtained from (almost) all of the members of a population (or of a probability sample), because missing data potentially result in nonresponse bias. See Rogelberg & Stanton (2007) for a discussion. Because there is no feasible way of estimating bias caused by nonresponse and because, by implication, there is no way of correcting for it, it is necessary to collect scores for (almost) all cases. As discussed above, this is only possible if (1) data for the population are already available (e.g., in a data base) or if (2) you data are collected in an intensive way from a small population. If this advice is heeded, then the survey is a census (i.e., a study of all members in population), not a sample survey. Although it is also possible to collect almost complete data in an intensive way from a small sample, this would introduce another problem: sampling variation and, hence, too little statistical power.
49
CHAPTER 9 INTERPRETING THE TEST RESULT

When a test result is achieved, its implications for the theory must be assessed. Assuming that the design of the test (its choice of cases in particular) was based on a thorough analysis of the preceding replication history, implying that the test result would be a significant contribution to the theory, now the time has come to demonstrate this. What exactly is the significance of the fact that the test result is supporting (or not supporting) the proposition? No single test result will be decisive regarding the correctness of the proposition that was tested. This test needs to be followed by other ones. Therefore, the discussion of the implications of the test result for the theory must be completed with suggestions for further research, i.e., a further replication strategy.
Replication, not generalization

A test result has been achieved that supports or does not support the proposition in the cases that were selected for the test. This test result does not entail any information about the correctness of the proposition in other cases in the theoretical domain that were not observed. A common misunderstanding about null hypothesis significance testing is that statistically significant test results can be generalized to the whole domain. It is shown above that such tests do not even give useful information about the correctness of the hypothesis in the cases (in surveys or in experiments) that are observed in the test. They can certainly not, magically, give information about cases that are not observed. Look again at the replication history as depicted in the figure that is familiar by now (Figure 5).
50
Take, as an example, the result in the second test in this series, which indicates that there is a higher chance that the hypothesis is not correct than that it is correct. In other words, the proposition is not supported in this test. Note that, based on this test no conclusion whatsoever can be drawn about the correctness of the proposition in the domain. Or in other words, generalization from one test result to the domain is not possible. This simple fact applies to any result, positive or negative. This test result carries meaning only as part of series of replications. It is only in the context of the whole series of seven tests that the result of test 2 can be seen as the one that is the least supportive for the proposition. It would be ludicrous to conclude from test 2 alone that the proposition is not correct. On the contrary, the series of test results suggests that the proposition likely is correct in at least a part of the domain. This means that the result of test 2 needs to be explained. Such an explanation must be sought first in errors that might have been made in the study, in measurement errors or in errors regarding the selection of cases for the test.
Error
In finding explanations for a test result, one might first assess the likelihood of error regarding the cases that have been observed. Were the observed cases eligible (i.e., members of the theoretical domain)? If a population was selected for a test in a survey, could a coverage error have been made (i.e., could the survey have included cases that are actually not members of the population or could it have excluded cases that actually are members)? If a sample has been drawn from the population (which was not recommended in this book), was it a correct probability sample and was no error made in the sampling procedures? Have scores been obtained for all members of the population or sample, so non-response bias can be excluded? In an experimental test, were subjects randomly assigned to the experimental conditions? Were scores obtained from all subjects? After having reviewed each of these potential reasons for (partially) not having observed the correct set of cases, a conclusion can be drawn about the likelihood that this type of error could explain the test result. Now look equally critically at how scores were obtained. If data have been used from a data base, were these scores valid and reliable? If data have collected for the study, have there been deviations from the measurement protocol? Have the measurement procedures (as specified in the measurement protocol) resulted in valid and reliable scores? Could measurement error explain the test result? In an experiment, was the value of the independent concept manipulated in the correct manner, in such a way that the experimental conditions actually and correctly generated or represented the different values of that concept? Has a pre-test (a measurement of the value of the dependent concept in each case or subject before the experimental treatment) been conducted (if appropriate)? Can errors in experimental design explain the test result? Only if it can convincingly be argued that the test result is not an effect of selection error, sampling error, non-response error, measurement error, or design error, then an interpretation of the test result can be based on the assumption that the result is correct for the cases that have been selected for the test.
51
A critical evaluation of potential error must always be conducted, not only if the test result is unexpected and outside of the range of previous test results. A test result that is expected and that confirms results of other tests can also be erroneous. The study might even replicate errors that have been made in previous tests, and the similar results of both this and the other tests might be the effect of similar errors creating the same (but wrong) results. In other words, it would be a mistake to assume, without critical evaluation, that other tests (i.e., the tests discussed in the literature review and listed in your reconstruction of the replication history) have been error-free. Not only should the current test be evaluated very critically but also each of the other tests in the replication history. For each test it must be assessed to what extent its result might be explained by error.
Recommendations for further research

After having critically evaluated the quality of all tests in the replication history, one can begin to interpret the test result as either resulting from differences methods of measurement or the different types of cases that have been observed. Measurement procedures can differ between tests and still be valid and reliable in each of them. Measurements that have been conducted in different ways (e.g., with different scales) will result in different distributions of scores. If differences between test results cannot be explained by some type of error or by differences in measurement procedures, conclusions might be drawn about apparent differences between different parts of the theoretical domain. If test 2 in Figure 5 cannot be explained as an effect of some error or of measurement procedures, then there is a good ground for thinking that the proposition behaves differently in the cases that have been selected for this test, compared to the cases that were selected for the other tests. This conclusion would constitute a significant contribution to the theory. This would also be the case for any other test result. For instance, if it can be assumed after a critical evaluation that the results of tests 3, 4 and 5 are quite likely error-free and assuming that these tests have been conducted in different types of cases from the domain then the result of test 7 could contribute considerably to the confidence among researchers that the proposition is likely to be true for the whole domain. Any of the conclusions of such a critical evaluation of the current test and of the other tests in the replication history can lead to the formulation of recommendations for research that might illuminate or solve issues raised. First, one might formulate ideas about how to design and conduct a better test than have been done so far. Such ideas or recommendations might cover case selection and measurement procedures. Next, one might be able to draw conclusions about how previous tests should have been designed and conducted and one might formulate these as proposals for duplications of those studies. One might also suggest studies in which issues about measurement are investigated, e.g., validation studies of measurement instruments. Finally, after completion of the evaluation of the replication history (now including the current, latest, test), one can describe for which parts of the domain something is known about the propositions correctness and for which parts such knowledge is not yet available. One can formulate proposals for further tests of the proposition based on this.
52
THE RESEARCH REPORT

The research report should consist of at least the following main parts or chapters. The first part of the report (Introduction, Literature Review and Methods chapters) will consist of a revised and updated version of the Research Proposal. Abstract or summary Introduction o An introduction to the aim of the study and a discussion of why it matters Literature Review o Origin of the proposition o Chronological description of how evidence about the proposition was collected and evaluated o Replication history o How the study will contribute to the replication history Methods o Specification of the proposition o Research strategy o Selection of cases o Measurement protocols o Hypothesis Results o Evaluation of how data actually have been obtained o An overview of the obtained scores (e.g., a presentation of a matrix populated with scores) o How the test has been conducted o The test result Discussion o Interpretation of the test result o How the test result relates to the results of other tests of this proposition o Conclusions regarding the theory o Recommendations for further research References
53
REFERENCES
American Psychological Association (APA) (2010), Publication manual, Sixth Edition. Jelke Bethlehem (2009), Applied survey methods, Hoboken (NJ): Wiley. Boris Blumberg, Donald Cooper & Pamela Schindler (2008), Business research methods (Second European Edition), Maidenhead: McGraw-Hill. Denny Borsboom, Gideon J. Mellenberg, & Jaap van Heerden (2004), The concept of validity, Psychological Review, 111(4):10611071. Geoff Cumming & Sue Finch, A primer on the understanding, use and calculation of confidence intervals that are based on central and noncentral distributions, Educational and Psychological Measurement, 2001: 532-574. Jan Dul & Tony Hak (2008), Case study methodology in business research, Oxford: Butterworth-Heinemann (also available as an eBook: http://www.dawsonera.com/) Andy Field & Graham Hole (2003), How to design and report experiments, London: Sage. Sue Finch, Neil Thomason & Geoff Cumming (2002), Past and Future American Psychological Association Guidelines for Statistical Practice, Theory & Psychology, 12:825. Floyd J. Fowler (2009), Survey research methods (Fourth Edition), Thousand Oaks (CA): Sage. Robert M. Groves et al. (2004), Survey methodology, Hoboken (NJ): Wiley. Raymond Hubbard & R. Murray Lindsay (2002), How the emphasis on original empirical marketing research impedes knowledge development, Marketing Theory, 2(4):381-402. Tony Hak & Jan Dul (2009a), Pattern Matching, In A.J. Mills, G Durepos & E Wiebe (Eds.), Encyclopedia of Case Study Research (pp. 663-665), Thousand Oaks (CA): Sage. Tony Hak & Jan Dul (2009b), Replication, In A.J. Mills, G Durepos & E Wiebe (Eds.), Encyclopedia of Case Study Research (pp. 804-806). Thousand Oaks (CA): Sage. Tony Hak, Kees van der Veer & Harrie Jansen (2008), The Three-Step Test-Interview (TSTI): An observation-based method for pretesting self-completion questionnaires, Survey Research Methods, 2:143-150. Lyle V. Jones & John W. Tukey (2000), A Sensible Formulation of the Significance Test, Psychological Methods, 5(4):411414. Naresh K. Malhotra & David F. Birks (2003), Marketing Research, An applied approach (Second European Edition), Harlow: Prentice-Hall. Steven G. Rogelberg & Jeffrey M. Stanton (2007), Understanding and dealing with organizational survey nonresponse, Organization Research Methods, 10:195-209. John R. Rossiter (2002), The C-OAR-SE procedure for scale development in marketing, International Journal for Research in Marketing, 19:305335.
Andreas Schwab, Eric Abrahamson, William H. Starbuck, & Fiona Fidler (2011), researchers should make thoughtful assessments instead of null-hypothesis significance tests, Organization Science, 22(4):11051120.
Gordon B. Willis (2005), Cognitive interviewing: a tool for improving questionnaire design, Thousand Oaks (CA): Sage.
54
GLOSSARY
Candidate population A candidate population is a member of a set of eligible and prioritized populations from the theoretical domain from which the researcher will select a population for a test in a survey. Case A case is an instance of a focal unit. Case selection Case selection is selecting a population of cases from a set of candidate populations or selecting experimental units (subjects or other units) from a set of candidate units. Causal relation A causal relation is a relation between two variable attributes X and Y of a focal unit in which a value of X (or its change) permits, or results, in a value of Y (or in its change). See Cause, Dependent concept, Effect, and Independent concept. Cause A cause is a variable attribute X of a focal unit of which the value (or its change) permits, or results, in a value (or its change) of another variable attribute Y (which is called the effect). See Causal relation, Dependent concept, Effect, and Independent concept. Coding Coding is categorizing data in order to generate scores. Concept A concept is a variable aspect of a focal unit as defined in a theory. See Dependent concept and variable, Independent concept and variable, Mediating concept and variable, Moderating concept and variable, and Variable. Conceptual model A conceptual model is a visual representation of a proposition in which the concepts are presented by blocks and the relation between them by an arrow. The arrow originates in the independent concept and points to the dependent concept. Data Data are the recordings of evidence generated in the process of data collection. Data collection Data collection is the process of (a) identifying and selecting one or more objects of measurement, (b) extracting evidence of the value of the relevant variable attributes from these objects, and (c) recording this evidence. See Object of measurement Dependent concept A dependent concept is a variable attribute Y of a focal unit of which the value (or its change) is the result of, or is permitted by a value (or its change) of another variable attribute X (which is called the independent concept). Dependent variable A dependent variable is a variable X which, according to a hypothesis, is an effect of an independent variable Y. Domain A domain is the universe of instances to which theoretical statements apply. See Focal unit and Theoretical domain. Ecological validity Ecological validity is the degree of confidence that findings of an experimental study apply in non-experimental (real life) settings. Effect An effect is a variable attribute Y of a focal unit of which the value (or its change) is the result of, or is permitted by a value (or its change) of another variable attribute X (which is called the cause). See Causal relation, Dependent concept, Effect, and Independent concept. Evidence Evidence is the information extracted from an object of measurement. Expected pattern An expected pattern is a specification of characteristics that the scores in a data matrix should have if the theory is correct for the cases in the matrix. It is a synonym of the term Hypothesis. Testing consists of comparing (matching) an Observed pattern with an Expected pattern. See Hypothesis, Observed pattern, Pattern matching, and Test.
55
Experiment An experiment is a study in which the value of an independent concept is manipulated in at least two randomly assigned groups of instances of a focal unit and, next, the value of the dependent concept in each of these instances is observed. Experimental research Experimental research (or the experiment) is a research strategy in which the value of an independent concept is manipulated in at least two randomly assigned groups of instances of a focal unit and, next, the value of the dependent concept in each of these instances is observed. External validity External validity is the degree of confidence that the findings of a study apply to non-observed cases of the theoretical domain. See Internal validity Focal unit of a theory A focal unit of a theory is the entity of which the range of values of one or more variable attributes is explained by that theory. Hypothesis A hypothesis is a specification of characteristics that the scores in a data matrix should have if the theory is correct for the cases in the matrix. It is a synonym of the term Expected pattern. See Expected pattern, Observed pattern, and Pattern matching. Independent concept An independent concept is a variable attribute X of a focal unit of which the value (or its change) permits, or results, in a value (or its change) of another variable attribute Y (which is called the dependent concept). Independent variable An independent variable is a variable X which, according to a hypothesis, is a cause of a dependent variable Y. Internal validity Internal validity is the degree to which the design of a study is believed to exclude alternative interpretations of the study findings. See External validity Measurement Measurement is a process in which a score or scores are generated for analysis. It consists of (a) data collection and (b) coding. Measurement procedures must be valid and the resulting scores must be reliable. See Coding, Data collection, Measurement validity, Reliability, and Score. Measurement validity Measurement validity is the extent to which procedures of data collection and of coding can be considered to capture meaningfully the ideas contained in the concept of which the value is measured. Mediating concept A mediating concept is a concept that links the independent and the dependent concept in a proposition and which is necessary for the causal relation between the independent and the dependent concept to exist. Mediating variable A mediating variable is a variable that mediates the relation between the independent and the dependent variables in a hypothesis. Moderating concept A moderating concept is a concept that qualifies the relation between the independent and the dependent concepts in a proposition. Moderating variable A moderating variable is a variable that qualifies the relation between the independent and the dependent variables in a hypothesis. Object of measurement An object of measurement is an object that must be accessed in order to extract evidence of the value of a variable. An object of measurement is not the same as the focal unit. See Data collection and Measurement. Observation Observation is collecting empirical evidence about instances of a focal unit. It is a synonym of Measurement.
56
See Measurement. Observed pattern An observed pattern is a pattern of characteristics of the empirical scores in a data matrix. Testing consists of comparing (matching) an Observed pattern with an Expected pattern. See Expected pattern, Pattern matching and Test. Pattern matching Pattern matching is comparing two or more patterns in order to determine whether patterns match (i.e. that they are the same) or do not match (i.e. that they differ). Pattern matching is a synonym of Test. See Expected pattern and Observed pattern. Population A population is a set of instances of a focal unit defined by one characteristic or by a set of characteristics. Population selection Population selection is selecting a population from a set of candidate populations for a survey. Probability sample A probability sample is a sample that is selected through a procedure of probability sampling. See Probability sampling. Probability sampling Probability sampling is a sampling procedure in which each member of the population has a fixed probabilistic chance of being selected. See Random sampling. Proposition A proposition is a theoretical statement about the relation between concepts. See Theory Random sample A random sample is a sample that is selected through a procedure of random sampling. See Random sampling. Random sampling Random sampling is a sampling procedure in which each member of the population has an equal chance of being selected. See Probability sampling. Reliability Reliability is the degree of precision of a score. Replication Replication is conducting a test of a proposition in another population of instances of the focal unit, or with another selection of experimental subjects or units, or with another experimental treatment. Replication strategy A replication strategy is a plan for a series of replications. See Replication. Research Research is testing theoretical statements by collecting and analyzing evidence drawn from observation. See Observation. Research objective A research objective is a specification of the aim of a study. Research strategy A research strategy is a category of procedures for selecting instances of a focal unit and for data analysis. In this book two research strategies are distinguished: experimental research (the experiment) and survey research (the survey). They are characterized by different data matrices. See Experimental research and Survey research. Sample A sample is a set of cases selected from a population. Sampling Sampling is the selection of cases from a population.
57
Sampling frame A sampling frame is a complete list of the members of a population. A sampling frame is required for probability sampling. See Probability sampling. Score A score is a value assigned to a variable by coding data. Study A study is a research project in which a research objective is formulated and achieved. Support for a proposition A proposition is said to be supported in a test if the hypothesis is confirmed. Survey A survey is a study in which values of concepts are observed in all members (or in a probability sample of members) of a population of instances of the focal unit. See Population, and Sampling. Survey research Survey research (or the survey) is a research strategy in which values of concepts are observed in all members (or in a probability sample of members) of a population of instances of the focal unit. See Population, and Sampling. Test A test of a proposition is determining whether a hypothesis that is deduced from the proposition is consistent with the pattern of scores obtained in a survey or experiment. Theoretical domain A theoretical domain is the universe of instances of a focal unit of a theory. Theory A theory is a set of propositions regarding the relations between the variable attributes (concepts) of a focal unit in a theoretical domain. Theory-testing Theory-testing is selecting one or more propositions for a test and conducting the test. Theory-testing research Theory-testing research is research with the objective to test propositions. Variable A variable is a measurable indicator of a concept in research. See Concept and Hypothesis.
58

Course Book 2011-2012

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Course Book 2011-2012

Diunggah oleh

Hak Cipta:

Format Tersedia

HOW TO DESIGN AND CONDUCT AN EMPIRICAL TEST OF A BUSINESS THEORY

Third Revised Edition

CHAPTER 2 THEORY AND PROPOSITIONS

Domain and population

Figure 2. A series of replications (from Cumming & Finch, 2001:557)

CHAPTER 4 CHOOSING A RESEARCH STRATEGY

Associations and the cross-sectional survey

Alliance 1 Alliance 2 Alliance 3 Alliance 4 Alliance 5 Alliance 6 N =6

Causal relations and the experiment

Group 2. Medium loyalty

Group 3. High loyalty

Causal relations and the longitudinal survey

CHAPTER 5 SELECTING CASES FOR THE TEST

5.1. General criteria for the selection of cases

Test 1. Test 2. Test 3. Test 4. Test 5. Test 6. Test 7.

N=16 N=29 N=36 N=24 N=40 N=20 N=8

5.2. Criteria for the selection of a population of cases for a survey

5.4. Number of cases

Step 1: formulate a more precise definition of the concept

Step 2: determine the object of measurement

Step 3: identify the location of the object of measurement

Step 5: specify how the object of measurement will be accessed

Step 6: specify how evidence will be recorded

Step 7: specify how data will be coded

Write the measurement protocol

Published measurement procedures

Data from a data base

Measurement in a large number of cases

Determining the appropriate parameter

Specifying the expected values

THE RESEARCH PROPOSAL

CHAPTER 8 CONDUCTING THE TEST

Figure 4. A series of replications (from Cumming & Finch, 2001:557)

CHAPTER 9 INTERPRETING THE TEST RESULT

Replication, not generalization

Figure 5. A series of replications (from Cumming & Finch, 2001:557)

Recommendations for further research

THE RESEARCH REPORT

Anda mungkin juga menyukai