Untitled

Experimental Design for Usability Testing
John W. Fleenor
Communications Programming Human Factors, Research Triangle Park, NC
Abstract.
This paper discusses the theoretical and methodological aspects of experimental design as applied to usability testing of software products. The paper is intended to serve as a guide to those interested in conducting usability testing, especially the planners and administrators of usability tests.
Introduction
When planning a usability test, the experimental design is an important consideration that must not be overlooked. Experimental design is a plan that shows how the subjects will be tested (Chapanis & Van Cott, 1972). To conduct a test that will provide a valid measure of the usability of a product, it is necessary to control for various effects that could bias the results. The best way to control for these unintended effects is to use good experimental design. According to Shneiderman (1980), controlled experimentation is fundamental in conducting scientific research. Knowledge of experimental design is basic to disciplines such as Human Factors and Psychology, which are schooled in the scientific method of investigation. As Chapanis and Van Cott (1972) suggest, tests involving human subjects are much more complicated than simple engineering or physical tests. Human testing is more difficult because people are very unpredictable and their behavior is constantly changing. People learn, become bored, and are influenced by previous occurrences. To get dependable results with human subjects, the human factors engineer must resort to techniques that are not often used by the physical scientist. These techniques, known as experimental methods, include the use of control groups, randomizing, and counterbalancing. The purpose of these methods is to ensure that changes in the behavior of the test subjects do not systematically bias the findings of the test. In usability testing, unfortunately, the principles of experimental design are too often overlooked. Because of the lack of experimental control, the findings of these tests are frequently questionable. Usability testing is usually conducted with small sample sizes because of time constraints in the development cycle of the product. Often a small number of subjects (perhaps as few as six or seven) are asked to use a product and to evaluate its usability. Measures of user performance, such as successful completion of task scenarios, are also typically taken in usability testing. It is then assumed that these measures are representative of all potential users of the product (known as the population of users). Because of the small sample sizes, it is unwise to conduct inferential statistics on the obtained measures. Inferential statistics, which are based on samples, are used to make judgments about behavior in the population. Because there is sampling error in any measure, small samples are very poor representations of large populations. This leaves descriptive statistics such as averages and percentages as the only meaningful data to report. What we are left with are measures of the attitudes and performance of a small group of people, which may or may not represent the user population. Although these measures are descriptive of our sample, they may not tell us a lot about the potential users of the product. We could bring in a another group of participants and obtain different results. What then can be said about the true usability of the product in question? Without experimental controls, we cannot be at all confident that the results of our test are reliable and valid measures of the usability of the product. The designers of usability tests have a duty to use as many experimental controls as possible. The best way to control for sampling error is to choose subjects that are similar to the target population and to use as many subjects as possible. It is important to follow the basic principles of experimental design, so that your findings can be generalized to the population of users.
Methodological Considerations and Theoretical Issues
According to Chapanis and Van Cott (1972), laboratory experimentation is the most powerful method of collecting data for human factors testing. The power of this method is derived from the experimental control that the investigator has over the variables under consideration. These variables can be manipulated under controlled conditions, which allows the experimenter to observe the changes in the variables which he or she is interested. If the investigator does not have control over the variables being studied, then the effects that are obtained cannot be attributed to the experimental manipulation. Controlled experimentation in usability testing requires an initial hypothesis about human performance and a product. For example, we could hypothesize that users who are first exposed to a tutorial will be able to use the product better than those who do not get the tutorial. The experimental method requires that two groups of subjects (or two treatment conditions with same subjects) be used. One group, called the experimental group, receives the treatment, which in this case is the tutorial. The other group, known as the control group, does not receive the treatment. This allows the control group to be used as a comparison group. The presence or absence of the tutorial is known as the independent variable. The dependent variable is the measure of how well the subjects were able to use the product. In general, variables in an experiment fall into three classes: independent, dependent and extraneous variables. Brief definitions of experimental variables and other important terms used in experimental design are presented below. For a more detailed description, an introductory text in Experimental Psychology such as Myers (1980), is recommended.
Variables
Variables are the things that vary in an experiment. A variable can take on different values along some dimension. The hypothesis of your study states a potential relationship between the independent and dependent variable. Independent variable: The independent variable is the preceding condition (the treatment) that the investigator manipulates to assess its effect on the variable being studied. Independent variables are hypothesized to have a significant effect on the performance of the participants or the product that is being studied. Significance, in the statistical sense, means that the obtained effects are not due to chance occurrences. The independent variable can be thought of as the cause in the cause-effect relationship. The experimenter includes certain independent variables in the test to see what effect they have. In usability testing, an example of an independent variable is the experience of the participants. The purpose of the experiment here might be to show that inexperienced users make more of a certain type of error when using a product than do more experienced users. Dependent variables These variables are the measures of the subjective and performance outcomes of test. Dependent variables are the effects in the cause-effect relationship. Usually, the criterion measures in a usability test are the dependent variables. Extraneous variables These are variables other than the independent and dependent variables that are not the main focus of the experiment. If left uncontrolled, they will confound the results by increasing error in the measurement of the dependent variables. Extraneous variables include experimenter effects, differences among subjects, equipment failures, and inconsistent instructions. Anything that varies across treatment conditions other than the independent variable can be an extraneous variable. An example of an extraneous variable in usability testing is the response time of the host computer. Slow response time may have an unintended effect on the performance of the participants and cause the
results of the test to be invalid. Statisticians refer to the problem of extraneous variables as confounding. When conducting usability testing, the investigator must ensure that the variables of interest are not confounded by such extraneous effects.
Reliability and Validity

Reliability Reliability means that the measures obtained from the test are consistent and stable. To be reliable, the test procedures must be clearly and simply defined. The more accurate our measures are, the more likely they are to be reliable. For example, we could weigh an object on a bathroom scale. The scale should show that the object weighs the same amount each time we put it on the scale. If not, the measures of weight obtained from the scale are not reliable. In usability testing, if several observers take measures of the same responses, then there should be a high level of agreement among the observers. If not, chances are that the measures are not reliable and are, therefore, not good indicators of usability. Validity Besides being reliable, a measure must also be valid. Validity refers to the principle of measuring the variables in which the investigator is actually interested. If we are interested in the usability of a product, then we must be sure that the measures we are taking are good indicators of ease of use. If not, we may be getting measures of something that has no effect on usability. Often the participants are asked to give subjective ratings on the usability of a product. We cannot be certain that these ratings are valid and reliable indicators of the actual usability of the product. When a subject is asked to reduce his impression of usability to a number on a five-point scale, much of the information value of the response is lost. Subjective impressions, however, can be a valuable source of information about the usability of a product. The use of subjective measures in usability testing will be discussed in detail later in this paper. Internal and external validity Campbell and Stanley (1963) discuss two types of validity: internal and external. Internal validity is an indication of soundness of the procedures within an experiment. External validity is the extent to which the findings of the study can be generalized to the "real world". In usability testing, both concepts are extremely important. First, we must ensure that our test is internally valid. That is, the test procedures must be well thought-out and our measures must be reliable. (A test cannot be valid if it is not reliable.) Secondly, the study should be externally valid. We should be able to generalize our findings beyond the lab to the population of potential users of the product. Otherwise, our usability test is useless. If we cannot say with some degree of confidence that our findings apply to users in general, then we have accomplished very little.
Selecting Subjects
Test subjects To run the tests, a group of people are needed to serve as subjects. In usability testing the subjects are usually called participants, since it is the product rather than the people that is being tested. Whatever they are called, the subjects must be similar to the user population for the product being tested. If we are testing a product designed for computer operators, but use network operators as our subjects, then we cannot generalize the findings to cover the target population for the product. An important rule to remember is to "KNOW YOUR USER" and choose test subjects that represent your users. Subjects should be randomly selected from the target population; they should not be selected on the basis of their ability.
On the number of test subjects Often when designing a usability test, the investigator is unsure of how many participants are necessary to have a valid test. The number of subjects in an experiment is called "n" (for number) or "sample size". There is no easy answer to the question of how many subjects are required to have a valid test. An excellent overview of this problem is presented by Chapanis and Van Cott (1972): The issue here is that one should use enough test subjects to get dependable results, but not so many subjects that he increases the length and costs of the tests unnecessarily. If dependable results can be obtained with 10 test subjects, it is wasteful and inefficient to test 15. There is, unfortunately, no easy way of deciding in advance how many test subjects should be recruited for a particular test. Although statistical formulas can be derived to make such forecasts, they are often more of theoretical rather than practical interest. The difficulty with most such equations is that they require certain numerical values - for example, an estimate of the so-called population variability in performance on some measure - that are seldom known in advance, at least for tests and evaluations involving human subjects. If the data of a test will be analyzed with certain simple, so-called nonparametric tests of significance - for example, a sign test or a binomial test - then it is possible to make some reasonably good forecasts about the minimum number of subjects that will be required. Even these predictions, however, involve some fairly intricate arguments about acceptable levels of Type I and Type II errors, and the probable true difference in the population. These are too complicated to try to present in succinct form here. The test planner is advised to consult a statistician for advice on this aspect of his plans. In the absence of more precise quantitative guides, the best approach is to get advice from someone experienced in the design and conduct of human tests. Veteran experimenters, in the course of years of experimentation, learn that some kinds of experiments require more subjects than others. Moreover, they learn "about how many" subjects will be required for this or that kind of test. Although these are, to be sure, subjective impressions, they often turn out to be reasonably good predictions. (p. 714-715) As shown by the above explanation, the statistical method for determining the optimum number of subjects is a rather complicated process. When planning a usability test, it is a good idea to consult with a human factors engineer to get an estimate the number of subjects that you will need.
Controlling Extraneous Variables

Experimental group This group contains the subjects that are in the experimental condition. These subjects receive the treatment, which is a specific set of preceding conditions designed by the experimenter to test its effects on the subjects' behavior. Control group The subjects in the control group are not exposed to the experimental manipulation. In the control condition, we carry out exactly the same test procedures that were followed with the experimental condition, except the control group does not receive the treatment. So, the control condition is also called the "no-treatment" condition. We need the control group to find out how the subjects perform on the dependent measure without benefit of the treatment. Without the control group, we could not tell if the experimental group did better or worse than usual. Random assignment of subjects
To have a true experimental design, it is necessary that the subjects be randomly assigned to treatment conditions. Random assignment means that every subject has an equal chance of being placed in each condition. If subjects are not randomly assigned, confounding may occur. We may unwittingly put all the "best" subjects in the control group which will bias our results. Placing subjects in randomly assigned groups is the best way to ensure that we have roughly equivalent groups. Random assignment controls for differences that exist between subjects before the experiment. Counterbalancing In usability testing, subjects are sometimes asked to evaluate several different products and to compare the products for usability. If the products are always presented to participants in the same order, then the order itself may explain the outcome. People tend to give lower ratings to the first product they encounter. It has been hypothesized that this is because of the consumer habit of "shopping around for the best buy". In this case, "order effects" are confounded with any possible treatment effect. Two kinds of order effects are especially common. First, performance can decline as the subjects become tired or bored with the tasks. The subjects' performance may be better on the first product because they had become fatigued by the time the second product was presented. On the other hand, performance may improve as the test goes on. Subjects may develop strategies for completing the tasks as they go through the test. The subjects' performance may be better on the second product because of what they learned from using the first product. Several procedures for controlling order effects have been developed. These procedures are known as counterbalancing and involve changing the order which the two treatment conditions are presented to each subject. One subject receives one product to evaluate first, then the other product. The next subject would receive the other product first, and so on. When multiple treatment conditions are used, a relatively large number of subjects (possibly more than 20) may be necessary to achieve complete counterbalancing. Analyzing Results When planning a test, it is necessary to consider how the data will be analyzed. Some designs result in data that cannot be legitimately analyzed by any type of statistical analysis. The type of analysis that will be conducted should be decided before the data is collected. If the data are not collected properly, it may not be possible to calculate the desired statistics (Chapanis & Van Cott, 1972). In usability testing, we often calculate the mean, or average, score for the test subjects. Sometimes the standard deviation, which is an indication of the variability of the scores among subjects, is calculated. The standard deviation shows how different the values in the group are from the group mean. Less frequently, we determine the median score for the subjects in an usability test. The median is the point at which half of the subjects scored higher and the other half scored lower. It is affected less by extreme scores than is the mean, and should be reported more often in usability test results. When means are calculated, they are often compared to the scores of another group or against some criteria. A higher average score for one group does not necessarily mean that there are significant differences between the two groups. Because a few high or low scores in a group can distort the results, experimenters often rely on statistical methods such as t-tests to demonstrate significant differences between groups. The t-test is a statistical test that takes into account the average difference and variation of performance among members of each group. As stated previously, it is often not feasible to conduct a statistical test of significance with the small sample sizes typically used in usability testing. This brings up the problem of practical versus statistical significance. Our results are statistically significant when there is a high probability that the change between the groups occurred because of the experimental manipulation (Myers, 1980). The results obtained in a usability test may not reach statistical significance because of the small sample size, but still may be considered as a practical indication of the usability of the product. When statistical significance is not reached, the researcher must decide how much faith to have in the results. The
determination of practical significance relies mostly on the judgement of the experimenter. Factors that influence practical significance include sample size, how close results are to statistical significance, and the design of the experiment. For guidance in this situation, someone with training and experience in conducting human factors testing should be consulted. Subjective Measures Performance measures in a usability test can be complemented by subjective measures of user satisfaction, preference, and confidence. Subjective responses are, unfortunately, easily influenced by extraneous factors, vary with personalit, and are not easily replicable - all of which are indications of the unreliability of the measures. Subjects may try to please the experimenter by giving favorable responses, feel confident of their comprehension even when performance is poor, or may generally be optimistic and report satisfaction with whatever is offered (Shneiderman, 1980). In usability testing, subjects often complete a questionnaire on which they rate statements about the usability of a product on a scale from 1 (high) to 5 (low). These ratings are then averaged across subjects and compared to pre-determined criteria. If the average rating is better than the criteria, then the product "passes". These ratings are subject to all the threats to validity and reliability discussed previously and should be interpreted with caution. Taking the average of the ratings can "wash out" individual differences in the measures (e.g., high and low ratings cancel each other). We cannot know, then, if the average rating of our sample is anywhere near the population mean. (The population mean is the average rating that would be obtained if all potential users of the product completed the questionnaire.) Subjective measures may, however, provide useful information, despite their questionable reliability. If the subjective measures are correlated with performance measures, they provide additional support (or non-support) for the usability of the product. If subjective measures are favorable, we may want to continue development of a product, even if performance measures are poor. Users may still like the product, even if they have trouble learning to use it. A technique known as bipolar scaling has been modified by Beith (1986) to assess subjective ratings of usability with more accuracy. Bipolar scales appear to be much more valid and reliable measures of subjective usability than do five-point scales. The usability test planner may wish to investigate the possibility of using bipolar scales to measure subjective usability. The participants' comments concerning the usability of a product are often collected by using open-ended questionnaires. With open-ended questionnaires, the subjects are allowed to express impressions of the usability of the product in their own words. The information obtained from these questionnaires can be supplemented by an interview with the test administrator. Often, the subjects' comments on usability supply more information about the product than do the subjective ratings. Open comments are, however, difficult and costly to analyze. Subjective measures are useful as an additional source of information concerning the usability of the product. They should not, however, be used alone as proof that a product is usable. Before a product can be judged as having good usability, supporting evidence is required from performance measures and open-ended comments.
References
Beith, B.H. (1986, May). The Use of NASA Scaling Procedures in IBM Information Testing. IBM Internal Report, Research Triangle Park, N.C. Campbell, D.T. & Stanley, J.C. (1963). Experimental and Quasi-Experimental Designs for Research. Boston: Houghton Mifflin. Chapanis, A. & Van Cott, H. (1972). Human Engineering Tests and Evaluations. In H. Van Cott & R. Kincade (eds.), Human Engineering Guide to Equipment Design. Washington, DC: U.S. Government Printing Office.
Myers, Anne (1980). Experimental Psychology. New York: D. Van Nostrand. Shneiderman, Ben (1980). Software Psychology: Human Factors in Computer and Information Systems. Cambridge, MA: Winthrop.

Untitled

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Untitled

Diunggah oleh

Hak Cipta:

Format Tersedia

Experimental Design for Usability Testing

Methodological Considerations and Theoretical Issues

Reliability and Validity

Controlling Extraneous Variables

Anda mungkin juga menyukai