THE HUMAN-COMPUTERINTERACTION HANDBOOK Fundamentals, Evolving Technologies and Emerging Applications JULIE A. JACKO, Editor Georgia Institute of Technology ANDREW SEARS, Editor UMBC LAWRENCE ERLBAUM ASSOCIATES, PUBLISHERS 2003 Mahwah, New Jersey London 56 USER-BASED EVALUATIONS Joseph S. Dumas Oracle Corporation Introduction 1094 User-Administered Questionnaires 1094 Off-the-Shelf Questionnaires 1095 Observing Users 1096 Empirical Usability Testing 1096 The Focus Is on Usability 1097 The Participants Are End Users or Potential End Users 1098 There Is a Product or System to Evaluate 1099 The Participants Think Aloud As They Perform Tasks... 1099 The Participants Are Observed, and Data Are Recorded and Analyzed 1102 Measures and Data Analysis 1105 Communicating Test Results 1107 Variations on the Essentials 1108 Measuring and Comparing Usability 1108 Comparing the Usability of Products 1108 Baseline Usability Tests 1110 Allowing Free Exploration Challenges to the Validity of Usability Testing How Do We Evaluate Usability Testing? Why Can't We Map Usability Measures to User Interface Components? Are We Ignoring the Operational Environment? Why Don't Usability Specialists See the Same Usability Problems? Additional Issues How Do We Evaluate Ease of Use? How Does Usability Testing Compare With Other Evaluation Methods? Is It Time to Standardize Methods? Are There Ethical Issues in User Testing? Is Testing Web-Based Products Different? The Future of Usability Testing Which User-Based Method to Use? References 1093 1110 1110 1110 1111 1111 1112 1112 1112 1112 1113 1114 1114 1115 1115 1115 • Assess the reliability of the questionnaire. usability inspection methods and userbased methods merge. it should correlate highly with questionnaires with known validity. the better. or it can be used as a stand-alone usability measure of the product. you could compute the correlation between each question and the total score of all of the questions. I maintain the somewhat artificial distinction between user-based and other evaluation methods to treat user-based evaluations thoroughly. Creating a valid and reliable questionnaire to evaluate usability takes considerable effort and specialized skills. I describe three user-based methods: user-administered questionnaires. 1988) • Assume repeated use of a product • Require a psychometrician for interpretation of results (Kirakowski. evaluation and design now are integrated. That revolution applies to evaluating usability. I concede that the boundary between design methods and evaluation methods grows less distinct with time. I focus on user-based evaluations. 1991). there have been two objectives for questionnaires developed to measure usability: (a) create a short questionnaire to measure users' subjective evaluation of a product. As the other chapters in this handbook show. For example. usually as part of another evaluation method. Or test scores from users should correlate with usability judgments of experts about a product. either the test is not valid or the users and the experts are not using the same process. Early user involvement has blurred the distinction between design and evaluation.1094 • DUMAS INTRODUCTION Over the past 20 years. You would also eliminate questions with small variances because nearly all of the respondents are selecting the same rating value or answer. such as in the pluralistic walkthrough (Bias. USER-ADMINISTERED QUESTIONNAIRES A questionnaire can be used as a stand-alone measure of usability. Beuijen. which are evaluations in which users directly participate. if the questionnaire is applied to two products that are known to differ on usability. At the beginning of the process. especially high-tech products. You could also measure split half reliability by randomly assigning each question to one of two sets of questions. & Brok. For example. . 1991) • Are filled out after using a product only once (Doll & Torkzadeh. For example. A questionnaire is valid when it measures what it is suppose to measure. observing users. For example. You would also look for high correlations between two questions because this indicates that the questions may be measuring the same thing. and users are sometimes asked to participate in early user interface design walkthroughs. The steps involved in creating an effective questionnaire include the following: • Create a number of questions or ratings that appear to tap attitudes or opinions that you want to measure. Although the focus of this chapter is on user-based evaluation methods. Demonstrating that it is valid takes some ingenuity. 1996) • Come with published validation studies • Provide comparison norms to which one can compare results Throughout this history. if the test is valid. such as computer software • Measure more general attitudes toward technology or computers (Igbaria & Parasuraman. such as the usability questionnaires discussed in the following subsections. 1994). there have been questionnaires that • Measure attitudes toward individual products • Break attitudes down into several smaller components. and (b) create a questionnaire to provide an absolute measure of the subjective usability of a product. skills in which most usability professionals don't receive training. but far enough apart in time that respondents would be unlikely to remember their answers from the first time. In this chapter. In this chapter. Validity is the most difficult aspect to measure but is an essential characteristic of a questionnaire (Chignell. • Assess the validity of the questionnaire. there has been a revolution in the way products. But the boundary between user-based and other methods is also becoming less distinct. or it can be used along with other measures. if you asked a sample of users to use a product and then answer the questions. You could then eliminate one of the two. a numerical measure of the usability of a product that is independent of its relationship to any other product. • Use item analysis techniques to eliminate the poor questions and keep the effective ones. This second objective parallels the effort in usability testing to find an absolute measure of usability. Over the past 20 years. the questions might focus on a product's overall ease of use. and empirical usability testing. a questionnaire can be used at the end of a usability test to measure the subjective reactions of the participant to the product tested. are developed. Occasionally. the test scores should reflect that difference. then administering both sets and computing the correlation between them (Gage & Berliner. so a questionnaire created to measure the usability of a product should do just that. I describe when to use each method. User participation is no longer postponed until just before the product is in its final form. the more questions you can create. 1990). If the correlations are low. Finally. For example. In the final section of the chapter. Brief usability tests are often part of participatory design sessions. you could measure test-retest reliability by administering the questionnaire twice to the same respondents. such as ease of learning • Measure just one aspect of usability (Spenkelink. Prototyping software and the acceptance of paper prototyping make it possible to evaluate designs as early concepts. 1993) • Measure attitudes that are restricted to a particular technology. that is. It is no longer accepted practice to wait until the end of development to evaluate a product. then throughout the detailed design phases. You would eliminate questions with low correlations. Short Questionnaires. SUMI has been used in development environments to set quantitative goals. There have been a number of published short questionnaires. usability specialists have been interested in using off-the-shelf questionnaires that they can borrow or purchase. These questionnaires were developed to measure usability as a stand-alone method. The Global subscale is similar to QUIS's general questions. 1996). you could show that the product you are evaluating scored higher than similar products on all of the subscales. Licenses for use are available for a few hundred dollars. These questionnaires usually have been developed by measurement specialists who assess the validity and reliability of the instrument as well as the contribution of each question. Version 7. The Questionnaire for User Interaction Satisfaction (QUIS) was developed at the Human-Computer Interaction Lab (HCIL) at the University of Maryland at College Park (Chin. and software installation. There is a long form of QUIS (71 questions) and a short form (26 questions). html). For example. Despite it length. Internet access. In addition. A somewhat longer questionnaire is the Computer User Satisfaction Inventory (CUSI). SUMI can be completed in about 5 minutes. in part because of its inclusion in Shneiderman's (1997) editions. efficiency. • I would not like to use this software every day. It was developed to measure attitudes toward software applications (Kirakowski & Corbett. and • This software responds too slowly to inputs. The Software Usability Measurement Inventory (SUMI) was developed to evaluate software only (Kirakowski. and (b) longer questionnaires that can be used alone as an evaluation method and that may be broken out into more specific subscales. 1988). the developer will do the scoring and the comparison with norms. The advantages of using a professionally developed questionnaire are substantial. voice recognition. For example. Each question uses a 9-point rating scale. 1988). are undecided. 100-point scale. and highlight good and bad aspects of a product. Stand-Alone Questionnaires. and learnability. multimedia. a manual. control. practitioners often select a subset of the questions to use or use only the general questions. SUMI has been applied not only to new software under development. and a set of detailed questions about interface components. not just software. The Software Usability Scale (SUS) has 10 questions (Brooke. It was created by a group of professionals then working at Digital Equipment Corporation.56. helpfulness. usually to a product that they have just used for the first time. learning factors. Because QUIS's factors are not always relevant to every product. 1996). It can be applied to any product. For example. • 1095 hierarchically organized measures of 11 specific interface factors: screen factors. They have many questions and attempt to break users' attitudes into a number of subscales. User-Based Evaluations Off-the-Shelf Questionnaires Because an effective questionnaire takes time and special skills to develop. SUMI's strengths come from its thorough development. or disagree. Strongly Disagree 1 Strongly Agree 2 3 4 5 Brooke (1996) described the scale and the scoring system. Diehl. The license comes with 50 questionnaires in the language of your choice. online tutorials. The questionnaire comes with a manual for scoring the questions and using the norms. The norms come from several thousand respondents. and software for scoring the results and . but also to compare software products and to establish a usability baseline. A three-item questionnaire was developed by Lewis (1991). its developers have created norms for the subscales so that you can compare your software against similar products. terminology and system feedback. The 10 SUS questions have a Likert scale format—a statement followed by a five-level agreement scale.0 of the questionnaire contains a set of demographic questions. It is a well-constructed instrument that breaks the answers into six subscales: global. The SUMI questionnaire consists of 50 statements to which users reply that they either agree. It has 22 questions that break into two subscales: affect (the degree to which respondents like the software) and competence (the degree to which respondents feel they can complete tasks with the product). For example: I think that I would like to use this system frequently. It does assume that the respondents have had several sessions working with the software. • The way that system information is presented is clear and understandable. system capabilities. The site also contains references to evaluations that have used QUIS. there have been two types of questionnaires developed: (a) short questionnaires that can be used to obtain a quick measure of users' subjective reactions. It has been used by many evaluators over the past 10 years.umd. • The instructions and prompts are helpful. which yields a single.edu/QUIS/index. It can be used as a stand-alone evaluation or as part of a user test. Historically.lap. virtual environments. & Norman. a measure of overall system satisfaction. track achievement of goals during product development. technical manuals. Its validity and reliability have been established. Characters on the screen are: Hard to read 1 2 3 Easy to read 4 5 6 7 8 9 There is a Web site for QUIS (www. The three questions measure the users' judgment of how easily and quickly tasks were completed. For a fee. affect. with the end points labeled with adjectives. It consists of a set of general questions. QUIS was designed to assess users' subjective satisfaction with several aspects of the human-computer interface. The developers recommend that the test be scored by a trained psychometrician. which provide an overall assessment of a product. It can be implemented in a closed booth at a professional meeting. A corollary to this limitation is that it may take a long time to observe what you are looking for. OBSERVING USERS Although observing users is a component of many evaluation methods. & Oosterholt. A method related to both observation and user testing is private camera conversation (DeVries. The advent of the cathode ray tube (CRT) and communications technology made it possible to interact directly with the computer in real time. There is a second camera or a scan converter that shows what is on the user's screen or desk surface. In this method. this section focuses on observation as a standalone evaluation method. 1996). The 1982 conference. this method can be used in situations in which an observer can't be present when users are working. The sessions are self-paced but quite short (5-10 minutes). Young and Barnard (1987) proposed the concept of scenarios instead of experiments. the data must be extracted from the videotapes. For example. Whether any of these questionnaires can provide an absolute measure of usability remains to be demonstrated. It is difficult to create a good one. The longer ones can be used to establish a usability baseline and to track progress over time. which is a direct challenge to the validity of observation. The reports of these studies were written in the style of experimental psychology reports. in fact it is a problem with any user-based evaluation method. The first books on HCI began appearing at this time.ucc. Participants are shown how to turn on the equipment and told to do so whenever they work. but there are several that have been well constructed and extensively used. Maryland. The explosion of end user computing was made possible by new hardware and software in the form of both the mini. This problem is not unique to observation. Hartevelt. where the most an evaluator can do is watch the participants.1096 • DUMAS creating reports. and 2 years later. Perhaps the most influential book on usability. This method can be used during any stage of product development and not just for evaluation. observation has several limitations when used alone (Baber & Stanton. • The observer is unable to control when events occur. when ready. including sections titled "Experimental Design" and "Data Analysis. which moved computing from the isolated computer room to the desktop. In addition. Its advocates claim that participants enjoy this method and that it yields a great deal of useful data. such as watching users through a one-way mirror during a usability test. The participant is given a product and asked to go into the room and. such as asking them to talk about what they like and dislike about the product. 1996). it is difficult to evaluate the usability of a product with this method.ie/ hfrg/questionnaires/sumi/index. Still. The instructions on what to talk about are quite general. and for most usability specialists. The rationale behind this passive video observation is the assumption that a video camera can be less intrusive than a usability specialist but more vigilant. 1982).and microcomputer and expansion of communications technology. Even the longest questionnaires can be completed in 10 minutes or less. we still don't know whether participants act differently because they know they are being taped. As with usability testing. The product of the sessions is a videotape that must be watched and analyzed. Hindrance or Ho-Hum?"(Wolf. held at Gaithersburg. But the reliance on psychological research experiments as a model for usability testing was challenged early. The method requires only a private room and a video camera with a microphone. for example. Subsequent meetings of this group became known as the Computer-Human Interaction (CHI) Conference. Hence. The two images are mixed and recorded. 1989). Some products can only be evaluated in their use environment. it is not always clear what caused a behavior. important events may never occur while the observer is watching. Although passive video capture is done without a usability specialist present. Baber and Stanton provide guidelines for using observation as an evaluation method. Indeed. A related method has been described by Bauersfeld and Halgren (1996). one could evaluate any product by observing its use and recoding what happens. which takes as much time to watch as if the participant were being observed directly. At that first meeting. EMPIRICAL USABILITY TESTING Usability testing began in the early 1980s at a time when computer software was beginning to reach a wider audience than just computing professionals. brought together for the first time professionals interested in studying and understanding human-computer interaction. using them is preferable to creating their own. a video camera is set up in the user's work environment. including the following: • It is difficult to infer causality while observing any behavior. there was a session on evaluating text editors that described early usability tests (Ledgard. • Observers often see what they want to see." in which the computation of inferential statistics was described.html. Questionnaires can play an important role in a toolkit of usability evaluation methods. The Web site for SUMI is http://www. did not have a . Human Factors in Computer Systems. Unfortunately. Shneiderman's (1987) first edition of Designing the User Interface. Participants are told to ignore the cameras as much as possible and to work as they normally would. Because the observer is not manipulating the events that occur. CHI Conference writers were discussing issues such as "The Role of Laboratory Experiments in HCI: Help. if you were evaluating new software for stock trading. turn on the camera and talk. The short ones can be used as part of other evaluation methods. the richness of the verbal protocol is enhanced when two or more people who know each other participate together. • Participants change their behavior when they know they are being observed. Because the participants are allowed to be creative and do not have to follow a session protocol. you could implement it and then watch trading activity as it occurs. " and watching participants think out loud fits a stereotype some people have about what a research study looks like. • The data are recorded and analyzed. comparisons between usability testing and research continue.56. When the test has another purpose. but it is certainly invalid and should not be called a usability test. most often they are referring to a diagnostic test. and additional issues. it has a qualifier such as comparison or baseline. Shneiderman wrote the following: Scientific and engineering progress is often stimulated by improved techniques for precise measurement. The two components of a usability test that are most often missing from a focus group are (a) a primary emphasis is not on usability and (b) the participants do not perform tasks during the session. It is missing one of the essentials: potential users. The other common misconception about the purpose of a test is to view it as a research experiment. but people who read the test report may draw inappropriate conclusions. or a prototype of either. When testers use the term usability test with no qualifier. important variations on the essentials... called "quick and dirty" and "informal. The research studies by Virzi (1990. A word that is often used to qualify a test is informal. Yet. • The participants are end users or potential end users. 1999). 1994) explicitly presented usability testing as a method separate from psychological research. the question is not appropriate. a usability test looks like research. Rapid progress in interactive systems design will occur as soon as researchers and practitioners evolve suitable human performance measures and techniques. such as a product design. inappropriate purposes or call other methods usability tests. 128) This brief history shows that usability testing has been an established evaluation method for only about 10 years. But a usability test is not a research study (Dumas. People new to userbased evaluation jump to the conclusion that talking with users during a test is like talking with participants in a focus group. The fact is. But if the question is added to see if customers would buy the product. Perhaps the most common mismatch is between usability and • 1097 marketing and promotional issues. A Usability Test Is Not a Focus Group. The participants' answers could provide an excuse for ignoring the usability problems. a system." These tests "can be run to compare design alternatives.. The Focus Is on Usability It may seem like an obvious point that a usability test should be about usability. It is best not to include such questions or related ones about whether customers would use the manual. perhaps the most used and abused empirical method of all time. When Informal Really Means Invalid. Academic and industrial researchers are discovering that the power of traditional scientific methods can be fruitfully employed in studying interactive systems. but five of the six participants say that they would buy it. Still we need words to describe diagnostic tests that differ from each other in important ways. Such a test may be informal in some sense of that word. Shneiderman described usability tests but called them "pilot tests.1992) on the relatively small number of participants needed in a usability test gave legitimacy to the notion that a usability test could identify usability problems quickly. but there is one for usability laboratories. and a focus group is not a usability test unless it contains the six essential components of a test. the question is appropriate. In the 1997 edition. Thomas (1996) described a method. While academics were developing controlled experiments to test hypotheses and support theories. Shneiderman wrote: Usability-laboratory advocates split from their academic roots as these practitioners developed innovative approaches that were influenced by advertising and market research. there again was no entry in the index for usability testing. But a usability test is not a group technique. A six-participant usability test is not an appropriate method for estimating sales or market share. as discussed later in this chapter. for example. Valid usability tests have the following six characteristics. In that section. User-Based Evaluations section or index item for usability testing but did have one on quantitative evaluations. practitioners developed usability-testing methods to refine user interfaces rapidly (p. • The participants think aloud as they perform tasks. such as adding a question to a posttest questionnaire asking participants if they would buy the product they just used. a company would not base its sales projections on the results of such a question. (p. 479). Both of the book length descriptions of usability testing (Dumas & Redish. Rubin. tests that are performed quickly and with minimal resources are best called "quick and clean" rather than "informal" or "quick and dirty" (Wichansky. • The results of the test are communicated to appropriate audiences. It is not an informal usability test because it is not a usability test at all. or to evaluate competitive products" (p. but sometimes people try to use a test for other. challenges to the validity of user testing. . • There is some artifact to evaluate. It often is done in a "lab. • The focus is on usability. If the purpose of the question is to provide an opportunity for the participant to talk about his or her reactions to the test session. when the product has several severe usability problems. 1993. 2000). Usability testing sometimes is mistaken for a focus group. In addition. Obviously. to contrast the new system with current manual procedures. but it is difficult to know what informal really means. 411) In the 1992 edition. there is a chapter section on usability testing and laboratories. The remaining sections on usability testing cover usability testing basics." in which the participants are not intended users of the product and in which time and other measures of efficiency are not recorded. The most common objective for a usability test is the diagnosis of usability problems. although two participants are sometimes paired. One of the difficulties in discussing usability testing is finding a way to describe a test that is somewhat different from a complete diagnostic usability test. 6 Proportion of Problems Uncovered 0. The key to finding people who are potential candidates for the test is a user profile (Branaghan. Lewis (1994) found that for a very large product. testers want to capture two types of characteristics: those that the users share and those that might make a difference among users. A valid usability test must test people who are part of the target market for the product. But the results cannot be generalized to the relevant population—the people for whom it is intended. Usually.5 0.9 0. see Fig. It is from that profile that you create a recruiting screener to select the participants.1) showing 1.1 0. To run the test you plan. What theses studies mean for practitioners is that.0 0. This research does not mean that all of the possible problems with a product appear with 5 or 10 participants.1098 • DUMAS The Participants Are End Users or Potential End Users that 80% of the problems are uncovered with about five participants and 90% with about 10 continue to be confirmed (Law & Vanderheiden. 56. This situation forces the test team to decide on which group or groups to focus. 1997). There are some studies that do not support the finding that small samples quickly converge on the same problems. it may find usability problems. The studies by Molich et al.2 0. the issue of how well usability testing uncovers the most severe usability problems is clouded by the unreliability of severity judgments. 2001) also do not favor convergence on a common set of problems. you will need to find candidates and qualify them for inclusion in the test. As I discuss later. but most of the problems that are going to show up with one sample of tasks and one group of participants will occur early. 5 to 10 participants was not enough to find nearly all of the problems. The fact that usability testing uncovers usability problems quickly remains one of its most compelling properties. There is almost always a way to find the few people needed for a valid usability test. there are inclusion and exclusion criteria. This decision should be based on the product management's priorities not on how easy it might be to recruit participants. A common issue at this stage of planning is that there are more relevant groups to test than there are resources to test them. a suite of office productivity tools. For example.4 0.8 0. An idealized curve showing the number of participants needed to find various proportions of usability problems. . Recruiting Participants and Getting Them to Show Up. from a user profile for a test of an instruction sheet that accompanies a ground fault circuit interrupter (GFCI)—the kind of plug installed in a bathroom or near a swimming pool—a test team might want to include people who consider themselves "do-it-yourselfers" and who would be willing to attempt the installation of the GFCI but exclude people who actually had installed one before or who were licensed electricians.1. In developing a profile of users. participants could be people who now own a cell phone or who would consider buying one. the sessions begin to get repetitive after running about five participants in a group. 1992. in a test of an upgrade to a design for a cellular phone. you may want to include people who owned the previous version of the manufacturer's phone and people who own other manufacturers' phones. that is. These characteristics build a user profile. Then they A Small Sample Size Is Still the Norm.3 0. 2000). The way the testers would qualify candidates is to create a screening questionnaire containing the specific questions to use to qualify each candidate. Of the people who own a phone. For example. The early research studies by Virzi (1990. given a sample of tasks and a sample of participants. (1998. Testing with other populations may be useful.7 0.0 5 10 15 20 Number of Participants in Test FIGURE 56. just about all of the problems testers will find appear with the first 5 to 10 participants. Testers know from experience that in a diagnostic test. A useful strategy can be to recruit two participants for a session and. such as anesthesiologists or computer network managers. Virzi. One of the major advances in human-computer interaction over the last 15 years is the use of prototypes to evaluate user interface designs. usability specialists wanted to create prototyping tools that make it possible to save the code from the prototype to use it in the final product. if both show up. network managers. or will be late. the speed with which these software tools and paper prototypes can be created makes it possible to evaluate user interface concepts before the development team gets so enamored with a concept that they won't discard it. a clock radio. Wiklund." Many organizations use recruiting firms to find test participants. Until these articles were published. If testers follow all of these steps. the recruiter may need to emphasize what the candidates are contributing to their profession by participating in the test. One of the important parts of the pretest activities is the instructions on thinking aloud.. the administrator gives a set of pretest instructions. and those that are both (a cell phone. Landay & Myers. Consequently.) Testing Methods Work Even With Prototypes. Some testing organizations use gift certificates or free products as incentives. 2000. 1992). The studies all show that there are few differences between high. . & Thurrot. 1995) • Products in various stages of development (such as userinterface concept drawings. a circuit board tester. a database management system). Before the test session starts. For participants with unusual qualifications. dollars) for each recruited participant. It often takes some social skills and a lot of persistence to recruit for a usability test. Firms charge about $100 (in year 2000 U. Dumas. online help. early. The instructions tell the participant how the test will proceed and that the test probes the usability of the product. Developers soon realized that a prototyped version of a design is seldom so close to the final design that it is worth saving the code. high-fidelity prototypes.g. 1995. Sokolov. & Karis. not their skills or experience. Both of these articles question assumptions about the similarity between the think-aloud method as used in usability testing and the think-aloud method used in cognitive psychology research. etc. computer programmers. Interest in thinking aloud was revived with the recent publication of two articles that were written independently at almost the same time (Boren & Ramey. 2001). • Give participants a phone number to call if they can't make the session. tutorials. • Offer them some incentive. quick-start programs. The Participants Think Aloud As They Perform Tasks This is the execution phase of the test. Ten or more years ago.and low-fidelity prototypes in terms of the number or types of problems identified in a usability test or in the ratings of usability that participants give to the designs (Cantani & Biers. instruction sheets that are packaged with a product. etc. such as cooperative work software (Scholtz & Bouchette. they may have to go to a hardware store and approach people who are buying electrical equipment to find the relevant "do-it-yourselfers. the testers or the recruiting firm need to do the following: • Be enthusiastic with them on the phone. engineers. a hospital patient monitor. Some organizations over recruit for a test. There have been several studies that have looked at the validity of user testing using prototypes. qualifying some extra candidates to be backups in case another participant is a no-show. low-tech prototypes. There Is a Product or System to Evaluate Usability testing can be performed with most any technology. medical personnel. 1998.) • Products intended for different types of users (such as consumers. send or fax or e-mail citing the particulars discussed on the phone and a map with instructions for getting to the site. Dumas. It is where the test participant and the test administrator interact and it is where the data are collected. need to reschedule. and completed products) • Components that are imbedded in or accompany a product (such as print manuals.56. 1996. all hardware (a high-quality pen). about $50 to $75 an hour (in year 2000 dollars) for participants without any unusual qualifications. These studies compare paper or relatively rough drawings with more interactive and polished renderings.S. The administrator tells the participants to say out loud what they are experiencing as they work. The confidence that evaluators have in the validity of prototypes has made it possible to move evaluation sooner and sooner in the development process. See below for a description of codiscovery. To get participants to show up. they will still have a noshow rate of about 10%. The range includes the following: • Products with user interfaces that are all software (e. run a codiscovery session with both participants. In addition. Nothing works better than money. Evaluating prototypes has been facilitated by two developments: (a) using paper prototypes and (b) using software specifically developed for prototyping. But the testers create the screening questions and test them to see if the people who qualify fit the user profile. • Contact participants one or two days before the test as a reminder. For the test of the GFCI instruction sheet.) • Products that are used together by groups of users. • As soon as participants are qualified. more • 1099 fully functioning. User-Based Evaluations have to find candidates to recruit. products in beta testing. high school students. It takes a full day to recruit about six participants. the focus of new prototyping tools has been on speed of creating an interactive design. etc. most discussions of the think-aloud method used in usability testing automatically noted its superficial resemblance to the method described by Ericsson and Simon (1993) to study human problem solving. thinking aloud is used to study what is in participants' short-term memory. Participants are discouraged from reporting any interpretations of what is happening. their instructions were typical of a think aloud research study. called retrospective thinking aloud. Reports of experiences other than thoughts are important because they often are indicators of usability problems. The differences between the groups were in the types of statements the participants made when they thought out loud. & Herbert. feelings. In the retrospective condition. to complete fewer tasks. In usability testing. The research method is thought. Both of these studies suggest that in many cases participants' think-aloud protocols provide evidence of usability problems that do not otherwise show up in the data. There were several interesting results. any emotions that accompany the task. "The retrospective subjects. Their verbalizations would be much less informative. but many more of the statements were explanations of what they had been doing or comments on the user interface design. Those who had only the performance data uncovered 46% of the problems with the product. Virzi. and their expectations or violations of them." They were not told to report any other internal experiences. which can only occur when the participants' describe what they are thinking as they perform cognitive tasks such as multiplying two numbers. the participants performed tasks in silence then watched a videotape of the session while they thought aloud. although in some studies thinking aloud does take longer. Boren and Ramey explored other aspects of the verbal communication between administrators and participants. whereas those seeing the think aloud condition uncovered 69%. "I hate this program!" Almost anything the test administrator says at that point can influence whether the participants will report more or fewer of these negative feelings. But there ends the similarity to research. In the Virzi et al. or to rate tasks as more difficult in comparison with the performance of the retrospective participants. The participants who did concurrent thinking aloud were doing exactly as they were instructed. The study does show that retrospective thinking aloud yields more diagnostic verbalizations. Consider the following statements: . with thinking out loud after the session. In addition. There have been two fairly recent studies that compared a condition in which the tester could not hear the think aloud of participants with a condition in which they could (Lesaigle & Biers. Any encouragement they needed to keep talking was only done between tasks. Called Level 1 thinking aloud. The participants in the retrospective condition made only about one fourth as many statements while watching the tape. This is an interesting study because of its implications for usability testing. In cognitive psychology research. p. 1993). Sorce. and whatever the participants want to report. The retrospective participants were told that they would be watching the videotape of the session after the tasks and would be asked to think aloud then. Dumas (2001) explored how difficult it can be for test administrators to keep from encouraging or discouraging participants' positive or negative statements. This study shows us what would happen if we tried to get participants in usability tests to report only Level 1 verbalizations and did no probing of what there were doing and thinking. The concurrent group verbalized about 4 times as many statements as the retrospective group. They reported the results of observing test administrators implementing these practices and how little consistency there is among them. This discrepancy between the two sets of think-aloud instructions led Boren and Ramey (2000) to look at how thinking aloud is used in user testing as well as practices related to thinking aloud. and by having the participant practice thinking aloud. Participants were told to "describe aloud what they are doing and thinking. including how to keep the participants talking without interfering with the think-aloud process. First. they were attending to the tasks and verbalizing a kind of "play-by-play" of what they were doing. Conflict in Roles. study. such as how and when to encourage participants to continue to do it. The results showed that there were fewer problems uncovered in the screen only condition compared with the screen plus thinkaloud condition. but the statements were almost all descriptions of what the participants were doing or reading from the screen. but it takes 80% longer to have the participants do the tasks silently then to think out loud as they watch the tape. The group of participants who performed concurrent thinking aloud were not given typical think-aloud instructions for a usability test. The friendly facilitator role and the neutral observer role come into conflict when participants make strong statements expressing an emotion such as. 1274). showing an example of the think aloud by giving a brief demonstration of it. In the Lesaigle and Biers study. the focus is on interactions with the object being tested and with reporting not only thoughts. Instead. can give their full attention to the verbalizations and in doing so give richer information" (Bowers & Snyder.1 100 • DUMAS In both methods. 1990. usability professionals who recorded usability problems from a videotape of test participants thinking aloud.. Bowers and Snyder (1990) conducted a research study to compare the advantages and disadvantages of having test participants think out loud as they work. These findings are consistent with the results from other think-aloud research studies. called concurrent thinking aloud. There was no probing. usability professionals who could see a video of only the screens the participants could see were compared with comparable professionals who could see the screens and hear the participants think aloud. but also expectations. there were no differences between the concurrent and retrospective groups in task performance or in task difficulty ratings. they were never interrupted during a task. to provide entree only into the short-term memory of the participants in the research. The thinking aloud during the session did not cause the concurrent group to take more time to complete tasks. the research method focuses on having the participant say out loud what is in the participants short-term memory. by its proponents. were compared with usability professionals who could see only the performance data of the test participants. He saw a conflict in two roles that administrators play: (a) the friendly facilitator of the test and (b) the neutral observer of the interaction between the participant and the product. the participants are taught to think aloud by providing instructions on how to do it. 2000.. Any of these responses could push test participants to utter more or fewer strong feelings. Not all of the biasing responses to emotional statements are verbal. For example. testers select tasks for several reasons: • They include important tasks. Using a poor recognizer often clouds the evaluation of the rest of the software. Both of these solutions provide useful information but add a substantial amount of time to test sessions. For the most part. such as log in or installation. a nurse using a patient monitor will frequently look to see the vital sign values of the patient and will want to silence any alarms once she or he determines the cause. they may add saving work to several other tasks. however. The recognizer interprets what the test participant says. At the end of the task. This limitation in thoroughness is often why testing is combined with usability inspection methods. the sample of tasks is a limitation to the scope of a test. but several options are now available. There are a few areas. With almost any product there is a set of basic tasks. Although not often recognized as a liability of testing. such as a print manual. because they affect other tasks. If the tester is in the room with the participant and takes notes when participants make an emotional statement. participants listen to the recording and stop it to comment on parts of the interaction that they found either clear or confusing. that are critical. that is. and adjusting alarm limits are basic tasks. if testers think that users will have difficulty knowing when to save their work. it was much more difficult to create a prototype of a speech-based product. the verbalized thoughts may be mistaken for input by the speech recognizer. One of the essential requirements of every usability test is that the test participants attempt tasks that users of the product will want to do. The Special Case of Speech-Based Products. These tasks can make a product look less usable than if they were not included. if infrequent. In this case. 1999): • It is not possible for test participants to think aloud while they are using a speech recognition application because talking interferes with using the application. Using this method also allows the administrator to be sure that error paths are tested. which have thoroughness as one of their strengths. Selecting Tasks. however. much of the design skill is in dealing with the types of errors that recognizers often make. an important goal of a diagnostic test. the flow and logic of each interaction is controlled by the test administrator. In a speech interface. I believe. and so on. the nurse will want to adjust the alarm limits. But including these kinds of tasks makes it likely that a diagnostic test will uncover additional usability problems. these tasks pose a more difficult challenge to a product than if just commonly done or critical tasks are included. this is one of the reasons why a diagnostic test does not provide an accurate measure of a product's usability. if participants think aloud while using a speech application. strong feelings are inappropriate in this kind of test. who interprets participants' responses to prompts. they may not be able to hear spoken prompts. you don't want negative comments. sounds evasive to participants "Those are the kinds of statements that really help us to understand how to improve the product": reinforcing the negative "I really appreciate your effort to help us today": says nothing about the content of what the participant said and is part of playing the friendly role with participants. where testers may need to modify their methods (Dobroth. Hence the need to select a sample of tasks. Basic means tasks that tap into the core functionality of the product. In reality. Selecting these kinds of tasks makes it more likely that usability problems will be uncovered by the test. even though the limit adjustment may be done infrequently. Will the participant hear it that way? Silence: Neutral in content. and tasks. In addition. tasks that are performed frequently or are basic to the job users will want to accomplish. viewing vital signs. In a diagnostic test. and responds with the next prompt in the interaction. Consequently. it is the software that surrounds it that is being tested. When testing other components of a product. the test administrator creates the impression in participants that they are interacting with a • 1101 voice response system. tasks that force the user to navigate to the lowest level of the menus or tasks that have toolbar shortcuts. If tasks are longer. For example. Another option is to use the speech capabilities of office tools such as Microsoft's PowerPoint. Without any other instructions. there are more tasks than there is time available to test them.56. participants will begin to forget exactly what happened in the early parts of the task. one person's silence after another's strong statement is almost always interpreted as disagreement or disapproval. • They include tasks that probe areas where usability problems are likely. the test administrator can make a recording of the participants' interaction with the system as they complete the task. User-Based Evaluations "Tell me more about that": relatively neutral in content but could be interpreted as encouraging more negative statements "That's great feedback": again relatively neutral to someone who has training in test administration but. but how will it be interpreted? In human interaction. • They include tasks that probe the components of a design. the basic techniques of user testing apply to speech applications. If participants were to speak aloud. testers may include tasks that focus on what . Often the recognizer can't be changed. For example. silencing alarms. Components of a design that are not touched by the tasks the participants perform are not evaluated. In a "Wizard-of-Oz" test (see chapter 52 for more on this technique). In effect. the participant is left to interpret the test administrator's silence—you don't care. One way to get around this problem is to have participants comment on the task immediately after finishing it. he or she may be reinforcing them to make more. however. Moreover. This works well for tasks that are short and uncomplicated. Dumas suggested that one way to avoid this conflict in roles is to tell participants what the two roles are in the pretest instructions. As we will see below. The goal is to include tasks that increase thoroughness at uncovering problems. • In the past. • Evaluating speech-based interfaces often is complicated by the presence of a recognizer in the product. When a product of even modest complexity is tested. The scenario needs to be carefully worded so as not to mislead the participant to try to perform a different task.. Testers also try to avoid using terms in the scenario that give the participants clues about how to perform the task. and Data Are Recorded and Analyzed Capturing Data As They Occur. the more reliable the test results" (emphasis added). Logging test activities in real time. p. the testers need to make some preliminary estimate of how long each task will take. In most cases. "The whole point of usability testing is to predict what will happen when people use the product on their own. however. • They may be new to the product line. There are too many events happening too quickly to be able to record them in free-form notes. such as sending an order for a drug to the hospital pharmacy. With so many reasons for selecting tasks. "The context of the scenarios will also help them to evaluate elements in your product's design that simply do not jibe with reality" and "The closer that the scenarios represent reality. In addition to the wording of the task scenarios. The task scenario is an attempt to bring a flavor of the way the product will be used into the test.2 shows a sample data collection form for the task of saving a file to a diskette in Microsoft Windows. Typically testers and developers get together in the early stages of test planning to create a task list. 200la). time limits are useful because testers want participants to get through most of the tasks. Figure 56. In addition to including tasks in the list. Dumas and Redish (1999. Testers have developed strategies to handle this situation. A good scenario is short. 2001b) or with specialized software (Lister. It is difficult to use forms or software when the administrator is sitting in the test room beside the participant and is conducting the test alone.. Notice that it is set up to capture both paths to success and paths to failure. such as a task that has been changed from a previous release of the product. It never tells the participant how to do the task. It is common for scenarios to have dependencies. The form also allows for capturing a Reject. Rubin (1994. such as putting a phone number in another memory location that the test administrator can direct the participants to when they could not complete the earlier task. such as using the name of a menu option in the scenario. showing that describing tasks as scenarios rather than simple task statements makes any difference to the performance or subjective judgments of participants. For example. it is difficult to make accurate estimates of time limits. unambiguous. From the beginning. continues to be a messy process. Testers almost always have to edit the data log after each session to remove errors and misunderstandings. 2001). There is no research. Most have created their own logging software. Until you conduct a pilot test. 125) describes task scenarios as adding context and the participant's rationale and motivation to perform tasks. and it may also be useful for setting time limits for each task. as described in the next section. During test planning. 174) said. All agree that testers need to plan how they will record what occurs. The Participants Are Observed. There are three ways that testers deal with the complexity of recording data: • Create data collection forms for events that can be anticipated (Kantner. it is difficult. such as when a task was really over. The goal is to record key events while they happen rather than having to take valuable time to watch videotapes later. testers work on the wording of each scenario. It is in the box on the table. But taking note of the product's use environment may be important. usability testers recognized the artificiality of the testing environment. paring the task list to the time available is an important part of test planning. • Create or purchase data logging software (Philips & Dumas. . The Tasks Are Presented in Task Scenarios. Rejects are important to note because. Almost without exception. which is a task that a participant considers complete but the data collector knows is not. Testers continue to believe in the importance of scenarios and always use them. in testing a cellular phone there may be a task to enter a phone number into memory and a later task to change it. For example: You've just bought a new combination telephone and answering machine. Without a data recorder. in the user's words not the product's. their order may also be important. however. testers present the tasks that the participants do in the form of a task scenario. Even in a diagnostic test. the scenario is the only mechanism for introducing the operational environment into the test situation. The time estimate is important for deciding how many tasks to include. although they are failures. 1990). The use of data logging software continues at many of the larger testing facilities. they often have task times that are faster than even successful tasks. p. 1998). The participants should feel as if the scenario matches what they would have to do and what they would know when they are doing that task in their actual jobs" (emphasis added). Some additional reasons for selecting tasks are: • They may be easy to do because they have been redesigned in response to the results of a previous test. Take the product out of the box and set it up so that you can make and receive calls. to sit in the test room with the test participant and to record data at the same time. but still possible. and gives participants enough information to do the task. • Automatically capture participant actions in logfiles(Kantner. such as a task that just asks the participant to locate a number of items (Branaghan. A problem with dependencies happens when the participant can't complete the first task. • They may cause interference from old habits. Recording data during the session remains a challenge. Setting time limits is always a bit of a guess. but some estimate is necessary for planning purposes.1102 • DUMAS is in the manual.. as opposed to descriptive statements and making more statements that developers view as useful (Hackman & Biers. Participant and Administrator Sit Together. 1989). Getting Developers and Managers to Watch Test Sessions. with codiscovery participants making more evaluative. Even though testing is known and accepted by a much wider circle of people than it was 10 years ago. A related method is to have one participant teach another how to do a task (Vora. • When developers see live sessions. sometimes called the codiscovery method (Kennedy. Most usability tests are run with a single test participant. Some of them will even become advocates for testing. Most usability problems don't need to be diagnosed at the mouse click or key press level. Studies show that when two participants work together.2. User-Based Evaluations • 1103 Task 1. and there is a brisk business in selling lab . But using codiscovery does require recruiting twice as many participants. The nature of the utterances also is different. The Usability Lab Is Now Ubiquitous. Watching a videotape of a session does not provide the same experience. They gain understanding of the value of the method. Expend whatever effort it takes to get these people to attend test sessions. 1994). 1992). But the tools to do this capture usually record data that are at too low a level to uncover usability problems. There are so many links and controls on a typical Web page that it is difficult to record what is happening short of watching the session again on videotape.56. Sample data collection form. Copy a Word file to a diskette Pass (Time ) Explorer: Dragged file from one Explorer pane to another with File: Send to: Floppy A Copied and Pasted in Explorer with Toolbar left right button Edit menu with Keyboard My Documents: Dragged to Desktop then back with left right button CTRL D File: Send to: Floppy A Copied and Pasted with Toolbar Edit menu with Keyboard Word Opened Word and chose File: Save as Fail or Reject (Time_ Word Chose Help Save Windows ____Word Topic: FIGURE 56. When they have seen some of the usability problems themselves. they are much less likely to resist agreeing on what the most important problems are. Watching even a few minutes of live testing can be very persuasive. Collecting data is a special challenge with Web-based products. One of the important assets of testing is that it sells itself. it is much easier to communicate the results to them. This difficulty has renewed interest in automatic data collection. the experience of watching a user work at a task while thinking aloud still converts more people to accept usability practices than any other development tool. they make more utterances. There are two reasons why testers need to get key project staff and decision makers to come to watch a test session: • When people see their first live test session they are almost always fascinated by what they see. Usability labs continue to be built. Barker and Biers (1994) conducted an experiment in which they varied whether there was a one-way mirror and cameras in the test room. For testing products that run on general-purpose computer equipment. • Company firewalls can prevent live testing. • Participants are tested using their own equipment environment. More recently. Kelso. p. 95) describes the requirements for the testing environment as follows: "Make the testing environment as realistic as possible. As much as possible. a common setup is a scan converter showing what is on the test participant's screen and a video camera focused on the face or head and shoulders of the participant. as we will discuss below. shows that in complex operational environments." But is putting a couch in the test room to make it look more like a room in a home simulating the use environment? It may not be. automobiles. The demand for labs is driven by the advantages of having recording equipment and the ability to allow stakeholders to view the test sessions. The literature on product evaluation.1 104 • DUMAS equipment. and we need to think more about the fidelity of our testing environments (Wichansky 2000). 1999). with a phone connection. make the quality of highlight tapes even poorer. which. Mimicking the Operational Environment. • Test costs are reduced because participants are easier to recruit. Remote usability testing refers to situations in which the test administrator and the test participant are not at the same location (Hartson. Rubin (1994. Other testing groups normally do not sit with the participants. Products such as NetMeeting software make it possible for the tester to see what is on the participant's screen and. 1997). hence the arrival of portable lab setups that fit in airplane overhead compartments. and operating rooms. 1994). I discuss this issue further in the section Challenges to the Validity of Testing. The move to digital video and inexpensive writeable CDs promises to improve recordings and to make it substantially easier to find and edit video segments. but they often require both parties to have special video cards and software. hospital operating-room simulators have been developed to study equipment interaction issues in anesthesiology (Gaba. 1992). 1996). Kamler. 587). The primary advantages of remote testing include the following: • Participants are tested in an environment in which they are comfortable and familiar. Relatively inexpensive eye-tracking equipment has made it possible to know where participants are looking as they work. and often do not have to be compensated. The basic makeup of a suite of usability test equipment has not changed much with time.000 produce surprisingly poor images. There are other technologies that can provide testers with even more information. Testers often make changes to the setup of the test room. is used to describe the degree to which simulations or simulators mimic the operational environment. They concluded that "the present authors are skeptical about using feedback provided by the user through online questionnaires as the sole source of information" (p. the environment is so important that simulations are needed to mimic it. In those interactions between users and aircraft. making it difficult to see screen details. & Neale. but going to the participant's home for testing is a complex process (Mitropoulos-Rundus & Muzak. believing that it makes it easier to remain objective and frees the administrator to record the actions of the participants (Dumas & Redish. Miniaturization continues to shrink the size of almost all lab equipment. believing that it reduces the participants' anxiety about being in the test and makes it easier to manage the session (Rubin. may be a false sense. adds a sense of scientific credibility to testing. researchers and practitioners have used software simulations or hardware-software simulators to mimic that operational environment. It consists of video and audio recording equipment and video mixing equipment. . They found that the presence of the equipment did not affect the participants' performance or ratings of usability of the product. A testing facility. Still. In addition. A variable. There may be other environments that influence the usability of the products we test. Remote Testing. there are no test facility costs. when viewed from the perspective of 50 years. the viewer or conferencing software can slow down the product being tested. Most viewers and meeting software often cannot be used if there is a firewall. An issue that has been debated throughout the history of usability testing is the impact of one-way mirrors and recording equipment on the test participants. Second-generation copies. aircraft and automobile simulators are used to study interactions with cockpits and dashboards as well as for operator training. which are often used in highlight tapes. do not have to travel. But there can be disadvantages to remote testing: • With live testing. Scan converters selling for under $2. In essence. the method sells itself in the sense that developers and managers find compelling the experience of watching a live test session. try to maintain a testing environment that mimics the actual working environment in which the product will be used. This can happen for a number of reasons. hear the participants think aloud. For example. 1994). no one would disagree that some remote testing is better than no testing at all. This debate comes to a head in discussions about whether the test administrator should sit with participants as they work or stay behind the one-way mirror and talk over an intercom. There are some recent innovations in lab equipment that are enhancing measurement. But Lesaigle and Biers' (2000) study showed that uncovering problems only through participants' questionnaire data had the least overlap with conditions in which testers uncovered problems by watching the participants work or seeing the screens on which participants. Some testing groups always sit with the participant. There are also technologies for having ratings or preference questions pop up while participants are working remotely (Abelow. The Impact of the Testing Equipment. There is one study that partially addressed this issue. such as testing products used by only a few users who are spread throughout the country or the world. Castillo. The quality of video images recorded during sessions has always been poor. especially one with a one-way mirror. usually called fidelity. and data analysis. These regions usually define some object or control on a page. duration. Not all test participants can be calibrated on an eye tracker. The end of the test session is a good time to ask for those opinions. part of the art of running a test. such as task time and task completion. They noted that it is common for testers to report only one category of performance measure and caution not to expect different types of measures to be related. There are several ways to categorize the measures taken in a usability test. depending on accessories and data reduction software. Other time measures include the time to reach intermediate goals such as the time for events such as to finding an item in Help. and the number of assists. Another common breakdown uses three categories: (a) efficiency measures (primarily task time). There are so many links and controls on a Web page that it can be difficult to know exactly where participants are looking. and (b) subjective measures. The way assists are given to participants by test administrators. The data from the tracker is broken into fixations—300millisecond periods during which the point of regard doesn't move more than 1° of visual angle. (b) effectiveness measures (such as task success). point of gaze coordinates. Then there are statistics that measure eye movements from one AOI to another and plots of scan paths. • They can see screens and hear the think aloud and see the participants face. is not consistent from one usability testing organization to another (Boren & Ramey. embedded survey questions. But there are other evaluation areas that are likely to benefit as tracking technology becomes cheaper and easier to manage. up to 20 percent of typical user populations cannot be calibrated because of eye abnormalities. An assist is important because it indicates that there is a usability problem that will keep participants from completing a task. Eye movement analysis involves looking at fixations within an area of interest (AOI). Discrepancies Between Measures. Frokjaer. I discuss test measures. Hughes (1999) argued that qualitative measures can be just as reliable and valid as qualitative measures. and average pupil diameter. • 1105 Interest in where participants are looking has increased with the proliferation of Web software. and live remote testing with conferencing software. • They see only the responses to questionnaire items." and "the questionnaire data was less likely to reveal the most severe problems . Each fixation has a start time. The participant has spent an hour or two using the product and probably has as much experience with it as he or she is likely to have. however.56. especially repeated errors. plus or minus $ 10. such as ratings of usability or participants' comments. An assist happens when the test administrator decides that the participant is not making progress toward task completion. Most performance measures involve time or simple counts of events. You can purchase a system for about $40. Consequently. There is a vocal minority of people writing about usability testing measures who argue against the use of quantitative performance measures in favor of a qualitative analysis of test data. eye tracking isn't something that is used without a specific need. These measures include the time the participant works toward the task goal divided by the total task time (sometimes called task efficiency) and the task time for a participant divided by the average time for some referent person or group. Some investigators find only a weak correlation between efficiency measures and effectiveness measures. There are some complex measures that are not often used in diagnostic tests but are sometimes used in comparison tests.000. One is to break them into two groups: (a) performance measures. Test Measures. A common finding in the literature is that performance measures and subjective measures are often weakly correlated. data reduction becomes a major task. The new systems are head mounted and allow the test participant a good deal of movement without losing track of where the eye is looking. which is a tester-defined area on a screen. Eye tracking systems produce a great deal of data. discrepancies between measures. Eye tracking is a relatively new measure in user testing. They then went back and looked at several years of usability test reports in the proceedings of the annual CHI conference. the discussion about the discrepancies between measures in the next subsection). Consequently. Goldberg (2000) identified evaluation criteria that can benefit most from eye tracking data. and Hornbaek (2000) described a study in which they found such a weak correlation. live remote testing with a viewer. such as an expert or an average user. It seems only natural that an important measure of the usability of a product should be the test participants' opinions and judgments about the ease or difficulty of using it. Lesaigle and Biers (2000) compared how well testers uncovered usability problems under a number of conditions: • They only can see the screen the participant sees. 2000). The authors concluded that "questionnaire data taps a somewhat different problem set. User-Based Evaluations Perkins (2001) described a range of remote usability testing options that includes user-reported critical incidents. The results show that uncovering problems only through participants' questionnaire data had the least overlap with the other three conditions. There are AOIs on each page and testers use eye-tracking software to compute statistics such as the average amount of time and the number and duration of fixations in each AOI. The most common time measure is time to complete each task.600 records a minute.000. Consequently. with visual clarity getting the most benefit. Hertzum. and (c) satisfaction measures (such as rating scales and preferences). a posttest interview or a brief questionnaire is a common subjective measure (see. • They can see the screen and hear the participants think aloud. Measures and Data Analysis In this section. An eye tracker helps solve that problem. a 60-Hz tracker produces 3. The counts of events in addition to task completion include the number of various types of errors. but the administrator can continue to learn more about the product by keeping the participant working on the task. Eyetracking equipment has come down in price in recent years. a process that is called triangulation (Dumas & Redish. questions that measure what we want to measure. the tester might conclude "the interface has too much technical and computer jargon. such as "didn't see the option. how participants interpret the questions. Bailey recommended using only performance measures and not using subjective measures when there is a choice. For example. The problem sheet is usually created by the test administrator during the test sessions or immediately afterward. 587). Much of the data analysis involves building a case for a usability problem by combining several measures. Unfortunately. But do experienced testers see the same problems and causes? As I discuss later. the experienced tester sees patterns that point to more general problems. a tester might see instances of participants spending time looking around the screen and aimlessly looking through menu options and conclude that "the participants were overwhelmed with the amount of information on the screen." and interpretations. But developers tend to see all problems as local. Bailey (1993) and Ground and Ensing (1999) both reported cases in which participants perform better with products that they don't prefer and vice versa. (b) the characteristics of the interviewer or the way the interviewer interacts with the participant. a very long task time or a failure to complete a task is a true measure of usability." Seeing the underlying causes of individual problem tokens is one of the important skills that a usability tester develops. Some of the factors have to do with the demand characteristics of the testing situation. 1998b. especially when it occurs late in the session. such as one key event. participants' need to be viewed as positive rather than negative people or their desire to please the test administrator. Instead of seeing that there needs to be a general review of the language in the interface. Testers without training in question development can use open-ended questions and consider questions as an opportunity to stimulate participants to talk about their opinions and preferences. 1999). Identifying usability problems is key. From individual instances of problems. Triangulation of measures is critical. This tendency seems to be a deep-seated cultural phenomenon that doesn't go away just because a test administrator tells the participant during the pretest instructions that the session is not a test of the participants' knowledge or ability. We depend on what participants say to help us understand what the problem is. was difficult to use.) In addition to the demand characteristics. There are several explanations for why participants might say they liked a product that. whereas subjective statements are unreliable. the characteristics of the task situation produce larger distortions than the characteristics of the interviewer or the participant. Most explanations point to a number of factors that all push user ratings toward the positive end of the scale. measuring subjective states is not a knowledge area where testers' intuition is enough. the problem sheet or log drives the data analysis. in the testers eyes. The problems are observed during the sessions and are recorded on problem sheets or data logs. These positive ratings and comments from participants often put testers in a situation in which they feel they have to explain away participants' positive judgments with the product. 1983). there is some doubt about the consistency of problem labeling." When the same problem appears again. Experienced usability testers see the basic causes of problems. for a discussion of these issues in a usability testing context. that is. Other factors include the tendency of participants to blame themselves rather than the product and the influence of one positive experience during the test. Later. and how sensitive or threatening the questions are (Bradburn. The sheet is organized by participant and by task. For example. It is difficult to create valid questions. It is rare that a usability problem affects only one measure. a poorly constructed icon toolbar will generate errors (especially picking the wrong icon on the toolbar). Orne (1969) called these task characteristics the "demand characteristics of the situation. Data Analysis. Task-based distortions include such factors as the format of questions and answers. especially one that occurs late in the session. In general. 1998c. It is not entirely clear that such skills can be taught quickly. Test administrators seldom have any training in question development or interpretation. the developer sees problems with individual words. for example. Testers often talk about the common finding that the way participants perform using a product is at odds with the way the testers themselves would rate the usability of the product. There are at least three sources of distortions or errors in survey or interview data: (a) the characteristics of the participants. subjective measures can be distorted by events in the test. slow task times (during which participants hesitate over each icon and frequently click through them looking for the one they want) and statements of frustration (participants express their feelings about not being able to learn how the icons are organized or be able to guess what an icon will do from the tool tip). Test participants continue to blame themselves for problems that usability specialists would blame on the user interface. It is surprising how much of this analysis is dependant on the think-aloud protocol. it is noted. whereas a positive rating of six out of seven on usability is inflated by demand characteristics." (See Dumas.1 106 • DUMAS (p. For example. a product developer will see the same events or tokens as the test administrator. Creating closed-ended questions or rating scales that probe what the tester is interested in is one of the most difficult challenges in usability test methodology. What gets recorded on the sheet are observations. One of the difficulties with test questions is that they are influenced by factors outside of the experience that participants have during the test session. Testers always feel that the performance measures are true indicators of usability. The case building is driven by the problem list created during the test sessions. and (c) the characteristics of the task situation itself. Most usability problems do not emerge from the analysis of the data after the test." From a number of instances of participants doing a task twice to make sure it was completed. While watching a test session. This ." From a number of instances of participants not understanding terms. such as "doesn't understand the graphic. the tester might conclude that "there is not enough feedback about what the system is doing with the participant's actions. Testers often have years of experience studying and practicing problem identification skills. It now is less often necessary to write a report to justify conducting the test. and what testing was all about. Usability professionals believe that this conflict over what "really" happened during the test remains a major barrier to improving a product's usability. does that change its severity? The authors of these scales provide little guidance. Several practitioners have proposed severity rating schemes: Nielsen (1992). With the exception of Rubin's multiplication rule. For example. especially the middle ones. One of the issues still being debated about usability problems is whether to place them into a company's software bug tracking system (Wilson & Coyne. They all use a rating scale that is derived from software bug reporting scales. The middle levels between the extremes are usually difficult to interpret and are stated in words that are hard to apply to specific cases. Putting them into the system can be effective if the bugs are more likely to be fixed. Dumas and Redish (1999) added a second dimension: the scope of the problem from local to global. But fitting the bugs into a bug severity rating scale often is difficult.56. 3. Rubin (1994). There have been several recent research studies that have looked at the validity the reliability of these scales. Communication at these meetings is facilitated when the product team has attended at least some of the test sessions. and something called "market impact. 2001). A disappointing aspect of this research is the lack of consistency in severity judgments. Practitioners are not given any guidance on how problems fit into the scale levels.16) among professional testers' ratings of the severity of the same usability problems in a usability test. and that severity ratings of usability professionals did not agree with each other. This conflict continues to limit the impact of testing on product improvement. Developers don't like to be told that they have tunnel vision and can't see the underlying causes of individual tokens. 2. This lack appears in all forms of usability evaluation. Communicating Test Results In the early days of user testing. Nielsen's middle levels are (a) major usability problem (important to fix and so should be given high priority) and (b) minor usability problem (fixing is given low priority). persistence. that is. that their scales alone are not enough to assess severity. Handling this conflict takes some diplomacy. The most severe category usually involves loss of data or task failure and the least severe category involves problems that are so unimportant that they don't need an immediate fix. The results of these studies cast doubt on one of the most often-mentioned assets of usability testing: its touted ability to uncover the most severe usability problems. more likely to be candidates to be fixed. whereas 46% were only found by a single evaluator. presumably. For example. Lesaigle and Biers (2000) reported a disappointing correlation coefficient (0. All of the authors assume that the measurement level of their scale is at least ordinal. For example. only 20% were detected by all evaluators. Dumas and Redish proposed two middle levels: (a) problems that create significant delay and frustration and (b) problems that have a minor effect on usability. Of the 93 problems identified with the product. with no levels in between. Testers needed reports to communicate what they did. perhaps. there almost always was a formal test report. including the top-10 problems in terms of severity. Cantani and Biers (1998) found that heuristic evaluation and user testing did not uncover the same problems. They asked four experienced usability testers to watch tapes of the same usability test and then identify problems. Nielsen (1992) described four factors in addition of the severity rating itself: frequency. All of the authors admit. Dumas and Redish (1999). and Wilson and Coyne (2001). impact. There have been a number of research studies investigating the consistency of severity ratings. and is one of the most important challenges to usability methodology. The schemes have three properties in common: 1. One of the important reasons for the change in reporting style for diagnostic usability tests is the confidence organizations have in the user testing process. These studies consistently show that usability specialists find more problems than product developers or computer scientists." Rubin (1994) proposed multiplying the rating by the number of users who have the problem. Now it is more common for the results of a test to be communicated more informally. User-Based Evaluations conflict often doesn't appear until the testers and developers sit down to discuss what they saw and what to do about it. They used Nielsen's severity rating scale. none of these other factors are described in enough detail to indicate how their combination with the severity scale would work. One way to call attention to important problems is to put them into a measurement tool such as a problem severity scale. But Jacobsen and John (1998) showed that it also applies to usability testing. what they found. These scales determine which problems are the most severe and. what does one do if only two of eight participants cannot complete a task because of a usability problem. which would not adequately describe many usability problems. Organizations with . Some bug tracking systems require that a bug be assigned only one cause. at least indirectly. None of the scales indicate how to treat individual differences. These studies all show that the degree of consistency is not encouraging. There have been several research studies that have looked at how many usability problems are uncovered by different populations. the problems gets worse as the scale value increases. and usability professions don't like hearing that the local fix will solve the problem. Most studies have looked at the inconsistencies among experts using severity scales with inspection methods such as heuristic evaluation. Is that problem in the most severe category or does it move down a level? If a problem is global rather than local. which is. such as at a meeting held soon after the last test session. But all of the studies have used inspection evaluation methods not user-based evaluation methods. None of the top-10 severe problems appeared on all four evaluators' lists. inspection and user based. The authors propose one or more additional factors for the tester to consider in judging • 1107 severity. an indicator of the weakness of the severity scales themselves. and there is always a risk that the fix will solve only the local impact of the problem not its basic structural cause. A highlight tape is a short. with 7 being very usable. a summative test is performed late in development to evaluate the design. it encourages those participants to take the time to think aloud and to made useful verbal diversions as they work. the second purpose for highlight tapes has become less necessary. even by an experienced editor.1 108 • DUMAS active usability programs have come to accept user testing as a valid and useful evaluation tool. they are cheaper to buy than videotapes and take up less storage space. As soon as the purpose of the test moves from diagnosis to comparison measurement. especially a tape aimed at important decision makers who could not attend the sessions. and the way the test administrator interacts with participants must not favor any of the products. "How usable is this product?" It would be wonderful to be able to answer that question with a precise. In both types of comparison tests. "It's very usable" or better. It would be ideal if we could say that a product is usable if participants complete 80% of their tasks and if they give it an average ease-of-use rating of 5. absolute statement such as. human factors professionals have made a distinction between formative and summative measurement. thus eliminating the need for editing." But there is no absolute measure of usability. Others have begun to store and replay video in a different way. The Value of Highlight Tapes. the tasks. A 15 minute tape can take 2 days to create. It typically tests a very small sample of participants. the test design moves toward becoming more like a research design. But all tasks and tests are not equal. baseline usability test. As usability testing has become an accepted evaluation tool. without a comparison product. consequently. A diagnostic test is clearly a formative test. even highlight tapes can be boring. Unless the action moves quickly. They don't feel that they need to know the details of the test method and the data analysis procedures. visual illustration of the 4 or 5 most important results of a test. there are two important considerations: (a) The test design must provide a valid comparison between the products and (b) the selection of test participants. and the administrator makes minimal interruptions to the flow of tasks. Some testers use the capabilities of tools such as PowerPoint to put selections from a videotape next to a bullet in a slide presentation rather than having a separate highlight tape. we are left with the judgment of a usability specialist about how usable a product is based on their interpretation of a summative usability test. This characteristic makes careful editing of the highlights a must. To . Historically.5 out of 7. But if the editing system is not digital. A formative test is done early in development to contribute to a product's design. They want to know the bottom line: What problems surfaced. comparative test) and (b) a test intended to produce results that will be used to measure comparative usability or to promote the winner over the others (a competitive usability test). There are two types of comparison tests: (a) an internal usability test focused on finding as much as possible about a product's usability relative to a comparison product (a comparative test or a diagnostic. This expert judgment is the basis of the common industry format (GIF) I describe below. will occasionally malfunction. But what specifically is a summative usability test? At the present time. and what should they do about them? In these organizations. Perhaps someday. Here the intention is to measure how usable a product is relative to some other product or to an earlier version of itself. it takes about 1 hour to create 1 minute of finished tape. "Its 85% usable. A tester can then show an audience the highlights by showing segments of the CDs in sequence. almost every test had a highlight tape. those judgments are the best estimate we have. One of the limiting factors in measuring usability is the makeup of the diagnostic test itself. The section includes measuring and comparing usability. It doesn't directly answer the question. I discuss aspects of usability testing that go beyond the basics of a simple diagnostic test. Because the cost of blank CDs is only about a dollar. and allowing free exploration. a written report may still have value but as a means of documenting the test. we will be able to make more precise measurements based directly on the measures that are not filtered through the judgment of usability professional. Those qualities make a diagnostic test good at exploring problems. Experienced usability professionals believe that they can make a relatively accurate and reliable assessment of a product's usability given data from a test designed to measure usability. and it deals with a product that might be in prototype form and. but limited at measuring usability. These tapes had two purposes: to show what happed during the test in an interesting way and to illustrate what a usability test is and what it can reveal. it allows the test administrator the freedom to probe interesting issues and to take such actions as skipping tasks that won't be informative. Most of that time is taken finding appropriate segments to illustrate key findings. Designing Comparison Tests. One of the disappointing aspects of highlight tapes is that watching them does not have the same impact as seeing the sessions live. VARIATIONS ON THE ESSENTIALS In this section. Comparing the Usability of Products An important variation on the purpose of a usability test is one that focuses primarily on comparing usability. In the early days of testing. The emergence of digital video will make highlight tapes less time-consuming. In the meantime. and without a comparative yardstick it is difficult to pinpoint a product's usability. Each CD stores about an hour of taping. There are video cards for personal computers that will take a feed from a camera and store images in mpeg format on a compact disk (CD). a test with a stable product and a larger sample than is typical and one in which participants are discouraged from making verbal diversions. Measuring and Comparing Usability A diagnostic usability test is not intended to measure usability as much as to uncover as many usability problems as it can. that is. 56. The design issues usually focus on two questions: • Will each participant use all of the products. One solution is to hire an industry expert to select or approve the selection of tasks. One product can be made to look better than any other product by carefully selecting tasks. Again. which means testers need fewer participants to detect a difference. The selection of participants can be biased in both a betweenand a within-subjects design. the wildcard has a smaller impact on the overall results. The scenarios need to be scrubbed of biasing terminology. (See Fisher & Yates. Because it is difficult to match groups on all of the relevant variables." If the group sizes are small. the tasks in a competitive test should not be selected because they are likely to uncover a usability problem or because they probe some aspect of one of the products. This design allows the statistical power of a within-subjects design for some comparisons—those involving your product. In a between-subjects design. the bias can come directly from selecting participants who have more knowledge or experience with one product. • 1 109 each participant would use the testers' product and one of the others. Because within-subjects statistical comparisons are not influenced by inequalities between groups. The bias can be indirect if the participants selected to use one product are more skilled at some auxiliary tasks. members of one group are recruited because they have experience with Product A. For example. and Dumas. The wording of the task scenarios can also be a source of bias. The interaction in a competitive test must be as minimal as possible. or are more computer literate. If people who work for the company that makes one of the products select the tasks. But the two groups need to have equivalent levels of experience with the product they use. general computer literacy. Eliminating Bias in Comparisons. For example. Every user interface has strengths and weaknesses. for rules for counterbalancing. whereas in a "between-subjects" design each participant uses only one product. Even more difficult to establish than lack of bias in task selection is apparent bias. This problem is why most organizations will hire an outside company or consultant to select the tasks and run the test. a design in which participants use all of the products is called a "within-subjects" design. But often the consultant doesn't know enough about the product area to be able to select tasks that are typical for end users. a qualification test could provide evidence that they know each product equally well. The test administrator should not provide any guidance in performing tasks and should be careful not to give participants rewarding feedback after task success. the selection and wording of tasks. in a typical between-subject design. and so on. even more so in a competitive test. In addition. for example. The exact number of participants depends on the design and the variability in the data. they are statistically more powerful than between-subjects designs. Sample sizes in competitive tests are closer to 20 in a group than the 5 to 8 that is common in diagnostic tests. but no one would use both of the competitors' products. such as the operating system. such as by having them attain similar average scores in a qualification test or by assigning them to the products by some random process. if tester's are comparing their product to two of their competitors. The tasks need to be selected because they are typical for the sample of users and the tasks they normally do. they eliminate the effect of groups not being equivalent but then have to worry about other problems. the test administrator who interacts with each test participant must do so without biasing the participants. it is almost always necessary to provide evidence showing that the groups are equivalent. such as job titles and time worked. Establishing the fairness of the tasks is usually one of the most difficult activities in a comparison test. and the interactions between the test administrator and the participants during the sessions. between-subjects designs need to have enough participants in each group to wash out any minor differences. Finally. An important concern to beware of in the between-subjects design is the situation in which one of the participants in a group is especially good or bad at performing tasks. but they need to make sure that the groups who use each product are equivalent in important ways. one superstar or dud could dramatically affect the comparison. This phenomenon is one of the reasons that competitive tests have larger sample sizes than diagnostic tests. because they describe tasks in the terminology used by one of the products. There are some designs that are hybrids because they use within-subjects comparisons but don't include all of the combinations. In that case. With larger numbers of participants in a group. it must be fair to all of the products.) They also have to be concerned about the test session becoming so long that participants get tired. In a within-subjects design. they might not care about how the two competitors compare with each other. Gray and Salzman (1998) called this the "wildcard effect. There are at least three potential sources of bias: the selection of participants. 1998a. If testers use a within-subjects design in which each participant uses all of the products. User-Based Evaluations demonstrate that one product is better on some measure. the test sessions are shorter than with the complete within-subjects design. it is difficult to counter the charge of bias even if there is no bias. In a competitive test using a between-subjects design. you need to counterbalance the order and sequence of the products. To eliminate effects due to order and the interaction of the product with each other. the most important of which are order and sequence effects and the length of the test session. Unlike a diagnostic test. For a comparison test to be valid. 1963. some of the products. They also need to have equivalent skills and knowledge with related variables. the bias can come from having the participants have more knowledge or skill with one product. If participants are to be told . asking them to list the tasks they do. you need a design that will validly measure the comparison. whereas a second group is recruited because they have experience with Product B. Another is to conduct a survey of end users. they avoid having any contamination from product to product. Each group then uses the product they know. If testers use a between-subjects design. or only one product? • How many participants are enough to detect a statistically significant difference? In the research methods literature. it is not clear whether the principles of research design should be applied to a diagnostic usability test. Without a baseline. Average measures from a diagnostic usability test with a few participants can be highly variable for two reasons. the tester might want to provide this training. A test session is hardly a spontaneous activity.1110 • DUMAS when they complete a task. especially a diagnostic test. which makes the data cleaner but also lessens its value as a diagnostic tool. Second. it is best to use a sample size closer to those from a comparison test than those from a diagnostic test. It is best not to have participants think aloud in a baseline test. First. Visitors often conclude that they are seeing what really happens when no one is there to watch customers. On the contrary. how does a tester interpret that result? One way is to compare it to a usability goal for the task. But some impressions of user testing can be wrong. This skepticism is healthy for the usability profession. CHALLENGES TO THE VALIDITY OF USABILITY TESTING For most of its short history. that in the "real world" people don't work that way but spend a few minutes exploring the product before they start doing tasks. usability professionals who write about testing agree that a usability test is not a research study. it is easy to believe that every user will have that problem. or their company might give them some orientation to it. Part of the reason for this freedom is the high face validity of user testing. which means that it appears to measure usability. it should be done after every complete task for all products. but others won't find the same information. Consequently. because of the small number of participants. Allowing Free Exploration An important issue in user testing is what the participant does first. some participants will find information that helps them do the tasks. Those against free exploration argue that it introduces added variability into the test. For example. we don't know what really happens when no one is watching. such as isolating an independent variable and having enough test participants to compute a statistical test. Some testers argue that this procedure is unrealistic. the product is getting a difficult evaluation and that the testing situation is not simulating the real use environment. Why don't usability specialists see the same usability problems? How Do We Evaluate Usability Testing? One of the consequences of making a distinction between usability testing and research is that it becomes unclear how to evaluate the quality and validity of a usability test. Users must know something about the product to buy it. For example. Furthermore. The six essential characteristics of user testing described above set the minimum conditions for a valid usability test but do not provide any further guidance. Baseline Usability Tests One of the ways to measure progress in user interface design is by comparing the results of a test to a usability baseline. especially for Webbased products. As I have noted. user testing has been remarkably free from criticism. Others argue that going directly to tasks without training or much of a preamble puts stress on the product to stand on its own. the test is intended to be a difficult evaluation for the product to pass. This debate continues. User testing easily wins converts. Another is to compare it to the results of the same task in an earlier version of the product. but most testers do not allow free exploration. Here I discuss four challenges to validity: 1. Why can't we map usability measures to user interface components? 3. do not apply to diagnostic usability testing. the thinking-aloud procedure typically used in diagnostic tests adds to the variability in performing the task. Are we ignoring the operational environment? 4. nobody really knows what users do when no one is watching. stress that is beneficial in making the product more usable. A usability test session is a constructed event that does not attempt to simulate every component of the real use environment. researchers and practitioners have begun to ask tough questions about the validity of user testing as part of a wider examination of all usability evaluation methods. But establishing a baseline of data takes care. the preamble leads to the first task scenario. When a usability problem appears in the performance of a test participant. Each task and each word in each scenario has been carefully chosen for a specific purpose. they think they are seeing a "real" user spontaneously providing their inner experiences through their think-aloud protocol. Using this procedure immediately throws the participant into product use. Should testers consider allowing the test participants 5 to 10 minutes of exploration before they begin the task scenarios? Those in favor of free exploration argue that without it. Because of the variability of task times it causes. Finally. average scores can be distorted by a wildcard. if all users will have some training before they use the product. if it takes a sample of participants 7 minutes to complete a task with an average of two errors. Most often. participant should not be thinking aloud and should be discouraged from making verbal tangents during the tasks. And unfortunately. a user test is a very constructed event. are all samples of tasks equal in terms of ensuring the valid of a test? Are some better than others? Would some samples be so bad as to invalidate the test and its results? Would any reasonable sample of tasks uncover the global usability . Principles. There is often a preamble to the first task scenario that puts the test and the tasks into some context. How do we evaluate usability testing? 2. Because of this variability. it can be difficult to interpret quantitative measures from a test and put them in context. When visitors watch a test for the first time. In the past 5 years. For example. Skinner (1956) invented a design in which causality between independent and dependent variables "was established with only one animal. Landauer (1995) urged usability professions and researchers to link measures such as the variability in task times to specific cognitive strategies people use to perform tasks.3. Reprinted with permission. Are We Ignoring the Operational Environment? Meister (1999) took the human factors profession to task for largely ignoring the environment within which products and systems are used (see Fig. 10. Until that happens. Allan. he presumably would have the same criticism of the use of testing laboratories to evaluate product usability. . a goal for the profession is to create a diagnostic taxonomy to make problem interpretations more consistent. Practitioners typically use their intuition and experience to make such connections. Unfortunately. Virzi et al. Each test team can roll their own categories. Without a consistent connection between measures and user interface components. Gray and Salzman (1998) and Lund (1998) have made similar points. He proposed that human factors researchers have chosen erroneously to study the interaction of people and technology largely in a laboratory environment. For example. which is to say they are not usability problems at all? Expert review. as often happens. and Raiello (1992) claimed that most of the problems identified by experts are false alarms. Common Ground. a long task time along with several errors in performing a task may be attributed to a poorly organized menu structure.Technology % Tasks 1 i FIGURE 56. Although Meister did not address usability testing directly. that is.3). In this analogy. (1993) compared the results of a performance analysis of objective measures with the results of a typical think aloud protocol analysis. the identification of problems in a user test looks suspiciously like an ad hoc fishing expedition. For example. But should we end there? Should we only fix repeating problems? And what if. Could some of these problems be false alarms. with a cause—a poor design. If Bailey et al. But they used the problems that they identified from user testing as the comparison. From "Usability testing methods: When does a usability test become a research experiment?" by J. has been criticized for proliferating false alarms. An effective tester is one who is good at tying symptoms. User-Based Evaluations problems? Is a test that misses uncovering a severe usability problem just imperfect. Would other testers make the same connection? Do these two measures always point to the same problem? Do these measures only point to this one problem? Is the problem restricted to one menu or several? Are some parts of the menu structure effective? As we have seen. Their study suggests that the only practice that makes any difference is to fix the one or two most serious problems found by user testing. By turning the independent variable on and off several times with the same animal. He noted that "any environment in which phenomena are recreated. usability problems. usability problems that repeat would establish a causal relationship between the presentation of the same tasks with the same product and the response of the participants. he was able to establish a causal relationship between. 1997). He asserted that in human factors. but we have little guidance about that makes a valid test. 2000. or is it invalid? Dumas (1999) explored other ways to judge the validity of a user test. a reinforcement schedule and the frequency and variability of bar pressing or pecking.56. however. difficulties with several words in an interface might be grouped under a "terminology" or a "jargon" category. an inspection evaluation method. we are left looking for good clinicians (testers). Bailey. 66). most of the problems identified by user testing also are false alarms. For example. The scope of human factors. are correct. for example. is artificial and unnatural" (p. Those who believe that it is important to work with users in their operational environment as the usability specialists gather requirements also believe that at least early prototype testing should be conducted in the work environment (Beyer & Holtzblatt. Hassenzahl (1999) agued that a usability tester is like a clinician trying to diagnose a psychological illness. They identified many fewer problems using performance analysis. the influence of the environment on the humantechnology interaction is critical to the validity of any evaluation. common practice in test reporting is to group problems into more general categories. Why Can't We Map Usability Measures to User Interface Components? An important—Gray and Salzman (1998) said the most important—challenge to the validity of usability testing is the difficulty of relating usability test measures to components of the user interface. • 1111 That study and others suggest that many problems identified in a usability test come from the think-aloud protocol alone. This makes the connection from design component to measures even more difficult to make. some participants don't have the problem? It is not clear where to draw the repetition line. other than the one for which it was intended. 56. The assumption these advocates make is that testing results will be different if the test is done in the work Human . In this analogy. Skinner's method is similar to having the same usability problem show up many times both between and within participants in a usability test. Dumas. there is no standardized set of these categories. In some ways. This relationship is exactly why a tester becomes confident that problems that repeat are caused by a flawed design. • Providing training to participants who will have it when the product is released—establishing a proficiency criterion that participants have to reach before they are tested is a way to control for variations in experience. ADDITIONAL ISSUES In this final section on usability testing. Campbell. there were 141 problems identified by the four labs. How does user testing compare with other evaluation methods? 3. They were given broad instructions about the user population and told that they were to do a "normal" usability test. and 75% of the problems were identified by only one team. & Uyeda. only one problem was identified by all seven teams. There were many differences in how the labs went about their testing. there were 310 problems identified by the seven teams. The evaluation methods . For some products. In the first study.. one would expect these labs staffed by usability professionals to find the same usability problems. the lab environment may be insufficient for uncovering all of the usability problems in products.1112 • DUMAS environment rather than in a usability lab. would an evaluation of a design for a clock radio be complete if test participants didn't have to read the time from across a dark room? Or shut the alarm off with one hand while lying down in a dark room? Meister admitted that it is often difficult to create or simulate the operational environment. in the second. How do we evaluate ease of use? 2. In the first study. 1991. 1992. When we list convenience as a quality of a usability lab. or they may use only a small part of it. Only one problem was identified by all of the labs. and then there is a wide range of products that fall in between. there are some techniques that address some long-term concerns: • Repeating the same tasks one or more times during the session—this method gets at whether usability problems persist when users see them again. Ninety-one percent of the problems were identified by only one lab. Testers see this characteristic as an asset because getting started with a new product is often a key issue. such as factory floor operational software. there were several research studies that looked at the ability of user testing to uncover usability problems and compared testing with other evaluation methods. Are there ethical issues in user-based evaluation? 5. One could imagine a continuum on which to place the influence of the operational environment on product use. These differences will lead to designs that are less effective if the richness of the work environment is ignored. It is not clear why there is so little overlap in problems. Nielsen & Phillips. Miller. 1994. Karat. but to date there are no research studies that speak to this issue. How Does Usability Testing Compare With Other Evaluation Methods? In the early 1990s. we need to keep in mind that for some products. Product developers often would like to know what usability will be like after users learn how to use a product. But even with that caveat. In many cases. assessing the ease of use for a new product is difficult with any evaluation method. Although these techniques sometimes are useful. The results of these studies were not encouraging. Again. These studies both had the same structure. Will users become frustrated by the very affordances that help them learn the product in the first place? How productive will power users be after 6 months of use? Although there is no magic potion that will tell developers what usability will be like for a new product after 6 months. 2001). There are two additional studies that also speak to this point (Molich et al. 1998). they may never use the product again. especially expert reviews and cognitive walkthroughs (Desurvire. four labs were included. If users can't get by initial usability barriers. Is it time to standardize methods? 4. A number of usability labs were asked to test the same product. One of the reasons the operational environment is not considered more often in usability evaluation is that it is inconvenient and sometimes difficult to simulate. such as office productivity tools. Why Don't Usability Specialists See the Same Usability Problems? Earlier I discussed the fact that usability specialists who viewed sessions from the same test had little agreement about which problems they saw and which ones were the most serious (Jabobsen & John. • Repeating the test—a few weeks in between tests provides some estimate of long-term use. & Fiegel. Are slight variations in method the cause? Are the problems really the same but just described differently? We look to further research to sort out the possibilities. Longer term usability issues are more difficult to evaluate. it seems unlikely that the operational environment would influence the usability of a product. a usability test probes the first hour or two of use of a product. The proponents of testing in the work environment offer examples to support their belief. Jeffries. It is clear form these studies that there is little commonality in testing methods. there were seven. 1993). In the second study. Is testing Web-based products different? How Do We Evaluate Ease of Use? Usability testing is especially good at assessing initial ease of learning issues. The classic case is an accident in a power plant that happens once in 10 years. I discuss five final issues: 1. the physical and social environments definitely influence product use. 1998. For some other products. For example. Wharton. Our assumption that usability testing is good method for finding the important problems quickly has to be questioned by the results of these studies. and reliability (repeatedly finding the same problems). the FDA described what it considers best practices in human factors methods that can be used to design and evaluate devices. But the assumption that usability testing uncovers the true problems has not been established.gov). Food and Drug Administration (FDA). and describes required measures of usability. In a report titled "Do It by Design" (http://www. It explains how measures of user performance and satisfaction. and performance. In these studies. Karat et al. Dumas and Redish (1993). problems not uncovered by other UEMs. proposed three criteria to evaluate UEMs: thoroughness (finding the most problems). (1991) found that usability testing didn't uncover as many problems as an expert review and that no one expert found more than 40% of the problems.gov/iusr/). User-Based Evaluations together are now called UEMs—usability evaluation methods. can be used to measure product usability. government. usability testing uncovered more unique problems than walkthroughs. Gray and Salzman's analysis was criticized by usability practitioners (Olson & Moran. Part 11 also includes an explanation of how the usability of a product can be evaluated as part of a quality system. As we have described above. The practitioners were not ready to abandon their confidence in the conclusions of the comparison studies and continue to apply them to evaluate the products they develop. Part 11 provides the definition of usability. usability testing found the smallest number of the least severe problems and the expert reviewers found the most.gov/cdrh/humfac/doit. testing generally came out quite well in comparison with the other methods. Since that time. ISO/DIS 13407. which approves new medical devices. The FDA stops short of requiring specific methods but does require that device manufacturers prove that they have an established human factors program. Their analysis makes it difficult to be sure what conclusions to draw from the comparison studies. who would like to have usability test data factored into the procurement decision for buying software. The FDA effort is an example of the U. validity (finding the true problems). and Turley (1998) proposed that usability testing and expert reviews find different kinds of problems. and Hartson (1999) in a meta-analysis of the comparison research. These efforts usually take a long time to gestate and their recommendations are sometimes not up to date. Desurvire (1994) compared usability testing to both expert reviews and walkthroughs and found that usability testing uncovered the most problems.ncsl. Andre et al.fda. In addition. who may have usability data available. the available research leaves us in doubt about the advantages and disadvantages of usability testing relative to other UEMs. being deficient in one or more of five types of validity. Jeffries et al. safety. it now appears that the ability of experts or researchers to consistently agree on whether problems are severe or not makes it difficult to tout usability testing's purported strength at uncovering severe problems quickly. They found that they only could compare UEM studies on thoroughness. Usability testing plays a prominent part in that description. • 1113 Is It Time to Standardize Methods? Several standards setting organizations have included user-based evaluation as one of the methods they recommend or require for assessing the usability of products. These methods can enhance the effectiveness and efficiency of working conditions and counteract possible adverse effects of use on human health. proposed that usability testing be held up as the yardstick against which to compare other UEMs. and consulting who are interested in developing standardized methods and reporting formats for quantifying usability (http://zing. could not find sufficient data to compare UEMs on validity or reliability. It describes human-centered design as a multidisciplinary activity. and the most unique problems. All of these studies and more were reviewed by Gray and Salzman (1998) and Andre. Andre et al." describes the ergonomic requirements for the use of visual display terminals for office tasks. specifically the Office of Health and Industrial Programs. with inspection methods being higher on it than usability testing. . which incorporates human factors and ergonomics methods such as user testing. explains how to identify the information which is necessary to take into account when evaluating usability. "Ergonomic requirements for office work with visual display terminals (VDTs). when the authors segmented the problems by severity. Andre et al. The most relevant standards-setting effort to those who conduct user-based evaluations is the National Institute of Standards and Technology's (NIST) Industry Usability Reporting Project (IURP). One of the purposes of the NIST IUSR project is to provide mechanisms for dialogue between large customers. html). (1992) compared usability testing to two kinds of walkthroughs and found that testing found more problems and more severe problems. Even the conclusion that usability testing finds unique problems is suspect because those problems might be false alarms. Governments' relatively recent but enthusiastic interest in usability (http://www.usability.S. that is. To date. "Human-centered design processes for interactive systems. At present. It consists of more than 50 representatives of industry. all of the studies are flawed. and vendors. but the trends are often indicative of a method's acceptance in professional circles. Williges. no one has shown that any of Gray and Salzman's or Andre et al. The International Organization of Standardization (ISO) Standard ISO 9241. when gathered in methods such as usability testing. It also provides guidance on sources of information and standards relevant to the human-centered approach.'s criticisms of the lack of validity of the UEM studies is incorrect. Salvendy. Its strengths were in finding severe usability problems quickly and finding unique problems. this clear-cut depiction of usability testing has been challenged. reviewing these studies from a usability testing perspective.nist.56. Furthermore. Fu. One of the most interesting efforts to promote usability methods has been conducted by the U.S." provides guidance on human-centered design including user-based evaluation throughout the life cycle of interactive systems. 1998). summarized the strengths of usability testing as uncovering more severe problems than the other methods. the most severe problems. This project has been underway since 1997. In Gray and Salzman's view. It needs to be stable while it is tested. basis. Their incentive to give positive ratings to the product may be increased when they believe that people from their company may be able to match their rating with their name. For an excellent discussion of how to create and use consent forms see Waters. Miller. 45663. if not daily. not to diagnostic usability tests conducted earlier in development. No. Testing with such speed is only possible in environments where the validity of testing is not questioned and the test team is experienced. Most organizations with policies use the federal government or American Psychological Association policies for the treatment of participants in research. The participants' names should not be written on data forms or on videotapes. The American National Standards Institute (ANSI) has created GIF as one of its standards (ANCI/NCITS 354-2001). The GIF is intended to be written by usability specialists and read by usability specialists. 1999). he or she should follow the procedures described in the Notice of Proposed Rule Making in the Federal Register. Often the users of Web-based products are geographically dispersed and may be more heterogeneous in their characteristics than users of other technologies. The most important challenge in testing these products is the speed with which they are developed (Wichansky. Conducting a test in 8 to 12 weeks is no longer possible in fast-paced development environments. The participants have a right to know who will be watching the session and what will be done with the videotape. . No. Unlike products with traditional cyclic development processes. It applies to a summative test done late in the development process to measure the usability of a software product. 218. it is difficult to determine when the consent is voluntary. Web products often do not have released versions. It is the test director's responsibility to refuse to match names with data. this means gaining some control over the product being tested. that means giving them time to read the form and asking them to sign it as an indication of their acceptance of what is in the form. the analysis of data. and Selwitz (2001). what will be done with the recording. Vol. The names of test participants also need to be kept in confidence for all tests. 53. the participants need to know before the test that the tape of the session might be viewed by people beyond the development team. Is Testing Web-Based Products Different? There is nothing fundamentally different about testing Web products. Testing in 1 or 2 weeks is more often the norm now. It specifies what should go into a report that conforms to the GIF. 2000). If that can't be avoided.1114 • DUMAS NIST worked with a committee of usability experts from around the world to develop a format for usability test reports. For testing. not a moving target. This discussion should make it clear that it may be difficult to interpret subjective measures of usability when the participants are internal employees. 53. but the logistics of such tests can be a challenge (Grouse. p. For most tests. the person who makes the highlight tape needs to be careful about showing segments of tape that place the participant in a negative light. The GIF is not intended to apply to all usability tests. At the heart of the policies are the concepts of informed consent and minimal risk. Only the test director should be able to match data with the name of a participant. The GIF provides a way to evaluate the usability of the products buyers are considering on a common basis. They are changed on a weekly. and the conclusions that can be drawn from the analysis. than those ordinarily encountered in daily life or during the performance of routine physical or psychological examination or tests. Stephens. Participants need to have the chance to give their consent voluntarily. pp. Withdrawing from the session may be negatively perceived. Vol. Test directors should resist making a highlight tape of any test done with internal participants. in and of themselves. and the participants' right to ask questions and withdraw from the test at any time. The test director needs to be especially careful in this case to protect participants' rights to give voluntary consent. The special situation in which voluntariness may be in question can happen when testers sample participants from their own organizations. and who will be watching the session. 1988. With control comes the pressure to produce results quickly. One of its assumptions is that given the appropriate data specified by GIF. 45661-45682. & Goff. but it could mean that in the near future vendors who are selling products to large companies could be required to submit a test report in CIF format. 1988. Are There Ethical Issues in User Testing? Every organization that does user testing needs a set of policies and procedures for the treatment of test participants. It that case. Jean-Pierre. including what is to be included about the test method. even if only in the eyes of the participant. 218. including the recording of the session. testers should have participants read and sign an informed consent form. Notice of Proposed Rule Making in the Federal Register. especially when the participants are internal employees. Minimal risk means that "the probability and magnitude of harm or discomfort anticipated in the test are not greater. If the test director feels that there may be more than minimal risk. Carswell. Even if the test does not expose participants to more than minimal risk. The same issue arises when the results of a test with internal participants are shown in a highlight tape. The goal of the GIF is to facilitate communication about product usability between large companies who want to buy software and providers who want to sell it.com. The GIF document is available from http://techstreet. which should describe the purpose of the test. It is difficult to know how this standard will be used. what will happen during the test. a usability specialist can measure the usability of a product their company is considering buying."1 Most usability tests do not put participants at more than minimal risk. If the participants' bosses or another senior members of the organization will be watching the sessions. Use numbers or some other code to match the participant with their data. called the common industry format (GIF). C. 34th Annual Meeting (pp. May). Brooke. preference. References Abelow. (1993). Usability evaluation in industry (pp. and we are only beginning to understand the implications of that complexity. (1994). San Francisco: Morgan Kaufmann. &I. Weerdmeester. P. Ten tips for selecting usability test participants. Before 1995 the validity of testing was seldom challenged. (1988). 282-286). Its findings have the most credibility with developers of all of the evaluation methods. Proceedings of the Human Factors and Ergonomics Society. Bowers. R. Performance vs. The effectiveness of usability evaluation methods: Determining the appropriate criteria. Common Ground. Faster. London: Taylor & Francis. W. Ramey (Eds. (1996). & I. (1988). (1994). 21. We can never go back to our earlier innocence about this method. (1998). & I. to provide an absolute measure of usability. T.. Santa Monica. K. The best questionnaires also have the potential to allow usability comparisons across products and. H. & Torkzadeh. although it remains very popular and new usability labs continue to open. Santa Monica. 3-6. CA. Paper presented at the annual meeting of the American Voice I/O Society (AVIOS). Rossi. 7. Allan. 38th Annual Meeting (pp. J. In R. Dobroth. (1994). September). SIGCHI Bulletin. Santa Monica. Field methods casebook for software design (pp.. C. 63-76). & Biers. (1983). Questionnaires are a useful way to evaluate a broad sample of users. B. R. Jordan. 10-11. Baber. Usability evaluation in industry (pp. (1996). (1999. Private camera conversation: A new method for eliciting user responses. Concurrent versus retrospective verbal protocols for comparing window usability. & Hartson. San Diego. & Ramey. M. direct or video observation is useful in special situations. Development of an instrument measuring user satisfaction of the human-computer interface. R. Proceedings of the Human Factors and Ergonomics Society. 2. Observation as a technique for usability evaluation. N.. & Norman. It allows usability specialists to observe populations of users who cannot otherwise be seen or who can only be observed through the medium of videotape. N. CA: Human Factors and Ergonomics Society. IEEE Transactions on Professional Communication. McClelland (Eds. R. 1-23. W. Tasks for testing documentation usability. tests can be conducted quickly and allow retesting to check whether solutions to usability problems are effective. B. Chin. S. Using testing to compare products or to provide an absolute measure of usability requires more tune and resources and testers who have knowledge of research design and statistics. SUS: A quick and dirty usability scale. (1996). Bauersfeld. Santa Monica. The measurement of end-user computing satisfaction. Nielsen & R. & Holtzblatt. G. Bailey. 12. K. & Snyder. Mack (Eds. MIS Quarterly. It appears that usability testing has entered into a new phase in which its strengths and weaknesses are being seriously debated. Santa Monica. perhaps. 37th Annual Meeting (pp. L. Anderson (Eds.. (1990). K. M. 1131-1134). New York: John Wiley. & Raiello. & Stanton. R. In D. Jordan. Bailey. Response effects. Proceedings of Human Factors in Computing Systems '88. Practical guidance for conducting usability tests of speech applications.. M.).Jordan. User-Based Evaluations The Future of Usability Testing Usability testing is clearly the most complex usability evaluation method. Weerdmeester. CA: Human Factors and Ergonomics Society. DeVries. B. (1997). 177-196). 189-194). Among the user-based methods. K. 8. (1992).56. which looks so simple in execution but whose subtleties we are only beginning to understand. heuristic evaluation: A head-to-head comparison. As currently practiced.. 42nd Annual Meeting (pp. Common Ground. Weerdmeester. Usability testing can be used throughout the product development cycle to diagnose usability problems. D.). Wright. CA: Human Factors and Ergonomics Society. J. & Halgren. Bias. The handbook of survey research (pp. Proceedings of the Human Factors Society. In P. and to sample repeatedly the same user population.. W. 409-413). Cantani. Wixon & J. 259-374. R.). (1996). Branaghan.. WHICH USER-BASED METHOD TO USE? Deciding which of the user-based evaluation methods to use should be done in the context of the strengths and weaknesses • 1115 of all of the usability inspection methods discussed in Chapter 57. Diehl. W. & Biers. R. to measure the usability of a product that has been used by the same people over a long period of time. CA: Human Factors and Ergonomics Society. B. Usability evaluation and prototype fidelity: Users and usability professionals. 1270-1274). Thinking aloud: Reconciling theory and practice. B. 1090-1094). J. Usability testing vs.. D. Barker. B. McClelland (Eds. 85-94). (2000. W. Usability inspection methods (pp. In P. Hartevelt. (1990). 1-2.. W. V. H. Software usability testing: Do user self-consciousness and the laboratory environment make any difference? Proceedings of the Human Factors Society. M. P. 147-156). W.). New York: John Wiley. London: Taylor & Francis. Chignell. Andre. (1997). London: Taylor & Francis. 27-34. B. In J. Proceedings of the Human Factors and Ergonomics Society. A. 43rd Annual Meeting (pp. Doll. Branaghan. Williges. A taxonomy of user interface terminology. H. 1331-1335). Santa Monica. Nielsen & R. New York: Academic Press.. InP..). Boren. R.. Bradburn. McClelland (Eds. T. (1999). H. Thomas. 289-328).. The pluralistic usability walkthrough: Coordinated empathies. Proceedings of the Human Factors Society.). CA: Human Factors and Ergonomics Society. & Oosterholt. The recent research has opened up a healthy debate about our assumptions about this method. 173-202). Desurvire.. M. "You've got three days!" Case studies in field techniques for the time-challenged. Thomas. New York: John Wiley. 213-218. D. Usability inspection methods (pp.). Usability evaluation in industry (pp. Beyer. (1992). R. & J. V. CA: Human Factors and Ergonomics Society. cheaper! Are usability inspection methods as effective as empirical testing? In J. Thomas. . Could usability testing become a built-in product feature? Common Ground. Contextual design: Designing customer-centered systems. 36th Annual Meeting (pp. (1998). Mack (Eds. 625-628). (1982). 157-160. G. Kantner. Branghan (Ed. Chicago: Usability Professional's Association. In R. 1-5. Proceedings of the IDEA 2000/HFES 2000 Congress. E. & Parasuraman. Gaba. 42nd Annual Meeting (pp. Law. Gray. 1205-1209). (1993). Salvendy. (1999). H. A practical guide to usability testing (Rev. M. Human performance in dynamic medical domains.. 42nd Annual Meeting (pp. Apple pie a-la-mode: Combining subjective and performance data in human-computer interaction tasks. Usability engineers as clinicians. J. Kantner. Effect of type of information or real-time usability evaluation: Implications for remote usability testing. CA: Human Factors and Ergonomics Society. (1995). G. J. B..). Hughes. A. Castillo. K. (1998a). Psychometric evaluation of an after-scenario questionnaire for computer usability studies: The ASQ. S. Chicago: Usability Professionals' Association. England: Cambridge University Press. 553-573. Proceedings of Human Factors in Computing Systems '92.. Cambridge. C. efficiency. 78-81.. Jacobsen. Rigor in usability testing.. Jones & R. Proceedings of the IEA 2000/HFES 2000 Congress. B. Jeffries. Common Ground. Lewis. & Hornbaek. CA: Human Factors and Ergonomics Society. Eye movement-based interface evaluation: What can and cannot be assessed? Proceedings of the IEA 2000/HFES 2000 Congress (44th Annual Meeting of the Human Factors and Ergonomics Society) (pp. 8. Kirakowski. SIGCHI Bulletin. 197-224). J. Common Ground. 3. (2000). W. (2000). D. 1336-1340). 585-588. Assessing Web site usability from server log files. Usability evaluation in industry (pp.. 6. MA: MIT Press. The need for a standardized set of usability metrics.. J. G. & Neale. J. & Biers. M. Kelso. & Vanderheiden. (1998). H. Hillsdale. S. Dumas.. Lewis. D. pp.. & Myers. Proceedings of Human Factors in Computer System. (2001). S. M. NJ: Ablex.. (1995). In R. In R. (1999). CA: Human Factors and Ergonomics Society. Hackman. (1998). M. 688-691). Damaged merchandise? A review of experiments that compare usability methods [Special Issue]. CA: Human Factors and Ergonomics Society. & Redish. 7. (2001b). J. Fu. M.. Following a fast-moving target: Recording user behavior in Web usability testing. Igbaria.). (1996). J. Goldberg. Attitudes towards microcomputers: Development and construct validation of a measure. A.. A practical guide to usability testing. Kamler.. 8. & Turley. N. NJ: Lawrence Erlbaum Associates. 1341-1345).. L. and who are simulating disabilities—experiences with blindness and public information kiosks. Reducing sample sizes when user testing with people who have. H. R. K. R. (1999). (1991). J. Santa Monica. Winder (Eds. Cambridge. & Redish. (1995).). Kennedy. Santa Monica. Lund. 13.1116 • DUMAS Dumas. J. Hartson.. Design by people for people: Essays on usability.. London: Taylor & Francis. ed. 4. In M. Frokjaer. G. (1994).). Applying usability methods to a large intranet site.. Kirakowski. L. W. 1085-1089). Santa Monica. J. Design by people for people: Essays on usability (pp. (1998). 34. Common Ground. Proceedings of Human Factors in Computing Systems '91. Lesaigle. D. W... CA: Human Factors and Ergonomics Society. CA: Human Factors and Ergonomics Society. J. R. (2000). Sample size for usability studies: Additional considerations.). & Corbett. Statistical tables for biological. E. 17-18. Usability testing methods: When does a usability test become a research experiment? Common Ground. 45-52. K. & Simon.. C. 3-5.. 21. (2001a). (1996). 42nd Annual Meeting (pp. Usability testing methods: Subjective measures Part II—Measuring attitudes and opinions. M. M. J. E. agricultural and medical research. Jean-Pierre. Santa Monica. . Chicago: Usability Professional's Association. D. M. SICCHI Bulletin. 8. Human Factors. J. Santa Monica. Gage. Lister. Who finds what in usability evaluation. Measuring user satisfaction. CA: Human Factors and Ergonomics Society. W. (1999). K. Dumas. D. M. Usability testing methods: Using test participants as their own controls. Proceedings of the Human Factors and Ergonomics Society. 36. (1991). J. & Biers. Measuring usability: Are effectiveness. (1998b). International Journal of Man-Machine Studies. R. Bogner (Ed. 43rd Annual Meeting (pp. Design by people for people: Essays on usability (pp. & I. Comparison of empirical testing and walk-through methods in user-interface evaluation. 12-13. 43rd Annual Meeting (pp. M. 397-404. C.). G. McClelland (Eds. B. International Journal of Human-Computer Interaction. Human error in medicine (pp. F. S. C. (1993). (1994). & Goff. A. 368-378. Thomas. Landauer. J. 3-5. (1998). (1963). 782786). 203-261. L. CA: Human Factors and Ergonomics Society. London: Intellect Books. 43-50. 23. CA: Human Factors and Ergonomics Society. Santa Monica. Proceedings of Human Factors in Computing Systems '96.. Hassenzahl. 5-10. & Yates. Protocol Analysis: Verbal Reports as Data. Santa Monica. A. Ground. B. J.. 9. Ledgard.. (1988). 57-78. Dumas. IBM computer usability satisfaction questionnaires: Psychometric evaluation and instructions for use. Wharton. Remote evaluation: The network as an extension of the usability laboratory. J. Hertzum. Common Ground. The trouble with computers.). 228-235. Proceedings of the Human Factors and Ergonomics Society. Dumas. and satisfaction really correlated? Proceedings of Human Factors in Computing Systems '2000. Dumas. Usability testing methods: Think-aloud protocols. A. Usability testing methods: Subjective measures Part I—Creating effective questions and answers. Ericsson. Cambridge. IV. Technical Communication. The evaluator effect in usability studies: Problem detection and severity judgments. Usability testing software for the Internet. R. HumanComputer Interaction. M. (1992). Common Ground. 9. & Fiegel. J. In D. & John.). & Ensing. MA: MIT Press. Edinburgh. People and computers (Vol. S. 46. M.. Proceedings of the Human Factors and Ergonomics Society. Jordan. Campbell. C. Proceedings of the Human Factors and Ergonomics Society. (1991). Proceedings of the Human Factors and Ergonomics Society. 189217). Branaghan (Ed. (1989). 135-156. 488-494. 245-261). Santa Monica. & Salzman. 119-124. The software usability measurement inventory (SUMI): Background and usage. C. Branaghan (Ed. Educational psychology (5th ed. Fisher. Dumas. Weerdmeester. 10. Team usability testing: Are two heads better than one? Proceedings of the Human Factors Society. N. (1998c). L. Landay J. (1999). T. Miller. Karat... Dumas. & Berliner. User interface evaluation in the real world: A comparison of four techniques. (1991). Lewis. 36th Annual Meeting (pp. 235-244). & Uyeda. Scotland: Oliver & Boyd. A. (2000). Miller. New York: Houghton Mifflin. (1992). Proceedings of Human Factors in Computing Systems 2001. 92-95. Interactive sketching for the early stages of user interface design.. Using video in the BNR usability lab. In P. H. (1999). Usability testing methods: The fidelity of the testing environment. Proceedings of Human Factors in Computing Systems '95. (2001).. C. Evaluating text editors. (2000). T. 4-8. 169-177). Grouse. R. Santa Monica. Kindlund. Reading. J. Dumas. & Moran. G. D. (1956). Mitropoulos-Rundus. (1993). Proceedings of Human Factors in Computing Systems '87. J. Branaghan (Ed. & Brok. Rosenthal & R. Butler. Dallas. 43. (1994).. Proceedings of the Human Factors and Ergonomics Society. Virzi. & Selwitz. CA: Human Factors and Ergonomics Society. Common Ground. Curson. Ergonomics. C. 214-221). Proceedings of the Usability Professionals'Association (pp. hi P. Shneiderman. (1989).. (1997). Proceedings of the Human Factors Society. Rubin. C. I. L. formal. A case history in scientific method.. Virzi. and empirical methods compared. K. 1-12). M. New York: Academic Press. Designing the user interface: Strategies for • 1117 effective human computer interaction (3rd ed. & Thurrott. G. Santa Monica. (1990). L. An instrument for measurement of the visual quality of displays. Skinner.)... (1996). N. 221-233.. (2002). Usability problem identification using both low and high fidelity prototypes. E. 5. van Oel. B. D. (1999).. HumanComputer Interaction. & Barnard. A. & Phillips. B. Santa Monica. 998-1006. G. 143-179).56. V L. Jordan. MA: Addison-Wesley. 249-260. (2001). 11. J. 291-296. & Coyne. 4. Designing the user interface: Strategies for effective human computer interaction. Demand characteristics and the concept of quasicontrols. Usability testing and group-based software: Lessons from the field. Schmidt. Remote usability evaluation over the Internet. 34th Annual Meeting (pp. Behaviour and Information Technology..). Molich. Philips. Thomas. 265-268. E. Seeley. 373-380). McClelland (Eds.).. Perkins. P.). R. (1992). MA: Addison-Wesley. Comparative evaluation of usability tests. The history of human factors and ergonomics. Reading.. L. hindrance or ho-hum? Proceedings of Human Factors in Computing Systems '89. Molich. Designing the user interface: Strategies for effective human computer interaction (2nd ed. Usability evaluation in industry (pp. R. W. Ede.. (1990). E. (2001). Does the fidelity of software prototypes affect the perception of usability? Proceedings of the Human Factors Society. B. In R. K. New York: John Wiley. Ergonomics in Design. (1994). CA: Human Factors and Ergonomics Society.. 295-299). and performance testing. 14-20... R. & I. Finding usability problems through heuristic evaluation. 13. K. A. Kindlund. Stephens. (1998). The use of scenarios in HCI research: Turbo charging the tortoise of cumulative science. Damaged merchandise? A review of experiments that compare usability methods [Special Issue]. (1992). R. J. J. B. Sorce. 9. & Kahmann. (1992). Thomas. Nielsen. How to design and conduct a consumer in-home usability test. 1-11. Design by people for people: Essays on usability (pp. Norman. Common Ground.. (2001). B. In R. M. 236-243. Orne. J. think-aloud. Estimating the relative usability of two interfaces: Heuristic. Kaasgaard. Wilson. A. . TX: Usability Professionals' Association. 8. Wiklund.. J. E. Vora. Olson.. Refining the test phase of usability evaluation: How many subjects is enough? Human Factors. & Bouchette. Reading. R. J. Weerdmeester. A. 107-114). Common Ground. 12. Carswell. 309-313. Tracking usability issues: To bug or not to bug? Interactions. (1997).. 34. B. 1207-1212). R. 15-19.. & Dumas. 153-162). Scholtz. Miller. D. B. Handbook of usability testing. Spenkelink. Nielsen. J. Usability testing in 2000 and beyond. Shneiderman. R. Research ethics meets usability testing. P. Karyukina. S. Beuijen.. J. Artifact in behavioral research (pp. (1969). Waters. R (1987). MA: AddisonWesley. (1995).. Shneiderman. 10-12. 457-468. 34th Annual Meeting (pp. Proceedings of the Human Factors Society. (1993). (1987). CA: Human Factors and Ergonomics Society. D. & Muzak.. Sevan. Comparative usability evaluation. Proceedings of the Association of Computerized Machinery INTERCHI '93 Conference on Human Factors in Computing Systems (pp.. J. Streamlining the design process: Running fewer subjects. S. In press. R. Santa Monica. B. C. (1992). M. User-Based Evaluations Meister.). F. Proceedings of Human Factors in Computing Systems '92 (pp. (1993). B. R. Wolf. D. (1998). A comparison of three usability evaluation methods: Heuristic. B. M.. Mahwah. & Kirakowski. A. J. 203-261. 5-9. 7. NJ: Lawrence Erlbaum Associates. Virzi. 291-294). American Psychologist.. Using teaching methods for usability evaluations.. & Herbert. Quick and dirty usability tests. Wichansky A. 36th Annual Meeting (pp. R. New York: ACM Press. 37th Annual Meeting.. Rosnow (Eds. Chicago: Usability Professional's Association. Young. Proceedings of Human Factors in Computing Systems '96. (2000). Usability testing: Functional requirements for data logging software. (1996).... Sokolov. Virzi.. London: Taylor & Francis.. & Karis. K. The role of laboratory experiments in HCI: Help. R.