Previously posted at http://joel.westside.com/wsContentPublisher/story.view?RowId=11

Advanced Usability Testing

This was a full-day tutorial at the Seattle Center, put on by the Nielsen Norman Group as part of their User Experience World Tour. The instructor (who was excellent) was Rolf Molich, a principal in the Danish usability consulting company dialogdesign (dialogdesign.uk). The class was based in large part on the results of CUE-2, a study on the usability testing and reporting of nine professional and student teams who volunteered to participate. They tested Hotmail.com in late 1998, and CUE-2 compares their findings and methodology (http://www.dialogdesign.dk/cue.html). The class covered Evaluation Methods, Creating Test Tasks, Identifying and DDescribing Usability Problems, and Communication Results. There were about forty people; dunno how many were in the beginner class. My notes presented here are a random mishmash of observations, quotes, and paraphrases.

General notes

It's hard to compare usability methodology because many companies do not share methodology. Although Microsoft participated in CUE-2 (they own hotmail.com, the tested site), they abruptly stopped participating at some point after the study, and now clam up with "it's Microsoft policy not to discuss internal results of usability ...." Mr. Molich interpreted this as the lawyers finding out what was going on. I agree, but I don't think Microsoft (and other companies) are trying to hide their methodology as a competitive advantage. I think that they are reluctant to discuss it because either results or practices could turn into a PR disaster. I also think that this is a ~~cowardly~~ misguidedly conservative approach.

Heuristic evaluation will find less than half of the problemsThe personal opinions of usability professionals are not worth much. Even the professional opinions of the best are usually wrong. The only way to find out if Function X on Site Y is usable is to watch typical users try it. On the other hand, even a heuristic evaluation that finds only half the problems on a site is still finding a lot of problems, and is almost certainly cheaper than user testing.

CUE-2 Results

Nine teams found a total of 310 problems with hotmail.com.
Only one problem was critical - a confusing 'password hint' page
Most teams spent ~100 hours and used one person. (Teams included Sun, SGI, consultants, et al)
One team did a questionnaire of 50 users. This team found very few problems, and missed the most critical problem. Unattended usability test don't work. An unattended test is "participant carries out tasks without being observed, and reports problems found."
75% of all problems were found by only one team.
No problems were found by all nine teams, or even eight. Only two problems out of 310 were found by as many as six teams.
Web usability results are non-reproducible. This may because most web sites are so bad that there are so many problems that no one test can even find the majority of the big ones.
Nobody can differentiate the student reports from the professional reports.

What's a usability problem

User error, user preference, user attitude, user comment.

(from http://www.ucc.ie/hfrg/resources/qfaq1.html

) ISO 9241 standard, part 12, defines usability in terms of effectiveness, efficiency, and satisfaction.

break down related problems into 'smallest discrete problem,' a problem which can be fixed without affecting any other usability problem.

severity scale:

0: (not in the tutorial - just my opinion) User abandons the site/product; or security hole
1: Severe. User cannot complete a task. User completes a task incorrectly without noticing (either could be a zero if the consequences are catastrophic)
2: Important. User is delayed or upset completing a task.
3: Cosmetic. (My personal opinion is that if it's cosmetic, it's a normal bug, not a usability bug)

Usability Tasks

Don't use system-oriented terminology like register. Don't use slang. Don't use humor (it isn't funny when you can't do it). Don't use personal references ('send mail to your mother' - instead use 'a friend').
Base tasks on user goals, not system features
Get rid of hidden clues. Hidden clues include exact vocabulary matches between tasks and web site labels; the mere fact that a task can be achieved with the product at hand. Use indirect tests when necessary.
To test something more realistic, use real loads: thousand-message inboxes, slow connections, etc.

Performing the test

If a user has flailed for 8 minutes, they're not going to get it (this number is anecdotal, comes from apps testing, and may be too high)
(I asked the group how long a typical test went). Rolf proposed 10 min for prep, 60 min for testing, 20 min for debrief and asked for a show of hands of who was doing something significantly different. Nobody raised their hand.
Ask standard questions after each task for better results (people remember recent successes and discount the struggling once they succeed "I'm sure it would have been easier for someone else - it just took me a while to find it.")
Q: what about tester in room versus tester in other room? A: I always do tester in room for social reasons - it's more humane.
When doing a paper test, have a third person to manipulate the paper and "be the computer"
Prototypes should be cross-functional efforts (dev, design, graphic, et al)
some people take notes by computer instead of paper
Nobody had any suggestions on how to better integrate help testing. Useful comments: "Testing whether or not somebody uses help is different from testing where or net help is useful." "Better to check logs to see if anyone uses help."
Display tester limitations ("Sorry, I can't type that fast.") to make user more comfortable
tell user in advance that some tasks may not be possible. Then, set an impossible task very early in the test. This helps reset the user's mindset back to something more realistic, namely that in the real world, you don't know in advance if the product you are using can do the task you need.
Put a super-easy task first to make the user more comfortable
avoid tasks that build on each other. If the user cannot complete one, the others are harder and (even if you have a fallback rigged) more stressful.
Don't test your own baby as you won't be objective - very easy to insert your own bias at many points in the process.
Good phrases
- "Here's what the designer intended. How could we make that clearer?"
- "I must be stupid." "Everything I've seen you do makes perfect sense to me. It's just that this system isn't designed that way."
- "Don't tell me, show me." (Tip 4: avoid user opinions.)
Bad phrases
- "Yeah, we thought about that."
- "Yeah, we tried that and it didn't work and here's why..."
- "Lab," "Experiment," "Research," "Test Subject." Instead, "Usability Evaluation"
- "I'll show you a good way to do that." Instead, "How should we change the interface to avoid the problems you had."
When you hint, you will help more than you intend. Some labs have a strict "no hints" policy and just move people on to the next task.
Discount Usability Testing Mr. Molich advocated this methodology. The idea is, "strive to optimize the cost/benefit ratio."
Should you videotape? In general, no, because analyzing results takes much too long. Instead, take notes while it happens. Ask the user to stop for a minute if a lot of stuff happens at once. You will miss things, but it's unlikely you'll miss anything important (if it's important, it will happen more than once). Reasons to videotape include: proving your results to people who don't think a user would actually say X. Emphasizing the consequence of usability problems.

Testing with Advanced Users

When testing with advanced users, the user should provide the task whenever possible.
Have users bring typical tasks. "What are the last three things you did with this product?" "Can you defer your normal tasks for a few days and then do them here in the test?"

Reporting

The users of a usability test are the developers. Make sure that the usability test and results are themselves usable.
Usual result of a usability test is a report: 8-12 pages, no more than 40 defects. Doesn't seem to be much iterative usability testing.
some people advocate tracking usability problems in the bug system
include severity for each issue.
Report should include: description of specific problem. Description of principle violated. Description of general problem, and notes on where dev/test should look for other instances of same problem.
Reports should have plenty of positive items - perhaps one positive item per problem. (N.B. In my opinion, this is a structural problem. Should a usability test produce a list of problems to fix, or a validation that the site enables users to accomplish their goals? This is analagous to the same problem in quality assurance/control: you must start by testing all the things that a program is supposed to be able to do, and marking (as bugs) where it cannot; then, you can go digging for bugs. So, how can you report results in terms of validation - "10 basic users tasks; most users can accomplish N tasks with avg time X and error rate Y." Mr. Molich suggested a color-coded results grid.)
The point of usability testing is to find and correct usability problems. Therefore, reporting is as important as testing.
Having developers observe tests is crucial. It will help developers think like users and thereby code more usable functions. Testing should be located somewhere convenient to developers, not to users or the tester. (Although I agree on a pragmatic level, I'm a little uncomfortable endorsing gratuitously developer-centric thinking.)
There was discussion of the KJ method and consensus-building among developers. I didn't take good notes because it isn't really relevant our continuous usability improvement process. (it's a way to sit down after a test and get everyone to agree what to fix)
Should problem descriptions include proposed solutions? Yes, because it helps developers and helps show that the reporter understood the problem. No, it short-circuits other possible solutions.

Differentiate between expert opinions, user opinions, and user findings.

Discuss the main findings immediately after the tests.

Assorted notes

eye-tracking isn't necessary, because there's still more than enough low-hanging fruit - problems that can be found without eye-tracking
A typical test of a web site, from scratch, is ~US$5,000 - $10,000, 50-80 hours.
Beware of habits that produce data you never use
A useful error message always says what to do next.
You should be able to repeat an error message to the user's face.
videotape yourself to see how you do as a test manager
Have yourself tested by a usability consultant (coach) every two years
Controversial rule of thumb: "Five users can find 75% of all problems, and since you'll never be able to fix them all anyway, that's enough." Instead, "The first five users will probably find 75% of the problems for a particular task set for a particular version."
Tip #27: "Keep the number of participants low so that you are just as interested in your final participant as in the first participant."
Mr Molich's conclusion from CUE: Many of the problems are caused by lack of knowledge about elementary rules for usable design. More focus on prevention; less focus on testing.

INDEX