|
by Joel Aufrecht
12:00 AM, 07 Apr 2001
Previously posted at http://joel.westside.com/wsContentPublisher/story.view?RowId=11
Advanced Usability TestingThis was a full-day tutorial at the Seattle Center, put on by the Nielsen Norman Group as part of their User Experience World Tour. The instructor (who was excellent) was Rolf Molich, a principal in the Danish usability consulting company dialogdesign (dialogdesign.uk). The class was based in large part on the results of CUE-2, a study on the usability testing and reporting of nine professional and student teams who volunteered to participate. They tested Hotmail.com in late 1998, and CUE-2 compares their findings and methodology (http://www.dialogdesign.dk/cue.html). The class covered Evaluation Methods, Creating Test Tasks, Identifying and DDescribing Usability Problems, and Communication Results. There were about forty people; dunno how many were in the beginner class. My notes presented here are a random mishmash of observations, quotes, and paraphrases. General notesIt's hard to compare usability methodology because many companies do not share methodology. Although Microsoft participated in CUE-2 (they own hotmail.com, the tested site), they abruptly stopped participating at some point after the study, and now clam up with "it's Microsoft policy not to discuss internal results of usability ...." Mr. Molich interpreted this as the lawyers finding out what was going on. I agree, but I don't think Microsoft (and other companies) are trying to hide their methodology as a competitive advantage. I think that they are reluctant to discuss it because either results or practices could turn into a PR disaster. I also think that this is a Heuristic evaluation will find less than half of the problemsThe personal opinions of usability professionals are not worth much. Even the professional opinions of the best are usually wrong. The only way to find out if Function X on Site Y is usable is to watch typical users try it. On the other hand, even a heuristic evaluation that finds only half the problems on a site is still finding a lot of problems, and is almost certainly cheaper than user testing. CUE-2 Results
What's a usability problem
Usability Tasks
Performing the test
Testing with Advanced Users
Reporting
Assorted notes
| ||||||
| Complexity of Content | High | Catalog (IA most useful here) | Information system |
| Low | Brochure | Service Site | |
| Low | High | ||
| Complexity of Application | |||
(Mecca et al., ACM WebDB'99, http://www-rocq.inria.fr/~cluet/WEBDB/procwebdb99.html)
She identified a number of components that have been labeled Information Architecture
(Newman & Landay, DIS 2000, http://guir.berkeley.edu/pubs/)
A major trend on the internet which massivly impacts IA: pages of content served from a database. This has a very positive consequence for IA: access can be instrumented. It's getting much easier to collect reams of data about exactly how people look for information, and about how successful they are looking for your information in your system. "We need to measure to learn what works and what doesn't."
She riposted Garrett's parody of her team's webpage guidelines: "Measuring the surface properties can accurately predict how people rate web sites."
Ms. Hearst adds, "More information on our empirical usability assessment work can be found at http://webtango.berkeley.edu. Thanks! Marti "
Lazily hyperkinetic and demanded complete attention to keep up, but clearly a member of that rare species, "non-bogus consultant." Disclaimer: Ragouzis used language very precisely and accurately, and I'm probably butchering not just his prose but his actual points, since my notes are fairly low-fidelity and don't catch when he was being sarcastic to make a point, etc.
Paraphrase from his written statement: Practitioners of HCI are utterly ignorant of research. But "over decades, without a credible basis for defining or measuring the whole or human experience, they have garnered an astounding quantity of success. ... requires only the ability to innovate ... and to deliver user-perceptible value. ... Abandon quantification, and may the fittest win."
"[IA] isn't a research domain, it's an applied domain. ... IA isn't intrinsically valuable. IA quality improvements are verifiable only via customers' perceived value. What is your strategy: differentiate yourself from your rivals, show affinities with your partners; enjoy the competitive power of a well-positioned follower; sustained rate of growth relative to rivals.
What does a plan to achieve mediocrity look like? Maintain parity; do what 90% of the competition does; follow the de facto standards; follow the "X will address 75% of users" recommendations; get improved conversion rates upon launch.
This guy was so caffienated he made Ragouzis look sleepy. He advocated tying traditional IA metrics to standard business metrics.
| IA measures | |
|---|---|
| Quantitative | Qualitative |
| Task Success Rate Time on Task # of categories, labels | Satisfaction Frustration confusion |
for-profit goal: sell products. Metrics: Revenue, referrals, subscriptions, brand loyalty
non-profit goal: spread knowledge. Metrics: memberships, subscriptions, registrations, cross-links/references
Now we can do surgical tracking. Here are results for Front Page 98:
After removeing the registration requirement, and shortening the info page, got +10% downloads. However, lost a lot of email addresses (which could be used to [spam] notify people when release version was ready); calculated that overall, would get 2-4x more revenue with fewer email addresses vs more anonymous downloads. of course, the cost of experimenting, measuring, and figuring all this out could be greater that the delta in revenue.
The way to measure information architecture quality is to perform surgical tracking to determine if changes in IA improve business-goal-oriented metrics.
"I do believe we can measure the usability and effectiveness of a design for very fine-grained characteristics such as number of clicks to task, mouse-travel distance, ... I am skeptical that we can have a standardized overall measure of IA effectiveness.
We can take some fine-grained measurements, for some criteria, for some tasks, for some people, at some point in time. For example, path length, vocabulary, reading level; and post-hoc aggregate data such as hits and referring links.
Overall, IA quality is qualitative. We can measure a few discrete acts, such as retrieve/buy/print/verify, because they have clear progress indicators and stopping points. We can't measure things like Explore/Browse/Learn/Read, because they don't have clear progress indicators or stopping points; they are continuous acts.
Why are we still talking about this, when IA is already a well-understood discipline? Because we stopped treating users as computers and started treating them as people.
How to test large site IA before building a site? Exploratory methods don't scale. Two first things when building a new web site? What do users want? What does the site owner want from users?. Usability is cyclical; IA is ongoing; entire org should do both. Usability has artifacts and checkpoints. Can't test IA in abstract; need examples. relationship between db design and IA? db is an implementation issue which doesn't/shouldn't directly address user needs.
A French researcher made a fairly incomprehensible and somewhat desultory argument that Fitts' law (Fitts' law essentially says that, when moving things (objects, mouse pointer, finger) from one point to another, the time it takes to do so accurately goes up (I think) exponentially with distance and target size.) is wrong. something about relative vs absolute measures. Each slide made sense but it didn't add up to a coherent point.
A common rule of thumb is that computers must react in <100 milliseconds in order for users to perceive response as instantaneous. A researcher at (somewhere in the Midwest) tested this for five common tasks (pulldown menu, buttons, typing text). They found that, to a good degree of confidence, half of all users (23 person sample size) consider <190 msec to be instantaneous for buttons/menus, and <150 msec for text entry. However, you probably need to aim for <80 msec (more research to pin this down) or some sizable fraction of users will still detect delay.
It's very hard for computers to measure error in text entry, because of extra characters, transposition errors, etc, that make direct string comparison useless. Also, users want to correct as they go. York University researchers applied the Levenshtein (think Gene Wilder) String Difference, which was originally developed to measure errors in genetic transcription. They reprocessed some data that had taken someone a day to measure manually, and found that results seemed to match. This talk was interesting but seemed somewhat chintzy to me as they hadn't done any actual experimentation, just fiddled with other people's data.
IBM researchers presented data on how to set up tap keyboards for efficiency. A blend of alphabetical and efficient design was almost as good for (theoretical - they didn't actually train anyone) trained users (41 WPM predicted vs 42 for purely efficient) and was somewhat easier to learn and faster for novices. In the course of a fifteen-minute test, users (sample size of 12) went from 8 to 9.5 WPM for the most efficient keyboard and 9 to 10.2 for the semi-alphabetical keyboard.
Problems with head-mounted HUD:
Binocular rivalry. If one eye sees a terminator-style display and the
other normally, the brain will alternate between the two at random,
unpredictable intervals, producing a patchwork image.
A transparent
display (text in the foreground, reality in the background) produces
visual interference.
Experiments showed a 37% difference in ability to read and work between best case (standing near a blank wall) and worst case (tv screen in background)
Some researchers from Canada are trying to start up some sort of
interest group within CHI for Socially Adept Technologies, an umbrella
definition for systems that adapt to social context, situations,
emotions, personality. Example applications: computers that make small talk to build trust with users. Computers that feign emotion to make users more comfortable.
Issues:
what are the ethical boundaries? Should a computer pretending to be human require a disclaimer? If the user figures it out, they are pissed (research shows). If the user doesn't figure it out, are they harmed?
Anecdote suggests it is possible to reliably tell if a user is introverted or extroverted based on a 200-word speech sample. (aside: other studies clearly show that users react better to computers that display the same behavior pattern as the user)
"Someone's going to do this. It should be us, since we'll do it ethically and openly."
It's generally accepted that people react to anything that displays intelligence (and even to plenty of objects that don't) as if it had human characteristics. Therefore, socially adept concepts won't go away if ignored.
What is socially adept technology? Is it a set of guidelines for usability design (such as, computer should not do things that would be rude if a human did them)? Is it active technology, such as profiling users on the fly and changing dialog text and behavior to compensate?
We should study this because it will illuminate human-human interaction issues from a new angle.
Computers can respect social expectations without behaving like a person.
Studies show: users behave as if a computer's time is worthless. A computer's apology is worthless. A computer's sympathy is worthless. A computer's empathy may not be worthless.
| April 2001 | ||||||
| S | M | T | W | T | F | S |
| 1 | 2 | 3 | 4 | 5 | 6 | 7 |
| 8 | 9 | 10 | 11 | 12 | 13 | 14 |
| 15 | 16 | 17 | 18 | 19 | 20 | 21 |
| 22 | 23 | 24 | 25 | 26 | 27 | 28 |
| 29 | 30 | |||||