Effectiveness and Limitations of General Methodology and Specific Methods
In case #2 we first encountered the researchers’ probing technique, in which testers escalate the specificity of questions to extract increasingly detailed answers. What regulates this probing is not mentioned (for instance: does questioning cease upon satisfactory answer?). If probing failed to yield insight, the testers drew subjects’ attention to specific matters, but could a subject’s lack of comment not be enlightening? Also, by mentioning a feature, did the testers exaggerate its importance, leading subjects to award it unwarranted attention and inflate their responses accordingly?
In case #3, half the subjects were computer novices and the remainder regular Web users. This differential begs clarification that is not provided: Is such a comparison level? If so, why? How did the results and preferences of the two groups compare? Were any such differences significant and what conclusions might be drawn if they were? The authors assert that not all usability practitioners were trained to prompt, hint, or ‘remediate’ when participants became lost or confused, and we are uninformed as to how the researchers drew up their guidelines for the same. If we assume that, ideally, the tester describes tasks in the most general terms, what of complex tasks and subject variation? It seems that a superior method would minimize tester involvement by prioritizing the collection of data that could, post-test, be used for interpretive/explorative purposes.
The authors encourage usability teams to decide and practice what to say when participants make serious errors or cannot continue. But could not such errors be informative and, thus, their rectification counterproductive? The researchers say that before prompting, administrators should wait longer than is comfortable. But estimating comfort is a subjective judgement and we do not know whose comfort is to be timed – tester’s or subject’s? Also stressed is the need to prompt minimally, and if this is to be accepted, then why not extended? An unprompted testing format might be advantageous all round. Automated or algorithmic instructions could practically eliminate prompting-related issues. Moreover, to counteract any audience effect, questioning might be better performed after testing.
After case #3, the researchers assert that users want objects of immediate interest to be large and centrally located, yet want all choices visible at all times. Such deductions hold up to reason, but not in scenarios where there are hundreds of choices available (as in operating systems) and when the screen must, for practicality, divide its space among elements of equal importance. How then does the designer centralize and make prominent any one feature without prior knowledge of exactly what the user wants?
The problems subjects encountered in cases #1, #2, and #3 illustrate ambiguous internal relationships and false structure; however, earlier usability tests by the same researchers revealed problems with organization and visual structure more generally – display density, haphazard layout, and conflicting symmetry. Are such problems usability- or visual communication-specific or overlapping aspects of both? And is the attribution to one or the other even necessary? If such problems are discernibly visual communication-related, a methodology must be devised to isolate, identify, and analyze them; but if they are better defined as usability issues, they must be given component status and analyzed through functionality- rather than aesthetics-focused testing (Barnum, 2002).
In relation to icon recognition, the authors concede (predictably) that immediately recognizable icons are desirable but sometimes difficult to create. This might be erroneous reasoning. Easily recognizable icons must be easily creatable because their images are simple by definition – they depict either generic objects or symbols of sufficient simplicity to be understood by almost anyone. The authors ought to have stated that recognizable images readily associable with abstract functions are difficult to create without considerable blunting of conceptuality. But this too might be a redundant conclusion, as users of complex software are unlikely to be dissuaded or obstructed (or even notice) the suitability of icons or in fact any subtler visual elements of a UI (Mirel, 1998).
In case #4, subjects’ free-form responses were classed as ‘right’, ‘partly right’, or ‘wrong’, a trinity whose oversimplification engenders false positive/negative dichotomizing, because ‘partly right’ leans more toward ‘right’ than stands as neutral position. The optimism inherent in having two of the three categories classed positively throws doubt on the researchers’ impartiality and, consequently, the conclusions they make based on these responses.
Case #4 is based on collecting subject expectations about icons. The authors claim such data is excellent for prototype testing. But might the authors be classifying plain guessing as ‘expectation’? And might expectation and/or guessing be subject to myriad experience-dependant variables not accounted for?
Finally, they declare that their methodology should help researchers working with graphic designers validate and prioritize specific techniques, yet their methodology reveals no technique-producing guidelines. A more challenging experiment would give this assertion credence – an experiment whose outcome was unlikely to be tidily compliant with the consensus of foregoing research.
Critique of Key Claims and their Evidence
The authors caution that simply asking subjects for information is inadequate, and overall choice of UI is less illuminating than subjects’ unsolicited comments and behaviour regarding various UI elements. The subjects, we are informed, liked and disliked elements from all the UIs they saw, and their reasons for preference were informative. But no contrast effect is considered: subjects might have evaluated elements as relatively good when those elements appeared alongside nondescript or substandard elements. Also, were the authors’ assumptions realized, the resultant UI would likely be a portmanteau muddle of displaced features. Any UI thus created would constitute an incongruous hybrid. This result, although not stated, is implicit in the authors’ reasoning. For such methods to be successfully applied to usability, the authors must delimit their applicability and confine the extensibility of any findings made by such extractive means.
Although questionnaire data and self-reporting have limitations, as the authors postulate, those limitations are counterbalanced by the omission of human interference and subjective judgement that they allow (complications that must both be factored in if interaction occurs). A mixed approach, combining self-reporting with tester questioning (and in which the tester’s input is constrictively standardized and interpretation of comments and behaviour predefined), would be preferable.
While there is a definite case for ‘look-and-feel’ to be judged according to how well or poorly it augments comprehension (White, 2002) and functionality (Galitz, 1997), if usability testing protocols should be designed to elicit behaviour and comments relating to ‘look-and-feel’, is it usability or the definition of ‘usability’ that is being tested?
And, if there is a subpopulation of users to whom visual appeal is irrelevant, as the authors claim, of what relevance are their comments regarding visual appeal? If users are only subconsciously aware that visual presentation of products influences them (as posited), how are their comments to be interpreted? Logically, if subjects are unaware, it behoves the researchers to determine whether that unawareness can be attributed positively (subjects were not obstructed or distracted by visual elements) or negatively (subjects failed to recognize a visual element due to its incomprehensibility, etc.).
Tapering of question specificity is core to the methodology presented, but is intrinsically problematic, because any questioning is potentially biasing.
In their conclusion, the authors declare that what they learned most clearly was the difficulty of assessing visual appeal independently of navigation, access, and other rest-of-product issues. Yet this need not have been so difficult: If the UI’s interactive elements had been replaced by a passive slideshow, the gathering of opinion concerning purely visual appeal would have been greatly simplified. Such a method would exclude the ‘usability’ aspect in a truer sense, but herein lies the crux of the authors’ terminological compromise – the indivisibility of visual appeal from navigation or other usability issues is insuperable, as they can be separated semantically but not actually. The manifest form of any onscreen element is a two-dimensional LED composition that is, simply by being a tool rather than a decorative object, subject to aesthetic considerations only after users are assured of its functional faithfulness.
Furthermore, the problem of decoupling visual communication from rest-of-product testing persists only if the prototype attempts to emulate an integrated, operative UI. That is, the closer the prototype is to a fully functional UI, the more challenging it becomes to isolate and identify which particular elements constitute weaknesses or strengths. To counter this, elements must be tested separately: processes must be evaluated using identical graphical elements; graphical elements must be evaluated using identical processes. Once separate results are obtained, bases for combining highly rated processes with highly rated visual elements will emerge. Without such division, the problems of evaluating the quality or otherwise of a single function or visual element in isolation of any other can be resolved only through recourse to subject questioning, which is problematic for the reasons stated earlier.
 If a subject must be pointedly asked about a UI feature, then can we assume that feature was so seamlessly functional that any particular visual aspect went unnoticed?
 The article later argues the case for prototype evenness, yet subject evenness is, by the same logic, required if preference is to be rightly attributed.
 For example, a computerized test platform that logged accuracy of selection and times between selections could prove highly informative.
 Audience effect is not discussed but would seem particularly relevant in case #4, where the usability team walked between subjects’ workstations observing subject’s behaviour.
 I.e. procedure ordering, logical flow, or some form of interactive guidance such as a wizard.
 The authors claim later that they were able to make recommendations concerning design decisions based on these same responses.
 Subjects were asked question such as ‘What do you think will happen if I click this?’
 For example: if subjects answer preceding specific questions one way, they tend to moderate their summative evaluations similarly.
 On a final note regarding decoupling: since textual content is principally what draws reader’s eyes, I cannot sympathize with the authors’ call for ‘greeking’ textual content to force subject attention toward an interface’s visual aspects. In the experience of the company at which I work, this method failed. We used Cyrillic characters in documents whose visual elements (including text layout) were being tested. Subjects spent too much time trying to read the text, and despite being told to ignore it, they were reluctant to look at other parts of the document.