Journal Search Engine
Search Advanced Search Adode Reader(link)
Download PDF Export Citaion korean bibliography PMC previewer
ISSN : 1598-7248 (Print)
ISSN : 2234-6473 (Online)
Industrial Engineering & Management Systems Vol.18 No.2 pp.154-162

A Comparison of Usability Testing Methods in Smartphone Evaluation

Andrie Pasca Hendradewa, Yassierli*
Industrial Engineering Department, Universitas Islam Indonesia, Yogyakarta, Indonesia
Faculty of Industrial Technology, Institut Teknologi Bandung, Bandung, Indonesia
Corresponding Author, E-mail:
April 12, 2018 August 1, 2018 January 28, 2019


A number of conventional methods are available for usability testing of smartphones, but their effectiveness is yet to be established. The objective of this study was to compare the effectiveness of the usability testing methods in identifying the problems of a smartphone design. Three usability methods, Think-Aloud (TA), Cognitive Walkthrough (CW), and Heuristic Evaluation (HE), were selected for comparison, and 15 smartphone users and 16 usability experts were chosen as participants in the comparative study. The results of this study indicate that HE method is the most effective one and has the highest severity rating. Additionally, this study finds that eight evaluators are sufficient for using HE method to find most of the usability problems. There has been substantial increase recently in smartphone sales that trigger a tough competition among the smartphone manufacturers. The usability of the smartphone has become among most smartphone users consider before buying their phones. This study proposed a guideline for usability testing of smartphones.



    Smartphone has recently become a widely-used telecommunication and personal device for almost everybody, including in Indonesia as a developing country. Since its first launch in 2008, the number of smartphone users has been steadily increasing year after year. It is estimated that about 2.1 billion people (over a third of the world’s population) would be owning smartphones by 2017, which indicates a growth of about 10% since 2016 (Statista, 2017). It is expected that the number keeps further increasing, along with increase in the number of smartphone’s models and their features.

    One of the salient features of smartphones is its touchscreen- based input. This feature enables the users to input information by merely touching the screen with their finger tips and, thus create innate human-device interaction (Wang and Ren, 2009). However, the touchscreen feature has several limitations, the most serious one among them being the possibility of hitting a wrong button, because the button is much smaller than the finger’s width (Albinsson and Zhai, 2003). This and other such limitations, which might cause difficulties in using smartphones, can be considered as product usability issues.

    Usability has become an important aspect of smartphone- user’s choice, besides being a critical success factor in product competitiveness (Maguire, 2001). Usability is evaluated based on learnability, operability, user error protection, aesthetics, and accessibility (ISO/IEC 25010, 2011). Usability is the solution for many poorly designed smartphones, which are considered difficult and complex to use. Ignoring usability aspect will result in poor products, which may fall out of favour with the users, ultimately affecting the phone manufacturers’ business (Maguire, 2001). Therefore, the manufacturing companies have to ensure that the smartphones they deliver to the users have passed the usability test via a thorough and reliable usability evaluation method.

    A number of usability evaluation methods (UEMs) are available, including user-based testing methods, such as Think-aloud evaluation (TA) method, and expert-based inspection methods, such as the Cognitive walkthrough method (CW) (Nielsen and Mack, 1994, Park et al., 2013). TA is an end user-based testing method, in which the end users are requested to verbalize the experiences they go through while interacting with the product. It enables the evaluator to know, via its interface, how the mobile user views the product and what are the major misconceptions regarding the existing interface design. The evaluator will examine how the users interpret each menu item to understand which part of the menu causes difficulties or problems (Galitz, 2002). During the interaction process, user’s verbalization is recorded for further analysis (Jaspers, 2009) and preparation of a list of usability problems (Wharton et al., 1994).

    The expert-based CW testing method evaluates the ease of learning a product through exploration (Wharton et al., 1994). Here, the expert evaluates the product, via its interface, based on the task completion order and certain assumptions regarding user’s characteristics. In conducting walkthrough evaluation, the expert will ensure that the user can perform the task correctly, based on the available menu items. The expert monitors the progress of the task, based on the existing interface design. The output of the CW method is a list of usability problems (Jaspers, 2009).

    In addition to CW, researchers have developed a Heuristic evaluation (HE) method, as an alternative to the expert-based method. HE is an expert team-based testing method to evaluate the usability aspect of a product, based on a list of usability principles (Inostroza et al., 2012). In this method, the expert will first go through the interface individually several times, inspect interacting menu items, and compare the existing design with a list of recognized usability principles (Inostroza et al., 2012). After completing all individual evaluations, the results will be aggregated and communicated to the team. The output of HE method is a list of (heuristics) usability problems, whose severity rating is done, based on the list of usability principles (Jaspers, 2009). Among the principles available in the literature, the Nielsen’s usability heuristics seem to be the most widely accepted and used ones in usability evaluation (Nielsen and Mack, 1994).

    Recently, Inostroza et al. (2016) proposed a specific HE method for touchscreen-based mobile devices. They argued that, since conventional HE methods do not account for the unique attributes of touchscreen-based mobile devices, special dimensions might be needed in the HE method to detect usability issues. They proposed 12 principle dimensions for evaluating the usability of touchscreen- based mobile devices, and they are as follows: 1) visibility of system status; 2) matching between system and the real world; 3) user control and freedom; 4) consistency and standards; 5) error prevention; 6) minimizing the user’s memory load; 7) customization and shortcuts; 8) efficiency of use and performance; 9) aesthetics and minimalist design; 10) helping users to recognize, diagnose, and recover from errors; 11) help and documentation; and 12) physical interaction and ergonomics.

    Even though HE method has been developed, it is still not clear as to how effective would be the HE in evaluating usability issues, as compared to those of the conventional methods (such as cognitive walkthrough and think-aloud evaluation methods). In this study, we used the model of Inostroza et al. (2016) for HE method. Therefore, a comprehensive study is needed to compare the effectiveness of all the available methods, specifically for smartphone evaluation. Keeping this requirement in view, three usability testing methods were compared in this study to identify the most effective one among them for a smartphone (based on its interface design) evaluation. The testing methods selected for comparative study are think-aloud evaluation (TA), cognitive walkthrough (CW), and heuristic evaluation (HE).

    2. METHODS

    The main purpose of this study was to compare the effectiveness of the three commonly used usability evaluation methods (UEMs) for smartphones. In addition, this study calculated the number of participants and evaluators needed for each UEM to find most problems. The more the number of participants and evaluators required for evaluation, the more would be the effort and cost of evaluation.

    2.1 Participants

    A widely used smartphone in Indonesia (Samsung S4) was evaluated using the three selected usability evaluation methods: TA, CW and HE. Three groups of participants were recruited for this purpose from the university. All of them were experienced users of Android operating system, but have never used Samsung S4. The first group was composed of 15 experienced users who were asked to perform TA method. They were master degree students, who were assumed to possess reasonably good verbal. The second and third groups (each comprising 8 evaluators) were experts in usability or user-interface design and were adept at using smartphones. Their expertise was assumed to be equal across the two groups. The usability problems identified by the three groups were weighted, using severity rating by three user interface experts (called ‘raters’), whose rating as per Nielsen’s rating scale ranged from 0 to 4 (Nielsen, 1995).

    2.2 Procedures

    The experiment was conducted in Usability room of Ergonomics Laboratory, Bandung Institute of Technology. The participants and evaluators were asked to assess the given smartphone, using a series of specific tasks related to the general functioning of a smartphone.

    The participants were asked to perform the following tasks:

    • Add a new contact, provide all the required information, and create a contact group;

    • Make a phone call to the new contact, added in the previous task;

    • Receive an incoming call;

    • Send a text message with the given information;

    • Take pictures, using the smartphones’ camera;

    • Perform mobile data connection by configuring the settings and activating internet connection for the SIM card;

    • Set and activate hotspot tethering feature;

    • Check memory space available in the phone;

    • Install an application through internet;

    • Uninstall an application;

    • Configure settings and activate WiFi connection;

    • Download a file from the internet;

    • Organize files, including searching a file, moving a file and creating a folder;

    • Take a screenshot, using the screenshot feature.

    The operating system used for testing the smartphone was Android, which is one of the most used operating systems in the smartphone industry (IDC, 2015). One of the Android-based smartphone characteristics is that the user’s interface is customized depending on the smartphone brands (e.g. Android UI Comparison (Androidpit, 2015)). Among the existing Android operating systems, the most commonly used platform is version 4.4, named KitKat (Statista, 2014).

    2.3 Data Processing and Analysis

    The three selected UEMs were compared, using the following criteria: validity, thoroughness, effectiveness, reliability, and severity rating (Hartson et al., 2003). Each criterion is explained below:

    • Validity is a measure of how well the UE method performs the way it is supposed to perform. UEMs with low validity seem to find problems that have no relevance to the real situation. Validity can be computed as follows:

    Validity = P A / P

    where P is the number of issues identified, and A is number of real problems that exist.

    • Thoroughness is a measure of to what extent the UE method can identify the problems in relation to the existing real problems. UEMs with low thoroughness appear to leave important usability problems unattended during usability evaluation process. Thoroughness can be calculated as follows:

    • Thoroughness = P A / A

      Effectiveness is defined as the product of thoroughness and validity, with the same range of values as thoroughness and validity, that is, from 0 to 1. Effectiveness can be computed as follows:

      Effectiveness = Thoroughness × Validity

    • Reliability is the measure of consistency among the participants or evaluators. To measure reliability, kappa number can be used. In this paper, Fleiss formula (Fleiss, 1971) was used to find the degree of agreement among the participants.

    • Severity rating method is a measure of how well the UE method finds the most important problems, even the method does not have the highest score for overall thoroughness. The measure can be calculated by the following formula:

    Severity rating method = Σ s r p f / number of real problem found by UEM

    where Σs(rpf) is the total severity rating of the real problem found by the UEM.

    In this study, the following mathematical model, proposed by Nielsen and Landauer (1993), was used to calculate the number of usability problems:

    Found i = N 1 1 λ i

    Here, i denotes the number of evaluators, N the number of problems found in the interface, and λ the probability of finding the average usability problem in a single evaluation.

    3. RESULTS

    The usability evaluation was done by implementing the protocol for each method. The problems found by the participants and evaluators in each group were examined to identify the new problems that surfaced. Table 1 shows the results of smartphone evaluation, using different usability evaluation methods (UEMs). From these results, it can be seen that the number of usability problems varies with the UEM used. On the whole, heuristic evaluation method detected more usability problems than the other two methods.

    The problems found by the three UEMs were then merged into a single list for assigning a severity rating. In the process of merging, the usability problem that was found in more than one UEM, was considered as one problem only. Therefore, there is no common usability problem among the three groups.

    Results in Table 2 shows the severity ratings of the usability problems, found by different UEMs. Final severity rating was obtained by rounding off the rating’s average. Problems whose severity rating values range from 1 to 4 were categorized as real problems.

    Considering all the three UEMs, there are 82 real problems (see Figure 1). Venn diagram (Figure 1) shows some real problems in more than one UEM.

    Based on the data in Table 2 and Figure 1, the number of real problems found by UEM and the number of real problems that exist in the system can be used to determine the validity, thoroughness, and effectiveness of each UEM, as shown in Tables 3, 4, and 5.

    Reliability of the UEMs is reflected by the value of kappa (κ), which was calculated for this study, using the Fleiss formula (Fleiss, 1971). The formulation was adjusted to suit the context of this study. There are two categories in this research; identified and unidentified. The number of subjects is proportional to the number of real problems found by each UEM. Kappa values for the think-aloud, cognitive walkthrough, and heuristic evaluation methods are shown in Tables 6, 7, and 8 respectively.

    Severity assessment ratings of the usability problems, found in the three UEMs, can be used to determine the value of severity rating method. Table 9 shows the values of severity rating method for the three UEMs.

    Counting of number of participants or the evaluators is conducted by observing the value of λ, which is the average probability of finding usability problems when running a single evaluation for each UEM. Based on the λ value, the number of participants or evaluators needed to find almost 100% of the problems can be determined. Estimated number of participants or the evaluators can be identified by using Nielsen & Landauer formula (Nielsen and Landauer, 1993) when the value of rounding 1-(1-λ)i is 1 (the value i when the proportion of the findings of the problem reaches almost 100%). In this study, we set the target as 99%. The value of λ and the estimated number of participants and evaluators for each UEM to find 99% problems are shown in Table 10.

    A plot can be drawn as the function of the equation 1-(1-λ)i for each UEM against the number of participants or evaluators needed, as shown in Figure 2. It also depicts a chart, in the form of a dotted line, for the approximation value or Nielsen approximation (value estimate or Nielsen approximation value is 31% (Nielsen and Landauer, 1993)).


    This study was aimed at investigating the effectiveness of three common usability methods for smartphone. A number of previous studies had similar interest, but the object was web-based or computer application, not for smartphone (Karat et al., 1992;Jeffries et al., 1991). It seems that the results were not consistent, and no one UEM was superior compared to others. It means that the combination of usability evaluation methods (UEMs) may be useful in doing evaluation for web-based application (Hertzum and Jacobsen, 2001;Yen and Bakken, 2009). If applied to smartphone, to our knowledge, only few previous studies have investigated this issue. Due to a dramatically increase of start-up programming company recently in Indonesia with huge smartphone users, there is a need to find the most effective UEM for smartphone. Though a number of other UEMs methods are available, these three conventional methods appear to be more practical to be employed in Indonesia, especially for start-up company in developing a number of smartphone interfaces and applications. In this study, we also investigated the recommended number of user/evaluator/participant for each UEM.

    In this study, we selected three UEMs which have been commonly used due to their practicality (TA, CW and HE). More specifically for HE, the framework has been developed to provide evaluator more comprehensive model for evaluation. The main result of this study suggested that Inostroza et al. (2016) seems to be superior for evaluating touchscreen-based mobile device, though TA and CW are also needed as the complement. In addition, this study suggested that quantitative measures can be applied for comparing utilities among the three UEMs. We proposed five quantitative indicators, including: Validity, Reliability, Thoroughness, Effectiveness and Severity rating). It is realized that other measures may be available, but again the proposed indicators are quantitative and ease of use. The indicators were developed from Hartson et al. (2003). This study confirmed that such indicators are still applicable as the bases for the comparison criteria. Hopefully, similar indicators can be used for evaluating usability of different products, not limited to smartphone.

    A summary of performance measures of different UEMs can be seen in Table 11. It can be seen that heuristic evaluation method (HE) has the highest validity and thoroughness than the think-aloud (TA) and cognitive walkthrough (CW) methods. Highest validity and thoroughness of heuristic evaluation method (HE) implies that, among the three evaluation methods, HE is the most effective one.

    According to kappa value interval of Altman (1991), the kappa values of the three UEMs are less than 0.2, indicating low agreement among the participants or evaluators, involved in each UEM. This implies that the three UEMs have low reliability values and, therefore, cannot give consistent results with different participants or evaluators. Thus, the identified problems have high variation. This could stem from variations of evaluator background.

    Severity rating shows how important are the usability problems found by the UEM. The greater the value of severity rating, the more effective would be the UEM in identifying the problems with a high degree of urgency. Table 11 shows that heuristic evaluation method has higher severity rating than the think-aloud and cognitive walkthrough evaluation methods. However, if the values of severity rating of the three UEMs are rounded off, then the severity rating would be 2 for all the three methods, in which case severity rating can be reckoned to be a minor usability problem. Therefore, fixing severity problem should be given low priority (Nielsen, 1995).

    Based on the numbers of participants or evaluators required by the three UEMs to find almost 100% of the problems (99%), it can be seen that heuristic evaluation method requires the least number of evaluators. The requirement of fewer participants or evaluators implies that usability evaluation involves less cost. It is to be noted that the evaluators involved in this study are experts in user interface, and the device that is being evaluated (they are also called “double specialist” (Nielsen, 1992)).

    The comparative study of the effectiveness of the evaluation methods shows that heuristic evaluation method scores the highest marks in almost every measure of comparison. In addition, this study also validates the 12 heuristics proposed by Inostroza et al. (2016), by including the aspects of physical interaction and ergonomics in evaluating touchscreen-based mobile devices. By generating more findings, Heuristic evaluation method proves to be more effective than cognitive walkthrough and thinkaloud evaluation methods. Furthermore, it is also known that cognitive walkthrough and think-aloud evaluation methods achieve lesser score than heuristic evaluation method in most measures, because of the absence of specific development and adjustment in evaluating touchscreen- based mobile device or smartphone.

    The metrics used in this study aim to compare the three methods of usability evaluation that depend on the number of findings of the usability problem found. Here, the HE method seemed to be superior. However, there might be some usability problems found by other methods but cannot be identified by the HE method. This is shown by the relatively low effectiveness value of 58%.

    Each UEMs may have advantages and limitations. Therefore, the characteristics of the problems found by each method vary. Based on ISO / IEC 25010, usability is characterized as appropriate recognize ability, learnability, operability, user error protection, and aesthetics user interface. Based on these characteristics, HE method appeared to be more able to identify problems related to user error protection and operability. While the ability to identify the aspects of learnability and user interface aesthetics is owned by the CW method. The aspect of appropriate recognize ability can be best identified using the TA and HE methods. The existence of limitation of each method suggest that the possibility for better development is still open.

    The results of this study also imply that TA and CW could also be used as the complement, to obtain more usability problem found. Based on the cost and effort required, involving end-user as participants is relatively cheap, and easier than involving usability experts as evaluators.

    This study also shows that, for finding most of the usability problems in smartphones, heuristic evaluation method requires 8 “double specialists” as evaluators. This was obtained by plotting λ value (the probability of finding the average usability problem) as 0.467 or 46.7%, which is lower than the value found by Nielsen (60%) (Nielsen, 1992). Higher λ value implies that fewer evaluators are required to find 100% of the problems. The type of technology could be an influencing factor, besides others, in this regard. In should be noted that in terms of thoroughness and reliability, HE performances need to be improved, implying that there is a need to develop HE evaluation model of Inostroza et al. (2016).

    Admittedly, this study suffers from one major drawback, namely that all the participants and evaluators involved in this study are from Indonesia, who are not accustomed to using the think-aloud method. If people from other countries, who are familiar with the think-aloud method of evaluation, are included as participants and evaluators, then the results could be different from those of the present study. In addition, we assumed that all problems have been identified by participants/evaluators. To ensure that there were not other problems uncovered, the evaluators have recruited from those who were usability experts and 15 participants of TA are assumed to be enough. It should also be noted that, due to the advancement of technology, several alternative usability evaluation methods may be available using various objective criteria based on recording apparatus. However, as discussed above, the three selected methods were selected due to their practicality.


    This study demonstrates that heuristic method is the most effective UEM for smartphone usability evaluation, specifically when heuristics, adjusted to touchscreen-based technology, as proposed by Inostroza (2016), are considered. The heuristic evaluation method requires only 8 evaluators (double specialists) to find almost 100% of the existing problems (99%).

    Further research is required to improve the estimation of the number of participants or evaluators needed to evaluate smartphone usability. This study shows that there is still room for developing cognitive walkthrough and think-aloud evaluation methods, so that they can be more effective.

    5.1 Declaration of Conflicting Interests

    The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.


    The authors received no financial support for the research, authorship, and/or publication of this article.



    Venn diagram of real problems found by UEMs.


    Proportion of usability problems found by UEMs and Nielsen’s approximation value.


    Number of new usability problems found per UEM

    Number of usability problem found per UEM by severity rating

    Notes: 0 = not a problem; 1 = cosmetic problem; 2 = minor problem; 3 = major problem; 4 = catastrophe

    Validity of usability evaluation methods

    Thoroughness of usability evaluation methods

    Effectiveness of usability evaluation methods

    Kappa values on 43 problems by 15 participants of think-aloud evaluation

    Kappa values on 42 problems by 8 evaluators of cognitive walkthrough evaluation

    Kappa values on 52 problems by 8 evaluators of heuristic evaluation

    Severity rating methods of UEMs

    Validity of usability evaluation methods

    UEMs performance summary


    1. Albinsson, P. A. and Zhai, S. (2003), High precision touch screen interaction, Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, Ft. Lauderdale, Florida, ACM New York, NY, 105-112.
    2. Altman, D. G. (1991), Practical Statistics for Medical Research, Chapman and Hall, London.
    3. Androidpit (2015), Android UI comparison, cited 2015 July 10, Available from:
    4. Fleiss, J. L. (1971), Measuring nominal scale agreement among many raters, Psychological Bulleting, 76(5), 378-382.
    5. Galitz, W. O. (2002), The Essential Guide to User Interface Design: An Introduction to GUI Design Principles and Techniques (2nd ed.), John Wiley & Sons Inc, New York: USA.
    6. Hartson, H. R. , Andre, T. S. , and Williges, R. C. (2003), Criteria for evaluating usability evaluation methods, International Journal of Human-Computer Interaction, 15(1), 145-181.
    7. Hertzum, M. and Jacobsen, N. E. (2001), The evaluator effect: A chilling fact about usability evaluation methods, International Journal of Human-Computer Interaction, 13(4), 421-443.
    8. IDC (2015), Smartphone OS Market Share, cited 2015 July 10, Available from:
    9. Inostroza, R. , Rusu, C. , Roncagliolo, S. , Jimenez, C. , and Rusu, V. (2012), Usability heuristics for touchscreenbased mobile devices, Proceedings of the 2012 Ninth International Conference on Information Technology, New Generations, 662-667.
    10. Inostroza, R. , Rusu, C. , Roncagliolo, S. , Rusu, V. , and Collazos, C. A. (2016), Developing SMASH: A set of SMArtphone’s uSability Heuristics, Computer Standards & Interfaces, 43, 40-52.
    11. ISO/IEC 25010 (2011), Systems and software quality requirements and evaluation (SQuaRE) – system and software quality models, Geneva, Switzerland: International Organization for Standardization.
    12. Jaspers, M. W. M. (2009), A comparison of usability methods for testing interactive health technologies: Methodological aspects and empirical evidence, International Journal of Medical Informatics, 78(5), 340-353.
    13. Jeffries, R. , Miller, J. R. , Wharton, C. , and UyedaK. M. (1991), User interface evaluation in the real world: a comparison of four technique, Proceedings of the SIGCHI conference on Human factors in computing systems, Ft. Lauderdale, Florida, ACM New York, NY, 119-124.
    14. Karat, C. M. , Campbell, R. , and Flegel, T. (1992), Comparison of empirical testing and walkthrough methods in user interface evaluation, Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, Monterey, California, ACM New York, NY, 397-404.
    15. Maguire, M. (2001), Method to support human-centered design, International Journal of Human-Computer Studies, 55(4), 587-634.
    16. Nielsen, J. (1992), Finding usability problems through heuristic evaluation, Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, Monterey, California, ACM New York, NY, 373-380.
    17. Nielsen, J. (1995), Severity ratings for usability problems, cited 2014 Nov. 15, Available from:
    18. Nielsen, J. and Landauer, T. K. (1993), A mathematical model of the finding of usability problems, Proceedings of ACM Inter CHI’93 Conference, Monterey, California, ACM New York, NY, 206-213.
    19. Nielsen, J. and Mack, R. L. (1994), Usability inspection methods, John Wiley & Sons, New York: USA.
    20. Park, J. , Han, S. H. , Kim, H. K. , Cho, Y. , and Park, W. (2013), Developing elements of user experience for mobile phones and services: Survey, interview, and observation approaches, Human Factors and Ergonomics in Manufacturing & Service Industries, 23
    21. Statista (2014), Share of android platforms on mobile devices with android OS, cited 2014 January 7,Available from:
    22. Statista (2017), Statistics and facts about smartphone, cited 2014 November 21, Available from:
    23. Wang, F. and Ren, X. (2009), Empirical evaluation for finger input properties in multi-touch interaction, Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, Boston, MA, 1063-1072,
    24. Wharton, J. , Rieman, C. , Lewis, P. , and Polson, P. (1994), The cognitive walkthrough method: A practitioner’s guide, Colorado: Institute of Cognitive Science University of Colorado.
    25. Yen, P. Y. and Bakken, S. (2009), A comparison of usability evaluation methods: Heuristic evaluation versus end-user think-aloud protocol–an example from a web-based communication tool for nurse scheduling, AMIA Annual Symposium Proceedings, 714-718.