{"id":675210,"date":"2020-08-05T13:01:30","date_gmt":"2020-08-05T20:01:30","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-blog-post&#038;p=675210"},"modified":"2020-08-24T14:04:25","modified_gmt":"2020-08-24T21:04:25","slug":"define-your-threshold-communicating-confidence-in-ux-scorecards","status":"publish","type":"msr-blog-post","link":"https:\/\/www.microsoft.com\/en-us\/research\/articles\/define-your-threshold-communicating-confidence-in-ux-scorecards\/","title":{"rendered":"Define your threshold: Communicating confidence in UX scorecards"},"content":{"rendered":"<p>By <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.linkedin.com\/in\/michael-vw\/\">Michael Van Waardhuizen<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-682491 size-full\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/07\/Simplified-Confidence_header.jpg\" alt=\"Man in a suit holds a tablet in his hand. Graphic images of information circles around it\" width=\"1269\" height=\"656\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/07\/Simplified-Confidence_header.jpg 1269w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/07\/Simplified-Confidence_header-300x155.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/07\/Simplified-Confidence_header-1024x529.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/07\/Simplified-Confidence_header-768x397.jpg 768w\" sizes=\"auto, (max-width: 1269px) 100vw, 1269px\" \/><\/p>\n<p>At many large companies like Microsoft, leaders review lots of information about how products are performing. Objectives and key results (OKRs), key performance indicators (KPIs), dashboards, and reports are just a few ways we look at performance. This type of data is what decision makers ask for to understand the status of the products we make.<\/p>\n<p>User research as a discipline often responds by developing quantitative data reported through scorecards, but they\u2019re not always accurate or clear in what they\u2019re reporting. So let&#8217;s take a closer look at how these tools are used and how they can be improved.<\/p>\n<h3><strong>Scorecards? Yes, scorecards!<\/strong><\/h3>\n<p>Scorecards have been around for a long time in the industry. A <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.bing.com\/images\/search?q=UX+scorecard\">simple web search<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> will review dozens of examples of UX scorecards, and numerous textbooks have been written on the subject. Scorecards can vary in many ways, but at the heart of them, we often find:<\/p>\n<div id=\"attachment_682527\" style=\"width: 641px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-682527\" class=\"wp-image-682527 \" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/08\/Simplified-Confidence_image-2.png\" alt=\"An example of a scorecard with various measurements\" width=\"631\" height=\"173\" \/><p id=\"caption-attachment-682527\" class=\"wp-caption-text\">A traditional scorecard shows various scenarios and measurements<\/p><\/div>\n<ul>\n<li><strong><u>A table of data<\/u><\/strong>: Tasks, scenarios, or key results are displayed in rows with quantified metrics in columns<\/li>\n<li><strong><u>A set of color codes<\/u><\/strong>: Typically on a green to red spectrum, the colors mark a product\u2019s status or severity, possibly with icons, sparklines, or other accessory info<\/li>\n<\/ul>\n<p>Of course, many variations exist on this theme. There are charting values, trends over time, or presenting multiple metrics in scatter plot form; there are charts using color gradients, infographic elements, including (more or less) explanatory text. The list goes on and on\u2014and it\u2019s as long as all the options in Excel.<\/p>\n<p>Unfortunately, despite all these variations, scorecards frequently let us down in an important way. There are many issues one could point out: The limited focus on a subset of tasks\/items, biasing the overview of a product; the equal weighting tasks\/items are typically given; and the lack of accessibility in color schemes frequently used.<\/p>\n<p>These are all important factors that need addressing. But the issue I\u2019m going to address, here and now? False confidence.<\/p>\n<p>In an effort to be simple, most scorecards report very precise-looking numbers that are sorted into colors or grades. Frequently, items may land close to or right on the boundary and appear as strong or severe as any other score in that grade. This is deceptive because\u2014if we\u2019re being honest\u2014we seldom have enough data to confidently state metrics as precisely as portrayed in scorecards.<\/p>\n<div id=\"attachment_682659\" style=\"width: 396px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-682659\" class=\"wp-image-682659\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/08\/Simplified-Confidence_image-3-300x200.png\" alt=\"A gray colored box, with columns and rows. The squares are colored red, green, and yellow with various amounts\" width=\"386\" height=\"257\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/08\/Simplified-Confidence_image-3-300x200.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/08\/Simplified-Confidence_image-3.png 621w\" sizes=\"auto, (max-width: 386px) 100vw, 386px\" \/><p id=\"caption-attachment-682659\" class=\"wp-caption-text\">A scorecard sorted by color and grades may appear strong but can be deceptive<\/p><\/div>\n<p>Take the above image, for example. Task 13 has a 79 percent marked as yellow, where Task 14 has an 82 percent marked as green. However, with the sample size used for that data, no real (statistically significant) difference existed between the two.<\/p>\n<p>These sorts of differences are problematic to a decision-maker who may be choosing where to invest, or whether to release a product and undermine the reliability and reputation of UX research data. Fortunately, there is an easy, well-established answer: confidence intervals.<\/p>\n<h3><strong>Confidence in sunshine; confidence in rain<\/strong><\/h3>\n<p>I\u2019m not going to go into how confidence intervals are calculated\u2014there are many resources out there for that. Instead, the question for us to explore is, &#8220;How do we communicate confidence in scorecards?&#8221;<\/p>\n<p>Researchers have tried myriad options, with their own strengths and limitations. Here are a few of them:<\/p>\n<div id=\"attachment_682572\" style=\"width: 310px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-682572\" class=\"wp-image-682572 size-medium\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/08\/Simplified-Confidence_image-3.1-300x132.png\" alt=\"Scorecard with tasks, metrics and time spent\" width=\"300\" height=\"132\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/08\/Simplified-Confidence_image-3.1-300x132.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/08\/Simplified-Confidence_image-3.1.png 422w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><p id=\"caption-attachment-682572\" class=\"wp-caption-text\"><strong>Tabular ranges<\/strong>: One could replace means in the table with the range of data, but it can be hard to read and margins of error may be large for some data sets<\/p><\/div>\n<div id=\"attachment_682584\" style=\"width: 277px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-682584\" class=\"wp-image-682584 \" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/08\/Simplified-Confidence_image-3.2.png\" alt=\"Scorecard showing time spent on tasks\" width=\"267\" height=\"259\" \/><p id=\"caption-attachment-682584\" class=\"wp-caption-text\"><strong>Confidence Interval (CI) Footnotes<\/strong>: The CI may be included in small text nearby, but that&#8217;s hard to read and easy to ignore<\/p><\/div>\n<div id=\"attachment_682596\" style=\"width: 338px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-682596\" class=\"wp-image-682596 \" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/08\/Simplified-Confidence_image-3.3.png\" alt=\"Scorecard measuring perceived ease of use\" width=\"328\" height=\"217\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/08\/Simplified-Confidence_image-3.3.png 770w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/08\/Simplified-Confidence_image-3.3-300x198.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/08\/Simplified-Confidence_image-3.3-768x508.png 768w\" sizes=\"auto, (max-width: 328px) 100vw, 328px\" \/><p id=\"caption-attachment-682596\" class=\"wp-caption-text\"><strong>Bar charts with error bars:<\/strong> This common approach works well, but can be difficult to read when the numbers of metrics and\/or tasks increase. Box and whisker plots have similar issues<\/p><\/div>\n<div id=\"attachment_682602\" style=\"width: 349px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-682602\" class=\"wp-image-682602 \" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/08\/Simplified-Confidence_image-3.4.png\" alt=\"\" width=\"339\" height=\"218\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/08\/Simplified-Confidence_image-3.4.png 821w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/08\/Simplified-Confidence_image-3.4-300x193.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/08\/Simplified-Confidence_image-3.4-768x494.png 768w\" sizes=\"auto, (max-width: 339px) 100vw, 339px\" \/><p id=\"caption-attachment-682602\" class=\"wp-caption-text\"><strong>Violin plots<\/strong>: Though powerful for showing relative probability, violin plots can be harder to generate and have similar limitations as regular bar charts<\/p><\/div>\n<p>One solution we\u2019ve tried has shown some promise for improving the at-a-glance understanding of confidence. Let\u2019s transform our data slightly to make reporting easy.<\/p>\n<h3><strong>Crossing the threshold<\/strong><\/h3>\n<p>An easy way to simplify confidence reporting is to define a threshold\u2014a line in the sand for what a \u201cgood\u201d score is. Thresholds can be set a few ways. For rating questions, I might focus on the label text of an option\u2014\u201cSomewhat satisfied\u201d is acceptable to me, but \u201cneither satisfied nor dissatisfied\u201d is not. We also use previous studies and data to correlate metrics to see where one good threshold can correspond to another. Once we have defined a threshold for a metric, we are able to simplify the metric:<\/p>\n<div id=\"attachment_682551\" style=\"width: 584px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-682551\" class=\"wp-image-682551 \" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/08\/Simplified-Confidence_image-4.png\" alt=\"A bar graph showing perceived time\" width=\"574\" height=\"291\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/08\/Simplified-Confidence_image-4.png 799w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/08\/Simplified-Confidence_image-4-300x152.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/08\/Simplified-Confidence_image-4-768x390.png 768w\" sizes=\"auto, (max-width: 574px) 100vw, 574px\" \/><p id=\"caption-attachment-682551\" class=\"wp-caption-text\">The green line denotes the threshold, which divides the confidence into three states<\/p><\/div>\n<p>The threshold line in green above divides our confidence into three states: entirely above the line (good!), entirely below the line (boo!), and straddling the line (umm?).\u00a0 We now have three states to report: clearly passing, clearly failing, and indeterminate.<\/p>\n<p>Earlier, I noted that color-coding grades is very common in scorecards. Now, we can co-opt the common color codes to communicate confidence clearly (whew!). Here I have assigned passing scores the color green, failing scores the color red (or another, more accessible color scheme), and have left indeterminate scores uncolored.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-682554 aligncenter\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/08\/Simplified-Confidence_image-5.png\" alt=\"Scorecard reporting tasks and amount time spent on each one, with success rate\" width=\"610\" height=\"473\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/08\/Simplified-Confidence_image-5.png 807w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/08\/Simplified-Confidence_image-5-300x232.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/08\/Simplified-Confidence_image-5-768x595.png 768w\" sizes=\"auto, (max-width: 610px) 100vw, 610px\" \/><\/p>\n<p>Alternatively, if comparing scores against a competitor or previous benchmark, I use icons to communicate both confidence and direction.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-682557 aligncenter\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/08\/Simplified-Confidence_image-6.png\" alt=\"Scorecard measuring tasks using arrows\" width=\"646\" height=\"109\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/08\/Simplified-Confidence_image-6.png 853w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/08\/Simplified-Confidence_image-6-300x51.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/08\/Simplified-Confidence_image-6-768x130.png 768w\" sizes=\"auto, (max-width: 646px) 100vw, 646px\" \/><\/p>\n<p>In both cases, there are several useful properties:<\/p>\n<ul>\n<li>Colors provide at-a-glance confirmation of scores that are rooted in the statistical confidence of the metric<\/li>\n<li>Colors and icons are able to live happily in a table, allowing for reporting of significant numbers of tasks\/scenarios and metrics without visually overloading the audience, improving understanding and impact<\/li>\n<li>The transformation of the data does not rely on complex statistical methods that are time-consuming to explain to product team stakeholders who may have less statistical training<\/li>\n<\/ul>\n<p>There are many different ways to visualize evaluation results with many trade-offs; however, other scorecard limitations still apply. But it&#8217;s important for researchers to present data that communicates clearly and accurately. Using this visualization method, your product team will know which insights to take most seriously and address first, while fitting within most existing scorecard presentations. Be confident in your data!<\/p>\n<h4><strong>Do you use scorecards at your organization? If so, how do you measure and define your thresholds? Tweet us your thoughts <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/www.x.com\/MicrosoftRI\">@MicrosoftRI<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> or join the conversation <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/www.facebook.com\/MicrosoftRI\">on Facebook<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and share your comments.<\/strong><\/h4>\n<hr \/>\n<p><em>Michael Van Waardhuizen is a Senior Researcher Manager for a horizontal research team, providing qualitative and quantitative research-as-a-service to help other researchers and product teams scale up and conduct high-quality research faster. He is also a builder of tools, processes, and backends for scaling research. Previously, he conducted applied research to improve the user experience of SharePoint, Office 365, OneDrive, and Windows Mixed Reality.<\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>When decision-makers are looking at how a product is performing, they often turn to scorecards to measure performance. Unfortunately, these reporting tools are not always accurate. By following these simple tips, you can begin to report data that is more clear and easier to follow. Find out how!<\/p>\n","protected":false},"author":39057,"featured_media":682698,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-content-parent":616842,"msr_hide_image_in_river":0,"footnotes":""},"research-area":[],"msr-locale":[268875],"msr-post-option":[],"class_list":["post-675210","msr-blog-post","type-msr-blog-post","status-publish","has-post-thumbnail","hentry","msr-locale-en_us"],"msr_assoc_parent":{"id":616842,"type":"group"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/675210","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-blog-post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/39057"}],"version-history":[{"count":69,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/675210\/revisions"}],"predecessor-version":[{"id":686943,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/675210\/revisions\/686943"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/682698"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=675210"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=675210"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=675210"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=675210"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}