Designing for the Long Tail of Machine Learning
DDesigning for the Long Tail ofMachine Learning
Martin Lindvall [email protected] ABLinköping, SwedenDepartment of Science and Technology, ITN,Linköping University
Jesper Molin [email protected] ABLinkoping, Sweden
ABSTRACT
Recent technical advances has made machine learning (ML) a promising component to include inend user facing systems. However, user experience (UX) practitioners face challenges in relating MLto existing user-centered design processes and how to navigate the possibilities and constraints ofthis design space. Drawing on our own experience, we characterize designing within this space asnavigating trade-offs between data gathering, model development and designing valuable interactionsfor a given model performance. We suggest that the theoretical description of how machine learningperformance scales with training data can guide designers in these trade-offs as well as havingimplications for prototyping. We exemplify the learning curve’s usage by arguing that a useful patternis to design an initial system in a bootstrap phase that aims to exploit the training effect of datacollected at increasing orders of magnitude.
KEYWORDS machine learning, user-centered design, ML as a design material, human-centered machine learning,AI-infused systems
ACM Reference Format:
Martin Lindvall and Jesper Molin. 2019. Designing for the Long Tail of Machine Learning. In
CHI 2019 HCMLWorkshop: ACM CHI Conference on Human Factors in Computing Systems.
ACM, New York, NY, USA, 6 pages.https://doi.org/10.1145/nnnnnnn.nnnnnnn
CHI 2019 HCML Workshop, May 04–09, 2019, Glasgow, UK a r X i v : . [ c s . H C ] J a n esigning for the Long Tail of Machine Learning CHI 2019 HCML Workshop, May 04–09, 2019, Glasgow, UK INTRODUCTION
Recent works have advocated making machine learning (ML) more accessible by helping non-MLexperts build and design better learning-based systems [8]. The designers of ML-based systems hasthe potential to improve the experiential value at all stages of development, from problem framingto maintenance phases [3] but this technology does not come without challenges. Framing ML asa design material highlights the fact that designers must be aware of its properties when used inhuman-centered design methods [2].In our own practice of prototyping, building and deploying ML-based systems we have often foundourselves trying to decide which activity might at any given time best forward our ambition to createML-based systems that improve patient outcomes in real clinical situations. As it happens, we areoften trying to decide whether to gather more or different training data, to spend time on improvingthe training algorithms or whether to employ human-centered methods to design user interactionsthat could render a model’s performance usable in a collaborative way within the deployed solution,a trade off we depict schematically in Figure 1. While the reality of practice is messy, we hope thatby this simple description of three highly ML-related activities we can begin a discourse on howML-specific constraints and possibilities affect the design process. model developmentdata gathering human-AI interaction design
Figure 1: When exploring the design spaceof a ML-based system, there is usually atrade off between pursuing training algo-rithm improvements through model de-velopment, gathering more training dataand attempting to design a suitable in-teraction using the current best perfor-mance.
Navigating the design space of ML
For our projects to date we have largely relied on what has sometimes been referred to as traditionalmachine learning [1]. The part of the process relevant to this discussion of ML typically begins inthe corner of what we, for lack of a better term, call human-AI interaction design . At some early point,one or multiple rather vague concepts are ideated with end users. Such a process usually results in aproblem definition, an idea of how ML might bring value and a very rough sketch of how that valuemight be realized in use through interaction with end users. Some very quick and dirty conceptual model development follow in order to clarify what kind of training data to gather before a very small data gathering pilot starts. After some added model development on that initial data, the order ofthe process gets murky. How much model development is worth doing on a small amount of initialdata? Should one blindly go collect more data, if so, how much? Furthermore, at which point shouldwe revisit our human-AI interaction design to align concepts with new notions of achievable modelperformances?Instead of prescribing some order of activities, we think it sensible that for each iteration one isable to reflect upon the cost of each activity in relation to its predicted impact on advancing theoverall design goal. While designers probably have a good idea of the costs and benefits of employinguser-centered design methods to further human-AI interaction designs given a fixed model performance,esigning for the Long Tail of Machine Learning CHI 2019 HCML Workshop, May 04–09, 2019, Glasgow, UKthey might lack tools for estimating the effect of data gathering and model development on modelperformance.As we have continued to work in design of ML-based systems, the theoretical relationship betweentraining data amount and model performance has helped guide our decisions on which part of theimprovement triangle to address at which phase of the design project. Recent work has highlightedthat UX designers may not have a clear understanding of the relationship ML has with data [2], thuswe think that an extended discussion on how this relationship impacts design decisions might begenerative in informing both process and particular system designs.
THE MACHINE LEARNING CURVE FOR DESIGNERS
Machine learning algorithms train a prediction model from samples. In general, the performance ofthe trained model improves with the amount and quality of training data. Since the model learns byincorporating new information, the value of each training sample will decline because, if drawn fromthe same source, chances increase that the new sample embodies something the model has alreadylearned. This means that the generic shape of a learning curve follows an inverse power law, with thedetails depending on e.g. model size, problem difficulty, and label noise [5]. It has been shown thatthis power law holds up to at least 300 millions samples, as long as the accuracy level does not reachthe inherent noise level in the data [7].
Dataset Size E rr o r Early"bootstrap" initial deployment
The long tail of the late phase
Figure 2: Deep learning model error de-creases with increasing data set size fol-lowing an inverse power law. When de-signing systems using these models, thiscan be exploited by dealing with designchallenges in the initial bootstrap phaseand the following long tail phase.
As we have previously described [6] we found that a reoccurring theme for our work with creatingML-based systems was first training a model on a relatively small dataset that had been expert-annotated by manual means, to then design an interaction embodying a successful task decompositionsuch that the collaborative workflow both helps the medical practitioner in her daily work while alsoimplicitly generating training data that we imagined would increase the model’s performance by afuture update.This way of approaching system design in two phases can be viewed as exploiting the learning curveto address the cost of gathering data. Since the improvement to the model solely from data requiresthe collection of data at increasing order of magnitudes, the cost of data collection in the developmentenvironment might grow out of proportion quickly. Inversely, trying to deploy a product with tooweak a model might not provide enough value to end users to achieve wanted usage, which can bethe rationale behind investing in "manual" data gathering solely for development. The relationshipbetween training data amount and model performance and its evolution over time, loosely dividedinto an early bootstrap and a late long tail phase, with different characteristics and implications fordesign is illustrated in Figure 2.esigning for the Long Tail of Machine Learning CHI 2019 HCML Workshop, May 04–09, 2019, Glasgow, UK
Bootstrap phase
At this stage, the model improves rapidly with relatively small amounts of training data and challengesinclude problem framing, feature engineering and training a model with sufficient performance toenable realistic user interface prototyping. It is at this stage that the design of manual annotationand labeling tools might be employed to engage domain experts as providers of training labels for theprototype models. The reasonable target performance in the bootstrap phase is probably less thanthat of the domain experts providing supervised labels, if such are employed. Hence, the human-AIinteraction design will need to especially consider that the error characteristics of the resulting modelpredictions can be compensated by human behaviour in the context of use.
The long tail
When starting to approach the long tail, for many practical applications, further improvement tothe model performance requires either the data collection to move towards large scale collection orattempting to improve the model by pursuing breakthrough research within machine learning. Thedesign activities will focus on ways to make systems continuously collect training data in a way thatit has minimal impact on end users goals and user experience. In order to increase data collection byorders of magnitude, the collection will move from being explicit to implicit. Other notable issues thesystem design needs to address are model drift, quality assurance, systematic bias and generalizabilitybetween contexts.
IMPACT FOR HUMAN-CENTERED MACHINE LEARNING
Previous work has indicated that designers believe it may require an "unwieldy amount" of datato create functional prototypes [2]. The theory surrounding deep learning suggests it is possible toestimate the scaling of model performance with data if enough is gathered to get past a ’small data’plateau and into the power-law region [5]. We believe that by combining experiments using smalldata and using the learning curve as a heuristic, designers can start imagining the points on the curvewhere resources required for data collection and model development exceed those that would berequired to realize a system enabling valuable interactions with the current model performance. Thedesigner could diverge to a few hypothetical designs by aiming for different target performance levels,as illustrated in figure 3. If the prediction goal of those concepts are kept somewhat similar, this canbe done with a low impact in terms of extra resources for data gathering and model development.
Dataset Size E rr o r threshold for design concept Figure 3: Designers can use the machinelearning curve as a tool to diverge multi-ple hypothetical design concepts that de-pend on model performance of a certainquality.
By creating a timeline for ML-based systems associated with both machine performance andamount of training data, it becomes possible to nuance questions such as “are human labelers usefulor should we do something different?” . By the principles we have introduced we might conclude thatesigning for the Long Tail of Machine Learning CHI 2019 HCML Workshop, May 04–09, 2019, Glasgow, UKsuch model-centered performance activities can be useful in the bootstrap phase but with decliningmotivations in later phases.The learning curve relates the effect of additional training data to model performance with a fixedmodel training pipeline. However, activities of model development such as algorithm selection, hyperparameter tuning and feature engineering can also effect model performance. In our practice, weconstrain our explorations of this vast space of options to applying techniques and pipelines that havebeen shown to be fruitful in scenarios similar to ours or to be widely usable in general. Our rationalebehind this is somewhat tentative and based on two assumptions. First, we assume that similarlyto how the benefits of increased data declines, so does the impact of model development in relationto time. Second, we believe that the focus on human-AI interaction and the overall solution meansthat the value of a few percentage’s increase in performance, while being very important to machinelearning researchers, is less important to designers.Finally, the learning curve assumes that examples are uniformly drawn from a population. Ininteractive machine learning systems [1] that lets users iteratively refine training data by observingmodel performance, the subsequent examples might still be very informative. For instance [4] showedthat single-user selected sampling can outperform random sampling. However, the learning curve ofthis kind of user sampling as seen over large data sets is to our best knowledge, yet unknown.
ACKNOWLEDGMENTS
This work was partially supported by the Wallenberg AI, Autonomous Systems and Software Program(WASP).
REFERENCES [1] Saleema Amershi, Maya Cakmak, William Bradley Knox, and Todd Kulesza. 2014. Power to the People: The Role of Humansin Interactive Machine Learning.
AI Magazine
35, 4 (2014), 105–120. https://doi.org/10.1609/aimag.v35i4.2513[2] Graham Dove, Kim Halskov, Jodi Forlizzi, and John Zimmerman. 2017. UX Design Innovation: Challenges for Working withMachine Learning as a Design Material.
CHI ’17 Proceedings of the 2017 annual conference on Human factors in computingsystems (2017), 278–288. https://doi.org/10.1145/3025453.3025739[3] Marco Gillies, Rebecca Fiebrink, Atau Tanaka, Jérémie Garcia, Frédéric Bevilacqua, Alexis Heloir, Fabrizio Nunnari, WendyMackay, Saleema Amershi, Bongshin Lee, Nicolas d’Alessandro, Joëlle Tilmanne, Todd Kulesza, and Baptiste Caramiaux.2016. Human-Centred Machine Learning. In
Proceedings of the 2016 CHI Conference Extended Abstracts on Human Factorsin Computing Systems (CHI EA ’16) . ACM, New York, NY, USA, 3558–3565. https://doi.org/10.1145/2851581.2856492event-place: San Jose, California, USA.[4] Neal Harvey and Reid Porter. 2016. User-driven sampling strategies in image exploitation.
Information Visualization
15, 1(Jan. 2016), 64–74. https://doi.org/10.1177/1473871614557659[5] Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md Patwary, MostofaAli, Yang Yang, and Yanqi Zhou. 2017. Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409 (2017). esigning for the Long Tail of Machine Learning CHI 2019 HCML Workshop, May 04–09, 2019, Glasgow, UK [6] Martin Lindvall, Jesper Molin, and Jonas Löwgren. 2018. From Machine Learning to Machine Teaching: The Importance ofUX.
Interactions
25, 6 (Oct. 2018), 52–57. https://doi.org/10.1145/3282860[7] Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. 2017. Revisiting unreasonable effectiveness of data indeep learning era. In
Computer Vision (ICCV), 2017 IEEE International Conference on . IEEE, 843–852.[8] Qian Yang, Jina Suh, Nan-Chen Chen, and Gonzalo Ramos. 2018. Grounding Interactive Machine Learning Tool Designin How Non-Experts Actually Build Models. In