Abstract

User interface design is a complex task that involves designers examining a wide range of options. We present Spacewalker, a tool that allows designers to rapidly search a large design space for an optimal web UI with integrated support. Designers first annotate each attribute they want to explore in a typical HTML page, using a simple markup extension we designed. Spacewalker then parses the annotated HTML specification, and intelligently generates and distributes various configurations of the web UI to crowd workers for evaluation. We enhanced a genetic algorithm to accommodate crowd worker responses from pairwise comparison of UI designs, which is crucial for obtaining reliable feedback. Based on our experiments, Spacewalker allows designers to effectively search a large design space of a UI, using the language they are familiar with, and improve their design rapidly at a minimal cost.

Full PDF

SSpacewalker: Rapid UI Design Exploration Using LightweightMarkup Enhancement and Crowd Genetic Programming

Mingyuan Zhong ∗ University of WashingtonSeattle, [email protected]

Gang Li

Google ResearchMountain View, [email protected]

Yang Li

Google ResearchMountain View, [email protected]

ABSTRACT

User interface design is a complex task that involves designersexamining a wide range of options. We present Spacewalker, atool that allows designers to rapidly search a large design spacefor an optimal web UI with integrated support. Designers firstannotate each attribute they want to explore in a typical HTMLpage, using a simple markup extension we designed. Spacewalkerthen parses the annotated HTML specification, and intelligentlygenerates and distributes various configurations of the web UI tocrowd workers for evaluation. We enhanced a genetic algorithm toaccommodate crowd worker responses from pairwise comparisonof UI designs, which is crucial for obtaining reliable feedback. Basedon our experiments, Spacewalker allows designers to effectivelysearch a large design space of a UI, using the language they arefamiliar with, and improve their design rapidly at a minimal cost.

CCS CONCEPTS • Human-centered computing → Interactive systems and tools . KEYWORDS

Markup language, crowdsourcing, design search, tools, geneticprogramming

ACM Reference Format:

Mingyuan Zhong, Gang Li, and Yang Li. 2021. Spacewalker: Rapid UI DesignExploration Using Lightweight Markup Enhancement and Crowd GeneticProgramming. In

CHI Conference on Human Factors in Computing Systems(CHI ’21), May 8–13, 2021, Yokohama, Japan.

ACM, New York, NY, USA,11 pages. https://doi.org/10.1145/3411764.3445326

User interface design is a complex task that often requires designersto explore a wide range of options, which is expensive and timeconsuming. For example, a designer may consider multiple colorschemes or layout choices for a UI. To evaluate these options, it isoften necessary to test them with users, via either usability test-ing [2, 4] or A/B testing at scale [6, 18]. Although these classicalapproaches are widely used, they require substantial engineering ∗ This work was completed while the author was an intern at Google Research.Permission to make digital or hard copies of part or all of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for third-party components of this work must be honored.For all other uses, contact the owner/author(s).

CHI ’21, May 8–13, 2021, Yokohama, Japan © 2021 Copyright held by the owner/author(s).ACM ISBN 978-1-4503-8096-6/21/05.https://doi.org/10.1145/3411764.3445326 investment to build and instrument each design alternative for test-ing, and extensive analytical effort to process collected user dataand distill findings.To ease the effort for exploring a design space, previous workhas extensively investigated using crowdsourcing as an essentialcomponent in UI design and evaluation [3, 9–11, 14, 15, 17, 25, 26],which lowers the threshold for acquiring user feedback at scale.Various commercial tools also exist to support A/B testing of UIdesigns. Nevertheless, it remains challenging for existing tools toexamine a large design space where it is a commonplace to havehundreds or even thousands of design alternatives.To battle the issue, previous work has attempted to apply Arti-ficial Intelligence algorithms to enable efficient search of a largedesign space [5, 13, 16, 19, 22]. Particularly, Salem [19] combinedcrowdsourcing and genetic programming [12] for the design oflanding pages. However, these existing tools often require a de-signer to learn a new language that is tailored for working withthe underlying algorithm to define a search space. Their optimiza-tion objectives (or fitness functions [12]) are designed based onuser click behaviors of specific interaction tasks. There lacks a toolfor general-purpose design space exploration that is seamlesslyintegrated into current design practice.In this paper, we present Spacewalker, a tool that allows design-ers to rapidly search a design space of a web UI for an optimaldesign within that space (see Figure 1). In a typical HTML pageor a CSS specification, designers first annotate each attribute theywant to explore using a simple markup extension we designed. Ourtool then parses the annotated HTML or CSS specification, andintelligently generates and distributes various configurations of theweb UI to crowd workers for evaluation. Our research challengesare three-fold: 1) designing a markup annotation that is expressiveand easy to use for specifying various design options, 2) developingan algorithm to allow efficient exploration of a large design spacebased on crowd worker feedback, and 3) creating a tool that canprovide integrated support for design exploration.To address challenge 1), We designed the HTML annotation as asimple extension of the existing HTML and CSS grammar, whereinstead of specifying a single value for an attribute, a designer canprovide multiple candidate values for it, which are to be exploredby Spacewalker. To address challenge 2), we enhanced a geneticalgorithm by adding feedback mask-based stochastic sampling, toaccommodate crowd worker responses from pairwise comparisonof UI designs—that tends to yield more reliable feedback than ratingeach design separately. To address challenge 3), we created a web-based tool that streamlined the entire task of design explorationincluding task creation, monitoring and evaluation. a r X i v : . [ c s . H C ] F e b HI ’21, May 8–13, 2021, Yokohama, Japan Mingyuan Zhong, Gang Li, and Yang Li

Annotated HTML Page InitializationParsing

Design Conﬁguration Space Genetic Sequences & MasksGenerate, pair & distribute designs Crowd worker compare designs … Crossover Mutation Designer annotates the HTML page …

Designer monitors progress or exports evolved designs [5, 1, 0, …, 4, 9][5, 0, 2, …, 1, 9][5, 5, 1, …, 3, 9] Intermediate Genetic Sequences & MasksGenetic Sequences of Selected Designs

Figure 1: Spacewalker provides integrated support to allow a designer to rapidly explore a large design space of a webpage. De-signer actions are

We evaluated Spacewalker by asking interaction designers to useit for exploring a set of UI design tasks, and Spacewalker receivedpositive feedback. To systematically examine how well Spacewalkeralgorithms can evolve a design by quickly searching a design space,we tested it on six design tasks that range in search space sizes anddesign types. The experiments indicate that UI designs obtainedby Spacewalker were significantly more preferred by human eval-uators than those from a baseline method. Our paper makes thefollowing contributions: • An expressive and easy-to-use HTML markup extension thatallows designers to easily specify various alternatives fordesign search, which requires negligible learning effort; • An enhanced genetic algorithm that can efficiently explorea large design space using crowd worker responses frompairwise comparison of UI designs; • Integrated general tool support that allows designers to eas-ily obtain an improved design from a large range of optionswithin a short period of time (e.g., 1 hour) at a small amountof cost (e.g., 35 US dollars).

Our work is related to three areas of the literature, including UIevaluation methods, crowdsourcing-based design support, and in-teractive UI design optimization.

Usability testing [2] is a commonly used approach for evaluatinga UI design, which often requires a user experience researcher to These evaluators were a separate group of crowd workers from those who wereinvolved in human-in-the-loop design search. recruit user participants, moderate a study session, and observeand analyze findings from the study [4]. To study how users reactto design alternatives at scale, A/B testing is widely used wherevariants of a design are tested with different user populations anduser behaviors are logged and statistically analyzed by user experi-ence researchers [6, 7]. Various tools or platforms are available tosupport A/B testing of UI designs, such as GoodUI , Optimizely and VWO .Although existing methods are widely adopted, they often re-quire substantial engineering effort to build and instrument a test.It also often involves extensive effort to analyze user data to ex-tract findings that can be used for next design iteration. In addition,these methods are limited by the number of alternatives they canexplore, which is problematic as a design space of a UI is often large.Consequently, an end design might be suboptimal due to limitedexploration. AB4Web addresses this problem through randomizedsplit testing, and successfully analyzed users’ preferences for a taskwith 49 designs [23]. Nevertheless, A/B testing still struggles tosupport a large design space when there are hundreds or thousandsof design alternatives. In Spacewalker, we aim to address theseissues by providing an integrated support for designers to explorea large design space and improve their design. Previous work has incorporated crowdsourcing for UI design andevaluation. Crowdsourcing has shown success in providing com-parable results for evaluating user interfaces with those acquiredfrom a lab-based setting [8, 24]. Voyant allows designers to seek https://goodui.org/ https://vwo.com/testing/ab-testing/ pacewalker: Rapid UI Design Exploration Using Lightweight Markup Enhancement and Crowd Genetic Programming CHI ’21, May 8–13, 2021, Yokohama, Japan perception-oriented feedback from a non-expert crowd, with anemphasis on connecting the visual design with corresponding feed-back [26]. Reinecke et al. evaluated a set of 430 web designs through40,000 online participants, demonstrating the feasibility of large-scale design evaluations through the crowd [17]. ZIPT allows de-signers to collect and visualize interaction patterns for any Androidapps from the crowd [3].The crowd can be more actively involved in UI design tasks toprovide feedback [14, 17, 25, 26] or participate in the design process[10, 11, 15]. Apparition supports creating UI designs and anima-tions from interface sketches and natural language descriptionsthrough self-coordinated real-time crowdsourcing [9]. Similar toprevious work, we also embed the crowd in the loop of the de-sign and evaluation process. However, we focus on the design taskwhere an interaction designer has a basic HTML design and wantsto obtain an optimal configuration for the design by exploring alarge range of options such as colors, fonts and layouts. We alsoaim to minimize the effort and cost of the designer to perform thetask. Using Artificial Intelligence algorithms to optimize interface designis a longstanding topic. Genetic algorithms (GA) in particular havebeen applied to optimize UI designs with large search spaces. Imag-ine generates style sheets for HTML pages interactively throughuser selection [13]. Quiroz et al. combines GA with UI design met-rics to reduce the number of choices needed by a user [16]. However,these approaches only take the input from a few users, causing fa-tigue [21] and increasing potential bias.To address the issue, Salem [19] combined crowdsourcing andgenetic programming [12] for the design of landing pages. Tambur-relli and Margara [22] explored approaches for optimizing softwaredesigns specified in Java through GA, basing their fitness functionon the distance from users’ interaction position. Despite the adop-tion of the crowd, these interactive GA solutions rely on implicitinformation, such as click location that is difficult to generalizeto other design tasks. Moreover, the designer-facing tools requirespecific learning of a custom specification or programming lan-guage, which increases the burden on the designer. Although weemploy GA-based algorithms and crowd in our work, Spacewalkeris designed to address a general web UI design scenario. It allowsdesigners to specify the design space for exploration using a simpleextension of HTML tags. We also enhanced genetic programmingfor addressing worker responses from pairwise comparison of de-signs, which makes genetic programming more applicable for UIoptimization.

We here describe how UI designers or developers would use Space-walker to explore the design space of their user interfaces. Assume,Alex, an web designer is designing a new Product page for hercompany. Although she has written an HTML prototype of thepage, she is uncertain about a few of design aspects of the page,such as color schemes, font sizes, and background choices. As thecombination of these factors resembles a vast number of design alternatives, Alex decides to let Spacewalker to explore her designspace.To do so, Alex first edits the HTML prototype for the Productpage by adding simple markup tags for the design options that sheis unsure. For example, in the

element for the background,Alex adds the following tag:

...

In this example, Spacewalker uses either " nav-1 " or " nav-2 " ata time, while the rest children are unaffected. Note that the CSSattributes of elements for each node can be further explored, whichenables recursive exploration.Instead of specifying exploration strategies based on nodes,which can be tedious, designers can directly explore at the level ofCSS specification using the explore-css tag. Here Alex would liketo assure the titles (

,

) and the body paragraphs (
) areusing matching colors that she designed, so she adds the optionsas a group: ... h1, h2: { color: (color1); }p : { color: (color2); }--------h1, h2: { color: (color3); }p : { color: (color4); } ...

This ensures that Spacewalker would globally apply these options:either color1 for titles and color2 for body text, or color3 for titlesand color4 for body text. A line of dashes (i.e., any number of "-")separates these two options.After creating the specifications for all the design aspects inquestioning, Alex launches a Spacewalker task by specifying 50

HI ’21, May 8–13, 2021, Yokohama, Japan Mingyuan Zhong, Gang Li, and Yang Li human raters and 10 iterations (see Figure 2). Spacewalker showsan estimate about how much it would cost for using the crowdworkers. Alex can preview designs generated by Spacewalker basedon her specifications. Upon Alex clicking the Launch button, Space-walker automatically generates and sample designs based on herspecification and distributes them to online raters that are recruitedfrom Amazon MTurk. The interface for the rater is straightforward(see Figure 3). Raters see a pair of designs side by side, and are askedto select the one that they prefer. Figure 2: The Author interface allows a designer to createand launch a task.Figure 3: The Eval interface allows a crowd worker to com-pare two design alternatives, and select the one they prefer.

Alex can monitor the progress of the task in the Progress Viewer(see Figure 4), which allows her to see the sample designs of thecurrent generation (iteration). In about one hour, all the iterationsare completed, and Alex selects five top designs from the collectionof designs in the last generation. In case that none of the designsare satisfactory, Alex can edit the task in the Progress Viewer andrelaunch the task to continue evolving the design.

In this section, we discuss the system design and algorithmic detailsthat underline the Spacewalker. The Spacewalker system consistsof three main components: an HTML specification parser, a geneticalgorithm backend, and a crowdsourcing frontend. Once a designersubmits an HTML specification, the parser extracts the attributes In this paper, we use worker and rater interchangeably.

Figure 4: The Progress viewer allows a designer to examinethe progress of the task by viewing the designs generatedfrom the current generation, and export the HTML specifi-cation of these designs if satisfied. and options to be explored, which are passed on to the geneticalgorithm. The genetic algorithm generates design instances fromthe options, which are sent to the crowdsourcing frontend to col-lect worker feedback. Once enough feedback is received for oneiteration, the genetic algorithm generates the next generation ofdesigns, and the process is repeated until the specified number ofiterations is reached.

As shown in the above example, Spacewalker supports a rich set ofmethods for exploring a design space through simple HTML exten-sions, which are intuitive to designers as shown in our experiments.We here discuss its syntax and parsing details.

To explore a property of an individual element, adesigner follows a simple syntax by prefixing " explore- " to theproperty, and specifying the alternative values for the propertydelimited by spaces: explore- =" option-1 option-2 option-3 ..." Spacewalker supports all CSS properties and any number ofthem for an element. If multiple properties need to be exploredjointly (e.g., height and width), Spacewalker allows a designer tocombine multiple properties for exploration by joining their namesusing " -and- " and optional values using a semicolon ( ; ): explore- -and- =" option-A-1;option-B-1 option-A-2;option-B-2 ..." In addition to explore individual elements, Spacewalker allowsa designer to easily explore a large component of a design as awhole, which might contain a branch of elements and sub-trees,such as a side bar or a navigation bar. To do so, a designer canuse the explore-child-id tag in a parent node with the id ofeach child that corresponds a design candidate as options. SeeSection 3 for an example. Lastly, instead of exploring a designspace based on elements, a designer can explore style options withCSS selectors using the same format as a regular CSS file, and byagain prefixing the " explore- " tag, and by using a line of dashesto indicate alternative styles (see examples in Section 3). Becauseof this, powerful CSS features, such as CSS variables (which can pacewalker: Rapid UI Design Exploration Using Lightweight Markup Enhancement and Crowd Genetic Programming CHI ’21, May 8–13, 2021, Yokohama, Japan be used to store values in custom properties) , can be adopted tostreamline the specification of possible values for properties. Forsimplicity, we did not use these features in this paper. The Spacewalker parser analyzes a design specifica-tion file by parsing its HTML structure, which derives an internalrepresentation for the design search space. It looks specifically forthe explore-* tags, and records the options for each attribute pro-vided by the designer. In addition, the parser adds a unique HTML id to elements without one, in order to link the attribute and op-tions with the corresponding element. To preserve the hierarchicalrelationship of the HTML tree structure, the parser also maintainsthe hierarchical layout of the elements to be explored in a separatetree structure. As the number of attributes and nodes to be explored increases, thesearch space for a design grows combinatorially. To search for anoptimal design in the space, it is prohibitively expensive to examineevery possible design configuration with worker evaluation. Onthe other hand, with a limited number of worker feedback, whichis often the case in reality, a search very likely ends up with asub-optimal design, as shown in our experiments later. As a result,it is necessary to use a more intelligent algorithm. We here focuson Genetic Algorithms, a popular choice that has been used in theliterature, with several important enhancements.

A typical genetic algorithm(GA) follows an iterative process, where potential solutions evolvefrom a multi-generation process. It consists of four stages: initial-ization , selection , crossover , and mutation . During initialization , thefirst generation is randomly generated, with the goal of covering asmany configurations as possible. Then, the algorithm loops throughthe rest of the stages, where each iteration leads to a new genera-tion. During selection , the algorithm selects a portion of the currentpopulation as the parents for the next generation based on a fit-ness function. After selecting the parents, the next generation isgenerated through the crossover operation. Each time, a pair ofparents are randomly selected by the fitness function, and theirgenetic representations are mixed based on a crossover operator.One method is single-point crossover, where a crossover point israndomly selected from both parents, and all the genetic represen-tation to the right of the crossover point are swapped, forming twochildren (Figure 5). Finally, the next generation goes through muta-tion , where the genetic sequences are randomly altered to preventthe algorithm from running into a local minima. Werefer an instance of a UI design, which is acquired by selecting aspecific option for each attribute to be explored, as a configuration ofthe design. Note that attributes that are not marked for explorationdo not appear in a configuration for genetic programming becausethey are already determined by the designer. To adapt the geneticalgorithm for searching an optimal UI configuration, we first encodeeach configuration as a genetic sequence, which is an ordered listof valued attributes whose value is denoted by the index to an CSS variable on MDN Web Docs: https://developer.mozilla.org/en-US/docs/Web/CSS/var() option for the attribute. As an example, consider a specificationthat explores three attributes (A,B,C) . If one configuration selectedoption for attribute A , option for attribute B , and option forattribute C , then the resulted genetic representation is [3,1,6] .A specific genetic sequence indicates a UI configuration, whichcan be rendered as a design instance shown to a crowd worker forfeedback. As a rater’s judgement can be dominated by the early ex-amples and may drift over time [1], it is generally difficult for a userto rate the goodness of a design with an absolute scale. Spacewalkerinstead presents each rater a pair of different candidate designs ata time, and asks the rater to select the preferred design, i.e., a two-alternative forced-choice (2AFC) method. Thus, our fitness function outputs for the preferred design and for the less preferred one.Although presenting more than two examples in a gallery designcan be another appealing alternative [1, 13], the viewing area foreach example would be too small in our case of web design, andmay prevent raters from noticing design details that matter. Figure 5: The crossover operation for traditional genetic al-gorithms (top) and for Spacewalker (bottom).

The pairwise comparison ofdesigns eases the rater task and yields more reliable feedback thanrating a design individually on an absolute scale. However, pairwisecomparison raises challenges for genetic programming. The scorethat a design receives is now more with respect to the differencesbetween two designs being compared, instead of every aspect ofthe design. In other words, the rater feedback carries informationonly for a subset of genes in a genetic sequence.In our early exploration, we found conventional GA sensitiveto the random initialization of design options. When the optionsshared by the pair of designs in comparison happen to be desirable,conventional GA yields good results. However, when these sharedoptions are less desirable, conventional GA performs poorly asthose options that are not compared also receive positive responses.To address the issue, we enhance traditional genetic algorithms, foreach iteration, by directing rater feedback to genes that participatein a comparison while allowing the rest genes in a sequence toremain stochastic in the downstream evolution.

HI ’21, May 8–13, 2021, Yokohama, Japan Mingyuan Zhong, Gang Li, and Yang Li

To do so, we introduce a bit mask, named feedback mask , for eachgenetic sequence—that corresponds a design instance, which hasthe same length as a genetic sequence. A bit in the feedback maskis when the corresponding option is compared and favored by acrowd worker or when it is not. For each pair of designs shown toa crowd worker, we compute a diff mask to capture the differencesbetween their genetic sequences where the corresponding bits forthe differences are set to . The diff mask represents where we wouldlikely gain knowledge by comparing these two designs: in the partwhere they differ. When the worker selects their preferred design,the original feedback mask for that design is combined with the diffmask through a bitwise logical OR operation. The resulting maskis assigned to the genetic sequence of the favored design, whichcaptures both the information learned from previous generations(through the original feedback mask) and the current generation(through the diff mask). We elaborate on our algorithm below andparticularly focus on how the masks operate in the initialization,crossover, and mutation phase. Initialization.

The goal of our initialization process is to maxi-mally cover possible options for each attribute, but meanwhile limitthe total number of designs to be compared by human raters. There-fore, we treat each attribute independently during initialization, andgenerate design variations by sampling the options of one attributeat a time while affixing the rest exploratory attributes at a randomvalue. The feedback mask for each genetic sequence is all zeros, asit has not received feedback from raters yet. Based on the result ofeach pairwise comparison, Spacewalker learns which design waspreferred by a human rater to which the option contributes, andsets the corresponding mask value as , and the rest as . Crossover.

We enhanced the single-point crossover operation ofgenetic algorithms, by assigning random option values to positionsin the sequence (corresponding attributes in the design) where thefeedback mask is in the descendants. As a result, we introducevariations to a sequence where we have yet to acquire rater feedback.Meanwhile, in addition to crossover for the genetic sequences, thefeedback masks also crossover at the sequence crossover point tocarry the mask to the next generation. Mutation.

Mutation is handled in a similar way to the traditionalapproach, where we alter one attribute in a genetic sequence basedon a mutation rate. We chose a .03 mutation rate in Spacewalker,which is in line with other genetic algorithms applied on a similarpopulation size. When an attribute is mutated, its mask is set to .Finally, we consider nested designs (where one option valuedepend on another parent value). In this case, we link the childchoices to the parent choices in the genes. When a parent is selected,only the relevant child genes would be active, and we only performthe crossover and mutation operations on these genes. The Spacewalker system is built as a web service based on Ap-pEngine . Our front end includes a task authoring interface fora designer to create and launch a task (see Figure 2), a monitorinterface for the designer to monitor the task progress and exportresults (see Figure 4), and a worker interface for the worker to com-pare a pair of designs (see Figure 3). Our backend parses a design https://cloud.google.com/appengine specification, generates and distributes evaluation tasks, schedulesworkers and execute genetic operations on the sequences. A crowdworker first signs in the worker interface by entering their workerID, and then performs a sequence of evaluation tasks in which eachtrial involves indicating their preference over a pair of designs. Theback-end server is responsible for scheduling workers for differentweb UI pairs without conflicts and supporting multiple workerssubmitting results at the same time, using database read/write lock.When a worker is submitting a comparison result, or a new itera-tion is being generated by the genetic algorithm, the database islocked to ensure the atomicity of the operations. To the workers, thescheduling process is transparent and they always see a consistentlabeling interface for comparing two web UIs. We evaluate Spacewalker in multiple dimensions. We conduct auser study to investigate whether Spacewalker markup extension toHTML is easy to understand and use by designers and developers,and how they react to the overall support of Spacewalker for designexploration. We also systematically examine how well Spacewalkerexplores a design space for designers and improve designs overiterations.

In this study, we evaluate the usability of our proposed HTMLextensions by gather informal feedback from web designers. Thegoal is to test whether web designers are able to learn and use ourmarkups to specify search criteria and launch a design explorationtask, and to gather feedback of the Spacewalker system.

We invited five participants for thisremote user study. Two of the participants were graduate studentsand the other three were professional developers. Four of them weretrained in HCI and had experience with conducting user studies.All the participants indicated at least three years of experience withweb interface design and development. These participants resembleHCI researchers and web developers who want to improve theirdesign by quickly examining detailed design options with users atscale.We provided a description of supported functionalities and sam-ple markup code snippets (similar to Section 4.1.1). We asked eachparticipant to add markups to one template HTML web page (the

Cover example in Section 5.2.1), specifying exploration options forattributes or style sheet entries that they would like to change.We verbally walked them through the code snippets and demoedthe usage, which took 10–15 minutes. Participants then edited theprovided the HTML documents in their preferred code or text edi-tors, and we recorded the time used for them to experiment withthe markups and complete each task. After the study, each partici-pant was asked to comment on their experience with learning themarkups and creating the specification. We reviewed their com-pleted HTML specifications to check if they were correct.

All participants were able to learn theSpacewalker markup syntax using the description we provided andwere able to create syntactically correct specifications. We ana-lyzed the specifications submitted by the participants, and Table 1 pacewalker: Rapid UI Design Exploration Using Lightweight Markup Enhancement and Crowd Genetic Programming CHI ’21, May 8–13, 2021, Yokohama, Japan

Table 1: The summary of User Study Results. The time in-cludes participants both learning our markup extension,and creating their own specifications as well as inspectingthe effects as adjusting them.

Participant P1 P2 P3 P4 P5Number of attributes 8 7 5 5 6Search space size 3888 2187 480 560 1152Time (minutes) 48 15 15 30 32summarizes the results. On average, participants were able to un-derstand and create a search specification in 28 minutes (SD = 13.8),exploring different options for five to eight attributes. The searchspaces specified ranged from 480 to 3888, which indicated the needof designers to explore large design spaces.We received largely positive feedback from the participants. P1and P4 reported that the syntax was easy to learn, "even with basicknowledge background about HTML and CSS" (P4). P1 praised thesystem for "supporting all existing CSS properties". All participantsappreciated the time savings and improved efficiency when workingwith Spacewalker. In particular, P1, P4, and P5 found Spacewalker torequire less effort than the the traditional way of exploring multipledesigns individually.

We conducted two experiments to evaluate whether Spacewalkerwas able to efficiently search a design space and generate betterdesigns by utilizing the responses from the crowd workers. Wecompare Spacewalker genetic algorithm against a baseline methodthat uniformly samples the design space for crowd evaluation. Inthe first experiment, we examine the effect of different search spacesizes on the techniques. In the second experiment, we test thesesearch methods on different types of web pages.

We conducted both exper-iments following the same procedure. For Spacewalker, we used 10iterations with 50 design samples in each iteration, which requires25 comparisons by the raters. To reduce the potential influence froma single worker, we used workers who had above 90% approval ratein MTurk, and limited each worker to performing 5 comparisons(10 samples). Therefore, each design search task needed 50 raters.To account for raters that may not be responsive after accepting thetasks, we distributed the tasks to 70 raters. For uniform sampling,to ensure that it receives the same number of rater responses asthe genetic method, we randomly deployed 500 samples, whichamounts to 250 pairs thus 250 rater responses. We also used 70raters here to ensure enough responses. In sum, the only differencebetween the two conditions is the method used for searching thedesign space, while the rest aspects including the feedback mecha-nism is the same. We compensated each rater that finished the 5comparisons 0.5 US dollars. A rater was only allowed to work ineach method condition once.On average, each task took about 1 hour to finish. After a task wasfinished, we selected the five designs that received most votes fromraters for each method. For Spacewalker, these designs were drawn from the population of the last generation. For uniform sampling,they were chosen by ranking all the selected samples that werereturned. We then deployed another task to a separate group ofcrowd workers for evaluating the quality of these selected designs.Each rater in the evaluation was asked to compare designs fromSpacewalker genetic method and those from uniform sampling sideby side. We refer to this round of crowd tasks as the cross-methodevaluation . The presentation order was randomized, and the raterhad no knowledge of the underlying method of each design. Wealso randomly shuffled the order of designs for both methods. Weran 100 comparison tasks for each search specification, and therater must make a choice between one of the two designs. Thesetup of the two experiments is the following.

Experiment

In this experi-ment, we varied the search space size by using a different numberof attributes and options in the design search specification. Webased our study on the

Cover example provided by Bootstrap , andwe added Spacewalker markups to create the specifications used inthe study. The number of options being explored ranged from 3 to8 in this example, which corresponds a search space size rangingfrom 50 to 11,000 (see Table 2 for all the search space sizes). Experiment

We added explo-ration attributes and options to five additional web page templates,which are also based on Bootstrap examples . The specifications forthese designs were created so that their search spaces are similarin size (between 972 and 1215, mean=1050) for all the tasks. Weused the following templates in this experiment (search space sizesin parentheses): Album (972),

Blog (1080),

Cover (972),

Dashboard (1215), Pricing (1056), Product (1008).

Figure 6: Voter preference for Spacewalker with varyingsearch space sizes. The horizontal axis uses a log scale.

For each search specification, we cal-culate the percentage of votes received by Spacewalker geneticmethod (the rest of the votes are received by uniform sampling).We also conduct one-sample z-tests on the differences between arandom draw and the voter’s preferences for Spacewalker in eachsearch specification. Bootstrap examples: https://getbootstrap.com/docs/4.0/examples/

HI ’21, May 8–13, 2021, Yokohama, Japan Mingyuan Zhong, Gang Li, and Yang Li

Table 2: Rater Preference for Spacewalker in Experiment 1

Search space size 50 200 500 1000 3000 11,000Percentage of votes (%) 66 60 73 78 80 75Z-score 3.36 2.55 4.77 6.72 7.20 5.74 𝑝 (two-tailed) <.001 .01 < .001 < .001 < .001 < .001 Table 3: Rater Preference for Spacewalker in Experiment 2

Page name

Album Blog Cover Dashboard Pricing Product

Percentage of votes (%) 71 75 78 64 59 70Z-score 4.60 5.74 6.72 2.90 2.00 4.24 𝑝 (two-tailed) <.001 <.001 <.001 .005 .045 <.001Crowd raters, from the cross-method evaluation, showed signifi-cant preference for the designs generated by Spacewalker for allthe search space sizes in Experiment 1 (see Table 2). In addition, weobserve a trend of increased preference for Spacewalker geneticmethod as the size of the search space increases (see Figure 6). Thisindicates that the larger the search space is the more benefit thereis for using Spacewalker.For the experiment where Spacewalker is used to search fordifferent web page types, we find that crowd raters, from the cross-method evaluation, preferred the results produced by Spacewalkergenetic method in all cases we tested when they are compared withthose from the uniform sampling method (see Table 3). Figure 7shows the top designs generated by Spacewalker and those fromuniform sampling for each of the web pages. Note that dependingon the search options specified in a design, the difference betweenthe outcome designs from the two methods can be subtle some-time, e.g., the Dashboard case in Figure 7. Nevertheless, there isstill strong consistency in raters preferences towards Spacewalkergenetic method. This indicates that the benefit of Spacewalker iswell demonstrated across different web page types. In this section, we discuss the strengths and limitation of our work,and our plan for future work. Our experiments show that the con-cept of Spacewalker is well received by the designers and developers.They feel Spacewalker is highly valuable for the design task. As P4commented: "This tool provides (a) useful way to compare my designs. I usedto use the Inspect tool in Chrome to try out different values of thestyles of my attributes, but the limitation is that I can only modifyone item at a time. With this tool I could manage my HTML/CSS codeand potential designs of the whole page efficiently. It improves myproductivity and experience significantly."

The Spacewalker markup extension is easy to understand anduse for specifying design exploration. Our participants gave usseveral useful suggestions for improvements. P1 suggested thatSpacewalker should provide suggestions for possible options toexplore for a specific property, and warns designers when an optionvalue is out of a reasonable range for a good design. This will requireSpacewalker to encode certain design knowledge to make proper suggestions. Designers also want to immediately see the effect of adesign when adding an option value, instead of examining themon a separate screen in the Preview (see Figure 2).Another challenge lies in how well designers can understandthe effect when complex design alternatives exist in one designspace, e.g., design options nested within a parent option or globaloptions via CSS. Although our participants did not encounter muchdifficulty, this can be challenging as the design space becomesconvoluted. We believe the above extensions can provide a goodstarting point for designers to understand the search space. Inaddition, designers would be able to easily include or exclude certaincombinations given appropriate visualization tools, which can beutilized by Spacewalker as an initialization condition.Dependency between elements and designer specified optionsalso presents two challenges to Spacewalker. First, in complex de-sign spaces, designers may want to maintain dependencies betweenseveral sets of elements, while style options for different elementscan also be dependent, where only certain options or elements canbe combined together. In order to support more advanced depen-dency, we believe automatic tooling for identifying option depen-dencies and detecting potential inconsistency would be necessary,which provides opportunity for future work. Second, with multi-ple dependencies and the resulting hierarchical design spaces, thesearch space for a design grows combinatorially. Therefore, evenour enhanced genetic algorithm may not converge to an optimalsolution with a small number of comparisons. The gist is how tosearch the vast space efficiently under the monetary and time con-straint for a task. However, we note that Spacewalker performedwell on rather large search spaces in our experiment, and that otherapproaches, such as uniform sampling, would suffer more in thesecases as the probability of encountering one "good" example wouldbe minuscule. With a large enough search space, the effectivenessof any algorithm will be impacted. Effectively conveying such ex-pectation to users is essential. Moreover, the system can offer tobreak down large multi-level search spaces into smaller ones, andperform design search tasks on each of them.Our quantitative experiments for examining the performance ofSpacewalker algorithms for searching a design space indicate that itimproves designs over time by producing better design candidates,particularly when the search space is large. However, we observed pacewalker: Rapid UI Design Exploration Using Lightweight Markup Enhancement and Crowd Genetic Programming CHI ’21, May 8–13, 2021, Yokohama, Japan

CHI ’21, May 8–13, 2021, Yokohama, Japan Mingyuan Zhong, Gang Li, and Yang Li (a) Album (b) Blog (c) Cover(d) Dashboard (e) Pricing (f) Product

Figure 7: The top designs generated by Spacewalker (top) versus those from uniform sampling (bottom) for each web pagetemplate. These pages were adapted from the Bootstrap examples (see Section 5.2.1).

Figure 7: The top designs generated by Spacewalker (top) versus those from uniform sampling (bottom) for each web pagetemplate. These pages were adapted from the Bootstrap examples (see Section 5.2.1).

HI ’21, May 8–13, 2021, Yokohama, Japan Mingyuan Zhong, Gang Li, and Yang Li that the quality of these designs could further improved with moreiterations and workers. These improvements can be easily achiev-able, particularly because these were achieved by only using 50workers and a small monetary budget (around 35 US Dollars), whichgives a lot of room to scale up.For each of the tasks, our GA-based algorithm only visited asmall portion of the design space. 250 comparisons were made foreach design. The search space size of our designs ranges from 50to 10000. Combinatorially, the number of pairwise comparisonsneeded to cover the smallest search space (50) is 𝐶 = , Spacewalker provides integrated support to enable designers torapidly explore a large design space to improve their web UI design.Our HTML markup for creating exploration specification provides alightweight and familiar language for designers to specify complexdesigns and search requirements. By adapting genetic algorithms to effectively utilize crowd worker feedback, our system can quicklyexplore the search space of a web design, which provides real-time feedback to the designer about the progress of the search.Our experiments indicate that Spacewalker is well received bydesigners, and Spacewalker’s genetic search algorithm significantlyoutperformed a uniform sampling baseline under different searchspace sizes and web design types.

ACKNOWLEDGEMENTS

We would like to thank anonymous reviewers for their insightfulfeedback for improving the paper. We would also like to thank theparticipants in our user studies.

REFERENCES [1] Eric Brochu, Tyson Brochu, and Nando de Freitas. 2010. A Bayesian InteractiveOptimization Approach to Procedural Animation Design. In

Proceedings of the2010 ACM SIGGRAPH/Eurographics Symposium on Computer Animation (Madrid,Spain) (SCA ’10) . Eurographics Association, Goslar, DEU, 103–112.[2] Michael D Corry, Theodore W Frick, and Lisa Hansen. 1997. User-centereddesign and usability testing of a web site: An illustrative case study.

Educationaltechnology research and development

45, 4 (1997), 65–76.[3] Biplab Deka, Zifeng Huang, Chad Franzen, Jeffrey Nichols, Yang Li, and RanjithaKumar. 2017. ZIPT: Zero-Integration Performance Testing of Mobile App Designs.In

Proceedings of the 30th Annual ACM Symposium on User Interface Software andTechnology (Québec City, QC, Canada) (UIST ’17) . Association for Computing Ma-chinery, New York, NY, USA, 727–736. https://doi.org/10.1145/3126594.3126647[4] Joseph S. Dumas and Janice C. Redish. 1999.

A Practical Guide to Usability Testing (1st ed.). Intellect Books, GBR.[5] Shuhei Iitsuka and Yutaka Matsuo. 2015. Website Optimization Problem andIts Solutions. In

Proceedings of the 21th ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining (Sydney, NSW, Australia) (KDD ’15) .Association for Computing Machinery, New York, NY, USA, 447–456. https://doi.org/10.1145/2783258.2783351[6] Ron Kohavi and Roger Longbotham. 2017.

Online Controlled Experiments andA/B Testing . Springer US, Boston, MA, 922–929. https://doi.org/10.1007/978-1-4899-7687-1_891[7] Ron Kohavi, Roger Longbotham, Dan Sommerfield, and Randal M Henne. 2009.Controlled experiments on the web: survey and practical guide.

Data mining andknowledge discovery

18, 1 (2009), 140–181.[8] Steven Komarov, Katharina Reinecke, and Krzysztof Z. Gajos. 2013. Crowd-sourcing Performance Evaluations of User Interfaces. In

Proceedings of theSIGCHI Conference on Human Factors in Computing Systems (Paris, France) (CHI ’13) . Association for Computing Machinery, New York, NY, USA, 207–216.https://doi.org/10.1145/2470654.2470684[9] Walter S. Lasecki, Juho Kim, Nick Rafter, Onkur Sen, Jeffrey P. Bigham, andMichael S. Bernstein. 2015. Apparition: Crowdsourced User Interfaces That Cometo Life as You Sketch Them. In

Proceedings of the 33rd Annual ACM Conferenceon Human Factors in Computing Systems (Seoul, Republic of Korea) (CHI ’15) .Association for Computing Machinery, New York, NY, USA, 1925–1934. https://doi.org/10.1145/2702123.2702565[10] Sang Won Lee, Rebecca Krosnick, Sun Young Park, Brandon Keelean, Sach Vaidya,Stephanie D O’Keefe, and Walter S Lasecki. 2018. Exploring real-time collabora-tion in crowd-powered systems through a ui design tool.

Proceedings of the ACMon Human-Computer Interaction

2, CSCW (2018), 1–23.[11] Sang Won Lee, Yujin Zhang, Isabelle Wong, Yiwei Yang, Stephanie D. O’Keefe,and Walter S. Lasecki. 2017. SketchExpress: Remixing Animations for MoreEffective Crowd-Powered Prototyping of Interactive Interfaces. In

Proceedingsof the 30th Annual ACM Symposium on User Interface Software and Technology (Québec City, QC, Canada) (UIST ’17) . Association for Computing Machinery,New York, NY, USA, 817–828. https://doi.org/10.1145/3126594.3126595[12] Melanie Mitchell. 1998.

An Introduction to Genetic Algorithms . MIT Press, Cam-bridge, MA, USA.[13] N. Monmarche, G. Nocent, M. Slimane, G. Venturini, and P. Santini. 1999. Imagine:a tool for generating HTML style sheets with an interactive genetic algorithmbased on genes frequencies. In

IEEE SMC’99 Conference Proceedings. 1999 IEEEInternational Conference on Systems, Man, and Cybernetics (Cat. No.99CH37028) ,Vol. 3. IEEE, New York, NY, USA, 640–645 vol.3. https://doi.org/10.1109/ICSMC.1999.823287[14] Jonas Oppenlaender, Thanassis Tiropanis, and Simo Hosio. 2020. CrowdUI:Supporting Web Design with the Crowd.

Proceedings of the ACM on Human-Computer Interaction

4, EICS (2020), 1–28. pacewalker: Rapid UI Design Exploration Using Lightweight Markup Enhancement and Crowd Genetic Programming CHI ’21, May 8–13, 2021, Yokohama, Japan [15] Cheong Ha Park, KyoungHee Son, Joon Hyub Lee, and Seok-Hyung Bae. 2013.Crowd vs. Crowd: Large-Scale Cooperative Design through Open Team Compe-tition. In

Proceedings of the 2013 Conference on Computer Supported CooperativeWork (San Antonio, Texas, USA) (CSCW ’13) . Association for Computing Machin-ery, New York, NY, USA, 1275–1284. https://doi.org/10.1145/2441776.2441920[16] Juan C Quiroz, Sushil J Louis, Anil Shankar, and Sergiu M Dascalu. 2007. In-teractive genetic algorithms for user interface design. In . IEEE, New York, NY, USA, 1366–1373.[17] Katharina Reinecke and Krzysztof Z. Gajos. 2014. Quantifying Visual Preferencesaround the World. In

Proceedings of the SIGCHI Conference on Human Factorsin Computing Systems (Toronto, Ontario, Canada) (CHI ’14) . Association forComputing Machinery, New York, NY, USA, 11–20. https://doi.org/10.1145/2556288.2557052[18] Caitlin Tan Rochelle King, Elizabeth F Churchill. 2017.

Designing with Data .O’Reilly Media, Inc., Sebastopol, CA, USA.[19] Paulo Salem. 2017. User interface optimization using genetic programming withan application to landing pages.

Proceedings of the ACM on Human-ComputerInteraction

1, EICS (2017), 1–17.[20] Rainer Storn. 1996. On the usage of differential evolution for function optimiza-tion. In

Proceedings of North American Fuzzy Information Processing . IEEE, NewYork, NY, USA, 519–523.[21] Hideyuki Takagi. 2001. Interactive evolutionary computation: Fusion of thecapabilities of EC optimization and human evaluation.

Proc. IEEE

89, 9 (2001), 1275–1296.[22] Giordano Tamburrelli and Alessandro Margara. 2014. Towards Automated A/BTesting. In

Search-Based Software Engineering . Springer International Publishing,Cham, Switzerland, 184–198.[23] Jean Vanderdonckt, Mathieu Zen, and Radu-Daniel Vatavu. 2019. AB4Web: Anon-line A/B tester for comparing user interface design alternatives.

Proceedingsof the ACM on Human-Computer Interaction

3, EICS (2019), 1–28.[24] Raynor Vliegendhart, Eelco Dolstra, and Johan Pouwelse. 2012. CrowdsourcedUser Interface Testing for Multimedia Applications. In

Proceedings of the ACMMultimedia 2012 Workshop on Crowdsourcing for Multimedia (Nara, Japan) (CrowdMM ’12) . Association for Computing Machinery, New York, NY, USA,21–22. https://doi.org/10.1145/2390803.2390813[25] Pengfei Wang, Matteo Varvello, and Aleksandar Kuzmanovic. 2019. Kaleidoscope:A crowdsourcing testing tool for web quality of experience. In . IEEE, NewYork, NY, USA, 1971–1982.[26] Anbang Xu, Shih-Wen Huang, and Brian Bailey. 2014. Voyant: Generating Struc-tured Feedback on Visual Designs Using a Crowd of Non-Experts. In

Proceedingsof the 17th ACM Conference on Computer Supported Cooperative Work & SocialComputing (Baltimore, Maryland, USA) (CSCW ’14)(CSCW ’14)