[PDF] VINS: Visual Search for Mobile User Interface Design

Abstract

Searching for relative mobile user interface (UI) design examples can aid interface designers in gaining inspiration and comparing design alternatives. However, finding such design examples is challenging, especially as current search systems rely on only text-based queries and do not consider the UI structure and content into account. This paper introduces VINS, a visual search framework, that takes as input a UI image (wireframe, high-fidelity) and retrieves visually similar design examples. We first survey interface designers to better understand their example finding process. We then develop a large-scale UI dataset that provides an accurate specification of the interface's view hierarchy (i.e., all the UI components and their specific location). By utilizing this dataset, we propose an object-detection based image retrieval framework that models the UI context and hierarchical structure. The framework achieves a mean Average Precision of 76.39\% for the UI detection and high performance in querying similar UI designs.

Full PDF

VVINS: Visual Search for Mobile User Interface Design

Sara Bunian

Northeastern UniversityBoston, MA, [email protected]

Kai Li

Northeastern UniversityBoston, MA, [email protected]

Chaima Jemmali

Northeastern UniversityBoston, MA, [email protected]

Casper Harteveld

Northeastern UniversityBoston, MA, [email protected]

Yun Fu

Northeastern UniversityBoston, MA, [email protected]

Magy Seif El-Nasr

University of California at Santa CruzSanta Clara, California, [email protected]

Figure 1: Overview of

VINS , our proposed image retrieval process for the visual search for mobile interface design. First, ittakes as input

UI layout screens, either a complete design or an abstract wireframe. Then, it employs an object detection modelto detect the presence and location of the different UI components defining the input query and produces a segmented layout accordingly. This segmented layout is passed to a multi-modal embedding network that learns a joint feature representationof both visual and label features. This representation is used to retrieve a ranked list of similar designs.

ABSTRACT

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].

CHI ’21, May 08–13, 2021, Yokohama, Japan © 2021 Association for Computing Machinery.ACM ISBN 978-1-4503-XXXX-X/18/06...$15.00https://doi.org/10.1145/1122445.1122456 develop a large-scale UI dataset that provides an accurate specifica-tion of the interface’s view hierarchy (i.e., all the UI componentsand their specific location). By utilizing this dataset, we proposean object-detection based image retrieval framework that modelsthe UI context and hierarchical structure. The framework achievesa mean Average Precision of 76.39% for the UI detection and highperformance in querying similar UI designs.

CCS CONCEPTS • Human-centered computing → Ubiquitous and mobile com-puting systems and tools ; Wireframes . KEYWORDS datasets, data-driven design, user interface design, design examples,wireframes, information retrieval, computer vision, deep learning,object detection

ACM Reference Format:

Sara Bunian, Kai Li, Chaima Jemmali, Casper Harteveld, Yun Fu, and MagySeif El-Nasr. 2021. VINS: Visual Search for Mobile User Interface Design. a r X i v : . [ c s . H C ] F e b HI ’21, May 08–13, 2021, Yokohama, Japan Trovato and Tobin, et al. In CHI ’21: ACM CHI Conference, June 08–13, 2021, Yokohama, Japan.

ACM,New York, NY, USA, 14 pages. https://doi.org/10.1145/1122445.1122456

Today’s digital, fast-paced world has stimulated a rapidly growingmobile application (also known as apps) industry. Within the mobileapp design process, the User Interface (UI) is an important visualcommunication factor that can play a significant role in the app’ssuccess. The UI depicts the organization and visual structure of thedifferent components comprising the app’s layout.

Design examples are important in the UI design process [47], and interface designersusually search and create design example repositories to stimulateinspiration, generate new ideas, and investigate feasible options tomake design decisions [5, 7, 18, 19].The web is presently a large repository containing a collectionof design examples. There are various specific online design shar-ing websites that provide UI design inspiration, such as uplabs and dribbble . However, the current searching mechanism withinthese websites is similar to the web at large, limited to only text-based queries, which makes finding relative examples a challengingtask [18, 35, 44]. While designers can easily search using keywords(e.g., onboarding screens), general categories (e.g., Food Apps), andcolor codes (e.g., white interface), the resulting design examplesmight not be relevant to the original design requirements in termsof visual layout structure and UI content. This makes the searchingprocess tedious and slow. To address this issue, more effective toolsare needed to support finding and retrieving design examples thatwill benefit designers in practice.Typically, designers express their ideas and UI design concepts inthe form of images that describe the UI’s visual layout, hierarchicalstructure, and content. There has recently been a growing inter-est to study mobile UI retrieval based on an input image [22, 32].However, the proposed methods so far present limitations in termsof performance and generalizability. For example, Swire [22] doesnot specifically consider the UI content in the retrieval process.This affects the performance of the system by retrieving imagesthat are not relevant to the query or are missing design compo-nents. The approach presented by Liu et al. [32] depends solely ona predefined UI content hierarchy. This limits the generalizabilityof the approach and prevents it from working on any new unseenimages. Therefore, there is a need for a fine-grained visual searchthat, given any input query, can infer the UI’s content hierarchyand provide designers with examples that fit their query.In this paper, we take a step towards investigating how to ef-fectively address the process of image retrieval in the domain ofmobile UI design. Our approach focuses on two main aspects forthe retrieval process: First, it is essential to have an understandingof the collection of UI components and their spatial location in a UIdesign image. This provides a close approximation to the UI visualhierarchy which allows measuring the relevance across differentimages at a more fine-grained level. Second, the system should beflexible to allow designers to find similar design examples for theirUI query in different design stages. This provides flexibility in either https://dribbble.com/ using an abstract wireframe (i.e., low/medium-fidelity image) or aUI screenshot (i.e., high-fidelity design image) as in input query.In order to develop VINS, our proposed visual search framework,we first report findings of semi-structured interviews with UI de-signers explaining the current process and informing the designrequirements of UI searching tools. We then explore the issue ofUI image retrieval in detail from both the data and the model side.Figure 1 shows VINS, our proposed visual search framework, whichtakes an app’s layout as its input query and finds the most similarmatches from our design inventory. To account for various stagesin the design process, the input query can be of different represen-tations, including abstract wireframes and high-fidelity layouts. Tosupport this feature, we constructed a large-scale annotated datasetcontaining UI design screens across these two design stages thatwe refer to as VINS dataset.Given the variety of the UI design screens and the complexityof their visual hierarchy, we develop VINS around deep learningmodels, which have demonstrated their effectiveness in solvingvarious tasks in different contexts [10]. Specifically, VINS consistsof two building blocks: detection and retrieval. First, we utilize anobject detection mechanism to detect the presence and location ofthe different UI components that represent a tentative layout of theinput query. We then train an attention-based neural network tolearn a joint feature representation that can define both the layoutstructure and its content in order to retrieve similar UI designs.The paper provides the following contributions:(1) VINS dataset: A large mobile UI dataset consisting of UIscreens across different design stages (i.e., abstract wire-frames and high-fidelity designs) that can be utilized in de-veloping different data-driven design applications. Througha human-powered process, we annotate the dataset to pro-vide an accurate specification of the UI in terms of its viewhierarchy (i.e., all the various UI components and their spe-cific location).(2) VINS: A deep learning framework that models the contextand the hierarchical structure of UI screens to develop a UIimage retrieval system. The framework can achieve highperformance in querying similar UI designs. Current design search systems encompass a variety of domainsincluding web design [25, 43], 3D modeling [15], interior design [3],fashion [34], and programming [6]. However, the current searchmechanism in these and other systems is mostly based on textqueries, such as keywords. Existing research has shown that key-words often fail to articulate the abstract design ideas and thusmakes finding relevant design examples a challenging task for de-signers [18, 35, 44].Several research studies have examined how designers essen-tially search for design examples, and how these examples areutilized in supporting their creative process [18, 19]. To supportdesigners in this example finding task, a plethora of HCI researchhas investigated alternative ways to better explore and retrievedesign examples. One approach is to use advanced keywords such

INS: Visual Search for Mobile User Interface Design CHI ’21, May 08–13, 2021, Yokohama, Japan as stylistic features (e.g., color, style term) [27, 43], or image meta-data (e.g., themes, media, date, location, shapes). Because formu-lation of search queries using images is easier to learn and fasterto specify than keywords [48], other researchers have exploredalternative visual searching mechanisms using image queries suchas sketches [17] and UI screenshots [48].There has been a recent interest in advancing the body of vi-sual search by integrating deep learning frameworks for betterresults. To this end, Deka et al. [11] presented preliminary resultsof an autoencoder model that learns similarities between UI layoutsbased on image and text content only. Liu et al. [32] extended thisapproach by training an autoencoder on semantically annotatedlayouts, containing different UI components, text button concepts,and icon classes, to learn UI similarities for design search. We alsouse an autoencoder for our retrieval task, however, our model isdifferent because we develop an attention-aware autoencoder thatlearns a joint embedding of structure and content.The most closely related approach to our work in terms of vi-sual search is the aforementioned

Swire system [22], which usesa deep neural network model to retrieve relevant UI examplesfrom input sketches. Specifically,

Swire trains two convolutionalsub-networks over matching pairs of screenshots and their cor-responding sketches.

Swire achieved 60% relevancy of retrievedexamples and demonstrated its applicability in a number of differ-ent tasks. However, it has the following key limitations: (1) focusesonly on the high-level layout information without inferring the UI’sstructure and content, which sometimes retrieves images that areirrelevant or missing UI components; and (2) requires a pairwisecollection of sketches and screenshots which makes it difficult togeneralize across unseen sketches of UI layouts. We seek to addressthese limitations and advance the body of work on visual searchwith our approach (see Figure 1).

Early work on data-driven design explored how design examplescan aid in various design tasks, including providing design assis-tance [47] and automated content and layout re-organization [26,40, 42]. Motivated by the utility of design examples in support-ing designers and enabling data-driven application design, severallarge-scale mobile UI datasets have been created. For example,

ER-ICA [12] provides a collection of user interaction data for mobileUIs captured while using the app.More recently,

Rico [11, 32], a large-scale data of mined Androidapps, has been released. It consists of 72K UI examples from 9,722Android apps. Each example is associated with a screenshot of theUI design, the corresponding view hierarchy, and the user interac-tion information. The predefined view hierarchies expose the UI’sstructural and functional properties, which provides a means ofinferring the UI content hierarchy that has the potential of sup-porting various data-driven design applications. While these viewhierarchies may often provide an accurate representation of theUI structure, there are several instances where they do not. Asshown in Figure 2, (1) hierarchies may be broken and are not di-rectly mapped to the UI, (2) additional space inside the boundingbox boundaries does not reflect the exact dimensions and positionof the object, and (3) inconsistencies with the class label of similar

Figure 2: Comparison of bounding boxes annotations be-tween

Rico and VINS dataset. objects. Hence, it is important to have a dataset that can provide ahighly accurate content hierarchy. Such a dataset enables to traincomputer vision systems, e.g. object detector, to analyze relevantpatterns and recognize objects.Other researches have employed traditional image processingtechniques (e.g., Optical Character Recognition (OCR) [37], CannyEdge Detection [8]) to infer the UI structure and content [2, 36, 38,47]. However, this method is constrained in the identification ofcomponents that are cluttered with the background and involves theadoption of an image classification model to differentiate betweencomponents [36].To effectively address the issue of inferring UI structure, an ap-proach that can perform both the tasks of object localization andclassification is needed. Object detection is one of the key problemsin the computer vision community, which aims to provide a compre-hensive understanding of the image by precisely determining thelocation and category of its objects through learning and extractinghigh-level deep visual features [50]. Our method for inferring theUI’s layout structure integrates an object detection mechanism toaccurately detect the different UI components. Object detectionhas been used recently for describing the UI structure [49]. This

HI ’21, May 08–13, 2021, Yokohama, Japan Trovato and Tobin, et al. approach, however, was restricted to UI sketches and the datasetwas not made public. To support this purpose, we are collectingVINS dataset, a new fully annotated dataset, which we will describein Section 4.

We conducted a series of semi-structured interviews with UI de-signers to gain insight from a professional user’s perspective intothe current design process and help assess the applicability of VINSinto the design workflow. A total of 24 designers were recruitedfrom the freelancing website

Upwork . All designers were basedin the USA and reported receiving a formal UI/UX training. Thedesigner’s experience ranged from 1 to 5 years with an average ofabout 2 years. They were compensated $10 USD and the interviewtook about 30 minutes.The interview script consisted of 23 questions related to (1) thecurrent strategy of finding design examples; (2) the difficulties en-countered throughout the process; (3) aspects of similarity betweenUI designs; and (4) the applicability of VINS. To analyze the surveyquestions, we performed a 3-stage thematic coding for analyzingresults. We began by literally coding keywords of designers, thengrouping the responses through discussion according to the themesthat emerged, and finally reporting how many designers used thevarious themes. Two researchers engaged in this consensual quali-tative process [20], a third researcher confirmed the findings.In line with the findings reported by Herring et al. [18, 19], mostdesigners (18 out of 24) agreed that design examples are often usedfor almost every project they work on. For example, as D8 explained,“I find them very helpful and tend to use them as much as possible.Every project of mine contains at least 60% of design examples”.Similarly, D6 stated “Everytime [sic] when I have to design. It’sa compulsory thing for me”. Their role in the design process istherefore beneficial to the designers. Specifically, when asked ifexposure to examples is a good means of inspiration, the answerswere mostly favorable (“Definitely yes”: 15 designers, “Probablyyes”: 5 designers, “Might or might not”: 3 designers, "Definitelynot": 1 designer).Given the vast design space, designers further reported that theycollect different design aspects from examples, such as various lay-outs (22 designers), font styles (14 designers), different palettes (11designers), and content hierarchy (4 designers). As D6 mentioned,“Mostly the layouts, fonts and how to organize data on the screen”.Similarly, D11 mentioned “Mostly it’s the layout. Other than thatcolor schemes, font styles also”. In addition to the common themes,some designers mentioned that examples allow them the potentialto: (1) explore emerging design trends: “It gives us an idea of whattype of design trends now a day and explores many designs widgets”(D12); (2) discover new ideas: “Examples give you different ideas tocreate and manage your content” (D2); (3) increase creativity: “Bylooking at the designs creativity can be increased which can helpyou creating new designs of your own” (D7); and (4) identify visualattractiveness: “It helps you find what looks attractive and what’snot” (D17).By analyzing the themes of how these examples are collected,we found that designers often utilize different strategies. Most commonly, 21 designers reported that they search for these exam-ples using keywords either by browsing the web (i.e., google) ornavigating through various design sharing websites (i.e., Behance,Pinterest). As D1 mentioned “Yeah, I would look out on Google, Be-hance to search the best design examples”. Similarly, D4 mentioned“I search on Google and design websites such as Free pic, Pinterestwith keywords”. Other designers would survey the market to viewwhat has already been done by their competitors or utilize their oldwork as inspirational examples. As D6 stated “Going through theonline UI kits for inspiration and sometimes I use my old designs”.Although the process of finding design examples is beneficial, itis also tedious and time-consuming. While 8 designers mentionedthat it depends on what designs they are searching for, most design-ers stated that it usually takes a long time. Specifically, 9 designersmentioned it usually takes hours to find good examples, while 3 de-signers stated that sometimes it may take more than a day. As D11stated “Usually, it takes about half a day (4-5 hours) to find all the rel-evant design examples”. Also, D20 mentioned “It depends on whattype of design you find but it takes 1 day”. In addition, designersidentified other key obstacles they face within the process. As partof the current searching mechanism, 21 designers indicated thatthey use keywords to find design examples. Although keywordsare simple, they have certain limitations. For example, keywordsare limited in their ability in describing the design specifications:“searching results doesn’t meet our specific design requirements”(D2). More specifically, it is difficult to describe the specific lay-out design with only text queries: “don’t know what exact query Ishould write” (D17) and “difficulty to find the right description ofwhat’s in my mind” (D21). As a result, the resulting design exam-ples from the search might not be relevant to the original designrequirements specified in the query. As D24 said: “I personally findthis impossible to do. I may or may not be able to find an inspirationwith the layout I have in mind. In this case, I usually only search fordesign inspirations to pick out a color scheme that I’ll implementto the design layout which I already have in mind”.Although the current searching process is currently limited bykeywords, designers did not completely agree on the role of key-words to effectively retrieve similar layouts. In response to whetherUI similarity can be measured by keywords, 10 designers reportedthat they can be considered an indication of similarity, while 14designers pointed that other design aspects should be considered.As D11 stated “Keywords are not the only measure for design simi-larity. Functionality and structure help in it too”. Similarly, D12 alsosupported this opinion by stating “Not only depends on keywordsbut also may be its functionally and visuality are not the same”. Thisemphasizes the need to consider new design aspects, other thankeywords, for developing effective tools that support retrievingsimilar design examples.To have a better understanding of the definition of layout similar-ity from a professional perspective, we asked designers to elaborateon their definition of similarity in regard to three aspects: structure,functionality, and visual elements and to identify which of theseaspects is more important in the retrieval process. Designers havedifferent perspectives regarding these aspects and their importance.Specifically, 7 designers agreed that all three aspects are equallyimportant and play a key role within the searching mechanism, asD5 stated “The structure is first and main thing in design and visual INS: Visual Search for Mobile User Interface Design CHI ’21, May 08–13, 2021, Yokohama, Japan is another thing that attract us. But functionality is important, butit will be according to design requirements”. The other remainingdesigners favored particular aspects as D11 mentioned “I find thosedesigns helpful that have a functional and structural similarity withwhat I need”. D21 also supported this similarity definition “I wouldcategorize two mockups that have a similar structure and function-alities but different color schemes to be ‘similar UI layouts’”.Thus, our formative interviews confirm what we found in theliterature about the limitations of the use of keywords. To overcomethis limitation, we also identify the importance of considering thenew design aspects of functionality, structure and visual elementsin order to better support the example finding process. In contrastto previous work [22, 32], in VINS we emphasize functionality andstructure in our design search.

Our approach for retrieving similar UI designs is based on objectdetection. The goal of object detection is to detect all instances ofobjects from a given set of classes and localize their exact positionsin the image. The location of an object is defined in terms of abounding box, which is represented by the rectangular boundarycoordinates that fully enclose the object.Typically, training a good detector requires having a large num-ber of training images in which objects are annotated with high-quality bounding boxes [9, 13, 14, 16]. For the development of VINS,it is thus essential to have a large-scale, carefully annotated datasetof UI design screens. To the best of our knowledge, there is no large-scale public dataset available that serves this purpose. Therefore, wecreated VINS dataset , a new annotated dataset containing a repre-sentative collection of UI screens across two design stages: abstractwireframes and high-fidelity fully designed interfaces. All of theseUIs are annotated with bounding boxes spanning different classesof UI components. We identified a total of 11 UI components withvarying functionality: background images, sliding menus, pop-upwindows, input fields, icons, images, texts, switches, checked views,text buttons, and page indicators. Based on our analysis and due torelatively small training instances, we combined radio buttons andcheckboxes to represent the checked view class. VINS dataset has a total of 4,800 images of UI designs screens,including 257 images of abstract wireframes and 4,543 images ofhigh-fidelity screens. We opted to include images of different designstages to ensure that the VINS can perform on a wider variety ofdesign inputs.

The wireframe-based dataset represents the ini-tial stage of design and contains digital low/medium-fidelity imagesthat describe the outline of the UI screen. From

Uplabs , we collected257 abstract wireframe designs of different templates and layouts.We only collected a relatively small subset of images because ofthe simplicity of these wireframes, which represent the skeleton ofthe interface and are typically stripped from all styling and designelements that might affect the detection process. Wireframes can https://github.com/sbunian/VINS Figure 3: Wireframe templates with different prototypingstyles. be generated using different tools, including whiteboard, paper-and-pencil, and a graphic design application. Because designers usedifferent prototyping styles as shown in Figure 3, such as represent-ing an image placeholder with either a mountain, or a square witha cross, we included wireframe templates that represent differentprototyping styles, thereby ensuring the generalizability of VINS.

The high-fidelity-based dataset con-tains 4,543 images of carefully selected quality UI designs. To ensurethat VINS generalizes across different platforms, we included im-ages of both iPhone and Android interfaces from popular appsacross different categories. For Android, we first started by manu-ally selecting 2,000 high quality screens from the

Rico dataset [32].Due to

Rico ’s large scale, it was very difficult to filter the duplicate,non-English, and outdated UIs. As a result, we additionally collected740 UI images by navigating through different popular Google Playapps and taking screenshots of the UIs resulting in a total of 2,740Android screens. To learn the iPhone design patterns, we down-loaded and browsed a number of apps from different categories andtook a screenshot of each screen, resulting in a total of 1,200 UIimages. To ensure quality selection in both Android and iPhoneplatforms, we ensured that the selected UIs are from popular appsacross different categories having an average rating more than 4stars. To allow the system to identify new design trends, we col-lected 603 UI designs from the community-powered website

Uplabs ,which offers quality digital UI inspirations from different designers.To find these images we used several keywords such as “Mobile UIKit”, “Mobile Onboarding Screen”, “Mobile Login screen”, etc.

Crowdsourcing [46] for data annotation has recently attracted a lotof attention within the computer vision community due to the needfor large-scale data and the lack of sufficiently labeled data. To thisend, we utilized crowdsourcing where we recruited 6 students tothoroughly annotate VINS dataset. The students were recruited bythe university’s internal slack channel and they were compensatedwith $12 USD per hour.Inspired by the approach presented by Su et al. [46], we follow asimilar strategy to crowd-source bounding box annotations. The

HI ’21, May 08–13, 2021, Yokohama, Japan Trovato and Tobin, et al.

Figure 4: Instructions of drawing a perfect fit bounding box. goal of this strategy is to ensure the bounding box’s high qualityand complete coverage, i.e. to be as tight as possible containing theentire instance of the object. To ensure accurate annotations, theprocess starts with an initial training that consists of reading a setof instructions, understanding the rules, and passing an assessmenttest before the participants can engage with the annotation task, asdescribed in detail below.

The instructions are composed ofthe following items. First, we asked students to read a documentthat outlines the collection of all the 11 intended UI design com-ponents (e.g., background image, icon, input field, etc.), togetherwith their functionality and style guide. For example, we includedthe set of icon concepts extracted in [32] as an icon style guide torecognize icon patterns and distinguish them from normal images.This enforces a deeper understanding for each aspect of the de-sign elements. Second, we provided an example set of annotatedUI design images where all the instances of the UI componentsalready have a bounding box associated with a class label. Thisprovides a better understanding of how to make a good boundingbox annotation. Third, we gave specific instructions on how to usethe

RectLabel annotation tool. Students were compensated for thetool’s monthly subscription fees. We also provided a set of rules to be fol-lowed during the annotation process: • Perfect fit : When drawing the bounding box, it should beas tight as possible perfectly containing the object to be an-notated. It is important to note that the boxes should neitherbe too tight (i.e., does not cover all the visible parts of the ob-ject) nor too loose (i.e. contains much space and unnecessaryparts from the background) as shown in Figure 4. • Correct Labeling : Once the bounding box is drawn, a la-bel must be assigned to it. The label must match the classof the annotated object. To overcome any confusion whenassigning the labels, students must refer to the style guidedocumentation or contact the research team for feedback. • Multiple Objects : A bounding box must be drawn for eachof the multiple UI objects available in the design image andassigned the correct label accordingly.

We then asked the students topass an assessment test, which includes a set of test images. Testimages were selected to cover all the classes of UI components. Also, https://rectlabel.com we already have the ground-truth bounding boxes for these imagesand used these to evaluate the quality of the students’ annotatedresults. Within this test, students were requested to complete twotasks: draw new bounding boxes and modify existing ones. For thefirst task, we provided students with design images that do not yethave bounding boxes and requested them to draw the boxes on theavailable components. For the second task, we provided studentswith images that contain bad bounding boxes. These images havebeen generated by either changing some of the bounding boxesclass label or perturbing their coordinates. They were requested tomodify these bounding boxes accordingly.To ensure quality bounding boxes, students must achieve a 90%Intersection over Union (IoU) for all images in both tasks. This isan iterative process in which, after each submission, we reviewedthe results and provided the students with the necessary feedbackif the bounding boxes are not correctly drawn or do not belong totheir respective class. They could only start working on the actualimages after completing the training with a high IoU score. The process workflow continues withthe drawing task, where we gave each student a batch of 100 im-ages and asked them to fully annotate each image. It normally takesan average of 10 hours to complete each batch. Once the drawingtask was completed, they proceeded to the task of quality verifi-cation. We randomly assigned each student a different batch offully annotated images from the drawing task and requested themto examine them. They needed to verify the quality and label forall the bounding boxes within an image and modify accordingly.Finally, after completing the quality verification task, we conducteda control verification task on all the annotated images to ensurethe quality and coverage of all the bounding boxes. In this task, weneeded to examine the bounding boxes quality for each image andmodify it accordingly if needed. However, mostly all images werehighly annotated and only a few needed slight modifications of thebounding boxes coordinates.After the annotation process was completed, the generated VINSdataset contains pairs of a UI design image and its correspondingXML file in the Pascal-VOC format [13]. To the best of our knowl-edge, this is the first annotated publicly available UI design datasetfor object detection.

VINS, our proposed visual search framework, takes a UI layout asits input query and provides structurally similar UI design exam-ples for inspiration. Instead of just finding similar images that areindexed by their visual content such as color, texture, and shapes,we focused on developing a more advanced visual search systemthat indexes the image by its functionality (defined by its content)and leverages its structural information. VINS has two main compo-nents:

Detection and

Image Retrieval . The detection process detectsthe input query’s different UI components to produce a tentativesegmented layout. Trained on these generated segmented layouts,our image retrieval process learns a joint feature embedding to finddesigns similar to the input query. Below is a detailed discussion ofthe two components.

INS: Visual Search for Mobile User Interface Design CHI ’21, May 08–13, 2021, Yokohama, Japan

Figure 5: The framework of our image retrieval model. It consists of two parts: an image model and a label model. The imagemodel learns a visual feature vector that encodes the hierarchical structure of the input image. The image encoder is condi-tioned on a box attention map for better structure learning. The label model learns a feature representing all the availableUI components which serve as a high-level control to support the learning process of the visual features. We concatenate theimage and label features to produce a final embedding that is used for the retrieval process.

To find structurally similar UI design images in the reference dataset,we need to identify and locate the different UI components that existin the image. Usually, the placement and functionality of these UIcomponents vary widely across the different designs. Therefore, thefirst step in the process is to accurately infer the bounding boxesof the different UI components and their domain specific types.To achieve this goal, we adopted a Single Shot MultiBox Detector(SSD) model [33]. We opted to use SSD because of its simplicity andstate-of-the-art performance for object detection [33]. SSD requirestaking a single shot to detect the multiple objects within the image,meaning that the tasks of object localization and classification arecompleted in only a single forward pass of the network. As part ofthe SSD, we used MobileNets [21] as the base network for featureextraction because they are optimized primarily for speed.To effectively support the retrieval process of various UI designexamples, the object detector has been configured to detect and clas-sify the most common UI components for an input query. Detectoroutputs are used to generate a semantic structured layout whereeach detected bounding box is represented with a unique color,based on its class label. The generation of these semantic layoutsprovides an easy understanding of the context and hierarchicalstructure of the input query. This new set of generated semanticlayouts represents the collection of images used to train the imageretrieval system described below.

For each query image, after detecting its different UI componentsfrom the previous phase, the next step is to search for design ex-amples in the reference dataset that share a similar hierarchicalstructure. We believe it is important to focus on two main aspectsof each query image: the collection of its UI components and theirspatial location. To gain more insight into the image’s components,we introduce a high-level attribute encoding the components’ classlabels. This allows us to learn joint features to measure the semanticrelevance across different images at a more fine-grained level.We propose a multi-modal embedding framework that learnsjoint features of the image’s structure and associated content anduses them to guide the UI retrieval process. The proposed frame-work consists of two models as shown in Figure 5. The first modelis an attention-aware image autoencoder 𝐸 that takes an imageas input 𝑥 and produces a structural feature vector 𝑧 = 𝐸 ( 𝑥 ) toencode the hierarchical structure of the image. The second modelis a label encoder-decoder 𝐴 that learns a feature vector 𝑧 = 𝐴 ( 𝑦 ) capturing the UI component’s 𝑦 in the image. These two featurevectors are fused by concatenation to form the final representation 𝑧 = ( 𝑧 , 𝑧 ) . Below we discuss the two models in detail. Our encoder takes in a semantic input image,which is down-sampled to size 256x256. For a fair comparison ofresults, we follow the design of the autoencoder described in [32].It consists of 4 convolutional layers at increasing feature size. Theconvolutional layers are arranged as 3x8, 8x16, 16x16, 16x32 (inputchannels x filters). The size of kernel and stride across all the layersis maintained at 3 and 1 respectively. A RELU activation and a max

HI ’21, May 08–13, 2021, Yokohama, Japan Trovato and Tobin, et al. pooling layer of size and stride 2 is applied after every convolutionallayer. This is considered the base autoencoder model.One of our contributions is augmenting the base autoencodermodel with a Box Attention mechanism following the approach pre-sented by Kolesnikov et al. [24]. This approach was used to modelobject interactions in an object detection pipeline. We, however, em-ploy it differently by incorporating it in our encoder model to guidethe image retrieval process by providing a better comprehensiveunderstanding of the image structure. The idea of this box attentionmechanism is to create a map represented as a spatial binary imageencoding the location of the UI components. The binary image isof the same size as the original input image with 3 channels. Thefirst channel represents the bounding boxes of the UI componentshaving all the pixels inside the bounding boxes set to 1 and all otherpixels are set to 0. The second channel is all zeros, and the thirdchannel is all ones. To integrate this additional box attention intothe base encoder model, the attention map is conditioned on theoutput of the convolutional layers. This conditioning procedurecan be applied to every convolutional layer of the base encodermodel. However, in our case, we created different models, eachemploying the above conditioning procedure on a varying numberof convolutional layers and reported the performance of each in theresults section. The final structural feature vector of the encodermodel is represented as 32x16x16 dimensional vector.The decoder D aims at reconstructing the original image fromthe latent representation z such that D(z ) is as similar to the input x as possible. It consists of the same encoding layers in reverseorder, with up-sampling layers instead of max pooling layers.To train the autoencoder model, we follow the L2 norm of mini-mizing the Mean Squared Error (MSE) to formulate the loss functionof the model. This loss aims to measure how close the reconstructedinput ˜ 𝑥 =D(E(x)) is to the original input x : 𝐿 𝑟𝑒𝑐 ( 𝑥 ) = ∥ 𝑥 − ˜ 𝑥 ∥ (1)To further improve the model learning, we introduce a new addi-tional term into our loss function based on the Dice coefficient [45].The dice coefficient is an overlap based metric widely used in seg-mentation problems for pairwise comparison between two binarysegmentations. It is based on the intersection over union (IoU) mea-sure and aims at detecting object boundaries to measure the overlapbetween two samples. it is defined as: 𝐷𝑖𝑐𝑒𝐶𝑜𝑒 𝑓 = ∗ (cid:205) 𝑖 𝑥 𝑖 ˜ 𝑥 𝑖 (cid:205) 𝑖 𝑥 𝑖 + (cid:205) 𝑖 ˜ 𝑥 𝑖 (2)And the Dice loss function is simply: 𝐿 𝐷𝑖𝑐𝑒 = − 𝐷𝑖𝑐𝑒𝐶𝑜𝑒 𝑓 (3)The loss function that our autoencoder aims to minimize is then: 𝐿 𝐴𝐸 ( 𝑥 ) = 𝐿 𝑟𝑒𝑐 + 𝐿 𝐷𝑖𝑐𝑒 (4)

Given an image, we consider the unique classlabels of the UI components associated with it. This informationis reflective of the content of the image and can serve as a high-level control to support the learning process of the image’s UIlayout. In some cases, a specific UI component dominates the image by taking a significant proportion of its layout, which diminishesthe rest of the components present. Encoding the class label ofthe UI components can influence the learning of the overall layoutstructure. Thus, the label model is built around the encoder-decoderparadigm and it aims to encode the class labels of the detected UIcomponents to convey the UI content.Each image is assigned a multi-class label representing the uniqueclasses in the image and is encoded as a multi-hot vector of size11, where the presence of each class is set to 1. This vector is thenfed to a series of 3 fully connected layers of sizes 16, 32, and 64respectively, which will form a 64-dimensional content vector. Aspart of our fine-tuning process, we found that a vector of size 64yields the best results. We used the MSE loss function defined ineq 1 for training the label encoder.Both the image and label models were trained end-to-end untilconvergence. They were optimized using Stochastic Gradient De-scent with a fixed learning rate of 0.00005 and a mini batch of size32. We selected these hyper-parameters empirically based on thetraining dataset.For each query image, the retrieval task focuses on returninga ranked list of the most likely similar images from the referencedataset. This is achieved by estimating the similarity of two imagesbased on the learned embedding vector associated with each image.The embedding vector is a concatenation of both the structural(Section 5.2.1) and content (Section 5.2.2) feature vectors. We applythe Euclidean distance measure to estimate the similarity scoreof the embeddings between the query image and each one of theimages from the reference dataset.

We evaluate VINS’s performance in a threefold manner. First, weevaluate the object detection model, then the image retrieval model,and finally the end-to-end combined model.

The first step in VINS is locating and classifying the different UIcomponents in the input query to ensure constructing a good repre-sentative layout structure. We trained the employed SSD model [33]from scratch with a learning rate of 1X10 -2 . The model performancewas evaluated through calculating the mean Average Precision(mAP) and the Area under precision-recall curve (AUC) of all theclasses.Our first objective is to evaluate the annotations of the VINSdataset against the predefined view hierarchies from the Rico dataset.Since only 2000 images from the Rico dataset have been annotatedas part of VINS, we have selected from these annotated images onlythose containing the most common classes: text, text buttons, icons,images, background images, and page indicators. We made thisdecision to ensure a fair comparison, as the remaining classes arenot sufficiently covered within the images that we selected fromRico. This results in a training dataset containing 1230 images thathave been split into a training, validation, and test sets based on80:10:10 ratio respectively. We select IoU of 0.5 for all the objectdetection results since its normally considered a good detectionratio. INS: Visual Search for Mobile User Interface Design CHI ’21, May 08–13, 2021, Yokohama, Japan

Table 1: Average Precision at (IoU = 0.5) for the detection ofeach of the 7 class labels Between Rico’s view hierarchiesand VINS’s annotations. Average Precision (%)Class Label Rico Dataset VINS Dataset

Background Image 68.61 89.55Icon 29.61 33.28Image 36.13 81.65Text 34.10 71.48Text Button 66.91 88.47Page Indicator 10.28 63.70Upper Task Bar 90.90 90.90 mAP 48.08 74.15

Usually most UI images contain an upper status bar that displaysinformation (e.g. time, battery level, cellular carrier) on the screen’supper edge. As part of our approach, this bar is not being cropped,but rather introduced as a new UI component, for two reasons.First, since it displays information containing text and icons, it isoften missclassified as belonging to one of the two classes of text oricons. Second, because our dataset contains different UI styles (e.g.Android, iOS, wireframes) and the position of the upper status barusually varies depending on the UI, therefore when given a newUI image it can detect the bar’s location regardless of its UI style.Although

Rico dataset contain only Android UIs, we still follow thesame convention and include the upper status bar as part of thedetection process.Table 1 shows the Average Precision (AP) at IoU of 0.5 for eachof the 7 classes in both

Rico and VINS dataset. We can see thatVINS’s annotations make objects of interest more recognizable tothe detector and that we are able to achieve a higher AP acrossall 7 classes. Overall, VINS’s annotations provide more than 26%increase over the mAP of Rico.Next, we evaluate the performance of the complete VINS dataset,which contains a total of 4,543 images. We follow the same approachof splitting the dataset into a training, validation, and test sets basedon 80:10:10 ratio respectively. We test the model on a test datasetconsisting of 450 images and achieve an overall mAP of 76.39%and AUC of 79.02% across the predefined set of classes. Table 2shows the AP of each of the 12 class labels. Overall, the modelhas a good performance across most of the classes except with thesliding menu component having the highest AP of 100%. However,the checked view class has the lowest AP of 44.48%, which can beattributed to its high cross-class similarity and also size, as objectdetection models often struggle with detecting small objects [28].

To quantitatively evaluate the im-age retrieval performance, we calculate the precision of the top krecommended images from the ranked retrieved list. To do so, wefirst remove the wireframe UIs from the dataset since they can’t beretrieved as inspirational design examples resulting in a dataset of4,543 images. We follow the same aforementioned split approach tocreate a test set of 450 images. We recruited 3 interface designers

Table 2: Average Precision of detection results for each ofthe 12 class labels on the test set consisting of 450 imagesfrom the VINS dataset. (IoU=0.5)Class Label AP (%) AUC (%)

Background Image 89.33 94.45Checked View 44.48 43.70Icon 50.50 49.55Input Field 78.24 80.81Image 79.24 81.68Text 63.99 65.30Text Button 87.37 92.95Page Indicator 59.37 61.07Pop-Up Window 93.75 97.20Sliding Menu 100 100Switch 80.00 82.45Upper Task Bar 90.40 99.09from

Upwork with considerable mobile UI/UX design experienceto assign a label representing the design category for each of thetest images. Based on the labels assigned, we identified 8 designgroups for the UIs presented as follows: login, login with back-ground image, sign up, introduction, introduction with backgroundimage, sliding menu, pop-up window, grid-based, and list-based.We eliminated design groups containing less than 10 related imagesresulting in a test set of 395 images for this evaluation. Then, wetake each image in the test set as a query and retrieve a rankedlist of K nearest neighbors. Finally, we calculate the precision scoreat top K retrieved images (precision@K), which is defined as thepercentage of the K retrieved images in the list that belong to thesame design category label. Since our test set is relatively small, weset the maximum retrieval limit of K = 10.

Table 3: Precision score at top K retrieved images from thevalidation set for the baseline model and different modelsof our proposed method with varying number of attentionmaps m applied. Top 1 Top 2 Top 4 Top 6 Top 8 Top 10Baseline

Ours ( m =1) Ours ( m =2) Ours ( m =3) Our ( m =4) 92.05 91.79 89.48 88.37 87.21 86.48 In our proposed model, we incorporate the attention box mech-anism into our base encoder by conditioning it on the output ofthe convolutional layers. We treat the attention map m as a hyperparameter and experiment by creating different models of varyingnumber of maps. When m =1, the model has 1 attention map appliedto the output of the last convolution layer. When m increases, thenumber of m layers increase and are applied to the last m convolu-tional layers. Because we have 4 convolutional layers, when m = 4,each layer has an attention map applied to it. HI ’21, May 08–13, 2021, Yokohama, Japan Trovato and Tobin, et al.

Figure 6: Retrieval results of query images selected from the test dataset from the baseline model and our model. The firstcolumn shows the query UI represented as a semantic layout, and the rest are the Top-3 retrieved UIs from both models. Thecolored bounding boxes represent the different UI components, which are described in the legend below the comparison.

We compare the retrieval results of our proposed model withthe baseline autoencoder implemented in [32]. Table 3 shows theprecision scores of different number of k neighbors on the validationset for the models used in the experiment. We can see that ourmodel outperforms the baseline and further improves the precisionrates at each value of K. The model with m = 4 is the best in theoverall performance. It is able to achieve 92.05% precision for thetop 1 nearest neighbor and 86.48% for the top 10 nearest neighbors.This marks an almost 4-6% improvements over the baseline model.This demonstrates how incorporating these attention maps aids incapturing the UI layout and thus retrieves more relevant imagesfor the input query. We selected the model with m =4 and used it tocomplete the remaining analysis. To qualitatively evaluate the retrievalprocess, we visualize query results of randomly sampled imagesfrom the test set. Although both models often perform similarly,there are cases where our model outperform that baseline in pro-viding examples that better fit the query’s content and structure asshown in Figure 6.As part of the evaluation, we consider that the UI componentsgiven in the query image are important to the designer. This isderived from the responses of the designers as part of the interviewin Section 3. When asked if given a query of a specific layout witha specified number of basic components, 18 designers indicated that a page with exact basic components with various/additionalcomponents would be considered an acceptable UI design example.Our model is able consider the query’s UI components and re-trieve examples that fits the overall layout structure (Example a).The baseline, however, retrieves only Rank 1 similar to the queryand fails in retrieving the other two because it disregards the inputfields and only detects the background image and the position ofthe text buttons. This may indicate that the baseline model some-times struggles in understanding the available UI components andmay be performing the retrieval based on the dominant color avail-able. We note that both models fail to capture the checked viewcomponent from the query, which may have happened because weare validating with a relatively small dataset and, as a result, thisdataset may not contain similar designs that have all the exact samecomponents.Our model is also able to capture the representation of the query’slayout structure (Example b). All ranked results exhibit a consistentstructure of following the sequence of component placement interms of image, text, page indicator, and text buttons, respectively.However, this is not the case for the ranked results from the baselinemodel as they are either missing a text button for rank 1 and 2 ormissing a page indicator and introducing 4 input fields for rank 3.These examples show how our model is better able to identify theUI components and confirm to the query’s overall layout structure.

INS: Visual Search for Mobile User Interface Design CHI ’21, May 08–13, 2021, Yokohama, Japan

Figure 7: Query Results for VINS, which takes an input query that can be of different design stages (abstract wirframes or high-fidelity designs) (first column), detects its UI components to generate a segmented layout (second column) and then retrievesa ranked list of similar designs (remaining columns).

To qualitatively analyze VINS’s performance, we visualize the end-to-end query results, including detection and retrieval, on the testset. We discuss below the feedback of expert designers and thepotential usage of VINS in supporting other design applications.

To gain insight from a professional user’sperspective regarding VINS’s performance, we recruited 5 new de-signers from

Upwork , to evaluate selected results from the testset. The participants were all US-based UI/UX designers with anaverage of 2 years of experience and having an average of 85% jobsuccess rate. As part of the evaluation study, we provided them with10 sets of query designs each containing the corresponding Top-5retrieved results from the test set. We sampled 5 high-fidelity lay-outs and 5 abstract wireframes as part of the set to ensure a diversecollection across the design stages. Designers were compensatedwith $20 USD for the evaluation study, which lasted approximately50 minutes.We asked designers to answer a survey consisting of 5 free-form questions regarding overall relevance between the query andeach of the Top-5 retrieved results, the layout and functionalityrelevance, the effect of additional components in the retrieved set,and whether the design examples provide useful design variations.Some of the specific questions included: “How would you commenton the relevance between the query and each of the 5 images in theretrieved results?”, “How would you comment on the layout and functionality relevance between the query and the set of retrievedresults?”, and “Does the set of retrieved results provide useful designvariations?”. Similar to the formative interviews (see Section 3), tworesearchers engaged in analyzing this data, with a third researcherconfirming the results.All the designers mentioned that all the retrieved results arerelevant to the input queries and would provide beneficial designexamples. As E1 stated “They are very relevant to what was in-tended in the query”. Specifically, 1 designer appreciated how thelayout composition and design patterns of the retrieved resultsmatches the query. As E3 mentioned “I think results looks greatbased on image query. The layout composition and design patternselements matches query”. Another designer (E1) mentioned howVINS is able to retrieve similar layouts to the query in example f“Rank 1 to Rank 5 are quite similar to what was required in thequery with an image on top and one or two buttons accordingto the requirement”. Although the retrieved layouts may be verysimilar to the query, they do still provide design inspirations suchas color schemes as E2 said “Mostly the layout is the same, butthey do offer different designs using different color patterns”. Inaddition to relevancy, all designers agreed that VINS also providesuseful design variations regarding different aspects such as “I seeproperly [sic] designs and all provide useful design variation” (E2),“Yes the design layouts are quite useful” (E1), and “I think this queryprovides great variations of composition layout” (E5). HI ’21, May 08–13, 2021, Yokohama, Japan Trovato and Tobin, et al.

Figure 7 shows 6 different query UIs, which were part of the givensurvey, and their corresponding Top-3 retrieved results. All design-ers were satisfied with the design examples for the onboardingqueries in regard to design variations in example a: “I really like thevariations of the provided examples. Colors, layouts, typographylooks great” (E3); and in example f: “they do offer different designsusing different color patterns” (E2). One designer (E1) identified theaddition of new components such as the forward button in examplef. Although there are slight variations in the sliding-menu results(example d), E4 commented that these variations are “beneficialto generate more ideas on how to solve certain design problem”.Designers also observed VINS’s capability of detecting the back-ground image and identified the results as “All are unique due tobackground images” (E2) and “They all offer login functionalitywith different layouts” (E1). Furthermore, designers expressed satis-faction for the grid-based layouts (example c), providing matchinglayouts while offering design variations (E1), and providing extrafunctionality (E4). For the login screen (example e), 4 designersliked the variety in offering different login layout designs and loginoptions. However, E1 commented that some results offer fewercomponents than what is required in the query, such as that theconcept of the google login button is ignored.In general, we observed that including extra components pro-vides new ideas and design variations that are appreciated by thedesigners. As stated by E3 “Adding additional components in queryresults is definitely beneficial for inspiration and providing possiblybetter solution”. Similarly, E5 mentioned “They are useful. We mightget a new idea after watching extra element”. However, in somecases these extra components may or not be required depending onthe condition and the client. We also attempted to understand de-signers’ perspectives regarding the slight variations in either layoutor functionality in the retrieved results. E1 identified that layoutvariations are helpful “For the most part layout matches the queryhowever the functionality is not so. It is not necessarily bad thingbecause seeing different examples may generate more ideas on howto solve design problem”. E5 however emphasized the importanceof functionality over layouts “I think layout doesn’t matter here.Our priority should be given the right functionality”. The rest ofthe designers identified that the layout and functionality of theresults match the query. This reflects VINS’s effectiveness in re-trieving useful design variations and how it is important within thesearching tool to have a balance between these two design aspects:offering similar layouts while maintaining the functionality.

Although VINS sometimesfails to detect all of the UI components, we observed its abilityto retrieve examples similar to the partial detected layout. Thisis related to the problem of auto completing UI layouts [30]. Toevaluate VINS in supporting this task, we created partial abstractwireframe layouts containing only 2-3 UI components. Based on thepartial layout given, VINS brings design knowledge to the processby identifying the entered UI components and then suggesting de-sign examples that complete the layout. As shown in Figure 8, VINSprovides design examples that maintain the common componentsof the query (i.e., central image and text) while providing inspira-tion on how to complete the remaining UI components (example a).VINS also provides ideas on how to complete a certain layout, e.g.

Figure 8: Auto completion of UI layout design. The frame-work is able to provide design examples that complete theremaining UI components based on partial layouts providedby the user. login, by retrieving various UIs with the detected input fields andtext button (example b).These examples illustrate the potential of VINS to facilitatethe design process by assisting designers to create layouts. While

Swire [22] also supports the auto-completion of partial designs, itcannot be implemented directly. Their approach relies on train-ing an alternative model on a new training set with only partialsketches, which is computationally expensive and impractical. VINScan, however, directly accept partial layouts and provide designexamples accordingly.

While VINS can retrieve highly similar relevant UI layouts, andmarks a clear improvement over previous approaches [32], thereare several limitations, improvements, and future directions thatwe discuss below.

Utilizing deep learning frameworks requires a dataset. As existingdatasets were unavailable or not sufficient for the purpose of ourwork, we proceeded in collecting and annotating the VINS dataset,a large mobile UI dataset consisting of UI screens across differentdesign stages. While the VINS dataset was large enough for thiswork, we still observed that it was not always able to detect all theUI components or provide similar designs. An even larger datasetcould potentially address this through improving the detection per-formance and providing more design variations. The VINS datasetis already publicly available, and we will provide suggestions onhow others can include new UI screens so that the dataset can beincreased over time.Furthermore, it currently only includes 11 classes of the mostcommon UI components for the detection process, which limits theapplicability of our model to certain UIs with a set of defined compo-nents. This can be improved by including additional UI components

INS: Visual Search for Mobile User Interface Design CHI ’21, May 08–13, 2021, Yokohama, Japan spanning different functionalities and identifying the different in-put field labels (e.g., password, email, etc.), text button concepts (e.g.login, skip, etc.), and icon classes (e.g. social media, settings, etc.)as identified in [32]. Understanding the true nature of componentswill aid in generating a more fine-grained hierarchical structurethat better defines the UI layout and its design components andthus retrieve even more similar results. This dataset can also beutilized for other data-driven applications such as mobile layoutgeneration [29] and UI code generation [4, 36, 38].

VINS consists of two main components: object detection and imageretrieval, both of which can be improved. Although the objectdetection was able to detect certain classes with very high precision,it failed to do the same for other classes such as checked view andicon. This is due to the high cross-class similarity and large in-classvariance. It is also related to the issue of missed detection of theSSD model in small object detection, which can be improved byincluding more images, using augmentation techniques, or utilizingfeature pyramid network structure to enhance detection [28].As indicated by designers within the interview, they focus onthree aspects of design: functionality, structure, and visuality. Wecan improve our image retrieval model to incorporate, along withthe structure and content, the visual features of the UI, such asimagery, font [41], and colors [23, 31, 39]. We can also improve thestructural representation of the UI by utilizing a tree-based datastructure to encode the hierarchical view of the layout. This ensuresa better modeling for the relations between the different UI compo-nents. In addition to image retrieval, such tree structures can also beused to automate different design tasks, including auto-completionof partial designs [30] and generation of new layouts [29], whichwe leave for future work.Because layout is an important factor in graphic design in gen-eral, our visual search framework can be extended from mobileapp’s to other design layouts including magazines, posters, webpages, etc.

Although this paper shows VINS’s capacity in supporting examplefinding behavior, the work presented so far needs to be furthervalidated by the respective users. To achieve this, we plan to conductuser evaluation studies with UI designers to assess how it canbe integrated with their daily workflow and how it meets theirrequirements. Such user evaluation falls into the emerging topic ofhuman-AI interaction [1].Most designers were very optimistic about VINS. Their feed-back in the interviews provided additional insights and unexpectedopportunities that can help for future work. One aspect is thatVINS can be utilized for the early stages of design as stated byD7: “Yes, it can be helpful by creating your thoughts on papers asmany times as you want, to know about the outlooks of the design”.We can easily extend our dataset to include sketches, by utilizing

Swire’s dataset [22] for example. Another interesting aspect is thatVINS can be leveraged into the design process to create an inter-active experience between designers and clients to enhance theidea communication phase. As reported by D11, “Even a client can be asked to make a layout and see what it looks like”. Designersalso suggested the need to provide user-specified constraints, alongwith a query image, that can control the searching process, such askeywords and colors.

In this paper, we proposed an object-detection based visual searchframework for UI layout designs. To support the development ofour framework, we (1) interviewed UI designers to better under-stand the problems, needs, and requirements for visual search; and(2) collected a large-scale annotated UI dataset consisting of UIscreens across different design stages that can be utilized for ObjectDetection training. Utilizing this dataset, our framework first takesan app’s design image, which can be an abstract wireframe or ahigh-fidelity image and detects its UI components to construct atentative segmented layout representation. It then trains a multi-modal embedding model with an attention mechanism on thesegenerated semantic layouts to learn a joint feature representationthat can retrieve similar UI designs. Our findings show a promisingperformance from both the detection and the retrieval phases. Weachieved a mAP of 76.39% for the detection of different UI compo-nents and a precision between 80-90% for the retrieval of relativedesign examples for an input query.

REFERENCES [1] Saleema Amershi, Dan Weld, Mihaela Vorvoreanu, Adam Fourney, Besmira Nushi,Penny Collisson, Jina Suh, Shamsi Iqbal, Paul N Bennett, Kori Inkpen, et al. 2019.Guidelines for human-AI interaction. In

Proceedings of the 2019 chi conference onhuman factors in computing systems . 1–13.[2] Farnaz Behrang, Steven P Reiss, and Alessandro Orso. 2018. GUIfetch: support-ing app design and development through GUI search. In

Proceedings of the 5thInternational Conference on Mobile Software Engineering and Systems . 236–246.[3] Sean Bell and Kavita Bala. 2015. Learning visual similarity for product designwith convolutional neural networks.

ACM transactions on graphics (TOG)

34, 4(2015), 1–10.[4] Tony Beltramelli. 2018. pix2code: Generating code from a graphical user inter-face screenshot. In

Proceedings of the ACM SIGCHI Symposium on EngineeringInteractive Computing Systems . ACM, 3.[5] Nathalie Bonnardel. 1999. Creativity in design activities: The role of analogiesin a constrained cognitive environment. In

Proceedings of the 3rd conference onCreativity & cognition . 158–165.[6] Joel Brandt, Mira Dontcheva, Marcos Weskamp, and Scott R Klemmer. 2010.Example-centric programming: integrating web search into the developmentenvironment. In

Proceedings of the SIGCHI Conference on Human Factors in Com-puting Systems . 513–522.[7] Bill Buxton. 2010.

Sketching user experiences: getting the design right and the rightdesign . Morgan kaufmann.[8] John Canny. 1986. A computational approach to edge detection.

IEEE Transactionson pattern analysis and machine intelligence , Vol. 1. IEEE, 886–893.[10] Shaveta Dargan, Munish Kumar, Maruthi Rohit Ayyagari, and Gulshan Kumar.2019. A survey of deep learning and its applications: A new paradigm to machinelearning.

Archives of Computational Methods in Engineering (2019), 1–22.[11] Biplab Deka, Zifeng Huang, Chad Franzen, Joshua Hibschman, Daniel Afergan,Yang Li, Jeffrey Nichols, and Ranjitha Kumar. 2017. Rico: A mobile app datasetfor building data-driven design applications. In

Proceedings of the 30th AnnualACM Symposium on User Interface Software and Technology . ACM, 845–854.[12] Biplab Deka, Zifeng Huang, and Ranjitha Kumar. 2016. ERICA: Interactionmining mobile apps. In

Proceedings of the 29th Annual Symposium on User InterfaceSoftware and Technology . ACM, 767–776.[13] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, andAndrew Zisserman. 2010. The pascal visual object classes (voc) challenge.

Inter-national journal of computer vision

88, 2 (2010), 303–338.[14] Pedro F Felzenszwalb, Ross B Girshick, David McAllester, and Deva Ramanan.2009. Object detection with discriminatively trained part-based models.

IEEEtransactions on pattern analysis and machine intelligence

32, 9 (2009), 1627–1645.

HI ’21, May 08–13, 2021, Yokohama, Japan Trovato and Tobin, et al. [15] Thomas Funkhouser, Patrick Min, Michael Kazhdan, Joyce Chen, Alex Halderman,David Dobkin, and David Jacobs. 2003. A search engine for 3D models.

ACMTransactions on Graphics (TOG)

22, 1 (2003), 83–105.[16] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Richfeature hierarchies for accurate object detection and semantic segmentation. In

Proceedings of the IEEE conference on computer vision and pattern recognition .580–587.[17] Yasunari Hashimoto and Takeo Igarashi. 2005. Retrieving Web Page Layouts usingSketches to Support Example-based Web Design.. In

SBM . Citeseer, 155–164.[18] Scarlett R Herring, Chia-Chen Chang, Jesse Krantzler, and Brian P Bailey. 2009.Getting inspired!: understanding how and why examples are used in creativedesign practice. In

Proceedings of the SIGCHI Conference on Human Factors inComputing Systems . ACM, 87–96.[19] Scarlett R Herring, Brett R Jones, and Brian P Bailey. 2009. Idea generation tech-niques among creative professionals. In . IEEE, 1–10.[20] Clara E Hill. 2012.

Consensual qualitative research: A practical resource for investi-gating social science phenomena.

American Psychological Association.[21] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, WeijunWang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. Mobilenets:Efficient convolutional neural networks for mobile vision applications. arXivpreprint arXiv:1704.04861 (2017).[22] Forrest Huang, John F Canny, and Jeffrey Nichols. 2019. Swire: Sketch-basedUser Interface Retrieval. In

Proceedings of the 2019 CHI Conference on HumanFactors in Computing Systems . ACM, 104.[23] Ali Jahanian, Shaiyan Keshvari, SVN Vishwanathan, and Jan P Allebach. 2017.Colors–Messengers of Concepts: Visual Design Mining for Learning Color Se-mantics.

ACM Transactions on Computer-Human Interaction (TOCHI)

24, 1 (2017),1–39.[24] Alexander Kolesnikov, Alina Kuznetsova, Christoph Lampert, and Vittorio Ferrari.2019. Detecting visual relationships using box attention. In

Proceedings of theIEEE International Conference on Computer Vision Workshops . 0–0.[25] Ranjitha Kumar, Arvind Satyanarayan, Cesar Torres, Maxine Lim, Salman Ahmad,Scott R Klemmer, and Jerry O Talton. 2013. Webzeitgeist: design mining the web.In

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems .ACM, 3083–3092.[26] Ranjitha Kumar, Jerry O Talton, Salman Ahmad, and Scott R Klemmer. 2011.Bricolage: example-based retargeting for web design. In

Proceedings of the SIGCHIConference on Human Factors in Computing Systems . 2197–2206.[27] Brian Lee, Savil Srivastava, Ranjitha Kumar, Ronen Brafman, and Scott R Klemmer.2010. Designing with interactive example galleries. In

Proceedings of the SIGCHIconference on human factors in computing systems . ACM, 2257–2266.[28] Haotian Li, Kezheng Lin, Jingxuan Bai, Ao Li, and Jiali Yu. 2019. Small Object De-tection Algorithm Based on Feature Pyramid-Enhanced Fusion SSD.

Complexity arXivpreprint arXiv:1901.06767 (2019).[30] Yang Li and Tsung-Hsiang Chang. 2016. Auto-completion for user interfacedesign. US Patent 9,417,760.[31] Sharon Lin, Daniel Ritchie, Matthew Fisher, and Pat Hanrahan. 2013. Probabilisticcolor-by-numbers: Suggesting pattern colorizations using factor graphs.

ACMTransactions on Graphics (TOG)

32, 4 (2013), 1–12.[32] Thomas F Liu, Mark Craft, Jason Situ, Ersin Yumer, Radomir Mech, and RanjithaKumar. 2018. Learning design semantics for mobile apps. In

Proceedings of the31st Annual ACM Symposium on User Interface Software and Technology . 569–579.[33] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed,Cheng-Yang Fu, and Alexander C Berg. 2016. Ssd: Single shot multibox detector. In

European conference on computer vision . Springer, 21–37.[34] Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton Van Den Hengel.2015. Image-based recommendations on styles and substitutes. In

Proceedingsof the 38th International ACM SIGIR Conference on Research and Development inInformation Retrieval . 43–52.[35] Scarlett R Miller and Brian P Bailey. 2014. Searching for inspiration: An in-depthlook at designers example finding practices. In

International Design EngineeringTechnical Conferences and Computers and Information in Engineering Conference ,Vol. 46407. American Society of Mechanical Engineers, V007T07A035.[36] Kevin Moran, Carlos Bernal-Cárdenas, Michael Curcio, Richard Bonett, andDenys Poshyvanyk. 2018. Machine learning-based prototyping of graphical userinterfaces for mobile apps. arXiv preprint arXiv:1802.02312 (2018).[37] George Nagy, Thomas A Nartker, and Stephen V Rice. 1999. Optical characterrecognition: An illustrated guide to the frontier. In

Document Recognition andRetrieval VII , Vol. 3967. International Society for Optics and Photonics, 58–69.[38] Tuan Anh Nguyen and Christoph Csallner. 2015. Reverse engineering mobileapplication user interfaces with remaui (t). In . IEEE, 248–259.[39] Peter O’Donovan, Aseem Agarwala, and Aaron Hertzmann. 2011. Color compat-ibility from large datasets. In

ACM SIGGRAPH 2011 papers . 1–12.[40] Peter O’Donovan, Aseem Agarwala, and Aaron Hertzmann. 2015. Designscape:Design with interactive layout suggestions. In

Proceedings of the 33rd annualACM conference on human factors in computing systems . 1221–1224.[41] Peter O’Donovan, J¯anis L¯ıbeks, Aseem Agarwala, and Aaron Hertzmann. 2014.Exploratory font selection using crowdsourced attributes.

ACM Transactions onGraphics (TOG)

33, 4 (2014), 1–9.[42] Peter O’Donovan, Aseem Agarwala, and Aaron Hertzmann. 2014. Learninglayouts for single-pagegraphic designs.

IEEE transactions on visualization andcomputer graphics

20, 8 (2014), 1200–1213.[43] Daniel Ritchie, Ankita Arvind Kejriwal, and Scott R Klemmer. 2011. d. tour:Style-based exploration of design example galleries. In

Proceedings of the 24thannual ACM symposium on User interface software and technology . 165–174.[44] Moushumi Sharmin, Brian P Bailey, Cole Coats, and Kevin Hamilton. 2009. Un-derstanding knowledge management practices for early design activity and itsimplications for reuse. In

Proceedings of the SIGCHI Conference on Human Factorsin Computing Systems . 2367–2376.[45] Thorvald Sørensen, TA Sørensen, TJ Sørensen, T SORENSEN, T Sorensen, TASorensen, and T Biering-Sørensen. 1948. A method of establishing groups ofequal amplitude in plant sociology based on similarity of species content and itsapplication to analyses of the vegetation on Danish commons. (1948).[46] Hao Su, Jia Deng, and Li Fei-Fei. 2012. Crowdsourcing annotations for visualobject detection. In

Workshops at the Twenty-Sixth AAAI Conference on ArtificialIntelligence .[47] Amanda Swearngin, Mira Dontcheva, Wilmot Li, Joel Brandt, Morgan Dixon,and Andrew J Ko. 2018. Rewire: Interface design assistance from examples. In

Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems .1–12.[48] Tom Yeh, Tsung-Hsiang Chang, and Robert C Miller. 2009. Sikuli: using GUIscreenshots for search and automation. In

Proceedings of the 22nd annual ACMsymposium on User interface software and technology . ACM, 183–192.[49] Young-Sun Yun, Jinman Jung, Seongbae Eun, Sun-Sup So, and Junyoung Heo. 2018.Detection of gui elements on sketch images using object detector based on deepneural networks. In

International Conference on Green and Human InformationTechnology . Springer, 86–90.[50] Zhong-Qiu Zhao, Peng Zheng, Shou-tao Xu, and Xindong Wu. 2019. Objectdetection with deep learning: A review.