[PDF] Hysia: Serving DNN-Based Video-to-Retail Applications in Cloud

Abstract

Combining \underline{v}ideo streaming and online \underline{r}etailing (V2R) has been a growing trend recently. In this paper, we provide practitioners and researchers in multimedia with a cloud-based platform named Hysia for easy development and deployment of V2R applications. The system consists of: 1) a back-end infrastructure providing optimized V2R related services including data engine, model repository, model serving and content matching; and 2) an application layer which enables rapid V2R application prototyping. Hysia addresses industry and academic needs in large-scale multimedia by: 1) seamlessly integrating state-of-the-art libraries including NVIDIA video SDK, Facebook faiss, and gRPC; 2) efficiently utilizing GPU computation; and 3) allowing developers to bind new models easily to meet the rapidly changing deep learning (DL) techniques. On top of that, we implement an orchestrator for further optimizing DL model serving performance. Hysia has been released as an open source project on GitHub, and attracted considerable attention. We have published Hysia to DockerHub as an official image for seamless integration and deployment in current cloud environments.

Full PDF

HHysia: Serving DNN-Based Video-to-Retail Applications in Cloud

Huaizheng Zhang, Yuanming Li, Qiming Ai, Yong Luo, Yonggang Wen, Yichao Jin,Nguyen Binh Duong Ta

Nanyang Technological University, Indeed Inc., Singapore Management University{huaizhen001,yli056,qai001,yluo,ygwen}@ntu.edu.sg,[email protected],[email protected]

ABSTRACT

Combining video streaming and online retailing (V2R) has beena growing trend recently. In this paper, we provide practitionersand researchers in multimedia with a cloud-based platform namedHysia for easy development and deployment of V2R applications.The system consists of: 1) a back-end infrastructure providing opti-mized V2R related services including data engine, model repository,model serving and content matching; and 2) an application layerwhich enables rapid V2R application prototyping. Hysia addressesindustry and academic needs in large-scale multimedia by: 1) seam-lessly integrating state-of-the-art libraries including NVIDIA videoSDK, Facebook faiss, and gRPC; 2) efficiently utilizing GPU com-putation; and 3) allowing developers to bind new models easily tomeet the rapidly changing deep learning (DL) techniques. On top ofthat, we implement an orchestrator for further optimizing DL modelserving performance. Hysia has been released as an open sourceproject on GitHub, and attracted considerable attention. We havepublished Hysia to DockerHub as an official image for seamlessintegration and deployment in current cloud environments.

KEYWORDS

Multimedia System, Video Analysis, Video Shopping, Advertis-ing, Cloud Platform, Open Source Software

ACM Reference Format:

Huaizheng Zhang, Yuanming Li, Qiming Ai, Yong Luo, Yonggang Wen,Yichao Jin, and Nguyen Binh Duong Ta. 2018. Hysia: Serving DNN-BasedVideo-to-Retail Applications in Cloud. In

Woodstock ’18: ACM Symposiumon Neural Gaze Detection, June 03–05, 2018, Woodstock, NY.

ACM, New York,NY, USA, 4 pages. https://doi.org/10.1145/1122445.1122456

Recently, we have been witnessing the combined power of videostreaming and e-commerce. Since online videos can reach millionsof people, most companies have realized that they are the bestshowcase platforms to promote products. Therefore, many appli-cations have been developed to support this combination. Theseapplications includes fashion products retrieval from videos [5],contextual ads insertion [2], etc. We term them Video-to-Retail(V2R) applications. In V2R, the task is to analyze video content and

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].

Woodstock ’18, June 03–05, 2018, Woodstock, NY © 2018 Association for Computing Machinery.ACM ISBN 978-1-4503-XXXX-X/18/06...$15.00https://doi.org/10.1145/1122445.1122456 products, and to match them to each other so that companies canpromote products efficiently while maintaining the video watchingexperience of users.Developing V2R applications is a non-trivial task. First, the datathat will be fed into the application such as videos, product adsand product descriptions, is multi-modality. Processing, fusing andaligning these data to understand them better require much effortand are still very challenging [1]. Second, to match videos to prod-ucts or vice versa in a non-intrusive way, accurate recognition andretrieval algorithms are needed. Third, processing speed is vitalfor maintaining a good user experience. A video usually containshundreds of thousands of frames, and a product database may in-clude thousands of items. How to efficiently process and matchthem remains an open problem.To address these issues, representative works as listed in Ta-ble 1 have considered the following two perspectives: the systemperspective and the algorithm perspective. From the system per-spective, for instance, Mei et al. [9] build a system that includes apipeline to unify the ads and video pre-processing for contextualads insertion. In [5], they exploit the video frame redundancy andindex features into a kd-tree for fast clothing retrieval. From thealgorithm perspective, quite a number of matching frameworksemploying DL models have been proposed. For instance, in [4], theauthors design a framework consisting of an image feature network,a video feature network and a similarity network to match clothingin videos to online shopping images. In [3], the authors use a set ofmodels that include content understanding models to analyze userbehavior, and video tags for accurate video advertising.There is still much work to be done to make developing fast andefficient V2R apps in various domains easier. First, existing systemsonly focus on one kind of V2R application such as contextual videoadvertising (product-to-video) or retrieving products from videos(video-to-product); and neglect the similarities (i.e. data engineering,model processing and matching) between them. Thus, multimediaresearchers have to go through all the infrastructure plumbingwork and make duplicate efforts in the process. Second, currentsystems pay more attention to improving matching accuracy anddo not address system optimization. Third, given that DL modelsare increasingly used to build V2R applications, how to deploy thesemodels with ease has not been fully considered. Lastly, there hasbeen no comprehensive open source V2R platform for non-expertswith little machine learning (ML) knowledge, making it challengingfor them to harness the power of AI.To narrow these gaps, we develop Hysia, a fully open sourceand cloud-oriented framework that comprises widely used V2Rmodels and optimized infrastructure services including data engine,model serving and content matching. It allows non-expert users toquickly make use of the built-in utilities to analyze V2R related data;and expert users to build or evaluate new, high performance V2R a r X i v : . [ c s . MM ] J un oodstock ’18, June 03–05, 2018, Woodstock, NY Huaizheng Zhang, Yuanming Li, Qiming Ai, Yong Luo, Yonggang Wen, Yichao Jin, and Nguyen Binh Duong Ta Table 1: A comparison of Hysia and existing V2R related works.

V2O related work Product-to-Video Video-to-Product Systemsupport End-to-end Modelmanagement Model servingoptimization Web interface&API Opensource

VideoSense. [9] ✓ × ✓ ✓ × × ✓ × CAVVA. [12] ✓ × ✓ ✓ × × ✓ × Video eCommerce. [3] × ✓ ✓ ✓ × × ✓ × Garcia et al. [5] × ✓ ✓ ✓ × × ✓ ✓ Video2shop. [4] × ✓ × × × × × × Madhok et al. [8] ✓ × × × × × × × Hysia (Ours) ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ applications with ease. Essential features in V2R such as applicationmanagement and new model binding are also provided. Hysia canrun in either virtual machines (VMs) or containers, making it easyto be integrated into the current cloud environments.In Hysia, multimedia practitioners and researchers can focus onapplication design rather than writing repetitive codes, with refer-ence applications provided out of the box. We integrate industry-grade libraries to speed up data processing including NVIDIA videoSDK for video pre-processing, Facebook faiss for searching andgRPC for communication. Hysia is highly modular, allowing seam-less integration with new modules. Though it has been designed forV2R, it can also be used as a multimedia toolbox for video analysis,audio recognition and so on.We release Hysia as an open source project at https://github.com/cap-ntu/Video-to-Retail-Platform under Apache 2.0 license. Ithas attracted attention and interests from many in the developercommunity. We also dockerize the system and publish it to Dock-erHub at https://hub.docker.com/r/hysia/hysia so that any cloudusers can install and run Hysia with ease.

In this section, we first present the architecture of Hysia, and thenwe introduce the workflow for fulfilling V2R applications.

The system architecture is presented in Figure 1. In designing Hysia,we focus on the system’s modularity and extensibility. As a cloud-oriented and end-to-end platform, it consists of two components: aback-end infrastructure, and a front-end application layer.

Back-End Platform . In clouds, computing resources are ab-stracted via three main approaches, namely infrastructure as aservice (IaaS), container as a service (CaaS), and serverless/functionas a service (FaaS). Hysia core services can make use of either vir-tual machines (IaaS) or containers (CaaS). In addition, as serving MLmodels is stateless [13], it is simple to deploy them using serverless(FaaS).The ML model-related services, namely data engine, model repos-itory, content matching and serving in Hysia are encapsulated into amiddleware - a form of ML-as-a-Service. The data engine is designedto reduce users’ efforts for preprocessing complex multimodalitydata. The model repository manages various ML models in Hysia.The model serving and content matching are designed to speedup the data analysis by utilizing GPUs. The functions provided by

ModelRepositoryContentMatching ModelServingDataEngineModel Service Middleware - Machine Learning as a Service (MLaaS)Infrastructure as aService (IaaS) Function as aService (FaaS)Container as aService (CaaS)Cloud Infrastructure APIs for Model ContributorsVideoShopping ContextualAdvertising ContentAnalysisService Portal - Software as a Service (SaaS) Model ContributorsHysia Cloud Video-to-Retail End-Users VideoProviderAdsProviderCustomerBack-EndPlatformFront-EndApplications

Figure 1: Hysia architecture. these core services are exposed via APIs so that developers caneasily extend our system.

Front-End Application . Built on top of the back-end platform,the front-end application layer provides full support for four classesof users: 1) We have well-designed APIs for model contributorsto bind new V2R-related models, develop new V2R applicationsand extend Hysia’s functionalities; 2) We provide content analysisservice for video providers so that they can mine videos to improvetheir commercial value; 3) A contextual advertising applicationis designed for advertisers to place ads at appropriate positionsof videos; and 4) Hysia also has a video shopping service to helpspectators buy products while watching videos. The built-in ser-vices and applications not only demonstrate the capability of ourplatform; they also provide reusable templates for researchers andpractitioners to easily add more V2R plugins to Hysia to betterserve their needs.

The workflow of our system is illustrated in Figure 2, which in-cludes two phases: offline and online. In the offline phase, modelcontributors register their V2R-related models to Hysia and use theprofiler to obtain their runtime performance. The profiling resultsare stored into a cache in the orchestrator; and the model weightsare then persisted in the model repository.In the online phase, a web interface is provided for end users (e.g.,video providers and advertisers) to upload data (e.g. videos and ads),and to display final results. Those data are first preprocessed by the ysia: Serving DNN-Based Video-to-Retail Applications in Cloud Woodstock ’18, June 03–05, 2018, Woodstock, NY

DataEngine ModelRepository ContentMatchingOrchestrator request

WebInterface batched request status

ModelServer weight

ModelProﬁler responserequestresponse modelresponse

OfﬂineOnline batch size

Monitor

CustomersVideo/AdsProviders modelModelContributer

Figure 2: Workflow of a V2R application. data engine and transformed into formats acceptable to DL models.Meanwhile, the orchestrator sends the optimal batch size of a modelto the data engine so that it can batch the formatted requests. Theyare then fed into the model server for further analysis. Finally, thepredictions and data feature output from the model server will besent to the content matching service to do matching of videos toproducts or vice versa. We also implement a monitoring componentto record the system status.

In this section, we describe the implementation of Hysia as illus-trated in Figure 2.

Model Repository . Hysia stores ML models in a two-layerstructure. It persists model information such as the model name,the service description (e.g., product detection) and so on in SQLitewhich is a very lightweight database. The simple data structuremakes it easy for users to replace the storage backend with theirown database solutions. The model weight file, usually sizeable, isserialized and stored separately in a file system and the file pathwill be persisted in SQLite.

Model Profiler . This component receives ML models submittedby contributors and profiles these models offline. Much researchhas shown that the batch size can significantly impact the model’slatency and throughput when served, in fact, our experiments alsodemonstrated that clearly in Section 4.4. Therefore, Hysia profilesmodels under different batch sizes to obtain the correspondinglatency and throughput. The profiling information will be storedin a cache in the orchestrator to help users choose the best batchsize for a particular model.

Orchestrator . The orchestrator contains a cache implementedwith Redis to store the model profiling information, and a batch sizecalculator for selecting an appropriate batch size of a model. Expertusers of Hysia only need to specify the maximum acceptable latencyfor their applications, i.e., a latency SLO (Service-Level-Objective).The orchestrator can then decide on an appropriate batch size, andsends such value to the data engine.

Data Engine . The data engine implements a set of functionsto pre-process multi-modality data, such as video, audio, productimages, and textual content. (1) Video : we employ NVIDIA videoSDK to implement the HysiaDecode component to process videoswith GPUs. In addition to utilizing GPU, HysiaDecode can alsodetect scene changes quickly by processing only one key frame ina scene shot. (2) Audio : we separate audio from video and save it asa file that will be processed by suitable audio models. (3) Image : weprovide a resize and transform function to format original imagesso that they can be processed by existing Tensorflow or PyTorchmodels. (4) Text : we implement a function to convert subtitles into (a) Video Analysis (b) Ads Insertion(c) Ads Display (d) Video Shopping

Figure 3: The built-in applications of Hysia. Try them outyourself online. ordinary text format, and a set of text preprocessing utilities suchas tokenization so it can be fed into NLP models.

Model Server . The model server is implemented using gRPCwhich is widely used for building micro-services. It receives batcheddata from the data engine and employs models in the repository toanalyze them. The model server will output two kinds of results.One is the prediction, and the other is the intermediate features.The predictions will be sent back to the data engine for displayingto users. The feature vectors will be stored in the file system, andat the same time, be sent to a subsequent module for matching.

Matching . We implement this module to match products tovideos or vice versa. Much optimization has been done in Hysia toimprove the matching efficiency. Specifically, we employ faiss [6],and load features into GPUs. Therefore, the similarity comparisonbetween features has been accelerated to meet real-time latency re-quirements. In addition, to make the system extensible, we provideAPIs for experts to extend the module to accommodate their needs.

Monitor . The monitor is implemented with a pub-sub structurein Redis to support V2R applications running on a distributed in-frastructure. It aggregates workers’ status including CPU and GPUdata, and resource usage of the executing models periodically. Amaster worker is set up to collect monitoring data from all workernodes, making it easy for users to locate system issues.

Hysia incorporates a wide range of ML models, ranging from scenerecognition and object detection to celebrity recognition and audiorecognition, for building comprehensive V2R applications. In thissection, we describe two built-in reference applications includingcontextual advertising and video shopping, based on real-worldscenarios. Then, we demonstrate how to bind new V2R models inHysia. Finally, we present a quantitative evaluation of Hysia. Both content and ads providers can enjoy the convenience providedby Hysia. For instance, a content provider gets a whole TV show https://cap-ntu.github.io/hysia_mm_demo/ oodstock ’18, June 03–05, 2018, Woodstock, NY Huaizheng Zhang, Yuanming Li, Qiming Ai, Yong Luo, Yonggang Wen, Yichao Jin, and Nguyen Binh Duong Ta and needs to insert several ad images or videos into the appropriatepositions of videos. Hysia analyzes the uploaded video content asshown in Figure 3a. Then, advertisers can upload their ads to Hysia,and it will search for the top-5 of relevant video clips. Users canthen choose the most relevant one (Figure 3b). Here we leverage thehuman-in-the-loop factor, since real-world scenarios can be verycomplex. Automatically inserting into the top-1 clip may negativelyaffect users’ experience, if the matching algorithm cannot capturenew data distributions. Finally, Hysia allows both content and adsproviders to verify the insertion results as shown in Figure 3c. Spectators may choose to buy related products while watchingvideos. Hysia fulfills this need by providing a video shopping service.Since mobile video accounts for a significant portion of video traffic,we demonstrate a mobile application whose backend server is basedon Hysia. As shown in Figure 3d, users can click on the screen, andHysia will immediately search for products related to the scenethat users are watching. The top 10 products will be shown tousers. They can then click on the product icon to navigate to thecorresponding shopping page.

In Hysia, model contributors can use the provided APIs for bind-ing new V2R models. Hysia provides well-designed template con-figuration files and reference models. For instance , a developerhas trained a VQA model [11] on a new V2R related dataset. Thedeveloper just needs to prepare a YMAL file and a engine.py file,following Hysia’s template. The model will be containerized as agRPC-based web service. Users can then employ the new model inHysia to analyze V2R-related data.

In this section, we evaluate Hysia’s performance on the StanfordOnline Product [10] and TVQA video [7] datasets with a DGXworkstation with NVIDIA V100 GPUs.As shown in Figure 4, we have evaluated Hysia in four aspects:1) Hysia data engine is able to efficiently utilize GPUs to processvideos at the speed of more than 1000FPS, providing enough im-ages for further analysis. 2) The key frame detection method canfurther improve video preprocessing speeds. A video with morescene shots can have more performance benefits. 3) As the batchsize increases, the latency keeps increasing while the throughputincreases initially, then decreases. This demonstrates the necessityof our model profiler and orchestrator for finding the right batchsize. 4) By integrating faiss, Hysia’s matching module can search for100K of product images in less than 4.5ms. This demonstrates theability to support a real-time shopping experience for spectators. In this paper, we present Hysia, a cloud-based system for the de-velopment and deployment of V2R applications. The system isdesigned to support a wide range of users, from ML novices toexperts. The former can leverage built-in applications for V2R data https://github.com/cap-ntu/hysia_mm_demo https://github.com/cap-ntu/Video-to-Retail-Platform/tree/master/tests X e o n E C P U i - X C P U T I T A N X G P U V G P U T i G P U D e c o d i n g T h r o u g h p u t ( f p s ) (a) V i d e o V i d e o V i d e o V i d e o V i d e o K e y F r a m e S p ee d u p R a t i o ( % ) (b) L a t e n c y ( m s ) SSDMobileNetv1FasterRCNNResNet101SSDMobileNetv1FasterRCNNResNet101 T h r o u g h p u t ( Q P S ) (c) K K K K K Frame Number S e a r c h T i m e ( m s ) (d) Figure 4: System performance evaluation analysis; while the latter can utilize Hysia’s optimized servicesfor rapid V2R prototyping. We demonstrate Hysia’s usability withthree real-world scenarios; and its efficiency with quantitative per-formance measurements. Our development team is continuouslymaintaining and improving Hysia as an open-source project.

REFERENCES [1] Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2018. Multi-modal machine learning: A survey and taxonomy.

IEEE transactions on patternanalysis and machine intelligence

41, 2 (2018), 423–443.[2] Xiang Chen, Tam V Nguyen, Zhiqi Shen, and Mohan Kankanhalli. 2019. LiveSense:Contextual Advertising in Live Streaming Videos. In

Proceedings of the 27th ACMInternational Conference on Multimedia . 392–400.[3] Zhi-Qi Cheng, Yang Liu, Xiao Wu, and Xian-Sheng Hua. 2016. Video ecommerce:Towards online video advertising. In

Proceedings of the 24th ACM internationalconference on Multimedia . 1365–1374.[4] Zhi-Qi Cheng, Xiao Wu, Yang Liu, and Xian-Sheng Hua. 2017. Video2shop: Exactmatching clothes in videos to online shopping images. In

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition . 4048–4056.[5] Noa Garcia and George Vogiatzis. 2017. Dress like a star: Retrieving fashionproducts from videos. In

Proceedings of the IEEE International Conference onComputer Vision Workshops . 2293–2299.[6] Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similaritysearch with GPUs.

IEEE Transactions on Big Data (2019).[7] Jie Lei, Licheng Yu, Mohit Bansal, and Tamara L Berg. 2018. TVQA: Localized,Compositional Video Question Answering. In

EMNLP .[8] Rishi Madhok, Shashank Mujumdar, Nitin Gupta, and Sameep Mehta. 2018. Se-mantic understanding for contextual in-video advertising. In

Thirty-Second AAAIConference on Artificial Intelligence .[9] Tao Mei, Xian-Sheng Hua, Linjun Yang, and Shipeng Li. 2007. VideoSense:towards effective online video advertising. In

Proceedings of the 15th ACM inter-national conference on Multimedia . 1075–1084.[10] Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and Silvio Savarese. 2016. Deepmetric learning via lifted structured feature embedding. In

Proceedings of the IEEEconference on computer vision and pattern recognition . 4004–4012.[11] Amanpreet Singh, Vedanuj Goswami, Vivek Natarajan, Yu Jiang, Xinlei Chen,Meet Shah, Marcus Rohrbach, Dhruv Batra, and Devi Parikh. 2018. Pythia-aplatform for vision & language research. In

SysML Workshop, NeurIPS , Vol. 2018.[12] Karthik Yadati, Harish Katti, and Mohan Kankanhalli. 2013. CAVVA: Computa-tional affective video-in-video advertising.

IEEE Transactions on Multimedia

16, 1(2013), 15–23.[13] Chengliang Zhang, Minchen Yu, Wei Wang, and Feng Yan. 2019. Mark: Exploitingcloud services for cost-effective, slo-aware machine learning inference serving. In { USENIX } Annual Technical Conference ( { USENIX }{ ATC }19)