[PDF] Tracking e-cigarette warning label compliance on Instagram with deep learning

Abstract

The U.S. Food & Drug Administration (FDA) requires that e-cigarette advertisements include a prominent warning label that reminds consumers that nicotine is addictive. However, the high volume of vaping-related posts on social media makes compliance auditing expensive and time-consuming, suggesting that an automated, scalable method is needed. We sought to develop and evaluate a deep learning system designed to automatically determine if an Instagram post promotes vaping, and if so, if an FDA-compliant warning label was included or if a non-compliant warning label was visible in the image. We compiled and labeled a dataset of 4,363 Instagram images, of which 44% were vaping-related, 3% contained FDA-compliant warning labels, and 4% contained non-compliant labels. Using a 20% test set for evaluation, we tested multiple neural network variations: image processing backbone model (Inceptionv3, ResNet50, EfficientNet), data augmentation, progressive layer unfreezing, output bias initialization designed for class imbalance, and multitask learning. Our final model achieved an area under the curve (AUC) and [accuracy] of 0.97 [92%] on vaping classification, 0.99 [99%] on FDA-compliant warning labels, and 0.94 [97%] on non-compliant warning labels. We conclude that deep learning models can effectively identify vaping posts on Instagram and track compliance with FDA warning label requirements.

Full PDF

TTracking e-cigarette warning label complianceon Instagram with deep learning

Chris J. Kennedy, Julia Vassey, Ho-Chun Herbert Chang,Jennifer B. Unger, Emilio FerraraFebruary 10, 2021

The proportion of the U.S. high school students who report using e-cigarettes (aka vaping devices)declined in 2020 during the COVID-19 pandemic. In 2020, 19.6% of high school students (3.02million) reported current e-cigarette use, compared to 27.5% (4.11 million) of high students whoreported using e-cigarettes in 2019 (Wang, Neﬀ, et al. 2020). However, despite this recent decline,during 2019 - 2020, the use of youth-appealing, low-priced disposable e-cigarette devices increasedapproximately 1,000% (from 2.4% to 26.5%) among high school current e-cigarette users (ibid.). Inaddition, more than eight in 10 teenage e-cigarette users reported consuming ﬂavored e-cigarettes(ibid.). E-cigarettes can harm the adolescent brain and increase susceptibility to tobacco addic-tion (Health et al. 2016; Fraga 2019; Wang, Gentzke, et al. 2019). As a result, youth who usee-cigarettes are more likely to subsequently try combustible cigarettes (The U.S. Food and DrugAdministration 2018). Beyond addiction, e-cigarettes pose a risk of breathing diﬃculties, inﬂamma-tory reactions, lowered defense against pathogens and lung diseases (Redﬁeld 2019; The U.S. Foodand Drug Administration 2019). Exposure to visual posts featuring e-cigarette products on socialmedia, including promotional images and videos, has been associated with increased e-cigaretteuse among U.S. adolescents (Wang, Gentzke, et al. 2019; King et al. 2016; Maloney et al. 2016;Pokhrel et al. 2018; Kim et al. 2019). Instagram, one of the most popular social media platformsamong adolescents, is considered the largest source of e-cigarette social media advertisements (cite).E-cigarette stores, distributors and social media inﬂuencers - users with large followings who postvaping content on behalf of e-cigarette brands - promote ENDS (Electronic Nicotine Delivery Sys-tems) products on Instagram and other social media (Vassey et al. 2020).In August 2018, the U.S. Food & Drug Administration (FDA) introduced a requirement that e-cigarette advertisements, including social media imagery, contain a prominent warning label thatreminds consumers that nicotine is addictive. The FDA requires that the warning statement ap-pears on the upper portion of the advertisement, occupies at least 20 percent of the advertisementarea, and be printed in in conspicuous and legible at least 12-point sans serif (e.g. Helvetica or Arialbold) font size (The U.S. Food and Drug Administration 2020). Several studies (Vassey et al. 2020;Laestadius et al. 2020) evaluated compliance with the 2018 FDA requirements for warning labelson Instagram. Vassey et al. (2020) manually reviewed 2,000 images posted in 2019 and found thatonly 7% included FDA-mandated warning statements. Posts uploaded from locations within the1 a r X i v : . [ c s . S I] F e b .S. had the highest prevalence of warning labels, while posts uploaded from other countries wereless likely to include warnings. Most of the international posts featured vaping products distributedin the U.S. and would therefore still be subject to compliance with the FDA warning-label regula-tions (Vassey et al. 2020). Laestadius et al. (2020) manually reviewed 1,000 posts collected in late2018-early 2020 and found that only 13% included warning statements.Both studies (Vassey et al. 2020; Laestadius et al. 2020) conducted qualitative analysis to assessthe presence of warning statements on a small sample size of Instagram posts and used binaryclassiﬁcation (presence or absence of a warning statement) without reporting warning labels thatwere too small or in the wrong place, which would constitute a partial compliance with the FDArequirements. This study addresses the limitation of prior research by developing an automateddeep learning image classiﬁcation capable of quickly and accurately measuring compliance with theFDA requirements for warning labels on a large sample size. i.e. thousands of images. We testedwhether advanced deep learning techniques could provide an eﬀective, automated method to trackvaping-related posts on Instagram and evaluate compliance with FDA warning label requirements. Automated analysis of social media content has exploded in popularity in recent years, with rele-vant comparable studies falling into two major groups: social inﬂuence & marketing focused andhealth focused. Starting with social inﬂuence, which is broader and typically disregards domain,multimodal learning has been deployed in the recent few years for predicting image popularity. Be-fore the rise of deep learning, support vector machines were primarily used, leveraging basic imagefeatures such as color and saturation (Khosla et al. 2014). In recent years, the focus has turned tothe use of deep learning to synthesize information and learn feature representations directly fromthe data. Hu et al. utilize both tags and images (J. Hu et al. 2016), leveraging the Yahoo FlickrCreative Commons 100M dataset. Zohourian and colleagues predicted the number of likes via con-text information (Zohourian et al. 2018). De et al. use standard deep neural networks (DNN) topredict future popularity of the Instagram of a lifestyle magazine (De et al. 2017). Each of thesecases demonstrate the inclusion of diﬀerent features in regards to metadata, context, time, andsocial indicators.However, by far the largest speciﬁc domain of interest is related to health. This can be furtherseparated into a) mental health, and b) public health. For instance, multimodal learning has beenused to detect depression on Instagram, using a convolutional neural network with synthesizes theaforementioned temporal features (Chiu et al. 2021). A related, more general study is integratingtext and images to determine the intent of a post (Kruk et al. 2019), including the promotion ofvaping.Apart from mental health, researchers have also targeted illicit activities related to public health.Yang and Luo used inductive transfer learning, with multi-staged regression using post and accountrelated data (Yang et al. 2017). The model was a standard DNN but with lower levels sharedamong diﬀerent tasks, typical of shared learning tasks, and found it more eﬀective than priorapproaches with less modalities. The potential for abuse of social media (Allem and Ferrara 2016;Allem, Ferrara, et al. 2017; Allem and Ferrara 2018), especially in the context of the COVID-19pandemic (E. Chen et al. 2020; Young et al. 2021), has also driven the interest in multimodalmachine learning. Mackey and colleagues studied suspicious COVID-19 related health products by2ombining natural language processing and deep learning to produce a binary classiﬁer (Mackeyet al. 2020). In addition to a useful temporal account of how products related to waves of infection,they demonstrated the eﬃcacy of deep learning in identifying buying/selling intent.

The dataset consisted of 4,363 images collected from 3,484 distinct posts. Our prediction targetsconsisted of three binary labels: the presence of a correctly placed and sized fully-compliant warninglabel, a non-fully-compliant warning label, and whether the post promoted vaping or did not.

The dataset was randomly divided into a 60% training set, a 20% validation set for early stopping,and a 20% test set for evaluation of model performance. This splitting procedure was conductedat the post level rather than the image level, so that multiple images from the same post would allbe assigned to the same split. Models were optimized to minimize cross-entropy loss on the binarylabels, equivalent to negative log-likelihood loss in logistic regression, and were assessed for accuracyand area under the curve (AUC).

Our baseline model was a transfer learning-based convolutional neural network (CNN) built inTensorﬂow Keras (Chollet et al. 2018; Abadi et al. 2016). The output of the image backbone modelwas passed through a global averaging layer, a dropout of 40% or 50% was applied to mitigateoverﬁtting, and then passed to the output layer. The model was trained with the Adam optimizer(Kingma et al. 2014) using cross-entropy loss and a batch size of 32 where possible, or 16 whenlimited by GPU memory (EﬃcientNet B3).We tested an EﬃcientNet model (Tan et al. 2019) as the primary image processing backbone withpretrained weights from ImageNet (Deng et al. 2009). To provide benchmark results, we comparedEﬃcientNet to popular alternative architectures that have shown good performance in related re-search: VGG16 (Simonyan et al. 2014), ResNet50 (He et al. 2016), Inceptionv3 (Szegedy et al.2016), and MobileNet (Howard et al. 2017).In the bias initialization variation, the bias parameter for each output node was initialized to log( positives / negatives ) to hasten training convergence with class imbalance (Karpathy 2019).In the progressive unfreezing variation, the image backbone model was initially frozen and theoutput head was trained at a relatively higher learning rate to improve the randomly initializedweights, up to 30 epochs with a patience of 2 epochs. Then the last 20% of layers of the imagebackbone were unfrozen and the model was trained for up to 30 additional epochs with a reducedlearning rate and early stopping with a patience of 3 epochs. Finally the entire image backbonewas unfrozen and the model was trained at an even lower learning rate for up to 30 epochs, with apatience of 3 epochs.The multitask design (Figure 1) was implemented by including three nodes with sigmoid activationin the output layer, one for each of our binary outcomes.3 mage pixelsModel inputs CNN Visual Processing (EfficientNet)Feature extraction Dropout Compliant warning labelMultilabel outputNoncompliant warning labelVaping- related Figure 1:

Image-only multitask model architecture.

Given that the warning label in an Instagram image consists entirely of text, we hypothesizedthat suﬃciently accurate scene text recognition (STR) could increase the accuracy of our model.Scene text recognition is a variant of optical character recognition (OCR) designed to detect andextract arbitrary text in photographs with complex backgrounds, whereas OCR is designed toextract more standardized text from scanned documents or books (Long et al. 2021). We usedthe CRAFT method (Baek et al. 2019) to detect the location of text in the images, followed bya convolutional recurrent neural network (CRNN) model to recognize the speciﬁc text characters,which was pretrained on the MJSynth dataset (Jaderberg et al. 2014).

Our ﬁnal model achieved an area under the curve (AUC) and [accuracy] of 0.97 [92%] on vapingclassiﬁcation, 0.99 [99%] on FDA-compliant warning labels, and 0.94 [97%] on non-compliant warn-ing labels, indicating that our sample size and modeling approach was suﬃcient to achieve excellentpredictive accuracy.We found that the image backbone architecture was the single most important factor aﬀectingpredictive performance. Looking at single-task results, we found that a VGG16 visual processingbackbone led to an overall cross-entropy loss of 0.53. ResNet50 was comparable at 0.51 loss. Bycontrast, the simplest version of EﬃcientNet (B0) achieved a loss of 0.36. This ranking conﬁrmedour expectations, mirroring the performance of these models in recent years’ benchmarks.We did not ﬁnd a beneﬁt from using higher capacity, more complex versions of the EﬃcientNetarchitecture. Upgrading from the B0 model (224x244 resolution, 5.3M parameters) to the B3 model(300x300 resolution, 12M parameters) led to an eﬀectively equivalent cross-entropy loss of 0.38.The B5 and B7 versions showed similar results. This may be a reﬂection of problem complexity,sample size limitations, or perhaps suboptimal training strategy for the larger models. Given itscomputational eﬃciency, it was advantageous for our speed of experimentation that the lower-parameter B0 version yielded excellent performance.Initializing the bias weights for each output node to account for class imbalance was beneﬁcial forboth lower cross-entropy loss and faster convergence, although a less important consideration thanthe base architecture with a typical improvement of only 0.02 in cross-entropy.Many variations that we tested did not show any noticeable performance beneﬁt. A multitask4rchitecture, learning rate ﬁnding, freezing of batchnorm layers, and progressize layer unfreezing allyielded no improvements, despite being considered best practices in the deep learning literature.These ﬁndings reinforce the need to verify that common training practices do show beneﬁt for theparticular dataset and task at hand.

Of the three tasks, the identiﬁcation of FDA-compliant warning labels was easiest for the deeplearning models in terms of area under the receiver operating characteristic curve (AUC), even whenusing only raw pixel information (i.e., no scene text extraction). That is reasonable because theguideline-compliant warning label is featured prominently in the image, making it straightforwardfor convolutional ﬁlters to detect. Determining if an image was promoting vaping proved to be themost diﬃcult task; increasing our sample size and variety of vaping and non-vaping images shouldbe a straightforward way to improve model performance on that task.The critical role of the image backbone in driving performance highlights the importance of using arecent backbone architecture when accuracy matters. The rapid innovation in computer vision leadsto improved architectures on a yearly basis, or sometimes even more quickly. It also suggests thataccuracy results should be interpreted as conditional on the selected backbone architecture, ratherthan representing a static, ﬁnal result for a particular dataset. New innovations in computer visionarchitectures can be reasonably expected to reduce overﬁtting and increase predictive accuracy overtime, even with the same training dataset.We did not ﬁnd any images with pseudo-warning text, which would have the initial appearance ofbeing a compliant warning label but that actually contained non-warning language. Such adversarialstrategies could lead to false positives in models that rely purely on general convolutional pixels.Incorporation of scene text recognition that is localized to the warning label box would providerobustness to such adversarial strategies, and can be explored in future work. Image augmentationthat includes the automatic generation of adversarial labels could also be used to ensure that modelsare not vulnerable to fake warning labels.A key limitation of the current study is that we used a single test set for performance evaluation.In future work we plan to use 5-fold cross-validation, facilitating the construction of conﬁdenceintervals and eliminating the reliance on a single test set.In future work we plan to incorporate the extracted text from scene text recognition into the deeplearning architecture, which we have successfully implemented for this dataset. Given our excellentcurrent results, there may not be much remaining beneﬁt for these three tasks by adding therecognized scene text. Nevertheless, such extracted text could prove useful for extracting marketingthemes, identifying promoted e-cigarette ﬂavors or brands, and other more granular analysis. Anexample Instagram image with scene text recognition is shown in Figure 2.The network architecture for our expanded model with scene text recognition is shown in Figure 3.5 a) Original image (b) Scene text recognition visualized

Figure 2:

Example of scene text recognition in an Instagram image.

Image pixelsModel inputs CNN Visual Processing (EfficientNet)Text extraction with scene text recognition (CRNN) Textual Processing (ALBERT) C on c a t ena t e Multimodal feature extraction Dense layer(s) Dropout Compliant warning labelCombined representation learning Multilabel outputNoncompliant warning labelVaping- related

Figure 3:

Image and scene text multitask model architecture. Conclusion

In this study, we demonstrated that deep learning models can successfully identify FDA-compliant e-cigarette warning labels (or lack thereof) on Instagram with high accuracy. The models were furtherable to determine if a post promoted vaping or did not, and if it contained a non-compliant warninglabel. In combination, the models allow the automatic tracking of vaping-related posts on Instagramand evaluation of their compliance with FDA marketing policies. Through experimentation acrossseveral dimensions, we identiﬁed combinations of model architectures and training strategies thatwere most successful at achieving high accuracy. Future research using deep learning based detectionis warranted to evaluate compliance with the FDA warning statements on images and videos onInstagram and TikTok, the most popular social media platforms among teenagers. In particular,the incorporation of scene text recognition has the potential to yield additional granularity forestablishing marketing themes and identifying speciﬁc brands and ﬂavors.

Funding

The authors did not receive any funding for this research.

References

Abadi, Martin, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, et al. (2016). “Tensorﬂow:A system for large-scale machine learning”. In: { USENIX } symposium on operating systemsdesign and implementation ( { OSDI } , pp. 265–283.Allem, Jon-Patrick and Emilio Ferrara (2016). “The importance of debiasing social media data tobetter understand e-cigarette-related attitudes and behaviors”. In: Journal of medical Internetresearch

American journal of public health

JMIR public health and surveillance

Proceedings of the IEEE/CVF Conference on ComputerVision and Pattern Recognition , pp. 9365–9374.Chen, Emily, Kristina Lerman, and Emilio Ferrara (2020). “Tracking Social Media Discourse Aboutthe COVID-19 Pandemic: Development of a Public Coronavirus Twitter Data Set”. In:

JMIRPublic Health and Surveillance

Journal of IntelligentInformation Systems

Deep learning with Python . Vol. 361. Manning New York.De, Shaunak, Abhishek Maity, Vritti Goel, Sanjay Shitole, and Avik Bhattacharya (2017). “Predict-ing the popularity of instagram posts for a lifestyle magazine using deep learning”. In: .IEEE, pp. 174–177. 7eng, Jia, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei (2009). “Imagenet: A large-scale hierarchical image database”. In: . Ieee, pp. 248–255.Fraga, John-Anthony (2019). “The Dangers of Juuling”. In: .He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun (2016). “Deep residual learning for imagerecognition”. In:

Proceedings of the IEEE conference on computer vision and pattern recognition ,pp. 770–778.Health, US Department of, Human Services, et al. (2016). “E-cigarette use among youth and youngadults: A report of the Surgeon General”. In.Howard, Andrew G, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand,Marco Andreetto, and Hartwig Adam (2017). “Mobilenets: Eﬃcient convolutional neural networksfor mobile vision applications”. In: arXiv preprint arXiv:1704.04861 .Hu, Jiani, Toshihiko Yamasaki, and Kiyoharu Aizawa (2016). “Multimodal learning for imagepopularity prediction on social media”. In: . IEEE, pp. 1–2.Jaderberg, Max, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman (2014). “Synthetic Dataand Artiﬁcial Neural Networks for Natural Scene Text Recognition”. In:

Workshop on Deep Learn-ing, NIPS .Karpathy, Andrej (2019). “A recipe for training neural networks”. In:

Karpathy. github. io .Khosla, Aditya, Atish Das Sarma, and Raﬀay Hamid (2014). “What makes an image popular?” In:

Proceedings of the 23rd international conference on World wide web , pp. 867–876.Kim, Minji, Lucy Popova, Bonnie Halpern-Felsher, and Pamela M Ling (2019). “Eﬀects of e-cigaretteadvertisements on adolescents’ perceptions of cigarettes”. In:

Health communication

Psychology of Addictive Behaviors arXivpreprint arXiv:1412.6980 .Kruk, Julia, Jonah Lubin, Karan Sikka, Xiao Lin, Dan Jurafsky, and Ajay Divakaran (2019). “Inte-grating text and image: Determining multimodal document intent in instagram posts”. In: arXivpreprint arXiv:1904.09073 .Laestadius, Linnea I, Megan M Wahl, Julia Vassey, and Young Ik Cho (2020). “Compliance withFDA nicotine warning statement provisions in e-liquid promotion posts on Instagram”. In:

Nicotine& Tobacco Research .Long, Shangbang, Xin He, and Cong Yao (2021). “Scene text detection and recognition: The deeplearning era”. In:

International Journal of Computer Vision

JMIR public health and surveillance

Health communication

Addictive behaviors

78, pp. 51–58.Redﬁeld, R (2019).

CDC director’s statement on the ﬁrst death related to the outbreak of severe lungdisease in people who use e-cigarette or “vaping” devices . url : .Simonyan, Karen and Andrew Zisserman (2014). “Very deep convolutional networks for large-scaleimage recognition”. In: arXiv preprint arXiv:1409.1556 .Szegedy, Christian, Vincent Vanhoucke, Sergey Ioﬀe, Jon Shlens, and Zbigniew Wojna (2016). “Re-thinking the inception architecture for computer vision”. In: Proceedings of the IEEE conferenceon computer vision and pattern recognition , pp. 2818–2826.Tan, Mingxing and Quoc Le (2019). “Eﬃcientnet: Rethinking model scaling for convolutional neuralnetworks”. In:

International Conference on Machine Learning . PMLR, pp. 6105–6114.The U.S. Food and Drug Administration (2018). “FDA launches new, comprehensive campaign towarn kids about the dangers of e-cigarette use as part of agency’s Youth Tobacco PreventionPlan, amid evidence of sharply rising use among kids”. In: url : .– (2019). Understanding the Health Impact and Dangers of Smoke and ‘Vapor.’ url : .– (Jan. 2020). Tobacco products: advertising and promotion guidelines . url : .Vassey, Julia, Catherine Metayer, Chris J Kennedy, and Todd P Whitehead (2020). “ Frontiers inCommunication

4, p. 75.Wang, Teresa W, Andrea S Gentzke, MeLisa R Creamer, Karen A Cullen, Enver Holder-Hayes,et al. (2019). “Tobacco product use and associated factors among middle and high school stu-dents—United States, 2019”. In:

MMWR Surveillance Summaries

Morbidity and Mortality Weekly Report

ACM Transactions on Intelligent Systems and Technology (TIST)

American journal of public health

0, e1–e6.Zohourian, Alireza, Hedieh Sajedi, and Arefeh Yavary (2018). “Popularity prediction of images andvideos on Instagram”. In:2018 4th International Conference on Web Research (ICWR)