Risk & returns around FOMC press conferences: a novel perspective from computer vision
RRisk & returns around FOMC press conferences:a novel perspective from computer vision *Alexis Marchal † January 7, 2021
I propose a new tool to characterize the resolution of uncertainty around FOMC pressconferences. It relies on the construction of a measure capturing the level of discussion com-plexity between the Fed Chair and reporters during the Q&A sessions. I show that complexdiscussions are associated with higher equity returns and a drop in realized volatility. Themethod creates an attention score by quantifying how much the Chair needs to rely on readinginternal documents to be able to answer a question. This is accomplished by building a noveldataset of video images of the press conferences and leveraging recent deep learning algo-rithms from computer vision. This alternative data provides new information on nonverbalcommunication that cannot be extracted from the widely analyzed FOMC transcripts. Thispaper can be seen as a proof of concept that certain videos contain valuable information forthe study of financial markets.
Keywords:
FOMC, Machine learning, Computer vision, Alternative data, Video data, Assetpricing, Equity premium.
JEL Classification:
C45, C55, C80, E58, G12, G14. * I am grateful my supervisors Pierre Collin-Dufresne and Julien Hugonnier for their helpful feedbacks. I alsothank Oksana Bashchenko and Philippe van der Beck for useful comments. † EPFL and Swiss Finance Institute. Address: EPFL CDM SFI, EXTRA 129 (Extranef UNIL), Quartier UNIL-DorignyCH-1015 Lausanne, email: alexis.marchal@epfl.ch . I NTRODUCTION
Most central banks actively try to shape expectations of market participants through forwardguidance. Some of the main objectives being to impact the price of various securities which inturn influences the financing cost of companies or to reduce market volatility during turbulenttimes. Over the last few years, we have witnessed an explosion of research papers employingmachine learning to analyze various documents produced by central banks. The goal isto measure quantitatively how they communicate. This is usually realized by assigning asentiment score (positive/negative) to the language employed by the bankers using NaturalLanguage Processing (NLP) techniques.The contribution of this paper is to provide a new method to characterise the complexityof the discussion between reporters and the Chair of the Fed. Instead of analyzing the textdocuments, I use the video recordings of the FOMC press conferences and introduce a measureof attention exploiting computer vision algorithms. This is based on the simple premise thatcomplex questions from journalists are followed by complex answers from the Chair, whichoften creates the need to consult internal documents in order to reply. The main idea is todifferentiate between two questions asked by reporters, not by studying their text content, butrather by analyzing how does the Chair behave on the video when answering each question.This way, I am able to identify complex discussions by quantifying how often the Chair needs tolook at internal documents. This is the key variable that video images are able to provide overother sources of data. I identify the events that involve more complex discussions and showthat they have the largest (positive) impact on equity returns and reduce realized volatility. Thishighlights a mechanism of uncertainty resolution that works as follows. Answers to complexquestions resolve more uncertainty than answers to simple questions and this ultimatelyimpacts stock returns, volatility and the equity risk premium around the press conferences.Macroeconomic announcement days have been substantially discussed in the asset pricingliterature which studies how much of the equity risk premium is earned around these events.Savor and Wilson (2013), Lucca and Moench (2015), Cieslak, Morse, and Vissing-Jorgensen(2019), and G. X. Hu et al. (2019) all find that a significant risk premium is earned aroundmacroeconomic announcements. Ernst, Gilbert, and Hrdlicka (2019) argue that if you accountfor sample selection and day-of-the-month fixed effects, these days are not special and the riskpremium is not that concentrated around macroeconomic announcement days. Regardless ofthe fraction of the equity premium that is earned on those days, there is some risk premiumthat is earned around these events and they reduce some uncertainty by disclosing importantinformation to market participants. This alone makes these events an important object ofinterest for researchers. Together with Beckmeyer, Grunthaler, and Branger (2019) and Kurov,Wolfe, and Gilbert (2020), all of the above mentioned papers revolve around studying the build-up and resolution of uncertainty around macroeconomic announcements. My addition withrespect to this literature is to identify why some press conferences reduce more uncertaintythan others. To this end, I compare stock returns on FOMC press conference days whenreporters had a complex discussion with other days when the talks were arguably simpleraccording to a new measure of attention. This allows me to identify a channel through whichthe Fed reduces market uncertainty and affects asset prices: by discussing with financial2eporters. This implies that the Chair reveals additional information during the Q&A sessionsthat is not redundant with the pre-written opening statements. My findings are consistentwith Kroencke, Schmeling, and Schrimpf (2018) who show that monetary policy affects thepricing of risk by identifying shocks to risky assets that are uncorrelated with changes in therisk-free rate (i.e. “FOMC risk shifts").This paper also provides a contribution to the literature of machine learning methods usedto analyze central banks communication. I quantify the degree of complexity of a discussionwithout relying on NLP techniques, hence avoiding their limitations. This new alternativedataset of videos allows me to analyze the press events from a new angle. Indeed, I investigatethe same events but leverage computer vision to extract a different signal which is the timespent by the Chair reading documents while answering questions. In other words, the NLPliterature has focused on what is being said during the press conferences while I focus on how it is being said. This is accomplished by exploiting the images of the conferences andscrutinizing the human behavior. This information is not present in the transcripts and I arguethat it is valuable for financial markets. However, it is likely that the signal I construct usingvideos could be augmented by sentiment measures extracted from text data. This is why Iview my method as complementary to what has been done in the NLP literature. However, thecombination of both approaches is left for future research. Another interesting use of machinelearning to analyze FOMC conferences is present in Gomez Cram and Grotteria (2020). Theirdataset is closely related to mine in the sense that they also use the videos from FOMC pressconferences but only analyze the audio in order to match sentences of the Chair with marketreactions in real time. In comparison, my paper is the first to use the images from these videos.Overall, I present a proof of concept that FOMC videos actually provide useful information forfinancial economists.In accounting, papers like Elliott, Hodge, and Sedor (2012), Blankespoor, Hendricks, andMiller (2017) and Cade, Koonce, and Mendoza (2020) have used video data to analyze theeffects of disclosure through videos. However they do not use any systematic algorithm toextract the visual content which makes the approaches hardly scalable. Some authors likeAkansu et al. (2017), Gong, Zhang, and Jia (2019) or A. Hu and Ma (2020) use machines toprocess the videos but they focus on extracting emotions either from CEOs or entrepreneurs.None of their methods are suited to analyze FOMC press conferences because central bankersexerce an effort to appear as neutral as possible when they speak. In contrast to this literature,I develop a simpler tool that systematically computes the reading time of a person appearingin a video. This fully objective measure does not rely at all on emotions.The rest of the paper is organized as follows. Section 2 establishes the methodology toconstruct the attention measure. Section 3 presents the main results and finally section 4concludes. A technical discussion on computer vision algorithms can be found in appendix A. One common drawback of NLP methods in finance/economics is the need to create a dictionary of positive andnegative words. The choice of which words belong to which set is somehow subjective. Another problem withmore advanced methods is the necessity to label the data which might have to be chosen by the researcher. . D ATASET AND METHODOLOGY
I use the video of each FOMC press conference (available on the Fed website) from their startin April 2011 to September 2020. The market data consists of the time-series of the S&P500index sampled at a frequency of 1-min. Each press conference can be decomposed into twoparts. (i) The first one is an introductory statement in which the Chair reads a pre-writtenspeech, detailing the recent decisions of the Fed. (ii) The second part is a Q&A session betweenfinancial reporters and the Fed Chair. I focus solely on the Q&A for the following reasons.Most of the literature analyzing press conferences has focused on the 1 st part (with a few rareexceptions) even though the Q&A occupies around 82% of the time of the press conference.Moreover, the unprepared character of the Q&A session means that the behavior of the Chair,when answering questions (whether he needs to read documents to answer questions or notfor instance), does bring valuable information that has never been analyzed. Indeed, the Q&Ais spontaneous and the Chair did not prepare answers to the reporters’ questions. Using thisdata, the main problems I try to solve are H1:
How can we measure the complexity of a question and its associated answer?
H2:
Do complex discussions contribute more to reduce uncertainty?In order to answer these questions, I need to characterise the content of the press confer-ences. As previously explained, the existing literature has done so by assigning a sentimentscore to the verbal content by combining text transcripts with some NLP algorithm. The newidea in my paper is to characterise a discussion between a reporter and the Chair of the FOMC,not by analyzing the language but rather by considering how the Chair reacts after beingasked a question. To this end, I decide to focus on the following dimension: Does the Chairreply directly or does he read some internal documents in order to provide an answer? Thisinformation is available in the videos provided by the Fed but it needs to be extracted andconverted into a numerical quantity that can serve as input for statistical inference tools. Thisis done by employing various computer vision algorithms that are new in finance but havebeen applied for years to solve engineering problems. In this paper, I focus on the economicmechanisms and the value of the information that can be extracted from this alternative data.Therefore I will keep the discussion of the methodology on a high (non-technical) level andinvite the reader to consult appendix A for more details. The need for a technical discussionon computer vision can be (partially) avoided because every image processing can actuallybe easily illustrated. I will simply visually present the result of every computation by showingan image and what kind of information I extract from it. Given that a video is nothing but acollection of still images, I will use these two words interchangeably.The first step is to construct facial landmarks l ∈ R which are certain key points on ahuman face used to localize some specific regions like the eyes, the mouth, the jaw, etc.They can be visualized in figure A.1 in the appendix. In this paper, they will help me trackcertain movements of the Fed Chair during the press conferences when he is answering aquestion. Basically, I want to know every time the Chair is looking at some documents. This is I remove the conference from the 15 th of March 2020 simply because there is no video available (it is only audio). a) EAR when the Chair is not reading. (b) EAR when the Chair is reading a document. Figure 2.3: Convex hulls created by the landmarks l , ..., l and associated eye aspect ratios(EARs).It is natural to wonder what type of questions will cause some reading by the Chair. To clarifythis, I report below a comparison of two questions asked by reporters. They are copied fromthe transcript of the press conference of September 21, 2016.Q ∗ : Question from a reporter that does not lead to the consultation of internal docu-ments by the Chair: “Chair Yellen, at a time when the public is losing faith in manyinstitutions, did the FOMC discuss the importance of today as an opportunity to dispelthe thinking that the Fed is politically compromised or beholden to markets?”Q’:
Question from a reporter that does trigger substential reading from the Chair: “MadamChair, critics of the Federal Reserve have said that you look for any excuse not to hike,that the goalposts constantly move. And it looks, indeed, like there are new goalpostsnow when you say looking for further evidence and-and you suggest that it’s evidencethat labor-labor market slack is being taken up. Could you explain what for the timebeing means, in terms of a time frame, and what that further evidence you would lookfor in order to hike interest rates? And also, this notion that the goalposts seem to move,and that you’ve indeed introduced a new goalpost with this statement.”The whole idea of this paper is to differentiate between Q ∗ and Q’, not by studying the textcontent, but by analyzing how does the Chair behave when answering each question. Thequestion Q’ will be associated with a complex discussion because my measure of attentionEAR will be low due to the reading from the Chair. On the other hand, the EAR stays relativelyhigh when Janet Yellen answers question Q ∗ . For simplicity, I focus solely on where the Chairlooks while answering reporters’ questions. More sophisticated measures incorporating extrafacial landmarks on top of the ones locating the eyes could produce a more precise signal. 7o far, for each press conference i I have a time series of EAR. In order to compare themacroeconomic announcements, I decide to summarize the time series information into avariable Λ i that will take one single value per conference. This is done by integrating the EARover time. I only include the values below some threshold c in order to approximate the totaltime spent looking at internal documents. The attention measure is therefore defined as Λ i = Z T i EAR i , t {EAR i , t < c } d t (2.3)where T i is the time at which the press conference finishes. The constant c helps discrimi-nate when is the EAR low enough for the person to be classified as looking down. The variable Λ i aggregates all the necessary information by measuring how much did the Chair look athis documents in a given press conference. The interpretation of Λ i is as follows. If the valueis small, it means that the Chair did not spend much time looking at his documents duringconference i . If on the other hand the value is large, the Chair spent a significant amount oftime looking down. I argue (and show later) that Λ is directly proportional to the quantity ofuncertainty that has been resolved during a Q&A session. Indeed, the Chair is more likely tolook at documents when providing a complex answer. This in turn provides more relevantinformation to the market and thus reduces uncertainty. It is worth emphasizing that thespontaneity of the questions and answers is important for this analysis. Had the speakerreceived the questions in advance, this methodology would probably not work.To conclude the methodology section, it is important to notice that even though I usemachine learning methods to extract facial landmarks, the analysis is totally transparent.I would obtain approximately the same variables and results if I had personally watchedwith great attention all the press conferences and timed manually whenever the Chair ispaying attention to the documents in his possession. Machine learning is being used only toautomatize this procedure. The variable extracted from the computer vision algorithm EAR i , t is easily interpretable as an attention measure (i.e. where the person is looking). In the nextsections, I will use this data as an input in a linear regression in order to explain the behaviorof returns and uncertainty around the FOMC press conferences.
3. M
AIN RESULTS
In this section I explore how the attention measure of the Chair Λ can explain equity returnsand quantify the uncertainty that has been resolved due to the conference. It is based on thepremise that Q&A sessions involving complex discussions (associated with higher values of Λ i )will further reduce uncertainty for financial markets, leading to higher stock returns and lowervolatility. Before proceeding further, it is worth noticing that the level of Λ is not stationary asillustrated in figure 3.1a. This is the reason I define and work with a new variable ∆ λ i = λ i − λ i − (3.1)where λ i = log( Λ i ). The new object ∆ λ is stationary (see figure 3.1b). 8 d i = log µ P i , τ P i , τ ¶ , (3.2) r a i = log µ P i , τ P i , τ ¶ . (3.3)I then run a simple linear regression r di = α + β ∆ λ i + ε i (3.4)of the returns during the Q&A session onto the change in the discussion complexity mea-sure. For comparison I also run similar univariate regressions using the log difference of thebenchmark variables. The results are reported in table 3.1. The first observation is that allthe betas are positive and of the same order of magnitude. The fact that all variables agreeon the sign of the effect is reassuring since they are all measuring the same quantity to someextent. The two most significant variables are ∆ λ and the duration of the speech of the Chair(with p-values below 1%). Interestingly, the variable constructed using the video data ∆ λ is “better” at explaining equity returns in the sense that its associated t-stat is the highest(around 5) and has by far the highest R that is 28.6%. This is not surprising since the durationof the Chair speech is simply measuring how long did the Chair speak during the Q&A ses-sion, disregarding anything else. While Λ is also proportional to the length of the speech butcontains the additional information that the Chair was paying close attention to importantdocuments while answering questions. Hence Λ gauges the intricacy of a Q&A session, whichis arguably difficult to capture using NLP techniques without making subjective choices (likechoosing a dictionary). An interesting consequence of these results is that the Chair revealsadditional information during the Q&A sessions that is not present in the pre-written openingstatements.I also run a similar regression to (3.4) but replacing the dependent variable with the returnafter the conference r a and find no significant coefficients. This means that the information isincorporated immediately in the stock price over the time of the Q&A. MPACT ON UNCERTAINTY
The previous section showed that when the change in λ is high, stock returns tend to be higher.Given that this variable is independent from any sentiment measure, it is natural to expect thatpositive returns are caused by a reduction in uncertainty. This is what I argue in this section byshowing that ∆ λ is negatively correlated with stock market volatility. For this purpose, I simplycompute the realized variances before and after each press conference which are denoted by σ bi and σ ai respectively. The timeline is illustrated in figure 3.2. The formal construction of10able 3.1: Explaining contemporaneous stock returns r d (1) (2) (3) (4)const -0.000 -0.000 -0.000 -0.000(0.001) (0.001) (0.001) (0.001) ∆ λ ∗∗∗ (0.001) ∆ ∆ Duration Q&A 0.008 ∗∗ (0.004) ∆ Duration speech chair 0.006 ∗∗∗ (0.002)Observations 44 44 44 44 R R ∗∗∗ ∗∗ ∗∗∗ Note: ∗ p < ∗∗ p < ∗∗∗ p < This table reports the regression statistics for the four different models, each analyzing the explanatorypower of a different covariate. The dependent variable r d is the log return of the S&P500 during theQ&A session (between the beginning until the end). Standard errors are in parenthesis. σ bi = s S τ , τ τ X τ = τ r i , τ , (3.5) σ ai = s S τ , τ τ X τ = τ r i , τ (3.6)where S t , t ′ is defined as the set containing all the returns between the times t and t ′ included.Equipped with this measure of change in market uncertainty, I run the main regression ofinterest σ ai − σ bi = α + β ∆ λ i + ε i . (3.7)Again, for comparison purposes I also replace the regressor with all the log difference of thebenchmark variables and report the results in table 3.2. The two most significant variables(with p-values below 5%) are the same than in the previous section: ∆ λ and the duration ofthe Chair’s speech. They again both agree on the sign of the effect in the sense that complexdiscussions reduce market volatility. And consistently with the previous results, ∆ λ is the mostprecise signal in the sense that its associated t-stat is the highest (in absolute) with a value of2.3. The R is also the highest and around 10.9%. 12able 3.2: Explaining the change in volatility before and after the Q&A( σ a − σ b ) ∗ ∆ λ -0.014 ∗∗ (0.006) ∆ ∆ Duration Q&A -0.016(0.019) ∆ Duration speech chair -0.019 ∗∗ (0.009)Observations 44 44 44 44 R R ∗∗ ∗∗ Note: ∗ p < ∗∗ p < ∗∗∗ p < This table reports the regression statistics for the four different models, each analyzing the explanatorypower of a different covariate. The dependent variable ( σ a − σ b ) is the difference in volatility (realizedvariation) of the S&P500 before and after the press conference. The time windows over which therealized variation is measured are illustrated in the timeline of figure 3.2. Standard errors are inparenthesis. . C ONCLUSION
This paper develops a new measure of discussion complexity between the Fed Chair andreporters during the Q&A sessions of FOMC press conferences. It is accomplished by analyzinga new dataset of videos and taking advantage of tools from computer vision in order to measurehow often the Chair needs to consult internal documents when answering questions. Thisvariable is then showed to explain both contemporaneous equity returns and the change involatility before and after the conference. On average, complex discussions lead to higherreturns and lower volatility. This is consistent with recent findings in the literature that centralbanks impact the pricing of risk. My work allows to pin down a new mechanism throughwhich press conferences impact the expectations of market participants. A by-product of thisresult is that there is additional information being revealed during the Q&A sessions that isnot redundant with the opening statements. I am currently working on incorporating moredata into the analysis by including the videos of the press conferences of the European CentralBank.In general, my methodology is not constrained to macroeconomic events and can be usefulto analyze the nonverbal communication of CEOs or politicians for instance. 14
EFERENCES
Akansu, Ali et al. (2017). “Firm Performance in the Face of Fear: How CEO Moods Affect FirmPerformance”.
Journal of Behavioral Finance
Journal of Accounting Research
Review of Ac-counting Studies
Journal of Finance
Account-ing Review
SSRN Electronic Journal .Gomez Cram, Roberto and Marco Grotteria (2020). “Real-time Price Discovery via VerbalCommunication: Method and Application to Fedspeak”.
SSRN Electronic Journal , pp. 1–44.Gong, Mijia, Zhe Zhang, and Ming Jia (2019). “Lie Detectors? How Entrepreneurs’ Facial Ex-pressions During IPO Roadshow Presentations Predict New Venture Misconduct Behaviors”.
IEEE Transactions on Engineering Management , pp. 1–12.Hu, Allen and Song Ma (2020). “Persuading Investors: A Video-Based Study”.Hu, Grace Xing et al. (2019). “Premium for Heightened Uncertainty: Solving the FOMC Puzzle”.
SSRN Electronic Journal .Kazemi, Vahid and Josephine Sullivan (2014). “One millisecond face alignment with an ensem-ble of regression trees”.
Proceedings of the IEEE Computer Society Conference on ComputerVision and Pattern Recognition , pp. 1867–1874.Kroencke, Tim Alexander, Maik Schmeling, and Andreas Schrimpf (2018). “The FOMC RiskShift”.
SSRN Electronic Journal .Kurov, Alexander, Marketa Halova Wolfe, and Thomas Gilbert (2020). “The disappearing pre-FOMC announcement drift”.
Finance Research Letters .Lucca, David O. and Emanuel Moench (2015). “The Pre-FOMC announcement drift”.
Journalof Finance
Deep Learning for Computer Vision with Python . PYIMAGESEARCH.Savor, Pavel and Mungo Wilson (2013). “How Much Do Investors Care About MacroeconomicRisk ? Evidence from Scheduled Economic Announcements”.
The Journal of Financial andQuantitative Analysis
PPENDIX A This appendix provides a brief overview of the machine learning techniques used in orderto compute an eye aspect ratio (EAR) defined in equation (2.1). This paper uses videos asdata, however, a video is simply a collection of still images indexed by time and therefore I willalways talk about images when referring to the variables used as inputs. Each image I (alsocalled frame) can be represented numerically by a tensor of dimension D = x × y ×
3. The x and y dimensions respectively correspond to the length and height of the image. Each pixelbeing an entry in the x × y matrix that corresponds to its horizontal and vertical location. The3 represents the three primitive components of any color.Extracting the behavior of only a subset of people (the Chairs of the FOMC) present in thevideos requires using a series of different algorithms one after another. They are all describedbelow in the order used to process the data. My paper employs popular computer visionalgorithms for which more details can for instance be found in the book Rosebrock (2017) orthe associated online blog . A.1. I
DENTITY DETECTION VIA DEEP METRIC LEARNING
The first task when processing the data is to apply a face recognition algorithm. Meaning thaton every single image, I want to know if the Chair of the Fed appears on it or not. For that Ineed a method that will take as input a frame I and output the identity of the person on it. Apowerful tool available is known as deep metric learning.Let us first begin with a simplified example. Suppose you want to perform a classification ofhuman pictures. That is you want to figure out if there is a human in a given picture or not(without being interested in the identity of the person). This is a simpler task than my originalgoal but I will build on it later. When one wants to classify labelled images, it is common totrain a neural network (typically a convolutional network) that accepts an image as input andoutputs a scalar value. For instance the output could be 1 if there is a human face in the pictureand 0 otherwise. However, this classification is too simple for me because financial reportersalso appear in the FOMC press conference videos and I am not interested in their behavior.This is why instead of using this algorithm, I use a slight modification that will help me filterout the images that do not contain the Fed Chair.Deep metric learning is different in the sense that the output will be a vector e ∈ R n ofembeddings where n is the number of points used to characterize the human face. Formally,our neural network will be a non-linear function f (. | θ ) parametrized by θ that takes as input astill image I ∈ R D and outputs a vector e : f ( I | θ ) = e . (A.1) To simplify the explanations I will assume that in each picture there is only one face. The methods in thisappendix do not need this assumption and I do not use it since it does not hold for my dataset. It is standard to use n = e is describing the face in the picture I in a mathematical way. This step is alsocalled encoding the face into a vector. Training the network boils down to making sure theoutput vectors are close to each other when two pictures of the same person are used as inputs,and far when the persons are different. Suppose that we have two different images I and I that both contain the same person. We want to train the network such that the associatedembedding vectors e and e are “close”. If on the other hand we had two images of differentpersons, we would like the output vectors to be “far” from each other. In other words, the goalis to twist the network parameters θ such that two pictures of the same person are classifiedas having an identical face. For instance any twenty different pictures of the same personshould approximatively give the same output vector e . In order to perform the training , I needto build a new database (different from the FOMC press conferences videos) of M picturesincluding multiple images of each person I want to detect. Each image I in this set is indexedby m ∈ {1, 2, ..., M }. The label is simply the identity of the person in the picture (in my case it isthe name of the Fed Chair: Ben Bernanke, Janet Yellen, Jerome Powell). It also helps to includepictures from random people in order to increase the performance of the neural network.At this stage, one could wonder how does the network embed a face into a real valued vector e ? For humans it seems natural to compare features like the shape of the eyes, the mouth, thejaws, the length of the nose, etc in order to differentiate people. However, at this step of theprocess, I do not instruct the machine how to describe a face. I do not specifically constraint itto compare the attributes that seem natural to us humans. The network is learning by itselfwhat are the characteristics that it should pick to properly encode a face. Later, for anothertask (extracting facial landmarks) I will ask the computer to detect features that are familiar tohumans (I will especially be interested in the movement of the eyes).Once the network is trained and has learnt to properly associate an output vector e toa specific person’s face, I generate an embedding vector e m for every labelled image I m in my database. Every single face in my database will have an associated mathematicalrepresentation that “encodes” it in a vector e . It will serve later to compare new unknown faceswe want to classify to the labelled images in the database to tell how close they are to eachothers.We are now ready to extract the identity of people in the new data. Let I ′ denote a picturethe network has never seen before. By using the following mapping f ( I ′ | θ ) = e ′ (A.2)we can characterize the face present on I ′ by using the encoding e ′ . To find out the identityof the person in this new picture I use a simple k -NN model (with votes) to make the decision.I measure the distance between e ′ and all encodings associated with our known faces in ourdatabase e , e , ..., e M . One way to do it is by using the L norm In order to save time, I use a pre-trained network that already knows how to encode human faces. I only re-trainit on very few images to make sure that it is calibrated for my specific task. m = k e ′ − e m k (A.3)where d m is the distance between the unknown face in image I ′ and the known face inimage I m . If this distance is small enough (below some tolerance level ǫ ), I conclude that bothpictures contain the same person. The problem here is that the distance could be small withmultiple pictures, not necessarily of the same person (if the encoding is not good for somereason). This is where the voting mechanism enters. To deal with that I create a vector ofbinary classification a ∈ {0, 1} M whose m th element in denoted by a m and is constructed asfollows a m = ½ d m < ǫ ,0 otherwise.Basically, a m contains a yes/no answer to the question: is the person on image I ′ the samethan on image I m ? Then I can simply count the number of votes. For instance, for this newimage I ′ I find that it is very close to 40 pictures of B. Bernanke, 2 pictures of Jerome Powell and1 picture of a random person in my dataset. I will conclude that I ′ is an image of B. Bernanke.Repeating the above procedure for all the frames in my videos allows me to isolate the timeswhen the Fed Chair appears and disregard moments when the reporters are on the screen. A.2. F
ACIAL LANDMARKS