Machine Learning for Temporal Data in Finance: Challenges and Opportunities
MMachine Learning for Temporal Data in Finance:Challenges and Opportunities
Jason Wittenbach
[email protected] One - Center for MachineLearningMcLean, Virginia
Brian d’Alessandro [email protected] OneMcLean, Virginia
C. Bayan Bruss
[email protected] One - Center for MachineLearningMcLean, Virginia
ABSTRACT
Temporal data are ubiquitous in the financial services (FS) industry– traditional data like economic indicators, operational data such asbank account transactions, and modern data sources like websiteclickstreams – all of these occur as a time-indexed sequence. Butmachine learning efforts in FS often fail to account for the tempo-ral richness of these data, even in cases where domain knowledgesuggests that the precise temporal patterns between events shouldcontain valuable information. At best, such data are often treatedas uniform time series, where there is a sequence but no senseof exact timing. At worst, rough aggregate features are computedover a pre-selected window so that static sample-based approachescan be applied (e.g. number of open lines of credit in the previousyear or maximum credit utilization over the previous month). Suchapproaches are at odds with the deep learning paradigm whichadvocates for building models that act directly on raw or lightlyprocessed data and for leveraging modern optimization techniquesto discover optimal feature transformations en route to solving themodeling task at hand. Furthermore, a full picture of the entitybeing modeled (customer, company, etc.) might only be attainableby examining multiple data streams that unfold across potentiallyvastly different time scales. In this paper, we examine the differenttypes of temporal data found in common FS use cases, review thecurrent machine learning approaches in this area, and finally assesschallenges and opportunities for researchers working at the inter-section of machine learning for temporal data and applications inFS.
CCS CONCEPTS • General and reference → Surveys and overviews ; •
Computingmethodologies → Machine learning ; Neural networks; •
Mathe-matics of computing → Time series analysis . KEYWORDS finance, time series, point process, multimodal fusion
Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].
KDD-MLF ’20, August 24, 2020, San Diego, CA © 2018 Association for Computing Machinery.ACM ISBN 978-x-xxxx-xxxx-x/YY/MM...$15.00https://doi.org/10.1145/1122445.1122456
ACM Reference Format:
Jason Wittenbach, Brian d’Alessandro, and C. Bayan Bruss. 2018. MachineLearning for Temporal Data in Finance: Challenges and Opportunities. In
Proceedings of KDD-MLF 2020: ACM SIGKDD Workshop on Machine Learningin Finance (KDD-MLF ’20).
ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/1122445.1122456
A core function of machine learning in financial services is modelingthe behavior of customers or products and the markets in whichthey operate. For example, for consumer finance companies, thecustomers are credit card holders who operate in the debt market.Customer behavior and the market are both highly dynamic. Atthe same time, current and future states for both the customer andthe market have strong temporal dependencies on their past states.As such, the data collected on them over the course of time shouldrightfully be viewed and modelled with a temporal component.
Figure 1: Finance has a long history of exploring and model-ing time series of commodity prices as can be seen here withmonthly prices of coffee dating back to the 1960s [1].
The history of modeling temporal data is intricately linked withthe history of modern finance. Almost all investment is in someform based on an anticipation of a future state of the market asdetermined by current and historical knowledge. The most commonof these is the futures contract where a producer of some commoditypromises to deliver a good to the buyer at a specific time for aspecific price. These types of contracts date back to the beginningof agriculture but found fertile soil in the 17th century Netherlands;the same birth place of the modern corporation. In these types ofcontracts, both the buyer and the seller need a model to anticipatethe risk of the delivery of that good. These models have grown a r X i v : . [ q -f i n . S T ] S e p DD-MLF ’20, August 24, 2020, San Diego, CA Wittenbach et. al. in sophistication along with the financial service industry and thetechnology that enables it [19].Until recently, much of the focus in finance around modelingtemporal data has been limited to traditional data sources such asstock/commodity prices, deposits, and macroeconomic indicators.However, as FS companies digitize their operations, novel streamsof data become available such as digital credit reports, clickstreamdata of customers online presence, detailed transaction information,and multi-channel payments. Each of these data streams can have adistinct time signature and scale and contain one or many differentmodalities of information. For instance clickstream data includesboth the sequence of the session and time per page, but also thegraph of the website, and the global point process of intents oversessions for that customer.Machine learning has seen significant advances in recent yearsin image classification and natural language processing. Withinfinance it has become a dominant tool for fraud detection and creditrisk modeling. However, both in finance and in the broader ma-chine learning research community, there remains a gap betweenmethods developed for static data sets and those that can incorpo-rate a temporal dimension. This paper presents a sample of recentadvances in the relevant domains of machine learning on sequencesand temporal data. The intent of this paper is to pose some chal-lenges and opportunities for research into modeling of temporaldata for financial applications.
At a high level, models of temporal data fall into the broader cat-egory of sequence models. Sequence models are those where theorder of data points contains structural information. For exam-ple, language models are sequence models seeking to uncover theinherent structure to word choice, order, and dependency. Timeseries models are sequence models where there is explicit orderand each data point is labeled with a time stamp. We could alsocall these event streams models to highlight that each observationwill have a timestamp and a possible rich feature feature vectorcontaining the details of the event. Because almost all industrialsystems record the time at which the data was collected, nearlyevery dataset can be viewed as a temporal dataset. This can befurther split into uniform and non-uniform time series, dependingon whether the events/measurements are evenly spaced or comeat irregular intervals. These data may come from discrete mea-surements of an underlying continuous process that take place atregular or irregular intervals (e.g. a stock price measured recordedeach hour or a credit score requested only when a new applicationis submitted). Or the events may be singular, with no definitionbetween observations (e.g. a credit card transaction). Much of thepast research on time series forecasting has been in the domain ofuniform time series. Non-uniform time series is a broad categorycontaining a number of very common sequences in financial ap-plications. One exception is when the focus is solely on the eventtimings, and not on any associated features or more complex tasks– in this setting, the irregularly spaced events are called a pointprocess, and there is an equally rich body on finance use cases inthis area. Each of these types of time series can be either univariate ormultivariate. In the case of multivariate, the variables might beindependent, but they may exhibit strong internal relationships ateach time point. For example, video can be viewed as a multivariatetime series of pixels. However, each frame contains strong internalstructure and localized correlations which also interact along thetime-axis. This can be viewed as a modality. There are time serieswith multiple modalities. Video with audio is an example wherethe sound and the images are distinct modalities with strong inter-nal structures but relationships to each other and to the temporaldimension. Modalities can include images, text, graphs, and tabularfeature spaces.
Figure 2: Examples of univariate time series with differentprofiles of sampling and missingness.
Uniform time series can describe temporal data where the obser-vations are evenly spaced in time. These time series can describediscrete processes or continuous processes sampled regularly. Infinance, macroeconomic indicators such as unemployment rate andGDP are collected and reported on a regular cadence (weekly ormonthly). This type of temporal data has received a tremendousamount of study over the last fifty years. Much of this work hasfocused on univariate time series and building statistical techniquesfor decomposing varying components of the time series (e.g. level,seasonality, trend). For a more in-depth overview of common timeseries methods please refer to [8].As shown in the recent Makridakis competition (M4), these sta-tistical techniques still prove very robust for univariate uniformtime series [14]. At the time of the latest competition no pure ma-chine learning approach had successfully surpassed the statisticalbenchmarks. However, for the first time the best performing modeloverall utilized a hybrid machine learning and statistical model. Itis notable that this approach surpassed the baselines by nearly 10percent. This model used a combination of a Holt-Winters statisticaltechnique with a recurrent neural network [23]. The Holt-Wintersmultiplicative model decomposes a time series into three compo-nents: level, seasonality, and trend. The winning model observedthat the assumption of linearity in the trend component in theHolt-Winters model might be effectively removed when you re-place that component with a recurrent neural network. While theperformance improvement is significant, this approach still requires achine Learning for Temporal Data in Finance:Challenges and Opportunities KDD-MLF ’20, August 24, 2020, San Diego, CA heavy pre-processing of the data and carefully crafted neural archi-tectures that depend on whether the series is monthly, quarterly oryearly.Oreshkin et. al. seek to build on this success by using a fullyneural architectures for univariate time series. They establish threeprinciples for a neural approach to time series: 1) a simple andgeneric architecture, 2) the architecture should not require heavypre-processing and feature engineering of the time-series input and3) it should be interpretable. The proposed approach uses stacksof residually connected layers that seek to learn basis functionsto simultaneously predict the next value in the series while alsolooking backwards to predict the series up until that point [17].This building block approach works well on its own but is notinherently interpretable. Using inductive bias it is easily modifiableto explicitly model common components such as seasonality andtrend. This approach does well on the M4 benchmark, though itwas not included in the competition.
As previously mentioned, many use cases in modern FS leveragedata that goes beyond uniformly sampled time series, instead deal-ing with data that are more properly described as irregular eventstreams – sequences of feature vectors that are indexed by a contin-uous timestamp as opposed to an discrete integer. Examples includebank account transactions (withdrawals, deposits, transfers, etc)to click stream events recorded during a users interaction with awebsite. In cases where the precise timing of events contains task-relevant information, throwing away the timestamps and revertingto treating the data as a uniform time series will limit model perfor-mance. Fortunately, there is a rich and evolving literature aroundbuilding deep learning models for even stream data.
Figure 3: Examples of univariate and multivariate eventstreams. Individual events can be comprised of multiple fea-tures coming from possibly distinct modalities.
One broad setof approaches to this challenge is to fall back on models designedto handle uniform time series – namely recurrent neural network(RNN) architectures in the context of deep learning – but preprocessthe event stream data in such a way that the relevant informationabout event timing is encoded elsewhere.For instance, one early approach was to discretize time into smallenough bins so that one or no events occur within any single bin,and then treat the data as a uniform time series with missing feature vectors in bins with no events. Of course, this leads to a problemwhere many positions in the sequence will have “missingâĂŹâĂŹdata, but any technique for handling missing data can now beused to bridge this gap [11]. This could be anything from simpleimputation schemes that replace missing values with a default value,the mean, or the mode, to complex model-based approaches that tryto impute the best value for the missing data from the non-missingdata [2, 18, 24].The downside of this approach is that, by covering over the factthat the data were missing, imputation gives the model no hint asto the the timing of the underlying events [22]. A key observationis that, when the missing values arise from discretization of eventstream data, the missingness is actually informative as to the eventtiming and thus may be relevant to the task. This led to a morerefined set of approaches where the feature vector is augmentedwith a binary mask that indicates whether or not the rest of thefeature vector is observed data or an imputation, which allows themodel make use of the informative missingness [10]. Taking thisidea a step further, the feature vector can also be augmented withthe time since the last observation to reinforce the temporal natureof the underlying data [3].Finally, information about event timing can also be embeddedinto the dynamics of the RNN itself. For example, this can beachieved by modifying RNN such that, between events, the hid-den state decays exponentially toward some fixed point with alearnable decay rate[3]. The idea being that, as time passes, eventsfurther in the past fade from relevance and provide diminishingpredictive power, and we move back to a state of ignorance. Forsome applications, this assumption may be appropriate, but in caseswhere long-range dependencies are important, it is likely to proveproblematic. This approach can be further extended to richer eventstreams where the event has a feature vector along with each times-tamp. One drawback of this approach is that the models it producesare somewhat limited to next-event prediction tasks and densityestimation, and how to extend them to other sequence modelingtasks is not obvious.
An alternative starting point tobuilding deep learning models for event stream data is to eschewdiscretization and instead confront the problem of modeling sparsetemporal data head-on. This approach has a venerable history inthe study of point processes – stochastic models for datasets whereeach observation is simply a point in some space [4]. Often, thatspace is the set of real numbers and the observations correspondto timestamps of a set of events; this is a temporal point process.But other prominent examples exist, such as spatial point processesused in geosciences for modeling the locations of earthquakes overa geographic region.Just as parametric distributions exist for modeling various uni-variate data types (Gaussian distributions for real-valued data, Pois-son distributions for count data, etc), there are classic temporalpoint process models, such as the Poisson process for independentevents, the Hawkes process for self-excitatory event streams, andmore. Often these models are specified by giving the conditionalprobability of the next event time given the preceding event history.One way to produce a more flexible point process model is to param-eterize this conditional probability by a neural network rather than
DD-MLF ’20, August 24, 2020, San Diego, CA Wittenbach et. al. the classical approach of assuming a fixed functional form, whichis the approach taken by the nueral Hawkes process [5, 16]. Thesemodels apply to a broader class of data know as “marked poissonprocesses”, which each timestamp has an associated feature vector.While these features vector could technically be defined over anyspace, in practice they are usually over a discrete set of categories,such that each category signifies a different event type. So whilethese models are appropriate for fusing multiple point processes(see below), they are more geared toward tasks that focus solely onpredicting timing.A recent arrival in this space that overcomes this obstacle isthe neural ODE (ordinary differential equation) [3, 21]. This modelarises from the observation that adding skip connections to thehidden layer in an RNN make the forward pass of the model identi-cal to a discrete time approximation of a continuous ODE. Movingto a continuous time makes modeling event sequences relativelystraightforward, since one can then model the feature vectors asobservations that are conditional on the instantaneous value ofthe latent state at the event timestamp. Furthermore, extensions tothis model allow events to treated instead as inputs to the modelthat can have an instantaneous effect on the hidden state (via anODE-RNN hybrid). And the probability of an event occurring canbe modeled as a time-dependent Poisson process, the rate of whichis conditioned on the instantaneous latent state of the model. Thus,these models naturally many tasks and modeling assumptions aboutthe relationships between event features and system state. Thoughnew and relatively untested, neural ODE models are a promisingand flexible approach to modeling event stream data.
Along with deriving the appropriate model for a time series, an-other area of interest is how to fuse multiple sources of temporaldata when each source has not only its own time signature butalso a distinct structure in the feature space. One area to look to inthis regard is recent research in multi-modal fusion. This researchprimarily focuses on predictive tasks related to fusing video, audio,and natural language understanding/generation or some subsetthereof. The challenge in this domain is that pixel space and wordtoken space have strong internal dynamics but don’t directly re-lated to one another. In many cases multi-modal fusion seeks tolearn joint representations of each modality for the purpose of apredictive task.Early work in this area focused on learning joint representationsof data coming from two or more modalities. In [9] the authors seekto generate image descriptions by aligning word vectors to objectsdetected in the image. The aligned object-word pairs are used togenerate natural language descriptions of the image as a whole. In[26] the goal is to determine alignments between movies (image& subtitle) and the books on which they were based. Again thisis an example of two modalities, text and images. However, nowthe image modality unfolds over time and the two text modalitiesfollow different sequential patterns. In order to align the books andthe movies they first generate sentence embeddings from the bookwhile simultaneously generating scene embeddings of subtitledclips. Alignments between these two spaces are learned for thosescenes that correspond to specific parts of the book. In [15] the alignments are learned between input instructions and sequencesof spatial actions for a agent to execute. The goal of this is to be ableto communicate with autonomous systems using spoken language.As can be seen, much of the mulitmodal fusion work involvesRecurrent Neural Networks (RNNs) to learn sequential dependen-cies in one or more of the streams of data. Perhaps unsurprisingly,the same efficacy that attention based models have shown for for awide variety of language and graph tasks, has been observed in themulti-modal fusion domain. [25] use an attention mechanism tolearn locations within an intermediate representation of an imageto focus on for the generation of each word in a caption. This atten-tion can be hard, meaning it selects a single location, or it can be asoft distribution over the entire image. [7] extends this approachbeyond single images to include video and audio. [12] takes a simi-lar approach but here there are two attention mechanisms swappedacross the images and the language domain. It is also a more gener-alized approach to learning joint representations of language andimages. In all of these models, the data is often sequential but thereis no conception of time beyond discrete ordered tokens.
As described above, FS companies generally work with multiplestreams of sequence data. Research in the topic of non-uniform timeseries gives direction on how to approach time series data where thenon-uniformity is driven by natural or uneven sampling processes.A core challenge then is, and taking cues from the literature of multi-modal fusion, how does one combine two or more non-uniformtime series that each has its own modality? In [26], alignment ofthe modalities it the core modeling goal. In many FS applications,alignment would be a pre-processing step necessary to solving thecore modeling task. While most FS sequences would be indexedby a real time dimension, naively combining them into a mastersequence would produce extra sparse and irregular input vectors.If we take a componentized view to the fusion of multiple non-uniform data sequences, architecture complexity quickly adds up.With each sequence being unique in both its domain and time-irregularity, separate embedding and time treatment componentswould be needed for each. On top of that we would need compo-nents to fuse each modality into a single representation. Withincreasingly complex architectures, with more parameters andhyper-parameters to tune, training requirements could increasedramatically. While compute is often effectively managed in largeFS companies (often by throwing more GPUs at the problem), timeand data are certainly limiting factors. Even at the transactionalscale of a top FS firm, data is fairly heterogeneous (in terms of typesof customers, merchants and transaction details). A converged so-lution may be good on average, but could fail to best representmicro-segments of the population data (this is a problem in allmachine learning, but one that is exacerbated by highly complexmodels).Methods that result in increased complexity also pose specificrisks to FS companies in terms of what is generally referred to as“model governance.” Models in FS companies are heavily regulatedby several laws and statutes that set rules around credit under-writing, fraud detection, anti-money laundering, marketing, anddata sharing. FS companies typically maintain compliance through achine Learning for Temporal Data in Finance:Challenges and Opportunities KDD-MLF ’20, August 24, 2020, San Diego, CA multiple stages of internal governance. A lot of model governancefocuses on model input features and model interpretability. Modelcomplexity generally makes explanations more difficult, but emerg-ing methods (e.g., LIME [20], SHAP [13]) are gaining regulatoryacceptance. As a fallback, models can be understood by the collec-tion of input features (with maybe some measure of global featureimportance). Features audits are particularly important in creditlending, as statutes like Fair Lending specifically blacklist featuresthat may lead to discrimination in credit lending (and possible prox-ies for them). As FS companies adopt methods that operate on rawinput streams, the notion of an explicit features fades away. Existinggovernance protocols are not designed for auditing implicit featureengineering, such as is done in deep learning. While deep learningmay not pose a risk for including explicitly prohibited features (e.g.,race, gender, sexual orientation), proxies to such features will beharder to detect when there are no explicit features to test. Thesolution to this challenge may not be in the deep learning modelsthemselves, but coupling these models with auxiliary techniquesfor discrimination detection [6] may have to be standard practice.
The finance industry has a history of leading the way in the analysisof multivariate time series data, but recent trends in digitizationhave expanded both the amount and type of data available to com-panies in this sector. Classical time series are now joined by richevent streams, where observations with rich feature vectors takeplace at irregular intervals across multiple timescales and levels oforganization. And streams are multiplying. A single bank accountmight contain multiple types of transaction events. A single cus-tomer might have multiple accounts as well as website interactionsand credit bureau data on top of all of that. And it may very wellbe that the key piece of information for making a crucial businessdecision about that customer depends on understanding the inter-action of the temporal patterns in those different streams. Acrossother industries, deep learning approaches have offered a way tobuild models that are tailored to the form and structure of the dataand can excel at extracting meaningful patterns in order to solve avariety of tasks. Work has begun, but many challenges still exist forapplying such models to complex and multimodal event streams.And once again, the finance industry is poised to lead the way.
REFERENCES
Advances in neural information processing systems . 395–401.[3] Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David K. Duvenaud. 2018.Neural ordinary differential equations. In
Advances in neural information process-ing systems . 6571–6583.[4] Daryl J. Daley and David Vere-Jones. 2003.
An introduction to the theory of pointprocesses. Vol. I. Probability and its Applications . New York). Springer-Verlag, NewYork,.[5] Nan Du, Hanjun Dai, Rakshit Trivedi, Utkarsh Upadhyay, Manuel Gomez-Rodriguez, and Le Song. 2016. Recurrent marked temporal point processes:Embedding event history to vector. In
Proceedings of the 22nd ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining . 1555–1564.[6] Moritz Hardt, Eric Price, Eric Price, and Nati Srebro. 2016. Equality of Oppor-tunity in Supervised Learning. In
Advances in Neural Information ProcessingSystems 29 , D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.).Curran Associates, Inc., 3315–3323. http://papers.nips.cc/paper/6374-equality-of-opportunity-in-supervised-learning.pdf [7] Chiori Hori, Takaaki Hori, Teng-Yok Lee, Ziming Zhang, Bret Harsham, John RHershey, Tim K Marks, and Kazuhiko Sumi. 2017. Attention-based multimodalfusion for video description. In
Proceedings of the IEEE international conferenceon computer vision . 4193–4202.[8] R.J. Hyndman and G. Athanasopoulos. 2014.
Forecasting: principles and practice .OTexts. https://books.google.com/books?id=gDuRBAAAQBAJ[9] Andrej Karpathy and Li Fei-Fei. 2015. Deep Visual-Semantic Alignments forGenerating Image Descriptions. In
The IEEE Conference on Computer Vision andPattern Recognition (CVPR) .[10] Zachary C. Lipton, David Kale, and Randall Wetzel. 2016. Directly modelingmissing data in sequences with rnns: Improved classification of clinical timeseries. In
Machine Learning for Healthcare Conference . 253–270.[11] Roderick JA Little and Donald B. Rubin. 2019.
Statistical analysis with missingdata . Vol. 793. John Wiley & Sons.[12] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: PretrainingTask-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. In
Advances in Neural Information Processing Systems 32 , H. Wallach, H. Larochelle,A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.). Curran Associates,Inc., 13–23. http://papers.nips.cc/paper/8297-vilbert-pretraining-task-agnostic-visiolinguistic-representations-for-vision-and-language-tasks.pdf[13] Scott M. Lundberg and Su-In Lee. 2017. A unified approach to interpreting modelpredictions. In
Advances in neural information processing systems . 4765–4774.[14] Spyros Makridakis, Evangelos Spiliotis, and Vassilios Assimakopoulos. 2018. TheM4 Competition: Results, findings, conclusion and way forward.
InternationalJournal of Forecasting
34, 4 (2018), 802–808.[15] Hongyuan Mei, Mohit Bansal, and Matthew R. Walter. 2016. Listen, Attend,and Walk: Neural Mapping of Navigational Instructions to Action Sequences. In
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (Phoenix,Arizona) (AAAIâĂŹ16) . AAAI Press, 2772âĂŞ2778.[16] Hongyuan Mei and Jason M. Eisner. 2017. The neural hawkes process: A neurallyself-modulating multivariate point process. In
Advances in Neural InformationProcessing Systems . 6754–6764.[17] Boris N. Oreshkin, Dmitri Carpov, Nicolas Chapados, and Yoshua Bengio. 2020. N-BEATS: Neural basis expansion analysis for interpretable time series forecasting.In
International Conference on Learning Representations . https://openreview.net/forum?id=r1ecqn4YwB[18] Shahla Parveen and Phil Green. 2002. Speech recognition with missing datausing recurrent neural nets. In
Advances in Neural Information Processing Systems .1189–1195.[19] L. Petram. 2011. The worldâĂŹs first stock exchange: how the Amsterdammarket for Dutch East India Company shares became a modern securities market,1602-1700. (01 2011).[20] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. " Why should itrust you?" Explaining the predictions of any classifier. In
Proceedings of the 22ndACM SIGKDD international conference on knowledge discovery and data mining .1135–1144.[21] Yulia Rubanova, Tian Qi Chen, and David K. Duvenaud. 2019. Latent OrdinaryDifferential Equations for Irregularly-Sampled Time Series. In
Advances in NeuralInformation Processing Systems . 5321–5331.[22] Donald B. Rubin. 1976. Inference and missing data.
Biometrika
63, 3 (1976),581–592. Publisher: Oxford University Press.[23] Slawek Smyl. 2018. M4 Forecasting Competition: Introducing a New HybridES-RNN Model. https://eng.uber.com/m4-forecasting-competition/[24] Volker Tresp and Thomas Briegel. 1998. A solution for missing data in recurrentneural networks with an application to blood glucose prediction. In
Advances inNeural Information Processing Systems . 971–977.[25] Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, RuslanSalakhutdinov, Richard S. Zemel, and Yoshua Bengio. 2015. Show, Attend andTell: Neural Image Caption Generation with Visual Attention. In
Proceedings ofthe 32nd International Conference on International Conference on Machine Learning- Volume 37 (Lille, France) (ICMLâĂŹ15) . JMLR.org, 2048âĂŞ2057.[26] Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun,Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towardsstory-like visual explanations by watching movies and reading books. In