Earnings Prediction with Deep Learning
Lars Elend, Sebastian A. Tideman, Kerstin Lopatta, Oliver Kramer
EEarnings Prediction with Deep Leaning
Lars Elend − − − , Sebastian A. Tideman − − − X ] ,Kerstin Lopatta − − − , and Oliver Kramer − − − Computational Intelligence Group, Department of Computer Science,Carl von Ossietzky University of Oldenburg, 26111 Oldenburg, Germany
Abstract.
In the financial sector, a reliable forecast the future financialperformance of a company is of great importance for investors’ investmentdecisions. In this paper we compare long-term short-term memory (LSTM)networks to temporal convolution network (TCNs) in the prediction offuture earnings per share (EPS). The experimental analysis is based onquarterly financial reporting data and daily stock market returns. Fora broad sample of US firms, we find that both LSTMs outperform thenaive persistent model with up to 30.0% more accurate predictions, whileTCNs achieve and an improvement of 30.8%. Both types of networks areat least as accurate as analysts and exceed them by up to 12.2% (LSTM)and 13.2% (TCN).
Keywords:
Finance · Earnings Prediction · EPS Forecasts · Long ShortTerm Memory · Temporal Convolutional Network.
Investors rely first and foremost on earnings predictions when making investmentdecisions, e.g., buy, hold, or sell a firm’s shares. Besides using own projections, theyheavily rely on earnings forecasts provided by financial analysts. Consequently,forecasting earnings is one of the main tasks of financial analysts working at majorfinancial institutions, e.g., broker firms. Analysts invest significant resources toprovide accurate forecasts. However, forecasting is a difficult undertaking asnumerous factors have an influence on the prediction performance. In this paper,we predict publicly listed US firms’ quarterly earnings per share with state-of-the-art techniques from the field of deep neural networks based on companies’time series data.We structure the remainder of this paper as follows. In Section 2, we presentrelated work on prediction of financial data. The base time series model andquality measures are introduced in Section 3. We describe the data preprocessingprocess in Section 4. Objective of our work is to compare
LSTM networks with
TCN s, which will be introduced in Section 5. Section 6 presents the experimentalanalysis, and Section 7 draws conclusions. a r X i v : . [ q -f i n . GN ] J un L. Elend et al.
Analyst forecasts are often used to benchmark the accuracy of earnings predictionsobtained from models. However, due to recent regulation on financial analystsworking conditions, e.g., limiting the private access to management, a drop inanalyst coverage has been observed [1]. Automated earnings prediction modelssupported by artificial intelligence may fill this gap. Empirical evidence is missingwhether artificial intelligence can provide meaningful forecasts.Some evidence exists that fraud, e.g., illegal manipulation of earnings, canbe predicted using machine learning [4]. In their study, Bao et al. (2020) findthat ensemble learning with raw accounting numbers has predictive power forfuture fraud cases. Their approach outperforms logistic regression models basedon financial ratios commonly used by prior research [6] as well as a support-vector-machine model [5], where a financial kernel maps raw accounting numbersinto a set of financial ratios. Yet, the prediction of restatements is relativelyless challenging as it is a binary decision tree (future restatement vs. no futurerestatement). To the contrary, predicting future earnings is more challengingas all discrete values are theoretically possible and information from multiplesources, e.g., financial statements, stock market data, have to be considered.To our knowledge, no study has yet predicted future earnings using artificialintelligence. Closest to this study is the work of Ball and Ghysels (2018) [3]. Theyuse a mixed data sampling regression method (but no neural networks) to predictfuture earnings and find that their predictions beat analysts’ predictions in certaincases, e.g., when the firm size is smaller and analysts’ forecast dispersion is high.
The goal in data-driven prediction based on time series is to find a function φ that yields a future value y based on the data of the past β time steps x = ( q t − β +1 , . . . , q t ) (Fig. 1). In this paper, the time-span between two time stepsis 3 months. A non-perfect predictor ˆ φ ( x ) = ˆ y can be evaluated using the meansquared error ( MSE ) to the real value y . . . . q t − q t − q t − q t q t +1 q t +2 q t +3 . . .tτ = 1 β = 3 x yφ Fig. 1: Illustration of time series model for prediction of earnings of a companywith quarterly reports q t at time step t . We seek a mapping φ from pattern x of earning data of the past to label y of the predicted earning for the future t = t + τ . The window size β describes the time span of considered past earnings. arnings Prediction with Deep Leaning 3 To evaluate our model we compare it with the persistent model and theanalysts forecast. The persistence model is a simple baseline that uses the currentvalue as a prediction for the next time step. For each model the
MSE is calculated.Therefore, larger deviations are more punished than smaller ones.Since the difficulty of forecasting the given data varies greatly over time andbetween different companies, the error value in itself is not meaningful. Thereforewe use a relative comparison between the different models, namely the skill score(SS) [11]: SS
MSE = 1 − MSE ( m ) MSE ( base ) , (1)where MSE ( m ) is MSE of the own model m ( LSTM , TCN ) and
MSE ( base ) is the MSE of the comparison model: persistent model pa or analyst forecast a . Themodel under consideration is better (worse) than the reference model if the skillscore is greater (less) than 0 [11]. As input data, we use accounting data (e.g., total assets and cost of goodssold) from
Computstat Quarterly as well as daily stock market price andreturn data from
CRSP ( Daily Shares ). At first both datasets
ComputstatQuarterly and
Daily Shares are reduced to the most important parameters per time-step and firm. Different value ranges of individual parameters x are“normalized” and scaled using the total assets atq : x (cid:48) = x max { , atq } (2)and studentized: z (cid:48) i = z i − z (cid:113) n (cid:80) i ( z i − z ) , (3)where z is the mean of z i . Outliers of eps which are partially erroneous areremoved by using the first (last) percentile as minimum (maximum). We createcompany samples of a given window size (number of quarters). Smaller data gapsa filled using linear interpolation, while samples with larger gaps are rejected.The quarterly data are merged with the corresponding daily stock data DailyShares , which are also being studentized. For the comparison with the persistent model only data points are used for whichanalyst forecasts exist. The following parameters of the data records are used. The parameters in brackets areonly used for the assignment and selection of the samples.
Computstat Quarterly :( cusip , fpedats , ffi5 , ffi10 , ffi12 , ffi48 , financialfirm , EPS_Mean_Analyst ), rdq , epsfiq , atq , revtq , nopiq , xoprq , apq , gdwlq , rectq , xrdq , cogsq , rcpq , ceqq , niq , oiadpq , oibdpq , dpq , ppentq , piq , txtq , gdwlq , xrdq , rcpq Daily Shares :( cusip , date ), ret , prc , vol , shrout , vwretd L. Elend et al. An LSTM network [8] belongs to the family of recurrent neural networks. Itemploys backward connections, which allow saving information in time.
LSTM cells internally consists of three gates: forget, input, and output gates, see Fig. 2.An
LSTM cell employs internal states h and s propagated through time. Yellowboxes represent ANN layers, orange circles represent element wise operations.Input x t is concatenated with h t − and fed to the forget, input, and output gates.The forget gate determines which information should be forgotten, the inputgate specifies to which amount the new input data is taken into account, and theoutput gate state specifies the information to output based on the internal state.With these functional components, an LSTM is well suited for time series data.
LSTM networks have successfully been applied to numerous domains, e.g., forwind power prediction [12] and for speech recognition [7].A
TCN [2] is a special kind of convolutional neural network [9]. While con-volutional neural networks are primarily used for classification tasks in image,text or speech,
TCN s can be applied to time series data.
TCN s extend theircounterparts by causal convolutions and dilated convolutions. The
TCN has aone-dimensional time series input. Causal convolutions only use the current andpast information for each filter. The dilation defines the distance between theused input data elements of each filter. An example for both concepts is visualizedin Fig. 3 with a dilated causal convolution with kernel size k = 2 and dilations1, 2, 4. In our experiments we increase d exponentially, i.e. d i = 2 i and selectan appropriate number of layers to cover the given time span. TCN s also findnumerous applications, e.g., in satellite image time series classification [10].
LSTM x t h t − h t forget gate input gate output gate[ · , · ] σ × c t f t σ tanh × + i t ˜ s t σ × tanh o t s t − s t Fig. 2: Illustration of
LSTM cell d = 1 d = 2 d = 4 InputOutput x x x x x x x x y y y y y y y y t Fig. 3: Dilated causal convolution
For our experiments we employed two datasets: A for the choice of a properarchitecture and parameters and B for the final experiment with the selectedbest architecture. The training set of A includes all samples whose predicted arnings Prediction with Deep Leaning 5EPS values lie in the period 2012 to end of 2016. The last 10% of the trainingset is used for validation only. The test set is in the following half year after thetraining, so it is independent and has no unfair knowledge. For data set B, theperiod is extended by half a year, so that its test data have not been seen before.Each model is trained with a batch size 1024 for 1000 epochs and a dropoutrate of 0.3 for each intermediate layer and the recurrent edges of an LSTM layer.Dense layers apply tanh as activation function, except for the last layer using alinear one. The window size of
Computstat Quarterly and
Daily Shares is set to 20, i.e., the last 20 quarters of earning reports and the last 20 dailystock market returns form a pattern. The model is optimized using Adam and
MSE as loss. Each epoch’s best model w.r.t. validation error is used for testing.Each experiment is repeated five times. Statistics include mean and standarddeviation.Furthermore, we have experimentally selected the best architectures as repre-sentatives for
LSTM and
TCN (Fig. 4).
Computstat Quarterly and
DailyShares are used as input (green). The dimensions are given in parentheses. Sincethe shares data is put into a dense layer (D), the time input × is flattenedto 220. After a few layers the two inputs are joined by a merge layer. For the TCN
32 filters and a kernel size of 3 were used. The last dense layer with onlyone neuron outputs the predicted
EPS value. quarters (20,19) LSTM (20,76) LSTM (38)D (220)D (440)D (660)shares (220) merge D (19) D (8) D (1)(a) LSTM architecturequarters (20,19) TCN (f=32, k=3) D (38)D (220)D (440)D (660)shares (220) merge D (19) D (8) D (1)(b) TCN architecture
Fig. 4: Visualization of selected
LSTM and TCN architectures.As financial and non-financial companies show a significantly different behaviorin many regards, we analyze the prediction in independent experiments. Table 1compares the prediction performance with three different sets of companies: allcompanies (all), no financial companies (nofin), only financial companies (onlyfin).The data sets without financial firms usually give the best results. The worstresults are achieved when only financial companies are taken into account.We test the best model an an independent dataset B. Table 2 shows the resultsof the bests configurations of Table 1. The results for the non-financial companiesare similar to the results observed before with an
MSE that is 12–13% better thanthe analysts’ predictions. The predictions for all companies are slightly better,but worse than on dataset A.
L. Elend et al.
Table 1: Selected architectures and parameters for three groups of companies:financial (onlyfin), non-financial (nofin), and all. SS MSE type comp ( m, pa ) ( m, a ) LSTM all 0.466 ± ± ± ± ± ± ± ± ± ± ± ± Table 2: Results on dataset B of optimal architectures and parameters groupedby financial sector affiliation. SS MSE type comp ( m, pa ) ( m, a ) LSTM nofin ± ± ± ± These results suggest that
LSTM networks and
TCN s are indeed able to providemeaningful earnings predictions. Even after acknowledging for the variation acrossthe repetitions (e.g., standard errors based on three repetitions), the range ofsignificance (e.g., mean estimate plus/minus standard error) is well above zeroin all cases. This is remarkable, as we only used widely available public dataon companies such as balance sheet information and stock market price andreturn data. Hence, we can conclude that our networks outperform both thepersistent model and the mean forecast of financial analysts based on a subsampleof non-financial firms (e.g., manufacturing firms).
Our experimental analysis has shown that
LSTM networks and
TCN s are powerfulmodels in the application of earnings prediction. We base our prediction modelson quarterly accounting data such as cost of goods sold and total assets as wellas stock market price and return data. Using these widely available time seriesdata, the persistent model was significantly outperformed. The
LSTM s performedslightly better in our analysis using the same set of variables. In the future, wewill extend the experimental analysis to further data sets and integrate furtherdomain knowledge to improve the financial predictions. Our findings are relevantto both broker firms and investors. Broker firms may want to consider developing
LSTM networks and
TCN to supplement their analysts’ forecast. Investors couldbuild up their own forecast models using artificial intelligence, particularly whenthere are no forecasts available from financial analysts, which became a moreurgent issue recently due to the drop in analyst coverage induced by regulation. arnings Prediction with Deep Leaning 7
References
1. Anantharaman, D., Zhang, Y.: Cover Me: Managers’ Responses to Changes inAnalyst Coverage in the Post-Regulation FD Period. The Accounting Review (6),1851–1885 (Nov 2011). https://doi.org/10.2308/accr-101262. Bai, S., Kolter, J.Z., Koltun, V.: An Empirical Evaluation of Generic Convolutionaland Recurrent Networks for Sequence Modeling. CoRR abs/1803.01271 (2018)3. Ball, R.T., Ghysels, E.: Automated Earnings Forecasts: Beat Analysts or Combineand Conquer? Management Science (10), 4936–4952 (Oct 2017). https://doi.org/10.1287/mnsc.2017.28644. Bao, Y., Ke, B., Li, B., Yu, Y.J., Zhang, J.: Detecting Accounting Fraud in PubliclyTraded U.S. Firms Using a Machine Learning Approach. Journal of AccountingResearch (1), 199–235 (2020). https://doi.org/10.1111/1475-679X.122925. Cecchini, M., Aytug, H., Koehler, G.J., Pathak, P.: Detecting Management Fraudin Public Companies. Management Science (7), 1146–1160 (May 2010). https://doi.org/10.1287/mnsc.1100.11746. Dechow, P.M., Ge, W., Larson, C.R., Sloan, R.G.: Predicting Material AccountingMisstatements. Contemporary Accounting Research (1), 17–82 (2011). https://doi.org/10.1111/j.1911-3846.2010.01041.x7. Graves, A., Jaitly, N.: Towards end-to-end speech recognition with recurrent neuralnetworks. In: Proceedings of the 31st International Conference on InternationalConference on Machine Learning - Volume 32. pp. II–1764–II–1772. ICML’14,JMLR.org, Beijing, China (2014)8. Hochreiter, S., Schmidhuber, J.: Long Short-Term Memory. Neural Computation (8), 1735–1780 (Nov 1997). https://doi.org/10.1162/neco.1997.9.8.17359. LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W.,Jackel, L.D.: Backpropagation Applied to Handwritten Zip Code Recognition.Neural Computation (4), 541–551 (Dec 1989). https://doi.org/10.1162/neco.1989.1.4.54110. Pelletier, C., Webb, G.I., Petitjean, F.: Temporal Convolutional Neural Networkfor the Classification of Satellite Image Time Series. Remote Sensing (5), 523(Jan 2019). https://doi.org/10.3390/rs1105052311. Roebber, P.J.: The Regime Dependence of Degree Day Forecast Technique, Skill,and Value. Weather and Forecasting13