CContinuously Updated Data Analysis Systems
Lee F. Richardson Department of Statistics and Data Science, Carnegie Mellon UniversityJuly 23, 2019
Abstract
When doing data science, it’s important to know what you’re building. This paper de-scribes an idealized final product of a data science project, called a Continuously UpdatedData-Analysis System (CUDAS). The CUDAS concept synthesizes ideas from a range ofsuccessful data science projects, such as Nate Silver’s
FiveThirtyEight . A CUDAS can bebuilt for any context, such as the state of the economy, the state of the climate, and so on.To demonstrate, we build two CUDAS systems. The first provides continuously-updatedratings for soccer players, based on the newly developed
Augmented Adjusted Plus-Minus statistic. The second creates a large dataset of synthetic ecosystems , which is used foragent-based modeling of infectious diseases.
When I work on data science projects, it helps to imagine what the final product will looklike. At the end of the rainbow, what’s my pot of gold? In most projects, the final productis defined for us; your boss wants a report, your software engineer friend wants a script,your advisor wants a paper, and so on. But we’ll forget these constraints here, and insteaddescribe an idealized final product for a data science project, called a Continuously-UpdatedData-Analysis System (CUDAS).The CUDAS concept isn’t new, and you’ve probably seen it before. For example, anincredibly popular CUDAS was the 2016 election forecast by
FiveThirtyEight . What makesthis a CUDAS? I’ll explain next.
Broadly speaking, a CUDAS has three components:1.
Data Pipeline Data Analysis Continuously-Updated Results a r X i v : . [ s t a t . O T ] J u l he data pipeline processes the raw data so it’s ready for analysis, the data-analysisconverts the processed data into the results we’re interested in, and we continuously updateour results as new data comes in. These components don’t need to happen in a sequence.For example, we may need to update our data pipeline after realizing that our data-analysisis missing a covariate.These three components closely follow definitions of data science described elsewhere (e.g.Silver (2014), Donoho (2017), and Wickham and Grolemund (2016)). For example, I adaptedthe definition given in Chapter 1 of Wickham and Grolemund (2016) by grouping the import and tidy boxes into a single component, called the data-pipeline. I’ve also grouped transform , visualize , and model into a single component, called data-analysis (Tukey (1962)). Finally, Ichanged communicate to continuously-updated results , because a CUDAS updates when newdata becomes available.This CUDAS definition also closely follows the definition of Greater Data Science givenby Donoho (2017). In fact, I think about a CUDAS as an implementation of this framework,since each of Donoho’s six divisions, even science about data science , can be viewed in termsof their impact on a CUDAS. More on this later.
I got the idea for a CUDAS by studying successful data science projects, and trying toabstract what they had in common. I’ll walk through my three favorite examples next.
The 2016 election forecast from
FiveThiryEight was (perhaps) the most popular CUDAS ofall time. Their system collects polling data, uses this data to forecast the probability ofeach candidate winning, and continuously-updates the forecast on a beautiful interactive webpage. Figure 1 shows two key screen shots of the project.
Figure 1:
Left : A visualization of the polls that are collected, adjusted, then added to the forecast.
Right : The predicted probability of each candidate winning the 2016 presidential election. .1.2 The Global Burden of Disease Another great CUDAS comes from the
Global Burden of Disease (GBD) study, producedby the
Institute of Health Metrics and Evaluation (Lopez and Murray (1998)). The GBDis an extremely ambitious project, with a goal of collecting and synthesizing all the world’shealth data, and providing continuously-updated estimates of disease burden. The GBD isa scientific triumph, and the book
Epic Measures by Smith (2015) chronicles the story fromits beginnings.Let’s think about the GBD in terms of a CUDAS. First, the GBD employs a team whosegoal is collecting all the health data they can get their hands on, from surveys, to scientificliterature, to vital registration systems, and more. Next, the GBD has a team of diseaseexperts, statisticians, computer scientists, epidemiologists, etc. to model the burden of eachindividual disease. Then, the individual disease estimates are combined into a single met-ric, called the Disability-Adjusted Life Year (DALY, Murray and Acharya (1997)). Finally,the GBD provides spectacular interactive visualizations of their results, which they updateannually, an example of which is shown in Figure 2.
Figure 2: The GBD’s interactive website that displays the DALY’s attributed to all diseases, by lo-cation, in 2017. The GBD produces multiple interactive visualizations, updates annually, and givenusers access to a wide variety of results.
Our final CUDAS example is the war-on-ice project, available online at: http://war-on-ice.com/ .Although the project is no longer active , at it’s apex, war-on-ice provided advancedhockey statistics that updated after every night of games. Notably, their statistics and The two creators are now employed by professional hockey teams. war-on-ice
CUDAS is that the author’s made an critical pieceof their data pipeline, the nhlscrapr R package, available. In later years, one of the author’shas been behind a similar nflscrapR package for American football (Horowitz (2017)), whichshows the early signs of a generalizable idea.The 2016 election forecast, the GBD, and war-on-ice come from completely differentcontexts (politics, global health, and hockey). But when viewed from a CUDAS lens, theprojects are similar. The next section provides more detail on the similarities between thesethree systems.
What do the 2016 Election forecast, the GBD, and war-on-ice have in common? Forstarters, each project has a data pipeline, data-analysis, and continuously updated results.But each project also understood the dependencies between these components: the data-analysis is nothing without the data pipeline, and the data-analysis isn’t as valuable withoutthe continuously-updated results. Let’s go into more detail on what made these three CUDASsystems stand out.
Multiple data sources are synthesized in a purposeful way . In each of our threeexamples, the data was available online, but the data wasn’t formatted for data-analysis.For example, the 2016 election forecast collects polls from many different sources, the GBDcombines different data types from many different diseases, and war-on-ice collects play-by-play data, images, box score statistics, and more.But these projects didn’t just collect the data, they also knew what to do with it. Thedata was rigorously extracted and transformed into the precise format required for the data-analysis. Nate Silver provides a detailed user guide to the 2016 election forecast (FiveThir-tyEight (2016b)), in which he describes the critical steps of adjusting the polls, and combin-ing the polls with other data sources, such as economic data. So, it’s not enough for a datapipeline to collect the data, a good pipeline must also know how the raw data needs to beprocessed in order to produce the results the CUDAS is ultimately interested in.
The results are interesting and interpretable . The data-analysis performed in ourthree examples isn’t extremely complicated, but it’s not trivial either. Each of our examplesuses some sort of statistical model: FiveThirtyEight (2016a) uses a Bayesian approach to fore-cast who will win the election, the GBD uses many models (e.g.
DisMod (Flaxman (2019)),and war-on-ice implements the adjusted plus-minus methodology described in Thomas et al.(2013).In my view, these models are successful because they’re interesting and interpretable . Byinteresting, I mean that each model was able to gain a large following of users who wantedto know how the results changed as new data came in. By interpretable, I mean that theoutput of the model was easy to understand: FiveThirtyEight (2016a) gives each candidate aprobability of winning, the GBD summarizes disease burden into a single metric (the DALY),and war-on-ice ranks players based on their contribution to winning. In each case, it doesn’ttake rocket scientist to understand the results.
The results are continuously updated in a highly intuitive display . Finally, andI think most important for their success, each example continuously-updates their results.4nd they don’t just update their results, they display their results in highly intuitive webapplications, which gives users a simple way to stay up to date. If there’s one thing we’velearned in the information age, it’s that people like checking their devices for updates (think:Facebook notifications).So now we know what a CUDAS is, and we’ve analyzed three examples. Let’s use theseinsights to create some CUDAS systems of our own.
I’ve luckily been involved in developing several CUDAS systems. Full disclosure: I worked onthe GBD project (Section 2.1.2) for a year. Since then, I’ve had a larger role in developingtwo other CUDAS projects: one for infectious disease modeling, and the other for rankingsoccer players. In this section, I’ll go through the details of building them both. In The Signal and the Noise (Silver (2012)), Nate Silver overviews the state of disease forecast-ing. After discussing the limitations of compartment models, Silver discusses the potential ofa new approach, called agent-based models . For disease modeling, agent-based models simu-late the daily interactions of people (e.g. when they talk, when they’re in the same room),and track how a disease spreads based on these interactions. For agent-based models to work, they need a dataset with a record for each person inthe population. The dataset should also include where each person lives, where they go toschool, and other information relevant to disease modeling. Agent-based modelers refer tothese datasets as synthetic ecosystems .Synthetic ecosystems are tricky to build: you need data from different sources, you needto integrate these data sources together, you need to make sure the synthetic ecosystemrepresents the population, and you need a large computer. And because this is a trickyproblem, there’s a demand in the agent-based modeling community for high quality syntheticecosystems. That’s where we came in. As part of the MIDAS research network, a group of uswere tasked with generating synthetic ecosystems. Our goal: build a CUDAS for syntheticecosystems.
To create a synthetic ecosystems, we need to know: • How many people to create . For example, to create a synthetic ecosystem forPittsburgh, we need to know how many people live in Pittsburgh. • Geography . Continuing the Pittsburgh example, we need to know how the neighbor-hoods are organized, where the roads are, where the schools are, and so on. Silver interviews a group of agent-based modelers from the
University of Pittsburgh , who work on aninfectious disease model called
FRED: A Framework for Reconstructing Epidemic Dynamics (Grefenstetteet al. (2013)). In the CUDAS described in this section, we essentially worked to create a synthetic ecosystemsfor the FRED model. The characteristics of the people (age, gender, occupation, etc.) .All of this data is available online, but different pieces are available in different locations.So the first part of our data pipeline consisted of scripts to collect data and store them on ourcomputing cluster, hosted by the Pittsburgh Supercomputer Center (Center (2016)). Next,we laboriously ensured that each data source shared a common geography. This is difficult,because each data source partitions countries into smaller regions. But unfortunately, eachdata source differs in how it partitions countries. For example, the left side of Figure 3 showshow the website
GeoHive (Geohive (2016)) splits Italy into 20 regions, and the right side ofFigure 3 shows how IPUMS (Center (2014)) splits Italy into 20 regions. While these twodata sources are close , it still took a lot of work to make sure that both datasets had thesame geographies. And Italy was easy compared with the rest of the countries.Thus, a substantial element of our data pipeline involved matching the geographies ofdifferent data sources. We did this manually, but in the next section, I’ll discuss a moregeneral solution to this problem.
Figure 3: A common problem in building data pipeline’s for CUDAS systems is matching namesacross multiple data sources. This figure shows how two different data sources split Italy into 20smaller regions. Matching names is often the most time consuming part of building a CUDAS, andthere is a need for more efficient solutions to this problem.
Once we (finally) had the data ready, we needed a method to turn this data into syntheticecosystems. For this, we developed the SPEW framework for synthetic ecosystems, which isdescribed in Gallagher et al. (2018). The framework samples people, assigns these people tohouseholds, schools and workplaces, then assigns locations to the households, schools, andworkplaces. We used intuitive algorithms for each of these tasks. For example, we sampledthe characteristics of people using microdata , where microdata is simply a representative6ample of the population. And we assigned people to schools based on the location of theirhousehold, the location of their school, and the size of each school.To implement the SPEW framework, we created the spew R package. We used thispackage to generate all of our synthetic ecosystems, and it was designed to work on thePittsburgh Supercomputing Cluster, where our data was stored. The package is availableonline at: https://github.com/leerichardson/spew . After generating the synthetic ecosystems, we needed to make them available to agent-basedmodelers. First things first, we created a website where users could simply download thecomplete ecosystems, which is available online at: .But these results weren’t very intuitive, so Shannon, a fellow PhD student on this project,wrote a general markdown script that produced summary reports for each synthetic ecosys-tem (see Figure 4 for an example). Not only did these reports help agent-based modelersunderstand their synthetic ecosystems, but they also helped us debug our software, andensured that our synthetic ecosystems passed the intraocular (“hits you in the eyes”) test.
Figure 4: Automatically generated reports that summarize each synthetic ecosystem. These reportshelped agent-based modelers understand their ecosystems, and they helped us debug our software.
7n terms of continuously-updated results, the idea was that whenever the Census releaseda new data sample, or
GeoHive released new population counts, or when a new data sourcebecame available, a user would be able to pass the new data source through the SPEWframework, and obtain a synthetic ecosystem that accounted for the new data.Although we released several versions of synthetic ecosystems, the newest of which usedmore recent data, we were never able to reliably and efficiently produce continuously-updatedsynthetic ecosystems in this idealized manner. But the dream lives on. As a silver lining, wedeveloped a diagram that describes the SPEW framework, shown in Figure 5. And as thefigure shows, this process cleanly decomposes into a CUDAS.
Figure 5: The spew framework in from Gallagher et al. (2018). From a CUDAS perspective, weclearly see that the framework decomposes into a data pipeline, data-analysis, and continuously-updatedresults.
The second CUDAS we’ll walk through is for a recently developed soccer metric, called
Augmented Adjusted Plus-Minus (AAPM, Matano et al. (2018)). The details are in thepaper, but the basic idea is that AAPM combines two data sources: FIFA ratings and play-by-play data, and uses these data sources to rate each player. In Matano et al. (2018), it’sshown that AAPM predicts game outcomes better than other statistics.Why is AAPM a good statistic for a CUDAS? Earlier, I noted that the data-analysisresults for a CUDAS should be interesting and interpretable . In principle, the AAPM statisticshould be interesting to soccer fans, especially since it’s tied to predictive accuracy. TheAAPM statistic is also interpretable, since it ranks each player, and can be easily displayedin a table. 8 .2.1 Data Pipeline
We need two data sources to compute AAPM: play-by-play information, and FIFA ratingsfrom the beginning of each season. With these two sources, we need to produce a designmatrix , which is the input required for our statistical model that computes AAPM. In thedesign matrix, each column represents a single player, and Figure 6 shows what the designmatrix looks like. We also need to link each player (column in the design matrix) with aFIFA rating.
Home Team Away Team Goal Di ff erential Time Start Time End Aguero Salah …. Lukaku Man City Liverpool
Man City Liverpool -1 61 68 1 -1 … 0
Man City Liverpool
Man City Liverpool
PlusMinusData returns a table with game segment data Figure 6: The design matrix produced by the data pipeline for our AAPM CUDAS. Each column (the X in a regression model) represents a player. This design matrix, and a FIFA rating for each player,is the input to our statistical model that computes the AAPM statistic. There are several complications to building the pipeline, such as: • Finding websites with the data, then writing scripts to extract it. • Matching the names of soccer players from multiple data sources. • Automating the process so that it works across seasons, leagues, etc.You may have noticed that these challenges are the same we faced when we built our CU-DAS for synthetic ecosystems. In each case, we needed to collect data from multiple sources,9nd match names across each source. For synthetic ecosystems, we matched geographicnames, and for soccer ratings, we matched player names.We overcame these challenges more effectively for the soccer CUDAS. Here, we devel-oped two R packages: the first extracts the play-by-play and FIFA data, and the secondmatches player names using active record linkage . The data collection package is similar tothe nhlscrapR package developed by war-on-ice , and we actually used a similar package,called fcscrapR , for some parts of the collection.The name matching package, called arl , is a bit more interesting. The arl packageautomatically matches the names that are identical in both data sources, partially matchesthe names that were close, using probabilistic record linkage methods, and manually matchesall the remaining names. In short: we automated as much as we could, and manually matchedthe rest. Given the differences between some data sources, sometimes this is the best youcan do.Similar to our CUDAS for synthetic ecosystems, the majority of the work was buildingthe data pipeline. For seasoned data scientists, this is an obvious point. Once our data pipeline produced the design matrix with linked FIFA ratings, we were ready tocompute AAPM. The AAPM statistic is calculated with a Bayesian regression model, whereFIFA ratings are the prior distribution for each player. We developed another R package tofit our model: https://github.com/tpospisi/PlusMinusModels ,which relies standard Bayesian software (Carpenter et al. (2017)). After various modelchecks and tweaks, we verified that: • Our results passed the intraocular test (the best/worst players made sense). • Our model predicted game outcomes better than baseline and comparison statistics.With our results in hand, the final step was producing the continuously-updated results.
To produce continuously updated results, we need to answer two questions:1. What’s the best way to display our results?2. How can we continuously-update them?To display the results, we followed another successful CUDAS: ESPN’s
Real Plus-Minus statistic (RPM, Ilardi and Engelmann (2019)). ESPN displays the RPM statistic in a simpletable, which users can sort by offense, defense, or position. We created a similar table, whichis available online at: .10 igure 7: The sortable table for our CUDAS, which we make available online, and we continuouslyupdate as new games are played. This screen shot displays the top EPL players, sorted by AAPM, forthe 2017-18 season.
Like ESPN, we made our tables sortable, and an example for the 2017-18 English PremierLeague season is shown in Figure 7. We created our sortable tables with the Javascript library D3 (Bostock et al. (2011)).Finally, we need to make sure our results continuously update. Just as the 2016 electionforecast updates after each poll, we want our AAPM statistic to update after each soccermatch. I’m not an expert here, but here’s three ways you can continuously update results:1. Manually run a script every time you want new results.2. Set up a cron job (Wikipedia contributors (2019)) to run every night.3. Use work-flow management tool, such as Luigi or Airflow (Spotify (2019); Apache(2019)).Since our AAPM CUDAS is in an early stage, we simply run our scripts manually. Butmoving forward, we play on switching to an automated workflow.And that’s how our AAPM CUDAS works. To reiterate, we chose the AAPM statisticbecause it’s interesting and interpretable . Then, we built a data pipeline that retrieves datafrom the web, links together multiple data sources, and produces a design matrix linked toFIFA ratings. We computed AAPM for each player with a Bayesian model, and this providesa ranking of each player in our dataset. Finally, we displayed our results as sortable tablesonline, and showed how our results can be continuously updated.11
Discussion
I’ve described a CUDAS as my idealized final product for a data science project. A CUDASincludes a data pipeline, data-analysis, and continuously-updated results, and works for anycontext. I walked through three examples of successful CUDAS projects, then I describedwhat I thought made them successful. I then described the creation of two CUDAS systemsI’ve been involved with: one for synthetic ecosystems, and one for soccer ratings.Now that I’ve explained what a CUDAS is, discussed several examples, and describedhow they can be built, I want to make the case for thinking about data science projects interms of a CUDAS.The key feature of a CUDAS is that it applies to any context. Let’s say you’re a statisticianwho is passionate about the economy. You notice that the GDP statistic is flawed, and youthink of a clever way to improve it. What better way to communicate your metric thanbuilding a CUDAS?Now take a more intricate example. In 2018, I gave a presentation where I proposed aCUDAS for my fantasy basketball team (Richardson (2018)). In fantasy basketball, you needto decide who to play, but I couldn’t find any high quality forecasts tailored to the specificsof my team. So I started making the forecasts myself, but every time I needed an update, Ihad to manually copy the data from my league into a spreadsheet, manually run the forecast,and only then could I make an informed decision on who to play. This took way too muchtime, which makes it a perfect opportunity for a CUDAS. In this case, the data pipelinewould download my league’s data, the data-analysis would prepare the forecasts, and I coulddisplay the results in an easy-to-read web page.Admittedly, building a CUDAS requires skills outside the wheelhouse of statisticians, andwe had to pick up a lot of skills along the way. We used the rvest R package (Wickham(2016)) for web scraping, we developed an active record linkage R package for name matching,we used the Javascript library D3 for our interactive tables, we used Google App Engine tohost our website, and so on.While this took a lot of work, data science tools are quickly maturing, and higher qualitytools should enable higher quality CUDAS systems. For example, the R package shiny hasenabled users to create interactive web applications, while requiring zero knowledge of howthe web works. And as I mentioned earlier, data pipeline tools have allowed data engineersto streamline and stress-test their data pipelines.Further improvements should also come from increased collaboration between data sci-entists, engineers, computer scientists, web developers, database developers, psychologists,and more. For instance, most of the people I worked with on building our CUDAS systemswere statisticians (data scientists?). Our backgrounds were great for developing the statisticalmodels, but our skills were stretched when building the data pipeline, and developing the webapplications to display our results. And as I went along, it became clear how valuable datamodeling, data visualization, and web development expertise were to building high qualityCUDAS systems. I came to see the projects as less about statistical modeling, and moreabout building an information system (Figure 8). In this way, the CUDAS concept providesa unifying framework for data centered professionals with different skills to rally around.As a final point, the rise of the Internet has profoundly changed the way we consumeinformation. This has led to echo chambers , which Wikipedia describes as:12 igure 8: An information system as shown in Figure 1.1 of Simsion and Witt (2004). “ a metaphorical description of a situation in which beliefs are amplified or reinforced bycommunication and repetition inside a closed system. ”In echo chambers, people lose access to a common set of facts, which makes communicationdifficult. Can a CUDAS help?One hypothesis is that high quality CUDAS systems could constrain the public discoursearound a common set of facts. If we all agreed that we want unemployment to be low, andGDP to be high, then we could build a CUDAS to track how well we’re doing.Would this work? As a thought experiment, consider the effect of FiveThirtyEight ’sCUDAS on Donald Trump’s approval rating (FiveThirtyEight (2018)). This shows thatTrump’s approval has ranged between 36.4% to 47.8% over the course of his presidency. Nowthink, when is the last time you heard someone claim that Trump’s approval rating is eitherat 20% or 80 %? And how much did this happen in the past? Viewed this way, a CUDAS isanalogous to a scoreboard , since it provides political junkies of all stripes a way to monitor acommon set of facts. In spirit, CUDAS systems could complement the ideas of Tetlock andGardner (2016), who advocate for score keeping of forecasts in the public square.One of my favorite parts of Donoho’s
50 Years of Data Science is the quote given byCleveland (2001): . . . [results in] data science should be judged by the extent to which they enable the analyst to earn from data .It’s a great quote. But what if we replaced data analyst with CUDAS : . . . [results in] data science should be judged by the extent to which improve a CUDAS .I think this quote works just as well. To me, it’s hard to think of any data science researchthat wouldn’t, indirectly or directly, demonstrate it’s utility in a CUDAS. Acknowledgments
Thanks to Francesca Matano and Taylor Pospisil for collaborating to build a CUDAS foraugmented adjusted plus-minus. Thanks to Shannon Gallagher, Sam Ventura, Bill Eddy,Jeremy Espino, Shawn Brown, Jay Depasse, and everyone at the Pittsburgh SupercomputerCenter who helped in creating SPEW synthetic ecosystems. Thanks to the sports readingand research group, in particular Sam Ventura and Ron Yurko, for their encouragement toinitially present the concept.
References
Apache (2019). Apache Airflow. https://airflow.apache.org/ .Bostock, M., Ogievetsky, V., and Heer, J. (2011). D data-driven documents. IEEE transac-tions on visualization and computer graphics , 17(12):2301–2309.Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D., Goodrich, B., Betancourt, M., Brubaker,M., Guo, J., Li, P., and Riddell, A. (2017). Stan: A probabilistic programming language.
Journal of statistical software , 76(1).Center, M. P. (2014). Integrated public use microdata series, international: Version 6.3.[Machine-readable database].Center, P. S. C. (2016). Olympus computing cluster.Cleveland, W. S. (2001). Data science: an action plan for expanding the technical areas ofthe field of statistics.
International statistical review , 69(1):21–26.Donoho, D. (2017). 50 years of data science.
Journal of Computational and GraphicalStatistics , 26(4):745–766.FiveThirtyEight (2016a). 2016 Election Forecast. https://projects.fivethirtyeight.com/2016-election-forecast/ .FiveThirtyEight (2016b). A Users Guide To FiveThirtyEights 2016General Election Forecast. https://fivethirtyeight.com/features/a-users-guide-to-fivethirtyeights-2016-general-election-forecast/ .FiveThirtyEight (2018). How Popular is Donald Trump? https://projects.fivethirtyeight.com/trump-approval-ratings/ .14laxman, A. (2019). An Integrative Metaregression Framework for Descriptive Epidemiology. https://github.com/ihmeuw/dismod_mr .Gallagher, S., Richardson, L. F., Ventura, S. L., and Eddy, W. F. (2018). Spew: Syntheticpopulations and ecosystems of the world.
Journal of Computational and Graphical Statis-tics , 27(4):773–784.Geohive (2016). .Grefenstette, J. J., Brown, S. T., Rosenfeld, R., DePasse, J., Stone, N. T., Cooley, P. C.,Wheaton, W. D., Fyshe, A., Galloway, D. D., Sriram, A., et al. (2013). Fred (a frameworkfor reconstructing epidemic dynamics): an open-source software system for modeling infec-tious diseases and control strategies using census-based populations.
BMC public health ,13(1):940.Horowitz, M. (2017). nflscrapr: R package for scraping nfl data off their json api.Ilardi, S. and Engelmann, J. (2019). NBA Real Plus-Minus. .Lopez, A. D. and Murray, C. C. (1998). The global burden of disease, 1990–2020.
Naturemedicine , 4(11):1241.Matano, F., Richardson, L. F., Pospisil, T., Eubanks, C., and Qin, J. (2018). Augmentingadjusted plus-minus in soccer with fifa ratings. arXiv preprint arXiv:1810.08032 .Murray, C. J. and Acharya, A. K. (1997). Understanding dalys.
Journal of health economics ,16(6):703–730.Richardson, L. F. (2018). A continuously updated data-analysis system for fantasy basketball.Presentation in the CMU sports and statistics reading group. Given June 6, 2018.Silver, N. (2012).
The signal and the noise: why so many predictions fail–but some don’t .Penguin.Silver, N. (2014). What the fox knows.
FiveThirtyEight http://fivethirtyeight.com/features/what-the-fox-knows .Simsion, G. and Witt, G. (2004).
Data modeling essentials . Elsevier.Smith, J. N. (2015).
Epic measures: one doctor, seven billion patients . Harper Wave.Spotify (2019). Luigi. https://github.com/spotify/luigi .Tetlock, P. E. and Gardner, D. (2016).
Superforecasting: The art and science of prediction .Random House.Thomas, A., Ventura, S. L., Jensen, S. T., and Ma, S. (2013). Competing process hazardfunction models for player ratings in ice hockey.
The Annals of Applied Statistics , pages1497–1524. 15ukey, J. W. (1962). The future of data analysis.
The annals of mathematical statistics ,33(1):1–67.Wickham, H. (2016). Package rvest.
URL: https://cran. r-project.org/web/packages/rvest/rvest. pdf .Wickham, H. and Grolemund, G. (2016).