Initial growth rates of malware epidemics fail to predict their reach
Lev Muchnik, Elad Yom-Tov, Nir Levy, Amir Rubin, Yoram Louzoun
11 Initial growth rates of epidemics fail to predict their reach: A lesson from large scale malware spread analysis.
Authors:
Lev Muchnik , Elad Yom-Tov , Nir Levy , Amir Rubin , Yoram Louzoun Affiliations: School of Business Administration, Hebrew University of Jerusalem Microsoft Research, Herzliya, Israel Microsoft Israel Research and Development Center Department of Computer Science, Ben-
Gurion University of The Negev, Be’er Sheva, Israel Department of Mathematics, Bar Ilan University, Ramat Gan, Israel * Please send correspondence and requests to [email protected]
Abstract
Many epidemiological models predict high morbidity levels based on an epidemic’s fast initial spread rate. However, most epidemics with a rapid early rise die out before affecting a significant fraction of the population. We study a computer malware ecosystem exhibiting spread mechanisms resembling those of biological systems while offering details unavailable for human epidemics. We find an extremely heterogeneous distribution of computer susceptibility, with nearly all outbreaks starting at its tail and collapsing quickly once it is exhausted. This mechanism annuls the correlation between an epidemic ’ s initial growth rate and its total reach and prevents the majority of epidemics from reaching a macroscopic fraction of the population. The few pervasive malwares distinguish themselves early by avoiding infecting the tail and preferentially targeting computers unaffected by typical malware. Significance – Early epidemics dynamics do not predict their reach.
One of the key tasks in disease management is the early prediction of the expected amplitude of an epidemic from its initial growth rate. However, over and over again, discrepancies are reported between such predictions and the total incidence. To understand that we studied the dynamics of thousands of computer malware in millions of computers. We show a large heterogeneity in computer infectivity and that majority of epidemics grow initially by affecting easily infected targets, but then collapse, when such targets are depleted. This eliminates the relation between the initial growth rate and their final amplitude and complicates forecast. However, some pandemic initially affect low infectivity targets. Those can actually have a very wide reach.
Main text
Introduction
A key concept in theoretical epidemiology is the epidemic threshold, which specifies the condition for epidemic growth. In mean-field epidemiological models, the basic reproductive number, R0 , has been systematically employed as a predictor for an epidemic ’s spread and as an analytical tool for studying the threshold conditions (1). Although the exact value of R0 is model-specific, its advantage is that in many models it determines both the threshold for the emergence of an epidemic, its initial spread rate, and the expected reach of an outbreak(2, 3). It has thus been widely used to gauge the degree of threat a specific infectious agent would pose as the outbreak progresses (4), and has seen extensive application in the current COVID-19 pandemic(5, 6) thus entering the public consciousness. At the same time, large differences between the predicted and observed scales of epidemics have been repeatedly recorded leading to a recognition that projection of infectious diseases outbreaks were unreliable and unable to support policy. This was best demonstrated by an out-of-sample multi-year contest that involved sixteen teams of epidemiologists competing over accuracy of forecast characteristics of the seasonal dengue fever outbreak (7). The participants found that their forecast quality varied widely and that predictions were particularly inaccurate for high incidence seasons. Most of the time, there were much less observed cases than predicted(8). Numerous failures to produce accurate forecast during the COVID-19 outbreak reignited the debate on strengths and weaknesses of models, expectation from their predictive power and the dangers of mismodelling (9). The gap between reality and epidemic models predictions may be explained by a decrease in the effective reproduction number over time. In biological viruses this decrease can stem from multiple factors, including human intervention (10); passive vaccination of the population(11); or seasonality(12) , arising from the dependence of the virus’s transmission and survival mechanisms on environmental factors (13). Still, often epidemics grow slower than anticipated (14) (15) and disappear with clearly insufficient or no measures taken while having a limited effect on the population (e.g. recent MERS, ZIKA, several outbreaks of Ebola,(15) foot-and-mouth disease (14), different variants of influenza, and many others (16, 17)). Thus, even if the measures deployed contribute to limiting spread, other mechanisms must be involved. The absence of microscopic data makes verifying epidemiological models challenging, yet the tracking of computer malware may offer an excellent analog. While differences exist between the propagation mechanisms of biological viruses and computer malware, the two share many common aspects(18). A key advantage of using malware as an analog are the telemetry reports generated by anti-malware software. They record infection spread at the machine level in granular detail, allowing researchers to reconstruct the history of infections for each machine and characterize in parallel each piece of malware and each infected machine.
Results
To understand why an epidemic’s reach may be limited, even if they have an initial rapid growth rate, we studied anonymized, machine-level data from the Microsoft Defender Antivirus telemetry reports (Supp. Methods). Integrated into Microsoft Windows, Defender Antivirus monitors the hard drive for malware and produces telemetry reports enabling the tracing of malware propagation. Each report contains a unique machine identifier, infection time, and a file fingerprint labeling the malware. Unlike aggregated data used by most epidemiological studies to track population dynamics over time, these details show which machines were infected by which malware, even retrospectively (i.e. machines will identify the date of infection even when a file is identified as malware later on). We studied nearly 30M infections observed in the first 72 hours of the spread of 139,962 malware strains detected during 21 full months between April 1, 2017 and December 31, 2018, as well as the malware’s final reach. These data contain only malware affecting over 200 machines.
As demonstrated below, the discrepancy between early predictions and the eventual reach may be due to the extreme heterogeneity of the susceptible population’s infection probabilities.
Computer infectivity, defined as the number of malwares infecting each machine, is indeed very broadly distributed (Fig 1A). The vast majority of the machines reported not a single malware infection during the entire observation period. Among machines infected at least once, the infectivity distribution is scale-free, with a long tail of a few computers infected by a very large number of malwares, and most computers are infected very rarely (Fig. 1A). To illustrate the heterogeneity: 10% of all infections are reported by as few as 0.6% of the most frequently infected machines. Such heterogeneity explains both the rapid initial rise in infection and the failure of predictions to effectively extrapolate subsequent epidemic dynamics. In particular, the initially infected population resides on the right-hand side of the infectivity distribution. Since it is a very small fraction of the total population, this population is rapidly removed, slashing the average infectivity of the remaining susceptible population (19, 20), and leading to the epidemic’s premature collapse. In such populations, the rapid initial growth would be followed by a sharp decline in the reproduction number and the epidemic’s fast decay resulting in a discordance between the expected reach – based on the initial growth – and the observed reach. We directly tested the relationship between the outbreak’s initial growth rate and its reach.
Figure 1B demonstrates each malware's mean hourly growth rate averaged for malwares of different sizes. The graph is bimodal. Its left side demonstrates that the initial growth rate correlates negatively with the final reach for the malware reaching up to few thousands of machines, in clear contrast with SIR and SIR-like models. Note that this comprises majority of all observed malwares (see the malware outbreak size distribution, Fig. 1C). In contrast, for the few malwares with the largest reach, the correlation is positive, hinting at the presence of different spread dynamics for the few very far-reaching malware. To further show the lack of correlation between initial and late growth rates, we computed whether malware attaining at least half of its total reach within 72 hours of the first occurrence (designated as “fast”) had a faster initial infection rate than malwares attaining less than half reach in 72 hours (slow). Figures 1D and S1 show distributions in each group of the average hourly growth rates computed over the first 12 hours of spread of each malware. Consistently with Fig
Figure 1 A) Probability density function (PDF) of machine infectiveness. The X-axis represents the number of distinct malwares found on a specific machine over the observation period. The distribution fits a power law with 𝛼 = 2.8 – a solid line. (see Supp. Mat., Methods 1 for details). B) Average initial hourly growth rate as a function of the final malware reach. (see Supp. Mat., Methods 2) C) Malware reach PDF. The graph represents 139,920 malwares, each with a reach exceeding 200 machines. The average malware reach is 914.6 machines, the median is 407, standard deviation – – power-law with 𝛼 = 2.367 ) D) PDF of the mean hourly malware growth rate computed over the first 12h of the outbreak for two populations: slow (under 50% of their reach in 72h) and fast (over 50% of their final reach achieved in 72h) spread.
The infectivity of the infected population is higher than the average infectivity of the population at large. b)
The average infectivity of the infected population decreases over time as the right-hand side of the distribution is depleted, causing a rapid decrease in the number of infected computers, which is not correlated with the initial growth rate.
These findings imply that the higher the heterogeneity of the population is, the larger is the infectivity gap between the early and the later infected individuals. In extreme cases, epidemics that might start spreading quickly will stop spreading once the few exceptional individuals get infected and removed. The opposite example would be the spread in homogeneous population. In that case, any outbreak will continue as long as the density of the remaining susceptible individuals is high enough to sustain its reproduction (i.e. until the herd immunity threshold is achieved). In homogenous populations similarly skewed distribution of infectivity could be obtained via scale-free distribution of the probability to be exposed to the threat. Such systems are typically modeled as scale-free networks (22) in which certain highly connected nodes (sometimes designated as hubs) are in contact with a macroscopically large fraction of the population. Variation of the number of contacts
Figure Mean infectiveness of the infected individuals for the SIR simulation run for A) scale-free distribution of infectivity with 𝛼 = 2.5 and 𝛼 = 1.5 . Dashed line represents mean population infectivity abd B) The average degree of infected individuals in a scale-free network with homogenous infectivity and power law degree distribution with 𝛼 = 2.5 . C) Ratio of the infectivity of newly infected machines for each malware as it spreads, to the infectivity of the machines infected by the same malware, with their infection times shuffled. The graphs show the average ratio over all malware that reach 85% of its final spread in the first 72h. that can lead to disease transfer has been shown to affect network processes including dynamics of diseases spread (23) (24) . In addition, network hubs are known to get infected early in the diffusion process(25). We demonstrate this by executing SIR simulation in artificially generated network with power law degree distribution with 𝛼 = 2.5 . Uneven exposure to the infected individuals results in the fast decay of degree of the newly infected nodes (Fig 2B).
Predictions made by SIR simulations are observed empirically in the decrease in malware infectivity over time (Fig. 2C and Supp. Mat. 4). This phenomenon re sults in an overestimation of the epidemic’s reproduction number and its reach from inception on. The initially observed infectivity is not characteristic of the entire population, decreasing gradually as the epidemic grows. Per mechanism above, all epidemics would be expected to die out before affecting a large population. However, some viral infections do reach a large fraction of certain populations. Such pandemics may have a different spread mechanism. A clear difference between high-reach computer pandemics and the vast majority of malware epidemics is the low average infectivity of the former (Fig. 3A) along with their high initial growth rate (Fig. 1B). These two phenomena suggest that large pandemics are not large exclusively due to high infectiousness, but also because they do not comply with the regular infectivity distribution, dominated by a few high-infectivity candidates. By selectively targeting the main body of the low-infectivity population, such an agent gains a much wider reach than other diseases . We verified this hypothesis by analyzing the properties of the computers infected by epidemics of different scales. While all malware tends to infect more low-infectivity than high-infectivity computers (i.e. since the vast majority of computers have a very low infectivity, Fig. 1A) in the first 72 hours, the balance in high-reach malware is shifted toward low-infectivity computers. Understanding what allows malware to target typically unaffected populations is beyond the scope of this work. This is clear from the infectivity distributions of the machines infected by malware, ranked by their reach (Fig. 3B). Th distribution steepness drops with malware rank, revealing that the share of frequently infected machines in the large outbreaks drops significantly. The plots in Fig. 3B are all ordered by their total reach demonstrating that larger malware outbreaks correspond to the distributions with a very small share of highly infective machines. To demonstrate that this is not a circular argument, we simulated the spread of each malware while preserving the infectivity distribution (Table S ). The distribution of average infectivity in a random sampling is practically not a function of the malware’s reach (Fig. S ). An important result of the intitial targeting of low infectivity targets is the early prediction of malware’s reach. I n contrast with the initial slope of the epidemic’s growth (Fig. 1B), the characteristic susceptibility of the affected machines is an excellent classifier of high-reach epidemics. Discussion
As recently evidenced by the COVID-19 pandemic, the reach of a disease is hard to predict based on initial rise, using classical epidemiological models. The reach has varied dramatically between countries and even areas within the same country(26-28). This phenomenon is not unique to COVID-19. A recent survey of the measles reproduction number (29) collected 58 reported values, with most
Figure A) The mean infectiveness of the machines affected by malware during the first 72h as a function of final reach. The orange line is the mean simulated infectiveness of a random equal number of machines. B) Distribution of machine infectiveness for the machines affected by malware, grouped by their reach. The black line represents all malware. Legend shows the mean epidemics reach for each group. See
Table S1 and Fig. S2 for details on group composition. located in the 4-18 range but demonstrating a long tail reaching a value of 770.38 set by Wallinga et. al. (30). Similarly large difference between districts was reported for the 2014-2015 Ebola outbreak in Western Africa (14). This variability in the epidemics spread dynamics may be due to an extreme sensitivity to the volume of the distribution tail. While we have shown results for malware spread, they are a direct consequence of heterogeneity, and the conclusions may extend to other contact processes, such as the spread of products, pathogen-driven epidemics, and information cascades (31). It should be noted that our analysis became possible due to the availability of exhaustive historical records on the individual propensity to become infected with the spreading agent. Such data is rarely available for biological pathogens. Even if significant effort is invested into monitoring the epidemics, much of the information is unobserved. For instance, less than 10% of the estimated 2009-2010 influenza H1N1 pandemic fatalities were laboratory confirmed (32). The fraction of the identified COVID-19 patients is still debated and could vary significantly by region but is estimated to reside between 4 and 14 percent by some studies (33, 34), suggesting that actual dynamics of the disease spread is never observed. Even when most patients are known, their detailed medical records are rarely available. The testing and application of our findings in other domains would require collecting similarly detailed data in corresponding systems. We have not studied here what property of the malware determines the differential targeting patterns. Understanding this factor may be key to fighting epidemics with a very wide reach, or, inversely, to developing a successful product or promoting social change. These results suggest a new intervention policy. Instead of looking at the effect on R0 (or on the infection probability matrix, when the population is segregated into sub-groups,) one should analyze the composition of the infected population. Small-scale actions targeting the highly susceptible population would slash the effective R0 and terminate the epidemic at its inception. Alternatively, reducing the infectivity of each susceptible target via broader intervention would prolong the epidemic, since that spread could still be sustained in the tail. To apply such models meaningfully to biological viruses, models for the infectivity of each susceptible target based on its demographics and behavior must be developed, with detailed data collected to fine-tune these models. Materials and Methods
Acknowledgments
This analysis was supported by the Israel Science Foundation, grant
References
Bibliography
1. Anderson RM, Anderson B, & May RM (1992)
Infectious diseases of humans: dynamics and control (Oxford university press). 2. Arino J, Brauer F, Van Den Driessche P, Watmough J, & Wu J (2007) A final size relation for epidemic models.
Mathematical Biosciences & Engineering
Bulletin of mathematical biology , et al. (2009) Pandemic potential of a strain of influenza A (H1N1): early findings. science
Journal of travel medicine . 6. Alimohamadi Y, Taghdir M, & Sepandi M (2020) The estimate of the basic reproduction number for novel coronavirus disease (covid-19): A systematic review and meta-analysis.
Journal of Preventive Medicine and Public Health . 7. Johansson MA , et al. (2019) An open challenge to advance probabilistic forecasting for dengue epidemics.
Proceedings of the National Academy of Sciences
Science
Proceedings of the National Academy of Sciences
PLoS currents
6. 11. Schaffer AC & Lee JC (2008) Vaccination and passive immunisation against Staphylococcus aureus.
International journal of antimicrobial agents , et al. (2006) Seasonal trends of human parainfluenza viral infections: United States, 1990 – Clinical infectious diseases
Proceedings of the National Academy of Sciences
Epidemics
Physics of life reviews ka… what next?
Vaccine
Molecular Virology of Human Pathogenic Viruses :289. 18. Boase J & Wellman B (2001) A plague of viruses: biological, computer and marketing.
Current Sociology
Physical review letters
Contemporary physics
Physical Review E
The structure and dynamics of networks (Princeton university press). 23. Barrat A, Barthelemy M, & Vespignani A (2008)
Dynamical processes on complex networks (Cambridge university press). 24. Pastor-Satorras R & Vespignani A (2001) Epidemic spreading in scale-free networks.
Physical review letters
Journal of marketing , et al. (2020) Early transmission dynamics in Wuhan, China, of novel coronavirus – infected pneumonia. New England Journal of Medicine . 27. Riou J & Althaus CL (2020) Pattern of early human-to-human transmission of Wuhan 2019 novel coronavirus (2019-nCoV), December 2019 to January 2020.
Eurosurveillance , et al. (2020) Estimating clinical severity of COVID-19 from the transmission dynamics in Wuhan, China.
Nature Medicine :1-5. 29. Guerra FM , et al. (2017) The basic reproduction number (R0) of measles: a systematic review.
The Lancet Infectious Diseases
Epidemiology & Infection
Journal of consumer research , et al. (2012) Estimated global mortality associated with the first 12 months of 2009 pandemic influenza A H1N1 virus circulation: a modelling study.
The Lancet infectious diseases
COVID-19, Infection Fatality Rate (IFR) Implied by the Serology, Antibody, Testing in New York City (May 1, 2020) . 34. Li R , et al. (2020) Substantial undocumented infection facilitates the rapid dissemination of novel coronavirus (SARS-CoV-2).
Science
Supplementary Material for
Initial growth rates of epidemics fail to predict their reach: An analysis of malware spread
Lev Muchnik, Elad Yom-Tov, Nir Levy, Amir Rubin, Yoram Louzoun
Correspondence to: [email protected] Materials and Methods
Data
Files are identified as a malware through several mechanisms. First, customers may flag files as malware and make them known to the antimalware vendor. This happens rarely and mostly for malware causing very severe and noticeable damage. Second, suspicious files might be sent for analysis and identified as malware by the vendor. Finally, other antimalware vendors may share the results of their analysis with the vendor. Once malware is identified, the vendor can decide to block the malware through a signature developed and transmitted to the client machines. Similar to data from biological viruses, our data may be biased by the behavior of malware vendors and their decision on whether to “vaccinate” machines. To estimate whether vendor policy could influence our results we measured the correlation between the time passed from the earliest observation of the malware hash to its classification as malware and the number of the machines infected within the first 72 hours. The correlation is 0.03, indicating very weak association between the rise of the malware and the time until a signature will be developed for it. Similarly, the correlation between the absolute number of infected machines after 72 hours and the time between identification and marking was -0.06.
Methods
1. Power law distributions
Power law fits were executed with the python power law toolbox ( ) ( ). The fits were executed for the integer data (discrete parameter set to true) and with x-minimal boundary set to 1000 for figure 1A and 500 for figure 1C.
2. Computing the average hourly growth rate.
The growth rate is averaged over the period of active spread of each malware, that does not exceed the initial 72 hours. It is defined as 〈 𝑑𝐼(𝑡)𝐼(𝑡) 〉 where 𝐼(𝑡) is the total number of infected at time 𝑡 and 𝑑𝐼(𝑡) ≡ 𝐼(𝑡) − 𝐼(𝑡 − 1) . This analysis is executed only for the malwares that had reached 80% of their final reach in the first 72 hours of their spread. Confidence intervals represent standard error (SE). 6
3. SIR Simulation
We test our model by implementing a classical SIR simulation with the following reactions: - According to the random mixing assumption, each susceptible individual can be infected with a probability proportional to the total number of infected
𝐼(𝑡) at time t. - Each infected can be removed with a constant probability. This probability was varied among simulations. - The population is seeded at time 𝑡 = 0 with infected individuals chosen randomly. At each step, the total probability of an infection or a removal are computed, using an efficient sum over all possible events, with an event tree. The event tree leaves are all possible events and the internal nodes are the sum of probabilities in the leaves. Following, the choice between an infection or a removal, the target of the infection and removal is chosen in the tree. The simulation is executed over 1,000,000 individuals, each with an individual value 𝛽 𝑖 representing her characteristic susceptibility. The values 𝛽 𝑖 were drawn from a scale free distribution 𝑝(𝛽)~𝛽 −𝛼 with the slopes 𝛼 = 2.5 (similar to the observed in the malware infectivity data) and 𝛼 = 1.5 – to demonstrate the effect of heterogeneity. The figure in the main article (Fig. 1D) represents the average infectivity of the infected population as the simulation progresses, computed over 10,000 of the simulation runs.
4. Data-Driven Simulation
We simulated the spread of each malware over the first 72 hours since its appearance, reproducing the exact hourly infection rate and the same infected population. However, as the simulation preserves the machines originally infected by the malware, we re-assign infection times randomly thus eliminating the original order of infection. The excessive presence of the most infective machines among the early infected population is demonstrated by the ratio between the real and the simulated infectivity (Fig. 2C). Supplemental Tables and Figures Table S1. The table summarizes statistical properties of the machines infected by malwares grouped into 7 groups by their reach in the first 72 hours of spread. The last row represents all malwares in this set. The table covers malwares that reached over 200 machines during that period. Malwares in each row are grouped by their rank (2 nd column). Third column lists the number of malwares in each group and each followed by the mean (4 th column) and median (5 th column) number of machines effected by them. Mean infectivity of the machines infected by the malwares in the group is listed in the 6 th column. The last column represents the mean infectivity of the malwares in the simulated spread that preserved the reach of each malware and infected each machine with the propensity proportional to the number of the infections observed on this machine. By yielding constant mean machine infectivity across all malware groups the simulation demonstrates that the increase in the mean machine infectivity with the drop in the malware reach (column six) is not a statistical artifact, but a direct result from selective targeting of malwares. See Figure S2 for details of the distribution of infectivity for each set of simulations. Standard errors for averages are given in parenesis. 8 Figure S1. PDF of the initial hourly malware growth rate computed over the first 12h of the outbreak for two populations: slow (with under 50% of their reach in 72h) and fast (with over 50% of their final reach achieved in 72h) spreading. This plot is identical to Fig 1D in the main manuscript but demoinstrates the difference between the distributions on linear scale. Confidence intervals are based on standard errors. 9 Figure S2: Infectivity probability density of the infected machines for malwares grouped as described in the caption for the table S1. These results are generated by a simulation as described in the S1 caption and complement the observed distribution in Figure 3B. Black line represents PDF of the actual distribution of infectivity in the entire population. References
1. A. Klaus, S. Yu, D. Plenz, Statistical analyses support power law distributions found in neuronal avalanches.
PloS one , (2011). 2. J. Alstott, D. P. Bullmore, powerlaw: a Python package for analysis of heavy-tailed distributions. PloS one9