[PDF] Characterizing Internet Worm Infection Structure

Abstract

Internet worm infection continues to be one of top security threats and has been widely used by botnets to recruit new bots. In this work, we attempt to quantify the infection ability of individual hosts and reveal the key characteristics of the underlying topology formed by worm infection, i.e., the number of children and the generation of the worm infection family tree. Specifically, we first apply probabilistic modeling methods and a sequential growth model to analyze the infection tree of a wide class of worms. We analytically and empirically find that the number of children has asymptotically a geometric distribution with parameter 0.5. As a result, on average half of infected hosts never compromise any vulnerable host, over 98% of infected hosts have no more than five children, and a small portion of infected hosts have a large number of children. We also discover that the generation follows closely a Poisson distribution and the average path length of the worm infection family tree increases approximately logarithmically with the total number of infected hosts. Next, we empirically study the infection structure of localized-scanning worms and surprisingly find that most of the above observations also apply to localized-scanning worms. Finally, we apply our findings to develop bot detection methods and study potential countermeasures for a botnet (e.g., Conficker C) that uses scan-based peer discovery to form a P2P-based botnet. Specifically, we demonstrate that targeted detection that focuses on the nodes with the largest number of children is an efficient way to expose bots. For example, our simulation shows that when 3.125% nodes are examined, targeted detection can reveal 22.36% bots. However, we also point out that future botnets may limit the maximum number of children to weaken targeted detection, without greatly slowing down the speed of worm infection.

Full PDF

aa r X i v : . [ c s . CR ] J u l Characterizing Internet Worm Infection Structure

Qian Wang,

Student Member, IEEE,

Zesheng Chen,

Member, IEEE, and Chao Chen,

Member, IEEE

Abstract —Internet worm infection continues to be one of top security threats and has been widely used by botnets to recruit newbots. In this work, we attempt to quantify the infection ability of individual hosts and reveal the key characteristics of the underlyingtopology formed by worm infection, i.e., the number of children and the generation of the worm infection family tree. Speciﬁcally, weﬁrst apply probabilistic modeling methods and a sequential growth model to analyze the infection tree of a wide class of worms. Weanalytically and empirically ﬁnd that the number of children has asymptotically a geometric distribution with parameter 0.5. As a result,on average half of infected hosts never compromise any vulnerable host, over 98% of infected hosts have no more than ﬁve children,and a small portion of infected hosts have a large number of children. We also discover that the generation follows closely a Poissondistribution and the average path length of the worm infection family tree increases approximately logarithmically with the total numberof infected hosts. Next, we empirically study the infection structure of localized-scanning worms and surprisingly ﬁnd that most ofthe above observations also apply to localized-scanning worms. Finally, we apply our ﬁndings to develop bot detection methods andstudy potential countermeasures for a botnet ( e.g.,

Conﬁcker C) that uses scan-based peer discovery to form a P2P-based botnet.Speciﬁcally, we demonstrate that targeted detection that focuses on the nodes with the largest number of children is an efﬁcient way toexpose bots. For example, our simulation shows that when 3.125% nodes are examined, targeted detection can reveal 22.36% bots.However, we also point out that future botnets may limit the maximum number of children to weaken targeted detection, without greatlyslowing down the speed of worm infection.

Index Terms —Worm infection family tree, botnet, probabilistic modeling, simulation, topology, and detection. ✦ NTRODUCTION

Internet epidemics are malicious software that can self-propagate across the Internet, i.e., compromise vulnera-ble hosts and use them to attack other victims. Internetepidemics include viruses, worms, and bots. The pastmore than twenty years have witnessed the evolution ofInternet epidemics. Viruses infect machines through ex-changed emails or disks, and dominated the 1980s and1990s. Internet active worms compromise vulnerable hostsby automatically propagating through the Internet andhave caused much attention since the Code Red and Nimdaworms in 2001. Botnets are zombie networks controlledby attackers through Internet relay chat (IRC) systems( e.g.,

GT Bot) or peer-to-peer (P2P) systems ( e.g.,

Storm) toexecute coordinated attacks and have become the numberone threat to the Internet in recent years. Since Internetepidemics have evolved to become more and more virulentand stealthy, they have been identiﬁed as one of the topsecurity problems [1].The main difference between worms and botnets lies inthat worms emphasize the procedures of infecting targetsand propagating among vulnerable hosts, whereas botnetsfocus on the mechanisms of organizing the network of com-promised computers and setting out coordinated attacks.Most botnets, however, still apply worm-scanning methodsto recruit new bots or collect network information [2], [3], • Q. Wang is with the Department of Electrical and Computer Engineering,Florida International University, Miami, FL, 33174.E-mail: qian.wang@ﬁu.edu. • Z. Chen and C. Chen are with the Department of Engineering, IndianaUniversity - Purdue University Fort Wayne, Fort Wayne, IN 46805.E-mail: { zchen, chen } @engr.ipfw.edu. • Z. Chen is the Corresponding Author. [4], [5]. Moreover, although many P2P-based botnets usethe existing P2P networks to build a bootstrap procedure,Conﬁcker C forms a P2P botnet through scan-based peerdiscovery [6], [7]. Speciﬁcally, Conﬁcker C searches for newpeers by randomly scanning the entire Internet addressspace. As a result, the way that Conﬁcker C constructsa P2P-based botnet is in principle the same as wormscanning/infection. Therefore, characterizing the structureof worm infection is important and imperative for defend-ing against current and future epidemics such as Internetworms and Conﬁcker C like P2P-based botnets.Modeling Internet worm infection has been focused onthe macro level. Most, if not all, mathematical models studythe total number of infected hosts over time [8], [9], [10],[11], [12], [2]. The models of the micro level of worminfection, however, have been investigated little. The micro-level models can provide more insights into the infectionability of individual compromised hosts and the underlyingtopologies formed by worm infection. A key micro-levelinformation is “who infects whom” or the worm infectionfamily tree. When a host infects another host, they forma “father-and-son” relationship, which is represented by adirected edge in a graph formed by worm infection. Hence,the procedure of worm propagation constructs a directedtree where patient zero is the root and the infected hoststhat do not compromise any vulnerable host are leaves(see Fig. 1). To the best of our knowledge, there is yet nomathematical model for reﬂecting the structure of such atree.The goal of this work is to characterize the Internet worminfection family tree, i.e., the topology formed by worminfection. For such a tree, we are particularly interested intwo metrics:

Patient zero Generation 0Generation 2Generation 3Generation 1

Fig. 1. A worm tree. • Number of children:

For a randomly selected node inthe tree, how many children does it have? This metricrepresents the infection ability of individual hosts. • Generation:

For a randomly selected node in the tree,which generation (or level) does it belong to? Thismetric indicates the average path length of the graphformed by worm infection.These two metrics reﬂect the underlying topology formedby worm infection, called the “worm tree” in short. Forexample, if the worm tree is a random graph, each hostwould infect a similar number of targets; and the averagepath length would increase approximately logarithmicallywith the total number of nodes [13], [14]. If the worm treehas a power-law topology, only a very small number ofhosts infect a large number of children, and a majority ofhosts infect none or few children; and the average pathlength would also increase approximately logarithmicallywith the total number of nodes [13]. Moreover, power-law topologies are robust to random node removal, butare vulnerable to the removal of a small portion of nodeswith highest node degrees. However, random graphs arerobust to both removal schemes [13]. Therefore, studyingthe structure of the worm tree can help provide insights ondetecting and defending against botnets such as ConﬁckerC.To study these two metrics analytically, we apply proba-bilistic modeling methods and derive the joint probabilitydistribution of the number of children and the generationthrough a sequential growth model. Speciﬁcally, we startfrom a worm tree with only patient zero and add newnodes into the worm tree sequentially. We then investigatethe relationship between the two worm trees before andafter a new node is added. From the joint distribution,we analyze the marginal distributions of the number ofchildren and the generation. We also develop closed-formapproximations to both marginal distributions and the jointdistribution. Different from other models that characterizethe dynamics of worm propagation ( e.g., the total numberof infected hosts over time), our sequential growth modelaims at capturing the main features of the topology formedby worm infection ( e.g., the number of children and thegeneration).As a ﬁrst attempt, we analyze the worm tree formed by a wide class of worms such as random-scanning worms[8], routable-scanning worms [15], [9], importance-scanningworms [16], OPT-STATIC worms [17], and SUBOPT-STATICworms [17]. For these worms, a new victim is compro-mised by each existing infected host with equal probability.We then verify the analytical results through simulations.We also employ simulations to investigate worm infectionusing localized scanning [18], [19]. Finally, we apply ouranalysis and observations to develop methods for detectingbots and study potential countermeasures for a botnet ( e.g.,

Conﬁcker C) that uses scan-based peer discovery to form aP2P-based botnet.Through both analytical and empirical study, we makeseveral contributions from this research as follows. First,if a worm uses a scanning method for which a newvictim is compromised by each existing infected host withequal probability, the number of children is shown bothanalytically and empirically to have asymptotically a ge-ometric distribution with parameter 0.5. This means thaton average half of infected hosts never compromise anytarget and over 98% of infected hosts have no more thanﬁve children. On the other hand, this also indicates that asmall portion of hosts infect a large number of vulnerablehosts. Moreover, the generation is demonstrated to closelyfollow a Poisson distribution with parameter H n − , where n is the number of nodes and H n is the n -th harmonicnumber [20]. This means that the average path length ofthe worm tree increases approximately logarithmically withthe number of nodes. Second, if a worm uses localizedscanning, the number of children still has approximatelya geometric distribution with parameter 0.5. Moreover, thegeneration still follows a Poisson distribution, but withthe parameter depending on the probability of local scan-ning. Therefore, most previous observations also apply tolocalized-scanning worms. Finally, a direct application ofthe observations of the worm tree is on the bot detectionin Conﬁcker C like botnets. We show both analytically andempirically that while randomly examining a small portionof nodes in a botnet ( i.e., random detection) can only exposea limited number of bots, examining the nodes with thelargest number of children ( i.e., targeted detection) is muchmore efﬁcient in detecting bots. For example, our simula-tion shows that when 3.125% nodes are examined, randomdetection exposes totally 9.10% bots, whereas targeted de-tection reveals 22.36% bots. On the other hand, we alsopoint out that future botnets can potentially use a simplemethod to weaken the performance of targeted detection,without greatly slowing down the speed of worm infection.To the best of our knowledge, this is the ﬁrst attemptin understanding and exploiting the topology formed byworm infection quantitatively.The remainder of this paper is structured as follows.Section 2 presents our sequential growth model and as-sumptions used in analyzing the worm tree. Section 3 givesour analysis on the worm tree. Section 4 uses simulationsto verify the analytical results and provide observations onthe worm tree using the localized-scanning method. Section5 further develops bot detection methods and studies po- tential countermeasures by future botnets. Finally, Section6 discusses the related work, and Section 7 concludes thispaper. ORM T REE AND S EQUENTIAL G ROWTH M ODEL

In this section, we provide the background on the wormtree, and present the assumptions and the growth model.An example of a worm tree is given in Fig. 1. Here,patient zero is the root and belongs to generation 0. Thetail of an arrow is from the “father” or the infector, whereasthe head of an arrow points to the “son” or the infectee.If a father belongs to generation i , then its children liein generation i + 1 . In a worm tree with n nodes, weuse L n ( i, j ) ( ≤ i, j ≤ n − ) to denote the numberof nodes that have i children and belong to generation j . Note that P n − i =0 P n − j =0 L n ( i, j ) = n . We also use C n ( i ) ( i = 0 , , , · · · , n − ) to denote the number of nodes thathave i children and G n ( j ) ( j = 0 , , , · · · , n − ) to denotethe number of nodes in generation j . Moreover, L n ( i, j ) , C n ( i ) , and G n ( j ) are random variables. Thus, we deﬁne p n ( i, j ) = E [ L n ( i,j )] n , representing the joint distribution ofthe number of children and the generation. Similarly, wedeﬁne c n ( i ) = E [ C n ( i )] n to represent the marginal distribu-tion of the number of children and g n ( j ) = E [ G n ( j )] n torepresent the marginal distribution of the generation. Notethat c n ( i ) = P n − j =0 p n ( i, j ) and g n ( j ) = P n − i =0 p n ( i, j ) .Although we model worm infection as a tree, differ-ent worm trees can show very different structures. Fig. 2demonstrates two extreme cases of worm trees. Speciﬁcally,in Fig. 2 (a), each infected host compromises one and onlyone host except the last infected host. In this case, if thetotal number of nodes is n , C n (0) = 1 , and C n (1) = n − ,which lead to c n (0) = n and c n (1) = n − n ≈ when n is large. That is, almost each node has one and onlyone child. Moreover, G n ( j ) = 1 , j = 0 , , , · · · , n − ,which means that g n ( j ) = n . Thus, the average path lengthis P n − j =0 j · g n ( j ) = n − ∼ O ( n ) . That is, the averagepath length increases linearly with the number of nodes.Comparatively, Fig. 2 (b) shows another case where all hosts(except patient zero) are infected by patient zero. For thedistribution of the number of children, c n ( n −

1) = n ,and c n (0) = n − n ≈ when n is large, indicating thatalmost every node has no child. For the distribution of thegeneration, g n (0) = n , and g n (1) = n − n , which leads tothat the average path length is n − n ≈ when n is large.Thus, the path length is close to a constant of 1. In thiswork, we attempt to identify the structure of the wormtree formed by Internet worm infection.To study the worm tree analytically, in this paper wemake several assumptions and considerations. First, tosimplify the model, we assume that infected hosts havethe same scanning rate. This assumption is removed inSection 4, where we use simulations to study the effect ofthe variation of scanning rates on the worm tree. Second,we consider a wide class of worms for which a new Generation 0Patient zero Generation n-1Generation 2Generation 1 (a) Extreme case 1.

Generation 0Patient zero Generation 1 (b) Extreme case 2.

Fig. 2. Two extreme cases of worm trees. victim is compromised by each existing infected host withequal probability. Such worms include random-scanningworms, routable-scanning worms, importance-scanningworms, OPT-STATIC worms, and SUBOPT-STATIC worms.Random scanning selects targets in the IPv4 address spacerandomly and has been the main scanning method for bothworms and botnets [8], [3]; routable scanning ﬁnds victimsin the routable IPv4 address space [15], [9]; and impor-tance scanning probes subnets according to the vulnerable-host distribution [16]. OPT-STATIC and SUBOPT-STATICare optimal and suboptimal scanning methods that areproposed in [17] to minimize the number of worm scansrequired to reach a predetermined fraction of vulnerablehosts. In Section 4.3, we extend our study to localizedscanning, which preferentially searches for targets in thelocal subnet and has also been used by real worms [18],[19]. Third, we consider the classic susceptible → infected(SI) model, ignoring the cases that an infected host can becleaned and becomes vulnerable again, or can be patchedand becomes invulnerable. The SI model assumes that onceinfected, a host remains infected. Such a simple model hasbeen widely applied in studying worm infection [8], [9],[21], [17], and presents the worst case scenario. Fourth, weassume that there is no re-infection. That is, if an infectedhost is hit by a worm scan, this host will not be furtherre-infected. As a result, every infected host has one andonly one father except for patient zero, and the resultinggraph formed by worm infection is a tree. Fifth, we assumethat the worm starts from one infected host, i.e., patientzero or a hitlist size of 1. When the hitlist size is largerthan 1, the underlying infection topology is a worm forest,instead of a worm tree. Our analysis, however, can easilybe extended to model the worm forest. Finally, to simplifythe analysis, we assume that no two nodes are added tothe worm tree at the same time. That is, no two vulnerablehosts are infected simultaneously. We relax this assumptionin Section 4 where simulations are performed.Based on these considerations and assumptions, the se-quential growth model of a worm tree works as follows:We consider a ﬁxed sequence of infected hosts ( i.e., nodes) v , v , · · · and inductively construct a random worm tree ( T n ) n ≥ , where n is the number of nodes and T has onlypatient zero. Infecting a new host is equivalent to adding anew node into the existing worm tree. Hence, given T n − , T n is formed by adding node v n together with an edgedirected from an existing node v f to v n . According to theassumption, v f is randomly chosen among the n − nodesin the tree, i.e. , Pr ( f = k ) = n − , k = 1 , , · · · , n − .Note that such a growth model and its variations havebeen widely used in studying topology generators [22], [23].In this paper, we apply this model to characterize worminfection. ATHEMATICAL A NALYSIS

In this section, we study the worm tree through mathe-matical analysis. Speciﬁcally, we ﬁrst derive the joint dis-tribution of the number of children and the generation, i.e., p n ( i, j ) , by applying probabilistic methods. We then use p n ( i, j ) to analyze two marginal distributions, i.e., c n ( i ) and g n ( j ) , and obtain their closed-form approximations. Finally,we ﬁnd a closed-form approximation to p n ( i, j ) . For a worm tree with only patient zero ( i.e., n = 1 ), since L (0 ,

0) = 1 with probability 1, p (0 ,

0) = 1 . Similarly,for a worm tree with n = 2 , it is evident that L (1 ,

0) = L (0 ,

1) = 1 . Thus, p (1 ,

0) = p (0 ,

1) = . We nowconsider p n ( i, j ) ( ≤ i, j ≤ n − ) when n ≥ . Speciﬁcally,we study two cases:(1) p n (0 , j ) , i.e., the proportion of the number of leavesin generation j in T n . Assume that T n − is given, andthere are L n − (0 , j ) leaves in generation j and totally G n − ( j −

1) = P n − i =0 L n − ( i, j − nodes in generation j − . Note that we have extended the notation so that G n − ( −

1) = L n − ( i, −

1) = 0 , ≤ i ≤ n − . When anew node v n is added, v n becomes a leaf of T n . If v n isconnected to one of existing nodes in generation j − , v n belongs to generation j ; and the probability of suchan event is G n − ( j − n − . Moreover, if a leaf in generation j in T n − connects to v n , this node is no longer a leafand now has one child; and the probability of this event is L n − (0 ,j ) n − . Therefore, we can obtain the stochastic recurrenceof L n (0 , j ) : L n (0 , j ) =  L n − (0 , j ) + 1 , w.p. G n − ( j − n − L n − (0 , j ) − , w.p. L n − (0 ,j ) n − L n − (0 , j ) , otherwise . (1)Given T n − ( i.e., L n − (0 , j ) and G n − ( j − ), the con-ditional expected value of L n (0 , j ) is [ L n − (0 , j ) + 1] · G n − ( j − n − + [ L n − (0 , j ) − · L n − (0 ,j ) n − + L n − (0 , j ) · h − G n − ( j − L n − (0 ,j ) n − i , i.e., E [ L n (0 , j ) | T n − ] = n − n − L n − (0 , j ) + n − G n − ( j − . (2)Applying E [ L n (0 , j )] = E [ E [ L n (0 , j ) | T n − ]] ( i.e., the law oftotal expectation), we obtainE [ L n (0 , j )] = n − n − E [ L n − (0 , j )] + n − E [ G n − ( j − . (3) Number of childrenGeneration J o i n t p r o b a b ili t y Fig. 3. Joint distribution of the number of children and thegeneration ( n = 2000 ). Using the deﬁnitions p n (0 , j ) = E [ L n (0 ,j )] n and g n − ( j −

1) = E [ G n − ( j − n − = P n − i =0 p n − ( i, j − , the above equa-tion leads to p n (0 , j ) = n − n p n − (0 , j ) + n g n − ( j − (4) = n − n p n − (0 , j ) + n P n − i =0 p n − ( i, j − . (5)(2) p n ( i, j ) , ≤ i ≤ n − . Given L n − ( i, j ) and L n − ( i − , j ) in T n − , we study L n ( i, j ) in T n . When the new node v n is added into T n − , v n is connected to a node with i − children and in generation j with probability L n − ( i − ,j ) n − ,or is connected to a node with i children and in generation j with probability L n − ( i,j ) n − . Thus, in T n , L n ( i, j ) =  L n − ( i, j ) + 1 , w.p. L n − ( i − ,j ) n − L n − ( i, j ) − , w.p. L n − ( i,j ) n − L n − ( i, j ) , otherwise . (6)This relationship leads toE [ L n ( i, j ) | T n − ] = n − n − L n − ( i, j ) + n − L n − ( i − , j ) . (7)Therefore,E [ L n ( i, j )] = n − n − E [ L n − ( i, j )] + n − E [ L n − ( i − , j )] . (8)That is, p n ( i, j ) = n − n p n − ( i, j ) + n p n − ( i − , j ) . (9)Summarizing the above two cases, we have the followingtheorem: Theorem 1:

When n ≥ , the joint distribution of thenumber of children and the generation in a worm tree T n follows p n ( i, j ) = ( n − n p n − (0 , j ) + n g n − ( j − , i = 0 n − n p n − ( i, j ) + n p n − ( i − , j ) , otherwise , (10)where ≤ i, j ≤ n − . Theorem 1 provides a way to calculate p n ( i, j ) recur-sively from p ( i, j ) . Fig. 3 shows a snapshot of p n ( i, j ) when n = 2000 . It can be seen that when the genera-tion is speciﬁed ( i.e., j is ﬁxed), p n ( i, j ) is a monotonousfunction and decreases quickly as i increases. On theother hand, when the number of children is given ( i.e., i is ﬁxed), p n ( i, j ) has a bell shape. Moreover, since P i =0 P j =0 p n ( i, j ) = 0 . , most nodes do not have alarge number of children, and the worm tree does not havea large average path length. We use p n ( i, j ) to derive the marginal distribution of thenumber of children, i.e., c n ( i ) . Similarly, we study twocases:(1) c n (0) , i.e., the proportion of the number of leaves in T n . Since c n (0) = P n − j =0 p n (0 , j ) and P n − j =0 g n − ( j −

1) = 1 ,we obtain the recursive relationship of c n (0) from Equation(4): c n (0) = n − n c n − (0) + n . (11)Moreover, note that c (0) = . If we assume that c n − (0) = , we can obtain by induction that c n (0) = . (12)This indicates that no matter how many nodes are in theworm tree, on average half of nodes are leaves, i.e., onaverage 50% of infected hosts never compromise any target.(2) c n ( i ) , ≤ i ≤ n − . From Equation (9) and c n ( i ) = P n − j =0 p n ( i, j ) , we ﬁnd the recurrence of c n ( i ) as follows c n ( i ) = n − n c n − ( i ) + n c n − ( i − . (13)Summarizing the above two cases, we have the followingtheorem on the distribution of the number of children: Theorem 2:

When n ≥ , the distribution of the numberof children in a worm tree T n follows c n ( i ) = ( , i = 0 n − n c n − ( i ) + n c n − ( i − , ≤ i ≤ n − . (14)From Theorem 2, we can derive the statistical propertiesof the number of children as follows. Corollary 1:

When n ≥ , the expectation and the vari-ance of the number of children areE n [ C ] = P n − i =0 i · c n ( i ) = n − n (15)Var n [ C ] = P n − i =0 ( i − E n [ C ]) · c n ( i ) = 2 − n − n − H n n , (16)where H n = P ni =1 1 i is the n -th harmonic number [20].The proof of Corollary 1 is given in Appendix 1. Oneintuitive way to derive E n [ C ] is that in worm tree T n , thereare n − directed edges and n nodes. Thus, the averagenumber of edges ( i.e, the average number of children)of a node is n − n . Moreover, since H n is O (1 + ln n ) , lim n →∞ E n [ C ] = 1 , and lim n →∞ Var n [ C ] = 2 . Number of children ( i ) c n ( i ) Analysis ( n = 1000)Analysis ( n = 2000)Analysis ( n = 5000)Analysis ( n = 20000)Geometric Fig. 4. Distribution of the number of children.

Theorem 2 also leads to a simple closed-form expressionof the distribution of the number of children when n is verylarge, as shown in the following corollary. Corollary 2:

When n → ∞ , the number of children has ageometric distribution with parameter , i.e. , c ( i ) = lim n →∞ c n ( i ) = (cid:16) (cid:17) i +1 , i = 0 , , , · · · . (17)The proof of Corollary 2 is given in Appendix 2. Corol-lary 2 indicates that when n is very large, c n ( i ) decreasesapproximately exponentially with a decay constant of ln 2 as the number of children increases.We further study when both n and i are ﬁnite andlarge, how c n ( i ) varies with n , i.e., how the tail of thedistribution of the number of children changes with n . First,note that c (0) = , c (1) = , and c (2) = . Thus,from Equation (13), we can prove by induction that c n ( i ) ( n ≥ ) is a decreasing function of i , i.e., c n ( i ) < c n ( i − ,for ≤ i ≤ n − . Next, putting this inequality intoEquation (13), we have c n ( i ) > n − n c n − ( i ) . Hence, when n is very large, n − n ≈ , and c n ( i ) > c n − ( i ) , whichindicates that the tail of c n ( i ) increases with n . Fig. 4 veriﬁesthis result, showing c n ( i ) obtained from Theorem 2 when n = 1000 , , , and , as well as the geometricdistribution with parameter 0.5 obtained from Corollary2. Note that the y-axis uses log-scale. It can be seen thatwhen n increases from 1000 to 20000, the tail of c n ( i ) alsoincreases to approach the tail of the geometric distribution.Moreover, it is shown that the geometric distribution wellapproximates the distribution of the number of childrenwhen n is large. Next, we derive the generation distribution ( i.e., g n ( j ) ) in asimilar manner to the case of c n ( i ) . Using Theorem 1 and g n ( j ) = P n − i =0 p n ( i, j ) , we obtain the following theorem: Theorem 3:

When n ≥ , the distribution of the genera-tion in a worm tree T n follows g n ( j ) = n − n g n − ( j ) + n g n − ( j − , ≤ j ≤ n − , (18) Generation ( j ) g n ( j ) Analysis ( n = 1000)Poisson ( λ n = H − n = 2000)Poisson ( λ n = H − n = 5000)Poisson ( λ n = H − n = 20000)Poisson ( λ n = H − Fig. 5. Distribution of the generation. where g n − ( −

1) = 0 .Theorem 3 gives a method to calculate the distributionof the generation recursively. Moreover, from Theorem 3,we can derive the statistical properties of the generationdistribution in the following corollary.

Corollary 3:

When n ≥ , the expectation and the vari-ance of the generation areE n [ G ] = P n − j =0 j · g n ( j ) = H n − . (19)Var n [ G ] = P n − j =0 ( j − E n [ G ]) · g n ( j ) = H n − H n, , (20)where H n = P ni =1 1 i and H n, = P ni =1 1 i .The proof of Corollary 3 is given in Appendix 3. FromCorollary 3, we have some interesting observations. Since H n is O (1 + ln n ) and H ∞ , = ζ (2) = π ≈ . is theRiemann zeta function of 2 [24], both E n [ G ] and Var n [ G ] are O (1 + ln n ) . This indicates that the average path lengthof the worm tree ( i.e. , E n [ G ] ) increases approximately loga-rithmically with n . Moreover, when n → ∞ , lim n →∞ E n [ G ] − ln n = γ − , and lim n →∞ Var n [ G ] − ln n = γ − ζ (2) , where γ ≈ . is the Euler-Mascheroni constant [25]. Therefore,when n is large, E n [ G ] ≈ Var n [ G ] . Furthermore, we can useTheorem 3 to obtain a closed-form approximation to g n ( j ) as follows. Corollary 4:

When n is very large, the generation distri-bution g n ( j ) can be approximated by a Poisson distributionwith parameter λ n = E n [ G ] = H n − . That is, g n ( j ) ≈ λ jn j ! e − λ n , ≤ j ≤ n − . (21)The proof of Corollary 4 is given in Appendix 4. Fig. 5veriﬁes Corollary 4, showing g n ( j ) obtained from Theorem3 when n = 1000 , , , and , as well asthe Poisson distribution with parameter E n [ G ] . It can beseen that when n is large, the Poisson distribution ﬁts thegeneration distribution closely. Actual joint distribution A pp r o x i m a t e d j o i n t d i s t r i bu t i o n Fig. 6. Parity plot of the approximation to the joint distribution( n = 2000 ). Finally, we derive a closed-form approximation to the jointdistribution p n ( i, j ) . From Equation (9), we can see thatwhen n → ∞ , p n ( i, j ) = p n − ( i, j ) , which yields p n ( i, j ) = p n ( i − , j ) . (22)Hence, we can obtain p n ( i, j ) = (cid:0) (cid:1) i p n (0 , j ) ≈ (cid:0) (cid:1) i +1 g n ( j ) . (23)Since when n is very large, g n ( j ) follows closely the Poissondistribution as in Corollary 4, p n ( i, j ) ≈ (cid:0) (cid:1) i +1 · λ jn j ! e − λ n , ≤ i, j ≤ n − , (24)where λ n = H n − . The above derivation also showsthat when n is very large, the number of children and thegeneration are almost independent random variables.Fig. 6 shows the parity plot of the approximation to thejoint distribution when n = 2000 . In the ﬁgure, the x-axisis the actual p n ( i, j ) obtained from Theorem 1, and the y-axis is the approximated p n ( i, j ) from Equation (24), where ≤ i, j ≤ . It can be seen that most points are on or nearthe diagonal line, indicating that the approximation to thejoint distribution is reasonable. IMULATIONS AND V ERIFICATION

In this section, we study the worm infection structurethrough simulations. As far as we know, there is no publiclyavailable data to show the real worm tree and verify our an-alytical results. Moreover, real experiments in a controlledenvironment are impractical for this study since the closed-form approximations are derived based on the assumptionthat the number of nodes is very large. Therefore, we applyempirical simulations. Speciﬁcally, we ﬁrst simulate theinfection structure of the Code Red v2 worm and thenstudy the effects of important parameters on the worm tree.Finally, we extend our simulation to localized-scanningworms.

Number of children ( i ) c n ( i ) Simulation ( n = n / n = n )Simulation ( n = 4 n )Geometric (a) Number of children. Generation ( j ) g n ( j ) Simulation ( n = n / n = n )Simulation ( n = 4 n )Poisson ( λ n = H n / − λ n = H n − λ n = H n − (b) Generation. Measured joint distribution A pp r o x i m a t e d j o i n t d i s t r i bu t i o n (c) Joint distribution ( n = n ). Fig. 7. Simulating the infection structure of the Code Red v2 worm ( n = 360000 ). We simulate the propagation of the Code Red v2 wormby using and extending the simulator in [26]. Speciﬁcally,the simulator considers a discrete-time system and mimicsthe random-scanning behavior of infected hosts duringeach discrete time interval. Moreover, the parameter settingis based on the Code Red v2 worm’s characteristics. Forexample, the vulnerable population is n = 360 , , and anewly infected host is assigned with a scanning rate of 358scans/min. Detailed information about how the parametersare chosen can be found in Section VII of [27]. We thenextend the simulator to track the worm infection structureby adding the information of the number of children andthe generation to each infected host. Moreover, we set thetime unit to 20 seconds and start our simulation at time tick0 with patient zero. Note that we remove the assumptionused in the sequential growth model that no two hosts arecompromised at the same time. That is, multiple hosts canbe compromised at one time tick. Moreover, all new victimsof the current time tick start scanning at the next time tick.The simulation results (mean ± standard deviation) areobtained from 100 independent runs with different seedsand are presented in Fig. 7.Fig. 7(a) shows the distribution of the number of children,comparing the simulation results of c n ( i ) for n = n / , n , and n with the geometric distribution obtained fromCorollary 2. Note that the y-axis uses the log-scale. Thedotted line represents the standard deviation that goes intothe negative territory. It can be seen that the distributionof the number of children can be well approximated by thegeometric distribution with parameter 0.5. This implies that c n ( i ) decreases approximately exponentially with a decayconstant of ln 2 . Speciﬁcally, in all three cases, on average50.0% of the infected hosts do not have children, about98.4% of them have no more than ﬁve children, and 0.1%of them have no less than ten children. We also calculatethe expectation and the variance of the number of childrenfrom the simulation and ﬁnd that they are identical tothe analytical results obtained from Corollary 1. Fig. 7(b)demonstrates the generation distribution, comparing thesimulation results of g n ( j ) for n = n / , n , and n with the Poisson distributions with parameter E n [ G ] = H n − obtained from Corollary 4. It can be seen that the simulationresults of g n ( j ) closely follow the Poisson distributions forall three cases. Hence, simulation results verify that theaverage path length of the worm tree increases approxi-mately logarithmically with the total number of infectedhosts. Moreover, we also compute the expectation and thevariance of the generation in simulations and verify theanalytical results in Corollary 3. Fig. 7(c) compares themeasured joint distribution from simulations with the ap-proximated joint distribution from Equation (24) by usingthe parity plot. It can be seen that most points are on ornear the diagonal line, indicating that the approximationworks well. Next, we extend our simulator to examine the effects ofthree important parameters of worm propagation on theworm tree: the scanning rate, the scanning rate standarddeviation, and the hitlist size. When a parameter is studiedand varied, we set other parameters to the parameters ofthe Code Red v2 worm as used in Section 4.1. The simula-tion results are obtained from 100 independent simulationruns and are shown in Fig. 8.Fig.s 8(a) and (b) show the effect of varying the scanningrate s (scans/min) from 158 to 558 on the distributionsof the number of children and the generation. Here, thescanning rate is set to a ﬁxed value for every infectedhost, i.e., the scanning rate standard deviation is 0. Theﬁgures also plot the geometric distribution with parameter0.5 and the Poisson distribution with parameter H n − for reference. It can be seen that the scanning rate does notaffect the worm tree structure.Fig.s 8(c) and (d) demonstrate the effect of the variationof the scanning rates among different hosts ( i.e. , σ ). Inour simulation, a newly infected host is assigned witha scanning rate (scans/min) from a normal distribution N (358 , σ ) . The ﬁgures show the simulation results when σ = 0 , , and . It can be seen that while thescanning rate standard derivation σ has no effect on thegeneration distribution, it does affect the distribution of Number of children ( i ) c n ( i ) Simulation ( s = 158)Simulation ( s = 358)Simulation ( s = 558)Geometric (a) Effect of scanning rates on c n ( i ) . Generation ( j ) g n ( j ) Simulation ( s = 158)Simulation ( s = 358)Simulation ( s = 558)Poisson ( λ n = H n − (b) Effect of scanning rates on g n ( j ) . Number of children ( i ) c n ( i ) Simulation ( σ = 0)Simulation ( σ = 100)Simulation ( σ = 200)Geometric (c) Effect of scanning rate standard deviation on c n ( i ) . Generation ( j ) g n ( j ) Simulation ( σ = 0)Simulation ( σ = 100)Simulation ( σ = 200)Poisson ( λ n = H n − (d) Effect of scanning rate standard deviation on g n ( j ) . Number of children ( i ) c n ( i ) Simulation (hitlist = 1)Simulation (hitlist = 10)Simulation (hitlist = 100)Geometric (e) Effect of hitlist sizes on c n ( i ) . Generation ( j ) g n ( j ) Simulation (hitlist = 1)Simulation (hitlist = 10)Simulation (hitlist = 100)Poisson ( λ n = H n − λ n = H n / − λ n = H n / − (f) Effect of hitlist sizes on g n ( j ) . Fig. 8. Effects of the scanning rate, the scanning rate standard deviation, and the hitlist size on the distributions of thenumber of children and the generation ( n = 360000 ). the number of children. Speciﬁcally, when σ increases, thetail of c n ( i ) moves upward from the geometric distributionwith parameter 0.5. This is because when σ becomes larger,the variation of the scanning rate among infected hosts isgreater. That is, there are more hosts with high scanningrates and also more hosts with low scanning rates. As aresult, those hosts with high scanning rates tend to infecta large number of hosts, making the tail of c n ( i ) move up-ward. However, it is also observed that when σ is not verylarge (the case for real worms), the geometric distributionwith parameter 0.5 is still a good approximation.In Fig.s 8(e) and (f), we show the effect of the hitlistsize on the worm tree. As pointed out in Section 2, whenthe hitlist size is greater than 1, the underlying infectiontopology is a worm forest with the number of trees equalto the hitlist size. Moreover, in a worm forest, it is intuitivethat each tree is a smaller version of the single wormtree of hitlist size 1 and has fewer nodes. Hence, it is notsurprising to see that in Fig. 8(f), the generation distributionmoves leftward when the hitlist size increases. However,the generation distribution can still be well approximatedby the Poisson distribution with parameter H n h − , where n h is the average number of nodes in a tree. Moreover,since in each tree the distribution of the number of childrencan be approximated by the geometric distribution with parameter 0.5, in the worm forest c n ( i ) still follows closelythe same distribution. Finally, we extend our simulation study to the infectiontree of localized-scanning worms. Different from randomscanning, localized scanning preferentially searches for tar-gets in the “local” address space [8]. As a result, when anew node is added to the worm tree, it connects to oneof the existing nodes that are in the same “local” addressspace with a higher probability. That is, the growth modelis no longer uniform attachment as studied in Section 3. Forsimplicity, in this paper we only consider the /l localizedscanning [19]: • Local scanning : p a (0 ≤ p a < of the time, a “local”address with the same ﬁrst l ( ≤ l ≤ ) bits as theattacking host is chosen as the target. • Global scanning : − p a of the time, a random addressis chosen.Note that random scanning can be regarded as a specialcase of localized scanning when p a = 0 . Moreover, if localscanning is selected, it can be regarded as random scanningin a local /l subnet. It has been shown that since thevulnerable-hosts distribution is highly uneven, localized Number of children ( i ) c n ( i ) Simulation ( p a = 0)Simulation ( p a = 0 . p a = 0 . (a) Number of children. Generation ( j ) g n ( j ) Simulation ( p a = 0)Simulation ( p a = 0 . p a = 0 . λ n = H n − λ . n = E . n [ G ])Poisson( λ . n = E . n [ G ]) (b) Generation. Measured joint distribution A pp r o x i m a t e d j o i n t d i s t r i bu t i o n (c) Joint distribution ( p a = 0 . ). Fig. 9. Simulating the infection structure of the localized-scanning worm ( n = 360000 , s = 358 scans/min, σ = 0 , hitlist =1 , and l = 8 ). scanning can spread a worm much faster than randomscanning [18], [21].We extend our simulator to imitate the spread oflocalized-scanning worms. We extract the distribution ofvulnerable hosts in /l subnets from the dataset providedby DShield [28], [29]. Speciﬁcally, we use the dataset in[29] with port 80 (HTTP) that is exploited by the CodeRed worm to generate the vulnerable-host distribution.Moreover, we use similar parameters as in Section 4.1 ( e.g., n = 360000 , s = 358 scans/min, σ = 0 , and hitlist = 1 )and set the subnet level to 8 ( i.e., l = 8 ). The results areobtained from 100 independent simulation runs and areshown in Fig. 9. For each run, patient zero is randomlychosen from vulnerable hosts.Fig. 9(a) compares the simulation results of the distribu-tions of the number of children ( i.e., c n ( i ) ) when p a = 0 , 0.3,and 0.6 with the geometric distribution with parameter 0.5.It is surprising that c n ( i ) of localized-scanning worms canstill be well approximated by the geometric distribution.That is, the majority of nodes have few children, whereas asmall portion of compromised hosts infect a large numberof hosts. An intuitive explanation is given as follows. FromFig. 7(a), it can be seen that the total number of nodes has aminor effect on c n ( i ) . Hence, if in a /8 subnet the majorityof vulnerable hosts are infected through local scanning, itis expected that c n ( i ) of these hosts still closely followsthe geometric distribution since the local scanning can beregarded as random scanning inside a /8 subnet. Therefore,both local infection and global infection lead c n ( i ) towardsthe geometric distribution with parameter 0.5. On the otherhand, it can also be seen that when p a increases, the tailof c n ( i ) moves slightly downward. This is because as p a increases, more vulnerable hosts are infected through localscanning. Hence, it is more difﬁcult for an infected host toﬁnd targets after vulnerable hosts in this host’s local subnethave been exhausted. As a result, when p a increases, fewernodes can have a large number of children.Fig. 9(b) demonstrates that the generation distributionof localized-scanning worms ( i.e., g n ( j ) ) can be well ap-proximated by the Poisson distribution for the cases of Number of children ( i ) c n ( i ) Simulation ( l = 16)Simulation ( l = 12)Simulation ( l = 8)Geometric Fig. 10. Effect of the subnet level ( p a = 0 . ). p a = 0 , 0.3, and 0.6. The Poisson parameter, however,depends not only on n , but also on p a . We further deﬁne λ p a n = E p a n [ G ] as the expectation of the generation for alocalized-scanning worm with parameter p a . Here, E p a n [ G ] can be easily estimated from the simulation results of g n ( j ) .Fig. 9(c) further shows the parity plot of the simulated jointdistribution and the approximated joint distribution fromEquation (24) when p a = 0 . . Since most points are on ornear the diagonal line, the approximation is reasonable.Moreover, Fig. 10 shows the effect of the subnet level ( i.e., l ) on the distribution of the number of children ( i.e., c n ( i ) ).It can be seen that when l increases, the tail of c n ( i ) movesdownward. The reason is similar to the argument used inFig. 9(a), i.e., as l increases, fewer nodes can infect a largenumber of children. However, the ﬁgure also demonstratesthat the geometric distribution with parameter 0.5 is still agood approximation to c n ( i ) , especially when the numberof children is not large. PPLICATIONS OF O BSERVATIONS

Our observations on the topologies formed by worm in-fection have important implications and applications forboth defenders and attackers. For example, we have found that the generation distribution closely follows the Poissondistribution and the average path length increases approx-imately logarithmically with the number of nodes. On onehand, some schemes have been proposed to trace wormsback to their origins through the cooperation between in-fected hosts [30], [31], and our work quantiﬁes the averagepath length that describes a lower bound of the numberof hosts required to cooperate. On the other hand, thisaverage path length reﬂects the delay or the effort for abotmaster to deliver a command to all bots in a P2P-basedbotnet like Conﬁcker C, and our results show that thebotnet is scalable and can efﬁciently forward commandsto a large number of bots. In this section, we focus on theapplications of the distribution of the number of childrenfor both defenders and attackers. Speciﬁcally, we study asimple and efﬁcient bot detection method in a ConﬁckerC like P2P-based botnet and consider a countermeasure byfuture botnets. We consider a P2P-based botnet formed by worm scan-ning/infection. That is, once a host infects another host,they become peers in the resulting P2P-based botnet. Whena defender captures an infected host in a botnet, the de-fender can process the historic records inside the host ormonitor the trafﬁc going into or out of the host, and willpotentially detect other infected hosts such as the father andthe children of this infected host. Then, our question is thatif a defender can only access a small portion of nodes ina botnet, how many bots will be detected by the defender.Moreover, inspired by the random removal and targetedremoval methods used in analyzing the robustness of atopology [13], here we study two bot detection strategies: • Random detection: Access bots randomly. • Targeted detection: Access bots that have the largestnumber of children.Analytically, we suppose that a defender can access asmall ratio of bots in a botnet. We assume that an accessedbot exposes itself, its father, and its children to the defender.To simplify the analysis, we also assume that the accessedbot ratio, A , is a power of 0.5 and all exposed nodes are dif-ferent nodes. We then calculate the average percentages ofexposed bots by random detection and targeted detection.Since from Corollary 1 a randomly selected node hasapproximately one child, the average percentage of botsthat can be exposed by random detection is then D R = 3 A. (25)For targeted detection, since the nodes with the largestnumber of children are chosen and the number of chil-dren follows asymptotically a geometric distribution withparameter 0.5 as shown in Corollary 2, A = P i ≥ d c n ( i ) = P ∞ i = d (cid:0) (cid:1) i +1 = (cid:0) (cid:1) d , (26)where d is the smallest number of children of accessednodes. That is, d = − log A . Therefore, the average per- Ratio of accessed hosts ( A ) P e r ce n t ag e o f e x p o s e dh o s t s ( D ) Simulation (random)Analysis (random)Simulation (targeted)Analysis (targeted)

Fig. 11. Random and targeted detection. centage of exposed nodes by targeted detection is D T = P ∞ i = d (2 + i ) · c n ( i ) = ( d + 3) (cid:0) (cid:1) d = A (3 − log A ) . (27)Compared with random detection, targeted detection canexpose ( − A log A ) × n more nodes. For example, if A = ,on average random detection can detect 4.69% of nodes,whereas targeted detection can expose 14.06% of bots.We then extend our simulation in Section 4.1 to study theeffectiveness of random and targeted detection strategies.Fig. 11 shows the simulation results over 100 independentruns for both strategies, as well as the analytical resultsfrom Equations (25) and (27), when A = , , and . Itcan be seen that the analytical results slightly overestimatethe exposed host percentage. This is because in our analysiswe ignore the case that two exposed nodes can be duplicate.Fig. 11 also demonstrates that targeted detection performsmuch better than random detection. For example, in oursimulation, when A = 3 . , 9.10% of the bots areexposed under random detection, whereas 22.36% of thebots are detected under targeted detection. Therefore, whena small portion of bots are examined, the botnets formedby worm infection are robust to random detection, but arerelatively vulnerable to targeted detection. To counteract the targeted detection method, an intuitiveway for botnets is to limit the maximum number of childrenfor each node. That is, set a small number m . Once aninfected host has compromised m other hosts, this hoststops scanning. In this way, there is no node with a largenumber of children. Moreover, the infected hosts can self-stop scanning, potentially reducing the worm trafﬁc [32].To analyze the robustness of such botnets against tar-geted detection, we extend Corollary 2 to obtain an approx-imated distribution of the number of children in a botnetwith the countermeasure: c n ( i ) = ( (cid:0) (cid:1) i +1 , i = 0 , , , · · · , m − (cid:0) (cid:1) m , i = m. (28) Number of children ( i ) c n ( i ) m = 2 m = 3 m = 4 m = 5 m = ∞ Geometric (a) Distribution of the number of children.

Ratio of accessed hosts ( A ) P e r ce n t ag e o f e x p o s e dh o s t s ( D ) Simulation (targeted, m = 2)Simulation (targeted, m = 3)Simulation (targeted, m = 4)Simulation (targeted, m = 5) (b) Targeted detection. Time (min) N u m b e r o f i n f ec t e dh o s t s m = 2 m = 3 m = 4 m = 5 m = ∞ (c) Worm propagation speed. Fig. 12. A worm countermeasure via limiting the maximum number of children.

The distribution is based on the observation that thosenodes having more than m children in a botnet withoutthe countermeasure can now have only m children. Hence,the expected percentage of exposed nodes under targeteddetection can be calculated: D ′ T = ( ( m + 2) · A, A ≤ (cid:0) (cid:1) m A (3 − log A ) − (cid:0) (cid:1) m , A > (cid:0) (cid:1) m . (29)Compared with D T in Equation (27), D ′ T is smaller. Thismeans that under the countermeasure the number of ex-posed nodes can be reduced signiﬁcantly. For example,when m = 3 and A = , D T = , and D ′ T = .We then extend our simulation in Section 5.1 to simulatethe worm tree generated using the above countermeasureand evaluate its performance against targeted detection.Fig. 12(a) shows the distribution of the number of childrenwhen m = 2 , 3, 4, and 5. It can be seen that except for m = 2 , c n ( i ) is well approximated by Equation (28). For m = 2 , since an infected host stops scanning when it hashit two vulnerable hosts, leaves in the worm tree havemore chances to recruit a child. Fig. 12(b) demonstratesthe expected percentage of exposed nodes ( i.e., D ′ T ), when A = , , and , and m = 2 , 3, 4, and 5. It canbe seen that D ′ T follows approximately the analytical re-sults in Equation (29). Moreover, the expected percentageof exposed nodes under the countermeasure is reducedsigniﬁcantly. For example, when A = , the percentageis reduced from 22.36% without the countermeasure to19.80%, 15.99%, 12.58%, and 9.38% when m = 5 , 4, 3, and2, respectively.On the other hand, since not every infected host keepsscanning the targets, the countermeasure can potentiallyslow down the speed of worm infection. Thus, we alsosimulate the propagation speed of worms that limit themaximum number of children and plot the results in Fig.12(c) for m = 2 , 3, 4, and 5, as well as the original wormwithout the countermeasure. It can be seen that except for m = 2 , the worm does not slow down much. But evenwhen m = 2 , the worm can infect most vulnerable hostswithin 17 hours. Moreover, Fig.s 12(b) and (c) demonstratethe tradeoff between the efﬁciency of worm infection and the robustness of the formed botnet topology. That is, aworm with the countermeasure spreads slower, but theresulting botnet is more robust against targeted detection. ELATED WORK

Since the Code Red worm in 2001, Internet worms havebeen an active research topic. Many mathematical modelshave been developed to characterize the spread of worms,estimate worm behaviors, and contain worm propagation.Most models, however, have focused on the macro-level behavior of worm infection. Speciﬁcally, different analyticalapproaches have been applied to study the total numberof infected hosts over time [8], [9], [10], [11], [12], [2],[27]. For example, Staniford et al. used a simple differentialequation to estimate the global propagation speed of theCode Red v2 worm [8], whereas Rohloff et al. applied astochastic model to reﬂect the variation of the number ofinfected hosts at the early stage of worm infection [11]. Themodels of the micro-level of worm infection, however, havebeen investigated little. In this paper, we apply probabilisticmodeling methods and reveal some key micro-level infor-mation, such as the infection ability of individual hosts andthe underlying botnet topology formed by worm infection.Some efforts have been focused on studying the “whoinfects whom” information or the worm infection sequence[30], [33], [31], [34]. Different from our work, the prior workinvestigates the details of the random number generatorof worm propagation [30] or infers the worm infectionsequence through the observations of network telescopes[33], [34]. Moreover, Sellke et al. applied a branching processto study the effectiveness of a containment strategy [35].They assume that the total number of scans of an infectedhost is bounded. As a result, the worm tree studied in theirwork is fundamentally different from the one in our work.Botnets have become the top threat to the Internet inrecent years. It has been shown that in current botnets,worm infection is still a main tool for recruiting new botsor collecting network information, and random scanninghas been widely used [3]. Moreover, botnets are rapidlytransiting from IRC systems to P2P systems. In [36], Wang etal. gave a systematic study on P2P-based botnets; whereas in [14], Dagon et al. surveyed different P2P-based botnettopologies, such as random graphs and power-law topolo-gies. Several methods have been proposed to construct P2P-based botnets through worm infection and re-infection [4],[5].Modeling the topology generation process has been anactive research area. For example, Barab´asi et al. developedthe well-known Barab´asi-Albert (BA) model and used amean-ﬁeld approach to characterize the growth of a topol-ogy with both preferential attachment and uniform attach-ment [22], [23]. Moreover, two exact mathematical modelshave been studied for the BA model [37], [38]. From thetheoretical aspect, our proposed worm tree is similar to therandom tree. For example, Devroye used the records theoryto derive the distribution of the level of a random orderedtree in [39]. Compared with these theoretical efforts, ourwork studies a very different problem ( i.e., botnets formedby worm infection) and uses a very different approach ( i.e., probabilistic modeling). ONCLUSIONS

In this paper, we attempt to capture the key characteris-tics of the Internet worm infection family tree and applythem to bot detection. We have shown analytically andempirically that for the infection tree formed by a wideclass of worms, the number of children asymptotically hasa geometric distribution with parameter 0.5; and the gener-ation closely follows a Poisson distribution with parameterE n [ G ] ( i.e., H n − ). As a result, on average half of infectedhosts never compromise any target, over 98% of nodeshave no more than ﬁve children, and a small portion ofhosts have a large number of children. Moreover, the aver-age path length of the worm tree increases approximatelylogarithmically with the number of nodes. We have alsodemonstrated empirically that similar observations can befound in localized-scanning worms. We have then appliedthe observations to bot detection and found that targeteddetection is an efﬁcient way to expose bots in a botnet.However, we have also pointed out that a simple counter-measure by future botnets can weaken the performance oftargeted detection, without greatly slowing down the speedof worm infection.As part of our ongoing work, we plan to study in moredepth efﬁcient methods against future botnets and relax ourassumptions to include more worm dynamics. For example,we are studying the effect of user defenses on the worm tree[40]. APPENDIX 1: P

ROOF OF C OROLLARY We apply z-transform to derive the expectation and thevariance of the number of children. First, note that Corol-lary 1 holds for n = 1 and . Next, when n ≥ , we deﬁnez-transform X n ( z ) = P n − i =0 c n ( i ) z − i . (30)Setting c n − ( −

1) = 1 , we can transform Theorem 2 to c n ( i ) = n − n c n − ( i ) + n c n − ( i − , ≤ i ≤ n − , (31) when n ≥ . Then, putting Equation (31) into Equation (30),we can obtain the difference equation of z-transform X n ( z ) = (cid:0) n z − + n − n (cid:1) X n − ( z ) + n . (32)Note that E n [ C ] = − dX n ( z ) dz | z =1 and X n − (1) = 1 , whichleads to E n [ C ] = n − n E n − [ C ] + n . (33)Since E [ C ] = , we can show by induction thatE n [ C ] = n − n . (34)Moreover, E n [ C ] = ddz h z dX n ( z ) dz i | z =1 yieldsE n [ C ] = n − n E n − [ C ] + n E n − [ C ] + n (35) = n − n E n − [ C ] + n − n . (36)Thus, we can use E [ C ] = to prove by induction thatE n [ C ] = 2 + ( n − n − n − H n n , (37)where H n = P ni =1 1 i is the n -th harmonic number [20].Therefore, Var n [ C ] = E n [ C ] − E n [ C ] (38) = 2 − n − n − H n n . (39) APPENDIX 2: P

ROOF OF C OROLLARY It is already known that c (0) = . When i ≥ , thiscorollary follows readily from Equation (13). Since n → ∞ , c n − ( i ) = c n ( i ) = c ( i ) , which yields c ( i ) = n − n c ( i ) + n c ( i − . (40)That is, c ( i ) = c ( i − , i ≥ . (41)Hence, from c (0) = , we can recursively obtain c ( i ) = (cid:0) (cid:1) i +1 , i ≥ . (42) APPENDIX 3: P

ROOF OF C OROLLARY Similar to the proof of Corollary 1, we apply z-transform toderive the expectation and the variance of the generation.First, note that Corollary 3 holds for n = 1 and . Next,when n ≥ , we deﬁne z-transform Y n ( z ) = P n − j =0 g n ( j ) z − j . (43)Putting Equation (18) into Equation (43), we can obtain thedifference equation of z-transform Y n ( z ) = (cid:0) n z − + n − n (cid:1) Y n − ( z ) . (44)Note that E n [ G ] = − dY n ( z ) dz | z =1 and Y n − (1) = 1 , whichleads to E n [ G ] = E n − [ G ] + n . (45)Since E [ G ] = , we can show by induction thatE n [ G ] = H n − . (46) Moreover, E n [ G ] = ddz h z dY n ( z ) dz i | z =1 yieldsE n [ G ] = E n − [ G ] + n E n − [ G ] + n . (47)Therefore, combining Equations (45) and (47) givesVar n [ G ] = E n [ G ] − E n [ G ]= E n − [ G ] + n (2 E n − [ G ] + 1) − ( E n − [ G ] + n ) = Var n − [ G ] + n − n . (48)Thus, we can use Var [ G ] = to prove by induction thatVar n [ G ] = H n − H n, , (49)where H n = P ni =1 1 i and H n, = P ni =1 1 i . APPENDIX 4: P

ROOF OF C OROLLARY We prove this corollary by applying z-transform. If arandom variable X follows a Poisson distribution withparameter λ ,Pr ( X = k ) = λ k k ! e − λ , k = 0 , , , · · · . (50)Using z-transform, we have X ( z ) = ∞ X k =0 Pr ( X = k ) z − k = e λ ( z − − ) . (51)Meanwhile, using Equation (18) in Theorem 3, we ﬁnd thez-transform of g n ( j ) Y n ( z ) = P n − j =0 g n ( j ) z − j = (cid:16) z − − n (cid:17) Y n − ( z ) . (52)Note that when x → , e x ≈ x . Thus, when n is verylarge, z − − n ≈ exp (( z − − /n ) . That is, Y n ( z ) ≈ e z − − n Y n − ( z ) . (53)Using Y ( z ) = 1 , we can recursively obtain Y n ( z ) ≈ e ( z − − P ni =2 1 i = e ( H n − z − − . (54)Therefore, by comparing Equations (51) and (54), g n ( j ) canbe approximated by the Poisson distribution with parame-ter H n − as in Equation (21). R EFERENCES

Proc. 13th Annual Network and DistributedSystem Security Symposium (NDSS’06) , Feb. 2006.[3] Z. Li, A. Goyal, Y. Chen, and V. Paxson, “Automating Analysis ofLarge-Scale Botnet Probing Events,” in

Proc. ACM Symposium onInformation, Computer and Communication Security (ASIACCS’09) ,Mar. 2009.[4] R. Vogt, J. Aycock, and M. Jacobson, Jr., “Army of Botnets,” in

Proc. 14th Annual Network and Distributed System Security Sympo-sium (NDSS’07) , Feb. 2007.[5] P. Wang, S. Sparks, and C. C. Zou, “An Advanced Hybrid Peer-to-Peer Botnet,” in to appear in IEEE Transactions on Dependable andSecure Computing . [6] P. Porras, H. Saidi, and V. Yegneswaran, “Conﬁcker C P2P Proto-col and Implementation,”

SRI International Technical Report

Proc. 11th USENIX Security Symposium(Security’02) , Aug. 2002.[9] C. C. Zou, D. Towsley, and W. Gong, “On the Performance ofInternet Worm Scanning Strategies,”

Elsevier Journal of PerformanceEvaluation , vol. 63, no. 7, pp. 700–723, Jul. 2006.[10] Z. Chen, L. Gao, and K. Kwiat, “Modeling the Spread of ActiveWorms,” in

Proc. IEEE INFOCOM , Apr. 2003.[11] K. Rohloff and T. Basar, “Stochastic Behavior of Random ConstantScanning Worms,” in

Proc. 14th ICCCN , Oct. 2005.[12] M. Vojnovic and A. J. Ganesh, “On the Race of Worms, Alerts andPatches,”

IEEE/ACM Transactions on Networking , vol. 16, no. 5, pp.1066–1079, Oct. 2008.[13] R. Albert and A.-L. Barab´asi, “Statistical Mechanics of ComplexNetworks,”

Review of Modern Physics , vol. 74, pp. 47–97, 2002.[14] D. Dagon, G. Gu, C. Lee, and W. Lee, “A Taxonomy of BotnetStructures,” in

Proc. 23 Annual Computer Security ApplicationsConference (ACSAC’07) , Dec. 2007.[15] J. Xia, S. Vangala, J. Wu, L. Gao, and K. Kwiat, “Effective WormDetection for Various Scan Techniques,”

Journal of Computer Secu-rity , vol. 14, no. 4, pp. 359 – 387, 2006.[16] Z. Chen and C. Ji, “Optimal Worm-Scanning Method UsingVulnerable-Host Distributions,”

International Journal of Security andNetworks (IJSN): Special Issue on Computer and Network Security ,vol. 2, no. 1/2, pp. 71 – 80, 2007.[17] M. Vojnovic, V. Gupta, T. Karagiannis, and C. Gkantsidis, “Sam-pling Strategies for Epidemic-Style Information Dissemination,” to appear in IEEE/ACM Transactions on Networking .[18] M. A. Rajab, F. Monrose, and A. Terzis, “On the Effectivenessof Distributed Worm Monitoring,” in

Proc. 14th USENIX SecuritySymposium (Security’05) , Aug. 2005.[19] Z. Chen, C. Chen, and C. Ji, “Understanding Localized-ScanningWorms,” in

Proc. IEEE IPCCC , Apr. 2007.[20] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, “In-troduction to Algorithms (Second Edition),”

The MIT Press andMcGraw-Hill , 2002.[21] Z. Chen and C. Ji, “An Information-Theoretic View of Network-Aware Malware Attacks,”

IEEE Transactions on Information Foren-sics and Security , vol. 4, no. 3, pp. 530–541, Sept. 2009.[22] A.-L. Barab´asi and R. Albert, “Emergence of Scaling in RandomNetworks,”

Science , vol. 286, pp. 509–512, Oct. 1999.[23] A.-L. Barab´asi, R. Albert, and H. Jeong, “Mean-ﬁeld Theory forScale-free Random Networks,”

Physica A 272 , 1999.[24] E. C. Titchmarsh, “The Theory of the Riemann Zeta Function,”

Oxford University Press , 1986.[25] J. Havil, “Gamma: Exploring Euler’s Constant,”

Princeton Univer-sity Press ∼ czou/research/wormSimulation/simulator-codered-100run.cpp.[27] C. C. Zou, W. Gong, D. Towsley, and L. Gao, “The Monitoringand Early Detection of Internet Worms,” IEEE/ACM Transactionson Networking

Proc.Passive and Active Measurement Conference (PAM’06) , Mar. 2006.[30] A. Kumar, V. Paxson, and N. Weaver, “Exploiting UnderlyingStructure for Detailed Reconstruction of an Internet-scale Event,”in

Proc. Internet Measurement Conference , 2005.[31] Y. Xie, V. Sekar, D. A. Maltz, M. K. Reiter, and H. Zhang,“Worm Origin Identiﬁcation Using Random Walks,” in

Proc. IEEESymposium on Security and Privacy , May 2005.[32] J. Ma, G. M. Voelker, and S. Savage, “Self-stopping Worms,” in

Proc. ACM Workshop on Rapid Malcode , Nov. 2005.[33] M. A. Rajab, F. Monrose, and A. Terzis, “Worm Evolution Track-ing via Timing Analysis,” in

Proc. Workshop on Rapid Malcode(WORM) , Nov. 2005. [34] Q. Wang, Z. Chen, K. Makki, N. Pissinou, and C. Chen, “InferringInternet Worm Temporal Characteristics,” in Proc. IEEE GLOBE-COM , Dec. 2008.[35] S. Sellke, N. B. Shroff, and S. Bagchi, “Modeling and AutomatedContainment of Worms,”

IEEE Transactions on Dependable andSecure Computing , vol. 5, no. 2, pp. 71–86, Apr.-Jun. 2008.[36] P. Wang, L. Wu, B. Aslam, and C. C. Zou, “A Systematic Study onPeer-to-Peer Botnets,” in

Proc. International Conference on ComputerCommunications and Networks (ICCCN) , Aug. 2009.[37] B. Bollob´as, O. Riordan, J. Spencer, and G. Tusnady, “The DegreeSequence of a Scale-free Random Graph Process,”

Random Struc-tures Algorithms , vol. 18, no. 3, pp. 279–290, Apr. 2001.[38] S. N. Dorogovtsev, J. F. F. Mendes, and A. N. Samukhin, “Structureof Growing Networks with Preferential Linking,”

Phys. Rev. Lett. ,vol. 85, pp. 4633–4636, Nov. 2000.[39] L. Devroye, “Applications of the Theory of Records in the Studyof Random Trees,”

Acta Inf. , vol. 26, no. 1-2, pp. 123–130, 1988.[40] Q. Wang, Z. Chen, C. Chen, and N. Pissinou, “On the Robustnessof the Botnet Topology Formed by Worm Infection,” to appear inProc. IEEE GLOBECOMto appear inProc. IEEE GLOBECOM