Fast Botnet Detection From Streaming Logs Using Online Lanczos Method
Zheng Chen, Xinli Yu, Chi Zhang, Jin Zhang, Cui Lin, Bo Song, Jianliang Gao, Xiaohua Hu, Wei-Shih Yang, Erjia Yan
FFast Botnet Detection From Streaming Logs Using Online Lanczos Method
Zheng Chen , Xinli Yu , Chi Zhang , Jin Zhang , Cui Lin , Bo Song , Jianliang Gao , Xiaohua Hu , Wei-Shih Yang , Erjia Yan CA Technologies, Inc. College of Computing & Informatics, Drexel University Department of Mathematics, Temple University Department of Computer Science, Maryland University at Baltimore County
Abstract — Botnet, a group of coordinated bots, is becoming the main platform of malicious Internet activities like DDOS, click fraud, web scraping, spam/rumor distribution, etc. This paper focuses on design and experiment of a new approach for botnet detection from streaming web server logs, motivated by its wide applicability, real-time protection capability, ease of use and better security of sensitive data. Our algorithm is inspired by a Principal Component Analysis (PCA) to capture correlation in data, and we are first to recognize and adapt Lanczos method to improve the time complexity of PCA-based botnet detection from cubic to sub-cubic, which enables us to more accurately and sensitively detect botnets with sliding time windows rather than fixed time windows. We contribute a generalized online correlation matrix update formula, and a new termination condition for Lanczos iteration for our purpose based on error bound and non-decreasing eigenvalues of symmetric matrices. On our dataset of an ecommerce website logs, experiments show the time cost of Lanczos method with different time windows are consistently only 20% to 25% of PCA.
Keywords-bot detection; botnet; streaming logs; correlation matrix; Lanczos iteration; online algorithm. I. I NTRODUCTION A bot is a software application that runs automated scripts over the Internet [1] to perform malicious tasks like DOS attack, website statistics skew, click fraud, price/information scraping, spam/rumor distribution, etc. Traditional single bots usually execute their tasks at a rate much higher than average human users to achieve their goal within a time limit. Recent single-bot detection methods, like peak-finding [2], outlier detection [3], threat propagation [4], more or less use this property of single bots. A botnet , as the name suggests, is a group of bots that work in a coordinated fashion. In contrast to single bots, a botnet, especially those large-scale botnets, they might request resources at a human-like speed, but altogether they place a heavy burden on the servers and collect large amount of information. Because bots in a botnet behave human-like, they are much harder to detect, and have become a key platform for many Internet attacks. Log data have been commonly leveraged for both bot and botnet detection. Generally, we can say a log is a sequence of data entries order by timestamp, where each data entry carries several fields that record the properties of an activity at a specific time. The main objective of this paper is to develop an efficient botnet detection from large-scale streaming web server logs with host identifier, request identifier, and time stamp; for example, in this paper we will experiment on Apache HTTP access logs. A host identifier could be an IP addresses or a MAC addresses, or anything similar; a request identifier could be an URL or an IP address, or web API name, or anything similar; the concrete forms of those identifiers depend on the type of stream-in logs. For Apache HTTP access log, a host identifier could be an IP address, or a host name, and a request identifier is a URL pointing to some resource on the server. Although we focus on sever logs, however, the approach can be used to monitor any streaming log data with similar information for coordinated behavior. The method developed in this paper is the key part of a larger bot/botnet detection system prototype overviewed in section II. The above objective is motivated by our research concerns, business goal and system requirements [5]. First is wider applicability. As far as we know, there still lacks researches on a generally applicable botnet detection method for web servers. Most recent botnet detection methods involve a particular type of non-log data. For example, [6] uses captcha test results to discover search engine bots, [7] takes advantage of an emulator to interact with the botnet, [8] is based on knowledge of protocol and DNS traffic, [9] needs to construct a user-user graph based on login activities. These methods typically intend for a special purpose and need additional efforts to collect and pre-process information not readily available on the server and sometimes need the modeler to understand advanced Internet structure. In contrast, our approach is quite “lightweight”, which only relies on above-mentioned basic information present in various access/activity logs of computer systems/software that host network services.
Secondly , the time complexity. Users stream in their log data to our system; the system monitors the streaming log data and in turn provide real-time warning on potential malicious hosts. Such real-time feedback has requirement on method complexity as well as sensitivity to coordinated attack from botnet. Logs usually come in a large volume. Users of a medium-scale ecommerce website can generate 100,000 log entries in 30 minutes. More popular website could have a much bigger number. As far as we know, all previous papers mentioned here do not meet our requirement. For example, method like in [9] needs hundreds of computers to run hours to figure out the botnet.
Thirdly , we believe our approach will be more secure in the sense that users are not required to provide sensitive low-level hardware information. Our method also does not need to intercept and inspect Internet packets like [10] [11] [12], which could bring additional security concerns. Log is the only data we need, and our method even allows users to anonymize the identifiers efore they stream in.
Last not the least , it is not possible for one method to detect all sorts of bots, and therefore modern industrial bot detection system is often an ensemble [13-15] integrating heterogenous methods to enhance detection capability. The simplicity and ease of use of our approach as discussed above: less data preprocessing, less requirement on specialized knowledge, less involvement with sensitive data, makes for better integration with other methods. The method we propose to adapt is Lanczos iteration [16, 17]. It is a method in numerical linear algebra to estimate eigenvalues and eigenvectors, with rigorous theory on its error and convergence. Detailed discussion is in section III. This idea is inspired by Principal Component Analysis (PCA) that captures data correlation by computing eigenvalues eigenvectors of correlation matrix. The main contribution of this paper can be summarized as the following, to the extent of our knowledge: 1) we are the first to investigate PCA-based botnet detection from streaming logs; 2) we are the first to contribute an algorithm that updates correlation matrix for the most general case of sliding window in section III.B; 3) we are first to first to recognize and adapt Lanczos method for fast botnet detection, and we innovate on the termination condition in our setup using Theorem 1 and Theorem 2 that leads to early termination of each iteration; our experiments in section IV further shows its effectiveness. II. B ACKGROUND A. Principal Component Analysis
Principal Component Analysis (PCA) is best known as a dimension reduction technique, is also a popular method in anomaly detection to detect outliers, for example, finding cyber anomalies [18, 19]; those data points not well-represented by the principal components are considered outliers or anomalies. It is also applied for the converse purpose to check if there is high-level coordination in the data, which has been successfully applied to the monitoring of industrial processes [20, 21]. PCA also has been combined with KL-divergence to detect botnet from search engine logs [22]; in that paper KL-divergence is used to filter users with usual “click distribution”, but still full PCA with cubic complexity is applied to detect correlation is the user is deemed “unusual”. A similar basic idea can also be found in [5]. Mathematically, PCA finds the eigenvalues and their eigenvectors of the covariance or correlation matrix s.t. the eigenvectors are orthonormal. Those orthonormal eigenvectors are then sorted by their eigenvalues and form rotated coordinate of the space and are called (first, second, …) principal components. In future discussion, we call the largest eigenvalue associated with the first principal component as the principal weight . The major problem of PCA, or in particular the calculation of the principal weight from the correlation matrix, is its cubic time complexity. The technical purpose of this paper is to use Lanczos method to reduce this time complexity. We discuss more about this in section III.C. B. Bot Detection System Prototype
Our algorithm plays a key role in a bot detection prototype system, which has been filed for patent [5, 23, 24]. The general work flow of this system is illustrated in Figure 1, working side by side with a Markov chain-based behavior model [24]. The latter views bots from a different perspective and detects them by their strange activities on the sever. For example, a human visiting an ecommerce website usually first browse products, make several searches, log in, add product to carts, and check out. However, it is hard for bots to follow this routine. Such routine can be modelled by Markov chain transitions, and bot visit sequence will have low probability in this Markov chain. However, this behavior model can only detect single bots or botnet-bots individually, lacking the ability to discover a botnet as a whole. As mentioned earlier, we favor a lightweight botnet detection algorithm, so it can be more easily integrated with the workflow.
Figure 1 The workflow of our bot/botnet detection prototype system.
III. P ROBLEM & A PPROACH A. Problem Formulation
For botnet detection of streaming log data, PCA provides a good start point, and a straightforward solution could be as the following: 1) split the streaming logs into time windows according to a specified interval; 2) for each time window, convert the log entries within a specified time interval to a host-request matrix 𝐗 with the integer value at 𝑖 th row and 𝑗 th column, denoted by 𝐗(𝑖, 𝑗) , represents the number of times host 𝑗 makes request 𝑖 ; 3) run PCA on each time window, and PCA can be applied on 𝐗 to check if the principal weight exceeds certain threshold. Above procedure can detect either ingle bots with a large volume of traffic, or a botnet where each bot might not make many requests but they correlate with each other and altogether still produce a high volume of traffic. An example is illustrated in Figure . (a) (b) Figure 2 (a) Two single bots, each of which has clearly higher visit rates that non-bot hosts. Although their visit counts are not correlated, they dominate non-bot visits, leading to a high principal weight of 0.779. (b) A botnet with correlated visit counts, with a high principal weight of 0.843.
Nonetheless, questions arise in the adaptation of PCA to our application.
First , the time complexity of full PCA used in [22] is cubic, which easily breaks down when a large number of logs come in. In our application, we are probably interested in only the largest eigenvalue and its eigenvector. It is obviously not necessary for a full PCA.
Secondly , division into time windows seems not suitable for monitor of streaming data. We are facing an accuracy-sensitivity dilemma : a small window will weaken the algorithm’s ability to find bots, since fewer data might not provide sufficient evidence for PCA; a larger window will slow down the algorithm’s sensitivity, e.g. in a massive attack the sever might already be brought down before the algorithm starts to analyze the last 30-min window. There is no universal criterion for good window length, and it is difficult to conduct experiments for different server logs. A more professional practice is sliding window , as in industrial process monitor [21, 25, 26], but this makes computation even more intense. In process control, the “features” as columns of data matrix is usually at most hundreds of columns, while in our application, the “features” are tens of thousands of host identifiers. From above discussion, reducing time complexity is the top priority if we desire to use sliding window. Our research problem can thus be summarized as constructing a fast algorithm suitable for using sliding window to monitor large-scale streaming logs for potential bots/botnets. B. Correlation Matrix Update
Consider the request-host matrix introduced in previous section. We evaluate the principal weight from the correlation matrix between hosts. After each window slide, this matrix will change, so does the correlation matrix. Our first problem is how to update the correlation matrix after every time window slide before updating the principal weight. The time complexity for computing correlation matrix of a 𝑛 × 𝑚 request-host matrix is
𝑂(𝑛𝑚 ) , so it is highly uneconomic and undesirable to re-compute entire correlation matrix. [21] calculated the updates for adding new rows and deleting old rows. Our case is more complicated: after every window slide, some rows and columns might be removed, some new rows and columns might be appended, and other rows and columns might have value change. Let 𝐗 𝑡o ∈ ℝ 𝑛 𝑡 ×𝑚 𝑡 be the request-host matrix before the 𝑡 th window slide, 𝑡 ∈ ℕ + . Then the mean of each column is given in the vector 𝐛 𝑡 = 1𝑛 𝑡 (𝐗 𝑡o ) T 𝑛 𝑡 (1) where 𝑛 𝑡 = [1, … ,1] T ∈ ℝ 𝑛 𝑡 is a column vector of length 𝑛 𝑡 with all components being . Then we call the following as the centralize request-host matrix , because each of its column has zero mean 𝐗 𝑡 = 𝐗 𝑡o − 𝟏 𝑛 𝑡 𝐛 𝑡T (2) where 𝚺 𝑡 = diag(𝜎 𝑡,1 , … , 𝜎 𝑡,𝑚 𝑡 ) is a diagonal matrix with the 𝑗 th diagonal element 𝜎 𝑡,𝑗 being the standard deviation of the 𝑗 th column of 𝐗 𝑡o . The correlation matrix for 𝐗 𝑡o is thus 𝐑 𝑡 = 1𝑛 𝑡 − 1 𝚺 𝑡−1 𝐗 𝑡T 𝐗 𝑡 𝚺 𝑡−1 (3) Our goal is to find the update formula for 𝐑 𝑡+1 using information from 𝐗 𝑡o , 𝐛 𝑡 , 𝐗 𝑡 and 𝐑 𝑡 . For simplicity, we can assume that the hosts are the same before and after the window slide, without loss of generality, i.e. column changes can be disregarded when calculating the updates. This is because, if there are any added columns or removed columns caused by window slide, we can simply add corresponding columns (and rows if necessary) into 𝐛 𝑡 , 𝐗 𝑡 , 𝐑 𝑡 , and use the modified ones for our future inferences. Also note correlation is not dependent on row order, therefore rows of 𝐗 𝑡 can be arbitrarily arranged for our convenience. Suppose after the 𝑡 th slide, 𝑛 𝑡+1+ new rows 𝐗 𝑛 𝑡+1+ o are appended to the bottom of 𝐗 𝑡o , each row representing a new request-host vector not currently in 𝐗 𝑡o ; 2) 𝑛 𝑡+1− rows 𝐗 𝑛 𝑡+1− o on top of 𝐗 𝑡o are removed, meaning those requests disappear after the window slide; 3) other rows have value change denoted by matrix 𝐗 𝑐 𝑡+1 o , where only the top 𝑐 𝑡+1 of 𝐗 𝑐 𝑡+1 o are non-zero. Since every window slide is small and of fixed distance, thus 𝑛 𝑡+1+ + 𝑛 𝑡+1− + 𝑐 𝑡+1 is usually small and can be treated as a constant. In future discussion, we also write 𝐗 𝑡o =[ 𝐗 𝑛 𝑡+1− o 𝐗 𝑛 𝑡 −𝑛 𝑡+1− o ] where 𝐗 𝑛 𝑡 −𝑛 𝑡+1− o is the last 𝑛 𝑡 − 𝑛 𝑡+1− rows of 𝐗 𝑡o ; likewise, 𝐗 𝑡 = [ 𝐗 𝑛 𝑡+1− 𝐗 𝑛 𝑡 −𝑛 𝑡+1− ] where 𝐗 𝑛 𝑡+1− and 𝐗 𝑛 𝑡 −𝑛 𝑡+1− are centralized data of 𝐗 𝑛 𝑡+1− o and 𝐗 𝑛 𝑡 −𝑛 𝑡+1− o respectively like in (2). We have the following relation for 𝐗 𝑛 𝑡+1+ o , 𝐗 𝑛 𝑡+1− o , and 𝐗 𝑐 𝑡+1 o , where 𝑛 𝑡+1− ×𝑚 𝑡 denotes a 𝑛 𝑡+1− × 𝑚 𝑡 zero matrix, and 𝐗 𝑡+1∗o is equivalent to 𝐗 𝑡+1o except for the remove rows are replaced by zeros. [𝟎 𝑛 𝑡+1− ×𝑚 𝑡 𝐗 𝑡+1o ] = [𝐗 𝑡o − [𝐗 𝑛 𝑡+1− o 𝐗 𝑐 𝑡+1 o ]𝐗 𝑛 𝑡+1+ o ] ≔ 𝐗 𝑡+1∗o (4) he update of mean-value is now given by (𝑛 𝑡 + 𝑛 𝑡+1+ − 𝑛 𝑡+1− )𝐛 𝑡+1 = 𝑛 𝑡 𝐛 𝑡 + (𝐗 𝑛 𝑡+1+ o ) T 𝑛 𝑡+1+ − (𝐗 𝑛 𝑡+1− o ) T 𝑛 𝑡+1− + (𝐗 𝑐 𝑡+1 o ) T 𝐶 𝑡+1 (5) Let Δ𝐛 𝑡+1𝑇 = 𝐛 𝑡+1𝑇 − 𝐛 𝑡𝑇 , and note 𝐗 𝑛 𝑡+1+ = 𝐗 𝑛 𝑡+1+ o −𝟏 𝑛 𝑡+1+ 𝐛 𝑡+1𝑇 is data centralization like in (2), we then have 𝐗 𝑡+1∗ = [𝟎 𝑛 𝑡+1− ×𝑚 𝑡 𝐗 𝑡+1o ] − [ 𝟎 𝑛 𝑡+1− ×𝑚 𝑡 𝑛 𝑡+1 𝐛 𝑡+1T ]= [𝐗 𝑡o − 𝟏 𝑛 𝑡 𝐛 𝑡T + 𝟏 𝑛 𝑡 𝐛 𝑡T − [𝐗 𝑛 𝑡+1− o 𝐗 𝑐 𝑡+1 o ]𝐗 𝑛 𝑡+1+ o ] − [ 𝟎 𝑛 𝑡+1− ×𝑚 𝑡 𝑛 𝑡+1 𝐛 𝑡+1T ]= [𝐗 𝑡o − 𝟏 𝑛 𝑡 𝐛 𝑡T + 𝟏 𝑛 𝑡 𝐛 𝑡T − [𝐗 𝑛 𝑡+1− o 𝐗 𝑐 𝑡+1 o ] − [ 𝟎 𝑛 𝑡+1− ×𝑚 𝑡 𝑛 𝑡 −𝑛 𝑡+1− 𝐛 𝑡+1T ]𝐗 𝑛 𝑡+1+ o − 𝟏 𝑛 𝑡+1+ 𝐛 𝑡+1𝑇 ]= [𝐗 𝑡 + (𝟏 𝑛 𝑡 𝐛 𝑡T − [𝐗 𝑛 𝑡+1− o 𝐗 𝑐 𝑡+1 o ] − [ 𝟎 𝑛 𝑡+1− ×𝑚 𝑡 𝑛 𝑡 −𝑛 𝑡+1− 𝐛 𝑡+1T ])𝐗 𝑛 𝑡+1+ ]= [𝐗 𝑡 − [ 𝐗 𝑛 𝑡+1− o − 𝟏 𝑛 𝑡+1− 𝐛 𝑡𝑇 𝐗 𝑐 𝑡+1 o + 𝟏 𝑛 𝑡 −𝑛 𝑡+1− (𝐛 𝑡+1𝑇 − 𝐛 𝑡𝑇 )]𝐗 𝑛 𝑡+1+ ]= [ [ 𝟎 𝑛 𝑡+1− ×𝑚 𝑡 𝐗 𝑛 𝑡 −𝑛 𝑡+1− − 𝐗 𝑐 𝑡+1 o − 𝟏 𝑛 𝑡 −𝑛 𝑡+1− Δ𝐛 𝑡+1𝑇 ⏞ 𝑂(𝑚 𝑡 (𝑛 𝑡 −𝑛 𝑡+1− )) ]𝐗 𝑛 𝑡+1+ ⏞ 𝑂(𝑚 𝑡 𝑛 𝑡+1+ ) ] (6) where we note the top 𝑛 𝑡+1− rows of 𝐗 𝑡+1∗ must all be zero, and thus the time complexity of 𝐗 𝑡+1∗ is 𝑂(𝑚 𝑡 ( 𝑛 𝑡 − 𝑛 𝑡+1− +𝑛 𝑡+1+ )) = 𝑂(𝑚 𝑡 𝑛 𝑡+1 ) . We now start calculating the update for the standard deviations in 𝚺 𝑡+1 . 𝜎 𝑡+1,𝑗2 = ‖[𝐗 𝑡o (: , 𝑗) − [𝐗 𝑛 𝑡+1− o (: , 𝑗)𝐗 𝑐 𝑡+1 o (: , 𝑗)]𝐗 𝑛 𝑡+1+ o (: , 𝑗) ] − [ 𝟎 𝑛 𝑡+1− 𝐛 𝑡+1 (𝑗)𝟏 𝑛 𝑡+1 ]‖ 𝑛 𝑡+1 − 1= ‖𝐗 𝑡o (: , 𝑗) − 𝟏 𝑛 𝑡 𝐛 𝑡T (𝑗) + 𝟏 𝑛 𝑡 𝐛 𝑡T (𝑗) − [𝐗 𝑛 𝑡+1− o (: , 𝑗)𝐗 𝑐 𝑡+1 o (: , 𝑗)] − [ 𝟎 𝑛 𝑡+1− 𝐛 𝑡+1 (𝑗)𝟏 𝑛 𝑡 −𝑛 𝑡+1− ]𝐗 𝑛 𝑡+1+ o (: , 𝑗) − 𝐛 𝑡+1 (𝑗)𝟏 𝑛 𝑡+1+ ‖ 𝑛 𝑡+1 − 1 (7) We can simplify (7) piece by piece. First expand, ‖𝐗 𝑡o (: , 𝑗) − 𝐛 𝑡 (𝑗)𝟏 𝑛 𝑡 + 𝐛 𝑡 (𝑗)𝟏 𝑛 𝑡 − [𝐗 𝑛 𝑡+1− o (: , 𝑗)𝐗 𝑐 𝑡+1 o (: , 𝑗)] − [ 𝟎 𝑛 𝑡+1− 𝐛 𝑡+1 (𝑗)𝟏 𝑛 𝑡 −𝑛 𝑡+1− ]‖= ‖𝐗 𝑡o (: , 𝑗) − 𝐛 𝑡 (𝑗)𝟏 𝑛 𝑡 ‖ + ‖𝐛 𝑡 (𝑗)𝟏 𝑛 𝑡 − [𝐗 𝑛 𝑡+1− o (: , 𝑗)𝐗 𝑐 𝑡+1 o (: , 𝑗)] − [ 𝟎 𝑛 𝑡+1− 𝐛 𝑡+1 (𝑗)𝟏 𝑛 𝑡 −𝑛 𝑡+1− ]‖ + 2 [𝐛 𝑡 (𝑗)𝟏 𝑛 𝑡 − [𝐗 𝑛 𝑡+1− o (: , 𝑗)𝐗 𝑐 𝑡+1 o (: , 𝑗)] − [ 𝟎 𝑛 𝑡+1− 𝐛 𝑡+1 (𝑗)𝟏 𝑛 𝑡 −𝑛 𝑡+1− ]] T [𝐗 𝑡o (: , 𝑗) − 𝐛 𝑡 (𝑗)𝟏 𝑛 𝑡 ] (8) where ‖𝐗 𝑡o (: , 𝑗) − 𝐛 𝑡 (𝑗)𝟏 𝑛 𝑡 ‖ = (𝑛 𝑡 − 1)𝜎 𝑡,𝑗2 (9) 𝐛 𝑡 (𝑗)𝟏 𝑛 𝑡 − [𝐗 𝑛 𝑡+1− o (: , 𝑗)𝐗 𝑐 𝑡+1 o (: , 𝑗)] − [ 𝟎 𝑛 𝑡+1− 𝐛 𝑡+1 (𝑗)𝟏 𝑛 𝑡 −𝑛 𝑡+1− ]= [ 𝐛 𝑡 (𝑗)𝟏 𝑛 𝑡 − 𝐗 𝑛 𝑡+1− o (: , 𝑗)−Δ𝐛 𝑡+1 (𝑗)𝟏 𝑛 𝑡 −𝑛 𝑡+1− − 𝐗 𝑐 𝑡+1 o (: , 𝑗)] (10) 𝑛 𝑡 T 𝐗 𝑡o (: , 𝑗) − 𝟏 𝑛 𝑡 T 𝐛 𝑡 (𝑗)𝟏 𝑛 𝑡 = 0⇒ [𝐛 𝑡 (𝑗)𝟏 𝑛 𝑡 ] T [𝐗 𝑡o (: , 𝑗) − 𝐛 𝑡 (𝑗)𝟏 𝑛 𝑡 ] = 0 (11) Using (11), we have the following important simplification. [ 𝟎 𝑛 𝑡+1− 𝐛 𝑡+1 (𝑗)𝟏 𝑛 𝑡 −𝑛 𝑡+1− ] T [𝐗 𝑡o (: , 𝑗) − 𝐛 𝑡 (𝑗)𝟏 𝑛 𝑡 ]= (𝐛 𝑡+1 (𝑗)𝟏 𝑛 𝑡 𝑇 − [𝐛 𝑡+1 (𝑗)𝟏 𝑛 𝑡+1− 𝑛 𝑡 −𝑛 𝑡+1− ] T ) [𝐗 𝑡o (: , 𝑗) − 𝐛 𝑡 (𝑗)𝟏 𝑛 𝑡 ]= − [𝐛 𝑡+1 (𝑗)𝟏 𝑛 𝑡+1− 𝑛 𝑡 −𝑛 𝑡+1− ] T 𝐗 𝑡 (: , 𝑗) (12) Plug (9)~(12) back to (8) and then (7), we have (𝑛 𝑡+1 − 1)𝜎 𝑡+1,𝑗2 = (𝑛 𝑡 − 1)𝜎 𝑡,𝑗2 + ‖‖ 𝐛 𝑡 (𝑗)𝟏 𝑛 𝑡 − 𝐗 𝑛 𝑡+1− o (: , 𝑗)⏞ 𝑂(𝑛 𝑡+1− ) Δ𝐛 𝑡+1 (𝑗)𝟏 𝑛 𝑡 −𝑛 𝑡+1− + 𝐗 𝑐 𝑡+1 o (: , 𝑗)⏞ 𝑂(𝑐 𝑡+1 ) ‖‖ − 2 [𝐗 𝑛 𝑡+1− o (: , 𝑗) + 𝐛 𝑡+1 (𝑗)𝟏 𝑛 𝑡+1− 𝐗 𝑐 𝑡+1 o (: , 𝑗) ] T 𝐗 𝑡 (: , 𝑗)⏞ 𝑂(𝑛 𝑡+1− +𝑐 𝑡+1 ) + ‖𝐗 𝑛 𝑡+1+ o (: , 𝑗) − 𝐛 𝑡+1 (𝑗)𝟏 𝑛 𝑡+1+ ⏞ 𝑂(𝑛 𝑡+1+ ) ‖ (13) Recall the top 𝑐 𝑡+1 rows of 𝐗 𝑐 𝑡+1 o are non-zero, thus from (13) the time complexity for updating each standard deviation is 𝑂(𝑛 𝑡+1− + 𝑐 𝑡+1 + 𝑛 𝑡+1+ ) , linear to the number of rows that are affected by the window slide. The total complexity for updating all standard deviations is 𝑂(𝑚 𝑡 (𝑛 𝑡+1− + 𝑐 𝑡+1 +𝑛 𝑡+1+ )) . At last we update the correlation matrix. Using the fourth identity of (6) and a technique like (12), we have (𝑛 𝑡+1 − 1)𝐑 𝑡+1 = (𝐗 𝑡+1∗ 𝚺 𝑡+1−1 ) T (𝐗 𝑡+1∗ 𝚺 𝑡+1−1 )= (𝐗 𝑡 𝚺 𝑡+1−1 + (𝟏 𝑛 𝑡 𝐛 𝑡T − [𝐗 𝑛 𝑡+1− o 𝐗 𝑐 𝑡+1 o ] − [ 𝟎 𝑛 𝑡+1− ×𝑚 𝑡 𝑛 𝑡 −𝑛 𝑡+1− 𝐛 𝑡+1T ]) 𝚺 𝑡+1−1 𝐗 𝑛 𝑡+1+ 𝚺 𝑡+1−1 ) T × (𝐗 𝑡 𝚺 𝑡+1−1 + (𝟏 𝑛 𝑡 𝐛 𝑡T − [𝐗 𝑛 𝑡+1− o 𝐗 𝑐 𝑡+1 o ] − [ 𝟎 𝑛 𝑡+1− ×𝑚 𝑡 𝑛 𝑡 −𝑛 𝑡+1− 𝐛 𝑡+1T ]) 𝚺 𝑡+1−1 𝐗 𝑛 𝑡+1+ 𝚺 𝑡+1−1 )= 𝚺 𝑡+1−1 𝐗 𝑡T 𝐗 𝑡 𝚺 𝑡+1−1 + 𝚺 𝑡+1−1 𝐘 𝑡+1 𝚺 𝑡+1−1 = (𝑛 𝑡 − 1)𝚺 𝑡+1−1 𝚺 𝑡 𝐑 𝑡 𝚺 𝑡 𝚺 𝑡+1−1 ⏞ 𝑂(𝑚 𝑡2 ) + 𝚺 𝑡+1−1 𝐘 𝑡+1 𝚺 𝑡+1−1 ⏞ 𝑂(𝑚 𝑡2 ) (14) where 𝐘 𝑡+1 = 2𝐗 𝑡T (𝟏 𝑛 𝑡 𝐛 𝑡T − [𝐗 𝑛 𝑡+1− o 𝐗 𝑐 𝑡+1 o ] − [ 𝟎 𝑛 𝑡+1− ×𝑚 𝑡 𝑛 𝑡 −𝑛 𝑡+1− 𝐛 𝑡+1T ])+ (𝟏 𝑛 𝑡 𝐛 𝑡T − [𝐗 𝑛 𝑡+1− o 𝐗 𝑐 𝑡+1 o ] − [ 𝟎 𝑛 𝑡+1− ×𝑚 𝑡 𝑛 𝑡 −𝑛 𝑡+1− 𝐛 𝑡+1T ]) T (𝟏 𝑛 𝑡 𝐛 𝑡T − [𝐗 𝑛 𝑡+1− o 𝐗 𝑐 𝑡+1 o ] − [ 𝟎 𝑛 𝑡+1− ×𝑚 𝑡 𝑛 𝑡 −𝑛 𝑡+1− 𝐛 𝑡+1T ])+ 𝐗 𝑛 𝑡+1+ T 𝐗 𝑛 𝑡+1+ (15) We then use the following identities multiple times 𝑡T 𝑛 𝑡 = 𝟎 ⇒ 𝐗 𝑡T 𝐗 𝑡o = 𝐗 𝑡T (𝐗 𝑡o − 𝟏 𝑛 𝑡 𝐛 𝑡T + 𝟏 𝑛 𝑡 𝐛 𝑡T )= 𝐗 𝑡T 𝐗 𝑡 + 𝐗 𝑡T 𝑛 𝑡 𝐛 𝑡T = 𝐗 𝑡T 𝐗 𝑡 (16) [ 𝟎 𝑛 𝑡+1− ×𝑚 𝑡 𝑛 𝑡 −𝑛 𝑡+1− 𝐛 𝑡+1T ] = 𝟏 𝑛 𝑡 𝐛 𝑡+1T − [ 𝟏 𝑛 𝑡+1− 𝐛 𝑡+1T (𝑛 𝑡 −𝑛 𝑡+1− )×𝑚 𝑡 ] (17) Δ𝐛 𝑡+1 𝑛 𝑡 T 𝑛 𝑡 Δ𝐛 𝑡+1T = 𝑛 𝑡 Δ𝐛 𝑡+1 Δ𝐛 𝑡+1T (18) to arrive at 𝑡T ([𝐗 𝑛 𝑡+1− o 𝐗 𝑐 𝑡+1 o ] + [ 𝟎 𝑛 𝑡+1− ×𝑚 𝑡 𝑛 𝑡 −𝑛 𝑡+1− 𝐛 𝑡+1T ])= 2𝐗 𝑡T (𝟏 𝑛 𝑡 𝐛 𝑡+1T + [𝐗 𝑛 𝑡+1− o 𝐗 𝑐 𝑡+1 o ] − [ 𝟏 𝑛 𝑡+1− 𝐛 𝑡+1T (𝑛 𝑡 −𝑛 𝑡+1− )×𝑚 𝑡 ])= 2𝐗 𝑡T [𝐗 𝑛 𝑡+1− o − 𝟏 𝑛 𝑡+1− 𝐛 𝑡+1T 𝐗 𝑐 𝑡+1 o ]= 2(𝐗 𝑛 𝑡+1− T 𝐗 𝑛 𝑡+1− + 𝐗 𝑛 𝑡 −𝑛 𝑡+1− T 𝐗 𝑐 𝑡+1 o ) (19) and (𝟏 𝑛 𝑡 𝐛 𝑡T − [𝐗 𝑛 𝑡+1− o 𝐗 𝑐 𝑡+1 o ] − [ 𝟎 𝑛 𝑡+1− ×𝑚 𝑡 𝑛 𝑡 −𝑛 𝑡+1− 𝐛 𝑡+1T ]) T (𝟏 𝑛 𝑡 𝐛 𝑡T − [𝐗 𝑛 𝑡+1− o 𝐗 𝑐 𝑡+1 o ] − [ 𝟎 𝑛 𝑡+1− ×𝑚 𝑡 𝑛 𝑡 −𝑛 𝑡+1− 𝐛 𝑡+1T ])= (𝟏 𝑛 𝑡 Δ𝐛 𝑡+1T + [𝐗 𝑛 𝑡+1− o − 𝟏 𝑛 𝑡+1− 𝐛 𝑡+1T 𝐗 𝑐 𝑡+1 o ]) T (𝟏 𝑛 𝑡 Δ𝐛 𝑡+1T + [𝐗 𝑛 𝑡+1− o − 𝟏 𝑛 𝑡+1− 𝐛 𝑡+1T 𝐗 𝑐 𝑡+1 o ])= 𝑛 𝑡 Δ𝐛 𝑡+1 Δ𝐛 𝑡+1T + 2Δ𝐛 𝑡+1 𝑛 𝑡 T [𝐗 𝑛 𝑡+1− o − 𝟏 𝑛 𝑡+1− 𝐛 𝑡+1T 𝐗 𝑐 𝑡+1 o ] 𝚺 𝑡+1−1 + [𝐗 𝑛 𝑡+1− o − 𝟏 𝑛 𝑡+1− 𝐛 𝑡+1T 𝐗 𝑐 𝑡+1 o ] T [𝐗 𝑛 𝑡+1− o − 𝟏 𝑛 𝑡+1− 𝐛 𝑡+1T 𝐗 𝑐 𝑡+1 o ] (20) Continue the simplification (sketch calculation, some steps are long and hence omitted, but they are not hard to verify). Δ𝐛 𝑡+1 𝑛 𝑡 T [𝐗 𝑛 𝑡+1− o − 𝟏 𝑛 𝑡+1− 𝐛 𝑡+1T 𝐗 𝑐 𝑡+1 o ]= Δ𝐛 𝑡+1 (𝟏 𝑛 𝑡+1− T 𝐗 𝑛 𝑡+1− o − 𝟏 𝑛 𝑡+1− T 𝑛 𝑡+1− 𝐛 𝑡+1T + 𝟏 𝑛 𝑡 −𝑛 𝑡+1− T 𝐗 𝑐 𝑡+1 o )= Δ𝐛 𝑡+1 (𝟏 𝑛 𝑡+1− T 𝐗 𝑛 𝑡+1− o + 𝟏 𝑛 𝑡 −𝑛 𝑡+1− T 𝐗 𝑐 𝑡+1 o − 𝑛 𝑡+1− 𝐛 𝑡+1T ) (21) [𝐗 𝑛 𝑡+1− o − 𝟏 𝑛 𝑡+1− 𝐛 𝑡+1T 𝐗 𝑐 𝑡+1 o ] T [𝐗 𝑛 𝑡+1− o − 𝟏 𝑛 𝑡+1− 𝐛 𝑡+1T 𝐗 𝑐 𝑡+1 o ]= (𝐗 𝑛 𝑡+1− o − 𝟏 𝑛 𝑡+1− 𝐛 𝑡+1T ) T (𝐗 𝑛 𝑡+1− o − 𝟏 𝑛 𝑡+1− 𝐛 𝑡+1T ) + (𝐗 𝑐 𝑡+1 o ) T (𝐗 𝑐 𝑡+1 o )= 𝐗 𝑛 𝑡+1− T 𝐗 𝑛 𝑡+1− + (𝐗 𝑐 𝑡+1 o ) T (𝐗 𝑐 𝑡+1 o ) + 𝑛 𝑡+1− Δ𝐛 𝑡+1 Δ𝐛 𝑡+1T (22) Finally, plug (21) and (22) back to (20), and then plug (19) and (20) back to (15), and after certain rearrangement, we will have 𝐘 𝑡+1 = (𝑛 𝑡 − 𝑛 𝑡+1− )Δ𝐛 𝑡+1 Δ𝐛 𝑡+1T ⏞ 𝑂(𝑚 𝑡2 ) + 𝐗 𝑛 𝑡+1+ T 𝐗 𝑛 𝑡+1+ ⏞ 𝑂(𝑚 𝑡2 𝑛 𝑡+1+ ) + 2Δ𝐛 𝑡+1 𝑛 𝑡+1− T 𝐗 𝑛 𝑡+1− T − 𝐗 𝑛 𝑡+1− T 𝐗 𝑛 𝑡+1− ⏞ 𝑂(𝑚 𝑡2 𝑛 𝑡+1− ) − (2𝐗 𝑛 𝑡 −𝑛 𝑡+1− − 𝐗 𝑐 𝑡+1 o − 2𝟏 𝑛 𝑡 −𝑛 𝑡+1− Δ𝐛 𝑡+1T ) T 𝐗 𝑐 𝑡+1 o ⏞ 𝑂(𝑚 𝑡2 𝑐 𝑡+1 ) (23) For the last addend of (23), we can further use of the last identity of (6) to reuse the result of 𝐗 𝑡+1 . 𝑛 𝑡 −𝑛 𝑡+1− − 𝐗 𝑐 𝑡+1 o − 2𝟏 𝑛 𝑡 −𝑛 𝑡+1− Δ𝐛 𝑡+1T = 2𝐗 𝑡+1 (1: 𝑛 𝑡+1 − 𝑛 𝑡+1+ , : ) + 𝐗 𝑐 𝑡+1 o (24) The time complexity for update of correlation matrix 𝐑 𝑡+1 is clearly 𝑂(𝑚 𝑡2 (𝑛 𝑡+1+ + 𝑛 𝑡+1− + 𝑐 𝑡+1 )) by (14) and (23). All updates are summarized in Algorithm 1. Combining (6) and (13) the total complexity for correlation matrix update is 𝑂(min{𝑚 𝑡 𝑛 𝑡+1 , 𝑚 𝑡2 (𝑛 𝑡+1+ + 𝑛 𝑡+1− + 𝑐 𝑡+1 )}) , which can be considered as quadratic if 𝑛 𝑡+1+ + 𝑛 𝑡+1− + 𝑐 𝑡+1 is treated as a small constant. This complexity is theoretically down from straightforward complete re-evaluation by power of 1. In practice, we sometimes could expect even better acceleration, as Δ𝐛 𝑡+1T , 𝐗 𝑐 𝑡+1 o etc. are very often sparse. C. Lanczos Method
In future discussion, for convenience of mathematical analysis, we assume the correlation matrix is normalized to unit total variance by dividing every element by the number of columns, so that the sun of all eigenvalues is , the largest eigenvalue, i.e. the principal weight is between range [0,1] , and an eigenvalue larger than must be the largest eigenvalue. In implementation, we should instead normalize the computed eigenvalue to the range of [0,1] for better numerical stability. With updated correlation matrix 𝐑 𝑘+1 , our task is to determine if 𝐑 𝑘+1 has a large eigenvalue, i.e. the principal weight, is larger than a threshold. If so, we consider there exists potential bot visits in the current time window, and continue to find those hosts that have high correlation with the principal component. Several methods suit for this task, including singular value decomposition, which computes the full PCA in cubic time, or Rayleigh quotient iteration, which finds the exact eigenvalue and eigenvector in cubic time [27]. For our purpose, we discussed in III.A that an accurate evaluation of eigenvalue is not necessary. We will investigate Lanczos method [17, 28], a numerical method to approximate the eigenvalues, for its use in our application. Given any symmetric matrix
𝐑 ∈ ℝ 𝑚×𝑚 (the correlation matrix is symmetric) and any non-zero initial vector 𝐱 ∈ ℝ 𝑚 , the Lanczos iteration is expressed as the following process, 𝐯 = 𝟎, 𝐯 = 𝐱‖𝐱‖⇒ {𝐯̃ 𝑗 = 𝐑𝐯 𝑗−1 − 〈𝐑𝐯 𝑗−1 , 𝐯 𝑗−1 ⟩𝐯 𝑗−1 − ‖𝐯̃ 𝑗−1 ‖𝐯 𝑗−2 𝐯 𝑗 = 𝐯̃ 𝑗 ‖𝐯̂ 𝑗 ‖ , 𝑗 = 2,3, … (25) The iteration is guaranteed to terminate at 𝑗 = 𝑘 + 2 when it founds 𝐯̃ 𝑘 +2 = 𝟎 , where 𝑘 is the smallest positive integer s.t. 𝐑 𝑘 +1 𝐱 ∈ span{𝐱, 𝐑𝐱, … , 𝐑 𝑘 𝐱} , and is guaranteed to be well-defined, i.e. ‖𝐯̃ 𝑗 ‖ ≠ 0 , for 𝑗 = 1, … , 𝑘 . It can be proved that 𝐕 𝑘 +1 = [𝐯 , … , 𝐯 𝑘 +1 ] is an orthonormal basis of pan{𝐱, 𝐑𝐱, … , 𝐑 𝑘 𝐱} . Let 〈⟩ denote the standard inner product, and let 𝐕 𝑘 = [𝐯 , … , 𝐯 𝑘 ], 1 ≤ 𝑘 ≤ 𝑘 + 1 , then the Lanczos decomposition gives 𝐑𝐕 𝑘 = 𝐕 𝑘 𝐓 𝑘 + 𝐯̃ 𝑘+1 𝐞 𝑘T , ∀𝑘 = 1, … , 𝑘 + 1 (26) where 𝐞 𝑘 ∈ ℝ 𝑘 is a vector with only the 𝑘 th component being and all other components being , and 𝐓 𝑘 = ( 〈𝐑𝐯 , 𝐯 ⟩ ‖𝐯̃ ‖‖𝐯̃ ‖ 〈𝐑𝐯 , 𝐯 ⟩ ⋱⋱ ⋱ ‖𝐯̃ 𝑘 ‖‖𝐯̃ 𝑘 ‖ 〈𝐑𝐯 𝑘 , 𝐯 𝑘 ⟩) (27) is a symmetric tridiagonal matrix. Since 𝐯̃ 𝑘+1 is orthogonal to every column of 𝐕 𝑘 , then (26) is equivalent to 𝐕 𝑘T 𝐑𝐕 𝑘 = 𝐓 𝑘 , ∀𝑘 = 1, … , 𝑘 + 1 (28) One can choose to use either (26) or (28) for their own convenience. The more crucial point here for our purpose is that the eigenvalues and eigenvectors of 𝐓 𝑘 can estimate those of 𝐑 . The eigenvalues and eigenvectors of a symmetric tridiagonal matrix is a basic task in numerical linear algebra, [16, 21, 29-32]. They take advantage of the special form of symmetric tridiagonal matrices to run faster and more accurate than algorithms for general matrices. The eigenvalue approximation is robust because the error can be shown bounded by the following important theorem, • Theorem 1 . For any eigenvalue 𝜆̂ of 𝐓 𝑘 , there exists an eigenvalue 𝜆 of 𝐑 s.t |𝜆 − 𝜆̂| ≤ ‖𝐯̃ 𝑘+1 ‖ ; or equivalently if 𝐑 has 𝑚 eigenvalues {𝜆 , … , 𝜆 𝑚 } , then we have min |𝜆 𝑖 − 𝜆̂| ≤ ‖𝐯̃ 𝑘+1 ‖ . Further, if 𝛍̃ is an eigenvector of eigenvalue 𝜆̂ of 𝐓 𝑘 , then there exists an eigenvalue 𝜆 of 𝐑 s.t |𝜆 − 𝜆̂| ≤ ‖𝐯̃ 𝑘+1 ‖|〈𝐞 𝑘 ,𝛍̃⟩|‖𝛍̃‖ . The theorem is stated with a short proof sketch in [17]. Since 𝐓 𝑘 ∈ ℝ 𝑘×𝑘 , 𝛍̃ ∈ ℝ 𝑘 , thus |〈𝐞 𝑘 , 𝛍̃⟩| is the absolute value of the last component of eigenvector 𝛍̃ , and |〈𝐞 𝑘 ,𝛍̃⟩|‖𝛍̃‖ ≤ 1 , meaning the second error bound is equal or tighter. Since computation of 𝛍̃ only takes linear time for a tridiagonal matrix, we will use the second bound. |〈𝐞 𝑘 ,𝛍̃⟩|‖𝛍̃‖ is usually a small value and it decreases with 𝑘 [17, 33]. The approximation can also conveniently run in an “online” style due to another useful theorem, which implies the maximum eigenvalue of 𝐓 𝑘 is non-decreasing with 𝑘 . Thus, if we are not satisfied with the error bound 𝑑 , we can increase 𝑘 for better result. • Theorem 2 . For any symmetric matrix, the maximum eigenvalues of its leading principal submatrices are always non-decreasing.
Based on above, we propose the following to terminate the iteration early for our application. Suppose we use a threshold 𝜔 > 0.5 for principal weight, and let 𝑑 = ‖ 𝐯 ̃ 𝑘+1 ‖|〈 𝐞 𝑘 , 𝛍̃⟩|‖𝛍̃‖ , then the practical meaning of Theorem 1 is: if we find a large eigenvalue 𝜆̂ of 𝐓 𝑘 , then if 𝜆̂ − 𝑑 ≥ 𝜔 , then the largest eigenvalue of the “normalized” correlation matrix 𝐑 must exceed 𝜔 , and a warning can be immediately raised; in contrast, if 𝜆̂ + 𝑑 < 𝜔 and 𝜆̂ − 𝑑 ≥ 0.5 , then the largest eigenvalue of 𝐑 must not exceed 𝜔 , and we can simply wait for next window slide. For 𝜆̂ − 𝑑 < 0.5 , we setup as threshold 𝑐 , and if 𝜆̂ − 𝑑 stays below for 𝑐 rounds, then the largest eigenvalue of 𝐑 is not likely to exceed 𝜔 , and we can continue to next window slide. This is because the largest eigenvalue of 𝐓 𝑘 approaches the largest eigenvalue of 𝐑 from below, therefore when the estimated eigenvalue plus the error bound stays below 0.5 for many iterations, it becomes increasingly unlikely for the eigenvalue to exceed a threshold 𝜔 > 0.5 . Once the largest eigenvalue 𝜆̂ of 𝐓 𝑘 is estimated, its eigenvector 𝛍 ̃ can be found trivially, because we can set 𝛍 ̃( ) to any positive number if ‖𝐯 ‖ ≠ 𝜆̂ , and 𝛍 ̃( ) = 0 otherwise, then other components of 𝛍 ̃ can be solved iteratively by 𝐓 𝑘 𝛍̃ = 𝜆̂𝛍̃ , and 𝛍̂ = 𝐕 𝑘 𝛍̃‖𝐕 𝑘 𝛍̃‖ is the estimated principal component of 𝐑 . The correlation between host 𝑗 and 𝛍̂ can be computed by 𝜌 𝑗 = 𝛍̂(𝑗)√𝜆̂ . We recommend finding the first knee point of the descendingly sorted list of all 𝜌 𝑗 , e.g. the point before the first sharp slope, which is inspired by the scree plot of PCA; meanwhile we can disregard hosts with 𝜌 𝑗 < 𝜔 in case the knee point is too low. Hosts satisfying both criteria are considered as potential bots. Finally, we give some more facts that will be used in the algorithm. First a loose bound of eigenvalues of 𝐓 𝑘 are given by 𝜆 𝑙 ≤ 𝜆 ≤ 𝜆 𝑢 where 𝜆 𝑙 = max{min{〈𝐑𝐯 , 𝐯 ⟩ − ‖𝐯̃ ‖, 〈𝐑𝐯 𝑖 , 𝐯 𝑖 ⟩ − ‖𝐯̃ 𝑖+1 ‖ − ‖𝐯̃ 𝑖 ‖}, 0} 𝜆 𝑢 = min{max{〈𝐑𝐯 , 𝐯 ⟩ + ‖𝐯̃ ‖, 〈𝐑𝐯 𝑖 , 𝐯 𝑖 ⟩ + ‖𝐯̃ 𝑖+1 ‖ + ‖𝐯̃ 𝑖 ‖}, trace 𝐓 𝑘 } 𝑖 = 2, … , 𝑘 (29) Secondly, the 𝑖 th leading principal minor 𝑝 𝑖 (𝜆) of 𝐓 𝑘 − 𝜆𝐈 can be computed as 𝑝 𝑖 (𝜆) = {1 𝑖 = 0〈𝐑𝐯 , 𝐯 ⟩ − 𝜆 𝑖 = 1(〈𝐑𝐯 𝑖 , 𝐯 𝑖 ⟩ − 𝜆)𝑝 𝑖−1 (𝜆) − ‖𝐯̃ 𝑖 ‖ 𝑝 𝑖−2 (𝜆) 𝑖 = 2, … , 𝑘 (30) and we use 𝑠 𝑘 (𝜆) to denote the number of sign changes in 𝑝 , … , 𝑝 𝑘 . A complete Lanczos method based bot detection algorithm is now as below, based on tridiagonal-matrix eigenvalue estimation algorithm in [16, 21], where we innovate on the termination condition for our particular application using Theorem 1 and Theorem 2. The total time complexity of Algorithm 2 for one window slide is 𝑂(𝑘 ∗ 𝑚 ) where 𝑘 ∗ is the value of 𝑘 when the computation for the current window slide ends, and 𝑚 is the number of distinct hosts. In practice, the average complexity scales near quadratic as the average number of 𝑘 ∗ grows slowly with 𝑚 . For detection of multiple botnets, the part in marked by ⋆ in Algorithm 2 can be modified to compute more eigenvalues of 𝐓 𝑘 by nesting a loop the same as the do-while loop except for the total variance is minus the principal weight, and termination conditions need corresponding adjustment. We omit the detail for space. IV. E XPERIMENT & E VALUATION
We simulated both single bots and botnet on an ecommerce web server and collected about four months’ log data, totaling 315,688,764 Apache access log entries for our experiments, with 3,075,108 distinct request identifiers (URL to website resources) and distinct 2,519,022 host identifiers (IP addresses or hostname). The human-like visit rate is estimated by average interval between two requests of HTML web pages, which is 39 secs for this dataset. Some of the hosts in the dataset mark themselves as bots. They are mostly search engines, like google-bots, bing-bots, yahoo-bots. However, they cannot be used as gold standard because they are very well-behaved even though they are bots and hard to detect. Thus, we have to run simulation. The simulation is real on the website server so the generated logs are realistic, and it is guaranteed that it is absolutely harmless for the target server. The simulation is done in four modes to mimic different types of bots. 1)
Single request: the simulator randomly choose a link from a fixed list and keeps repeatedly visiting the link for two hours, then the simulator picks the next link. 2)
Random list: for every visit, the simulator randomly chooses a link from a list to visit. 3)
Fixed list: the simulator visits links of a fixed list in its original order. 4)
Focused random walk: the simulator works like a crawler – it downloads a webpage, extracts links of particular pattern from it, and then randomly picks the next link to visit. The four modes can mimic different types of bots. For example, DOS/DDOS might take the form of any of the first three types; click fraud or statistic skewer are usually the second and the third form, visiting a list of desired target links; price scraper or web crawler could be any of the last three types. We have several further comments: 1) multiple log entries could be generated for a single visit; 2) for all modes, the simulator requests at a Gaussian distributed human-like random rate estimated by the true human visits; 3) only one computer with a distinct IP is used to simulate each mode, therefore the simulation is harmless for a website that handles tens of thousands of customers each day. The simulated logs for each visit are identified, and then duplicated with new host ids and mixed as desired for various experiment purposes. For example, we can mimic single-bots of high visit rate of each type by duplicating them several times in a time window without changing their host id; we can mimic a massive bot net of human-like visit rate by choosing logs of several different visits, duplicate them many times with distinct host ids, and randomly mixed them with existing logs in several consecutive windows. The simulated logs are used as gold standard for performance analysis. A. Performance Analysis
This section presents comparison of accuracy, runtime and sensitivity between PCA and our Lanczos-based algorithm. We run PCA with fixed window and Lanczos with both fixed window (denoted by
Lanczos-B ) and sliding window (denoted by
Lanczos-S ). All experiments are restricted to one thread for fairness. Both methods have researches on parallelism as mentioned in section II.A. For this paper, we focus on experiments with one thread. For our purpose, we define the accuracy as the percentage of known bot visits that can be correctly marked by the algorithm. We compare the Lanczos-method based algorithm with the full PCA on our dataset and measure both running time, accuracy and sensitivity. We experiment on both fixed windows and sliding windows of length from 10 mins to 50 mins. Logs for 2100 single bot visits are placed at random time points with 30 times to 50 times faster visit rate than humans; 100 botnets of 10 to 100 hosts are placed at randomly chosen time points with 1 to 5 times human-like visit rate, and for simplicity, we let bots in a botnet start working simultaneously at the chosen time point, and their visits do not overlap. Each bot/botnet visit lasts from 10 mins to 2hrs. Window slide step is 10% of the window length. For parameters of Algorithm 2, we let 𝜔 = 0.65 , 𝜖 = 10 −10 , 𝜖 = 0.01 , 𝑘 𝑙 , 𝑘 𝑢 , 𝑘 𝑠 are set to and of the number of hosts in the window, and = 25 . The value of 𝜖 is given by [16, 21]. The choice of 𝜔, 𝜖 , 𝑐 will be experimented in next section. The results for time and accuracy are shown in Table 1, and it provides clear evidence for the advantage of our algorithm against PCA. Single Bots Botnets (3)
Run Time (3)
10m PCA 70.5% 52.4% 12s (1) /213s (2)
Lanczos-B 70.1% 50.3%
Lanczos-S
Lanczos-S
Lanczos-S
Lanczos-S
PCA
Lanczos-B
Lanczos-S
Table 1 Accuracy & runtime comparison of PCA with fixed time window, Lanczos with fixed time window (Lanczos-B) and Lanczos with sliding time window (Lanczos-S). Results in bold are better. (1) Average runtime of the algorithm on all time windows; (2) maximum runtime; (3) detection of botnets with better accuracy and less running time is the main technical purpose of this research.
Overall, we have several conclusions from Table 1: 1) It is not possible to run PCA with sliding window of length longer than 20m, as its computation time will generally exceed the sliding step; 2) Lanczos-B has competitive performance in comparison to PCA for bot detection; 3) Lanczos has much better running time, and on average it grows almost linearly with the window length; 4) Lanczos with sliding window consistently has higher accuracy than fixed window. A subtler implication of above results is about choice of window length . For single bots, longer window length damages performance, which is the weakness of all correlation-based botnet detection algorithms. This is because if there exists more than one “not-so-correlated” single bots in a time window, the principal weight will plunge, and longer time window length increases the probability of this situation. For botnet, within a range, accuracy increases with window length, because more data provide more statistical evidence of correlation if the botnet exists, especially when botnets include randomness in their visit pattern, like the random walk and random list in our simulation. However, longer time window usually makes the algorithm less capable of detecting bots with shorter visit duration. For example, for 50-min window length, botnet visits of less than 20 minutes become less discoverable. Given the assumption that botnets often have to maintain at least a request per 30 to 40 secs (the human-like visit rate of our dataset) to achieve its goal at a reasonable cost, 40 minutes’ data generally are sufficient for exposing them. In practice, we would recommend running at least two threads to monitor a log stream, one with short length (like 10-min, 15-min) for discovery of single bots and short-duration visits of botnets, and the other with medium length (like 30-min, 40-min) for detection of other botnets. If detection of slower botnet is desired, we can add one more thread with longer window length. We now present the sensitivity of both PCA and Lanczos-S which is defined as how much time does the algorithm needs to detect the bots since their initial requests. The experiments are done for 10-min window length and 40-min window length to test how what percentage of bots are detected within a specified sensitivity. The results are shown in Figure 3 with a point (𝑥, 𝑦) on a curve for algorithm 𝑧 means 𝑥 % of the bots detected by 𝑧 are detected within 𝑦 minutes of their initial request. A higher curve implies better sensitivity. The overall average sensitivity results are shown in Table 2. We can observe a clear advantage of Lanczos-S from both Figure 3 and Table 2. The advantage comes from two sources: less computation time of Lanczos, and the sliding window. Ultimately, it is all due to the less time complexity of Lanczos, which enables us to use sliding window rather than fixed windows. (a) 10-min window (b) 40-min window Figure 3 Experiment results for the sensitivity of PCA and Lanczos-S. “-Bot” suffix means results for single bots, “-Botnet” suffix means results for botnet-bots. x-axis marks the time in minutes since initial request, y-axis indicates among bots detected by the algorithm, what percentage is detected within corresponding minutes since initial request. For example, a point at (𝒙, 𝒚) on the curve for PCA-Botnet means 𝒙 % of all detected botnet-bots are detected within 𝒚 minutes since their initial request. B. Experiment on Parameters
This section presents experiments on parameters 𝜔, 𝑐 and 𝜖 and provides some insight into choice of their values. The experiments is on the whole dataset with Lanczos-B, because Lanczos-S takes too much time, and the results of Lanczos-B should be good enough for our discussion. In the case of real botnet detection, we can run the following experiments on historical logs and decide good values for the parameters. For choice of 𝜔 , we first run Lanczos-B on the original data (without simulation) over all fixed 40-min windows, and then use the Markov chain behavior model in section II.B to recognize and remove strangely-behaved suspicious hosts, and then run the Lanczos-B over all windows again. The principal weight distribution change is shown in Figure 4. The histogram shows most suspicious hosts are in window with principal weights higher than 0.5 quite possibly contain bots. However, 1) our algorithm requires that 𝜔 > 0.5 ; 2) we emphasize botnet detection, and mile principal weight is less likely to imply existence of botnet; 3) the rate of false alarm hould be contained at a reasonable level. Considering all these factors, we choose 𝜔 = 0.65 as our threshold. Figure 4 Principal weight distribution change after using the Markov chain behavior model to remove strangely-behaved hosts from the logs of all 40-min fixed time windows.
For choice of 𝑐 , it affects the early termination for a window where bots are not likely to exit, and a smaller 𝑐 will decease runtime, but possibly introduce certain error. For choice of 𝜖 , it affects the early termination for a window where bots are detected, and smaller 𝜖 indicates a more accurate principal component is desired. The results using Lanczos-B with both 10-min window and 40-min window are shown in Figure 5, and for the experiments in previous section we choose parameter value at the “elbow” points in Figure 5 (b),(d). We in addition remark that: 1) for shorter time window, larger 𝑐 may take larger value without hurting much accuracy, thus if we run a second thread with short time window as suggested in choice of window length in previous section, we could specify a larger 𝑐 ; 2) 𝜖 has stronger effect on performance, thus we recommend setting it to a small value even for short time window. (a) 𝑐 , 10-min window (b) 𝑐 , 40-min window (c) 𝜖 , 10-min window (d) 𝜖 , 40-min window Figure 5 Relation between 𝒄, 𝝐 with average runtime and accuracy for Lanczos method. Smaller 𝒄, 𝝐 improves accuracy at cost of more runtime. For experiments in previous section, we choose 𝒄, 𝝐 at the elbow point 𝒄 =𝟐𝟓 and 𝝐 = 𝟎. 𝟎𝟏 of (c) and (d). C. Botnet Examples
We apply our algorithm on the original data without simulated logs, and 4,557 distinct hosts are marked as potential bots, with 3,417 of them recognized as botnet-bots in 232 potential botnets. Botnets discovered in different time windows are merged if they share two or more hosts. We use an automatic program to compare the IPs with the Barracuda IP reputation database and 88% of them have poor reputation. Some of the botnets are search engine like “crawl-66-249-xx.xxx.googlebot.com” that can be actually confirmed. Besides that, we manually confirmed one of the top-rated botnet that do not mark themselves as bots in the access log, which comes from a website monitor company Anturis, as shown in Figure 6. Some other botnets are demonstrated in Figure 7, which clearly shows that one bot in the botnet might play the role of a controller.
Figure 6 A top-rated botnet is manually confirmed from a company providing website monitoring service. At the time our experiment, they do not mark themselves as bots in the Apache access log.
Figure 7 Some other examples of recognized botnets, where we can see usually one bot might play the role of a controller. V. C ONCLUSION & F UTURE W ORK
The main objective of this paper is to recognize botnets from streaming web server logs. We recognize and adapt Lanczos method to the application of botnet detection. For his purpose, we first develop the online correlation matrix updates, and feed them to the Lanczos iterations. Making use of Lanczos error bound the non-decreasing eigenvalues of symmetric matrices, and the special properties of our application, a method is proposed to terminate the iterations early. Our approach improves time use of eigenvalue-based botnet detection from cubic to sub-cubic, which enables us to monitor the log stream by sliding windows, rather than batch-based detection. Experiments show the time cost of Lanczos method with different time windows are consistently only 20% to 25% of PCA. In the future, we could further the research in two directions: 1) finding its good use in other anomaly detection applications; 2) compensate its weakness in single bot detection by the Markov-chain behavior model as mentioned in section II.B. R
EFERENCES
1. Dunham, K. and J. Melnick,
Malicious bots: an inside look into the cyber-criminal underground of the internet . 2008: CrC Press. 2. Tegeler, F., et al.
Botfinder: Finding bots in network traffic without deep packet inspection . in
Proceedings of the 8th international conference on Emerging networking experiments and technologies . 2012. ACM. 3. Kang, A.R., et al.,
Online game bot detection based on party-play log analysis.
Computers & Mathematics with Applications, 2013. (9): p. 1384-1395. 4. Carter, K.M., N. Idika, and W.W. Streilein. Probabilistic threat propagation for malicious activity detection . in
Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on . 2013. IEEE. 5. Jin Zhang, Z.C., Chi Zhang,
Bot Detection Based on Divergence and Variance , I. CA Technologies. Patent Application US20170102US1. 2016: United States. 6. Kang, H., et al.
Large-scale bot detection for search engines . in
Proceedings of the 19th international conference on World wide web . 2010. ACM. 7. Thomas, K. and D.M. Nicol.
The Koobface botnet and the rise of social malware . in
Malicious and Unwanted Software (MALWARE), 2010 5th International Conference on . 2010. IEEE. 8. Manasrah, A.M., et al.,
Detecting botnet activities based on abnormal DNS traffic. arXiv preprint arXiv:0911.0487, 2009. 9. Zhao, Y., et al.
BotGraph: Large Scale Spamming Botnet Detection . in
NSDI . 2009. 10. Goebel, J. and T. Holz,
Rishi: Identify Bot Contaminated Hosts by IRC Nickname Evaluation.
HotBots, 2007. : p. 8-8. 11. Gu, G., et al. BotHunter: Detecting Malware Infection Through IDS-Driven Dialog Correlation . in
USENIX Security Symposium . 2007. 12. Wurzinger, P., et al.
Automatically generating models for botnet detection . in
European symposium on research in computer security . 2009. Springer. 13. Bhatia, S., D. Schmidt, and G. Mohay.
Ensemble-based ddos detection and mitigation model . in
Proceedings of the Fifth International Conference on Security of Information and Networks . 2012. ACM. 14. Rayana, S. and L. Akoglu,
Less is more: Building selective anomaly ensembles.
ACM Transactions on Knowledge Discovery from Data (TKDD), 2016. (4): p. 42. 15. Bot Detection and Mitigation
Matrix computations . Vol. 3. 2012: JHU Press. 17. Allaire, G. and S.M. Kaber,
Numerical linear algebra . Vol. 55. 2008: Springer. 18. Xie, M., S. Han, and B. Tian.
Highly efficient distance-based anomaly detection through univariate with PCA in wireless sensor networks . in
Trust, Security and Privacy in Computing and Communications (TrustCom), 2011 IEEE 10th International Conference on . 2011. IEEE. 19. Pevny, T., M. Rehák, and M. Grill.
Detecting anomalous network hosts by means of pca . in
Information Forensics and Security (WIFS), 2012 IEEE International Workshop on . 2012. IEEE. 20. Wise, B. and N. Ricker.
Recent advances in multivariate statistical process control: improving robustness and sensitivity . in
Proceedings of the IFAC. ADCHEM Symposium . 1991. 21. Li, W., et al.,
Recursive PCA for adaptive process monitoring.
Journal of process control, 2000. (5): p. 471-486. 22. Yu, F., Y. Xie, and Q. Ke. Sbotminer: large scale search bot detection . in
Proceedings of the third ACM international conference on Web search and data mining . 2010. ACM. 23. Chi Zhang, Z.C., Jin Zhang,
Bot Detection System Based On Deep Learning , I. CA Technologies, Patent Application 20170109US1. 2016: United States. 24. Zheng Chen, J.Z., Chi Zhang,
Bot Detection Based On Behavior Analytics , I. CA Technologies, Patent Application US20170131US1. 2016: United States. 25. Jeng, J.-C.,
Adaptive process monitoring using efficient recursive PCA and moving window PCA algorithms.
Journal of the Taiwan Institute of Chemical Engineers, 2010. (4): p. 475-481. 26. Wang, X., U. Kruger, and G.W. Irwin, Process monitoring approach using fast moving window PCA.
Industrial & Engineering Chemistry Research, 2005. (15): p. 5691-5702. 27. Rayleigh quotient iteration . Available from: https://en.wikipedia.org/wiki/Rayleigh_quotient_iteration. 28. Cullum, J.K. and R.A. Willoughby,
Lanczos algorithms for large symmetric eigenvalue computations: Vol. I: Theory . 2002: SIAM. 29. Da Fonseca, C.,
On the eigenvalues of some tridiagonal matrices.
Journal of Computational and Applied Mathematics, 2007. (1): p. 283-286. 30. Osipov, A.,
Evaluation of small elements of the eigenvectors of certain symmetric tridiagonal matrices with high relative accuracy.
Applied and Computational Harmonic Analysis, 2015. 31. Kahan, W.,
Accurate eigenvalues of a symmetric tri-diagonal matrix . 1966, STANFORD UNIV CA DEPT OF COMPUTER SCIENCE. 32. Cuppen, J.,
A divide and conquer method for the symmetric tridiagonal eigenproblem.
Numerische Mathematik, 1980. (2): p. 177-195. 33. Xu, G. and T. Kailath, Fast estimation of principal eigenspace using Lanczos algorithm.
SIAM Journal on Matrix Analysis and Applications, 1994. (3): p. 974-994.(3): p. 974-994.