Is this you? Create Your Porfile

Bo Zong

University of California, Santa Barbara

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Bo Zong is active.

Explore More

Publication

Featured researches published by Bo Zong.

international conference on management of data | 2012

Towards effective partition management for large graphs

Shengqi Yang; Xifeng Yan; Bo Zong; Arijit Khan

Searching and mining large graphs today is critical to a variety of application domains, ranging from community detection in social networks, to de novo genome sequence assembly. Scalable processing of large graphs requires careful partitioning and distribution of graphs across clusters. In this paper, we investigate the problem of managing large-scale graphs in clusters and study access characteristics of local graph queries such as breadth-first search, random walk, and SPARQL queries, which are popular in real applications. These queries exhibit strong access locality, and therefore require specific data partitioning strategies. In this work, we propose a Self Evolving Distributed Graph Management Environment (Sedge), to minimize inter-machine communication during graph query processing in multiple machines. In order to improve query response time and throughput, Sedge introduces a two-level partition management architecture with complimentary primary partitions and dynamic secondary partitions. These two kinds of partitions are able to adapt in real time to changes in query workload. (Sedge) also includes a set of workload analyzing algorithms whose time complexity is linear or sublinear to graph size. Empirical results show that it significantly improves distributed graph processing on todays commodity clusters.

international conference on computer communications | 2012

Efficient multicasting for delay tolerant networks using graph indexing

Misael Mongiovì; Ambuj K. Singh; Xifeng Yan; Bo Zong; Konstantinos Psounis

In Delay Tolerant Networks (DTNs), end-to-end connectivity between nodes does not always occur due to limited radio coverage, node mobility and other factors. Remote communication may assist in guaranteeing delivery. However, it has a considerable cost, and consequently, minimizing it is an important task. For multicast routing, the problem is NP-hard, and naive approaches are infeasible on large problem instances. In this paper we define the problem of minimizing the remote communication cost for multicast in DTNs. Our formulation handles the realistic scenario in which a data source is continuously updated and nodes need to receive recent versions of data. We analyze the problem in the case of scheduled trajectories and known traffic demands, and propose a solution based on a novel graph indexing system. We also present an adaptive extension that can work with limited knowledge of node mobility. Our method reduces the search space significantly and finds an optimal solution in reasonable time. Extensive experimental analysis on large real and synthetic datasets shows that the proposed method completes in less than 10 seconds on datasets with millions of encounters, with an improvement of up to 100 times compared to a naive approach.

international conference on data engineering | 2014

Cloud service placement via subgraph matching

Bo Zong; Ramya Raghavendra; Mudhakar Srivatsa; Xifeng Yan; Ambuj K. Singh; Kang-Won Lee

Fast service placement, finding a set of nodes with enough free capacity of computation, storage, and network connectivity, is a routine task in daily cloud administration. In this work, we formulate this as a subgraph matching problem. Different from the traditional setting, including approximate and probabilistic graphs, subgraph matching on data-center networks has two unique properties. (1) Node/edge labels representing vacant CPU cycles and network bandwidth change rapidly, while the network topology varies little. (2) There is a partial order on node/edge labels. Basically, one needs to place service in nodes with enough free capacity. Existing graph indexing techniques have not considered very frequent label updates, and none of them supports partial order on numeric labels. Therefore, we resort to a new graph index framework, Gradin, to address both challenges. Gradin encodes subgraphs into multi-dimensional vectors and organizes them with indices such that it can efficiently search the matches of a querys subgraphs and combine them to form a full match. In particular, we analyze how the index parameters affect update and search performance with theoretical results. Moreover, a revised pruning algorithm is introduced to reduce unnecessary search during the combination of partial matches. Using both real and synthetic datasets, we demonstrate that Gradin outperforms the baseline approaches up to 10 times.

international conference on data mining | 2012

Inferring the Underlying Structure of Information Cascades

Bo Zong; Yinghui Wu; Ambuj K. Singh; Xifeng Yan

In social networks, information and influence diffuse among users as cascades. While the importance of studying cascades has been recognized in various applications, it is difficult to observe the complete structure of cascades in practice. In this paper we study the cascade inference problem following the independent cascade model, and provide a full treatment from complexity to algorithms: (a) we propose the idea of consistent trees as the inferred structures for cascades, these trees connect source nodes and observed nodes with paths satisfying the constraints from the observed temporal information. (b) We introduce metrics to measure the likelihood of consistent trees as inferred cascades, as well as several optimization problems for finding them. (c) We show that the decision problems for consistent trees are in general NP-complete, and that the optimization problems are hard to approximate. (d) We provide approximation algorithms with performance guarantees on the quality of the inferred cascades, as well as heuristics. We experimentally verify the efficiency and effectiveness of our inference algorithms, using real and synthetic data.

very large data bases | 2015

Behavior query discovery in system-generated temporal graphs

Bo Zong; Xusheng Xiao; Zhichun Li; Zhenyu Wu; Zhiyun Qian; Xifeng Yan; Ambuj K. Singh; Guofei Jiang

Computer system monitoring generates huge amounts of logs that record the interaction of system entities. How to query such data to better understand system behaviors and identify potential system risks and malicious behaviors becomes a challenging task for system administrators due to the dynamics and heterogeneity of the data. System monitoring data are essentially heterogeneous temporal graphs with nodes being system entities and edges being their interactions over time. Given the complexity of such graphs, it becomes time-consuming for system administrators to manually formulate useful queries in order to examine abnormal activities, attacks, and vulnerabilities in computer systems. In this work, we investigate how to query temporal graphs and treat query formulation as a discriminative temporal graph pattern mining problem. We introduce TGMiner to mine discriminative patterns from system logs, and these patterns can be taken as templates for building more complex queries. TGMiner leverages temporal information in graphs to prune graph patterns that share similar growth trend without compromising pattern quality. Experimental results on real system data show that TGMiner is 6-32 times faster than baseline methods. The discovered patterns were verified by system experts; they achieved high precision (97%) and recall (91%).

knowledge discovery and data mining | 2014

Towards scalable critical alert mining

Bo Zong; Yinghui Wu; Jie Song; Ambuj K. Singh; Hasan Cam; Jiawei Han; Xifeng Yan

Performance monitor software for data centers typically generates a great number of alert sequences. These alert sequences indicate abnormal network events. Given a set of observed alert sequences, it is important to identify the most critical alerts that are potentially the causes of others. While the need for mining critical alerts over large scale alert sequences is evident, most alert analysis techniques stop at modeling and mining the causal relations among the alerts. This paper studies the critical alert mining problem: Given a set of alert sequences, we aim to find a set of k critical alerts such that the number of alerts potentially triggered by them is maximized. We show that the problem is intractable; therefore, we resort to approximation and heuristic algorithms. First, we develop an approximation algorithm that obtains a near-optimal alert set in quadratic time, and propose pruning techniques to improve its runtime performance. Moreover, we show a faster approximation exists, when the alerts follow certain causal structure. Second, we propose two fast heuristic algorithms based on tree sampling techniques. On real-life data, these algorithms identify a critical alert from up to 270,000 mined causal relations in 5 seconds; meanwhile, they preserve more than 80% of solution quality, and are up to 5,000 times faster than their approximation counterparts.

IEEE Transactions on Knowledge and Data Engineering | 2016

Nearest Keyword Set Search in Multi-Dimensional Datasets

Vishwakarma Singh; Bo Zong; Ambuj K. Singh

Keyword-based search in text-rich multi-dimensional datasets facilitates many novel applications and tools. In this paper, we consider objects that are tagged with keywords and are embedded in a vector space. For these datasets, we study queries that ask for the tightest groups of points satisfying a given set of keywords. We propose a novel method called ProMiSH (Projection and Multi Scale Hashing) that uses random projection and hash-based index structures, and achieves high scalability and speedup. We present an exact and an approximate version of the algorithm. Our experimental results on real and synthetic datasets show that ProMiSH has up to 60 times of speedup over state-of-the-art tree-based techniques.

conference on information and knowledge management | 2018

TGNet: Learning to Rank Nodes in Temporal Graphs

Qi Song; Bo Zong; Yinghui Wu; Lu-An Tang; Hui Zhang; Guofei Jiang; Haifeng Chen

Node ranking in temporal networks are often impacted by heterogeneous context from node content, temporal, and structural dimensions. This paper introduces TGNet , a deep learning framework for node ranking in heterogeneous temporal graphs. TGNet utilizes a variant of Recurrent Neural Network to adapt context evolution and extract context features for nodes. It incorporates a novel influence network to dynamically estimate temporal and structural influence among nodes over time. To cope with label sparsity, it integrates graph smoothness constraints as a weak form of supervision. We show that the application of TGNet is feasible for large-scale networks by developing efficient learning and inference algorithms with optimization techniques. Using real-life data, we experimentally verify the effectiveness and efficiency of TGNet techniques. We also show that TGNet yields intuitive explanations for applications such as alert detection and academic impact ranking, as verified by our case study.

acm special interest group on data communication | 2018

Deep Learning IP Network Representations

Mingda Li; Cristian Lumezanu; Bo Zong; Haifeng Chen

We present DIP, a deep learning based framework to learn structural properties of the Internet, such as node clustering or distance between nodes. Existing embedding-based approaches use linear algorithms on a single source of data, such as latency or hop count information, to approximate the position of a node in the Internet. In contrast, DIP computes low-dimensional representations of nodes that preserve structural properties and non-linear relationships across multiple, heterogeneous sources of structural information, such as IP, routing, and distance information. Using a large real-world data set, we show that DIP learns representations that preserve the real-world clustering of the associated nodes and predicts distance between them more than 30% better than a mean-based approach. Furthermore, DIP accurately imputes hop count distance to unknown hosts (i.e., not used in training) given only their IP addresses and routable prefixes. Our framework is extensible to new data sources and applicable to a wide range of problems in network monitoring and security.

international conference on learning representations | 2018