Kenji Yoshihira
Princeton University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Kenji Yoshihira.
integrated network management | 2011
Ming Chen; Hui Zhang; Ya-Yunn Su; Xiaorui Wang; Guofei Jiang; Kenji Yoshihira
In this paper, we undertake the problem of server consolidation in virtualized data centers from the perspective of approximation algorithms. We formulate server consolidation as a stochastic bin packing problem, where the server capacity and an allowed server overflow probability p are given, and the objective is to assign VMs to as few physical servers as possible, and the probability that the aggregated load of a physical server exceeds the server capacity is at most p.
international conference on autonomic computing | 2006
Guofei Jiang; Haifeng Chen; Kenji Yoshihira
Large amount of monitoring data can be collected from distributed systems as the observables to analyze system behaviors. However, without reasonable models to characterize systems, we can hardly interpret such monitoring data effectively for system management. In this paper, a new concept named flow intensity is introduced to measure the intensity with which internal monitoring data reacts to the volume of user requests in distributed transaction systems. We propose a novel approach to automatically model and search relationships between the flow intensities measured at various points across the system. If the modeled relationships hold all the time, they are regarded as invariants of the underlying system. Experimental results from a real system demonstrate that such invariants widely exist in distributed transaction systems. Further we discuss how such invariants can be used to characterize complex systems and support autonomic system management.
IEEE Transactions on Dependable and Secure Computing | 2006
Guofei Jiang; Haifeng Chen; Kenji Yoshihira
With the prevalence of Internet services and the increase of their complexity, there is a growing need to improve their operational reliability and availability. While a large amount of monitoring data can be collected from systems for fault analysis, it is hard to correlate this data effectively across distributed systems and observation time. In this paper, we analyze the mass characteristics of user requests and propose a novel approach to model and track transaction flow dynamics for fault detection in complex information systems. We measure the flow intensity at multiple checkpoints inside the system and apply system identification methods to model transaction flow dynamics between these measurements. With the learned analytical models, a model-based fault detection and isolation method is applied to track the flow dynamics in real time for fault detection. We also propose an algorithm to automatically search and validate the dynamic relationship between randomly selected monitoring points. Our algorithm enables systems to have self-cognition capability for system management. Our approach is tested in a real system with a list of injected faults. Experimental results demonstrate the effectiveness of our approach and algorithms
IEEE Transactions on Knowledge and Data Engineering | 2007
Guofei Jiang; Haifeng Chen; Kenji Yoshihira
Distributed systems generate a large amount of monitoring data such as log files to track their operational status. However, it is hard to correlate such monitoring data effectively across distributed systems and along observation time for system management. In previous work, we proposed a concept named flow intensity to measure the intensity with which internal monitoring data reacts to the volume of user requests. We calculated flow intensity measurements from monitoring data and proposed an algorithm to automatically search constant relationships between flow intensities measured at various points across distributed systems. If such relationships hold all the time, we regard them as invariants of the underlying systems. Invariants can be used to characterize complex systems and support various system management tasks. However, the computational complexity of the previous invariant search algorithm is high so that it may not scale well in large systems with thousands of measurements. In this paper, we propose two efficient but approximate algorithms for inferring invariants in large-scale systems. The computational complexity of new randomized algorithms is significantly reduced, and experimental results from a real system are also included to demonstrate the accuracy and efficiency of our new algorithms.
Cluster Computing | 2006
Guofei Jiang; Haifeng Chen; Kenji Yoshihira
Large amount of monitoring data can be collected from distributed systems as the observables to analyze system behaviors. However, without reasonable models to characterize systems, we can hardly interpret such monitoring data effectively for system management. In this paper, a new concept named flow intensity is introduced to measure the intensity with which internal monitoring data reacts to the volume of user requests in distributed transaction systems. We propose a novel approach to automatically model and search relationships between the flow intensities measured at various points across the system. If the modeled relationships hold all the time, they are regarded as invariants of the underlying system. Experimental results from a real system demonstrate that such invariants widely exist in distributed transaction systems. Further we discuss how such invariants can be used to characterize complex systems and support autonomic system management.
international conference on autonomic computing | 2005
Guofei Jiang; Haifeng Chen; Cristian Ungureanu; Kenji Yoshihira
Detection and diagnosis of faults in a large-scale distributed system is a formidable task. Interest in monitoring and using traces of user requests for fault detection has been on the rise recently. In this paper we propose novel fault detection methods based on abnormal trace detection. One essential problem is how to represent the large amount of training trace data compactly as an oracle. Our key contribution is the novel use of varied-length n-grams and automata to characterize normal traces. A new trace is compared against the learned automata to determine whether it is abnormal. We develop algorithms to automatically extract n-grams and construct multiresolution automata from training data. Further both deterministic and multihypothesis algorithms are proposed for detection. We inspect the trace constraints of real application software and verify the existence of long n-grams. Our approach is tested in a real system with injected faults and achieves good results in experiments
IEEE Transactions on Network and Service Management | 2014
Hui Zhang; Guofei Jiang; Kenji Yoshihira; Haifeng Chen
The hindrances to the adoption of public cloud computing services include service reliability, data security and privacy, regulation compliant requirements, and so on. To address those concerns, we propose a hybrid cloud computing model which users may adopt as a viable and cost-saving methodology to make the best use of public cloud services along with their privately-owned (legacy) data centers. As the core of this hybrid cloud computing model, an intelligent workload factoring service is designed for proactive workload management. It enables federation between on- and off-premise infrastructures for hosting Internet-based applications, and the intelligence lies in the explicit segregation of base workload and flash crowd workload, the two naturally different components composing the application workload. The core technology of the intelligent workload factoring service is a fast frequent data item detection algorithm, which enables factoring incoming requests not only on volume but also on data content, upon a changing application data popularity. Through analysis and extensive evaluation with real-trace driven simulations and experiments on a hybrid testbed consisting of local computing platform and Amazon Cloud service platform, we showed that the proactive workload management technology can enable reliable workload prediction in the base workload zone (with simple statistical methods), achieve resource efficiency (e.g., 78% higher server capacity than that in base workload zone) and reduce data cache/replication overhead (up to two orders of magnitude) in the flash crowd workload zone, and react fast (with an X^2 speed-up factor) to the changing application data popularity upon the arrival of load spikes.
knowledge discovery and data mining | 2005
Haifeng Chen; Guofei Jiang; Cristian Ungureanu; Kenji Yoshihira
The increasing complexity of todays systems makes fast and accurate failure detection essential for their use in mission-critical applications. Various monitoring methods provide a large amount of data about systems behavior. Analyzing this data with advanced statistical methods holds the promise of not only detecting the errors faster, but also detecting errors which are difficult to catch with current monitoring tools. Two challenges to building such detection tools are: the high dimensionality of observation data, which makes the models expensive to apply, and frequent system changes, which make the models expensive to update. In this paper, we present algorithms to reduce the dimensionality of data in a way that makes it easy to adapt to system changes. We decompose the observation data into signal and noise subspaces. Two statistics, the Hotelling T2 score and squared prediction error (SPE) are calculated to represent the data characteristics in signal and noise subspaces respectively. Instead of tracking the original data, we use a sequentially discounting expectation maximization (SDEM) algorithm to learn the distribution of the two extracted statistics. A failure event can then be detected based on the abnormal change of the distribution. Applying our technique to component interaction data in a simple e-commerce application shows better accuracy than building independent profiles for each component. Additionally, experiments on synthetic data show that the detection accuracy is high even for changing systems.
IEEE Transactions on Knowledge and Data Engineering | 2007
Haifeng Chen; Guofei Jiang; Kenji Yoshihira
Fast and accurate failure detection is becoming essential in managing large-scale Internet services. This paper proposes a novel detection approach based on the subspace mapping between the system inputs and internal measurements. By exploring these contextual dependencies, our detector can initiate repair actions accurately, increasing the availability of the system. Although a classical statistical method, the canonical correlation analysis (CCA), is presented in the paper to achieve subspace mapping, we also propose a more advanced technique, the principal canonical correlation analysis (PCCA), to improve the performance of the CCA-based detector. PCCA extracts a principal subspace from internal measurements that is not only highly correlated with the inputs but also a significant representative of the original measurements. Experimental results on a Java 2 platform, enterprise edition (J2EE)- based Web application demonstrate that such property of PCCA is especially beneficial to failure detection tasks.
dependable systems and networks | 2013
Abhishek Sharma; Haifeng Chen; Min Ding; Kenji Yoshihira; Guofei Jiang
Recent advances in sensing and communication technologies enable us to collect round-the-clock monitoring data from a wide-array of distributed systems including data centers, manufacturing plants, transportation networks, automobiles, etc. Often this data is in the form of time series collected from multiple sensors (hardware as well as software based). Previously, we developed a time-invariant relationships based approach that uses Auto-Regressive models with eXogenous input (ARX) to model this data. A tool based on our approach has been effective for fault detection and capacity planning in distributed systems. In this paper, we first describe our experience in applying this tool in real-world settings. We also discuss the challenges in fault localization that we face when using our tool, and present two approaches - a spatial approach based on invariant graphs and a temporal approach based on expected broken invariant patterns - that we developed to address this problem.