Haimonti Dutta | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Haimonti Dutta is active.

Explore More

Publication

Featured researches published by Haimonti Dutta.

international conference on future energy systems | 2014

NILMTK: an open source toolkit for non-intrusive load monitoring

Nipun Batra; Jack Kelly; Oliver Parson; Haimonti Dutta; William J. Knottenbelt; Alex Rogers; Amarjeet Singh; Mani B. Srivastava

Non-intrusive load monitoring, or energy disaggregation, aims to separate household energy consumption data collected from a single point of measurement into appliance-level consumption data. In recent years, the field has rapidly expanded due to increased interest as national deployments of smart meters have begun in many countries. However, empirically comparing disaggregation algorithms is currently virtually impossible. This is due to the different data sets used, the lack of reference implementations of these algorithms and the variety of accuracy metrics employed. To address this challenge, we present the Non-intrusive Load Monitoring Toolkit (NILMTK); an open source toolkit designed specifically to enable the comparison of energy disaggregation algorithms in a reproducible manner. This work is the first research to compare multiple disaggregation approaches across multiple publicly available data sets. Our toolkit includes parsers for a range of existing data sets, a collection of preprocessing algorithms, a set of statistics for describing data sets, two reference benchmark disaggregation algorithms and a suite of accuracy metrics. We demonstrate the range of reproducible analyses which are made possible by our toolkit, including the analysis of six publicly available data sets and the evaluation of both benchmark disaggregation algorithms across such data sets.

IEEE Transactions on Pattern Analysis and Machine Intelligence | 2012

Machine Learning for the New York City Power Grid

Cynthia Rudin; David L. Waltz; Roger N. Anderson; Albert Boulanger; Ansaf Salleb-Aouissi; Maggie Chow; Haimonti Dutta; Philip Gross; Bert Huang; Steve Ierome; Delfina Isaac; Arthur Kressner; Rebecca J. Passonneau; Axinia Radeva; Leon Wu

Power companies can benefit from the use of knowledge discovery methods and statistical machine learning for preventive maintenance. We introduce a general process for transforming historical electrical grid data into models that aim to predict the risk of failures for components and systems. These models can be used directly by power companies to assist with prioritization of maintenance and repair work. Specialized versions of this process are used to produce (1) feeder failure rankings, (2) cable, joint, terminator, and transformer rankings, (3) feeder Mean Time Between Failure (MTBF) estimates, and (4) manhole events vulnerability rankings. The process in its most general form can handle diverse, noisy, sources that are historical (static), semi-real-time, or real-time, incorporates state-of-the-art machine learning algorithms for prioritization (supervised ranking or MTBF), and includes an evaluation of results via cross-validation and blind test. Above and beyond the ranked lists and MTBF estimates are business management interfaces that allow the prediction capability to be integrated directly into corporate planning and decision support; such interfaces rely on several important properties of our general modeling approach: that machine learning features are meaningful to domain experts, that the processing of data is transparent, and that prediction results are accurate enough to support sound decision making. We discuss the challenges in working with historical electrical grid data that were not designed for predictive purposes. The “rawness” of these data contrasts with the accuracy of the statistical models that can be obtained from the process; these models are sufficiently accurate to assist in maintaining New York Citys electrical grid.

Machine Learning | 2010

A process for predicting manhole events in Manhattan

Cynthia Rudin; Rebecca J. Passonneau; Axinia Radeva; Haimonti Dutta; Steve Ierome; Delfina Isaac

We present a knowledge discovery and data mining process developed as part of the Columbia/Con Edison project on manhole event prediction. This process can assist with real-world prioritization problems that involve raw data in the form of noisy documents requiring significant amounts of pre-processing. The documents are linked to a set of instances to be ranked according to prediction criteria. In the case of manhole event prediction, which is a new application for machine learning, the goal is to rank the electrical grid structures in Manhattan (manholes and service boxes) according to their vulnerability to serious manhole events such as fires, explosions and smoking manholes. Our ranking results are currently being used to help prioritize repair work on the Manhattan electrical grid.

IEEE Transactions on Knowledge and Data Engineering | 2006

Orthogonal decision trees

K. Kargupta; B.-H. Park; Haimonti Dutta

This paper introduces orthogonal decision trees that offer an effective way to construct a redundancy-free, accurate, and meaningful representation of large decision-tree-ensembles often created by popular techniques such as bagging, boosting, random forests, and many distributed and data stream mining algorithms. Orthogonal decision trees are functionally orthogonal to each other and they correspond to the principal components of the underlying function space. This paper offers a technique to construct such trees based on the Fourier transformation of decision trees and eigen-analysis of the ensemble in the Fourier representation. It offers experimental results to document the performance of orthogonal trees on the grounds of accuracy and model complexity

international conference on machine learning and applications | 2013

INDiC: Improved Non-intrusive Load Monitoring Using Load Division and Calibration

Nipun Batra; Haimonti Dutta; Amarjeet Singh

Residential buildings contribute significantly to the overall energy consumption across most parts of the world. While smart monitoring and control of appliances can reduce the overall energy consumption, management and cost associated with such systems act as a big hindrance. Prior work has established that detailed feedback in the form of appliance level consumption to building occupants improves their awareness and paves the way for reduction in electricity consumption. Non-Intrusive Load Monitoring (NILM), i.e. the process of disaggregating the overall home electricity usage measured at the meter level into constituent appliances, provides a simple and cost effective methodology to provide such feedback to the occupants. In this paper we present Improved Non-Intrusive load monitoring using load Division and Calibration (INDiC) that simplifies NILM by dividing the appliances across multiple instrumented points (meters/phases) and calibrating the measured power. Proposed approach is used together with the Combinatorial Optimization framework and evaluated on the popular REDD dataset. Empirical results demonstrate significant improvement in disaggregation accuracy, achieved by using INDiC based Combinatorial Optimization, demonstrate significant improvement in disaggregation accuracy.

international conference on data mining | 2008

Distributed Linear Programming and Resource Management for Data Mining in Distributed Environments

Haimonti Dutta; Hillol Kargupta

Advances in computing and communication has resulted in very large scale distributed environments in recent years. They are capable of storing large volumes of data and often have multiple compute nodes. However, the inherent heterogeneity of data components, the dynamic nature of distributed systems, the need for information synchronization and data fusion over a network and security and access control issues makes the problem of resource management and monitoring a tremendous challenge. In particular, centralized algorithms for management of resources and data may not be sufficient to manage complex distributed systems. In this paper, we present a distributed algorithm for resource and data management which builds on the traditional simplex algorithm used for solving linear optimization problems. Our distributed algorithm is an exact one meaning its results are identical if run in a centralized setting. We provide extensive analytical results and experiments on simulated data to demonstrate the performance of our algorithm.

Grid and Cloud Database Management | 2011

Distributed Storage of Large-Scale Multidimensional Electroencephalogram Data Using Hadoop and HBase

Haimonti Dutta; Alex Kamil; Manoj Pooleery; Simha Sethumadhavan; John Demme

Huge volumes of data are being accumulated from a variety of sources in engineering and scientific disciplines; this has been referred to as the “Data Avalanche”. Cloud computing infrastructures (such as Amazon Elastic Compute Cloud (EC2)) are specifically designed to combine high compute performance with high performance network capability to meet the needs of data-intensive science. Reliable, scalable, and distributed computing is used extensively on the cloud. Apache Hadoop is one such open-source project that provides a distributed file system to create multiple replicas of data blocks and distribute them on compute nodes throughout a cluster to enable reliable and rapid computations. Column-oriented databases built on Hadoop (such as HBase) along with MapReduce programming paradigm allows development of large-scale distributed computing applications with ease. In this chapter, benchmarking results on a small in-house Hadoop cluster composed of 29 nodes each with 8-core processors is presented along with a case-study on distributed storage of electroencephalogram (EEG) data. Our results indicate that the Hadoop / HBase projects are still in their nascent stages but provide promising performance characteristics with regard to latency and throughput. In future work, we will explore the development of novel machine learning algorithms on this infrastructure.

international conference on machine learning and applications | 2009

Ranking Electrical Feeders of the New York Power Grid

Philip Gross; Ansaf Salleb-Aouissi; Haimonti Dutta; Albert Boulanger

Ranking problems arise in a wide range of real world applications where an ordering on a set of examples is preferred to a classification model. These applications include collaborative filtering, information retrieval and ranking components of a system by susceptibility to failure. In this paper, we present an ongoing project to rank the underground primary feeders of New York Citys electrical grid according to their susceptibility to outages. We describe our framework and the application of machine learning ranking methods, using scores from Support Vector Machines (SVM), RankBoost and Martingale Boosting. Finally, we present our experimental results and the lessons learned from this challenging real-world application.

arXiv: Other Computer Science | 2014

NILMTK v0.2: a non-intrusive load monitoring toolkit for large scale data sets: demo abstract

Jack Kelly; Nipun Batra; Oliver Parson; Haimonti Dutta; William J. Knottenbelt; Alex Rogers; Amarjeet Singh; Mani B. Srivastava

In this demonstration, we present an open source toolkit for evaluating non-intrusive load monitoring research; a field which aims to disaggregate a households total electricity consumption into individual appliances. The toolkit contains: a number of importers for existing public data sets, a set of preprocessing and statistics functions, a benchmark disaggregation algorithm and a set of metrics to evaluate the performance of such algorithms. Specifically, this release of the toolkit has been designed to enable the use of large data sets by only loading individual chunks of the whole data set into memory at once for processing, before combining the results of each chunk.

Archive | 2011

PARABLE: A PArallel RAndom-partition Based HierarchicaL ClustEring Algorithm for the MapReduce Framework

Shen Wang; Haimonti Dutta

Large datasets, of the order of petaand terabytes, are becoming prevalent in many scientific domains including astronomy, physical sciences, bioinformatics and medicine. To effectively store, query and analyze these gigantic repositories, parallel and distributed architectures have become popular. Apache Hadoop is one such framework for supporting data-intensive applications. It provides an open source implementation of the MapReduce programming paradigm which can be used to build scalable algorithms for pattern analysis and data mining. In this paper, we present a PArallel, RAndom-partition Based hierarchicaL clustEring algorithm (PARABLE) for the MapReduce framework. It proceeds in two main steps – local hierarchical clustering on nodes using mappers and reducers and integration of results by a novel dendrogram alignment technique. Empirical results on two large data sets (High Energy Particle Physics and Intrusion Detection) from the KDDCup competition on a large cluster indicates that significant scalability benefits can be obtained by using the parallel hierarchical clustering algorithm in addition to maintaining good cluster quality.

Explore More