George Forman | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where George Forman is active.

Explore More

Publication

Featured researches published by George Forman.

Sigkdd Explorations | 2000

Distributed data clustering can be efficient and exact

George Forman; Bin Zhang

Data clustering is one of the fundamental techniques in scientific data analysis and data mining. It partitions a data set into groups of similar items, as measured by some distance metric. Over the years, data set sizes have grown rapidly with the exponential growth of computer storage and increasingly automated business and manufacturing processes. Many of these datasets are geographically distributed across multiple sites, e.g. different sales or warehouse locations. To cluster such large and distributed data sets, efficient distributed algorithms are called for to reduce the communication overhead, central storage requirements, and computation time, as well as to bring the resources of multiple machines to bear on a given problem as the data set sizes scale-up. We describe a technique for parallelizing a family of center-based data clustering algorithms. The central idea is to communicate only sufficient statistics, yielding linear speed-up with excellent efficiency. The technique does not involve approximation and may be used orthogonally in conjunction with sampling or aggregation-based methods, such as BIRCH, to lessen the quality degradation of their approximation or to handle larger data sets. We demonstrate in this paper that even for relatively small problem sizes, it can be more cost effective to cluster the data in-place using an exact distributed algorithm than to collect the data in one central location for clustering.

Sigkdd Explorations | 2010

Apples-to-apples in cross-validation studies: pitfalls in classifier performance measurement

George Forman; Martin B. Scholz

Cross-validation is a mainstay for measuring performance and progress in machine learning. There are subtle differences in how exactly to compute accuracy, F-measure and Area Under the ROC Curve (AUC) in cross-validation studies. However, these details are not discussed in the literature, and incompatible methods are used by various papers and software packages. This leads to inconsistency across the research literature. Anomalies in performance calculations for particular folds and situations go undiscovered when they are buried in aggregated results over many folds and datasets, without ever a person looking at the intermediate performance measurements. This research note clarifies and illustrates the differences, and it provides guidance for how best to measure classification performance under cross-validation. In particular, there are several divergent methods used for computing F-measure, which is often recommended as a performance measure under class imbalance, e.g., for text classification domains and in one-vs.-all reductions of datasets having many classes. We show by experiment that all but one of these computation methods leads to biased measurements, especially under high class imbalance. This paper is of particular interest to those designing machine learning software libraries and researchers focused on high class imbalance.

international conference on machine learning | 2004

A pitfall and solution in multi-class feature selection for text classification

George Forman

Information Gain is a well-known and empirically proven method for high-dimensional feature selection. We found that it and other existing methods failed to produce good results on an industrial text classification problem. On investigating the root cause, we find that a large class of feature scoring methods suffers a pitfall: they can be blinded by a surplus of strongly predictive features for some classes, while largely ignoring features needed to discriminate difficult classes. In this paper we demonstrate this pitfall hurts performance even for a relatively uniform text classification task. Based on this understanding, we present solutions inspired by round-robin scheduling that avoid this pitfall, without resorting to costly wrapper methods. Empirical evaluation on 19 datasets shows substantial improvements.

european conference on principles of data mining and knowledge discovery | 2004

Learning from little: comparison of classifiers given little training

George Forman; Ira Cohen

Many real-world machine learning tasks are faced with the problem of small training sets. Additionally, the class distribution of the training set often does not match the target distribution. In this paper we compare the performance of many learning models on a substantial benchmark of binary text classification tasks having small training sets. We vary the training size and class distribution to examine the learning surface, as opposed to the traditional learning curve. The models tested include various feature selection methods each coupled with four learning algorithms: Support Vector Machines (SVM), Logistic Regression, Naive Bayes, and Multinomial Naive Bayes. Different models excel in different regions of the learning surface, leading to meta-knowledge about which to apply in different situations. This helps guide the researcher and practitioner when facing choices of model and feature selection methods in, for example, information retrieval settings and others.

knowledge discovery and data mining | 2005

Finding similar files in large document repositories

George Forman; Kave Eshghi; Stephane Chiocchetti

Hewlett-Packard has many millions of technical support documents in a variety of collections. As part of content management, such collections are periodically merged and groomed. In the process, it becomes important to identify and weed out support documents that are largely duplicates of newer versions. Doing so improves the quality of the collection, eliminates chaff from search results, and improves customer satisfaction.The technical challenge is that through workflow and human processes, the knowledge of which documents are related is often lost. We required a method that could identify similar documents based on their content alone, without relying on metadata, which may be corrupt or missing.We present an approach for finding similar files that scales up to large document repositories. It is based on chunking the byte stream to find unique signatures that may be shared in multiple files. An analysis of the file-chunk graph yields clusters of related files. An optional bipartite graph partitioning algorithm can be applied to greatly increase scalability.

international acm sigir conference on research and development in information retrieval | 2006

Tackling concept drift by temporal inductive transfer

George Forman

Machine learning is the mainstay for text classification. However, even the most successful techniques are defeated by many real-world applications that have a strong time-varying component. To advance research on this challenging but important problem, we promote a natural, experimental framework-the Daily Classification Task-which can be applied to large time-based datasets, such as Reuters RCV1.In this paper we dissect concept drift into three main subtypes. We demonstrate via a novel visualization that the recurrent themes subtype is present in RCV1. This understanding led us to develop a new learning model that transfers induced knowledge through time to benefit future classifier learning tasks. The method avoids two main problems with existing work in inductive transfer: scalability and the risk of negative transfer. In empirical tests, it consistently showed more than 10 points F-measure improvement for each of four Reuters categories tested.

conference on information and knowledge management | 2008

BNS feature scaling: an improved representation over tf-idf for svm text classification

George Forman

In the realm of machine learning for text classification, TF-IDF is the most widely used representation for real-valued feature vectors. However, IDF is oblivious to the training class labels and naturally scales some features inappropriately. We replace IDF with Bi-Normal Separation (BNS), which has been previously found to be excellent at ranking words for feature selection filtering. Empirical evaluation on a benchmark of 237 binary text classification tasks shows substantially better accuracy and F-measure for a Support Vector Machine (SVM) by using BNS scaling. A wide variety of other feature representations were later tested and found inferior, as well as binary features with no scaling. Moreover, BNS scaling yielded better performance without feature selection, obviating the need for feature selection.

Data Mining and Knowledge Discovery | 2008

Quantifying counts and costs via classification

George Forman

Many business applications track changes over time, for example, measuring the monthly prevalence of influenza incidents. In situations where a classifier is needed to identify the relevant incidents, imperfect classification accuracy can cause substantial bias in estimating class prevalence. The paper defines two research challenges for machine learning. The ‘quantification’ task is to accurately estimate the number of positive cases (or class distribution) in a test set, using a training set that may have a substantially different distribution. The ‘cost quantification’ variant estimates the total cost associated with the positive class, where each case is tagged with a cost attribute, such as the expense to resolve the case. Quantification has a very different utility model from traditional classification research. For both forms of quantification, the paper describes a variety of methods and evaluates them with a suitable methodology, revealing which methods give reliable estimates when training data is scarce, the testing class distribution differs widely from training, and the positive class is rare, e.g., 1% positives. These strengths can make quantification practical for business use, even where classification accuracy is poor.

knowledge discovery and data mining | 2012

Mining large-scale, sparse GPS traces for map inference: comparison of approaches

Xuemei Liu; James Biagioni; Jakob Eriksson; Yin Wang; George Forman; Yanmin Zhu

We address the problem of inferring road maps from large-scale GPS traces that have relatively low resolution and sampling frequency. Unlike past published work that requires high-resolution traces with dense sampling, we focus on situations with coarse granularity data, such as that obtained from thousands of taxis in Shanghai, which transmit their location as seldom as once per minute. Such data sources can be made available inexpensively as byproducts of existing processes, rather than having to drive every road with high-quality GPS instrumentation just for map building - and having to re-drive roads for periodic updates. Although the challenges in using opportunistic probe data are significant, successful mining algorithms could potentially enable the creation of continuously updated maps at very low cost. In this paper, we compare representative algorithms from two approaches: working with individual reported locations vs. segments between consecutive locations. We assess their trade-offs and effectiveness in both qualitative and quantitative comparisons for regions of Shanghai and Chicago.

international conference on mobile systems, applications, and services | 2013

CrowdAtlas: self-updating maps for cloud and personal use

Yin Wang; Xuemei Liu; Hong Wei; George Forman; Chao Chen; Yanmin Zhu

The inaccuracy of manually created digital road maps is a persistent problem, despite their high economic value. We present CrowdAtlas, which automates map update based on peoples travels, either individually or crowdsourced. Its mobile navigation app detects significant portions of GPS traces that do not conform to the existing map, as determined by state-of-the-art Viterbi map matching. When there is sufficient evidence collected, map inference algorithms can automatically update the map. The CrowdAtlas server aggregates exceptional traces from users with the navigation app as well as from other, large-scale data sources. From these it automatically generates high quality map updates, which can be propagated to its navigation app and other interested applications. Using CrowdAtlas app, we mapped out a 4.5 km^2 street block in Shanghai in less than half an hour and built a walking/cycling map of the SJTU campus. Using taxi traces collected from Beijing, we contributed completely computer-generated roads for this large, 61 km of missing roads to OpenStreetMap, the first set of open-source map community.

Explore More