Stephen C. Gates | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Stephen C. Gates is active.

Explore More

Publication

Featured researches published by Stephen C. Gates.

knowledge discovery and data mining | 1999

On the merits of building categorization systems by supervised clustering

Charu C. Aggarwal; Stephen C. Gates; Philip S. Yu

This paper investigates the use of supervised clustering in order to create sets of categories for classification of documents. We use information from a pre-existing taxonomy in order to supervise the creation of a set of related clusters, though with some freedom in defining and creating the classes. We show that the advantage of using supervised clustering is that it is possible to have some control over the range of subjects that one would like the categorization system to address, but with a precise mathematical definition of each category. We then categorize documents using this a priori knowledge of the definition of each category. We also discuss a new technique to help the classifier distinguish better among closely related clusters. Finally, we show empirically that this categorization system utilizing a machine-derived taxonomy performs as well as a manual categorization process, but at a far lower cost.

IEEE Transactions on Knowledge and Data Engineering | 2004

On using partial supervision for text categorization

Charu C. Aggarwal; Stephen C. Gates; Philip S. Yu

We discuss the merits of building text categorization systems by using supervised clustering techniques. Traditional approaches for document classification on a predefined set of classes are often unable to provide sufficient accuracy because of the difficulty of fitting a manually categorized collection of documents in a given classification model. This is especially the case for heterogeneous collections of Web documents which have varying styles, vocabulary, and authorship. Hence, we investigate the use of clustering in order to create the set of categories and its use for classification of documents. Completely unsupervised clustering has the disadvantage that it has difficulty in isolating sufficiently fine-grained classes of documents relating to a coherent subject matter. We use the information from a preexisting taxonomy in order to supervise the creation of a set of related clusters, though with some freedom in defining and creating the classes. We show that the advantage of using partially supervised clustering is that it is possible to have some control over the range of subjects that one would like the categorization system to address, but with a precise mathematical definition of how each category is defined. An extremely effective way then to categorize documents is to use this a priori knowledge of the definition of each category. We also discuss a new technique to help the classifier distinguish better among closely related clusters.

conference on information and knowledge management | 2005

Taxonomies by the numbers: building high-performance taxonomies

Stephen C. Gates; Wilfried Teiken; Keh-Shin F. Cheng

In this paper, we describe a system for the construction of taxonomies which yield high accuracies with automated categorization systems, even on Web and intranet documents. In particular, we describe the way in which measurement of five key features of the system can be used to predict when categories are sufficiently well defined to yield high accuracy categorization. We describe the use of this system to construct a large (8800-category) general-purpose taxonomy and categorization system.

conference on information and knowledge management | 2009

Towards real-time measurement of customer satisfaction using automatically generated call transcripts

Youngja Park; Stephen C. Gates

Customer satisfaction is a very important indicator of how successful a contact center is at providing services to the customers. Contact centers typically conduct a manual survey with a randomly selected group of customers to measure customer satisfaction. Manual customer satisfaction surveys, however, provide limited values due to high cost and the time lapse between the service and the survey. In this paper, we demonstrate that it is possible to automatically measure customer satisfaction by analyzing call transcripts enabling companies to measure customer satisfaction for every call in near real-time. We have identified various features from multiple knowledge sources indicating prosodic, linguistic and behavioral aspects of the speakers, and built machine learning models that predict the degree of customer satisfaction with high accuracy. The machine learning algorithms used in this work include Decision Tree, Naive Bayes, Logistic Regression and Support Vector Machines (SVMs). Experiments were conducted for a 5-point satisfaction measurement and a 2-point satisfaction measurement using customer calls to an automotive company. The experimental results show that customer satisfaction can be measured quite accurately both at the end of calls and in the middle of calls. The best performing 5-point satisfaction classification yields an accuracy of 66.09% outperforming the DominantClass baseline by 15.16%. The best performing 2-point classification shows an accuracy of 89.42% and outperforms both the DominantClass baseline and the CSRJudgment baseline by 17.7% and 3.3% respectively. Furthermore, Decision Tree and SVMs achieve higher F-measure than the CSRJudgment baseline in identifying both satisfied customers and dissatisfied customers.

conference on information and knowledge management | 2008

Semi-automated logging of contact center telephone calls

Roy J. Byrd; Mary S. Neff; Wilfried Teiken; Youngja Park; Keh-Shin F. Cheng; Stephen C. Gates; Karthik Visweswariah

Modern businesses use contact centers as a communication channel with users of their products and services. The largest factor in the expense of running a telephone contact center is the labor cost of its agents. IBM Research has built a new system, Contact-Center Agent Buddies (CAB), which is designed to help reduce the average handle time (AHT) for customer calls, thereby also reducing their cost. In this paper, we focus on the call logging subsystem, which helps agents reduce the time they spend documenting those calls. We built a Template CAB and a Call Logging CAB, using a pipeline consisting of audio capture of a telephone conversation, automatic speech recognition, text analysis, and log generation. We developed techniques for ASR text cleansing, including normalization of expressions and acronyms, domain terms, capitalization, and boundaries for sentences, paragraphs, and call segments. We found that simple heuristics suffice to generate high-quality logs from the normalized sentences. The pipeline yields a candidate call log which the agents can edit in less time than it takes them to generate call logs manually. Evaluation of the Call Logging CAB in an industrial contact center environment shows that it reduces the amount of time agents spend logging calls by at least 50% without compromising the quality of the resulting call documentation.

symposium on access control models and technologies | 2011

System for automatic estimation of data sensitivity with applications to access control and other applications

Youngja Park; Stephen C. Gates; Wilfried Teiken; Suresh Chari

The Enterprise Information Security Management (EISM) system aims to semi-automatically estimate the sensitivity of enterprise data through advanced content analysis and business process mining. We demonstrate a proof-of-concept of EISM that crawls all the files in a personal computer and estimates the sensitivity of individual files and the overall sensitivity level of the computer. The system can identify 11 different personally identifiable information (PII) types and 11 sensitive data categories, and estimate data sensitivity based on the identified sensitive information in the data. Furthermore, the tool produces the evidences of the discovered sensitive information including the surrounding context in the document to help users understand what kinds of sensitive information are stored in their computer. The evidences allow users can easily redact the sensitive information or move it to a more secure location. Thus, this system can be used as a privacy enhancing tool as well as a security tool.

european symposium on research in computer security | 2013

Estimating Asset Sensitivity by Profiling Users

Youngja Park; Christopher S. Gates; Stephen C. Gates

We introduce algorithms to automatically score and rank information technology (IT) assets in an enterprise, such as computer systems or data files, by their business value and criticality to the organization. Typically, information assets are manually assigned classification labels with respect to the confidentiality, integrity and availability. In this paper, we propose semi-automatic machine learning algorithms to automatically estimate the sensitivity of assets by profiling the users. Our methods do not require direct access to the target assets or privileged knowledge about the assets, resulting in a more efficient, scalable and privacy-preserving approach compared with existing data security solutions relying on data content classification. Instead, we rely on external information such as the attributes of the users, their access patterns and other published data content by the users. Validation with a set of 8,500 computers collected from a large company show that all our algorithms perform significantly better than two baseline methods.

Archive | 1999