John C. Shafer | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where John C. Shafer is active.

Explore More

Publication

Featured researches published by John C. Shafer.

knowledge discovery and data mining | 2009

Improving classification accuracy using automatically extracted training data

Ariel Fuxman; Anitha Kannan; Andrew B. Goldberg; Rakesh Agrawal; Panayiotis Tsaparas; John C. Shafer

Classification is a core task in knowledge discovery and data mining, and there has been substantial research effort in developing sophisticated classification models. In a parallel thread, recent work from the NLP community suggests that for tasks such as natural language disambiguation even a simple algorithm can outperform a sophisticated one, if it is provided with large quantities of high quality training data. In those applications, training data occurs naturally in text corpora, and high quality training data sets running into billions of words have been reportedly used. We explore how we can apply the lessons from the NLP community to KDD tasks. Specifically, we investigate how to identify data sources that can yield training data at low cost and study whether the quantity of the automatically extracted training data can compensate for its lower quality. We carry out this investigation for the specific task of inferring whether a search query has commercial intent. We mine toolbar and click logs to extract queries from sites that are predominantly commercial (e.g., Amazon) and non-commercial (e.g., Wikipedia). We compare the accuracy obtained using such training data against manually labeled training data. Our results show that we can have large accuracy gains using automatically extracted training data at much lower cost.

Proceedings of the second international workshop on MapReduce and its applications | 2011

Parallelizing large-scale data processing applications with data skew: a case study in product-offer matching

Ekaterina Gonina; Anitha Kannan; John C. Shafer; Mihai Budiu

The last decade has seen a surge of interest in large-scale data-parallel processing engines. While these engines share many features in common with parallel databases, they make a set of different trade-offs. In consequence many of the lessons learned for programming parallel databases have to be re-learned in the new environment. In this paper we show a case study of parallelizing an example large-scale application (offer matching, a core part of online shopping) on an example MapReduce-based distributed computation engine (DryadLINQ). We focus on the challenges raised by the nature of large data sets and data skew and show how they can be addressed effectively within this computation framework by optimizing the computation to adapt to the nature of the data. In particular we describe three different strategies for performing distributed joins and show how the platform language allows us to implement optimization strategies at the application level, without system support. We show that this flexibility in the programming model allows for a highly effective system, providing a measured speedup of more than 100 on 64 machines (256 cores), and an estimated speedup of 200 on 1280 machines (5120 cores)of matching 4 million offers.

international world wide web conferences | 2012

Associating structured records to text documents

Rakesh Agrawal; Ariel Fuxman; Anitha Kannan; John C. Shafer; Partha Pratim Talukdar

Postulate two independently created data sources. The first contains text documents, each discussing one or a small number of objects. The second is a collection of structured records, each containing information about the characteristics of some objects. We present techniques for associating structured records to corresponding text documents and empirical results supporting the proposed techniques.

international conference on data engineering | 2010

Symphony: A platform for search-driven applications

John C. Shafer; Rakesh Agrawal; Hady Wirawan Lauw

We present the design of Symphony, a platform that enables non-developers to build and deploy a new class of search-driven applications that combine their data and domain expertise with content from search engines and other web services. The Symphony prototype has been built on top of Microsofts Bing infrastructure. While Symphony naturally makes use of the customization capabilities exposed by Bing, its distinguishing feature is the capability it provides to the application creator to combine their proprietary data and domain expertise with content obtained from Bing. They can also integrate specialized data obtained from web services to enhance the richness of their applications. Finally, Symphony is targeted at non-developers and provides cloud services for the creation and deployment of applications.

IEEE Internet Computing | 2010