Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where John C. Shafer is active.

Publication


Featured researches published by John C. Shafer.


knowledge discovery and data mining | 2009

Improving classification accuracy using automatically extracted training data

Ariel Fuxman; Anitha Kannan; Andrew B. Goldberg; Rakesh Agrawal; Panayiotis Tsaparas; John C. Shafer

Classification is a core task in knowledge discovery and data mining, and there has been substantial research effort in developing sophisticated classification models. In a parallel thread, recent work from the NLP community suggests that for tasks such as natural language disambiguation even a simple algorithm can outperform a sophisticated one, if it is provided with large quantities of high quality training data. In those applications, training data occurs naturally in text corpora, and high quality training data sets running into billions of words have been reportedly used. We explore how we can apply the lessons from the NLP community to KDD tasks. Specifically, we investigate how to identify data sources that can yield training data at low cost and study whether the quantity of the automatically extracted training data can compensate for its lower quality. We carry out this investigation for the specific task of inferring whether a search query has commercial intent. We mine toolbar and click logs to extract queries from sites that are predominantly commercial (e.g., Amazon) and non-commercial (e.g., Wikipedia). We compare the accuracy obtained using such training data against manually labeled training data. Our results show that we can have large accuracy gains using automatically extracted training data at much lower cost.


Proceedings of the second international workshop on MapReduce and its applications | 2011

Parallelizing large-scale data processing applications with data skew: a case study in product-offer matching

Ekaterina Gonina; Anitha Kannan; John C. Shafer; Mihai Budiu

The last decade has seen a surge of interest in large-scale data-parallel processing engines. While these engines share many features in common with parallel databases, they make a set of different trade-offs. In consequence many of the lessons learned for programming parallel databases have to be re-learned in the new environment. In this paper we show a case study of parallelizing an example large-scale application (offer matching, a core part of online shopping) on an example MapReduce-based distributed computation engine (DryadLINQ). We focus on the challenges raised by the nature of large data sets and data skew and show how they can be addressed effectively within this computation framework by optimizing the computation to adapt to the nature of the data. In particular we describe three different strategies for performing distributed joins and show how the platform language allows us to implement optimization strategies at the application level, without system support. We show that this flexibility in the programming model allows for a highly effective system, providing a measured speedup of more than 100 on 64 machines (256 cores), and an estimated speedup of 200 on 1280 machines (5120 cores)of matching 4 million offers.


international world wide web conferences | 2012

Associating structured records to text documents

Rakesh Agrawal; Ariel Fuxman; Anitha Kannan; John C. Shafer; Partha Pratim Talukdar

Postulate two independently created data sources. The first contains text documents, each discussing one or a small number of objects. The second is a collection of structured records, each containing information about the characteristics of some objects. We present techniques for associating structured records to corresponding text documents and empirical results supporting the proposed techniques.


international conference on data engineering | 2010

Symphony: A platform for search-driven applications

John C. Shafer; Rakesh Agrawal; Hady Wirawan Lauw

We present the design of Symphony, a platform that enables non-developers to build and deploy a new class of search-driven applications that combine their data and domain expertise with content from search engines and other web services. The Symphony prototype has been built on top of Microsofts Bing infrastructure. While Symphony naturally makes use of the customization capabilities exposed by Bing, its distinguishing feature is the capability it provides to the application creator to combine their proprietary data and domain expertise with content obtained from Bing. They can also integrate specialized data obtained from web services to enhance the richness of their applications. Finally, Symphony is targeted at non-developers and provides cloud services for the creation and deployment of applications.


IEEE Internet Computing | 2010

Homophily in the Digital World: A LiveJournal Case Study

Hady Wirawan Lauw; John C. Shafer; Rakesh Agrawal; Alexandros Ntoulas


international conference on management of data | 2009

Answering web queries using structured data sources

Stelios Paparizos; Alexandros Ntoulas; John C. Shafer; Rakesh Agrawal


Archive | 2008

GENERATING TRAINING DATA FROM CLICK LOGS

Nina Mishra; Rakesh Agrawal; Sreenivas Gollapudi; Alan Halverson; Krishnaram Kenthapadi; Rina Panigrahy; John C. Shafer; Panayiotis Tsaparas


Archive | 2009

Symphony: Enabling Search-Driven Applications

John C. Shafer; Rakesh Agrawal; Hady Wirawan Lauw


Archive | 2012

Composing text and structured databases

Rakesh Agrawal; Anitha Kannan; John C. Shafer; Ariel Fuxman


Archive | 2011

BRINGING ACHIEVEMENTS TO AN OFFLINE WORLD

Mohammed Moinuddin; Joseph Futty; Matthew G. Dyor; Dan E. Walther; Sreenivas Gollapudi; Stelios Paparizos; John C. Shafer

Collaboration


Dive into the John C. Shafer's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Hady Wirawan Lauw

Nanyang Technological University

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge