Josh Attenberg | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Josh Attenberg is active.

Explore More

Publication

Featured researches published by Josh Attenberg.

knowledge discovery and data mining | 2009

Modeling and predicting user behavior in sponsored search

Josh Attenberg; Sandeep Pandey; Torsten Suel

Implicit user feedback, including click-through and subsequent browsing behavior, is crucial for evaluating and improving the quality of results returned by search engines. Several recent studies [1, 2, 3, 13, 25] have used post-result browsing behavior including the sites visited, the number of clicks, and the dwell time on site in order to improve the ranking of search results. In this paper, we first study user behavior on sponsored search results (i.e., the advertisements displayed by search engines next to the organic results), and compare this behavior to that of organic results. Second, to exploit post-result user behavior for better ranking of sponsored results, we focus on identifying patterns in user behavior and predict expected on-site actions in future instances. In particular, we show how post-result behavior depends on various properties of the queries, advertisement, sites, and users, and build a classifier using properties such as these to predict certain aspects of the user behavior. Additionally, we develop a generative model to mimic trends in observed user activity using a mixture of pareto distributions. We conduct experiments based on billions of real navigation trails collected by a major search engines browser toolbar.

european conference on machine learning | 2010

A unified approach to active dual supervision for labeling features and examples

Josh Attenberg; Prem Melville; Foster Provost

When faced with the task of building accurate classifiers, active learning is often a beneficial tool for minimizing the requisite costs of human annotation. Traditional active learning schemes query a human for labels on intelligently chosen examples. However, human effort can also be expended in collecting alternative forms of annotation. For example, one may attempt to learn a text classifier by labeling words associated with a class, instead of, or in addition to, documents. Learning from two different kinds of supervision adds a challenging dimension to the problem of active learning. In this paper, we present a unified approach to such active dual supervision: determining which feature or example a classifier is most likely to benefit from having labeled. Empirical results confirm that appropriately querying for both example and feature labels significantly reduces overall human effort--beyond what is possible through traditional one-dimensional active learning.

international world wide web conferences | 2010

Scalable techniques for document identifier assignment in inverted indexes

Shuai Ding; Josh Attenberg; Torsten Suel

Web search engines depend on the full-text inverted index data structure. Because the query processing performance is so dependent on the size of the inverted index, a plethora of research has focused on fast end effective techniques for compressing this structure. Recently, several authors have proposed techniques for improving index compression by optimizing the assignment of document identifiers to the documents in the collection, leading to significant reduction in overall index size. In this paper, we propose improved techniques for document identifier assignment. Previous work includes simple and fast heuristics such as sorting by URL, as well as more involved approaches based on the Traveling Salesman Problem or on graph partitioning. These techniques achieve good compression but do not scale to larger document collections. We propose a new framework based on performing a Traveling Salesman computation on a reduced sparse graph obtained through Locality Sensitive Hashing. This technique achieves improved compression while scaling to tens of millions of documents. Based on this framework, we describe a number of new algorithms, and perform a detailed evaluation on three large data sets showing improvements in index size.

Sigkdd Explorations | 2011

Inactive learning?: difficulties employing active learning in practice

Josh Attenberg; Foster Provost

Despite the tremendous level of adoption of machine learning techniques in real-world settings, and the large volume of research on active learning, active learning techniques have been slow to gain substantial traction in practical applications. This reluctance of adoption is contrary to active learnings promise of reduced model-development costs and increased performance on a model-development budget. This essay presents several important and under-discussed challenges to using active learning well in practice. We hope this paper can serve as a call to arms for researchers in active learning--an encouragement to focus even more attention on how practitioners might actually use active learning.

web search and data mining | 2011

Batch query processing for web search engines

Shuai Ding; Josh Attenberg; Ricardo A. Baeza-Yates; Torsten Suel

Large web search engines are now processing billions of queries per day. Most of these queries are interactive in nature, requiring a response in fractions of a second. However, there are also a number of important scenarios where large batches of queries are submitted for various web mining and system optimization tasks that do not require an immediate response. Given the significant cost of executing search queries over billions of web pages, it is a natural question to ask if such batches of queries can be more efficiently executed than interactive queries. In this paper, we motivate and discuss the problem of batch query processing in search engines, identify basic mechanisms for improving the performance of such queries, and provide a preliminary experimental evaluation of the proposed techniques. Our conclusion is that significant cost reductions are possible by using specialized mechanisms for executing batch queries in Web search engines.

adversarial information retrieval on the web | 2008

Cleaning search results using term distance features

Josh Attenberg; Torsten Suel

The presence of Web spam in query results is one of the critical challenges facing search engines today. While search engines try to combat the impact of spam pages on their results, the incentive for spammers to use increasingly sophisticated techniques has never been higher, since the commercial success of a Web page is strongly correlated to the number of views that page receives. This paper describes a term-based technique for spam detection based on a simple new summary data structure called Term Distance Histograms that tries to capture the topical structure of a page. We apply this technique as a post-filtering step to a major search engine. Our experiments show that we are able to detect many of the artificially generated spam pages that remained in the results of the engine. Specifically, our method is able to detect many web pages generated by utilizing techniques such as dumping, weaving, or phrase stitching [11], which are spamming techniques designed to achieve high rankings while still exhibiting many of the individual word frequency (and even bi-gram) properties of natural human text.

knowledge discovery and data mining | 2016

Images Don't Lie: Transferring Deep Visual Semantic Features to Large-Scale Multimodal Learning to Rank

Corey Lynch; Kamelia Aryafar; Josh Attenberg

Search is at the heart of modern e-commerce. As a result, the task of ranking search results automatically (learning to rank) is a multibillion dollar machine learning problem. Traditional models optimize over a few hand-constructed features based on the items text. In this paper, we introduce a multimodal learning to rank model that combines these traditional features with visual semantic features transferred from a deep convolutional neural network. In a large scale experiment using data from the online marketplace Etsy, we verify that moving to a multimodal representation significantly improves ranking quality. We show how image features can capture fine-grained style information not available in a text-only representation. In addition, we show concrete examples of how image information can successfully disentangle pairs of highly different items that are ranked similarly by a text-only model.

Proceedings of the first international workshop on Location and the web | 2008