Miles Efron
University of Illinois at Urbana–Champaign
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Miles Efron.
PLOS Biology | 2015
Zachary D. Stephens; Skylar Y. Lee; Faraz Faghri; Roy H. Campbell; ChengXiang Zhai; Miles Efron; Ravishankar K. Iyer; Michael C. Schatz; Saurabh Sinha; Gene E. Robinson
Genomics is a Big Data science and is going to get much bigger, very soon, but it is not known whether the needs of genomics will exceed other Big Data domains. Projecting to the year 2025, we compared genomics with three other major generators of Big Data: astronomy, YouTube, and Twitter. Our estimates show that genomics is a “four-headed beast”—it is either on par with or the most demanding of the domains analyzed here in terms of data acquisition, storage, distribution, and analysis. We discuss aspects of new technologies that will need to be developed to rise up and meet the computational challenges that genomics poses for the near future. Now is the time for concerted, community-wide planning for the “genomical” challenges of the next decade.
Journal of the Association for Information Science and Technology | 2011
Miles Efron
Modern information retrieval (IR) has come to terms with numerous new media in efforts to help people find information in increasingly diverse settings. Among these new media are so-called microblogs. A microblog is a stream of text that is written by an author over time. It comprises many very brief updates that are presented to the microblogs readers in reverse-chronological order. Today, the service called Twitter is the most popular microblogging platform. Although microblogging is increasingly popular, methods for organizing and providing access to microblog data are still new. This review offers an introduction to the problems that face researchers and developers of IR systems in microblog settings. After an overview of microblogs and the behavior surrounding them, the review describes established problems in microblog retrieval, such as entity search and sentiment analysis, and modeling abstractions, such as authority and quality. The review also treats user-created metadata that often appear in microblogs. Because the problem of microblog search is so new, the review concludes with a discussion of particularly pressing research issues yet to be studied in the field.
international acm sigir conference on research and development in information retrieval | 2011
Miles Efron; Gene Golovchinsky
Temporal aspects of documents can impact relevance for certain kinds of queries. In this paper, we build on earlier work of modeling temporal information. We propose an extension to the Query Likelihood Model that incorporates query-specific information to estimate rate parameters, and we introduce a temporal factor into language model smoothing and query expansion using pseudo-relevance feedback. We evaluate these extensions using a Twitter corpus and two newspaper article collections. Results suggest that, compared to prior approaches, our models are more effective at capturing the temporal variability of relevance associated with some topics.
international acm sigir conference on research and development in information retrieval | 2012
Miles Efron; Peter Organisciak; Katrina Fenlon
Collections containing a large number of short documents are becoming increasingly common. As these collections grow in number and size, providing effective retrieval of brief texts presents a significant research problem. We propose a novel approach to improving information retrieval (IR) for short texts based on aggressive document expansion. Starting from the hypothesis that short documents tend to be about a single topic, we submit documents as pseudo-queries and analyze the results to learn about the documents themselves. Document expansion helps in this context because short documents yield little in the way of term frequency information. However, as we show, the proposed technique helps us model not only lexical properties, but also temporal properties of documents. We present experimental results using a corpus of microblog (Twitter) data and a corpus of metadata records from a federated digital library. With respect to established baselines, results of these experiments show that applying our proposed document expansion method yields significant improvements in effectiveness. Specifically, our method improves the lexical representation of documents and the ability to let time influence retrieval.
international acm sigir conference on research and development in information retrieval | 2014
Miles Efron; Jimmy J. Lin; Jiyin He; Arjen P. de Vries
This paper investigates the temporal cluster hypothesis: in search tasks where time plays an important role, do relevant documents tend to cluster together in time? We explore this question in the context of tweet search and temporal feedback: starting with an initial set of results from a baseline retrieval model, we estimate the temporal density of relevant documents, which is then used for result reranking. Our contributions lie in a method to characterize this temporal density function using kernel density estimation, with and without human relevance judgments, and an approach to integrating this information into a standard retrieval model. Experiments on TREC datasets confirm that our temporal feedback formulation improves search effectiveness, thus providing support for our hypothesis. Our approach out-performs both a standard baseline and previous temporal retrieval models. Temporal feedback improves over standard lexical feedback (with and without human judgments), illus- trating that temporal relevance signals exist independently of document content.
international acm sigir conference on research and development in information retrieval | 2015
Yulu Wang; Garrick Sherman; Jimmy J. Lin; Miles Efron
In information retrieval evaluation, when presented with an effectiveness difference between two systems, there are three relevant questions one might ask. First, are the differences statistically significant? Second, is the comparison stable with respect to assessor differences? Finally, is the difference actually meaningful to a user? This paper tackles the last two questions about assessor differences and user preferences in the context of the newly-introduced tweet timeline generation task in the TREC 2014 Microblog track, where the systems goal is to construct an informative summary of non-redundant tweets that addresses the users information need. Central to the evaluation methodology is human-generated semantic clusters of tweets that contain substantively similar information. We show that the evaluation is stable with respect to assessor differences in clustering and that user preferences generally correlate with effectiveness metrics even though users are not explicitly aware of the semantic clustering being performed by the systems. Although our analyses are limited to this particular task, we believe that lessons learned could generalize to other evaluations based on establishing semantic equivalence between information units, such as nugget-based evaluations in question answering and temporal summarization.
european conference on information retrieval | 2015
Jinfeng Rao; Jimmy J. Lin; Miles Efron
“Evaluation as a service” (EaaS) is a new methodology for community-wide evaluations where an API provides the only point of access to the collection for completing the evaluation task. Two important advantages of this model are that it enables reproducible IR experiments and encourages sharing of pluggable open-source components. In this paper, we illustrate both advantages by providing open-source implementations of lexical and temporal feedback techniques for tweet search built on the TREC Microblog API. For the most part, we are able to reproduce results reported in previous papers and confirm their general findings. However, experiments on new test collections and additional analyses provide a more nuanced look at the results and highlight issues not discussed in previous studies, particularly the large variances in effectiveness associated with training/test splits.
Proceedings of the American Society for Information Science and Technology | 2011
Miles Efron; Peter Organisciak; Katrina Fenlon
Building topic models in federated digital collections presents numerous challenges due to metadata inconsistencies. The quality of topical metadata is difficult to ascertain and is interspersed with often irrelevant administrative metadata. In this study, we propose a way to improve topic modeling in large collections by identifying documents that convey only weak topical information. These documents are ignored when training topic models. Their topical associations are instead inferred model training. A method is outlined for identifying weakly topical documents by defining runs of similar documents in a collection. In preliminary evaluation using a corpus from the Institute of Museum and Library Services Digital Collections and Content aggregation, results show an increase in coherence among words in topics. In showing this, we demonstrate that it may be beneficial to induce topic models using less, higher-quality data.
ASIST '13 Proceedings of the 76th ASIS&T Annual Meeting: Beyond the Cloud: Rethinking Information Boundaries | 2013
Craig Willis; Miles Efron
Searching large collections of digitized books is a relatively new area in information-seeking and retrieval research, made possible by initiatives such as Google Books and the HathiTrust Digital Library. The availability of large full-text book collections is transforming how users search and interact with information in books, but the characteristics of these changes are unknown. This paper aims to provide insight into the characteristics of full-text searches in a large collection of digitized books and is the first step in a broader research agenda intended to improve book retrieval. To better understand the types of queries that users are issuing to full-text-book collections, we analyzed a full year of anonymized query logs from the HathiTrust Digital Library full-text search engine. We also manually classified a random sample of 600 queries to develop a taxonomy of book search query types. We found that users are beginning to search for information in books instead of searching for books. Searches still largely follow bibliographic models, but, as expected, new types of searches are beginning to take advantage of full-text capabilities. Additionally, comparing the results of our query log analysis to searches in other domains, we found similar search patterns including short queries, sessions with only a few queries, and users viewing only a few pages of results per query. We discuss how these findings can be used to characterize users of large full-text book collections.
international acm sigir conference on research and development in information retrieval | 2013
Fernando Diaz; Susan T. Dumais; Miles Efron; Kira Radinsky; Maarten de Rijke; Milad Shokouhi
Web content increasingly reflects the current state of the physical and social world, manifested both in traditional news media sources along with user-generated publishing sites such as Twitter, Foursquare, and Facebook. At the same time, web searching increasingly reflects problems grounded in the real world. As a result of this blending of the web with the real world, we observe that the web, both in its composition and use, has incorporated many of the dynamics of the real world. Few of the problems associated with searching dynamic collections are well understood, such as defining time-sensitive relevance, understanding user query behavior over time and understanding why certain web content changes. We believe that, just as static collections often benefit from modeling topics, dynamic collections will likely benefit from temporal modeling of events and time-sensitive user interests and intents, which were rarely addressed in the literature. There have been preliminary efforts in the research and industrial communities to address algorithms, architectures, evaluation methodologies and metrics. We aim to bring together practitioners and researchers to discuss their recent breakthroughs and the challenges with addressing time-aware information access, both from the algorithmic and the architectural perspectives. This workshop is a successor to the successful SIGIR 2012 Workshop on Time Aware Information Access (#TAIA2012). Where the 2012 edition was the first to bring together a broad set of academic and industrial researchers around the topic of time-aware information access, the specific focus of this workshop is on the many time-aware benchmarking activities that are ongoing in 2013.