Haotian Zhang
University of Waterloo
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Haotian Zhang.
international joint conference on natural language processing | 2015
Luchen Tan; Haotian Zhang; Charles L. A. Clarke; Mark D. Smucker
Compared with carefully edited prose, the language of social media is informal in the extreme. The application of NLP techniques in this context may require a better understanding of word usage within social media. In this paper, we compute a word embedding for a corpus of tweets, comparing it to a word embedding for Wikipedia. After learning a transformation of one vector space to the other, and adjusting similarity values according to term frequency, we identify words whose usage differs greatly between the two corpora. For any given word, the set of words closest to it in a particular embedding provides a characterization for that word’s usage within the corresponding corpora.
OTM Confederated International Conferences "On the Move to Meaningful Internet Systems" | 2014
Claudia Diamantini; Domenico Potena; Emanuele Storti; Haotian Zhang
This paper describes the main functionalities of an ontology-based data explorer for Key Performance Indicators (KPI), aimed to support users in the extraction of KPI values from a shared repository. Data produced by partners of a Virtual Enterprise are semantically annotated through a domain ontology in which KPIs are described together with their mathematical formulas. Based on this model and on reasoning capabilities, the tool provides functionalities for dynamic aggregation of data and computation of KPI values through the formula. In this way, besides the usual drill-down, a novel mode of data exploration is enabled, based on the expansion of a KPI into its components.
conference on information and knowledge management | 2016
Gaurav Baruah; Haotian Zhang; Rakesh Guttikonda; Jimmy J. Lin; Mark D. Smucker; Olga Vechtomova
Nugget-based evaluations, such as those deployed in the TREC Temporal Summarization and Question Answering tracks, require human assessors to determine whether a nugget is present in a given piece of text. This process, known as nugget annotation, is labor-intensive. In this paper, we present two active learning techniques that prioritize the sequence in which candidate nugget/sentence pairs are presented to an assessor, based on the likelihood that the sentence contains a nugget. Our approach builds on the recognition that nugget annotation is similar to high-recall retrieval, and we adapt proven existing solutions. Simulation experiments with four existing TREC test collections show that our techniques yield far more matches for a given level of effort than baselines that are typically deployed in previous nugget-based evaluations.
international acm sigir conference on research and development in information retrieval | 2018
Mustafa Abualsaud; Nimesh Ghelani; Haotian Zhang; Mark D. Smucker; Gordon V. Cormack; Maura R. Grossman
The goal of high-recall information retrieval (HRIR) is to find all or nearly all relevant documents for a search topic. In this paper, we present the design of our system that affords efficient high-recall retrieval. HRIR systems commonly rely on iterative relevance feedback. Our system uses a state-of-the-art implementation of continuous active learning (CAL), and is designed to allow other feedback systems to be attached with little work. Our system allows users to judge documents as fast as possible with no perceptible interface lag. We also support the integration of a search engine for users who would like to interactively search and judge documents. In addition to detailing the design of our system, we report on user feedback collected as part of a 50 participants user study. While we have found that users find the most relevant documents when we restrict user interaction, a majority of participants prefer having flexibility in user interaction. Our work has implications on how to build effective assessment systems and what features of the system are believed to be useful by users.
international acm sigir conference on research and development in information retrieval | 2016
Haotian Zhang; Jimmy J. Lin; Gordon V. Cormack; Mark D. Smucker
This paper tackles the challenge of accurately and efficiently estimating the number of relevant documents in a collection for a particular topic. One real-world application is estimating the volume of social media posts (e.g., tweets) pertaining to a topic, which is fundamental to tracking the popularity of politicians and brands, the potential sales of a product, etc. Our insight is to leverage active learning techniques to find all the easy documents, and then to use sampling techniques to infer the number of relevant documents in the residual collection. We propose a simple yet effective technique for determining this switchover point, which intuitively can be understood as the knee in an effort vs. recall gain curve, as well as alternative sampling strategies beyond the knee. We show on several TREC datasets and a collection of tweets that our best technique yields more accurate estimates (with the same effort) than several alternatives.
conference on information and knowledge management | 2018
Haotian Zhang; Mustafa Abualsaud; Nimesh Ghelani; Mark D. Smucker; Gordon V. Cormack; Maura R. Grossman
High-recall retrieval --- finding all or nearly all relevant documents --- is critical to applications such as electronic discovery, systematic review, and the construction of test collections for information retrieval tasks. The effectiveness of current methods for high-recall information retrieval is limited by their reliance on human input, either to generate queries, or to assess the relevance of documents. Past research has shown that humans can assess the relevance of documents faster and with little loss in accuracy by judging shorter document surrogates, e.g. extractive summaries, in place of full documents. To test the hypothesis that short document surrogates can reduce assessment time and effort for high-recall retrieval, we conducted a 50-person, controlled, user study. We designed a high-recall retrieval system using continuous active learning (CAL) that could display either full documents or short document excerpts for relevance assessment. In addition, we tested the value of integrating a search engine with CAL. In the experiment, we asked participants to try to find as many relevant documents as possible within one hour. We observed that our study participants were able to find significantly more relevant documents when they used the system with document excerpts as opposed to full documents. We also found that allowing participants to compose and execute their own search queries did not improve their ability to find relevant documents and, by some measures, impaired performance. These results suggest that for high-recall systems to maximize performance, system designers should think carefully about the amount and nature of user interaction incorporated into the system.
conference on human information interaction and retrieval | 2018
Haotian Zhang; Mustafa Abualsaud; Mark D. Smucker
When search results fail to satisfy users» information needs, users often reformulate their search query in the hopes of receiving better results. In many cases, users immediately requery without clicking on any search results. In this paper, we report on a user study designed to investigate the rate at which users immediately reformulate at different levels of search quality. We had users search for answers to questions as we manipulated the placement of the only relevant document in a ranked list of search results. We show that as the quality of search results decreases, the probability of immediately requerying increases. We find that users can quickly decide to immediately reformulate, and the time to immediately reformulate appears to be independent of the quality of the search results. Finally, we show that there appears to be two types of users. One group has a high probability of immediately reformulating and the other is unlikely to immediately reformulate unless no relevant documents can be found in the search results. While requerying takes time, it is the group of users who are more likely to immediately requery that are able to able find answers to questions the fastest.
international acm sigir conference on research and development in information retrieval | 2017
Haotian Zhang; Jinfeng Rao; Jimmy J. Lin; Mark D. Smucker
We propose a heuristic called one answer per document for automatically extracting high-quality negative examples for answer selection in question answering. Starting with a collection of question-answer pairs from the popular TrecQA dataset, we identify the original documents from which the answers were drawn. Sentences from these source documents that contain query terms (aside from the answers) are selected as negative examples. Training on the original data plus these negative examples yields improvements in effectiveness by a margin that is comparable to successive recent publications on this dataset. Our technique is completely unsupervised, which means that the gains come essentially for free. We confirm that the improvements can be directly attributed to our heuristic, as other approaches to extracting comparable amounts of training data are not effective. Beyond the empirical validation of this heuristic, we also share our improved TrecQA dataset with the community to support further work in answer selection.
arXiv: Information Retrieval | 2017
Royal Sequiera; Gaurav Baruah; Zhucheng Tu; Salman Mohammed; Jinfeng Rao; Haotian Zhang; Jimmy J. Lin
arXiv: Information Retrieval | 2017
Jinfeng Rao; Hua He; Haotian Zhang; Ferhan Türe; Royal Sequiera; Salman Mohammed; Jimmy J. Lin