Bo-June Paul Hsu | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Bo-June Paul Hsu is active.

Explore More

Publication

Featured researches published by Bo-June Paul Hsu.

international world wide web conferences | 2015

An Overview of Microsoft Academic Service (MAS) and Applications

Arnab Sinha; Zhihong Shen; Yang Song; Hao Ma; Darrin Eide; Bo-June Paul Hsu; Kuansan Wang

In this paper we describe a new release of a Web scale entity graph that serves as the backbone of Microsoft Academic Service (MAS), a major production effort with a broadened scope to the namesake vertical search engine that has been publicly available since 2008 as a research prototype. At the core of MAS is a heterogeneous entity graph comprised of six types of entities that model the scholarly activities: field of study, author, institution, paper, venue, and event. In addition to obtaining these entities from the publisher feeds as in the previous effort, we in this version include data mining results from the Web index and an in-house knowledge base from Bing, a major commercial search engine. As a result of the Bing integration, the new MAS graph sees significant increase in size, with fresh information streaming in automatically following their discoveries by the search engine. In addition, the rich entity relations included in the knowledge base provide additional signals to disambiguate and enrich the entities within and beyond the academic domain. The number of papers indexed by MAS, for instance, has grown from low tens of millions to 83 million while maintaining an above 95% accuracy based on test data sets derived from academic activities at Microsoft Research. Based on the data set, we demonstrate two scenarios in this work: a knowledge driven, highly interactive dialog that seamlessly combines reactive search and proactive suggestion experience, and a proactive heterogeneous entity recommendation.

international world wide web conferences | 2011

Online spelling correction for query completion

Huizhong Duan; Bo-June Paul Hsu

In this paper, we study the problem of online spelling correction for query completions. Misspelling is a common phenomenon among search engines queries. In order to help users effectively express their information needs, mechanisms for automatically correcting misspelled queries are required. Online spelling correction aims to provide spell corrected completion suggestions as a query is incrementally entered. As latency is crucial to the utility of the suggestions, such an algorithm needs to be not only accurate, but also efficient. To tackle this problem, we propose and study a generative model for input queries, based on a noisy channel transformation of the intended queries. Utilizing spelling correction pairs, we train a Markov n-gram transformation model that captures user spelling behavior in an unsupervised fashion. To find the top spell-corrected completion suggestions in real-time, we adapt the A* search algorithm with various pruning heuristics to dynamically expand the search space efficiently. Evaluation of the proposed methods demonstrates a substantial increase in the effectiveness of online spelling correction over existing techniques.

empirical methods in natural language processing | 2006

Style & Topic Language Model Adaptation Using HMM-LDA

Bo-June Paul Hsu; James R. Glass

Adapting language models across styles and topics, such as for lecture transcription, involves combining generic style models with topic-specific content relevant to the target document. In this work, we investigate the use of the Hidden Markov Model with Latent Dirichlet Allocation (HMM-LDA) to obtain syntactic state and semantic topic assignments to word instances in the training corpus. From these context-dependent labels, we construct style and topic models that better model the target document, and extend the traditional bag-of-words topic models to n-grams. Experiments with static model interpolation yielded a perplexity and relative word error rate (WER) reduction of 7.1% and 2.1%, respectively, over an adapted trigram baseline. Adaptive interpolation of mixture components further reduced perplexity by 9.5% and WER by a modest 0.3%.

international acm sigir conference on research and development in information retrieval | 2011

Unsupervised query segmentation using clickthrough for information retrieval

Yanen Li; Bo-June Paul Hsu; ChengXiang Zhai; Kuansan Wang

Query segmentation is an important task toward understanding queries accurately, which is essential for improving search results. Existing segmentation models either use labeled data to predict the segmentation boundaries, for which the training data is expensive to collect, or employ unsupervised strategy based on a large text corpus, which might be inaccurate because of the lack of relevant information. In this paper, we propose a probabilistic model to exploit clickthrough data for query segmentation, where the model parameters are estimated via an efficient EM algorithm. We further study how to properly interpret the segmentation results and utilize them to improve retrieval accuracy. Specifically, we propose an integrated language model based on the standard bigram language model to exploit the probabilistic structure obtained through query segmentation. Experiment results on two datasets show that our segmentation model outperforms existing segmentation models. Furthermore, extensive experiments on a large retrieval dataset reveals that the results of query segmentation can be leveraged to improve retrieval relevance by using the proposed integrated language model.

international world wide web conferences | 2011

Web scale NLP: a case study on url word breaking

Kuansan Wang; Christopher Thrasher; Bo-June Paul Hsu

This paper uses the URL word breaking task as an example to elaborate what we identify as crucial in designing statistical natural language processing (NLP) algorithms for Web scale applications: (1) rudimentary multilingual capabilities to cope with the global nature of the Web, (2) multi-style modeling to handle diverse language styles seen in the Web contents, (3) fast adaptation to keep pace with the dynamic changes of the Web, (4) minimal heuristic assumptions for generalizability and robustness, and (5) possibilities of efficient implementations and minimal manual efforts for processing massive amount of data at a reasonable cost. We first show that the state-of-the-art word breaking techniques can be unified and generalized under the Bayesian minimum risk (BMR) framework that, using a Web scale N-gram, can meet the first three requirements. We discuss how the existing techniques can be viewed as introducing additional assumptions to the basic BMR framework, and describe a generic yet efficient implementation called word synchronous beam search. Testing the framework and its implementation on a series of large scale experiments reveals the following. First, the language style used to build the model plays a critical role in the word breaking task, and the most suitable for the URL word breaking task appears to be that of the document title where the best performance is obtained. Models created from other language styles, such as from document body, anchor text, and even queries, exhibit varying degrees of mismatch. Although all styles benefit from increasing modeling power which, in our experiments, corresponds to the use of a higher order N-gram, the gain is most recognizable for the title model. The heuristics proposed by the prior arts do contribute to the word breaking performance for mismatched or less powerful models, but are less effective and, in many cases, lead to poorer performance than the matched model with minimal assumptions. For the matched model based on document titles, an accuracy rate of 97.18% can already be achieved using simple trigram without any heuristics.

ieee automatic speech recognition and understanding workshop | 2007

Generalized linear interpolation of language models

Bo-June Paul Hsu

Despite the prevalent use of model combination techniques to improve speech recognition performance on domains with limited data, little prior research has focused on the choice of the actual interpolation model. For merging language models, the most popular approach has been the simple linear interpolation. In this work, we propose a generalization of linear interpolation that computes context-dependent mixture weights from arbitrary features. Results on a lecture transcription task yield up to a 1.0% absolute improvement in recognition word error rate (WER).

web search and data mining | 2014

On building entity recommender systems using user click log and freebase knowledge

Xiao Yu; Hao Ma; Bo-June Paul Hsu; Jiawei Han

Due to their commercial value, search engines and recommender systems have become two popular research topics in both industry and academia over the past decade. Although these two fields have been actively and extensively studied separately, researchers are beginning to realize the importance of the scenarios at their intersection: providing an integrated search and information discovery user experience. In this paper, we study a novel application, i.e., personalized entity recommendation for search engine users, by utilizing user click log and the knowledge extracted from Freebase. To better bridge the gap between search engines and recommender systems, we first discuss important heuristics and features of the datasets. We then propose a generic, robust, and time-aware personalized recommendation framework to utilize these heuristics and features at different granularity levels. Using movie recommendation as a case study, with user click log dataset collected from a widely used commercial search engine, we demonstrate the effectiveness of our proposed framework over other popular and state-of-the-art recommendation techniques.

web search and data mining | 2015

Learning to Recommend Related Entities to Search Users

Bin Bi; Hao Ma; Bo-June Paul Hsu; Wei Chu; Kuansan Wang; Junghoo Cho

Over the past few years, major web search engines have introduced knowledge bases to offer popular facts about people, places, and things on the entity pane next to regular search results. In addition to information about the entity searched by the user, the entity pane often provides a ranked list of related entities. To keep users engaged, it is important to develop a recommendation model that tailors the related entities to individual user interests. We propose a probabilistic Three-way Entity Model (TEM) that provides personalized recommendation of related entities using three data sources: knowledge base, search click log, and entity pane log. Specifically, TEM is capable of extracting hidden structures and capturing underlying correlations among users, main entities, and related entities. Moreover, the TEM model can also exploit the click signals derived from the entity pane log. We further provide an inference technique to learn the parameters in TEM, and propose a principled preference learning method specifically designed for ranking related entities. Extensive experiments with two real-world datasets show that TEM with our probabilistic framework significantly outperforms a state of the art baseline, confirming the effectiveness of TEM and our probabilistic framework in related entity recommendation.

human factors in computing systems | 2011

Sampling representative phrase sets for text entry experiments: a procedure and public resource

Tim Paek; Bo-June Paul Hsu

Text entry experiments evaluating the effectiveness of various input techniques often employ a procedure whereby users are prompted with natural language phrases which they are instructed to enter as stimuli. For experimental validity, it is desirable to control the stimuli and present text that is representative of a target task, domain or language. MacKenzie and Soukoreff (2001) manually selected a set of 500 phrases for text entry experiments. To demonstrate representativeness, they correlated the distribution of single letters in their phrase set to a relatively small (by current standards) corpus of English prior to 1966, which may not reflect the style of text input today. In this paper, we ground the notion of representativeness in terms of information theory and propose a procedure for sampling representative phrases from any large corpus so that researchers can curate their own stimuli. We then describe the characteristics of phrase sets we generated using the procedure for email and social media (Facebook and Twitter). The phrase sets and code for the procedure are publicly available for download.

signal processing systems | 2001

Real-Time Automated Video and Audio Capture with Multiple Cameras and Microphones

Ce Wang; Scott M. Griebel; Michael S. Brandstein; Bo-June Paul Hsu

This work presents the acoustic and visual-based tracking system functioning at the Harvard Intelligent Multi-Media Environments Laboratory (HIMMEL). The environment is populated with a number of microphones and steerable video cameras. Acoustic source localization, video-based face tracking and pose estimation, and multi-channel speech enhancement methods are applied in combination to detect and track individuals in a practical environment while also providing an improved audio signal to accompany the video stream. The video portion of the system tracks talkers by utilizing source motion, contour geometry, color data, and simple facial features. Decisions involving which camera to use are based on an estimate of the heads gazing angle. This head pose estimation is achieved using a very general head model which employs hairline features and a learned network classification procedure. Finally, a beamforming and postfiltering microphone array technique is used to create an enhanced speech waveform to accompany the recorded video signal. The system presented in this paper is robust to both visual clutter (e.g. ovals in the scene of interest which are not faces) and audible noise (e.g. reverberations and background noise).

Explore More