Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Jiyin He is active.

Publication


Featured researches published by Jiyin He.


european conference on information retrieval | 2008

Using coherence-based measures to predict query difficulty

Jiyin He; Martha Larson; Maarten de Rijke

We investigate the potential of coherence-based scores to predict query difficulty. The coherence of a document set associated with each query word is used to capture the quality of a query topic aspect. A simple query coherence score, QC-1, is proposed that requires the average coherence contribution of individual query terms to be high. Two further query scores, QC-2 and QC-3, are developed by constraining QC- 1 in order to capture the semantic similarity among query topic aspects. All three query coherence scores show the correlation with average precision necessary to make them good predictors of query difficulty. Simple and efficient, the measures require no training data and are competitive with language model-based clarity scores.


international acm sigir conference on research and development in information retrieval | 2012

Combining implicit and explicit topic representations for result diversification

Jiyin He; Vera Hollink; Arjen P. de Vries

Result diversification deals with ambiguous or multi-faceted queries by providing documents that cover as many subtopics of a query as possible. Various approaches to subtopic modeling have been proposed. Subtopics have been extracted internally, e.g., from retrieved documents, and externally, e.g., from Web resources such as query logs. Internally modeled subtopics are often implicitly represented, e.g., as latent topics, while externally modeled subtopics are often explicitly represented, e.g., as reformulated queries. We propose a framework that: i)combines both implicitly and explicitly represented subtopics; and ii)allows flexible combination of multiple external resources in a transparent and unified manner. Specifically, we use a random walk based approach to estimate the similarities of the explicit subtopics mined from a number of heterogeneous resources: click logs, anchor text, and web n-grams. We then use these similarities to regularize the latent topics extracted from the top-ranked documents, i.e., the internal (implicit) subtopics. Empirical results show that regularization with explicit subtopics extracted from the right resource leads to improved diversification results, indicating that the proposed regularization with (explicit) external resources forms better (implicit) topic models. Click logs and anchor text are shown to be more effective resources than web n-grams under current experimental settings. Combining resources does not always lead to better results, but achieves a robust performance. This robustness is important for two reasons: it cannot be predicted which resources will be most effective for a given query, and it is not yet known how to reliably determine the optimal model parameters for building implicit topic models.


conference on information and knowledge management | 2011

Generating links to background knowledge: a case study using narrative radiology reports

Jiyin He; Maarten de Rijke; Merlijn Sevenster; Rob C. van Ommering; Yuechen Qian

Automatically annotating texts with background information has recently received much attention. We conduct a case study in automatically generating links from narrative radiology reports to Wikipedia. Such links help users understand the medical terminology and thereby increase the value of the reports. Direct applications of existing automatic link generation systems trained on Wikipedia to our radiology data do not yield satisfactory results. Our analysis reveals that medical phrases are often syntactically regular but semantically complicated, e.g., containing multiple concepts or concepts with multiple modifiers. The latter property is the main reason for the failure of existing systems. Based on this observation, we propose an automatic link generation approach that takes into account these properties. We use a sequential labeling approach with syntactic features for anchor text identification in order to exploit syntactic regularities in medical terminology. We combine this with a sub-anchor based approach to target finding, which is aimed at coping with the complex semantic structure of medical phrases. Empirical results show that the proposed system effectively improves the performance over existing systems.


international acm sigir conference on research and development in information retrieval | 2014

Temporal feedback for tweet search with non-parametric density estimation

Miles Efron; Jimmy J. Lin; Jiyin He; Arjen P. de Vries

This paper investigates the temporal cluster hypothesis: in search tasks where time plays an important role, do relevant documents tend to cluster together in time? We explore this question in the context of tweet search and temporal feedback: starting with an initial set of results from a baseline retrieval model, we estimate the temporal density of relevant documents, which is then used for result reranking. Our contributions lie in a method to characterize this temporal density function using kernel density estimation, with and without human relevance judgments, and an approach to integrating this information into a standard retrieval model. Experiments on TREC datasets confirm that our temporal feedback formulation improves search effectiveness, thus providing support for our hypothesis. Our approach out-performs both a standard baseline and previous temporal retrieval models. Temporal feedback improves over standard lexical feedback (with and without human judgments), illus- trating that temporal relevance signals exist independently of document content.


International Journal on Document Analysis and Recognition | 2009

An effective coherence measure to determine topical consistency in user-generated content

Jiyin He; Wouter Weerkamp; Martha Larson; Maarten de Rijke

When searching for blogs on a specific topic, information seekers prefer blogs that place a central focus on that topic over blogs whose mention of the topic is diffuse or incidental. In order to present users with better blog feed search results, we developed a measure of topical consistency that is able to capture whether or not a blog is topically focused. The measure, called the coherence score, is inspired by the genetics literature and captures the tightness of the clustering structure of a data set relative to a background collection. In a set of experiments on synthetic data, the coherence score is shown to provide a faithful reflection of topic clustering structure. The properties that make the coherence score more appropriate than lexical cohesion, a common measure of topical structure, are discussed. Retrieval experiments show that integrating the coherence score as a prior in a language modeling-based approach to blog feed search improves retrieval effectiveness. The coherence score must, however, be used judiciously in order to avoid boosting the ranking of irrelevant but topically focused blogs. To this end, we experiment with a series of weighting schemes that adjust the contribution of the coherence score according to the relevance of a blog to the user query. An appropriate weighting scheme is able to improve retrieval performance. Finally, we show that the coherence score can be reliably estimated with a sample exceeding 20 posts in size. Consistent with this finding, experiments show that the best retrieval performance is achieved if coherence scores are used when a blog contains more than 20 posts.


machine vision applications | 2014

A rule-based event detection system for real-life underwater domain

Concetto Spampinato; Emmanuelle Beauxis-Aussalet; Simone Palazzo; Cigdem Beyan; Jacco van Ossenbruggen; Jiyin He; Bas Boom; Xuan Huang

Understanding and analyzing fish behaviour is a fundamental task for biologists that study marine ecosystems because the changes in animal behaviour reflect environmental conditions such as pollution and climate change. To support investigators in addressing these complex questions, underwater cameras have been recently used. They can continuously monitor marine life while having almost no influence on the environment under observation, which is not the case with observations made by divers for instance. However, the huge quantity of recorded data make the manual video analysis practically impossible. Thus machine vision approaches are needed to distill the information to be investigated. In this paper, we propose an automatic event detection system able to identify solitary and pairing behaviours of the most common fish species of the Taiwanese coral reef. More specifically, the proposed system employs robust low-level processing modules for fish detection, tracking and recognition that extract the raw data used in the event detection process. Then each fish trajectory is modeled and classified using hidden Markov models. The events of interest are detected by integrating end-user rules, specified through an ad hoc user interface, and the analysis of fish trajectories. The system was tested on 499 events of interest, divided into solitary and pairing events for each fish species. It achieved an average accuracy of 0.105, expressed in terms of normalized detection cost. The obtained results are promising, especially given the difficulties occurring in underwater environments. And moreover, it allows marine biologists to speed up the behaviour analysis process, and to reliably carry on their investigations.


Proceedings of the First International Workshop on Gamification for Information Retrieval | 2014

Studying user browsing behavior through gamified search tasks

Jiyin He; Marc Bron; Leif Azzopardi; Arjen P. de Vries

Typical crowdsourcing tasks ask workers to label images or make relevance judgements, as a low cost alternative to lab based user studies. More recently, gamification has been employed as a way to make these tasks more appealing and so users play, rather than work. One observation is that differences in task design and incentives elicits different player behavior. In this paper we discuss a new type of task, where we aim at eliciting player behavior that resembles user behavior when performing a search task. Care should be taken in the design of a gamified version of such a task to allow players to complete tasks with a limited amount of effort and time, without changing the behavior to be studied. We discuss the motivation of the abstractions and design choices we have made in achieving this goal. We then analyze whether and how these abstractions and design choices influence our observations of player behaviors.


international acm sigir conference on research and development in information retrieval | 2010

A ranking approach to target detection for automatic link generation

Jiyin He; Maarten de Rijke

We focus on the task of target detection in automatic link generation with Wikipedia, i.e., given an N-gram in a snippet of text, find the relevant Wikipedia concepts that explain or provide background knowledge for it. We formulate the task as a ranking problem and investigate the effectiveness of learning to rank approaches and of the features that we use to rank the target concepts for a given N-gram. Our experiments show that learning to rank approaches outperform traditional binary classification approaches. Also, our proposed features are effective both in binary classification and learning to rank settings.


european conference on information retrieval | 2009

Investigating the Global Semantic Impact of Speech Recognition Error on Spoken Content Collections

Martha Larson; Manos Tsagkias; Jiyin He; Maarten de Rijke

Errors in speech recognition transcripts have a negative impact on effectiveness of content-based speech retrieval and present a particular challenge for collections containing conversational spoken content. We propose a Global Semantic Distortion (GSD) metric that measures the collection-wide impact of speech recognition error on spoken content retrieval in a query-independent manner. We deploy our metric to examine the effects of speech recognition substitution errors. First, we investigate frequent substitutions, cases in which the recognizer habitually mis-transcribes one word as another. Although habitual mistakes have a large global impact, the long tail of rare substitutions has a more damaging effect. Second, we investigate semantically similar substitutions, cases in which the word spoken and the word recognized do not diverge radically in meaning. Similar substitutions are shown to have slightly less global impact than semantically dissimilar substitutions.


Lecture Notes in Computer Science | 2009

Link Detection with Wikipedia

Jiyin He

This paper describes our participation in the INEX 2008 Link the Wiki track. We focused on the file-to-file task and submitted three runs, which were designed to compare the impact of different features on link generation. For outgoing links, we introduce the anchor likelihood ratio as an indicator for anchor detection, and explore two types of evidence for target identification, namely, the title field evidence and the topic article content evidence. We find that the anchor likelihood ratio is a useful indicator for anchor detection, and that in addition to the title field evidence, re-ranking with the topic article content evidence is effective for improving target identification. For incoming links, we use exact match and retrieval method with language modeling approach, and find that the exact match approach works best. On top of that, our experiment shows that the semantic relatedness between Wikipedia articles also has certain ability to indicate links.

Collaboration


Dive into the Jiyin He's collaboration.

Top Co-Authors

Avatar

Arjen P. de Vries

Radboud University Nijmegen

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Marc Bron

University of Amsterdam

View shared research outputs
Top Co-Authors

Avatar

M. de Rijke

University of Amsterdam

View shared research outputs
Top Co-Authors

Avatar

Martha Larson

Delft University of Technology

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge