Yubin Kim | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Yubin Kim is active.

Explore More

Publication

Featured researches published by Yubin Kim.

international acm sigir conference on research and development in information retrieval | 2016

Load-Balancing in Distributed Selective Search

Yubin Kim; Jamie Callan; J. Shane Culpepper; Alistair Moffat

Simulation and analysis have shown that selective search can reduce the cost of large-scale distributed information retrieval. By partitioning the collection into small topical shards, and then using a resource ranking algorithm to choose a subset of shards to search for each query, fewer postings are evaluated. Here we extend the study of selective search using a fine-grained simulation investigating: selective search efficiency in a parallel query processing environment; the difference in efficiency when term-based and sample-based resource selection algorithms are used; and the effect of two policies for assigning index shards to machines. Results obtained for two large datasets and four large query logs confirm that selective search is significantly more efficient than conventional distributed search. In particular, we show that selective search is capable of both higher throughput and lower latency in a parallel environment than is exhaustive search.

Information Retrieval | 2017

Efficient distributed selective search

Yubin Kim; Jamie Callan; J. Shane Culpepper; Alistair Moffat

AbstractSimulation and analysis have shown that selective search can reduce the cost of large-scale distributed information retrieval. By partitioning the collection into small topical shards, and then using a resource ranking algorithm to choose a subset of shards to search for each query, fewer postings are evaluated. In this paper we extend the study of selective search into new areas using a fine-grained simulation, examining the difference in efficiency when term-based and sample-based resource selection algorithms are used; measuring the effect of two policies for assigning index shards to machines; and exploring the benefits of index-spreading and mirroring as the number of deployed machines is varied. Results obtained for two large datasets and four large query logs confirm that selective search is significantly more efficient than conventional distributed search architectures and can handle higher query rates. Furthermore, we demonstrate that selective search can be tuned to avoid bottlenecks, and thus maximize usage of the underlying computer hardware.

european conference on information retrieval | 2016

Does Selective Search Benefit from WAND Optimization

Yubin Kim; Jamie Callan; J. Shane Culpepper; Alistair Moffat

Selective search is a distributed retrieval technique that reduces the computational cost of large-scale information retrieval. By partitioning the collection into topical shards, and using a resource selection algorithm to identify a subset of shards to search, selective search allows retrieval effectiveness to be maintained while evaluating fewer postings, often resulting in 90+% reductions in querying cost. However, there has been only limited attention given to the interaction between dynamic pruning algorithms and topical index shards. We demonstrate that the WAND dynamic pruning algorithm is more effective on topical index shards than it is on randomly-organized index shards, and that the savings generated by selective search and WAND are additive. We also compare two methods for applying WAND to topical shards: searching each shard with a separate top-k heap and threshold; and sequentially passing a shared top-k heap and threshold from one shard to the next, in the order established by a resource selection mechanism. Separate top-k heaps provide low query latency, whereas a shared top-k heap provides higher throughput.

international acm sigir conference on research and development in information retrieval | 2017

Learning To Rank Resources

Zhuyun Dai; Yubin Kim; Jamie Callan

We present a learning-to-rank approach for resource selection. We develop features for resource ranking and present a training approach that does not require human judgments. Our method is well-suited to environments with a large number of resources such as selective search, is an improvement over the state-of-the-art in resource selection for selective search, and is statistically equivalent to exhaustive search even for recall-oriented metrics such as MAP@1000, an area in which selective search was lacking.

ACM Transactions on Intelligent Systems and Technology | 2016

Using the Crowd to Improve Search Result Ranking and the Search Experience

Yubin Kim; Kevyn Collins-Thompson; Jaime Teevan

Despite technological advances, algorithmic search systems still have difficulty with complex or subtle information needs. For example, scenarios requiring deep semantic interpretation are a challenge for computers. People, on the other hand, are well suited to solving such problems. As a result, there is an opportunity for humans and computers to collaborate during the course of a search in a way that takes advantage of the unique abilities of each. While search tools that rely on human intervention will never be able to respond as quickly as current search engines do, recent research suggests that there are scenarios where a search engine could take more time if it resulted in a much better experience. This article explores how crowdsourcing can be used at query time to augment key stages of the search pipeline. We first explore the use of crowdsourcing to improve search result ranking. When the crowd is used to replace or augment traditional retrieval components such as query expansion and relevance scoring, we find that we can increase robustness against failure for query expansion and improve overall precision for results filtering. However, the gains that we observe are limited and unlikely to make up for the extra cost and time that the crowd requires. We then explore ways to incorporate the crowd into the search process that more drastically alter the overall experience. We find that using crowd workers to support rich query understanding and result processing appears to be a more worthwhile way to make use of the crowd during search. Our results confirm that crowdsourcing can positively impact the search experience but suggest that significant changes to the search process may be required for crowdsourcing to fulfill its potential in search systems.

international conference on data engineering | 2010

ProbClean: A probabilistic duplicate detection system

George Beskales; Mohamed A. Soliman; Ihab F. Ilyas; Shai Ben-David; Yubin Kim

One of the most prominent data quality problems is the existence of duplicate records. Current data cleaning systems usually produce one clean instance (repair) of the input data, by carefully choosing the parameters of the duplicate detection algorithms. Finding the right parameter settings can be hard, and in many cases, perfect settings do not exist. We propose ProbClean, a system that treats duplicate detection procedures as data processing tasks with uncertain outcomes. We use a novel uncertainty model that compactly encodes the space of possible repairs corresponding to different parameter settings. ProbClean efficiently supports relational queries and allows new types of queries against a set of possible repairs.

international acm sigir conference on research and development in information retrieval | 2015

How Random Decisions Affect Selective Distributed Search

Zhuyun Dai; Yubin Kim; Jamie Callan

Selective distributed search is a retrieval architecture that reduces search costs by partitioning a corpus into topical shards such that only a few shards need to be searched for each query. Prior research created topical shards by using random seed documents to cluster a random sample of the full corpus. The resource selection algorithm might use a different random sample of the corpus. These random components make selective search non-deterministic. This paper studies how these random components affect experimental results. Experiments on two ClueWeb09 corpora and four query sets show that in spite of random components, selective search is stable for most queries.

international conference on the theory of information retrieval | 2018

Measuring the Effectiveness of Selective Search Index Partitions without Supervision

Yubin Kim; Jamie Callan

Selective search architectures partition a document collection into topic-oriented index shards, usually using algorithms that have random components. Different mappings of documents into index shards (shard maps) produce different search accuracy and consistency, however identifying which shard maps will deliver the highest average effectiveness is an open problem. This paper presents a new metric, Area Under Recall Curve (AUReC), to evaluate and compare shard maps. AUReC is the first such metric that is independent of resource selection and shard cut-off estimation. It does not require an end-to-end evaluation or manual gold-standard judgements. Experiments show that its predictions are highly-correlated with evaluating end-to-end systems of various configurations, while being easier to implement and computationally inexpensive.

symposium on human computer interaction and information retrieval | 2013