Shuming Shi | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Shuming Shi is active.

Explore More

Publication

Featured researches published by Shuming Shi.

international world wide web conferences | 2007

Web object retrieval

Zaiqing Nie; Yunxiao Ma; Shuming Shi; Ji-Rong Wen; Wei-Ying Ma

The primary function of current Web search engines is essentially relevance ranking at the document level. However, myriad structured information about real-world objects is embedded in static Web pages and online Web databases. Document-level information retrieval can unfortunately lead to highly inaccurate relevance ranking in answering object-oriented queries. In this paper, we propose a paradigm shift to enable searching at the object level. In traditional information retrieval models, documents are taken as the retrieval units and the content of a document is considered reliable. However, this reliability assumption is no longer valid in the object retrieval context when multiple copies of information about the same object typically exist. These copies may be inconsistent because of diversity of Web site qualities and the limited performance of current information extraction techniques. If we simply combine the noisy and inaccurate attribute information extracted from different sources, we may not be able to achieve satisfactory retrieval performance. In this paper, we propose several language models for Web object retrieval, namely an unstructured object retrieval model, a structured object retrieval model, and a hybrid model with both structured and unstructured retrieval features. We test these models on a paper search engine and compare their performances. We conclude that the hybrid model is the superior by taking into account the extraction errors at varying levels.

international acm sigir conference on research and development in information retrieval | 2005

Title extraction from bodies of HTML documents and its application to web page retrieval

Yunhua Hu; Guomao Xin; Ruihua Song; Guoping Hu; Shuming Shi; Yunbo Cao; Hang Li

This paper is concerned with automatic extraction of titles from the bodies of HTML documents. Titles of HTML documents should be correctly defined in the title fields; however, in reality HTML titles are often bogus. It is desirable to conduct automatic extraction of titles from the bodies of HTML documents. This is an issue which does not seem to have been investigated previously. In this paper, we take a supervised machine learning approach to address the problem. We propose a specification on HTML titles. We utilize format information such as font size, position, and font weight as features in title extraction. Our method significantly outperforms the baseline method of using the lines in largest font size as title (20.9%-32.6% improvement in F1 score). As application, we consider web page retrieval. We use the TREC Web Track data for evaluation. We propose a new method for HTML documents retrieval using extracted titles. Experimental results indicate that the use of both extracted titles and title fields is almost always better than the use of title fields alone; the use of extracted titles is particularly helpful in the task of named page finding (23.1% -29.0% improvements).

conference on information and knowledge management | 2010

Efficient term proximity search with term-pair indexes

Hao Yan; Shuming Shi; Fan Zhang; Torsten Suel; Ji-Rong Wen

There has been a large amount of research on early termination techniques in web search and information retrieval. Such techniques return the top-k documents without scanning and evaluating the full inverted lists of the query terms. Thus, they can greatly improve query processing efficiency. However, only a limited amount of efficient top-k processing work considers the impact of term proximity, i.e., the distance between term occurrences in a document, which has recently been integrated into a number of retrieval models to improve effectiveness. In this paper, we propose new early termination techniques for efficient query processing for the case where term proximity is integrated into the retrieval model. We propose new index structures based on a term-pair index, and study new document retrieval strategies on the resulting indexes. We perform a detailed experimental evaluation on our new techniques and compare them with the existing approaches. Experimental results on large-scale data sets show that our techniques can significantly improve the efficiency of query processing.

international world wide web conferences | 2008

Improving relevance judgment of web search results with image excerpts

Zhiwei Li; Shuming Shi; Lei Zhang

Current web search engines return result pages containing mostly text summary even though the matched web pages may contain informative pictures. A text excerpt (i.e. snippet) is generated by selecting keywords around the matched query terms for each returned page to provide context for users relevance judgment. However, in many scenarios, we found that the pictures in web pages, if selected properly, could be added into search result pages and provide richer contextual description because a picture is worth a thousand words. Such new summary is named as image excerpts. By well designed user study, we demonstrate image excerpts can help users make much quicker relevance judgment of search results for a wide range of query types. To implement this idea, we propose a practicable approach to automatically generate image excerpts in the result pages by considering the dominance of each picture in each web page and the relevance of the picture to the query. We also outline an efficient way to incorporate image excerpts in web search engines. Web search engines can adopt our approach by slightly modifying their index and inserting a few low cost operations in their workflow. Our experiments on a large web dataset indicate the performance of the proposed approach is very promising.

conference on information and knowledge management | 2008

Can phrase indexing help to process non-phrase queries?

Mingjie Zhu; Shuming Shi; Nenghai Yu; Ji-Rong Wen

Modern web search engines, while indexing billions of web pages, are expected to process queries and return results in a very short time. Many approaches have been proposed for efficiently computing top-k query results, but most of them ignore one key factor in the ranking functions of commercial search engines - term-proximity, which is the metric of the distance between query terms in a document. When term-proximity is included in ranking functions, most of the existing top-k algorithms will become inefficient. To address this problem, in this paper we propose to build a compact phrase index to speed up the search process when incorporating the term-proximity factor. The compact phrase index can help more accurately estimate the score upper bounds of unknown documents. The size of the phrase index is controlled by including a small portion of phrases which are possibly helpful for improving search performance. Phrase index has been used to process phrase queries in existing work. It is, however, to the best of our knowledge, the first time that phrase index is used to improve the performance of generic queries. Experimental results show that, compared with the state-of-the-art top-k computation approaches, our approach can reduce average query processing time to 1/5 for typical setttings.

conference on information and knowledge management | 2007

Effective top-k computation in retrieving structured documents with term-proximity support

Mingjie Zhu; Shuming Shi; Mingjing Li; Ji-Rong Wen

Modern web search engines are expected to return top-k results efficiently given a query. Although many dynamic index pruning strategies have been proposed for efficient top-k computation, most of them are prone to ignore some especially important factors in ranking functions, e.g. term proximity (the distance relationship between query terms in a document). The inclusion of term proximity breaks the monotonicity of ranking functions and therefore leads to additional challenges for efficient query processing. This paper studies the performance of some existing top-k computation approaches using term-proximity-enabled ranking functions. Our investigation demonstrates that, when term proximity is incorporated into ranking functions, most existing index structures and top-k strategies become quite inefficient. According to our analysis and experimental results, we propose two index structures and their corresponding index pruning strategies: Structured and Hybrid, which performs much better on the new settings. Moreover, the efficiency of index building and maintenance would not be affected too much with the two approaches.

empirical methods in natural language processing | 2015

Automatically Solving Number Word Problems by Semantic Parsing and Reasoning

Shuming Shi; Yuehui Wang; Chin-Yew Lin; Xiaojiang Liu; Yong Rui

This paper presents a semantic parsing and reasoning approach to automatically solving math word problems. A new meaning representation language is designed to bridge natural language text and math expressions. A CFG parser is implemented based on 9,600 semi-automatically created grammar rules. We conduct experiments on a test set of over 1,500 number word problems (i.e., verbally expressed number problems) and yield 95.4% precision and 60.2% recall.

international acm sigir conference on research and development in information retrieval | 2005

Gravitation-based model for information retrieval

Shuming Shi; Ji-Rong Wen; Qing Yu; Ruihua Song; Wei-Ying Ma

This paper proposes GBM (gravitation-based model), a physical model for information retrieval inspired by Newtons theory of gravitation. A mapping is built in this model from concepts of information retrieval (documents, queries, relevance, etc) to those of physics (mass, distance, radius, attractive force, etc). This model actually provides a new perspective on IR problems. A family of effective term weighting functions can be derived from it, including the well-known BM25 formula. This model has some advantages over most existing ones: First, because it is directly based on basic physical laws, the derived formulas and algorithms can have their explicit physical interpretation. Second, the ranking formulas derived from this model satisfy more intuitive heuristics than most of existing ones, thus have the potential to behave empirically better and to be used safely on various settings. Finally, a new approach for structured document retrieval derived from this model is more reasonable and behaves better than existing ones.

european conference on information retrieval | 2006

Exploring URL hit priors for web search

Ruihua Song; Guomao Xin; Shuming Shi; Ji-Rong Wen; Wei-Ying Ma

URL usually contains meaningful information for measuring the relevance of a Web page to a query in Web search. Some existing works utilize URL depth priors (i.e. the probability of being a good page given the length and depth of a URL) for improving some types of Web search tasks. This paper suggests the use of the location of query terms occur in a URL for measuring how well a web page is matched with a users information need in web search. First, we define and estimate URL hit types, i.e. the priori probability of being a good answer given the type of query term hits in the URL. The main advantage of URL hit priors (over depth priors) is that it can achieve stable improvement for both informational and navigational queries. Second, an obstacle of exploiting such priors is that shortening and concatenation are frequently used in a URL. Our investigation shows that only 30% URL hits are recognized by an ordinary word breaking approach. Thus we combine three methods to improve matching. Finally, the priors are integrated into the probabilistic model for enhancing web document retrieval. Our experiments were conducted using 7 query sets of TREC2002, TREC2003 and TREC2004, and show that the proposed approach is stable and improve retrieval effectiveness by 4%~11% for navigational queries and 10% for informational queries.

web search and data mining | 2010

Revisiting globally sorted indexes for efficient document retrieval

Fan Zhang; Shuming Shi; Hao Yan; Ji-Rong Wen

There has been a large amount of research on efficient document retrieval in both IR and web search areas. One important technique to improve retrieval efficiency is early termination, which speeds up query processing by avoiding scanning the entire inverted lists. Most early termination techniques first build new inverted indexes by sorting the inverted lists in the order of either the term-dependent information, e.g., term frequencies or term IR scores, or the term-independent information, e.g., static rank of the document; and then apply appropriate retrieval strategies on the resulting indexes. Although the methods based only on the static rank have been shown to be ineffective for the early termination, there are still many advantages of using the methods based on term-independent information. In this paper, we propose new techniques to organize inverted indexes based on the term-independent information beyond static rank and study the new retrieval strategies on the resulting indexes. We perform a detailed experimental evaluation on our new techniques and compare them with the existing approaches. Our results on the TREC GOV and GOV2 data sets show that our techniques can improve query efficiency significantly.

Explore More