In-Su Kang
Pohang University of Science and Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by In-Su Kang.
Information Processing and Management | 2009
In-Su Kang; Seung-Hoon Na; Seungwoo Lee; Hanmin Jung; Pyung Kim; Won-Kyung Sung; Jong-Hyeok Lee
Author name disambiguation deals with clustering the same-name authors into different individuals. To attack the problem, many studies have employed a variety of disambiguation features such as coauthors, titles of papers/publications, topics of articles, emails/affiliations, etc. Among these, co-authorship is the most easily accessible and influential, since inter-person acquaintances represented by co-authorship could discriminate the identities of authors more clearly than other features. This study attempts to explore the net effects of co-authorship on author clustering in bibliographic data. First, to handle the shortage of explicit coauthors listed in known citations, a web-assisted technique of acquiring implicit coauthors of the target author to be disambiguated is proposed. Then, a coauthor disambiguation hypothesis that the identity of an author can be determined by his/her coauthors is examined and confirmed through a variety of author disambiguation experiments.
Information Processing and Management | 2007
In-Su Kang; Seung-Hoon Na; Jungi Kim; Jong-Hyeok Lee
Through the recent NTCIR workshops, patent retrieval casts many challenging issues to information retrieval community. Unlike newspaper articles, patent documents are very long and well structured. These characteristics raise the necessity to reassess existing retrieval techniques that have been mainly developed for structure-less and short documents such as newspapers. This study investigates cluster-based retrieval in the context of invalidity search task of patent retrieval. Cluster-based retrieval assumes that clusters would provide additional evidence to match users information need. Thus far, cluster-based retrieval approaches have relied on automatically-created clusters. Fortunately, all patents have manually-assigned cluster information, international patent classification codes. International patent classification is a standard taxonomy for classifying patents, and has currently about 69,000 nodes which are organized into a five-level hierarchical system. Thus, patent documents could provide the best test bed to develop and evaluate cluster-based retrieval techniques. Experiments using the NTCIR-4 patent collection showed that the cluster-based language model could be helpful to improving the cluster-less baseline language model.
Information Processing and Management | 2011
In-Su Kang; Pyung Kim; Seungwoo Lee; Hanmin Jung; Beom-Jong You
Author disambiguation resolves same-name author occurrences in the bibliographic data into namesakes. This enables author-centered searches and high-quality social network analysis. As an attempt to promote much research in author disambiguation, KISTI have constructed a new large-scale test set for this field. This article describes its semi-manual creation procedures, characteristics especially in terms of author ambiguities and name diversities. In addition, the baseline performance of author clustering against the test set is provided.
european conference on information retrieval | 2008
Seung-Hoon Na; In-Su Kang; Jong-Hyeok Lee
Term frequency normalization is a serious issue since lengths of documents are various. Generally, documents become long due to two different reasons - verbosity and multi-topicality. First, verbosity means that the same topic is repeatedly mentioned by terms related to the topic, so that term frequency is more increased than the well-summarized one. Second, multi-topicality indicates that a document has a broad discussion of multi-topics, rather than single topic. Although these document characteristics should be differently handled, all previous methods of term frequency normalization have ignored these differences and have used a simplified length-driven approach which decreases the term frequency by only the length of a document, causing an unreasonable penalization. To attack this problem, we propose a novel TF normalization method which is a type of partially-axiomatic approach. We first formulate two formal constraints that the retrieval model should satisfy for documents having verbose and multitopicality characteristic, respectively. Then, we modify language modeling approaches to better satisfy these two constraints, and derive novel smoothing methods. Experimental results show that the proposed method increases significantly the precision for keyword queries, and substantially improves MAP (Mean Average Precision) for verbose queries.
asia information retrieval symposium | 2008
Seung-Hoon Na; In-Su Kang; Yeha Lee; Jong-Hyeok Lee
Passage retrieval has been expected to be an alternative method to re-solve length-normalization problem, since passages have more uniform lengths and topics, than documents. An important issue in the passage retrieval is to determine the type of the passage. Among several different passage types, the arbitrary passage type which dynamically varies according to query has shown the best performance. However, the previous arbitrary passage type is not fully examined, since it still uses the fixed-length restriction such as n consequent words. This paper proposes a new type of passage, namely completely-arbitrary passages by eliminating all possible restrictions of passage on both lengths and starting positions, and by extremely relaxing the type of the original arbitrary passage. The main advantage using completely-arbitrary passages is that the proximity feature of query terms can be well-supported in the passage retrieval, while the non-completely arbitrary passage cannot clearly support. Experimental result extensively shows that the passage retrieval using the completely-arbitrary passage significantly improves the document retrieval, as well as the passage retrieval using previous non-completely arbitrary passages, on six standard TREC test collections, in the context of language modeling approaches.
asia information retrieval symposium | 2004
Jung-Min Lim; In-Su Kang; Jae-Hak J. Bae; Jong-Hyeok Lee
In multi-document summarization (MDS), especially for time-dependent documents, humans tend to select sentences in time sequence. Based on this insight, we use time features to separate documents and assign scores to sentences to determine the most important sentences. We implemented and compared two different systems, one using time features and the other not. In the evaluation of 29 news article document sets, our test method using time features turned out to be more effective and precise than the control system.
applications of natural language to data bases | 2004
In-Su Kang; Seung-Hoon Na; Jong-Hyeok Lee; Gijoo Yang
Most natural language database interfaces suffer from the translation knowledge portability problem, and are vulnerable to ill-formed questions because of their deep analysis. To alleviate those problems, this paper proposes a lightweight approach to natural language interfaces, where translation knowledge is semi-automatically acquired and user questions are only syntactically analyzed. For the acquisition of translation knowledge, first, a target database is reverse-engineered into a physical database schema on which domain experts annotate linguistic descriptions to produce a pER (physically-derived Entity-Relationship) schema. Next, from the pER schema, initial translation knowledge is automatically extracted. Then, it is extended with synonyms from lexical databases. In the stage of question answering, this semi-automatically constructed translation knowledge is then used to resolve translation ambiguities.
cyberworlds | 2002
In-Su Kang; JaeHak J. Bae; Jong-Hyeok Lee
In natural language database interfaces (NLDBI), manual construction of translation knowledge normally undermines domain portability because of its expensive human intervention. To overcome this, this paper proposes linguistically-motivated database semantics in order to systematically construct translation knowledge. Database semantics consists of two structures which are designed to function as a translation dictionary, and to contain selectional restriction constraints on domain classes. The database semantics is semi-automatically obtained from a semantic data model for a target database. Based on this database semantics, a conceptual NLDBI translation scheme is developed. Translating a natural language question into a database query suffers from the translation ambiguity problem, which has not received significant attention in this field. To deal with translation ambiguity, we suggest an ambiguity resolution method based on the proposed database semantics. Experiments showed the usefulness of the proposed database semantics and its disambiguation procedures.
asia information retrieval symposium | 2005
Seung-Hoon Na; In-Su Kang; Ji-Eun Roh; Jong-Hyeok Lee
In information retrieval, the word mismatch problem is a critical issue. To resolve the problem, several techniques have been developed, such as query expansion, cluster-based retrieval, and dimensionality reduction. Of these techniques, this paper performs an empirical study on query expansion and cluster-based retrieval. We examine the effect of using parsimony in query expansion and the effect of clustering algorithms in cluster-based retrieval. In addition, query expansion and cluster-based retrieval are compared, and their combinations are evaluated in terms of retrieval performance. By performing experimentation on seven test collections of NTCIR and TREC, we conclude that 1) query expansion using parsimony is well performed, 2) cluster-based retrieval by agglomerative clustering is better than that by partitioning clustering, and 3) query expansion is generally more effective than cluster-based retrieval in resolving the word-mismatch problem, and finally 4) their combinations are effective when each method significantly improves baseline performance.
asia information retrieval symposium | 2008
Seung-Hoon Na; In-Su Kang; Yeha Lee; Jong-Hyeok Lee
Different from the traditional document-level feedback, passage-level feedback restricts the context of selecting relevant terms to a passage in a document, rather than to the entire document. It can thus avoid the selection of nonrelevant terms from non-relevant parts in a document. The most recent work of passage-level feedback has been investigated from the viewpoint of the fixed-window type of passage. However, the fixed-window type of passage has limitation in optimizing the passage-level feedback, since it includes a query-independent portion. To minimize the query-independence of the passage, this paper proposes a new type of passage, called completely-arbitrary passage. Based on this, we devise a novel two-stage passage feedback - which consists of passage-retrieval and passage-extension as sub-steps, unlike previous single-stage passage feedback relying only on passage retrieval. Experimental results show that the proposed two-stage passage-level feedback much significantly improves the document-level feedback than the single-stage passage feedback that uses the fixed-window type of passage.