Xinying Song | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Xinying Song is active.

Explore More

Publication

Featured researches published by Xinying Song.

web search and data mining | 2013

What's in a name?: an unsupervised approach to link users across communities

Jing Liu; Fan Zhang; Xinying Song; Young-In Song; Chin-Yew Lin; Hsiao-Wuen Hon

In this paper, we consider the problem of linking users across multiple online communities. Specifically, we focus on the alias-disambiguation step of this user linking task, which is meant to differentiate users with the same usernames. We start quantitatively analyzing the importance of the alias-disambiguation step by conducting a survey on 153 volunteers and an experimental analysis on a large dataset of About.me (75,472 users). The analysis shows that the alias-disambiguation solution can address a major part of the user linking problem in terms of the coverage of true pairwise decisions (46.8%). To the best of our knowledge, this is the first study on human behaviors with regards to the usages of online usernames. We then cast the alias-disambiguation step as a pairwise classification problem and propose a novel unsupervised approach. The key idea of our approach is to automatically label training instances based on two observations: (a) rare usernames are likely owned by a single natural person, e.g. pennystar88 as a positive instance; (b) common usernames are likely owned by different natural persons, e.g. tank as a negative instance. We propose using the n-gram probabilities of usernames to estimate the rareness or commonness of usernames. Moreover, these two observations are verified by using the dataset of Yahoo! Answers. The empirical evaluations on 53 forums verify: (a) the effectiveness of the classifiers with the automatically generated training data and (b) that the rareness and commonness of usernames can help user linking. We also analyze the cases where the classifiers fail.

conference on information and knowledge management | 2010

Automatic extraction of web data records containing user-generated content

Xinying Song; Jing Liu; Yunbo Cao; Chin-Yew Lin; Hsiao-Wuen Hon

In this paper, we are concerned with the problem of automatically extracting web data records that contain user-generated content (UGC). In previous work, web data records are usually assumed to be well-formed with a limited amount of UGC, and thus can be extracted by testing repetitive structure similarity. However, when a web data record includes a large portion of free-format UGC, the similarity test between records may fail, which in turn results in lower performance. In our work, we find that certain domain constraints (e.g., post-date) can be used to design better similarity measures capable of circumventing the influence of UGC. In addition, we also use anchor points provided by the domain constraints to improve the extraction process, which ends in an algorithm called MiBAT (Mining data records Based on Anchor Trees). We conduct extensive experiments on a dataset consisting of forum thread pages which are collected from 307 sites that cover 219 different forum software packages. Our approach achieves a precision of 98.9% and a recall of 97.3% with respect to post record extraction. On page level, it perfectly handles 91.7% of pages without extracting any wrong posts or missing any golden posts. We also apply our approach to comment extraction and achieve good results as well.

empirical methods in natural language processing | 2008

Better Binarization for the CKY Parsing

Xinying Song; Shilin Ding; Chin-Yew Lin

We present a study on how grammar binarization empirically affects the efficiency of the CKY parsing. We argue that binarizations affect parsing efficiency primarily by affecting the number of incomplete constituents generated, and the effectiveness of binarization also depends on the nature of the input. We propose a novel binarization method utilizing rich information learnt from training corpus. Experimental results not only show that different binarizations have great impacts on parsing efficiency, but also confirm that our learnt binarization outperforms other existing methods. Furthermore we show that it is feasible to combine existing parsing speed-up techniques with our binarization to achieve even better performance.

conference on information and knowledge management | 2013

Identifying salient entities in web pages

Michael Gamon; Tae Yano; Xinying Song; Johnson Apacible; Patrick Pantel

We propose a system that determines the salience of entities within web documents. Many recent advances in commercial search engines leverage the identification of entities in web pages. However, for many pages, only a small subset of entities are central to the document, which can lead to degraded relevance for entity triggered experiences. We address this problem by devising a system that scores each entity on a web page according to its centrality to the page content. We propose salience classification functions that incorporate various cues from document content, web search logs, and a large web graph. To cost-effectively train the models, we introduce a soft labeling methodology that generates a set of annotations based on user behaviors observed in web search logs. We evaluate several variations of our model via a large-scale empirical study conducted over a test set, which we release publicly to the research community. We demonstrate that our methods significantly outperform competitive baselines and the previous state of the art, while keeping the human annotation cost to a minimum.

conference on information and knowledge management | 2012

An unsupervised method for author extraction from web pages containing user-generated content

Jing Liu; Xinying Song; Jingtian Jiang; Chin-Yew Lin

In this paper, we address the problem of author extraction (AE) from user generated content (UGC) pages. Most existing solutions for web information extraction, including AE, adopt supervised approaches, which require expensive manual annotation. We propose a novel unsupervised approach for automatically collecting and labeling training data based on two key observations of author names: (1) people tend to use a single name across sites if their preferred names are available; (2) people tend to create unique usernames to easily distinguish themselves from others, e.g. travelbug61. Our AE solution only requires features extracted from a single UGC page instead of relying on clues from multiple UGC pages. We conducted extensive experiments. (1) The evaluation of automatically labeled author field data shows 95.0% precision. (2) Our method achieves an F1 score of 96.1%, which significantly outperforms a state-of-the-art supervised approach with single page features (F1 score: 68.4%) and has a comparable performance to its multiple page solution (F1 score: 95.4%). (3) We also examine the robustness of our approach on various UGC pages from forums and review sites, and achieve promising results as well.

conference on information and knowledge management | 2017

Deep Context Modeling for Web Query Entity Disambiguation

Zhen Liao; Xinying Song; Yelong Shen; Saekoo Lee; Jianfeng Gao; Ciya Liao

In this paper, we presented a new study for Web query entity disambiguation (QED), which is the task of disambiguating different candidate entities in a knowledge base given their mentions in a query. QED is particularly challenging because queries are often too short to provide rich contextual information that is required by traditional entity disambiguation methods. In this paper, we propose several methods to tackle the problem of QED. First, we explore the use of deep neural network (DNN) for capturing the character level textual information in queries. Our DNN approach maps queries and their candidate reference entities to feature vectors in a latent semantic space where the distance between a query and its correct reference entity is minimized. Second, we utilize the Web search result information of queries to help generate large amounts of weakly supervised training data for the DNN model. Third, we propose a two-stage training method to combine large-scale weakly supervised data with a small amount of human labeled data, which can significantly boost the performance of a DNN model. The effectiveness of our approach is demonstrated in the experiments using large-scale real-world datasets.

IEEE Transactions on Knowledge and Data Engineering | 2013