Weimao Ke | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Weimao Ke is active.

Explore More

Publication

Featured researches published by Weimao Ke.

International Journal on Digital Libraries | 2015

Information-theoretic term weighting schemes for document clustering and classification

Weimao Ke

We propose a new theory to quantify information in probability distributions and derive a new document representation model for text clustering and classification. By extending Shannon entropy to accommodate a non-linear relation between information and uncertainty, the proposed least information theory provides insight into how terms can be weighted based on their probability distributions in documents vs. in the collection. We derive two basic quantities for document representation: (1) LI Binary (LIB) which quantifies information due to the observation of a term’s (binary) occurrence in a document; and (2) LI Frequency (LIF) which measures information for the observation of a randomly picked term from the document. The two quantities are computed based on terms’ prior distributions in the entire collection and posterior distributions in a document. LIB and LIF can be used individually or combined to represent documents for text clustering and classification. Experiments on four benchmark text collections demonstrate strong performances of the proposed methods compared to classic TF*IDF. Particularly, the LIB*LIF weighting scheme, which combines LIB and LIF, consistently outperforms TF*IDF in terms of multiple evaluation metrics. The least information measure has a potentially broad range of applications beyond text clustering and classification.

IEEE Transactions on Nanobioscience | 2017

BalanceAli: Multiple PPI Network Alignment With Balanced High Coverage and Consistency

Jianliang Gao; Bo Song; Weimao Ke; Xiaohua Hu

Coverage and consistency are two most considered metrics to evaluate the effectiveness of network alignment. But they are a pair of contradictory evaluation metrics in protein-protein interaction (PPI) network alignment. It is difficult, if not impossible, to achieve high coverage and consistency simultaneously. Furthermore, existing methods of multiple PPI network alignment mostly ignore <italic>k</italic>-coverage or <italic>k</italic>-consistency, where <italic>k</italic> indicates the number of aligned species. In this paper, we propose <italic>BalanceAli</italic>, a novel approach for global alignment of multiple PPI networks that achieves high <italic>k</italic>-coverage and <italic>k</italic>-consistency simultaneously. With six data sets consisting of various numbers of PPI networks from five species, we evaluate the experimental results using different <italic>k</italic> values. The performance evaluations of our approach against other three state-of-the-art methods demonstrate the preferable comprehensive strength of our approach.

acm/ieee joint conference on digital libraries | 2013

Information-theoretic term weighting schemes for document clustering

Weimao Ke

We propose a new theory to quantify information in probability distributions and derive a new document representation model for text clustering. By extending Shannon entropy to accommodate a non-linear relation between information and uncertainty, the proposed Least Information theory (LIT) provides insight into how terms can be weighted based on their probability distributions in documents vs. in the collection. We derive two basic quantities in the document clustering context: 1) LI Binary (LIB) which quantifies information due to the observation of a terms (binary) occurrence in a document; and 2) LI Frequency (LIF) which measures information for the observation of a randomly picked term from the document. Both quantities are computed given term distributions in the document collection as prior knowledge and can be used separately or combined to represent documents for text clustering. Experiments on four benchmark text collections demonstrate strong performances of the proposed methods compared to classic TF*IDF. Particularly, the LIB*LIF weighting scheme, which combines LIB and LIF, consistently outperforms TF*IDF in terms of multiple evaluation metrics. The least information measure has a potentially broad range of applications beyond text clustering.

acm/ieee joint conference on digital libraries | 2013

Interactive search result clustering: a study of user behavior and retrieval effectiveness

Xuemei Gong; Weimao Ke; Yan Zhang; Ramona Broussard

Scatter/Gather is a document browsing and information retrieval method based on document clustering. It is designed to facilitate user articulation of information needs through iterative clustering and interactive browsing. This paper reports on a study that investigated the effectiveness of Scatter/Gather browsing for information retrieval. We conducted a within-subject user study of 24 college students to investigate the utility of a Scatter/Gather system, to examine its strengths and weaknesses, and to receive feedback from users on the system. Results show that the clustering-based Scatter/Gather method was more difficult to use than the classic information retrieval systems in terms of user perception. However, clustering helped the subjects accomplish the tasks more efficiently. Scatter/Gather clustering was particularly useful in helping users finish tasks that they were less familiar with and allowed them to search with fewer words. Scatter/Gather tended to be more useful when it was more difficult for the user to do query specification for an information need. Topic familiarity and specificity had significant influences on user perceived retrieval effectiveness. The influences appeared to be greater with the Scatter/Gather system compared to a classic search system. Topic familiarity also had significant influences on query formulation.

ACM Transactions on Information Systems | 2013

Studying the clustering paradox and scalability of search in highly distributed environments

Weimao Ke; Javed Mostafa

With the ubiquitous production, distribution and consumption of information, todays digital environments such as the Web are increasingly large and decentralized. It is hardly possible to obtain central control over information collections and systems in these environments. Searching for information in these information spaces has brought about problems beyond traditional boundaries of information retrieval (IR) research. This article addresses one important aspect of scalability challenges facing information retrieval models and investigates a decentralized, organic view of information systems pertaining to search in large-scale networks. Drawing on observations from earlier studies, we conduct a series of experiments on decentralized searches in large-scale networked information spaces. Results show that how distributed systems interconnect is crucial to retrieval performance and scalability of searching. Particularly, in various experimental settings and retrieval tasks, we find a consistent phenomenon, namely, the Clustering Paradox, in which the level of network clustering (semantic overlay) imposes a scalability limit. Scalable searches are well supported by a specific, balanced level of network clustering emerging from local system interconnectivity. Departure from that level, either stronger or weaker clustering, leads to search performance degradation, which is dramatic in large-scale networks.

international conference on big data | 2016

Scalability analysis of distributed search in large peer-to-peer networks

Weimao Ke; Javed Mostafa

We study decentralized searches in large-scale, self-organized peer-to-peer networks and investigate the influences of network size and degree distribution (neighborhood size) on search efficiency. Experimental results show that searches are efficient and scalable in large networks, especially with large neighborhood sizes (degrees). Analysis of the data supports a proposed scalability model, in which search path length L (efficiency) is proportional to a poly-logarithmic function of network size N, with degree dm (majority neighborhood size) as the log base. The model explains 90% (R2) of variances in search path lengths. Search time (search path length) predicted by the model shows great potential for efficient searches in real-scale networks of up to a billion distributed systems.

bioinformatics and biomedicine | 2016

Achieving high k-coverage and k-consistency in global alignment of multiple PPI networks

Bo Song; Jianliang Gao; Weimao Ke; Xiaohua Hu

Alignment among protein-protein interaction (PPI) networks largely benefits our understanding in biological researches as it contributes greatly to the uncovering of important information, such as evolutionary conserved pathways, protein complexes and functional orthologs. Global alignment of multiple PPI networks aims at clustering functionally conserved proteins throughout different species, where most traditional methods attempt to achieve results with high overall coverage and consistency. However little attention was paid on the deeper level criteria of k-coverage and especially the k-consistency that we additionally conducted for evaluations, where k indicates the number of species that the proteins in a cluster belong to. In this paper, we propose a novel approach for global alignment of multiple PPI networks which achieved high k-coverage and k-consistency simultaneously in addition to conventional criteria. The evaluations demonstrate the preferable comprehensive strength of our approach against other three state-of-the-art methods on six datasets composed of various numbers of PPI networks from five species.

international acm sigir conference on research and development in information retrieval | 2018

Computational Surprise in Information Retrieval

Xi Niu; Wlodek Zadrozny; Kazjon Grace; Weimao Ke

The concept of surprise is central to human learning and development. However, compared to accuracy, surprise has received little attention in the IR community, yet it is an essential component of the information seeking process. This workshop brings together researchers and practitioners of IR to discuss the topic of computational surprise, to set a research agenda, and to examine how to build datasets for research into this fascinating topic. The themes in this workshop include discussion of what can be learned from some well-known surprise models in other fields, such as Bayesian surprise; how to evaluate surprise based on user experience; and how computational surprise is related to the newly emerging areas, such as fake news detection, computational contradiction, clickbait detection, etc.

international conference on the theory of information retrieval | 2017

Text Retrieval based on Least Information Measurement

Weimao Ke

We developed a new information retrieval framework based on the Least Information (LI) metric. We derived multiple term weighting schemes and combined them with a vector space representation for ad hoc retrieval. Given probability distributions in a collection as prior knowledge, LI Binary (LIB) quantifies least information due to the binary occurrence of a term in a document whereas LI Frequency (LIF) measures least information based on the probability of drawing a term from a bag of words. Experiments on four benchmark TREC collections for ad hoc retrieval showed that LIT-based methods achieved superior performances compared to classic TF*IDF and BM25, especially for verbose queries and hard search topics. The least information theory is a method for entropy-based information measurement and offers a novel approach for IR modeling.

international acm sigir conference on research and development in information retrieval | 2017

Counter Deanonymization Query: H-index Based k -Anonymization Privacy Protection for Social Networks

Jianliang Gao; Bo Song; Zheng Chen; Weimao Ke; Wanying Ding; Xiaohua Hu

In this paper, we propose a novel k-anonymization scheme to counter deanonymization queries on social networks. With this scheme, all entities are protected by k-anonymization, which means the attackers cannot re-identify a target with confidence higher than 1/k. The proposed scheme minimizes the modification on original networks, and accordingly maximizes the utility preservation of published data while achieving k-anonymization privacy protection. Extensive experiments on real data sets demonstrate the effectiveness of the proposed scheme, where the efficacy of the k-anonymized networks is verified with the distributions of pagerank, betweenness, and their Kolmogorov-Smirnov (K-S) test.

Explore More