Caiyan Jia
Beijing Jiaotong University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Caiyan Jia.
BMC Bioinformatics | 2013
Caiyan Jia; Matthew B. Carson; Jian Yu
BackgroundIdentification of transcription factor binding sites (also called ‘motif discovery’) in DNA sequences is a basic step in understanding genetic regulation. Although many successful programs have been developed, the problem is far from being solved on account of diversity in gene expression/regulation and the low specificity of binding sites. State-of-the-art algorithms have their own constraints (e.g., high time or space complexity for finding long motifs, low precision in identification of weak motifs, or the OOPS constraint: one occurrence of the motif instance per sequence) which limit their scope of application.ResultsIn this paper, we present a novel and fast algorithm we call TFBSGroup. It is based on community detection from a graph and is used to discover long and weak (l,d) motifs under the ZOMOPS constraint (zero, one or multiple occurrence(s) of the motif instance(s) per sequence), where l is the length of a motif and d is the maximum number of mutations between a motif instance and the motif itself. Firstly, TFBSGroup transforms the (l, d) motif search in sequences to focus on the discovery of dense subgraphs within a graph. It identifies these subgraphs using a fast community detection method for obtaining coarse-grained candidate motifs. Next, it greedily refines these candidate motifs towards the true motif within their own communities. Empirical studies on synthetic (l, d) samples have shown that TFBSGroup is very efficient (e.g., it can find true (18, 6), (24, 8) motifs within 30 seconds). More importantly, the algorithm has succeeded in rapidly identifying motifs in a large data set of prokaryotic promoters generated from the Escherichia coli database RegulonDB. The algorithm has also accurately identified motifs in ChIP-seq data sets for 12 mouse transcription factors involved in ES cell pluripotency and self-renewal.ConclusionsOur novel heuristic algorithm, TFBSGroup, is able to quickly identify nearly exact matches for long and weak (l, d) motifs in DNA sequences under the ZOMOPS constraint. It is also capable of finding motifs in real applications. The source code for TFBSGroup can be obtained from http://bioinformatics.bioengr.uic.edu/TFBSGroup/.
PLOS ONE | 2014
Caiyan Jia; Matthew B. Carson; Yang Wang; Youfang Lin; Hui Lu
ChIP-seq, which combines chromatin immunoprecipitation (ChIP) with next-generation parallel sequencing, allows for the genome-wide identification of protein-DNA interactions. This technology poses new challenges for the development of novel motif-finding algorithms and methods for determining exact protein-DNA binding sites from ChIP-enriched sequencing data. State-of-the-art heuristic, exhaustive search algorithms have limited application for the identification of short (, ) motifs (, ) contained in ChIP-enriched regions. In this work we have developed a more powerful exhaustive method (FMotif) for finding long (, ) motifs in DNA sequences. In conjunction with our method, we have adopted a simple ChIP-enriched sampling strategy for finding these motifs in large-scale ChIP-enriched regions. Empirical studies on synthetic samples and applications using several ChIP data sets including 16 TF (transcription factor) ChIP-seq data sets and five TF ChIP-exo data sets have demonstrated that our proposed method is capable of finding these motifs with high efficiency and accuracy. The source code for FMotif is available at http://211.71.76.45/FMotif/.
Scientific Reports | 2017
Caiyan Jia; Yafang Li; Matthew B. Carson; Xiaoyang Wang; Jian Yu
Community detection involves grouping the nodes of a network such that nodes in the same community are more densely connected to each other than to the rest of the network. Previous studies have focused mainly on identifying communities in networks using node connectivity. However, each node in a network may be associated with many attributes. Identifying communities in networks combining node attributes has become increasingly popular in recent years. Most existing methods operate on networks with attributes of binary, categorical, or numerical type only. In this study, we introduce kNN-enhance, a simple and flexible community detection approach that uses node attribute enhancement. This approach adds the k Nearest Neighbor (kNN) graph of node attributes to alleviate the sparsity and the noise effect of an original network, thereby strengthening the community structure in the network. We use two testing algorithms, kNN-nearest and kNN-Kmeans, to partition the newly generated, attribute-enhanced graph. Our analyses of synthetic and real world networks have shown that the proposed algorithms achieve better performance compared to existing state-of-the-art algorithms. Further, the algorithms are able to deal with networks containing different combinations of binary, categorical, or numerical attributes and could be easily extended to the analysis of massive networks.
BMC Medical Genomics | 2017
Matthew B. Carson; Cong Liu; Yao Lu; Caiyan Jia; Hui Lu
BackgroundComplex diseases involve many genes, and these genes are often associated with several different illnesses. Disease similarity measurement can be based on shared genotype or phenotype. Quantifying relationships between genes can reveal previously unknown connections and form a reference base for therapy development and drug repurposing.MethodsHere we introduce a method to measure disease similarity that incorporates the uniqueness of shared genes. For each disease pair, we calculated the uniqueness score and constructed disease similarity matrices using OMIM and Disease Ontology annotation.ResultsUsing the Disease Ontology-based matrix, we identified several interesting connections between cancer and other disease and conditions such as malaria, along with studies to support our findings. We also found several high scoring pairwise relationships for which there was little or no literature support, highlighting potentially interesting connections warranting additional study.ConclusionsWe developed a co-occurrence matrix based on gene uniqueness to examine the relationships between diseases from OMIM and DORIF data. Our similarity matrix can be used to identify potential disease relationships and to motivate further studies investigating the causal mechanisms in diseases.
Pattern Recognition | 2018
Caiyan Jia; Matthew B. Carson; Xiaoyang Wang; Jian Yu
A new concept decomposition method WordCom is proposed.It creates concept vectors by identifying semantic word communities from a weighted word co-occurrence network.It is not only robust to the sparsity of short texts but also overcomes the curse of dimensionality.It scaling to a large number of short text inputs due to the concept vectors being obtained from term-term space.Experimental tests have shown that the proposed method outperforms state-of-the-art algorithms. Short text clustering is an increasingly important methodology but faces the challenges of sparsity and high-dimensionality of text data. Previous concept decomposition methods have obtained concept vectors via the centroids of clusters using k-means-type clustering algorithms on normal, full texts. In this study, we propose a new concept decomposition method that creates concept vectors by identifying semantic word communities from a weighted word co-occurrence network extracted from a short text corpus or a subset thereof. The cluster memberships of short texts are then estimated by mapping the original short texts to the learned semantic concept vectors. The proposed method is not only robust to the sparsity of short text corpora but also overcomes the curse of dimensionality, scaling to a large number of short text inputs due to the concept vectors being obtained from term-term instead of document-term space. Experimental tests have shown that the proposed method outperforms state-of-the-art algorithms.
Iet Systems Biology | 2016
Wenyi Qin; Guijun Zhao; Matthew B. Carson; Caiyan Jia; Hui Lu
A structure-based statistical potential is developed for transcription factor binding site (TFBS) prediction. Besides the direct contact between amino acids from TFs and DNA bases, the authors also considered the influence of the neighbouring base. This three-body potential showed better discriminate powers than the two-body potential. They validate the performance of the potential in TFBS identification, binding energy prediction and binding mutation prediction.
IEEE Transactions on Systems, Man, and Cybernetics | 2017
Yafang Li; Caiyan Jia; Xiangnan Kong; Liu Yang; Jian Yu
Attributed graphs have attracted much attention in recent years. Different from conventional graphs, attributed graphs involve two different types of heterogeneous information, i.e., structural information, which represents the links between the nodes, and attribute information on each of the nodes. Clustering on attributed graphs usually requires the fusion of both types of information in order to identify meaningful clusters. However, most of existing works implement the combination of these two types of information in a “global” manner by treating all nodes equally and learning a global weight for the information fusion. To address this issue, this paper proposed a novel weighted
Physica A-statistical Mechanics and Its Applications | 2015
Yafang Li; Caiyan Jia; Jian Yu
{K}
Physica A-statistical Mechanics and Its Applications | 2018
Zhenhai Chang; Xianjun Yin; Caiyan Jia; Xiaoyang Wang
-means algorithm with “local” learning for attributed graph clustering, called adaptive fusion of structural and attribute information (Adapt-SA) and analyzed the convergence property of the algorithm. The key advantage of this model is to automatically balance the structural connections and attribute information of each node to learn a fusion weight, and get densely connected clusters with high attribute semantic similarity. Experimental study of weights on both synthetic and real-world data sets showed that the weights learned by Adapt-SA were reasonable, and they reflected which one of these two types of information was more important to decide the membership of a node. We also compared Adapt-SA with the state-of-the-art algorithms on the real-world networks with varieties of characteristics. The experimental results demonstrated that our method outperformed the other algorithms in partitioning an attributed graph into a community structure or other general structures.
Physica A-statistical Mechanics and Its Applications | 2015
Bianfang Chai; Caiyan Jia; Jian Yu