Yangyong Zhu
Fudan University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Yangyong Zhu.
Bioinformatics | 2008
Guangyong Zheng; Kang Tu; Qing Yang; Yun Xiong; Chaochun Wei; Lu Xie; Yangyong Zhu; Yixue Li
Investigation of transcription factors (TFs) and their downstream regulated genes (targets) is a significant issue in post-genome era, which can provide a brand new vision for some vital biological process. However, information of TFs and their targets in mammalian is far from sufficient. Here, we developed an integrated TF platform (ITFP), which included abundant TFs and their targets of mammalian. In current release, ITFP includes 4105 putative TFs and 69 496 potential TF-target pairs for human, 3134 putative TFs and 37 040 potential TF-target pairs for mouse, and 1114 putative TFs and 18 055 potential TF-target pairs for rat. In short, ITFP will serve as an important resource for the research community of transcription and provide strong support for regulatory network study.
IEEE Transactions on Knowledge and Data Engineering | 2015
Yun Xiong; Yangyong Zhu; Philip S. Yu
As a newly emerging network model, heterogeneous information networks (HINs) have received growing attention. Many data mining tasks have been explored in HINs, including clustering, classification, and similarity search. Similarity join is a fundamental operation required for many problems. It is attracting attention from various applications on network data, such as friend recommendation, link prediction, and online advertising. Although similarity join has been well studied in homogeneous networks, it has not yet been studied in heterogeneous networks. Especially, none of the existing research on similarity join takes different semantic meanings behind paths into consideration and almost all completely ignore the heterogeneity and diversity of the HINs. In this paper, we propose a path-based similarity join (PS-join) method to return the top k similar pairs of objects based on any user specified join path in a heterogeneous information network. We study how to prune expensive similarity computation by introducing bucket pruning based locality sensitive hashing (BPLSH) indexing. Compared with existing Link-based Similarity join (LS-join) method, PS-join can derive various similarity semantics. Experimental results on real data sets show the efficiency and effectiveness of the proposed approach.
BMC Bioinformatics | 2008
Guangyong Zheng; Ziliang Qian; Qing Yang; Chaochun Wei; Lu Xie; Yangyong Zhu; Yixue Li
BackgroundTranscription factors (TFs) are core functional proteins which play important roles in gene expression control, and they are key factors for gene regulation network construction. Traditionally, they were identified and classified through experimental approaches. In order to save time and reduce costs, many computational methods have been developed to identify TFs from new proteins and to classify the resulted TFs. Though these methods have facilitated screening of TFs to some extent, low accuracy is still a common problem. With the fast growing number of new proteins, more precise algorithms for identifying TFs from new proteins and classifying the consequent TFs are in a high demand.ResultsThe support vector machine (SVM) algorithm was utilized to construct an automatic detector for TF identification, where protein domains and functional sites were employed as feature vectors. Error-correcting output coding (ECOC) algorithm, which was originated from information and communication engineering fields, was introduced to combine with support vector machine (SVM) methodology for TF classification. The overall success rates of identification and classification achieved 88.22% and 97.83% respectively. Finally, a web site was constructed to let users access our tools (see Availability and requirements section for URL).ConclusionThe SVM method was a valid and stable means for TFs identification with protein domains and functional sites as feature vectors. Error-correcting output coding (ECOC) algorithm is a powerful method for multi-class classification problem. When combined with SVM method, it can remarkably increase the accuracy of TF classification using protein domains and functional sites as feature vectors. In addition, our work implied that ECOC algorithm may succeed in a broad range of applications in biological data mining.
knowledge discovery and data mining | 2007
Yue Chen; Jiankui Guo; Yaqin Wang; Yun Xiong; Yangyong Zhu
This paper fist demonstrates that current PrefixSpan-based incremental mining algorithm IncSpan+ which is proposed in PAKDD05 cannot completely mine all sequential patterns. Then a new incremental mining algorithm of sequential patterns using prefix tree is proposed. This algorithm constructs a prefix tree to represent the sequential patterns, and then continuously scans the incremental element set to maintain the tree structure, using width pruning and depth pruning to eliminate the search space. The experiment shows this algorithm has a good performance.
world congress on intelligent control and automation | 2006
Yaqin Wang; Yue Chen; Minggui Qin; Yangyong Zhu
ITS technology collects a large of historical traffic flow data that may provide information for the support and improvement of traffic control. Data mining technique is appropriate to analysis the large amount of ITS data to acquire useful traffic pattern. We present a dynamic traffic prediction model, the model deals with traffic flow data to convert them into traffic status. In this paper two data mining techniques, the clustering analysis and the classification analysis, are used to develop the model, and the classification model can be used to predict traffic status in real time. The experiment shows the prediction model can be used efficiently in the dynamic traffic prediction for the urban traffic flow guidance
Brain Informatics | 2009
Yangyong Zhu; Ning Zhong; Yun Xiong
The essence of computer applications is to store things in the real world into computer systems in the form of data, i.e., it is a process of producing data. Some data are the records related to culture and society, and others are the descriptions of phenomena of universe and life. The large scale of data is rapidly generated and stored in computer systems, which is called data explosion. Data explosion forms data nature in computer systems. To explore data nature, new theories and methods are required. In this paper, we present the concept of data nature and introduce the problems arising from data nature, and then we define a new discipline named dataology (also called data science or science of data), which is an umbrella of theories, methods and technologies for studying data nature. The research issues and framework of dataology are proposed.
web intelligence | 2010
Li Xue; Yun Xiong; Yangyong Zhu
User Navigation Behavior Mining (UNBM) mainly studies the problems of extracting the interesting user access patterns from user access sequences (UAS), which are usually used for user access prediction and web page recommendation. Through analyzing the real world web data, we find most of user access sequences carrying hybrid features of different patterns, rather than a single one.
computer and information technology | 2005
Yun Xiong; Yangyong Zhu
Sequential pattern mining is now widely used in various areas, such as the analysis of biological sequences, Web access patterns, customer purchase patterns and etc. In this paper, we propose a new definition for M-sequences. Also we present multiple supports: local support, total support, and distribution support for their related mining of local sequential patterns, total sequential patterns and existence sequential patterns. Based on multiple supports, a multi-supports-based sequential pattern mining algorithm is developed which can be generally applied to find such patterns
international conference data science | 2015
Zhongyi Sun; Fengke Chen; Mingmin Chi; Yangyong Zhu
With the fast development of remote sensing techniques, the volume of acquired data grows exponentially. This brings a big challenge to process massive remote sensing data. In the paper, an in-memory computing framework is proposed to address this problem. Here, Spark is an open-source distributed computing platform with Hadoop YARN as resource scheduler and HDFS as cloud storage system. On the Spark-based platform, data loaded into memory in the first iteration can be reused in the subsequent iterations. This mechanism makes Spark much suitable for running multi-iteration algorithms compared to MapReduce which has to load data in each iteration. The experiments are carried out on massive remote sensing data using multi-iteration singular value decomposition SVD algorithm. The results show that Spark-based SVD can obtain significantly faster computation timethan that by MapReduce, usually by one order of magnitude.
bioinformatics and biomedicine | 2010
Yun Xiong; Junhua He; Yangyong Zhu
Biological sequential patterns usually exhibit some significant functions in a set of sequences. Mining such patterns offers a key means of insight into transcription regulation mechanisms and becomes a useful primitive task underlying many researches and applications. Recently, various methods have been developed to identify biological patterns. However, traditional approaches to mine sequential pattern will get a huge result set, which make biologists difficult to decide which patterns are interesting and meaningful. In this paper, we study a variant of biological sequential pattern mining aiming at the huge result set, termed top k representative patterns mining based on regularity measurement. As the first attempt to tackle the problem, a new measurement ‘regularity’ is defined to evaluate the interesting of each pattern and an efficient algorithm is proposed with pruning strategy which returns top k representative patterns ranked by the regularity. Experimental results demonstrate that the proposed method is more efficient than the state-of-the-art methods on the real datasets.