Yujun Han
Chinese Academy of Sciences
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Yujun Han.
Science | 2002
Jun Yu; Songnian Hu; Jun Wang; Gane Ka-Shu Wong; Songgang Li; Bin Liu; Yajun Deng; Yan Zhou; Xiuqing Zhang; Mengliang Cao; Jing Liu; Jiandong Sun; Jiabin Tang; Yanjiong Chen; Xiaobing Huang; Wei Lin; Chen Ye; Wei Tong; Lijuan Cong; Jianing Geng; Yujun Han; Lin Li; Wei Li; Guangqiang Hu; Xiangang Huang; Wenjie Li; Jian Li; Zhanwei Liu; Long Li; Jianping Liu
The genome of the japonica subspecies of rice, an important cereal and model monocot, was sequenced and assembled by whole-genome shotgun sequencing. The assembled sequence covers 93% of the 420-megabase genome. Gene predictions on the assembled sequence suggest that the genome contains 32,000 to 50,000 genes. Homologs of 98% of the known maize, wheat, and barley proteins are found in rice. Synteny and gene homology between rice and the other cereal genomes are extensive, whereas synteny with Arabidopsis is limited. Assignment of candidate rice orthologs to Arabidopsis genes is possible in many cases. The rice genome sequence provides a foundation for the improvement of cereals, our most important crops.
PLOS Computational Biology | 2005
Ruiqiang Li; Jia Ye; Songgang Li; Jing Wang; Yujun Han; Chen Ye; Jian Wang; Huanming Yang; Jun Yu; Gane Ka-Shu Wong; Jun Wang
We describe an algorithm, ReAS, to recover ancestral sequences for transposable elements (TEs) from the unassembled reads of a whole genome shotgun. The main assumptions are that these TEs must exist at high copy numbers across the genome and must not be so old that they are no longer recognizable in comparison to their ancestral sequences. Tested on the japonica rice genome, ReAS was able to reconstruct all of the high copy sequences in the Repbase repository of known TEs, and increase the effectiveness of RepeatMasker in identifying TEs from genome sequences.
Theoretical and Applied Genetics | 2004
Ai-Guo Tian; Jun Wang; Peng Cui; Yujun Han; Hao Xu; Lijuan Cong; Xiangang Huang; Xiaoling Wang; Yongzhi Jiao; B. Wang; Yong-Jun Wang; Zhang J; Shou-Yi Chen
We analyzed 314,254 soybean expressed sequence tags (ESTs), including 29,540 from our laboratory and 284,714 from GenBank. These ESTs were assembled into 56,147 unigenes. About 76.92% of the unigenes were homologous to genes from Arabidopsis thaliana (Arabidopsis). The putative products of these unigenes were annotated according to their homology with the categorized proteins of Arabidopsis. Genes corresponding to cell growth and/or maintenance, enzymes and cell communication belonged to the slow-evolving class, whereas genes related to transcription regulation, cell, binding and death appeared to be fast-evolving. Soybean unigenes with no match to genes within the Arabidopsis genome were identified as soybean-specific genes. These genes were mainly involved in nodule development and the synthesis of seed storage proteins. In addition, we also identified 61 genes regulated by salicylic acid, 1,322 transcription factor genes and 326 disease resistance-like genes from soybean unigenes. SSR analysis showed that the soybean genome was more complex than the Arabidopsis and the Medicago truncatula genomes. GC content in soybean unigene sequences is similar to that in Arabidopsis and M. truncatula. Furthermore, the combined analysis of the EST database and the BAC-contig sequences revealed that the total gene number in the soybean genome is about 63,501.
Chinese Science Bulletin | 2003
E’de Qin; Qingyu Zhu; Man Yu; Baochang Fan; Guohui Chang; Bingyin Si; Bao’an Yang; Wenming Peng; Tao Jiang; Bohua Liu; Yong-Qiang Deng; Liu H; Yu Zhang; Cui’e Wang; Y. Li; Yonghua Gan; Xiaoyu Li; Fushuang Lü; Gang Tan; Wuchun Cao; Ruifu Yang; Jian Wang; Wei Li; Z. Y. Xu; Yan Li; Qingfa Wu; Wei Lin; Weijun Chen; Lin Tang; Yajun Deng
The genome sequence of the Severe Acute Respiratory Syndrome (SARS)-associated virus provides essential information for the identification of pathogen(s), exploration of etiology and evolution, interpretation of transmission and pathogenesis, development of diagnostics, prevention by future vaccination, and treatment by developing new drugs. We report the complete genome sequence and comparative analysis of an isolate (BJ01) of the coronavirus that has been recognized as a pathogen for SARS. The genome is 29725 nt in size and has 11 ORFs (Open Reading Frames). It is composed of a stable region encoding an RNA-dependent RNA polymerase (composed of 2 ORFs) and a variable region representing 4 CDSs (coding sequences) for viral structural genes (the S, E, M, N proteins) and 5 PUPs (putative uncharacterized proteins). Its gene order is identical to that of other known coronaviruses. The sequence alignment with all known RNA viruses places this virus as a member in the family of Coronaviridae. Thirty putative substitutions have been identified by comparative analysis of the 5 SARS-associated virus genome sequences in GenBank. Fifteen of them lead to possible amino acid changes (non-synonymous mutations) in the proteins. Three amino acid changes, with predicted alteration of physical and chemical features, have been detected in the S protein that is postulated to be involved in the immunoreactions between the virus and its host. Two amino acid changes have been detected in the M protein, which could be related to viral envelope formation. Phylogenetic analysis suggests the possibility of non-human origin of the SARS-associated viruses but provides no evidence that they are man-made. Further efforts should focus on identifying the etiology of the SARS-associated virus and ruling out conclusively the existence of other possible SARS-related pathogen(s).
Chinese Science Bulletin | 2001
Jun Yu; Songnian Hu; Jun Wang; Songgang Li; Ka-Shu Gane Wong; Bin Liu; Yajun Deng; Li Dai; Yan Zhou; Xiuqing Zhang; Mengliang Cao; Jing Liu; Jiandong Sun; Jiabin Tang; Yanjiong Chen; Xiaobing Huang; Wei Lin; Chen Ye; Wei Tong; Lijuan Cong; Jianing Geng; Yujun Han; Lin Li; Wei Li; Guangqiang Hu; Xiangang Huang; Wenjie Li; Jian Li; Zhanwei Liu; Long Li
The sequence of the rice genome holds fundamental information for its biology, including physiology, genetics, development, and evolution, as well as information on many beneficial phenotypes of economic significance. Using a “whole genome shotgun” approach, we have produced a draft rice genome sequence ofOryza sativa ssp.indica, the major crop rice subspecies in China and many other regions of Asia. The draft genome sequence is constructed from over 4.3 million successful sequencing traces with an accumulative total length of 2214.9 Mb. The initial assembly of the non-redundant sequences reached 409.76 Mb in length, based on 3.30 million successful sequencing traces with a total length of 1797.4 Mb from anindica variant cultivar93-11, giving an estimated coverage of 95.29% of the rice genome with an average base accuracy of higher than 99%. The coverage of the draft sequence, the randomness of the sequence distribution, and the consistency of BIG-ASSEMBLER, a custom-designed software package used for the initial assembly, were verified rigorously by comparisons against finished BAC clone sequences from bothindica andjapanica strains, available from the public databases. Over all, 96.3% of full-length cDNAs, 96.4% of STS, STR, RFLP markers, 94.0% of ESTs and 94.9% unigene clusters were identified from the draft sequence. Our preliminary analysis on the data set shows that our rice draft sequence is consistent with the comman standard accepted by the genome sequencing community. The unconditional release of the draft to the public also undoubtedly provides a fundamental resource to the international scientific communities to facilitate genomic and genetic studies on rice biology.
Genomics, Proteomics & Bioinformatics | 2003
E’de Qin; Xionglei He; Wei Tian; Yong Liu; Wei Li; Jie Wen; Jingqiang Wang; Baochang Fan; Qingfa Wu; Guohui Chang; Wuchun Cao; Z. Y. Xu; Ruifu Yang; Jing Wang; Man Yu; Yan Li; Jing Xu; Bingyin Si; Yongwu Hu; Wenming Peng; Lin Tang; Tao Jiang; Jianping Shi; Jia Ji; Yu Zhang; Jia Ye; Cui’e Wang; Yujun Han; Jun Zhou; Yajun Deng
We report a complete genomic sequence of rare isolates (minor genotype) of the SARS-CoV from SARS patients in Guangdong, China, where the first few cases emerged. The most striking discovery from the isolate is an extra 29-nucleotide sequence located at the nucleotide positions between 27,863 and 27,864 (referred to the complete sequence of BJ01) within an overlapped region composed of BGI-PUP5 (BGI-postulated uncharacterized protein 5) and BGI-PUP6 upstream of the N (nucleocapsid) protein. The discovery of this minor genotype, GD-Ins29, suggests a significant genetic event and differentiates it from the previously reported genotype, the dominant form among all sequenced SARS-CoV isolates. A 17-nt segment of this extra sequence is identical to a segment of the same size in two human mRNA sequences that may interfere with viral genome replication and transcription in the cytosol of the infected cells. It provides a new avenue for the exploration of the virus-host interaction in viral evolution, host pathogenesis, and vaccine development.
Genomics, Proteomics & Bioinformatics | 2003
Yongwu Hu; Jie Wen; Lin Tang; Haijun Zhang; Xiaowei Zhang; Yan Li; Jing Wang; Yujun Han; Guoqing Li; Jianping Shi; Xiangjun Tian; Feng Jiang; Xiaoqian Zhao; Jun Wang; Siqi Liu; Changqing Zeng; Jian Wang; Huanming Yang
We studied structural and immunological properties of the SARS-CoV M (membrane) protein, based on comparative analyses of sequence features, phylogenetic investigation, and experimental results. The M protein is predicted to contain a triple-spanning transmembrane (TM) region, a single N-glycosylation site near its N-terminus that is in the exterior of the virion, and a long C-terminal region in the interior. The M protein harbors a higher substitution rate (0.6% correlated to its size) among viral open reading frames (ORFs) from published data. The four substitutions detected in the M protein, which cause non-synonymous changes, can be classified into three types. One of them results in changes of pI (isoelectric point) and charge, affecting antigenicity. The second changes hydrophobicity of the TM region, and the third one relates to hydrophilicity of the interior structure. Phylogenetic tree building based on the variations of the M protein appears to support the non-human origin of SARS-CoV. To investigate its immunogenicity, we synthesized eight oligopeptides covering 69.2% of the entire ORF and screened them by using ELISA (enzyme-linked immunosorbent assay) with sera from SARS patients. The results confirmed our predictions on antigenic sites.
Genomics, Proteomics & Bioinformatics | 2003
Jingxiang Li; Chunqing Luo; Yajun Deng; Yujun Han; Lin Tang; Jing Wang; Jia Ji; Jia Ye; Fanbo Jiang; Zhao Xu; Wei Tong; Wei Wei; Qingrun Zhang; Shengbin Li; Wei Li; Hongyan Li; Yudong Li; Wei Dong; Jian Wang; Shengli Bi; Huanming Yang
The corona-like spikes or peplomers on the surface of the virion under electronic microscope are the most striking features of coronaviruses. The S (spike) protein is the largest structural protein, with 1,255 amino acids, in the viral genome. Its structure can be divided into three regions: a long N-terminal region in the exterior, a characteristic transmembrane (TM) region, and a short C-terminus in the interior of a virion. We detected fifteen substitutions of nucleotides by comparisons with the seventeen published SARS-CoV genome sequences, eight (53.3%) of which are non-synonymous mutations leading to amino acid alternations with predicted physiochemical changes. The possible antigenic determinants of the S protein are predicted, and the result is confirmed by ELISA (enzyme-linked immunosorbent assay) with synthesized peptides. Another profound finding is that three disulfide bonds are defined at the C-terminus with the N-terminus of the E (envelope) protein, based on the typical sequence and positions, thus establishing the structural connection with these two important structural proteins, if confirmed. Phylogenetic analysis reveals several conserved regions that might be potent drug targets.
Genomics, Proteomics & Bioinformatics | 2003
Shengli Bi; E’de Qin; Z. Y. Xu; Wei Li; Jing Wang; Yongwu Hu; Yong Liu; Shumin Duan; Jianfei Hu; Yujun Han; Jing Xu; Yan Li; Yao Yi; Yongdong Zhou; Wei Lin; Jie Wen; Hong Xu; Ruan Li; Zizhang Zhang; Haiyan Sun; Jingui Zhu; Man Yu; Baochang Fan; Qingfa Wu; Lin Tang; Bao’an Yang; Guoqing Li; Wenming Peng; Wenjie Li; Tao Jiang
Beijing has been one of the epicenters attacked most severely by the SARS-CoV (severe acute respiratory syndrome-associated coronavirus) since the first patient was diagnosed in one of the city’s hospitals. We now report complete genome sequences of the BJ Group, including four isolates (Isolates BJ01, BJ02, BJ03, and BJ04) of the SARS-CoV. It is remarkable that all members of the BJ Group share a common haplotype, consisting of seven loci that differentiate the group from other isolates published to date. Among 42 substitutions uniquely identified from the BJ group, 32 are non-synonymous changes at the amino acid level. Rooted phylogenetic trees, proposed on the basis of haplotypes and other sequence variations of SARS-CoV isolates from Canada, USA, Singapore, and China, gave rise to different paradigms but positioned the BJ Group, together with the newly discovered GD01 (GD-Ins29) in the same clade, followed by the H-U Group (from Hong Kong to USA) and the H-T Group (from Hong Kong to Toronto), leaving the SP Group (Singapore) more distant. This result appears to suggest a possible transmission path from Guangdong to Beijing/Hong Kong, then to other countries and regions.
Genomics, Proteomics & Bioinformatics | 2003
Lan Zhong; Kunlin Zhang; Xiangang Huang; Peixiang Ni; Yujun Han; Kai Wang; Jun Wang; Songgang Li
The large amount of repeats, especially high copy repeats, in the genomes of higher animals and plants makes whole genome assembly (WGA) quite difficult. In order to solve this problem, we tried to identify repeats and mask them prior to assembly even at the stage of genome survey. It is known that repeats of different copy number have different probabilities of appearance in shotgun data, so based on this principle, we constructed a statistical model and inferred criteria for mathematically defined repeats (MDRs) at different shotgun coverages. According to these criteria, we developed software MDRmasker to identify and mask MDRs in shotgun data. With repeats masked prior to assembly, the speed of assembly was increased with lower error probability. In addition, clone-insert size affects the accuracy of repeat assembly and scaffold construction. We also designed length distribution of clone-inserts using our model. In our simulated genomes of human and rice, the length distribution of repeats is different, so their optimal length distributions of clone-inserts were not the same. Thus with optimal length distribution of clone-inserts, a given genome could be assembled better at lower coverage.