In the field of genomics, expressed sequence tags (ESTs) were once an important tool for gene identification and transcript research. Through the arrangement of short codes, ESTs can reveal specific gene expression states and their possible functions. However, with the rapid development of whole-genome sequencing technology, the status of ESTs has begun to be challenged, and many scientists have begun to question their practicality and effectiveness in genome research.
Currently, the number of available ESTs in public databases has reached 74.2 million.
ESTs are short sequences extracted from complementary DNA (cDNA), typically between 500 and 800 nucleotides in length, which limits their use in genome sequencing. In contrast, genome sequencing technology provides a clear image of the entire genome and can capture the structural and functional information of all genes at once. The high resolution of genome sequencing greatly exceeds the capabilities of EST.
Since 1982, research in this field has gradually developed. Scientists first sequenced randomly cloned cDNA, and the term "EST" was officially proposed in 1991. Over time, although the information provided by EST has made important contributions to gene discovery and function prediction, it has also been gradually replaced by whole-genome sequencing technology because it is a low-quality fragment.
According to a 2006 study, the existence of ESTs made it possible to identify thousands of genes.
The data source of EST mainly comes from dbEST, a database established by GenBank since 1992. dbEST provides a large amount of EST data, but lacks the necessary review process, which makes the quality of the information in this database uneven. Many ESTs are actually duplicates, often representing partial sequences of the same mRNA, so these sequences need to be combined into EST contigs for subsequent gene discovery.
When whole-genome sequences become available, scientists can easily compare ESTs directly to the genome, a process that is becoming increasingly important in current research. Many platforms, such as the TissueInfo system, have implemented this efficient matching technology to help link transcripts with EST data.
Large-scale EST data analysis faces diverse data management challenges, the most obvious of which is the unclear coding of organizational sources.
With the growth of EST data, how to effectively manage and utilize these data has become an important issue in scientific research. Especially when describing tissue origin and its associated disease conditions, dbEST's simple textual description makes automated analysis difficult. In view of this, the TissueInfo project, started in 2000, aims to fill this gap, provide audited data to disambiguate tissue origin and disease status, and provide systematic support for genetic data.
Although EST made many achievements in the early days of genome research, with the advancement of science and technology and the development of genome and transcriptome sequencing technology, these early tools have obviously lost part of their luster. The advantages of modern technology allow researchers to obtain more comprehensive and precise information on gene function.
While the early contributions of ESTs are undeniable, how will future genomic research evolve? Will we continue to rely on outdated technology or fully embrace emerging sequencing methods? Are these questions worth pondering for every scientific researcher?