Hongfei Cao | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Hongfei Cao is active.

Explore More

Publication

Featured researches published by Hongfei Cao.

Applications in Plant Sciences | 2014

A Neotropical Miocene Pollen Database Employing Image-Based Search and Semantic Modeling

Jing Ginger Han; Hongfei Cao; Adrian S. Barb; Surangi W. Punyasena; Carlos Jaramillo; Chi-Ren Shyu

Premise of the study: Digital microscopic pollen images are being generated with increasing speed and volume, producing opportunities to develop new computational methods that increase the consistency and efficiency of pollen analysis and provide the palynological community a computational framework for information sharing and knowledge transfer. Methods: Mathematical methods were used to assign trait semantics (abstract morphological representations) of the images of neotropical Miocene pollen and spores. Advanced database-indexing structures were built to compare and retrieve similar images based on their visual content. A Web-based system was developed to provide novel tools for automatic trait semantic annotation and image retrieval by trait semantics and visual content. Results: Mathematical models that map visual features to trait semantics can be used to annotate images with morphology semantics and to search image databases with improved reliability and productivity. Images can also be searched by visual content, providing users with customized emphases on traits such as color, shape, and texture. Discussion: Content- and semantic-based image searches provide a powerful computational platform for pollen and spore identification. The infrastructure outlined provides a framework for building a community-wide palynological resource, streamlining the process of manual identification, analysis, and species discovery.

data mining in bioinformatics | 2016

Mining large-scale repetitive sequences in a MapReduce setting

Hongfei Cao; Michael A. Phinney; Devin Petersohn; Benjamin Ryan Merideth; Chi-Ren Shyu

Recent research suggests DNA repeats play critical roles in cellular regulatory functions and disease development. The challenge associated with identifying repeats across a collection of genomes arises from the amount of data stored within DNA, and intermediate data generated by alignment-and hash-based approaches are substantial. We present a MapReduce-based method for repeat identification and propose efficient storage and search techniques. Our approach distributes the computation and storage across a cluster of commodity computers, lending a cost-effective, flexible, robust, and scalable solution to the challenge of identifying various types of repetitive sequences across a collection of genomes. In this study, we benchmark our method using a collection of six genomes, totalling approximately 14.2 billion base pairs. We demonstrate a tenfold speedup over previous state-of-the-art approaches and linear scalability. In addition, we conduct a deeper scalability analysis by processing a collection of 39 genomes, approximately 104 billion base pairs.

bioinformatics and biomedicine | 2014

MRSMRS: Mining repetitive sequences in a MapReduce setting

Hongfei Cao; Michael A. Phinney; Devin Petersohn; Benjamin Ryan Merideth; Chi-Ren Shyu

Recent research suggests DNA repeats play critical roles in cellular regulatory functions and disease development. Also, repeat variability among different species, or the same species, is an important indicator for the development of specific phenotypes. Similarities in repetitive sequences among different species have been shown to indicate deeply conserved functions. Patterns such as ultra conserved elements (UCEs), tandem repeats, and palindromes have been of interest. Researchers utilize various computational approaches to aid in the identification of each of these types of patterns. The challenge associated with identifying repeats across a collection of genomes arises from the amount of data stored within DNA. The human genome alone consists of more than 3.1 billion base pairs, and intermediate data generated by alignment- and hash-based approaches are substantial. This sort of all-against-all analysis on a large collection of genomic sequence data often requires data to be reprocessed when new genomes are collected. To handle data of this scale, we utilize the Hadoop Distributed File System running on a cluster of 11 relatively inexpensive nodes, each containing a quad-core commodity processor. Furthermore, to alleviate redundant computation, intermediate data are organized in HBase, allowing us to incrementally process new genomic data without having to reprocess existing genomes. Our approach lends a cost-effective, flexible, robust, and scalable solution to the challenge of identifying various types of repetitive sequences across a collection of genomes. In this study, we benchmark our method using a collection of 6 genomes, summing to an approximate total of 14.2 billion base pairs. Three case studies are presented, demonstrating a 10.4 times speedup over previous state-of-the-art approaches and linear scalability.

Proceedings of the American Society for Information Science and Technology | 2011

Search tactics for medical image retrieval

Xin Wang; Sanda Erdelez; Yunhui Lu; Carla Allen; Blake Anderson; Hongfei Cao; Chi-Ren Shyu

Few studies have explored image search behavior in the medical field. This study investigated the effect of domain knowledge on the uses of search tactics during image search process by radiologic technologists. We found that in the field of Radiography, experts and novices demonstrated significant differences in employing five search tactics: Browse, Enlarge, Examine Enlarged Images, Refine, and Exhaust. Experts showed strong capabilities when using image content information to make their relevance judgments, while novices relied more on textual information (e.g., captions) to select relevant images. We also found that experts used exhaustive query terms to initiate a search and browsed several screens of results to locate well-matched images or generate new ideas for further search moves. On the contrary, novices usually started their search with a simpler concept and only browsed the results on the first few screens.

ieee international conference on multimedia big data | 2016

Visual Reasoning Indexing and Retrieval Using In-memory Computing

Hongfei Cao; Yu Li; Carla Allen; Michael A. Phinney; Chi-Ren Shyu

Research has shown that visual information of multimedia is critical in highly-skilled application, such as biomedicine and life sciences, and a certain visual reasoning process is essential for meaningful search in a timely manner. Relevant image characteristics are learned and verified with accumulated experiences during the reasoning processes. However, such type of process is highly dynamic and elusive to computationally quantify and therefore challenging to analyze, let alone to make the knowledge shareable across users. In this paper we study real-time human visual reasoning processes with the aid of gaze tracking devices. Temporal and spatial representations are proposed for gaze modeling, and a visual reasoning retrieval system utilizing in-memory computing such as Apache Spark is designed for real-time search. Simulated data derived from human subject experiments show that the system has a reasonably high accuracy and provides predictive estimations for hardware requirements versus data sizes for exhaustive searches.

International Journal of Semantic Computing | 2016

Visual Reasoning Indexing and Retrieval Using In-Memory Computing

Hongfei Cao; Yu Li; Carla Allen; Michael A. Phinney; Chi-Ren Shyu

bioinformatics and biomedicine | 2013

Object behavior mining from mitochondiral shape databases for degenerated nerve disease studies

Hongfei Cao; Jing Ginger Han; Chi-Ren Shyu

To make a biological information system valuable for the life sciences community, data analytics have to provide explainable results that can demonstrate the linkages between measurable features and biological meaning of interest. Neither traditional classification nor information retrieval methods are sufficient for biomedical applications if such linkages cannot be established. An example area is the understanding of mitochondrial dynamics in Drosophila segmental nerves. Simple shape analysis or image classification of mitochondrial objects is a far cry from in-depth study of the temporal patterns and transitions of mitochondria which are essential organelles of eukaryotic cells where mitochondria undergo frequent shape changes as well as active movement. We present a computational approach to perform mitochondrial shape analysis, pattern mining, and dynamic behavior retrievals. The capability of the computational methods may provide new insights into scientific discoveries for degenerated nerve diseases.

bioinformatics and biomedicine | 2013

Mining repetitive sequences using a big data ecosystem

Michael A. Phinney; Hongfei Cao; Andi Dhroso; Chi-Ren Shyu

Identifying repetitive gene sequences occurring within DNA sequences that span a collection of species is a challenge that is conceptually simple yet computationally challenging. Biological research suggests that certain regions within genomic sequences may be unchanged for hundreds of millions of years; understanding and identifying these highly preserved regions is a major challenge faced by bioinformaticians. Taking an evolutionary perspective on DNA, pinpointing these repetitive sequences is the first step to understanding functional similarities and diversities. The difficulty of this problem arises from the volume of the data required for analysis; it grows with every genome that is sequenced. Traditional approaches used to identify repetitive sequences often require the pair-wise comparison of chromosomes, which takes a significant amount of time to gather results. When comparing n chromosomes, n(n-l) individual comparisons must be made. To avoid exhaustive pair-wise comparisons, we designed an algorithm that partitions genomic sequences into search key values representing potential repetitive sequences, which are hashed into bins. With the introduction of new genomes, we only process the new sequences and aggregate new results with those that were previously processed.

Journal of the Association for Information Science and Technology | 2012