Raymond Wan | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Raymond Wan is active.

Explore More

Publication

Featured researches published by Raymond Wan.

Genome Research | 2011

Adaptive seeds tame genomic sequence comparison

Szymon M. Kiełbasa; Raymond Wan; Kengo Sato; Paul Horton; Martin C. Frith

The main way of analyzing biological sequences is by comparing and aligning them to each other. It remains difficult, however, to compare modern multi-billionbase DNA data sets. The difficulty is caused by the nonuniform (oligo)nucleotide composition of these sequences, rather than their size per se. To solve this problem, we modified the standard seed-and-extend approach (e.g., BLAST) to use adaptive seeds. Adaptive seeds are matches that are chosen based on their rareness, instead of using fixed-length matches. This method guarantees that the number of matches, and thus the running time, increases linearly, instead of quadratically, with sequence length. LAST, our open source implementation of adaptive seeds, enables fast and sensitive comparison of large sequences with arbitrarily nonuniform composition.

Nucleic Acids Research | 2010

Incorporating sequence quality data into alignment improves DNA read mapping.

Martin C. Frith; Raymond Wan; Paul Horton

New DNA sequencing technologies have achieved breakthroughs in throughput, at the expense of higher error rates. The primary way of interpreting biological sequences is via alignment, but standard alignment methods assume the sequences are accurate. Here, we describe how to incorporate the per-base error probabilities reported by sequencers into alignment. Unlike existing tools for DNA read mapping, our method models both sequencer errors and real sequence differences. This approach consistently improves mapping accuracy, even when the rate of real sequence difference is only 0.2%. Furthermore, when mapping Drosophila melanogaster reads to the Drosophila simulans genome, it increased the amount of correctly mapped reads from 49 to 66%. This approach enables more effective use of DNA reads from organisms that lack reference genomes, are extinct or are highly polymorphic.

Bioinformatics | 2012

Transformations for the compression of FASTQ quality scores of next-generation sequencing data

Raymond Wan; Vo Ngoc Anh; Kiyoshi Asai

MOTIVATION The growth of next-generation sequencing means that more effective and efficient archiving methods are needed to store the generated data for public dissemination and in anticipation of more mature analytical methods later. This article examines methods for compressing the quality score component of the data to partly address this problem. RESULTS We compare several compression policies for quality scores, in terms of both compression effectiveness and overall efficiency. The policies employ lossy and lossless transformations with one of several coding schemes. Experiments show that both lossy and lossless transformations are useful, and that simple coding methods, which consume less computing resources, are highly competitive, especially when random access to reads is needed. AVAILABILITY AND IMPLEMENTATION Our C++ implementation, released under the Lesser General Public License, is available for download at http://www.cb.k.u-tokyo.ac.jp/asailab/members/rwan. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

Bioinformatics | 2013

ViralFusionSeq: accurately discover viral integration events and reconstruct fusion transcripts at single-base resolution

Jing-Woei Li; Raymond Wan; Chi-Shing Yu; Ngai Na Co; Nathalie Wong; Ting-Fung Chan

Summary: Insertional mutagenesis from virus infection is an important pathogenic risk for the development of cancer. Despite the advent of high-throughput sequencing, discovery of viral integration sites and expressed viral fusion events are still limited. Here, we present ViralFusionSeq (VFS), which combines soft-clipping information, read-pair analysis and targeted de novo assembly to discover and annotate viral–human fusions. VFS was used in an RNA-Seq experiment, simulated DNA-Seq experiment and re-analysis of published DNA-Seq datasets. Our experiments demonstrated that VFS is both sensitive and highly accurate. Availability: VFS is distributed under GPL version 3 at http://hkbic.cuhk.edu.hk/software/viralfusionseq Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics Online

string processing and information retrieval | 2001

Re-store: a system for compressing, browsing, and searching large documents

Alistair Moffat; Raymond Wan

We describe a software system for managing textjiles of up to several hundred megabytes that combines a number of useful facilities. First, the text is stored compressed using a variant of the RE-PAIR mechanism described by Larsson and Moflat, with space savings comparable to those obtained by other widely used general-purpose compression systems. Second, we provide, as a byproduct of the compression process, a phrase-based browsing tool that allows users to explore the contents of the source text in a natural and useful manner. And third, once a set of desiredphrases has been determined through the use of the browsing tool, the compressed text can be searched to determine locations at which those phrases appear, without decompressing the whole of the stored text, and without use of an additional index. That is, we show how the RE-PAIR compression regime can be extended to allow phrase-bused browsing and fast interactive searching.

Bioinformatics | 2016

OMBlast: alignment tool for optical mapping using a seed-and-extend approach.

Alden King-Yung Leung; Tsz-Piu Kwok; Raymond Wan; Ming Xiao; Pui-Yan Kwok; Kevin Y. Yip; Ting-Fung Chan

Motivation: Optical mapping is a technique for capturing fluorescent signal patterns of long DNA molecules (in the range of 0.1‐1 Mbp). Recently, it has been complementing the widely used short‐read sequencing technology by assisting with scaffolding and detecting large and complex structural variations (SVs). Here, we introduce a fast, robust and accurate tool called OMBlast for aligning optical maps, the set of signal locations on the molecules generated from optical mapping. Our method is based on the seed‐and‐extend approach from sequence alignment, with modifications specific to optical mapping. Results: Experiments with both synthetic and our real data demonstrate that OMBlast has higher accuracy and faster mapping speed than existing alignment methods. Our tool also shows significant improvement when aligning data with SVs. Availability and Implementation: OMBlast is implemented for Java 1.7 and is released under a GPL license. OMBlast can be downloaded from https://github.com/aldenleung/OMBlast and run directly on machines equipped with a Java virtual machine. Contact: [email protected] and [email protected] Supplementary information: Supplementary data are available at Bioinformatics online

asia information retrieval symposium | 2009

Efficient Probabilistic Latent Semantic Analysis through Parallelization

Raymond Wan; Vo Ngoc Anh; Hiroshi Mamitsuka

Probabilistic latent semantic analysis (PLSA) is considered an effective technique for information retrieval, but has one notable drawback: its dramatic consumption of computing resources, in terms of both execution time and internal memory. This drawback limits the practical application of the technique only to document collections of modest size. In this paper, we look into the practice of implementing PLSA with the aim of improving its efficiency without changing its output. Recently, Hong et al. [2008] has shown how the execution time of PLSA can be improved by employing OpenMP for shared memory parallelization. We extend their work by also studying the effects from using it in combination with the Message Passing Interface (MPI) for distributed memory parallelization. We show how a more careful implementation of PLSA reduces execution time and memory costs by applying our method on several text collections commonly used in the literature.

bioinformatics and biomedicine | 2010

Sorting next generation sequencing data improves compression effectiveness

Raymond Wan; Kiyoshi Asai

With the increase usage of next generation sequencing, the problem of effectively storing and transmitting such massive amounts of data will need to be addressed. Current repositories such as the Sequence Read Archive (SRA) currently use the FASTQ format and a general-purpose compression systems (GZIP) for data archiving. In this work, we investigate how GZIP (and BZIP2) can be made more effective for read archiving by pre-sorting the reads. The improvement in compression effectiveness of just the sequences is a reduction of at most 12% and of up to 6% when the original FASTQ data is considered.

string processing and information retrieval | 2008

Term Impacts as Normalized Term Frequencies for BM25 Similarity Scoring

Vo Ngoc Anh; Raymond Wan; Alistair Moffat

The BM25 similarity computation has been shown to provide effective document retrieval. In operational terms, the formulae which form the basis for BM25 employ both term frequency and document length normalization. This paper considers an alternative form of normalization using document-centric impacts, and shows that the new normalization simplifies BM25 and reduces the number of tuning parameters. Motivation is provided by a preliminary analysis of a document collection that shows that impacts are more likely to identify documents whose lengths resemble those of the relevant judgments.Experiments on TREC data demonstrate that impact-based BM25 is as good as or better than the original term frequency-based BM25 in terms of retrieval effectiveness.

international conference on data mining | 2006

Applying gaussian distribution-dependent criteria to decision trees for high-dimensional microarray data

Raymond Wan; Ichigaku Takigawa; Hiroshi Mamitsuka

Biological data presents unique problems for data analysis due to its high dimensions. Microarray data is one example of such data which has received much attention in recent years. Machine learning algorithms such as support vector machines (SVM) are ideal for microarray data due to its high classification accuracies. However, sometimes the information being sought is a list of genes which best separates the classes, and not a classification rate. Decision trees are one alternative which do not perform as well as SVMs, but their output is easily understood by non-specialists. A major obstacle with applying current decision tree implementations for high-dimensional data sets is their tendency to assign the same scores for multiple attributes. In this paper, we propose two distribution-dependant criteria for decision trees to improve their usefulness for microarray classification.

Explore More