Ariel S. Schwartz
University of California, Berkeley
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Ariel S. Schwartz.
Bioinformatics | 2007
Ariel S. Schwartz; Lior Pachter
MOTIVATION We introduce a novel approach to multiple alignment that is based on an algorithm for rapidly checking whether single matches are consistent with a partial multiple alignment. This leads to a sequence annealing algorithm, which is an incremental method for building multiple sequence alignments one match at a time. Our approach improves significantly on the standard progressive alignment approach to multiple alignment. RESULTS The sequence annealing algorithm performs well on benchmark test sets of protein sequences. It is not only sensitive, but also specific, drastically reducing the number of incorrectly aligned residues in comparison to other programs. The method allows for adjustment of the sensitivity/specificity tradeoff and can be used to reliably identify homologous regions among protein sequences. AVAILABILITY An implementation of the sequence annealing algorithm is available at http://bio.math.berkeley.edu/amap/
BMC Bioinformatics | 2004
Diane E. Oliver; Gaurav Bhalotia; Ariel S. Schwartz; Russ B. Altman; Marti A. Hearst
BackgroundResearchers who use MEDLINE for text mining, information extraction, or natural language processing may benefit from having a copy of MEDLINE that they can manage locally. The National Library of Medicine (NLM) distributes MEDLINE in eXtensible Markup Language (XML)-formatted text files, but it is difficult to query MEDLINE in that format. We have developed software tools to parse the MEDLINE data files and load their contents into a relational database. Although the task is conceptually straightforward, the size and scope of MEDLINE make the task nontrivial. Given the increasing importance of text analysis in biology and medicine, we believe a local installation of MEDLINE will provide helpful computing infrastructure for researchers.ResultsWe developed three software packages that parse and load MEDLINE, and ran each package to install separate instances of the MEDLINE database. For each installation, we collected data on loading time and disk-space utilization to provide examples of the process in different settings. Settings differed in terms of commercial database-management system (IBM DB2 or Oracle 9i), processor (Intel or Sun), programming language of installation software (Java or Perl), and methods employed in different versions of the software. The loading times for the three installations were 76 hours, 196 hours, and 132 hours, and disk-space utilization was 46.3 GB, 37.7 GB, and 31.6 GB, respectively. Loading times varied due to a variety of differences among the systems. Loading time also depended on whether data were written to intermediate files or not, and on whether input files were processed in sequence or in parallel. Disk-space utilization depended on the number of MEDLINE files processed, amount of indexing, and whether abstracts were stored as character large objects or truncated.ConclusionsRelational database (RDBMS) technology supports indexing and querying of very large datasets, and can accommodate a locally stored version of MEDLINE. RDBMS systems support a wide range of queries and facilitate certain tasks that are not directly supported by the application programming interface to PubMed. Because there is variation in hardware, software, and network infrastructures across sites, we cannot predict the exact time required for a user to load MEDLINE, but our results suggest that performance of the software is reasonable. Our database schemas and conversion software are publicly available at http://biotext.berkeley.edu.
north american chapter of the association for computational linguistics | 2006
Ariel S. Schwartz; Marti A. Hearst
Citations have great potential to be a valuable resource in mining the bioscience literature (Nakov et al., 2004). The text around citations (or citances) tends to state biological facts with reference to the original papers that discovered them. The cited facts are typically stated in a more concise way in the citing papers than in the original. We hypothesize that in many cases, as time goes by, the citation sentences can more accurately indicate the most important contributions of a paper than its original abstract.
meeting of the association for computational linguistics | 2005
Preslav Nakov; Ariel S. Schwartz; Brian Wolf; Marti A. Hearst
We demonstrate a system for flexible querying against text that has been annotated with the results of NLP processing. The system supports self-overlapping and parallel layers, integration of syntactic and ontological hierarchies, flexibility in the format of returned results, and tight integration with SQL. We present a query language and its use on examples taken from the NLP literature.
pacific symposium on biocomputing | 2002
Ariel S. Schwartz; Marti A. Hearst
arXiv: Quantitative Methods | 2005
Ariel S. Schwartz; Eugene W. Myers; Lior Pachter
text retrieval conference | 2006
Gaurav Bhalotia; Preslav Nakov; Ariel S. Schwartz; Marti A. Hearst
text retrieval conference | 2004
Preslav Nakov; Ariel S. Schwartz; Emilia Stoica; Marti A. Hearst
empirical methods in natural language processing | 2007
Ariel S. Schwartz; Anna Divoli; Marti A. Hearst
Archive | 2007
Lior Pachter; Ariel S. Schwartz