Is this you? Create Your Porfile

David Buttler

Lawrence Livermore National Laboratory

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where David Buttler is active.

Explore More

Publication

Featured researches published by David Buttler.

international conference on distributed computing systems | 2001

A fully automated object extraction system for the World Wide Web

David Buttler; Ling Liu; Calton Pu

This paper presents a fully automated object extraction system Omini. A distinct feature of Omini is the suite of algorithms and the automatically learned information extraction rules for discovering and extracting objects from dynamic Web pages or static Web pages that contain multiple object instances. We evaluated the system using more than 2,000 Web pages over 40 sites. It achieves 100% precision (returns only correct objects) and excellent recall (between 99% and 98%, with very few significant objects left out). The object boundary identification algorithms are fast, about 0.1 second per page with a simple optimization.

international conference on management of data | 2001

Wrapping web data into XML

Wei Han; David Buttler; Calton Pu

The vast majority of online information is part of the World Wide Web. In order to use this information for more than human browsing, web pages in HTML must be converted into a format meaningful to software programs. Wrappers have been a useful technique to convert HTML documents into semantically meaningful XML files. However, developing wrappers is slow and labor-intensive. Further, frequent changes on the HTML documents typically require frequent changes in the wrappers. This paper describes XWRAP Elite, a tool to automatically generate robust wrappers. XWRAP breaks down the conversion process into three steps. First, discover where the data is located in an HTML page and separating the data into individual objects. Second, decompose objects into data elements. Third, mark objects and elements in an output format. XWRAP Elite automates the first two steps and minimizes human involvement in marking output data. Our experience shows that XWRAP is able to create useful wrapper software for a wide variety of real world HTML documents.

international conference on data engineering | 2004

Probe, cluster, and discover: focused extraction of QA-Pagelets from the deep Web

James Caverlee; Ling Liu; David Buttler

We introduce the concept of a QA-Pagelet to refer to the content region in a dynamic page that contains query matches. We present THOR, a scalable and efficient mining system for discovering and extracting QA-Pagelets from the deep Web. A unique feature of THOR is its two-phase extraction framework. In the first phase, pages from a deep Web site are grouped into distinct clusters of structurally-similar pages. In the second phase, pages from each page cluster are examined through a subtree filtering algorithm that exploits the structural and content similarity at subtree level to identify the QA-Pagelets.

international conference on management of data | 1998

CQ: a personalized update monitoring toolkit

Ling Liu; Calton Pu; Wei Tang; David Buttler; John Biggs; Tong Zhou; Paul Benninghoff; Wei Han; Fenghua Yu

The CQ project at OGI, funded by DARPA, aims at developing a scalable toolkit and techniques for update monitoring and event-driven information delivery on the net. The main feature of the CQ project is a “personalized update monitoring” toolkit based on continual queries [3]. Comparing with the pure pull (such as DBMSs, various web search engines) and pure push (such as Pointcast, Marimba, Broadcast disks) technology, the CQ project can be seen as a hybrid approach that combines the pull and push technology by supporting personalized update monitoring through a combined client-pull and server-push paradigm.

knowledge discovery and data mining | 2011

Latent topic feedback for information retrieval

David Andrzejewski; David Buttler

We consider the problem of a user navigating an unfamiliar corpus of text documents where document metadata is limited or unavailable, the domain is specialized, and the user base is small. These challenging conditions may hold, for example, within an organization such as a business or government agency. We propose to augment standard keyword search with user feedback on latent topics. These topics are automatically learned from the corpus in an unsupervised manner and presented alongside search results. User feedback is then used to reformulate the original query, resulting in improved information retrieval performance in our experiments.

Archive | 2009

Reconcile: A Coreference Resolution Research Platform

Veselin Stoyanov; Claire Cardie; Nathan Gilbert; Ellen Riloff; David Buttler; David Hysom

This research was supported in part by Lawrence Livermore National Laboratory subcontract B573245 and the Department of Homeland Security under ONR Grant N0014-07-1-0152.

knowledge discovery and data mining | 2007

Tracking multiple topics for finding interesting articles

Raymond K. Pon; Alfonso F. Cardenas; David Buttler; Terence Critchlow

We introduce multiple topic tracking (MTT) for iScore to better recommend news articles for users with multiple interests and to address changes in user interests over time. As an extension of the basic Rocchio algorithm, traditional topic detection and tracking, and single-pass clustering, MTT maintains multiple interest profiles to identify interesting articles for a specific user given user-feedback. Focusing on only interesting topics enables iScore to discard useless profiles to address changes in user interests and to achieve a balance between resource consumption and classification accuracy. Also by relating a topics interestingness to an article.s interestingness, iScore is able to achieve higher quality results than traditional methods such as the Rocchio algorithm. We identify several operating parameters that work well for MTT. Using the same parameters, we show that MTT alone yields high quality results for recommending interesting articles from several corpora. The inclusion of MTT improves iScores performance by 9% in recommending news articles from the Yahoo! News RSS feeds and the TREC11 adaptive filter article collection. And through a small user study, we show that iScore can still perform well when only provided with little user feedback.

international conference on management of data | 1999

An XJML-based wrapper generator for Web information extraction

Ling Liu; Wei Han; David Buttler; Calton Pu; Wei Tang

There has been tremendous interest in information integration systems that automatically gather, manipulate, and integrate data from multiple information sources on a users behalf. Unfortunately, web sites are primarily designed for human browsing rather than for use by a computer program. Mechanically extracting their content is in general a rather di cult job if not impossible [4]. Software systems using such web information sources typically use hand-coded wrappers to extract information content of interest from web sources and translate query responses to a more structured format (e.g., relational form) before unifying them into an integrated answer to a users query. The most recent generation of information mediator systems (e.g., Ariadne [3], CQ [5, 7], Internet Softbots [4], TSIMMIS [2]) addresses this problem by enabling a pre-wrapped set of web sources to be accessed via database-like queries. However, hand-coding a wrapper is time consuming and error-prone. We have also observed that, by using a good design methodology, only a relatively small part of the code deals with the source-speci c access details, the rest of the code is either common among wrappers or can be expressed in a high level, more structured fashion. As the Web grows, maintaining a reasonable number of wrappers becomes impractical. First, the number of information sources of interest to a user query can be quite large, even within a particular domain. Second, new information sources are constantly added on the Web. Thirdly, the content and presentation format of the existing information sources may change frequently and autonomously. With these observations in mind, we have developed a wrapper generation system, called XWrap, for semi-automatic construction of wrappers for Web information sources. The system contains a library of commonly used functions, such as receiving queries from applications, handling of lter queries, and packaging results. It also contains some source-speci c facilities that are in charge of mapping a mediator query to a remote connection call to fetch the relevant pages and translating the retrieved page(s) into a more structured format (such as XML documents or relational tables). A distinct feature of our wrapper generator is its ability to provide an XML-enabled, feedback-based, interactive wrapper construction facility for Internet information sources. By XML-enabled we mean that the extraction of information content from the Web pages will be captured in XML form and the process of lter queries is performed against XML documents. By feedback-based we mean that the wrapper construction process will be revisited and tuned according to the feedback received by the wrapper manager.

international conference on tools with artificial intelligence | 2006

Multi-Criterion Active Learning in Conditional Random Fields

Christopher T. Symons; Nagiza F. Samatova; Ramya Krishnamurthy; Byung-Hoon Park; Tarik Umar; David Buttler; Terence Critchlow; David Hysom

Conditional random fields (CRFs), which are popular supervised learning models for many natural language processing (NLP) tasks, typically require a large collection of labeled data for training. In practice, however, manual annotation of text documents is quite costly. Furthermore, even large labeled training sets can have arbitrarily limited performance peaks if they are not chosen with care. This paper considers the use of multi-criterion active learning for identification of a small but sufficient set of text samples for training CRFs. Our empirical results demonstrate that our method is capable of reducing the manual annotation costs, while also limiting the retraining costs that are often associated with active learning. In addition, we show that the generalization performance of CRFs can be enhanced through judicious selection of training examples

international world wide web conferences | 2005

Automatic Discovery and Inferencing of Complex Bioinformatics Web Interfaces

Anne H. H. Ngu; Daniel Rocco; Terence Critchlow; David Buttler

The World Wide Web provides a vast resource to genomics researchers, with Web-based access to distributed data sources such as BLAST sequence homology search interfaces. However, finding the desired scientific information can still be very tedious and frustrating. While there are several known servers on genomic data (e.g., GeneBank, EMBL, NCBI) that are shared and accessed frequently, new data sources are created each day in laboratories all over the world. Sharing these new genomics results is hindered by the lack of a common interface or data exchange mechanism. Moreover, the number of autonomous genomics sources and their rate of change outpace the speed at which they can be manually identified, meaning that the available data is not being utilized to its full potential. An automated system that can find, classify, describe, and wrap new sources without tedious and low-level coding of source-specific wrappers is needed to assist scientists in accessing hundreds of dynamically changing bioinformatics Web data sources through a single interface. A correct classification of any kind of Web data source must address both the capability of the source and the conversation/interaction semantics inherent in the design of the data source. We propose a service class description (SCD)-a meta-data approach for classifying Web data sources that takes into account both the capability and the conversational semantics of the source. The ability to discover the interaction pattern of a Web source leads to increased accuracy in the classification process. Our results show that an SCD-based approach successfully classifies two thirds of BLAST sites with 100% accuracy and two thirds of bioinformatics keyword search sites with around 80% precision.

Explore More