Dawn J. Lawrie | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Dawn J. Lawrie is active.

Explore More

Publication

Featured researches published by Dawn J. Lawrie.

international acm sigir conference on research and development in information retrieval | 2001

Finding topic words for hierarchical summarization

Dawn J. Lawrie; W. Bruce Croft; Arnold L. Rosenberg

Hierarchies have long been used for organization, summarization, and access to information. In this paper we define summarization in terms of a probabilistic language model and use the definition to explore a new technique for automatically generating topic hierarchies by applying a graph-theoretic algorithm, which is an approximation of the Dominating Set Problem. The algorithm efficiently chooses terms according to a language model. We compare the new technique to previous methods proposed for constructing topic hierarchies including subsumption and lexical hierarchies, as well as the top TF.IDF terms. Our results show that the new technique consistently performs as well as or better than these other techniques. They also show the usefulness of hierarchies compared with a list of terms.

conference on information and knowledge management | 2000

Language models for financial news recommendation

Victor Lavrenko; Matthew D. Schmill; Dawn J. Lawrie; Paul Ogilvie; David D. Jensen; James Allan

ABSTRACT We present a unique approa h to identifying news stories that in uen e the behavior of nan ial markets. Spe i ally, we des ribe the design and implementation of nalyst, a system that an re ommend interesting news stories { stories that are likely to a e t market behavior. nalyst operates by orrelating the ontent of news stories with trends in nan ial time series. We identify trends in time series using pie ewise linear tting and then assign labels to the trends a ording to an automated binning pro edure. We use language models to represent patterns of language that are highly asso iated with parti ular labeled trends. nalyst an then identify and re ommend news stories that are highly indi ative of future trends. We evaluate the system in terms of its ability to re ommend the stories that will a e t the behavior of the sto k market. We demonstrate that stories re ommended by nalyst ould be used to pro tably predi t forth oming trends in sto k pri es.

international acm sigir conference on research and development in information retrieval | 2003

Generating hierarchical summaries for web searches

Dawn J. Lawrie; W. Bruce Croft

Hierarchies provide a means of organizing, summarizing and accessing information. We describe a method for automatically generating hierarchies from small collections of text, and then apply this technique to summarizing the documents retrieved by a search engine.

Innovations in Systems and Software Engineering | 2007

Effective identifier names for comprehension and memory

Dawn J. Lawrie; Christopher H. Morrell; Henry Feild; David W. Binkley

Readers of programs have two main sources of domain information: identifier names and comments. When functions are uncommented, as many are, comprehension is almost exclusively dependent on the identifier names. Assuming that writers of programs want to create quality identifiers (e.g., identifiers that include relevant domain knowledge), one must ask how should they go about it. For example, do the initials of a concept name provide enough information to represent the concept? If not, and a longer identifier is needed, is an abbreviation satisfactory or does the concept need to be captured in an identifier that includes full words? What is the effect of longer identifiers on limited short term memory capacity? Results from a study designed to investigate these questions are reported. The study involved over 100 programmers who were asked to describe 12 different functions and then recall identifiers that appeared in each function. The functions used three different levels of identifiers: single letters, abbreviations, and full words. Responses allow the extent of comprehension associated with the different levels to be studied along with their impact on memory. The functions used in the study include standard computer science textbook algorithms and functions extracted from production code. The results show that full-word identifiers lead to the best comprehension; however, in many cases, there is no statistical difference between using full words and abbreviations. When considered in the light of limited human short-term memory, well-chosen abbreviations may be preferable in some situations since identifiers with fewer syllables are easier to remember.

working conference on reverse engineering | 2010

Normalizing Source Code Vocabulary

Dawn J. Lawrie; David W. Binkley; Christopher H. Morrell

Information Retrieval (IR) based tools complement traditional static and dynamic analysis tools by exploiting the natural language found within a programs text. Tools incorporating IR have tackled problems, such as feature location, that previously required considerable human effort. However, to reap the full benefit of IR-based techniques, the language used across all software artifacts (e.g., requirement and design documents, test plans, as well as the source code) must be consistent. Vocabulary normalization aligns the vocabulary found in source code with that found in other software artifacts. Normalization both splits an identifier into its constituent parts and expands each part into a full dictionary word to match vocabulary in other artifacts. An algorithm for normalization is presented. Its current implementation incorporates a greatly improved splitter that exploits a collection of resources including several dictionaries, frequency distributions derived from the corpus of programs, and co-occurrence data. Empirical study of this new splitter, GenTest, on almost 8000 identifiers finds that it correctly splits 82%, outperforming the current state-of-the-art. A preliminary experiment with the normalization algorithm finds it improving the FLAT feature locators scores of relevant code from 0.60 to 0.95 on a scale from 0 to 1.

source code analysis and manipulation | 2007

Extracting Meaning from Abbreviated Identifiers

Dawn J. Lawrie; Henry Feild; David W. Binkley

Informative identifiers are made up of full (natural language) words and (meaningful) abbreviations. Readers of programs typically have little trouble understanding the purpose of identifiers composed of full words. In addition, those familiar with the code can (most often) determine the meaning of abbreviations used in identifiers. However, when faced with unfamiliar code, abbreviations often carry little useful information. Furthermore, tools that focus on the natural language used in the code have a hard time in the presence of abbreviations. One approach to providing meaning to programmers and tools is to translate (expand) abbreviations into full words. This paper presents a methodology for expanding identifiers and evaluates the process on a code based of just over 35 million lines of code. For example, using phrase extraction, fs_exists is expanded to file_status_exists illustrating how the expansion process can facilitate comprehension. On average, 16 percent of the identifiers in a program are expanded. Finally, as an example application, the approach is used to improve the syntactic identification of violations to Deissenbock and Pizkas rules for concise and consistent identifier construction.

international conference on software maintenance | 2011

Expanding identifiers to normalize source code vocabulary

Dawn J. Lawrie; David W. Binkley

Maintaining modern software requires significant tool support. Effective tools exploit a variety of information and techniques to aid a software maintainer. One area of recent interest in tool development exploits the natural language information found in source code. Such Information Retrieval (IR) based tools compliment traditional static analysis tools and have tackled problems, such as feature location, that otherwise require considerable human effort. To reap the full benefit of IR-based techniques, the language used across all software artifacts (e.g., requirements, design, change requests, tests, and source code) must be consistent. Unfortunately, there is a significant proportion of invented vocabulary in source code. Vocabulary normalization aligns the vocabulary found in the source code with that found in other software artifacts. Most existing work related to normalization has focused on splitting an identifier into its constituent parts. The next step is to expand each part into a (dictionary) word that matches the vocabulary used in other software artifacts. Building on a successful approach to splitting identifiers, an implementation of an expansion algorithm is presented. Experiments on two systems find that up to 66% of identifiers are correctly expanded, which is within about 20% of the current systems best-case performance. Not only is this performance comparable to previous techniques, but the result is achieved in the absence of special purpose rules and not limited to restricted syntactic contexts. Results from these experiments also show the impact that varying levels of documentation (including both internal documentation such as the requirements and design, and external, or user-level, documentation) have on the algorithms performance.

international conference on program comprehension | 2009

To camelcase or under_score

David W. Binkley; Marcia H. Davis; Dawn J. Lawrie; Christopher H. Morrell

Naming conventions are generally adopted in an effort to improve program comprehension. Two of the most popular conventions are alternatives for composing multi-word identifiers: the use of underscores and the use of camel casing. While most programmers have a personal opinion as to which style is better, empirical study forms a more appropriate basis for choosing between them. The central hypothesis considered herein is that identifier style affects the speed and accuracy of manipulating programs. An empirical study of 135 programmers and non-programmers was conducted to better understand the impact of identifier style on code readability. The experiment builds on past work of others who study how readers of natural language perform such tasks. Results indicate that camel casing leads to higher accuracy among all subjects regardless of training, and those trained in camel casing are able to recognize identifiers in the camel case style faster than identifiers in the underscore style.

Empirical Software Engineering | 2013

The impact of identifier style on effort and comprehension

Dave W. Binkley; Marcia H. Davis; Dawn J. Lawrie; Jonathan I. Maletic; Christopher H. Morrell; Bonita Sharif

A family of studies investigating the impact of program identifier style on human comprehension is presented. Two popular identifier styles are examined, namely camel case and underscore. The underlying hypothesis is that identifier style affects the speed and accuracy of comprehending source code. To investigate this hypothesis, five studies were designed and conducted. The first study, which investigates how well humans read identifiers in the two different styles, focuses on low-level readability issues. The remaining four studies build on the first to focus on the semantic implications of identifier style. The studies involve 150 participants with varied demographics from two different universities. A range of experimental methods is used in the studies including timed testing, read aloud, and eye tracking. These methods produce a broad set of measurements and appropriate statistical methods, such as regression models and Generalized Linear Mixed Models (GLMMs), are applied to analyze the results. While unexpected, the results demonstrate that the tasks of reading and comprehending source code is fundamentally different from those of reading and comprehending natural language. Furthermore, as the task becomes similar to reading prose, the results become similar to work on reading natural language text. For more “source focused” tasks, experienced software developers appear to be less affected by identifier style; however, beginners benefit from the use of camel casing with respect to accuracy and effort.

international conference on program comprehension | 2006

Leveraged Quality Assessment using Information Retrieval Techniques

Dawn J. Lawrie; Henry Feild; David W. Binkley

The goal of this research is to apply language processing techniques to extend human judgment into situations where obtaining direct human judgment is impractical due to the volume of information that must be considered. On aspect of this is leveraged quality assessments, which can be used to evaluate third-party coded subsystems, to track quality across the versions of a program, to assess the compression effort (and subsequent cost) required to make a change, and to identify parts of a program in need of preventative maintenance. A description of the QALP tool, its output from just under two million lines of code, and an experiment aimed at evaluating the tools use in leveraged quality assessment are presented. Statistically significant results from this experiment validate the use of the QALP tool in human leverage quality assessment

Explore More