Dorothea Blostein
Queen's University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Dorothea Blostein.
International Journal on Document Analysis and Recognition | 2004
Richard Zanibbi; Dorothea Blostein; R. Cordy
Abstract.Table characteristics vary widely. Consequently, a great variety of computational approaches have been applied to table recognition. In this survey, the table recognition literature is presented as an interaction of table models, observations, transformations, and inferences. A table model defines the physical and logical structure of tables; the model is used to detect tables and to analyze and decompose the detected tables. Observations perform feature measurements and data lookup, transformations alter or restructure data, and inferences generate and test hypotheses. This presentation clarifies both the decisions made by a table recognizer and the assumptions and inferencing techniques that underlie these decisions.
IEEE Transactions on Pattern Analysis and Machine Intelligence | 1989
Dorothea Blostein; Narendra Ahuja
A method is presented for identifying texture elements while simultaneously recovering the orientation of textured surfaces. A multiscale region detector, based on measurements in a Del /sup 2/G (Laplacian-of-Gaussian) scale space, is used to construct a set of candidate texture elements. True elements are selected from the set of candidate elements by finding the planar surface that best predicts the observed areas of the latter. Results are shown for a variety of natural textures, including waves, flowers, rocks, clouds, and dirt clods. >
International Journal on Document Analysis and Recognition | 2012
Richard Zanibbi; Dorothea Blostein
Document recognition and retrieval technologies complement one another, providing improved access to increasingly large document collections. While recognition and retrieval of textual information is fairly mature, with wide-spread availability of optical character recognition and text-based search engines, recognition and retrieval of graphics such as images, figures, tables, diagrams, and mathematical expressions are in comparatively early stages of research. This paper surveys the state of the art in recognition and retrieval of mathematical expressions, organized around four key problems in math retrieval (query construction, normalization, indexing, and relevance feedback), and four key problems in math recognition (detecting expressions, detecting and classifying symbols, analyzing symbol layout, and constructing a representation of meaning). Of special interest is the machine learning problem of jointly optimizing the component algorithms in a math recognition system, and developing effective indexing, retrieval and relevance feedback algorithms for math retrieval. Another important open problem is developing user interfaces that seamlessly integrate recognition and retrieval. Activity in these important research areas is increasing, in part because math notation provides an excellent domain for studying problems common to many document and graphics recognition and retrieval applications, and also because mature applications will likely provide substantial benefits for education, research, and mathematical literacy.
Archive | 1992
Dorothea Blostein; Henry S. Baird
The research literature concerning the automatic analysis of images of printed and handwritten music notation, for the period 1966 through 1990, is surveyed and critically examined.
International Journal on Document Analysis and Recognition | 2007
Nawei Chen; Dorothea Blostein
Document image classification is an important step in Office Automation, Digital Libraries, and other document image analysis applications. There is great diversity in document image classifiers: they differ in the problems they solve, in the use of training data to construct class models, and in the choice of document features and classification algorithms. We survey this diverse literature using three components: the problem statement, the classifier architecture, and performance evaluation. This brings to light important issues in designing a document classifier, including the definition of document classes, the choice of document features and feature representation, and the choice of classification algorithm and learning mechanism. We emphasize techniques that classify single-page typeset document images without using OCR results. Developing a general, adaptable, high-performance classifier is challenging due to the great variety of documents, the diverse criteria used to define document classes, and the ambiguity that arises due to ill-defined or fuzzy document classes.
source code analysis and manipulation | 2010
Stephen W. Thomas; Bram Adams; Ahmed E. Hassan; Dorothea Blostein
Topics are collections of words that co-occur frequently in a text corpus. Topics have been found to be effective tools for describing the major themes spanning a corpus. Using such topics to describe the evolution of a software system’s source code promises to be extremely useful for development tasks such as maintenance and re-engineering. However, no one has yet examined whether these automatically discovered topics accurately describe the evolution of source code, and thus it is not clear whether topic models are a suitable tool for this task. In this paper, we take a first step towards deter-mining the suitability of topic models in the analysis of software evolution by performing a qualitative case study on 12 releases of JHot Draw, a well studied and documented system. We define and compute various metrics on the identified topics and manually investigate how the metrics evolve over time. We find that topic evolutions are characterizable through spikes and drops in their metric values, and that the large majority of these spikes and drops are indeed caused by actual change activity in the source code. We are thus encouraged by the use of topic models as a tool for analyzing the evolution of software.
Science of Computer Programming | 2014
Stephen W. Thomas; Bram Adams; Ahmed E. Hassan; Dorothea Blostein
Topic models are generative probabilistic models which have been applied to information retrieval to automatically organize and provide structure to a text corpus. Topic models discover topics in the corpus, which represent real world concepts by frequently co-occurring words. Recently, researchers found topics to be effective tools for structuring various software artifacts, such as source code, requirements documents, and bug reports. This research also hypothesized that using topics to describe the evolution of software repositories could be useful for maintenance and understanding tasks. However, research has yet to determine whether these automatically discovered topic evolutions describe the evolution of source code in a way that is relevant or meaningful to project stakeholders, and thus it is not clear whether topic models are a suitable tool for this task.In this paper, we take a first step towards evaluating topic models in the analysis of software evolution by performing a detailed manual analysis on the source code histories of two well-known and well-documented systems, JHotDraw and jEdit. We define and compute various metrics on the discovered topic evolutions and manually investigate how and why the metrics evolve over time. We find that the large majority (87%-89%) of topic evolutions correspond well with actual code change activities by developers. We are thus encouraged to use topic models as tools for studying the evolution of a software system. Highlights? We apply an advanced IR technique, called topic models, to source code histories. ? High-level topic evolutions are created that describe the changes to source code. ? We examine whether the topic evolutions are accurate and meaningful to developers. ? After two case studies, we conclude that topic models are mostly accurate and meaningful.
mining software repositories | 2011
Stephen W. Thomas; Bram Adams; Ahmed E. Hassan; Dorothea Blostein
Studying the evolution of topics (collections of co-occurring words) in a software project is an emerging technique to automatically shed light on how the project is changing over time: which topics are becoming more actively developed, which ones are dying down, or which topics are lately more error-prone and hence require more testing. Existing techniques for modeling the evolution of topics in software projects suffer from issues of data duplication, i.e., when the repository contains multiple copies of the same document, as is the case in source code histories. To address this issue, we propose the Diff model, which applies a topic model only to the changes of the documents in each version instead of to the whole document at each version. A comparative study with a state-of-the-art topic evolution model shows that the Diff model can detect more distinct topics as well as more sensitive and accurate topic evolutions, which are both useful for analyzing source code histories.
Empirical Software Engineering | 2014
Stephen W. Thomas; Hadi Hemmati; Ahmed E. Hassan; Dorothea Blostein
Software development teams use test suites to test changes to their source code. In many situations, the test suites are so large that executing every test for every source code change is infeasible, due to time and resource constraints. Development teams need to prioritize their test suite so that as many distinct faults as possible are detected early in the execution of the test suite. We consider the problem of static black-box test case prioritization (TCP), where test suites are prioritized without the availability of the source code of the system under test (SUT). We propose a new static black-box TCP technique that represents test cases using a previously unused data source in the test suite: the linguistic data of the test cases, i.e., their identifier names, comments, and string literals. Our technique applies a text analysis algorithm called topic modeling to the linguistic data to approximate the functionality of each test case, allowing our technique to give high priority to test cases that test different functionalities of the SUT. We compare our proposed technique with existing static black-box TCP techniques in a case study of multiple real-world open source systems: several versions of Apache Ant and Apache Derby. We find that our static black-box TCP technique outperforms existing static black-box TCP techniques, and has comparable or better performance than two existing execution-based TCP techniques. Static black-box TCP methods are widely applicable because the only input they require is the source code of the test cases themselves. This contrasts with other TCP techniques which require access to the SUT runtime behavior, to the SUT specification models, or to the SUT source code.
international workshop on graph-grammars and their application to computer science | 1994
Dorothea Blostein; Hoda Fahmy; Ann Grbavec
Graphs are a popular data structure, and graph-manipulation programs are common. Graph manipulations can be cleanly, compactly, and explicitly described using graph-rewriting notation. However, when a software developer is persuaded to try graph rewriting, several problems commonly arise. Primarily, it is difficult for a newcomer to develop a feel for how computations are expressed via graph rewriting. Also, graph-rewriting is not convenient for solving all aspects of a problem: better mechanisms are needed for interfacing graph rewriting with other styles of computation. Efficiency considerations and the limited availability of development tools further limit practical use of graph rewriting. The inaccessible appearance of the graph-rewriting literature is an additional hindrance. These problems can be addressed through a combination of “public relations” work, and further research and development, thereby promoting the widespread use of graph rewriting.