Alina Lazar | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Alina Lazar is active.

Explore More

Publication

Featured researches published by Alina Lazar.

mining software repositories | 2014

Improving the accuracy of duplicate bug report detection using textual similarity measures

Alina Lazar; Sarah Ritchey; Bonita Sharif

The paper describes an improved method for automatic duplicate bug report detection based on new textual similarity features and binary classification. Using a set of new textual features, inspired from recent text similarity research, we train several binary classification models. A case study was conducted on three open source systems: Eclipse, Open Office, and Mozilla to determine the effectiveness of the improved method. A comparison is also made with current state-of-the-art approaches highlighting similarities and differences. Results indicate that the accuracy of the proposed method is better than previously reported research with respect to all three systems.

eye tracking research & application | 2014

An eye-tracking study assessing the comprehension of c++ and Python source code

Rachel Turner; Michael Falcone; Bonita Sharif; Alina Lazar

A study to assess the effect of programming language on student comprehension of source code is presented, comparing the languages of C++ and Python in two task categories: overview and find bug tasks. Eye gazes are tracked while thirty-eight students complete tasks and answer questions. Results indicate no significant difference in accuracy or time, however there is a significant difference reported on the rate at which students look at buggy lines of code. These results start to provide some direction as to the effect programming language might have in introductory programming classes.

mining software repositories | 2016

Analyzing developer sentiment in commit logs

Vinayak Sinha; Alina Lazar; Bonita Sharif

The paper presents an analysis of developer commit logs for GitHub projects. In particular, developer sentiment in commits is analyzed across 28,466 projects within a seven year time frame. We use the Boa infrastructure’s online query system to generate commit logs as well as files that were changed during the commit. We analyze the commits in three categories: large, medium, and small based on the number of commits using a sentiment analysis tool. In addition, we also group the data based on the day of week the commit was made and map the sentiment to the file change history to determine if there was any correlation. Although a majority of the sentiment was neutral, the negative sentiment was about 10% more than the positive sentiment overall. Tuesdays seem to have the most negative sentiment overall. In addition, we do find a strong correlation between the number of files changed and the sentiment expressed by the commits the files were part of. Future work and implications of these results are discussed.

mining software repositories | 2014

Generating duplicate bug datasets

Alina Lazar; Sarah Ritchey; Bonita Sharif

Automatic identification of duplicate bug reports is an important research problem in the mining software repositories field. This paper presents a collection of bug datasets collected, cleaned and preprocessed for the duplicate bug report identification problem. The datasets were extracted from open-source systems that use Bugzilla as their bug tracking component and contain all the bugs ever submitted. The systems used are Eclipse, Open Office, NetBeans and Mozilla. For each dataset, we store both the initial data and the cleaned data in separate collections in a mongoDB document-oriented database. For each dataset, in addition to the bug data collections downloaded from bug repositories, the database includes a set of all pairs of duplicate bugs together with randomly selected pairs of non-duplicate bugs. Such a dataset is useful as input for classification models and forms a good base to support replications and comparisons by other researchers. We used a subset of this data to predict duplicate bug reports but the same data set may also be used to predict bug priorities and severity.

computer, information, and systems sciences, and engineering | 2008

Epistemic Structured Representation for Legal Transcript Analysis

Tracey Hughes; Cameron Hughes; Alina Lazar

HTML based standards and the new XML based standards for digital transcripts generated by court recorders offer more search and analysis options than the traditional CAT (Computer Aided Transcription) technology. The LegalXml standards are promising opportunities for new methods of search for legal documents. However, the search techniques employed are still largely restricted to keyword search and various probabilistic association techniques. Rather than keyword and association searches, we are interested in semantic and inference-based search. In this paper, a process for transforming the semi-structured representation of the digital transcript to an epistemic structured representation that supports semantic and inference-based search is explored.

international conference on machine learning and applications | 2004

Income prediction via support vector machine

Alina Lazar

Principal component analysis and support vector machine methods are employed to generate and evaluate income prediction data based on the Current Population Survey provided by the U.S. Census Bureau. A detailed statistical study targeted for relevant feature selection is found to increase efficiency and even improve classification accuracy. A systematic study is performed on the influence of this statistical narrowing on the grid parameter search, training time, accuracy, and number of support vectors. Accuracy values as high as 84%, when compared against a test population, are obtained with a reduced set of parameters while the computational time is reduced by 60%. Tailoring computational methods around specific real data sets is critical in designing powerful algorithms.

international multi conference on computing in global information technology | 2006

Support Vector Machines Optimization - An Income Prediction Study

Alina Lazar; Robert Zaremba

Relevant features selection through principal component analysis is employed to increase the efficiency of support vector machine (SVM) methods. In particular, a detailed study is presented on the effects of this statistical narrowing, when used to generate income prediction data based on the current population survey provided by the U.S. Census Bureau. A systematic analysis of the grid parameter search, training time, accuracy, and number of support vectors shows increases not only in the efficiency of the SVM methods, but also in the classification accuracy. Proper identification of the relevant features for specific problems allows accuracy values as high as 93% against a test population, to be obtained, while reducing the total computational. Tailoring computational methods around specific real data sets is critical in designing powerful algorithms

international conference on machine learning and applications | 2010

A Comparison of Linear Support Vector Machine Algorithms on Large Non-Sparse Datasets

Alina Lazar

This paper demonstrates the effectiveness of Linear Support Vector Machines (SVM) when applied to non-sparse datasets with a large number of instances. Two linear SVM algorithms are compared. The coordinate descent method (LibLinear) trains a linear SVM with the L2-loss function versus the cutting-plane algorithm (SVMperf), which uses a L1-loss function. Four Geographical Information System (GIS) datasets with over a million instances were used for this study. Each dataset consists of seven independent variables and a class label which denotes the urban areas versus the rural areas.

international conference on machine learning and applications | 2005

Comparing machine learning classification schemes - a GIS approach

Alina Lazar; Bradley A. Shellito

This project examines the effectiveness of two classification schema: support vector machines (SVM), and artificial neural networks (NN) when applied to geographic (i.e. spatial) data. The context for this study is to examine patterns of urbanization in Mahoning County, OH in relation to several independent driving variables of urban development. These independent variables were constructed using Geographic Information Systems (GIS) and were compared to the dependent variable of the spatial locations of urban areas in Mahoning County. The classification techniques were used in conjunction with the GIS-created variables to predict the location of urban areas within Mahoning County. A comparison of the accuracy of the techniques is presented and conclusions drawn concerning which of the variables are the most influential on urban patterns in the region. Lastly, a spatial analysis of the prediction error is performed for each method.

international conference on artificial intelligence and law | 2011

Discovering coherence and justification clusters in digital transcripts using epistemic analysis

Cameron Hughes; Tracey Hughes; Alina Lazar

We are investigating the potential use of trial transcripts as sources of social knowledge for epistemic agents. But we are immediately faced with the reality that not all transcripts are equal. The quality of the transcripts will be partially related to the knowledge, consistency, and integrity of the individuals that testify during the course of the trial, and related to the nature and sophistication of the questions. Before we can determine whether a transcript will be useful as a knowledge source for an epistemic agent, we have to identify the consistency and quality of the knowledge present in the transcript. Coherence clusters demarcate the network of positively and negatively related propositions in the transcript. The justification clusters define the subcluster of propositions that support or justify other propositions in a coherence cluster. These clusters can be used to determine the nature of the consistency of the knowledge potentially present in the transcript. In this paper, we show how these clusters are identified using epistemic analysis. Our goals is to use these clusters as the basis for an epistemic metric used to determines the quality propositional knowledge present in a transcript.

Explore More