Israel Herraiz | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Israel Herraiz is active.

Explore More

Publication

Featured researches published by Israel Herraiz.

international workshop on principles of software evolution | 2005

Evolution and growth in large libre software projects

Gregorio Robles; Juan Jose Amor; Jesus M. Gonzalez-Barahona; Israel Herraiz

Software evolution research has recently focused on new development paradigms, studying whether laws found in more classic development environments also apply. Previous works have pointed out that at least some laws seem not to be valid for these new environments and even Lehman has labeled those (up to the moment few) cases as anomalies and has suggested that further research is needed to clarify this issue. In this line, we consider in this paper a large set of libre (free, open source) software systems featuring a large community of users and developers. In particular, we analyze a number of projects found in literature up to now, including the Linux kernel. For comparison, we include other libre software kernels from the BSD family, and for completeness we consider a wider range of libre software applications. In the case of Linux and the other operating system kernels we have studied growth patterns also at the subsystem level. We have observed in the studied sample that super-linearity occurs only exceptionally, that many of the systems follow a linear growth pattern and that smooth growth is not that common. These results differ from the ones found generally in classical software evolution studies. Other behaviors and patterns give also a hint that development in the libre software world could follow different laws than those known, at least in some cases.

foundations of software engineering | 2013

Sample size vs. bias in defect prediction

Foyzur Rahman; Daryl Posnett; Israel Herraiz; Premkumar T. Devanbu

Most empirical disciplines promote the reuse and sharing of datasets, as it leads to greater possibility of replication. While this is increasingly the case in Empirical Software Engineering, some of the most popular bug-fix datasets are now known to be biased. This raises two significant concerns: first, that sample bias may lead to underperforming prediction models, and second, that the external validity of the studies based on biased datasets may be suspect. This issue has raised considerable consternation in the ESE literature in recent years. However, there is a confounding factor of these datasets that has not been examined carefully: size. Biased datasets are sampling only some of the data that could be sampled, and doing so in a biased fashion; but biased samples could be smaller, or larger. Smaller data sets in general provide less reliable bases for estimating models, and thus could lead to inferior model performance. In this setting, we ask the question, what affects performance more, bias, or size? We conduct a detailed, large-scale meta-analysis, using simulated datasets sampled with bias from a high-quality dataset which is relatively free of bias. Our results suggest that size always matters just as much bias direction, and in fact much more than bias direction when considering information-retrieval measures such as AUCROC and F-score. This indicates that at least for prediction models, even when dealing with sampling bias, simply finding larger samples can sometimes be sufficient. Our analysis also exposes the complexity of the bias issue, and raises further issues to be explored in the future.

international conference on software maintenance | 2007

On the prediction of the evolution of libre software projects

Israel Herraiz; Jesus M. Gonzalez-Barahona; Gregorio Robles; Daniel M. German

Libre (free / open source) software development is a complex phenomenon. Many actors (core developers, casual contributors, bug reporters, patch submitters, users, etc.), in many cases volunteers, interact in complex patterns without the constrains of formal hierarchical structures or organizational ties. Understanding this complex behavior with enough detail to build explanatory models suitable for prediction is an open challenge, and few results have been published to date in this area. Therefore statistical, non-explanatory models (such as the traditional regression model) have a clear role, and have been used in some evolution studies. Our proposal goes in this direction, but using a model that we have found more useful: time series analysis. Data available from the source code management repository is used to compute the size of the software over its past life, using this information to estimate the future evolution of the project. In this paper we present this methodology and apply it to three large projects, showing how in these cases predictions are more accurate than regression models, and precise enough to estimate with little error their near future evolutions.

ACM Computing Surveys | 2013

The evolution of the laws of software evolution: A discussion based on a systematic literature review

Israel Herraiz; Daniel Rodríguez; Gregorio Robles; Jesus M. Gonzalez-Barahona

After more than 40 years of life, software evolution should be considered as a mature field. However, despite such a long history, many research questions still remain open, and controversial studies about the validity of the laws of software evolution are common. During the first part of these 40 years, the laws themselves evolved to adapt to changes in both the research and the software industry environments. This process of adaption to new paradigms, standards, and practices stopped about 15 years ago, when the laws were revised for the last time. However, most controversial studies have been raised during this latter period. Based on a systematic and comprehensive literature review, in this article, we describe how and when the laws, and the software evolution field, evolved. We also address the current state of affairs about the validity of the laws, how they are perceived by the research community, and the developments and challenges that are likely to occur in the coming years.

annual software engineering workshop | 2012

On software engineering repositories and their open problems

Daniel Rodríguez; Israel Herraiz; Rachel Harrison

In the last decade, a large number of software repositories have been created for different purposes. In this paper we present a survey of the publicly available repositories and classify the most common ones as well as discussing the problems faced by researchers when applying machine learning or statistical techniques to them.

evaluation and assessment in software engineering | 2014

Preliminary comparison of techniques for dealing with imbalance in software defect prediction

Daniel Rodríguez; Israel Herraiz; Rachel Harrison; José Javier Dolado; José C. Riquelme

Imbalanced data is a common problem in data mining when dealing with classification problems, where samples of a class vastly outnumber other classes. In this situation, many data mining algorithms generate poor models as they try to optimize the overall accuracy and perform badly in classes with very few samples. Software Engineering data in general and defect prediction datasets are not an exception and in this paper, we compare different approaches, namely sampling, cost-sensitive, ensemble and hybrid approaches to the problem of defect prediction with different datasets preprocessed differently. We have used the well-known NASA datasets curated by Shepperd et al. There are differences in the results depending on the characteristics of the dataset and the evaluation metrics, especially if duplicates and inconsistencies are removed as a preprocessing step. Further Results and replication package: http://www.cc.uah.es/drg/ease14

Journal of Software: Evolution and Process | 2014

Studying the laws of software evolution in a long-lived FLOSS project

Jesus M. Gonzalez-Barahona; Gregorio Robles; Israel Herraiz; Felipe Ortega

Some free, open‐source software projects have been around for quite a long time, the longest living ones dating from the early 1980s. For some of them, detailed information about their evolution is available in source code management systems tracking all their code changes for periods of more than 15 years. This paper examines in detail the evolution of one of such projects, glibc, with the main aim of understanding how it evolved and how it matched Lehmans laws of software evolution. As a result, we have developed a methodology for studying the evolution of such long‐lived projects based on the information in their source code management repository, described in detail several aspects of the history of glibc, including some activity and size metrics, and found how some of the laws of software evolution may not hold in this case.

mining software repositories | 2013

Intensive metrics for the study of the evolution of open source projects: Case studies from Apache Software Foundation projects

Santiago Gala-Pérez; Gregorio Robles; Jesus M. Gonzalez-Barahona; Israel Herraiz

Based on the empirical evidence that the ratio of email messages in public mailing lists to versioning system commits has remained relatively constant along the history of the Apache Software Foundation (ASF), this paper has as goal to study what can be inferred from such a metric for projects of the ASF. We have found that the metric seems to be an intensive metric as it is independent of the size of the project, its activity, or the number of developers, and remains relatively independent of the technology or functional area of the project. Our analysis provides evidence that the metric is related to the technical effervescence and popularity of project, and as such can be a good candidate to measure its healthy evolution. Other, similar metrics -like the ratio of developer messages to commits and the ratio of issue tracker messages to commits- are studied for several projects as well, in order to see if they have similar characteristics.

working conference on reverse engineering | 2011

Impact of Installation Counts on Perceived Quality: A Case Study on Debian

Israel Herraiz; Emad Shihab; Thanh H. D. Nguyen; Ahmed E. Hassan

Software defects are generally used to indicate software quality. However, due to the nature of software, we are often only able to know about the defects found and reported, either following the testing process or after being deployed. In software research studies, it is assumed that a higher amount of defect reports represents a higher amount of defects in the software system. In this paper, we argue that widely deployed programs have more reported defects, regardless of their actual number of defects. To address this question, we perform a case study on the Debian GNU/Linux distribution, a well-known free / open source software collection. We compare the defects reported for all the software packages in Debian with their popularity. We find that the number of reported defects for a Debian package is limited by its popularity. This finding has implications on defect prediction studies, showing that they need to consider the impact of popularity on perceived quality, otherwise they might be risking bias.

foundational and practical aspects of resource analysis | 2009

Comparing cost functions in resource analysis

Puri Arenas; Samir Genaim; Israel Herraiz; Germán Puebla

Cost functions provide information about the amount of resources required to execute a program in terms of the sizes of input arguments. They can provide an upper-bound, a lower-bound, or the average-case cost. Motivated by the existence of a number of automatic cost analyzers which produce cost functions, we propose an approach for automatically proving that a cost function is smaller than another one. In all applications of resource analysis, such as resource-usage verification, program synthesis and optimization, etc., it is essential to compare cost functions. This allows choosing an implementation with smaller cost or guaranteeing that the given resource-usage bounds are preserved. Unfortunately, automatically generated cost functions for realistic programs tend to be rather intricate, defined by multiple cases, involving non-linear subexpressions (e.g., exponential, polynomial and logarithmic) and they can contain multiple variables, possibly related by means of constraints. Thus, comparing cost functions is far from trivial. Our approach first syntactically transforms functions into simpler forms and then applies a number of sufficient conditions which guarantee that a set of expressions is smaller than another expression. Our preliminary implementation in the COSTA system indicates that the approach can be useful in practice.

Explore More