Is this you? Create Your Porfile

Anna Corazza

University of Naples Federico II

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Anna Corazza is active.

Explore More

Publication

Featured researches published by Anna Corazza.

conference on software maintenance and reengineering | 2011

Investigating the use of lexical information for software system clustering

Anna Corazza; Sergio Di Martino; Valerio Maggio; Giuseppe Scanniello

Developers have a lot of freedom in writing comments as well as in choosing identifiers and method names. These are intentional in nature and provide a different relevance of information to understand what a software system implements, and in particular the role of each source file. In this paper we investigate the effectiveness of exploiting lexical information for software system clustering. In particular we explore the contribution of the combined use of six different dictionaries, corresponding to the six parts of the source code where programmers introduce lexical information, namely: class, attribute, method and parameter names, comments, and source code statements. Their relevance has been weighted by means of a probabilistic model, whose parameters have been estimated by the Expectation-Maximization algorithm. To group source files accordingly we used a hierarchical clustering algorithm. The investigation has been conducted on a dataset of 13 open source Java software systems.

Empirical Software Engineering | 2013

Using tabu search to configure support vector regression for effort estimation

Anna Corazza; S. Di Martino; Filomena Ferrucci; Carmine Gravino; Federica Sarro; Emilia Mendes

Recent studies have reported that Support Vector Regression (SVR) has the potential as a technique for software development effort estimation. However, its prediction accuracy is heavily influenced by the setting of parameters that needs to be done when employing it. No general guidelines are available to select these parameters, whose choice also depends on the characteristics of the dataset being used. This motivated the work described in (Corazza et al. 2010), extended herein. In order to automatically select suitable SVR parameters we proposed an approach based on the use of the meta-heuristics Tabu Search (TS). We designed TS to search for the parameters of both the support vector algorithm and of the employed kernel function, namely RBF. We empirically assessed the effectiveness of the approach using different types of datasets (single and cross-company datasets, Web and not Web projects) from the PROMISE repository and from the Tukutuku database. A total of 21 datasets were employed to perform a 10-fold or a leave-one-out cross-validation, depending on the size of the dataset. Several benchmarks were taken into account to assess both the effectiveness of TS to set SVR parameters and the prediction accuracy of the proposed approach with respect to widely used effort estimation techniques. The use of TS allowed us to automatically obtain suitable parameters’ choices required to run SVR. Moreover, the combination of TS and SVR significantly outperformed all the other techniques. The proposed approach represents a suitable technique for software development effort estimation.

IEEE Transactions on Pattern Analysis and Machine Intelligence | 1991

Computation of probabilities for an island-driven parser

Anna Corazza; R. De Mori; R. Gretter; Giorgio Satta

The authors describe an effort to adapt island-driven parsers to handle stochastic context-free grammars. These grammars could be used as language models (LMs) by a language processor (LP) to computer the probability of a linguistic interpretation. As different islands may compete for growth, it is important to compute the probability that an LM generates a sentence containing islands and gaps between them. Algorithms for computing these probabilities are introduced. The complexity of these algorithms is analyzed both from theoretical and practical points of view. It is shown that the computation of probabilities in the presence of gaps of unknown length requires the impractical solution of a nonlinear system of equations, whereas the computation of probabilities for cases with gaps containing a known number of unknown words has polynomial time complexity and is practically feasible. The use of the results obtained in automatic speech understanding systems is discussed. >

conference on software maintenance and reengineering | 2010

A Probabilistic Based Approach towards Software System Clustering

Anna Corazza; Sergio Di Martino; Giuseppe Scanniello

In this paper we present a clustering based approach to partition software systems into meaningful subsystems. In particular, the approach uses lexical information extracted from four zones in Java classes, which may provide a different contribution towards software systems partitioning. To automatically weigh these zones, we introduced a probabilistic model, and applied the Expectation-Maximization (EM) algorithm. To group classes according to the considered lexical information, we customized the well-known K-Medoids algorithm. To assess the approach and the implemented supporting system, we have conducted a case study on six open source software systems.

international conference on software maintenance | 2012

LINSEN: An efficient approach to split identifiers and expand abbreviations

Anna Corazza; Sergio Di Martino; Valerio Maggio

Information Retrieval (IR) techniques are being exploited by an increasing number of tools supporting Software Maintenance activities. Indeed the lexical information embedded in the source code can be valuable for tasks such as concept location, clustering or recovery of traceability links. The application of such IR-based techniques relies on the consistency of the lexicon available in the different artifacts, and their effectiveness can worsen if programmers introduce abbreviations (e.g: rect) and/or do not strictly follow naming conventions such as Camel Case (e.g: UTFtoASCII). In this paper we propose an approach to automatically split identifiers in their composing words, and expand abbreviations. The solution is based on a graph model and performs in linear time with respect to the size of the dictionary, taking advantage of an approximate string matching technique. The proposed technique exploits a number of different dictionaries, referring to increasingly broader contexts, in order to achieve a disambiguation strategy based on the knowledge gathered from the most appropriate domain. The approach has been compared to other splitting and expansion techniques, using freely available oracles for the identifiers extracted from 24 C/C++ and Java open source systems. Results show an improvement in both splitting and expanding performance, in addition to a strong enhancement in the computational efficiency.

Empirical Software Engineering | 2011

Investigating the use of Support Vector Regression for web effort estimation

Anna Corazza; Sergio Di Martino; Filomena Ferrucci; Carmine Gravino; Emilia Mendes

Support Vector Regression (SVR) is a new generation of Machine Learning algorithms, suitable for predictive data modeling problems. The objective of this paper is twofold: first, to investigate the effectiveness of SVR for Web effort estimation using a cross-company dataset; second, to compare different SVR configurations looking at the one that presents the best performance. In particular, we took into account three variables’ preprocessing strategies (no-preprocessing, normalization, and logarithmic), in combination with two different dependent variables (effort and inverse effort). As a result, SVR was applied using six different data configurations. Moreover, to understand the suitability of kernel functions to handle non-linear problems, SVR was applied without a kernel, and in combination with the Radial Basis Function (RBF) and the Polynomial kernels, thus obtaining 18 different SVR configurations. To identify, for each configuration, which were the best values for each of the parameters we defined a procedure based on a leave-one-out cross-validation approach. The dataset employed was the Tukutuku database, which has been adopted in many previous Web effort estimation studies. Three different training and test set splits were used, including respectively 130 and 65 projects. The SVR-based predictions were also benchmarked against predictions obtained using Manual StepWise Regression and Case-Based Reasoning. Our results showed that the configuration corresponding to the logarithmic features’ preprocessing, in combination with the RBF kernel provided the best results for all three data splits. In addition, SVR provided significantly superior prediction accuracy than all the considered benchmarking techniques.

Computers in Biology and Medicine | 2016

Unsupervised entity and relation extraction from clinical records in Italian

Anita Alicante; Anna Corazza; F. Isgrò; Stefano Silvestri

This paper proposes and discusses the use of text mining techniques for the extraction of information from clinical records written in Italian. However, as it is very difficult and expensive to obtain annotated material for languages different from English, we only consider unsupervised approaches, where no annotated training set is necessary. We therefore propose a complete system that is structured in two steps. In the first one domain entities are extracted from the clinical records by means of a metathesaurus and standard natural language processing tools. The second step attempts to discover relations between the entity pairs extracted from the whole set of clinical records. For this last step we investigate the performance of unsupervised methods such as clustering in the space of entity pairs, represented by an ad hoc feature vector. The resulting clusters are then automatically labelled by using the most significant features. The system has been tested on a fairly large data set of clinical records in Italian, investigating the variation in the performance adopting different similarity measures in the feature space. The results of our experiments show that the unsupervised approach proposed is promising and well suited for a semi-automatic labelling of the extracted relations.

IEEE Transactions on Pattern Analysis and Machine Intelligence | 1994

Optimal probabilistic evaluation functions for search controlled by stochastic context-free grammars

Anna Corazza; R. De Mori; R. Gretter; Giorgio Satta

The possibility of using stochastic context-free grammars (SCFGs) in language modeling (LM) has been considered previously. When these grammars are used, search can be directed by evaluation functions based on the probabilities that a SCFG generates a sentence, given only some words in it. Expressions for computing the evaluation function have been proposed by Jelinek and Lafferty (1991) for the recognition of word sequences in the case in which only the prefix of a sequence is known. Corazza et al. (1991) have proposed methods for probability computation in the more general case in which partial word sequences interleaved by gaps are known. This computation is too complex in practice unless the lengths of the gaps are known. This paper proposes a method for computing the probability of the best parse tree that can generate a sentence only part of which (consisting of islands and gaps) is known. This probability is the minimum possible, and thus the most informative, upper-bound that can be used in the evaluation function. The computation of the proposed upper-bound has cubic time complexity even if the lengths of the gaps are unknown. This makes possible the practical use of SCFG for driving interpretations of sentences in natural language processing. >

international conference on software maintenance | 2010

A Tree Kernel based approach for clone detection

Anna Corazza; Sergio Di Martino; Valerio Maggio; Giuseppe Scanniello

Reusing software by copying and pasting is a common practice in software development. This phenomenon is widely known as code cloning. Problems with clones are mainly due to the need of managing each duplication, thus increasing the effort to maintain software systems. Clone detection approaches generally take into account either the syntactic structure (e.g., Abstract Syntax Tree) or lexical elements (e.g., the signature of a function). In this paper we propose an approach to detect code clones, based on syntactic information enriched by lexical elements. To this end, we have defined a Tree Kernel function to compare Abstract Syntax Trees. A preliminary investigation has been also conducted to assess the validity of the proposed approach.

language and technology conference | 2006

Cross-Entropy and Estimation of Probabilistic Context-Free Grammars

Anna Corazza; Giorgio Satta

We investigate the problem of training probabilistic context-free grammars on the basis of a distribution defined over an infinite set of trees, by minimizing the cross-entropy. This problem can be seen as a generalization of the well-known maximum likelihood estimator on (finite) tree banks. We prove an unexpected theoretical property of grammars that are trained in this way, namely, we show that the derivational entropy of the grammar takes the same value as the cross-entropy between the input distribution and the grammar itself. We show that the result also holds for the widely applied maximum likelihood estimator on tree banks.

Explore More