M. Rosário de Oliveira
Instituto Superior Técnico
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by M. Rosário de Oliveira.
international conference on computer communications | 2012
Cláudia Pascoal; M. Rosário de Oliveira; Rui Valadas; Peter Filzmoser; Paulo Salvador; António Pacheco
Robust statistics is a branch of statistics which includes statistical methods capable of dealing adequately with the presence of outliers. In this paper, we propose an anomaly detection method that combines a feature selection algorithm and an outlier detection method, which makes extensive use of robust statistics. Feature selection is based on a mutual information metric for which we have developed a robust estimator; it also includes a novel and automatic procedure for determining the number of relevant features. Outlier detection is based on robust Principal Component Analysis (PCA) which, opposite to classical PCA, is not sensitive to outliers and precludes the necessity of training using a reliably labeled dataset, a strong advantage from the operational point of view. To evaluate our method we designed a network scenario capable of producing a perfect ground-truth under real (but controlled) traffic conditions. Results show the significant improvements of our method over the corresponding classical ones. Moreover, despite being a largely overlooked issue in the context of anomaly detection, feature selection is found to be an important preprocessing step, allowing adaption to different network conditions and inducing significant performance gains.
Neurocomputing | 2017
Cláudia Pascoal; M. Rosário de Oliveira; António Pacheco; Rui Valadas
Feature selection methods are usually evaluated by wrapping specific classifiers and datasets in the evaluation process, resulting very often in unfair comparisons between methods. In this work, we develop a theoretical framework that allows obtaining the true feature ordering of two-dimensional sequential forward feature selection methods based on mutual information, which is independent of entropy or mutual information estimation methods, classifiers, or datasets, and leads to an undoubtful comparison of the methods. Moreover, the theoretical framework unveils problems intrinsic to some methods that are otherwise difficult to detect, namely inconsistencies in the construction of the objective function used to select the candidate features, due to various types of indeterminations and to the possibility of the entropy of continuous random variables taking null and negative values. HighlightsSequential forward feature selection methods are compared theoretically.The true feature ordering is obtained using a theoretical framework.Several inconsistencies in the objective functions of the methods are unveiled.
advanced industrial conference on telecommunications | 2005
António Nogueira; M. Rosário de Oliveira; Paulo Salvador; Rui Valadas; António Pacheco
Traffic engineering and network management can greatly benefit from a reliable classification of Internet users. This paper evaluates the potential of different artificial neural network models for classifying Internet users based on their hourly traffic profile. The training of the neural networks and the evaluation of their performance rely on a previous classification of the Internet users obtained through cluster analysis. The results obtained for two data sets measured at the access network of a Portuguese ISP indicate that neural networks constitute a valuable tool for classifying Internet users.
Journal of Applied Statistics | 2012
Luzia Gonçalves; M. Rosário de Oliveira; Cláudia Pascoal; Ana M. Pires
The poor performance of the Wald method for constructing confidence intervals (CIs) for a binomial proportion has been demonstrated in a vast literature. The related problem of sample size determination needs to be updated and comparative studies are essential to understanding the performance of alternative methods. In this paper, the sample size is obtained for the Clopper–Pearson, Bayesian (Uniform and Jeffreys priors), Wilson, Agresti–Coull, Anscombe, and Wald methods. Two two-step procedures are used: one based on the expected length (EL) of the CI and another one on its first-order approximation. In the first step, all possible solutions that satisfy the optimal criterion are obtained. In the second step, a single solution is proposed according to a new criterion (e.g. highest coverage probability (CP)). In practice, it is expected a sample size reduction, therefore, we explore the behavior of the methods admitting 30% and 50% of losses. For all the methods, the ELs are inflated, as expected, but the coverage probabilities remain close to the original target (with few exceptions). It is not easy to suggest a method that is optimal throughout the range (0, 1) for p. Depending on whether the goal is to achieve CP approximately or above the nominal level different recommendations are made.
Archive | 2000
M. Rosário de Oliveira; João A. Branco
Projection pursuit techniques are used to build new robust estimators for the parameters of the canonical correlation model. A simulation study shows that for non-ideal data these estimators can perform as well as other robust estimators. However, they can have much higher breakdown points. This advantage makes these estimators the right choice for use with real data, where potential outlying observations are very frequent.
Traffic Management and Traffic Engineering for the Future Internet | 2009
M. Rosário de Oliveira; António Pacheco; Cláudia Pascoal; Rui Valadas; Paulo Salvador
The development of realistic Internet traffic models of applications and services calls for a good understanding of the nature of Internet flows, which can be affected by many factors. Especially relevant among these are the limitations imposed by link capacities and router algorithms that control bandwidth on a per-flow basis. In the paper, we perform a statistical analysis of an Internet traffic trace that specifically takes into account the upper-bounds on the duration and rate of measured flows. In particular, we propose a new model for studying the dependencies between the logarithm of the size, the logarithm of the duration, and the logarithm of the transmission rate of an Internet flow. We consider a bivariate lognormal distribution for the flow size and flow duration, and derive estimators for the mean, the variance and the correlation, based on a truncated domain that reflects the upper-bounds on the duration and rate of measured flows. Moreover, we obtain regression equations that describe the expected value of one characteristic (size, duration or rate) given other (size or duration), thus providing further insights on the dependencies between Internet flow characteristics. In particular, for flows with large sizes we are able to predict durations and rates that are coherent with the upper-bound on transmission rates imposed by the network.
Neurocomputing | 2018
Francisco Macedo; M. Rosário de Oliveira; António Pacheco; Rui Valadas
Abstract Feature selection problems arise in a variety of applications, such as microarray analysis, clinical prediction, text categorization, image classification and face recognition, multi-label learning, and classification of internet traffic. Among the various classes of methods, forward feature selection methods based on mutual information have become very popular and are widely used in practice. However, comparative evaluations of these methods have been limited by being based on specific datasets and classifiers. In this paper, we develop a theoretical framework that allows evaluating the methods based on their theoretical properties. Our framework is grounded on the properties of the target objective function that the methods try to approximate, and on a novel categorization of features, according to their contribution to the explanation of the class; we derive upper and lower bounds for the target objective function and relate these bounds with the feature types. Then, we characterize the types of approximations taken by the methods, and analyze how these approximations cope with the good properties of the target objective function. Additionally, we develop a distributional setting designed to illustrate the various deficiencies of the methods, and provide several examples of wrong feature selections. Based on our work, we identify clearly the methods that should be avoided, and the methods that currently have the best performance.
international conference on computer communications | 2015
M. Rosário de Oliveira; João Neves; Rui Valadas; Paulo Salvador
The classification of Internet traffic using supervised or semi-supervised statistical learning techniques, both for anomaly detection and identification of Internet applications, has been impaired by difficulties in obtaining a reliable ground-truth, required both to train the classifier and to evaluate its performance. A perfect ground-truth is increasingly difficult, or sometimes impossible, to obtain due to the growing percentage of cyphered traffic, the sophistication of network attacks, and the constant updates of Internet applications. In this paper, we study the impact of the ground-truth on training the classifier and estimating its performance measures. We show both theoretically and through simulation that ground-truth imperfections can severely bias the performance estimates. We then propose a latent class model that overcomes this problem by combining estimates of several classifiers over the same dataset. The model is evaluated using a high-quality dataset that includes the most representative Internet applications and network attacks. The results show that our latent class model produces very good performance estimates under mild levels of ground-truth imperfection, and can thus be used to correctly benchmark Internet traffic classifiers when only an imperfect ground-truth is available.
european conference on networks and communications | 2015
Sara Faria Leal; M. Rosário de Oliveira; Rui Valadas
Anomaly detection of Internet traffic is a network service of primary importance, given the constant threats that impinge on Internet security. From a statistical perspective, traffic anomalies can be considered outliers, and must be handled through effective outlier detection methods, for which feature selection is an important pre-processing step. Feature selection removes the redundant and irrelevant features from the detection process, increasing its performance. In this work, we consider outlier detection based on principal component analysis, and feature selection based on mutual information. Moreover, we address the use of kernel density estimation (KDE) to estimate themutual information, which is designed for continuous features, and avoids the discretization step of histograms. Our results, obtained using a high-quality ground-truth, clearly show the usefulness of feature selection and the superiority of KDE to estimate the mutual information, in the context of Internet traffic anomaly detection.
international telecommunications network strategy and planning symposium | 2014
M. Rosário de Oliveira; Rui Valadas; Marcin Pietrzyk; Denis Collange
One important requirement associated with the deployment of large scale classification infrastructures is the portability of classifiers, which allows a small number of pre-trained classifiers to be used on many sites and time periods. The portability can be severely degraded if the flow features used in the classification process lack stability, i.e. if they do not preserve their most relevant statistical properties across different sites and time periods. In this paper we propose a statistical procedure to evaluate the stability of flow features, which resorts to the notion of effect size. The procedure is used challenge the stability of popular flow features, such as the direction and size of the first four packets of a TCP connection. Our results, obtained with three high-quality traffic traces, clearly show that only some applications are portable, when using these features as discriminators. We also provide evidence of these findings based on the operation of the protocols underlying the Internet applications.