M. Rosário de Oliveira

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where M. Rosário de Oliveira is active.

Explore More

Publication

Featured researches published by M. Rosário de Oliveira.

international conference on computer communications | 2012

Robust feature selection and robust PCA for internet traffic anomaly detection

Cláudia Pascoal; M. Rosário de Oliveira; Rui Valadas; Peter Filzmoser; Paulo Salvador; António Pacheco

Robust statistics is a branch of statistics which includes statistical methods capable of dealing adequately with the presence of outliers. In this paper, we propose an anomaly detection method that combines a feature selection algorithm and an outlier detection method, which makes extensive use of robust statistics. Feature selection is based on a mutual information metric for which we have developed a robust estimator; it also includes a novel and automatic procedure for determining the number of relevant features. Outlier detection is based on robust Principal Component Analysis (PCA) which, opposite to classical PCA, is not sensitive to outliers and precludes the necessity of training using a reliably labeled dataset, a strong advantage from the operational point of view. To evaluate our method we designed a network scenario capable of producing a perfect ground-truth under real (but controlled) traffic conditions. Results show the significant improvements of our method over the corresponding classical ones. Moreover, despite being a largely overlooked issue in the context of anomaly detection, feature selection is found to be an important preprocessing step, allowing adaption to different network conditions and inducing significant performance gains.

Neurocomputing | 2017

Theoretical evaluation of feature selection methods based on mutual information

Cláudia Pascoal; M. Rosário de Oliveira; António Pacheco; Rui Valadas

Feature selection methods are usually evaluated by wrapping specific classifiers and datasets in the evaluation process, resulting very often in unfair comparisons between methods. In this work, we develop a theoretical framework that allows obtaining the true feature ordering of two-dimensional sequential forward feature selection methods based on mutual information, which is independent of entropy or mutual information estimation methods, classifiers, or datasets, and leads to an undoubtful comparison of the methods. Moreover, the theoretical framework unveils problems intrinsic to some methods that are otherwise difficult to detect, namely inconsistencies in the construction of the objective function used to select the candidate features, due to various types of indeterminations and to the possibility of the entropy of continuous random variables taking null and negative values. HighlightsSequential forward feature selection methods are compared theoretically.The true feature ordering is obtained using a theoretical framework.Several inconsistencies in the objective functions of the methods are unveiled.

advanced industrial conference on telecommunications | 2005

Using neural networks to classify Internet users

António Nogueira; M. Rosário de Oliveira; Paulo Salvador; Rui Valadas; António Pacheco

Traffic engineering and network management can greatly benefit from a reliable classification of Internet users. This paper evaluates the potential of different artificial neural network models for classifying Internet users based on their hourly traffic profile. The training of the neural networks and the evaluation of their performance rely on a previous classification of the Internet users obtained through cluster analysis. The results obtained for two data sets measured at the access network of a Portuguese ISP indicate that neural networks constitute a valuable tool for classifying Internet users.

Journal of Applied Statistics | 2012

Sample size for estimating a binomial proportion: comparison of different methods

Luzia Gonçalves; M. Rosário de Oliveira; Cláudia Pascoal; Ana M. Pires

The poor performance of the Wald method for constructing confidence intervals (CIs) for a binomial proportion has been demonstrated in a vast literature. The related problem of sample size determination needs to be updated and comparative studies are essential to understanding the performance of alternative methods. In this paper, the sample size is obtained for the Clopper–Pearson, Bayesian (Uniform and Jeffreys priors), Wilson, Agresti–Coull, Anscombe, and Wald methods. Two two-step procedures are used: one based on the expected length (EL) of the CI and another one on its first-order approximation. In the first step, all possible solutions that satisfy the optimal criterion are obtained. In the second step, a single solution is proposed according to a new criterion (e.g. highest coverage probability (CP)). In practice, it is expected a sample size reduction, therefore, we explore the behavior of the methods admitting 30% and 50% of losses. For all the methods, the ELs are inflated, as expected, but the coverage probabilities remain close to the original target (with few exceptions). It is not easy to suggest a method that is optimal throughout the range (0, 1) for p. Depending on whether the goal is to achieve CP approximately or above the nominal level different recommendations are made.

Archive | 2000

Projection pursuit approach to robust canonical correlation analysis

M. Rosário de Oliveira; João A. Branco

Projection pursuit techniques are used to build new robust estimators for the parameters of the canonical correlation model. A simulation study shows that for non-ideal data these estimators can perform as well as other robust estimators. However, they can have much higher breakdown points. This advantage makes these estimators the right choice for use with real data, where potential outlying observations are very frequent.

Traffic Management and Traffic Engineering for the Future Internet | 2009

On the Dependencies between Internet Flow Characteristics

M. Rosário de Oliveira; António Pacheco; Cláudia Pascoal; Rui Valadas; Paulo Salvador

The development of realistic Internet traffic models of applications and services calls for a good understanding of the nature of Internet flows, which can be affected by many factors. Especially relevant among these are the limitations imposed by link capacities and router algorithms that control bandwidth on a per-flow basis. In the paper, we perform a statistical analysis of an Internet traffic trace that specifically takes into account the upper-bounds on the duration and rate of measured flows. In particular, we propose a new model for studying the dependencies between the logarithm of the size, the logarithm of the duration, and the logarithm of the transmission rate of an Internet flow. We consider a bivariate lognormal distribution for the flow size and flow duration, and derive estimators for the mean, the variance and the correlation, based on a truncated domain that reflects the upper-bounds on the duration and rate of measured flows. Moreover, we obtain regression equations that describe the expected value of one characteristic (size, duration or rate) given other (size or duration), thus providing further insights on the dependencies between Internet flow characteristics. In particular, for flows with large sizes we are able to predict durations and rates that are coherent with the upper-bound on transmission rates imposed by the network.

Neurocomputing | 2018

Theoretical foundations of forward feature selection methods based on mutual information

Francisco Macedo; M. Rosário de Oliveira; António Pacheco; Rui Valadas

Abstract Feature selection problems arise in a variety of applications, such as microarray analysis, clinical prediction, text categorization, image classification and face recognition, multi-label learning, and classification of internet traffic. Among the various classes of methods, forward feature selection methods based on mutual information have become very popular and are widely used in practice. However, comparative evaluations of these methods have been limited by being based on specific datasets and classifiers. In this paper, we develop a theoretical framework that allows evaluating the methods based on their theoretical properties. Our framework is grounded on the properties of the target objective function that the methods try to approximate, and on a novel categorization of features, according to their contribution to the explanation of the class; we derive upper and lower bounds for the target objective function and relate these bounds with the feature types. Then, we characterize the types of approximations taken by the methods, and analyze how these approximations cope with the good properties of the target objective function. Additionally, we develop a distributional setting designed to illustrate the various deficiencies of the methods, and provide several examples of wrong feature selections. Based on our work, we identify clearly the methods that should be avoided, and the methods that currently have the best performance.

international conference on computer communications | 2015

Do we need a perfect ground-truth for benchmarking Internet traffic classifiers?

M. Rosário de Oliveira; João Neves; Rui Valadas; Paulo Salvador

The classification of Internet traffic using supervised or semi-supervised statistical learning techniques, both for anomaly detection and identification of Internet applications, has been impaired by difficulties in obtaining a reliable ground-truth, required both to train the classifier and to evaluate its performance. A perfect ground-truth is increasingly difficult, or sometimes impossible, to obtain due to the growing percentage of cyphered traffic, the sophistication of network attacks, and the constant updates of Internet applications. In this paper, we study the impact of the ground-truth on training the classifier and estimating its performance measures. We show both theoretically and through simulation that ground-truth imperfections can severely bias the performance estimates. We then propose a latent class model that overcomes this problem by combining estimates of several classifiers over the same dataset. The model is evaluated using a high-quality dataset that includes the most representative Internet applications and network attacks. The results show that our latent class model produces very good performance estimates under mild levels of ground-truth imperfection, and can thus be used to correctly benchmark Internet traffic classifiers when only an imperfect ground-truth is available.

european conference on networks and communications | 2015

Anomaly detection of Internet traffic using robust feature selection based on kernel density estimation

Sara Faria Leal; M. Rosário de Oliveira; Rui Valadas

Anomaly detection of Internet traffic is a network service of primary importance, given the constant threats that impinge on Internet security. From a statistical perspective, traffic anomalies can be considered outliers, and must be handled through effective outlier detection methods, for which feature selection is an important pre-processing step. Feature selection removes the redundant and irrelevant features from the detection process, increasing its performance. In this work, we consider outlier detection based on principal component analysis, and feature selection based on mutual information. Moreover, we address the use of kernel density estimation (KDE) to estimate themutual information, which is designed for continuous features, and avoids the discretization step of histograms. Our results, obtained using a high-quality ground-truth, clearly show the usefulness of feature selection and the superiority of KDE to estimate the mutual information, in the context of Internet traffic anomaly detection.

international telecommunications network strategy and planning symposium | 2014

Stability of flow features for the identification of Internet applications

M. Rosário de Oliveira; Rui Valadas; Marcin Pietrzyk; Denis Collange

One important requirement associated with the deployment of large scale classification infrastructures is the portability of classifiers, which allows a small number of pre-trained classifiers to be used on many sites and time periods. The portability can be severely degraded if the flow features used in the classification process lack stability, i.e. if they do not preserve their most relevant statistical properties across different sites and time periods. In this paper we propose a statistical procedure to evaluate the stability of flow features, which resorts to the notion of effect size. The procedure is used challenge the stability of popular flow features, such as the direction and size of the first four packets of a TCP connection. Our results, obtained with three high-quality traffic traces, clearly show that only some applications are portable, when using these features as discriminators. We also provide evidence of these findings based on the operation of the protocols underlying the Internet applications.

Explore More