Rui Máximo Esteves | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Rui Máximo Esteves is active.

Explore More

Publication

Featured researches published by Rui Máximo Esteves.

advanced information networking and applications | 2011

K-means Clustering in the Cloud -- A Mahout Test

Rui Máximo Esteves; Rui Pais; Chunming Rong

The K-Means is a well known clustering algorithm that has been successfully applied to a wide variety of problems. However, its application has usually been restricted to small datasets. Mahout is a cloud computing approach to K-Means that runs on a Hadoop system. Both Mahout and Hadoop are free and open source. Due to their inexpensive and scalable characteristics, these platforms can be a promising technology to solve data intensive problems which were not trivial in the past. In this work we studied the performance of Mahout using a large data set. The tests were running on Amazon EC2 instances and allowed to compare the gain in runtime when running on a multi node cluster. This paper presents some results of ongoing research.

ieee international conference on cloud computing technology and science | 2011

Using Mahout for Clustering Wikipedia's Latest Articles: A Comparison between K-means and Fuzzy C-means in the Cloud

Rui Máximo Esteves; Chunming Rong

This paper compares k-means and fuzzy c-means for clustering a noisy realistic and big dataset. We made the comparison using a free cloud computing solution Apache Mahout/ Hadoop and Wikipedias latest articles. In the past the usage of these two algorithms was restricted to small datasets. As so, studies were based on artificial datasets that do not represent a real document clustering situation. With this ongoing research we found that in a noisy dataset, fuzzy c-means can lead to worse cluster quality than k-means. The convergence speed of k-means is not always faster. We found as well that Mahout is a promise clustering technology but the preprocessing tools are not developed enough for an efficient dimensionality reduction. From our experience the use of the Apache Mahout is premature.

ieee international conference on cloud computing technology and science | 2010

Social Impact of Privacy in Cloud Computing

Rui Máximo Esteves; Chunming Rong

Cloud computing is emerging as a serious paradigm shift in the way we use computers. It relies on several technologies that are not new. However, the increasing availability of bandwidth allows new combinations and opens new IT perspectives. The data storage and processing power are being moved to more efficient and centralized structures over the web. Costs are being reduced with the loss of our data control as a trade-off. It will almost be inevitable for companies not to follow this trend. Yet, there are some important challenges to overcome. This paper discusses Cloud Computing concept concerning privacy and how it may affect our freedom of speech.

International Journal of Big Data Intelligence | 2014

A new approach for accurate distributed cluster analysis for Big Data: competitive K-Means

Rui Máximo Esteves; Thomas J. Hacker; Chunming Rong

The tremendous growth in data volumes has created a need for new tools and algorithms to quickly analyse large datasets. Cluster analysis techniques, such as K-Means can be distributed across several machines. The accuracy of K-Means depends on the selection of seed centroids during initialisation. K-Means++ improves on the K-Means seeder, but suffers from problems when it is applied to large datasets. In this paper, we describe a new algorithm and a MapReduce implementation we developed that addresses these problems. We compared the performance with three existing algorithms and found that our algorithm improves cluster analysis accuracy and decreases variance. Our results show that our new algorithm produced a speedup of 76 ± 9 times compared with the serial K-Means++ and is as fast as the streaming K-Means. Our work provides a method to select a good initial seeding in less time, facilitating fast accurate cluster analysis over large datasets.

ieee international conference on cloud computing technology and science | 2014

Predictive Analytics of Sensor Data Using Distributed Machine Learning Techniques

Girma Kejela; Rui Máximo Esteves; Chunming Rong

This work is based on a real-life data-set collected from sensors that monitor drilling processes and equipment in an oil and gas company. The sensor data stream-in at an interval of one second, which is equivalent to 86400 rows of data per day. After studying state-of-the-art Big Data analytics tools including Mahout, RHadoop and Spark, we chose Ox datas H2O for this particular problem because of its fast in-memory processing, strong machine learning engine, and ease of use. Accurate predictive analytics of big sensor data can be used to estimate missed values, or to replace incorrect readings due malfunctioning sensors or broken communication channel. It can also be used to anticipate situations that help in various decision makings, including maintenance planning and operation.

ieee international conference on cloud computing technology and science | 2013

Competitive K-Means, a New Accurate and Distributed K-Means Algorithm for Large Datasets

Rui Máximo Esteves; Thomas J. Hacker; Chunming Rong

The tremendous growth in data volumes has created a need for new tools and algorithms to quickly analyze large datasets. Cluster analysis techniques, such as K-means can be used for large datasets distributed across several machines. The accuracy of K-means depends on the selection of seed centroids during initialization. K-means++ improves on the K-means seeder, but suffers from problems when it is applied to large datasets: (a) the random algorithm it employs can produce inconsistent results across several analysis runs under the same initial conditions; and (b) it scales poorly for large datasets. In this paper we describe a new Competitive K-means algorithm we developed that addresses both of these problems. We describe an efficient MapReduce implementation of our new Competitive K-means algorithm that we found scales well with large datasets. We compared the performance of our new algorithm with three existing cluster analysis algorithms and found that our new algorithm improves cluster analysis accuracy and decreases variance. Our results show that our new algorithm produced a speedup of 76 ± 9 times compared with the serial K-means++ and is as fast as the Streaming K-means. Our work provides a method to select a good initial seeding in less time, facilitating accurate cluster analysis over large datasets in shorter time.

ieee international conference on cloud computing technology and science | 2012

Cluster analysis for the cloud: Parallel competitive fitness and parallel K-means++ for large dataset analysis

Rui Máximo Esteves; Thomas J. Hacker; Chunming Rong

The amount of resources needed to provision Virtual Machines (VM) in a cloud computing systems to support virtual HPC clusters can be predicted from the analysis of historic use data. In previous work, Hacker et al. found that cluster analysis is a useful tool to understand the underlying spatio-temporal dependencies present in system fault and use logs. However, the cluster analysis used for reducing spatio-temporal dependences should be fast and accurate to understand the underlying stochastic properties of these systems. K-means is a fast cluster analysis method, in which accuracy depends on the use of initialization algorithms that are usually serial and slow. In this paper we present two new parallel strategies for fast seeding K-means cluster analysis. Both strategies were tested on a real problem where the aim was to reduce spatial and temporal dependencies of failures on large supercomputer systems. The performance of both strategies were compared with five existing serial implementations: K-means implementations of 1) Lloyd (L); 2) McQueen (M); and 3) Hartigan - Wong (HW), all of them using Forgy seeding; 4) K-means++; and 5) Neural Gas clustering (NG), a more recent and sophisticated method. Our results show that our new Parallel Competitive Fitness approach reduces the Within Sum of Squares (WSQQ) measure, thus increasing cluster quality of the three K-means implementations: L; M; HW, and is 200 times faster than the existing serial K-means++. The existing serial and our new Parallel K-means++ have the lowest WSQQ. Our new Parallel K-means++ is twice as fast as the existing serial K-means++ method, and is 4 times faster than the NG method. Moreover, our new methods did not generate empty clusters, while NG did. As a result of our new techniques, predicting the amount of resources needed to provision VMs processing historic system fault and use data can now be done faster and with more accuracy.

international conference on cloud computing | 2015

Time Series Similarity Evaluation Based on Spearman's Correlation Coefficients and Distance Measures

Jiaqi Ye; Chengwei Xiao; Rui Máximo Esteves; Chunming Rong

This paper evaluates the similarity between two time series generated by two sensors manufactured by different companies, trying to provide some valuable information upon choosing sensors of different brands. Spearman correlation coefficient analysis and Euclidean distance measurement have been applied. Experiment is carried out on R. Visualization of the studied time series and results of similarity measured over time series by Spearman correlation coefficient and Euclidean distance are presented. Besides, the consistency and inconsistency in the analysis results of two measurements have been discussed in this paper.

international symposium on pervasive systems, algorithms, and networks | 2009

Bayesian Networks for Fault Detection under Lack of Historical Data

Rui Máximo Esteves; Tomasz Wiktor Wlodarczyk; Chunming Rong; Einar Landre

In this paper we propose a Bayesian Network approach as a promissory data fusion technique for surveillance of sensors accuracy. We prove the usefulness of this method even in case when there is not enough feasible data to construct the model in traditional way. In presence of this data constrains we suggest an inversion of the causal relationship. This approach proves to be a possible solution to help the expert in the conditional probabilities assessment process. As a result a working model is constructed what would not be possible using traditional Bayesian Network approach.

Concurrency and Computation: Practice and Experience | 2016