Rui Máximo Esteves
University of Stavanger
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Rui Máximo Esteves.
advanced information networking and applications | 2011
Rui Máximo Esteves; Rui Pais; Chunming Rong
The K-Means is a well known clustering algorithm that has been successfully applied to a wide variety of problems. However, its application has usually been restricted to small datasets. Mahout is a cloud computing approach to K-Means that runs on a Hadoop system. Both Mahout and Hadoop are free and open source. Due to their inexpensive and scalable characteristics, these platforms can be a promising technology to solve data intensive problems which were not trivial in the past. In this work we studied the performance of Mahout using a large data set. The tests were running on Amazon EC2 instances and allowed to compare the gain in runtime when running on a multi node cluster. This paper presents some results of ongoing research.
ieee international conference on cloud computing technology and science | 2011
Rui Máximo Esteves; Chunming Rong
This paper compares k-means and fuzzy c-means for clustering a noisy realistic and big dataset. We made the comparison using a free cloud computing solution Apache Mahout/ Hadoop and Wikipedias latest articles. In the past the usage of these two algorithms was restricted to small datasets. As so, studies were based on artificial datasets that do not represent a real document clustering situation. With this ongoing research we found that in a noisy dataset, fuzzy c-means can lead to worse cluster quality than k-means. The convergence speed of k-means is not always faster. We found as well that Mahout is a promise clustering technology but the preprocessing tools are not developed enough for an efficient dimensionality reduction. From our experience the use of the Apache Mahout is premature.
ieee international conference on cloud computing technology and science | 2010
Rui Máximo Esteves; Chunming Rong
Cloud computing is emerging as a serious paradigm shift in the way we use computers. It relies on several technologies that are not new. However, the increasing availability of bandwidth allows new combinations and opens new IT perspectives. The data storage and processing power are being moved to more efficient and centralized structures over the web. Costs are being reduced with the loss of our data control as a trade-off. It will almost be inevitable for companies not to follow this trend. Yet, there are some important challenges to overcome. This paper discusses Cloud Computing concept concerning privacy and how it may affect our freedom of speech.
International Journal of Big Data Intelligence | 2014
Rui Máximo Esteves; Thomas J. Hacker; Chunming Rong
The tremendous growth in data volumes has created a need for new tools and algorithms to quickly analyse large datasets. Cluster analysis techniques, such as K-Means can be distributed across several machines. The accuracy of K-Means depends on the selection of seed centroids during initialisation. K-Means++ improves on the K-Means seeder, but suffers from problems when it is applied to large datasets. In this paper, we describe a new algorithm and a MapReduce implementation we developed that addresses these problems. We compared the performance with three existing algorithms and found that our algorithm improves cluster analysis accuracy and decreases variance. Our results show that our new algorithm produced a speedup of 76 ± 9 times compared with the serial K-Means++ and is as fast as the streaming K-Means. Our work provides a method to select a good initial seeding in less time, facilitating fast accurate cluster analysis over large datasets.
ieee international conference on cloud computing technology and science | 2014
Girma Kejela; Rui Máximo Esteves; Chunming Rong
This work is based on a real-life data-set collected from sensors that monitor drilling processes and equipment in an oil and gas company. The sensor data stream-in at an interval of one second, which is equivalent to 86400 rows of data per day. After studying state-of-the-art Big Data analytics tools including Mahout, RHadoop and Spark, we chose Ox datas H2O for this particular problem because of its fast in-memory processing, strong machine learning engine, and ease of use. Accurate predictive analytics of big sensor data can be used to estimate missed values, or to replace incorrect readings due malfunctioning sensors or broken communication channel. It can also be used to anticipate situations that help in various decision makings, including maintenance planning and operation.
ieee international conference on cloud computing technology and science | 2013
Rui Máximo Esteves; Thomas J. Hacker; Chunming Rong
The tremendous growth in data volumes has created a need for new tools and algorithms to quickly analyze large datasets. Cluster analysis techniques, such as K-means can be used for large datasets distributed across several machines. The accuracy of K-means depends on the selection of seed centroids during initialization. K-means++ improves on the K-means seeder, but suffers from problems when it is applied to large datasets: (a) the random algorithm it employs can produce inconsistent results across several analysis runs under the same initial conditions; and (b) it scales poorly for large datasets. In this paper we describe a new Competitive K-means algorithm we developed that addresses both of these problems. We describe an efficient MapReduce implementation of our new Competitive K-means algorithm that we found scales well with large datasets. We compared the performance of our new algorithm with three existing cluster analysis algorithms and found that our new algorithm improves cluster analysis accuracy and decreases variance. Our results show that our new algorithm produced a speedup of 76 ± 9 times compared with the serial K-means++ and is as fast as the Streaming K-means. Our work provides a method to select a good initial seeding in less time, facilitating accurate cluster analysis over large datasets in shorter time.
ieee international conference on cloud computing technology and science | 2012
Rui Máximo Esteves; Thomas J. Hacker; Chunming Rong
The amount of resources needed to provision Virtual Machines (VM) in a cloud computing systems to support virtual HPC clusters can be predicted from the analysis of historic use data. In previous work, Hacker et al. found that cluster analysis is a useful tool to understand the underlying spatio-temporal dependencies present in system fault and use logs. However, the cluster analysis used for reducing spatio-temporal dependences should be fast and accurate to understand the underlying stochastic properties of these systems. K-means is a fast cluster analysis method, in which accuracy depends on the use of initialization algorithms that are usually serial and slow. In this paper we present two new parallel strategies for fast seeding K-means cluster analysis. Both strategies were tested on a real problem where the aim was to reduce spatial and temporal dependencies of failures on large supercomputer systems. The performance of both strategies were compared with five existing serial implementations: K-means implementations of 1) Lloyd (L); 2) McQueen (M); and 3) Hartigan - Wong (HW), all of them using Forgy seeding; 4) K-means++; and 5) Neural Gas clustering (NG), a more recent and sophisticated method. Our results show that our new Parallel Competitive Fitness approach reduces the Within Sum of Squares (WSQQ) measure, thus increasing cluster quality of the three K-means implementations: L; M; HW, and is 200 times faster than the existing serial K-means++. The existing serial and our new Parallel K-means++ have the lowest WSQQ. Our new Parallel K-means++ is twice as fast as the existing serial K-means++ method, and is 4 times faster than the NG method. Moreover, our new methods did not generate empty clusters, while NG did. As a result of our new techniques, predicting the amount of resources needed to provision VMs processing historic system fault and use data can now be done faster and with more accuracy.
international conference on cloud computing | 2015
Jiaqi Ye; Chengwei Xiao; Rui Máximo Esteves; Chunming Rong
This paper evaluates the similarity between two time series generated by two sensors manufactured by different companies, trying to provide some valuable information upon choosing sensors of different brands. Spearman correlation coefficient analysis and Euclidean distance measurement have been applied. Experiment is carried out on R. Visualization of the studied time series and results of similarity measured over time series by Spearman correlation coefficient and Euclidean distance are presented. Besides, the consistency and inconsistency in the analysis results of two measurements have been discussed in this paper.
international symposium on pervasive systems, algorithms, and networks | 2009
Rui Máximo Esteves; Tomasz Wiktor Wlodarczyk; Chunming Rong; Einar Landre
In this paper we propose a Bayesian Network approach as a promissory data fusion technique for surveillance of sensors accuracy. We prove the usefulness of this method even in case when there is not enough feasible data to construct the model in traditional way. In presence of this data constrains we suggest an inversion of the causal relationship. This approach proves to be a possible solution to help the expert in the conditional probabilities assessment process. As a result a working model is constructed what would not be possible using traditional Bayesian Network approach.
Concurrency and Computation: Practice and Experience | 2016
Chengwei Xiao; Jiaqi Ye; Rui Máximo Esteves; Chunming Rong