Angeles G. Navarro
University of Málaga
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Angeles G. Navarro.
international conference on multimedia and expo | 2002
Sonia González; Angeles G. Navarro; Juan Torres López; Emilio L. Zapata
In recent years, there has been an increasing interest in video on demand (VoD) systems. We study a distributed VoD system in which the videos are replicated according to their popularity. We present an algorithm to share the load in such a system efficiently and an analytical model that captures the performance of this algorithm, which we validate through simulations. This research shows that popularity is an essential parameter which can save storage without reducing performance by just replicating a few popular videos in all the servers.
IEEE Transactions on Parallel and Distributed Systems | 2003
Angeles G. Navarro; Emilio L. Zapata; David A. Padua
This paper presents a new method that can be applied by a parallelizing compiler to find, without user intervention, the iteration and data decompositions that minimize communication and load imbalance overheads in parallel programs targeted at NUMA architectures. One of the key ingredients in our approach is the representation of locality as a locality-communication graph (ICG) and the formulation of the compiler technique as a mixed integer nonlinear programming (MINLP) optimization problem on this graph. The objective function and constraints of the optimization problem model communication costs and load imbalance. The solution to this optimization problem is a decomposition that minimizes the parallel execution overhead. This paper summarizes the process of how the compiler extracts the locality information from a nonannotated code and focuses on how this compiler can derive the optimization problem, solve it, and generate the parallel code with the automatically selected iteration and data distributions. In addition, we include a discussion about our model and the solutions - the decompositions - that it provides. The approach presented in the paper is evaluated using several benchmarks. The experimental results demonstrate that the MINLP formulation does not increase compilation time significantly and that our framework generates very efficient iteration/data distributions for a variety of NUMA machines.
IEEE Transactions on Multimedia | 2006
Sonia González; Angeles G. Navarro; Juan Torres López; Emilio L. Zapata
In our research, we consider a distributed video-on-demand (VoD) system in which only the most popular videos are replicated in all the servers, whereas the rest of them are distributed through the system following some allocation scheme. In this paper, we present an algorithm to efficiently share the load in such a system and an analytical model that captures the performance of this algorithm, which we validate through simulations. One novelty in our work is that our analytical model lets us relate popularity and partial replication of some of the videos and to predict the user waiting time. We exploit such relationships to assist the system designer to select the size of the servers and network, the optimal number of servers to maintain short waiting time and to predict when the network encounters bottleneck
IEEE Transactions on Parallel and Distributed Systems | 2002
Yunheung Paek; Angeles G. Navarro; Emilio L. Zapata; Jay Hoeflinger; David A. Padua
The Cray T3D and T3E are non-cache-coherent (NCC) computers with a NUMA structure. They have been shown to exhibit a very stable and scalable performance for a variety of application programs. Considerable evidence suggests that they are more stable and scalable than many other shared-memory multiprocessors. However, the principal drawback of these machines is a lack of programmability, caused by the absence of the global cache coherence that is necessary to provide a convenient shared view of memory in hardware. This forces the programmer to keep careful track of where each piece of data is stored, a complication that is unnecessary when a pure shared-memory view is presented to the user. We believe that a remedy for this problem is advanced compiler technology. In this paper, we present our experience with a compiler framework for automatic parallelization and communication generation that has the potential to reduce the time-consuming hand-tuning that would otherwise be necessary to achieve good performance with this type of machine. From our experiments, we learned that our compiler performs well for a variety of applications on the T3D and T3E and we found a few sophisticated techniques that could improve performance even more once they are fully implemented in the compiler.
The Journal of Supercomputing | 2014
Angeles G. Navarro; Antonio Vilches; Francisco Corbera; Rafael Asenjo
This paper explores the possibility of efficiently executing a single application using multicores simultaneously with multiple GPU accelerators under a parallel task programming paradigm. In particular, we address the challenge of extending a parallel_for template to allow its exploitation on heterogeneous architectures. Due to the asymmetry of the computing resources, we propose in this work a dynamic scheduling strategy coupled with an adaptive partitioning scheme that resizes chunks to prevent underutilization and load imbalance of CPUs and GPUs. In this paper we also address the problem of the underutilization of the CPU core where a host thread operates. To solve it, we propose two different approaches: (1) a collaborative host thread strategy, in which the host thread, instead of busy-waiting for the GPU to complete, it carries out useful chunk processing; and (2) a host thread blocking strategy combined with oversubscription, that delegates on the OS the duty of scheduling threads to available CPU cores in order to guarantee that all cores are doing useful work. Using two benchmarks we evaluate the overhead introduced by our scheduling and partitioning algorithms, finding that it is negligible. We also evaluate the efficiency of the strategies proposed finding that allowing oversubscription controlled by the OS can be beneficial under certain scenarios.
symposium on computer architecture and high performance computing | 2012
Alberto Sanz; Rafael Asenjo; Juan Torres López; Rafael Larrosa; Angeles G. Navarro; Vassily Litvinov; Sung-Eun Choi; Bradford L. Chamberlain
Chapel is a parallel programming language designed to improve the productivity and ease of use of conventional and parallel computers. This language currently delivers sub optimal performance when executing codes that perform global data re-allocation operations on distributed memory architectures. This is mainly due to data communication that is done without aggregation (one message for each remote array element). In this work, we analyze Chapels standard Block and Cyclic distribution modules and optimize the communication routines for array assignments by performing aggregation. Thanks to the expressive power of Chapel, the compiler and runtime have enough information to do communication aggregation without user intervention. The runtime relies on the low-level GAS Net networking layer, whose versions of one-sided bulk put/get routines that support strides are particularly useful for us. Experimental results conducted on Hector (a Cray XE6) and Jaguar (Cray XK6)reveal that the implemented techniques can lead to significant reductions in communication time.
high performance computing and communications | 2010
Antonio J. Dios; Rafael Asenjo; Angeles G. Navarro; Francisco Corbera; Emilio L. Zapata
This paper analyzes the applicability of the task programming model in the parallelization of generic wave front problems. Computations on this type of problems are characterized by a data dependency pattern across a data space, which can produce a variable number of independent tasks through the traversal of such space. Precisely, we think that it is better to formulate the parallelization of this wave front-based programs in terms of logical tasks, instead of threads for several reasons: more efficient matching of computations to available resources, faster start-up and creation task times, improved load balancing and higher level thinking. To implement the parallel wave front we have used two state-of-the art task libraries: TBB and OpenMP 3.0. In this work, we highlight the differences between both implementations, from a programmer standpoint and from the performance point of view. For it, we conduct several experiments to identify the factors that can limit the performance on each case. Besides, we present in the paper a wave front template based on tasks, template that makes easier the coding of parallel wave front codes. We have validated this template with three real dynamic programming algorithms, finding that the TBB-coded template always outperforms the OpenMP based-one.
IEEE Transactions on Parallel and Distributed Systems | 2016
Antonio Vilches; Angeles G. Navarro; Rafael Asenjo; Francisco Corbera; Ruben Gran; María Jesús Garzarán
In this paper, we consider the problem of efficiently executing streaming applications on commodity processors composed of several cores and an on-chip GPU. Streaming applications, such as those in vision and video analytic, consist of a pipeline of stages and are good candidates to take advantage of this type of platforms. We also consider that characteristics of the input may change while the application is running. Therefore, we propose a framework that adaptively finds the optimal mapping of the pipeline stages. The core of the framework is an analytical model coupled with information collected at runtime used to dynamically map each pipeline stage to the most efficient device, taking into consideration both performance and energy. Our experimental results show that for the evaluated applications running on two different architectures, our model always predicts the best configuration among the evaluated alternatives, and significantly reduces the amount of information that needs to be collected at runtime. This best configuration has, on the average, 20 percent higher throughput than the configuration recommended by a baseline state of the art approach, while the ratio throughput/energy is 43 percent higher. We have measured improvements in throughput and throughput/energy of up-to 81 and 204 percent, respectively, when the model is used to adapt to a video that changes from low to high definition.
ieee international conference on high performance computing data and analytics | 2004
Angeles G. Navarro; Francisco Corbera; Rafael Asenjo; Adrian Tineo; Oscar G. Plata; Emilio L. Zapata
The approach presented in this paper focus on detecting data dependences induced by heap-directed pointers on loops that access dynamic data structures. Knowledge about the shape of the data structure accessible from a heap-directed pointer, provides critical information for disambiguating heap accesses originating from it. Our approach is based on a previously developed shape analysis that maintains topological information of the connections among the different nodes (memory locations) in the data structure. Basically, the novelty is that our approach carries out abstract interpretation of the statements being analyzed, and let us annotate the memory locations reached by each statement with read/write information. This information will be later used in order to find dependences in a very accurate dependence test which we introduce in this paper.
international conference on parallel processing | 1999
Angeles G. Navarro; Rafael Asenjo; Emilio L. Zapata; David A. Padua
Most of todays multiprocessors have a Distributed-Shared Memory (DSM) organization, which enables scalability while retaining the convenience of the shared-memory programming paradigm. Data locality is crucial for performance in DSM machines, due to the difference in access times between local and remote memories. In this paper, we present a compile-time representation that captures the memory locality exhibited by a program in the form of a graph known as Locality-Communication Graph (LCG). In the LCG, each node represents a DO loop nest which can have at most one level of parallelism. Not all loops need to be represented within a node and, therefore, the LCG may contain cycles. Our representation works whether the loops represented by the nodes are perfectly nested or not, and the subscript expressions and loop limits can be affine or non-affine expressions of the loop indices. The LCG provides essential information that a parallelizing compiler can use to automatically choose a good iteration/data distribution and to schedule the communication operations required during program execution.