Alberto A. Del Barrio
Complutense University of Madrid
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Alberto A. Del Barrio.
design, automation, and test in europe | 2010
Alberto A. Del Barrio; María Molina; José M. Mendías; Román Hermida; Seda Ogrenci Memik
Speculative Functional Units (SFUs) enable a new execution paradigm for High Level Synthesis (HLS). SFUs are arithmetic functional units that operate using a predictor for the carry signal, which reduces the critical path delay. The performance of these units is determined by the success in the prediction of the carry value, i.e. the hit rate of the prediction. Hence SFUs reduce critical path at a low cost, but they cannot be used in HLS with the current techniques. In order to use them, it is necessary to include hardware support to recover from mispredictions of the carry signals. In this paper, we present techniques for designing a datapath controller for seamless deployment of SFUs in HLS. We have developed two techniques for this goal. The first approach stops the execution of the entire datapath for each misprediction and resumes execution once the correct value of the carry is known. The second approach decouples the functional unit suffering from the misprediction from the rest of the datapath. Hence, it allows the rest of the SFUs to carry on execution and be at different scheduling states at different times. Experiments show that it is possible to reduce execution time by as much as 38% and by 33% on average.
international conference on computer design | 2013
Alberto A. Del Barrio; Román Hermida; Seda Ogrenci Memik
Variable Latency Adders are attracting strong interest for increasing performance at a low cost. However, most of the literature is focused on achieving a good area-delay tradeoff. In this paper we consider multispeculation as an alternative for designing adders with low energy consumption, while offering better performance than the corresponding non-speculative ones. Instead of introducing more logic to accelerate the computation, the adder is split into several fragments which operate in parallel, and whose carry-in signals are provided by predictor units. On the one hand, the critical path of the module is shortened, and on the other hand the frequent useless glitches produced in the carry propagation structure are diminished. Hence, this will be translated into an overall energy reduction. Several experiments have been performed with linear and logarithmic adders, and results show energy savings by up to 90% and 70%, respectively, while achieving an additional execution time decrease. Furthermore, when utilized in whole datapaths with current control techniques, it is possible to reduce execution time by 24.5% (34% best case) and energy by 32% (48% best case) on average.
Digital Signal Processing | 2014
Joaquín Recas; Nadia Khaled; Alberto A. Del Barrio; Román Hermida
Abstract The IEEE-802.15.4 standard is poised to become the global standard for low data rate, low energy consumption Wireless Sensor Networks. By assigning the same sets of contention access parameters for all data frames and nodes, the Contention Access Period (CAP) of the slotted IEEE-802.15.4 currently provides an even channel access functionality and no service differentiation. However, some applications may require service differentiation and traffic prioritization support to accommodate high-priority traffic (e.g., alarms). In order to simulate a scenario in which different sets of access parameters for different node classes can be configured, this paper develops a Markov-chain-based model of the CAP of the IEEE-802.15.4-MAC. Our Markov model can be used to evaluate the impact of mixing node classes in important factors like the throughput, energy consumption, probability of delivery and the packet latency. The model has been used to provide traffic differentiation in a high saturation scenario in which a set of nodes can be configured to increase 76% the probability of sending a packet and reduce 58% latency, with a 69% energy penalty, in comparison with a standard scenario. The accuracy of the Markov model is validated by extensive ns-2 simulations.
Microelectronics Journal | 2014
Alberto A. Del Barrio; Román Hermida; Seda Ogrenci Memik; José M. Mendías; María Molina
The recent introduction of Variable Latency Functional Units (VLFUs) has broadened the design space of High-Level Synthesis (HLS). Nevertheless their use is restricted to only few operators in the datapaths because the number of cases to control grows exponentially. In this work an instance of VLFUs is described, and based on its structure, the average latency of tree structures is improved. Multispeculative Functional Units (MSFUs) are arithmetic Functional Units that operate using several predictors for the carry signal. In spite of utilizing more than a predictor, none or only one additional very short cycle is enough for producing the correct result in the majority of the cases. In this paper our proposal takes advantage of multispeculation in order to increase the performance of tree structures with a negligible area penalty. By judiciously introducing these structures into computation trees, it will only be necessary to predict the carry signals in certain selected nodes, thus minimizing the total number of predictions and the number of operations that can potentially mispredict. Hence, the average latency will be diminished and thus performance will be increased. Our experiments show that it is possible to improve 26% execution time. Furthermore, our flow outperforms previous approaches with Speculative FUs.
design, automation, and test in europe | 2013
Alberto A. Del Barrio; Román Hermida; Seda Ogrenci Memik; José M. Mendías; María Molina
Multispeculative Functional Units (MSFUs) are arithmetic functional units that operate using several predictors for the carry signal. The carry prediction helps to shorten the critical path of the functional unit. The average performance of these units is determined by the hit rate of the prediction. In spite of utilizing more than one predictor, none or only one additional cycle is enough for producing the correct result in the majority of the cases. In this paper we present multispeculation as a way of increasing the performance of tree structures with a negligible area penalty. By judiciously introducing these structures into computation trees, it will only be necessary to predict in certain selected nodes, thus minimizing the number of operations that can potentially mispredict. Hence, the average latency will be diminished and thus performance will be increased. Our experiments show that it is possible to improve on average 24% and 38% execution time, when considering logarithmic and linear modules, respectively.
IEEE Transactions on Computers | 2016
Alberto A. Del Barrio; Román Hermida; Seda Ogrenci Memik
Functional Units that are designed to receive inputs and produce outputs using a non-redundant format typically exhibit an inferior performance. In order to overcome this limitation, the carry-save and partial carry-save formats have been proposed. Both approaches are very suitable when implementing addition trees. Nevertheless, if there are multiplications in the datapath, the inputs to the multiplier must be reduced to a non-redundant form, to avoid applying the distributive property. In this paper we present a multiplier able to receive two numbers in partial carry-save format, and produce a result in partial carry-save format as well. This is done by modifying the Booth encoder and leveraging the generate and propagate group signals that are available because of the partial carry-save format. Hence, this can allow to fully implement datapaths without additional penalty cycles due to reductions to non-redundant forms. Experiments show that our proposed multiplier has 15 percent shorter delay with respect to a conventional Booth radix-4 multiplier. Moreover, when combining it with partial carry-save adders it is possible to reduce 36 percent execution time on average for several benchmarks, achieving a 32.7 percent reduction in the Energy Delay Product at the same time.
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 2016
Alberto A. Del Barrio; Jason Cong; Román Hermida
Due to the necessity of handling unexpected events in execution time, e.g., to support process variations, new mechanisms for dealing with every possible behavior of the datapath must be developed. Conventional centralized controllers can only handle very few dynamic events. Distributed controllers, on the other hand, are able to support every combination of events. These controllers are composed of several finite state machines, which are interconnected via a global coordinator. The use of this type of controller obliges to check the hazards between operations in run time, which entails some penalty in the controller complexity. In this paper, a new methodology for deploying a distributed controller over a set of clusters is presented. A register binding algorithm specially suited for distributed controllers has also been developed. It combines a clustering method and a least recently used policy to reduce the number of hazards in run time. Furthermore, our methodology allows the exploration of different solutions by tuning the input parameters of the binding algorithm. Several studies evaluating the execution time and area tradeoffs are presented to support our techniques. Results show that for some cases it is possible to reduce more than 50% the expected execution time, at the expense of a slight area increase.
ACM Transactions in Embedded Computing Systems | 2014
Alberto A. Del Barrio; Nader Bagherzadeh; Román Hermida
Currently, the most powerful supercomputers can provide tens of petaflops. Future many-core systems are estimated to provide an exaflop. However, the power budget limitation makes these machines still unfeasible and unaffordable. Floating Point Units (FPUs) are critical from both the power consumption and performance points of view of todays microprocessors and supercomputers. Literature offers very different designs. Some of them are focused on increasing performance no matter the penalty, and others on decreasing power at the expense of lower performance. In this article, we propose a novel approach for reducing the power of the FPU without degrading the rest of parameters. Concretely, this power reduction is also accompanied by an area reduction and a performance improvement. Hence, an overall energy gain will be produced. According to our experiments, our proposed unit consumes 17.5%, 23% and 16.5% less energy for single, double and quadruple precision, with an additional 15%, 21.5% and 14.5% delay reduction, respectively. Furthermore, area is also diminished by 4%, 4.5 and 5%.
Integration | 2013
Alberto A. Del Barrio; Seda Ogrenci Memik; María Molina; José M. Mendías; Román Hermida
State of the art multi-objective synthesis flows use to degrade some parameters of the circuit while trying to optimize the target one. This paper addresses the power reduction problem in heterogeneous datapaths, while keeping a similar area and execution time with respect to the baseline case. Our specific approach first diminishes the area via fragmentation techniques and afterwards it gives it back with the introduction of Low Power Functional Units (LP-FUs) that occupy more area than their corresponding non-low power counterparts. Furthermore, a fragmentation algorithm more suitable for power reduction is proposed. Results show that it is possible to diminish power by 27% on average (49% in the best case).
design, automation, and test in europe | 2017
Alberto A. Del Barrio; Román Hermida
In 1951 A. Booth published his algorithm to efficiently multiply signed numbers. Since the appearance of such algorithm, it has been widely accepted that radix 4-based Booth multipliers are the most efficient. They allow the height of the multiplier to be halved, at the expense of a simple recoding that consists of just shifts and negations. Theoretically, higher radix should produce even larger reductions, especially in terms of area and power, but the recoding process is much more complex. Notably, in the case of radix 8 it is necessary to compute 3X, X being the multiplicand. In order to avoid the penalty due to this calculation, we propose decoupling it from the product and considering 3X as an extra operation within the applications Dataflow Graph (DFG). Experiments show that typically there is enough slack in the DFGs to do this without degrading the performance of the circuit, which permits the efficient deployment of radix 8 multipliers that do not calculate the 3X multiple. Results show that our approach is 10% and 17% faster than radix 4 and radix 8 Booth based implementations, respectively, and 12% and 10% more energy efficient in terms of Energy Delay Product.