Somnath Mazumdar | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Somnath Mazumdar is active.

Explore More

Publication

Featured researches published by Somnath Mazumdar.

computer and information technology | 2016

Forecasting HPC Workload Using ARMA Models and SSA

Anoop Kumar; Somnath Mazumdar

In high-performance computing (HPC) platform, resource usage pattern changes over time which makes the resource monitoring a challenge. Maintaining the performance goals within a good power range is very critical where servers suffer from under utilisation, failure or degraded hardware support. Better forecast of the workload can reduce the energy cost by predicting the future workload more accurately. Identifying the usage pattern is also vital for efficient capacity planning because prediction can also be augmented with an effective resource allocation strategy to manage resource distribution goals. However, a single prediction model does not fit for all. In this paper, we compare forecast performance of the state-of-the-art ARMA class (integrated ARMA (ARIMA), seasonal integrated ARMA (SARIMA) and fractionally integrated ARMA (ARFIMA)) with the singular spectrum analysis (SSA) method using CPU, RAM and Network traces collected from Wikimedia grid. We found that the most simple model of ARMA class (ARIMA) had outperformed other complex ARMA class models while forecasting the bursty pattern of Networks. ARIMA model provides the best forecast for the Network data while SSA is found to be the best method for CPU and RAM. We also show that with proper model fitting, we can achieve high forecasting precision as low as 0.00586% for RAM and maximum error around 5% for Network without having complete information about the underlying system hardware and the running applications type.

IEEE Computer Architecture Letters | 2018

Enabling Massive Multi-Threading with Fast Hashing

Alberto Scionti; Somnath Mazumdar; Stéphane Zuckerman

The next generation of high-performance computers is expected to execute threads in orders of magnitude higher than today’s systems. Improper management of such huge amount of threads can create resource contention, leading to overall degraded system performance. By leveraging more practical approaches to distribute threads on the available resources, execution models and manycore chips are expected to overcome limitations of current systems. Here, we present DELTA—a Data-Enabled muLti-Threaded Architecture, where a producer-consumer scheme is used to execute threads via complete distributed thread management mechanism. We consider a manycore tiled-chip architecture where Network-on-Chip (NoC) routers are extended to support our execution model. The proposed extension is analysed, while simulation results confirm that DELTA can manage a large number of simultaneous threads, relying on a simple hardware structure.

international conference on high performance computing and simulation | 2016

Software defined Network-on-Chip for scalable CMPs

Alberto Scionti; Somnath Mazumdar; Antoni Portero

Moving from Petascale to Exascale computing necessitates optimizing the micro-architectural to increase the performance/power ratio of multicores (e.g., FLOPS/W). Future manycore processors will contain thousands of low-powered processing elements (kilo-core Chip Multi-Processors - CMPs) to support the execution of a large number of concurrent threads. While data-driven Program eXecution Models (PXMs) are gaining popularity due to the support they provide for thread communication, frequent data exchange among many concurrent threads puts stress on the underlying interconnect subsystem. This results in hotspots and high latency for data packet delivering. As a solution, we propose a scalable Software Defined Network-on-Chip (SDNoC) architecture for future manycore processors. Our design tries to merge the benefits of ring-based NoCs (i.e., performance, energy efficiency) with those brought by dynamic reconfiguration (i.e., adaptation, fault tolerance) while keeping the hard-wired topology (2D-mesh) fixed. To potentially accommodate different application and communication requirements, our interconnect allows mapping different types of topologies (virtual topologies). To allow the software layer to control and monitor the NoC subsystem, few customized instructions supporting a data-driven PXM are added to the core ISA. In experiments, we compared our lightweight reconfigurable architecture to a conventional 2D-mesh interconnection subsystem. Results show that our model allows savings of 39.4% of the chip area and up to 72.4% of the consumed power.

digital systems design | 2016

AXIOM: A Hardware-Software Platform for Cyber Physical Systems

Somnath Mazumdar; Eduard Ayguadé; Nicola Bettin; Javier Bueno; Sara Ermini; Antonio Filgueras; Daniel Jiménez-González; Carlos Álvarez Martínez; Xavier Martorell; Francesco Montefoschi; David Oro; Dionisis Pnevmatikatos; Antonio Rizzo; Dimitris Theodoropoulos; Roberto Giorgi

Cyber-Physical Systems (CPSs) are widely necessary for many applications that require interactions with the humans and the physical environment. A CPS integrates a set of hardware-software components to distribute, execute and manage its operations. The AXIOM project (Agile, eXtensible, fast I/O Module) aims at developing a hardware-software platform for CPS such that i) it can use an easy parallel programming model and ii) it can easily scale-up the performance by adding multiple boards (e.g., 1 to 10 boards can run in parallel). AXIOM supports task-based programming model based on OmpSs and leverage a high-speed, inexpensive communication interface called AXIOM-Link. Another key aspect is that the board provides programmable logic (FPGA) to accelerate portions of an application. We are using smart video surveillance, and smart home living applications to drive our design.

computing frontiers | 2017

Let's Go: a Data-Driven Multi-Threading Support

Alberto Scionti; Somnath Mazumdar

Increasing performance of computing systems necessitates providing solutions for improving scalability and productivity. In recent times, data-driven Program eXecution Models (PXMs) are gaining popularity due to their superior support compared to traditional von Neumann execution models. However, exposing the benefits of such PXMs within a high-level programming language remains a challenge. Although many high-level programming languages and APIs support concurrency and multi-threading (e.g., C++11, Java, OpenMP, MPI, etc.), their synchronisation models make large use of mutex and locks, generally leading to poor system performance. Conversely, one major appeal of Go programming language is the way it supports concurrency: goroutines (tagged functions) are mapped on OS threads and communicate each other through data structures buffering input data (channels). By forcing goroutines to exchange data only through channels, it is possible to enable a data-driven execution. This paper proposes a first attempt to map goroutines on a data-driven based PXM. Go compilation procedure and the run-time library are modified to exploit the execution of fine-grain threads on an abstracted parallel machine model.

Archive | 2017

Adaptive Resource Allocation for Load Balancing in Cloud

Somnath Mazumdar; Alberto Scionti; Anoop Kumar

Cloud computing has become a robust computing paradigm aimed at providing ubiquitous access to almost “infinite” computational as well as storage resources. However, the ever-changing demand for computational capabilities pushes data centres (DCs) to adopt power-hungry solutions. The problem of reducing energy consumption in a DC is exacerbated by the difficulties in fairly distributing the workload among the available physical servers. Current methods rely on algorithmic solutions which are not able to capture and counterbalance all the changes in the user access pattern. It leads to an over-provisioning of the resources and to an underutilisation of the active servers. This chapter advocates for an effective way to tackle the resource allocation problem with the aim of improving energy efficiency and reliability. We show how a flexible framework, designed to foreseen expected load and to use an evolutionary optimisation algorithm (such as particle swarm optimization – PSO), can efficiently map the user requests with the available hardware resources.

Sensors | 2018

Towards a Scalable Software Defined Network-on-Chip for Next Generation Cloud

Alberto Scionti; Somnath Mazumdar; Antoni Portero

The rapid evolution of Cloud-based services and the growing interest in deep learning (DL)-based applications is putting increasing pressure on hyperscalers and general purpose hardware designers to provide more efficient and scalable systems. Cloud-based infrastructures must consist of more energy efficient components. The evolution must take place from the core of the infrastructure (i.e., data centers (DCs)) to the edges (Edge computing) to adequately support new/future applications. Adaptability/elasticity is one of the features required to increase the performance-to-power ratios. Hardware-based mechanisms have been proposed to support system reconfiguration mostly at the processing elements level, while fewer studies have been carried out regarding scalable, modular interconnected sub-systems. In this paper, we propose a scalable Software Defined Network-on-Chip (SDNoC)-based architecture. Our solution can easily be adapted to support devices ranging from low-power computing nodes placed at the edge of the Cloud to high-performance many-core processors in the Cloud DCs, by leveraging on a modular design approach. The proposed design merges the benefits of hierarchical network-on-chip (NoC) topologies (via fusing the ring and the 2D-mesh topology), with those brought by dynamic reconfiguration (i.e., adaptation). Our proposed interconnect allows for creating different types of virtualised topologies aiming at serving different communication requirements and thus providing better resource partitioning (virtual tiles) for concurrent tasks. To further allow the software layer controlling and monitoring of the NoC subsystem, a few customised instructions supporting a data-driven program execution model (PXM) are added to the processing element’s instruction set architecture (ISA). In general, the data-driven programming and execution models are suitable for supporting the DL applications. We also introduce a mechanism to map a high-level programming language embedding concurrent execution models into the basic functionalities offered by our SDNoC for easing the programming of the proposed system. In the reported experiments, we compared our lightweight reconfigurable architecture to a conventional flattened 2D-mesh interconnection subsystem. Results show that our design provides an increment of the data traffic throughput of 9.5% and a reduction of 2.2× of the average packet latency, compared to the flattened 2D-mesh topology connecting the same number of processing elements (PEs) (up to 1024 cores). Similarly, power and resource (on FPGA devices) consumption is also low, confirming good scalability of the proposed architecture.

Archive | 2018

Statistical Analysis of a Data Centre Resource Usage Patterns: A Case Study

Somnath Mazumdar; Anoop Kumar

Performance evaluation is necessary to understand the runtime behaviour of a computing system. A better understanding of resource usage leads to better utilisation and less energy cost. To optimise the server provisioning and also the energy cost of a data centre (DC), we should explore the underlying resource usage patterns to extract meaningful information. In this paper, our primary goal is to obtain correlation or cross-correlation among CPU, RAM, and Network at different timescales of a DC. To perform this analysis, we have collected Wikimedia grid traces and conducted an experimental campaign using rationally selected multiple statistical methods. They are: a univariate method (Hurst exponent), multivariate explanatory methods (such as wavelets, cross-recurrence quantification analysis (CRQA)), multivariate predictive methods (such as vector auto-regression (VAR), multivariate adaptive regression splines (MARS)). It is worth to note that, we analyse the data without any prior knowledge about running applications. We present the results together with a comprehensive analysis. In our case study, we found in long time scale CPU, and RAM is more correlated than Network. We also have shown that wavelet-based methods are superior to detect long-run relationship among these resource variables.

international conference on high performance computing and simulation | 2017

Efficient Data-Driven Task Allocation for Future Many-Cluster On-chip Systems

Alberto Scionti; Somnath Mazumdar; Antoni Portero

Continuous demand for higher performance is adding more pressure on hardware designers to provide faster machines with low energy consumption. Recent technological advancements allow placing a group of silicon dies on top of a conventional interposer (silicon layer), which provides space to integrate logic and interconnection resources to manage active processing cores. However, such large resource availability requires an adequate Program eXecution Model (PXM) as well as an efficient mechanism to allocate resources in the system. From this perspective, fine-grain data-driven PXMs represent an attractive solution to reduce the cost of synchronising concurrent activities. The contribution of this work is twofold. First, a hardware architecture called TALHES - a Task ALlocator for HEterogeneous System is proposed to support scheduling of multi-threaded applications (adhering to an explicit data-driven PXM). TALHES introduces a Network-on-Chip (NoC) extension: i) while on-chip 2D-mesh NoCs are used to support locality of computations in the execution of a single task; ii) a global task scheduler integrated into the silicon interposer orchestrates application tasks among different clusters of cores (eventually with different computing capabilities). The second contribution of the paper is a simulation framework that is tailored to support the analysis of such fine-grain data-driven applications. In this work, Linux Containers are used to abstract and efficiently simulate clusters of cores (i.e., a single die), as well as the behaviour of the global scheduling unit.

ieee international advance computing conference | 2017

Analysing Dataflow Multi-Threaded Applications at Runtime

Somnath Mazumdar; Alberto Scionti

Recently, dataflow-inspired execution models have gained popularity, due to better support for concurrent, multi-threaded execution. Dedicated hardware solutions have been proposed to help to harness the capability of dataflow-based execution models. Such models may benefit both HPC and Cloud applications which are sensitive to the communication latency and synchronisations. However, understanding the behaviour of a vast number of concurrent threads, as well as possible performance on a given hardware platform, remain a challenge. Moreover, the evaluation process often requires the usage and modification of complex simulation frameworks. To counter these challenges, we propose RADA – Runtime Analysis of Dataflow-based Applications, a simulation tool for fast evaluation of applications adhering to a hierarchical dataflow execution model. RADA provides an abstract model for different hardware platforms, also allowing to optimise the application for the execution on heterogeneous systems. In particular, RADA integrates an efficient scheduling mechanism, in the form of a runtime library, which allows the system to distribute the workload efficiently while minimising the synchronisation overhead. The output provided by RADA are of great help in analysing the traffic generated by the scheduling activity. Preliminary evaluation results show the significant benefit in adopting RADA for the assessment of dataflow based applications and execution models.

Explore More