Waleed Meleis | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Waleed Meleis is active.

Explore More

Publication

Featured researches published by Waleed Meleis.

IEEE Design & Test of Computers | 1998

Rothko: a three-dimensional FPGA

Miriam Leeser; Waleed Meleis; Mankuan Michael Vai; Silviu Chiricescu; Weidong Xu; Paul M. Zavracky

Using transferred circuits and metal interconnections placed between layers of active devices anywhere on the chip, Rothko aims at solving utilization, routing, and delay problems of existing FPGA architectures. Experimental implementations have demonstrated important performance advantages.

field-programmable custom computing machines | 2003

Runtime assignment of reconfigurable hardware components for image processing pipelines

Heather Quinn; Laurie A. Smith King; Miriam Leeser; Waleed Meleis

The combination of hardware acceleration and flexibility make FPGAs (field programmable gate arrays) important to image processing applications. There is also a need for efficient, flexible hardware/software codesign environments that can balance the benefits and costs of using FPGAs. Image processing applications often consist of pipeline of components where each component applies a different processing algorithm. Components can be implemented for FPGAs or software. Such systems enable an image analyst to work with either FPGA or software implementations of image processing algorithms for a given problem. The pipeline assignment problem chooses from alternative implementations of pipeline components to yield the fastest pipeline. Our codesign system solves the pipeline assignment problem to provide the most effective implementation automatically, so the image analyst can focus solely on choosing components, which make up the pipeline. However, the pipeline assignment problem is NP complete. An efficient, dynamic solution to the pipeline assignment problem is a desirable enabler of codesign systems which use both FPGA and software implementations. This paper is concerned with solving pipeline assignment in this context. Consequently, we focus on optimal and heuristic methods for fast (fixed time limit) runtime pipeline assignment are investigated. We present experimental finding for pipelines of twenty or fewer components, which show that in our environment, optimal runtime solutions are possible for smaller pipelines and nearly optimal heuristic solutions are possible for larger pipelines.

international performance computing and communications conference | 2011

ArA: Adaptive resource allocation for cloud computing environments under bursty workloads

Jianzhe Tai; Juemin Zhang; Jun Li; Waleed Meleis; Ningfang Mi

Cloud computing nowadays becomes quite popular among a community of cloud users by offering a variety of resources. However, burstiness in user demands often dramatically degrades the application performance. In order to satisfy peak user demands and meet Service Level Agreement (SLA), efficient resource allocation schemes are highly demanded in the cloud. However, we find that conventional load balancers unfortunately neglect cases of bursty arrivals and thus experience significant performance degradation. Motivated by this problem, we propose new burstiness-aware algorithms to balance bursty workloads across all computing sites, and thus to improve overall system performance. We present a smart load balancer, which leverages the knowledge of burstiness to predict the changes in user demands and on-the-fly shifts between the schemes that are “greedy” (i.e., always select the best site) and “random” (i.e., randomly select one) based on the predicted information. Both simulation and real experimental results show that this new load balancer can adapt quickly to the changes in user demands and thus improve performance by making a smart site selection for cloud users under both bursty and non-bursty workloads.

conference on advanced research in vlsi | 1997

Architectural design of a three dimensional FPGA

Waleed Meleis; Miriam Leeser; Paul M. Zavracky; Mankuan Michael Vai

The design and evaluation of a 3-dimensional FPGA architecture called Rothko are described. Rothko takes advantage of a novel 3-dimensional VLSI circuit technology developed at Northeastern University that is based on transferred circuits with interconnections between layers of circuits. The Rothko 3-D FPGA architecture is based on a sea-of-gates FPGA model first proposed in the Triptych architecture (a 2-D architecture) in which individual cells have the dual functions of routing and logic implementation. Our 3-D VLSI technology allows metal interconnections to be made between cells on different layers so that Rothko is truly 3-D. A very fine-grain interconnection scheme is provided with each cell connected to the one above/below it. In this paper we present the architectural design of this 3-D FPGA. The 3-D technology that supports the Rothko architecture is also described. An example of mapping a combinational multiplier to both the Rothko and Triptych architectures is provided to demonstrate the advantages of Rothko.

Medical Imaging 2004: Physics of Medical Imaging | 2004

Digital tomosynthesis mammography using a parallel maximum-likelihood reconstruction method

Tao Wu; Juemin Zhang; Richard H. Moore; Elizabeth A. Rafferty; Daniel B. Kopans; Waleed Meleis; David R. Kaeli

A parallel reconstruction method, based on an iterative maximum likelihood (ML) algorithm, is developed to provide fast reconstruction for digital tomosynthesis mammography. Tomosynthesis mammography acquires 11 low-dose projections of a breast by moving an x-ray tube over a 50° angular range. In parallel reconstruction, each projection is divided into multiple segments along the chest-to-nipple direction. Using the 11 projections, segments located at the same distance from the chest wall are combined to compute a partial reconstruction of the total breast volume. The shape of the partial reconstruction forms a thin slab, angled toward the x-ray source at a projection angle 0°. The reconstruction of the total breast volume is obtained by merging the partial reconstructions. The overlap region between neighboring partial reconstructions and neighboring projection segments is utilized to compensate for the incomplete data at the boundary locations present in the partial reconstructions. A serial execution of the reconstruction is compared to a parallel implementation, using clinical data. The serial code was run on a PC with a single PentiumIV 2.2GHz CPU. The parallel implementation was developed using MPI and run on a 64-node Linux cluster using 800MHz Itanium CPUs. The serial reconstruction for a medium-sized breast (5cm thickness, 11cm chest-to-nipple distance) takes 115 minutes, while a parallel implementation takes only 3.5 minutes. The reconstruction time for a larger breast using a serial implementation takes 187 minutes, while a parallel implementation takes 6.5 minutes. No significant differences were observed between the reconstructions produced by the serial and parallel implementations.

international symposium on microarchitecture | 1999

Balance scheduling: weighting branch tradeoffs in superblocks

Alexandre E. Eichenberger; Waleed Meleis

Since there is generally insufficient instruction level parallelism within a single basic block, higher performance is achieved by speculatively scheduling operations in superblocks. This is difficult in general because each branch competes for the processors limited resources. Previous work manages the performance tradeoffs that exist between branches only indirectly. We show here that dependence and resource constraints can be used to gather explicit knowledge about scheduling tradeoffs between branches. The first contribution of this paper is a set of new, tighter lower bounds on the execution times of superblocks that specifically accounts for the dependence and resource conflicts between pairs of branches. The second contribution of this paper is a novel superblock scheduling heuristic that finds high performance schedules by determining the operations that each branch needs to be scheduled early and selecting branches with compatible needs that favor beneficial branch tradeoffs. Performance evaluations for superblocks from SPECint95 indicate that our bounds are very tight and that our scheduling heuristic outperforms well known superblock scheduling algorithms.

world of wireless mobile and multimedia networks | 2011

Cooperation and communication in Cognitive radio networks based on TV spectrum experiments

Kaushik R. Chowdhury; Rahman Doost-Mohammady; Waleed Meleis; Marco Di Felice; Luciano Bononi

Cognitive radio (CR) ad hoc networks are composed of wireless nodes that may opportunistically transmit in licensed frequency bands without affecting the primary users of that band. In such distributed networks, gathering the spectrum information is challenging as the nodes have a partial view of the spectrum environment based on the local sensing range. Moreover, individual measurements are also affected by channel uncertainties and location-specific fluctuations in signal strength. To facilitate the distributed operation, this paper makes the following contributions: (i) First, an experimental study is undertaken to measure the signal characteristics for indoor and outdoor locations for the TV channels 21 – 51, and these results are used to identify the conditions under which nodes may share information. (ii) Second, a Cooperative reinforcement LearnIng scheme for Cognitive radio networKs (CLICK) is designed for combining the spectrum usage information observed by a node and its neighbors. (iii) Finally, CLICK is integrated within a MAC protocol for testing the benefits and overhead of our approach on a higher layer protocol performance. The proposed learning framework and the protocol design are extensively evaluated through a thorough simulation study in ns-2 using experimental traces of channel measurements.

ieee workshop on wireless mesh networks | 2010

To Sense or to Transmit: A Learning-Based Spectrum Management Scheme for Cognitive Radiomesh Networks

Marco Di Felice; Kaushik R. Chowdhury; Waleed Meleis; Luciano Bononi

Wireless mesh networks, composed of interconnected clusters of mesh router (MR) and multiple associated mesh clients (MCs), may use cognitive radio equipped transceivers, allowing them to choose licensed frequencies for high bandwidth communication. However, the protection of the licensed users in these bands is a key constraint. In this paper, we propose a reinforcement learning based approach that allows each mesh cluster to independently decide the operative channel, the durations for spectrum sensing, the time of switching, and the duration for which the data transmission happens. The contributions made in this paper are threefold. First, based on accumulated rewards for a channel mapped to the link transmission delays, and the estimated licensed user activity, the MRs assign a weight to each of the channels, thereby selecting the channel with highest performance for MCs operations. Second, our algorithm allows dynamic selection of the sensing time interval that optimizes the link throughput. Third, by cooperative sharing, we allow the MRs to share their channel table information, thus allowing a more accurate learning model. Simulations results reveal significant improvement over classical schemes which have pre-set sensing and transmission durations in the absence of learning.

IEEE Transactions on Instrumentation and Measurement | 2004

Using data compression in automatic test equipment for system-on-chip testing

Farzin Karimi; Zainalabedin Navabi; Waleed Meleis; Fabrizio Lombardi

Compression has been used in automatic test equipment (ATE) to reduce storage and application time for high volume data by exploiting the repetitive nature of test vectors. The application of a binary compression method to an ATE environment for manufacturing is studied using a technique, referred to as reuse. In reuse, compression is achieved by partitioning the vector set and removing repeating segments. This process has O(n/sup 2/) time complexity for compression (where n is the number of vectors) with a simple hardware decoding circuitry. It is shown that for industrial system-on-chip (SoC) designs, the efficiency of the reuse compression technique is comparable to sophisticated software techniques with the advantage of easy and fast decoding. Two shift register-based decompression schemes are presented; they can be either incorporated into internal scan chains or built in the testers head. The proposed compression method has been applied to industrial test and data and an average compression rate of 84% has been achieved.

Algorithmica | 2002

An Experimental Study of Algorithms for Weighted Completion Time Scheduling

Ivan D. Baev; Waleed Meleis; Alexandre E. Eichenberger

Abstract We consider the total weighted completion time scheduling problem for parallel identical machines and precedence constraints, P| prec|\sum wiCi . This important and broad class of problems is known to be NP-hard, even for restricted special cases, and the best known approximation algorithms have worst-case performance that is far from optimal. However, little is known about the experimental behavior of algorithms for the general problem. This paper represents the first attempt to describe and evaluate comprehensively a range of weighted completion time scheduling algorithms.We first describe a family of combinatorial scheduling algorithms that optimally solve the single-machine problem, and show that they can be used to achieve good performance for the multiple-machine problem. These algorithms are efficient and find schedules that are on average within 1.5\percent of optimal over a large synthetic benchmark consisting of trees, chains, and instances with no precedence constraints. We then present several ways to create feasible schedules from nonintegral solutions to a new linear programming relaxation for the multiple-machine problem. The best of these linear programming-based approaches finds schedules that are within 0.2\percent of optimal over our benchmark.Finally, we describe how the scheduling phase in profile-based program compilation can be expressed as a weighted completion time scheduling problem and apply our algorithms to a set of instances extracted from the SPECint95 compiler benchmark. For these instances with arbitrary precedence constraints, the best linear programming-based approach finds optimal solutions in 78\percent of cases. Our results demonstrate that careful experimentation can help lead the way to high quality algorithms, even for difficult optimization problems.

Explore More