Matthew W. Moskewicz | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Matthew W. Moskewicz is active.

Explore More

Publication

Featured researches published by Matthew W. Moskewicz.

design automation conference | 2001

Chaff: engineering an efficient SAT solver

Matthew W. Moskewicz; Conor Madigan; Ying Zhao; Lintao Zhang; Sharad Malik

Boolean satisfiability is probably the most studied of the combinatorial optimization/search problems. Significant effort has been devoted to trying to provide practical solutions to this problem for problem instances encountered in a range of applications in electronic design automation (EDA), as well as in artificial intelligence (AI). This study has culminated in the development of several SAT packages, both proprietary and in the public domain (e.g. GRASP, SATO) which find significant use in both research and industry. Most existing complete solvers are variants of the Davis-Putnam (DP) search algorithm. In this paper we describe the development of a new complete solver, Chaff which achieves significant performance gains through careful engineering of all aspects of the search-especially a particularly efficient implementation of Boolean constraint propagation (BCP) and a novel low overhead decision strategy. Chaff has been able to obtain one to two orders of magnitude performance improvement on difficult SAT benchmarks in comparison with other solvers (DP or otherwise), including GRASP and SATO.

computer vision and pattern recognition | 2016

FireCaffe: Near-Linear Acceleration of Deep Neural Network Training on Compute Clusters

Forrest N. Iandola; Matthew W. Moskewicz; Khalid Ashraf; Kurt Keutzer

Long training times for high-accuracy deep neural networks (DNNs) impede research into new DNN architectures and slow the development of high-accuracy DNNs. In this paper we present FireCaffe, which successfully scales deep neural network training across a cluster of GPUs. We also present a number of best practices to aid in comparing advancements in methods for scaling and accelerating the training of deep neural networks. The speed and scalability of distributed algorithms is almost always limited by the overhead of communicating between servers, DNN training is not an exception to this rule. Therefore, the key consideration here is to reduce communication overhead wherever possible, while not degrading the accuracy of the DNN models that we train. Our approach has three key pillars. First, we select network hardware that achieves high bandwidth between GPU servers - Infiniband or Cray interconnects are ideal for this. Second, we consider a number of communication algorithms, and we find that reduction trees are more efficient and scalable than the traditional parameter server approach. Third, we optionally increase the batch size to reduce the total quantity of communication during DNN training, and we identify hyperparameters that allow us to reproduce the small-batch accuracy while training with large batch sizes. When training GoogLeNet and Network-in-Network on ImageNet, we achieve a 47x and 39x speedup, respectively, when training on a cluster of 128 GPUs.

IEEE Design & Test of Computers | 2002

Developing architectural platforms: a disciplined approach

Andrew Mihal; Chidamber Kulkarni; Matthew W. Moskewicz; Mel Tsai; Niraj Shah; Scott J. Weber; Yujia Jin; Kurt Keutzer; K. Vissers; Christian Sauer; Sharad Malik

The Mescal project brings a formalized, disciplined methodology to the design of programmable platform-based systems, enabling the exploration of a wide array of architectures and a correct-by-construction path to implementation.

international symposium on systems synthesis | 2001

Accelerating boolean satisfiability through application specific processing

Ying Zhao; Sharad Malik; Matthew W. Moskewicz; Conor Madigan

This paper presents an application specific multiprocessor system for SAT, utilizing the most recent results such as the development of highly efficient sequential SAT algorithms, the emergence of commercial configurable processor cores and the rapid progress in IC manufacturing techniques. Based on an analysis of the basic SAT search algorithm, we propose a new parallel SAT algorithm that utilizes fine grain parallelism. This is then used to design a multiprocessor architecture in which each processing node consists of a processor and a communication assist node that deals with message processing. Each processor is an application specific processor built from a commercial configurable processor core. All the system configurations are determined based on the characteristics of SAT algorithms, and are supported by simulation results. While this hardware accelerator system does not change the inherent intractability of the SAT problems, it achieves a 30-60x speedup over and above the fastest known SAT solver - Chaff. We believe that this system can be used to expand the practical applicability of SAT in all its application areas.

international conference on computer aided design | 2003

CAMA: A Multi-Valued Satisfiability Solver

Cong Liu; Andreas Kuehlmann; Matthew W. Moskewicz

This paper presents the multi-valued SAT solver CAMA. CAMAgeneralizes the recently developed speed-up techniques usedin state-of-the-art binary SAT solvers, such as the two-literal-watching scheme for Boolean constraint propagation (BCP), conflict-based learning with identifying the first unique implication point (UIP), and non-chronological back-tracking. In addition, a novel minimum value set (MVS) technique is introduced for improving the efficiency of conflict-based learning. By analyzing the conflict clauses, MVS can potentially prune conflictingspace that has not been searched before. Two different decisionheuristics are discussed and evaluated. Finally the performanceof CAMA is compared with Chaff using on a one-hot-encodingscheme. The experimental results show that, for MV-SAT problemswith large variable domains, CAMA outperforms Chaff.

international conference on multimedia retrieval | 2015

Audio-Based Multimedia Event Detection with DNNs and Sparse Sampling

Khalid Ashraf; Benjamin Elizalde; Forrest N. Iandola; Matthew W. Moskewicz; Julia Bernd; Gerald Friedland; Kurt Keutzer

This paper presents advances in analyzing audio content information to detect events in videos, such as a parade or a birthday party. We developed a set of tools for audio processing within the predominantly vision-focused deep neural network (DNN) framework Caffe. Using these tools, we show, for the first time, the potential of using only a DNN for audio-based multimedia event detection. Training DNNs for event detection using the entire audio track from each video causes a computational bottleneck. Here, we address this problem by developing a sparse audio frame-sampling method that improves event-detection speed and accuracy. We achieved a 10 percentage-point improvement in event classification accuracy, with a 200x reduction in the number of training input examples as compared to using the entire track. This reduction in input feature volume led to a 16x reduction in the size of the DNN architecture and a 300x reduction in training time. We applied our method using the recently released YLI-MED dataset and compared our results with a state-of-the-art system and with results reported in the literature for TRECVIDMED. Our results show much higher MAP scores compared to a baseline i-vector system - at a significantly reduced computational cost. The speed improvement is relevant for processing videos on a large scale, and could enable more effective deployment in mobile systems.

international conference on computer design | 2001

Matching architecture to application via configurable processors: a case study with boolean satisfiability problem

Ying Zhao; Sharad Malik; Albert R. Wang; Matthew W. Moskewicz; Conor F. Madigan

Boolean Satisfiability (SAT) is a classical NP-complete problem with both theoretical and practical interests. This paper presents our work in developing an application-specific processor for SAT based on a commercial configurable processor core. We customize the processor configuration and design new instruction extensions based on the data structure and atomic operations used in SAT. The customized processor has achieved around 24 /spl times/ speedup at a very low hardware cost. The small size of the processor makes it possible to integrate multiple processors and other customized logic into a single chip for an application-specific multiprocessor solution for SAT. Our work shows the strength of application-specific processing in accelerating applications with complex control and dynamic data structures - an area that has traditionally not been targeted by application-specific processing. It also demonstrates that configurable processor cores can be used to cut the development time and cost for designing and building such application-specific processors.

wireless and mobile computing, networking and communications | 2016

Boda-RTC: Productive generation of portable, efficient code for convolutional neural networks on mobile computing platforms

Matthew W. Moskewicz; Forrest N. Iandola; Kurt Keutzer

The popularity of neural networks (NNs) spans academia [1], industry [2], and popular culture [3]. In particular, convolutional neural networks (CNNs) have been applied to many image based machine learning tasks and have yielded strong results [4]. The availability of hardware/software systems for efficient training and deployment of large and/or deep CNN models is critical for the continued success of the field [5] [1]. Early systems for NN computation focused on leveraging existing dense linear algebra techniques and libraries [6] [7]. Current approaches use low-level machine specific programming [8] and/or closed-source, purpose-built vendor libraries [9]. In this work, we present an open source system that, compared to existing approaches, achieves competitive computational speed while achieving significantly greater portability. We achieve this by targeting the vendor-neutral OpenCL platform [10] using a code-generation approach. We argue that our approach allows for both: (1) the rapid development of new computational kernels for existing hardware targets, and (2) the rapid tuning of existing computational kernels for new hardware targets. Results are presented for a case study of targeting the Qualcomm Snapdragon 820 mobile computing platform [11] for CNN deployment.

computing frontiers | 2017

Boda: A Holistic Approach for Implementing Neural Network Computations

Matthew W. Moskewicz; Ali Jannesari; Kurt Keutzer

Neural networks (NNs) are currently a very popular topic in machine learning for both research and practice. GPUs are the dominant computing platform for research efforts and are also gaining popularity as a deployment platform for applications such as autonomous vehicles. As a result, GPU vendors such as NVIDIA have spent enormous effort to write special-purpose NN libraries. On other hardware targets, especially mobile GPUs, such vendor libraries are not generally available. Thus, the development of portable, open, high-performance, energy-efficient GPU code for NN operations would enable broader deployment of NN-based algorithms. A root problem is that high efficiency GPU programming suffers from high complexity, low productivity, and low portability. To address this, this work presents a framework to enable productive, high-efficiency GPU programming for NN computations across hardware platforms and programming models. In particular, the framework provides specific support for metaprogramming and autotuning of operations over ND-Arrays. To show the correctness and value of our framework and approach, we implement a selection of NN operations, covering the core operations needed for deploying three common image-processing neural networks. We target three different hardware platforms: NVIDIA, AMD, and Qualcomm GPUs. On NVIDIA GPUs, we show both portability between OpenCL and CUDA as well competitive performance compared to the vendor library. On Qualcomm GPUs, we show that our framework enables productive development of target-specific optimizations, and achieves reasonable absolute performance. Finally, On AMD GPUs, we show initial results that indicate our framework can yield reasonable performance on a new platform with minimal effort.

international conference on intelligent transportation systems | 2015

libHOG: Energy-Efficient Histogram of Oriented Gradient Computation

Forrest N. Iandola; Matthew W. Moskewicz; Kurt Keutzer

Histogram of Oriented Gradients (HOG) features are the underlying representation in automotive computer vision applications such as collision avoidance and lane keeping. In these applications, we have observed that HOG feature computation is often a slow and energy-intensive component of the overall pipeline. In this paper, we focus on reducing both the time taken and the energy used for computing Felzenszwalb HOG features. We achieve our results though a combination of reduced precision, SIMD parallelism, algorithmic changes, and outer-loop parallelism. In particular, we address a bottleneck in histogram accumulation by phrasing the problem as a gather instead of the (traditional) scatter. Additionally, we explore the tradeoffs of using L1 instead of L2 norms to compute gradients, which enables smaller operands and more SIMD parallelism. Overall, we are able to compute multiresolution HOG pyramids at 70fps for 640×480 images on a multicore CPU. This is a 3.6x speedup over the best known HOG implementation and a 29× speedup over the popular voc-release5 HOG code. This is also a 3.6× - 22× reduction in energy per frame compared to previous HOG implementations. Our open-source implementation is available for download.

Explore More