Is this you? Create Your Porfile

William A. Watson

Thomas Jefferson National Accelerator Facility

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where William A. Watson is active.

Explore More

Publication

Featured researches published by William A. Watson.

international supercomputing conference | 2013

Lattice QCD on Intel® Xeon PhiTM Coprocessors

Balint Joo; Dhiraj D. Kalamkar; Karthikeyan Vaidyanathan; Mikhail Smelyanskiy; Kiran Pamnany; Victor W. Lee; Pradeep Dubey; William A. Watson

Lattice Quantum Chromodynamics (LQCD) is currently the only known model independent, non perturbative computational method for calculations in the theory of the strong interactions, and is of importance in studies of nuclear and high energy physics. LQCD codes use large fractions of supercomputing cycles worldwide and are often amongst the first to be ported to new high performance computing architectures. The recently released Intel Xeon Phi architecture from Intel Corporation features parallelism at the level of many x86-based cores, multiple threads per core, and vector processing units. In this contribution, we describe our experiences with optimizing a key LQCD kernel for the Xeon Phi architecture. On a single node, using single precision, our Dslash kernel sustains a performance of up to 320 GFLOPS, while our Conjugate Gradients solver sustains up to 237 GFLOPS. Furthermore we demonstrate a fully ’native’ multi-node LQCD implementation running entirely on KNC nodes with minimum involvement of the host CPU. Our multi-node implementation of the solver has been strong scaled to 3.9 TFLOPS on 32 KNCs.

networking, architecture and storages | 2008

Parallel Job Scheduling with Overhead: A Benchmark Study

Richard A. Dutton; Weizhen Mao; Jie Chen; William A. Watson

We study parallel job scheduling, where each job may be scheduled on any number of available processors in a given parallel system. We propose a mathematical model to estimate a jobs execution time when assigned to multiple parallel processors. The model incorporates both the linear computation speedup achieved by having multiple processors to execute a job and the overhead incurred due to communication, synchronization, and management of multiple processors working on the same job. We show that the model is sophisticated enough to reflect the reality in parallel job execution and meanwhile also concise enough to make theoretical analysis possible. In particular, we study the validity of our overhead model by running well-known benchmarks on a parallel system with 1024 processors. We compare our fitting results with the traditional linear model without the overhead. The comparison shows conclusively that our model more accurately reflects the effect of the number of processors on the execution time. We also summarize some theoretical results for a parallel job schedule problem that uses our overhead model to calculate execution times.

international parallel and distributed processing symposium | 2005

Message passing for Linux clusters with gigabit Ethernet mesh connections

Jie Chen; William A. Watson; Robert G. Edwards; Weizhen Mao

Multiple copper-based commodity gigabit Ethernet (GigE) interconnects (adapters) on a single host can lead to Linux clusters with mesh/torus connections without using expensive switches and high speed network interconnects (NICs). However traditional message passing systems based on TCP for GigE cannot perform well for this type of clusters because of the overhead of TCP for multiple GigE links. In this paper, we present two os-bypass message passing systems that are based on a modified M-VIA (an implementation of VIA specification) for two production GigE mesh clusters: one is constructed as a 4x8x8 (256 nodes) torus and has been in production use for a year; the other is constructed as a 6/spl times/8/spl times/8 (384 nodes) torus and was deployed recently. One of the message passing systems targets to a specific application domain and is called QMP and the other is an implementation of MPI specification 1.1. The GigE mesh clusters using these two message passing systems achieve about 18.5 /spl mu/s half-way round trip latency and 400MB/s total bandwidth, which compare reasonably well to systems using specialized high speed adapters in a switched architecture at much lower costs.

international conference on parallel and distributed systems | 2010

GMH: A Message Passing Toolkit for GPU Clusters

Jie Chen; William A. Watson; Weizhen Mao

Driven by the market demand for high-definition 3D graphics, commodity graphics processing units (GPUs) have evolved into highly parallel, multi-threaded, many-core processors, which are ideal for data parallel computing. Many applications have been ported to run on a single GPU with tremendous speedups using general C-style programming languages such as CUDA. However, large applications require multiple GPUs and demand explicit message passing. This paper presents a message passing toolkit, called GMH (GPU Message Handler), on NVIDIA GPUs. This toolkit utilizes a data-parallel thread group as a way to map multiple GPUs on a single host to an MPI rank, and introduces a notion of virtual GPUs as a way to bind a thread to a GPU automatically. This toolkit provides high performance MPI style point-to-point and collective communication, but more importantly, facilitates event-driven APIs to allow an application to be managed and executed by the toolkit at runtime.

international symposium on parallel architectures algorithms and networks | 2000

User level communication on Alpha Linux systems

Jie Chen; William A. Watson

Recent advances in commodity network interface technology enable scientists and engineers to build clusters of workstations or PCs to execute parallel applications. However, raw-hardware network performance is rarely delivered to applications because of the overheads of communication software and operating systems. To reduce these overheads, a technique called user-level communication can be used to allow applications to access the network interface directly without intervention from the operating system. We examine two user-level communication systems, GM and BIP, on alpha based systems connected by a Myrinet network. In addition quantitative studies on how DMA initiation costs, flow control and reliable communication will effect the performance of communication software are presented.

Concurrency and Computation: Practice and Experience | 2002

A Web services data analysis Grid

William A. Watson; Ian Bird; Jie Chen; Bryan Hess; Andy Kowalski; Ying Chen

The trend in large‐scale scientific data analysis is to exploit computational, storage and other resources located at multiple sites, and to make those resources accessible to the scientist as if they were a single, coherent system. Web technologies driven by the huge and rapidly growing electronic commerce industry provide valuable components to speed the deployment of such sophisticated systems. Jefferson Lab, where several hundred terabytes of experimental data are acquired each year, is in the process of developing a Web‐based distributed system for data analysis and management. The essential aspects of this system are a distributed data Grid (site independent access to experimental, simulation and model data) and a distributed batch system, augmented with various supervisory and management capabilities, and integrated using Java and XML‐based Web services. Copyright

ieee particle accelerator conference | 1997

A portable accelerator control toolkit

William A. Watson

In recent years, the expense of creating good control software has led to a number of collaborative efforts among laboratories to share this cost. The EPICS collaboration is a particularly successful example of this trend. More recently another collaborative effort has addressed the need for sophisticated high level software, including model driven accelerator controls. This work builds upon the CDEV (Common DEVice) software framework, which provides a generic abstraction of a control system, and maps that abstraction onto a number of site-specific control systems including EPICS, the SLAC control system, CERN/PS and others. In principle, it is now possible to create portable accelerator control applications which have no knowledge of the underlying and site-specific control system. Applications based on CDEV now provide a growing suite of tools for accelerator operations, including general purpose displays, an on-line accelerator model, beamline steering, machine status displays incorporating both hardware and model information (such as beam positions overlaid with beta functions) and more. A survey of CDEV compatible portable applications will be presented, as well as plans for future development.

international parallel and distributed processing symposium | 2012

Automatic Offloading C++ Expression Templates to CUDA Enabled GPUs

Jie Chen; Balint Joo; William A. Watson; Robert G. Edwards

In the last few years, many scientific applications have been developed for powerful graphics processing units (GPUs) and have achieved remarkable speedups. This success can be partially attributed to high performance host callable GPU library routines that are offloaded to GPUs at runtime. These library routines are based on C/C++-like programming toolkits such as CUDA from NVIDIA and have the same calling signatures as their CPU counterparts. Recently, with the sufficient support of C++ templates from CUDA, the emergence of template libraries have enabled further advancement in code reusability and rapid software development for GPUs. However, Expression Templates (ET), which have been very popular for implementing data parallel scientific software for host CPUs because of their intuitive and mathematics-like syntax, have been underutilized by GPU development libraries. The lack of ET usage is caused by the difficulty of offloading expression templates from hosts to GPUs due to the inability to pass instantiated expressions to GPU kernels as well as the absence of the exact form of the expressions for the templates at the time of coding. This paper presents a general approach that enables automatic offloading of C++ expression templates to CUDA enabled GPUs by using the C++ metaprogramming technique and Just-In-Time (JIT) compilation methodology to generate and compile CUDA kernels for corresponding expression templates followed by executing the kernels with appropriate arguments. This approach allows developers to port applications to run on GPUs with virtually no code modifications. More specifically, this paper uses a large ET based data parallel physics library called QDP++ as an example to illustrate many aspects of the approach to offload expression templates automatically and to demonstrate very good speedups for typical QDP++ applications running on GPUs against running on CPUs using this method of offloading. In addition, this approach of automatic offloading expression templates could be applied to other many-core accelerators that provide C++ programming toolkits with the support of C++ template.

networking, architecture and storages | 2008

Software Barrier Performance on Dual Quad-Core Opterons

Jie Chen; William A. Watson

Multi-core processors based SMP servers have become building blocks for Linux clusters in recent years because they can deliver better performance for multi-threaded programs through on-chip multi-threading. However, a relative slow software barrier can hinder the performance of a data-parallel scientific application on a multi-core system. In this paper we study the performance of different software barrier algorithms on a server based on newly introduced AMD quad-core Opteron processors. We study how the memory architecture and the cache coherence protocol of the system influence the performance of barrier algorithms. We present an optimized barrier algorithm derived from the queue-based barrier algorithm. We find that the optimized barrier algorithm achieves speedup of 1.77 over the original queue-based algorithm. In addition, it has speedup of 2.39 over the software barrier generated by the Intel OpenMP compiler.

networking architecture and storages | 2014

Efficient GCD Computation for Big Integers on Xeon Phi Coprocessor

Jie Chen; William A. Watson; Mayee F. Chen

Efficient calculation of the greatest common divisor (GCD) for big integers each whose number of bits is greater than or equal to 1024 has drawn a considerable amount of attention because it can be used to detect a weakness of the RSA security infrastructure. This paper presents a parallel binary GCD algorithm and its implementation for big integers on the Intel Xeon Phi coprocessor. This algorithm is capable of computing GCDs efficiently on many pairs of big integers in parallel by utilizing all cores on a Xeon Phi coprocessor as well as taking advantage of all vector processing units of the coprocessor to speed up critical integer operations within the algorithm. Using 240 threads on a Xeon Phi coprocessor to carry out GCD calculations for a large amount of 2048-bit integers, the implementation achieves the speedup of 30 times over a sequential binary GCD algorithm implementation on a single CPU core, and it delivers twice amount of performance in comparison to the same sequential binary GCD implementation running on 240 threads of the Xeon Phi.

Explore More