Is this you? Create Your Porfile

Matthew Woitaszek

National Center for Atmospheric Research

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Matthew Woitaszek is active.

Explore More

Publication

Featured researches published by Matthew Woitaszek.

symposium on applications and the internet | 2003

Identifying junk electronic mail in Microsoft outlook with a support vector machine

Matthew Woitaszek; Muhammad Shaaban; Roy Czernikowski

In this paper, we utilize a simple support vector machine to identify commercial electronic mail. The use of a personalized dictionary for model training provided a classification accuracy of 96.69%, while a much larger system dictionary achieved 95.26%. The classification system was subsequently implemented as an add-in for Microsoft Outlook XP, providing sorting and grouping capabilities using Outlooks interface to the typical desktop e-mail user.

computer and information technology | 2010

Developing a Cloud Computing Charging Model for High-Performance Computing Resources

Matthew Woitaszek; Henry M. Tufo

This paper examines the economics of cloud computing charging from the perspective of a supercomputing resource provider offering its own resources. To evaluate the competitiveness of our computing center with cloud computing resources, we develop a comprehensive system utilization charging model similar to that used by Amazon EC2 and apply the model to our current resources and planned procurements. For our current resource, we find that charging for computational time may be appropriate, but that charging for data traffic between the supercomputer and the storage/front-end systems would result in negligible additional revenue. Similarly, charging for data storage capacity at currently typical commercial rates yields insufficient revenue to offset the acquisition and operation of the storage. However, when we extend the analysis to a capacity cluster scheduled for deployment in the first half of 2010 that will be made available to users through batch, Grid, and cloud interfaces, we find that the resource will be competitive with current and anticipated cloud rates.

many task computing on grids and supercomputers | 2011

Parallel high-resolution climate data analysis using swift

Matthew Woitaszek; John M. Dennis; Taleena R. Sines

Advances in software parallelism and high-performance systems have resulted in an order of magnitude increase in the volume of output data produced by the Community Earth System Model (CESM). As the volume of data produced by CESM increases, the single-threaded script-based software packages traditionally used to post-process model output data have become a bottleneck in the analysis process. This paper presents a parallel version of the CESM atmosphere model data analysis workflow implemented using the Swift scripting language. Using the Swift implementation of the workflow, the time to analyze a 10-year atmosphere simulation on a typical cluster is reduced from 95 to 32 minutes on a single 8-core node and to 20 minutes on two nodes. The parallelized workflow is then used to evaluate several new data-intensive computational systems that feature RAM-based and flash-based storage. Even when constraining parallelism to limit the amount of file system space used by intermediate temporary data, our results show that the Swift-based implementation significantly reduces data analysis time.

grid computing environments | 2009

AMP: a science-driven web-based application for the TeraGrid

Matthew Woitaszek; T. S. Metcalfe; Ian Shorrock

The Asteroseismic Modeling Portal (AMP) provides a web-based interface for astronomers to run and view simulations that derive the properties of Sun-like stars from observations of their pulsation frequencies. In this paper, we describe the architecture and implementation of AMP, highlighting the lightweight design principles and tools used to produce a functional fully-custom web-based science application in less than a year. Targeted as a TeraGrid science gateway, AMPs architecture and implementation are intended to simplify its orchestration of TeraGrid computational resources. AMPs web-based interface was developed as a traditional standalone database-backed web application using the Python-based Django web development framework, allowing us to leverage the Django frameworks capabilities while cleanly separating the user interface development from the grid interface development. We have found this combination of tools flexible and effective for rapid gateway development and deployment.

international conference on cluster computing | 2007

High throughput grid computing with an IBM Blue Gene/L

Jason Cope; Michael Oberg; Henry M. Tufo; Theron Voran; Matthew Woitaszek

While much high-performance computing is performed using massively parallel MPI applications, many workflows execute jobs with a mix of processor counts. At the extreme end of the scale, some workloads consist of large quantities of single-processor jobs. These types of workflows lead to inefficient usage of massively parallel architectures such as the IBM Blue Gene/L (BG/L) because of allocation constraints forced by its unique system design. Recently, IBM introduced the ability to schedule individual processors on BG/L - a feature named high throughput computing (HTC) - creating an opportunity to exploit the systempsilas power efficiency for other classes of computing. In this paper, we present a Grid-enabled interface supporting HTC on BG/L. This interface accepts single-processor tasks using Globus GRAM, aggregates HTC tasks into BG/L partitions, and requests partition execution using the underlying system scheduler. By separating HTC task aggregation from scheduling, we provide the ability for workflows constructed using standard Grid middleware to run both parallel and serial jobs on the BG/L. We examine the startup latency and performance of running large quantities of HTC jobs. Finally, we deploy Daymet, a component of a coupled climate model, on a BG/L system using our HTC interface.

ieee conference on mass storage systems and technologies | 2007

Tornado Codes for MAID Archival Storage

Matthew Woitaszek; Henry M. Tufo

This paper examines the application of Tornado codes, a class of low density parity check (LDPC) erasure codes, to archival storage systems based on massive arrays of idle disks (MAID). We present a log- structured extent-based archival file system based on Tornado Coded stripe storage. The file system is combined with a MAID simulator to emulate the behavior of a large-scale storage system with the goal of employing Tornado Codes to provide fault tolerance and performance in a power-constrained environment. The effect of power conservation constraints on system throughput is examined, and a policy of placing multiple data nodes on a single device is shown to increase read throughput at the cost of a measurable, but negligible, decrease in fault tolerance. Finally, a system prototype is implemented on a 100 TB Lustre storage cluster, providing GridFTP accessible storage with higher reliability and availability than the underlying storage architecture.

high performance distributed computing | 2006

Fault Tolerance of Tornado Codes for Archival Storage

Matthew Woitaszek; Henry M. Tufo

This paper examines a class of low density parity check (LDPC) erasure codes called Tornado codes for applications in archival storage systems. The fault tolerance of Tornado code graphs is analyzed and it is shown that it is possible to identify and mitigate worst-case failure scenarios in small (96 node) graphs through use of simulations to find and eliminate critical node sets that can cause Tornado codes to fail even when almost all blocks are present. The graph construction procedure resulting from the preceding analysis is then used to construct a 96-device Tornado code storage system with capacity overhead equivalent to RAID 10 that tolerates any 4 device failures. This system is demonstrated to be superior to other parity-based RAID systems. Finally, it is described how a geographically distributed data stewarding system can be enhanced by using cooperatively selected Tornado code graphs to obtain fault tolerance exceeding that of its constituent storage sites or site replication strategies

european conference on parallel processing | 2005

Grid-BGC: a grid-enabled terrestrial carbon cycle modeling system

Jason Cope; Craig Hartsough; Peter E. Thornton; Henry M. Tufo; Nathan Wilhelmi; Matthew Woitaszek

Grid-BGC is a Grid-enabled terrestrial biogeochemical cycle simulator collaboratively developed by the National Center for Atmospheric Research (NCAR) and the University of Colorado (CU) with funding from NASA. The primary objective of the project is to utilize Globus Grid technology to integrate inexpensive commodity cluster computational resources at CU with the mass storage system at NCAR while hiding the logistics of data transfer and job submission from the scientists. We describe a typical process for simulating the terrestrial carbon cycle, present our solution architecture and software design, and describe our implementation experiences with Grid technology on our systems. By design the Grid-BGC software framework is extensible in that it can utilize other grid-accessible computational resources and can be readily applied to other climate simulation problems which have similar workflows. Overall, this project demonstrates an end-to-end system which leverages Grid technologies to harness distributed resources across organizational boundaries to achieve a cost-effective solution to a compute-intensive problem.

many task computing on grids and supercomputers | 2009

Ensemble dispatching on an IBM Blue Gene/L for a bioinformatics knowledge environment

Paul Marshall; Matthew Woitaszek; Henry M. Tufo; Rob Knight; Daniel McDonald; Julia K. Goodrich

This paper discusses our work providing support for processing a large number of short tasks within the context of our development of a collaborative bioinformatics knowledge environment for structural biologists, environmental microbiologists, and evolutionary biologists. We have designed and implemented a new ensemble-based task dispatching system that we have deployed on a Blue Gene/L system in conjunction with the Blue Genes High Throughput Computing (HTC) capability. Unlike our prior general database-backed HTC task dispatching system, the ensemble-based task dispatching system is able to efficiently process and dispatch large numbers of very short tasks to over a thousand cores. We also investigate the scalability of the IBM Blue Gene/L at HTC in general, identifying and eliminating processor-reboot inefficincies for very short tasks for specific applications, making the Blue Gene/L a feasible processing system for this bioinformatics workload.

Computer Science - Research and Development | 2011

A system architecture supporting high-performance and cloud computing in an academic consortium environment

Michael Oberg; Matthew Woitaszek; Theron Voran; Henry M. Tufo

The University of Colorado (CU) and the National Center for Atmospheric Research (NCAR) have been deploying complimentary and federated resources supporting computational science in the Western United States since 2004. This activity has expanded to include other partners in the area, forming the basis for a broader Front Range Computing Consortium (FRCC). This paper describes the development of the Consortium’s current architecture for federated high-performance resources, including a new 184 teraflop/s (TF) computational system at CU and prototype data-centric computing resources at NCAR. CU’s new Dell-based computational plant is housed in a co-designed pre-fabricated data center facility that allowed the university to install a top-tier academic resource without major capital facility investments or renovations. We describe integration of features such as virtualization, dynamic configuration of high-throughput networks, and Grid and cloud technologies, into an architecture that supports collaboration among regional computational science participants.

Explore More