D C Bradley
University of Wisconsin-Madison
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by D C Bradley.
computer science and information engineering | 2009
I. Sfiligoi; D C Bradley; Burt Holzman; Parag Mhashilkar; Sanjay Padhi; F. Würthwein
Grid computing has become very popular in big and widespread scientific communities with high computing demands, like high energy physics. Computing resources are being distributed over many independent sites with only a thin layer of Grid middleware shared between them. This deployment model has proven to be very convenient for computing resource providers, but has introduced several problems for the users of the system, the three major being the complexity of job scheduling, the non-uniformity of compute resources, and the lack of good job monitoring.Pilot jobs address all the above problems by creating a virtual private computing pool on top of Grid resources. This paper presents both the general pilot concept, as well as a concrete implementation, called glideinWMS, deployed in the Open Science Grid.
Journal of Physics: Conference Series | 2011
D C Bradley; T St Clair; Matthew Farrellee; Z Guo; Miron Livny; I Sfiligoi; Todd Tannenbaum
Condor is being used extensively in the HEP environment. It is the batch system of choice for many compute farms, including several WLCG Tier Is, Tier 2s and Tier 3s. It is also the building block of one of the Grid pilot infrastructures, namely glideinWMS. As with any software, Condor does not scale indefinitely with the number of users and/or the number of resources being handled. In this paper we are presenting the current observed scalability limits of both the latest production and the latest development release of Condor, and compare them with the limits reported in previous publications. A description of what changes were introduced to remove the previous scalability limits are also presented.
Journal of Physics: Conference Series | 2014
L. A. T. Bauerdick; Kenneth Bloom; Brian Bockelman; D C Bradley; Sridhara Dasu; J M Dost; I. Sfiligoi; A Tadel; M. Tadel; Frank Wuerthwein; Avi Yagil
Following the success of the XRootd-based US CMS data federation, the AAA project investigated extensions of the federation architecture by developing two sample implementations of an XRootd, disk-based, caching proxy. The first one simply starts fetching a whole file as soon as a file open request is received and is suitable when completely random file access is expected or it is already known that a whole file be read. The second implementation supports on-demand downloading of partial files. Extensions to the Hadoop Distributed File System have been developed to allow for an immediate fallback to network access when local HDFS storage fails to provide the requested block. Both cache implementations are in pre-production testing at UCSD.
arXiv: Distributed, Parallel, and Cluster Computing | 2010
Zach Miller; D C Bradley; Todd Tannenbaum; I. Sfiligoi
Many secure communication libraries used by distributed systems, such as SSL, TLS, and Kerberos, fail to make a clear distinction between the authentication, session, and communication layers. In this paper we introduce CEDAR, the secure communication library used by the Condor High Throughput Computing software, and present the advantages to a distributed computing system resulting from CEDARs separation of these layers. Regardless of the authentication method used, CEDAR establishes a secure session key, which has the flexibility to be used for multiple capabilities. We demonstrate how a layered approach to security sessions can avoid round-trips and latency inherent in network authentication. The creation of a distinct session management layer allows for optimizations to improve scalability by way of delegating sessions to other components in the system. This session delegation creates a chain of trust that reduces the overhead of establishing secure connections and enables centralized enforcement of system-wide security policies. Additionally, secure channels based upon UDP datagrams are often overlooked by existing libraries; we show how CEDARs structure accommodates this as well. As an example of the utility of this work, we show how the use of delegated security sessions and other techniques inherent in CEDARs architecture enables US CMS to meet their scalability requirements in deploying Condor over large-scale, wide-area grid systems.
Journal of Physics: Conference Series | 2010
D C Bradley; I. Sfiligoi; S Padhi; J Frey; Todd Tannenbaum
Physicists have access to thousands of CPUs in grid federations such as OSG and EGEE. With the start-up of the LHC, it is essential for individuals or groups of users to wrap together available resources from multiple sites across multiple grids under a higher user-controlled layer in order to provide a homogeneous pool of available resources. One such system is glideinWMS, which is based on the Condor batch system. A general discussion of glideinWMS can be found elsewhere. Here, we focus on recent advances in extending its reach: scalability and integration of heterogeneous compute elements. We demonstrate that the new developments exceed the design goal of over 10,000 simultaneous running jobs under a single Condor schedd, using strong security protocols across global networks, and sustaining a steady-state job completion rate of a few Hz. We also show interoperability across heterogeneous computing elements achieved using client-side methods. We discuss this technique and the challenges in direct access to NorduGrid and CREAM compute elements, in addition to Globus based systems.
Journal of Physics: Conference Series | 2012
I. Sfiligoi; D C Bradley; Zachary Miller; Burt Holzman; F. Würthwein; J M Dost; Kenneth Bloom; C Grandi
Multi-user pilot infrastructures provide significant advantages for the communities using them, but also create new security challenges. With Grid authorization and mapping happening with the pilot credential only, final user identity is not properly addressed in the classic Grid paradigm. In order to solve this problem, OSG and EGI have deployed glexec, a privileged executable on the worker nodes that allows for final user authorization and mapping from inside the pilot itself. The glideinWMS instances deployed on OSG have been now using glexec on OSG sites for several years, and have started using it on EGI resources in the past year. The user experience of using glexec has been mostly positive, although there are still some edge cases where things could be improved. This paper provides both the usage statistics as well as a description of the still remaining problems and the expected solutions.
Journal of Physics: Conference Series | 2014
I. Sfiligoi; Thomas Martin; Brian Bockelman; D C Bradley; F. Würthwein
The computing landscape is moving at an accelerated pace to many-core computing. Nowadays, it is not unusual to get 32 cores on a single physical node. As a consequence, there is increased pressure in the pilot systems domain to move from purely single-core scheduling and allow multi-core jobs as well. In order to allow for a gradual transition from single-core to multi-core user jobs, it is envisioned that pilot jobs will have to handle both kinds of user jobs at the same time, by requesting several cores at a time from Grid providers and then partitioning them between the user jobs at runtime. Unfortunately, the current Grid ecosystem only allows for relatively short lifetime of pilot jobs, requiring frequent draining, with the relative waste of compute resources due to varying lifetimes of the user jobs. Significantly extending the lifetime of pilot jobs is thus highly desirable, but must come without any adverse effects for the Grid resource providers. In this paper we present a mechanism, based on communication between the pilot jobs and the Grid provider, that allows for pilot jobs to run for extended periods of time when there are available resources, but also allows the Grid provider to reclaim the resources in a short amount of time when needed. We also present the experience of running a prototype system using the above mechanism on a few US-based Grid sites.
Journal of Physics: Conference Series | 2011
W Andrews; Brian Bockelman; D C Bradley; J M Dost; D Evans; I Fisk; J Frey; Burt Holzman; Miron Livny; T Martin; A McCrea; A Melo; S Metson; H Pi; I. Sfiligoi; P Sheldon; Todd Tannenbaum; A Tiradani; F. Würthwein; D Weitzel
Cloud computing is steadily gaining traction both in commercial and research worlds, and there seems to be significant potential to the HEP community as well. However, most of the tools used in the HEP community are tailored to the current computing model, which is based on grid computing. One such tool is glideinWMS, a pilot-based workload management system. In this paper we present both what code changes were needed to make it work in the cloud world, as well as what architectural problems we encountered and how we solved them. Benchmarks comparing grid, Magellan, and Amazon EC2 resources are also included.
Journal of Physics: Conference Series | 2010
D C Bradley; Sridhara Dasu; Miron Livny; A Mohapatra; Todd Tannenbaum; G Thain
A number of recent enhancements to the Condor batch system have been stimulated by the challenges of LHC computing. The result is a more robust, scalable, and flexible computing platform. One product of this effort is the Condor Job Router, which serves as a high-throughput scheduler for feeding multiple (e.g. grid) queues from a single input job queue. We describe its principles and how it has been used at large scale in CMS production on the Open Science Grid. Improved scalability of Condor is another welcome advance. We describe the scaling characteristics of the Condor batch system under large workloads and when integrating large pools of resources; we then detail how LHC physicists have directly profited under the expanded scaling regime.
international conference on parallel processing | 2009
José Nelson Falavinha Junior; Aleardo Manacero Júnior; Miron Livny; D C Bradley
In large distributed systems, where shared resources are owned by distinct entities, there is a need to reflect resource ownership in resource allocation. An appropriate resource management system should guarantee that resource’s owners have access to a share of resources proportional to the share they provide. In order to achieve that some policies can be used for revoking access to resources currently used by other users. In this paper, a scheduling policy based in the concept of distributed ownership is introduced called Owner Share Enforcement Policy (OSEP). OSEP goal is to guarantee that owner do not have their jobs postponed for longer periods of time. We evaluate the results achieved with the application of this policy using metrics that describe policy violation, loss of capacity, policy cost and user satisfaction in environments with and without job checkpointing. We also evaluate and compare the OSEP policy with the Fair-Share policy, and from these results it is possible to capture the trade-offs from different ways to achieve fairness based on the user satisfaction.