Todd Tannenbaum
University of Wisconsin-Madison
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Todd Tannenbaum.
high performance distributed computing | 2001
J Frey; Todd Tannenbaum; Miron Livny; Ian T. Foster; Steven Tuecke
In recent years, there has been a dramatic increase in the number of available computing and storage resources. Yet few tools exist that allow these resources to be exploited effectively in an aggregated form. We present the Condor-G system, which leverages software from Globus and Condor to enable users to harness multi-domain resources as if they all belong to one personal domain. We describe the structure of Condor-G and how it handles job management, resource selection, security, and fault tolerance. We also present results from application experiments with the Condor-G system. We assert that Condor-G can serve as a general-purpose interface to Grid resources, for use by both end users and higher-level program development tools.
conference on high performance computing (supercomputing) | 2007
Alexandru Iosup; Dick H. J. Epema; Todd Tannenbaum; Matthew Farrellee; Miron Livny
The grid vision of a single computing utility has yet to materíalize: while many grids with thousands of processors each exist, most work in isolation. An important obstacle for the effective and efficient inter-operation of grids is the problem of resource selection. In this paper we propose a solution to this problem that combines the hierarchical and decentralized approaches for interconnecting grids. In our solution, a hierarchy of grid sites is augmented with peer-to-peer connections between sites under the same administrative control. To operate this architecture, we employ the key concept of delegated matchmaking, which temporarily binds resources from remote sites to the local environment. With trace-based simulations we evaluate our solution under various infrastructural and load conditions, and we show that it outperforms other approaches to inter-operating grids. Specifically, we show that delegated matchmaking achieves up to 60% more goodput and completes 26% more jobs than its best alternative.
Concurrency and Computation: Practice and Experience | 2006
Douglas Thain; Todd Tannenbaum; Miron Livny
How can we measure the impact of an open‐source software package over time? When a system has no price, no purchase contracts and no buyers or sellers it can be difficult to judge its impact on the world. To explore this issue, we have instrumented the Condor distributed batch system in a variety of ways and observed its growth to over 50 000 CPUs at over 1000 sites over five years. Instrumentation methods include automatic updates by e‐mail and user datagram protocol (UDP), annotated download records and a voluntary user survey. Each of these metrics has various strengths and weaknesses that we are able to compare and contrast. We also explore the ethical and legal issues surrounding automatic data collection. Surprisingly, we discover that objections to automatic data collection are higher among people that are not using the Condor software. We conclude with some practical advice for further research into the measurement of software systems. Copyright
Scientific Programming | 2008
Alexandru Iosup; Todd Tannenbaum; Matthew Farrellee; Dick H. J. Epema; Miron Livny
The grid vision of a single computing utility has yet to materialize: while many grids with thousands of processors each exist, most work in isolation. An important obstacle for the effective and efficient inter-operation of grids is the problem of resource selection. In this paper we propose a solution to this problem that combines the hierarchical and decentralized approaches for interconnecting grids. In our solution, a hierarchy of grid sites is augmented with peer-to-peer connections between sites under the same administrative control. To operate this architecture, we employ the key concept of delegated matchmaking, which temporarily binds resources from remote sites to the local environment. With trace-based simulations we evaluate our solution under various infrastructural and load conditions, and we show that it outperforms other approaches to inter-operating grids. Specifically, we show that delegated matchmaking achieves up to 60% more goodput and completes 26% more jobs than its best alternative.
Journal of Physics: Conference Series | 2011
D C Bradley; T St Clair; Matthew Farrellee; Z Guo; Miron Livny; I Sfiligoi; Todd Tannenbaum
Condor is being used extensively in the HEP environment. It is the batch system of choice for many compute farms, including several WLCG Tier Is, Tier 2s and Tier 3s. It is also the building block of one of the Grid pilot infrastructures, namely glideinWMS. As with any software, Condor does not scale indefinitely with the number of users and/or the number of resources being handled. In this paper we are presenting the current observed scalability limits of both the latest production and the latest development release of Condor, and compare them with the limits reported in previous publications. A description of what changes were introduced to remove the previous scalability limits are also presented.
Journal of Physics: Conference Series | 2015
E M Fajardo; J M Dost; Burt Holzman; Todd Tannenbaum; J Letts; A Tiradani; Brian Bockelman; J Frey; D Mason
The HTCondor high throughput computing system is heavily used in the high energy physics (HEP) community as the batch system for several Worldwide LHC Computing Grid (WLCG) resources. Moreover, it is the backbone of GlidelnWMS, the pilot system used by the computing organization of the Compact Muon Solenoid (CMS) experiment. To prepare for LHC Run 2, we probed the scalability limits of new versions and configurations of HTCondor with a goal of reaching 200,000 simultaneous running jobs in a single internationally distributed dynamic pool.In this paper, we first describe how we created an opportunistic distributed testbed capable of exercising runs with 200,000 simultaneous jobs without impacting production. This testbed methodology is appropriate not only for scale testing HTCondor, but potentially for many other services. In addition to the test conditions and the testbed topology, we include the suggested configuration options used to obtain the scaling results, and describe some of the changes to HTCondor inspired by our testing that enabled sustained operations at scales well beyond previous limits.
arXiv: Distributed, Parallel, and Cluster Computing | 2010
Zach Miller; D C Bradley; Todd Tannenbaum; I. Sfiligoi
Many secure communication libraries used by distributed systems, such as SSL, TLS, and Kerberos, fail to make a clear distinction between the authentication, session, and communication layers. In this paper we introduce CEDAR, the secure communication library used by the Condor High Throughput Computing software, and present the advantages to a distributed computing system resulting from CEDARs separation of these layers. Regardless of the authentication method used, CEDAR establishes a secure session key, which has the flexibility to be used for multiple capabilities. We demonstrate how a layered approach to security sessions can avoid round-trips and latency inherent in network authentication. The creation of a distinct session management layer allows for optimizations to improve scalability by way of delegating sessions to other components in the system. This session delegation creates a chain of trust that reduces the overhead of establishing secure connections and enables centralized enforcement of system-wide security policies. Additionally, secure channels based upon UDP datagrams are often overlooked by existing libraries; we show how CEDARs structure accommodates this as well. As an example of the utility of this work, we show how the use of delegated security sessions and other techniques inherent in CEDARs architecture enables US CMS to meet their scalability requirements in deploying Condor over large-scale, wide-area grid systems.
Journal of Physics: Conference Series | 2010
D C Bradley; I. Sfiligoi; S Padhi; J Frey; Todd Tannenbaum
Physicists have access to thousands of CPUs in grid federations such as OSG and EGEE. With the start-up of the LHC, it is essential for individuals or groups of users to wrap together available resources from multiple sites across multiple grids under a higher user-controlled layer in order to provide a homogeneous pool of available resources. One such system is glideinWMS, which is based on the Condor batch system. A general discussion of glideinWMS can be found elsewhere. Here, we focus on recent advances in extending its reach: scalability and integration of heterogeneous compute elements. We demonstrate that the new developments exceed the design goal of over 10,000 simultaneous running jobs under a single Condor schedd, using strong security protocols across global networks, and sustaining a steady-state job completion rate of a few Hz. We also show interoperability across heterogeneous computing elements achieved using client-side methods. We discuss this technique and the challenges in direct access to NorduGrid and CREAM compute elements, in addition to Globus based systems.
Future Generation Computer Systems | 2017
Zhe Zhang; Brian Bockelman; Dale W. Carder; Todd Tannenbaum
Abstract High throughput computing (HTC) systems are widely adopted in scientific discovery and engineering research. They are responsible for scheduling submitted batch jobs to utilize the cluster resources. Current systems mostly focus on managing computing resources like CPU and memory; however, they lack flexible and fine-grained management mechanisms for network resources. This has increasingly been an urgent need as current batch systems may be distributed among dozens of sites around the globe like Open Science Grid. The Lark project was motivated by this need to re-examine how the HTC layer interacts with the network layer. In this paper, we present the system architecture of Lark and its implementation as a plugin of HTCondor which is a popular HTC software project. Lark achieves lightweight network virtualization at per-job granularity for HTCondor by utilizing Linux container and virtual Ethernet devices; this provides each batch job with a unique network address in a private network namespace. We extended HTCondor’s description language, ClassAds, so users can specify networking requirements in the job submission script. HTCondor can perform matchmaking to make sure user-specified network requirements and resource-specific policies are fulfilled. We also extended the job agent, condor_starter , so that it can manage and configure the job’s network environment. Given this important building block as the core, we implement bandwidth management functionality at both the host and network levels utilizing software-defined networking (SDN). In addition to HTCondor, Wide area network bandwidth management for GridFTP traffic is designed and implemented. Our experiments and evaluations show that Lark can effectively manage network resources simultaneously for both applications inside the cluster environment. By not resorting to heavyweight VMs, we keep startup overheads minimal compared to “regular” batch jobs. This mechanism provides the users with better predictability of their job execution and the administrators more policy flexibility in allocation of network resources.
ieee/acm international symposium cluster, cloud and grid computing | 2015
Zhe Zhang; Brian Bockelman; Dale W. Carder; Todd Tannenbaum
High throughput computing (HTC) systems are widely adopted in scientific discovery and engineering research. They are responsible for scheduling submitted batch jobs to utilize the cluster resources. Current systems mostly focus on managing computing resources like CPU and memory, however, they lack flexible and fine-grained management mechanisms for network resources. This has increasingly been an urgent need as current batch systems may be distributed among dozens of sites around the globe like Open Science Grid. The Lark project was motivated by this need to re-examine how the HTC layer interacts with the network layer. In this paper, we present the system architecture of Lark and its implementation as a plugin of HTCondor which is a popular HTC software project. Lark achieves lightweight network virtualization at per-job granularity for HTCondor by utilizing Linux container and virtual Ethernet devices, this provides each batch job with a unique network address in a private network namespace. We extended HTCondors description language, Class Ads, so users can specify networking requirements in the job submission script. HTCondor can perform matchmaking to make sure user-specified network requirements and resource-specific policies are fulfilled. We also extended the job agent, condor starter, so that it can manage and configure the jobs network environment. Given this important building block as the core, we implement bandwidth management functionality at both the host and network levels utilizing software-defined networking (SDN). Our experiments and evaluations show that Lark can effectively manage network resources within the cluster with low overhead. It provides the users with better predictability of their job execution and the administrators more flexibility in network resource consumption policies.