Taghrid Samak
Lawrence Berkeley National Laboratory
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Taghrid Samak.
international conference on e-science | 2014
Gilberto Pastorello; Deborah A. Agarwal; Dario Papale; Taghrid Samak; Carlo Trotta; Alessio Ribeca; Cristina Poindexter; Boris Faybishenko; Dan Gunter; Rachel Hollowgrass; Eleonora Canfora
Observational data are fundamental for scientific research in almost any domain. Recent advances in sensor and data management technologies are enabling unprecedented amounts of observational data to be collected and analyzed. However, an essential part of using observational data is not currently as scalable as data collection and analysis methods: data quality assurance and control. While specialized tools for very narrow domains do exist, general methods are harder to create. This paper explores the identification of data issues that lead to the creation of data tests and tools to perform data quality control activities. Developing this identification step in a systematic manner allows for better and more general quality control tools. As our case study, we use carbon, water, and energy fluxes as well as micro-meteorological data collected at field sites that are part of FLUXNET, a network of over 400 ecosystem-level monitoring stations. In an effort toward the release of a new global data set of fluxes, we are doing data quality control for these data. The experience from this work led to the creation of a catalog of issues identified in the data. This paper presents this catalog and its generalization into a set of patterns of data quality issues that can be detected in observational data.
ieee international conference on high performance computing data and analytics | 2012
Karan Vahi; Ian Harvey; Taghrid Samak; Daniel K. Gunter; Kieran Evans; David Rogers; Ian J. Taylor; Monte Goode; Fabio Silva; Eddie Al-Shakarchi; Gaurang Mehta; Andrew Clifford Jones; Ewa Deelman
Scientific workflow systems support different workflow representations, operational modes and configurations. However, independent of the system used, end users need to track the status of their workflows in real time, be notified of execution anomalies and failures automatically, perform troubleshooting and automate the analysis of the workflow to help categorize and qualify the results. In this paper, we describe how the Stampede monitoring infrastructure, which was previously integrated in the Pegasus Workflow Management System, was employed in Triana in order to add generic real time monitoring and troubleshooting capabilities across both systems. Stampede is an infrastructure that attempts to address interoperable monitoring needs by providing a three-layer model: a common data model to describe workflow and job executions; high-performance tools to load workflow logs conforming to the data model into a data store, and a querying interface for extracting information from the data store in a standard fashion. The resulting integration demonstrates the generic nature of the Stampede monitoring infrastructure that has the potential to provide a common platform for monitoring across scientific workflow engines.
workflows in support of large scale science | 2014
Jack Deslippe; Abdelilah Essiari; Simon J. Patton; Taghrid Samak; Craig E. Tull; Alexander Hexemer; Dinesh Kumar; Dilworth Y. Parkinson; Polite Stewart
The Advanced lightsource (ALS) is a X-ray synchrotron facility at Lawrence Berkeley National Laboratory. The ALS generates terabytes of raw and derived data each day and serves 1,000s of researchers each year. Only a subset of the data is analyzed due to barriers in terms of processing that small science teams are ill-equipped to surmount. In this paper, we discuss the development and application of a computational framework, termed SPOT, fed with synchrotron data, powered by storage, networking and compute resources at NERSC and ESnet. We describe issues and recommendations for an end-to-end analysis workflow for ALS data. After one year of operation, the collection contains over 90,000 datasets (550 TB) from 85 users across three beamlines. For 16 months, beamline data taken has been promptly and automatically analyzed and annotated with metadata, allowing users to focus on analysis, conclusions and experiments.
grid computing | 2013
Karan Vahi; Ian Harvey; Taghrid Samak; Daniel K. Gunter; Kieran Evans; David Mckendrick Rogers; Ian J. Taylor; Monte Goode; Fabio Silva; Eddie Al-Shakarchi; Gaurang Mehta; Ewa Deelman; Andrew C. Jones
Scientific workflow systems support various workflow representations, operational modes, and configurations. Regardless of the system used, end users have common needs: to track the status of their workflows in real time, be notified of execution anomalies and failures automatically, perform troubleshooting, and automate the analysis of the workflow results. In this paper, we describe how the Stampede monitoring infrastructure was integrated with the Pegasus Workflow Management System and the Triana Workflow Systems, in order to add generic real time monitoring and troubleshooting capabilities across both systems. Stampede is an infrastructure that provides interoperable monitoring using a three-layer model: (1) a common data model to describe workflow and job executions; (2) high-performance tools to load workflow logs conforming to the data model into a data store; and (3) a common query interface. This paper describes the integration of Stampede monitoring architecture with Pegasus and Triana and shows the new analysis capabilities that Stampede provides to these workflow systems. The successful integration of Stampede with these workflow engines demonstrates the generic nature of the Stampede monitoring infrastructure and its potential to provide a common platform for monitoring across scientific workflow engines.
network operations and management symposium | 2012
Taghrid Samak; Daniel K. Gunter; Valerie Hendrix
The deployment of ubiquitous distributed monitoring infrastructure such as perfSONAR is greatly increasing the availability and quality of network performance data. Cross-cutting analyses are now possible that can detect anomalies and provide real-time automated alerts to network management services. However, scaling these analyses to the volumes of available data remains a difficult task. Although there is significant research into offline analysis techniques, most of these approaches do not address the systems and scalability issues. This work presents an analysis framework incorporating industry best-practices and tools to perform large-scale analyses. Our framework integrates the expressiveness of Pig, the scalability of Hadoop, and the analysis and visualization capabilities of R to achieve a significant increase in both speed and power of analysis. Evaluation of our framework on a large dataset of real measurements from perfSONAR demonstrate a large speedup and novel statistical capabilities.
IEEE Transactions on Information Forensics and Security | 2014
Muhammad Qasim Ali; Ehab Al-Shaer; Taghrid Samak
In the past decade, scanning has been widely used as a reconnaissance technique to gather critical network information to launch a follow up attack. To combat, numerous intrusion detectors have been proposed. However, scanning methodologies have shifted to the next-generation paradigm to be evasive. The next-generation reconnaissance techniques are intelligent and stealthy. These techniques use a low volume packet sequence and intelligent calculation for the victim selection to be more evasive. Previously, we proposed models for firewall policy reconnaissance that are used to set bound for learning accuracy as well as to put minimum requirements on the number of probes. We presented techniques for reconstructing the firewall policy by intelligently choosing the probing packets based on the responses of previous probes. In this paper, we show the statistical analysis of these techniques and discuss their evasiveness along with the improvement. First, we present the previously proposed two techniques followed by the statistical analysis and their evasiveness to current detectors. Based on the statistical analysis, we show that these techniques still exhibit a pattern and thus can be detected. We then develop a hybrid approach to maximize the benefit by combining the two heuristics.
network operations and management symposium | 2012
Ezra Kissel; Ahmed El-Hassany; Guilherme Fernandes; D. Martin Swany; Dan Gunter; Taghrid Samak; Jennifer M. Schopf
Monitoring and managing multi-gigabit networks requires dynamic adaptation to end-to-end performance characteristics. This paper presents a measurement collection and analysis framework that automates the troubleshooting of end-to-end network bottlenecks. We integrate real-time host, application, and network measurements with a common representation (compatible with perfSONAR) within a flexible and scalable architecture. Our measurement architecture is supported by a light-weight eXtensible Session Protocol (XSP), which enables context-sensitive adaptive measurement collection. We evaluate the ability of our system to analyze and detect bottleneck conditions over a series of high-speed and I/O intensive bulk data transfer experiments and find that the overhead of the system is very low and that we are able to detect and understand a variety of bottlenecks.
international conference on e-science | 2012
Taghrid Samak; Dan Gunter; Zhong Wang
Gene synthesis is a key step to convert digitally predicted proteins to functional proteins. However, it is a relatively expensive and labor-intensive process. About 30-50% of the synthesized proteins are not soluble, thereby further reduces the efficacy of gene synthesis as a method for protein function characterization. Solubility prediction from primary protein sequences holds the promise to dramatically reduce the cost of gene synthesis. This work presents a framework that creates models of solubility from sequence information. From the primary protein sequences of the genes to be synthesized, sequence features can be used to build computational models for solubility. This way, biologists can focus the effort on synthesizing genes that are highly likely to generate soluble proteins. We have developed a framework that employs several machine learning algorithms to model protein solubility. The framework is used to predict protein solubility in the Escherichia coli expression system. The analysis is performed on over 1,600 quantified proteins. The approach successfully predicted the solubility with more than 80% accuracy, and enabled in depth analysis of the most important features affecting solubility. The analysis pipeline is general and can be applied to any set of sequence features to predict any binary measure. The framework also provides the biologist with a comprehensive comparison between different learning algorithms, and insightful feature analysis.
IEEE Transactions on Network and Service Management | 2012
Taghrid Samak; Ehab Al-Shaer
Policy-based network management is a necessity for managing large-scale environments. It provides the means for separating high-level system requirements from the actual implementation. As the network size increases, the need for automated tools to perform management becomes more apparent. But configuring routers and network devices to achieve QoS goals is a challenging task. Using Differentiated Services to dynamically perform this configuration involves defining policies on different network nodes in multiple domains. Policy aggregation across domains requires a unified policy model that can overcome the challenge of conflict detection and resolution. In this work, we propose a unified model to represent and encode QoS policies. This model enables efficient and flexible conflict analysis. The representation utilizes a bottom-up approach, from the base policy parameters to the aggregation of policies across domains with respect to traffic classes. We also present a classification of these conflicts and a measure of conflicts to assess the severity of any misconfiguration. The model and the conflict measure are evaluated with large networks and different topologies.
international workshop on quality of service | 2011
Taghrid Samak; Adel El-Atawy; Ehab Al-Shaer
Configuring routers and network devices to achieve quality of service (QoS) goals is a challenging task. In a DiffServ environment, traffic flows are assigned specific classes of service, and service level agreements (SLA) are enforced at routers within the domain. We present a model for QoS policy configurations that facilitates efficient property-based verification. Network configuration is given as a set of policies governing each device. The model efficiently checks the SLA against the current configuration using computation tree logic model checking. By following possible decision paths for a specific flow from source to destination, properties can be checked at each hop, and assessments can be made on how well configurations adhere to the specified agreement. The model also covers configuration debugging given a specific QoS violation.