Catharine van Ingen
Microsoft
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Catharine van Ingen.
Global Biogeochemical Cycles | 2011
Youngryel Ryu; Dennis D. Baldocchi; Hideki Kobayashi; Catharine van Ingen; Jie Li; T. Andy Black; Jason Beringer; Eva van Gorsel; Alexander Knohl; Beverly E. Law; Olivier Roupsard
linear relations with measurements of solar irradiance (r 2 = 0.95, relative bias: 8%), gross primary productivity (r 2 = 0.86, relative bias: 5%) and evapotranspiration (r 2 = 0.86, relative bias: 15%) in data from 33 flux towers that cover seven plant functional types across arctic to tropical climatic zones. A sensitivity analysis revealed that the gross primary productivity and evapotranspiration computed in BESS were most sensitive to leaf area index and solar irradiance, respectively. We quantified the mean global terrestrial estimates of gross primary productivity and evapotranpiration between 2001 and 2003 as 118 � 26 PgC yr � 1 and 500 � 104 mm yr � 1 (equivalent to 63,000 � 13,100 km 3 yr � 1 ), respectively. BESS-derived gross primary productivity and evapotranspiration estimates were consistent with the estimates from independent machine-learning, data-driven products, but the process-oriented structure has the advantage of diagnosing sensitivity of mechanisms. The process-based BESS is able to offer gridded biophysical variables everywhere from local to the total global land scales with an 8-day interval over multiple years.
international conference on cloud computing | 2010
Yogesh Simmhan; Catharine van Ingen; Girish Subramanian; Jie Li
The widely discussed scientific data deluge creates a need to computationally scale out eScience applications beyond the local desktop and cope with variable loads over time. Cloud computing offers a scalable, economic, on-demand model well matched to these needs. Yet cloud computing creates gaps that must be crossed to move existing science applications to the cloud. In this article, we propose a Generic Worker framework to deploy and invoke science applications in the cloud with minimal user effort and predictable cost-effective performance. Our framework addresses three distinct challenges posed by the cloud: the complexity of application deployment, invocation of cloud applications from desktop clients, and efficient transparent data transfers across desktop and the cloud. We present an implementation of the Generic Worker for the Microsoft Azure Cloud and evaluate its use for a genomics application. Our evaluation shows that the user complexity to port and scale the application is substantially reduced while introducing a negligible performance overhead of of <; 5% for the genomics application when scaling to 20 VM instances.
2009 Third International Conference on Advanced Engineering Computing and Applications in Sciences | 2009
Yogesh Simmhan; Roger S. Barga; Catharine van Ingen; Edward D. Lazowska; Alexander S. Szalay
Scientific workflows have gained popularity for modeling and executing in silico experiments by scientists for problem-solving. These workflows primarily engage in computation and data transformation tasks to perform scientific analysis in the Science Cloud. Increasingly workflows are gaining use in managing the scientific data when they arrive from external sensors and are prepared for becoming science ready and available for use in the Cloud. While not directly part of the scientific analysis, these workflows operating behind the Cloud on behalf of the -data valets¿ play an important role in end-to-end management of scientific data products. They share several features with traditional scientific workflows: both are data intensive and use Cloud resources. However, they also differ in significant respects, for example, in the reliability required, scheduling constraints and the use of provenance collected. In this article, we investigate these two classes of workflows – Science Application workflows and Data Preparation workflows – and use these to drive common and distinct requirements from workflow systems for eScience in the Cloud. We use workflow examples from two collaborations, the NEPTUNE oceanography project and the Pan-STARRS astronomy project, to draw out our comparison. Our analysis of these workflows classes can guide the evolution of workflow systems to support emerging applications in the Cloud and the Trident Scientific Workbench is one such workflow system that has directly benefitted from this to meet the needs of these two eScience projects
Concurrency and Computation: Practice and Experience | 2013
Jane Hunter; Abdulmonem Alabri; Catharine van Ingen
The Internet, Web 2.0 and Social Networking technologies are enabling citizens to actively participate in ‘citizen science’ projects by contributing data to scientific programmes via the Web. However, the limited training, knowledge and expertise of contributors can lead to poor quality, misleading or even malicious data being submitted. Subsequently, the scientific community often perceive citizen science data as not worthy of being used in serious scientific research—which in turn, leads to poor retention rates for volunteers. In this paper, we describe a technological framework that combines data quality improvements and trust metrics to enhance the reliability of citizen science data. We describe how online social trust models can provide a simple and effective mechanism for measuring the trustworthiness of community‐generated data. We also describe filtering services that remove unreliable or untrusted data and enable scientists to confidently reuse citizen science data. The resulting software services are evaluated in the context of the CoralWatch project—a citizen science project that uses volunteers to collect comprehensive data on coral reef health. Copyright
international conference on data engineering | 2009
Sebastian Michel; Ali Salehi; Liqian Luo; Nicholas Dawes; Karl Aberer; Guillermo Barrenetxea; Mathias Bavay; Aman Kansal; K. Ashwin Kumar; Suman Nath; Marc Parlange; Stewart Tansley; Catharine van Ingen; Feng Zhao; Yongluan Zhou
A sensor network data gathering and visualization infrastructure is demonstrated, comprising of Global Sensor Networks (GSN) middleware and Microsoft SensorMap. Users are invited to actively participate in the process of monitoring real-world deployments and can inspect measured data in the form of contour plots overlayed onto a high resolution map and a digital topographic model. Users can go back in time virtually to search for interesting events or simply to visualize the temporal dependencies of the data. The system presented is not only interesting and visually enticing for non-expert users but brings substantial benefits to environmental scientists. The easily installed data acquisition component as well as the powerful data sharing and visualization platform opens up new ground in collaborative data gathering and interpretation in the spirit of Web 2.0 applications.
Archive | 2012
Dario Papale; Deborah A. Agarwal; Dennis D. Baldocchi; R. B. Cook; Joshua B. Fisher; Catharine van Ingen
If I have seen further,” Sir Isaac Newton wrote to Robert Hooke in 1676, “it is by standing on the shoulders of giants.
International Journal of Agricultural and Environmental Information Systems | 2011
Jane Hunter; Peter Becker; Abdulmonem Alabri; Catharine van Ingen; Eva Abal
The Health-e-Waterways Project is a multi-disciplinary collaboration between the University of Queensland, Microsoft Research and the South East Queensland Healthy Waterways Partnership (SEQ-HWP). This project develops the underlying technological framework and set of services to enable streamlined access to the expanding collection of real-time, near-real-time and static datasets related to water resource management in South East Queensland. More specifically, the system enables water resource managers to access the datasets being captured by the various agencies participating in the SEQ HWP Ecosystem Health Monitoring Program (EHMP). It also provides online access to the statistical data processing tools that enable users to analyse the data and generate online ecosystem report cards dynamically via a Web mapping interface. The authors examine the development of ontologies and semantic querying tools to integrate disparate datasets and relate management actions to water quality indicators for specific regions and periods. This semantic data integration approach enables scientists and resource managers to identify which actions are having an impact on which parameters and adapt the management strategies accordingly. This paper provides an overview of the semantic technologies developed to underpin the adaptive management framework that is the central philosophy behind the SEQ HWP.
international conference on e-science | 2009
Marty Humphrey; Deborah A. Agarwal; Catharine van Ingen
Many of today’s large-scale scientific projects attempt to collect data from a diverse set of sources. The traditional campaign-style approach to “synthesis” efforts gathers data through a single concentrated effort, and the data contributors know in advance exactly who will use their data and why. At even moderate scales, the cost and time required to find, gather, collate, normalize, and customize data in order to build a synthesis dataset can quickly outweigh the value of the resulting dataset. By explicitly identifying and addressing the different requirements for each data role (author, publisher, curator, and consumer), our data management architecture for large-scale shared scientific data enables the creation of such synthesis datasets that continue to grow and evolve with new data, data annotations, participants, and use rules. We show the effectiveness of our approach in the context of the FLUXNET Synthesis Dataset, one of the largest ongoing biogeophysical experiments.
international conference on e-science | 2009
Yogesh Simmhan; Catharine van Ingen; Alexander S. Szalay; Roger S. Barga; J. N. Heasley
The growing amount of scientific data from sensors and field observations is posing a challenge to “data valets” responsible for managing them in data repositories. These repositories built on commodity clusters need to reliably ingest data continuously and ensure its availability to a wide user community. Workflows provide several benefits to modeling data-intensive science applications and many of these benefits can help manage the data ingest pipelines too. But using workflows is not panacea in itself and data valets need to consider several issues when designing workflows that behave reliably on fault prone hardware while retaining the consistency of the scientific data. In this paper, we propose workflow designs for reliable data ingest in a distributed environment and identify workflow framework features to support resilience. We illustrate these using the data pipeline for the Pan-STARRS repository, one of the largest digital surveys that accumulates 100TB of data annually to support 300 astronomers.
Future Generation Computer Systems | 2014
Valerie Hendrix; Lavanya Ramakrishnan; Youngryel Ryu; Catharine van Ingen; Keith Jackson; Deborah A. Agarwal
Abstract The Moderate Resolution Imaging Spectroradiometer (MODIS) instrument’s land and atmosphere data are important to many scientific analyses that study processes at both local and global scales. The Terra and Aqua MODIS satellites acquire data of the entire Earth’s surface every one or two days in 36 spectral bands. MODIS data provide information to complement many of the ground-based observations but are extremely critical when studying global phenomena such as gross photosynthesis and evapotranspiration. However, data procurement and processing can be challenging and cumbersome due to difficulties in volume, size of data and scale of analyses. For example, the very first step in MODIS data processing is to ensure that all products are in the same resolution and coordinate system. The reprojection step involves a complex inverse gridding algorithm and involves downloading tens of thousands of files for a single year that is often infeasible to perform on a scientist’s desktop. Thus, use of large-scale resource environments such as high performance computing (HPC) environments are becoming crucial for processing of MODIS data. However, HPC environments have traditionally been used for tightly coupled applications and present several challenges for managing data-intensive pipelines. We have developed a data-processing pipeline that downloads the MODIS swath products and reprojects the data to a sinusoidal system on an HPC system. The 10 year archive of the reprojected data generated using the pipeline is made available through a web portal. In this paper, we detail a system architecture (CAMP) to manage the lifecycle of MODIS data that includes procurement, storage, processing and dissemination. Our system architecture was developed in the context of the MODIS reprojection pipeline but is extensible to other analyses of MODIS data. Additionally, our work provides a framework and valuable experiences for future developments and deployments of data-intensive pipelines from other scientific domains on HPC systems.