October 5-9, 2014

Abstract

P2.11 Compute Pipelines with Advanced Data Management using Pegasus WMS

Karan Vahi (USC Information Sciences Institute)

Mats Rynge, Rafael Ferreira da Silva, Gideon Juve, Rajiv Mayani, Ewa Deelman. Information Sciences Institute, University of Southern California

Pegasus WMS is a workflow management system that can manage large-scale pipelines across desktops, campus clusters, grids and clouds. This poster will introduce the advanced capabilities available to manage these pipelines in an efficient, reliable and automated fashion, with a focus on data management. HTCondor DAGMan is a common foundation for many astronomy pipelines. When using DAGMan directly, projects commonly find themselves developing pipelines for a particular execution environment and data storage solution, and therefore have to spend valuable development time to continuously adjust the pipeline for new infrastructures or changes to the existing infrastructure. Pegasus WMS solves this problem by providing an abstraction layer on top of HTCondor DAGMan. In Pegasus WMS, pipelines are represented in an abstract form that is independent of the resources available to run it and the location of data and executables. It compiles these abstract workflows into an executable form by querying information catalogs. The executable form is an advanced DAG which is executed by HTCondor DAGMan. Data discovery is a key feature in Pegasus WMS. Within the abstract workflow, files are only referred to with a logical filename (LFN). During the mapping step, Pegasus WMS tries to look up the LFNs to obtain a list of physical file names (PFN) which are URLs to locations of where the file can be found. For input files, the system determines the appropriate replica to use, and which data transfer mechanism to use. Data transfer tasks are added to the pipeline accordingly. If during the lookups, Pegasus WMS finds that a subset of the pipeline outputs already exists, the pipeline will be automatically pruned to not recompute the existing outputs. This data reuse feature is commonly used for projects with overlapping datasets and pipelines. Required data transfers are automatically added to the pipeline and optimized for performance. Pegasus WMS imports compute environment descriptions provided by the scientist, and URLs for the data to schedule data transfers, including credential management. What data transfers are required depends on the execution environment. If using a resource with a shared filesystem, Pegasus WMS will stage input data to the shared filesystem, set up the task to read/write directly against the provided filesystem, and then stage out outputs to permanent storage. In a distributed execution environment, i.e. one without a shared filesystem, Pegasus WMS will use a storage service for intermediate data products and configure the tasks to pull/push data against that storage service. An example of this would be using Amazon S3 as the storage service when running a pipeline in the Amazon EC2 cloud. In addition to the data transfers tasks, data cleanup and data registration tasks will be added to the pipeline. During the mapping step, it is determined when intermediate data files are no longer required. Clean up tasks are added, with the overall result being a minimized data footprint during execution. Data registration is the feature that adds information about generated outputs to an information catalog for future data discovery and potential data reuse. Pegasus WMS also captures all the provenance of the pipeline lifecycle from the planning stage, through execution, to the final output data, helping scientists to accurately measure the performance of their pipelines and reconstruct the history of data products. Pegasus provides debugging and monitoring tools that allow users to easily detect and debug failures in their pipelines. Acknowledgments: This work was supported by the National Science Foundation under grant #OCI-1148515.

Mode of presentation: poster

Applicable ADASS XXIV theme category: Data Analysis / Pipelines