• No results found

Scientific Workflow Management

In document GRDI2020 Final Roadmap Report (Page 40-44)

7. System Challenges

7.4 Scientific Workflow Management

Today, scientists face many of the same challenges found in enterprise computing, namely integrating distributed and heterogeneous resources. Scientists no longer use just a single machine, or even a single cluster of machines, or a single source of data. Research collaborations are becoming more and more geographically dispersed and often exploit heterogeneous tools, compare data from different sources, and use machines distributed across several institutions throughout the world. Therefore, the task of running and coordinating a scientific application across several administrative domains remains extremely complex.

Definition

Scientific Workflow is a key component in a research data infrastructure. It orchestrates e-Science

services so that they co-operate to implement efficiently a scientific application.

Scientific Workflow has seen massive growth in recent years as science becomes increasingly

reliant on the analysis of massive data sets and the use of distributed resources. The workflow programming paradigm is seen as a means of managing the complexity in defining the analysis, executing the necessary computations on distributed resources, collecting information about the analysis results, and providing means to record and reproduce the scientific analysis [40].

Workflows provide [41]:

 A systematic and automated means of conducting analyses across diverse datasets and applications;

 A way of capturing this process so that results can be reproduced and the method can be reviewed, validated, repeated, and adapted;

 A visual scripting interface so that computational scientists can create pipelines without low-level programming concern;

 An integration and access platform for the growing pool of independent resource providers so that computational scientists need not specialize in each one.

The workflow is thus becoming a paradigm for enabling science on a large by managing data preparation and analysis pipelines, as well as the preferred vehicle for computational knowledge extraction.

Workflow Definition

A workflow is a precise description of a scientific procedure – a multi-step process to coordinate

multiple tasks, acting like a sophisticated script. Each task represents the execution of a computational process, such as running a program, submitting a query to a database, submitting a job to a compute cloud or grid, or invoking a service over the Web to use a remote resource. Data output from one task is consumed by subsequent tasks according to a predefined graph topology that “orchestrates” the flow of data [41].

Workflow Systems

Workflow systems generally have three components: an execution platform, a visual design suite, and a development kit [41].

The platform executes the workflow on behalf of applications and handles common crosscutting concerns, including: invocation of services and handling heterogeneities of data types, recovery from failures, optimization of memory, storage, and execution including concurrency and parallelization, data handling (mapping, referencing, movement, streaming and staging), security and monitoring of access policies.

The design suite provides a visual scripting application for authoring and sharing workflows and preparing the components that are to be incorporated as executable steps.

The development kit enables developers to extend the capabilities of the system and enables workflows to be embedded into applications, Web portals, or databases.

Workflow Usage [41]

Workflows liberate scientists from the drudgery of routine data processing so they can concentrate on scientific discovery. They shoulder the burden of routine tasks, they represent the computational protocols needed to undertake data-centric science, and they open up the use of processes and data resources to a much wider group of scientists and scientific application developers. Workflows are ideal for systematically, accurately, and repeatedly running routine procedures: managing data capture from sensors or instruments; cleaning, normalizing, and validating data; securely and efficiently moving and archiving data; comparing data across repeated runs; and regularly updating data warehouses.

Workflow Types

In [40] four basic types of workflows encountered in business have been identified, and most have direct counterparts in science and engineering. The first type of workflows, referred to as

collaborative workflows, are those that have high business value to the company and involve a

single large project and possibly many individuals. For example, the production, promotion, documentation, and release of a major product fall into this category. The workflow is usually specific to the particular project, but it may follow a standard pattern used by the company. Within the scientific community, it can refer to the management of data produced and distributed on behalf of a large scientific experiment such as those encountered in high-energy physics. Another example may be the end-to-end tracking of the steps required by a biotech enterprise to produce and release a new drug.

The second type of workflow is ad hoc. These activities are less formal in both structure and required response; for example, a notification that a business practice or policy has changed that is broadcast to the entire workforce. Any required action is up to the individual receiving the notification.

Within science, notification-driven workflows are common. A good example is an agent process that looks at the output of an instrument. Based on events detected by the instrument, different actions may be required and sub-workflow instances may need to be created to deal with them. The third type of workflow is administrative, which refers to enterprise activities such as internal bookkeeping, database management, and maintenance scheduling, that must be done frequently but are not tied directly to the core business of the company.

On the other hand, the fourth type of workflow, referred to as production workflow, is involved with those business processes that define core business activities. For example, the steps involved with loan processing are one of the central business processes of a bank. These are tasks that are repeated frequently, and many such workflows may be concurrently processed.

Both the administrative and production forms of workflow have obvious counterparts in science and engineering. For example, the routine tasks of managing data coming from instrument streams or verifying that critical monitoring services are running are administrative in nature. Production workflows are those that are run as standard data analyses and simulations by users on a daily basis. For example, doing a severe storm prediction based on current weather conditions within a specific domain or conducting a standard data-mining experiment on a new, large data sample are all central to e-Science workflow practice.

Workflow-enabled e-Science [40]

Scientific workflows have emerged and been adapted from the business world as a means to formalize and structure the data analysis and computations on the distributed resources.

issues unique to science. Business workflows are typically less dynamic and evolving in nature. Scientific workflows tend to change more frequently and may involve very voluminous data translations. In addition, while business workflows tend to be constructed by professional software and business flow engineers, scientific workflows are often constructed by scientists themselves. While they are experts in their domains, they are not necessarily experts in information technology, the software, or the networking in which the tools and workflows operate. Therefore, the two cases may require considerably different interfaces and end-user robustness both during the construction stage of the workflows and during their execution.

In composing a workflow, scientists often incorporate portions of existing workflows, making changes where necessary. Business workflow systems do not currently provide support for storing workflows in a repository and then later searching this repository during workflow composition. The degree of flexibility that scientists have in their work is usually much higher than in the business domain, where business processes are usually predefined and executed in a routine fashion. Scientific research is exploratory in nature. Scientists carry out experiments, often in a trial-and-error manner wherein they modify the steps of the task to be performed as the experiment proceeds. A scientist may decide to filter a data set coming from a measuring device. Even if such filtering was not originally planned, that is a perfectly acceptable option. The ability to run, pause, revise, and resume a workflow is not exposed in most business workflow systems. Finally, the control flow found in business workflows may not be expressive enough for highly concurrent workflows and data pipelines found in leading edge simulation studies. Scientific workflows may require a new control flow operator to succinctly capture concurrent execution and data flow.

Workflow–Enabled Data–Centric Science [41]

Workflows offer techniques to support the new paradigm of data-centric science. They can be replayed and repeated. Results and secondary data can be computed as needed using the latest sources, providing virtual data (or on-demand) warehouses by effectively providing distributed query processing.

The workflows as first class citizens in data-centric science, can be generated and transformed dynamically to meet the requirements at hand. In a landscape of data in considerable flux, workflows provide robustness, accountability, and full auditing.

Workflows enable data-centric science to be a collaborative endeavor on multiple levels. They enable scientists to collaborate over shared data and shared services, and they grant non- developers access to sophisticated code and applications without the need to install and operate them. Consequently, scientists can use the best applications, not just the ones with which they are familiar. Multidisciplinary workflows promote even broader collaboration. In this sense, a workflow system is a framework for reusing community’s tools and datasets that represent the original codes and overcomes diverse coding styles.

Although the impact of workflow tools on data-centric science is potentially profound – scaling processing to match the scaling of data – many challenges exist over and above the engineering issues inherent in large-scale distributed software.

There are a confusing number of workflow platforms with various capabilities and purposes and little compliance with standards. Workflows are often difficult to author, using languages that are at an inappropriate level of abstraction and expecting too much knowledge of the underlying infrastructure.

The reusability of a workflow is often confined to the project it was conceived in – or even to its author- and it is inherently only as strong as its components. Although workflows encourage providers to supply clean, robust, and validated data services, components failure is common. Unfortunately, debugging failing workflows is a crucial but neglected topic. Contemporary workflow platforms fall short of adequately supporting rapid deployment into the user applications that consume them, and legacy application codes need to be integrated and managed.

In document GRDI2020 Final Roadmap Report (Page 40-44)