Workflow Planning and Execution

The workflow engine transforms abstract GeoEDF workflows expressed in YAML syntax into a sequence of possibly parallel workflow jobs that execute connector or processor containers in a target execution environment. The actual workflow planning, data management, and execution is carried out using the Pegasus workflow management system. The workflow engine first converts a GeoEDF YAML workflow into a Pegasus workflow using the Pegasus Python API, following which the Pegasus API is again used to plan and submit the workflow to a target execution site.

Each workflow stage (identified by a numeric identifier $n) is converted into a separate Pegasus sub-workflow and these sub-workflows are executed in the same sequence. Bookend jobs at either end are used to perform some initial setup (temporary scratch directory creation on the execution site, key-pair creation for encrypting sensitive arguments) and final clean up (transfer of final outputs back to the submission site) respectively.

It should be pointed out that the individual stage sub-workflows cannot be constructed beforehand. The primary reason for this is that at planning time, the number of possible files that may be generated by each connector or processor instance is unknown. For efficiency sake, we need to parallelize each sub-workflow so that a connector or processor operates on each of its possible bindings in parallel rather than sequentially. Efficiency also dictates that the outputs of each workflow stage not be transferred back and forth between the execution and submission sites. Instead, each sub-workflow contains its own bookend jobs that create a stage specific output directory on the execution site and collect the names of output files placed in that directory (by the plugins) and returns this list of output filenames to the submission site. This list is used in constructing the parallelized sub-workflow for subsequent stages that reference the outputs of this stage.

The construction of the sub-workflows is itself carried out by a Pegasus workflow job which outputs the constructed sub-workflow to a Pegasus YAML workflow file. The planning and execution of these sub-workflow files is carried out by another Pegasus workflow job. Workflow job dependencies are established so that the sub-workflow creation job is executed after all prior workflow stages have completed execution and the list of output files from each prior stage is provided as input to this creation job. Similarly, a sub-workflow execution job is set to only execute after the corresponding creation job has completed execution and created the sub-workflow file.

Pegasus profiles are utilized to ensure that the creation jobs always execute on the submission site. This is because the creation job in essence executes a Python script that utilizes the Pegasus Python API to construct the sub-workflow for this stage. Rather than transfer the entire Pegasus library to the execution site, the local Pegasus library on the submission site can be used. Furthermore, the act of planning and executing a sub-workflow necessarily has to occur on the submission site and hence having the sub-workflow file produced and available locally improves the efficiency of this process.

In general, a GeoEDF user does not need to be aware of the specifics of the workflow planning and execution. Gateway specific GeoEDF setup may require some initial configuration of the execution site where jobs are to be executed. Details of the setup process are provided in the Deployment documentation.

Some final points of note:

  1. Since filter plugins simply output strings rather than files, they need to be written to output the values generated to a text file. The GeoEDF framework binds the target attribute on a filter plugin to point to the filepath of this output file.
  2. In other cases such as input or processor plugins, the target attribute points to the output directory where files generated by the plugin need to be stored.
  3. When a dir modifier is used in one of a plugin’s attribute bindings, that specific attribute will only be bound once irrespective of the number of files produced by the referenced stage. If the plugin contains other attributes with mutiple bindings, parallel jobs will be created for this plugin with each job utilizing the same binding for this specific attribute.