Workflow Planning and Execution¶
The workflow engine transforms abstract GeoEDF workflows expressed in YAML syntax into a sequence of possibly parallel workflow jobs that execute connector or processor containers in a target execution environment. The actual workflow planning, data management, and execution is carried out using the Pegasus workflow management system. The workflow engine first converts a GeoEDF YAML workflow into a Pegasus workflow using the Pegasus Python API, following which the Pegasus API is again used to plan and submit the workflow to a target execution site.
Each workflow stage (identified by a numeric identifier $n
) is converted into a separate
Pegasus sub-workflow and these sub-workflows are executed in the same sequence. Bookend jobs
at either end are used to perform some initial setup (temporary scratch directory creation on the
execution site, key-pair creation for encrypting sensitive arguments) and final clean up (transfer
of final outputs back to the submission site) respectively.
It should be pointed out that the individual stage sub-workflows cannot be constructed beforehand. The primary reason for this is that at planning time, the number of possible files that may be generated by each connector or processor instance is unknown. For efficiency sake, we need to parallelize each sub-workflow so that a connector or processor operates on each of its possible bindings in parallel rather than sequentially. Efficiency also dictates that the outputs of each workflow stage not be transferred back and forth between the execution and submission sites. Instead, each sub-workflow contains its own bookend jobs that create a stage specific output directory on the execution site and collect the names of output files placed in that directory (by the plugins) and returns this list of output filenames to the submission site. This list is used in constructing the parallelized sub-workflow for subsequent stages that reference the outputs of this stage.
The construction of the sub-workflows is itself carried out by a Pegasus workflow job which outputs
the constructed sub-workflow to a Pegasus YAML workflow file. The planning and execution of these
sub-workflow files is carried out by another Pegasus workflow job. Workflow job dependencies are
established so that the sub-workflow creation
job is executed after all prior workflow stages
have completed execution and the list of output files from each prior stage is provided as input to
this creation
job. Similarly, a sub-workflow execution
job is set to only execute after the
corresponding creation
job has completed execution and created the sub-workflow file.
Pegasus profiles are utilized to ensure that the creation
jobs always execute on the submission
site. This is because the creation job in essence executes a Python script that utilizes the Pegasus
Python API to construct the sub-workflow for this stage. Rather than transfer the entire Pegasus
library to the execution site, the local Pegasus library on the submission site can be used. Furthermore,
the act of planning and executing a sub-workflow necessarily has to occur on the submission site and
hence having the sub-workflow file produced and available locally improves the efficiency of this
process.
In general, a GeoEDF user does not need to be aware of the specifics of the workflow planning and execution. Gateway specific GeoEDF setup may require some initial configuration of the execution site where jobs are to be executed. Details of the setup process are provided in the Deployment documentation.
Some final points of note:
- Since filter plugins simply output strings rather than files, they need to be written to output the
values generated to a text file. The GeoEDF framework binds the
target
attribute on a filter plugin to point to the filepath of this output file. - In other cases such as input or processor plugins, the
target
attribute points to the output directory where files generated by the plugin need to be stored. - When a
dir
modifier is used in one of a plugin’s attribute bindings, that specific attribute will only be bound once irrespective of the number of files produced by the referenced stage. If the plugin contains other attributes with mutiple bindings, parallel jobs will be created for this plugin with each job utilizing the same binding for this specific attribute.