This code is intended to be used as the basis for derived transformers and docker images
-
The file named entrypoint.py is expected to be kept for all transformers.
-
For each environment (such as Clowder, TERRA REF, CyVerse) the transformer_class.py file is replaced.
-
For each transformer the transformer.py file is replaced.
-
Additionally, the entrypoint.py script can be called from a different script allowing pre- and post- processing (see entrypoint.py below).
It is expected that this arrangement will provide reusable code not only within a single environment, but across transformers in different environments as well.
Create a new repository to hold the code specific to your environment or transformer.
For a new environment:
- create a new transformer_class.py file specific for your environment
- fill and create in any methods and data as necessary to support transformers
- if using Docker images, create a new dockerfile that uses the base_image Docker image as its starting point, add needed executables and libraries, and overwrite the existing transformer_classs.py file in your new image
For a new transformer:
- create a new transformer.py file specific for your transformer with the needed function signatures
- add the code to do your work
- if using Docker images, create a new dockerfile that uses the appropriate starting docker image, add needed executables and libraries, and overwrite the existing transformer.py file in your new image
- Dockerfile: contains the build instructions for a docker image
- configuration.py: contains configuration information for transformers. Can be overridden by derived code as long as existing variables aren't lost
- entrypoint.py: entrypoint for the transformers and docker images. More on this file below
- transformer.py: stub of expected transformer interface. More on this file below as well
- transformer_class.py: stub of class used to provide environment for code in transfomer.py
Unless documented here, the contents of this file are required by entrypoint.py
.
If you are replacing this file with your own version, be sure to keep existing code (and its associated comments).
This file can be executed as an independent script, or called by other Python code.
If calling into this script, the entry point is a function named do_work
.
The do_work
function expects to get an instance or argparse.ArgumentParser
passed in as its first parameter.
Additional named parameters can also be passed in as kwargs; these are then passed to the new instance of transformer_class.Transformer at initialization.
Calling do_work
returns a dict of the result.
Briefly, the 'code' key of the return value indicates the result of the call, and the presence of an 'error' key indicates an error ocurred.
To provide environmental context to a transformer, the transformer_class.py file can be replaced with something more meaningful. The transformer_class.py file in this repo defines a class that has methods that will be called by entrypoint.py if they're defined. The class methods are not required but can provide convenient hooks for customization. An instance of this class is passed to the transformer code in transformer.py
This is the file that performs all the work.
It is expected that this file will be replaced with a meaningful one for particular transformers.
The transformer.py file in this repo contains the functions that can be called by the main transformer script entrypoint.py.
The only required function in this file is the perform_process
function.
This is the file that provides the environment for transformers. It is expected that for different environments, this file will be replaced with a meaningful one. For example, in the CyVerse environment this file could be replaced with one containing iRODS support for any files generated by the transformer.
It is the responsibility of this class to appropriately handle any command line arguments for the transformer instance. The easiest way to achieve this is to store the parameters as part of the class instance.
In this section we cover the flow of control for a transformer. We assume that this transformer is started by running the entrypoint.py script.
-
Initialization of Parameters: The first thing that happens is the initialization of an instance of
argparse.ArgumentParser
and the creation of atransformer_class.Transformer
instance. The entrypoint.py script adds its parameters, followed by the transformer_class.Transformer instance, and finally the transformer can add theirs. The parse_args() method is called on the ArgumentParser instance and the resulting argument values are stored in memory. -
Loading of Metadata: One of the parameters required by entrypoint is the path to a JSON file containing metadata. After the parameters are parsed, the entire contents of the JSON file are loaded and stored in memory.
-
Getting Parameters for transformer function calls: If the transformer_class.Transformer instance has a method named
get_transformer_params()
it is called with the command line arguments and the loaded metadata. The dictionary returned by get_transformer_params() is used to pass parameters to the functions defined in transformer.py. This allows the customization of parameters between an environment and a transformer. If get_transformer_params() is not defined by transformer_class.Transformer, no additional parameters are passed to the transformer functions. -
Check to Continue: If the transformer.py file has a function named
check_continue
it will be called getting passed the transformer_class.Transformer instance and any parameters defined in the above step. The return from the check_continue() function is used to determine if processing should continue. If the function is not defined, processing will continue automatically. -
Retrieve Files: If the transformer_class.Transformer instance has a method named
retrieve_files
it will be called getting passed the dictionary returned bytransformer_class.Transformer.get_transformer_params()
(see step 3.) and the loaded metadata. This allows the downloading of data when the transformer has determined it can proceed (see step 4.). If this function is not defined, processing will continue automatically. -
Processsing: The
perform_process
function in transformer.py is called getting passed the transformer_class.Transformer instance and any parameters previously defined (see step 3.). This performs the processing of the data. It's important to note that the dictionary returned in step 3 is used to populate the parameter list of theperform_process
call. -
Result Handling: The result of the above steps may produce warnings, errors, or successful results. These results can be stored in a file, printed to standard output, and/or returned to the caller of
do_work
. In the default case that we're exploring here, the return value from do_work is ignored.
The following command line parameters are defined for all transformers.
- --debug, -d: (optional parameter) enable debug level logging messages
- -h: (optional parameter) display help message (automatically defined by argparse)
- --info, -i: (optional parameter) enable info level logging messages
- --result: (optional parameter) how to handle the result of processing; one or more comma-separated strings of: all, file, print
- --metadata: mandatory path to file containing JSON metadata; can be specified multiple times
- --working_space: path to folder to use as a working space and file store
- the "file_list" argument contains all additional parameters (which are assumed to be file names but may not be)
Pro Tip - Use the -h
parameter against the script or docker container to see all the command line options for a transformer.
error return code ranges:
- entrypoint.py returns error values in the range of
-1
to-99
- transformer_class.py returns error values in the range of
-100
to-999
- transformer.py returns error values in the range of
-1000
and greater