The Lava Job Framework¶
Building and deploying simple lava jobs is fairly straightforward. For more complex requirements, a lava solution may require multiple jobs with associated JSON job specifications, connectors and S3 triggers to be tailored and deployed across multiple lava realms (e.g. dev, test and production). Job payloads also need to be assembled into packages or docker containers and deployed. This can be manually intensive and error prone.
The lava job framework provides a suggested way of structuring, building and deploying lava jobs. Its use is completely optional.
The framework provides the following advantages over hand-crafting a complex lava solution.
-
Job, connection and s3 trigger specifications can be specified in either YAML or JSON in a realm / environment independent way. YAML is easier to read, write and annotate than JSON. YAML formatted samples are provided in the Lava Job Framework Samples section.
-
Environment specific information is managed in separate, extensible configuration files.
-
The deployable job, connection and s3 trigger specifications and the job payloads can be generated and deployed for a target environment in a single step.
-
The framework can automatically build and deploy single file exe, sql and sqlc payloads, multi-file pkg payloads and images for docker jobs. It could be readily integrated into a CI/CD pipeline.
The framework uses a combination of the following common tools:
It works on Linux and macOS. It can also run inside a docker container, with some limitations. It is not supported on DOS.
Usage Guidelines¶
A key goal of the framework is to facilitate the separation of environment specific configuration details from the structure and logic of a lava based solution. When developing lava solution components it is critical to properly parameterise the various components to allow the same solution to be rapidly migrated from one environment to another (e.g. dev to prod).
The following guidelines should be considered:
-
Become familiar with basic Jinja syntax. Jinja is used extensively in lava and the job framework.
-
Never hard-code any environment specific information into source code. These details should be properly parameterised and values provided by the environment configuration file at build time. Typical environment specific information includes S3 bucket names, database identifiers, schema names, host names and addresses etc.
-
Complete solution design, implementation and testing in a single development environment. Once done, the configuration file can be cloned for other environments and the parameters adjusted appropriately. Building for these other environments is then a quick and easy process.
Getting Started¶
Ensure you have GNU make and Python 3.9+ installed. Make will be preinstalled on many Linux systems, or will be available in the distro package repos. For macOS, install the Xcode developer tools.
New in v8.2 (Kīlauea)
The recommended approach to creating a new lava framework project is using the lava-new utility.
# Install the jinlava package if not already installed.
pip install --user jinlava
# Create a new lava framework project.
lava-new my-project-directory
The lava framework itself is packaged as a zip file:
lava-job-framework-<VERSION>.zip. If this is not available, it can be
built by cloning the lava repo and running make tools. The zip file will
be placed in the dist/dev-tools directory.
Note
This was the approach required prior to v8.2 (Kīlauea). It can still be used provided you have access to the lava framework cookiecutter bundle.
# Install cookiecutter
pip install --user --upgrade cookiecutter
# Install the AWS CLI, if not installed.
pip install --user --upgrade awscli
# Create a new lava project. Cookiecutter will issue prompts for a few
# configuration parameters. The parameters "project_name" and "project_dir"
# are particularly important. The others can easily be changed later.
cookiecutter lava-job-framework-<VERSION>.zip
Either approach will prompt the user for some basic configuration options and then create the project structure in the specified directory. It should look like this (items in angled brackets refer to the values provided in response to cookiecutter prompts):
<project_dir>/
+--> Makefile # Master make file. Try "make help".
+--> bin/ # Miscellaneous utilities.
+--> config/ # Config files - one per environment.
| +--> <env>.yaml # Initial config built by cookiecutter.
+--> etc/ # Miscellaneous support files.
+--> lava-connections/ # Lava connection specifications.
+--> lava-jobs/ # Lava job specifications.
+--> lava-payloads/ # Lava job payloads.
+--> lava-rules/ # Amazon EventBridge specifications.
+--> lava-triggers/ # Lava s3trigger specifications.
+--> misc/ # Non-deployable job related components.
To complete the setup:
# cd to the new project directory
cd <project_dir>
# Initialise the project environment. This will create a virtualenv and install
# a bunch of required components. This can be safely rerun at any time.
make init
# Activate the virtual environment
source venv/bin/activate
requirements.txt file at the root of the project directory before running
make init. The packages required by the framework itself are already
covered in etc/requirements.txt.
The project is now ready to start creating the various lava components.
Jinja Rendering of Lava Framework Components¶
It is very important to create the lava job framework component specifications
in a way that is environment independent. This allows the same specification
file to generate deployable components for multiple target environments (e.g.
development, testing, production). The framework achieves this by placing all
environment specific parameters into a YAML configuration file in the config
directory and using Jinja rendering to
inject the environment parameters into the specifications at build/deploy time.
The cookiecutter will create an initial skeleton environment configuration file that can be tailored as needed and copied as the basis for other environments. Apart from a small number of variables required for the correct operation of the project framework, configuration files have no predefined structure. The file can contain whatever other variables are required for a given project.
Info
Parameters in the configuration file should not use the lava key. This is
reserved for use by lava itself.
Because lava specifications can contain Jinja markup intended for lava itself, this build/deploy time rendering must use non-standard Jinja delimiters to avoid a clash. For build/deploy time parameter injection, use the following Jinja delimiters.
-
<{...}>instead of{{...}} -
<#...#>instead of{#...#} -
<%...%>instead of{% %}
Built-in Rendering Variables¶
In addition to the parameters defined in the environment configuration file, the following variables are also made available to the renderer.
| Name | Type | Description |
|---|---|---|
| env | str | The name of the configuration file (without the .yaml suffix). |
| jinja.ctime | str | Current local date/time in ctime(3) format. |
| jinja.datetime | str | Current local date/time in YYYY-mm-dd HH:MM:SS format. |
| jinja.iso_datetime | str | Current date/time in ISO8601 format. |
| jinja.prog | str | Name of the rendering program. |
| jinja.templates | list[str] | A list of the files being rendered. For framework files, jinja.templates[0] will be the name of the YAML source file relative to the enclosing lava-* directory. |
| jinja.user | str | Current user name. |
| jinja.utc_ctime | str | Current UTC date/time in ctime(3) format. |
| jinja.utc_datetime | str | Current UTC date/time in YYYY-mm-dd HH:MM:SS format. |
| lava.aws.account | str | The AWS account ID. |
| lava.aws.arn() | function | Helper function to assist with constructing AWS ARNs for a limited range of AWS resource types. See below. |
| lava.aws.ecr_uri | str | The base URI for the ECR registry. The repository name needs to be appended to get the repository URI. |
| lava.aws.region | str | The AWS region (e.g. ap-southeast-2). |
| lava.aws.user | str | The AWS user or role name (e.g. user/fred). |
| lava.dag() | function | A helper function for building DAG payloads. |
| lava.realm | dict[str,*] | The realms table entry for the target realm. e.g. <{ lava.realm.s3_temp.split('/')[2] }> is the realm temp bucket. |
The lava.arn(service, resource) function is a helper function to generate AWS ARNs. Mostly
these are required for specifying targets for
Amazon EventBridge Rules. The allowed
values for the service and resource arguments are:
| service | resource |
|---|---|
| iam-role | The name of an IAM role. |
| lambda-function | The name of an AWS Lambda function. |
| log-group | The name of a CloudWatch log group. |
| sns-topic | The name of an SNS topic. |
| sqs-queue | The name of an SQS queue. |
For example, this will generate the ARN for the s3trigger Lambda function for the realm:
<{ lava.aws.arn('lambda-function', 'lava-' + realm + '-s3trigger') }>
Jinja Template Factorisation¶
In more complex projects, it is not uncommon to have several jobs or triggers containing similar or repeated material. This material can be factored out into reusable sub-templates that are included into the main component at build time.
If these reusable sub-templates are located in one of the lava-* directories
and have names ending in .yaml or .json, they must have names starting with
an underscore, or be in a subdirectory with a name starting with underscore,
otherwise the framework will attempt to process them as a complete specification
in their own right.
Jinja provides several mechanisms to facilitate such factorisation:
Jinja Template Inheritance¶
Template inheritance allows construction of a base skeleton template that contains common elements and defines blocks that child templates can override.
It is the most complex of the factorisation methods. Refer to the Jinja documentation for details.
Jinja Template Inclusion¶
The Jinja include statement is useful to include a sub-template and return the rendered contents of that file into the current namespace. These sub-templates are rendered using the environment configuration file in exactly same way as the main component files. They can also receive variable values that are set in the parent file.
The sub-templates can be placed in a subdirectory of the relevant lava job
framework directory or in a common area elsewhere in the project tree. YAML
files that are not full specification files must have names beginning with
underscore or be in a subdirectory with a name beginning with an underscore if
located within one of the lava-* directories.
For example, consider the following s3trigger specification.
description: "Process data received from source_a"
trigger_id: "<{ prefix.s3trigger }>/source_a"
enabled: true
job_id: "<{ prefix.job }>/process/source_a"
bucket: "<{ s3.bucket }>"
prefix: "source_a"
parameters:
vars:
bucket: "{{ bucket }}"
key: "{{ key }}"
This is fine if there is only a single source_a that needs to be handled. If
new sources are added that have similar processing, the s3trigger specification
will be copied multiple times with common material.
An alternative approach is to create a new directory lava-triggers/_common
containing the following file whatever.yaml. This relies on the main template
to set the source variable.
description: "Process data received from <{ source }>"
trigger_id: "<{ prefix.s3trigger }>/<{ source }>"
enabled: true
job_id: "<{ prefix.job }>/process/<{ source }>"
bucket: "<{ s3.bucket }>"
prefix: "<{ source }>"
parameters:
vars:
bucket: "{{ bucket }}"
key: "{{ key }}"
The main s3trigger specification then becomes:
# Set the source for the sub-template
# <% set source='source_a' %>
# Load the sub-template
# <% include '_common/whatever.yaml' %>
Jinja Template Import¶
Jinja2 allows variables (and macros) to be imported from other templates using the import statement. This process is broadly similar to Python imports.
Imported templates don’t have access to the current template variables, just the globals.
For example, consider the following file vars.jinja:
<% set bucket='my-bucket' %>
This can be used in YAML template thus:
# <% from 'vars.jinja' import bucket %>
bucket: "<{ bucket }>"
Alternatively:
# <% import 'vars.jinja' as v %>
bucket: "<{ v.bucket }>"
Note that because vars.jinja does not end in .yaml, the framework will not
confuse it with a specification file.
Jinja Template Self-Configuration¶
The Jinja rendering process is aware of the name of the source file being
rendered and makes this name available for use in the rendering process as the
expression jinja.templates[0]. For example, if the source file is
lava-jobs/dir/file.yaml, this expression will have the value dir/file.yaml.
This allows the contents of the created DynamoDB object to be dependent on the name of the file.
Here is a simple example of how this can be used to create a generic job that avoids embedding specific configuration details.
# Assume the name of this file is lava-jobs/my-db/my-schema/my-table/count.yaml
# Extract database, schema and table names:
# <% set db=jinja.templates[0].split('/')[0] %>
# <% set schema=jinja.templates[0].split('/')[1] %>
# <% set table=jinja.templates[0].split('/')[2] %>
# Now we can use these in our job spec
description: "Count rows in <{ schema }>.<{ table }> in database <{ db }>"
job_id: "<{ prefix.job }>/count/<{ db }>/<{ schema }>.<{ table }>"
type: sqli
owner: "<{ owner }>"
dispatcher: "<{ dispatcher.main }>"
worker: "<{ worker.main }>"
enabled: true
payload: "SELECT count(*) FROM <{ schema }>.<{ table }>"
parameters:
# Lookup a table in the config file to convert db to a connector ID
conn_id: "<{ db_conn_table[db] }>"
This is probably overkill for handling a single table in a single database. However, if the same action is required for multiple tables, the same specification can be copied without modification, provided the file naming structure is setup correctly. Alternatively, symlinks can be used to avoid multiple copies of the same specification. Like so:
lava-jobs
├── Makefile
├── README.md
├── _common
│ └── count.yaml
├── db1
│ ├── schema1
│ │ └── table1
│ │ └── count.yaml -> ../../../_common/count.yaml
│ └── schema2
│ └── table2
│ └── count.yaml -> ../../../_common/count.yaml
└── db2
└── schema3
└── table3
└── count.yaml -> ../../../_common/count.yaml
Creating DynamoDB Items¶
The lava-connections, lava-jobs and lava-triggers directories will contain
the source for the lava specification components for the project. The source can
be either JSON (*.json) or YAML (*.yaml) files that will be converted into
JSON and pushed to the appropriate DynamoDB table as part of the deployment
process. It is strongly
recommended to use YAML as it is easier to write, read and annotate with
comments.
There are samples for the various DynamoDB table entries available in the Lava Job Framework Samples section.
Existing lava configuration entries can be imported from the DynamoDB tables using the lava-dump utility. The resulting files will need to be manually edited to remove realm specific settings and move those to the environment configuration file(s).
Jinja Rendering of DynamoDB Items¶
Jinja rendering of lava framework components as part of the build and deploy process is supported. See Jinja Rendering of Lava Framework Components.
Conditional Deployment of DynamoDB Items¶
In most cases, exactly the same inventory of components should be deployed to all target environments, although the contents may be environment specific. In some, limited, circumstances, some components may not need to be deployed to some target environments.
The lava framework will skip deployment of a component if the built JSON
component contains only a null object. This is achieved by wrapping the YAML
source for the object in a Jinja conditional block like so:
# <% if env in ('dev', 'uat') %>
description: Conditional job
job_id: maybe_yes_maybe_no
type: etc ...
# <% endif %>
When this job is built for the dev or uat environments, the resulting json
object will be non-null and hence the job will be installed. when it is built
for the prod environment, the contents will generate a null JSON object
which will be skipped during installation.
The conditional logic can make use of any of the configuration information made available when rendering the item, including the contents of the environment configuration file.
Examples¶
The following example shows an sql job specification:
# --------------------------------------
description: Sample SQL job
dispatcher: <{ dispatcher.none }>
enabled: true
job_id: <{ prefix.job }>/simple-sql
owner: <{ owner }>
payload: <{ prefix.payload }>/simple.sql
type: sql
worker: <{ worker.main }>
# Get the name of the job source file relative to lava-jobs dir.
x-srcfile: <{ jinja.templates[0] }>
# --------------------------------------
# Post job actions
# <% if on_fail %>
on_fail: <{ on_fail }>
# <% endif %>
# <% if on_success %>
on_success: <{ on_success }>
# <% endif %>
# --------------------------------------
parameters:
conn_id: <{ conn.mydb }>
vars:
schema_name: <{ schema.staging }>
All of the values delimited by <{...}>, <%...%> will be obtained from
whichever environment configuration file is used at build/deploy time.
If the configuration file is:
# --------------------------------------
# Lava environment configuration file
realm: "user01"
prefix:
job: "app/demo"
payload: "app/demo"
s3trigger: "app/demo"
owner: "Fred"
worker:
main: "core"
dispatcher:
main: "Sydney"
none: "--"
schedule:
main: "--"
# --------------------------------------
# Connections
conn:
mydb: redshift/dev
# --------------------------------------
# Post-job actions These can be safely removed if not needed.
on_fail:
- action: email
to: fred@somewhere.com
subject: "ALARM: Job={{job.job_id}}@{{realm.realm}}"
message: "Run {{job.run_id}}: {{result.error}}"
# --------------------------------------
# Custom variables.
schema:
staging: public
The final job will look like this:
{
"description": "Sample SQL job",
"dispatcher": "--",
"enabled": true,
"job_id": "app/demo/simple-sql",
"on_fail": [
{
"action": "email",
"message": "Run {{job.run_id}}: {{result.error}}",
"subject": "ALARM: Job={{job.job_id}}@{{realm.realm}}",
"to": "fred@somewhere.com"
}
],
"owner": "Fred",
"parameters": {
"conn_id": "redshift/dev",
"vars": {
"schema": "public"
}
},
"payload": "app/demo/simple.sql",
"type": "sql",
"worker": "core"
}
Creating Amazon EventBridge Rules¶
Lava provides support for triggering jobs from Amazon EventBridge rules via a number of mechanisms:
-
Using the lava dispatch helper.
-
Using Amazon S3 Event Notifications with Amazon EventBridge.
Also, a project may need to create EventBridge rules to interact with other non-lava elements in the environment.
In each case, EventBridge rules with suitable targets need to be created. The
lava job framework supports this with rule specifications placed in the
lava-rules directory.
Sorry, I couldn't resist.
Anatomy of EventBridge Rules¶
Note
This explanation is for general information only and many details are omitted. Consult AWS documentation for full details.
EventBridge rules are attached to an event bus (typically the default bus) and
contain the following key components:
-
A rule name.
-
A description.
-
An optional event pattern that is matched against incoming events by EventBridge at runtime to determine if the rule should fire or not.
-
An optional schedule that specifies a cron style schedule or repetition frequency for the rule to fire.
-
Targets for the rule and a definition of what data to send to the targets. A range of target types are supported, including Lambda functions, CloudWatch log groups, SNS topics and SQS queues. While targets are optional, not having any is pretty pointless.
-
Tags for the rule.
For the lava job framework, these elements are defined in a rule specification file.
Rule Specification Files¶
Rule specification files are YAML (or JSON, if you must) formatted and placed in
the lava-rules directory. These files are Jinja rendered against the specified
environment configuration file as for other YAML job framework components and
deployed to EventBridge by the lava job framework.
A sample rule specification file is provided here.
Each file has the following keys:
| Key | Type | Required | Description |
|---|---|---|---|
| description | str | Yes | A short description of the rule. |
| enabled | Boolean | No | Whether or not the rule is enabled. Defaults to false |
| event_bus_name | str | No | The event bus name. The default is default. |
| event_pattern | dict | No | The pattern used to select which events trigger the rule. See the AWS documentation for details. |
| owner | str | Yes | Name or email address of the rule owner. This will be added as a tag on the rule when deployed. |
| role_arn | str | No | The ARN of the IAM role associated with the rule. See the AWS documentation for details. |
| rule_id | str | Yes | The rule name. This must be of the form lava.<REALM>.*. |
| schedule_expression | str | No | A cron style schedule or repetition frequency for the rule to fire. See the AWS documentation for details. |
| tags | dict | No | A dictionary of key/value pairs that will be added as tags to the rule. These are additional to the owner and control tags added by the lava framework. |
| targets | list | No | A list of targets for the rule. If omitted, the rule may fire but nothing will happen. See Specifying Rule Targets. |
Specifying Rule Targets¶
A rule target is a resource to which EventBridge sends an event message when a rule fires. Rules can have zero or more targets. Consult the AWS documentation for details.
Rule specification files may contain the targets
key which is a list of targets for the rule. Each entry in the list specifies
the resource or endpoint and any additional parameters required for that
endpoint.
The format for each entry in the targets list can be either:
-
The ARN of a target resource.
-
A full target specification using the structure specified for a target in the boto3 EventBridge put_targets function (camel case and all).
In the first case, the incoming event is forwarded, unmodified, to the resource specified by the ARN. This is suitable for using EventBridge to trigger lava jobs from S3 events, among other uses. The lava job framework provides Jinja helper functions to assist with constructing ARNs.
In the second case, the specification provides full control over the target configuration, including the nature of the event message being sent.
Example Rule Specification File¶
The following example is typical of one used to send an S3 bucket event to the realm s3trigger lambda function to dispatch a lava job. It also logs the event to CloudWatch logs.
# rule_id becomes the rule name
rule_id: "<{ prefix.rule }>.s3-rule-example"
# If you forget this, your rule is disabled.
enabled: true
owner: Fred
description: A sample rule
tags:
project: my-great-project
# This will capture object creation in s3://my-bucket/an/interesting/prefix
event_pattern:
detail:
bucket:
name:
- my-bucket
object:
key:
- prefix: an/interesting/prefix
detail-type:
- Object Created
source:
- aws.s3
targets:
# Construct the ARN for the realm s3trigger lambda
- <{ lava.aws.arn('lambda-function', 'lava-' + realm + '-s3trigger') }>
# Let's log messages in CloudWatch logs
- <{ lava.aws.arn('log-group', '/aws/events/lava') }>
# This does exactly the same as the previous targets using the full target
# format. Don't do both or s3trigger will get 2 events sent
- Id: trigger-me
Arn: <{ lava.aws.arn('lambda-function', 'lava-' + realm + '-s3trigger') }>
- Id: log-me
Arn: <{ lava.aws.arn('log-group', '/aws/events/lava') }>
Creating Payloads¶
Some job types, such as cmd, dag and sqli, have the payload fully contained within the job specification.
For other job types, such as
exe,
pkg and
sql, the payload is external to the
job specification, which references the
payload content (e.g. as a code bundle in S3 or a docker image repository).
For these, the lava-payloads directory will contain the source for the lava
payloads for the project. The framework currently supports automated build for
the following external payload types:
-
Python scripts (
*.py) -
Jupyter notebooks (
*.ipynb) -
Shell scripts (
*.sh) -
SQL scripts (
*.sql) -
Docker images for docker jobs (
*.docker/). -
Resource directories (
*.rscand*.raw).
Resource Directories¶
Directories directly under lava-payloads with names ending in .rsc or .raw
are static resource directories that contain no active job components but are
uploaded to the payloads area in S3 for consumption by lava jobs as required.
Directories ending in .rsc will have the contents Jinja rendered at
build/deploy time using the specified environment configuration file.
Directories ending in .raw are not Jinja rendered.
In either case, the directory structure is replicated in the project payload
area in S3 under the prefix.payload item from the environment configuration.
Symbolic links are followed as part of the process.
Note that the lava worker will completely ignore these areas in S3. It is up to
individual jobs to download the contents as required. For situations where
static resources need to be accessed locally by a job, it may be more
appropriate to place them directly in the .pkg or .docker directory so
that they are included in the job payload.
An element my-file from a resource directory xyz.rsc can be referenced in a
job specification thus:
{{ realm.s3_payloads }}/<{ prefix.payload }>/xyz.rsc/my-file
DAG Payloads¶
The payload for dag jobs is a map representing job dependencies. The details can be included directly in the job specification. The job framework also provides support for generating this map at build time via the following:
-
The lava-dag-gen utility which is provided in the job framework
bindirectory. -
A Jinja function,
lava.dag(), that calls this utility to generate and interpolate a DAG payload at build time.
Note
The lava framework cannot easily tell if a job using the lava.dag()
function needs to be rebuilt as it may depend on external data. Hence, the
framework will always rebuild job specifications that use this function.
This following example shows how to use the Jinja function:
description: A daggy job
type: dag
job_id: "<{ prefix.job }>/dag/demo"
worker: "<{ worker.main }>"
enabled: true
owner: "<{ owner }>"
parameters:
workers: 2
# Generate the dag payload by reading the first tab in Excel file dag.xlsx
payload: "<{ lava.dag('dag.xlsx'}>"
The first (positional) argument to the lava.dag() function corresponds to the
source argument of the
lava-dag-gen utility.
The lava.dag() function also supports keyword arguments that match the
--option value command line options of the
lava-dag-gen utility, although not all of
these are useful in a lava framework job specification.
The following example shows how to generate the dag payload by reading dependencies from a database using a lava connector:
description: A daggy job
type: dag
job_id: "<{ prefix.job }>/dag/demo"
worker: "<{ worker.main }>"
enabled: true
owner: "<{ owner }>"
parameters:
workers: 2
# Generate the dag payload by reading a database table. Note that the realm
# value from the framework configuration file is used.
payload: "<{ lava.dag('a_conn_id', group='a_batch', table='a_schema.dags', realm=realm) }>"
Note that the lava.dag() function actually returns a JSON formatted string.
This works in a YAML source file because valid JSON is also valid YAML. Neat eh?
Info
Using the lava.dag() function with a lava database connector requires that
the lava package is installed in the framework virtual environment.
Docker Payloads¶
Warning
Lava version 8.1 (Kīlauea) introduced some important changes in this area. It is essential to read Backward Compatibility Notes for Docker Payloads if running an earlier version.
Directories directly under lava-payloads with names ending in .docker are
assumed to contain the code for lava docker
jobs.
The build process is essentially:
-
Create a clean copy of the source tree.
-
Any files in the
env/directory of the source tree are Jinja rendered using the environment configuration file. This provides one possible mechanism to include environment specific information in the build. -
Any Jupyter notebooks (
*.ipynb) are converted to Python. -
If the source directory already contains a
Dockerfile, that will be Jinja rendered using the environment configuration file and used to build the image. -
If the source directory does not contain a
Dockerfile, a default one is used.
Info
These components are placed in the container in /lava and owned by the
user lava. However, by default, the container will be run with the
effective user ID of the lava worker. This is required so that any items
left by the container in the $LAVA_TMP area can be read by the worker. Take
care when building containers to account for the different user IDs at build
and run time.
The install process will create an appropriate ECR repo and push the image. The uninstall process will delete the ECR repo.
The Default Dockerfile¶
The default Dockerfile supplied with the framework should suffice in most
cases. It effectively emulates the packaging process for pkg payloads but builds
a docker image instead of a zip file.
The payload files are installed in the /lava directory in the container. The
files are owned by root and are globally readable inside the container. Any
files that are user executable in the source directory are made globally
executable inside the container.
Info
The /lava directory is not added to any *PATH environment
variables by default.
If the root directory of the source tree contains a requirements.txt file,
then Python modules listed therein, including any dependencies, are installed as
part of the image build. If the root directory contains a
requirements-nodeps.txt file, then Python modules listed therein, excluding
any dependencies, are included.
If the default Dockerfile is not adequate, a custom one can be created. A
simple Dockerfile might look something like the following, but keep in mind
the runtime configuration to ensure permissions
are set correctly inside the container when building the image.
FROM ghcr.io/jin-gizmo/lava/amzn2023/base
# Copy our code into the image
COPY * /install/
# Point at the right pip repo. The Makefile will supply the value.
ARG PIP_INDEX_URL
ENV PIP_INDEX_URL $PIP_INDEX_URL
RUN \
cd /install ; \
echo My code is here ; \
ls -lR : \
python3 -m pip install -r requirements.txt --upgrade
Docker Platform Architecture Selection¶
As of version 8.1 (Kīlauea), the lava job framework supports building docker payload images for a specific target platform architecture.
Info
Currently, the capability to generate cross-platform images is only supported when using Docker Desktop with multi-platform support enabled.
Image platform selection is controlled by the docker->platform key in the
environment configuration file. This key may have one of the following values.
| Docker platform | Description |
|---|---|
host |
Use the default behaviour of the build host docker platform. The platform selected will be dependent on some combination of the architecture of the base image and the build host, as is usual for docker. |
linux/amd64 |
Build an image for x86_64 platforms. |
linux/arm |
Build an image for ARM platforms, such as Mac M series and AWS Graviton. |
| unspecified | Build an image for x86_64 platforms. |
For a cross-platform build to work as expected, the base image must either be a multi-platform image or have itself been built for the target platform. Most standard operating system base images, such as Amazon Linux 2023 and Ubuntu Linux are multi-platform. As of version 8.1 (Kīlauea), the lava docker images are also multi-platform.
Compatibility Notes for Docker Payloads¶
This is a bit complicated but please bear with me ...
To understand platform compatibility when deploying a docker payload, the fundamental principle is that the docker image must contain a platform version that matches the host running the lava worker.
If lava workers are being run on x86 AWS EC2 instances
(linux/amd64 in docker terminology), job payload docker images must be, or
contain, a linux/amd64 version.
This, in turn, implies that the base image for the payload is either:
-
A single platform
linux/amd64image; or -
A multi-platform image that
- includes a
linux/amd64platform version; and - the build process, implicitly or explicitly, directs the use of the
linux/amd64platform version.
- includes a
If every machine in the dev / build / run chain is x86, no problems. That was the world view for lava versions prior to version 8.1. The lava docker images, commonly used as payload base images, were built only for x86. Any derived images would inevitably be x86.
Unfortunately, if a multi-platform base image, such as any of the common
operating system base images, was used on a M-series Mac build machine, the
result would be an ARM (linux/arm64) payload which would not run on an x86 AWS
EC2 worker. The lava job framework provided no way to specify what output
architecture was required.
It also meant that the lava docker images could not run on ARM machines, except under emulation.
Lava version 8.1 (Kīlauea) introduced some key changes in this area:
-
The lava docker images are multi-platform images supporting x86 (
linux/amd64) and ARM (linux/arm64). -
The lava job framework includes the ability to explicitly specify the target platform for docker payloads, rather than relying on some implicit combination of the platform types available in the base image and the platform type of the build host.
So far, so good.
New projects using the v8.1 lava job framework allow the user to control the
target platform using the docker->platform key in the environment
configuration file. It defaults to linux/amd64. This should work fine on x86
and M-series Mac build machines using Docker
Desktop with emulation.
What happens when working with existing projects using an older version of the lava job framework? I hear you ask. It depends:
-
Existing, deployed docker payloads and projects without docker payloads.
No impact. -
Rebuilding and deploying docker payloads from an x86 build host.
No impact. -
Rebuilding and deploying from an ARM build host (e.g. M-series Mac)
This (probably) would have worked prior to v8.1. Now, it will not. The lava job framework version must be updated to v8.1 (or later). See Updating the Framework in an Existing Project. Thedocker->platformkey should be added to theconfig/*.yamlfiles, but will default tolinux/amd64if not present.
Exe Payloads¶
Single file Python and Shell scripts directly under lava-payloads are copied
as is when deployed.
SQL scripts are Jinja rendered at build/deploy time using the specified environment configuration file, in the same way as the DynamoDB table specifications.
Jupyter notebooks are converted to Python scripts for deployment.
Pkg Payloads¶
Directories directly under lava-payloads with names ending in .pkg are
assumed to contain the code for lava pkg jobs.
The build process is essentially:
-
Create a clean copy of the source tree.
-
Any files in the
env/directory of the source tree are Jinja rendered using the environment configuration file. This provides one possible mechanism to include environment specific information in the build. -
Any Jupyter notebooks (
*.ipynb) are converted to Python. -
If the root directory of the source tree contains a
requirements.txtfile, then Python modules listed therein, including any dependencies, are included. -
If the root directory contains a
requirements-nodeps.txtfile, then Python modules listed therein, excluding any dependencies, are included. -
Zip up everything and place it in the
distarea of the project.
Miscellaneous Components¶
Lava jobs sometimes require other components that may, or may not, be deployed as part of a job but which don't naturally belong in the lava payloads area in S3.
For example, jobs may require some tables to be pre-created before the job runs. The SQL to create the tables would be one such miscellaneous component. Another example might be JSONPath files for a Redshift COPY operation for JSON data.
These components can be placed in the misc (miscellaneous) directory.
Any SQL scripts (*.sql) placed in the misc directory are
Jinja rendered at build/deploy time into
the dist directory using the specified environment configuration file, in the
same way as the DynamoDB table specifications.
By default, no other build or installation action is performed for anything in
the misc directory.
Info
Do not edit misc/Makefile as this file will be replaced in the event of
a framework update.
If some additional build or installation action is
required, the appropriate means to achieve this is to create a custom makefile
Makefile.local. This will be detected by the framework and invoked. This
makefile must implement the following targets, although they don't have to
do anything if not required:
- dist
- pre-install
- install
- uninstall.
The recommended approach is to copy the file misc/Makefile.local.sample to
misc/Makefile.local and customise as required.
Building the Deployable Components¶
Once the lava components are created, the installable components are created thus:
# cd to the project root directory then ...
# Activate the virtualenv
source venv/bin/activate
# Build the lava artefacts
make dist env=<ENV>
The value of the env parameter must correspond to one of the environment
configuration YAML files in the config directory.
The deployable components will be built and placed in the dist/<ENV>
directory.
Installing Deployable Components¶
The lava components can be installed using:
# cd to the project root directory then ...
# Activate the virtualenv
source venv/bin/activate
# Deploy the lava artefacts
make install env=<ENV>
This will do the following:
-
Build any out of date artefacts.
-
Perform some basic pre-installation checks (e.g. verify permission to write to the payloads area in S3).
-
Backup any existing payloads in the realm S3 bucket under the
__bak__prefix. -
Deploy the DynamoDB table entries and payload components.
Warning
No backup is made of existing DynamoDB entries prior to uploading new ones.
To perform an installation without the pre-installation checks use:
# Deploy the lava artefacts without pre-install checks.
make _install env=<ENV>
Uninstalling Deployable Components¶
The lava components can be uninstalled using:
# cd to the project root directory then ...
# Activate the virtualenv
source venv/bin/activate
# Remove the lava artefacts
make uninstall env=<ENV>
To clean up the local dist area:
# cd to the project root directory then ...
make clean
Health Checking Deployable Components¶
See also Maintaining DynamoDB Table Entries.
Code Hygiene¶
The lava job framework incorporates some basic code health checks. The checks can be run using:
make check
# or ...
etc/git-hooks/pre-commit
The checks are also run prior to any installation process. Installation is blocked if the checks fail.
If the framework was used to automatically initialise Git for the project then the checking process is also configured as a pre-commit hook.
| Check Type | Tool | Description |
|---|---|---|
| Python quality | flake8 | Performs a range of PEP8 compliance and other code health checks, including compliance with black formatting. The configuration file for flake8 is contained in .flake8 and for black in pyproject.toml. |
| YAML correctness | yamllint | Performs correctness and style checks on the project YAML files. The configuration file is in .yamllint.yaml. |
| Config alignment | Builtin | Compares the key structures in the configuration files in the config directory and highlights any differences. Generally, configuration files for a project correlate to different target realms (e.g. test vs prod). While the configuration values will vary by environment, the key hierarchies should be identical. The only configuration option is the choice between warning and strict modes which is specified in etc/git-hooks/pre-commit. |
The following command will apply black formatting to project Python files:
make black
# or ...
black lava-payloads misc
# or even ...
black
Configuration Drift Detection¶
Changes to a lava job framework based project should always be done via a make
install from an appropriately managed Git repo to ensure that the deployed
components are fully aligned with the committed contents of the repo.
Deviation from this practice can result in misalignment between deployed components and the repo contents; aka drift.
The lava job framework supports drift detection for the DynamoDB table entries. To detect differences between the repo contents and the deployed table entries, run the following command:
make diff env=...
Note that fields starting with x- / X- are excluded from drift comparisons.
Updating the Framework in an Existing Project¶
The lava framework can be updated for an existing project by obtaining the new
framework package lav-job-framework-<NEW-VERSION>.zip and applying it over the
top of the project.
This process is automated by the framework itself. A backup is made first as
part of the process in case of problems. However it is strongly recommended to
do a git commit and git push before starting the process.
The update process is relatively straightforward when updating from a framework version of 5.1.0 (Tungurahua) or above. Updating earlier versions is possible with a little bit of fiddling.
Updating from Lava Version 5.1.0 (Tungurahua) or Above¶
The process is:
# Go to the project root directory. Then ...
# Commit and push your code just in case. Then ...
# Deposit the new package at the root of the project directory. Then ...
# Activate the virtual environment
source venv/bin/activate
# Run the update process
make update pkg=lav-job-framework-<NEW-VERSION>.zip
This will do a backup of the project into a zip file, rerun the cookiecutter using the new package and apply the new framework components over the existing project.
Updating from Lava Versions Prior to 5.1.0 (Tungurahua)¶
The process is:
# Go to the project root directory. Then ...
# Commit and push your code just in case. Then ...
# Deposit the new package at the root of the project directory. Then ...
# Extract the `bin` directory from the new framework package
# The quotes are important here.
unzip -j -d bin lav-job-framework-<NEW-VERSION>.zip '*bin/*'
chmod u+x bin/*
# Activate the virtual environment
source venv/bin/activate
# Run the update process
PATH=$(pwd)/bin:$PATH make update pkg=lav-job-framework-<NEW-VERSION>.zip
Note that later versions of the framework move the framework's
requirements.txt file into the etc directory. After the update the
requirements.txt in the base directory can be deleted if there are no locally
added packages. If there are, only those packages need to be retained in that file.