Skip to content

The Lava Job Framework

Building and deploying simple lava jobs is fairly straightforward. For more complex requirements, a lava solution may require multiple jobs with associated JSON job specifications, connectors and S3 triggers to be tailored and deployed across multiple lava realms (e.g. dev, test and production). Job payloads also need to be assembled into packages or docker containers and deployed. This can be manually intensive and error prone.

The lava job framework provides a suggested way of structuring, building and deploying lava jobs. Its use is completely optional.

The framework provides the following advantages over hand-crafting a complex lava solution.

  • Job, connection and s3 trigger specifications can be specified in either YAML or JSON in a realm / environment independent way. YAML is easier to read, write and annotate than JSON. YAML formatted samples are provided in the Lava Job Framework Samples section.

  • Environment specific information is managed in separate, extensible configuration files.

  • The deployable job, connection and s3 trigger specifications and the job payloads can be generated and deployed for a target environment in a single step.

  • The framework can automatically build and deploy single file exe, sql and sqlc payloads, multi-file pkg payloads and images for docker jobs. It could be readily integrated into a CI/CD pipeline.

The framework uses a combination of the following common tools:

It works on Linux and macOS. It can also run inside a docker container, with some limitations. It is not supported on DOS.

Usage Guidelines

A key goal of the framework is to facilitate the separation of environment specific configuration details from the structure and logic of a lava based solution. When developing lava solution components it is critical to properly parameterise the various components to allow the same solution to be rapidly migrated from one environment to another (e.g. dev to prod).

The following guidelines should be considered:

  1. Become familiar with basic Jinja syntax. Jinja is used extensively in lava and the job framework.

  2. Never hard-code any environment specific information into source code. These details should be properly parameterised and values provided by the environment configuration file at build time. Typical environment specific information includes S3 bucket names, database identifiers, schema names, host names and addresses etc.

  3. Complete solution design, implementation and testing in a single development environment. Once done, the configuration file can be cloned for other environments and the parameters adjusted appropriately. Building for these other environments is then a quick and easy process.

Getting Started

Ensure you have GNU make and Python 3.9+ installed. Make will be preinstalled on many Linux systems, or will be available in the distro package repos. For macOS, install the Xcode developer tools.

New in v8.2 (Kīlauea)

The recommended approach to creating a new lava framework project is using the lava-new utility.

# Install the jinlava package if not already installed.
pip install --user jinlava

# Create a new lava framework project.
lava-new my-project-directory

The lava framework itself is packaged as a zip file: lava-job-framework-<VERSION>.zip. If this is not available, it can be built by cloning the lava repo and running make tools. The zip file will be placed in the dist/dev-tools directory.

Note

This was the approach required prior to v8.2 (Kīlauea). It can still be used provided you have access to the lava framework cookiecutter bundle.

# Install cookiecutter 
pip install --user --upgrade cookiecutter

# Install the AWS CLI, if not installed.
pip install --user --upgrade awscli

# Create a new lava project. Cookiecutter will issue prompts for a few
# configuration parameters. The parameters "project_name" and "project_dir"
# are particularly important. The others can easily be changed later.

cookiecutter lava-job-framework-<VERSION>.zip

Either approach will prompt the user for some basic configuration options and then create the project structure in the specified directory. It should look like this (items in angled brackets refer to the values provided in response to cookiecutter prompts):

<project_dir>/
        +--> Makefile               # Master make file. Try "make help".
        +--> bin/                   # Miscellaneous utilities.
        +--> config/                # Config files - one per environment.
        |       +--> <env>.yaml     # Initial config built by cookiecutter.
        +--> etc/                   # Miscellaneous support files.
        +--> lava-connections/      # Lava connection specifications.
        +--> lava-jobs/             # Lava job specifications.
        +--> lava-payloads/         # Lava job payloads.
        +--> lava-rules/            # Amazon EventBridge specifications.
        +--> lava-triggers/         # Lava s3trigger specifications.
        +--> misc/                  # Non-deployable job related components.

To complete the setup:

# cd to the new project directory
cd <project_dir>

# Initialise the project environment. This will create a virtualenv and install
# a bunch of required components. This can be safely rerun at any time.
make init

# Activate the virtual environment
source venv/bin/activate
If the project requires any non-standard Python packages, create a suitable requirements.txt file at the root of the project directory before running make init. The packages required by the framework itself are already covered in etc/requirements.txt.

The project is now ready to start creating the various lava components.

Jinja Rendering of Lava Framework Components

It is very important to create the lava job framework component specifications in a way that is environment independent. This allows the same specification file to generate deployable components for multiple target environments (e.g. development, testing, production). The framework achieves this by placing all environment specific parameters into a YAML configuration file in the config directory and using Jinja rendering to inject the environment parameters into the specifications at build/deploy time.

The cookiecutter will create an initial skeleton environment configuration file that can be tailored as needed and copied as the basis for other environments. Apart from a small number of variables required for the correct operation of the project framework, configuration files have no predefined structure. The file can contain whatever other variables are required for a given project.

Info

Parameters in the configuration file should not use the lava key. This is reserved for use by lava itself.

Because lava specifications can contain Jinja markup intended for lava itself, this build/deploy time rendering must use non-standard Jinja delimiters to avoid a clash. For build/deploy time parameter injection, use the following Jinja delimiters.

  • <{...}> instead of {{...}}

  • <#...#> instead of {#...#}

  • <%...%> instead of {% %}

Built-in Rendering Variables

In addition to the parameters defined in the environment configuration file, the following variables are also made available to the renderer.

Name Type Description
env str The name of the configuration file (without the .yaml suffix).
jinja.ctime str Current local date/time in ctime(3) format.
jinja.datetime str Current local date/time in YYYY-mm-dd HH:MM:SS format.
jinja.iso_datetime str Current date/time in ISO8601 format.
jinja.prog str Name of the rendering program.
jinja.templates list[str] A list of the files being rendered. For framework files, jinja.templates[0] will be the name of the YAML source file relative to the enclosing lava-* directory.
jinja.user str Current user name.
jinja.utc_ctime str Current UTC date/time in ctime(3) format.
jinja.utc_datetime str Current UTC date/time in YYYY-mm-dd HH:MM:SS format.
lava.aws.account str The AWS account ID.
lava.aws.arn() function Helper function to assist with constructing AWS ARNs for a limited range of AWS resource types. See below.
lava.aws.ecr_uri str The base URI for the ECR registry. The repository name needs to be appended to get the repository URI.
lava.aws.region str The AWS region (e.g. ap-southeast-2).
lava.aws.user str The AWS user or role name (e.g. user/fred).
lava.dag() function A helper function for building DAG payloads.
lava.realm dict[str,*] The realms table entry for the target realm. e.g. <{ lava.realm.s3_temp.split('/')[2] }> is the realm temp bucket.

The lava.arn(service, resource) function is a helper function to generate AWS ARNs. Mostly these are required for specifying targets for Amazon EventBridge Rules. The allowed values for the service and resource arguments are:

service resource
iam-role The name of an IAM role.
lambda-function The name of an AWS Lambda function.
log-group The name of a CloudWatch log group.
sns-topic The name of an SNS topic.
sqs-queue The name of an SQS queue.

For example, this will generate the ARN for the s3trigger Lambda function for the realm:

<{ lava.aws.arn('lambda-function', 'lava-' + realm + '-s3trigger') }>

Jinja Template Factorisation

In more complex projects, it is not uncommon to have several jobs or triggers containing similar or repeated material. This material can be factored out into reusable sub-templates that are included into the main component at build time.

If these reusable sub-templates are located in one of the lava-* directories and have names ending in .yaml or .json, they must have names starting with an underscore, or be in a subdirectory with a name starting with underscore, otherwise the framework will attempt to process them as a complete specification in their own right.

Jinja provides several mechanisms to facilitate such factorisation:

  1. Template inheritance

  2. Template inclusion

  3. Template import

  4. Template self-configuration

Jinja Template Inheritance

Template inheritance allows construction of a base skeleton template that contains common elements and defines blocks that child templates can override.

It is the most complex of the factorisation methods. Refer to the Jinja documentation for details.

Jinja Template Inclusion

The Jinja include statement is useful to include a sub-template and return the rendered contents of that file into the current namespace. These sub-templates are rendered using the environment configuration file in exactly same way as the main component files. They can also receive variable values that are set in the parent file.

The sub-templates can be placed in a subdirectory of the relevant lava job framework directory or in a common area elsewhere in the project tree. YAML files that are not full specification files must have names beginning with underscore or be in a subdirectory with a name beginning with an underscore if located within one of the lava-* directories.

For example, consider the following s3trigger specification.

description: "Process data received from source_a"
trigger_id: "<{ prefix.s3trigger }>/source_a"

enabled: true

job_id: "<{ prefix.job }>/process/source_a"

bucket: "<{ s3.bucket }>"
prefix: "source_a"

parameters:
  vars:
    bucket: "{{ bucket }}"
    key: "{{ key }}"

This is fine if there is only a single source_a that needs to be handled. If new sources are added that have similar processing, the s3trigger specification will be copied multiple times with common material.

An alternative approach is to create a new directory lava-triggers/_common containing the following file whatever.yaml. This relies on the main template to set the source variable.

description: "Process data received from <{ source }>"
trigger_id: "<{ prefix.s3trigger }>/<{ source }>"

enabled: true

job_id: "<{ prefix.job }>/process/<{ source }>"

bucket: "<{ s3.bucket }>"
prefix: "<{ source }>"

parameters:
  vars:
    bucket: "{{ bucket }}"
    key: "{{ key }}"

The main s3trigger specification then becomes:

# Set the source for the sub-template
# <% set source='source_a' %>

# Load the sub-template
# <% include '_common/whatever.yaml' %>

Jinja Template Import

Jinja2 allows variables (and macros) to be imported from other templates using the import statement. This process is broadly similar to Python imports.

Imported templates don’t have access to the current template variables, just the globals.

For example, consider the following file vars.jinja:

<% set bucket='my-bucket' %>

This can be used in YAML template thus:

# <% from 'vars.jinja' import bucket %>

bucket: "<{ bucket }>"

Alternatively:

# <% import 'vars.jinja' as v %>

bucket: "<{ v.bucket }>"

Note that because vars.jinja does not end in .yaml, the framework will not confuse it with a specification file.

Jinja Template Self-Configuration

The Jinja rendering process is aware of the name of the source file being rendered and makes this name available for use in the rendering process as the expression jinja.templates[0]. For example, if the source file is lava-jobs/dir/file.yaml, this expression will have the value dir/file.yaml.

This allows the contents of the created DynamoDB object to be dependent on the name of the file.

Here is a simple example of how this can be used to create a generic job that avoids embedding specific configuration details.

# Assume the name of this file is lava-jobs/my-db/my-schema/my-table/count.yaml

# Extract database, schema and table names:
# <% set db=jinja.templates[0].split('/')[0] %>
# <% set schema=jinja.templates[0].split('/')[1] %>
# <% set table=jinja.templates[0].split('/')[2] %>

# Now we can use these in our job spec

description: "Count rows in <{ schema }>.<{ table }> in database <{ db }>"
job_id: "<{ prefix.job }>/count/<{ db }>/<{ schema }>.<{ table }>"
type: sqli
owner: "<{ owner }>"

dispatcher: "<{ dispatcher.main }>"
worker: "<{ worker.main }>"
enabled: true

payload: "SELECT count(*) FROM <{ schema }>.<{ table }>"
parameters:
  # Lookup a table in the config file to convert db to a connector ID
  conn_id: "<{ db_conn_table[db] }>"

This is probably overkill for handling a single table in a single database. However, if the same action is required for multiple tables, the same specification can be copied without modification, provided the file naming structure is setup correctly. Alternatively, symlinks can be used to avoid multiple copies of the same specification. Like so:

lava-jobs
├── Makefile
├── README.md
├── _common
│   └── count.yaml
├── db1
│   ├── schema1
│   │   └── table1
│   │       └── count.yaml -> ../../../_common/count.yaml
│   └── schema2
│       └── table2
│           └── count.yaml -> ../../../_common/count.yaml
└── db2
    └── schema3
        └── table3
            └── count.yaml -> ../../../_common/count.yaml

Creating DynamoDB Items

The lava-connections, lava-jobs and lava-triggers directories will contain the source for the lava specification components for the project. The source can be either JSON (*.json) or YAML (*.yaml) files that will be converted into JSON and pushed to the appropriate DynamoDB table as part of the deployment process. It is strongly recommended to use YAML as it is easier to write, read and annotate with comments.

There are samples for the various DynamoDB table entries available in the Lava Job Framework Samples section.

Existing lava configuration entries can be imported from the DynamoDB tables using the lava-dump utility. The resulting files will need to be manually edited to remove realm specific settings and move those to the environment configuration file(s).

Jinja Rendering of DynamoDB Items

Jinja rendering of lava framework components as part of the build and deploy process is supported. See Jinja Rendering of Lava Framework Components.

Conditional Deployment of DynamoDB Items

In most cases, exactly the same inventory of components should be deployed to all target environments, although the contents may be environment specific. In some, limited, circumstances, some components may not need to be deployed to some target environments.

The lava framework will skip deployment of a component if the built JSON component contains only a null object. This is achieved by wrapping the YAML source for the object in a Jinja conditional block like so:

# <% if env in ('dev', 'uat') %>
description: Conditional job
job_id: maybe_yes_maybe_no
type: etc ...
# <% endif %>

When this job is built for the dev or uat environments, the resulting json object will be non-null and hence the job will be installed. when it is built for the prod environment, the contents will generate a null JSON object which will be skipped during installation.

The conditional logic can make use of any of the configuration information made available when rendering the item, including the contents of the environment configuration file.

Examples

The following example shows an sql job specification:

# --------------------------------------
description: Sample SQL job
dispatcher: <{ dispatcher.none }>
enabled: true
job_id: <{ prefix.job }>/simple-sql
owner: <{ owner }>
payload: <{ prefix.payload }>/simple.sql
type: sql
worker: <{ worker.main }>

# Get the name of the job source file relative to lava-jobs dir.
x-srcfile: <{ jinja.templates[0] }>

# --------------------------------------
# Post job actions
# <% if on_fail %>
on_fail: <{ on_fail }>
# <% endif %>

# <% if on_success %>
on_success: <{ on_success }>
# <% endif %>

# --------------------------------------
parameters:
  conn_id: <{ conn.mydb }>
  vars:
    schema_name: <{ schema.staging }>

All of the values delimited by <{...}>, <%...%> will be obtained from whichever environment configuration file is used at build/deploy time.

If the configuration file is:

# --------------------------------------
# Lava environment configuration file

realm: "user01"
prefix:
  job: "app/demo"
  payload: "app/demo"
  s3trigger: "app/demo"
owner: "Fred"
worker:
  main: "core"
dispatcher:
  main: "Sydney"
  none: "--"
schedule:
  main: "--"

# --------------------------------------
# Connections
conn:
  mydb: redshift/dev

# --------------------------------------
# Post-job actions These can be safely removed if not needed.
on_fail:
  - action: email
    to: fred@somewhere.com
    subject: "ALARM: Job={{job.job_id}}@{{realm.realm}}"
    message: "Run {{job.run_id}}: {{result.error}}"

# --------------------------------------
# Custom variables.

schema:
  staging: public

The final job will look like this:

{
    "description": "Sample SQL job",
    "dispatcher": "--",
    "enabled": true,
    "job_id": "app/demo/simple-sql",
    "on_fail": [
        {
            "action": "email",
            "message": "Run {{job.run_id}}: {{result.error}}",
            "subject": "ALARM: Job={{job.job_id}}@{{realm.realm}}",
            "to": "fred@somewhere.com"
        }
    ],
    "owner": "Fred",
    "parameters": {
        "conn_id": "redshift/dev",
        "vars": {
            "schema": "public"
        }
    },
    "payload": "app/demo/simple.sql",
    "type": "sql",
    "worker": "core"
}

Creating Amazon EventBridge Rules

Lava provides support for triggering jobs from Amazon EventBridge rules via a number of mechanisms:

Also, a project may need to create EventBridge rules to interact with other non-lava elements in the environment.

In each case, EventBridge rules with suitable targets need to be created. The lava job framework supports this with rule specifications placed in the lava-rules directory.

Sorry, I couldn't resist.

Anatomy of EventBridge Rules

Note

This explanation is for general information only and many details are omitted. Consult AWS documentation for full details.

EventBridge rules are attached to an event bus (typically the default bus) and contain the following key components:

  • A rule name.

  • A description.

  • An optional event pattern that is matched against incoming events by EventBridge at runtime to determine if the rule should fire or not.

  • An optional schedule that specifies a cron style schedule or repetition frequency for the rule to fire.

  • Targets for the rule and a definition of what data to send to the targets. A range of target types are supported, including Lambda functions, CloudWatch log groups, SNS topics and SQS queues. While targets are optional, not having any is pretty pointless.

  • Tags for the rule.

For the lava job framework, these elements are defined in a rule specification file.

Rule Specification Files

Rule specification files are YAML (or JSON, if you must) formatted and placed in the lava-rules directory. These files are Jinja rendered against the specified environment configuration file as for other YAML job framework components and deployed to EventBridge by the lava job framework.

A sample rule specification file is provided here.

Each file has the following keys:

Key Type Required Description
description str Yes A short description of the rule.
enabled Boolean No Whether or not the rule is enabled. Defaults to false
event_bus_name str No The event bus name. The default is default.
event_pattern dict No The pattern used to select which events trigger the rule. See the AWS documentation for details.
owner str Yes Name or email address of the rule owner. This will be added as a tag on the rule when deployed.
role_arn str No The ARN of the IAM role associated with the rule. See the AWS documentation for details.
rule_id str Yes The rule name. This must be of the form lava.<REALM>.*.
schedule_expression str No A cron style schedule or repetition frequency for the rule to fire. See the AWS documentation for details.
tags dict No A dictionary of key/value pairs that will be added as tags to the rule. These are additional to the owner and control tags added by the lava framework.
targets list No A list of targets for the rule. If omitted, the rule may fire but nothing will happen. See Specifying Rule Targets.

Specifying Rule Targets

A rule target is a resource to which EventBridge sends an event message when a rule fires. Rules can have zero or more targets. Consult the AWS documentation for details.

Rule specification files may contain the targets key which is a list of targets for the rule. Each entry in the list specifies the resource or endpoint and any additional parameters required for that endpoint.

The format for each entry in the targets list can be either:

  1. The ARN of a target resource.

  2. A full target specification using the structure specified for a target in the boto3 EventBridge put_targets function (camel case and all).

In the first case, the incoming event is forwarded, unmodified, to the resource specified by the ARN. This is suitable for using EventBridge to trigger lava jobs from S3 events, among other uses. The lava job framework provides Jinja helper functions to assist with constructing ARNs.

In the second case, the specification provides full control over the target configuration, including the nature of the event message being sent.

Example Rule Specification File

The following example is typical of one used to send an S3 bucket event to the realm s3trigger lambda function to dispatch a lava job. It also logs the event to CloudWatch logs.

# rule_id becomes the rule name
rule_id: "<{ prefix.rule }>.s3-rule-example"

# If you forget this, your rule is disabled.
enabled: true

owner: Fred
description: A sample rule

tags:
  project: my-great-project

# This will capture object creation in s3://my-bucket/an/interesting/prefix
event_pattern:
  detail:
    bucket:
      name:
        - my-bucket
    object:
      key:
        - prefix: an/interesting/prefix
  detail-type:
    - Object Created
  source:
    - aws.s3

targets:

  # Construct the ARN for the realm s3trigger lambda
  - <{ lava.aws.arn('lambda-function', 'lava-' + realm + '-s3trigger') }>
  # Let's log messages in CloudWatch logs
  - <{ lava.aws.arn('log-group', '/aws/events/lava') }>

  # This does exactly the same as the previous targets using the full target
  # format. Don't do both or s3trigger will get 2 events sent
  - Id: trigger-me
    Arn: <{ lava.aws.arn('lambda-function', 'lava-' + realm + '-s3trigger') }>
  - Id: log-me
    Arn: <{ lava.aws.arn('log-group', '/aws/events/lava') }>

Creating Payloads

Some job types, such as cmd, dag and sqli, have the payload fully contained within the job specification.

For other job types, such as exe, pkg and sql, the payload is external to the job specification, which references the payload content (e.g. as a code bundle in S3 or a docker image repository). For these, the lava-payloads directory will contain the source for the lava payloads for the project. The framework currently supports automated build for the following external payload types:

Resource Directories

Directories directly under lava-payloads with names ending in .rsc or .raw are static resource directories that contain no active job components but are uploaded to the payloads area in S3 for consumption by lava jobs as required.

Directories ending in .rsc will have the contents Jinja rendered at build/deploy time using the specified environment configuration file.

Directories ending in .raw are not Jinja rendered.

In either case, the directory structure is replicated in the project payload area in S3 under the prefix.payload item from the environment configuration. Symbolic links are followed as part of the process.

Note that the lava worker will completely ignore these areas in S3. It is up to individual jobs to download the contents as required. For situations where static resources need to be accessed locally by a job, it may be more appropriate to place them directly in the .pkg or .docker directory so that they are included in the job payload.

An element my-file from a resource directory xyz.rsc can be referenced in a job specification thus:

{{ realm.s3_payloads }}/<{ prefix.payload }>/xyz.rsc/my-file

DAG Payloads

The payload for dag jobs is a map representing job dependencies. The details can be included directly in the job specification. The job framework also provides support for generating this map at build time via the following:

  • The lava-dag-gen utility which is provided in the job framework bin directory.

  • A Jinja function, lava.dag(), that calls this utility to generate and interpolate a DAG payload at build time.

Note

The lava framework cannot easily tell if a job using the lava.dag() function needs to be rebuilt as it may depend on external data. Hence, the framework will always rebuild job specifications that use this function.

This following example shows how to use the Jinja function:

description: A daggy job

type: dag

job_id: "<{ prefix.job }>/dag/demo"
worker: "<{ worker.main }>"
enabled: true
owner: "<{ owner }>"

parameters:
  workers: 2

# Generate the dag payload by reading the first tab in Excel file dag.xlsx
payload:  "<{ lava.dag('dag.xlsx'}>"

The first (positional) argument to the lava.dag() function corresponds to the source argument of the lava-dag-gen utility.

The lava.dag() function also supports keyword arguments that match the --option value command line options of the lava-dag-gen utility, although not all of these are useful in a lava framework job specification.

The following example shows how to generate the dag payload by reading dependencies from a database using a lava connector:

description: A daggy job

type: dag

job_id: "<{ prefix.job }>/dag/demo"
worker: "<{ worker.main }>"
enabled: true
owner: "<{ owner }>"

parameters:
  workers: 2

# Generate the dag payload by reading a database table. Note that the realm
# value from the framework configuration file is used.
payload: "<{ lava.dag('a_conn_id', group='a_batch', table='a_schema.dags', realm=realm) }>"

Note that the lava.dag() function actually returns a JSON formatted string. This works in a YAML source file because valid JSON is also valid YAML. Neat eh?

Info

Using the lava.dag() function with a lava database connector requires that the lava package is installed in the framework virtual environment.

Docker Payloads

Warning

Lava version 8.1 (Kīlauea) introduced some important changes in this area. It is essential to read Backward Compatibility Notes for Docker Payloads if running an earlier version.

Directories directly under lava-payloads with names ending in .docker are assumed to contain the code for lava docker jobs.

The build process is essentially:

  1. Create a clean copy of the source tree.

  2. Any files in the env/ directory of the source tree are Jinja rendered using the environment configuration file. This provides one possible mechanism to include environment specific information in the build.

  3. Any Jupyter notebooks (*.ipynb) are converted to Python.

  4. If the source directory already contains a Dockerfile, that will be Jinja rendered using the environment configuration file and used to build the image.

  5. If the source directory does not contain a Dockerfile, a default one is used.

Info

These components are placed in the container in /lava and owned by the user lava. However, by default, the container will be run with the effective user ID of the lava worker. This is required so that any items left by the container in the $LAVA_TMP area can be read by the worker. Take care when building containers to account for the different user IDs at build and run time.

The install process will create an appropriate ECR repo and push the image. The uninstall process will delete the ECR repo.

The Default Dockerfile

The default Dockerfile supplied with the framework should suffice in most cases. It effectively emulates the packaging process for pkg payloads but builds a docker image instead of a zip file.

The payload files are installed in the /lava directory in the container. The files are owned by root and are globally readable inside the container. Any files that are user executable in the source directory are made globally executable inside the container.

Info

The /lava directory is not added to any *PATH environment variables by default.

If the root directory of the source tree contains a requirements.txt file, then Python modules listed therein, including any dependencies, are installed as part of the image build. If the root directory contains a requirements-nodeps.txt file, then Python modules listed therein, excluding any dependencies, are included.

If the default Dockerfile is not adequate, a custom one can be created. A simple Dockerfile might look something like the following, but keep in mind the runtime configuration to ensure permissions are set correctly inside the container when building the image.

FROM ghcr.io/jin-gizmo/lava/amzn2023/base

# Copy our code into the image
COPY * /install/

# Point at the right pip repo. The Makefile will supply the value.
ARG PIP_INDEX_URL
ENV PIP_INDEX_URL $PIP_INDEX_URL

RUN \
    cd /install ; \
    echo My code is here ; \
    ls -lR : \
    python3 -m pip install -r requirements.txt --upgrade

Docker Platform Architecture Selection

As of version 8.1 (Kīlauea), the lava job framework supports building docker payload images for a specific target platform architecture.

Info

Currently, the capability to generate cross-platform images is only supported when using Docker Desktop with multi-platform support enabled.

Image platform selection is controlled by the docker->platform key in the environment configuration file. This key may have one of the following values.

Docker platform Description
host Use the default behaviour of the build host docker platform. The platform selected will be dependent on some combination of the architecture of the base image and the build host, as is usual for docker.
linux/amd64 Build an image for x86_64 platforms.
linux/arm Build an image for ARM platforms, such as Mac M series and AWS Graviton.
unspecified Build an image for x86_64 platforms.

For a cross-platform build to work as expected, the base image must either be a multi-platform image or have itself been built for the target platform. Most standard operating system base images, such as Amazon Linux 2023 and Ubuntu Linux are multi-platform. As of version 8.1 (Kīlauea), the lava docker images are also multi-platform.

Compatibility Notes for Docker Payloads

This is a bit complicated but please bear with me ...

To understand platform compatibility when deploying a docker payload, the fundamental principle is that the docker image must contain a platform version that matches the host running the lava worker.

If lava workers are being run on x86 AWS EC2 instances (linux/amd64 in docker terminology), job payload docker images must be, or contain, a linux/amd64 version.

This, in turn, implies that the base image for the payload is either:

  1. A single platform linux/amd64 image; or

  2. A multi-platform image that

    • includes a linux/amd64 platform version; and
    • the build process, implicitly or explicitly, directs the use of the linux/amd64 platform version.

If every machine in the dev / build / run chain is x86, no problems. That was the world view for lava versions prior to version 8.1. The lava docker images, commonly used as payload base images, were built only for x86. Any derived images would inevitably be x86.

Unfortunately, if a multi-platform base image, such as any of the common operating system base images, was used on a M-series Mac build machine, the result would be an ARM (linux/arm64) payload which would not run on an x86 AWS EC2 worker. The lava job framework provided no way to specify what output architecture was required.

It also meant that the lava docker images could not run on ARM machines, except under emulation.

Lava version 8.1 (Kīlauea) introduced some key changes in this area:

  1. The lava docker images are multi-platform images supporting x86 (linux/amd64) and ARM (linux/arm64).

  2. The lava job framework includes the ability to explicitly specify the target platform for docker payloads, rather than relying on some implicit combination of the platform types available in the base image and the platform type of the build host.

So far, so good.

New projects using the v8.1 lava job framework allow the user to control the target platform using the docker->platform key in the environment configuration file. It defaults to linux/amd64. This should work fine on x86 and M-series Mac build machines using Docker Desktop with emulation.

What happens when working with existing projects using an older version of the lava job framework? I hear you ask. It depends:

  1. Existing, deployed docker payloads and projects without docker payloads.
    No impact.

  2. Rebuilding and deploying docker payloads from an x86 build host.
    No impact.

  3. Rebuilding and deploying from an ARM build host (e.g. M-series Mac)
    This (probably) would have worked prior to v8.1. Now, it will not. The lava job framework version must be updated to v8.1 (or later). See Updating the Framework in an Existing Project. The docker->platform key should be added to the config/*.yaml files, but will default to linux/amd64 if not present.

Exe Payloads

Single file Python and Shell scripts directly under lava-payloads are copied as is when deployed.

SQL scripts are Jinja rendered at build/deploy time using the specified environment configuration file, in the same way as the DynamoDB table specifications.

Jupyter notebooks are converted to Python scripts for deployment.

Pkg Payloads

Directories directly under lava-payloads with names ending in .pkg are assumed to contain the code for lava pkg jobs.

The build process is essentially:

  1. Create a clean copy of the source tree.

  2. Any files in the env/ directory of the source tree are Jinja rendered using the environment configuration file. This provides one possible mechanism to include environment specific information in the build.

  3. Any Jupyter notebooks (*.ipynb) are converted to Python.

  4. If the root directory of the source tree contains a requirements.txt file, then Python modules listed therein, including any dependencies, are included.

  5. If the root directory contains a requirements-nodeps.txt file, then Python modules listed therein, excluding any dependencies, are included.

  6. Zip up everything and place it in the dist area of the project.

Miscellaneous Components

Lava jobs sometimes require other components that may, or may not, be deployed as part of a job but which don't naturally belong in the lava payloads area in S3.

For example, jobs may require some tables to be pre-created before the job runs. The SQL to create the tables would be one such miscellaneous component. Another example might be JSONPath files for a Redshift COPY operation for JSON data.

These components can be placed in the misc (miscellaneous) directory.

Any SQL scripts (*.sql) placed in the misc directory are Jinja rendered at build/deploy time into the dist directory using the specified environment configuration file, in the same way as the DynamoDB table specifications.

By default, no other build or installation action is performed for anything in the misc directory.

Info

Do not edit misc/Makefile as this file will be replaced in the event of a framework update.

If some additional build or installation action is required, the appropriate means to achieve this is to create a custom makefile Makefile.local. This will be detected by the framework and invoked. This makefile must implement the following targets, although they don't have to do anything if not required:

  • dist
  • pre-install
  • install
  • uninstall.

The recommended approach is to copy the file misc/Makefile.local.sample to misc/Makefile.local and customise as required.

Building the Deployable Components

Once the lava components are created, the installable components are created thus:

# cd to the project root directory then ...

# Activate the virtualenv
source venv/bin/activate

# Build the lava artefacts
make dist env=<ENV>

The value of the env parameter must correspond to one of the environment configuration YAML files in the config directory.

The deployable components will be built and placed in the dist/<ENV> directory.

Installing Deployable Components

The lava components can be installed using:

# cd to the project root directory then ...

# Activate the virtualenv
source venv/bin/activate

# Deploy the lava artefacts
make install env=<ENV>

This will do the following:

  1. Build any out of date artefacts.

  2. Perform some basic pre-installation checks (e.g. verify permission to write to the payloads area in S3).

  3. Backup any existing payloads in the realm S3 bucket under the __bak__ prefix.

  4. Deploy the DynamoDB table entries and payload components.

Warning

No backup is made of existing DynamoDB entries prior to uploading new ones.

To perform an installation without the pre-installation checks use:

# Deploy the lava artefacts without pre-install checks.
make _install env=<ENV>

Uninstalling Deployable Components

The lava components can be uninstalled using:

# cd to the project root directory then ...

# Activate the virtualenv
source venv/bin/activate

# Remove the lava artefacts
make uninstall env=<ENV>

To clean up the local dist area:

# cd to the project root directory then ...

make clean

Health Checking Deployable Components

See also Maintaining DynamoDB Table Entries.

Code Hygiene

The lava job framework incorporates some basic code health checks. The checks can be run using:

make check

# or ...

etc/git-hooks/pre-commit

The checks are also run prior to any installation process. Installation is blocked if the checks fail.

If the framework was used to automatically initialise Git for the project then the checking process is also configured as a pre-commit hook.

Check Type Tool Description
Python quality flake8 Performs a range of PEP8 compliance and other code health checks, including compliance with black formatting. The configuration file for flake8 is contained in .flake8 and for black in pyproject.toml.
YAML correctness yamllint Performs correctness and style checks on the project YAML files. The configuration file is in .yamllint.yaml.
Config alignment Builtin Compares the key structures in the configuration files in the config directory and highlights any differences. Generally, configuration files for a project correlate to different target realms (e.g. test vs prod). While the configuration values will vary by environment, the key hierarchies should be identical. The only configuration option is the choice between warning and strict modes which is specified in etc/git-hooks/pre-commit.

The following command will apply black formatting to project Python files:

make black

# or ...

black lava-payloads misc

# or even ...

black

Configuration Drift Detection

Changes to a lava job framework based project should always be done via a make install from an appropriately managed Git repo to ensure that the deployed components are fully aligned with the committed contents of the repo.

Deviation from this practice can result in misalignment between deployed components and the repo contents; aka drift.

The lava job framework supports drift detection for the DynamoDB table entries. To detect differences between the repo contents and the deployed table entries, run the following command:

make diff env=...

Note that fields starting with x- / X- are excluded from drift comparisons.

Updating the Framework in an Existing Project

The lava framework can be updated for an existing project by obtaining the new framework package lav-job-framework-<NEW-VERSION>.zip and applying it over the top of the project.

This process is automated by the framework itself. A backup is made first as part of the process in case of problems. However it is strongly recommended to do a git commit and git push before starting the process.

The update process is relatively straightforward when updating from a framework version of 5.1.0 (Tungurahua) or above. Updating earlier versions is possible with a little bit of fiddling.

Updating from Lava Version 5.1.0 (Tungurahua) or Above

The process is:

# Go to the project root directory. Then ...
# Commit and push your code just in case. Then ...
# Deposit the new package at the root of the project directory. Then ...

# Activate the virtual environment
source venv/bin/activate

# Run the update process
make update pkg=lav-job-framework-<NEW-VERSION>.zip

This will do a backup of the project into a zip file, rerun the cookiecutter using the new package and apply the new framework components over the existing project.

Updating from Lava Versions Prior to 5.1.0 (Tungurahua)

The process is:

# Go to the project root directory. Then ...
# Commit and push your code just in case. Then ...
# Deposit the new package at the root of the project directory. Then ...

# Extract the `bin` directory from the new framework package
# The quotes are important here.
unzip -j -d bin lav-job-framework-<NEW-VERSION>.zip '*bin/*'
chmod u+x bin/*

# Activate the virtual environment
source venv/bin/activate

# Run the update process
PATH=$(pwd)/bin:$PATH make update pkg=lav-job-framework-<NEW-VERSION>.zip

Note that later versions of the framework move the framework's requirements.txt file into the etc directory. After the update the requirements.txt in the base directory can be deleted if there are no locally added packages. If there are, only those packages need to be retained in that file.