Developing Lava Jobs¶

Executable Jobs¶

Executable jobs are handled by cmd, exe, pkg and docker job types.

Executable scripts (bash, Python, Perl etc), as well as worker compatible binaries, are fine for use in lava. The information in this section is applicable to all of these.

For Python based jobs, additional capabilities are provided by direct access to the lava packages.

The run-time environment for executable jobs in lava is a conventional Linux environment based on the worker on which the job runs.

For docker jobs, some details may depend on the nature of the container being run. Refer to the chapter on lava and docker for more information.

The main peculiarities associated with lava based executable jobs are outlined below.

Executable Scripts¶

Lava relies on the hashbang line at the beginning of the script to determine the appropriate interpreter in exactly the same way that a UNIX shell does.

Beware DOS

If the script has been edited on a DOS system, it is very likely that it will have DOS style CRLF line endings instead of UNIX style LF endings. This will prevent the hashbang line from being recognised and the job will fail.

Handling of Temporary Files¶

Lava jobs are run in a temporary directory created by lava and deleted by lava when the job exits.

The TMPDIR environment variable is set for cmd, exe and pkg jobs to point within the private run area for the job rather than inheriting the default system setting. This variable can be referenced explicitly in a job. Alternatively, the mktemp(1) command line utility or the Python tempfile module can be used as these will use TMPDIR if used correctly.

Info

The following applies to Linux. Note that macOS mktemp(1) behaves very differently in a number of ways, including using of TMPDIR.

Typical usage in a shell is:

#!/bin/bash

# Create a temp file in our private job area
MY_TMP_FILE=$(mktemp)

# Create a temp directory in our private job area
MY_TMP_DIR=$(mktemp -d)

# Create a temp directory using a name template. The -t is critical here.
MY_TMP_DIR2=$(mktemp -d -t tmp-XXXXXX)

Typical usage in Python is:

#!/usr/bin/env python3

import tempfile

# Create a tempfile. The file is open.
tmp_file_descriptor, tmp_file_name = tempfile.mkstemp()

# Create a temp directory.
tmp_dir_name = tempfile.mkdtemp()

While lava will clean these up when the job exits, it is still good practice for jobs to clean up after themselves. Jobs should generally avoid creating temporary objects in /tmp because lava will not clean these up and there are no guarantees about availability of storage space in /tmp.

Testing if Running in Lava¶

Sometimes it's necessary for an executable to test whether or not it is running in lava. The easiest way to do this is to look for the presence of one of the lava environment variables LAVA_REALM or LAVA_JOB_ID.

In a bash script, this would look like:

#!/bin/bash

if [ "$LAVA_REALM" != "" ]
then
    # We are in lava
    echo "I lava you"
else
    # We are not in lava
    echo "I don't lava you anymore"
fi

Handling of stdin, stdout and stderr¶

For executable jobs, stdin is redirected from /dev/null while stdout and stderr are captured and uploaded to the realm temporary area in S3 unless the worker is running with the --dev option. In that case, stdout and stderr are emitted locally on the worker.

Exit Status¶

Lava assumes that a zero exit status indicates that the job has succeeded. This will trigger any on_success job actions.

A non-zero exit status indicates to lava that the job has failed. This will trigger any on_fail actions.

Status and Error messages¶

Executable jobs running under lava should print useful status and error messages to stdout and stderr, just as they should when running in any other environment.

In normal operation, lava will collect this and upload it to S3 and place a pointer to it in the job event record.

When the worker is running with the --dev option, stdout and stderr from the job are emitted locally rather than being sent to S3. This can help with development and debugging.

Connection Handling for Executable Jobs¶

Handling of connections to external resources is facilitated via small executables created by lava to effect the connection. The path to the connector executable is passed to the lava job executable as an environment variable.

The following example shows how this would be used in a shell script to access a database connection, but the mechanism is generic and available to any executable that can read environment variables and invoke an external program.

#!/bin/bash

# The following environment variables are set in the Lava exe job specification.
# The can be as many connections to different resources as is required.
#
# LAVA_CONN_AURORA01
#    Lava connector script for the "aurora01" database. The lava job spec must
#    have an "aurora01" connections entry.

# SQL to do something
SQL="...."

# Run a command line SQL client that is preconfigured for auto login.
$LAVA_CONN_AURORA01 --database=dbname -e "$SQL" > local-temp-file

# Now do something clever with the results.

Python based executable jobs have additional options for handling connections by virtue of programmatic access to the underlying lava connection manager.

Python Executable Jobs¶

In addition to the facilities available to all executable jobs, additional functionality is available for Python jobs as lava itself is Python based.

Python based executable jobs can interface directly with the lava Python base package. This provides access to the lava connection manager as well as other modules included with lava.

See Installing Lava Locally.

Connection Handling for Python Based Jobs¶

Python based lava programs can invoke the lava connection manager directly. See lava.connections in the API documentation for more information.

This is a simple example showing how to create a connection to an SQL-based database:

import os
from lava.connection import get_pysql_connection

# As we are using the lava connection manager, we just need the connection ID
# this time – not the connector script. So use LAVA_CONNID_DB not LAVA_CONN_DB.

conn_id = os.environ['LAVA_CONNID_DB']
realm = os.environ['LAVA_REALM']

# Get a standard DBAPI 2.0 connection
conn = get_pysql_connection(conn_id, realm)

# Knock yourself out with SQL wizardry…
cursor = conn.cursor()
...
conn.close()

Connection Handling for SQLAlchemy¶

The SQL database connectors provide native support for SQLAlchemy. An SQLAlchemy engine can be created using a lava connector to manage the underlying connection process.

This is useful, not just for using SQLAlchemy natively, but also for packages such as pandas that rely on SQLAlchemy for database interaction.

The following example shows how this would be used.

import os
import pandas as pd
from lava.connection import get_sqlalchemy_engine

# As we are using the lava connection manager, we just need the connection ID
# this time – not the connector script. So use LAVA_CONNID_DB not LAVA_CONN_DB.

conn_id = os.environ['LAVA_CONNID_DB']
realm = os.environ['LAVA_REALM']

engine = get_sqlalchemy_engine(conn_id, realm)
# engine is a standard SQLAlchemy engine.

with engine.connect() as conn:
    for row in conn.execute('... an SQL query ...'):
        print(row)

# Or use with pandas
table_df = pd.read_sql_table('my_table', con=engine)

Note

There is a known issue for SQLAlchemy and pg8000. See the workaround.

Database Connections - The Good, the Bad and the Ugly of DBAPI 2.0¶

Aaaah–aaaah–aaah–aaaah… Wah–wah–wahhhh…

(Don't tell me you don't know)

Lava uses DBAPI 2.0 based database drivers, the interface for which is specified in PEP 249.

The Good¶

DBAPI 2.0 provides some level of interface consistency across database types. In simple cases, you only need to invoke the lava get_pysql_connection() function as described above to obtain a database connection which can be used to execute queries in a more or less consistent way across database types. But ...

The Ugly¶

While DBAPI 2.0 provides Connection.commit() and Connection.rollback() functions, it does not provide a Connection.begin() function to start a transaction and driver implementations can differ in how they handle this. (Most, but not all, handle this by setting Connection.autocommit). Different databases also use different SQL syntax to begin a transaction. Oracle is notable in that it does not support BEGIN TRANSACTION in the way that Postgres and MySQL do.

To avoid this problem, lava provides a helper function lava.lib.db.begin_transaction().

from lava.connection import get_pysql_connection
from lava.lib.db import begin_transaction

conn = get_pysql_connection(...)
cursor = conn.cursor()

try:
    begin_transaction(conn, cursor)
    # Do some SQL stuff
except Exception:
    conn.rollback()
else:
    conn.commit()
finally:
    conn.close()

The Bad¶

PEP 249 defines 5 different possible mechanisms for passing query parameters when a query is executed.

This is because it is absolutely critical for a standard to have 5 incompatible ways of doing exactly the same thing.

Unfortunately there is no consistency across different drivers as to which subset of these is implemented or the default setting. The paramstyle module constant will specify the default mechanism.

Some drivers, such as pg8000, allow the paramstyle constant to be set to different values to support different parameter passing styles. Some don't.

It's a bit of a mess unfortunately and makes writing driver independent code inordinately difficult. You either need to test the module's paramstyle setting and adapt the parameter passing mechanism at run-time or just make do with the specific driver settings.

DBAPI 2.0 Usage in Lava¶

Lava uses the following drivers by default. Check the documentation for the driver for more details.

Database Family	Driver
MSSQL	pyodbc
MySQL	PyMySQL
Oracle	cx_Oracle
Postgres	pg8000
Redshift	pg8000
SQLite3	sqlite3

Python code either must be sophisticated enough to adapt to the DBAPI 2.0 variations at run-time or must have specific knowledge of which driver is being used. Using SQLAlchemy instead of the native interface may be of assistance in the former option.

Lava also provides limited support to select an alternate driver for some database types. This is done using the subtype field in the database connection specification. Refer to individual connectors for details.

SQL Jobs¶

The sql, sqlc, sqli and sqlv jobs will run SQL commands against a target RDBMS.

There is nothing special that needs to be done with the SQL to prepare it to run with lava but it is important to keep the following in mind:

Lava will manage all of the connectivity to the database.
The SQL must match the syntax requirements of the target database.
sqlc jobs use the command line client specific to the target database. Typically these will support some client specific meta commands to control behaviour of the client. These can be used in the job payload script.
sqlc and sqlv jobs have a timeout that can be configured in the job specification. sql and sqli jobs do not have a timeout. Like all jobs, the visibility timeout on the worker queue needs to be kept in mind.
If the queries return data, this will be placed into the temporary area in S3. Some other process may need to do something with this data.