Jobs and Job Types¶

A lava job is a static definition represented as an entry in the jobs table. Each invocation of a job is referred to as a run which has a unique run_id.

The run_id is used in construction of prefixes for job outputs in S3 as well as in event records in the events table.

Various job types are supported, depending on the kind of processing required. Each job type has its own requirements for the parameters field in the job specification.

Jobs can be run in either normal mode or in dev mode by specifying the --dev argument when invoking the lava worker. In dev mode, the behaviour of the job may by slightly altered to assist with job development and debugging without changing the fundamental action of the job. The specifics are job type dependent.

If the DEBUG configuration variable is set to true, some additional information is added to the event records for failed jobs.

Job Families¶

Jobs types can be roughly grouped into the following families.

Orchestration jobs that control the operation of other jobs:
- chain
- dag
- dispatch
- foreach
Database jobs that interact with a database to run SQL or load / unload data:
- db_from_s3
- redshift_unload
- sql
- sqlc
- sqli
- sqlv
See also Choosing an SQL Job Type.
Executable jobs that run a code bundle external to lava itself:
- cmd
- docker
- exe
- pkg
Integration jobs that interact with an external system of some kind:
Lava internal jobs that lava uses for it's own operations:
- lavasched
Miscellaneous jobs that don't fit the above:
- log

Job Payloads¶

Some job types have some kind of associated payload. Payloads are specified by the mandatory payload field in the job specification.

Payloads are job type specific and can be of the following types:

S3 based payloads: The payload field will point to location(s) in S3 that the lava worker will download at run-time. These are typically things such as code bundles, SQL scripts etc.
Inline payloads: The payload field will contain the actual payload. These may be things like single CLI commands, small in-line SQL scripts, docker repository names etc.
No payload: Some job types (e.g. log) don't require a payload and hence the value of the payload field is ignored. (As an accident of history, the field is still required but it can be set to anything. A value of -- or null is common usage.)

S3 Payloads¶

Prior to version 7.1.0 (Pichincha), the S3 payload downloader was the v1 downloader.

As of version 7.1.0 (Pichincha), there is an additional v2 downloader.

Only one version of the downloader is active on any lava worker.

Info

As of version 8.1.0 (Kīlauea), the v2 downloader is the default and the v1 downloader is deprecated.

The v1 downloader can be enabled by setting the PAYLOAD_DOWNLOADER worker configuration parameter to v1. See Lava Worker Configuration for more details.

The v1 Payload Downloader¶

The v1 S3 payload downloader requires the payload field in the job specification to be a string that specifies a location in S3 relative to the s3_payloads area defined in the realms table. It can be either:

an S3 object key, in which case a single S3 object is downloaded; or
an S3 prefix ending in /, in which case all objects under that prefix will be downloaded and made available to the job in lexicographic order.

Info

The download process when a prefix is specified does not recurse down the object hierarchy in S3. All objects to be downloaded must sit directly under that prefix. The job will abort with an error if there are any sub-prefixes.

A typical job specification might contain something like this:

{
  "payload": "app/whatever/query-some-stuff.sql"
}

... or this:

{
  "payload": "app/whatever/lots-of-sqls/"
}

The v2 Payload Downloader¶

If the payload field in the job specification is a string, the v2 S3 payload downloader behaves the same as the v1 downloader, with one minor exception. If the string is a prefix containing sub-prefixes, the v1 downloader aborts with an error whereas the v2 downloader silently ignores the sub-prefixes.

In v2, the payload field may also be a list of strings. These are processed in the specified order, using the same mechanism as the v1 downloader and the combined list of downloaded objects is made available to the job. The download process does not recurse into sub-prefixes which are silently ignored.

Info

Lava places the files resulting from each item in the list in a separate private directory to avoid name clashes. There is no way a payload file can know the location of any other payload file that isn't part of the same list item.

Job Retries¶

By default, lava will make one attempt to run a job once dispatched and the job will then either succeed or fail.

As of v6.1.0 (Volcán Pinta), lava supports a job retry mechanism that can retry a failed job one or more times. This process is controlled by the following fields in the job specification:

iteration_limit: specifies maximum number of attempts that will be made to run the job successfully before giving up
iteration_delay: The delay between run attempts.

The job will be re-run until either it succeeds or the iteration_limit is reached. Lava makes an additional globals.lava.iteration global available for use with run-time Jinja rendering of job and action fields.

A single worker thread remains committed to the job for the entirety of the retry process so it is important to use the retry mechanism sensibly.

Info

The entire duration of the retry process must fit within the queue visibility timeout of the worker SQS queue or else SQS itself will resubmit the dispatch message, with the same run ID, while the first job is still running. This is not ideal.

Job Actions for Retries¶

Lava makes the on_retry job action available for situations where a job fails but will be retried in another iteration.

For example, if a job has an iteration_count of 2, and both attempts fail, the on_retry actions will be executed after the first iteration and the on_fail actions will be executed after the second and final iteration.

Lava Job Retries (Iterations) vs SQS Resubmissions¶

The job specification contains both iterations_* related fields and a max_tries fields.

The iterations_* related fields control internal job retry within the lava worker.

The max_tries field controls how many times SQS is permitted to resubmit the same dispatch message due to SQS queue visibility timeouts before it is discarded by the lava worker.

Choosing an SQL Job Type¶

Lava provides the following SQL jobs types:

sql
sqlc
sqli
sqlv.

All of these will run blobs of SQL against a relational database using a lava database connector to connect to the target database. Each job type has unique characteristics that make it more or less suited for a given context.

The following table compares the job types as well as the stand-alone lava-sql CLI utility:

Feature	sql	sqli	sqlc	sqlv	lava-sql
Runs inside the lava worker	*	*
Runs outside the lava worker			*	*	*
Uniform, lava provided interface	*	*		*	*
DB specific client			*
Timeout supported			*	*
Can be killed on lava side			*	*	*
SQL in-line in job-spec		*
SQL in S3	*		*	*
Lava transaction support	*	*		*
Jinja payload rendering	*	*	*	*
Run multiple SQL statements	*	*	*	*	*
Suitable for large jobs	*		**	**	**
Lava CSV formatting control	*	*		*	*
HTML output			?		*
Output column headers	*	*	*	*	*
Client side copy support			?
Runs as a stand-alone utility			*		*

? means support depends on database client capabilities.

Job type: chain¶

The chain job type runs a list of other jobs sequentially. Unlike the dispatch job type, this is a synchronous operation. All of the jobs in the chain must be set to run on the same worker and will all run under the run_id of the parent job.

It is possible to commence processing of a chain at an arbitrary point in the list. This is useful when it's necessary to resume a failed chain at a mid-point.

Any globals available in the chain job will also be passed in the dispatch requests to the jobs in the chain.

Payload¶

The payload is either a comma separated list of job_ids or an actual list of job_ids.

Parameters¶

Parameter	Type	Required	Description
can_fail	String or List[String]	No	A glob style pattern, or list of patterns specifying child jobs that are allowed to fail. See Allowing Child Jobs to Fail.
job_prefix	String	No	Prepend the specified value to each job ID in the payload, including any job specified by the `start` parameter.
start	String	No	The `job_id` of the starting point in the chain. The chain will commence with the first job with the given ID.

Allowing Child Jobs to Fail¶

By default, the chain is aborted when any job in the list fails unless the can_fail parameter is specified, however jobs that are not enabled will be skipped and the chain will continue.

The can_fail parameter can be set to a glob style pattern, or list of patterns. A failed child job with a job_id that matches any of the patterns will not cause the parent chain to fail. In this situation, it is important that the child job handles its own on_fail actions, as the parent will not. Tough love.

This tolerance of failure does not include configuration errors on child jobs, such as malformed job specifications, jobs sent to the wrong worker etc. These will still cause the entire chain to fail.

Keep calm

Before anyone gets all bitter and twisted about can vs may in can_fail... both are essentially equivalent in this crazy, modern world. Look it up.

Handling of Globals¶

The chain job type merges its globals into those of the child jobs in the chain. A value specified in the parent chain job will override a similarly named value in the child.

The chain job will also add lava specific globals under globals.lava. These lava owned globals allow all jobs in a chain, even a multi-level chain, to access some common global values.

Dev Mode Behaviour¶

The chain job behaviour is unchanged for dev mode. However, dev mode is propagated to jobs in the chain.

Examples¶

A basic chain:

{
  "description": "Chain, chain, chain ... Chain of tools",
  "dispatcher": "Sydney",
  "enabled": true,
  "job_id": "demo/chain-01",
  "owner": "demo@somewhere.com",
  "parameters": {},
  "payload": [
    "demo/job_01",
    "demo/job_02"
  ],
  "schedule": "0 0 * * *",
  "type": "chain",
  "worker": "default"
}

The chain with a different starting point:

{
  "description": "Chain, chain, chain ... Chain of tools",
  "dispatcher": "Sydney",
  "enabled": true,
  "job_id": "demo/chain-01",
  "owner": "demo@somewhere.com",
  "parameters": {
    "start": "demo/job_02"
  },
  "payload": [
    "demo/job_01",
    "demo/job_02"
  ],
  "schedule": "0 0 * * *",
  "type": "chain",
  "worker": "default"
}

With a common job prefix:

{
  "description": "Chain, chain, chain ... Chain of tools",
  "dispatcher": "Sydney",
  "enabled": true,
  "job_id": "demo/chain-01",
  "job_prefix": "demo/",
  "owner": "demo@somewhere.com",
  "parameters": {
    "start": "job_02"
  },
  "payload": [
    "job_01",
    "job_02"
  ],
  "schedule": "0 0 * * *",
  "type": "chain",
  "worker": "default"
}

An uncaring parent chain job that doesn't care if any of its children fail:

{
  "description": "Chain, chain, chain ... Chain of tools",
  "dispatcher": "Sydney",
  "enabled": true,
  "job_id": "demo/chain-01",
  "owner": "demo@somewhere.com",
  "parameters": {
    "can_fail": "*"
  },
  "payload": [
    "demo/job_01",
    "demo/job_02"
  ],
  "schedule": "0 0 * * *",
  "type": "chain",
  "worker": "default"
}

A parent chain job that plays favourites:

{
  "description": "Chain, chain, chain ... Chain of tools",
  "dispatcher": "Sydney",
  "enabled": true,
  "job_id": "demo/chain-01",
  "owner": "demo@somewhere.com",
  "parameters": {
    "can_fail": [
      "*/job_[0-9][0-9]"
    ]
  },
  "payload": [
    "demo/job_01",
    "demo/job_02",
    "demo/black_sheep"
  ],
  "schedule": "0 0 * * *",
  "type": "chain",
  "worker": "default"
}

Job type: cmd¶

The cmd job type runs a single Linux command.

Payload¶

The payload is the command string. This will be parsed using standard Linux shell lexical analysis to determine the executable and arguments. Additional arguments can also be specified with the args parameter.

Environment¶

Stdin will be redirected from /dev/null. Stdout and stderr are captured and, if non-empty, placed into the realm s3_temp area with the following prefixes:

stdout: <s3_temp>/<job_id>/<run_id>/stdout
stderr: <s3_temp>/<job_id>/<run_id>/stderr

The following variables are placed into the environment for the command.

Variable	Description
LAVA_JOB_ID	The `job_id`.
LAVA_OWNER	The value of the `owner` field from the job specification.
LAVA_REALM	The realm name.
LAVA_RUN_ID	The `run_id` UUID.
LAVA_S3_KEY	The identifier for the KMS key needed to write data into the S3 temporary area.
LAVA_S3_TMP	The private S3 temporary area for this job run. The command is allowed to put data here.
LAVA_WORKER	The worker name.

Parameters¶

Parameter	Type	Required	Description
args	List[String]	No	A list of additional arguments for the command.
env	Map[String,String]	No	A map of additional environment variables for the command.
jinja	Boolean	No	If `false`, disable Jinja rendering of the payload. Default `true`.
timeout	String	No	By default, cmd jobs are killed after 10 minutes. This parameter can override that with values in the form `nnX` where `nn` is a number and `X` is `s` (seconds), `m` (minutes) or `h` (hours). Note that this must be less than the visibility timeout on the worker SQS queue.
vars	Map[String,*]	No	A map of variables injected when the command arguments and environment are Jinja rendered.

Jinja Rendering of the Payload and Environment¶

The collected arguments for the command and any environment values defined in the job specification are individually rendered using Jinja prior to execution.

Refer to Jinja Rendering in Lava for more information.

The following variables are made available to the renderer.

Name	Type	Description
globals	dict[str,*]	The `globals` from the job specification updated with any globals received in the job dispatch.
job	dict[str,*]	The augmented job specification.
realm	dict[str,*]	The realm specification.
start	datetime	The local time when the job run started.
state	dict[str,*]	A dictionary of the state items imported into the job, keyed on state_id. The default values are updated at run-time with any current values obtainable from the state table.
ustart	datetime	The UTC time when the job run started.
utils	dict[str,runnable]	A dictionary of utility functions that can be used in the Jinja markup.
vars	dict[str,*]	A dictionary of variables provided as the `vars` component of the job `parameters`.

Dev Mode Behaviour¶

Normally, the cmd job will copy stdout and stderr to S3 on the conclusion of the job. In dev mode, stdout and stderr are emitted locally during the job run instead of being copied to S3.

Examples¶

The following example will list the contents of the S3 payloads area for this realm.

{
  "description": "List S3",
  "dispatcher": "Sydney",
  "enabled": true,
  "job_id": "demo/list_s3",
  "owner": "demo@somewhere.com",
  "payload": "aws s3 ls {{realm.s3_payloads}}/ --recursive",
  "type": "cmd",
  "worker": "default"
}

This does the same thing but with the arguments supplied a little differently.

{
  "description": "List S3",
  "dispatcher": "Sydney",
  "enabled": true,
  "job_id": "demo/list_s3",
  "owner": "demo@somewhere.com",
  "parameters": {
    "args": [
      "--recursive"
    ]
  },
  "payload": "aws s3 ls {{realm.s3_payloads}}/",
  "type": "cmd",
  "worker": "default"
}

Job type: dag¶

The dag (Directed Acyclic Graph) job type runs a set of child jobs in the order defined by a dependency matrix.

Ordering controlled by explicit dependencies will be honoured in that a child job which is a predecessor of another child job will always run to completion before the successor job starts. Except for this, there is no guaranteed run order for the child jobs. Jobs that are not dependent on one another may run in any order, or in parallel.

All of the jobs in the DAG must be set to run on the same worker and will all run under the run_id of the parent job. The jobs will be run in a dedicated transient thread pool that is separate from the main lava worker threads.

Any globals available in the dag job will also be passed to the child jobs.

Payload¶

The payload is a map containing elements in the form:

{
  "payload": {
    "child_job_id_1": "predecessor_job_id",
    "child_job_id_2": [ List of predecessor job IDs ],
    "child_job_id_3": null,
    "child_job_id_4": []
  }
}

The keys are job IDs and values must be one of the following:

The ID of a predecessor job as a string.
A list of predecessor job IDs.
A null indicating the job must be run but has no predecessor requirements.
An empty list, which also indicates the job must be run but has no predecessor requirements.

Info

Every job listed in the payload map, in either a key, or as an element in a predecessor list, will be run exactly once by the dag job.

It is not necessary to include a separate key for a job if it is also present in a predecessor list and has no predecessors of its own, but it's harmless to do so.

Note

The lava-dag-gen utility generates a dag job payload from a dependency matrix. It can be used standalone or with the lava job framework.

Parameters¶

Parameter	Type	Required	Description
can_fail	String or List[String]	No	A glob style pattern, or list of patterns specifying child jobs that are allowed to fail. See Allowing Child Jobs to Fail.
job_prefix	String	No	Prepend the specified value to each job ID in the payload.
workers	Integer	No	The number of worker threads to use for running child jobs. The default is specified by the DAG_WORKERS worker configuration parameter and the maximum allowed value is specified by the DAG_MAX_WORKERS configuration parameter.

Tip

Don't get carried away setting the workers parameter too high. It will impact memory consumption.

Allowing Child Jobs to Fail¶

By default, the dag job is aborted when any child fails, however child jobs running in parallel with the failed job will generally run to completion.

The can_fail parameter can be set to a glob style pattern, or list of patterns. A failed child job with a job_id that matches any of the patterns will not cause the parent dag job, or dependent jobs, to fail. In this situation, it is important that the child job handles its own on_fail actions, as the parent will not. Tough love.

This tolerance of failure does not include configuration errors on child jobs, such as malformed job specifications, jobs sent to the wrong worker etc. These will still cause the entire dag job to fail.

Handling of Globals¶

The dag job type merges its globals into those of the child jobs. A value specified in the parent dag job will override a similarly named value in the child.

The dag job will also add lava specific globals under globals.lava. These lava owned globals allow all child jobs in the dag to access some common global values.

Dev Mode Behaviour¶

The dag job behaviour is unchanged for dev mode. However, dev mode is propagated to child jobs.

Examples¶

{
  "description": "Daggy job",
  "dispatcher": "Sydney",
  "enabled": true,
  "job_id": "demo/dag-01",
  "owner": "demo@somewhere.com",
  "parameters": {
    "workers": 2
  },
  "payload": {
    "demo/job_01": "demo/job_02",
    "demo/job_02": [ "demo/job_03", "demo/job_04" ],
    "demo/job_05": null
  },
  "schedule": "0 0 * * *",
  "type": "dag",
  "worker": "default"
}

This version uses job_prefix and is functionally identical to the one above:

{
  "description": "Daggy job",
  "dispatcher": "Sydney",
  "enabled": true,
  "job_id": "demo/dag-01",
  "owner": "demo@somewhere.com",
  "parameters": {
    "job_prefix": "demo/",
    "workers": 2
  },
  "payload": {
    "job_01": "job_02",
    "job_02": [ "job_03", "job_04" ],
    "job_05": null
  },
  "schedule": "0 0 * * *",
  "type": "dag",
  "worker": "default"
}

An uncaring parent dag job that doesn't care if any of its children fail:

{
  "description": "Daggy job",
  "dispatcher": "Sydney",
  "enabled": true,
  "job_id": "demo/dag-01",
  "owner": "demo@somewhere.com",
  "parameters": {
    "can_fail": "*",
    "workers": 2
  },
  "payload": {
    "demo/job_01": "demo/job_02",
    "demo/job_02": [ "demo/job_03", "demo/job_04" ],
    "demo/job_05": null
  },
  "schedule": "0 0 * * *",
  "type": "dag",
  "worker": "default"
}

A parent dag job that plays favourites:

{
  "description": "Daggy job",
  "dispatcher": "Sydney",
  "enabled": true,
  "job_id": "demo/dag-01",
  "owner": "demo@somewhere.com",
  "parameters": {
    "can_fail": [
      "*/job_[24]"
    ],
    "workers": 2
  },
  "payload": {
    "demo/job_01": "demo/job_02",
    "demo/job_02": [ "demo/job_03", "demo/job_04" ],
    "demo/job_05": null
  },
  "schedule": "0 0 * * *",
  "type": "dag",
  "worker": "default"
}

Job type: db_from_s3¶

The db_from_s3 job type loads data from AWS S3 into a target database. The data formats supported, and the loading mechanism used, are dependent on the target database type. For example, Redshift databases use the COPY command to load structured data from S3.

The target database type, and hence the loading mechanism, are determined by the type of database connector specified for the job.

The db_from_s3 job type can be used in conjunction with the lava s3trigger Lambda function to load data whenever a new data file is placed in S3.

The following target database types are supported:

AWS Redshift
Postgres RDS and Aurora Postgres.
Postgres
Aurora MySQL
SQLite3
Oracle

Payload¶

The payload is ignored.

Parameters¶

Parameter	Type	Required	Description
args	*	*	Dependent on the target database type.
bucket	String	Yes	The name of the S3 bucket from which the data is copied.
columns	List[String]	Sometimes	A list of SQL column specifications. Required if the target table must be created (e.g. if the table does not exist or the `mode` is `drop`). If specified, the copy process will explicitly include the columns specified. This is useful when only some of the columns in the target table are to be loaded with data.
db_conn_id	String	Yes	The connection ID for a target database. The `type` associated with the connection specification will determine which loader is used.
jinja	Boolean	No	If `false`, disable Jinja rendering of the `bucket` and `key`. Default `true`.
key	String	Yes	The key in the S3 bucket from which the data is copied.
min_size	String \| Int	No	The loading process will attempt to avoid loading files below the specified size. The value is a size value as described in Lava Worker Configuration. The default is 0. See below for more information.
mode	String	Yes	Update mode for the target table. See Copy Modes for more information.
s3_conn_id	String	Conditional	The connection ID for AWS S3. This is used to allow access to the source data in S3. Whether or not this is required is database dependent. Only one of either `s3_conn_id` or `s3_iam_role` is ever required.
s3_iam_role	String	Conditional	The IAM role name used to allow access to the source data in S3. Whether or not this is required is database dependent. Only one of either `s3_conn_id` or `s3_iam_role` is ever required.
schema	String	Yes	Target schema name. If the Jinja rendered value starts with `/`, the rest of the value is used as a regular expression (regex) to derive the schema name from the Jinja rendered `key` value. In this case, the `columns` argument is not permitted and the target table must already exist.
skip_missing	Boolean	No	If `true`, no attempt is made to load missing files. The default is `false` which means attempting to load a non-existent file will result in an error. This won't help in the situation where the file is removed from S3 between the check for existence and the database load operation. Note that the lava worker will require `s3:GetObject` access to the object if the value is set to `true`.
table	String	Yes	Target table name. If the Jinja rendered value starts with `/`, the rest of the value is used as a regular expression (regex) to derive the table name from the `key` argument. In this case, the `columns` argument is not permitted and the target table must already exist.
vars	Map[String,*]	No	A map of variables injected when `bucket`, `key`, `schema` and `table` are Jinja rendered.

About the min_size Parameter¶

If the min_size parameter is specified, the loading process will attempt to avoid loading files below the specified size. This is not always possible, depending on the loading mechanism being used.

The fundamental use case for this parameter is to avoid attempting to load empty data files. Most of the database loading mechanisms will happily load an empty file and consider that to be a success. Some, such as the native S3 loading process for Aurora Postgres, consider this to be an error and will abort the transaction. This is particularly problematic when loading a number of files in one operation, some of which may be empty. This can happen with data unloaded from Redshift where the source table is skewed and some slice files are empty as a result.

The file size is checked directly in S3 without downloading the object to the lava worker. This means that for compressed files, the value applies to the file before decompression. It also means that the lava worker must have read access (s3:GetObject) to objects to obtain size information, even if the loading process is performed directly from the database.

Unfortunately, an empty file will GZIP to a non-empty file of small but uncertain length due to the embedding of the filename. To avoid empty GZIP files, the value should be set to a small but non-zero value (e.g. 100).

For uncompressed data without headers, setting the value to 1 should avoid empty files.

Jinja Rendering of the Bucket, Key, Schema and Table¶

The bucket, key, schema and table parameters are rendered using Jinja prior to further use. The rendering is done in the following order:

The bucket and key parameters are rendered.
The schema and table parameters are rendered.

Refer to Jinja Rendering in Lava for more information.

The following variables are made available to the renderer.

Name	Type	Description
globals	dict[str,*]	The `globals` from the job specification updated with any globals received in the job dispatch.
job	dict[str,*]	The augmented job specification.
realm	dict[str,*]	The realm specification.
start	datetime	The local time when the job run started.
state	dict[str,*]	A dictionary of the state items imported into the job, keyed on state_id. The default values are updated at run-time with any current values obtainable from the state table.
ustart	datetime	The UTC time when the job run started.
utils	dict[str,runnable]	A dictionary of utility functions that can be used in the Jinja markup.
vars	dict[str,*]	A dictionary of variables provided as the `vars` component of the job `parameters`.

The following additional variables are made available when rendering the schema and table parameters:

Name	Type	Description
bucket	str	The Jinja rendered value of the `bucket` parameter.
key	str	The Jinja rendered value of the `key` parameter.

This two-step process provides an additional mechanism for deriving the target database object name from the S3 object name. It complements the regex based mechanism described below.

Why two mechanisms for deriving schema and table from the S3 key?

Well, the original regex mechanism was a legacy of previous code and is still supported for historical reasons. The newer Jinja based method is more flexible and consistent with lava generally. Try not to use them both at once. It probably won't end well.

Deriving Target Schema and Table Name from the S3 Key¶

The regex options for schema and table provide an alternate mechanism to allow the target table to be derived at run time from the S3 object name. This allows a single job specification to be used for multiple tables provided an appropriate S3 naming convention is used. This is particularly handy if the job is dispatched by an S3 object created event. In this configuration, the bucket and key parameters would be supplied when the job is dispatched as a result of an S3 bucket notification event.

For each of these arguments, if the value starts with /, the rest of the value must be a regular expression with a single capture group. The regex is applied to the S3 object key that triggered the event (minus any suffix) and the value of the capture group is used as the schema or table name respectively. If the regex option is used, the columns argument is not permitted and the target schema and table must already exist.

By combining these mechanisms, it is possible to use a single job specification to load any number of different tables into a given target database provided the target schema and table name can be derived from the S3 object key using regular expressions.

Copy Modes¶

The mode argument specifies how any existing data in the table should be handled and how new data should be loaded.

Mode	Description
abort	If the target table contains any data, then abort with an error. This prevents the loss of any existing data.
append	New data is appended to any existing data in the table.
drop	The table is dropped and recreated prior to loading. This requires the `columns` argument to be provided so lava knows how to create the table.
delete	Delete all existing data from the table.
switch	Perform A/B table switching. If the `table` argument specifies a target table of `mytable`, data will be alternately loaded into switch tables `mytable_a` and `mytable_b`. During the load phase, lava will find whichever switch table is empty and load the new data into it. It will then delete all data from the other table. All this is done inside a transaction. Once the transaction is committed, the empty table is truncated to minimise vacuuming requirements. If neither switch table is empty, the load is aborted. Generally, there will be a view on top of the switch tables presenting the union of the two.
truncate	Truncate the table before loading.

Event Driven Loading¶

A common usage pattern is to trigger a database load when a data file is created in S3. Lava supports this via the s3trigger lambda function.

To load a table in response to an S3 bucket notification event would require an entry in the s3triggers table like so:

{
  "description": "A new custard club membership file appeared in S3",
  "bucket": "my-bucket",
  "prefix": "a/b",
  "enabled": true,
  "job_id": "demo/copy-mytable",
  "parameters": {
    "bucket": "{{bucket}}",
    "key": "{{key}}"
   }
}

The bucket and key parameters are then determined at run-time and merged into the job specification.

When a new data file is placed in S3, the lava s3trigger Lambda function will be invoked. It will look up the event source (bucket + object key) in the s3triggers table to determine that it should invoke the demo/copy-mytable job. It will use Jinja to render the parameters shown above from the bucket event notification details and then dispatch the job with the customised bucket and key parameters replacing any value in the job specification.

Dev Mode Behaviour¶

Normally, the db_from_s3 job will copy a log of events to S3 on the conclusion of the job. In dev mode, this log is emitted locally at the end of the job run.

Moving Data from Redshift to Other Databases¶

Unfortunately, there is little consistency of data interchange formats or methods across the AWS database services.

For example:

Data unloaded from Redshift cannot be reliably copied back to Redshift without special handling of date fields.
Data unloaded from Redshift as pipe separated values (the default), cannot be reliably loaded to Postgres, Postgres RDS or Aurora Postgres because of differences in escaping the pipe character. CSV format seems to work.
Redshift unloading works best with compression and parallel unloading enabled. Aurora MySQL supports a manifest but does not support loading of compressed data. Postgres RDS and Aurora Postgres don't support loading from a manifest. They do support loading compressed data but only if the content encoding is set correctly on the S3 objects. Redshift does not do this.
The syntax and capability of data loading commands varies widely from Redshift to Aurora MySQL to Aurora/RDS Postgres to standard Postgres.

Frankly, it's a mess.

Lava can only compensate for these problems to a limited degree. For example, the redshift_unload job has additional logic to work around the first problem above. The db_from_s3 job uses a range of background trickery (such as manifest handling) to partially compensate for some of the other limitations and inconsistencies.

Be careful out there!

Loading Data to Aurora MySQL¶

Loading to Amazon Aurora MySQL is done using the native LOAD DATA FROM S3 facility.

There are a number of preliminary setup steps required on the database itself to enable this mechanism. Once enabled, it can load uncompressed data. Loading of compressed data is not supported.

Aurora MySQL Specific Parameters¶

The db_conn_id parameter must point to a connection with a type of mysql-aurora.

The loading operation relies on pre-configuration on the cluster itself to provide access to S3. The s3_conn_id and s3_iam_role parameters are not used.

Note

The syntax of the LOAD DATA FROM S3 command is very different from its Postgres and Redshift counterparts. Lava attempts to reduce the apparent differences by adopting some of the Redshift / Postgres conventions in the args parameter. Whether this is helpful or dangerous folly is in the eye of the beholder.

The args parameter must contain a list of Aurora MySQL LOAD DATA command options. See also the standard MySQL LOAD DATA reference. Only the following options are supported:

FILE
IGNORE
MANIFEST
PARTITION
PREFIX
REPLACE

In addition, the following non-standard options are supported:

Option	Description
DELIMITER	Takes an argument of the form `'string'` specifying the column delimiter.
ENCODING	Takes an argument specifying the character set of the data.
ESCAPE	Takes an argument of the form `'char'` specifying the character used to escape quote characters.
HEADER	Takes an optional integer argument specifying the number of header lines to skip. The default value is `1`.
QUOTE	Takes arguments of the form `[OPTIONAL] 'char'` specifying the character used to quote data fields. If the `OPTIONAL` keyword is present, only fields with a string type are quoted. If omitted, all fields are quoted.
TERMINATOR	Takes an argument in the form `DOS \| UNIX \| 'char'`. The `DOS` and `UNIX` options set the line terminator appropriate for data sourced from those systems.

Aurora MySQL Examples¶

The following example shows loading of a data file to a table in switch mode. The tables membership_a and membership_b will be created with the given column specifications if they don't already exist.

{
  "description": "Copy data to the table 'custardclub.membership'",
  "dispatcher": "Sydney",
  "enabled": true,
  "job_id": "demo/copy-mytable",
  "owner": "demo@somewhere.com",
  "parameters": {
    "db_conn_id": "mysql-aurora-conn-01",
    "bucket": "my-bucket",
    "key": "a/b/data.csv",
    "schema": "custardclub",
    "table": "membership",
    "mode": "switch",
    "args": [
      "DELIMITER ','",
      "HEADER",
      "ENCODING utf8",
      "QUOTE OPTIONAL '\"'",
      "TERMINATOR UNIX"
    ],
    "columns": [
      "CustardClubNo INTEGER PRIMARY KEY NOT NULL",
      "GivenName VARCHAR(20)",
      "FamilyName VARCHAR(30)",
      "CustardBidPrice FLOAT(2)",
      "CustardJedi BOOLEAN",
      "LastCustard DATE",
      "CustardQuota INTEGER"
    ]
  },
  "payload": "--",
  "type": "db_from_s3",
  "worker": "default"
}

Loading Data to Aurora Postgres¶

See Loading Data to Postgres RDS.

Loading Data to Oracle¶

Loading to Oracle is done by lava performing direct data insertion. The files to be loaded are copied from S3 to the lava worker node, decompressed on the fly if required, and then inserted row by row into the database by lava.

Warning

If using TRUNCATE or DROP mode, a COMMIT is done after the TRUNCATE / DROP but prior to loading of new data.

Oracle Specific Parameters¶

The db_conn_id parameter must point to a connection with a type of oracle.

The args parameter supports the following options. The CSV format related options are linked directly to the Python csv module formatting parameters.

Option	Description
DELIMITER 'c'	As for the Python `csv` module formatting parameters.
DOUBLEQUOTE	Controls how instances of `QUOTECHAR` appearing inside a field should themselves be quoted. If the `DOUBLEQUOTE` option is present, the character is doubled. If not present, the `ESCAPECHAR` is used as a prefix to the `QUOTECHAR`.
ESCAPECHAR 'c'	As for the Python `csv` module formatting parameters.
GZIP	Source data is compressed using gzip. The worker will decompress it on the fly for insertion.
HEADER	Ignore the first line in each data file.
MANIFEST	The S3 object will be treated as a Redshift compatible manifest containing a list of actual data files to load.
QUOTECHAR 'c'	As for the Python `csv` module formatting parameters.
QUOTING 'style'	As for the Python `csv` module QUOTE_* parameters (without the `QUOTE_` prefix). Default is `minimal` (i.e. `QUOTE_MINIMAL`). Case is not significant.

Loading Data to Postgres¶

Loading to Postgres is done using a client side COPY. The files to be loaded are copied from S3 to the lava worker node, decompressed if required, and then copied to the database via the client.

While this will work for Postgres RDS/Aurora, it will be much slower than a direct S3 copy as described in Loading Data to Postgres RDS.

Moreover, if multiple files are to be loaded (e.g. when using a manifest), the files are processed sequentially. This can make the whole operation rather slow. If large amounts of data need to be loaded, it is recommended to implement a custom loading process.

Loading of each individual file is subject to a timeout which is set via the PG_COPY_TIMEOUT configuration variable. This can be set at the worker or realm level.

Warning

If using TRUNCATE or DROP mode, a COMMIT is done after the TRUNCATE / DROP but prior to loading of new data.

Postgres Specific Parameters¶

The db_conn_id parameter must point to a connection with a type of postgres or psql.

Use with postgres-aurora or postgres-rds will work but is not recommended.

The worker uses its own credentials to copy the files from S3 to local storage. The s3_conn_id and s3_iam_role parameters are not used.

The args parameter must contain a list Postgres COPY options.

The following options are supported:

DELIMITER
ENCODING
ESCAPE
FORCE_NOT_NULL (see below)
FORCE_NULL (see below)
FORMAT
FREEZE
HEADER
NULL
OIDS
QUOTE

In addition, the following non-standard options are supported:

Option	Description
FORCE_NOT_NULL	As for `FORCE_NULL`.
FORCE_NULL	This standard option accepts a non-standard `*` argument which lava will replace with a list of all columns in the target table.
GZIP	Source data is compressed using gzip. The worker will decompress it prior to loading.
MANIFEST	The S3 object will be treated as a Redshift compatible manifest containing a list of actual data files to load.

Loading Data to Postgres RDS¶

Loading to Postgres RDS and Aurora Postgres is done using the native aws_s3.table_import_from_s3 facility. This is supported in the following versions:

AWS RDS PostgreSQL versions 11.1+
AWS RDS Aurora PostgreSQL-compatible versions 10.7+

There are a number of preliminary setup steps required on the database itself to enable this mechanism. Once enabled, it can load uncompressed data and gzip compressed data.

Postgres RDS Specific Parameters¶

The db_conn_id parameter must point to a connection with a type of either postgres-aurora or postgres-rds.

If the s3_conn_id parameter is specified, the relevant connector will be used to provide credentials to access S3. If the parameter is not specified, the database must have an attached IAM role that provides the required access to S3. As this is implicit, there is no need to specify the s3_conn_id parameter.

The args parameter must contain a list of Postgres COPY command options appropriate for the data file.

The following options are supported:

DELIMITER
ENCODING
ESCAPE
FORCE_NOT_NULL (see below)
FORCE_NULL (see below)
FORMAT
FREEZE
HEADER
NULL
OIDS
QUOTE

In addition, the following non-standard options are supported:

Option	Description
FORCE_NOT_NULL	As for `FORCE_NULL`.
FORCE_NULL	This standard option accepts a non-standard `*` argument which lava will replace with a list of all columns in the target table.
GZIP	The Postgres copy operation can load gzip compressed data but the S3 object being loaded must have its `Content-Encoding` metadata set to `gzip`. If this option is specified, lava will attempt to set the content encoding appropriately if necessary. Be aware that this is, in fact, an object copy operation with new metadata so the worker will require appropriate IAM permissions to do that.
MANIFEST	The S3 object will be treated as a Redshift compatible manifest containing a list of actual data files to load.

Info

When configuring a bucket event notification for s3trigger to dispatch a db_from_s3 job using GZIP compression, do not trigger on All object create eventsas this may cause two events to be sent and the job will run twice. Restrict the event type toObjectCreated:Put` only. This is due to the hidden copy operation lava must perform to set the correct content encoding on the source S3 object.

Enabling the Postgres RDS Copy Extension¶

To enable the aws_s3.table_import_from_s3 copy extension follow the AWS instructions.

It is also necessary to allow the database users who will be performing copy operations to access the extension, thus:

GRANT USAGE ON SCHEMA aws_s3 TO <USER>;
GRANT EXECUTE ON FUNCTION aws_s3.table_import_from_s3
    (
        table_name text,
        column_list text,
        options text,
        bucket text,
        file_path text,
        region text,
        access_key text,
        secret_key text,
        session_token text
        )
    TO <USER>;

Postgres RDS/Aurora Examples¶

The following example shows loading of a data file to a table in switch mode. The tables membership_a and membership_b will be created with the given column specifications if they don't already exist.

{
  "description": "Copy data to the table 'custardclub.membership'",
  "dispatcher": "Sydney",
  "enabled": true,
  "job_id": "demo/copy-mytable",
  "owner": "demo@somewhere.com",
  "parameters": {
    "db_conn_id": "pg-rds-conn-01",
    "s3_conn_id": "s3-conn-01",
    "bucket": "my-bucket",
    "key": "a/b/data.csv",
    "schema": "custardclub",
    "table": "membership",
    "mode": "truncate",
    "args": [
      "FORMAT CSV",
      "HEADER"
    ],
    "columns": [
      "CustardClubNo INTEGER PRIMARY KEY NOT NULL",
      "GivenName VARCHAR(20)",
      "FamilyName VARCHAR(30)",
      "CustardBidPrice FLOAT(2)",
      "CustardJedi BOOLEAN",
      "LastCustard DATE",
      "CustardQuota INTEGER"
    ]
  },
  "payload": "--",
  "type": "db_from_s3",
  "worker": "default"
}

Loading Data to Redshift¶

Loading to Redshift is done using the native Redshift COPY command. Data files contain structured, tabular data. All of the formats supported by the COPY command can be used, including:

CSV files
JSON files
Compressed versions of the above.

For Redshift, it is generally better to use truncate mode rather than delete.

Redshift Specific Parameters¶

The db_conn_id parameter must point to a connection with a type of redshift or redshift-serverless.

The args parameter must contain a list of Redshift COPY options appropriate for the data file. Most of the parameters supported by Redshift COPY, including MANIFEST, can be used.

Redshift Examples¶

The following example shows loading of a data file to a table in append mode. The table will be created with the given column specifications if it doesn't already exist.

{
  "description": "Copy data to the table 'custardclub.membership'",
  "dispatcher": "Sydney",
  "enabled": true,
  "job_id": "demo/copy-mytable",
  "owner": "demo@somewhere.com",
  "parameters": {
    "db_conn_id": "redshift-conn-01",
    "s3_conn_id": "s3-conn-01",
    "bucket": "my-bucket",
    "key": "a/b/data.csv.bz2",
    "schema": "custardclub",
    "table": "membership",
    "mode": "truncate",
    "args": [
      "BLANKSASNULL",
      "BZIP2",
      "CSV",
      "EMPTYASNULL",
      "IGNOREHEADER 1",
      "MAXERROR 0",
    ],
    "columns": [
      "CustardClubNo INTEGER PRIMARY KEY NOT NULL",
      "GivenName VARCHAR(20)",
      "FamilyName VARCHAR(30)",
      "CustardBidPrice FLOAT(2)",
      "CustardJedi BOOLEAN",
      "LastCustard DATE",
      "CustardQuota INTEGER"
    ]
  },
  "payload": "--",
  "type": "db_from_s3",
  "worker": "default"
}

Loading Data to SQLite3¶

Loading to SQLite is done by lava performing direct data insertion. The files to be loaded are copied from S3 to the lava worker node, decompressed on the fly if required, and then inserted row by row into the database by lava.

Warning

If using TRUNCATE or DROP mode, a COMMIT is done after the TRUNCATE / DROP but prior to loading of new data.

SQLite3 Specific Parameters¶

The db_conn_id parameter must point to a connection with a type of sqlite3.

The schema parameter must be set to main as SQLite3 has a very limited notion of schemas.

The args parameter supports the following options. The CSV format related options are linked directly to the Python csv module formatting parameters.

Option	Description
DELIMITER 'c'	As for the Python `csv` module formatting parameters.
DOUBLEQUOTE	Controls how instances of `QUOTECHAR` appearing inside a field should themselves be quoted. If the `DOUBLEQUOTE` option is present, the character is doubled. If not present, the `ESCAPECHAR` is used as a prefix to the `QUOTECHAR`.
ESCAPECHAR 'c'	As for the Python `csv` module formatting parameters.
GZIP	Source data is compressed using gzip. The worker will decompress it on the fly for insertion.
HEADER	Ignore the first line in each data file.
MANIFEST	The S3 object will be treated as a Redshift compatible manifest containing a list of actual data files to load.
QUOTECHAR 'c'	As for the Python `csv` module formatting parameters.
QUOTING 'style'	As for the Python `csv` module QUOTE_* parameters (without the `QUOTE_` prefix). Default is `minimal` (i.e. `QUOTE_MINIMAL`). Case is not significant.

Job type: dispatch¶

The dispatch job type initiates a dispatch of other jobs.

Unlike the chain job type, this is an asynchronous operation in that the dispatch is initiated but the job run does not wait for the dispatched jobs to actually start. The dispatched jobs will not necessarily run on the same worker and will each have their own run_id.

Any globals available in the dispatch job will also be passed in the dispatch requests to the jobs being dispatched.

Payload¶

The payload is a comma separated list, or an actual list of job IDs to dispatch.

Parameters¶

Parameter	Type	Required	Description
delay	String	No	Dispatch message sending delay in the form `nnX` where `nn` is a number and `X` is `s` (seconds) or `m` (minutes). The maximum allowed value is 15 minutes.
jinja	Boolean	No	If `false`, disable Jinja rendering of the payload. Default `true`.
job_prefix	String	No	Prepend the specified value to each job ID in the payload.
parameters	Map[String,*]	No	A map of parameters that will be passed to the job being dispatched. This is Jinja rendered.

Handling of Globals¶

The dispatch job type merges its globals into those of the child jobs being dispatched. A value specified in the parent dispatch job will override a similarly named value in the child.

In addition, the dispatch job will also add lava specific globals under globals.lava. These lava owned globals allow all jobs dispatched as a result of the current job to access some common global values.

Jinja Rendering of the Payload¶

The parameters parameter is Jinja rendered.

Refer to Jinja Rendering in Lava for more information.

The following variables are made available to the renderer.

Name	Type	Description
globals	dict[str,*]	The `globals` from the job specification updated with any globals received in the job dispatch.
job	dict[str,*]	The augmented job specification.
realm	dict[str,*]	The realm specification.
start	datetime	The local time when the job run started.
state	dict[str,*]	A dictionary of the state items imported into the job, keyed on state_id. The default values are updated at run-time with any current values obtainable from the state table.
ustart	datetime	The UTC time when the job run started.
utils	dict[str,runnable]	A dictionary of utility functions that can be used in the Jinja markup.

Dev Mode Behaviour¶

The dispatch job behaviour is unchanged for dev mode.

Examples¶

The following example dispatches two downstream jobs with no delay.

{
  "description": "Dispatch two downstream jobs",
  "dispatcher": "Sydney",
  "enabled": true,
  "job_id": "demo/dispatch2",
  "owner": "demo@somewhere.com",
  "payload": "demo/job_01, demo/job_02",
  "type": "dispatch",
  "worker": "default"
}

This one dispatches a downstream job with a 5 minute delay and some additional parameters being passed to the dispatched job. Jinja rendering is disabled.

{
  "description": "Dispatch a downstream job",
  "dispatcher": "Sydney",
  "enabled": true,
  "job_id": "demo/dispatch1",
  "owner": "demo@somewhere.com",
  "payload": "demo/job_01, demo/job_02",
  "parameters": {
    "delay": "5m",
    "jinja": false,
    "parameters": {
      "param01": "Hello",
      "param02": "world"
    }
  },
  "type": "dispatch",
  "worker": "default"
}

Job type: docker¶

The docker job type launches a docker container on the worker node and runs the nominated command. Refer to the chapter on docker for more information.

Lava will connect to a docker registry, pull the specified image from a repository and use it to create and run a container with the specified command. If no command is specified, the default entry point for the container will be used.

The logs from the container will be captured and, if non-empty, placed into the realm s3_temp area as <s3_temp>/<job_id>/<run_id>/logs.

If the command returns a zero exit status the job is considered to have succeeded. A non-zero exit status indicates failure.

Payload¶

The payload is the container repository and, optionally, tag in the form repository[:tag]. If the tag is not specified, a tag of latest is used.

Environment¶

The following variables are placed into the environment for the container.

Variable	Description
LAVA_JOB_ID	The `job_id`.
LAVA_OWNER	The value of the `owner` field from the job specification.
LAVA_REALM	The realm name.
LAVA_RUN_ID	The `run_id` UUID.
LAVA_S3_KEY	The identifier for the KMS key needed to write data into the S3 temporary area.
LAVA_S3_PAYLOAD	The payload location for this job.
LAVA_S3_TMP	The private S3 temporary area for this job run. The executables are allowed to put data here.
LAVA_TMP	The temporary directory on the host worker node is mapped into the container at this location.
LAVA_WORKER	The worker name.

Docker Runtime Configuration¶

The container is run with the following run-time attributes:

Environment variables as described above.
The local temporary area on the worker node is mapped inside the container and its location is made available as the LAVA_TMP environment variable. The container can read and write this directory as for other lava jobs. This can be useful for chain jobs as they share the same temporary directory.
The container runs with the user ID (uid) set to the effective uid (euid) of the lava worker unless overridden by the host_config in the job specification. Don't do this unless you know what you're doing.
The effective group ID (egid) of the lava worker is added to the list of group IDs for the container unless overridden by the host_config in the job specification. Don't do this unless you know what you're doing.
Unless otherwise specified in the host_config job parameter, the default host configuration (memory, CPU, ports etc) is used.
Stderr and stdout are captured by the lava worker.
The container will use any proxy configuration present in the docker client configuration file.
Unless otherwise specified in the host_config, the hostname of the container is set to lava-<REALM>-<WORKER>-<RUN_ID>.
Unless otherwise specified in the host_config job parameter, the working directory is set to the temporary directory shared between host and container. This mimics the behaviour of other executable job types. The lava worker will clean up this area on job completion.

Parameters¶

Parameter	Type	Required	Description
args	List[String]	No	A list of additional arguments for the command. Ignored if no command is specified.
command	String	No	The command to run in the container. This will be parsed using standard Linux shell lexical analysis to determine the executable and arguments. If not specified, the default entry point for the container is used.
connections	Map[String,String]	No	A dictionary with keys that are connection labels and the values are conn_id
docker	String	No	The `conn_id` for connecting to docker. If not specified, a value must be specified for the entire realm in the realms table.
env	Map[String,String]	No	A map of additional environment variables for the container.
host_config	Map[String,*]	No	A map of container host configuration parameters.
jinja	Boolean	No	If `false`, disable Jinja rendering of the `args`. Default `true`.
timeout	String	No	By default, containers run by docker jobs are killed 10 minutes after the container starts to run. This parameter can override that with values in the form `nnX` where `nn` is a number and `X` is `s` (seconds), `m` (minutes) or `h` (hours). Note that the timeout must be less than the visibility timeout on the worker SQS queue minus the time to pull the image and start the container.
vars	Map[String,*]	No	A map of variables injected when the command arguments and environment are rendered.

Connecting to Docker¶

Lava needs to be able to connect to a docker daemon to create containers and a docker registry to obtain docker images. Like all connections in lava, this is managed through the connection manager.

When running a docker job, the process to obtain the specified image is:

Look for a docker connection ID in the job parameters. If not found there, look for a docker connection ID in the realm specification from the realms table.
Fetch the daemon and registry connection details from the connections table.
Connect to the docker daemon.
Login to the docker registry and pull the required image from the specified registry.

Refer to the section on the docker connector for more information.

Container Host Configuration¶

The host_config parameter in the docker job specification is a map that allows a number of configuration parameters for the container to be specified. The definition and types of these parameters correspond to like named parameters of the containers.run() function of the Docker SDK for Python.

The following parameters are permitted:

blkio_weight
blkio_weight_device
cap_add
cap_drop
cpu_count
cpu_percent
cpu_period
cpu_quota
cpu_shares
cpuset_cpus
cpuset_mems
device_read_bps
device_read_iops
device_write_bps
device_write_iops
dns
dns_opt
dns_search
domainname
extra_hosts
hostname
group_add
mem_limit
mem_swappiness
memswap_limit
nano_cpus
network_disabled
network_mode
ports
publish_all_ports
shm_size
user
working_dir

So, for example, to allow a container to use CPUs 0 and 1, disable networking in the container and set the working directory to /tmp/lava add the following:

{
  "parameters": {
    "host_config": {
      "cpuset_cpus": "0,1",
      "network_disabled": true,
      "working_dir": "/tmp/lava"
    }
  }
}

Jinja Rendering of the Arguments and Environment¶

The collected arguments for the command and any environment values defined in the job specification are individually rendered using Jinja prior to execution.

Refer to Jinja Rendering in Lava for more information.

The following variables are made available to the renderer.

Name	Type	Description
globals	dict[str,*]	The `globals` from the job specification updated with any globals received in the job dispatch.
job	dict[str,*]	The augmented job specification.
realm	dict[str,*]	The realm specification.
start	datetime	The local time when the job run started.
state	dict[str,*]	A dictionary of the state items imported into the job, keyed on state_id. The default values are updated at run-time with any current values obtainable from the state table.
ustart	datetime	The UTC time when the job run started.
utils	dict[str,runnable]	A dictionary of utility functions that can be used in the Jinja markup.
vars	dict[str,*]	A dictionary of variables provided as the `vars` component of the job `parameters`.

Connection Handling¶

Connections are handled similarly to the exe job type. In the case of a docker job, the connection handler executables are mapped into the container and are invoked in the same way.

Python executables running in a docker container based on one of the docker images for lava can use the connection manager directly as described in the relevant section in the chapter on developing lava jobs. It is important to ensure that /usr/local/lib/lava is on the PYTHONPATH for the executable in the container.

Dev Mode Behaviour¶

Normally, the docker job will copy the container logs to S3 on the conclusion of the job. In dev mode, the container logs are emitted locally after the job run instead of being copied to S3.

Examples¶

The following example will run a shell to invoke the Linux date command in one of the standard lava docker images. The connection ID for the docker registry containing the image is ecr.

{
  "description": "Run a simple command in a container",
  "dispatcher": "Sydney",
  "enabled": true,
  "job_id": "demo/simple-docker",
  "owner": "demo@somewhere.com",
  "parameters": {
    "args": [
      "-c",
      "date"
    ],
    "command": "/bin/sh",
    "docker": "ecr",
    "timeout": "10s"
  },
  "payload": "jin-gizmo/lava/amzn2023/base",
  "type": "docker",
  "worker": "core"
}

This example does exactly the same thing with the docker image and command specified in a slightly different way.

{
  "description": "Run a simple command in a container",
  "dispatcher": "Sydney",
  "enabled": true,
  "job_id": "demo/simple-docker",
  "owner": "demo@somewhere.com",
  "parameters": {
    "command": "/bin/sh -c date",
    "docker": "ecr",
    "timeout": "10s"
  },
  "payload": "dist/lava/centos8/full:latest",
  "type": "docker",
  "worker": "core"
}

And again:

{
  "description": "Run a simple command in a container",
  "dispatcher": "Sydney",
  "enabled": true,
  "job_id": "demo/simple-docker",
  "owner": "demo@somewhere.com",
  "parameters": {
    "command": "/bin/sh -c {{vars.cmd}}",
    "docker": "ecr",
    "timeout": "10s",
    "vars": {
        "cmd": "date"
    }
  },
  "payload": "dist/lava/centos8/full:latest",
  "type": "docker",
  "worker": "core"
}

Job type: exe¶

The exe job type downloads one or more executables from the S3 payload area and runs them.

Payload¶

The payload is a location in S3 relative to the s3_payloads area specified in the realms table. It can be either an object key, in which case a single executable is downloaded, or a prefix ending in /, in which case all executables under that prefix will be downloaded and run in lexicographic order.

See S3 Payloads for more information.

The payload value will be parsed using standard Linux shell lexical analysis to determine the S3 object location and, optionally, any arguments. Additional arguments can also be specified with the args parameter.

Info

The exe job type requires the payload to be a single string. It does not support the list mechanism provided by the v2 Payload Downloader.

Environment¶

Stdin will be redirected from /dev/null. Stdout and stderr are captured and, if non-empty, placed into the realm s3_temp area with the following prefixes:

stdout: <s3_temp>/<job_id>/<run_id>/stdout
stderr: <s3_temp>/<job_id>/<run_id>/stderr

The following variables are placed into the environment for the command.

Variable	Description
LAVA_JOB_ID	The `job_id`.
LAVA_OWNER	The value of the `owner` field from the job specification.
LAVA_REALM	The realm name.
LAVA_RUN_ID	The `run_id` UUID.
LAVA_S3_KEY	The identifier for the KMS key needed to write data into the S3 temporary area.
LAVA_S3_PAYLOAD	The payload location for this job.
LAVA_S3_TMP	The private S3 temporary area for this job run. The executables are allowed to put data here.
LAVA_WORKER	The worker name.
PYTHONPATH	The PYTHONPATH has the lava code directory appended. This allows Python based executables to directly import lava modules. This is particularly handy for accessing the lava connection manager from within a Python program.

Warning

If the payload is a script (e.g. bash or Python) then Lava relies on the hashbang line at the beginning of the file. If the file has been edited on a DOS system then it may have DOS style CR-LF line endings which will cause the script interpreter to be unrecognised and the job will fail.

Parameters¶

Parameter	Type	Required	Description
args	List[String]	No	A list of additional arguments for the executable(s).
connections	Map[String,String]	No	A dictionary with keys that are connection labels and the values are conn_id
env	Map[String,String]	No	A map of additional environment variables for the command.
jinja	Boolean	No	If `false`, disable Jinja rendering of the `args`. Default `true`.
timeout	String	No	By default, executables run by exe jobs are killed after 10 minutes. This parameter can override that with values in the form `nnX` where `nn` is a number and `X` is `s` (seconds), `m` (minutes) or `h` (hours). Note that the total timeout for the entire job (`timeout` multiplied by number of payload elements) must be less than the visibility timeout on the worker SQS queue.
vars	Map[String,*]	No	A map of variables injected when the command arguments and environment are Jinja rendered.

Jinja Rendering of the Arguments and Environment¶

The collected arguments for the executable(s) and any environment values defined in the job specification are individually rendered using Jinja prior to execution.

Refer to Jinja Rendering in Lava for more information.

The following variables are made available to the renderer.

Name	Type	Description
globals	dict[str,*]	The `globals` from the job specification updated with any globals received in the job dispatch.
job	dict[str,*]	The augmented job specification.
realm	dict[str,*]	The realm specification.
start	datetime	The local time when the job run started.
state	dict[str,*]	A dictionary of the state items imported into the job, keyed on state_id. The default values are updated at run-time with any current values obtainable from the state table.
ustart	datetime	The UTC time when the job run started.
utils	dict[str,runnable]	A dictionary of utility functions that can be used in the Jinja markup.
vars	dict[str,*]	A dictionary of variables provided as the `vars` component of the job `parameters`.

Connection Handling¶

Lava provides a mechanism for exe jobs to utilise connections defined in the connections table. It does this by creating a small executable that gives effect to the connection and then placing the name of that executable into an environment variable where it can be accessed by the job payload.

For example, consider a parameters field for the job like the following,

{
  "parameters": {
    "connections": {
      "pgres_db": "conn_id_for_postgres"
    }
  }
}

Lava will create a small executable that automates connection to the relevant Postgres database using the Postgres command line client (psql) and place the name of that executable in the environment variable LAVA_CONN_PGRES_DB. A second environment variable, LAVA_CONNID_PGRESS_DB will contain the connection ID itself, conn_id_for_postgres in this case.

If the payload is a shell script, for example, it would use this in the following way:

# Use $LAVA_CONN_PGRES_DB anywhere you could use the psql CLI.
# The authentication is automated behind the scenes.

$LAVA_CONN_PGRES_DB <<!
SELECT * FROM pg_user
WHERE usename = 'fred';
!

If the payload is a Python script, the lava connection manager can be used directly to provide a native Python DBAPI 2.0 connection object, thus:

import os
from lava.connection import get_pysql_connection

conn = get_pysql_connection(
    conn_id=os.environ['LAVA_CONNID_PGRESS_DB'],
    realm=os.environ['LAVA_REALM']
)

# Now use conn object in the usual way.

All of the database connectors work in the same way. Non-database connectors have a similar interface but the behaviour and usage depends on the underlying nature of the thing to which connection is required.

Dev Mode Behaviour¶

Normally, the exe job will copy stdout and stderr to S3 on the conclusion of the job. In dev mode, stdout and stderr are emitted locally during the job run instead of being copied to S3.

Examples¶

The following example will download and run a shell script with a non-standard timeout of 20 minutes.

{
  "description": "Test pgconnector from exe",
  "dispatcher": "Sydney",
  "enabled": true,
  "job_id": "demo/shell",
  "owner": "demo@somewhere.com",
  "parameters": {
    "timeout": "20m"
  },
  "payload": "demo/shell_script.sh",
  "type": "exe",
  "worker": "default"
}

This one will download all files under the demo/scripts/ prefix from the payloads area and run them. Note that the download is not recursive.

{
  "description": "Test pgconnector from exe",
  "dispatcher": "Sydney",
  "enabled": true,
  "job_id": "demo/multi_exe",
  "owner": "demo@somewhere.com",
  "payload": "demo/scripts/",
  "type": "exe",
  "worker": "default"
}

This one will run a Python script and provide it with connection handles to a Postgres database and a MySQL database. The command needed to run a Postgres database client with auto-connect will be in the environment variable LAVA_CONN_PGRES_DB. The command needed to run a MySQL database client with auto-connect will be in the environment variable LAVA_CONN_MYSQL_DB.

{
  "description": "Run pgconnector from exe",
  "dispatcher": "Sydney",
  "enabled": true,
  "job_id": "demo/py_exe",
  "owner": "demo@somewhere.com",
  "parameters": {
    "connections": {
      "mysql_db": "conn_id_for_mysql",
      "pgres_db": "conn_id_for_postgres"
    }
  },
  "payload": "demo/conect_to_db.py",
  "type": "exe",
  "worker": "default"
}

This one will run an executable with some arguments.

{
  "description": "How to pass arguments to the exe",
  "dispatcher": "Sydney",
  "enabled": true,
  "job_id": "demo/exe_with_args",
  "owner": "demo@somewhere.com",
  "payload": "demo/my-exe -y --log-level info",
  "type": "exe",
  "worker": "default"
}

This one will do exactly the same thing with the arguments specified in a different way.

{
  "description": "How to pass arguments to the exe",
  "dispatcher": "Sydney",
  "enabled": true,
  "job_id": "demo/exe_with_args",
  "owner": "demo@somewhere.com",
  "payload": "demo/my-exe -y",
  "parameters": {
    "args": [
      "--log-level",
      "info"
    ]
  },
  "type": "exe",
  "worker": "default"
}

And again.

{
  "description": "How to pass arguments to the exe",
  "dispatcher": "Sydney",
  "enabled": true,
  "job_id": "demo/exe_with_args",
  "owner": "demo@somewhere.com",
  "payload": "demo/my-exe -y --log-level {{vars.level}}",
  "parameters": {
    "vars": {
      "level": "info"
    }
  },
  "type": "exe",
  "worker": "default"
}

Job type: foreach¶

The foreach job type runs a single specified job in a loop.

Like the chain job type, this is a synchronous operation. The parent foreach job and the child job must be set to run on the same worker and the parent and all child iterations will run under the run_id of the parent job.

Any globals available in the foreach job will also be passed to each iteration of the child job, with additional, iteration specific, globals injected for each iteration. A loop index counter (globals.lava.foreach_index) is also made available to child iterations in the globals owned by lava.

Payload¶

The payload is the job_id of the job to be iterated.

Parameters¶

Parameter	Type	Required	Description
can_fail	Boolean	No	A boolean that indicates if individual iterations are permitted to fail without causing the entire job to fail. Defaults to `false`. See Allowing Foreach Child Jobs to Fail.
foreach	Map[String,*]	Yes	This parameter specifies the mechanism used to generate named values for each iteration. It is effectively a Python-like iterator yielding a dictionary of values for each iteration that is merged into the globals for the child job. See Foreach Value Generators.
limit	Integer	No	An integer. Attempting to run a foreach job with more than this many loop iterations will fail without any job being run at all. Defaults to the realm level value as specified by the FOREACH_LIMIT configuration parameter.
jinja	Boolean	No	If true (the default), enable Jinja rendering. What gets rendered in the `foreach` generator specification is controlled by the generator itself.

Foreach Value Generators¶

The foreach parameter is a map that defines the mechanism used to generate a set of named values for each loop iteration. These named values are merged into the globals passed to the child job.

Value generation is as lazy as possible to minimise overheads and memory usage.

Info

It is not permitted for any value names returned by a foreach generator to begin with lava (case independent).

Apart from the common type field, the other fields in the foreach parameter depend on the type. Field values are not Jinja rendered unless indicated otherwise below.

Foreach type: CSV¶

The csv generator reads data from a CSV file. The file must have a header to provide the names of values generated for each foreach iteration.

Field	Type	Required	Description
type	String	Yes	The foreach generator type: `csv`.
filename	String	Yes	The name of the CSV file, which can be local or in S3 (`s3://...`). This field is Jinja rendered.

For example, given this CSV source file ...

a,b
a0,b0
a1,b1

... the globals of the child job would have globals.a=a0 on the first iteration and globals.a=a1 on the second.

Foreach type: inline¶

The inline generator contains the iteration values within the foreach parameter itself.

Field	Type	Required	Description
type	String	Yes	The foreach generator type: `inline`.
values	List[Map[String,*]]	Yes	A list of maps containing iteration values.

For example, given this (partial) job specification ...

{
  "job_id": "example",
  "type": "foreach",
  "payload": "child_job_id",
  "parameters": {
    "foreach": {
      "type": "inline",
      "values": [
        {"a": "a0", "b": "b0" },
        {"a": "a1", "b": "b1" }
      ]
    }
  }
}

... the globals of the child job would have globals.a=a0 on the first iteration and globals.a=a1 on the second.

Note

It is not required that each value map in the list has the same keys, although it would be an unusual use-case.

Foreach type: jsonl¶

The jsonl generator reads JSON encoded objects from a file, one object per line.

Field	Type	Required	Description
type	String	Yes	The foreach generator type: `jsonl`.
filename	String	Yes	The name of the JSONL file, which can be local or in S3 (`s3://...`). This field is Jinja rendered.

For example, given this JSONL source file ...

{"a": "a0", "b": "b0" }
{"a": "a1", "b": "b1" }

... the globals of the child job would have globals.a=a0 on the first iteration and globals.a=a1 on the second.

Note

It is not required that each JSON line has the same keys, although it would be an unusual use-case.

Foreach type: query¶

The query generator obtains iteration values from a database query.

Field	Type	Required	Description
type	String	Yes	The foreach generator type: `query`.
conn_id	String	Yes	The connection ID for a target database.
query	String	Yes	The query to provide iteration values. Each row provides one set of iteration values. The field names are derived from the column names in the query. Take care to specify useful column names in the query when using database functions.

But but but ...

No, the query text is not Jinja rendered ... and it won't be.

For example, given this (partial) job specification ...

{
  "job_id": "example",
  "type": "foreach",
  "payload": "child_job_id",
  "parameters": {
    "foreach": {
      "type": "query",
      "conn_id": "my_db_conn_id",
      "query": "SELECT a, COUNT(*) AS b FROM some.table GROUP by 1 LIMIT 5"
    }
  }
}

... would generate up to 5 iterations, each containing an a and b global.

Note

Be careful about producing too many rows or including unnecessary values. The limit parameter will force the job to abort if more than that many rows are produced. Use database views to effect here.

Foreach type: range¶

The range generator behaves in the same way as the Python range() mechanism.

Field	Type	Required	Description
type	String	Yes	The foreach generator type: `range`.
name	String	No	If specified, the current value of the range counter is made available in a global with this name. This value may be different from `lava.globals.foreach_index` which always counts up from `0`.
start	Integer	No	The starting index for the range. Default is `0`.
stop	Integer	Yes	The value of the `stop` parameter.
step	Integer	No	The value of the `step` parameter. Default is `1`.

For example, given this (partial) job specification ...

{
  "job_id": "example",
  "type": "foreach",
  "payload": "child_job_id",
  "parameters": {
    "foreach": {
      "type": "range",
      "name": "a",
      "stop": 2
    }
  }
}

... the globals of the child job would have globals.a=0 on the first iteration and globals.a=1 on the second.

Foreach type: s3list¶

The s3list generator produces a list of objects in an AWS S3 bucket.

Field	Type	Required	Description
type	String	Yes	The foreach generator type: `s3list`.
bucket	String	Yes	The S3 bucket name. This field is Jinja rendered.
prefix	String	No	The S3 prefix to list. Defaults to the bucket root. This field is Jinja rendered.
glob	String\|List[String]	No	A glob style pattern, or list of patterns. Only S3 objects with names matching any of the patterns are returned.

A (partial) job specification might look like so:

{
  "job_id": "example",
  "type": "foreach",
  "payload": "child_job_id",
  "parameters": {
    "foreach": {
      "type": "s3list",
      "bucket": "my-bucket",
      "prefix": "a/prefix/",
      "glob": [
        "*.csv",
        "*.jsonl"
      ]
    }
  }
}

Each element returned by the generator to be added to the child globals is of the form:

{
    's3obj': {
        'Bucket': 'my-bucket',
        'Key': 'a/prefix/somewhere/in/s3.jsonl',
        'LastModified': datetime.datetime(2024, 1, 1, 6, 2, 59, tzinfo=tzutc()),
        'ETag': '"be0c0123456789abcd0123456678916a"',
        'Size': 197,
        'StorageClass': 'STANDARD'
    }
}

The value in the dictionary shown above is the object returned in the Contents list in the response to the boto3 S3 client list_objects_v2 function, with the addition of the bucket name.

In the example shown above, the child job could access the full S3 object name as:

s3://{{ globals.s3obj.Bucket }}/{{ globals.s3obj.Key }}

Allowing Foreach Child Jobs to Fail¶

By default, the foreach loop is aborted when any iteration fails unless the can_fail parameter is set to true. In this case, the iteration process will continue and the master foreach job will succeed even if child iterations fail. In this situation, it is important that the child job handles its own on_fail actions, as the parent will not. Suck it up kid.

This tolerance of failure does not include configuration errors in the child job, such as a malformed job specification, child job sent to the wrong worker etc. This will still cause the entire foreach job to fail.

Keep Calm

Before anyone gets all bitter and twisted about can vs may in can_fail... both are essentially equivalent in this crazy, modern world. Look it up.

Handling of Globals¶

The foreach job type merges its globals into those of the child job for each iteration. A value specified in the parent foreach job will override a similarly named value in the child.

The globals generated by the foreach iteration generator will override any existing global with the same name.

The foreach job will also add lava specific globals under globals.lava. These lava owned globals allow all child iterations, to access some common global values. An attempt by the foreach iteration generator to override any of these will cause the job to fail.

The global.lava.foreach_index global is special. It is a counter, starting at zero for the first loop iteration, and incremented by one for each iteration.

Dev Mode Behaviour¶

The foreach job behaviour is unchanged for dev mode. However, dev mode is propagated to child jobs.

Job type: lavasched¶

The lavasched job type is a special lava internal job type. It must be run only on workers that are also dispatchers for the realm to create the crontab on that node to dispatch jobs.

Each crontab entry schedules the dispatch of a job by running lava-dispatcher to send an SQS message to the appropriate worker SQS queue.

The lavasched jobs must themselves be scheduled to update the crontab periodically to accommodate changes in the jobs table. Refer to the section on Scheduled Dispatch for more information on how to initialise this process.

Refer to the section Schedule Specifications for more information.

Payload¶

The payload is ignored for lavasched jobs.

Parameters¶

Parameter	Type	Required	Description
args	List[String]	No	A list of additional arguments that will be added to the invocation of `lava-dispatcher` in the crontab entries.
dispatcher	String or List[String]	Yes	Name of dispatcher, or a list of names, specifying the dispatchers for which schedules should be built.
env	Map[String,String]	No	A map of environment variables that will be added into the crontab. Of these, the most useful is `CRON_TZ`, which controls the timezone used by cron for this dispatcher. Note that this only controls the timezone for the dispatch process, not for the job run itself which will be the local timezone of the worker. See also Cron and PATH.

Cron and PATH¶

Scheduled dispatches are effected by running the Python based lava-dispatcher utility via cron(8). It is critical that the PATH for cron yields a lava compatible version of Python.

Cron typically has a default PATH set at a system level that points to the system default version of Python. This may be an older version that is not lava compatible. This is the case for the lava AMI.

The best way to avoid problems is to explicitly specify the desired PATH in the env parameter for the job. e.g.

{
  "parameters": {
    "env": {
      "PATH": "/usr/local/bin:/bin:/usr/bin"
  }
}

Dev Mode Behaviour¶

The lavasched job behaviour is unchanged for dev mode.

Examples¶

The following example will cause the crontab to be rebuilt for the localtime dispatcher every hour. It will include jobs that have a dispatcher field of localtime - including this lavasched job. The dispatch schedule will operate with respect to the local time set on the worker instance.

{
  "description": "Rebuild the dispatcher crontab",
  "dispatcher": "localtime",
  "enabled": true,
  "job_id": "demo/schedule-01",
  "owner": "demo@somewhere.com",
  "parameters": {
    "dispatcher": "localtime",
    "env": {
      "PATH": "/usr/local/bin:/bin:/usr/bin"
    }
  },
  "payload": "--",
  "schedule": "0 * * * *",
  "type": "lavasched",
  "worker": "default"
}

This one does the same but sets the logging level on the lava-dispatcher invocations to warning.

{
  "description": "Rebuild the Sydney dispatcher crontab",
  "dispatcher": "Sydney",
  "enabled": true,
  "job_id": "demo/schedule-01",
  "owner": "demo@somewhere.com",
  "parameters": {
    "args": [ "--level", "warning" ],
    "dispatcher": "Sydney",
    "env": {
      "PATH": "/usr/local/bin:/bin:/usr/bin"
    }
  },
  "payload": "--",
  "schedule": "0 * * * *",
  "type": "lavasched",
  "worker": "dispatch_syd"
}

This one will cause the dispatcher schedule to operate on Perth time. Note that the jobs themselves will be dispatched on a Perth schedule but they will run under whatever timezone setting the worker has.

{
  "description": "Rebuild the Perth dispatcher crontab",
  "dispatcher": "Perth",
  "enabled": true,
  "job_id": "demo/schedule-01",
  "owner": "demo@somewhere.com",
  "parameters": {
    "dispatcher": "Perth",
    "env": {
      "CRON_TZ": "Australia/Perth",
      "PATH": "/usr/local/bin:/bin:/usr/bin"
    }
  },
  "payload": "--",
  "schedule": "0 * * * *",
  "type": "lavasched",
  "worker": "dispatch_perth"
}

Job type: log¶

The log job type simply logs the job specification to the syslog. Payload and parameters are ignored.

Dev Mode Behaviour¶

The log job behaviour is unchanged for dev mode.

Job type: pkg¶

The pkg job type downloads one or more code packages from the S3 payload area, unpacks them and runs the nominated executable within the package using the same invocation mechanism as the exe job type.

Supported package types are:

Tar files, including any compressed variants supported by the standard Linux tar command.
Zip files.

Payload¶

The payload is a location in S3 relative to the s3_payloads area specified in the realms table. It can be either an object key, in which case a single package is downloaded, or a prefix ending in /, in which case all packages under that prefix will be downloaded and run in lexicographic order.

See S3 Payloads for more information.

Note

The pkg job type also allows the payload to be a list of S3 locations if the v2 Payload Downloader is enabled.

Environment¶

Stdin will be redirected from /dev/null. Stdout and Stderr are captured and, if non-empty, placed into the realm s3_temp area with the following prefixes:

stdout: <s3_temp>/<job_id>/<run_id>/stdout
stderr: <s3_temp>/<job_id>/<run_id>/stderr

The following variables are placed into the environment for the command.

Variable	Description
LAVA_JOB_ID	The `job_id`.
LAVA_OWNER	The value of the `owner` field from the job specification.
LAVA_REALM	The realm name.
LAVA_RUN_ID	The `run_id` UUID.
LAVA_S3_KEY	The identifier for the KMS key needed to write data into the S3 temporary area.
LAVA_S3_PAYLOAD	The payload location for this job.
LAVA_S3_TMP	The private S3 temporary area for this job run. The executables are allowed to put data here.
LAVA_WORKER	The worker name.
PYTHONPATH	The PYTHONPATH has the lava code directory appended. This allows Python based executables to directly import lava modules. This is particularly handy for accessing the lava connection manager from within a Python program.

Warning

If the entry point in the payload is a script (e.g. bash or Python) then Lava relies on the hashbang line at the beginning of the file. If the file has been edited on a DOS system then it may have DOS style CR-LF line endings which will cause the script interpreter to be unrecognised and the job will fail.

Parameters¶

Parameter	Type	Required	Description
args	List[String]	No	A list of additional arguments for the main executable(s).
command	String	Yes	The name of the entry point executable in the bundle, relative to the root of the bundle. This will be parsed using standard Linux shell lexical analysis to determine the executable and arguments. Additional arguments can also be specified with the `args` parameter.
connections	Map[String,String]	No	A dictionary with keys that are connection labels and the values are conn_id
env	Map[String,String]	No	A map of additional environment variables for the command.
jinja	Boolean	No	If `false`, disable Jinja rendering of the `args`. Default `true`.
timeout	String	No	By default, executables run by exe jobs are killed after 10 minutes. This parameter can override that with values in the form `nnX` where `nn` is a number and `X` is `s` (seconds), `m` (minutes) or `h` (hours). Note that the total timeout for the entire job (`timeout` multiplied by number of payload elements) must be less than the visibility timeout on the worker SQS queue.
vars	Map[String,*]	No	A map of variables injected when the arguments and environment are Jinja rendered.

Jinja Rendering of the Arguments and Environment¶

The collected arguments for the executable(s) and any environment values defined in the job specification are individually rendered using Jinja prior to execution.

Refer to Jinja Rendering in Lava for more information.

The following variables are made available to the renderer.

Name	Type	Description
globals	dict[str,*]	The `globals` from the job specification updated with any globals received in the job dispatch.
job	dict[str,*]	The augmented job specification.
realm	dict[str,*]	The realm specification.
start	datetime	The local time when the job run started.
state	dict[str,*]	A dictionary of the state items imported into the job, keyed on state_id. The default values are updated at run-time with any current values obtainable from the state table.
ustart	datetime	The UTC time when the job run started.
utils	dict[str,runnable]	A dictionary of utility functions that can be used in the Jinja markup.
vars	dict[str,*]	A dictionary of variables provided as the `vars` component of the job `parameters`.

Connection Handling¶

Connections are handled exactly as for the exe job type.

Dev Mode Behaviour¶

Normally, the pkg job will copy stdout and stderr to S3 on the conclusion of the job. In dev mode, stdout and stderr are emitted locally during the job run instead of being copied to S3.

Examples¶

The following example will download a zip file, unpack it and run the main executable:

{
  "description": "Run a package of code.",
  "dispatcher": "Sydney",
  "enabled": true,
  "job_id": "demo/my-pkg",
  "owner": "demo@somewhere.com",
  "parameters": {
    "command": "main-script.sh --log-level warning"
  },
  "payload": "demo/my-pkg.zip",
  "type": "pkg",
  "worker": "default"
}

Job type: redshift_unload¶

The redshift_unload job type performs an UNLOAD operation on an AWS Redshift cluster.

Payload¶

The payload is ignored.

Parameters¶

Parameter	Type	Required	Description
args	List[String]	Yes	A list of options for the `UNLOAD` command. All of the Redshift supported parameters except for the authorisation parameters are supported. The options are pre-processed before passing to Redshift as described below.
bucket	String	Yes	The name of the S3 bucket to which the data is unloaded.
conn_id	String	Yes	The connection ID for a Redshift provisioned cluster or a Redshift Serverless workgroup.
dateformat	String	No	A Redshift datetime format string that will be applied to DATE fields when unloading. The safest value is `YYYY-MM-DD`. More information on date formats.
insecure	Boolean	No	Disable bucket security checks. Default is false.
prefix	String	Yes	The prefix in the S3 bucket to which the data is unloaded. This value is Jinja rendered prior to use.
relation	String or List[String]	Yes	The name of the table or view to be unloaded (without schema) or a list of names.
s3_conn_id	String	No	The connection ID for AWS S3. This is used in the `UNLOAD` command to allow Redshift access to S3. Either `s3_conn_id` or `s3_iam_role` is required.
s3_iam_role	String	No	The IAM role name to use in the `UNLOAD` command to allow Redshift access to S3. Either `s3_conn_id` or `s3_iam_role` is required.
schema	String	Yes	The name of the source schema for the object to be unloaded.
start	String	No	Name of the relation to start with when unloading a list of relations. If not specified, start at the beginning of the list. This is useful when `stop_on_fail` is false and an unload fails as it allows the unloads to be resumed at the point of failure once the issue is rectified.
stop_on_fail	Boolean	No	If true, stop when any unload fails otherwise keep moving through the unload list. Default is true. The event record for the job will indicate which unloads succeeded and which failed.
vars	Map[String,*]	No	A map of variables injected when the S3 target prefix is Jinja rendered.
where	String	No	An optional `WHERE` condition for the `UNLOAD` queries. Do not include the `WHERE` keyword. Note that the same condition will be added to each unload if multiple relations are specified.

Handling of Redshift Unload Options¶

The options for the Redshift UNLOAD command specified in the args key of the job parameters are pre-processed prior to use in the UNLOAD operation. Unless otherwise specified below, the option is passed through unmodified.

Partition Option¶

The PARTITION option is supplied to the UNLOAD command in the following form:

PARTITION BY (col1, col2,...)

If the unload args contain the PARTITION clause in this format, it is used as is.

The unload args can instead contain a PARTITION clause in this format:

PARTITION BY @<SCHEMA_NAME>.<REL_NAME>

In this case, the partition information is assumed to be contained in a partition table (or view) <SCHEMA_NAME>.<REL_NAME> that resides in the same Redshift cluster as the source relation. The partition table (or view) must include the following columns:

Name	Type	Description
schema_name	VARCHAR(127)	Schema name in lower case.
rel_name	VARCHAR(127)	Table or view name in lower case.
partitions	VARCHAR(n)	A comma separated list of columns in the relation that will be used to populate the partition column list in the `UNLOAD` command.

The following SQL DDL would create a suitable partition table.

CREATE TABLE metadata.partitions
(
    schema_name VARCHAR(127)  NOT NULL,
    rel_name    VARCHAR(127)  NOT NULL,
    partitions  VARCHAR(2048) NOT NULL,
    PRIMARY KEY (schema_name, rel_name)
)

The PARTITION argument in the job parameters would then look like:

PARTITION BY @metadata.partitions

Date Formatting¶

Redshift has some very nasty behaviour in its date handling. It will UNLOAD date fields with years less than 100 in YY-MM-DD format. Madness.

Redshift COPY using DATEFORMAT 'auto' assumes dates in YY-MM-DD format must be Y2K adjusted (i.e. they are moved to post 2000). More madness.

So, an UNLOAD followed by a COPY will mangle dates. While the COPY command could specify DATEFORMAT 'YY-MM-DD', this will then fail for any dates with years greater than 99. So, if the column contains dates before and after the year 100, you're in trouble.

In short, there is no reliable way to UNLOAD and then COPY a data set containing dates without manually formatting date fields.

To get around this singular piece of genius, use the dateformat key in the job parameters.

If the dateformat key is present, and the relation has one or more DATE fields, redshift_unload will construct a SELECT statement for the unload that includes all of the columns, applying the given date format to DATE columns. This is likely to impact UNLOAD performance, so it's best to avoid using it if you are certain all dates have years greater than 99. The safest format value is probably YYYY-MM-DD.

If the dateformat key is not present, or the relation has no DATE columns, redshift_unload will simply use SELECT * for the target relation in the UNLOAD command.

Jinja Rendering of the S3 Target Key¶

The prefix parameter specifies the UNLOAD location in the target bucket. Its value is rendered using Jinja to allow injection of parameters relevant to the individual job specification and run.

Info

If a list of relations is to be unloaded, it is important to use this rendering facility to ensure that each relation is unloaded to its own area in S3.

All of the injected parameters are effectively Python objects so the normal Jinja syntax and Python methods for those objects can be used in the Jinja templates. This is particularly useful for the datetime objects as strftime() becomes available. For example, the S3 location of the unload (prefix) can be dynamically set to include components such as the schema, relation name, unload date etc.

Refer to Jinja Rendering in Lava for more information.

The following variables are made available to the renderer.

Name	Type	Description
globals	dict[str,*]	The `globals` from the job specification updated with any globals received in the job dispatch.
job	dict[str,*]	The augmented job specification.
realm	dict[str,*]	The realm specification.
relation	str	The relation name.
schema	str	The schema name.
start	datetime	The local time when the job run started.
state	dict[str,*]	A dictionary of the state items imported into the job, keyed on state_id. The default values are updated at run-time with any current values obtainable from the state table.
ustart	datetime	The UTC time when the job run started.
utils	dict[str,runnable]	A dictionary of utility functions that can be used in the Jinja markup.
vars	dict[str,*]	A dictionary of variables provided as the `vars` component of the job `parameters`.

S3 Bucket Security Checks¶

Redshift_unload performs some basic checks on the security of the target S3 bucket to reduce the risk of unloading data to somewhere unsafe. This is a convenience only and should not be relied upon for securing your data.

By default, if any of the following are true, the bucket will not be used for unloaded data and the job run will fail:

The bucket has any form of public access
The bucket does not have default encryption enabled
Server logging is not enabled on the bucket
The bucket is owned by an AWS account other than the one associated with the profile being used by the lava worker.

These security checks can be disabled by setting the insecure parameter to true.

If the security checks are enabled, the lava worker will require the following additional IAM permissions on the target bucket:

List all buckets: s3:ListAllMyBuckets
Get bucket logging configuration: s3:GetBucketLogging
Get bucket encryption configuration: s3:GetEncryptionConfiguration
Get bucket ACLs: s3:GetBucketAcl

Dev Mode Behaviour¶

The redshift_unload job behaviour is unchanged for dev mode.

Examples¶

The following example will unload the table mytable

{
  "description": "Unload the table 'mytable'",
  "dispatcher": "Sydney",
  "enabled": true,
  "job_id": "demo/unload-mytable",
  "owner": "demo@somewhere.com",
  "parameters": {
    "args": [
      "parallel on",
      "delimiter '|'",
      "kms_key_id 'alias/lava-proto-user'",
      "encrypted",
      "header",
      "allowoverwrite"
    ],
    "bucket": "my-bucket",
    "conn_id": "redshift-conn-01",
    "dateformat": "YYYY-MM-DD",
    "prefix": "unload_demo/{{schema}}/{{relation}}/",
    "relation": "mytable",
    "s3_conn_id": "s3-conn-01",
    "schema": "public"
  },
  "payload": "--",
  "type": "redshift_unload",
  "worker": "default"
}

This one unloads the same table but the destination in S3 includes a date component.

{
  "description": "Unload the table 'mytable'",
  "dispatcher": "Sydney",
  "enabled": true,
  "job_id": "demo/unload-mytable",
  "owner": "demo@somewhere.com",
  "parameters": {
    "args": [
      "parallel on",
      "delimiter '|'",
      "kms_key_id 'alias/lava-proto-user'",
      "encrypted",
      "header",
      "allowoverwrite"
    ],
    "bucket": "my-bucket",
    "conn_id": "redshift-conn-01",
    "dateformat": "YYYY-MM-DD",
    "prefix": "unload_demo/{{schema}}/{{relation}}/{{start.strftime('%Y/%m/%d')}}/",
    "relation": "mytable",
    "s3_conn_id": "s3-conn-01",
    "schema": "public"
  },
  "payload": "--",
  "type": "redshift_unload",
  "worker": "default"
}

This one unloads a list of tables.

{
  "description": "Unload multiple tables",
  "dispatcher": "Sydney",
  "enabled": true,
  "job_id": "demo/unload-multi",
  "owner": "demo@somewhere.com",
  "parameters": {
    "args": [
      "parallel on",
      "delimiter '|'",
      "kms_key_id 'alias/lava-proto-user'",
      "encrypted",
      "header",
      "allowoverwrite"
    ],
    "bucket": "my-bucket",
    "conn_id": "redshift-conn-01",
    "dateformat": "YYYY-MM-DD",
    "prefix": "unload_demo/{{schema}}/{{relation}}/{{start.strftime('%Y/%m/%d')}}/",
    "relation": [
      "mytable",
      "yourtable"
    ],
    "s3_conn_id": "s3-conn-01",
    "schema": "public"
  },
  "payload": "--",
  "type": "redshift_unload",
  "worker": "default"
}

Job type: sharepoint_get_doc¶

The sharepoint_get_doc job type downloads a SharePoint document to a specified destination file.

Connection to the target SharePoint site is handled automatically by lava.

Payload¶

The payload is ignored.

Parameters¶

Parameter	Type	Required	Description
basedir	String	No	If the target file is specified as a relative filename, it will be treated as relative to the specified directory. Defaults to the lava temporary directory for the job.
conn_id	String	Yes	The connection ID for a SharePoint site.
file	String	Yes	The destination file name. If it starts with `s3://` it is assumed to be an object in S3, otherwise a local file. If a local file and not absolute, it will be relative to the `basedir` parameter. This value is Jinja rendered.
jinja	Boolean	No	If `false`, disable Jinja rendering. Default `true`.
kms_key_id	String	No	An AWS KMS encryption key to use when uploading to AWS S3.
library	String	Yes	Source SharePoint library name. This value is Jinja rendered.
path	String	Yes	The source document path in SharePoint. Use POSIX, not DOS, style path names (i.e. forward slash path separators). It must be an absolute path starting with `/`. This value is Jinja rendered.
vars	Map[String,*]	No	A map of variables injected when the parameters are Jinja rendered.

Jinja Rendering of Parameters¶

Some of the parameters are rendered using Jinja.

All of the injected parameters are effectively Python objects so the normal Jinja syntax and Python methods for those objects can be used in the Jinja templates. This is particularly useful for the datetime objects as strftime() becomes available. For example, the source path can be dynamically set to include components such as the date or part of the target file name.

Refer to Jinja Rendering in Lava for more information.

The following variables are made available to the renderer.

Name	Type	Description
globals	dict[str,*]	The `globals` from the job specification updated with any globals received in the job dispatch.
job	dict[str,*]	The augmented job specification.
realm	dict[str,*]	The realm specification.
start	datetime	The local time when the job run started.
state	dict[str,*]	A dictionary of the state items imported into the job, keyed on state_id. The default values are updated at run-time with any current values obtainable from the state table.
ustart	datetime	The UTC time when the job run started.
utils	dict[str,runnable]	A dictionary of utility functions that can be used in the Jinja markup.
vars	dict[str,*]	A dictionary of variables provided as the `vars` component of the job `parameters`.

Dev Mode Behaviour¶

The sharepoint_get_doc job behaviour is unchanged for dev mode.

Examples¶

The following example downloads a document from SharePoint and places it in S3. The file will be KMS encrypted in S3 with the specified KMS key.

{
  "description": "Download a CSV file from SharePoint",
  "dispatcher": "Sydney",
  "enabled": true,
  "job_id": "demo/sharepoint-doc",
  "owner": "demo@somewhere.com",
  "parameters": {
    "conn_id": "sp-conn-01",
    "library": "My SharePoint Library",
    "path": "/Interesting/Data/x.csv",
    "file": "s3://my-bucket/x.csv",
    "kms_key_id": "alias/data"
  },
  "payload": "--",
  "type": "sharepoint_get_doc",
  "worker": "default"
}

This one shows how Jinja can be used to include the current date in the source document name.

{
  "description": "Download a CSV file from SharePoint",
  "dispatcher": "Sydney",
  "enabled": true,
  "job_id": "demo/sharepoint-doc",
  "owner": "demo@somewhere.com",
  "parameters": {
    "conn_id": "sp-conn-01",
    "library": "My SharePoint Library",
    "path": "/Interesting/Data/{{start.strftime('%Y-%m-%d')}}.csv",
    "file": "s3://my-bucket/{{start.strftime('%Y-%m-%d')}}.csv",
    "kms_key_id": "alias/data"
  },
  "payload": "--",
  "type": "sharepoint_get_doc",
  "worker": "default"
}

Job type: sharepoint_get_list¶

The sharepoint_get_list job type downloads a SharePoint list to a a specified destination file, one row per line.

Connection to the target SharePoint site is handled automatically by lava.

Payload¶

The payload is ignored.

Parameters¶

The formatting related parameters are as defined for the Python CSV writer, although some of the defaults are different. Defaults can be overridden at the realm level using configuration variables.

Parameter	Type	Required	Description
basedir	String	No	If the target file is specified as a relative filename, it will be treated as relative to the specified directory. Defaults to the lava temporary directory for the job.
conn_id	String	Yes	The connection ID for a SharePoint site.
data_columns	String	No	A comma separated list of column names. If specified, then only columns listed are extracted (in addition to any specified `system_columns`).
delimiter	String	No	Single character field delimiter. Default `\|`.
doublequote	Boolean	No	As for csv.writer. Default `false`.
escapechar	String	No	As for csv.writer. Default `null`.
file	String	Yes	The destination file name. If it starts with `s3://` it is assumed to be an object in S3, otherwise a local file. If a local file and not absolute, it will be relative to the `basedir` parameter. This value is Jinja rendered.
header	Boolean	No	If `true`, include a header line containing column names. Default `true`.
jinja	Boolean	No	If `false`, disable Jinja rendering. Default `true`.
kms_key_id	String	No	An AWS KMS encryption key to use when uploading to AWS S3.
list	String	Yes	Name of the list. It must already exist in SharePoint. This will be jinja rendered.
quotechar	String	No	As for csv.writer. Default `"`.
quoting	String	No	As for csv.writer `QUOTE_*` parameters (without the QUOTE_ prefix). Default `minimal` (i.e. `QUOTE_MINIMAL`).
system_columns	String	No	A comma separated list of system columns to retrieve in addition to the data columns. Unless specified, only data columns are retrieved.
vars	Map[String,*]	No	A map of variables injected when the parameters are Jinja rendered.

Consult SharePoint documentation for available system columns. Currently known columns include:

ComplianceAssetId
AppAuthor
AppEditor
Attachments
Author
ContentType
Created
DocIcon
Edit
Editor
FolderChildCount
ID
ItemChildCount
LinkTitleNoMenu
LinkTitle
Modified
_ComplianceFlags
_ComplianceTagUserId
_ComplianceTagWrittenTime
_ComplianceTag
_IsRecord
_UIVersionString

Handling of personorgroup Columns¶

Columns of type personorgroup have an associated LookupId column. So for a column named <COLUMN>, the list also has a <COLUMN>LookupId column.

If the data_columns parameter is not specified, meaning to retrieve all columns, the <COLUMN> and <COLUMN>LookupId will both be retrieved, in that order.

If the data_columns parameter is specified, the <COLUMN>LookupId must be explicitly named in the list if required.

System columns of type personorgroup also have an associated LookupId column that can be retrieved by naming it in the system_columns parameter.

Jinja Rendering of Parameters¶

Some of the parameters are rendered using Jinja.

All of the injected parameters are effectively Python objects so the normal Jinja syntax and Python methods for those objects can be used in the Jinja templates. This is particularly useful for the datetime objects as strftime() becomes available. For example, target path can be dynamically set to include components such as the date or part of the source file name.

Refer to Jinja Rendering in Lava for more information.

The following variables are made available to the renderer.

Name	Type	Description
globals	dict[str,*]	The `globals` from the job specification updated with any globals received in the job dispatch.
job	dict[str,*]	The augmented job specification.
realm	dict[str,*]	The realm specification.
start	datetime	The local time when the job run started.
state	dict[str,*]	A dictionary of the state items imported into the job, keyed on state_id. The default values are updated at run-time with any current values obtainable from the state table.
ustart	datetime	The UTC time when the job run started.
utils	dict[str,runnable]	A dictionary of utility functions that can be used in the Jinja markup.
vars	dict[str,*]	A dictionary of variables provided as the `vars` component of the job `parameters`.

Dev Mode Behaviour¶

The sharepoint_get_list job behaviour is unchanged for dev mode.

Examples¶

The following example replaces the contents of a list with data from S3.

{
  "description": "Download a SharePoint list to a CSV file",
  "dispatcher": "Sydney",
  "enabled": true,
  "job_id": "demo/sharepoint-list",
  "owner": "demo@somewhere.com",
  "parameters": {
    "conn_id": "sp-conn-01",
    "list": "My List",
    "file": "s3://my-bucket/x.csv",
    "kms_key_id": "alias/data",
    "quote": "minimal",
    "separator": ","
  },
  "payload": "--",
  "type": "sharepoint_get_list",
  "worker": "default"
}

Job type: sharepoint_get_multi_doc¶

The sharepoint_get_multi_doc job type downloads multiple SharePoint documents to a specified destination path.

Connection to the target SharePoint site is handled automatically by lava.

Payload¶

The payload is ignored.

Parameters¶

Parameter	Type	Required	Description
basedir	String	No	If the target file is specified as a relative filename, it will be treated as relative to the specified directory. Defaults to the lava temporary directory for the job.
conn_id	String	Yes	The connection ID for a SharePoint site.
glob	String	No	The UNIX glob style pattern used to select files in the specified SharePoint folder path area. If not specified, all files are selected. Matching is case-insensitive. This value is Jinja rendered.
jinja	Boolean	No	If `false`, disable Jinja rendering. Default `true`.
kms_key_id	String	No	An AWS KMS encryption key to use when uploading to AWS S3.
library	String	Yes	Source SharePoint library name. This value is Jinja rendered.
outpath	String	Yes	The destination file location. If it starts with `s3://` it is assumed to be a prefix in S3 (add your own trailing `/` if needed), otherwise a local directory. If a local directory and not absolute, it will be relative to the `basedir` parameter. This value is Jinja rendered.
path	String	Yes	The source document path in SharePoint. Use POSIX, not DOS, style path names (i.e. forward slash path separators). It must be an absolute path starting with `/` and a SharePoint folder. This value is Jinja rendered.
vars	Map[String,*]	No	A map of variables injected when the parameters are Jinja rendered.

Jinja Rendering of Parameters¶

Some of the parameters are rendered using Jinja.

All of the injected parameters are effectively Python objects so the normal Jinja syntax and Python methods for those objects can be used in the Jinja templates. This is particularly useful for the datetime objects as strftime() becomes available. For example, the source path can be dynamically set to include components such as the date or part of the target file name.

Refer to Jinja Rendering in Lava for more information.

The following variables are made available to the renderer.

Name	Type	Description
globals	dict[str,*]	The `globals` from the job specification updated with any globals received in the job dispatch.
job	dict[str,*]	The augmented job specification.
realm	dict[str,*]	The realm specification.
start	datetime	The local time when the job run started.
state	dict[str,*]	A dictionary of the state items imported into the job, keyed on state_id. The default values are updated at run-time with any current values obtainable from the state table.
ustart	datetime	The UTC time when the job run started.
utils	dict[str,runnable]	A dictionary of utility functions that can be used in the Jinja markup.
vars	dict[str,*]	A dictionary of variables provided as the `vars` component of the job `parameters`.

Dev Mode Behaviour¶

The sharepoint_get_multi_doc job behaviour is unchanged for dev mode.

Examples¶

The following example downloads all csv files from SharePoint path and places it in S3. The file will be KMS encrypted in S3 with the specified KMS key.

{
  "description": "Download a CSV file from SharePoint",
  "dispatcher": "Sydney",
  "enabled": true,
  "job_id": "demo/sharepoint-multi-doc",
  "owner": "demo@somewhere.com",
  "parameters": {
    "conn_id": "sp-conn-01",
    "library": "My SharePoint Library",
    "path": "/Interesting/Data/",
    "outpath": "s3://my-bucket/base-prefix/",
    "glob": "*.csv",
    "kms_key_id": "alias/data"
  },
  "payload": "--",
  "type": "sharepoint_get_multi_doc",
  "worker": "default"
}

This one shows how Jinja can be used to include the current date in the source document name.

{
  "description": "Download a CSV file from SharePoint",
  "dispatcher": "Sydney",
  "enabled": true,
  "job_id": "demo/sharepoint-multi-doc",
  "owner": "demo@somewhere.com",
  "parameters": {
    "conn_id": "sp-conn-01",
    "library": "My SharePoint Library",
    "path": "/Interesting/Data/{{start.strftime('%Y-%m-%d')}}",
    "outpath": "s3://my-bucket/{{start.strftime('%Y-%m-%d')}}/",
    "kms_key_id": "alias/data"
  },
  "payload": "--",
  "type": "sharepoint_get_multi_doc",
  "worker": "default"
}

Job type: sharepoint_put_doc¶

The sharepoint_put_doc job type creates or updates a SharePoint document from a specified source file.

Connection to the target SharePoint site is handled automatically by lava.

Payload¶

The payload is ignored.

Parameters¶

Parameter	Type	Required	Description
basedir	String	No	If the source file is specified as a relative filename, it will be treated as relative to the specified directory. Defaults to the lava temporary directory for the job.
conn_id	String	Yes	The connection ID for a SharePoint site.
file	String	Yes	The source file name. If it starts with `s3://` it is assumed to be an object in S3, otherwise a local file. If a local file and not absolute, it will be relative to the `basedir` parameter. This value is Jinja rendered.
jinja	Boolean	No	If `false`, disable Jinja rendering. Default `true`.
library	String	Yes	Target SharePoint library name. This value is Jinja rendered.
path	String	Yes	The target document path in SharePoint. Use POSIX, not DOS, style path names (i.e. forward slash path separators). It must be an absolute path starting with `/`. This value is Jinja rendered.
title	String	No	A title for the document. This value is Jinja rendered.
vars	Map[String,*]	No	A map of variables injected when the parameters are Jinja rendered.

Jinja Rendering of Parameters¶

Some of the parameters are rendered using Jinja.

All of the injected parameters are effectively Python objects so the normal Jinja syntax and Python methods for those objects can be used in the Jinja templates. This is particularly useful for the datetime objects as strftime() becomes available. For example, target path can be dynamically set to include components such as the date or part of the source file name.

Refer to Jinja Rendering in Lava for more information.

The following variables are made available to the renderer.

Name	Type	Description
globals	dict[str,*]	The `globals` from the job specification updated with any globals received in the job dispatch.
job	dict[str,*]	The augmented job specification.
realm	dict[str,*]	The realm specification.
start	datetime	The local time when the job run started.
state	dict[str,*]	A dictionary of the state items imported into the job, keyed on state_id. The default values are updated at run-time with any current values obtainable from the state table.
ustart	datetime	The UTC time when the job run started.
utils	dict[str,runnable]	A dictionary of utility functions that can be used in the Jinja markup.
vars	dict[str,*]	A dictionary of variables provided as the `vars` component of the job `parameters`.

Dev Mode Behaviour¶

The sharepoint_put_doc job behaviour is unchanged for dev mode.

Examples¶

The following example uploads a file from S3 to a SharePoint library.

{
  "description": "Upload a CSV file to SharePoint",
  "dispatcher": "Sydney",
  "enabled": true,
  "job_id": "demo/sharepoint-doc",
  "owner": "demo@somewhere.com",
  "parameters": {
    "conn_id": "sp-conn-01",
    "library": "My SharePoint Library",
    "path": "/Interesting/Data/x.csv",
    "file": "s3://my-bucket/x.csv",
    "title": "This is an interesting data set"
  },
  "payload": "--",
  "type": "sharepoint_put_doc",
  "worker": "default"
}

This one shows how Jinja can be used to replicate the source file name into the target document name.

{
  "description": "Upload a CSV file to SharePoint",
  "dispatcher": "Sydney",
  "enabled": true,
  "job_id": "demo/sharepoint-doc",
  "owner": "demo@somewhere.com",
  "parameters": {
    "conn_id": "sp-conn-01",
    "library": "My SharePoint Library",
    "path": "/Interesting/Data/{{job.parameters.file}}.split('/')[-1]}}",
    "file": "s3://my-bucket/x.csv",
    "title": "This is an interesting data set"
  },
  "payload": "--",
  "type": "sharepoint_put_doc",
  "worker": "default"
}

This one shows how Jinja can be used to include the current date in the target document name.

{
  "description": "Upload a CSV file to SharePoint",
  "dispatcher": "Sydney",
  "enabled": true,
  "job_id": "demo/sharepoint-doc",
  "owner": "demo@somewhere.com",
  "parameters": {
    "conn_id": "sp-conn-01",
    "library": "My SharePoint Library",
    "path": "/Interesting/Data/{{start.strftime('%Y-%m-%d')}}.csv",
    "file": "s3://my-bucket/x.csv",
    "title": "This is an interesting data set"
  },
  "payload": "--",
  "type": "sharepoint_put_doc",
  "worker": "default"
}

Job type: sharepoint_put_list¶

The sharepoint_put_list job type updates a SharePoint list from a specified CSV source file.

Connection to the target SharePoint site is handled automatically by lava.

Payload¶

The payload is ignored.

Parameters¶

Parameter	Type	Required	Description
basedir	String	No	If the source file is specified as a relative filename, it will be treated as relative to the specified directory. Defaults to the lava temporary directory for the job.
conn_id	String	Yes	The connection ID for a SharePoint site.
data_columns	String	No	A comma separated list of column names. If specified, then only columns listed are modified. Any columns specified in SharePoint as required must be included in the source data for modes `append` and `replace`. Required columns present in the source data will be included in the append/replace even if not explicitly included in the `data_columns` list.
delimiter	String	No	Single character field delimiter. Default `\|`.
doublequote	Boolean	No	As for csv.reader. Default `false`.
error_missing	Boolean	No	If `true` and there are columns in the source file that are not in the SharePoint list, raise an error. If `false`, the extra columns are silently ignored. Default `false`.
escapechar	String	No	As for csv.reader. Default `null`.
file	String	Yes	The source file name. If it starts with `s3://` it is assumed to be an object in S3, otherwise a local file. If a local file and not absolute, it will be relative to the `basedir` parameter. This value is Jinja rendered. The contents must be a CSV data with a single header line. The columns must match the pre-existing list in SharePoint.
jinja	Boolean	No	If `false`, disable Jinja rendering. Default `true`.
list	String	Yes	Name of the list. It must already exist in SharePoint. This will be jinja rendered.
mode	String	No	See below. Default is `append`.
quotechar	String	No	As for csv.reader. Default `"`.
quoting	String	No	As for csv.reader `QUOTE_*` parameters (without the QUOTE_ prefix). Default `minimal` (i.e. `QUOTE_MINIMAL`).
vars	Map[String,*]	No	A map of variables injected when the parameters are Jinja rendered.

The mode parameter can take the following values:

Mode	Description
append	Rows are added to the existing list contents.
delete	Rows are deleted based on an `ID` column that must be present in the source data. No new data is added.
replace	Existing list contents are a deleted before adding new data.
update	Rows are updated based on an `ID` column that must be present in the source data. Fields in existing rows may be updated but no new rows will be added.

Note that any read-only columns in the SharePoint list are not updated. These are silently skipped in the update process.

Handling of personorgroup Columns¶

Columns of type personorgroup have an associated LookupId column. So for a column named <COLUMN>, the list also has a <COLUMN>LookupId column.

A personorgroup type column can only be updated if it’s provided using the name <COLUMN>LookupId and the actual LookupId value

Jinja Rendering of Parameters¶

Some of the parameters are rendered using Jinja.

All of the injected parameters are effectively Python objects so the normal Jinja syntax and Python methods for those objects can be used in the Jinja templates. This is particularly useful for the datetime objects as strftime() becomes available. For example, target path can be dynamically set to include components such as the date or part of the source file name.

Refer to Jinja Rendering in Lava for more information.

The following variables are made available to the renderer.

Name	Type	Description
globals	dict[str,*]	The `globals` from the job specification updated with any globals received in the job dispatch.
job	dict[str,*]	The augmented job specification.
realm	dict[str,*]	The realm specification.
start	datetime	The local time when the job run started.
state	dict[str,*]	A dictionary of the state items imported into the job, keyed on state_id. The default values are updated at run-time with any current values obtainable from the state table.
ustart	datetime	The UTC time when the job run started.
utils	dict[str,runnable]	A dictionary of utility functions that can be used in the Jinja markup.
vars	dict[str,*]	A dictionary of variables provided as the `vars` component of the job `parameters`.

Dev Mode Behaviour¶

The sharepoint_put_list job behaviour is unchanged for dev mode.

Examples¶

The following example replaces the contents of a list with data from S3.

{
  "description": "Upload a CSV file to a SharePoint list",
  "dispatcher": "Sydney",
  "enabled": true,
  "job_id": "demo/sharepoint-list",
  "owner": "demo@somewhere.com",
  "parameters": {
    "conn_id": "sp-conn-01",
    "list": "My List",
    "file": "s3://my-bucket/x.csv",
    "mode": "replace"
  },
  "payload": "--",
  "type": "sharepoint_put_list",
  "worker": "default"
}

Job type: smb_get¶

The smb_get job type downloads a file from an SMB file server to the specified destination file.

Connection to the target SMB server is handled automatically by lava.

Payload¶

The payload is ignored.

Parameters¶

Parameter	Type	Required	Description
basedir	String	No	If the target file is specified as a relative filename, it will be treated as relative to the specified directory. Defaults to the lava temporary directory for the job.
conn_id	String	Yes	The connection ID for an SMB file share.
file	String	Yes	The destination file name. If it starts with `s3://` it is assumed to be an object in S3, otherwise a local file. If a local file and not absolute, it will be relative to the `basedir` parameter. This value is Jinja rendered.
jinja	Boolean	No	If `false`, disable Jinja rendering. Default `true`.
kms_key_id	String	No	An AWS KMS encryption key to use when uploading to AWS S3.
path	String	Yes	The source file path on the SMB file share. Use POSIX, not DOS, style path names (i.e. forward slash path separators). This value is Jinja rendered.
share_name	String	Yes	The name of the file share.
vars	Map[String,*]	No	A map of variables injected when the parameters are Jinja rendered.

Jinja Rendering of Parameters¶

Some of the parameters are rendered using Jinja.

All of the injected parameters are effectively Python objects so the normal Jinja syntax and Python methods for those objects can be used in the Jinja templates. This is particularly useful for the datetime objects as strftime() becomes available. For example, the source path can be dynamically set to include components such as the date or part of the target file name.

Refer to Jinja Rendering in Lava for more information.

The following variables are made available to the renderer.

Name	Type	Description
globals	dict[str,*]	The `globals` from the job specification updated with any globals received in the job dispatch.
job	dict[str,*]	The augmented job specification.
realm	dict[str,*]	The realm specification.
start	datetime	The local time when the job run started.
state	dict[str,*]	A dictionary of the state items imported into the job, keyed on state_id. The default values are updated at run-time with any current values obtainable from the state table.
ustart	datetime	The UTC time when the job run started.
utils	dict[str,runnable]	A dictionary of utility functions that can be used in the Jinja markup.
vars	dict[str,*]	A dictionary of variables provided as the `vars` component of the job `parameters`.

Dev Mode Behaviour¶

Job behaviour is unchanged for dev mode.

Examples¶

The following example downloads a file from an SMB file share and places it in S3. The file will be KMS encrypted in S3 with the specified KMS key.

{
  "description": "Download a CSV file from a file share",
  "dispatcher": "Sydney",
  "enabled": true,
  "job_id": "demo/smb-get",
  "owner": "demo@somewhere.com",
  "parameters": {
    "conn_id": "smb-conn-01",
    "share_name": "Public",
    "path": "/Interesting/Data/x.csv",
    "file": "s3://my-bucket/x.csv",
    "kms_key_id": "alias/data"
  },
  "payload": "--",
  "type": "smb_get",
  "worker": "default"
}

This one shows how Jinja can be used to include the current date in the source file name.

{
  "description": "Download a CSV file from a file share",
  "dispatcher": "Sydney",
  "enabled": true,
  "job_id": "demo/smb-get",
  "owner": "demo@somewhere.com",
  "parameters": {
    "conn_id": "smb-conn-01",
    "share_name": "Public",
    "path": "/Interesting/Data/{{start.strftime('%Y-%m-%d')}}.csv",
    "file": "s3://my-bucket/{{start.strftime('%Y-%m-%d')}}.csv",
    "kms_key_id": "alias/data"
  },
  "payload": "--",
  "type": "smb_get",
  "worker": "default"
}

Job type: smb_put¶

The smb_put job type creates or updates a file on an SMB file share from a specified source file.

Connection to the target SMB server is handled automatically by lava.

Payload¶

The payload is ignored.

Parameters¶

Parameter	Type	Required	Description
basedir	String	No	If the source file is specified as a relative filename, it will be treated as relative to the specified directory. Defaults to the lava temporary directory for the job.
conn_id	String	Yes	The connection ID for an SMB server.
create_dirs	Boolean	No	If `true`, the target directory, including parent directories, will be created if it doesn't exist. Default is `false`.
file	String	Yes	The source file name. If it starts with `s3://` it is assumed to be an object in S3, otherwise a local file. If a local file and not absolute, it will be relative to the `basedir` parameter. This value is Jinja rendered.
jinja	Boolean	No	If `false`, disable Jinja rendering. Default `true`.
path	String	Yes	The target path on the SMB file share. Use POSIX, not DOS, style path names (i.e. forward slash path separators). This value is Jinja rendered.
share_name	String	Yes	The name of the file share.
vars	Map[String,*]	No	A map of variables injected when the parameters are Jinja rendered.

Jinja Rendering of Parameters¶

Some of the parameters are rendered using Jinja.

All of the injected parameters are effectively Python objects so the normal Jinja syntax and Python methods for those objects can be used in the Jinja templates. This is particularly useful for the datetime objects as strftime() becomes available. For example, target path can be dynamically set to include components such as the date or part of the source file name.

Refer to Jinja Rendering in Lava for more information.

The following variables are made available to the renderer.

Name	Type	Description
globals	dict[str,*]	The `globals` from the job specification updated with any globals received in the job dispatch.
job	dict[str,*]	The augmented job specification.
realm	dict[str,*]	The realm specification.
start	datetime	The local time when the job run started.
state	dict[str,*]	A dictionary of the state items imported into the job, keyed on state_id. The default values are updated at run-time with any current values obtainable from the state table.
ustart	datetime	The UTC time when the job run started.
utils	dict[str,runnable]	A dictionary of utility functions that can be used in the Jinja markup.
vars	dict[str,*]	A dictionary of variables provided as the `vars` component of the job `parameters`.

Dev Mode Behaviour¶

Job behaviour is unchanged for dev mode.

Examples¶

The following example uploads a file from S3 to an SMB file share.

{
  "description": "Upload a CSV file to an SMB file share",
  "dispatcher": "Sydney",
  "enabled": true,
  "job_id": "demo/smb-put",
  "owner": "demo@somewhere.com",
  "parameters": {
    "conn_id": "smb-conn-01",
    "share_name": "Public",
    "path": "/Interesting/Data/x.csv",
    "file": "s3://my-bucket/x.csv"
  },
  "payload": "--",
  "type": "smb_put",
  "worker": "default"
}

This one shows how Jinja can be used to replicate the source file name into the target file name.

{
  "description": "Upload a CSV file to an SMB file share",
  "dispatcher": "Sydney",
  "enabled": true,
  "job_id": "demo/smb-put",
  "owner": "demo@somewhere.com",
  "parameters": {
    "conn_id": "smb-conn-01",
    "share_name": "Public",
    "path": "/Interesting/Data/{{job.parameters.file}}.split('/')[-1]}}",
    "file": "s3://my-bucket/x.csv"
  },
  "payload": "--",
  "type": "smb_put",
  "worker": "default"
}

This one shows how Jinja can be used to include the current date in the target path. Note that target directories are created on the fly.

{
  "description": "Upload a CSV file to an SMB file share",
  "dispatcher": "Sydney",
  "enabled": true,
  "job_id": "demo/smb-put",
  "owner": "demo@somewhere.com",
  "parameters": {
    "conn_id": "sp-conn-01",
    "share_name": "Public",
    "path": "/Interesting/Data/{{start.strftime('%Y/%m/%d')}}/data.csv",
    "file": "s3://my-bucket/x.csv",
    "title": "This is an interesting data set",
    "create_dirs": true
  },
  "payload": "--",
  "type": "smb_put",
  "worker": "default"
}

Job type: sql¶

The sql job type runs one or more SQL statements.

The sql job type is functionally identically to the sqli job type with the exception that sql obtains the SQL statements from files in S3 whereas for sqli jobs the payload is inline in the job specification.

Tip

For help selecting the appropriate SQL job type, See Choosing an SQL Job Type.

Connection to the target database is handled automatically by lava.

The sql job type is intended for simple SQL sequences. For more complex cases, the sqlc or sqlv job types may be more appropriate.

Payload¶

The payload is a location in S3 relative to the s3_payloads area specified in the realms table. It can be either an object key, in which case a single file is downloaded, or a prefix ending in /, in which case all files under that prefix will be downloaded and run in lexicographic order.

See S3 Payloads for more information.

Info

The sql job type also allows the payload to be a list of S3 locations if the v2 Payload Downloader is enabled.

Each payload file can contain one or more SQL statements that are compatible with the target database.

Note that some database drivers are pickier than others about the presence or absence of a terminating semi-colon (e.g. Oracle) so these will be stripped from the end of each statement before execution.

Parameters¶

The formatting related parameters are as defined for the Python CSV writer, although some of the defaults are different. Defaults can be overridden at the realm level using configuration variables.

Parameter	Type	Required	Description
batch_size	Integer	No	Fetch this many rows at a time. Default is 1000.
conn_id	String	Yes	The connection ID for an SQL database.
delimiter	String	No	Single character field delimiter. Default `\|`.
dialect	String	No	As for csv.writer. Default `excel`. The `unix` option is useful when DOS style line endings must be avoided.
doublequote	Boolean	No	As for csv.writer. Default `false`.
escapechar	String	No	As for csv.writer. Default `null`.
header	Boolean	No	Add a header for data produced from SELECT queries. Default is `false`.
jinja	Boolean	No	If `false`, disable Jinja rendering of the payload. Default `true`.
output	String	No	If specified, the output from statements that produce result tuples will be placed in this subdirectory in both the job run temporary area in the local filesystem and in the `s3_temp` area. This option must be specified for dag jobs if the output is needed. See Output Data below. It must contain only alphanumeric characters and underscores.
quotechar	String	No	As for csv.writer. Default `"`.
quoting	String	No	As for csv.writer `QUOTE_*` parameters (without the QUOTE_ prefix). Default `minimal` (i.e. `QUOTE_MINIMAL`).
raw	Boolean	No	By default, an attempt will be made to split each payload file into individual SQL statements. This should be safe in most cases. To suppress this behaviour and run the payload as-is, set raw to `true`. Default `false`.
transaction	Boolean	No	If `true`, auto-commit is disabled and the sequence of SQLs is run within a transaction. If `false`, auto-commit is enabled (if supported by the driver). Default `false`.
vars	Map[String,*]	No	A map of variables injected when the SQL is Jinja rendered.

Jinja Rendering of the Payload¶

Each SQL statement is rendered using Jinja prior to execution.

All of the injected parameters are effectively Python objects so the normal Jinja syntax and Python methods for those objects can be used in the Jinja templates. This is particularly useful for the datetime objects as strftime() becomes available. For example, the S3 location of the unload (s3key) can be dynamically set to include components such as the schema, table name, unload date etc.

Refer to Jinja Rendering in Lava for more information.

The following variables are made available to the renderer.

Name	Type	Description
globals	dict[str,*]	The `globals` from the job specification updated with any globals received in the job dispatch.
job	dict[str,*]	The augmented job specification.
realm	dict[str,*]	The realm specification.
start	datetime	The local time when the job run started.
state	dict[str,*]	A dictionary of the state items imported into the job, keyed on state_id. The default values are updated at run-time with any current values obtainable from the state table.
ustart	datetime	The UTC time when the job run started.
utils	dict[str,runnable]	A dictionary of utility functions that can be used in the Jinja markup.
vars	dict[str,*]	A dictionary of variables provided as the `vars` component of the job `parameters`.

Output Data¶

If any of the queries in the payload produce result tuples (e.g. SELECT statements), the output is placed in its own file in the local file system and in the s3_temp area.

If the output parameter is not specified, the output files will be placed in the root of the temporary run directory on the local worker file system and at the root of the job run's s3_temp area.

If the output parameter is specified, the output files will be placed in a subdirectory of that name under the root of the temporary run directory on the local worker file system and under a similarly named sub-prefix of the root of the job run's s3_temp area.

Files are named <PAYLOAD>.<n>.out, where:

<PAYLOAD> is the name of the payload file containing the SELECT query; and
<n> is the SQL statement sequence number within that file, starting from zero.

So with an output parameter of whatever and a payload file, some-queries.sql, containing only a SELECT statement, the output file would be named whatever/some-queries.sql.0.out.

As child jobs in chain and dag jobs all share the same run directory, this provides a mechanism for one child job to leave data behind for a subsequent job.

Info

It is critical to use the output parameter in these circumstances.

Unless modified by job parameters, the output of each SELECT statement is pipe separated values, one record per line.

Dev Mode Behaviour¶

The sql job behaviour is unchanged for dev mode.

Examples¶

The following example runs the SQL statement in demo/query-mytable.sql. The output is written to demo/query-mytable.sql.out in the s3_temp area.

{
  "description": "Run a single SQL statement.",
  "dispatcher": "Sydney",
  "enabled": true,
  "job_id": "demo/sql-query-01",
  "owner": "demo@somewhere.com",
  "parameters": {
    "conn_id": "db-conn-01",
    "delimiter": "|",
    "quoting": "all"
  },
  "payload": "demo/query-mytable.sql",
  "type": "sql",
  "worker": "default"
}

Job type: sqlc¶

The sqlc job type runs one or more files containing native SQL client commands. It differs from the sql and sqli job types in that, instead of a Python DBAPI 2.0 driver, it uses a command line client specific to the target database type. This makes sqlc less portable than sql/sqli/sqlv as the former allows access to all the meta-commands available in the command line client.

The sqlc job type is intended for more complex requirements where the specific capabilities of the native CLI are critical. In most cases, one of the sql, sqli or sqlv job types will be more appropriate.

Tip

For help selecting the appropriate SQL job type, See Choosing an SQL Job Type.

The stdout and stderr of each SQL script, if any, is placed in its own file in the s3_temp area.

For Postgres flavoured databases (including Redshift), the psql client is used. For MySQL databases, the mysql client is used. For Oracle databases, the SQL*Plus client is used. It's a horror. Sorry.

Payload¶

The payload is a location in S3 relative to the s3_payloads area specified in the realms table. It can be either an object key, in which case a single file is downloaded, or a prefix ending in /, in which case all files under that prefix will be downloaded and run in lexicographic order.

See S3 Payloads for more information.

Note

The sqlc job type also allows the payload to be a list of S3 locations if the v2 Payload Downloader is enabled.

Each file may contain a mix of SQL statements and client meta-commands. The SQL must be compatible with the target database. SQL commands must always be properly terminated with semi-colons.

Parameters¶

Parameter	Type	Required	Description
args	List[String]	No	A list of zero or more additional arguments provided to the database client. These are necessarily specific to the database type and underlying database client.
conn_id	String	Yes	The connection ID for a database.
jinja	Boolean	No	If `false`, disable Jinja rendering of the payload. Default `true`.
timeout	String	No	By default, payload components run by sqlc jobs are killed after 10 minutes. This parameter can override that with values in the form `nnX` where `nn` is a number and `X` is `s` (seconds), `m` (minutes) or `h` (hours). Note that the total timeout for the entire job (`timeout` multiplied by number of payload elements) must be less than the visibility timeout on the worker SQS queue.
vars	Map[String,*]	No	A map of variables injected when the SQL is Jinja rendered.

Jinja Rendering of the Payload¶

Each file in the payload is rendered using Jinja prior to execution.

All of the injected parameters are effectively Python objects so the normal Jinja syntax and Python methods for those objects can be used in the Jinja templates. This is particularly useful for the datetime objects as strftime() becomes available. For example, the S3 location of the unload (s3key) can be dynamically set to include components such as the schema, table name, unload date etc.

Refer to Jinja Rendering in Lava for more information.

The following variables are made available to the renderer.

Name	Type	Description
globals	dict[str,*]	The `globals` from the job specification updated with any globals received in the job dispatch.
job	dict[str,*]	The augmented job specification.
realm	dict[str,*]	The realm specification.
start	datetime	The local time when the job run started.
state	dict[str,*]	A dictionary of the state items imported into the job, keyed on state_id. The default values are updated at run-time with any current values obtainable from the state table.
ustart	datetime	The UTC time when the job run started.
utils	dict[str,runnable]	A dictionary of utility functions that can be used in the Jinja markup.
vars	dict[str,*]	A dictionary of variables provided as the `vars` component of the job `parameters`.

Dev Mode Behaviour¶

Normally, the sqlc job will copy stdout and stderr to S3 on the conclusion of the job. In dev mode, stdout and stderr are emitted locally during the job run instead of being copied to S3.

Examples¶

The following example runs all of the commands in demo/psql-commands.sql. The --no-align argument is passed to the psql command line client. Stdout is written to demo/sqlc-query-01.sql.stdout and stderr is written to demo/sqlc-query-01.sql.sterr in the s3_temp area. The underlying database type is specified in the connection specification not the job specification but the payload must match the underlying database type.

{
  "description": "Run a psql file.",
  "dispatcher": "Sydney",
  "enabled": true,
  "job_id": "demo/sqlc-query-01",
  "owner": "demo@somewhere.com",
  "parameters": {
    "args": [
      "--no-align"
    ],
    "conn_id": "pgdb-conn-01"
  },
  "payload": "demo/psql-commands.sql",
  "type": "sql",
  "worker": "default"
}

This one runs all files found under the given payload prefix. Separate stdout and stderr files are created for each payload file. The timeout for each individual payload element is set to 30 minutes.

{
  "description": "Run a psql file.",
  "dispatcher": "Sydney",
  "enabled": true,
  "job_id": "demo/sqlc-query-01",
  "owner": "demo@somewhere.com",
  "parameters": {
    "args": [
      "--no-align"
    ],
    "conn_id": "pgdb-conn-01"
  },
  "payload": "demo/psql-commands/",
  "timeout": "30m",
  "type": "sql",
  "worker": "default"
}

Job type: sqli¶

The sqli job type runs one or more SQL statements.

The sqli job type is functionally identical to the sql job type with the exception that sqli obtains the SQL statements inline from the payload whereas for sql jobs the payload specifies S3 objects containing the SQL statements.

Tip

For help selecting the appropriate SQL job type, see Choosing an SQL Job Type.

Connection to the target database is handled automatically by lava.

The sqli job type is intended for simple, relatively small SQL sequences. For larger or more complex cases, the sql or sqlc job types may be more appropriate.

Payload¶

The payload is a string, or list of strings, containing SQL statements.

Each payload string can itself contain one or more SQL statements that are compatible with the target database.

For statements that produce result tuples (e.g. SELECT statements), the output is placed in its own file in the s3_temp area. Unless modified by job parameters, the output of each statement is pipe separated values, one record per line.

Note that some database drivers are pickier than others about the presence or absence of a terminating semi-colon (e.g. Oracle) so these will be stripped from the end of each statement before execution.

Parameters¶

Parameters are identical to those for the sql job type.

Output Data¶

If any of the queries in the payload produce result tuples (e.g. SELECT statements), the output is placed in its own file in the local file system and in the s3_temp area.

If the output parameter is not specified, the output files will be placed in the root of the temporary run directory on the local worker file system and at the root of the job run's s3_temp area.

If the output parameter is specified, the output files will be placed in a subdirectory of that name under the root of the temporary run directory on the local worker file system and under a similarly named sub-prefix of the root of the job run's s3_temp area.

Files are named <m>.<n>.out, where:

<m> is the sequence number of the payload element, starting from zero. Note that each payload element can contain multiple SQL statements. Hence ...
<n> is the SQL statement sequence number within the payload element, starting from zero.

So, with an output parameter of whatever and a payload list containing a single element which is a SELECT statement, the output file would be named whatever/0.0.out.

As child jobs in chain and dag jobs all share the same run directory, this provides a mechanism for one child job to leave data behind for a subsequent job.

Info

It is critical to use the output parameter in these circumstances.

Unless modified by job parameters, the output of each SELECT statement is pipe separated values, one record per line.

Jinja Rendering of the Payload¶

Each SQL statement is rendered using Jinja prior to execution. The rendering process is identical to that for the sql job type.

Dev Mode Behaviour¶

The sqli job behaviour is unchanged for dev mode.

Examples¶

The following example runs a single SQL statement. The output is written to the s3_temp area.

{
  "description": "Run a single SQL statement.",
  "dispatcher": "Sydney",
  "enabled": true,
  "job_id": "demo/sql-query-01",
  "owner": "demo@somewhere.com",
  "parameters": {
    "conn_id": "db-conn-01",
    "delimiter": "|",
    "quoting": "all"
  },
  "payload": "SELECT numbers FROM lottery_results WHERE result_date > NOW()",
  "type": "sqli",
  "worker": "default"
}

The following example runs multiple SQL statements in a transaction.

{
  "description": "Run multiple SQL statements.",
  "dispatcher": "Sydney",
  "enabled": true,
  "job_id": "demo/sql-query-01",
  "owner": "demo@somewhere.com",
  "parameters": {
    "conn_id": "db-conn-01",
    "delimiter": "|",
    "quoting": "all",
    "transaction": true
  },
  "payload": [
    "DELETE FROM main_table",
    "INSERT INTO main_table (SELECT * FROM staging_table)"
  ],
  "type": "sqli",
  "worker": "default"
}

Job type: sqlv¶

The sqlv job type runs one or more files containing SQL statements. It is essentially a hybrid of the sql and sqlc job types. It has a consistent, lava controlled interface across all database types like the former, but uses an external CLI client like the latter.

Tip

For help selecting the appropriate SQL job type, see Choosing an SQL Job Type.

The client, lava-sql is provided as part of the lava code base. It supports SQL statements only, not the meta-commands that are typical of the proprietary database clients used by sqlc. The lava-sql utility can also be used stand-alone or invoked by lava exe, pkg and docker jobs.

Connection to the target database is handled automatically by lava.

Payload¶

The payload is a location in S3 relative to the s3_payloads area specified in the realms table. It can be either an object key, in which case a single file is downloaded, or a prefix ending in /, in which case all files under that prefix will be downloaded and run in lexicographic order.

See S3 Payloads for more information.

Note

The sqlv job type also allows the payload to be a list of S3 locations if the v2 Payload Downloader is enabled.

Each payload file can contain one or more SQL statements that are compatible with the target database.

For SELECT statements, the output is placed in its own file in the s3_temp area. Unless modified by job parameters, the output of each SELECT statement is pipe separated values, one record per line.

Tip

All output is placed in a single file. To avoid a mess, each job should either limit itself to a single SELECT statement that produces output or all SELECT statements should produce the same column structure.

Parameters¶

Some of the formatting related parameters are output format dependent. In most cases, defaults can be overridden at the realm level using configuration variables.

Parameter	Type	Required	Description
batch_size	Integer	No	Fetch this many rows at a time. Default is 1000.
conn_id	String	Yes	The connection ID for an SQL database.
delimiter	String	No	Single character field delimiter. Default `\|`.
dialect	String	No	As for csv.writer. Default `excel`. The `unix` option is useful when DOS style line endings must be avoided.
doublequote	Boolean	No	As for csv.writer. Default `false`.
escapechar	String	No	As for csv.writer. Default `null`.
format	String	No	Specify the output format for `SELECT` statements. Any of the formats supported by the lava-sql can be used. The default is `csv`.
jinja	Boolean	No	If `false`, disable Jinja rendering of the payload. Default `true`.
header	Boolean	No	Add a header for data produced from SELECT queries. Default is `false`.
level	String	No	Print log messages of a given severity level or above. The standard logging level names are available but `debug`, `info`, `warning` and `error` are most useful. The Default is `info`.
quotechar	String	No	As for csv.writer. Default `"`.
quoting	String	No	As for csv.writer `QUOTE_*` parameters (without the QUOTE_ prefix). Default `minimal` (i.e. `QUOTE_MINIMAL`).
raw	Boolean	No	By default, an attempt will be made to split each payload file into individual SQL statements. This should be safe in most cases. To suppress this behaviour and run the payload as-is, set raw to `true`. Default `false`.
timeout	String	No	By default, jobs are killed after 10 minutes. This parameter can override that with values in the form `nnX` where `nn` is a number and `X` is `s` (seconds), `m` (minutes) or `h` (hours). Note that the total timeout for the entire job (`timeout` multiplied by number of payload elements) must be less than the visibility timeout on the worker SQS queue.
transaction	Boolean	No	If `true`, auto-commit is disabled and the sequence of SQLs is run within a transaction. If `false`, auto-commit is enabled (if supported by the driver). Default `false`.
vars	Map[String,*]	No	A map of variables injected when the SQL is Jinja rendered.

Jinja Rendering of the Payload¶

Each SQL statement is rendered using Jinja prior to execution.

All of the injected parameters are effectively Python objects so the normal Jinja syntax and Python methods for those objects can be used in the Jinja templates. This is particularly useful for the datetime objects as strftime() becomes available. For example, the S3 location of the unload (s3key) can be dynamically set to include components such as the schema, table name, unload date etc.

Refer to Jinja Rendering in Lava for more information.

The following variables are made available to the renderer.

Name	Type	Description
globals	dict[str,*]	The `globals` from the job specification updated with any globals received in the job dispatch.
job	dict[str,*]	The augmented job specification.
realm	dict[str,*]	The realm specification.
start	datetime	The local time when the job run started.
state	dict[str,*]	A dictionary of the state items imported into the job, keyed on state_id. The default values are updated at run-time with any current values obtainable from the state table.
ustart	datetime	The UTC time when the job run started.
utils	dict[str,runnable]	A dictionary of utility functions that can be used in the Jinja markup.
vars	dict[str,*]	A dictionary of variables provided as the `vars` component of the job `parameters`.

Dev Mode Behaviour¶

Normally, the sqlv job will copy stdout and stderr to S3 on the conclusion of the job. In dev mode, stderr is emitted locally during the job run instead of being copied to S3. Stdout will still be copied to S3 as it may contain binary information.

Examples¶

The following example runs the SQL statements in demo/query-mytable.sql. The output is written to demo/query-mytable.sql.out in the s3_temp area.

{
  "description": "Run a file containing SQL statements.",
  "dispatcher": "Sydney",
  "enabled": true,
  "job_id": "demo/sql-query-01",
  "owner": "demo@somewhere.com",
  "parameters": {
    "conn_id": "db-conn-01",
    "delimiter": "|",
    "quoting": "all",
    "timeout": "20m",
    "transaction": true
  },
  "payload": "demo/queries.sql",
  "type": "sqlc",
  "worker": "default"
}