Back to Conductor

External Payload Storage

docs/documentation/advanced/externalpayloadstorage.md

2019-04-12-130010.1 KB
Original Source

External Payload Storage

!!!warning The external payload storage is currently only implemented to be used to by the Java client. Client libraries in other languages need to be modified to enable this.
Contributions are welcomed.

Context

Conductor can be configured to enforce barriers on the size of workflow and task payloads for both input and output.
These barriers can be used as safeguards to prevent the usage of conductor as a data persistence system and to reduce the pressure on its datastore.

Barriers

Conductor typically applies two kinds of barriers:

  • Soft Barrier
  • Hard Barrier

Soft Barrier

The soft barrier is used to alleviate pressure on the conductor datastore. In some special workflow use-cases, the size of the payload is warranted enough to be stored as part of the workflow execution.
In such cases, conductor externalizes the storage of such payloads to S3 and uploads/downloads to/from S3 as needed during the execution. This process is completely transparent to the user/worker process.

Hard Barrier

The hard barriers are enforced to safeguard the conductor backend from the pressure of having to persist and deal with voluminous data which is not essential for workflow execution. In such cases, conductor will reject such payloads and will terminate/fail the workflow execution with the reasonForIncompletion set to an appropriate error message detailing the payload size.

Usage

Barriers setup

Set the following properties to the desired values in the JVM system properties:

PropertyDescriptiondefault value
conductor.app.workflowInputPayloadSizeThresholdSoft barrier for workflow input payload in KB5120
conductor.app.maxWorkflowInputPayloadSizeThresholdHard barrier for workflow input payload in KB10240
conductor.app.workflowOutputPayloadSizeThresholdSoft barrier for workflow output payload in KB5120
conductor.app.maxWorkflowOutputPayloadSizeThresholdHard barrier for workflow output payload in KB10240
conductor.app.taskInputPayloadSizeThresholdSoft barrier for task input payload in KB3072
conductor.app.maxTaskInputPayloadSizeThresholdHard barrier for task input payload in KB10240
conductor.app.taskOutputPayloadSizeThresholdSoft barrier for task output payload in KB3072
conductor.app.maxTaskOutputPayloadSizeThresholdHard barrier for task output payload in KB10240

Amazon S3

Conductor provides an implementation of Amazon S3 used to externalize large payload storage.
Set the following property in the JVM system properties:

conductor.external-payload-storage.type=S3

!!! note This implementation assumes that S3 access is configured on the instance.

Set the following properties to the desired values in the JVM system properties:

PropertyDescriptiondefault value
conductor.external-payload-storage.s3.bucketNameS3 bucket where the payloads will be stored
conductor.external-payload-storage.s3.signedUrlExpirationDurationThe expiration time in seconds of the signed url for the payload5

The payloads will be stored in the bucket configured above in a UUID.json file at locations determined by the type of the payload. See the S3PayloadStorage source for information about how the object key is determined.

Azure Blob Storage

!!!note This implementation assumes that you have an Azure Blob Storage account's connection string or SAS Token. If you want signed url to expired you must specify a Connection String.

Set the following properties to the desired values in the JVM system properties:

PropertyDescriptiondefault value
workflow.external.payload.storage.azure_blob.connection_stringAzure Blob Storage connection string. Required to sign Url.
workflow.external.payload.storage.azure_blob.endpointAzure Blob Storage endpoint. Optional if connection_string is set.
workflow.external.payload.storage.azure_blob.sas_tokenAzure Blob Storage SAS Token. Must have permissions Read and Write on Resource Object on Service Blob. Optional if connection_string is set.
workflow.external.payload.storage.azure_blob.container_nameAzure Blob Storage container where the payloads will be storedconductor-payloads
workflow.external.payload.storage.azure_blob.signedurlexpirationsecondsThe expiration time in seconds of the signed url for the payload5
workflow.external.payload.storage.azure_blob.workflow_input_pathPath prefix where workflows input will be stored with an random UUID filenameworkflow/input/
workflow.external.payload.storage.azure_blob.workflow_output_pathPath prefix where workflows output will be stored with an random UUID filenameworkflow/output/
workflow.external.payload.storage.azure_blob.task_input_pathPath prefix where tasks input will be stored with an random UUID filenametask/input/
workflow.external.payload.storage.azure_blob.task_output_pathPath prefix where tasks output will be stored with an random UUID filenametask/output/

The payloads will be stored in the same path structure as Amazon S3.

Testing with Azurite

You can use Azurite to simulate Azure Storage locally for development and testing.

Troubleshooting

When using Elasticsearch persistence, you may receive a java.lang.IllegalStateException because the Netty library calls setAvailableProcessors twice. To resolve this, set:

properties
es.set.netty.runtime.available.processors=false

To use okhttp instead of the default Netty HTTP client, add the following dependency:

com.azure:azure-core-http-okhttp:${compatible version}

PostgreSQL Storage

Frinx provides an implementation of PostgreSQL Storage used to externalize large payload storage.

!!!note This implementation assumes that you have an PostgreSQL database server with all required credentials.

Set the following properties to your application.properties:

PropertyDescriptiondefault value
conductor.external-payload-storage.postgres.conductor-urlURL, that can be used to pull the json configurations, that will be downloaded from PostgreSQL to the conductor server. For example: for local development it is {{ server_host }}""
conductor.external-payload-storage.postgres.urlPostgreSQL database connection URL. Required to connect to database.
conductor.external-payload-storage.postgres.usernameUsername for connecting to PostgreSQL database. Required to connect to database.
conductor.external-payload-storage.postgres.passwordPassword for connecting to PostgreSQL database. Required to connect to database.
conductor.external-payload-storage.postgres.table-nameThe PostgreSQL schema and table name where the payloads will be storedexternal.external_payload
conductor.external-payload-storage.postgres.max-data-rowsMaximum count of data rows in PostgreSQL database. After overcoming this limit, the oldest data will be deleted.Long.MAX_VALUE (9223372036854775807L)
conductor.external-payload-storage.postgres.max-data-daysMaximum count of days of data age in PostgreSQL database. After overcoming limit, the oldest data will be deleted.0
conductor.external-payload-storage.postgres.max-data-monthsMaximum count of months of data age in PostgreSQL database. After overcoming limit, the oldest data will be deleted.0
conductor.external-payload-storage.postgres.max-data-yearsMaximum count of years of data age in PostgreSQL database. After overcoming limit, the oldest data will be deleted.1

The maximum date age for fields in the database will be: years + months + days
The payloads will be stored in PostgreSQL database with key (externalPayloadPath) UUID.json and you can generate URI for this data using external-postgres-payload-resource rest controller.
To make this URI work correctly, you must correctly set the conductor-url property.