PQS is designed to operate as a long-running process which uses these principles to enhance availability:Documentation Index
Fetch the complete documentation index at: https://cantonfoundation-generated-hydration-fix.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
- Redundancy involves running multiple instances of PQS in parallel to ensure that the system remains available even if one instance fails.
- Retry involves healing from transient and recoverable failures without shutting down the process or requiring operator intervention.
- Recovery entails reconciling the current state of the ledger with already exported data in the datastore after a cold start, and continuing from the latest checkpoint.
High availability
Multiple isolated instances of PQS can be instantiated without any cross-dependency. This allows for an active-active high availability clustering model. Please note that different instances might not be at the same offset due to different processing rates and general network non-determinism. PQS’ SQL API provides capabilities to deal with this ‘eventual consistency’ model, to ensure that readers have at least ‘repeatable read’ consistency. Seevalidate_offset_exists() in pqs-references-offset-management for more details.
---title: High Availability Deployment ---flowchart LR Participant —Ledger API—> PQS1[PQS<br>Process] Participant —Ledger API—> PQS2[PQS<br>Process] PQS1 —JDBC—> Database1[PQS<br>Database] PQS2 —JDBC—> Database2[PQS<br>Database] Database1 <—JDBC—> LoadBalancer[Load<br>Balancer] Database2 <—JDBC—> LoadBalancer[Load<br>Balancer] LoadBalancer <—JDBC—> App((App<br>Cluster)) style PQS1 stroke-width:4px style PQS2 stroke-width:4px style Database1 stroke-width:4px style Database2 stroke-width:4px
Retries
PQS’pipeline command is a unidirectional streaming process that heavily relies on the availability of its source and target dependencies. When PQS encounters an error, it attempts to recover by restarting its internal engine, if the error is designated as recoverable:
- gRPC1 (white-listed; retries if):
CANCELLEDDEADLINE_EXCEEDEDNOT_FOUNDPERMISSION_DENIEDRESOURCE_EXHAUSTEDFAILED_PRECONDITIONABORTEDINTERNALUNAVAILABLEDATA_LOSSUNAUTHENTICATED
- JDBC2 (black-listed; retries unless):
INVALID_PARAMETER_TYPEPROTOCOL_VIOLATIONNOT_IMPLEMENTEDINVALID_PARAMETER_VALUESYNTAX_ERRORUNDEFINED_COLUMNUNDEFINED_OBJECTUNDEFINED_TABLEUNDEFINED_FUNCTIONNUMERIC_CONSTANT_OUT_OF_RANGENUMERIC_VALUE_OUT_OF_RANGEDATA_TYPE_MISMATCHINVALID_NAMECANNOT_COERCEUNEXPECTED_ERROR
Configuration
The followingpqs-references-configuration-options are available to control the retry behavior of PQS:
--retry-backoff-* settings control periodicity of retries and the maximum duration between attempts.
Configuring --retry-counter-attempts and --retry-counter-duration controls the maximum instability tolerance before shutting down.
Configuring --retry-counter-reset controls the period of stability after which the retry counters are reset across the board.
Logging
While PQS recovers, the following log messages are emitted to indicate the progress of the recovery:Metrics
The following metrics are available to monitor stability of PQS’ dependencies. Seepqs-application-metrics for more details on general observability:
Retry counters reset
If PQS encounters network unavailability it starts incrementing retry counters with each attempt. These counters are reset only after a period of stability, as defined by--retry-counter-reset. As such, during the prolonged periods of intermittent failures that alternate with brief periods of operating normally, PQS keeps maintaining a cautious stance on assumptions regarding the stability of the overall system. This can be illustrated with an example below:
As an example, for the setting --retry-counter-reset PT5M the following timeline illustrates how the retry works:
Exit codes
PQS terminates with the following exit codes:0: Normal termination1: Termination due to unrecoverable error or all retry attempts for recoverable errors have been exhausted
Ledger streaming & recovery
On (re-)start, PQS determines last saved checkpoint and continues incremental processing from that point onward. PQS is able to start and finish at prescribed ledger offsets, specified via args. In many scenarios--pipeline-ledger-start Oldest --pipeline-ledger-stop Never is the most appropriate configuration, for both initial population of all available history, and also catering for resumption/recovery processing.
Start offset meanings:
| Value | Meaning |
|---|---|
Genesis | Commence from the first offset of the ledger, failing if not available. |
Oldest | Resume processing, or start from the oldest available offset of the ledger (if the datastore is empty). |
Latest | Resume processing, or start from the latest available offset of the ledger (if the datastore is empty). |
<offset> | Offset from which to start processing, terminating if it does not match the state of the datastore. |
| Value | Meaning |
|---|---|
Latest | Process until reaching the latest available offset of the ledger, then terminate. |
Never | Keep processing and never terminate. |
<offset> | Process until reaching this offset, then terminate. |
If the ledger has been pruned beyond the offset specified in
--pipeline-ledger-start, PQS fails to start. For more details see pqs-history-slicing.