This section describes observability features of PQS, which are designed to help you monitor health and performance of the application.Documentation Index
Fetch the complete documentation index at: https://cantonfoundation-generated-hydration-fix.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Approach to observability
PQS opted to incorporate OpenTelemetry APIs to provide its observability features. All three sources of signals (traces, metrics, and logs) can be exported to various backends by providing appropriate configuration defined by OpenTelemetry protocols and guidelines. This makes PQS flexible in terms of observability backends, allowing users to choose what fits their needs and established infrastructure without being overly prescriptive. To have PQS emit observability data, an OpenTelemetry Java Agent must be attached to the JVM running PQS. OpenTelemetry’s documentation page on Java Agent Configuration1 has all the necessary information to get started. As a frequently requested shortcut (only metrics over Prometheus exposition endpoint embedded by PQS), the following snippet can help you get started. For more details, refer to the official documentation:Logging
Log level
Set log level with--logger-level. Possible value are All, Fatal, Error, Warning, Info (default), Debug, Trace:
Per-logger log level
Use--logger-mappings to adjust the log level for individual loggers. For example, to remove Netty network traffic from a more detailed overall log:
Log pattern
With--logger-pattern, use one of the predefined patterns, such as Plain (default), Standard (standard format used in DA applications), Structured, or set your own. Check Log Format Configuration2 for more details.
To use your custom format, provide its string representation, such as:
Log format for console output
Use--logger-format to set the log format. Possible values are Plain (default) or Json. These formats can be used for the pipeline command.
Log format for file output
Use--logger-format to set the log format. Possible values are Plain (default), Json, PlainAsync and JsonAsync. They can be used for the interactive commands, such as prune. For PlainAsync and JsonAsync, log entries are written to the destination file asynchronously.
Destination file for file output
Use--logger-destination to set the path to the destination file (default: output.log) for interactive commands, such as prune.
Log format and log pattern combinations
-
Plain/Plain -
Plain/Standard -
Plain/Custom -
Json/Standard -
Json/Structured -
Json/CustomNotice you need to use%label{your_label}{format}to describe a Json attribute-value pair.
Application metrics
Assuming PQS exposes metrics as described above, you can access the following metrics athttp://localhost:9090/metrics. Each metric is accompanied by # HELP and # TYPE comments, which describe the meaning of the metric and its type, respectively.
Some metric types have additional constituent parts exposed as separate metrics. For example, a histogram metric type tracks max, count, sum, and actual ranged buckets as separate time series. Metrics are labeled where it makes sense, providing additional context such as the type of operation or the template/choice involved.
Conceptual list of metrics (refer to actual metric names in the Prometheus output):
| Type | Name | Description |
|---|---|---|
| gauge | watermark_ix | Current watermark index (transaction ordinal number for consistent reads) |
| counter | pipeline_events_total | Processed ledger events |
| histogram | jdbc_conn_use | Latency of database connections usage |
| histogram | jdbc_conn_isvalid | Latency of database connection validation |
| histogram | jdbc_conn_commit | Latency of database connection commit |
| histogram | total_tx_handling_latency | Total latency of transaction handling in PQS (observed in LAPI to committed in DB) |
| gauge | tx_lag_from_ledger_wallclock | Lag from ledger (wall-clock delta (in ms) from command completion to receipt by pipeline) |
| histogram | pipeline_convert_acs_event | Latency of converting ACS events |
| histogram | pipeline_convert_transaction | Latency of converting transactions |
| histogram | pipeline_prepare_batch_latency | Latency of preparing batches of statements |
| histogram | pipeline_execute_batch_latency | Latency of executing batches of statements |
| histogram | pipeline_progress_watermark_latency | Latency of watermark progression |
| histogram | pipeline_wp_acs_events_size | Number of in-flight units of work in pipeline_wp_acs_events wait point |
| histogram | pipeline_wp_acs_statements_size | Number of in-flight units of work in pipeline_wp_acs_statements wait point |
| histogram | pipeline_wp_acs_batched_statements_size | Number of in-flight units of work in pipeline_wp_acs_batched_statements wait point |
| histogram | pipeline_wp_acs_prepared_statements_size | Number of in-flight units of work in pipeline_wp_acs_prepared_statements wait point |
| histogram | pipeline_wp_events_size | Number of in-flight units of work in pipeline_wp_events wait point |
| histogram | pipeline_wp_statements_size | Number of in-flight units of work in pipeline_wp_statements wait point |
| histogram | pipeline_wp_batched_statements_size | Number of in-flight units of work in pipeline_wp_batched_statements wait point |
| histogram | pipeline_wp_prepared_statements_size | Number of in-flight units of work in pipeline_wp_prepared_statements wait point |
| histogram | pipeline_wp_watermarks_size | Number of in-flight units of work in pipeline_wp_watermarks wait point |
| counter | pipeline_wp_acs_events_total | Number of units of work processed in pipeline_wp_acs_events wait point |
| counter | pipeline_wp_acs_statements_total | Number of units of work processed in pipeline_wp_acs_statements wait point |
| counter | pipeline_wp_acs_batched_statements_total | Number of units of work processed in pipeline_wp_acs_batched_statements wait point |
| counter | pipeline_wp_acs_prepared_statements_total | Number of units of work processed in pipeline_wp_acs_prepared_statements wait point |
| counter | pipeline_wp_events_total | Number of units of work processed in pipeline_wp_events wait point |
| counter | pipeline_wp_statements_total | Number of units of work processed in pipeline_wp_statements wait point |
| counter | pipeline_wp_batched_statements_total | Number of units of work processed in pipeline_wp_batched_statements wait point |
| counter | pipeline_wp_prepared_statements_total | Number of units of work processed in pipeline_wp_prepared_statements wait point |
| counter | pipeline_wp_watermarks_total | Number of units of work processed in pipeline_wp_watermarks wait point |
| counter | app_restarts_total | Tracks number of times recoverable failures forced the pipeline to restart |
| gauge | grpc_up | Indicator whether gRPC channel is up and operational |
| gauge | jdbc_conn_pool_up | Indicator whether JDBC connection pool is up and operational |
Grafana dashboard
Based on the metrics described above, it is possible to build a comprehensive dashboard to monitor PQS. Vendor-supplied Grafana dashboard for PQS can be downloaded from artifacts repository (seepqs-download). You may want to refer to this as a starting point for your own.
Health check
The health of the PQS process can be monitored using the health check endpoint/livez. The health check endpoint is available on the configured network interface (--health-address) and TCP port (--health-port). Note the default is 127.0.0.1:8080.
Tracing of pipeline execution
PQS instruments the most critical parts of its operations with tracing to provide insights into the execution flow and performance. Traces can be exported to various OpenTelemetry backends by providing appropriate configuration, for example:| span name | description |
|---|---|
process metadata and schema | interactions that happen when PQS starts up and ensures its datastore is ready for operations |
initialization routine | interactions that happen when PQS establishes its offset range boundaries (including seeding from ACS if requested) on startup |
consume com.daml.ledger.api.v1.TransactionService/GetTransactions consume com.daml.ledger.api.v1.TransactionService/GetTransactionTrees | [Daml SDK v2.x] timeline of processing a ledger transaction from delivery over gRPC to its persistence to datastore |
consume com.daml.ledger.api.v2.UpdateService/GetUpdates consume com.daml.ledger.api.v2.UpdateService/GetUpdateTrees | [Daml SDK v3.x] timeline of processing a ledger transaction from delivery over gRPC to its persistence to datastore |
execute datastore transaction | interactions when a batch of transactions is persisted to the datastore |
advance datastore watermark | interactions when the latest consecutive watermark is persisted to the datastore |
Trace context propagation
PQS is an intermediary between a ledger instance and downstream applications that would prefer to access data through SQL rather than in streaming manner from Ledger API directly. Despite forming a pipeline between two data storage systems (Canton and PostgreSQL), PQS stores the original ledger transaction’s trace context (see alsoopen-tracing-ledger-api-client) for the purposes of propagation rather than its own. This allows downstream applications to decide for themselves how they want to connect to the original submission’s trace (as a child span or as a new trace connected through span links).
transactions.trace_context column allows any application to re-create the propagated trace context4 and use it with their runtime’s instrumentation library.
Diagnostics
PQS is capable of exporting diagnostic telemetry snapshots. This data export archive contains essential troubleshooting information such as:- application thread dumps (over a period of time)
- application metrics (over a period of time)
netcat tool:
| System property | Environment variable | Default value | Description |
|---|---|---|---|
da.diagnostics.enabled | DA_DIAGNOSTICS_ENABLED | true | Enables/disables diagnostics data collection and exposition |
da.diagnostics.host | DA_DIAGNOSTICS_HOST | 127.0.0.1 | Hostname or IP address to use for binding the exposition socket |
da.diagnostics.port | DA_DIAGNOSTICS_PORT | 0 | Port to use for binding the exposition socket (0 = random port) |
da.diagnostics.dump.path | DA_DIAGNOSTICS_DUMP_PATH | <empty> | Directory to write to on graceful shutdown (path needs to be an existing writable directory) |
da.diagnostics.metrics.interval | DA_DIAGNOSTICS_METRICS_INTERVAL | PT10S | Metrics collection interval in ISO 8601 format |
da.diagnostics.metrics.buffer.size | DA_DIAGNOSTICS_METRICS_BUFFER_SIZE | 60 | Quantity of samples to store for each monitored metric (rolling window) |
da.diagnostics.metrics.tags | DA_DIAGNOSTICS_METRICS_TAGS | <empty> | Comma-separated list of additional labels to enrich each metric with during exposition (for example, job=myapp,env=staging,deployed=20250101) |
da.diagnostics.threads.interval | DA_DIAGNOSTICS_THREADS_INTERVAL | PT1M | Thread dumps collection interval in ISO 8601 format |
da.diagnostics.threads.buffer.size | DA_DIAGNOSTICS_THREADS_BUFFER_SIZE | 10 | Quantity of thread dumps to store (rolling window) |