Skip to content

Latest commit

 

History

History
77 lines (67 loc) · 3.81 KB

MONITORING.md

File metadata and controls

77 lines (67 loc) · 3.81 KB

Monitoring

Key Metrics

The following metrics are important indicators of the health of the Amazon S3 Find and Forget Solution:

  • AWS/SQS - ApproximateNumberOfMessagesVisible for the Object Deletion Queue DLQ. Any value > 0 for this metric indicates that 1 or more objects could not be processed during a deletion job. The job which triggered the message(s) to be put in the queue will have a status of COMPLETED_WITH_ERRORS and the ObjectUpdateFailed event(s) will contain further debugging information.
  • AWS/SQS - ApproximateNumberOfMessagesVisible for the Events DLQ. Any value > 0 for this metrics indicates that 1 or more Job Events could not be processed.
  • AWS/Athena - ProcessedBytes/TotalExecutionTime. If the average processed bytes and/or total execution time per query is rising, it may be indicative of the average partition size also growing in size. This is not an issue per se, however if partitions grow too large (or your dataset is unpartitioned), you may eventually encounter Athena errors.
  • AWS/States - ExecutionsFailed. State machine executions failing indicates that the Amazon S3 Find and Forget solution is misconfigured error. To resolve this, find the State Machine execution which failed and investigate the cause of the failure.
  • AWS/States - ExecutionsTimedOut. State machine timeouts indicate that Amazon S3 Find and Forget is unable to complete a job before Step Functions kills the execution due to it exceeding the allowed execution time limit. See Troubleshooting for more details.

If required, you can create CloudWatch Alarms for any of the aforementioned metrics to be notified of potential solution misconfiguration.

Service Level Monitoring

All standard metrics for the services used by the Amazon S3 Find and Forget Solution are available. For detailed information about the metrics and logging for a given service, view the relevant Monitoring docs for that service. The key services used by the solution:

1 CloudWatch Container Insights can be be enabled when deploying the solution by setting EnableContainerInsights to true. Using Container Insights will incur additional charges. It is disabled by default.

2 To obtain Athena metrics, you will need to enable metrics for the workgroup you are using to execute the queries as described in the Athena docs. By default the solution uses the primary workgroup, however you can change this when deploying the stack using the AthenaWorkGroup parameter