Skip to content

Monitoring Runbook

All resource names can be found via:

Terminal window
aws cloudformation list-stack-resources --stack-name <STACK_NAME> --region <REGION> \
--query 'StackResources[*].{Type:ResourceType,Id:LogicalResourceId,Physical:PhysicalResourceId}' \
--output table

Or filter for specific types:

Terminal window
aws logs describe-log-groups --log-group-name-prefix /aws/lambda/<STACK_PREFIX> --query 'logGroups[*].logGroupName'

URL: https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#dashboards/dashboard/ai3-mvp-platform

Shows: Events by handler, commands executed, auth results, errors, DLQ depth, API Gateway errors, alarms, health check, Step Functions executions.

AlarmWhat it means
ai3-mvp-webhook-oversized-payloadAPI Gateway returned 4XX (possible >10MB payload)
ai3-mvp-dlq-not-emptyA handler Lambda crashed — event in dead letter queue

When the bot replies CI Check started (Job: <job-id>), use that ID to trace the full execution.

Terminal window
AWS_PROFILE=burner2 aws dynamodb get-item \
--table-name <jobs-table> \
--key "{\"JobId\":{\"S\":\"<job-id>\"}}" \
--region us-east-1
Terminal window
AWS_PROFILE=burner2 aws logs filter-log-events \
--log-group-name <CI_CHECK_LOG_GROUP> \
--start-time $(date -d '8 hours ago' +%s000) \
--region us-east-1 \
--filter-pattern "<job-id>" \
--query 'events[*].message' --output text
Terminal window
AWS_PROFILE=burner2 aws logs filter-log-events \
--log-group-name /aws/lambda/<CHECK_RUN_STEP_FUNCTION> \
--start-time $(date -d '8 hours ago' +%s000) \
--region us-east-1 \
--filter-pattern "<job-id>" \
--query 'events[*].message' --output text
Terminal window
AWS_PROFILE=burner2 aws logs filter-log-events \
--log-group-name <CI_CHECK_LOG_GROUP> \
--start-time $(date -d '8 hours ago' +%s000) \
--region us-east-1 \
--filter-pattern '{ $.type = "ExecutionSucceeded" || $.type = "ExecutionFailed" }' \
--query 'events[*].message' --output text
Terminal window
AWS_PROFILE=burner2 aws logs tail \
<CI_CHECK_LOG_GROUP> \
--since 8h --region us-east-1 \
| python3 -c "
import sys, json
for line in sys.stdin:
parts = line.strip().split(' ', 2)
if len(parts) < 3: continue
try:
d = json.loads(parts[2])
t = d.get('type','')
n = d.get('details',{}).get('name','')
if t: print(f'{t:30s} {n}')
except: pass
"

CheckRunStep Lambda logs (create/complete)

Section titled “CheckRunStep Lambda logs (create/complete)”
Terminal window
AWS_PROFILE=burner2 aws logs filter-log-events \
--log-group-name /aws/lambda/<CHECK_RUN_STEP_FUNCTION> \
--start-time $(date -d '8 hours ago' +%s000) \
--region us-east-1 \
--filter-pattern 'INFO' \
--query 'events[*].message' --output text

Comment handler (processes @ai3-mvp commands)

Section titled “Comment handler (processes @ai3-mvp commands)”
Terminal window
AWS_PROFILE=burner2 aws logs tail \
/aws/lambda/<COMMENT_HANDLER_FUNCTION> \
--since 1h --region us-east-1

Webhook receiver (signature verification, dispatch)

Section titled “Webhook receiver (signature verification, dispatch)”
Terminal window
AWS_PROFILE=burner2 aws logs tail \
/aws/lambda/<RECEIVER_FUNCTION> \
--since 1h --region us-east-1
Terminal window
AWS_PROFILE=burner2 aws logs tail \
/aws/lambda/<HEALTH_CHECK_FUNCTION> \
--since 1h --region us-east-1

Namespace: GitHubAppPlatform

MetricDimensionsMeaning
EventProcessedAppId, EventType, HandlerName, OrgNameAn event was handled
CommandExecutedAppId, Command, HandlerName, OrgNameA bot command ran
AuthSuccessAppId, HandlerName, OrgNameUser authorized successfully
AuthFailedAppId, HandlerName, OrgNameAuthorization denied
ErrorOccurredAppId, HandlerName, OrgNameHandler error (also goes to DLQ)
HealthCheckSuccessAppId, CheckTypeHealth check pass (1) or fail (0)

Browse in console: CloudWatch → Metrics → All metrics → GitHubAppPlatform

Troubleshooting: Verify GitHub Actions Before Debugging Handlers

Section titled “Troubleshooting: Verify GitHub Actions Before Debugging Handlers”

If an expected event doesn’t arrive (e.g., code_scanning_alert), verify the upstream GitHub Action succeeded first:

Terminal window
# Check recent workflow runs for a repo
gh api repos/<OWNER>/<REPO>/actions/runs --jq '.workflow_runs[:5] | .[] | {id: .id, status: .status, conclusion: .conclusion, name: .name, created: .created_at}'
SymptomCauseFix
”actions not allowed”Org requires pinned SHAsUse actions/checkout@<full-sha> not @v4
Workflow never triggersMissing on: trigger or branch filter wrongCheck .github/workflows/*.yml
SARIF upload 403Code scanning not enabledRepo must be public or have GHAS
Event in S3 but handler silentEventBridge rule mismatchCheck detail-type matches exactly
  1. GitHub Actions tab — did the workflow run and succeed?
  2. S3 payload bucket — did the event arrive? (aws s3 ls s3://<bucket>/webhooks/<date>/<event_type>/)
  3. Handler CloudWatch logs — did the Lambda execute?
  4. DLQ — did the handler crash?

When the DLQ alarm fires:

Terminal window
# See what's in the DLQ
AWS_PROFILE=burner2 aws logs tail \
/aws/lambda/<ALERT_HANDLER_FUNCTION> \
--since 1h --region us-east-1

The alert handler logs the full failed event body including the delivery ID, which you can use to redrive (see docs/runbooks/redrive-events.md).