Monitoring Runbook
Monitoring Runbook
Section titled “Monitoring Runbook”Finding Resource Names
Section titled “Finding Resource Names”All resource names can be found via:
aws cloudformation list-stack-resources --stack-name <STACK_NAME> --region <REGION> \ --query 'StackResources[*].{Type:ResourceType,Id:LogicalResourceId,Physical:PhysicalResourceId}' \ --output tableOr filter for specific types:
aws logs describe-log-groups --log-group-name-prefix /aws/lambda/<STACK_PREFIX> --query 'logGroups[*].logGroupName'Dashboard
Section titled “Dashboard”Shows: Events by handler, commands executed, auth results, errors, DLQ depth, API Gateway errors, alarms, health check, Step Functions executions.
Alarms
Section titled “Alarms”| Alarm | What it means |
|---|---|
ai3-mvp-webhook-oversized-payload | API Gateway returned 4XX (possible >10MB payload) |
ai3-mvp-dlq-not-empty | A handler Lambda crashed — event in dead letter queue |
Looking Up a Job by ID
Section titled “Looking Up a Job by ID”When the bot replies CI Check started (Job: <job-id>), use that ID to trace the full execution.
Find the job in DynamoDB
Section titled “Find the job in DynamoDB”AWS_PROFILE=burner2 aws dynamodb get-item \ --table-name <jobs-table> \ --key "{\"JobId\":{\"S\":\"<job-id>\"}}" \ --region us-east-1Find the execution in Step Functions logs
Section titled “Find the execution in Step Functions logs”AWS_PROFILE=burner2 aws logs filter-log-events \ --log-group-name <CI_CHECK_LOG_GROUP> \ --start-time $(date -d '8 hours ago' +%s000) \ --region us-east-1 \ --filter-pattern "<job-id>" \ --query 'events[*].message' --output textFind the Lambda invocations for that job
Section titled “Find the Lambda invocations for that job”AWS_PROFILE=burner2 aws logs filter-log-events \ --log-group-name /aws/lambda/<CHECK_RUN_STEP_FUNCTION> \ --start-time $(date -d '8 hours ago' +%s000) \ --region us-east-1 \ --filter-pattern "<job-id>" \ --query 'events[*].message' --output textViewing Step Functions Executions
Section titled “Viewing Step Functions Executions”Summary — successes and failures only
Section titled “Summary — successes and failures only”AWS_PROFILE=burner2 aws logs filter-log-events \ --log-group-name <CI_CHECK_LOG_GROUP> \ --start-time $(date -d '8 hours ago' +%s000) \ --region us-east-1 \ --filter-pattern '{ $.type = "ExecutionSucceeded" || $.type = "ExecutionFailed" }' \ --query 'events[*].message' --output textAll state transitions (readable)
Section titled “All state transitions (readable)”AWS_PROFILE=burner2 aws logs tail \ <CI_CHECK_LOG_GROUP> \ --since 8h --region us-east-1 \ | python3 -c "import sys, jsonfor line in sys.stdin: parts = line.strip().split(' ', 2) if len(parts) < 3: continue try: d = json.loads(parts[2]) t = d.get('type','') n = d.get('details',{}).get('name','') if t: print(f'{t:30s} {n}') except: pass"CheckRunStep Lambda logs (create/complete)
Section titled “CheckRunStep Lambda logs (create/complete)”AWS_PROFILE=burner2 aws logs filter-log-events \ --log-group-name /aws/lambda/<CHECK_RUN_STEP_FUNCTION> \ --start-time $(date -d '8 hours ago' +%s000) \ --region us-east-1 \ --filter-pattern 'INFO' \ --query 'events[*].message' --output textViewing Webhook Handler Logs
Section titled “Viewing Webhook Handler Logs”Comment handler (processes @ai3-mvp commands)
Section titled “Comment handler (processes @ai3-mvp commands)”AWS_PROFILE=burner2 aws logs tail \ /aws/lambda/<COMMENT_HANDLER_FUNCTION> \ --since 1h --region us-east-1Webhook receiver (signature verification, dispatch)
Section titled “Webhook receiver (signature verification, dispatch)”AWS_PROFILE=burner2 aws logs tail \ /aws/lambda/<RECEIVER_FUNCTION> \ --since 1h --region us-east-1Health check (runs every 15 min)
Section titled “Health check (runs every 15 min)”AWS_PROFILE=burner2 aws logs tail \ /aws/lambda/<HEALTH_CHECK_FUNCTION> \ --since 1h --region us-east-1Custom Metrics
Section titled “Custom Metrics”Namespace: GitHubAppPlatform
| Metric | Dimensions | Meaning |
|---|---|---|
EventProcessed | AppId, EventType, HandlerName, OrgName | An event was handled |
CommandExecuted | AppId, Command, HandlerName, OrgName | A bot command ran |
AuthSuccess | AppId, HandlerName, OrgName | User authorized successfully |
AuthFailed | AppId, HandlerName, OrgName | Authorization denied |
ErrorOccurred | AppId, HandlerName, OrgName | Handler error (also goes to DLQ) |
HealthCheckSuccess | AppId, CheckType | Health check pass (1) or fail (0) |
Browse in console: CloudWatch → Metrics → All metrics → GitHubAppPlatform
Troubleshooting: Verify GitHub Actions Before Debugging Handlers
Section titled “Troubleshooting: Verify GitHub Actions Before Debugging Handlers”If an expected event doesn’t arrive (e.g., code_scanning_alert), verify the upstream GitHub Action succeeded first:
# Check recent workflow runs for a repogh api repos/<OWNER>/<REPO>/actions/runs --jq '.workflow_runs[:5] | .[] | {id: .id, status: .status, conclusion: .conclusion, name: .name, created: .created_at}'Common Failures
Section titled “Common Failures”| Symptom | Cause | Fix |
|---|---|---|
| ”actions not allowed” | Org requires pinned SHAs | Use actions/checkout@<full-sha> not @v4 |
| Workflow never triggers | Missing on: trigger or branch filter wrong | Check .github/workflows/*.yml |
| SARIF upload 403 | Code scanning not enabled | Repo must be public or have GHAS |
| Event in S3 but handler silent | EventBridge rule mismatch | Check detail-type matches exactly |
Verification order
Section titled “Verification order”- GitHub Actions tab — did the workflow run and succeed?
- S3 payload bucket — did the event arrive? (
aws s3 ls s3://<bucket>/webhooks/<date>/<event_type>/) - Handler CloudWatch logs — did the Lambda execute?
- DLQ — did the handler crash?
DLQ Investigation
Section titled “DLQ Investigation”When the DLQ alarm fires:
# See what's in the DLQAWS_PROFILE=burner2 aws logs tail \ /aws/lambda/<ALERT_HANDLER_FUNCTION> \ --since 1h --region us-east-1The alert handler logs the full failed event body including the delivery ID, which you can use to redrive (see docs/runbooks/redrive-events.md).