Add OpenTelemetry instrumentation with distributed tracing and metrics: - Structured JSON logging with trace context correlation - Auto-instrumentation for FastAPI, asyncpg, httpx, redis - OTLP exporter for traces and Prometheus metrics endpoint Implement Celery worker and notification task system: - Celery app with Redis/SQS broker support and configurable queues - Notification tasks for incident fan-out, webhooks, and escalations - Pluggable TaskQueue abstraction with in-memory driver for testing Add Grafana observability stack (Loki, Tempo, Prometheus, Grafana): - OpenTelemetry Collector for receiving OTLP traces and logs - Tempo for distributed tracing backend - Loki for log aggregation with Promtail DaemonSet - Prometheus for metrics scraping with RBAC configuration - Grafana with pre-provisioned datasources and API overview dashboard - Helm templates for all observability components Enhance application infrastructure: - Global exception handlers with structured ErrorResponse schema - Request logging middleware with timing metrics - Health check updated to verify task queue connectivity - Non-root user in Dockerfile for security - Init containers in Helm deployments for dependency ordering - Production Helm values with autoscaling and retention policies
2.9 KiB
2.9 KiB
IncidentOps
A fullstack on-call & incident management platform
Environment Configuration
| Variable | Description | Default |
|---|---|---|
DATABASE_URL |
Postgres connection string | — |
REDIS_URL |
Legacy redis endpoint, also used if no broker override is supplied | redis://localhost:6379/0 |
TASK_QUEUE_DRIVER |
Task queue implementation (celery or inmemory) |
celery |
TASK_QUEUE_BROKER_URL |
Celery broker URL (falls back to REDIS_URL when unset) |
None |
TASK_QUEUE_BACKEND |
Celery transport semantics (redis or sqs) |
redis |
TASK_QUEUE_DEFAULT_QUEUE |
Queue used for fan-out + notification deliveries | default |
TASK_QUEUE_CRITICAL_QUEUE |
Queue used for escalation + delayed work | critical |
TASK_QUEUE_VISIBILITY_TIMEOUT |
Visibility timeout passed to sqs transport |
600 |
TASK_QUEUE_POLLING_INTERVAL |
Polling interval for sqs transport (seconds) |
1.0 |
NOTIFICATION_ESCALATION_DELAY_SECONDS |
Delay before re-checking unacknowledged incidents | 900 |
AWS_REGION |
Region used when TASK_QUEUE_BACKEND=sqs |
None |
JWT_SECRET_KEY |
Symmetric JWT signing key | — |
JWT_ALGORITHM |
JWT algorithm | HS256 |
JWT_ISSUER |
JWT issuer claim | incidentops |
JWT_AUDIENCE |
JWT audience claim | incidentops-api |
Task Queue Modes
- Development / Tests – Set
TASK_QUEUE_DRIVER=inmemoryto bypass Celery entirely (default for local pytest). The API will enqueue events into an in-memory recorder while the worker code remains importable. - Celery + Redis – Set
TASK_QUEUE_DRIVER=celeryand either leaveTASK_QUEUE_BROKER_URLunset (and rely onREDIS_URL) or point it to another Redis endpoint. This is the default production-style configuration. - Celery + Amazon SQS – Provide
TASK_QUEUE_BROKER_URL=sqs://(Celery automatically discovers credentials), setTASK_QUEUE_BACKEND=sqs, and configureAWS_REGION. Optional tuning is available via the visibility timeout and polling interval variables above.
Running the Worker
The worker automatically discovers tasks under worker/tasks. Use the same environment variables as the API:
uv run celery -A worker.celery_app worker --loglevel=info
Setup
Docker Compose
docker compose up --build -d
K8S with Skaffold and Helm
# Install with infrastructure only (for testing)
helm install incidentops helm/incidentops -n incidentops --create-namespace \
--set migration.enabled=false \
--set api.replicaCount=0 \
--set worker.replicaCount=0 \
--set web.replicaCount=0
# Full install (requires building app images first)
helm install incidentops helm/incidentops -n incidentops --create-namespace
# Create a cluster
kind create cluster --name incidentops
# We then deploy
skaffold dev
# One-time deployment
skaffold run
# Production deployment
skaffold run -p production