Add OpenTelemetry instrumentation with distributed tracing and metrics: - Structured JSON logging with trace context correlation - Auto-instrumentation for FastAPI, asyncpg, httpx, redis - OTLP exporter for traces and Prometheus metrics endpoint Implement Celery worker and notification task system: - Celery app with Redis/SQS broker support and configurable queues - Notification tasks for incident fan-out, webhooks, and escalations - Pluggable TaskQueue abstraction with in-memory driver for testing Add Grafana observability stack (Loki, Tempo, Prometheus, Grafana): - OpenTelemetry Collector for receiving OTLP traces and logs - Tempo for distributed tracing backend - Loki for log aggregation with Promtail DaemonSet - Prometheus for metrics scraping with RBAC configuration - Grafana with pre-provisioned datasources and API overview dashboard - Helm templates for all observability components Enhance application infrastructure: - Global exception handlers with structured ErrorResponse schema - Request logging middleware with timing metrics - Health check updated to verify task queue connectivity - Non-root user in Dockerfile for security - Init containers in Helm deployments for dependency ordering - Production Helm values with autoscaling and retention policies
39 lines
701 B
YAML
39 lines
701 B
YAML
receivers:
|
|
otlp:
|
|
protocols:
|
|
grpc:
|
|
endpoint: 0.0.0.0:4317
|
|
http:
|
|
endpoint: 0.0.0.0:4318
|
|
|
|
processors:
|
|
batch:
|
|
timeout: 1s
|
|
send_batch_size: 1024
|
|
memory_limiter:
|
|
check_interval: 1s
|
|
limit_mib: 256
|
|
spike_limit_mib: 64
|
|
|
|
exporters:
|
|
otlp/tempo:
|
|
endpoint: tempo:4317
|
|
tls:
|
|
insecure: true
|
|
loki:
|
|
endpoint: http://loki:3100/loki/api/v1/push
|
|
default_labels_enabled:
|
|
exporter: true
|
|
job: true
|
|
|
|
service:
|
|
pipelines:
|
|
traces:
|
|
receivers: [otlp]
|
|
processors: [memory_limiter, batch]
|
|
exporters: [otlp/tempo]
|
|
logs:
|
|
receivers: [otlp]
|
|
processors: [memory_limiter, batch]
|
|
exporters: [loki]
|