incidentops/README.md

# IncidentOps

A fullstack on-call & incident management platform

## Environment Configuration

| Variable | Description | Default |
|----------|-------------|---------|
| `DATABASE_URL` | Postgres connection string | — |
| `REDIS_URL` | Legacy redis endpoint, also used if no broker override is supplied | `redis://localhost:6379/0` |
| `TASK_QUEUE_DRIVER` | Task queue implementation (`celery` or `inmemory`) | `celery` |
| `TASK_QUEUE_BROKER_URL` | Celery broker URL (falls back to `REDIS_URL` when unset) | `None` |
| `TASK_QUEUE_BACKEND` | Celery transport semantics (`redis` or `sqs`) | `redis` |
| `TASK_QUEUE_DEFAULT_QUEUE` | Queue used for fan-out + notification deliveries | `default` |
| `TASK_QUEUE_CRITICAL_QUEUE` | Queue used for escalation + delayed work | `critical` |
| `TASK_QUEUE_VISIBILITY_TIMEOUT` | Visibility timeout passed to `sqs` transport | `600` |
| `TASK_QUEUE_POLLING_INTERVAL` | Polling interval for `sqs` transport (seconds) | `1.0` |
| `NOTIFICATION_ESCALATION_DELAY_SECONDS` | Delay before re-checking unacknowledged incidents | `900` |
| `AWS_REGION` | Region used when `TASK_QUEUE_BACKEND=sqs` | `None` |
| `JWT_SECRET_KEY` | Symmetric JWT signing key | — |
| `JWT_ALGORITHM` | JWT algorithm | `HS256` |
| `JWT_ISSUER` | JWT issuer claim | `incidentops` |
| `JWT_AUDIENCE` | JWT audience claim | `incidentops-api` |

### Task Queue Modes

- **Development / Tests** – Set `TASK_QUEUE_DRIVER=inmemory` to bypass Celery entirely (default for local pytest). The API will enqueue events into an in-memory recorder while the worker code remains importable.
- **Celery + Redis** – Set `TASK_QUEUE_DRIVER=celery` and either leave `TASK_QUEUE_BROKER_URL` unset (and rely on `REDIS_URL`) or point it to another Redis endpoint. This is the default production-style configuration.
- **Celery + Amazon SQS** – Provide `TASK_QUEUE_BROKER_URL=sqs://` (Celery automatically discovers credentials), set `TASK_QUEUE_BACKEND=sqs`, and configure `AWS_REGION`. Optional tuning is available via the visibility timeout and polling interval variables above.

### Running the Worker

The worker automatically discovers tasks under `worker/tasks`. Use the same environment variables as the API:

```
uv run celery -A worker.celery_app worker --loglevel=info
```

## Setup

### Docker Compose

```
docker compose up --build -d
```

### K8S with Skaffold and Helm

```
# Install with infrastructure only (for testing)
helm install incidentops helm/incidentops -n incidentops --create-namespace \
  --set migration.enabled=false \
  --set api.replicaCount=0 \
  --set worker.replicaCount=0 \
  --set web.replicaCount=0

# Full install (requires building app images first)
helm install incidentops helm/incidentops -n incidentops --create-namespace

# Create a cluster
kind create cluster --name incidentops

# We then deploy
skaffold dev

# One-time deployment
skaffold run

# Production deployment
skaffold run -p production
```

### Accessing Dashboards

When running with `skaffold dev`, the following dashboards are port-forwarded automatically:

| Dashboard | URL | Description |
|-----------|-----|-------------|
| **OpenAPI (Swagger)** | http://localhost:8000/docs | Interactive API documentation |
| **OpenAPI (ReDoc)** | http://localhost:8000/redoc | Alternative API docs |
| **Grafana** | http://localhost:3001 | Metrics, logs, and traces |
| **Prometheus** | http://localhost:9090 | Raw metrics queries |
| **Tempo** | http://localhost:3200 | Distributed tracing backend |
| **Loki** | http://localhost:3100 | Log aggregation backend |

Grafana comes pre-configured with datasources for Prometheus, Loki, and Tempo.