87 lines
3.6 KiB
Markdown
87 lines
3.6 KiB
Markdown
# IncidentOps
|
||
|
||
A fullstack on-call & incident management platform
|
||
|
||
## Environment Configuration
|
||
|
||
| Variable | Description | Default |
|
||
|----------|-------------|---------|
|
||
| `DATABASE_URL` | Postgres connection string | — |
|
||
| `REDIS_URL` | Legacy redis endpoint, also used if no broker override is supplied | `redis://localhost:6379/0` |
|
||
| `TASK_QUEUE_DRIVER` | Task queue implementation (`celery` or `inmemory`) | `celery` |
|
||
| `TASK_QUEUE_BROKER_URL` | Celery broker URL (falls back to `REDIS_URL` when unset) | `None` |
|
||
| `TASK_QUEUE_BACKEND` | Celery transport semantics (`redis` or `sqs`) | `redis` |
|
||
| `TASK_QUEUE_DEFAULT_QUEUE` | Queue used for fan-out + notification deliveries | `default` |
|
||
| `TASK_QUEUE_CRITICAL_QUEUE` | Queue used for escalation + delayed work | `critical` |
|
||
| `TASK_QUEUE_VISIBILITY_TIMEOUT` | Visibility timeout passed to `sqs` transport | `600` |
|
||
| `TASK_QUEUE_POLLING_INTERVAL` | Polling interval for `sqs` transport (seconds) | `1.0` |
|
||
| `NOTIFICATION_ESCALATION_DELAY_SECONDS` | Delay before re-checking unacknowledged incidents | `900` |
|
||
| `AWS_REGION` | Region used when `TASK_QUEUE_BACKEND=sqs` | `None` |
|
||
| `JWT_SECRET_KEY` | Symmetric JWT signing key | — |
|
||
| `JWT_ALGORITHM` | JWT algorithm | `HS256` |
|
||
| `JWT_ISSUER` | JWT issuer claim | `incidentops` |
|
||
| `JWT_AUDIENCE` | JWT audience claim | `incidentops-api` |
|
||
|
||
### Task Queue Modes
|
||
|
||
- **Development / Tests** – Set `TASK_QUEUE_DRIVER=inmemory` to bypass Celery entirely (default for local pytest). The API will enqueue events into an in-memory recorder while the worker code remains importable.
|
||
- **Celery + Redis** – Set `TASK_QUEUE_DRIVER=celery` and either leave `TASK_QUEUE_BROKER_URL` unset (and rely on `REDIS_URL`) or point it to another Redis endpoint. This is the default production-style configuration.
|
||
- **Celery + Amazon SQS** – Provide `TASK_QUEUE_BROKER_URL=sqs://` (Celery automatically discovers credentials), set `TASK_QUEUE_BACKEND=sqs`, and configure `AWS_REGION`. Optional tuning is available via the visibility timeout and polling interval variables above.
|
||
|
||
### Running the Worker
|
||
|
||
The worker automatically discovers tasks under `worker/tasks`. Use the same environment variables as the API:
|
||
|
||
```
|
||
uv run celery -A worker.celery_app worker --loglevel=info
|
||
```
|
||
|
||
## Setup
|
||
|
||
### Docker Compose
|
||
|
||
```
|
||
docker compose up --build -d
|
||
```
|
||
|
||
### K8S with Skaffold and Helm
|
||
|
||
```
|
||
# Install with infrastructure only (for testing)
|
||
helm install incidentops helm/incidentops -n incidentops --create-namespace \
|
||
--set migration.enabled=false \
|
||
--set api.replicaCount=0 \
|
||
--set worker.replicaCount=0 \
|
||
--set web.replicaCount=0
|
||
|
||
# Full install (requires building app images first)
|
||
helm install incidentops helm/incidentops -n incidentops --create-namespace
|
||
|
||
# Create a cluster
|
||
kind create cluster --name incidentops
|
||
|
||
# We then deploy
|
||
skaffold dev
|
||
|
||
# One-time deployment
|
||
skaffold run
|
||
|
||
# Production deployment
|
||
skaffold run -p production
|
||
```
|
||
|
||
### Accessing Dashboards
|
||
|
||
When running with `skaffold dev`, the following dashboards are port-forwarded automatically:
|
||
|
||
| Dashboard | URL | Description |
|
||
|-----------|-----|-------------|
|
||
| **OpenAPI (Swagger)** | http://localhost:8000/docs | Interactive API documentation |
|
||
| **OpenAPI (ReDoc)** | http://localhost:8000/redoc | Alternative API docs |
|
||
| **Grafana** | http://localhost:3001 | Metrics, logs, and traces |
|
||
| **Prometheus** | http://localhost:9090 | Raw metrics queries |
|
||
| **Tempo** | http://localhost:3200 | Distributed tracing backend |
|
||
| **Loki** | http://localhost:3100 | Log aggregation backend |
|
||
|
||
Grafana comes pre-configured with datasources for Prometheus, Loki, and Tempo.
|