Compare commits
4 Commits
3170f10e86
...
master
| Author | SHA1 | Date | |
|---|---|---|---|
|
2fb4907e31
|
|||
|
46ede7757d
|
|||
|
f427d191e0
|
|||
|
ad94833830
|
12
Dockerfile
12
Dockerfile
@@ -7,7 +7,7 @@ WORKDIR /app
|
|||||||
COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/
|
COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/
|
||||||
|
|
||||||
# Install Python dependencies
|
# Install Python dependencies
|
||||||
COPY pyproject.toml uv.lock ./
|
COPY pyproject.toml uv.lock README.md ./
|
||||||
RUN uv sync --no-cache --no-dev
|
RUN uv sync --no-cache --no-dev
|
||||||
|
|
||||||
# Copy application code
|
# Copy application code
|
||||||
@@ -15,9 +15,17 @@ COPY app/ ./app/
|
|||||||
COPY worker/ ./worker/
|
COPY worker/ ./worker/
|
||||||
COPY migrations/ ./migrations/
|
COPY migrations/ ./migrations/
|
||||||
|
|
||||||
|
# Set up non-root user and cache directory
|
||||||
|
RUN useradd -m -u 1000 appuser && \
|
||||||
|
mkdir -p /app/.cache && \
|
||||||
|
chown -R appuser:appuser /app
|
||||||
|
|
||||||
|
ENV UV_CACHE_DIR=/app/.cache
|
||||||
|
|
||||||
# API service target
|
# API service target
|
||||||
FROM base AS api
|
FROM base AS api
|
||||||
|
|
||||||
|
USER appuser
|
||||||
EXPOSE 8000
|
EXPOSE 8000
|
||||||
|
|
||||||
CMD ["uv", "run", "uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
|
CMD ["uv", "run", "uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
|
||||||
@@ -25,4 +33,6 @@ CMD ["uv", "run", "uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "800
|
|||||||
# Worker service target
|
# Worker service target
|
||||||
FROM base AS worker
|
FROM base AS worker
|
||||||
|
|
||||||
|
USER appuser
|
||||||
|
|
||||||
CMD ["uv", "run", "celery", "-A", "worker.celery_app", "worker", "--loglevel=info", "-Q", "critical,default,low"]
|
CMD ["uv", "run", "celery", "-A", "worker.celery_app", "worker", "--loglevel=info", "-Q", "critical,default,low"]
|
||||||
|
|||||||
49
README.md
49
README.md
@@ -2,6 +2,40 @@
|
|||||||
|
|
||||||
A fullstack on-call & incident management platform
|
A fullstack on-call & incident management platform
|
||||||
|
|
||||||
|
## Environment Configuration
|
||||||
|
|
||||||
|
| Variable | Description | Default |
|
||||||
|
|----------|-------------|---------|
|
||||||
|
| `DATABASE_URL` | Postgres connection string | — |
|
||||||
|
| `REDIS_URL` | Legacy redis endpoint, also used if no broker override is supplied | `redis://localhost:6379/0` |
|
||||||
|
| `TASK_QUEUE_DRIVER` | Task queue implementation (`celery` or `inmemory`) | `celery` |
|
||||||
|
| `TASK_QUEUE_BROKER_URL` | Celery broker URL (falls back to `REDIS_URL` when unset) | `None` |
|
||||||
|
| `TASK_QUEUE_BACKEND` | Celery transport semantics (`redis` or `sqs`) | `redis` |
|
||||||
|
| `TASK_QUEUE_DEFAULT_QUEUE` | Queue used for fan-out + notification deliveries | `default` |
|
||||||
|
| `TASK_QUEUE_CRITICAL_QUEUE` | Queue used for escalation + delayed work | `critical` |
|
||||||
|
| `TASK_QUEUE_VISIBILITY_TIMEOUT` | Visibility timeout passed to `sqs` transport | `600` |
|
||||||
|
| `TASK_QUEUE_POLLING_INTERVAL` | Polling interval for `sqs` transport (seconds) | `1.0` |
|
||||||
|
| `NOTIFICATION_ESCALATION_DELAY_SECONDS` | Delay before re-checking unacknowledged incidents | `900` |
|
||||||
|
| `AWS_REGION` | Region used when `TASK_QUEUE_BACKEND=sqs` | `None` |
|
||||||
|
| `JWT_SECRET_KEY` | Symmetric JWT signing key | — |
|
||||||
|
| `JWT_ALGORITHM` | JWT algorithm | `HS256` |
|
||||||
|
| `JWT_ISSUER` | JWT issuer claim | `incidentops` |
|
||||||
|
| `JWT_AUDIENCE` | JWT audience claim | `incidentops-api` |
|
||||||
|
|
||||||
|
### Task Queue Modes
|
||||||
|
|
||||||
|
- **Development / Tests** – Set `TASK_QUEUE_DRIVER=inmemory` to bypass Celery entirely (default for local pytest). The API will enqueue events into an in-memory recorder while the worker code remains importable.
|
||||||
|
- **Celery + Redis** – Set `TASK_QUEUE_DRIVER=celery` and either leave `TASK_QUEUE_BROKER_URL` unset (and rely on `REDIS_URL`) or point it to another Redis endpoint. This is the default production-style configuration.
|
||||||
|
- **Celery + Amazon SQS** – Provide `TASK_QUEUE_BROKER_URL=sqs://` (Celery automatically discovers credentials), set `TASK_QUEUE_BACKEND=sqs`, and configure `AWS_REGION`. Optional tuning is available via the visibility timeout and polling interval variables above.
|
||||||
|
|
||||||
|
### Running the Worker
|
||||||
|
|
||||||
|
The worker automatically discovers tasks under `worker/tasks`. Use the same environment variables as the API:
|
||||||
|
|
||||||
|
```
|
||||||
|
uv run celery -A worker.celery_app worker --loglevel=info
|
||||||
|
```
|
||||||
|
|
||||||
## Setup
|
## Setup
|
||||||
|
|
||||||
### Docker Compose
|
### Docker Compose
|
||||||
@@ -35,3 +69,18 @@ skaffold run
|
|||||||
# Production deployment
|
# Production deployment
|
||||||
skaffold run -p production
|
skaffold run -p production
|
||||||
```
|
```
|
||||||
|
|
||||||
|
### Accessing Dashboards
|
||||||
|
|
||||||
|
When running with `skaffold dev`, the following dashboards are port-forwarded automatically:
|
||||||
|
|
||||||
|
| Dashboard | URL | Description |
|
||||||
|
|-----------|-----|-------------|
|
||||||
|
| **OpenAPI (Swagger)** | http://localhost:8000/docs | Interactive API documentation |
|
||||||
|
| **OpenAPI (ReDoc)** | http://localhost:8000/redoc | Alternative API docs |
|
||||||
|
| **Grafana** | http://localhost:3001 | Metrics, logs, and traces |
|
||||||
|
| **Prometheus** | http://localhost:9090 | Raw metrics queries |
|
||||||
|
| **Tempo** | http://localhost:3200 | Distributed tracing backend |
|
||||||
|
| **Loki** | http://localhost:3100 | Log aggregation backend |
|
||||||
|
|
||||||
|
Grafana comes pre-configured with datasources for Prometheus, Loki, and Tempo.
|
||||||
|
|||||||
101
app/api/deps.py
Normal file
101
app/api/deps.py
Normal file
@@ -0,0 +1,101 @@
|
|||||||
|
"""Shared FastAPI dependencies (auth, RBAC, ownership)."""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from dataclasses import dataclass
|
||||||
|
from typing import Callable
|
||||||
|
from uuid import UUID
|
||||||
|
|
||||||
|
from fastapi import Depends
|
||||||
|
from fastapi.security import HTTPAuthorizationCredentials, HTTPBearer
|
||||||
|
|
||||||
|
from app.core import exceptions as exc, security
|
||||||
|
from app.db import db
|
||||||
|
from app.repositories import OrgRepository, UserRepository
|
||||||
|
|
||||||
|
|
||||||
|
bearer_scheme = HTTPBearer(auto_error=False)
|
||||||
|
|
||||||
|
ROLE_RANKS: dict[str, int] = {"viewer": 0, "member": 1, "admin": 2}
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass(slots=True)
|
||||||
|
class CurrentUser:
|
||||||
|
"""Authenticated user context derived from the access token."""
|
||||||
|
|
||||||
|
user_id: UUID
|
||||||
|
email: str
|
||||||
|
org_id: UUID
|
||||||
|
org_role: str
|
||||||
|
token: str
|
||||||
|
|
||||||
|
|
||||||
|
async def get_current_user(
|
||||||
|
credentials: HTTPAuthorizationCredentials | None = Depends(bearer_scheme),
|
||||||
|
) -> CurrentUser:
|
||||||
|
"""Extract and validate the current user from the Authorization header."""
|
||||||
|
|
||||||
|
if credentials is None or credentials.scheme.lower() != "bearer":
|
||||||
|
raise exc.UnauthorizedError("Missing bearer token")
|
||||||
|
|
||||||
|
try:
|
||||||
|
payload = security.TokenPayload(security.decode_access_token(credentials.credentials))
|
||||||
|
except security.JWTError as err: # pragma: no cover - jose error types
|
||||||
|
raise exc.UnauthorizedError("Invalid access token") from err
|
||||||
|
|
||||||
|
async with db.connection() as conn:
|
||||||
|
user_repo = UserRepository(conn)
|
||||||
|
user = await user_repo.get_by_id(payload.user_id)
|
||||||
|
if user is None:
|
||||||
|
raise exc.UnauthorizedError("User not found")
|
||||||
|
|
||||||
|
org_repo = OrgRepository(conn)
|
||||||
|
membership = await org_repo.get_member(payload.user_id, payload.org_id)
|
||||||
|
if membership is None:
|
||||||
|
raise exc.ForbiddenError("Organization access denied")
|
||||||
|
|
||||||
|
return CurrentUser(
|
||||||
|
user_id=payload.user_id,
|
||||||
|
email=user["email"],
|
||||||
|
org_id=payload.org_id,
|
||||||
|
org_role=membership["role"],
|
||||||
|
token=credentials.credentials,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
class RoleChecker:
|
||||||
|
"""Dependency that enforces a minimum organization role."""
|
||||||
|
|
||||||
|
def __init__(self, minimum_role: str) -> None:
|
||||||
|
if minimum_role not in ROLE_RANKS:
|
||||||
|
raise ValueError(f"Unknown role '{minimum_role}'")
|
||||||
|
self.minimum_role = minimum_role
|
||||||
|
|
||||||
|
def __call__(self, current_user: CurrentUser = Depends(get_current_user)) -> CurrentUser:
|
||||||
|
if ROLE_RANKS[current_user.org_role] < ROLE_RANKS[self.minimum_role]:
|
||||||
|
raise exc.ForbiddenError("Insufficient role for this operation")
|
||||||
|
return current_user
|
||||||
|
|
||||||
|
|
||||||
|
def require_role(min_role: str) -> Callable[[CurrentUser], CurrentUser]:
|
||||||
|
"""Factory that returns a dependency enforcing the specified role."""
|
||||||
|
|
||||||
|
return RoleChecker(min_role)
|
||||||
|
|
||||||
|
|
||||||
|
def ensure_org_access(resource_org_id: UUID, current_user: CurrentUser) -> None:
|
||||||
|
"""Verify that the resource belongs to the active org in the token."""
|
||||||
|
|
||||||
|
if resource_org_id != current_user.org_id:
|
||||||
|
raise exc.ForbiddenError("Resource does not belong to the active organization")
|
||||||
|
|
||||||
|
|
||||||
|
__all__ = [
|
||||||
|
"CurrentUser",
|
||||||
|
"ROLE_RANKS",
|
||||||
|
"RoleChecker",
|
||||||
|
"bearer_scheme",
|
||||||
|
"ensure_org_access",
|
||||||
|
"get_current_user",
|
||||||
|
"require_role",
|
||||||
|
]
|
||||||
59
app/api/v1/auth.py
Normal file
59
app/api/v1/auth.py
Normal file
@@ -0,0 +1,59 @@
|
|||||||
|
"""Authentication API endpoints."""
|
||||||
|
|
||||||
|
from fastapi import APIRouter, Depends, status
|
||||||
|
|
||||||
|
from app.api.deps import CurrentUser, get_current_user
|
||||||
|
from app.schemas.auth import (
|
||||||
|
LoginRequest,
|
||||||
|
LogoutRequest,
|
||||||
|
RefreshRequest,
|
||||||
|
RegisterRequest,
|
||||||
|
SwitchOrgRequest,
|
||||||
|
TokenResponse,
|
||||||
|
)
|
||||||
|
from app.services import AuthService
|
||||||
|
|
||||||
|
|
||||||
|
router = APIRouter(prefix="/auth", tags=["auth"])
|
||||||
|
auth_service = AuthService()
|
||||||
|
|
||||||
|
|
||||||
|
@router.post("/register", response_model=TokenResponse, status_code=status.HTTP_201_CREATED)
|
||||||
|
async def register_user(payload: RegisterRequest) -> TokenResponse:
|
||||||
|
"""Register a new user and default org, returning auth tokens."""
|
||||||
|
|
||||||
|
return await auth_service.register_user(payload)
|
||||||
|
|
||||||
|
|
||||||
|
@router.post("/login", response_model=TokenResponse)
|
||||||
|
async def login_user(payload: LoginRequest) -> TokenResponse:
|
||||||
|
"""Authenticate an existing user and issue tokens."""
|
||||||
|
|
||||||
|
return await auth_service.login_user(payload)
|
||||||
|
|
||||||
|
|
||||||
|
@router.post("/refresh", response_model=TokenResponse)
|
||||||
|
async def refresh_tokens(payload: RefreshRequest) -> TokenResponse:
|
||||||
|
"""Rotate refresh token and mint a new access token."""
|
||||||
|
|
||||||
|
return await auth_service.refresh_tokens(payload)
|
||||||
|
|
||||||
|
|
||||||
|
@router.post("/switch-org", response_model=TokenResponse)
|
||||||
|
async def switch_org(
|
||||||
|
payload: SwitchOrgRequest,
|
||||||
|
current_user: CurrentUser = Depends(get_current_user),
|
||||||
|
) -> TokenResponse:
|
||||||
|
"""Switch the active organization for the authenticated user."""
|
||||||
|
|
||||||
|
return await auth_service.switch_org(current_user, payload)
|
||||||
|
|
||||||
|
|
||||||
|
@router.post("/logout", status_code=status.HTTP_204_NO_CONTENT)
|
||||||
|
async def logout(
|
||||||
|
payload: LogoutRequest,
|
||||||
|
current_user: CurrentUser = Depends(get_current_user),
|
||||||
|
) -> None:
|
||||||
|
"""Revoke the provided refresh token for the current session."""
|
||||||
|
|
||||||
|
await auth_service.logout(current_user, payload)
|
||||||
@@ -2,7 +2,8 @@
|
|||||||
|
|
||||||
from fastapi import APIRouter, Response, status
|
from fastapi import APIRouter, Response, status
|
||||||
|
|
||||||
from app.db import db, redis_client
|
from app.db import db
|
||||||
|
from app.taskqueue import task_queue
|
||||||
|
|
||||||
router = APIRouter()
|
router = APIRouter()
|
||||||
|
|
||||||
@@ -16,14 +17,14 @@ async def healthz() -> dict[str, str]:
|
|||||||
@router.get("/readyz")
|
@router.get("/readyz")
|
||||||
async def readyz(response: Response) -> dict[str, str | dict[str, bool]]:
|
async def readyz(response: Response) -> dict[str, str | dict[str, bool]]:
|
||||||
"""
|
"""
|
||||||
Readiness probe - checks database and Redis connectivity.
|
Readiness probe - checks database and task queue connectivity.
|
||||||
- Check Postgres status
|
- Check Postgres status
|
||||||
- Check Redis status
|
- Check configured task queue backend
|
||||||
- Return overall healthiness
|
- Return overall healthiness
|
||||||
"""
|
"""
|
||||||
checks = {
|
checks = {
|
||||||
"postgres": False,
|
"postgres": False,
|
||||||
"redis": False,
|
"task_queue": False,
|
||||||
}
|
}
|
||||||
|
|
||||||
try:
|
try:
|
||||||
@@ -34,7 +35,7 @@ async def readyz(response: Response) -> dict[str, str | dict[str, bool]]:
|
|||||||
except Exception:
|
except Exception:
|
||||||
pass
|
pass
|
||||||
|
|
||||||
checks["redis"] = await redis_client.ping()
|
checks["task_queue"] = await task_queue.ping()
|
||||||
|
|
||||||
all_healthy = all(checks.values())
|
all_healthy = all(checks.values())
|
||||||
if not all_healthy:
|
if not all_healthy:
|
||||||
|
|||||||
103
app/api/v1/incidents.py
Normal file
103
app/api/v1/incidents.py
Normal file
@@ -0,0 +1,103 @@
|
|||||||
|
"""Incident API endpoints."""
|
||||||
|
|
||||||
|
from datetime import datetime
|
||||||
|
from uuid import UUID
|
||||||
|
|
||||||
|
from fastapi import APIRouter, Depends, Query, status
|
||||||
|
|
||||||
|
from app.api.deps import CurrentUser, get_current_user, require_role
|
||||||
|
from app.schemas.common import PaginatedResponse
|
||||||
|
from app.schemas.incident import (
|
||||||
|
CommentRequest,
|
||||||
|
IncidentEventResponse,
|
||||||
|
IncidentResponse,
|
||||||
|
IncidentStatus,
|
||||||
|
TransitionRequest,
|
||||||
|
IncidentCreate,
|
||||||
|
)
|
||||||
|
from app.services import IncidentService
|
||||||
|
|
||||||
|
|
||||||
|
router = APIRouter(tags=["incidents"])
|
||||||
|
incident_service = IncidentService()
|
||||||
|
|
||||||
|
|
||||||
|
@router.get("/incidents", response_model=PaginatedResponse[IncidentResponse])
|
||||||
|
async def list_incidents(
|
||||||
|
status: IncidentStatus | None = Query(default=None),
|
||||||
|
cursor: datetime | None = Query(default=None, description="Cursor (created_at)"),
|
||||||
|
limit: int = Query(default=20, ge=1, le=100),
|
||||||
|
current_user: CurrentUser = Depends(get_current_user),
|
||||||
|
) -> PaginatedResponse[IncidentResponse]:
|
||||||
|
"""List incidents for the active organization."""
|
||||||
|
|
||||||
|
return await incident_service.get_incidents(
|
||||||
|
current_user,
|
||||||
|
status=status,
|
||||||
|
cursor=cursor,
|
||||||
|
limit=limit,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@router.post(
|
||||||
|
"/services/{service_id}/incidents",
|
||||||
|
response_model=IncidentResponse,
|
||||||
|
status_code=status.HTTP_201_CREATED,
|
||||||
|
)
|
||||||
|
async def create_incident(
|
||||||
|
service_id: UUID,
|
||||||
|
payload: IncidentCreate,
|
||||||
|
current_user: CurrentUser = Depends(require_role("member")),
|
||||||
|
) -> IncidentResponse:
|
||||||
|
"""Create a new incident for the given service (member+)."""
|
||||||
|
|
||||||
|
return await incident_service.create_incident(current_user, service_id, payload)
|
||||||
|
|
||||||
|
|
||||||
|
@router.get("/incidents/{incident_id}", response_model=IncidentResponse)
|
||||||
|
async def get_incident(
|
||||||
|
incident_id: UUID,
|
||||||
|
current_user: CurrentUser = Depends(get_current_user),
|
||||||
|
) -> IncidentResponse:
|
||||||
|
"""Fetch a single incident by ID."""
|
||||||
|
|
||||||
|
return await incident_service.get_incident(current_user, incident_id)
|
||||||
|
|
||||||
|
|
||||||
|
@router.get("/incidents/{incident_id}/events", response_model=list[IncidentEventResponse])
|
||||||
|
async def get_incident_events(
|
||||||
|
incident_id: UUID,
|
||||||
|
current_user: CurrentUser = Depends(get_current_user),
|
||||||
|
) -> list[IncidentEventResponse]:
|
||||||
|
"""Get the event timeline for an incident."""
|
||||||
|
|
||||||
|
return await incident_service.get_incident_events(current_user, incident_id)
|
||||||
|
|
||||||
|
|
||||||
|
@router.post(
|
||||||
|
"/incidents/{incident_id}/transition",
|
||||||
|
response_model=IncidentResponse,
|
||||||
|
)
|
||||||
|
async def transition_incident(
|
||||||
|
incident_id: UUID,
|
||||||
|
payload: TransitionRequest,
|
||||||
|
current_user: CurrentUser = Depends(require_role("member")),
|
||||||
|
) -> IncidentResponse:
|
||||||
|
"""Transition an incident status (member+)."""
|
||||||
|
|
||||||
|
return await incident_service.transition_incident(current_user, incident_id, payload)
|
||||||
|
|
||||||
|
|
||||||
|
@router.post(
|
||||||
|
"/incidents/{incident_id}/comment",
|
||||||
|
response_model=IncidentEventResponse,
|
||||||
|
status_code=status.HTTP_201_CREATED,
|
||||||
|
)
|
||||||
|
async def add_comment(
|
||||||
|
incident_id: UUID,
|
||||||
|
payload: CommentRequest,
|
||||||
|
current_user: CurrentUser = Depends(require_role("member")),
|
||||||
|
) -> IncidentEventResponse:
|
||||||
|
"""Add a comment to the incident timeline (member+)."""
|
||||||
|
|
||||||
|
return await incident_service.add_comment(current_user, incident_id, payload)
|
||||||
72
app/api/v1/org.py
Normal file
72
app/api/v1/org.py
Normal file
@@ -0,0 +1,72 @@
|
|||||||
|
"""Organization API endpoints."""
|
||||||
|
|
||||||
|
from fastapi import APIRouter, Depends, status
|
||||||
|
|
||||||
|
from app.api.deps import CurrentUser, get_current_user, require_role
|
||||||
|
from app.schemas.org import (
|
||||||
|
MemberResponse,
|
||||||
|
NotificationTargetCreate,
|
||||||
|
NotificationTargetResponse,
|
||||||
|
OrgResponse,
|
||||||
|
ServiceCreate,
|
||||||
|
ServiceResponse,
|
||||||
|
)
|
||||||
|
from app.services import OrgService
|
||||||
|
|
||||||
|
|
||||||
|
router = APIRouter(prefix="/org", tags=["org"])
|
||||||
|
org_service = OrgService()
|
||||||
|
|
||||||
|
|
||||||
|
@router.get("", response_model=OrgResponse)
|
||||||
|
async def get_org(current_user: CurrentUser = Depends(get_current_user)) -> OrgResponse:
|
||||||
|
"""Return the active organization summary for the authenticated user."""
|
||||||
|
|
||||||
|
return await org_service.get_current_org(current_user)
|
||||||
|
|
||||||
|
|
||||||
|
@router.get("/members", response_model=list[MemberResponse])
|
||||||
|
async def list_members(current_user: CurrentUser = Depends(require_role("admin"))) -> list[MemberResponse]:
|
||||||
|
"""List members of the current organization (admin only)."""
|
||||||
|
|
||||||
|
return await org_service.get_members(current_user)
|
||||||
|
|
||||||
|
|
||||||
|
@router.get("/services", response_model=list[ServiceResponse])
|
||||||
|
async def list_services(current_user: CurrentUser = Depends(get_current_user)) -> list[ServiceResponse]:
|
||||||
|
"""List services for the current organization."""
|
||||||
|
|
||||||
|
return await org_service.get_services(current_user)
|
||||||
|
|
||||||
|
|
||||||
|
@router.post("/services", response_model=ServiceResponse, status_code=status.HTTP_201_CREATED)
|
||||||
|
async def create_service(
|
||||||
|
payload: ServiceCreate,
|
||||||
|
current_user: CurrentUser = Depends(require_role("member")),
|
||||||
|
) -> ServiceResponse:
|
||||||
|
"""Create a new service within the current organization (member+)."""
|
||||||
|
|
||||||
|
return await org_service.create_service(current_user, payload)
|
||||||
|
|
||||||
|
|
||||||
|
@router.get("/notification-targets", response_model=list[NotificationTargetResponse])
|
||||||
|
async def list_notification_targets(
|
||||||
|
current_user: CurrentUser = Depends(require_role("admin")),
|
||||||
|
) -> list[NotificationTargetResponse]:
|
||||||
|
"""List notification targets for the current organization (admin only)."""
|
||||||
|
|
||||||
|
return await org_service.get_notification_targets(current_user)
|
||||||
|
|
||||||
|
|
||||||
|
@router.post(
|
||||||
|
"/notification-targets",
|
||||||
|
response_model=NotificationTargetResponse,
|
||||||
|
status_code=status.HTTP_201_CREATED,
|
||||||
|
)
|
||||||
|
async def create_notification_target(
|
||||||
|
payload: NotificationTargetCreate,
|
||||||
|
current_user: CurrentUser = Depends(require_role("admin")),
|
||||||
|
) -> NotificationTargetResponse:
|
||||||
|
"""Create a notification target for the current organization (admin only)."""
|
||||||
|
|
||||||
|
return await org_service.create_notification_target(current_user, payload)
|
||||||
@@ -1,5 +1,7 @@
|
|||||||
"""Application configuration via pydantic-settings."""
|
"""Application configuration via pydantic-settings."""
|
||||||
|
|
||||||
|
from typing import Literal
|
||||||
|
|
||||||
from pydantic_settings import BaseSettings, SettingsConfigDict
|
from pydantic_settings import BaseSettings, SettingsConfigDict
|
||||||
|
|
||||||
|
|
||||||
@@ -15,9 +17,22 @@ class Settings(BaseSettings):
|
|||||||
# Database
|
# Database
|
||||||
database_url: str
|
database_url: str
|
||||||
|
|
||||||
# Redis
|
# Redis (legacy default for Celery broker)
|
||||||
redis_url: str = "redis://localhost:6379/0"
|
redis_url: str = "redis://localhost:6379/0"
|
||||||
|
|
||||||
|
# Task queue
|
||||||
|
task_queue_driver: Literal["celery", "inmemory"] = "celery"
|
||||||
|
task_queue_broker_url: str | None = None
|
||||||
|
task_queue_backend: Literal["redis", "sqs"] = "redis"
|
||||||
|
task_queue_default_queue: str = "default"
|
||||||
|
task_queue_critical_queue: str = "critical"
|
||||||
|
task_queue_visibility_timeout: int = 600
|
||||||
|
task_queue_polling_interval: float = 1.0
|
||||||
|
notification_escalation_delay_seconds: int = 900
|
||||||
|
|
||||||
|
# AWS (used when task_queue_backend="sqs")
|
||||||
|
aws_region: str | None = None
|
||||||
|
|
||||||
# JWT
|
# JWT
|
||||||
jwt_secret_key: str
|
jwt_secret_key: str
|
||||||
jwt_algorithm: str = "HS256"
|
jwt_algorithm: str = "HS256"
|
||||||
@@ -30,5 +45,22 @@ class Settings(BaseSettings):
|
|||||||
debug: bool = False
|
debug: bool = False
|
||||||
api_v1_prefix: str = "/v1"
|
api_v1_prefix: str = "/v1"
|
||||||
|
|
||||||
|
# OpenTelemetry
|
||||||
|
otel_enabled: bool = True
|
||||||
|
otel_service_name: str = "incidentops-api"
|
||||||
|
otel_environment: str = "development"
|
||||||
|
otel_exporter_otlp_endpoint: str | None = None # e.g., "http://tempo:4317"
|
||||||
|
otel_exporter_otlp_insecure: bool = True
|
||||||
|
otel_log_level: str = "INFO"
|
||||||
|
|
||||||
settings = Settings()
|
# Metrics
|
||||||
|
prometheus_port: int = 9464 # Port for Prometheus metrics endpoint
|
||||||
|
|
||||||
|
@property
|
||||||
|
def resolved_task_queue_broker_url(self) -> str:
|
||||||
|
"""Return the broker URL with redis fallback for backwards compatibility."""
|
||||||
|
|
||||||
|
return self.task_queue_broker_url or self.redis_url
|
||||||
|
|
||||||
|
|
||||||
|
settings = Settings() # type: ignore[call-arg]
|
||||||
|
|||||||
164
app/core/logging.py
Normal file
164
app/core/logging.py
Normal file
@@ -0,0 +1,164 @@
|
|||||||
|
"""Structured JSON logging configuration with OpenTelemetry integration."""
|
||||||
|
|
||||||
|
import json
|
||||||
|
import logging
|
||||||
|
import sys
|
||||||
|
from datetime import datetime, timezone
|
||||||
|
from typing import Any
|
||||||
|
|
||||||
|
from app.config import settings
|
||||||
|
|
||||||
|
|
||||||
|
class JSONFormatter(logging.Formatter):
|
||||||
|
"""
|
||||||
|
JSON log formatter that outputs structured logs with trace context.
|
||||||
|
|
||||||
|
Log format includes:
|
||||||
|
- timestamp: ISO 8601 format
|
||||||
|
- level: Log level name
|
||||||
|
- message: Log message
|
||||||
|
- logger: Logger name
|
||||||
|
- trace_id: OpenTelemetry trace ID (if available)
|
||||||
|
- span_id: OpenTelemetry span ID (if available)
|
||||||
|
- Extra fields from log record
|
||||||
|
"""
|
||||||
|
|
||||||
|
def format(self, record: logging.LogRecord) -> str:
|
||||||
|
log_data: dict[str, Any] = {
|
||||||
|
"timestamp": datetime.now(timezone.utc).isoformat(),
|
||||||
|
"level": record.levelname,
|
||||||
|
"message": record.getMessage(),
|
||||||
|
"logger": record.name,
|
||||||
|
}
|
||||||
|
|
||||||
|
# Add trace context if available (injected by OpenTelemetry LoggingInstrumentor)
|
||||||
|
if hasattr(record, "otelTraceID") and record.otelTraceID != "0":
|
||||||
|
log_data["trace_id"] = record.otelTraceID
|
||||||
|
if hasattr(record, "otelSpanID") and record.otelSpanID != "0":
|
||||||
|
log_data["span_id"] = record.otelSpanID
|
||||||
|
|
||||||
|
# Add exception info if present
|
||||||
|
if record.exc_info:
|
||||||
|
log_data["exception"] = self.formatException(record.exc_info)
|
||||||
|
|
||||||
|
# Add extra fields (excluding standard LogRecord attributes)
|
||||||
|
standard_attrs = {
|
||||||
|
"name",
|
||||||
|
"msg",
|
||||||
|
"args",
|
||||||
|
"created",
|
||||||
|
"filename",
|
||||||
|
"funcName",
|
||||||
|
"levelname",
|
||||||
|
"levelno",
|
||||||
|
"lineno",
|
||||||
|
"module",
|
||||||
|
"msecs",
|
||||||
|
"pathname",
|
||||||
|
"process",
|
||||||
|
"processName",
|
||||||
|
"relativeCreated",
|
||||||
|
"stack_info",
|
||||||
|
"exc_info",
|
||||||
|
"exc_text",
|
||||||
|
"thread",
|
||||||
|
"threadName",
|
||||||
|
"taskName",
|
||||||
|
"message",
|
||||||
|
"otelTraceID",
|
||||||
|
"otelSpanID",
|
||||||
|
"otelTraceSampled",
|
||||||
|
"otelServiceName",
|
||||||
|
}
|
||||||
|
for key, value in record.__dict__.items():
|
||||||
|
if key not in standard_attrs and not key.startswith("_"):
|
||||||
|
log_data[key] = value
|
||||||
|
|
||||||
|
return json.dumps(log_data, default=str)
|
||||||
|
|
||||||
|
|
||||||
|
class DevelopmentFormatter(logging.Formatter):
|
||||||
|
"""
|
||||||
|
Human-readable formatter for development with color support.
|
||||||
|
|
||||||
|
Format: [TIME] LEVEL logger - message [trace_id]
|
||||||
|
"""
|
||||||
|
|
||||||
|
COLORS = {
|
||||||
|
"DEBUG": "\033[36m", # Cyan
|
||||||
|
"INFO": "\033[32m", # Green
|
||||||
|
"WARNING": "\033[33m", # Yellow
|
||||||
|
"ERROR": "\033[31m", # Red
|
||||||
|
"CRITICAL": "\033[35m", # Magenta
|
||||||
|
}
|
||||||
|
RESET = "\033[0m"
|
||||||
|
|
||||||
|
def format(self, record: logging.LogRecord) -> str:
|
||||||
|
color = self.COLORS.get(record.levelname, "")
|
||||||
|
reset = self.RESET
|
||||||
|
|
||||||
|
# Format timestamp
|
||||||
|
timestamp = datetime.now(timezone.utc).strftime("%H:%M:%S.%f")[:-3]
|
||||||
|
|
||||||
|
# Build message
|
||||||
|
msg = f"[{timestamp}] {color}{record.levelname:8}{reset} {record.name} - {record.getMessage()}"
|
||||||
|
|
||||||
|
# Add trace context if available
|
||||||
|
if hasattr(record, "otelTraceID") and record.otelTraceID != "0":
|
||||||
|
msg += f" [{record.otelTraceID[:8]}...]"
|
||||||
|
|
||||||
|
# Add exception if present
|
||||||
|
if record.exc_info:
|
||||||
|
msg += f"\n{self.formatException(record.exc_info)}"
|
||||||
|
|
||||||
|
return msg
|
||||||
|
|
||||||
|
|
||||||
|
def setup_logging() -> None:
|
||||||
|
"""
|
||||||
|
Configure application logging.
|
||||||
|
|
||||||
|
- JSON format in production (OTEL enabled)
|
||||||
|
- Human-readable format in development
|
||||||
|
- Integrates with OpenTelemetry trace context
|
||||||
|
"""
|
||||||
|
# Determine log level
|
||||||
|
log_level = getattr(logging, settings.otel_log_level.upper(), logging.INFO)
|
||||||
|
|
||||||
|
# Choose formatter based on environment
|
||||||
|
if settings.otel_enabled and not settings.debug:
|
||||||
|
formatter = JSONFormatter()
|
||||||
|
else:
|
||||||
|
formatter = DevelopmentFormatter()
|
||||||
|
|
||||||
|
# Configure root logger
|
||||||
|
root_logger = logging.getLogger()
|
||||||
|
root_logger.setLevel(log_level)
|
||||||
|
|
||||||
|
# Remove existing handlers
|
||||||
|
for handler in root_logger.handlers[:]:
|
||||||
|
root_logger.removeHandler(handler)
|
||||||
|
|
||||||
|
# Add stdout handler
|
||||||
|
handler = logging.StreamHandler(sys.stdout)
|
||||||
|
handler.setFormatter(formatter)
|
||||||
|
root_logger.addHandler(handler)
|
||||||
|
|
||||||
|
# Reduce noise from third-party libraries (keep uvicorn access at INFO so requests are logged)
|
||||||
|
logging.getLogger("uvicorn.access").setLevel(logging.INFO)
|
||||||
|
logging.getLogger("asyncpg").setLevel(logging.WARNING)
|
||||||
|
logging.getLogger("httpx").setLevel(logging.WARNING)
|
||||||
|
logging.getLogger("httpcore").setLevel(logging.WARNING)
|
||||||
|
|
||||||
|
logging.info(
|
||||||
|
"Logging configured",
|
||||||
|
extra={
|
||||||
|
"log_level": settings.otel_log_level,
|
||||||
|
"format": "json" if settings.otel_enabled and not settings.debug else "dev",
|
||||||
|
},
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def get_logger(name: str) -> logging.Logger:
|
||||||
|
"""Get a logger instance with the given name."""
|
||||||
|
return logging.getLogger(name)
|
||||||
271
app/core/telemetry.py
Normal file
271
app/core/telemetry.py
Normal file
@@ -0,0 +1,271 @@
|
|||||||
|
"""OpenTelemetry instrumentation for tracing, metrics, and logging."""
|
||||||
|
|
||||||
|
import logging
|
||||||
|
from contextlib import contextmanager
|
||||||
|
from typing import Any
|
||||||
|
|
||||||
|
from opentelemetry import metrics, trace
|
||||||
|
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
|
||||||
|
from opentelemetry.exporter.prometheus import PrometheusMetricReader
|
||||||
|
from opentelemetry.instrumentation.asyncpg import AsyncPGInstrumentor
|
||||||
|
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
|
||||||
|
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor
|
||||||
|
from opentelemetry.instrumentation.logging import LoggingInstrumentor
|
||||||
|
from opentelemetry.instrumentation.redis import RedisInstrumentor
|
||||||
|
from opentelemetry.instrumentation.system_metrics import SystemMetricsInstrumentor
|
||||||
|
from opentelemetry.sdk.metrics import MeterProvider
|
||||||
|
from opentelemetry.sdk.resources import Resource
|
||||||
|
from opentelemetry.sdk.trace import TracerProvider
|
||||||
|
from opentelemetry.sdk.trace.export import BatchSpanProcessor, ConsoleSpanExporter
|
||||||
|
from opentelemetry.semconv.resource import ResourceAttributes
|
||||||
|
from prometheus_client import REGISTRY, start_http_server
|
||||||
|
|
||||||
|
from app.config import settings
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
_tracer_provider: TracerProvider | None = None
|
||||||
|
_meter_provider: MeterProvider | None = None
|
||||||
|
|
||||||
|
# Custom metrics
|
||||||
|
_request_counter = None
|
||||||
|
_request_duration = None
|
||||||
|
_active_requests = None
|
||||||
|
_error_counter = None
|
||||||
|
|
||||||
|
|
||||||
|
def setup_telemetry(app: Any) -> None:
|
||||||
|
"""
|
||||||
|
Initialize OpenTelemetry with tracing, metrics, and logging instrumentation.
|
||||||
|
|
||||||
|
Configures:
|
||||||
|
- OTLP exporter for traces (to Tempo/Jaeger)
|
||||||
|
- Prometheus exporter for metrics (scraped by Prometheus)
|
||||||
|
- Auto-instrumentation for FastAPI, asyncpg, httpx, redis
|
||||||
|
- System metrics (CPU, memory, etc.)
|
||||||
|
- Logging instrumentation for trace context injection
|
||||||
|
"""
|
||||||
|
global _tracer_provider, _meter_provider
|
||||||
|
global _request_counter, _request_duration, _active_requests, _error_counter
|
||||||
|
|
||||||
|
if not settings.otel_enabled:
|
||||||
|
logger.info("OpenTelemetry disabled")
|
||||||
|
return
|
||||||
|
|
||||||
|
# Create resource with service info
|
||||||
|
resource = Resource.create(
|
||||||
|
{
|
||||||
|
ResourceAttributes.SERVICE_NAME: settings.otel_service_name,
|
||||||
|
ResourceAttributes.SERVICE_VERSION: "0.1.0",
|
||||||
|
ResourceAttributes.DEPLOYMENT_ENVIRONMENT: settings.otel_environment,
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
# =========================================
|
||||||
|
# TRACING SETUP
|
||||||
|
# =========================================
|
||||||
|
_tracer_provider = TracerProvider(resource=resource)
|
||||||
|
|
||||||
|
if settings.otel_exporter_otlp_endpoint:
|
||||||
|
otlp_exporter = OTLPSpanExporter(
|
||||||
|
endpoint=settings.otel_exporter_otlp_endpoint,
|
||||||
|
insecure=settings.otel_exporter_otlp_insecure,
|
||||||
|
)
|
||||||
|
_tracer_provider.add_span_processor(BatchSpanProcessor(otlp_exporter))
|
||||||
|
logger.info(f"OTLP exporter configured: {settings.otel_exporter_otlp_endpoint}")
|
||||||
|
else:
|
||||||
|
_tracer_provider.add_span_processor(BatchSpanProcessor(ConsoleSpanExporter()))
|
||||||
|
logger.info("Console span exporter configured (no OTLP endpoint)")
|
||||||
|
|
||||||
|
trace.set_tracer_provider(_tracer_provider)
|
||||||
|
|
||||||
|
# =========================================
|
||||||
|
# METRICS SETUP
|
||||||
|
# =========================================
|
||||||
|
# Prometheus metric reader exposes metrics at /metrics endpoint
|
||||||
|
prometheus_reader = PrometheusMetricReader()
|
||||||
|
_meter_provider = MeterProvider(resource=resource, metric_readers=[prometheus_reader])
|
||||||
|
metrics.set_meter_provider(_meter_provider)
|
||||||
|
|
||||||
|
# Start Prometheus HTTP server on port 9464
|
||||||
|
prometheus_port = settings.prometheus_port
|
||||||
|
try:
|
||||||
|
start_http_server(port=prometheus_port, registry=REGISTRY)
|
||||||
|
logger.info(f"Prometheus metrics server started on port {prometheus_port}")
|
||||||
|
except OSError as e:
|
||||||
|
logger.warning(f"Could not start Prometheus server on port {prometheus_port}: {e}")
|
||||||
|
|
||||||
|
# Create custom metrics
|
||||||
|
meter = metrics.get_meter(__name__)
|
||||||
|
|
||||||
|
_request_counter = meter.create_counter(
|
||||||
|
name="http_requests_total",
|
||||||
|
description="Total number of HTTP requests",
|
||||||
|
unit="1",
|
||||||
|
)
|
||||||
|
|
||||||
|
_request_duration = meter.create_histogram(
|
||||||
|
name="http_request_duration_seconds",
|
||||||
|
description="HTTP request duration in seconds",
|
||||||
|
unit="s",
|
||||||
|
)
|
||||||
|
|
||||||
|
_active_requests = meter.create_up_down_counter(
|
||||||
|
name="http_requests_active",
|
||||||
|
description="Number of active HTTP requests",
|
||||||
|
unit="1",
|
||||||
|
)
|
||||||
|
|
||||||
|
_error_counter = meter.create_counter(
|
||||||
|
name="http_errors_total",
|
||||||
|
description="Total number of HTTP errors",
|
||||||
|
unit="1",
|
||||||
|
)
|
||||||
|
|
||||||
|
# Instrument system metrics (CPU, memory, etc.)
|
||||||
|
SystemMetricsInstrumentor().instrument()
|
||||||
|
logger.info("System metrics instrumentation enabled")
|
||||||
|
|
||||||
|
# =========================================
|
||||||
|
# LIBRARY INSTRUMENTATION
|
||||||
|
# =========================================
|
||||||
|
FastAPIInstrumentor.instrument_app(
|
||||||
|
app,
|
||||||
|
excluded_urls="healthz,readyz,metrics",
|
||||||
|
tracer_provider=_tracer_provider,
|
||||||
|
meter_provider=_meter_provider,
|
||||||
|
)
|
||||||
|
AsyncPGInstrumentor().instrument(tracer_provider=_tracer_provider)
|
||||||
|
HTTPXClientInstrumentor().instrument(tracer_provider=_tracer_provider)
|
||||||
|
RedisInstrumentor().instrument(tracer_provider=_tracer_provider)
|
||||||
|
|
||||||
|
# Inject trace context into logs
|
||||||
|
LoggingInstrumentor().instrument(
|
||||||
|
set_logging_format=True,
|
||||||
|
log_level=logging.INFO,
|
||||||
|
)
|
||||||
|
|
||||||
|
logger.info(
|
||||||
|
f"OpenTelemetry initialized: service={settings.otel_service_name}, "
|
||||||
|
f"env={settings.otel_environment}, metrics_port={prometheus_port}"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
async def shutdown_telemetry() -> None:
|
||||||
|
"""Gracefully shutdown the tracer and meter providers."""
|
||||||
|
global _tracer_provider, _meter_provider
|
||||||
|
|
||||||
|
if _tracer_provider:
|
||||||
|
_tracer_provider.shutdown()
|
||||||
|
_tracer_provider = None
|
||||||
|
logger.info("Tracer provider shutdown complete")
|
||||||
|
|
||||||
|
if _meter_provider:
|
||||||
|
_meter_provider.shutdown()
|
||||||
|
_meter_provider = None
|
||||||
|
logger.info("Meter provider shutdown complete")
|
||||||
|
|
||||||
|
|
||||||
|
def get_tracer(name: str) -> trace.Tracer:
|
||||||
|
"""Get a tracer instance for manual span creation."""
|
||||||
|
return trace.get_tracer(name)
|
||||||
|
|
||||||
|
|
||||||
|
def get_meter(name: str) -> metrics.Meter:
|
||||||
|
"""Get a meter instance for custom metrics."""
|
||||||
|
return metrics.get_meter(name)
|
||||||
|
|
||||||
|
|
||||||
|
def get_current_trace_id() -> str | None:
|
||||||
|
"""Get the current trace ID for request correlation."""
|
||||||
|
span = trace.get_current_span()
|
||||||
|
if span and span.get_span_context().is_valid:
|
||||||
|
return format(span.get_span_context().trace_id, "032x")
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def get_current_span_id() -> str | None:
|
||||||
|
"""Get the current span ID."""
|
||||||
|
span = trace.get_current_span()
|
||||||
|
if span and span.get_span_context().is_valid:
|
||||||
|
return format(span.get_span_context().span_id, "016x")
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
@contextmanager
|
||||||
|
def create_span(name: str, attributes: dict[str, Any] | None = None):
|
||||||
|
"""Context manager for creating manual spans."""
|
||||||
|
tracer = get_tracer(__name__)
|
||||||
|
with tracer.start_as_current_span(name, attributes=attributes) as span:
|
||||||
|
yield span
|
||||||
|
|
||||||
|
|
||||||
|
def add_span_attributes(attributes: dict[str, Any]) -> None:
|
||||||
|
"""Add attributes to the current span."""
|
||||||
|
span = trace.get_current_span()
|
||||||
|
if span:
|
||||||
|
for key, value in attributes.items():
|
||||||
|
span.set_attribute(key, value)
|
||||||
|
|
||||||
|
|
||||||
|
def record_exception(exception: Exception) -> None:
|
||||||
|
"""Record an exception on the current span."""
|
||||||
|
span = trace.get_current_span()
|
||||||
|
if span:
|
||||||
|
span.record_exception(exception)
|
||||||
|
span.set_status(trace.Status(trace.StatusCode.ERROR, str(exception)))
|
||||||
|
|
||||||
|
|
||||||
|
# =========================================
|
||||||
|
# CUSTOM METRICS HELPERS
|
||||||
|
# =========================================
|
||||||
|
|
||||||
|
|
||||||
|
def record_request(method: str, endpoint: str, status_code: int) -> None:
|
||||||
|
"""Record a request metric."""
|
||||||
|
if _request_counter:
|
||||||
|
_request_counter.add(
|
||||||
|
1,
|
||||||
|
{
|
||||||
|
"method": method,
|
||||||
|
"endpoint": endpoint,
|
||||||
|
"status_code": str(status_code),
|
||||||
|
},
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def record_request_duration(method: str, endpoint: str, duration: float) -> None:
|
||||||
|
"""Record request duration in seconds."""
|
||||||
|
if _request_duration:
|
||||||
|
_request_duration.record(
|
||||||
|
duration,
|
||||||
|
{
|
||||||
|
"method": method,
|
||||||
|
"endpoint": endpoint,
|
||||||
|
},
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def increment_active_requests(method: str, endpoint: str) -> None:
|
||||||
|
"""Increment active requests counter."""
|
||||||
|
if _active_requests:
|
||||||
|
_active_requests.add(1, {"method": method, "endpoint": endpoint})
|
||||||
|
|
||||||
|
|
||||||
|
def decrement_active_requests(method: str, endpoint: str) -> None:
|
||||||
|
"""Decrement active requests counter."""
|
||||||
|
if _active_requests:
|
||||||
|
_active_requests.add(-1, {"method": method, "endpoint": endpoint})
|
||||||
|
|
||||||
|
|
||||||
|
def record_error(method: str, endpoint: str, error_type: str) -> None:
|
||||||
|
"""Record an error metric."""
|
||||||
|
if _error_counter:
|
||||||
|
_error_counter.add(
|
||||||
|
1,
|
||||||
|
{
|
||||||
|
"method": method,
|
||||||
|
"endpoint": endpoint,
|
||||||
|
"error_type": error_type,
|
||||||
|
},
|
||||||
|
)
|
||||||
60
app/db.py
60
app/db.py
@@ -2,9 +2,10 @@
|
|||||||
|
|
||||||
from collections.abc import AsyncGenerator
|
from collections.abc import AsyncGenerator
|
||||||
from contextlib import asynccontextmanager
|
from contextlib import asynccontextmanager
|
||||||
|
from contextvars import ContextVar
|
||||||
|
|
||||||
import asyncpg
|
import asyncpg
|
||||||
import redis.asyncio as redis
|
from asyncpg.pool import PoolConnectionProxy
|
||||||
|
|
||||||
|
|
||||||
class Database:
|
class Database:
|
||||||
@@ -27,7 +28,7 @@ class Database:
|
|||||||
await self.pool.close()
|
await self.pool.close()
|
||||||
|
|
||||||
@asynccontextmanager
|
@asynccontextmanager
|
||||||
async def connection(self) -> AsyncGenerator[asyncpg.Connection, None]:
|
async def connection(self) -> AsyncGenerator[asyncpg.Connection | PoolConnectionProxy, None]:
|
||||||
"""Acquire a connection from the pool."""
|
"""Acquire a connection from the pool."""
|
||||||
if not self.pool:
|
if not self.pool:
|
||||||
raise RuntimeError("Database not connected")
|
raise RuntimeError("Database not connected")
|
||||||
@@ -35,7 +36,7 @@ class Database:
|
|||||||
yield conn
|
yield conn
|
||||||
|
|
||||||
@asynccontextmanager
|
@asynccontextmanager
|
||||||
async def transaction(self) -> AsyncGenerator[asyncpg.Connection, None]:
|
async def transaction(self) -> AsyncGenerator[asyncpg.Connection | PoolConnectionProxy, None]:
|
||||||
"""Acquire a connection with an active transaction."""
|
"""Acquire a connection with an active transaction."""
|
||||||
if not self.pool:
|
if not self.pool:
|
||||||
raise RuntimeError("Database not connected")
|
raise RuntimeError("Database not connected")
|
||||||
@@ -44,37 +45,30 @@ class Database:
|
|||||||
yield conn
|
yield conn
|
||||||
|
|
||||||
|
|
||||||
class RedisClient:
|
# Global instance
|
||||||
"""Manages Redis connection."""
|
|
||||||
|
|
||||||
client: redis.Redis | None = None
|
|
||||||
|
|
||||||
async def connect(self, url: str) -> None:
|
|
||||||
"""Create Redis connection."""
|
|
||||||
self.client = redis.from_url(url, decode_responses=True)
|
|
||||||
|
|
||||||
async def disconnect(self) -> None:
|
|
||||||
"""Close Redis connection."""
|
|
||||||
if self.client:
|
|
||||||
await self.client.aclose()
|
|
||||||
|
|
||||||
async def ping(self) -> bool:
|
|
||||||
"""Check if Redis is reachable."""
|
|
||||||
if not self.client:
|
|
||||||
return False
|
|
||||||
try:
|
|
||||||
await self.client.ping()
|
|
||||||
return True
|
|
||||||
except redis.RedisError:
|
|
||||||
return False
|
|
||||||
|
|
||||||
|
|
||||||
# Global instances
|
|
||||||
db = Database()
|
db = Database()
|
||||||
redis_client = RedisClient()
|
|
||||||
|
|
||||||
|
|
||||||
async def get_conn() -> AsyncGenerator[asyncpg.Connection, None]:
|
_connection_ctx: ContextVar[asyncpg.Connection | PoolConnectionProxy | None] = ContextVar(
|
||||||
"""Dependency for getting a database connection."""
|
"db_connection",
|
||||||
async with db.connection() as conn:
|
default=None,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
async def get_conn() -> AsyncGenerator[asyncpg.Connection | PoolConnectionProxy, None]:
|
||||||
|
"""Dependency that reuses the same DB connection within a request context."""
|
||||||
|
|
||||||
|
existing_conn = _connection_ctx.get()
|
||||||
|
if existing_conn is not None:
|
||||||
|
yield existing_conn
|
||||||
|
return
|
||||||
|
|
||||||
|
if not db.pool:
|
||||||
|
raise RuntimeError("Database not connected")
|
||||||
|
|
||||||
|
async with db.pool.acquire() as conn:
|
||||||
|
token = _connection_ctx.set(conn)
|
||||||
|
try:
|
||||||
yield conn
|
yield conn
|
||||||
|
finally:
|
||||||
|
_connection_ctx.reset(token)
|
||||||
|
|||||||
259
app/main.py
259
app/main.py
@@ -1,33 +1,282 @@
|
|||||||
"""FastAPI application entry point."""
|
"""FastAPI application entry point."""
|
||||||
|
|
||||||
|
import logging
|
||||||
|
import time
|
||||||
from contextlib import asynccontextmanager
|
from contextlib import asynccontextmanager
|
||||||
from typing import AsyncGenerator
|
from typing import AsyncGenerator
|
||||||
|
|
||||||
from fastapi import FastAPI
|
from fastapi import FastAPI, Request, status
|
||||||
|
from fastapi.encoders import jsonable_encoder
|
||||||
|
from fastapi.exceptions import RequestValidationError
|
||||||
|
from fastapi.openapi.utils import get_openapi
|
||||||
|
from fastapi.responses import JSONResponse
|
||||||
|
from starlette.exceptions import HTTPException as StarletteHTTPException
|
||||||
|
|
||||||
from app.api.v1 import health
|
from app.api.v1 import auth, health, incidents, org
|
||||||
from app.config import settings
|
from app.config import settings
|
||||||
from app.db import db, redis_client
|
from app.core.logging import setup_logging
|
||||||
|
from app.core.telemetry import (
|
||||||
|
get_current_trace_id,
|
||||||
|
record_exception,
|
||||||
|
setup_telemetry,
|
||||||
|
shutdown_telemetry,
|
||||||
|
)
|
||||||
|
from app.db import db
|
||||||
|
from app.schemas.common import ErrorDetail, ErrorResponse
|
||||||
|
from app.taskqueue import task_queue
|
||||||
|
|
||||||
|
# Initialize logging before anything else
|
||||||
|
setup_logging()
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
|
||||||
@asynccontextmanager
|
@asynccontextmanager
|
||||||
async def lifespan(app: FastAPI) -> AsyncGenerator[None, None]:
|
async def lifespan(app: FastAPI) -> AsyncGenerator[None, None]:
|
||||||
"""Manage application lifecycle - connect/disconnect resources."""
|
"""Manage application lifecycle - connect/disconnect resources."""
|
||||||
# Startup
|
# Startup
|
||||||
|
logger.info("Starting IncidentOps API")
|
||||||
await db.connect(settings.database_url)
|
await db.connect(settings.database_url)
|
||||||
await redis_client.connect(settings.redis_url)
|
await task_queue.startup()
|
||||||
|
logger.info("Startup complete")
|
||||||
yield
|
yield
|
||||||
# Shutdown
|
# Shutdown
|
||||||
await redis_client.disconnect()
|
logger.info("Shutting down IncidentOps API")
|
||||||
|
await task_queue.shutdown()
|
||||||
await db.disconnect()
|
await db.disconnect()
|
||||||
|
await shutdown_telemetry()
|
||||||
|
logger.info("Shutdown complete")
|
||||||
|
|
||||||
|
|
||||||
app = FastAPI(
|
app = FastAPI(
|
||||||
title="IncidentOps",
|
title="IncidentOps",
|
||||||
description="Incident management API with multi-tenant org support",
|
description="Incident management API with multi-tenant org support",
|
||||||
version="0.1.0",
|
version="0.1.0",
|
||||||
|
docs_url="/docs",
|
||||||
|
redoc_url="/redoc",
|
||||||
|
openapi_url="/openapi.json",
|
||||||
lifespan=lifespan,
|
lifespan=lifespan,
|
||||||
)
|
)
|
||||||
|
|
||||||
|
# Set up OpenTelemetry instrumentation
|
||||||
|
setup_telemetry(app)
|
||||||
|
|
||||||
|
|
||||||
|
@app.middleware("http")
|
||||||
|
async def request_logging_middleware(request: Request, call_next):
|
||||||
|
start = time.time()
|
||||||
|
response = await call_next(request)
|
||||||
|
duration_ms = (time.time() - start) * 1000
|
||||||
|
logger.info(
|
||||||
|
"request",
|
||||||
|
extra={
|
||||||
|
"method": request.method,
|
||||||
|
"path": request.url.path,
|
||||||
|
"status_code": response.status_code,
|
||||||
|
"duration_ms": round(duration_ms, 2),
|
||||||
|
},
|
||||||
|
)
|
||||||
|
return response
|
||||||
|
|
||||||
|
app.openapi_tags = [
|
||||||
|
{"name": "auth", "description": "Registration, login, token lifecycle"},
|
||||||
|
{"name": "org", "description": "Organization membership, services, and notifications"},
|
||||||
|
{"name": "incidents", "description": "Incident lifecycle and timelines"},
|
||||||
|
{"name": "health", "description": "Service health probes"},
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Global Exception Handlers
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
def _build_error_response(
|
||||||
|
error: str,
|
||||||
|
message: str,
|
||||||
|
status_code: int,
|
||||||
|
details: list[ErrorDetail] | None = None,
|
||||||
|
) -> JSONResponse:
|
||||||
|
"""Build a structured error response with trace context."""
|
||||||
|
response = ErrorResponse(
|
||||||
|
error=error,
|
||||||
|
message=message,
|
||||||
|
details=details,
|
||||||
|
request_id=get_current_trace_id(),
|
||||||
|
)
|
||||||
|
return JSONResponse(
|
||||||
|
status_code=status_code,
|
||||||
|
content=jsonable_encoder(response),
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@app.exception_handler(StarletteHTTPException)
|
||||||
|
async def http_exception_handler(
|
||||||
|
request: Request, exc: StarletteHTTPException
|
||||||
|
) -> JSONResponse:
|
||||||
|
"""Handle HTTP exceptions with structured error responses."""
|
||||||
|
# Map status codes to error type strings
|
||||||
|
error_types = {
|
||||||
|
400: "bad_request",
|
||||||
|
401: "unauthorized",
|
||||||
|
403: "forbidden",
|
||||||
|
404: "not_found",
|
||||||
|
409: "conflict",
|
||||||
|
422: "validation_error",
|
||||||
|
429: "rate_limited",
|
||||||
|
500: "internal_error",
|
||||||
|
502: "bad_gateway",
|
||||||
|
503: "service_unavailable",
|
||||||
|
}
|
||||||
|
error_type = error_types.get(exc.status_code, "error")
|
||||||
|
|
||||||
|
logger.warning(
|
||||||
|
"HTTP exception",
|
||||||
|
extra={
|
||||||
|
"status_code": exc.status_code,
|
||||||
|
"error": error_type,
|
||||||
|
"detail": exc.detail,
|
||||||
|
"path": str(request.url.path),
|
||||||
|
"method": request.method,
|
||||||
|
},
|
||||||
|
)
|
||||||
|
|
||||||
|
return _build_error_response(
|
||||||
|
error=error_type,
|
||||||
|
message=str(exc.detail),
|
||||||
|
status_code=exc.status_code,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@app.exception_handler(RequestValidationError)
|
||||||
|
async def validation_exception_handler(
|
||||||
|
request: Request, exc: RequestValidationError
|
||||||
|
) -> JSONResponse:
|
||||||
|
"""Handle Pydantic validation errors with detailed error responses."""
|
||||||
|
details = [
|
||||||
|
ErrorDetail(
|
||||||
|
loc=[str(loc) for loc in error["loc"]],
|
||||||
|
msg=error["msg"],
|
||||||
|
type=error["type"],
|
||||||
|
)
|
||||||
|
for error in exc.errors()
|
||||||
|
]
|
||||||
|
|
||||||
|
logger.warning(
|
||||||
|
"Validation error",
|
||||||
|
extra={
|
||||||
|
"path": str(request.url.path),
|
||||||
|
"method": request.method,
|
||||||
|
"error_count": len(details),
|
||||||
|
},
|
||||||
|
)
|
||||||
|
|
||||||
|
return _build_error_response(
|
||||||
|
error="validation_error",
|
||||||
|
message="Request validation failed",
|
||||||
|
status_code=status.HTTP_422_UNPROCESSABLE_ENTITY,
|
||||||
|
details=details,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@app.exception_handler(Exception)
|
||||||
|
async def unhandled_exception_handler(request: Request, exc: Exception) -> JSONResponse:
|
||||||
|
"""Handle unexpected exceptions with logging and safe error response."""
|
||||||
|
# Record exception in the current span for tracing
|
||||||
|
record_exception(exc)
|
||||||
|
|
||||||
|
logger.exception(
|
||||||
|
"Unhandled exception",
|
||||||
|
extra={
|
||||||
|
"path": str(request.url.path),
|
||||||
|
"method": request.method,
|
||||||
|
"exception_type": type(exc).__name__,
|
||||||
|
},
|
||||||
|
)
|
||||||
|
|
||||||
|
# Don't leak internal error details in production
|
||||||
|
message = "An unexpected error occurred"
|
||||||
|
if settings.debug:
|
||||||
|
message = f"{type(exc).__name__}: {exc}"
|
||||||
|
|
||||||
|
return _build_error_response(
|
||||||
|
error="internal_error",
|
||||||
|
message=message,
|
||||||
|
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# OpenAPI Customization
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
def custom_openapi() -> dict:
|
||||||
|
"""Add JWT bearer security scheme and error responses to OpenAPI schema."""
|
||||||
|
if app.openapi_schema:
|
||||||
|
return app.openapi_schema
|
||||||
|
|
||||||
|
openapi_schema = get_openapi(
|
||||||
|
title=app.title,
|
||||||
|
version=app.version,
|
||||||
|
description=app.description,
|
||||||
|
routes=app.routes,
|
||||||
|
tags=app.openapi_tags,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Add security schemes
|
||||||
|
components = openapi_schema.setdefault("components", {})
|
||||||
|
security_schemes = components.setdefault("securitySchemes", {})
|
||||||
|
security_schemes["BearerToken"] = {
|
||||||
|
"type": "http",
|
||||||
|
"scheme": "bearer",
|
||||||
|
"bearerFormat": "JWT",
|
||||||
|
"description": "Paste the JWT access token returned by /auth endpoints",
|
||||||
|
}
|
||||||
|
openapi_schema["security"] = [{"BearerToken": []}]
|
||||||
|
|
||||||
|
# Add common error response schemas
|
||||||
|
schemas = components.setdefault("schemas", {})
|
||||||
|
schemas["ErrorResponse"] = {
|
||||||
|
"type": "object",
|
||||||
|
"properties": {
|
||||||
|
"error": {"type": "string", "description": "Error type identifier"},
|
||||||
|
"message": {"type": "string", "description": "Human-readable error message"},
|
||||||
|
"details": {
|
||||||
|
"type": "array",
|
||||||
|
"items": {"$ref": "#/components/schemas/ErrorDetail"},
|
||||||
|
"nullable": True,
|
||||||
|
"description": "Validation error details",
|
||||||
|
},
|
||||||
|
"request_id": {
|
||||||
|
"type": "string",
|
||||||
|
"nullable": True,
|
||||||
|
"description": "Trace ID for debugging",
|
||||||
|
},
|
||||||
|
},
|
||||||
|
"required": ["error", "message"],
|
||||||
|
}
|
||||||
|
schemas["ErrorDetail"] = {
|
||||||
|
"type": "object",
|
||||||
|
"properties": {
|
||||||
|
"loc": {
|
||||||
|
"type": "array",
|
||||||
|
"items": {"oneOf": [{"type": "string"}, {"type": "integer"}]},
|
||||||
|
"description": "Error location path",
|
||||||
|
},
|
||||||
|
"msg": {"type": "string", "description": "Error message"},
|
||||||
|
"type": {"type": "string", "description": "Error type"},
|
||||||
|
},
|
||||||
|
"required": ["loc", "msg", "type"],
|
||||||
|
}
|
||||||
|
|
||||||
|
app.openapi_schema = openapi_schema
|
||||||
|
return app.openapi_schema
|
||||||
|
|
||||||
|
|
||||||
|
app.openapi = custom_openapi # type: ignore[assignment]
|
||||||
|
|
||||||
# Include routers
|
# Include routers
|
||||||
|
app.include_router(auth.router, prefix=settings.api_v1_prefix)
|
||||||
|
app.include_router(incidents.router, prefix=settings.api_v1_prefix)
|
||||||
|
app.include_router(org.router, prefix=settings.api_v1_prefix)
|
||||||
app.include_router(health.router, prefix=settings.api_v1_prefix, tags=["health"])
|
app.include_router(health.router, prefix=settings.api_v1_prefix, tags=["health"])
|
||||||
|
|||||||
@@ -2,12 +2,13 @@
|
|||||||
|
|
||||||
from app.schemas.auth import (
|
from app.schemas.auth import (
|
||||||
LoginRequest,
|
LoginRequest,
|
||||||
|
LogoutRequest,
|
||||||
RefreshRequest,
|
RefreshRequest,
|
||||||
RegisterRequest,
|
RegisterRequest,
|
||||||
SwitchOrgRequest,
|
SwitchOrgRequest,
|
||||||
TokenResponse,
|
TokenResponse,
|
||||||
)
|
)
|
||||||
from app.schemas.common import CursorParams, PaginatedResponse
|
from app.schemas.common import CursorParams, ErrorDetail, ErrorResponse, PaginatedResponse
|
||||||
from app.schemas.incident import (
|
from app.schemas.incident import (
|
||||||
CommentRequest,
|
CommentRequest,
|
||||||
IncidentCreate,
|
IncidentCreate,
|
||||||
@@ -27,12 +28,15 @@ from app.schemas.org import (
|
|||||||
__all__ = [
|
__all__ = [
|
||||||
# Auth
|
# Auth
|
||||||
"LoginRequest",
|
"LoginRequest",
|
||||||
|
"LogoutRequest",
|
||||||
"RefreshRequest",
|
"RefreshRequest",
|
||||||
"RegisterRequest",
|
"RegisterRequest",
|
||||||
"SwitchOrgRequest",
|
"SwitchOrgRequest",
|
||||||
"TokenResponse",
|
"TokenResponse",
|
||||||
# Common
|
# Common
|
||||||
"CursorParams",
|
"CursorParams",
|
||||||
|
"ErrorDetail",
|
||||||
|
"ErrorResponse",
|
||||||
"PaginatedResponse",
|
"PaginatedResponse",
|
||||||
# Incident
|
# Incident
|
||||||
"CommentRequest",
|
"CommentRequest",
|
||||||
|
|||||||
@@ -33,6 +33,12 @@ class SwitchOrgRequest(BaseModel):
|
|||||||
refresh_token: str
|
refresh_token: str
|
||||||
|
|
||||||
|
|
||||||
|
class LogoutRequest(BaseModel):
|
||||||
|
"""Request body for logging out and revoking a refresh token."""
|
||||||
|
|
||||||
|
refresh_token: str
|
||||||
|
|
||||||
|
|
||||||
class TokenResponse(BaseModel):
|
class TokenResponse(BaseModel):
|
||||||
"""Response containing access and refresh tokens."""
|
"""Response containing access and refresh tokens."""
|
||||||
|
|
||||||
|
|||||||
@@ -3,6 +3,47 @@
|
|||||||
from pydantic import BaseModel, Field
|
from pydantic import BaseModel, Field
|
||||||
|
|
||||||
|
|
||||||
|
class ErrorDetail(BaseModel):
|
||||||
|
"""Individual error detail for validation errors."""
|
||||||
|
|
||||||
|
loc: list[str | int] = Field(description="Location of the error (field path)")
|
||||||
|
msg: str = Field(description="Error message")
|
||||||
|
type: str = Field(description="Error type identifier")
|
||||||
|
|
||||||
|
|
||||||
|
class ErrorResponse(BaseModel):
|
||||||
|
"""Structured error response returned by all error handlers."""
|
||||||
|
|
||||||
|
error: str = Field(description="Error type (e.g., 'not_found', 'validation_error')")
|
||||||
|
message: str = Field(description="Human-readable error message")
|
||||||
|
details: list[ErrorDetail] | None = Field(
|
||||||
|
default=None, description="Additional error details for validation errors"
|
||||||
|
)
|
||||||
|
request_id: str | None = Field(
|
||||||
|
default=None, description="Request trace ID for debugging"
|
||||||
|
)
|
||||||
|
|
||||||
|
model_config = {
|
||||||
|
"json_schema_extra": {
|
||||||
|
"examples": [
|
||||||
|
{
|
||||||
|
"error": "not_found",
|
||||||
|
"message": "Incident not found",
|
||||||
|
"request_id": "abc123def456",
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"error": "validation_error",
|
||||||
|
"message": "Request validation failed",
|
||||||
|
"details": [
|
||||||
|
{"loc": ["body", "title"], "msg": "Field required", "type": "missing"}
|
||||||
|
],
|
||||||
|
"request_id": "abc123def456",
|
||||||
|
},
|
||||||
|
]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
class CursorParams(BaseModel):
|
class CursorParams(BaseModel):
|
||||||
"""Pagination parameters using cursor-based pagination."""
|
"""Pagination parameters using cursor-based pagination."""
|
||||||
|
|
||||||
|
|||||||
7
app/services/__init__.py
Normal file
7
app/services/__init__.py
Normal file
@@ -0,0 +1,7 @@
|
|||||||
|
"""Service layer entrypoints."""
|
||||||
|
|
||||||
|
from app.services.auth import AuthService
|
||||||
|
from app.services.incident import IncidentService
|
||||||
|
from app.services.org import OrgService
|
||||||
|
|
||||||
|
__all__ = ["AuthService", "OrgService", "IncidentService"]
|
||||||
269
app/services/auth.py
Normal file
269
app/services/auth.py
Normal file
@@ -0,0 +1,269 @@
|
|||||||
|
"""Authentication service providing business logic for auth flows."""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import re
|
||||||
|
from typing import cast
|
||||||
|
from uuid import UUID, uuid4
|
||||||
|
|
||||||
|
import asyncpg
|
||||||
|
from asyncpg.pool import PoolConnectionProxy
|
||||||
|
|
||||||
|
from app.api.deps import CurrentUser
|
||||||
|
from app.config import settings
|
||||||
|
from app.core import exceptions as exc, security
|
||||||
|
from app.db import Database, db
|
||||||
|
from app.repositories import OrgRepository, RefreshTokenRepository, UserRepository
|
||||||
|
from app.schemas.auth import (
|
||||||
|
LoginRequest,
|
||||||
|
LogoutRequest,
|
||||||
|
RefreshRequest,
|
||||||
|
RegisterRequest,
|
||||||
|
SwitchOrgRequest,
|
||||||
|
TokenResponse,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
_SLUG_PATTERN = re.compile(r"[^a-z0-9]+")
|
||||||
|
|
||||||
|
|
||||||
|
def _as_conn(conn: asyncpg.Connection | PoolConnectionProxy) -> asyncpg.Connection:
|
||||||
|
"""Helper to satisfy typing when a pool proxy is returned."""
|
||||||
|
|
||||||
|
return cast(asyncpg.Connection, conn)
|
||||||
|
|
||||||
|
|
||||||
|
class AuthService:
|
||||||
|
"""Encapsulates authentication workflows (register/login/refresh/logout)."""
|
||||||
|
|
||||||
|
def __init__(self, database: Database | None = None) -> None:
|
||||||
|
self.db = database or db
|
||||||
|
self._access_token_expires_in = settings.access_token_expire_minutes * 60
|
||||||
|
|
||||||
|
async def register_user(self, data: RegisterRequest) -> TokenResponse:
|
||||||
|
"""Create a new user, default org, membership, and token pair."""
|
||||||
|
|
||||||
|
async with self.db.transaction() as conn:
|
||||||
|
db_conn = _as_conn(conn)
|
||||||
|
user_repo = UserRepository(db_conn)
|
||||||
|
org_repo = OrgRepository(db_conn)
|
||||||
|
refresh_repo = RefreshTokenRepository(db_conn)
|
||||||
|
|
||||||
|
if await user_repo.exists_by_email(data.email):
|
||||||
|
raise exc.ConflictError("Email already registered")
|
||||||
|
|
||||||
|
user_id = uuid4()
|
||||||
|
org_id = uuid4()
|
||||||
|
member_id = uuid4()
|
||||||
|
password_hash = security.hash_password(data.password)
|
||||||
|
|
||||||
|
await user_repo.create(user_id, data.email, password_hash)
|
||||||
|
slug = await self._generate_unique_org_slug(org_repo, data.org_name)
|
||||||
|
await org_repo.create(org_id, data.org_name, slug)
|
||||||
|
await org_repo.add_member(member_id, user_id, org_id, "admin")
|
||||||
|
|
||||||
|
return await self._issue_token_pair(
|
||||||
|
refresh_repo,
|
||||||
|
user_id=user_id,
|
||||||
|
org_id=org_id,
|
||||||
|
role="admin",
|
||||||
|
)
|
||||||
|
|
||||||
|
async def login_user(self, data: LoginRequest) -> TokenResponse:
|
||||||
|
"""Authenticate a user and issue tokens for their first organization."""
|
||||||
|
|
||||||
|
async with self.db.connection() as conn:
|
||||||
|
db_conn = _as_conn(conn)
|
||||||
|
user_repo = UserRepository(db_conn)
|
||||||
|
org_repo = OrgRepository(db_conn)
|
||||||
|
refresh_repo = RefreshTokenRepository(db_conn)
|
||||||
|
|
||||||
|
user = await user_repo.get_by_email(data.email)
|
||||||
|
if not user or not security.verify_password(data.password, user["password_hash"]):
|
||||||
|
raise exc.UnauthorizedError("Invalid email or password")
|
||||||
|
|
||||||
|
orgs = await org_repo.get_user_orgs(user["id"])
|
||||||
|
if not orgs:
|
||||||
|
raise exc.ForbiddenError("User does not belong to any organization")
|
||||||
|
|
||||||
|
active_org = orgs[0]
|
||||||
|
return await self._issue_token_pair(
|
||||||
|
refresh_repo,
|
||||||
|
user_id=user["id"],
|
||||||
|
org_id=active_org["id"],
|
||||||
|
role=active_org["role"],
|
||||||
|
)
|
||||||
|
|
||||||
|
async def refresh_tokens(self, data: RefreshRequest) -> TokenResponse:
|
||||||
|
"""Rotate refresh token and mint a new access token."""
|
||||||
|
|
||||||
|
old_hash = security.hash_token(data.refresh_token)
|
||||||
|
new_refresh_token = security.generate_refresh_token()
|
||||||
|
new_refresh_hash = security.hash_token(new_refresh_token)
|
||||||
|
new_refresh_id = uuid4()
|
||||||
|
new_refresh_expiry = security.get_refresh_token_expiry()
|
||||||
|
|
||||||
|
rotated: dict | None = None
|
||||||
|
membership: dict | None = None
|
||||||
|
|
||||||
|
async with self.db.transaction() as conn:
|
||||||
|
db_conn = _as_conn(conn)
|
||||||
|
refresh_repo = RefreshTokenRepository(db_conn)
|
||||||
|
rotated = await refresh_repo.rotate(
|
||||||
|
old_token_hash=old_hash,
|
||||||
|
new_token_id=new_refresh_id,
|
||||||
|
new_token_hash=new_refresh_hash,
|
||||||
|
new_expires_at=new_refresh_expiry,
|
||||||
|
)
|
||||||
|
|
||||||
|
if rotated is not None:
|
||||||
|
org_repo = OrgRepository(db_conn)
|
||||||
|
membership = await org_repo.get_member(rotated["user_id"], rotated["active_org_id"])
|
||||||
|
if membership is None:
|
||||||
|
raise exc.UnauthorizedError("Invalid refresh token")
|
||||||
|
|
||||||
|
if rotated is None or membership is None:
|
||||||
|
await self._handle_invalid_refresh(old_hash)
|
||||||
|
|
||||||
|
assert rotated is not None and membership is not None
|
||||||
|
access_token = security.create_access_token(
|
||||||
|
sub=str(rotated["user_id"]),
|
||||||
|
org_id=str(rotated["active_org_id"]),
|
||||||
|
org_role=membership["role"],
|
||||||
|
)
|
||||||
|
|
||||||
|
return TokenResponse(
|
||||||
|
access_token=access_token,
|
||||||
|
refresh_token=new_refresh_token,
|
||||||
|
expires_in=self._access_token_expires_in,
|
||||||
|
)
|
||||||
|
|
||||||
|
async def switch_org(
|
||||||
|
self,
|
||||||
|
current_user: CurrentUser,
|
||||||
|
data: SwitchOrgRequest,
|
||||||
|
) -> TokenResponse:
|
||||||
|
"""Switch active organization (rotates refresh token + issues new JWT)."""
|
||||||
|
|
||||||
|
target_org_id = data.org_id
|
||||||
|
old_hash = security.hash_token(data.refresh_token)
|
||||||
|
new_refresh_token = security.generate_refresh_token()
|
||||||
|
new_refresh_hash = security.hash_token(new_refresh_token)
|
||||||
|
new_refresh_expiry = security.get_refresh_token_expiry()
|
||||||
|
|
||||||
|
rotated: dict | None = None
|
||||||
|
membership: dict | None = None
|
||||||
|
|
||||||
|
async with self.db.transaction() as conn:
|
||||||
|
db_conn = _as_conn(conn)
|
||||||
|
org_repo = OrgRepository(db_conn)
|
||||||
|
membership = await org_repo.get_member(current_user.user_id, target_org_id)
|
||||||
|
if membership is None:
|
||||||
|
raise exc.ForbiddenError("Not a member of the requested organization")
|
||||||
|
|
||||||
|
refresh_repo = RefreshTokenRepository(db_conn)
|
||||||
|
rotated = await refresh_repo.rotate(
|
||||||
|
old_token_hash=old_hash,
|
||||||
|
new_token_id=uuid4(),
|
||||||
|
new_token_hash=new_refresh_hash,
|
||||||
|
new_expires_at=new_refresh_expiry,
|
||||||
|
new_active_org_id=target_org_id,
|
||||||
|
expected_user_id=current_user.user_id,
|
||||||
|
)
|
||||||
|
|
||||||
|
if rotated is None:
|
||||||
|
await self._handle_invalid_refresh(old_hash)
|
||||||
|
|
||||||
|
access_token = security.create_access_token(
|
||||||
|
sub=str(current_user.user_id),
|
||||||
|
org_id=str(target_org_id),
|
||||||
|
org_role=membership["role"],
|
||||||
|
)
|
||||||
|
|
||||||
|
return TokenResponse(
|
||||||
|
access_token=access_token,
|
||||||
|
refresh_token=new_refresh_token,
|
||||||
|
expires_in=self._access_token_expires_in,
|
||||||
|
)
|
||||||
|
|
||||||
|
async def logout(self, current_user: CurrentUser, data: LogoutRequest) -> None:
|
||||||
|
"""Revoke the provided refresh token for the current session."""
|
||||||
|
|
||||||
|
token_hash = security.hash_token(data.refresh_token)
|
||||||
|
|
||||||
|
async with self.db.transaction() as conn:
|
||||||
|
refresh_repo = RefreshTokenRepository(_as_conn(conn))
|
||||||
|
token = await refresh_repo.get_by_hash(token_hash)
|
||||||
|
if token and token["user_id"] != current_user.user_id:
|
||||||
|
raise exc.ForbiddenError("Refresh token does not belong to this user")
|
||||||
|
|
||||||
|
if not token:
|
||||||
|
return
|
||||||
|
|
||||||
|
await refresh_repo.revoke(token["id"])
|
||||||
|
|
||||||
|
async def _issue_token_pair(
|
||||||
|
self,
|
||||||
|
refresh_repo: RefreshTokenRepository,
|
||||||
|
*,
|
||||||
|
user_id: UUID,
|
||||||
|
org_id: UUID,
|
||||||
|
role: str,
|
||||||
|
) -> TokenResponse:
|
||||||
|
"""Create access/refresh tokens and persist the refresh token."""
|
||||||
|
|
||||||
|
access_token = security.create_access_token(
|
||||||
|
sub=str(user_id),
|
||||||
|
org_id=str(org_id),
|
||||||
|
org_role=role,
|
||||||
|
)
|
||||||
|
|
||||||
|
refresh_token = security.generate_refresh_token()
|
||||||
|
await refresh_repo.create(
|
||||||
|
token_id=uuid4(),
|
||||||
|
user_id=user_id,
|
||||||
|
token_hash=security.hash_token(refresh_token),
|
||||||
|
active_org_id=org_id,
|
||||||
|
expires_at=security.get_refresh_token_expiry(),
|
||||||
|
)
|
||||||
|
|
||||||
|
return TokenResponse(
|
||||||
|
access_token=access_token,
|
||||||
|
refresh_token=refresh_token,
|
||||||
|
expires_in=self._access_token_expires_in,
|
||||||
|
)
|
||||||
|
|
||||||
|
async def _handle_invalid_refresh(self, token_hash: str) -> None:
|
||||||
|
"""Raise appropriate errors for invalid/compromised refresh tokens."""
|
||||||
|
|
||||||
|
async with self.db.connection() as conn:
|
||||||
|
refresh_repo = RefreshTokenRepository(_as_conn(conn))
|
||||||
|
reused = await refresh_repo.check_token_reuse(token_hash)
|
||||||
|
if reused:
|
||||||
|
await refresh_repo.revoke_token_chain(reused["id"])
|
||||||
|
raise exc.UnauthorizedError("Refresh token reuse detected")
|
||||||
|
|
||||||
|
raise exc.UnauthorizedError("Invalid refresh token")
|
||||||
|
|
||||||
|
async def _generate_unique_org_slug(
|
||||||
|
self,
|
||||||
|
org_repo: OrgRepository,
|
||||||
|
org_name: str,
|
||||||
|
) -> str:
|
||||||
|
"""Slugify the org name and append a counter until unique."""
|
||||||
|
|
||||||
|
base_slug = self._slugify(org_name)
|
||||||
|
candidate = base_slug
|
||||||
|
counter = 1
|
||||||
|
while await org_repo.slug_exists(candidate):
|
||||||
|
suffix = f"-{counter}"
|
||||||
|
max_base_len = 50 - len(suffix)
|
||||||
|
candidate = f"{base_slug[:max_base_len]}{suffix}"
|
||||||
|
counter += 1
|
||||||
|
return candidate
|
||||||
|
|
||||||
|
def _slugify(self, value: str) -> str:
|
||||||
|
"""Convert arbitrary text into a URL-friendly slug."""
|
||||||
|
|
||||||
|
slug = _SLUG_PATTERN.sub("-", value.strip().lower()).strip("-")
|
||||||
|
return slug[:50] or "org"
|
||||||
247
app/services/incident.py
Normal file
247
app/services/incident.py
Normal file
@@ -0,0 +1,247 @@
|
|||||||
|
"""Incident service implementing incident lifecycle operations."""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from datetime import datetime
|
||||||
|
from typing import cast
|
||||||
|
from uuid import UUID, uuid4
|
||||||
|
|
||||||
|
import asyncpg
|
||||||
|
from asyncpg.pool import PoolConnectionProxy
|
||||||
|
|
||||||
|
from app.api.deps import CurrentUser, ensure_org_access
|
||||||
|
from app.config import settings
|
||||||
|
from app.core import exceptions as exc
|
||||||
|
from app.db import Database, db
|
||||||
|
from app.repositories import IncidentRepository, ServiceRepository
|
||||||
|
from app.schemas.common import PaginatedResponse
|
||||||
|
from app.schemas.incident import (
|
||||||
|
CommentRequest,
|
||||||
|
IncidentCreate,
|
||||||
|
IncidentEventResponse,
|
||||||
|
IncidentResponse,
|
||||||
|
TransitionRequest,
|
||||||
|
)
|
||||||
|
from app.taskqueue import TaskQueue
|
||||||
|
from app.taskqueue import task_queue as default_task_queue
|
||||||
|
|
||||||
|
_ALLOWED_TRANSITIONS: dict[str, set[str]] = {
|
||||||
|
"triggered": {"acknowledged"},
|
||||||
|
"acknowledged": {"mitigated"},
|
||||||
|
"mitigated": {"resolved"},
|
||||||
|
"resolved": set(),
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def _as_conn(conn: asyncpg.Connection | PoolConnectionProxy) -> asyncpg.Connection:
|
||||||
|
"""Helper to satisfy typing when a pool proxy is returned."""
|
||||||
|
|
||||||
|
return cast(asyncpg.Connection, conn)
|
||||||
|
|
||||||
|
|
||||||
|
class IncidentService:
|
||||||
|
"""Encapsulates incident lifecycle operations within an org context."""
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
database: Database | None = None,
|
||||||
|
task_queue: TaskQueue | None = None,
|
||||||
|
escalation_delay_seconds: int | None = None,
|
||||||
|
) -> None:
|
||||||
|
self.db = database or db
|
||||||
|
self.task_queue = task_queue or default_task_queue
|
||||||
|
self.escalation_delay_seconds = (
|
||||||
|
escalation_delay_seconds
|
||||||
|
if escalation_delay_seconds is not None
|
||||||
|
else settings.notification_escalation_delay_seconds
|
||||||
|
)
|
||||||
|
|
||||||
|
async def create_incident(
|
||||||
|
self,
|
||||||
|
current_user: CurrentUser,
|
||||||
|
service_id: UUID,
|
||||||
|
data: IncidentCreate,
|
||||||
|
) -> IncidentResponse:
|
||||||
|
"""Create an incident for a service in the active org and record the creation event."""
|
||||||
|
|
||||||
|
async with self.db.transaction() as conn:
|
||||||
|
db_conn = _as_conn(conn)
|
||||||
|
service_repo = ServiceRepository(db_conn)
|
||||||
|
incident_repo = IncidentRepository(db_conn)
|
||||||
|
|
||||||
|
service = await service_repo.get_by_id(service_id)
|
||||||
|
if service is None:
|
||||||
|
raise exc.NotFoundError("Service not found")
|
||||||
|
ensure_org_access(service["org_id"], current_user)
|
||||||
|
|
||||||
|
incident_id = uuid4()
|
||||||
|
incident = await incident_repo.create(
|
||||||
|
incident_id=incident_id,
|
||||||
|
org_id=current_user.org_id,
|
||||||
|
service_id=service_id,
|
||||||
|
title=data.title,
|
||||||
|
description=data.description,
|
||||||
|
severity=data.severity,
|
||||||
|
)
|
||||||
|
|
||||||
|
await incident_repo.add_event(
|
||||||
|
uuid4(),
|
||||||
|
incident_id,
|
||||||
|
"created",
|
||||||
|
actor_user_id=current_user.user_id,
|
||||||
|
payload={
|
||||||
|
"title": data.title,
|
||||||
|
"severity": data.severity,
|
||||||
|
"description": data.description,
|
||||||
|
},
|
||||||
|
)
|
||||||
|
|
||||||
|
incident_response = IncidentResponse(**incident)
|
||||||
|
|
||||||
|
self.task_queue.incident_triggered(
|
||||||
|
incident_id=incident_response.id,
|
||||||
|
org_id=current_user.org_id,
|
||||||
|
triggered_by=current_user.user_id,
|
||||||
|
)
|
||||||
|
|
||||||
|
if self.escalation_delay_seconds > 0:
|
||||||
|
self.task_queue.schedule_escalation_check(
|
||||||
|
incident_id=incident_response.id,
|
||||||
|
org_id=current_user.org_id,
|
||||||
|
delay_seconds=self.escalation_delay_seconds,
|
||||||
|
)
|
||||||
|
|
||||||
|
return incident_response
|
||||||
|
|
||||||
|
async def get_incidents(
|
||||||
|
self,
|
||||||
|
current_user: CurrentUser,
|
||||||
|
*,
|
||||||
|
status: str | None = None,
|
||||||
|
cursor: datetime | None = None,
|
||||||
|
limit: int = 20,
|
||||||
|
) -> PaginatedResponse[IncidentResponse]:
|
||||||
|
"""Return paginated incidents for the active organization."""
|
||||||
|
|
||||||
|
async with self.db.connection() as conn:
|
||||||
|
incident_repo = IncidentRepository(_as_conn(conn))
|
||||||
|
rows = await incident_repo.get_by_org(
|
||||||
|
org_id=current_user.org_id,
|
||||||
|
status=status,
|
||||||
|
cursor=cursor,
|
||||||
|
limit=limit,
|
||||||
|
)
|
||||||
|
|
||||||
|
has_more = len(rows) > limit
|
||||||
|
items = rows[:limit]
|
||||||
|
next_cursor = items[-1]["created_at"].isoformat() if has_more and items else None
|
||||||
|
|
||||||
|
incidents = [IncidentResponse(**row) for row in items]
|
||||||
|
return PaginatedResponse[IncidentResponse](
|
||||||
|
items=incidents,
|
||||||
|
next_cursor=next_cursor,
|
||||||
|
has_more=has_more,
|
||||||
|
)
|
||||||
|
|
||||||
|
async def get_incident(self, current_user: CurrentUser, incident_id: UUID) -> IncidentResponse:
|
||||||
|
"""Return a single incident, ensuring it belongs to the active org."""
|
||||||
|
|
||||||
|
async with self.db.connection() as conn:
|
||||||
|
incident_repo = IncidentRepository(_as_conn(conn))
|
||||||
|
incident = await incident_repo.get_by_id(incident_id)
|
||||||
|
if incident is None:
|
||||||
|
raise exc.NotFoundError("Incident not found")
|
||||||
|
ensure_org_access(incident["org_id"], current_user)
|
||||||
|
return IncidentResponse(**incident)
|
||||||
|
|
||||||
|
async def get_incident_events(
|
||||||
|
self, current_user: CurrentUser, incident_id: UUID
|
||||||
|
) -> list[IncidentEventResponse]:
|
||||||
|
"""Return the timeline events for an incident in the active org."""
|
||||||
|
|
||||||
|
async with self.db.connection() as conn:
|
||||||
|
incident_repo = IncidentRepository(_as_conn(conn))
|
||||||
|
incident = await incident_repo.get_by_id(incident_id)
|
||||||
|
if incident is None:
|
||||||
|
raise exc.NotFoundError("Incident not found")
|
||||||
|
ensure_org_access(incident["org_id"], current_user)
|
||||||
|
|
||||||
|
events = await incident_repo.get_events(incident_id)
|
||||||
|
return [IncidentEventResponse(**event) for event in events]
|
||||||
|
|
||||||
|
async def transition_incident(
|
||||||
|
self,
|
||||||
|
current_user: CurrentUser,
|
||||||
|
incident_id: UUID,
|
||||||
|
data: TransitionRequest,
|
||||||
|
) -> IncidentResponse:
|
||||||
|
"""Transition an incident status with optimistic locking and event recording."""
|
||||||
|
|
||||||
|
async with self.db.transaction() as conn:
|
||||||
|
db_conn = _as_conn(conn)
|
||||||
|
incident_repo = IncidentRepository(db_conn)
|
||||||
|
|
||||||
|
incident = await incident_repo.get_by_id(incident_id)
|
||||||
|
if incident is None:
|
||||||
|
raise exc.NotFoundError("Incident not found")
|
||||||
|
ensure_org_access(incident["org_id"], current_user)
|
||||||
|
self._validate_transition(incident["status"], data.to_status)
|
||||||
|
|
||||||
|
updated = await incident_repo.update_status(
|
||||||
|
incident_id,
|
||||||
|
data.to_status,
|
||||||
|
data.version,
|
||||||
|
)
|
||||||
|
if updated is None:
|
||||||
|
raise exc.ConflictError("Incident version mismatch")
|
||||||
|
|
||||||
|
payload = {"from": incident["status"], "to": data.to_status}
|
||||||
|
if data.note:
|
||||||
|
payload["note"] = data.note
|
||||||
|
|
||||||
|
await incident_repo.add_event(
|
||||||
|
uuid4(),
|
||||||
|
incident_id,
|
||||||
|
"status_changed",
|
||||||
|
actor_user_id=current_user.user_id,
|
||||||
|
payload=payload,
|
||||||
|
)
|
||||||
|
|
||||||
|
return IncidentResponse(**updated)
|
||||||
|
|
||||||
|
async def add_comment(
|
||||||
|
self,
|
||||||
|
current_user: CurrentUser,
|
||||||
|
incident_id: UUID,
|
||||||
|
data: CommentRequest,
|
||||||
|
) -> IncidentEventResponse:
|
||||||
|
"""Add a comment event to the incident timeline."""
|
||||||
|
|
||||||
|
async with self.db.connection() as conn:
|
||||||
|
incident_repo = IncidentRepository(_as_conn(conn))
|
||||||
|
incident = await incident_repo.get_by_id(incident_id)
|
||||||
|
if incident is None:
|
||||||
|
raise exc.NotFoundError("Incident not found")
|
||||||
|
ensure_org_access(incident["org_id"], current_user)
|
||||||
|
|
||||||
|
event = await incident_repo.add_event(
|
||||||
|
uuid4(),
|
||||||
|
incident_id,
|
||||||
|
"comment_added",
|
||||||
|
actor_user_id=current_user.user_id,
|
||||||
|
payload={"content": data.content},
|
||||||
|
)
|
||||||
|
return IncidentEventResponse(**event)
|
||||||
|
|
||||||
|
def _validate_transition(self, current_status: str, to_status: str) -> None:
|
||||||
|
"""Validate a requested status transition against the allowed state machine."""
|
||||||
|
|
||||||
|
if current_status == to_status:
|
||||||
|
raise exc.BadRequestError("Incident is already in the requested status")
|
||||||
|
|
||||||
|
allowed = _ALLOWED_TRANSITIONS.get(current_status, set())
|
||||||
|
if to_status not in allowed:
|
||||||
|
raise exc.BadRequestError("Invalid incident status transition")
|
||||||
|
|
||||||
|
|
||||||
|
__all__ = ["IncidentService"]
|
||||||
115
app/services/org.py
Normal file
115
app/services/org.py
Normal file
@@ -0,0 +1,115 @@
|
|||||||
|
"""Organization service providing org-scoped operations."""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from typing import cast
|
||||||
|
from uuid import UUID, uuid4
|
||||||
|
|
||||||
|
import asyncpg
|
||||||
|
from asyncpg.pool import PoolConnectionProxy
|
||||||
|
|
||||||
|
from app.api.deps import CurrentUser
|
||||||
|
from app.core import exceptions as exc
|
||||||
|
from app.db import Database, db
|
||||||
|
from app.repositories import NotificationRepository, OrgRepository, ServiceRepository
|
||||||
|
from app.schemas.org import (
|
||||||
|
MemberResponse,
|
||||||
|
NotificationTargetCreate,
|
||||||
|
NotificationTargetResponse,
|
||||||
|
OrgResponse,
|
||||||
|
ServiceCreate,
|
||||||
|
ServiceResponse,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def _as_conn(conn: asyncpg.Connection | PoolConnectionProxy) -> asyncpg.Connection:
|
||||||
|
"""Helper to satisfy typing when a pool proxy is returned."""
|
||||||
|
|
||||||
|
return cast(asyncpg.Connection, conn)
|
||||||
|
|
||||||
|
|
||||||
|
class OrgService:
|
||||||
|
"""Encapsulates organization-level operations within the active org context."""
|
||||||
|
|
||||||
|
def __init__(self, database: Database | None = None) -> None:
|
||||||
|
self.db = database or db
|
||||||
|
|
||||||
|
async def get_current_org(self, current_user: CurrentUser) -> OrgResponse:
|
||||||
|
"""Return the active organization summary for the current user."""
|
||||||
|
|
||||||
|
async with self.db.connection() as conn:
|
||||||
|
org_repo = OrgRepository(_as_conn(conn))
|
||||||
|
org = await org_repo.get_by_id(current_user.org_id)
|
||||||
|
if org is None:
|
||||||
|
raise exc.NotFoundError("Organization not found")
|
||||||
|
return OrgResponse(**org)
|
||||||
|
|
||||||
|
async def get_members(self, current_user: CurrentUser) -> list[MemberResponse]:
|
||||||
|
"""List members of the active organization."""
|
||||||
|
|
||||||
|
async with self.db.connection() as conn:
|
||||||
|
org_repo = OrgRepository(_as_conn(conn))
|
||||||
|
members = await org_repo.get_members(current_user.org_id)
|
||||||
|
return [MemberResponse(**member) for member in members]
|
||||||
|
|
||||||
|
async def create_service(self, current_user: CurrentUser, data: ServiceCreate) -> ServiceResponse:
|
||||||
|
"""Create a new service within the active organization."""
|
||||||
|
|
||||||
|
async with self.db.connection() as conn:
|
||||||
|
service_repo = ServiceRepository(_as_conn(conn))
|
||||||
|
|
||||||
|
if await service_repo.slug_exists(current_user.org_id, data.slug):
|
||||||
|
raise exc.ConflictError("Service slug already exists in this organization")
|
||||||
|
|
||||||
|
try:
|
||||||
|
service = await service_repo.create(
|
||||||
|
service_id=uuid4(),
|
||||||
|
org_id=current_user.org_id,
|
||||||
|
name=data.name,
|
||||||
|
slug=data.slug,
|
||||||
|
)
|
||||||
|
except asyncpg.UniqueViolationError as err: # pragma: no cover - race protection
|
||||||
|
raise exc.ConflictError("Service slug already exists in this organization") from err
|
||||||
|
|
||||||
|
return ServiceResponse(**service)
|
||||||
|
|
||||||
|
async def get_services(self, current_user: CurrentUser) -> list[ServiceResponse]:
|
||||||
|
"""List services for the active organization."""
|
||||||
|
|
||||||
|
async with self.db.connection() as conn:
|
||||||
|
service_repo = ServiceRepository(_as_conn(conn))
|
||||||
|
services = await service_repo.get_by_org(current_user.org_id)
|
||||||
|
return [ServiceResponse(**svc) for svc in services]
|
||||||
|
|
||||||
|
async def create_notification_target(
|
||||||
|
self,
|
||||||
|
current_user: CurrentUser,
|
||||||
|
data: NotificationTargetCreate,
|
||||||
|
) -> NotificationTargetResponse:
|
||||||
|
"""Create a notification target for the active organization."""
|
||||||
|
|
||||||
|
if data.target_type == "webhook" and data.webhook_url is None:
|
||||||
|
raise exc.BadRequestError("webhook_url is required for webhook targets")
|
||||||
|
|
||||||
|
async with self.db.connection() as conn:
|
||||||
|
notification_repo = NotificationRepository(_as_conn(conn))
|
||||||
|
target = await notification_repo.create_target(
|
||||||
|
target_id=uuid4(),
|
||||||
|
org_id=current_user.org_id,
|
||||||
|
name=data.name,
|
||||||
|
target_type=data.target_type,
|
||||||
|
webhook_url=str(data.webhook_url) if data.webhook_url else None,
|
||||||
|
enabled=data.enabled,
|
||||||
|
)
|
||||||
|
return NotificationTargetResponse(**target)
|
||||||
|
|
||||||
|
async def get_notification_targets(self, current_user: CurrentUser) -> list[NotificationTargetResponse]:
|
||||||
|
"""List notification targets for the active organization."""
|
||||||
|
|
||||||
|
async with self.db.connection() as conn:
|
||||||
|
notification_repo = NotificationRepository(_as_conn(conn))
|
||||||
|
targets = await notification_repo.get_targets_by_org(current_user.org_id)
|
||||||
|
return [NotificationTargetResponse(**target) for target in targets]
|
||||||
|
|
||||||
|
|
||||||
|
__all__ = ["OrgService"]
|
||||||
188
app/taskqueue.py
Normal file
188
app/taskqueue.py
Normal file
@@ -0,0 +1,188 @@
|
|||||||
|
"""Task queue abstractions for scheduling background work."""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
from abc import ABC, abstractmethod
|
||||||
|
from dataclasses import dataclass
|
||||||
|
from typing import Any
|
||||||
|
from uuid import UUID
|
||||||
|
|
||||||
|
from app.config import settings
|
||||||
|
|
||||||
|
try:
|
||||||
|
from worker.celery_app import celery_app
|
||||||
|
except Exception: # pragma: no cover - celery app may not import during docs builds
|
||||||
|
celery_app = None # type: ignore[assignment]
|
||||||
|
|
||||||
|
|
||||||
|
class TaskQueue(ABC):
|
||||||
|
"""Interface for enqueueing background work."""
|
||||||
|
|
||||||
|
async def startup(self) -> None: # pragma: no cover - default no-op
|
||||||
|
"""Hook for queue initialization."""
|
||||||
|
|
||||||
|
async def shutdown(self) -> None: # pragma: no cover - default no-op
|
||||||
|
"""Hook for queue teardown."""
|
||||||
|
|
||||||
|
async def ping(self) -> bool:
|
||||||
|
"""Check if the queue backend is reachable."""
|
||||||
|
|
||||||
|
return True
|
||||||
|
|
||||||
|
def reset(self) -> None: # pragma: no cover - optional for in-memory impls
|
||||||
|
"""Reset any in-memory state (used in tests)."""
|
||||||
|
|
||||||
|
@abstractmethod
|
||||||
|
def incident_triggered(
|
||||||
|
self,
|
||||||
|
*,
|
||||||
|
incident_id: UUID,
|
||||||
|
org_id: UUID,
|
||||||
|
triggered_by: UUID | None,
|
||||||
|
) -> None:
|
||||||
|
"""Fan out an incident triggered notification."""
|
||||||
|
|
||||||
|
@abstractmethod
|
||||||
|
def schedule_escalation_check(
|
||||||
|
self,
|
||||||
|
*,
|
||||||
|
incident_id: UUID,
|
||||||
|
org_id: UUID,
|
||||||
|
delay_seconds: int,
|
||||||
|
) -> None:
|
||||||
|
"""Schedule a delayed escalation check."""
|
||||||
|
|
||||||
|
|
||||||
|
class CeleryTaskQueue(TaskQueue):
|
||||||
|
"""Celery-backed task queue that can use Redis or SQS brokers."""
|
||||||
|
|
||||||
|
def __init__(self, default_queue: str, critical_queue: str) -> None:
|
||||||
|
if celery_app is None: # pragma: no cover - guarded by try/except
|
||||||
|
raise RuntimeError("Celery application is unavailable")
|
||||||
|
self._celery = celery_app
|
||||||
|
self._default_queue = default_queue
|
||||||
|
self._critical_queue = critical_queue
|
||||||
|
|
||||||
|
def incident_triggered(
|
||||||
|
self,
|
||||||
|
*,
|
||||||
|
incident_id: UUID,
|
||||||
|
org_id: UUID,
|
||||||
|
triggered_by: UUID | None,
|
||||||
|
) -> None:
|
||||||
|
self._celery.send_task(
|
||||||
|
"worker.tasks.notifications.incident_triggered",
|
||||||
|
kwargs={
|
||||||
|
"incident_id": str(incident_id),
|
||||||
|
"org_id": str(org_id),
|
||||||
|
"triggered_by": str(triggered_by) if triggered_by else None,
|
||||||
|
},
|
||||||
|
queue=self._default_queue,
|
||||||
|
)
|
||||||
|
|
||||||
|
def schedule_escalation_check(
|
||||||
|
self,
|
||||||
|
*,
|
||||||
|
incident_id: UUID,
|
||||||
|
org_id: UUID,
|
||||||
|
delay_seconds: int,
|
||||||
|
) -> None:
|
||||||
|
self._celery.send_task(
|
||||||
|
"worker.tasks.notifications.escalate_if_unacked",
|
||||||
|
kwargs={
|
||||||
|
"incident_id": str(incident_id),
|
||||||
|
"org_id": str(org_id),
|
||||||
|
},
|
||||||
|
countdown=max(delay_seconds, 0),
|
||||||
|
queue=self._critical_queue,
|
||||||
|
)
|
||||||
|
|
||||||
|
async def ping(self) -> bool:
|
||||||
|
loop = asyncio.get_running_loop()
|
||||||
|
return await loop.run_in_executor(None, self._ping_sync)
|
||||||
|
|
||||||
|
def _ping_sync(self) -> bool:
|
||||||
|
connection = self._celery.connection()
|
||||||
|
try:
|
||||||
|
connection.connect()
|
||||||
|
return True
|
||||||
|
except Exception:
|
||||||
|
return False
|
||||||
|
finally:
|
||||||
|
try:
|
||||||
|
connection.release()
|
||||||
|
except Exception: # pragma: no cover - release best effort
|
||||||
|
pass
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class InMemoryTaskQueue(TaskQueue):
|
||||||
|
"""Test-friendly queue that records dispatched tasks in memory."""
|
||||||
|
|
||||||
|
dispatched: list[tuple[str, dict[str, Any]]] | None = None
|
||||||
|
|
||||||
|
def __post_init__(self) -> None:
|
||||||
|
if self.dispatched is None:
|
||||||
|
self.dispatched = []
|
||||||
|
|
||||||
|
def incident_triggered(
|
||||||
|
self,
|
||||||
|
*,
|
||||||
|
incident_id: UUID,
|
||||||
|
org_id: UUID,
|
||||||
|
triggered_by: UUID | None,
|
||||||
|
) -> None:
|
||||||
|
self.dispatched.append(
|
||||||
|
(
|
||||||
|
"incident_triggered",
|
||||||
|
{
|
||||||
|
"incident_id": incident_id,
|
||||||
|
"org_id": org_id,
|
||||||
|
"triggered_by": triggered_by,
|
||||||
|
},
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
def schedule_escalation_check(
|
||||||
|
self,
|
||||||
|
*,
|
||||||
|
incident_id: UUID,
|
||||||
|
org_id: UUID,
|
||||||
|
delay_seconds: int,
|
||||||
|
) -> None:
|
||||||
|
self.dispatched.append(
|
||||||
|
(
|
||||||
|
"escalate_if_unacked",
|
||||||
|
{
|
||||||
|
"incident_id": incident_id,
|
||||||
|
"org_id": org_id,
|
||||||
|
"delay_seconds": delay_seconds,
|
||||||
|
},
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
def reset(self) -> None:
|
||||||
|
if self.dispatched is not None:
|
||||||
|
self.dispatched.clear()
|
||||||
|
|
||||||
|
|
||||||
|
def _build_task_queue() -> TaskQueue:
|
||||||
|
if settings.task_queue_driver == "inmemory":
|
||||||
|
return InMemoryTaskQueue()
|
||||||
|
|
||||||
|
return CeleryTaskQueue(
|
||||||
|
default_queue=settings.task_queue_default_queue,
|
||||||
|
critical_queue=settings.task_queue_critical_queue,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
task_queue = _build_task_queue()
|
||||||
|
|
||||||
|
|
||||||
|
__all__ = [
|
||||||
|
"CeleryTaskQueue",
|
||||||
|
"InMemoryTaskQueue",
|
||||||
|
"TaskQueue",
|
||||||
|
"task_queue",
|
||||||
|
]
|
||||||
@@ -41,6 +41,7 @@ services:
|
|||||||
container_name: incidentops-api
|
container_name: incidentops-api
|
||||||
ports:
|
ports:
|
||||||
- "8000:8000"
|
- "8000:8000"
|
||||||
|
- "9464:9464" # Prometheus metrics
|
||||||
environment:
|
environment:
|
||||||
DATABASE_URL: postgresql://incidentops:incidentops@postgres:5432/incidentops
|
DATABASE_URL: postgresql://incidentops:incidentops@postgres:5432/incidentops
|
||||||
REDIS_URL: redis://redis:6379/0
|
REDIS_URL: redis://redis:6379/0
|
||||||
@@ -48,11 +49,24 @@ services:
|
|||||||
JWT_ALGORITHM: HS256
|
JWT_ALGORITHM: HS256
|
||||||
ACCESS_TOKEN_EXPIRE_MINUTES: 30
|
ACCESS_TOKEN_EXPIRE_MINUTES: 30
|
||||||
REFRESH_TOKEN_EXPIRE_DAYS: 30
|
REFRESH_TOKEN_EXPIRE_DAYS: 30
|
||||||
|
# OpenTelemetry
|
||||||
|
OTEL_ENABLED: "true"
|
||||||
|
OTEL_SERVICE_NAME: incidentops-api
|
||||||
|
OTEL_ENVIRONMENT: development
|
||||||
|
OTEL_EXPORTER_OTLP_ENDPOINT: http://otel-collector:4317
|
||||||
|
OTEL_EXPORTER_OTLP_INSECURE: "true"
|
||||||
|
OTEL_LOG_LEVEL: INFO
|
||||||
|
# Metrics
|
||||||
|
PROMETHEUS_PORT: "9464"
|
||||||
depends_on:
|
depends_on:
|
||||||
postgres:
|
postgres:
|
||||||
condition: service_healthy
|
condition: service_healthy
|
||||||
redis:
|
redis:
|
||||||
condition: service_healthy
|
condition: service_healthy
|
||||||
|
otel-collector:
|
||||||
|
condition: service_started
|
||||||
|
prometheus:
|
||||||
|
condition: service_started
|
||||||
healthcheck:
|
healthcheck:
|
||||||
test: ["CMD", "curl", "-f", "http://localhost:8000/v1/healthz"]
|
test: ["CMD", "curl", "-f", "http://localhost:8000/v1/healthz"]
|
||||||
interval: 30s
|
interval: 30s
|
||||||
@@ -72,6 +86,12 @@ services:
|
|||||||
REDIS_URL: redis://redis:6379/0
|
REDIS_URL: redis://redis:6379/0
|
||||||
CELERY_BROKER_URL: redis://redis:6379/0
|
CELERY_BROKER_URL: redis://redis:6379/0
|
||||||
CELERY_RESULT_BACKEND: redis://redis:6379/1
|
CELERY_RESULT_BACKEND: redis://redis:6379/1
|
||||||
|
# OpenTelemetry
|
||||||
|
OTEL_ENABLED: "true"
|
||||||
|
OTEL_SERVICE_NAME: incidentops-worker
|
||||||
|
OTEL_ENVIRONMENT: development
|
||||||
|
OTEL_EXPORTER_OTLP_ENDPOINT: http://otel-collector:4317
|
||||||
|
OTEL_EXPORTER_OTLP_INSECURE: "true"
|
||||||
depends_on:
|
depends_on:
|
||||||
postgres:
|
postgres:
|
||||||
condition: service_healthy
|
condition: service_healthy
|
||||||
@@ -121,9 +141,89 @@ services:
|
|||||||
profiles:
|
profiles:
|
||||||
- monitoring
|
- monitoring
|
||||||
|
|
||||||
|
# ============================================
|
||||||
|
# Observability Stack
|
||||||
|
# ============================================
|
||||||
|
|
||||||
|
# OpenTelemetry Collector - receives traces/logs from apps
|
||||||
|
otel-collector:
|
||||||
|
image: otel/opentelemetry-collector-contrib:0.96.0
|
||||||
|
container_name: incidentops-otel-collector
|
||||||
|
command: ["--config=/etc/otel-collector/config.yaml"]
|
||||||
|
volumes:
|
||||||
|
- ./observability/otel-collector/config.yaml:/etc/otel-collector/config.yaml:ro
|
||||||
|
ports:
|
||||||
|
- "4317:4317" # OTLP gRPC
|
||||||
|
- "4318:4318" # OTLP HTTP
|
||||||
|
depends_on:
|
||||||
|
- tempo
|
||||||
|
- loki
|
||||||
|
|
||||||
|
# Tempo - distributed tracing backend
|
||||||
|
tempo:
|
||||||
|
image: grafana/tempo:2.4.1
|
||||||
|
container_name: incidentops-tempo
|
||||||
|
command: ["-config.file=/etc/tempo/config.yaml"]
|
||||||
|
volumes:
|
||||||
|
- ./observability/tempo/config.yaml:/etc/tempo/config.yaml:ro
|
||||||
|
- tempo_data:/var/tempo
|
||||||
|
ports:
|
||||||
|
- "3200:3200" # Tempo HTTP
|
||||||
|
- "4320:4317" # Tempo OTLP gRPC (different host port to avoid conflict)
|
||||||
|
|
||||||
|
# Loki - log aggregation
|
||||||
|
loki:
|
||||||
|
image: grafana/loki:2.9.6
|
||||||
|
container_name: incidentops-loki
|
||||||
|
command: ["-config.file=/etc/loki/config.yaml"]
|
||||||
|
volumes:
|
||||||
|
- ./observability/loki/config.yaml:/etc/loki/config.yaml:ro
|
||||||
|
- loki_data:/loki
|
||||||
|
ports:
|
||||||
|
- "3100:3100" # Loki HTTP
|
||||||
|
|
||||||
|
# Prometheus - metrics storage
|
||||||
|
prometheus:
|
||||||
|
image: prom/prometheus:v2.51.0
|
||||||
|
container_name: incidentops-prometheus
|
||||||
|
command:
|
||||||
|
- "--config.file=/etc/prometheus/prometheus.yml"
|
||||||
|
- "--storage.tsdb.path=/prometheus"
|
||||||
|
- "--web.enable-lifecycle"
|
||||||
|
volumes:
|
||||||
|
- ./observability/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
|
||||||
|
- prometheus_data:/prometheus
|
||||||
|
ports:
|
||||||
|
- "9090:9090" # Prometheus UI
|
||||||
|
|
||||||
|
# Grafana - visualization
|
||||||
|
grafana:
|
||||||
|
image: grafana/grafana:10.4.1
|
||||||
|
container_name: incidentops-grafana
|
||||||
|
environment:
|
||||||
|
GF_SECURITY_ADMIN_USER: admin
|
||||||
|
GF_SECURITY_ADMIN_PASSWORD: admin
|
||||||
|
GF_USERS_ALLOW_SIGN_UP: "false"
|
||||||
|
GF_EXPLORE_ENABLED: "true"
|
||||||
|
GF_FEATURE_TOGGLES_ENABLE: traceqlEditor tempoSearch tempoBackendSearch tempoApmTable
|
||||||
|
volumes:
|
||||||
|
- ./observability/grafana/provisioning:/etc/grafana/provisioning:ro
|
||||||
|
- ./observability/grafana/dashboards:/var/lib/grafana/dashboards:ro
|
||||||
|
- grafana_data:/var/lib/grafana
|
||||||
|
ports:
|
||||||
|
- "3001:3000" # Grafana UI (3001 to avoid conflict with web frontend)
|
||||||
|
depends_on:
|
||||||
|
- tempo
|
||||||
|
- loki
|
||||||
|
- prometheus
|
||||||
|
|
||||||
volumes:
|
volumes:
|
||||||
postgres_data:
|
postgres_data:
|
||||||
redis_data:
|
redis_data:
|
||||||
|
tempo_data:
|
||||||
|
loki_data:
|
||||||
|
prometheus_data:
|
||||||
|
grafana_data:
|
||||||
|
|
||||||
networks:
|
networks:
|
||||||
default:
|
default:
|
||||||
|
|||||||
@@ -29,6 +29,29 @@ spec:
|
|||||||
serviceAccountName: {{ include "incidentops.serviceAccountName" . }}
|
serviceAccountName: {{ include "incidentops.serviceAccountName" . }}
|
||||||
securityContext:
|
securityContext:
|
||||||
{{- toYaml .Values.podSecurityContext | nindent 8 }}
|
{{- toYaml .Values.podSecurityContext | nindent 8 }}
|
||||||
|
initContainers:
|
||||||
|
- name: wait-for-postgres
|
||||||
|
image: busybox:1.36
|
||||||
|
command:
|
||||||
|
- sh
|
||||||
|
- -c
|
||||||
|
- |
|
||||||
|
until nc -z {{ include "incidentops.fullname" . }}-postgresql 5432; do
|
||||||
|
echo "Waiting for PostgreSQL..."
|
||||||
|
sleep 2
|
||||||
|
done
|
||||||
|
echo "PostgreSQL is ready"
|
||||||
|
- name: wait-for-redis
|
||||||
|
image: busybox:1.36
|
||||||
|
command:
|
||||||
|
- sh
|
||||||
|
- -c
|
||||||
|
- |
|
||||||
|
until nc -z {{ include "incidentops.fullname" . }}-redis 6379; do
|
||||||
|
echo "Waiting for Redis..."
|
||||||
|
sleep 2
|
||||||
|
done
|
||||||
|
echo "Redis is ready"
|
||||||
containers:
|
containers:
|
||||||
- name: api
|
- name: api
|
||||||
securityContext:
|
securityContext:
|
||||||
@@ -39,6 +62,11 @@ spec:
|
|||||||
- name: http
|
- name: http
|
||||||
containerPort: 8000
|
containerPort: 8000
|
||||||
protocol: TCP
|
protocol: TCP
|
||||||
|
{{- if .Values.metrics.enabled }}
|
||||||
|
- name: metrics
|
||||||
|
containerPort: {{ .Values.metrics.port }}
|
||||||
|
protocol: TCP
|
||||||
|
{{- end }}
|
||||||
envFrom:
|
envFrom:
|
||||||
- configMapRef:
|
- configMapRef:
|
||||||
name: {{ include "incidentops.fullname" . }}-config
|
name: {{ include "incidentops.fullname" . }}-config
|
||||||
|
|||||||
@@ -11,5 +11,11 @@ spec:
|
|||||||
targetPort: http
|
targetPort: http
|
||||||
protocol: TCP
|
protocol: TCP
|
||||||
name: http
|
name: http
|
||||||
|
{{- if .Values.metrics.enabled }}
|
||||||
|
- port: {{ .Values.metrics.port }}
|
||||||
|
targetPort: metrics
|
||||||
|
protocol: TCP
|
||||||
|
name: metrics
|
||||||
|
{{- end }}
|
||||||
selector:
|
selector:
|
||||||
{{- include "incidentops.api.selectorLabels" . | nindent 4 }}
|
{{- include "incidentops.api.selectorLabels" . | nindent 4 }}
|
||||||
|
|||||||
@@ -8,3 +8,16 @@ data:
|
|||||||
JWT_ALGORITHM: {{ .Values.config.jwtAlgorithm | quote }}
|
JWT_ALGORITHM: {{ .Values.config.jwtAlgorithm | quote }}
|
||||||
ACCESS_TOKEN_EXPIRE_MINUTES: {{ .Values.config.accessTokenExpireMinutes | quote }}
|
ACCESS_TOKEN_EXPIRE_MINUTES: {{ .Values.config.accessTokenExpireMinutes | quote }}
|
||||||
REFRESH_TOKEN_EXPIRE_DAYS: {{ .Values.config.refreshTokenExpireDays | quote }}
|
REFRESH_TOKEN_EXPIRE_DAYS: {{ .Values.config.refreshTokenExpireDays | quote }}
|
||||||
|
# OpenTelemetry configuration
|
||||||
|
OTEL_ENABLED: {{ .Values.observability.enabled | quote }}
|
||||||
|
OTEL_SERVICE_NAME: "incidentops-api"
|
||||||
|
OTEL_ENVIRONMENT: {{ .Values.config.environment | default "production" | quote }}
|
||||||
|
{{- if .Values.observability.enabled }}
|
||||||
|
OTEL_EXPORTER_OTLP_ENDPOINT: "http://{{ include "incidentops.fullname" . }}-otel-collector:4317"
|
||||||
|
{{- end }}
|
||||||
|
OTEL_EXPORTER_OTLP_INSECURE: "true"
|
||||||
|
OTEL_LOG_LEVEL: {{ .Values.config.logLevel | default "INFO" | quote }}
|
||||||
|
# Metrics configuration
|
||||||
|
{{- if .Values.metrics.enabled }}
|
||||||
|
PROMETHEUS_PORT: {{ .Values.metrics.port | quote }}
|
||||||
|
{{- end }}
|
||||||
|
|||||||
387
helm/incidentops/templates/grafana-deployment.yaml
Normal file
387
helm/incidentops/templates/grafana-deployment.yaml
Normal file
@@ -0,0 +1,387 @@
|
|||||||
|
{{- if .Values.observability.enabled }}
|
||||||
|
apiVersion: v1
|
||||||
|
kind: ConfigMap
|
||||||
|
metadata:
|
||||||
|
name: {{ include "incidentops.fullname" . }}-grafana-datasources
|
||||||
|
labels:
|
||||||
|
{{- include "incidentops.labels" . | nindent 4 }}
|
||||||
|
app.kubernetes.io/component: grafana
|
||||||
|
data:
|
||||||
|
datasources.yaml: |
|
||||||
|
apiVersion: 1
|
||||||
|
datasources:
|
||||||
|
- name: Prometheus
|
||||||
|
type: prometheus
|
||||||
|
uid: prometheus
|
||||||
|
url: http://{{ include "incidentops.fullname" . }}-prometheus:9090
|
||||||
|
access: proxy
|
||||||
|
isDefault: false
|
||||||
|
jsonData:
|
||||||
|
httpMethod: POST
|
||||||
|
exemplarTraceIdDestinations:
|
||||||
|
- name: trace_id
|
||||||
|
datasourceUid: tempo
|
||||||
|
|
||||||
|
- name: Tempo
|
||||||
|
type: tempo
|
||||||
|
uid: tempo
|
||||||
|
url: http://{{ include "incidentops.fullname" . }}-tempo:3200
|
||||||
|
access: proxy
|
||||||
|
isDefault: false
|
||||||
|
jsonData:
|
||||||
|
tracesToLogsV2:
|
||||||
|
datasourceUid: loki
|
||||||
|
spanStartTimeShift: '-1h'
|
||||||
|
spanEndTimeShift: '1h'
|
||||||
|
filterByTraceID: true
|
||||||
|
filterBySpanID: true
|
||||||
|
tracesToMetrics:
|
||||||
|
datasourceUid: prometheus
|
||||||
|
spanStartTimeShift: '-1h'
|
||||||
|
spanEndTimeShift: '1h'
|
||||||
|
serviceMap:
|
||||||
|
datasourceUid: prometheus
|
||||||
|
nodeGraph:
|
||||||
|
enabled: true
|
||||||
|
lokiSearch:
|
||||||
|
datasourceUid: loki
|
||||||
|
|
||||||
|
- name: Loki
|
||||||
|
type: loki
|
||||||
|
uid: loki
|
||||||
|
url: http://{{ include "incidentops.fullname" . }}-loki:3100
|
||||||
|
access: proxy
|
||||||
|
isDefault: true
|
||||||
|
jsonData:
|
||||||
|
derivedFields:
|
||||||
|
- datasourceUid: tempo
|
||||||
|
matcherRegex: '"trace_id":"([a-f0-9]+)"'
|
||||||
|
name: TraceID
|
||||||
|
url: '$${__value.raw}'
|
||||||
|
urlDisplayLabel: 'View Trace'
|
||||||
|
---
|
||||||
|
apiVersion: v1
|
||||||
|
kind: ConfigMap
|
||||||
|
metadata:
|
||||||
|
name: {{ include "incidentops.fullname" . }}-grafana-dashboards-provider
|
||||||
|
labels:
|
||||||
|
{{- include "incidentops.labels" . | nindent 4 }}
|
||||||
|
app.kubernetes.io/component: grafana
|
||||||
|
data:
|
||||||
|
dashboards.yaml: |
|
||||||
|
apiVersion: 1
|
||||||
|
providers:
|
||||||
|
- name: 'default'
|
||||||
|
orgId: 1
|
||||||
|
folder: 'IncidentOps'
|
||||||
|
folderUid: 'incidentops'
|
||||||
|
type: file
|
||||||
|
disableDeletion: false
|
||||||
|
editable: true
|
||||||
|
options:
|
||||||
|
path: /var/lib/grafana/dashboards
|
||||||
|
---
|
||||||
|
apiVersion: v1
|
||||||
|
kind: ConfigMap
|
||||||
|
metadata:
|
||||||
|
name: {{ include "incidentops.fullname" . }}-grafana-dashboards
|
||||||
|
labels:
|
||||||
|
{{- include "incidentops.labels" . | nindent 4 }}
|
||||||
|
app.kubernetes.io/component: grafana
|
||||||
|
data:
|
||||||
|
api-overview.json: |
|
||||||
|
{
|
||||||
|
"title": "IncidentOps API Overview",
|
||||||
|
"uid": "incidentops-api",
|
||||||
|
"tags": ["incidentops", "api"],
|
||||||
|
"timezone": "browser",
|
||||||
|
"editable": true,
|
||||||
|
"panels": [
|
||||||
|
{
|
||||||
|
"id": 1,
|
||||||
|
"title": "Request Rate",
|
||||||
|
"type": "timeseries",
|
||||||
|
"gridPos": {"h": 8, "w": 8, "x": 0, "y": 0},
|
||||||
|
"targets": [
|
||||||
|
{
|
||||||
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
|
"expr": "sum(rate(http_server_request_duration_seconds_count{job=\"incidentops-api\"}[1m]))",
|
||||||
|
"legendFormat": "Requests/sec",
|
||||||
|
"refId": "A"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"unit": "reqps"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": 2,
|
||||||
|
"title": "Request Duration (p50, p95, p99)",
|
||||||
|
"type": "timeseries",
|
||||||
|
"gridPos": {"h": 8, "w": 8, "x": 8, "y": 0},
|
||||||
|
"targets": [
|
||||||
|
{
|
||||||
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
|
"expr": "histogram_quantile(0.50, sum(rate(http_server_request_duration_seconds_bucket{job=\"incidentops-api\"}[5m])) by (le))",
|
||||||
|
"legendFormat": "p50",
|
||||||
|
"refId": "A"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
|
"expr": "histogram_quantile(0.95, sum(rate(http_server_request_duration_seconds_bucket{job=\"incidentops-api\"}[5m])) by (le))",
|
||||||
|
"legendFormat": "p95",
|
||||||
|
"refId": "B"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
|
"expr": "histogram_quantile(0.99, sum(rate(http_server_request_duration_seconds_bucket{job=\"incidentops-api\"}[5m])) by (le))",
|
||||||
|
"legendFormat": "p99",
|
||||||
|
"refId": "C"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"unit": "s"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": 3,
|
||||||
|
"title": "Error Rate",
|
||||||
|
"type": "timeseries",
|
||||||
|
"gridPos": {"h": 8, "w": 8, "x": 16, "y": 0},
|
||||||
|
"targets": [
|
||||||
|
{
|
||||||
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
|
"expr": "sum(rate(http_server_request_duration_seconds_count{job=\"incidentops-api\", http_status_code=~\"5..\"}[1m])) / sum(rate(http_server_request_duration_seconds_count{job=\"incidentops-api\"}[1m])) * 100",
|
||||||
|
"legendFormat": "Error %",
|
||||||
|
"refId": "A"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"unit": "percent",
|
||||||
|
"min": 0,
|
||||||
|
"max": 100
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": 4,
|
||||||
|
"title": "Requests by Status Code",
|
||||||
|
"type": "timeseries",
|
||||||
|
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 8},
|
||||||
|
"targets": [
|
||||||
|
{
|
||||||
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
|
"expr": "sum by (http_status_code) (rate(http_server_request_duration_seconds_count{job=\"incidentops-api\"}[1m]))",
|
||||||
|
"legendFormat": "{{ "{{" }}http_status_code{{ "}}" }}",
|
||||||
|
"refId": "A"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"unit": "reqps"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": 5,
|
||||||
|
"title": "Requests by Endpoint",
|
||||||
|
"type": "timeseries",
|
||||||
|
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 8},
|
||||||
|
"targets": [
|
||||||
|
{
|
||||||
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
|
"expr": "sum by (http_route) (rate(http_server_request_duration_seconds_count{job=\"incidentops-api\"}[1m]))",
|
||||||
|
"legendFormat": "{{ "{{" }}http_route{{ "}}" }}",
|
||||||
|
"refId": "A"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"unit": "reqps"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": 6,
|
||||||
|
"title": "Recent Logs",
|
||||||
|
"type": "logs",
|
||||||
|
"gridPos": {"h": 10, "w": 24, "x": 0, "y": 16},
|
||||||
|
"targets": [
|
||||||
|
{
|
||||||
|
"datasource": {"type": "loki", "uid": "loki"},
|
||||||
|
"expr": "{service_name=\"incidentops-api\"} | json",
|
||||||
|
"refId": "A"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"options": {
|
||||||
|
"showTime": true,
|
||||||
|
"showLabels": true,
|
||||||
|
"wrapLogMessage": true,
|
||||||
|
"enableLogDetails": true,
|
||||||
|
"sortOrder": "Descending"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": 7,
|
||||||
|
"title": "Recent Traces",
|
||||||
|
"type": "traces",
|
||||||
|
"gridPos": {"h": 10, "w": 24, "x": 0, "y": 26},
|
||||||
|
"targets": [
|
||||||
|
{
|
||||||
|
"datasource": {"type": "tempo", "uid": "tempo"},
|
||||||
|
"queryType": "traceqlSearch",
|
||||||
|
"filters": [
|
||||||
|
{
|
||||||
|
"id": "service-name",
|
||||||
|
"operator": "=",
|
||||||
|
"scope": "resource",
|
||||||
|
"tag": "service.name",
|
||||||
|
"value": ["incidentops-api"]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"refId": "A"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"schemaVersion": 38,
|
||||||
|
"version": 2
|
||||||
|
}
|
||||||
|
---
|
||||||
|
apiVersion: apps/v1
|
||||||
|
kind: Deployment
|
||||||
|
metadata:
|
||||||
|
name: {{ include "incidentops.fullname" . }}-grafana
|
||||||
|
labels:
|
||||||
|
{{- include "incidentops.labels" . | nindent 4 }}
|
||||||
|
app.kubernetes.io/component: grafana
|
||||||
|
spec:
|
||||||
|
replicas: 1
|
||||||
|
selector:
|
||||||
|
matchLabels:
|
||||||
|
{{- include "incidentops.selectorLabels" . | nindent 6 }}
|
||||||
|
app.kubernetes.io/component: grafana
|
||||||
|
template:
|
||||||
|
metadata:
|
||||||
|
labels:
|
||||||
|
{{- include "incidentops.selectorLabels" . | nindent 8 }}
|
||||||
|
app.kubernetes.io/component: grafana
|
||||||
|
annotations:
|
||||||
|
checksum/datasources: {{ .Values.observability.grafana.image.tag | sha256sum }}
|
||||||
|
spec:
|
||||||
|
securityContext:
|
||||||
|
fsGroup: 472
|
||||||
|
runAsUser: 472
|
||||||
|
containers:
|
||||||
|
- name: grafana
|
||||||
|
image: "{{ .Values.observability.grafana.image.repository }}:{{ .Values.observability.grafana.image.tag }}"
|
||||||
|
imagePullPolicy: {{ .Values.observability.grafana.image.pullPolicy }}
|
||||||
|
ports:
|
||||||
|
- name: http
|
||||||
|
containerPort: 3000
|
||||||
|
protocol: TCP
|
||||||
|
env:
|
||||||
|
- name: GF_SECURITY_ADMIN_USER
|
||||||
|
value: {{ .Values.observability.grafana.adminUser | quote }}
|
||||||
|
- name: GF_SECURITY_ADMIN_PASSWORD
|
||||||
|
valueFrom:
|
||||||
|
secretKeyRef:
|
||||||
|
name: {{ include "incidentops.fullname" . }}-grafana
|
||||||
|
key: admin-password
|
||||||
|
- name: GF_USERS_ALLOW_SIGN_UP
|
||||||
|
value: "false"
|
||||||
|
- name: GF_EXPLORE_ENABLED
|
||||||
|
value: "true"
|
||||||
|
- name: GF_FEATURE_TOGGLES_ENABLE
|
||||||
|
value: "traceqlEditor tempoSearch tempoBackendSearch tempoApmTable"
|
||||||
|
volumeMounts:
|
||||||
|
- name: datasources
|
||||||
|
mountPath: /etc/grafana/provisioning/datasources
|
||||||
|
- name: dashboards-provider
|
||||||
|
mountPath: /etc/grafana/provisioning/dashboards
|
||||||
|
- name: dashboards
|
||||||
|
mountPath: /var/lib/grafana/dashboards
|
||||||
|
- name: data
|
||||||
|
mountPath: /var/lib/grafana
|
||||||
|
resources:
|
||||||
|
{{- toYaml .Values.observability.grafana.resources | nindent 12 }}
|
||||||
|
readinessProbe:
|
||||||
|
httpGet:
|
||||||
|
path: /api/health
|
||||||
|
port: http
|
||||||
|
initialDelaySeconds: 10
|
||||||
|
periodSeconds: 10
|
||||||
|
livenessProbe:
|
||||||
|
httpGet:
|
||||||
|
path: /api/health
|
||||||
|
port: http
|
||||||
|
initialDelaySeconds: 30
|
||||||
|
periodSeconds: 30
|
||||||
|
volumes:
|
||||||
|
- name: datasources
|
||||||
|
configMap:
|
||||||
|
name: {{ include "incidentops.fullname" . }}-grafana-datasources
|
||||||
|
- name: dashboards-provider
|
||||||
|
configMap:
|
||||||
|
name: {{ include "incidentops.fullname" . }}-grafana-dashboards-provider
|
||||||
|
- name: dashboards
|
||||||
|
configMap:
|
||||||
|
name: {{ include "incidentops.fullname" . }}-grafana-dashboards
|
||||||
|
- name: data
|
||||||
|
{{- if .Values.observability.grafana.persistence.enabled }}
|
||||||
|
persistentVolumeClaim:
|
||||||
|
claimName: {{ include "incidentops.fullname" . }}-grafana
|
||||||
|
{{- else }}
|
||||||
|
emptyDir: {}
|
||||||
|
{{- end }}
|
||||||
|
---
|
||||||
|
apiVersion: v1
|
||||||
|
kind: Secret
|
||||||
|
metadata:
|
||||||
|
name: {{ include "incidentops.fullname" . }}-grafana
|
||||||
|
labels:
|
||||||
|
{{- include "incidentops.labels" . | nindent 4 }}
|
||||||
|
app.kubernetes.io/component: grafana
|
||||||
|
type: Opaque
|
||||||
|
data:
|
||||||
|
admin-password: {{ .Values.observability.grafana.adminPassword | b64enc | quote }}
|
||||||
|
---
|
||||||
|
apiVersion: v1
|
||||||
|
kind: Service
|
||||||
|
metadata:
|
||||||
|
name: {{ include "incidentops.fullname" . }}-grafana
|
||||||
|
labels:
|
||||||
|
{{- include "incidentops.labels" . | nindent 4 }}
|
||||||
|
app.kubernetes.io/component: grafana
|
||||||
|
spec:
|
||||||
|
type: {{ .Values.observability.grafana.service.type }}
|
||||||
|
ports:
|
||||||
|
- name: http
|
||||||
|
port: 80
|
||||||
|
targetPort: http
|
||||||
|
protocol: TCP
|
||||||
|
selector:
|
||||||
|
{{- include "incidentops.selectorLabels" . | nindent 4 }}
|
||||||
|
app.kubernetes.io/component: grafana
|
||||||
|
{{- if .Values.observability.grafana.persistence.enabled }}
|
||||||
|
---
|
||||||
|
apiVersion: v1
|
||||||
|
kind: PersistentVolumeClaim
|
||||||
|
metadata:
|
||||||
|
name: {{ include "incidentops.fullname" . }}-grafana
|
||||||
|
labels:
|
||||||
|
{{- include "incidentops.labels" . | nindent 4 }}
|
||||||
|
app.kubernetes.io/component: grafana
|
||||||
|
spec:
|
||||||
|
accessModes:
|
||||||
|
- ReadWriteOnce
|
||||||
|
resources:
|
||||||
|
requests:
|
||||||
|
storage: {{ .Values.observability.grafana.persistence.size }}
|
||||||
|
{{- end }}
|
||||||
|
{{- end }}
|
||||||
38
helm/incidentops/templates/grafana-ingress.yaml
Normal file
38
helm/incidentops/templates/grafana-ingress.yaml
Normal file
@@ -0,0 +1,38 @@
|
|||||||
|
{{- if and .Values.observability.enabled .Values.observability.grafana.ingress.enabled -}}
|
||||||
|
apiVersion: networking.k8s.io/v1
|
||||||
|
kind: Ingress
|
||||||
|
metadata:
|
||||||
|
name: {{ include "incidentops.fullname" . }}-grafana
|
||||||
|
labels:
|
||||||
|
{{- include "incidentops.labels" . | nindent 4 }}
|
||||||
|
app.kubernetes.io/component: grafana
|
||||||
|
{{- with .Values.observability.grafana.ingress.annotations }}
|
||||||
|
annotations:
|
||||||
|
{{- toYaml . | nindent 4 }}
|
||||||
|
{{- end }}
|
||||||
|
spec:
|
||||||
|
{{- if .Values.ingress.className }}
|
||||||
|
ingressClassName: {{ .Values.ingress.className }}
|
||||||
|
{{- end }}
|
||||||
|
{{- if .Values.observability.grafana.ingress.tls }}
|
||||||
|
tls:
|
||||||
|
{{- range .Values.observability.grafana.ingress.tls }}
|
||||||
|
- hosts:
|
||||||
|
{{- range .hosts }}
|
||||||
|
- {{ . | quote }}
|
||||||
|
{{- end }}
|
||||||
|
secretName: {{ .secretName }}
|
||||||
|
{{- end }}
|
||||||
|
{{- end }}
|
||||||
|
rules:
|
||||||
|
- host: {{ .Values.observability.grafana.ingress.host | quote }}
|
||||||
|
http:
|
||||||
|
paths:
|
||||||
|
- path: /
|
||||||
|
pathType: Prefix
|
||||||
|
backend:
|
||||||
|
service:
|
||||||
|
name: {{ include "incidentops.fullname" . }}-grafana
|
||||||
|
port:
|
||||||
|
number: 80
|
||||||
|
{{- end }}
|
||||||
155
helm/incidentops/templates/loki-deployment.yaml
Normal file
155
helm/incidentops/templates/loki-deployment.yaml
Normal file
@@ -0,0 +1,155 @@
|
|||||||
|
{{- if .Values.observability.enabled }}
|
||||||
|
apiVersion: v1
|
||||||
|
kind: ConfigMap
|
||||||
|
metadata:
|
||||||
|
name: {{ include "incidentops.fullname" . }}-loki-config
|
||||||
|
labels:
|
||||||
|
{{- include "incidentops.labels" . | nindent 4 }}
|
||||||
|
app.kubernetes.io/component: loki
|
||||||
|
data:
|
||||||
|
loki.yaml: |
|
||||||
|
auth_enabled: false
|
||||||
|
|
||||||
|
server:
|
||||||
|
http_listen_port: 3100
|
||||||
|
grpc_listen_port: 9096
|
||||||
|
|
||||||
|
common:
|
||||||
|
path_prefix: /loki
|
||||||
|
storage:
|
||||||
|
filesystem:
|
||||||
|
chunks_directory: /loki/chunks
|
||||||
|
rules_directory: /loki/rules
|
||||||
|
replication_factor: 1
|
||||||
|
ring:
|
||||||
|
kvstore:
|
||||||
|
store: inmemory
|
||||||
|
|
||||||
|
query_range:
|
||||||
|
results_cache:
|
||||||
|
cache:
|
||||||
|
embedded_cache:
|
||||||
|
enabled: true
|
||||||
|
max_size_mb: 100
|
||||||
|
|
||||||
|
schema_config:
|
||||||
|
configs:
|
||||||
|
- from: "2020-10-24"
|
||||||
|
store: tsdb
|
||||||
|
object_store: filesystem
|
||||||
|
schema: v13
|
||||||
|
index:
|
||||||
|
prefix: index_
|
||||||
|
period: 24h
|
||||||
|
|
||||||
|
ruler:
|
||||||
|
alertmanager_url: http://localhost:9093
|
||||||
|
|
||||||
|
limits_config:
|
||||||
|
retention_period: {{ .Values.observability.loki.retention }}
|
||||||
|
allow_structured_metadata: true
|
||||||
|
volume_enabled: true
|
||||||
|
---
|
||||||
|
apiVersion: apps/v1
|
||||||
|
kind: Deployment
|
||||||
|
metadata:
|
||||||
|
name: {{ include "incidentops.fullname" . }}-loki
|
||||||
|
labels:
|
||||||
|
{{- include "incidentops.labels" . | nindent 4 }}
|
||||||
|
app.kubernetes.io/component: loki
|
||||||
|
spec:
|
||||||
|
replicas: 1
|
||||||
|
selector:
|
||||||
|
matchLabels:
|
||||||
|
{{- include "incidentops.selectorLabels" . | nindent 6 }}
|
||||||
|
app.kubernetes.io/component: loki
|
||||||
|
template:
|
||||||
|
metadata:
|
||||||
|
labels:
|
||||||
|
{{- include "incidentops.selectorLabels" . | nindent 8 }}
|
||||||
|
app.kubernetes.io/component: loki
|
||||||
|
annotations:
|
||||||
|
checksum/config: {{ .Values.observability.loki.image.tag | sha256sum }}
|
||||||
|
spec:
|
||||||
|
containers:
|
||||||
|
- name: loki
|
||||||
|
image: "{{ .Values.observability.loki.image.repository }}:{{ .Values.observability.loki.image.tag }}"
|
||||||
|
imagePullPolicy: {{ .Values.observability.loki.image.pullPolicy }}
|
||||||
|
args:
|
||||||
|
- -config.file=/etc/loki/loki.yaml
|
||||||
|
ports:
|
||||||
|
- name: http
|
||||||
|
containerPort: 3100
|
||||||
|
protocol: TCP
|
||||||
|
- name: grpc
|
||||||
|
containerPort: 9096
|
||||||
|
protocol: TCP
|
||||||
|
volumeMounts:
|
||||||
|
- name: config
|
||||||
|
mountPath: /etc/loki
|
||||||
|
- name: data
|
||||||
|
mountPath: /loki
|
||||||
|
resources:
|
||||||
|
{{- toYaml .Values.observability.loki.resources | nindent 12 }}
|
||||||
|
readinessProbe:
|
||||||
|
httpGet:
|
||||||
|
path: /ready
|
||||||
|
port: http
|
||||||
|
initialDelaySeconds: 10
|
||||||
|
periodSeconds: 10
|
||||||
|
livenessProbe:
|
||||||
|
httpGet:
|
||||||
|
path: /ready
|
||||||
|
port: http
|
||||||
|
initialDelaySeconds: 30
|
||||||
|
periodSeconds: 30
|
||||||
|
volumes:
|
||||||
|
- name: config
|
||||||
|
configMap:
|
||||||
|
name: {{ include "incidentops.fullname" . }}-loki-config
|
||||||
|
- name: data
|
||||||
|
{{- if .Values.observability.loki.persistence.enabled }}
|
||||||
|
persistentVolumeClaim:
|
||||||
|
claimName: {{ include "incidentops.fullname" . }}-loki
|
||||||
|
{{- else }}
|
||||||
|
emptyDir: {}
|
||||||
|
{{- end }}
|
||||||
|
---
|
||||||
|
apiVersion: v1
|
||||||
|
kind: Service
|
||||||
|
metadata:
|
||||||
|
name: {{ include "incidentops.fullname" . }}-loki
|
||||||
|
labels:
|
||||||
|
{{- include "incidentops.labels" . | nindent 4 }}
|
||||||
|
app.kubernetes.io/component: loki
|
||||||
|
spec:
|
||||||
|
type: ClusterIP
|
||||||
|
ports:
|
||||||
|
- name: http
|
||||||
|
port: 3100
|
||||||
|
targetPort: http
|
||||||
|
protocol: TCP
|
||||||
|
- name: grpc
|
||||||
|
port: 9096
|
||||||
|
targetPort: grpc
|
||||||
|
protocol: TCP
|
||||||
|
selector:
|
||||||
|
{{- include "incidentops.selectorLabels" . | nindent 4 }}
|
||||||
|
app.kubernetes.io/component: loki
|
||||||
|
{{- if .Values.observability.loki.persistence.enabled }}
|
||||||
|
---
|
||||||
|
apiVersion: v1
|
||||||
|
kind: PersistentVolumeClaim
|
||||||
|
metadata:
|
||||||
|
name: {{ include "incidentops.fullname" . }}-loki
|
||||||
|
labels:
|
||||||
|
{{- include "incidentops.labels" . | nindent 4 }}
|
||||||
|
app.kubernetes.io/component: loki
|
||||||
|
spec:
|
||||||
|
accessModes:
|
||||||
|
- ReadWriteOnce
|
||||||
|
resources:
|
||||||
|
requests:
|
||||||
|
storage: {{ .Values.observability.loki.persistence.size }}
|
||||||
|
{{- end }}
|
||||||
|
{{- end }}
|
||||||
@@ -30,9 +30,11 @@ spec:
|
|||||||
- name: migrate
|
- name: migrate
|
||||||
securityContext:
|
securityContext:
|
||||||
{{- toYaml .Values.securityContext | nindent 12 }}
|
{{- toYaml .Values.securityContext | nindent 12 }}
|
||||||
image: {{ include "incidentops.api.image" . }}
|
image: "{{ .Values.migration.image.repository }}:{{ .Values.migration.image.tag }}"
|
||||||
imagePullPolicy: {{ .Values.migration.image.pullPolicy }}
|
imagePullPolicy: {{ .Values.migration.image.pullPolicy }}
|
||||||
command:
|
command:
|
||||||
|
- uv
|
||||||
|
- run
|
||||||
- python
|
- python
|
||||||
- migrations/migrate.py
|
- migrations/migrate.py
|
||||||
- apply
|
- apply
|
||||||
|
|||||||
132
helm/incidentops/templates/otel-collector-deployment.yaml
Normal file
132
helm/incidentops/templates/otel-collector-deployment.yaml
Normal file
@@ -0,0 +1,132 @@
|
|||||||
|
{{- if .Values.observability.enabled }}
|
||||||
|
apiVersion: v1
|
||||||
|
kind: ConfigMap
|
||||||
|
metadata:
|
||||||
|
name: {{ include "incidentops.fullname" . }}-otel-collector-config
|
||||||
|
labels:
|
||||||
|
{{- include "incidentops.labels" . | nindent 4 }}
|
||||||
|
app.kubernetes.io/component: otel-collector
|
||||||
|
data:
|
||||||
|
otel-collector-config.yaml: |
|
||||||
|
extensions:
|
||||||
|
health_check:
|
||||||
|
endpoint: 0.0.0.0:13133
|
||||||
|
|
||||||
|
receivers:
|
||||||
|
otlp:
|
||||||
|
protocols:
|
||||||
|
grpc:
|
||||||
|
endpoint: 0.0.0.0:4317
|
||||||
|
http:
|
||||||
|
endpoint: 0.0.0.0:4318
|
||||||
|
|
||||||
|
processors:
|
||||||
|
batch:
|
||||||
|
timeout: 1s
|
||||||
|
send_batch_size: 1024
|
||||||
|
memory_limiter:
|
||||||
|
check_interval: 1s
|
||||||
|
limit_mib: 512
|
||||||
|
spike_limit_mib: 128
|
||||||
|
|
||||||
|
exporters:
|
||||||
|
otlp/tempo:
|
||||||
|
endpoint: {{ include "incidentops.fullname" . }}-tempo:4317
|
||||||
|
tls:
|
||||||
|
insecure: true
|
||||||
|
loki:
|
||||||
|
endpoint: http://{{ include "incidentops.fullname" . }}-loki:3100/loki/api/v1/push
|
||||||
|
default_labels_enabled:
|
||||||
|
exporter: true
|
||||||
|
job: true
|
||||||
|
|
||||||
|
service:
|
||||||
|
extensions: [health_check]
|
||||||
|
pipelines:
|
||||||
|
traces:
|
||||||
|
receivers: [otlp]
|
||||||
|
processors: [memory_limiter, batch]
|
||||||
|
exporters: [otlp/tempo]
|
||||||
|
logs:
|
||||||
|
receivers: [otlp]
|
||||||
|
processors: [memory_limiter, batch]
|
||||||
|
exporters: [loki]
|
||||||
|
---
|
||||||
|
apiVersion: apps/v1
|
||||||
|
kind: Deployment
|
||||||
|
metadata:
|
||||||
|
name: {{ include "incidentops.fullname" . }}-otel-collector
|
||||||
|
labels:
|
||||||
|
{{- include "incidentops.labels" . | nindent 4 }}
|
||||||
|
app.kubernetes.io/component: otel-collector
|
||||||
|
spec:
|
||||||
|
replicas: {{ .Values.observability.otelCollector.replicaCount }}
|
||||||
|
selector:
|
||||||
|
matchLabels:
|
||||||
|
{{- include "incidentops.selectorLabels" . | nindent 6 }}
|
||||||
|
app.kubernetes.io/component: otel-collector
|
||||||
|
template:
|
||||||
|
metadata:
|
||||||
|
labels:
|
||||||
|
{{- include "incidentops.selectorLabels" . | nindent 8 }}
|
||||||
|
app.kubernetes.io/component: otel-collector
|
||||||
|
annotations:
|
||||||
|
checksum/config: {{ .Values.observability.otelCollector.image.tag | sha256sum }}
|
||||||
|
spec:
|
||||||
|
containers:
|
||||||
|
- name: otel-collector
|
||||||
|
image: "{{ .Values.observability.otelCollector.image.repository }}:{{ .Values.observability.otelCollector.image.tag }}"
|
||||||
|
imagePullPolicy: {{ .Values.observability.otelCollector.image.pullPolicy }}
|
||||||
|
args:
|
||||||
|
- --config=/etc/otel-collector/otel-collector-config.yaml
|
||||||
|
ports:
|
||||||
|
- name: otlp-grpc
|
||||||
|
containerPort: 4317
|
||||||
|
protocol: TCP
|
||||||
|
- name: otlp-http
|
||||||
|
containerPort: 4318
|
||||||
|
protocol: TCP
|
||||||
|
volumeMounts:
|
||||||
|
- name: config
|
||||||
|
mountPath: /etc/otel-collector
|
||||||
|
resources:
|
||||||
|
{{- toYaml .Values.observability.otelCollector.resources | nindent 12 }}
|
||||||
|
livenessProbe:
|
||||||
|
httpGet:
|
||||||
|
path: /
|
||||||
|
port: 13133
|
||||||
|
initialDelaySeconds: 10
|
||||||
|
periodSeconds: 30
|
||||||
|
readinessProbe:
|
||||||
|
httpGet:
|
||||||
|
path: /
|
||||||
|
port: 13133
|
||||||
|
initialDelaySeconds: 5
|
||||||
|
periodSeconds: 10
|
||||||
|
volumes:
|
||||||
|
- name: config
|
||||||
|
configMap:
|
||||||
|
name: {{ include "incidentops.fullname" . }}-otel-collector-config
|
||||||
|
---
|
||||||
|
apiVersion: v1
|
||||||
|
kind: Service
|
||||||
|
metadata:
|
||||||
|
name: {{ include "incidentops.fullname" . }}-otel-collector
|
||||||
|
labels:
|
||||||
|
{{- include "incidentops.labels" . | nindent 4 }}
|
||||||
|
app.kubernetes.io/component: otel-collector
|
||||||
|
spec:
|
||||||
|
type: ClusterIP
|
||||||
|
ports:
|
||||||
|
- name: otlp-grpc
|
||||||
|
port: 4317
|
||||||
|
targetPort: otlp-grpc
|
||||||
|
protocol: TCP
|
||||||
|
- name: otlp-http
|
||||||
|
port: 4318
|
||||||
|
targetPort: otlp-http
|
||||||
|
protocol: TCP
|
||||||
|
selector:
|
||||||
|
{{- include "incidentops.selectorLabels" . | nindent 4 }}
|
||||||
|
app.kubernetes.io/component: otel-collector
|
||||||
|
{{- end }}
|
||||||
163
helm/incidentops/templates/prometheus-deployment.yaml
Normal file
163
helm/incidentops/templates/prometheus-deployment.yaml
Normal file
@@ -0,0 +1,163 @@
|
|||||||
|
{{- if and .Values.observability.enabled .Values.metrics.enabled }}
|
||||||
|
apiVersion: v1
|
||||||
|
kind: ConfigMap
|
||||||
|
metadata:
|
||||||
|
name: {{ include "incidentops.fullname" . }}-prometheus
|
||||||
|
labels:
|
||||||
|
{{- include "incidentops.labels" . | nindent 4 }}
|
||||||
|
app.kubernetes.io/component: prometheus
|
||||||
|
data:
|
||||||
|
prometheus.yml: |
|
||||||
|
global:
|
||||||
|
scrape_interval: {{ .Values.observability.prometheus.scrapeInterval | default "15s" }}
|
||||||
|
evaluation_interval: 15s
|
||||||
|
|
||||||
|
scrape_configs:
|
||||||
|
- job_name: "prometheus"
|
||||||
|
static_configs:
|
||||||
|
- targets: ["localhost:9090"]
|
||||||
|
|
||||||
|
- job_name: "incidentops-api"
|
||||||
|
kubernetes_sd_configs:
|
||||||
|
- role: pod
|
||||||
|
namespaces:
|
||||||
|
names:
|
||||||
|
- {{ .Release.Namespace }}
|
||||||
|
relabel_configs:
|
||||||
|
- source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_component]
|
||||||
|
action: keep
|
||||||
|
regex: api
|
||||||
|
- source_labels: [__meta_kubernetes_pod_container_port_name]
|
||||||
|
action: keep
|
||||||
|
regex: metrics
|
||||||
|
- source_labels: [__meta_kubernetes_namespace]
|
||||||
|
target_label: namespace
|
||||||
|
- source_labels: [__meta_kubernetes_pod_name]
|
||||||
|
target_label: pod
|
||||||
|
metrics_path: /metrics
|
||||||
|
scrape_interval: 10s
|
||||||
|
|
||||||
|
- job_name: "incidentops-worker"
|
||||||
|
kubernetes_sd_configs:
|
||||||
|
- role: pod
|
||||||
|
namespaces:
|
||||||
|
names:
|
||||||
|
- {{ .Release.Namespace }}
|
||||||
|
relabel_configs:
|
||||||
|
- source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_component]
|
||||||
|
action: keep
|
||||||
|
regex: worker
|
||||||
|
- source_labels: [__meta_kubernetes_pod_container_port_name]
|
||||||
|
action: keep
|
||||||
|
regex: metrics
|
||||||
|
- source_labels: [__meta_kubernetes_namespace]
|
||||||
|
target_label: namespace
|
||||||
|
- source_labels: [__meta_kubernetes_pod_name]
|
||||||
|
target_label: pod
|
||||||
|
metrics_path: /metrics
|
||||||
|
scrape_interval: 10s
|
||||||
|
---
|
||||||
|
apiVersion: apps/v1
|
||||||
|
kind: Deployment
|
||||||
|
metadata:
|
||||||
|
name: {{ include "incidentops.fullname" . }}-prometheus
|
||||||
|
labels:
|
||||||
|
{{- include "incidentops.labels" . | nindent 4 }}
|
||||||
|
app.kubernetes.io/component: prometheus
|
||||||
|
spec:
|
||||||
|
replicas: 1
|
||||||
|
selector:
|
||||||
|
matchLabels:
|
||||||
|
{{- include "incidentops.selectorLabels" . | nindent 6 }}
|
||||||
|
app.kubernetes.io/component: prometheus
|
||||||
|
template:
|
||||||
|
metadata:
|
||||||
|
labels:
|
||||||
|
{{- include "incidentops.selectorLabels" . | nindent 8 }}
|
||||||
|
app.kubernetes.io/component: prometheus
|
||||||
|
annotations:
|
||||||
|
checksum/config: {{ .Values.observability.prometheus.image.tag | sha256sum }}
|
||||||
|
spec:
|
||||||
|
serviceAccountName: {{ include "incidentops.serviceAccountName" . }}
|
||||||
|
securityContext:
|
||||||
|
fsGroup: 65534
|
||||||
|
runAsUser: 65534
|
||||||
|
runAsNonRoot: true
|
||||||
|
containers:
|
||||||
|
- name: prometheus
|
||||||
|
image: "{{ .Values.observability.prometheus.image.repository }}:{{ .Values.observability.prometheus.image.tag }}"
|
||||||
|
imagePullPolicy: {{ .Values.observability.prometheus.image.pullPolicy }}
|
||||||
|
args:
|
||||||
|
- "--config.file=/etc/prometheus/prometheus.yml"
|
||||||
|
- "--storage.tsdb.path=/prometheus"
|
||||||
|
- "--storage.tsdb.retention.time={{ .Values.observability.prometheus.retention }}"
|
||||||
|
- "--web.enable-lifecycle"
|
||||||
|
ports:
|
||||||
|
- name: http
|
||||||
|
containerPort: 9090
|
||||||
|
protocol: TCP
|
||||||
|
volumeMounts:
|
||||||
|
- name: config
|
||||||
|
mountPath: /etc/prometheus
|
||||||
|
- name: data
|
||||||
|
mountPath: /prometheus
|
||||||
|
resources:
|
||||||
|
{{- toYaml .Values.observability.prometheus.resources | nindent 12 }}
|
||||||
|
readinessProbe:
|
||||||
|
httpGet:
|
||||||
|
path: /-/ready
|
||||||
|
port: http
|
||||||
|
initialDelaySeconds: 10
|
||||||
|
periodSeconds: 10
|
||||||
|
livenessProbe:
|
||||||
|
httpGet:
|
||||||
|
path: /-/healthy
|
||||||
|
port: http
|
||||||
|
initialDelaySeconds: 30
|
||||||
|
periodSeconds: 30
|
||||||
|
volumes:
|
||||||
|
- name: config
|
||||||
|
configMap:
|
||||||
|
name: {{ include "incidentops.fullname" . }}-prometheus
|
||||||
|
- name: data
|
||||||
|
{{- if .Values.observability.prometheus.persistence.enabled }}
|
||||||
|
persistentVolumeClaim:
|
||||||
|
claimName: {{ include "incidentops.fullname" . }}-prometheus
|
||||||
|
{{- else }}
|
||||||
|
emptyDir: {}
|
||||||
|
{{- end }}
|
||||||
|
---
|
||||||
|
apiVersion: v1
|
||||||
|
kind: Service
|
||||||
|
metadata:
|
||||||
|
name: {{ include "incidentops.fullname" . }}-prometheus
|
||||||
|
labels:
|
||||||
|
{{- include "incidentops.labels" . | nindent 4 }}
|
||||||
|
app.kubernetes.io/component: prometheus
|
||||||
|
spec:
|
||||||
|
type: ClusterIP
|
||||||
|
ports:
|
||||||
|
- name: http
|
||||||
|
port: 9090
|
||||||
|
targetPort: http
|
||||||
|
protocol: TCP
|
||||||
|
selector:
|
||||||
|
{{- include "incidentops.selectorLabels" . | nindent 4 }}
|
||||||
|
app.kubernetes.io/component: prometheus
|
||||||
|
{{- if .Values.observability.prometheus.persistence.enabled }}
|
||||||
|
---
|
||||||
|
apiVersion: v1
|
||||||
|
kind: PersistentVolumeClaim
|
||||||
|
metadata:
|
||||||
|
name: {{ include "incidentops.fullname" . }}-prometheus
|
||||||
|
labels:
|
||||||
|
{{- include "incidentops.labels" . | nindent 4 }}
|
||||||
|
app.kubernetes.io/component: prometheus
|
||||||
|
spec:
|
||||||
|
accessModes:
|
||||||
|
- ReadWriteOnce
|
||||||
|
resources:
|
||||||
|
requests:
|
||||||
|
storage: {{ .Values.observability.prometheus.persistence.size }}
|
||||||
|
{{- end }}
|
||||||
|
{{- end }}
|
||||||
29
helm/incidentops/templates/prometheus-rbac.yaml
Normal file
29
helm/incidentops/templates/prometheus-rbac.yaml
Normal file
@@ -0,0 +1,29 @@
|
|||||||
|
{{- if and .Values.observability.enabled .Values.metrics.enabled }}
|
||||||
|
apiVersion: rbac.authorization.k8s.io/v1
|
||||||
|
kind: Role
|
||||||
|
metadata:
|
||||||
|
name: {{ include "incidentops.fullname" . }}-prometheus
|
||||||
|
labels:
|
||||||
|
{{- include "incidentops.labels" . | nindent 4 }}
|
||||||
|
app.kubernetes.io/component: prometheus
|
||||||
|
rules:
|
||||||
|
- apiGroups: [""]
|
||||||
|
resources: ["pods", "endpoints", "services"]
|
||||||
|
verbs: ["get", "list", "watch"]
|
||||||
|
---
|
||||||
|
apiVersion: rbac.authorization.k8s.io/v1
|
||||||
|
kind: RoleBinding
|
||||||
|
metadata:
|
||||||
|
name: {{ include "incidentops.fullname" . }}-prometheus
|
||||||
|
labels:
|
||||||
|
{{- include "incidentops.labels" . | nindent 4 }}
|
||||||
|
app.kubernetes.io/component: prometheus
|
||||||
|
subjects:
|
||||||
|
- kind: ServiceAccount
|
||||||
|
name: {{ include "incidentops.serviceAccountName" . }}
|
||||||
|
namespace: {{ .Release.Namespace }}
|
||||||
|
roleRef:
|
||||||
|
apiGroup: rbac.authorization.k8s.io
|
||||||
|
kind: Role
|
||||||
|
name: {{ include "incidentops.fullname" . }}-prometheus
|
||||||
|
{{- end }}
|
||||||
169
helm/incidentops/templates/promtail-daemonset.yaml
Normal file
169
helm/incidentops/templates/promtail-daemonset.yaml
Normal file
@@ -0,0 +1,169 @@
|
|||||||
|
{{- if and .Values.observability.enabled .Values.observability.promtail.enabled }}
|
||||||
|
apiVersion: v1
|
||||||
|
kind: ConfigMap
|
||||||
|
metadata:
|
||||||
|
name: {{ include "incidentops.fullname" . }}-promtail-config
|
||||||
|
labels:
|
||||||
|
{{- include "incidentops.labels" . | nindent 4 }}
|
||||||
|
app.kubernetes.io/component: promtail
|
||||||
|
data:
|
||||||
|
promtail.yaml: |
|
||||||
|
server:
|
||||||
|
http_listen_port: 3101
|
||||||
|
grpc_listen_port: 0
|
||||||
|
|
||||||
|
positions:
|
||||||
|
filename: /run/promtail/positions.yaml
|
||||||
|
|
||||||
|
clients:
|
||||||
|
- url: http://{{ include "incidentops.fullname" . }}-loki:3100/loki/api/v1/push
|
||||||
|
|
||||||
|
scrape_configs:
|
||||||
|
- job_name: kubernetes-pods
|
||||||
|
pipeline_stages:
|
||||||
|
- cri: {}
|
||||||
|
kubernetes_sd_configs:
|
||||||
|
- role: pod
|
||||||
|
namespaces:
|
||||||
|
names: [{{ .Release.Namespace }}]
|
||||||
|
relabel_configs:
|
||||||
|
- source_labels: [__meta_kubernetes_pod_container_init]
|
||||||
|
regex: "true"
|
||||||
|
action: drop
|
||||||
|
- source_labels: [__meta_kubernetes_pod_phase]
|
||||||
|
regex: Pending|Failed|Succeeded
|
||||||
|
action: drop
|
||||||
|
- source_labels: [__meta_kubernetes_pod_name, __meta_kubernetes_pod_namespace, __meta_kubernetes_pod_container_name]
|
||||||
|
target_label: __path__
|
||||||
|
replacement: /var/log/containers/$1_$2_$3-*.log
|
||||||
|
- source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_component]
|
||||||
|
regex: (.*)
|
||||||
|
target_label: service_name
|
||||||
|
replacement: {{ include "incidentops.fullname" . }}-$1
|
||||||
|
- source_labels: [__meta_kubernetes_pod_namespace]
|
||||||
|
target_label: namespace
|
||||||
|
- source_labels: [__meta_kubernetes_pod_name]
|
||||||
|
target_label: pod
|
||||||
|
- source_labels: [__meta_kubernetes_pod_container_name]
|
||||||
|
target_label: container
|
||||||
|
- source_labels: [__meta_kubernetes_pod_uid]
|
||||||
|
target_label: pod_uid
|
||||||
|
- target_label: cluster
|
||||||
|
replacement: {{ .Release.Namespace }}
|
||||||
|
|
||||||
|
- job_name: containers-fallback
|
||||||
|
pipeline_stages:
|
||||||
|
- cri: {}
|
||||||
|
static_configs:
|
||||||
|
- labels:
|
||||||
|
job: containers
|
||||||
|
namespace: {{ .Release.Namespace }}
|
||||||
|
service_name: incidentops-api
|
||||||
|
__path__: /var/log/containers/incidentops-api-*_incidentops_api-*.log
|
||||||
|
- labels:
|
||||||
|
job: containers
|
||||||
|
namespace: {{ .Release.Namespace }}
|
||||||
|
service_name: incidentops-worker
|
||||||
|
__path__: /var/log/containers/incidentops-worker-*_incidentops_worker-*.log
|
||||||
|
---
|
||||||
|
apiVersion: v1
|
||||||
|
kind: ServiceAccount
|
||||||
|
metadata:
|
||||||
|
name: {{ include "incidentops.fullname" . }}-promtail
|
||||||
|
labels:
|
||||||
|
{{- include "incidentops.labels" . | nindent 4 }}
|
||||||
|
app.kubernetes.io/component: promtail
|
||||||
|
---
|
||||||
|
apiVersion: rbac.authorization.k8s.io/v1
|
||||||
|
kind: ClusterRole
|
||||||
|
metadata:
|
||||||
|
name: {{ include "incidentops.fullname" . }}-promtail
|
||||||
|
labels:
|
||||||
|
{{- include "incidentops.labels" . | nindent 4 }}
|
||||||
|
app.kubernetes.io/component: promtail
|
||||||
|
rules:
|
||||||
|
- apiGroups: [""]
|
||||||
|
resources: ["pods", "pods/log", "namespaces", "services", "endpoints", "nodes"]
|
||||||
|
verbs: ["get", "list", "watch"]
|
||||||
|
---
|
||||||
|
apiVersion: rbac.authorization.k8s.io/v1
|
||||||
|
kind: ClusterRoleBinding
|
||||||
|
metadata:
|
||||||
|
name: {{ include "incidentops.fullname" . }}-promtail
|
||||||
|
labels:
|
||||||
|
{{- include "incidentops.labels" . | nindent 4 }}
|
||||||
|
app.kubernetes.io/component: promtail
|
||||||
|
subjects:
|
||||||
|
- kind: ServiceAccount
|
||||||
|
name: {{ include "incidentops.fullname" . }}-promtail
|
||||||
|
namespace: {{ .Release.Namespace }}
|
||||||
|
roleRef:
|
||||||
|
apiGroup: rbac.authorization.k8s.io
|
||||||
|
kind: ClusterRole
|
||||||
|
name: {{ include "incidentops.fullname" . }}-promtail
|
||||||
|
---
|
||||||
|
apiVersion: apps/v1
|
||||||
|
kind: DaemonSet
|
||||||
|
metadata:
|
||||||
|
name: {{ include "incidentops.fullname" . }}-promtail
|
||||||
|
labels:
|
||||||
|
{{- include "incidentops.labels" . | nindent 4 }}
|
||||||
|
app.kubernetes.io/component: promtail
|
||||||
|
spec:
|
||||||
|
selector:
|
||||||
|
matchLabels:
|
||||||
|
{{- include "incidentops.selectorLabels" . | nindent 6 }}
|
||||||
|
app.kubernetes.io/component: promtail
|
||||||
|
template:
|
||||||
|
metadata:
|
||||||
|
labels:
|
||||||
|
{{- include "incidentops.selectorLabels" . | nindent 8 }}
|
||||||
|
app.kubernetes.io/component: promtail
|
||||||
|
annotations:
|
||||||
|
checksum/config: {{ .Values.observability.promtail.image.tag | sha256sum }}
|
||||||
|
spec:
|
||||||
|
serviceAccountName: {{ include "incidentops.fullname" . }}-promtail
|
||||||
|
securityContext:
|
||||||
|
runAsUser: 0
|
||||||
|
containers:
|
||||||
|
- name: promtail
|
||||||
|
image: "{{ .Values.observability.promtail.image.repository }}:{{ .Values.observability.promtail.image.tag }}"
|
||||||
|
imagePullPolicy: {{ .Values.observability.promtail.image.pullPolicy }}
|
||||||
|
args:
|
||||||
|
- -config.file=/etc/promtail/promtail.yaml
|
||||||
|
ports:
|
||||||
|
- name: http-metrics
|
||||||
|
containerPort: 3101
|
||||||
|
protocol: TCP
|
||||||
|
volumeMounts:
|
||||||
|
- name: config
|
||||||
|
mountPath: /etc/promtail
|
||||||
|
- name: positions
|
||||||
|
mountPath: /run/promtail
|
||||||
|
- name: varlog
|
||||||
|
mountPath: /var/log
|
||||||
|
readOnly: true
|
||||||
|
- name: varlogpods
|
||||||
|
mountPath: /var/log/pods
|
||||||
|
readOnly: true
|
||||||
|
- name: varlogcontainers
|
||||||
|
mountPath: /var/log/containers
|
||||||
|
readOnly: true
|
||||||
|
resources:
|
||||||
|
{{- toYaml .Values.observability.promtail.resources | nindent 12 }}
|
||||||
|
volumes:
|
||||||
|
- name: config
|
||||||
|
configMap:
|
||||||
|
name: {{ include "incidentops.fullname" . }}-promtail-config
|
||||||
|
- name: positions
|
||||||
|
emptyDir: {}
|
||||||
|
- name: varlog
|
||||||
|
hostPath:
|
||||||
|
path: /var/log
|
||||||
|
- name: varlogpods
|
||||||
|
hostPath:
|
||||||
|
path: /var/log/pods
|
||||||
|
- name: varlogcontainers
|
||||||
|
hostPath:
|
||||||
|
path: /var/log/containers
|
||||||
|
{{- end }}
|
||||||
153
helm/incidentops/templates/tempo-deployment.yaml
Normal file
153
helm/incidentops/templates/tempo-deployment.yaml
Normal file
@@ -0,0 +1,153 @@
|
|||||||
|
{{- if .Values.observability.enabled }}
|
||||||
|
apiVersion: v1
|
||||||
|
kind: ConfigMap
|
||||||
|
metadata:
|
||||||
|
name: {{ include "incidentops.fullname" . }}-tempo-config
|
||||||
|
labels:
|
||||||
|
{{- include "incidentops.labels" . | nindent 4 }}
|
||||||
|
app.kubernetes.io/component: tempo
|
||||||
|
data:
|
||||||
|
tempo.yaml: |
|
||||||
|
server:
|
||||||
|
http_listen_port: 3200
|
||||||
|
|
||||||
|
distributor:
|
||||||
|
receivers:
|
||||||
|
otlp:
|
||||||
|
protocols:
|
||||||
|
grpc:
|
||||||
|
endpoint: 0.0.0.0:4317
|
||||||
|
http:
|
||||||
|
endpoint: 0.0.0.0:4318
|
||||||
|
|
||||||
|
ingester:
|
||||||
|
trace_idle_period: 10s
|
||||||
|
max_block_bytes: 1048576
|
||||||
|
max_block_duration: 5m
|
||||||
|
|
||||||
|
compactor:
|
||||||
|
compaction:
|
||||||
|
block_retention: {{ .Values.observability.tempo.retention }}
|
||||||
|
|
||||||
|
storage:
|
||||||
|
trace:
|
||||||
|
backend: local
|
||||||
|
local:
|
||||||
|
path: /var/tempo/traces
|
||||||
|
wal:
|
||||||
|
path: /var/tempo/wal
|
||||||
|
|
||||||
|
querier:
|
||||||
|
search:
|
||||||
|
query_timeout: 30s
|
||||||
|
---
|
||||||
|
apiVersion: apps/v1
|
||||||
|
kind: Deployment
|
||||||
|
metadata:
|
||||||
|
name: {{ include "incidentops.fullname" . }}-tempo
|
||||||
|
labels:
|
||||||
|
{{- include "incidentops.labels" . | nindent 4 }}
|
||||||
|
app.kubernetes.io/component: tempo
|
||||||
|
spec:
|
||||||
|
replicas: 1
|
||||||
|
selector:
|
||||||
|
matchLabels:
|
||||||
|
{{- include "incidentops.selectorLabels" . | nindent 6 }}
|
||||||
|
app.kubernetes.io/component: tempo
|
||||||
|
template:
|
||||||
|
metadata:
|
||||||
|
labels:
|
||||||
|
{{- include "incidentops.selectorLabels" . | nindent 8 }}
|
||||||
|
app.kubernetes.io/component: tempo
|
||||||
|
annotations:
|
||||||
|
checksum/config: {{ .Values.observability.tempo.image.tag | sha256sum }}
|
||||||
|
spec:
|
||||||
|
containers:
|
||||||
|
- name: tempo
|
||||||
|
image: "{{ .Values.observability.tempo.image.repository }}:{{ .Values.observability.tempo.image.tag }}"
|
||||||
|
imagePullPolicy: {{ .Values.observability.tempo.image.pullPolicy }}
|
||||||
|
args:
|
||||||
|
- -config.file=/etc/tempo/tempo.yaml
|
||||||
|
ports:
|
||||||
|
- name: http
|
||||||
|
containerPort: 3200
|
||||||
|
protocol: TCP
|
||||||
|
- name: otlp-grpc
|
||||||
|
containerPort: 4317
|
||||||
|
protocol: TCP
|
||||||
|
- name: otlp-http
|
||||||
|
containerPort: 4318
|
||||||
|
protocol: TCP
|
||||||
|
volumeMounts:
|
||||||
|
- name: config
|
||||||
|
mountPath: /etc/tempo
|
||||||
|
- name: data
|
||||||
|
mountPath: /var/tempo
|
||||||
|
resources:
|
||||||
|
{{- toYaml .Values.observability.tempo.resources | nindent 12 }}
|
||||||
|
readinessProbe:
|
||||||
|
httpGet:
|
||||||
|
path: /ready
|
||||||
|
port: http
|
||||||
|
initialDelaySeconds: 10
|
||||||
|
periodSeconds: 10
|
||||||
|
livenessProbe:
|
||||||
|
httpGet:
|
||||||
|
path: /ready
|
||||||
|
port: http
|
||||||
|
initialDelaySeconds: 30
|
||||||
|
periodSeconds: 30
|
||||||
|
volumes:
|
||||||
|
- name: config
|
||||||
|
configMap:
|
||||||
|
name: {{ include "incidentops.fullname" . }}-tempo-config
|
||||||
|
- name: data
|
||||||
|
{{- if .Values.observability.tempo.persistence.enabled }}
|
||||||
|
persistentVolumeClaim:
|
||||||
|
claimName: {{ include "incidentops.fullname" . }}-tempo
|
||||||
|
{{- else }}
|
||||||
|
emptyDir: {}
|
||||||
|
{{- end }}
|
||||||
|
---
|
||||||
|
apiVersion: v1
|
||||||
|
kind: Service
|
||||||
|
metadata:
|
||||||
|
name: {{ include "incidentops.fullname" . }}-tempo
|
||||||
|
labels:
|
||||||
|
{{- include "incidentops.labels" . | nindent 4 }}
|
||||||
|
app.kubernetes.io/component: tempo
|
||||||
|
spec:
|
||||||
|
type: ClusterIP
|
||||||
|
ports:
|
||||||
|
- name: http
|
||||||
|
port: 3200
|
||||||
|
targetPort: http
|
||||||
|
protocol: TCP
|
||||||
|
- name: otlp-grpc
|
||||||
|
port: 4317
|
||||||
|
targetPort: otlp-grpc
|
||||||
|
protocol: TCP
|
||||||
|
- name: otlp-http
|
||||||
|
port: 4318
|
||||||
|
targetPort: otlp-http
|
||||||
|
protocol: TCP
|
||||||
|
selector:
|
||||||
|
{{- include "incidentops.selectorLabels" . | nindent 4 }}
|
||||||
|
app.kubernetes.io/component: tempo
|
||||||
|
{{- if .Values.observability.tempo.persistence.enabled }}
|
||||||
|
---
|
||||||
|
apiVersion: v1
|
||||||
|
kind: PersistentVolumeClaim
|
||||||
|
metadata:
|
||||||
|
name: {{ include "incidentops.fullname" . }}-tempo
|
||||||
|
labels:
|
||||||
|
{{- include "incidentops.labels" . | nindent 4 }}
|
||||||
|
app.kubernetes.io/component: tempo
|
||||||
|
spec:
|
||||||
|
accessModes:
|
||||||
|
- ReadWriteOnce
|
||||||
|
resources:
|
||||||
|
requests:
|
||||||
|
storage: {{ .Values.observability.tempo.persistence.size }}
|
||||||
|
{{- end }}
|
||||||
|
{{- end }}
|
||||||
@@ -29,6 +29,29 @@ spec:
|
|||||||
serviceAccountName: {{ include "incidentops.serviceAccountName" . }}
|
serviceAccountName: {{ include "incidentops.serviceAccountName" . }}
|
||||||
securityContext:
|
securityContext:
|
||||||
{{- toYaml .Values.podSecurityContext | nindent 8 }}
|
{{- toYaml .Values.podSecurityContext | nindent 8 }}
|
||||||
|
initContainers:
|
||||||
|
- name: wait-for-postgres
|
||||||
|
image: busybox:1.36
|
||||||
|
command:
|
||||||
|
- sh
|
||||||
|
- -c
|
||||||
|
- |
|
||||||
|
until nc -z {{ include "incidentops.fullname" . }}-postgresql 5432; do
|
||||||
|
echo "Waiting for PostgreSQL..."
|
||||||
|
sleep 2
|
||||||
|
done
|
||||||
|
echo "PostgreSQL is ready"
|
||||||
|
- name: wait-for-redis
|
||||||
|
image: busybox:1.36
|
||||||
|
command:
|
||||||
|
- sh
|
||||||
|
- -c
|
||||||
|
- |
|
||||||
|
until nc -z {{ include "incidentops.fullname" . }}-redis 6379; do
|
||||||
|
echo "Waiting for Redis..."
|
||||||
|
sleep 2
|
||||||
|
done
|
||||||
|
echo "Redis is ready"
|
||||||
containers:
|
containers:
|
||||||
- name: worker
|
- name: worker
|
||||||
securityContext:
|
securityContext:
|
||||||
@@ -36,6 +59,8 @@ spec:
|
|||||||
image: {{ include "incidentops.worker.image" . }}
|
image: {{ include "incidentops.worker.image" . }}
|
||||||
imagePullPolicy: {{ .Values.worker.image.pullPolicy }}
|
imagePullPolicy: {{ .Values.worker.image.pullPolicy }}
|
||||||
command:
|
command:
|
||||||
|
- uv
|
||||||
|
- run
|
||||||
- celery
|
- celery
|
||||||
- -A
|
- -A
|
||||||
- worker.celery_app
|
- worker.celery_app
|
||||||
@@ -52,6 +77,8 @@ spec:
|
|||||||
livenessProbe:
|
livenessProbe:
|
||||||
exec:
|
exec:
|
||||||
command:
|
command:
|
||||||
|
- uv
|
||||||
|
- run
|
||||||
- celery
|
- celery
|
||||||
- -A
|
- -A
|
||||||
- worker.celery_app
|
- worker.celery_app
|
||||||
|
|||||||
@@ -80,3 +80,63 @@ redis:
|
|||||||
limits:
|
limits:
|
||||||
cpu: 1000m
|
cpu: 1000m
|
||||||
memory: 1Gi
|
memory: 1Gi
|
||||||
|
|
||||||
|
# Application configuration
|
||||||
|
config:
|
||||||
|
environment: production
|
||||||
|
logLevel: INFO
|
||||||
|
|
||||||
|
# Observability Stack - Production settings
|
||||||
|
observability:
|
||||||
|
enabled: true
|
||||||
|
|
||||||
|
otelCollector:
|
||||||
|
replicaCount: 2
|
||||||
|
resources:
|
||||||
|
requests:
|
||||||
|
cpu: 100m
|
||||||
|
memory: 256Mi
|
||||||
|
limits:
|
||||||
|
cpu: 500m
|
||||||
|
memory: 512Mi
|
||||||
|
|
||||||
|
tempo:
|
||||||
|
retention: "720h" # 30 days
|
||||||
|
persistence:
|
||||||
|
enabled: true
|
||||||
|
size: 50Gi
|
||||||
|
resources:
|
||||||
|
requests:
|
||||||
|
cpu: 250m
|
||||||
|
memory: 512Mi
|
||||||
|
limits:
|
||||||
|
cpu: 1000m
|
||||||
|
memory: 2Gi
|
||||||
|
|
||||||
|
loki:
|
||||||
|
retention: "720h" # 30 days
|
||||||
|
persistence:
|
||||||
|
enabled: true
|
||||||
|
size: 100Gi
|
||||||
|
resources:
|
||||||
|
requests:
|
||||||
|
cpu: 250m
|
||||||
|
memory: 512Mi
|
||||||
|
limits:
|
||||||
|
cpu: 1000m
|
||||||
|
memory: 2Gi
|
||||||
|
|
||||||
|
grafana:
|
||||||
|
adminPassword: "" # Set via external secret in production
|
||||||
|
service:
|
||||||
|
type: ClusterIP
|
||||||
|
persistence:
|
||||||
|
enabled: true
|
||||||
|
size: 5Gi
|
||||||
|
resources:
|
||||||
|
requests:
|
||||||
|
cpu: 100m
|
||||||
|
memory: 256Mi
|
||||||
|
limits:
|
||||||
|
cpu: 500m
|
||||||
|
memory: 512Mi
|
||||||
|
|||||||
@@ -106,6 +106,8 @@ config:
|
|||||||
jwtAlgorithm: HS256
|
jwtAlgorithm: HS256
|
||||||
accessTokenExpireMinutes: 30
|
accessTokenExpireMinutes: 30
|
||||||
refreshTokenExpireDays: 30
|
refreshTokenExpireDays: 30
|
||||||
|
environment: development
|
||||||
|
logLevel: INFO
|
||||||
|
|
||||||
# Secrets (use external secrets in production)
|
# Secrets (use external secrets in production)
|
||||||
secrets:
|
secrets:
|
||||||
@@ -161,3 +163,117 @@ podSecurityContext:
|
|||||||
securityContext:
|
securityContext:
|
||||||
runAsNonRoot: true
|
runAsNonRoot: true
|
||||||
runAsUser: 1000
|
runAsUser: 1000
|
||||||
|
|
||||||
|
# Observability Stack (Grafana + Loki + Tempo + OpenTelemetry Collector)
|
||||||
|
observability:
|
||||||
|
enabled: true
|
||||||
|
|
||||||
|
otelCollector:
|
||||||
|
replicaCount: 1
|
||||||
|
image:
|
||||||
|
repository: otel/opentelemetry-collector-contrib
|
||||||
|
tag: "0.96.0"
|
||||||
|
pullPolicy: IfNotPresent
|
||||||
|
resources:
|
||||||
|
requests:
|
||||||
|
cpu: 50m
|
||||||
|
memory: 128Mi
|
||||||
|
limits:
|
||||||
|
cpu: 200m
|
||||||
|
memory: 256Mi
|
||||||
|
|
||||||
|
tempo:
|
||||||
|
image:
|
||||||
|
repository: grafana/tempo
|
||||||
|
tag: "2.4.1"
|
||||||
|
pullPolicy: IfNotPresent
|
||||||
|
retention: "168h" # 7 days
|
||||||
|
persistence:
|
||||||
|
enabled: false
|
||||||
|
size: 10Gi
|
||||||
|
resources:
|
||||||
|
requests:
|
||||||
|
cpu: 50m
|
||||||
|
memory: 128Mi
|
||||||
|
limits:
|
||||||
|
cpu: 500m
|
||||||
|
memory: 512Mi
|
||||||
|
|
||||||
|
loki:
|
||||||
|
image:
|
||||||
|
repository: grafana/loki
|
||||||
|
tag: "2.9.6"
|
||||||
|
pullPolicy: IfNotPresent
|
||||||
|
retention: "168h" # 7 days
|
||||||
|
persistence:
|
||||||
|
enabled: false
|
||||||
|
size: 10Gi
|
||||||
|
resources:
|
||||||
|
requests:
|
||||||
|
cpu: 50m
|
||||||
|
memory: 128Mi
|
||||||
|
limits:
|
||||||
|
cpu: 500m
|
||||||
|
memory: 512Mi
|
||||||
|
|
||||||
|
prometheus:
|
||||||
|
image:
|
||||||
|
repository: prom/prometheus
|
||||||
|
tag: "v2.51.0"
|
||||||
|
pullPolicy: IfNotPresent
|
||||||
|
retention: "15d"
|
||||||
|
scrapeInterval: "15s"
|
||||||
|
persistence:
|
||||||
|
enabled: false
|
||||||
|
size: 10Gi
|
||||||
|
resources:
|
||||||
|
requests:
|
||||||
|
cpu: 50m
|
||||||
|
memory: 128Mi
|
||||||
|
limits:
|
||||||
|
cpu: 500m
|
||||||
|
memory: 512Mi
|
||||||
|
|
||||||
|
grafana:
|
||||||
|
image:
|
||||||
|
repository: grafana/grafana
|
||||||
|
tag: "10.4.1"
|
||||||
|
pullPolicy: IfNotPresent
|
||||||
|
adminUser: admin
|
||||||
|
adminPassword: "admin" # Change in production!
|
||||||
|
service:
|
||||||
|
type: ClusterIP
|
||||||
|
ingress:
|
||||||
|
enabled: false
|
||||||
|
host: grafana.incidentops.local
|
||||||
|
annotations: {}
|
||||||
|
tls: []
|
||||||
|
persistence:
|
||||||
|
enabled: false
|
||||||
|
size: 1Gi
|
||||||
|
resources:
|
||||||
|
requests:
|
||||||
|
cpu: 50m
|
||||||
|
memory: 128Mi
|
||||||
|
limits:
|
||||||
|
cpu: 200m
|
||||||
|
memory: 256Mi
|
||||||
|
|
||||||
|
promtail:
|
||||||
|
enabled: true
|
||||||
|
image:
|
||||||
|
repository: grafana/promtail
|
||||||
|
tag: "2.9.6"
|
||||||
|
pullPolicy: IfNotPresent
|
||||||
|
resources:
|
||||||
|
requests:
|
||||||
|
cpu: 25m
|
||||||
|
memory: 64Mi
|
||||||
|
limits:
|
||||||
|
cpu: 200m
|
||||||
|
memory: 256Mi
|
||||||
|
|
||||||
|
# Metrics configuration
|
||||||
|
metrics:
|
||||||
|
enabled: true
|
||||||
|
port: 9464
|
||||||
|
|||||||
294
observability/grafana/dashboards/api-overview.json
Normal file
294
observability/grafana/dashboards/api-overview.json
Normal file
@@ -0,0 +1,294 @@
|
|||||||
|
{
|
||||||
|
"title": "IncidentOps API Overview",
|
||||||
|
"uid": "incidentops-api",
|
||||||
|
"tags": ["incidentops", "api"],
|
||||||
|
"timezone": "browser",
|
||||||
|
"editable": true,
|
||||||
|
"panels": [
|
||||||
|
{
|
||||||
|
"id": 1,
|
||||||
|
"title": "Request Rate",
|
||||||
|
"type": "timeseries",
|
||||||
|
"gridPos": {"h": 8, "w": 8, "x": 0, "y": 0},
|
||||||
|
"targets": [
|
||||||
|
{
|
||||||
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
|
"expr": "sum(rate(http_server_request_duration_seconds_count{job=\"incidentops-api\"}[1m]))",
|
||||||
|
"legendFormat": "Requests/sec",
|
||||||
|
"refId": "A"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"color": {"mode": "palette-classic"},
|
||||||
|
"unit": "reqps"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": 2,
|
||||||
|
"title": "Request Duration (p50, p95, p99)",
|
||||||
|
"type": "timeseries",
|
||||||
|
"gridPos": {"h": 8, "w": 8, "x": 8, "y": 0},
|
||||||
|
"targets": [
|
||||||
|
{
|
||||||
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
|
"expr": "histogram_quantile(0.50, sum(rate(http_server_request_duration_seconds_bucket{job=\"incidentops-api\"}[5m])) by (le))",
|
||||||
|
"legendFormat": "p50",
|
||||||
|
"refId": "A"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
|
"expr": "histogram_quantile(0.95, sum(rate(http_server_request_duration_seconds_bucket{job=\"incidentops-api\"}[5m])) by (le))",
|
||||||
|
"legendFormat": "p95",
|
||||||
|
"refId": "B"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
|
"expr": "histogram_quantile(0.99, sum(rate(http_server_request_duration_seconds_bucket{job=\"incidentops-api\"}[5m])) by (le))",
|
||||||
|
"legendFormat": "p99",
|
||||||
|
"refId": "C"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"color": {"mode": "palette-classic"},
|
||||||
|
"unit": "s"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": 3,
|
||||||
|
"title": "Error Rate",
|
||||||
|
"type": "timeseries",
|
||||||
|
"gridPos": {"h": 8, "w": 8, "x": 16, "y": 0},
|
||||||
|
"targets": [
|
||||||
|
{
|
||||||
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
|
"expr": "sum(rate(http_server_request_duration_seconds_count{job=\"incidentops-api\", http_status_code=~\"5..\"}[1m])) / sum(rate(http_server_request_duration_seconds_count{job=\"incidentops-api\"}[1m])) * 100",
|
||||||
|
"legendFormat": "Error %",
|
||||||
|
"refId": "A"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"color": {"fixedColor": "red", "mode": "fixed"},
|
||||||
|
"unit": "percent",
|
||||||
|
"min": 0,
|
||||||
|
"max": 100
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": 4,
|
||||||
|
"title": "Requests by Status Code",
|
||||||
|
"type": "timeseries",
|
||||||
|
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 8},
|
||||||
|
"targets": [
|
||||||
|
{
|
||||||
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
|
"expr": "sum by (http_status_code) (rate(http_server_request_duration_seconds_count{job=\"incidentops-api\"}[1m]))",
|
||||||
|
"legendFormat": "{{http_status_code}}",
|
||||||
|
"refId": "A"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"color": {"mode": "palette-classic"},
|
||||||
|
"unit": "reqps"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": 5,
|
||||||
|
"title": "Requests by Endpoint",
|
||||||
|
"type": "timeseries",
|
||||||
|
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 8},
|
||||||
|
"targets": [
|
||||||
|
{
|
||||||
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
|
"expr": "sum by (http_route) (rate(http_server_request_duration_seconds_count{job=\"incidentops-api\"}[1m]))",
|
||||||
|
"legendFormat": "{{http_route}}",
|
||||||
|
"refId": "A"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"color": {"mode": "palette-classic"},
|
||||||
|
"unit": "reqps"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": 6,
|
||||||
|
"title": "System CPU Usage",
|
||||||
|
"type": "gauge",
|
||||||
|
"gridPos": {"h": 6, "w": 6, "x": 0, "y": 16},
|
||||||
|
"targets": [
|
||||||
|
{
|
||||||
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
|
"expr": "avg(system_cpu_utilization{job=\"incidentops-api\"}) * 100",
|
||||||
|
"refId": "A"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"color": {"mode": "thresholds"},
|
||||||
|
"thresholds": {
|
||||||
|
"mode": "absolute",
|
||||||
|
"steps": [
|
||||||
|
{"color": "green", "value": null},
|
||||||
|
{"color": "yellow", "value": 60},
|
||||||
|
{"color": "red", "value": 80}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"unit": "percent",
|
||||||
|
"min": 0,
|
||||||
|
"max": 100
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": 7,
|
||||||
|
"title": "Memory Usage",
|
||||||
|
"type": "gauge",
|
||||||
|
"gridPos": {"h": 6, "w": 6, "x": 6, "y": 16},
|
||||||
|
"targets": [
|
||||||
|
{
|
||||||
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
|
"expr": "process_runtime_cpython_memory_bytes{job=\"incidentops-api\", type=\"rss\"} / 1024 / 1024",
|
||||||
|
"refId": "A"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"color": {"mode": "thresholds"},
|
||||||
|
"thresholds": {
|
||||||
|
"mode": "absolute",
|
||||||
|
"steps": [
|
||||||
|
{"color": "green", "value": null},
|
||||||
|
{"color": "yellow", "value": 256},
|
||||||
|
{"color": "red", "value": 512}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"unit": "decmbytes"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": 8,
|
||||||
|
"title": "Active Threads",
|
||||||
|
"type": "stat",
|
||||||
|
"gridPos": {"h": 6, "w": 6, "x": 12, "y": 16},
|
||||||
|
"targets": [
|
||||||
|
{
|
||||||
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
|
"expr": "process_runtime_cpython_thread_count{job=\"incidentops-api\"}",
|
||||||
|
"refId": "A"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"color": {"mode": "thresholds"},
|
||||||
|
"thresholds": {
|
||||||
|
"mode": "absolute",
|
||||||
|
"steps": [
|
||||||
|
{"color": "green", "value": null},
|
||||||
|
{"color": "yellow", "value": 50},
|
||||||
|
{"color": "red", "value": 100}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": 9,
|
||||||
|
"title": "GC Collections",
|
||||||
|
"type": "stat",
|
||||||
|
"gridPos": {"h": 6, "w": 6, "x": 18, "y": 16},
|
||||||
|
"targets": [
|
||||||
|
{
|
||||||
|
"datasource": {"type": "prometheus", "uid": "prometheus"},
|
||||||
|
"expr": "sum(rate(process_runtime_cpython_gc_count{job=\"incidentops-api\"}[5m]))",
|
||||||
|
"refId": "A"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"fieldConfig": {
|
||||||
|
"defaults": {
|
||||||
|
"color": {"mode": "thresholds"},
|
||||||
|
"thresholds": {
|
||||||
|
"mode": "absolute",
|
||||||
|
"steps": [
|
||||||
|
{"color": "green", "value": null}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"unit": "cps"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": 10,
|
||||||
|
"title": "Recent Logs",
|
||||||
|
"type": "logs",
|
||||||
|
"gridPos": {"h": 10, "w": 24, "x": 0, "y": 22},
|
||||||
|
"targets": [
|
||||||
|
{
|
||||||
|
"datasource": {"type": "loki", "uid": "loki"},
|
||||||
|
"expr": "{service_name=\"incidentops-api\"} | json",
|
||||||
|
"refId": "A"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"options": {
|
||||||
|
"showTime": true,
|
||||||
|
"showLabels": true,
|
||||||
|
"wrapLogMessage": true,
|
||||||
|
"enableLogDetails": true,
|
||||||
|
"sortOrder": "Descending"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": 11,
|
||||||
|
"title": "Error Logs",
|
||||||
|
"type": "logs",
|
||||||
|
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 32},
|
||||||
|
"targets": [
|
||||||
|
{
|
||||||
|
"datasource": {"type": "loki", "uid": "loki"},
|
||||||
|
"expr": "{service_name=\"incidentops-api\"} |= \"ERROR\" | json",
|
||||||
|
"refId": "A"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"options": {
|
||||||
|
"showTime": true,
|
||||||
|
"showLabels": true,
|
||||||
|
"wrapLogMessage": true,
|
||||||
|
"enableLogDetails": true,
|
||||||
|
"sortOrder": "Descending"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": 12,
|
||||||
|
"title": "Recent Traces",
|
||||||
|
"type": "traces",
|
||||||
|
"gridPos": {"h": 10, "w": 24, "x": 0, "y": 40},
|
||||||
|
"targets": [
|
||||||
|
{
|
||||||
|
"datasource": {"type": "tempo", "uid": "tempo"},
|
||||||
|
"queryType": "traceqlSearch",
|
||||||
|
"filters": [
|
||||||
|
{
|
||||||
|
"id": "service-name",
|
||||||
|
"operator": "=",
|
||||||
|
"scope": "resource",
|
||||||
|
"tag": "service.name",
|
||||||
|
"value": ["incidentops-api"]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"refId": "A"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"schemaVersion": 38,
|
||||||
|
"version": 2
|
||||||
|
}
|
||||||
@@ -0,0 +1,12 @@
|
|||||||
|
apiVersion: 1
|
||||||
|
|
||||||
|
providers:
|
||||||
|
- name: 'default'
|
||||||
|
orgId: 1
|
||||||
|
folder: 'IncidentOps'
|
||||||
|
folderUid: 'incidentops'
|
||||||
|
type: file
|
||||||
|
disableDeletion: false
|
||||||
|
editable: true
|
||||||
|
options:
|
||||||
|
path: /var/lib/grafana/dashboards
|
||||||
@@ -0,0 +1,48 @@
|
|||||||
|
apiVersion: 1
|
||||||
|
|
||||||
|
datasources:
|
||||||
|
- name: Prometheus
|
||||||
|
type: prometheus
|
||||||
|
uid: prometheus
|
||||||
|
url: http://prometheus:9090
|
||||||
|
access: proxy
|
||||||
|
isDefault: false
|
||||||
|
jsonData:
|
||||||
|
httpMethod: POST
|
||||||
|
exemplarTraceIdDestinations:
|
||||||
|
- name: trace_id
|
||||||
|
datasourceUid: tempo
|
||||||
|
|
||||||
|
- name: Tempo
|
||||||
|
type: tempo
|
||||||
|
uid: tempo
|
||||||
|
url: http://tempo:3200
|
||||||
|
access: proxy
|
||||||
|
isDefault: false
|
||||||
|
jsonData:
|
||||||
|
tracesToLogsV2:
|
||||||
|
datasourceUid: loki
|
||||||
|
spanStartTimeShift: '-1h'
|
||||||
|
spanEndTimeShift: '1h'
|
||||||
|
filterByTraceID: true
|
||||||
|
filterBySpanID: true
|
||||||
|
tracesToMetrics:
|
||||||
|
datasourceUid: prometheus
|
||||||
|
nodeGraph:
|
||||||
|
enabled: true
|
||||||
|
lokiSearch:
|
||||||
|
datasourceUid: loki
|
||||||
|
|
||||||
|
- name: Loki
|
||||||
|
type: loki
|
||||||
|
uid: loki
|
||||||
|
url: http://loki:3100
|
||||||
|
access: proxy
|
||||||
|
isDefault: true
|
||||||
|
jsonData:
|
||||||
|
derivedFields:
|
||||||
|
- datasourceUid: tempo
|
||||||
|
matcherRegex: '"trace_id":"([a-f0-9]+)"'
|
||||||
|
name: TraceID
|
||||||
|
url: '$${__value.raw}'
|
||||||
|
urlDisplayLabel: 'View Trace'
|
||||||
41
observability/loki/config.yaml
Normal file
41
observability/loki/config.yaml
Normal file
@@ -0,0 +1,41 @@
|
|||||||
|
auth_enabled: false
|
||||||
|
|
||||||
|
server:
|
||||||
|
http_listen_port: 3100
|
||||||
|
grpc_listen_port: 9096
|
||||||
|
|
||||||
|
common:
|
||||||
|
path_prefix: /loki
|
||||||
|
storage:
|
||||||
|
filesystem:
|
||||||
|
chunks_directory: /loki/chunks
|
||||||
|
rules_directory: /loki/rules
|
||||||
|
replication_factor: 1
|
||||||
|
ring:
|
||||||
|
kvstore:
|
||||||
|
store: inmemory
|
||||||
|
|
||||||
|
query_range:
|
||||||
|
results_cache:
|
||||||
|
cache:
|
||||||
|
embedded_cache:
|
||||||
|
enabled: true
|
||||||
|
max_size_mb: 100
|
||||||
|
|
||||||
|
schema_config:
|
||||||
|
configs:
|
||||||
|
- from: "2020-10-24"
|
||||||
|
store: tsdb
|
||||||
|
object_store: filesystem
|
||||||
|
schema: v13
|
||||||
|
index:
|
||||||
|
prefix: index_
|
||||||
|
period: 24h
|
||||||
|
|
||||||
|
ruler:
|
||||||
|
alertmanager_url: http://localhost:9093
|
||||||
|
|
||||||
|
limits_config:
|
||||||
|
retention_period: 168h # 7 days
|
||||||
|
allow_structured_metadata: true
|
||||||
|
volume_enabled: true
|
||||||
38
observability/otel-collector/config.yaml
Normal file
38
observability/otel-collector/config.yaml
Normal file
@@ -0,0 +1,38 @@
|
|||||||
|
receivers:
|
||||||
|
otlp:
|
||||||
|
protocols:
|
||||||
|
grpc:
|
||||||
|
endpoint: 0.0.0.0:4317
|
||||||
|
http:
|
||||||
|
endpoint: 0.0.0.0:4318
|
||||||
|
|
||||||
|
processors:
|
||||||
|
batch:
|
||||||
|
timeout: 1s
|
||||||
|
send_batch_size: 1024
|
||||||
|
memory_limiter:
|
||||||
|
check_interval: 1s
|
||||||
|
limit_mib: 256
|
||||||
|
spike_limit_mib: 64
|
||||||
|
|
||||||
|
exporters:
|
||||||
|
otlp/tempo:
|
||||||
|
endpoint: tempo:4317
|
||||||
|
tls:
|
||||||
|
insecure: true
|
||||||
|
loki:
|
||||||
|
endpoint: http://loki:3100/loki/api/v1/push
|
||||||
|
default_labels_enabled:
|
||||||
|
exporter: true
|
||||||
|
job: true
|
||||||
|
|
||||||
|
service:
|
||||||
|
pipelines:
|
||||||
|
traces:
|
||||||
|
receivers: [otlp]
|
||||||
|
processors: [memory_limiter, batch]
|
||||||
|
exporters: [otlp/tempo]
|
||||||
|
logs:
|
||||||
|
receivers: [otlp]
|
||||||
|
processors: [memory_limiter, batch]
|
||||||
|
exporters: [loki]
|
||||||
23
observability/prometheus/prometheus.yml
Normal file
23
observability/prometheus/prometheus.yml
Normal file
@@ -0,0 +1,23 @@
|
|||||||
|
global:
|
||||||
|
scrape_interval: 15s
|
||||||
|
evaluation_interval: 15s
|
||||||
|
|
||||||
|
scrape_configs:
|
||||||
|
# Scrape Prometheus itself
|
||||||
|
- job_name: "prometheus"
|
||||||
|
static_configs:
|
||||||
|
- targets: ["localhost:9090"]
|
||||||
|
|
||||||
|
# Scrape IncidentOps API metrics
|
||||||
|
- job_name: "incidentops-api"
|
||||||
|
static_configs:
|
||||||
|
- targets: ["api:9464"]
|
||||||
|
metrics_path: /metrics
|
||||||
|
scrape_interval: 10s
|
||||||
|
|
||||||
|
# Scrape IncidentOps Worker metrics (when metrics are enabled)
|
||||||
|
- job_name: "incidentops-worker"
|
||||||
|
static_configs:
|
||||||
|
- targets: ["worker:9464"]
|
||||||
|
metrics_path: /metrics
|
||||||
|
scrape_interval: 10s
|
||||||
32
observability/tempo/config.yaml
Normal file
32
observability/tempo/config.yaml
Normal file
@@ -0,0 +1,32 @@
|
|||||||
|
server:
|
||||||
|
http_listen_port: 3200
|
||||||
|
|
||||||
|
distributor:
|
||||||
|
receivers:
|
||||||
|
otlp:
|
||||||
|
protocols:
|
||||||
|
grpc:
|
||||||
|
endpoint: 0.0.0.0:4317
|
||||||
|
http:
|
||||||
|
endpoint: 0.0.0.0:4318
|
||||||
|
|
||||||
|
ingester:
|
||||||
|
trace_idle_period: 10s
|
||||||
|
max_block_bytes: 1048576
|
||||||
|
max_block_duration: 5m
|
||||||
|
|
||||||
|
compactor:
|
||||||
|
compaction:
|
||||||
|
block_retention: 168h # 7 days
|
||||||
|
|
||||||
|
storage:
|
||||||
|
trace:
|
||||||
|
backend: local
|
||||||
|
local:
|
||||||
|
path: /var/tempo/traces
|
||||||
|
wal:
|
||||||
|
path: /var/tempo/wal
|
||||||
|
|
||||||
|
querier:
|
||||||
|
search:
|
||||||
|
query_timeout: 30s
|
||||||
@@ -15,6 +15,18 @@ dependencies = [
|
|||||||
"celery[redis]>=5.4.0",
|
"celery[redis]>=5.4.0",
|
||||||
"redis>=5.0.0",
|
"redis>=5.0.0",
|
||||||
"httpx>=0.28.0",
|
"httpx>=0.28.0",
|
||||||
|
# OpenTelemetry
|
||||||
|
"opentelemetry-api>=1.27.0",
|
||||||
|
"opentelemetry-sdk>=1.27.0",
|
||||||
|
"opentelemetry-exporter-otlp>=1.27.0",
|
||||||
|
"opentelemetry-exporter-prometheus>=0.48b0",
|
||||||
|
"opentelemetry-instrumentation-fastapi>=0.48b0",
|
||||||
|
"opentelemetry-instrumentation-asyncpg>=0.48b0",
|
||||||
|
"opentelemetry-instrumentation-httpx>=0.48b0",
|
||||||
|
"opentelemetry-instrumentation-redis>=0.48b0",
|
||||||
|
"opentelemetry-instrumentation-logging>=0.48b0",
|
||||||
|
"opentelemetry-instrumentation-system-metrics>=0.48b0",
|
||||||
|
"prometheus-client>=0.20.0",
|
||||||
]
|
]
|
||||||
|
|
||||||
[project.optional-dependencies]
|
[project.optional-dependencies]
|
||||||
|
|||||||
@@ -27,14 +27,15 @@ build:
|
|||||||
- src: "worker/**/*.py"
|
- src: "worker/**/*.py"
|
||||||
dest: /app
|
dest: /app
|
||||||
|
|
||||||
- image: incidentops/web
|
# Web frontend disabled until implemented
|
||||||
docker:
|
# - image: incidentops/web
|
||||||
dockerfile: Dockerfile.web
|
# docker:
|
||||||
context: .
|
# dockerfile: Dockerfile.web
|
||||||
sync:
|
# context: .
|
||||||
manual:
|
# sync:
|
||||||
- src: "web/src/**/*"
|
# manual:
|
||||||
dest: /app
|
# - src: "web/src/**/*"
|
||||||
|
# dest: /app
|
||||||
|
|
||||||
local:
|
local:
|
||||||
push: false
|
push: false
|
||||||
@@ -48,12 +49,15 @@ deploy:
|
|||||||
valuesFiles:
|
valuesFiles:
|
||||||
- helm/incidentops/values.yaml
|
- helm/incidentops/values.yaml
|
||||||
setValues:
|
setValues:
|
||||||
api.image.repository: incidentops/api
|
web.replicaCount: 0 # Disabled until frontend is implemented
|
||||||
api.image.tag: ""
|
migration.enabled: true
|
||||||
worker.image.repository: incidentops/worker
|
setValueTemplates:
|
||||||
worker.image.tag: ""
|
api.image.repository: "{{.IMAGE_REPO_incidentops_api}}"
|
||||||
web.image.repository: incidentops/web
|
api.image.tag: "{{.IMAGE_TAG_incidentops_api}}"
|
||||||
web.image.tag: ""
|
worker.image.repository: "{{.IMAGE_REPO_incidentops_worker}}"
|
||||||
|
worker.image.tag: "{{.IMAGE_TAG_incidentops_worker}}"
|
||||||
|
migration.image.repository: "{{.IMAGE_REPO_incidentops_api}}"
|
||||||
|
migration.image.tag: "{{.IMAGE_TAG_incidentops_api}}"
|
||||||
createNamespace: true
|
createNamespace: true
|
||||||
namespace: incidentops
|
namespace: incidentops
|
||||||
|
|
||||||
@@ -74,13 +78,15 @@ profiles:
|
|||||||
setValues:
|
setValues:
|
||||||
api.replicaCount: 1
|
api.replicaCount: 1
|
||||||
worker.replicaCount: 1
|
worker.replicaCount: 1
|
||||||
web.replicaCount: 1
|
web.replicaCount: 0 # Disabled until frontend is implemented
|
||||||
api.image.repository: incidentops/api
|
migration.enabled: true
|
||||||
api.image.tag: ""
|
setValueTemplates:
|
||||||
worker.image.repository: incidentops/worker
|
api.image.repository: "{{.IMAGE_REPO_incidentops_api}}"
|
||||||
worker.image.tag: ""
|
api.image.tag: "{{.IMAGE_TAG_incidentops_api}}"
|
||||||
web.image.repository: incidentops/web
|
worker.image.repository: "{{.IMAGE_REPO_incidentops_worker}}"
|
||||||
web.image.tag: ""
|
worker.image.tag: "{{.IMAGE_TAG_incidentops_worker}}"
|
||||||
|
migration.image.repository: "{{.IMAGE_REPO_incidentops_api}}"
|
||||||
|
migration.image.tag: "{{.IMAGE_TAG_incidentops_api}}"
|
||||||
createNamespace: true
|
createNamespace: true
|
||||||
namespace: incidentops
|
namespace: incidentops
|
||||||
|
|
||||||
@@ -115,8 +121,30 @@ portForward:
|
|||||||
namespace: incidentops
|
namespace: incidentops
|
||||||
port: 8000
|
port: 8000
|
||||||
localPort: 8000
|
localPort: 8000
|
||||||
|
# Web frontend disabled until implemented
|
||||||
|
# - resourceType: service
|
||||||
|
# resourceName: incidentops-web
|
||||||
|
# namespace: incidentops
|
||||||
|
# port: 3000
|
||||||
|
# localPort: 3000
|
||||||
|
# Observability
|
||||||
- resourceType: service
|
- resourceType: service
|
||||||
resourceName: incidentops-web
|
resourceName: incidentops-grafana
|
||||||
namespace: incidentops
|
namespace: incidentops
|
||||||
port: 3000
|
port: 80
|
||||||
localPort: 3000
|
localPort: 3001
|
||||||
|
- resourceType: service
|
||||||
|
resourceName: incidentops-prometheus
|
||||||
|
namespace: incidentops
|
||||||
|
port: 9090
|
||||||
|
localPort: 9090
|
||||||
|
- resourceType: service
|
||||||
|
resourceName: incidentops-tempo
|
||||||
|
namespace: incidentops
|
||||||
|
port: 3200
|
||||||
|
localPort: 3200
|
||||||
|
- resourceType: service
|
||||||
|
resourceName: incidentops-loki
|
||||||
|
namespace: incidentops
|
||||||
|
port: 3100
|
||||||
|
localPort: 3100
|
||||||
|
|||||||
65
tests/api/helpers.py
Normal file
65
tests/api/helpers.py
Normal file
@@ -0,0 +1,65 @@
|
|||||||
|
"""Shared helpers for API integration tests."""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from typing import Any
|
||||||
|
from uuid import UUID, uuid4
|
||||||
|
|
||||||
|
import asyncpg
|
||||||
|
from httpx import AsyncClient
|
||||||
|
|
||||||
|
API_PREFIX = "/v1"
|
||||||
|
|
||||||
|
|
||||||
|
async def register_user(
|
||||||
|
client: AsyncClient,
|
||||||
|
*,
|
||||||
|
email: str,
|
||||||
|
password: str,
|
||||||
|
org_name: str = "Test Org",
|
||||||
|
) -> dict[str, Any]:
|
||||||
|
"""Call the register endpoint and return JSON body (raises on failure)."""
|
||||||
|
|
||||||
|
response = await client.post(
|
||||||
|
f"{API_PREFIX}/auth/register",
|
||||||
|
json={"email": email, "password": password, "org_name": org_name},
|
||||||
|
)
|
||||||
|
response.raise_for_status()
|
||||||
|
return response.json()
|
||||||
|
|
||||||
|
|
||||||
|
async def create_org(
|
||||||
|
conn: asyncpg.Connection,
|
||||||
|
*,
|
||||||
|
name: str,
|
||||||
|
slug: str | None = None,
|
||||||
|
) -> UUID:
|
||||||
|
"""Insert an organization row and return its ID."""
|
||||||
|
|
||||||
|
org_id = uuid4()
|
||||||
|
slug_value = slug or name.lower().replace(" ", "-")
|
||||||
|
await conn.execute(
|
||||||
|
"INSERT INTO orgs (id, name, slug) VALUES ($1, $2, $3)",
|
||||||
|
org_id,
|
||||||
|
name,
|
||||||
|
slug_value,
|
||||||
|
)
|
||||||
|
return org_id
|
||||||
|
|
||||||
|
|
||||||
|
async def add_membership(
|
||||||
|
conn: asyncpg.Connection,
|
||||||
|
*,
|
||||||
|
user_id: UUID,
|
||||||
|
org_id: UUID,
|
||||||
|
role: str,
|
||||||
|
) -> None:
|
||||||
|
"""Insert a membership record for the user/org pair."""
|
||||||
|
|
||||||
|
await conn.execute(
|
||||||
|
"INSERT INTO org_members (id, user_id, org_id, role) VALUES ($1, $2, $3, $4)",
|
||||||
|
uuid4(),
|
||||||
|
user_id,
|
||||||
|
org_id,
|
||||||
|
role,
|
||||||
|
)
|
||||||
213
tests/api/test_auth.py
Normal file
213
tests/api/test_auth.py
Normal file
@@ -0,0 +1,213 @@
|
|||||||
|
"""Integration tests for FastAPI auth endpoints."""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from uuid import UUID
|
||||||
|
|
||||||
|
import asyncpg
|
||||||
|
import pytest
|
||||||
|
from httpx import AsyncClient
|
||||||
|
|
||||||
|
from app.core import security
|
||||||
|
from tests.api import helpers
|
||||||
|
|
||||||
|
|
||||||
|
pytestmark = pytest.mark.asyncio
|
||||||
|
|
||||||
|
API_PREFIX = "/v1/auth"
|
||||||
|
|
||||||
|
|
||||||
|
async def test_register_endpoint_persists_user_and_membership(
|
||||||
|
api_client: AsyncClient,
|
||||||
|
db_admin: asyncpg.Connection,
|
||||||
|
) -> None:
|
||||||
|
data = await helpers.register_user(
|
||||||
|
api_client,
|
||||||
|
email="api-register@example.com",
|
||||||
|
password="SuperSecret1!",
|
||||||
|
org_name="API Org",
|
||||||
|
)
|
||||||
|
assert "access_token" in data and "refresh_token" in data
|
||||||
|
|
||||||
|
token_payload = security.decode_access_token(data["access_token"])
|
||||||
|
assert token_payload["org_role"] == "admin"
|
||||||
|
|
||||||
|
stored_user = await db_admin.fetchrow("SELECT email FROM users WHERE email = $1", "api-register@example.com")
|
||||||
|
assert stored_user is not None
|
||||||
|
|
||||||
|
membership = await db_admin.fetchrow(
|
||||||
|
"SELECT role FROM org_members WHERE user_id = $1 AND org_id = $2",
|
||||||
|
UUID(token_payload["sub"]),
|
||||||
|
UUID(token_payload["org_id"]),
|
||||||
|
)
|
||||||
|
assert membership is not None and membership["role"] == "admin"
|
||||||
|
|
||||||
|
|
||||||
|
async def test_login_endpoint_rejects_bad_credentials(
|
||||||
|
api_client: AsyncClient,
|
||||||
|
) -> None:
|
||||||
|
register_payload = {
|
||||||
|
"email": "api-login@example.com",
|
||||||
|
"password": "CorrectHorse1!",
|
||||||
|
"org_name": "Login Org",
|
||||||
|
}
|
||||||
|
await helpers.register_user(api_client, **register_payload)
|
||||||
|
|
||||||
|
response = await api_client.post(
|
||||||
|
f"{API_PREFIX}/login",
|
||||||
|
json={"email": register_payload["email"], "password": "wrong"},
|
||||||
|
)
|
||||||
|
|
||||||
|
assert response.status_code == 401
|
||||||
|
|
||||||
|
|
||||||
|
async def test_refresh_endpoint_rotates_refresh_token(
|
||||||
|
api_client: AsyncClient,
|
||||||
|
db_admin: asyncpg.Connection,
|
||||||
|
) -> None:
|
||||||
|
register_payload = {
|
||||||
|
"email": "api-refresh@example.com",
|
||||||
|
"password": "RefreshPass1!",
|
||||||
|
"org_name": "Refresh Org",
|
||||||
|
}
|
||||||
|
initial = await helpers.register_user(api_client, **register_payload)
|
||||||
|
|
||||||
|
response = await api_client.post(
|
||||||
|
f"{API_PREFIX}/refresh",
|
||||||
|
json={"refresh_token": initial["refresh_token"]},
|
||||||
|
)
|
||||||
|
|
||||||
|
assert response.status_code == 200
|
||||||
|
data = response.json()
|
||||||
|
assert data["refresh_token"] != initial["refresh_token"]
|
||||||
|
|
||||||
|
old_hash = security.hash_token(initial["refresh_token"])
|
||||||
|
old_row = await db_admin.fetchrow(
|
||||||
|
"SELECT rotated_to FROM refresh_tokens WHERE token_hash = $1",
|
||||||
|
old_hash,
|
||||||
|
)
|
||||||
|
assert old_row is not None and old_row["rotated_to"] is not None
|
||||||
|
|
||||||
|
|
||||||
|
async def test_refresh_endpoint_detects_reuse(
|
||||||
|
api_client: AsyncClient,
|
||||||
|
db_admin: asyncpg.Connection,
|
||||||
|
) -> None:
|
||||||
|
tokens = await helpers.register_user(
|
||||||
|
api_client,
|
||||||
|
email="api-reuse@example.com",
|
||||||
|
password="ReusePass1!",
|
||||||
|
org_name="Reuse Org",
|
||||||
|
)
|
||||||
|
|
||||||
|
rotated = await api_client.post(
|
||||||
|
f"{API_PREFIX}/refresh",
|
||||||
|
json={"refresh_token": tokens["refresh_token"]},
|
||||||
|
)
|
||||||
|
assert rotated.status_code == 200
|
||||||
|
|
||||||
|
reuse_response = await api_client.post(
|
||||||
|
f"{API_PREFIX}/refresh",
|
||||||
|
json={"refresh_token": tokens["refresh_token"]},
|
||||||
|
)
|
||||||
|
assert reuse_response.status_code == 401
|
||||||
|
|
||||||
|
old_hash = security.hash_token(tokens["refresh_token"])
|
||||||
|
old_row = await db_admin.fetchrow(
|
||||||
|
"SELECT revoked_at FROM refresh_tokens WHERE token_hash = $1",
|
||||||
|
old_hash,
|
||||||
|
)
|
||||||
|
assert old_row is not None and old_row["revoked_at"] is not None
|
||||||
|
|
||||||
|
|
||||||
|
async def test_switch_org_changes_active_org(
|
||||||
|
api_client: AsyncClient,
|
||||||
|
db_admin: asyncpg.Connection,
|
||||||
|
) -> None:
|
||||||
|
email = "api-switch@example.com"
|
||||||
|
register_payload = {
|
||||||
|
"email": email,
|
||||||
|
"password": "SwitchPass1!",
|
||||||
|
"org_name": "Primary Org",
|
||||||
|
}
|
||||||
|
tokens = await helpers.register_user(api_client, **register_payload)
|
||||||
|
|
||||||
|
user_id_row = await db_admin.fetchrow("SELECT id FROM users WHERE email = $1", email)
|
||||||
|
assert user_id_row is not None
|
||||||
|
user_id = user_id_row["id"]
|
||||||
|
|
||||||
|
target_org_id = await helpers.create_org(db_admin, name="Secondary Org", slug="secondary-org")
|
||||||
|
await helpers.add_membership(db_admin, user_id=user_id, org_id=target_org_id, role="member")
|
||||||
|
|
||||||
|
response = await api_client.post(
|
||||||
|
f"{API_PREFIX}/switch-org",
|
||||||
|
json={"org_id": str(target_org_id), "refresh_token": tokens["refresh_token"]},
|
||||||
|
headers={"Authorization": f"Bearer {tokens['access_token']}"},
|
||||||
|
)
|
||||||
|
|
||||||
|
assert response.status_code == 200
|
||||||
|
data = response.json()
|
||||||
|
payload = security.decode_access_token(data["access_token"])
|
||||||
|
assert payload["org_id"] == str(target_org_id)
|
||||||
|
assert payload["org_role"] == "member"
|
||||||
|
|
||||||
|
new_hash = security.hash_token(data["refresh_token"])
|
||||||
|
new_row = await db_admin.fetchrow(
|
||||||
|
"SELECT active_org_id FROM refresh_tokens WHERE token_hash = $1",
|
||||||
|
new_hash,
|
||||||
|
)
|
||||||
|
assert new_row is not None and new_row["active_org_id"] == target_org_id
|
||||||
|
|
||||||
|
|
||||||
|
async def test_switch_org_forbidden_without_membership(
|
||||||
|
api_client: AsyncClient,
|
||||||
|
db_admin: asyncpg.Connection,
|
||||||
|
) -> None:
|
||||||
|
tokens = await helpers.register_user(
|
||||||
|
api_client,
|
||||||
|
email="api-switch-no-access@example.com",
|
||||||
|
password="SwitchBlock1!",
|
||||||
|
org_name="Primary",
|
||||||
|
)
|
||||||
|
|
||||||
|
foreign_org = await helpers.create_org(db_admin, name="Foreign Org", slug="foreign-org")
|
||||||
|
|
||||||
|
response = await api_client.post(
|
||||||
|
f"{API_PREFIX}/switch-org",
|
||||||
|
json={"org_id": str(foreign_org), "refresh_token": tokens["refresh_token"]},
|
||||||
|
headers={"Authorization": f"Bearer {tokens['access_token']}"},
|
||||||
|
)
|
||||||
|
assert response.status_code == 403
|
||||||
|
|
||||||
|
# ensure refresh token still valid after failed attempt
|
||||||
|
retry = await api_client.post(
|
||||||
|
f"{API_PREFIX}/refresh",
|
||||||
|
json={"refresh_token": tokens["refresh_token"]},
|
||||||
|
)
|
||||||
|
assert retry.status_code == 200
|
||||||
|
|
||||||
|
|
||||||
|
async def test_logout_revokes_refresh_token(
|
||||||
|
api_client: AsyncClient,
|
||||||
|
) -> None:
|
||||||
|
register_payload = {
|
||||||
|
"email": "api-logout@example.com",
|
||||||
|
"password": "LogoutPass1!",
|
||||||
|
"org_name": "Logout Org",
|
||||||
|
}
|
||||||
|
tokens = await helpers.register_user(api_client, **register_payload)
|
||||||
|
|
||||||
|
logout_response = await api_client.post(
|
||||||
|
f"{API_PREFIX}/logout",
|
||||||
|
json={"refresh_token": tokens["refresh_token"]},
|
||||||
|
headers={"Authorization": f"Bearer {tokens['access_token']}"},
|
||||||
|
)
|
||||||
|
|
||||||
|
assert logout_response.status_code == 204
|
||||||
|
|
||||||
|
refresh_response = await api_client.post(
|
||||||
|
f"{API_PREFIX}/refresh",
|
||||||
|
json={"refresh_token": tokens["refresh_token"]},
|
||||||
|
)
|
||||||
|
|
||||||
|
assert refresh_response.status_code == 401
|
||||||
230
tests/api/test_incidents.py
Normal file
230
tests/api/test_incidents.py
Normal file
@@ -0,0 +1,230 @@
|
|||||||
|
"""Integration tests for incident endpoints."""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from datetime import UTC, datetime, timedelta
|
||||||
|
from uuid import UUID, uuid4
|
||||||
|
|
||||||
|
import asyncpg
|
||||||
|
import pytest
|
||||||
|
from httpx import AsyncClient
|
||||||
|
|
||||||
|
from app.core import security
|
||||||
|
from app.repositories.incident import IncidentRepository
|
||||||
|
from tests.api import helpers
|
||||||
|
|
||||||
|
|
||||||
|
pytestmark = pytest.mark.asyncio
|
||||||
|
|
||||||
|
API_PREFIX = "/v1"
|
||||||
|
|
||||||
|
|
||||||
|
async def _create_service(conn: asyncpg.Connection, org_id: UUID, slug: str = "api") -> UUID:
|
||||||
|
service_id = uuid4()
|
||||||
|
await conn.execute(
|
||||||
|
"INSERT INTO services (id, org_id, name, slug) VALUES ($1, $2, $3, $4)",
|
||||||
|
service_id,
|
||||||
|
org_id,
|
||||||
|
"API",
|
||||||
|
slug,
|
||||||
|
)
|
||||||
|
return service_id
|
||||||
|
|
||||||
|
|
||||||
|
async def _create_incident(
|
||||||
|
conn: asyncpg.Connection,
|
||||||
|
org_id: UUID,
|
||||||
|
service_id: UUID,
|
||||||
|
title: str,
|
||||||
|
severity: str = "low",
|
||||||
|
created_at: datetime | None = None,
|
||||||
|
) -> UUID:
|
||||||
|
repo = IncidentRepository(conn)
|
||||||
|
incident_id = uuid4()
|
||||||
|
incident = await repo.create(
|
||||||
|
incident_id,
|
||||||
|
org_id,
|
||||||
|
service_id,
|
||||||
|
title,
|
||||||
|
description=None,
|
||||||
|
severity=severity,
|
||||||
|
)
|
||||||
|
if created_at:
|
||||||
|
await conn.execute(
|
||||||
|
"UPDATE incidents SET created_at = $1 WHERE id = $2",
|
||||||
|
created_at,
|
||||||
|
incident_id,
|
||||||
|
)
|
||||||
|
return incident["id"]
|
||||||
|
|
||||||
|
|
||||||
|
async def _login(client: AsyncClient, *, email: str, password: str) -> dict:
|
||||||
|
response = await client.post(
|
||||||
|
f"{API_PREFIX}/auth/login",
|
||||||
|
json={"email": email, "password": password},
|
||||||
|
)
|
||||||
|
response.raise_for_status()
|
||||||
|
return response.json()
|
||||||
|
|
||||||
|
|
||||||
|
async def test_create_incident_requires_member_role(
|
||||||
|
api_client: AsyncClient, db_admin: asyncpg.Connection
|
||||||
|
) -> None:
|
||||||
|
owner_tokens = await helpers.register_user(
|
||||||
|
api_client,
|
||||||
|
email="owner-inc@example.com",
|
||||||
|
password="OwnerInc1!",
|
||||||
|
org_name="Incident Org",
|
||||||
|
)
|
||||||
|
payload = security.decode_access_token(owner_tokens["access_token"])
|
||||||
|
org_id = UUID(payload["org_id"])
|
||||||
|
|
||||||
|
service_id = await _create_service(db_admin, org_id)
|
||||||
|
|
||||||
|
viewer_password = "Viewer123!"
|
||||||
|
viewer_id = uuid4()
|
||||||
|
await db_admin.execute(
|
||||||
|
"INSERT INTO users (id, email, password_hash) VALUES ($1, $2, $3)",
|
||||||
|
viewer_id,
|
||||||
|
"viewer@example.com",
|
||||||
|
security.hash_password(viewer_password),
|
||||||
|
)
|
||||||
|
await db_admin.execute(
|
||||||
|
"INSERT INTO org_members (id, user_id, org_id, role) VALUES ($1, $2, $3, $4)",
|
||||||
|
uuid4(),
|
||||||
|
viewer_id,
|
||||||
|
org_id,
|
||||||
|
"viewer",
|
||||||
|
)
|
||||||
|
|
||||||
|
viewer_tokens = await _login(api_client, email="viewer@example.com", password=viewer_password)
|
||||||
|
|
||||||
|
forbidden = await api_client.post(
|
||||||
|
f"{API_PREFIX}/services/{service_id}/incidents",
|
||||||
|
json={"title": "View only", "description": None, "severity": "low"},
|
||||||
|
headers={"Authorization": f"Bearer {viewer_tokens['access_token']}"},
|
||||||
|
)
|
||||||
|
assert forbidden.status_code == 403
|
||||||
|
|
||||||
|
created = await api_client.post(
|
||||||
|
f"{API_PREFIX}/services/{service_id}/incidents",
|
||||||
|
json={"title": "Database down", "description": "Primary unavailable", "severity": "critical"},
|
||||||
|
headers={"Authorization": f"Bearer {owner_tokens['access_token']}"},
|
||||||
|
)
|
||||||
|
|
||||||
|
assert created.status_code == 201
|
||||||
|
incident_id = UUID(created.json()["id"])
|
||||||
|
|
||||||
|
row = await db_admin.fetchrow(
|
||||||
|
"SELECT status, org_id FROM incidents WHERE id = $1",
|
||||||
|
incident_id,
|
||||||
|
)
|
||||||
|
assert row is not None and row["status"] == "triggered" and row["org_id"] == org_id
|
||||||
|
|
||||||
|
event = await db_admin.fetchrow(
|
||||||
|
"SELECT event_type FROM incident_events WHERE incident_id = $1",
|
||||||
|
incident_id,
|
||||||
|
)
|
||||||
|
assert event is not None and event["event_type"] == "created"
|
||||||
|
|
||||||
|
|
||||||
|
async def test_list_incidents_paginates_and_isolates_org(
|
||||||
|
api_client: AsyncClient, db_admin: asyncpg.Connection
|
||||||
|
) -> None:
|
||||||
|
tokens = await helpers.register_user(
|
||||||
|
api_client,
|
||||||
|
email="pager@example.com",
|
||||||
|
password="Pager123!",
|
||||||
|
org_name="Pager Org",
|
||||||
|
)
|
||||||
|
payload = security.decode_access_token(tokens["access_token"])
|
||||||
|
org_id = UUID(payload["org_id"])
|
||||||
|
|
||||||
|
service_id = await _create_service(db_admin, org_id)
|
||||||
|
now = datetime.now(UTC)
|
||||||
|
await _create_incident(db_admin, org_id, service_id, "Old", created_at=now - timedelta(minutes=3))
|
||||||
|
await _create_incident(db_admin, org_id, service_id, "Mid", created_at=now - timedelta(minutes=2))
|
||||||
|
await _create_incident(db_admin, org_id, service_id, "New", created_at=now - timedelta(minutes=1))
|
||||||
|
|
||||||
|
# Noise in another org
|
||||||
|
other_org = await helpers.create_org(db_admin, name="Other", slug="other")
|
||||||
|
other_service = await _create_service(db_admin, other_org, slug="other-api")
|
||||||
|
await _create_incident(db_admin, other_org, other_service, "Other incident")
|
||||||
|
|
||||||
|
response = await api_client.get(
|
||||||
|
f"{API_PREFIX}/incidents",
|
||||||
|
params={"limit": 2},
|
||||||
|
headers={"Authorization": f"Bearer {tokens['access_token']}"},
|
||||||
|
)
|
||||||
|
|
||||||
|
assert response.status_code == 200
|
||||||
|
body = response.json()
|
||||||
|
titles = [item["title"] for item in body["items"]]
|
||||||
|
assert titles == ["New", "Mid"]
|
||||||
|
assert body["has_more"] is True
|
||||||
|
assert body["next_cursor"] is not None
|
||||||
|
|
||||||
|
|
||||||
|
async def test_transition_incident_enforces_version_and_updates_status(
|
||||||
|
api_client: AsyncClient, db_admin: asyncpg.Connection
|
||||||
|
) -> None:
|
||||||
|
tokens = await helpers.register_user(
|
||||||
|
api_client,
|
||||||
|
email="trans@example.com",
|
||||||
|
password="Trans123!",
|
||||||
|
org_name="Trans Org",
|
||||||
|
)
|
||||||
|
payload = security.decode_access_token(tokens["access_token"])
|
||||||
|
org_id = UUID(payload["org_id"])
|
||||||
|
service_id = await _create_service(db_admin, org_id)
|
||||||
|
|
||||||
|
incident_id = await _create_incident(db_admin, org_id, service_id, "Queue backlog")
|
||||||
|
|
||||||
|
conflict = await api_client.post(
|
||||||
|
f"{API_PREFIX}/incidents/{incident_id}/transition",
|
||||||
|
json={"to_status": "acknowledged", "version": 5, "note": None},
|
||||||
|
headers={"Authorization": f"Bearer {tokens['access_token']}"},
|
||||||
|
)
|
||||||
|
assert conflict.status_code == 409
|
||||||
|
|
||||||
|
ok = await api_client.post(
|
||||||
|
f"{API_PREFIX}/incidents/{incident_id}/transition",
|
||||||
|
json={"to_status": "acknowledged", "version": 1, "note": "Looking"},
|
||||||
|
headers={"Authorization": f"Bearer {tokens['access_token']}"},
|
||||||
|
)
|
||||||
|
assert ok.status_code == 200
|
||||||
|
assert ok.json()["status"] == "acknowledged"
|
||||||
|
assert ok.json()["version"] == 2
|
||||||
|
|
||||||
|
|
||||||
|
async def test_add_comment_appends_event(
|
||||||
|
api_client: AsyncClient, db_admin: asyncpg.Connection
|
||||||
|
) -> None:
|
||||||
|
tokens = await helpers.register_user(
|
||||||
|
api_client,
|
||||||
|
email="commenter@example.com",
|
||||||
|
password="Commenter1!",
|
||||||
|
org_name="Comment Org",
|
||||||
|
)
|
||||||
|
payload = security.decode_access_token(tokens["access_token"])
|
||||||
|
org_id = UUID(payload["org_id"])
|
||||||
|
service_id = await _create_service(db_admin, org_id)
|
||||||
|
incident_id = await _create_incident(db_admin, org_id, service_id, "Add comment")
|
||||||
|
|
||||||
|
response = await api_client.post(
|
||||||
|
f"{API_PREFIX}/incidents/{incident_id}/comment",
|
||||||
|
json={"content": "Monitoring"},
|
||||||
|
headers={"Authorization": f"Bearer {tokens['access_token']}"},
|
||||||
|
)
|
||||||
|
|
||||||
|
assert response.status_code == 201
|
||||||
|
body = response.json()
|
||||||
|
assert body["event_type"] == "comment_added"
|
||||||
|
assert body["payload"] == {"content": "Monitoring"}
|
||||||
|
|
||||||
|
event_row = await db_admin.fetchrow(
|
||||||
|
"SELECT event_type, actor_user_id FROM incident_events WHERE id = $1",
|
||||||
|
UUID(body["id"]),
|
||||||
|
)
|
||||||
|
assert event_row is not None
|
||||||
|
assert event_row["event_type"] == "comment_added"
|
||||||
238
tests/api/test_org.py
Normal file
238
tests/api/test_org.py
Normal file
@@ -0,0 +1,238 @@
|
|||||||
|
"""Integration tests for org endpoints."""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from uuid import UUID, uuid4
|
||||||
|
|
||||||
|
import asyncpg
|
||||||
|
import pytest
|
||||||
|
from httpx import AsyncClient
|
||||||
|
|
||||||
|
from app.core import security
|
||||||
|
from tests.api import helpers
|
||||||
|
|
||||||
|
|
||||||
|
pytestmark = pytest.mark.asyncio
|
||||||
|
|
||||||
|
API_PREFIX = "/v1/org"
|
||||||
|
|
||||||
|
|
||||||
|
async def _create_user_in_org(
|
||||||
|
conn: asyncpg.Connection,
|
||||||
|
*,
|
||||||
|
org_id: UUID,
|
||||||
|
email: str,
|
||||||
|
password: str,
|
||||||
|
role: str,
|
||||||
|
) -> UUID:
|
||||||
|
user_id = uuid4()
|
||||||
|
await conn.execute(
|
||||||
|
"INSERT INTO users (id, email, password_hash) VALUES ($1, $2, $3)",
|
||||||
|
user_id,
|
||||||
|
email,
|
||||||
|
security.hash_password(password),
|
||||||
|
)
|
||||||
|
await conn.execute(
|
||||||
|
"INSERT INTO org_members (id, user_id, org_id, role) VALUES ($1, $2, $3, $4)",
|
||||||
|
uuid4(),
|
||||||
|
user_id,
|
||||||
|
org_id,
|
||||||
|
role,
|
||||||
|
)
|
||||||
|
return user_id
|
||||||
|
|
||||||
|
|
||||||
|
async def _login(client: AsyncClient, *, email: str, password: str) -> dict:
|
||||||
|
response = await client.post(
|
||||||
|
"/v1/auth/login",
|
||||||
|
json={"email": email, "password": password},
|
||||||
|
)
|
||||||
|
response.raise_for_status()
|
||||||
|
return response.json()
|
||||||
|
|
||||||
|
|
||||||
|
async def test_get_org_returns_active_org(api_client: AsyncClient) -> None:
|
||||||
|
tokens = await helpers.register_user(
|
||||||
|
api_client,
|
||||||
|
email="org-owner@example.com",
|
||||||
|
password="OrgOwner1!",
|
||||||
|
org_name="Org Owner Inc",
|
||||||
|
)
|
||||||
|
|
||||||
|
response = await api_client.get(
|
||||||
|
API_PREFIX,
|
||||||
|
headers={"Authorization": f"Bearer {tokens['access_token']}",},
|
||||||
|
)
|
||||||
|
|
||||||
|
assert response.status_code == 200
|
||||||
|
data = response.json()
|
||||||
|
payload = security.decode_access_token(tokens["access_token"])
|
||||||
|
assert data["id"] == payload["org_id"]
|
||||||
|
assert data["name"] == "Org Owner Inc"
|
||||||
|
|
||||||
|
|
||||||
|
async def test_get_members_requires_admin(
|
||||||
|
api_client: AsyncClient,
|
||||||
|
db_admin: asyncpg.Connection,
|
||||||
|
) -> None:
|
||||||
|
owner_tokens = await helpers.register_user(
|
||||||
|
api_client,
|
||||||
|
email="owner@example.com",
|
||||||
|
password="OwnerPass1!",
|
||||||
|
org_name="Members Co",
|
||||||
|
)
|
||||||
|
payload = security.decode_access_token(owner_tokens["access_token"])
|
||||||
|
org_id = UUID(payload["org_id"])
|
||||||
|
|
||||||
|
member_password = "MemberPass1!"
|
||||||
|
await _create_user_in_org(
|
||||||
|
db_admin,
|
||||||
|
org_id=org_id,
|
||||||
|
email="member@example.com",
|
||||||
|
password=member_password,
|
||||||
|
role="member",
|
||||||
|
)
|
||||||
|
member_tokens = await _login(api_client, email="member@example.com", password=member_password)
|
||||||
|
|
||||||
|
admin_response = await api_client.get(
|
||||||
|
f"{API_PREFIX}/members",
|
||||||
|
headers={"Authorization": f"Bearer {owner_tokens['access_token']}"},
|
||||||
|
)
|
||||||
|
assert admin_response.status_code == 200
|
||||||
|
emails = {item["email"] for item in admin_response.json()}
|
||||||
|
assert emails == {"owner@example.com", "member@example.com"}
|
||||||
|
|
||||||
|
member_response = await api_client.get(
|
||||||
|
f"{API_PREFIX}/members",
|
||||||
|
headers={"Authorization": f"Bearer {member_tokens['access_token']}"},
|
||||||
|
)
|
||||||
|
assert member_response.status_code == 403
|
||||||
|
|
||||||
|
|
||||||
|
async def test_create_service_allows_member_and_persists(
|
||||||
|
api_client: AsyncClient,
|
||||||
|
db_admin: asyncpg.Connection,
|
||||||
|
) -> None:
|
||||||
|
owner_tokens = await helpers.register_user(
|
||||||
|
api_client,
|
||||||
|
email="service-owner@example.com",
|
||||||
|
password="ServiceOwner1!",
|
||||||
|
org_name="Service Org",
|
||||||
|
)
|
||||||
|
payload = security.decode_access_token(owner_tokens["access_token"])
|
||||||
|
org_id = UUID(payload["org_id"])
|
||||||
|
|
||||||
|
member_password = "CreateSvc1!"
|
||||||
|
await _create_user_in_org(
|
||||||
|
db_admin,
|
||||||
|
org_id=org_id,
|
||||||
|
email="svc-member@example.com",
|
||||||
|
password=member_password,
|
||||||
|
role="member",
|
||||||
|
)
|
||||||
|
member_tokens = await _login(api_client, email="svc-member@example.com", password=member_password)
|
||||||
|
|
||||||
|
response = await api_client.post(
|
||||||
|
f"{API_PREFIX}/services",
|
||||||
|
json={"name": "API Gateway", "slug": "api-gateway"},
|
||||||
|
headers={"Authorization": f"Bearer {member_tokens['access_token']}"},
|
||||||
|
)
|
||||||
|
|
||||||
|
assert response.status_code == 201
|
||||||
|
body = response.json()
|
||||||
|
row = await db_admin.fetchrow(
|
||||||
|
"SELECT org_id, slug FROM services WHERE id = $1",
|
||||||
|
UUID(body["id"]),
|
||||||
|
)
|
||||||
|
assert row is not None and row["org_id"] == org_id and row["slug"] == "api-gateway"
|
||||||
|
|
||||||
|
|
||||||
|
async def test_create_service_rejects_duplicate_slug(
|
||||||
|
api_client: AsyncClient,
|
||||||
|
db_admin: asyncpg.Connection,
|
||||||
|
) -> None:
|
||||||
|
tokens = await helpers.register_user(
|
||||||
|
api_client,
|
||||||
|
email="dup-owner@example.com",
|
||||||
|
password="DupOwner1!",
|
||||||
|
org_name="Dup Org",
|
||||||
|
)
|
||||||
|
payload = security.decode_access_token(tokens["access_token"])
|
||||||
|
org_id = UUID(payload["org_id"])
|
||||||
|
|
||||||
|
await db_admin.execute(
|
||||||
|
"INSERT INTO services (id, org_id, name, slug) VALUES ($1, $2, $3, $4)",
|
||||||
|
uuid4(),
|
||||||
|
org_id,
|
||||||
|
"Existing",
|
||||||
|
"duplicate",
|
||||||
|
)
|
||||||
|
|
||||||
|
response = await api_client.post(
|
||||||
|
f"{API_PREFIX}/services",
|
||||||
|
json={"name": "New", "slug": "duplicate"},
|
||||||
|
headers={"Authorization": f"Bearer {tokens['access_token']}"},
|
||||||
|
)
|
||||||
|
|
||||||
|
assert response.status_code == 409
|
||||||
|
|
||||||
|
|
||||||
|
async def test_notification_targets_admin_only_and_validation(
|
||||||
|
api_client: AsyncClient,
|
||||||
|
db_admin: asyncpg.Connection,
|
||||||
|
) -> None:
|
||||||
|
owner_tokens = await helpers.register_user(
|
||||||
|
api_client,
|
||||||
|
email="notify-owner@example.com",
|
||||||
|
password="NotifyOwner1!",
|
||||||
|
org_name="Notify Org",
|
||||||
|
)
|
||||||
|
payload = security.decode_access_token(owner_tokens["access_token"])
|
||||||
|
org_id = UUID(payload["org_id"])
|
||||||
|
|
||||||
|
member_password = "NotifyMember1!"
|
||||||
|
await _create_user_in_org(
|
||||||
|
db_admin,
|
||||||
|
org_id=org_id,
|
||||||
|
email="notify-member@example.com",
|
||||||
|
password=member_password,
|
||||||
|
role="member",
|
||||||
|
)
|
||||||
|
member_tokens = await _login(api_client, email="notify-member@example.com", password=member_password)
|
||||||
|
|
||||||
|
forbidden = await api_client.post(
|
||||||
|
f"{API_PREFIX}/notification-targets",
|
||||||
|
json={"name": "Webhook", "target_type": "webhook", "webhook_url": "https://example.com"},
|
||||||
|
headers={"Authorization": f"Bearer {member_tokens['access_token']}"},
|
||||||
|
)
|
||||||
|
assert forbidden.status_code == 403
|
||||||
|
|
||||||
|
missing_url = await api_client.post(
|
||||||
|
f"{API_PREFIX}/notification-targets",
|
||||||
|
json={"name": "Bad", "target_type": "webhook"},
|
||||||
|
headers={"Authorization": f"Bearer {owner_tokens['access_token']}"},
|
||||||
|
)
|
||||||
|
assert missing_url.status_code == 400
|
||||||
|
|
||||||
|
created = await api_client.post(
|
||||||
|
f"{API_PREFIX}/notification-targets",
|
||||||
|
json={"name": "Pager", "target_type": "webhook", "webhook_url": "https://example.com/hook"},
|
||||||
|
headers={"Authorization": f"Bearer {owner_tokens['access_token']}"},
|
||||||
|
)
|
||||||
|
|
||||||
|
assert created.status_code == 201
|
||||||
|
target_id = UUID(created.json()["id"])
|
||||||
|
|
||||||
|
row = await db_admin.fetchrow(
|
||||||
|
"SELECT org_id, name FROM notification_targets WHERE id = $1",
|
||||||
|
target_id,
|
||||||
|
)
|
||||||
|
assert row is not None and row["org_id"] == org_id
|
||||||
|
|
||||||
|
listing = await api_client.get(
|
||||||
|
f"{API_PREFIX}/notification-targets",
|
||||||
|
headers={"Authorization": f"Bearer {owner_tokens['access_token']}"},
|
||||||
|
)
|
||||||
|
assert listing.status_code == 200
|
||||||
|
names = [item["name"] for item in listing.json()]
|
||||||
|
assert names == ["Pager"]
|
||||||
@@ -3,15 +3,23 @@
|
|||||||
from __future__ import annotations
|
from __future__ import annotations
|
||||||
|
|
||||||
import os
|
import os
|
||||||
from uuid import uuid4
|
from contextlib import asynccontextmanager
|
||||||
|
from typing import AsyncGenerator, Callable, Generator
|
||||||
|
from uuid import UUID, uuid4
|
||||||
|
|
||||||
import asyncpg
|
import asyncpg
|
||||||
|
import httpx
|
||||||
import pytest
|
import pytest
|
||||||
|
|
||||||
# Set test environment variables before importing app modules
|
# Set test environment variables before importing app modules
|
||||||
os.environ.setdefault("DATABASE_URL", "postgresql://incidentops:incidentops@localhost:5432/incidentops_test")
|
os.environ.setdefault("DATABASE_URL", "postgresql://incidentops:incidentops@localhost:5432/incidentops_test")
|
||||||
os.environ.setdefault("JWT_SECRET_KEY", "test-secret-key-for-testing-only")
|
os.environ.setdefault("JWT_SECRET_KEY", "test-secret-key-for-testing-only")
|
||||||
os.environ.setdefault("REDIS_URL", "redis://localhost:6379/1")
|
os.environ.setdefault("REDIS_URL", "redis://localhost:6379/1")
|
||||||
|
os.environ.setdefault("TASK_QUEUE_DRIVER", "inmemory")
|
||||||
|
os.environ.setdefault("TASK_QUEUE_BROKER_URL", "redis://localhost:6379/2")
|
||||||
|
|
||||||
|
from app.main import app
|
||||||
|
from app.taskqueue import task_queue
|
||||||
|
|
||||||
|
|
||||||
# Module-level setup: create database and run migrations once
|
# Module-level setup: create database and run migrations once
|
||||||
@@ -65,7 +73,7 @@ async def _init_test_db() -> None:
|
|||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture
|
||||||
async def db_conn() -> asyncpg.Connection:
|
async def db_conn() -> AsyncGenerator[asyncpg.Connection, None]:
|
||||||
"""Get a database connection with transaction rollback for test isolation."""
|
"""Get a database connection with transaction rollback for test isolation."""
|
||||||
await _init_test_db()
|
await _init_test_db()
|
||||||
|
|
||||||
@@ -84,12 +92,88 @@ async def db_conn() -> asyncpg.Connection:
|
|||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture
|
||||||
def make_user_id() -> uuid4:
|
def make_user_id() -> Callable[[], UUID]:
|
||||||
"""Factory for generating user IDs."""
|
"""Factory for generating user IDs."""
|
||||||
return lambda: uuid4()
|
return lambda: uuid4()
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture
|
||||||
def make_org_id() -> uuid4:
|
def make_org_id() -> Callable[[], UUID]:
|
||||||
"""Factory for generating org IDs."""
|
"""Factory for generating org IDs."""
|
||||||
return lambda: uuid4()
|
return lambda: uuid4()
|
||||||
|
|
||||||
|
|
||||||
|
TABLES_TO_TRUNCATE = [
|
||||||
|
"incident_events",
|
||||||
|
"notification_attempts",
|
||||||
|
"incidents",
|
||||||
|
"notification_targets",
|
||||||
|
"services",
|
||||||
|
"refresh_tokens",
|
||||||
|
"org_members",
|
||||||
|
"orgs",
|
||||||
|
"users",
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
async def _truncate_all_tables() -> None:
|
||||||
|
test_dsn = os.environ["DATABASE_URL"]
|
||||||
|
conn = await asyncpg.connect(test_dsn)
|
||||||
|
try:
|
||||||
|
tables = ", ".join(TABLES_TO_TRUNCATE)
|
||||||
|
await conn.execute(f"TRUNCATE TABLE {tables} CASCADE")
|
||||||
|
finally:
|
||||||
|
await conn.close()
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
async def clean_database() -> AsyncGenerator[None, None]:
|
||||||
|
"""Ensure the database is initialized and truncated before/after tests."""
|
||||||
|
|
||||||
|
await _init_test_db()
|
||||||
|
await _truncate_all_tables()
|
||||||
|
yield
|
||||||
|
await _truncate_all_tables()
|
||||||
|
|
||||||
|
|
||||||
|
@asynccontextmanager
|
||||||
|
async def _lifespan_manager() -> AsyncGenerator[None, None]:
|
||||||
|
lifespan = app.router.lifespan_context
|
||||||
|
if lifespan is None:
|
||||||
|
yield
|
||||||
|
else:
|
||||||
|
async with lifespan(app):
|
||||||
|
yield
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
async def api_client(clean_database: None) -> AsyncGenerator[httpx.AsyncClient, None]:
|
||||||
|
"""HTTPX async client bound to the FastAPI app with lifespan support."""
|
||||||
|
|
||||||
|
async with _lifespan_manager():
|
||||||
|
transport = httpx.ASGITransport(app=app)
|
||||||
|
async with httpx.AsyncClient(transport=transport, base_url="http://testserver") as client:
|
||||||
|
yield client
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
async def db_admin(clean_database: None) -> AsyncGenerator[asyncpg.Connection, None]:
|
||||||
|
"""Plain connection for arranging/inspecting API test data (no rollback)."""
|
||||||
|
|
||||||
|
test_dsn = os.environ["DATABASE_URL"]
|
||||||
|
conn = await asyncpg.connect(test_dsn)
|
||||||
|
try:
|
||||||
|
yield conn
|
||||||
|
finally:
|
||||||
|
await conn.close()
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture(autouse=True)
|
||||||
|
def reset_task_queue() -> Generator[None, None, None]:
|
||||||
|
"""Ensure in-memory task queue state is cleared between tests."""
|
||||||
|
|
||||||
|
if hasattr(task_queue, "reset"):
|
||||||
|
task_queue.reset()
|
||||||
|
yield
|
||||||
|
if hasattr(task_queue, "reset"):
|
||||||
|
task_queue.reset()
|
||||||
|
|||||||
80
tests/db/test_get_conn.py
Normal file
80
tests/db/test_get_conn.py
Normal file
@@ -0,0 +1,80 @@
|
|||||||
|
"""Tests for the get_conn dependency helper."""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
from app.db import db, get_conn
|
||||||
|
|
||||||
|
|
||||||
|
pytestmark = pytest.mark.asyncio
|
||||||
|
|
||||||
|
|
||||||
|
class _FakeConnection:
|
||||||
|
def __init__(self, idx: int) -> None:
|
||||||
|
self.idx = idx
|
||||||
|
|
||||||
|
|
||||||
|
class _AcquireContext:
|
||||||
|
def __init__(self, conn: _FakeConnection, tracker: "_FakePool") -> None:
|
||||||
|
self._conn = conn
|
||||||
|
self._tracker = tracker
|
||||||
|
|
||||||
|
async def __aenter__(self) -> _FakeConnection:
|
||||||
|
self._tracker.active += 1
|
||||||
|
return self._conn
|
||||||
|
|
||||||
|
async def __aexit__(self, exc_type, exc, tb) -> None:
|
||||||
|
self._tracker.active -= 1
|
||||||
|
|
||||||
|
|
||||||
|
class _FakePool:
|
||||||
|
def __init__(self) -> None:
|
||||||
|
self.acquire_calls = 0
|
||||||
|
self.active = 0
|
||||||
|
|
||||||
|
def acquire(self) -> _AcquireContext:
|
||||||
|
conn = _FakeConnection(self.acquire_calls)
|
||||||
|
self.acquire_calls += 1
|
||||||
|
return _AcquireContext(conn, self)
|
||||||
|
|
||||||
|
|
||||||
|
async def _collect_single_connection():
|
||||||
|
connection = None
|
||||||
|
async for conn in get_conn():
|
||||||
|
connection = conn
|
||||||
|
return connection
|
||||||
|
|
||||||
|
|
||||||
|
async def test_get_conn_reuses_connection_within_scope():
|
||||||
|
original_pool = db.pool
|
||||||
|
fake_pool = _FakePool()
|
||||||
|
db.pool = fake_pool
|
||||||
|
try:
|
||||||
|
captured: list[_FakeConnection] = []
|
||||||
|
|
||||||
|
async for outer in get_conn():
|
||||||
|
captured.append(outer)
|
||||||
|
async for inner in get_conn():
|
||||||
|
captured.append(inner)
|
||||||
|
|
||||||
|
assert len(captured) == 2
|
||||||
|
assert captured[0] is captured[1]
|
||||||
|
assert fake_pool.acquire_calls == 1
|
||||||
|
finally:
|
||||||
|
db.pool = original_pool
|
||||||
|
|
||||||
|
|
||||||
|
async def test_get_conn_acquires_new_connection_per_root_scope():
|
||||||
|
original_pool = db.pool
|
||||||
|
fake_pool = _FakePool()
|
||||||
|
db.pool = fake_pool
|
||||||
|
try:
|
||||||
|
first = await _collect_single_connection()
|
||||||
|
second = await _collect_single_connection()
|
||||||
|
|
||||||
|
assert first is not None and second is not None
|
||||||
|
assert first is not second
|
||||||
|
assert fake_pool.acquire_calls == 2
|
||||||
|
finally:
|
||||||
|
db.pool = original_pool
|
||||||
260
tests/services/test_auth_service.py
Normal file
260
tests/services/test_auth_service.py
Normal file
@@ -0,0 +1,260 @@
|
|||||||
|
"""Unit tests covering AuthService flows."""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from contextlib import asynccontextmanager
|
||||||
|
from datetime import UTC, datetime, timedelta
|
||||||
|
from uuid import UUID, uuid4
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
from app.api.deps import CurrentUser
|
||||||
|
from app.core import security
|
||||||
|
from app.db import Database
|
||||||
|
from app.schemas.auth import (
|
||||||
|
LoginRequest,
|
||||||
|
LogoutRequest,
|
||||||
|
RefreshRequest,
|
||||||
|
RegisterRequest,
|
||||||
|
SwitchOrgRequest,
|
||||||
|
)
|
||||||
|
from app.services.auth import AuthService
|
||||||
|
|
||||||
|
|
||||||
|
pytestmark = pytest.mark.asyncio
|
||||||
|
|
||||||
|
|
||||||
|
class _SingleConnectionDatabase(Database):
|
||||||
|
"""Database stub that reuses a single asyncpg connection."""
|
||||||
|
|
||||||
|
def __init__(self, conn) -> None: # type: ignore[override]
|
||||||
|
self._conn = conn
|
||||||
|
|
||||||
|
@asynccontextmanager
|
||||||
|
async def connection(self): # type: ignore[override]
|
||||||
|
yield self._conn
|
||||||
|
|
||||||
|
@asynccontextmanager
|
||||||
|
async def transaction(self): # type: ignore[override]
|
||||||
|
tr = self._conn.transaction()
|
||||||
|
await tr.start()
|
||||||
|
try:
|
||||||
|
yield self._conn
|
||||||
|
except Exception:
|
||||||
|
await tr.rollback()
|
||||||
|
raise
|
||||||
|
else:
|
||||||
|
await tr.commit()
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
async def auth_service(db_conn):
|
||||||
|
"""AuthService bound to the per-test database connection."""
|
||||||
|
|
||||||
|
return AuthService(database=_SingleConnectionDatabase(db_conn))
|
||||||
|
|
||||||
|
|
||||||
|
async def _create_user(conn, email: str, password: str) -> UUID:
|
||||||
|
user_id = uuid4()
|
||||||
|
password_hash = security.hash_password(password)
|
||||||
|
await conn.execute(
|
||||||
|
"INSERT INTO users (id, email, password_hash) VALUES ($1, $2, $3)",
|
||||||
|
user_id,
|
||||||
|
email,
|
||||||
|
password_hash,
|
||||||
|
)
|
||||||
|
return user_id
|
||||||
|
|
||||||
|
|
||||||
|
async def _create_org(
|
||||||
|
conn,
|
||||||
|
name: str,
|
||||||
|
slug: str | None = None,
|
||||||
|
*,
|
||||||
|
created_at: datetime | None = None,
|
||||||
|
) -> UUID:
|
||||||
|
org_id = uuid4()
|
||||||
|
slug_value = slug or f"{name.lower().replace(' ', '-')}-{org_id.hex[:6]}"
|
||||||
|
created = created_at or datetime.now(UTC)
|
||||||
|
await conn.execute(
|
||||||
|
"INSERT INTO orgs (id, name, slug, created_at) VALUES ($1, $2, $3, $4)",
|
||||||
|
org_id,
|
||||||
|
name,
|
||||||
|
slug_value,
|
||||||
|
created,
|
||||||
|
)
|
||||||
|
return org_id
|
||||||
|
|
||||||
|
|
||||||
|
async def _add_membership(conn, user_id: UUID, org_id: UUID, role: str) -> None:
|
||||||
|
await conn.execute(
|
||||||
|
"INSERT INTO org_members (id, user_id, org_id, role) VALUES ($1, $2, $3, $4)",
|
||||||
|
uuid4(),
|
||||||
|
user_id,
|
||||||
|
org_id,
|
||||||
|
role,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
async def test_register_user_creates_admin_membership(auth_service, db_conn):
|
||||||
|
request = RegisterRequest(
|
||||||
|
email="founder@example.com",
|
||||||
|
password="SuperSecret1!",
|
||||||
|
org_name="Founders Inc",
|
||||||
|
)
|
||||||
|
|
||||||
|
response = await auth_service.register_user(request)
|
||||||
|
|
||||||
|
payload = security.decode_access_token(response.access_token)
|
||||||
|
assert payload["org_role"] == "admin"
|
||||||
|
|
||||||
|
user_id = UUID(payload["sub"])
|
||||||
|
org_id = UUID(payload["org_id"])
|
||||||
|
|
||||||
|
user = await db_conn.fetchrow("SELECT email FROM users WHERE id = $1", user_id)
|
||||||
|
assert user is not None and user["email"] == request.email
|
||||||
|
|
||||||
|
membership = await db_conn.fetchrow(
|
||||||
|
"SELECT role FROM org_members WHERE user_id = $1 AND org_id = $2",
|
||||||
|
user_id,
|
||||||
|
org_id,
|
||||||
|
)
|
||||||
|
assert membership is not None and membership["role"] == "admin"
|
||||||
|
|
||||||
|
refresh_hash = security.hash_token(response.refresh_token)
|
||||||
|
refresh_row = await db_conn.fetchrow(
|
||||||
|
"SELECT user_id, active_org_id FROM refresh_tokens WHERE token_hash = $1",
|
||||||
|
refresh_hash,
|
||||||
|
)
|
||||||
|
assert refresh_row is not None
|
||||||
|
assert refresh_row["user_id"] == user_id
|
||||||
|
assert refresh_row["active_org_id"] == org_id
|
||||||
|
|
||||||
|
|
||||||
|
async def test_login_user_returns_tokens_for_valid_credentials(auth_service, db_conn):
|
||||||
|
email = "member@example.com"
|
||||||
|
password = "Password123!"
|
||||||
|
user_id = await _create_user(db_conn, email, password)
|
||||||
|
org_id = await _create_org(
|
||||||
|
db_conn,
|
||||||
|
name="Member Org",
|
||||||
|
slug="member-org",
|
||||||
|
created_at=datetime.now(UTC) - timedelta(days=1),
|
||||||
|
)
|
||||||
|
await _add_membership(db_conn, user_id, org_id, "member")
|
||||||
|
|
||||||
|
response = await auth_service.login_user(LoginRequest(email=email, password=password))
|
||||||
|
|
||||||
|
payload = security.decode_access_token(response.access_token)
|
||||||
|
assert payload["sub"] == str(user_id)
|
||||||
|
assert payload["org_id"] == str(org_id)
|
||||||
|
|
||||||
|
refresh_hash = security.hash_token(response.refresh_token)
|
||||||
|
refresh_row = await db_conn.fetchrow(
|
||||||
|
"SELECT active_org_id FROM refresh_tokens WHERE token_hash = $1",
|
||||||
|
refresh_hash,
|
||||||
|
)
|
||||||
|
assert refresh_row is not None and refresh_row["active_org_id"] == org_id
|
||||||
|
|
||||||
|
|
||||||
|
async def test_refresh_tokens_rotates_existing_token(auth_service, db_conn):
|
||||||
|
email = "rotate@example.com"
|
||||||
|
password = "Rotate123!"
|
||||||
|
user_id = await _create_user(db_conn, email, password)
|
||||||
|
org_id = await _create_org(db_conn, name="Rotate Org", slug="rotate-org")
|
||||||
|
await _add_membership(db_conn, user_id, org_id, "member")
|
||||||
|
|
||||||
|
initial = await auth_service.login_user(LoginRequest(email=email, password=password))
|
||||||
|
|
||||||
|
rotated = await auth_service.refresh_tokens(
|
||||||
|
RefreshRequest(refresh_token=initial.refresh_token)
|
||||||
|
)
|
||||||
|
|
||||||
|
assert rotated.refresh_token != initial.refresh_token
|
||||||
|
|
||||||
|
old_hash = security.hash_token(initial.refresh_token)
|
||||||
|
old_row = await db_conn.fetchrow(
|
||||||
|
"SELECT rotated_to FROM refresh_tokens WHERE token_hash = $1",
|
||||||
|
old_hash,
|
||||||
|
)
|
||||||
|
assert old_row is not None and old_row["rotated_to"] is not None
|
||||||
|
|
||||||
|
new_hash = security.hash_token(rotated.refresh_token)
|
||||||
|
new_row = await db_conn.fetchrow(
|
||||||
|
"SELECT user_id FROM refresh_tokens WHERE token_hash = $1",
|
||||||
|
new_hash,
|
||||||
|
)
|
||||||
|
assert new_row is not None and new_row["user_id"] == user_id
|
||||||
|
|
||||||
|
|
||||||
|
async def test_switch_org_updates_active_org(auth_service, db_conn):
|
||||||
|
email = "switcher@example.com"
|
||||||
|
password = "Switch123!"
|
||||||
|
user_id = await _create_user(db_conn, email, password)
|
||||||
|
|
||||||
|
primary_org = await _create_org(
|
||||||
|
db_conn,
|
||||||
|
name="Primary Org",
|
||||||
|
slug="primary-org",
|
||||||
|
created_at=datetime.now(UTC) - timedelta(days=2),
|
||||||
|
)
|
||||||
|
await _add_membership(db_conn, user_id, primary_org, "member")
|
||||||
|
|
||||||
|
secondary_org = await _create_org(
|
||||||
|
db_conn,
|
||||||
|
name="Secondary Org",
|
||||||
|
slug="secondary-org",
|
||||||
|
created_at=datetime.now(UTC) - timedelta(days=1),
|
||||||
|
)
|
||||||
|
await _add_membership(db_conn, user_id, secondary_org, "admin")
|
||||||
|
|
||||||
|
initial = await auth_service.login_user(LoginRequest(email=email, password=password))
|
||||||
|
current_user = CurrentUser(
|
||||||
|
user_id=user_id,
|
||||||
|
email=email,
|
||||||
|
org_id=primary_org,
|
||||||
|
org_role="member",
|
||||||
|
token=initial.access_token,
|
||||||
|
)
|
||||||
|
|
||||||
|
switched = await auth_service.switch_org(
|
||||||
|
current_user,
|
||||||
|
SwitchOrgRequest(org_id=secondary_org, refresh_token=initial.refresh_token),
|
||||||
|
)
|
||||||
|
|
||||||
|
payload = security.decode_access_token(switched.access_token)
|
||||||
|
assert payload["org_id"] == str(secondary_org)
|
||||||
|
assert payload["org_role"] == "admin"
|
||||||
|
|
||||||
|
new_hash = security.hash_token(switched.refresh_token)
|
||||||
|
new_row = await db_conn.fetchrow(
|
||||||
|
"SELECT active_org_id FROM refresh_tokens WHERE token_hash = $1",
|
||||||
|
new_hash,
|
||||||
|
)
|
||||||
|
assert new_row is not None and new_row["active_org_id"] == secondary_org
|
||||||
|
|
||||||
|
|
||||||
|
async def test_logout_revokes_refresh_token(auth_service, db_conn):
|
||||||
|
email = "logout@example.com"
|
||||||
|
password = "Logout123!"
|
||||||
|
user_id = await _create_user(db_conn, email, password)
|
||||||
|
org_id = await _create_org(db_conn, name="Logout Org", slug="logout-org")
|
||||||
|
await _add_membership(db_conn, user_id, org_id, "member")
|
||||||
|
|
||||||
|
initial = await auth_service.login_user(LoginRequest(email=email, password=password))
|
||||||
|
current_user = CurrentUser(
|
||||||
|
user_id=user_id,
|
||||||
|
email=email,
|
||||||
|
org_id=org_id,
|
||||||
|
org_role="member",
|
||||||
|
token=initial.access_token,
|
||||||
|
)
|
||||||
|
|
||||||
|
await auth_service.logout(current_user, LogoutRequest(refresh_token=initial.refresh_token))
|
||||||
|
|
||||||
|
token_hash = security.hash_token(initial.refresh_token)
|
||||||
|
row = await db_conn.fetchrow(
|
||||||
|
"SELECT revoked_at FROM refresh_tokens WHERE token_hash = $1",
|
||||||
|
token_hash,
|
||||||
|
)
|
||||||
|
assert row is not None and row["revoked_at"] is not None
|
||||||
275
tests/services/test_incident_service.py
Normal file
275
tests/services/test_incident_service.py
Normal file
@@ -0,0 +1,275 @@
|
|||||||
|
"""Unit tests for IncidentService."""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from contextlib import asynccontextmanager
|
||||||
|
from datetime import UTC, datetime, timedelta
|
||||||
|
from uuid import UUID, uuid4
|
||||||
|
|
||||||
|
import asyncpg
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
from app.api.deps import CurrentUser
|
||||||
|
from app.core import exceptions as exc, security
|
||||||
|
from app.db import Database
|
||||||
|
from app.schemas.incident import CommentRequest, IncidentCreate, TransitionRequest
|
||||||
|
from app.services.incident import IncidentService
|
||||||
|
from app.taskqueue import InMemoryTaskQueue
|
||||||
|
|
||||||
|
|
||||||
|
pytestmark = pytest.mark.asyncio
|
||||||
|
|
||||||
|
|
||||||
|
class _SingleConnectionDatabase(Database):
|
||||||
|
"""Database stub that reuses a single asyncpg connection."""
|
||||||
|
|
||||||
|
def __init__(self, conn) -> None: # type: ignore[override]
|
||||||
|
self._conn = conn
|
||||||
|
|
||||||
|
@asynccontextmanager
|
||||||
|
async def connection(self): # type: ignore[override]
|
||||||
|
yield self._conn
|
||||||
|
|
||||||
|
@asynccontextmanager
|
||||||
|
async def transaction(self): # type: ignore[override]
|
||||||
|
tr = self._conn.transaction()
|
||||||
|
await tr.start()
|
||||||
|
try:
|
||||||
|
yield self._conn
|
||||||
|
except Exception:
|
||||||
|
await tr.rollback()
|
||||||
|
raise
|
||||||
|
else:
|
||||||
|
await tr.commit()
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def incident_task_queue() -> InMemoryTaskQueue:
|
||||||
|
"""In-memory task queue used to assert dispatch behavior."""
|
||||||
|
|
||||||
|
return InMemoryTaskQueue()
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
async def incident_service(
|
||||||
|
db_conn: asyncpg.Connection,
|
||||||
|
incident_task_queue: InMemoryTaskQueue,
|
||||||
|
):
|
||||||
|
"""IncidentService bound to the per-test database connection."""
|
||||||
|
|
||||||
|
return IncidentService(
|
||||||
|
database=_SingleConnectionDatabase(db_conn),
|
||||||
|
task_queue=incident_task_queue,
|
||||||
|
escalation_delay_seconds=60,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
async def _seed_user_org_service(conn: asyncpg.Connection) -> tuple[CurrentUser, UUID]:
|
||||||
|
"""Create a user, org, and service and return the CurrentUser + service_id."""
|
||||||
|
|
||||||
|
user_id = uuid4()
|
||||||
|
org_id = uuid4()
|
||||||
|
service_id = uuid4()
|
||||||
|
|
||||||
|
await conn.execute(
|
||||||
|
"INSERT INTO users (id, email, password_hash) VALUES ($1, $2, $3)",
|
||||||
|
user_id,
|
||||||
|
"owner@example.com",
|
||||||
|
security.hash_password("Passw0rd!"),
|
||||||
|
)
|
||||||
|
await conn.execute(
|
||||||
|
"INSERT INTO orgs (id, name, slug) VALUES ($1, $2, $3)",
|
||||||
|
org_id,
|
||||||
|
"Test Org",
|
||||||
|
"test-org",
|
||||||
|
)
|
||||||
|
await conn.execute(
|
||||||
|
"INSERT INTO org_members (id, user_id, org_id, role) VALUES ($1, $2, $3, $4)",
|
||||||
|
uuid4(),
|
||||||
|
user_id,
|
||||||
|
org_id,
|
||||||
|
"member",
|
||||||
|
)
|
||||||
|
await conn.execute(
|
||||||
|
"INSERT INTO services (id, org_id, name, slug) VALUES ($1, $2, $3, $4)",
|
||||||
|
service_id,
|
||||||
|
org_id,
|
||||||
|
"API",
|
||||||
|
"api",
|
||||||
|
)
|
||||||
|
|
||||||
|
current_user = CurrentUser(
|
||||||
|
user_id=user_id,
|
||||||
|
email="owner@example.com",
|
||||||
|
org_id=org_id,
|
||||||
|
org_role="member",
|
||||||
|
token="token",
|
||||||
|
)
|
||||||
|
return current_user, service_id
|
||||||
|
|
||||||
|
|
||||||
|
async def test_create_incident_persists_and_records_event(
|
||||||
|
incident_service: IncidentService,
|
||||||
|
db_conn: asyncpg.Connection,
|
||||||
|
incident_task_queue: InMemoryTaskQueue,
|
||||||
|
) -> None:
|
||||||
|
current_user, service_id = await _seed_user_org_service(db_conn)
|
||||||
|
|
||||||
|
incident = await incident_service.create_incident(
|
||||||
|
current_user,
|
||||||
|
service_id,
|
||||||
|
IncidentCreate(title="API outage", description="Gateway 502s", severity="critical"),
|
||||||
|
)
|
||||||
|
|
||||||
|
row = await db_conn.fetchrow(
|
||||||
|
"SELECT status, org_id, service_id FROM incidents WHERE id = $1",
|
||||||
|
incident.id,
|
||||||
|
)
|
||||||
|
assert row is not None
|
||||||
|
assert row["status"] == "triggered"
|
||||||
|
assert row["org_id"] == current_user.org_id
|
||||||
|
assert row["service_id"] == service_id
|
||||||
|
|
||||||
|
event = await db_conn.fetchrow(
|
||||||
|
"SELECT event_type, actor_user_id FROM incident_events WHERE incident_id = $1",
|
||||||
|
incident.id,
|
||||||
|
)
|
||||||
|
assert event is not None
|
||||||
|
assert event["event_type"] == "created"
|
||||||
|
assert event["actor_user_id"] == current_user.user_id
|
||||||
|
|
||||||
|
assert incident_task_queue.dispatched is not None
|
||||||
|
assert len(incident_task_queue.dispatched) == 2
|
||||||
|
first, second = incident_task_queue.dispatched
|
||||||
|
assert first[0] == "incident_triggered"
|
||||||
|
assert second[0] == "escalate_if_unacked"
|
||||||
|
|
||||||
|
|
||||||
|
async def test_get_incidents_paginates_by_created_at(
|
||||||
|
incident_service: IncidentService, db_conn: asyncpg.Connection
|
||||||
|
) -> None:
|
||||||
|
current_user, service_id = await _seed_user_org_service(db_conn)
|
||||||
|
|
||||||
|
first = await incident_service.create_incident(
|
||||||
|
current_user, service_id, IncidentCreate(title="First", description=None, severity="low")
|
||||||
|
)
|
||||||
|
second = await incident_service.create_incident(
|
||||||
|
current_user, service_id, IncidentCreate(title="Second", description=None, severity="medium")
|
||||||
|
)
|
||||||
|
third = await incident_service.create_incident(
|
||||||
|
current_user, service_id, IncidentCreate(title="Third", description=None, severity="high")
|
||||||
|
)
|
||||||
|
|
||||||
|
# Stagger created_at for deterministic ordering
|
||||||
|
now = datetime.now(UTC)
|
||||||
|
await db_conn.execute(
|
||||||
|
"UPDATE incidents SET created_at = $1 WHERE id = $2",
|
||||||
|
now - timedelta(minutes=3),
|
||||||
|
first.id,
|
||||||
|
)
|
||||||
|
await db_conn.execute(
|
||||||
|
"UPDATE incidents SET created_at = $1 WHERE id = $2",
|
||||||
|
now - timedelta(minutes=2),
|
||||||
|
second.id,
|
||||||
|
)
|
||||||
|
await db_conn.execute(
|
||||||
|
"UPDATE incidents SET created_at = $1 WHERE id = $2",
|
||||||
|
now - timedelta(minutes=1),
|
||||||
|
third.id,
|
||||||
|
)
|
||||||
|
|
||||||
|
page = await incident_service.get_incidents(current_user, limit=2)
|
||||||
|
|
||||||
|
titles = [item.title for item in page.items]
|
||||||
|
assert titles == ["Third", "Second"]
|
||||||
|
assert page.has_more is True
|
||||||
|
assert page.next_cursor is not None
|
||||||
|
|
||||||
|
|
||||||
|
async def test_transition_incident_updates_status_and_records_event(
|
||||||
|
incident_service: IncidentService, db_conn: asyncpg.Connection
|
||||||
|
) -> None:
|
||||||
|
current_user, service_id = await _seed_user_org_service(db_conn)
|
||||||
|
incident = await incident_service.create_incident(
|
||||||
|
current_user, service_id, IncidentCreate(title="Escalation", severity="high", description=None)
|
||||||
|
)
|
||||||
|
|
||||||
|
updated = await incident_service.transition_incident(
|
||||||
|
current_user,
|
||||||
|
incident.id,
|
||||||
|
TransitionRequest(to_status="acknowledged", version=incident.version, note="On it"),
|
||||||
|
)
|
||||||
|
|
||||||
|
assert updated.status == "acknowledged"
|
||||||
|
assert updated.version == incident.version + 1
|
||||||
|
|
||||||
|
event = await db_conn.fetchrow(
|
||||||
|
"""
|
||||||
|
SELECT payload
|
||||||
|
FROM incident_events
|
||||||
|
WHERE incident_id = $1 AND event_type = 'status_changed'
|
||||||
|
ORDER BY created_at DESC
|
||||||
|
LIMIT 1
|
||||||
|
""",
|
||||||
|
incident.id,
|
||||||
|
)
|
||||||
|
assert event is not None
|
||||||
|
payload = event["payload"]
|
||||||
|
if isinstance(payload, str):
|
||||||
|
import json
|
||||||
|
|
||||||
|
payload = json.loads(payload)
|
||||||
|
|
||||||
|
assert payload["from"] == "triggered"
|
||||||
|
assert payload["to"] == "acknowledged"
|
||||||
|
assert payload["note"] == "On it"
|
||||||
|
|
||||||
|
|
||||||
|
async def test_transition_incident_rejects_invalid_transition(
|
||||||
|
incident_service: IncidentService, db_conn: asyncpg.Connection
|
||||||
|
) -> None:
|
||||||
|
current_user, service_id = await _seed_user_org_service(db_conn)
|
||||||
|
incident = await incident_service.create_incident(
|
||||||
|
current_user, service_id, IncidentCreate(title="Invalid", severity="low", description=None)
|
||||||
|
)
|
||||||
|
|
||||||
|
with pytest.raises(exc.BadRequestError):
|
||||||
|
await incident_service.transition_incident(
|
||||||
|
current_user,
|
||||||
|
incident.id,
|
||||||
|
TransitionRequest(to_status="resolved", version=incident.version, note=None),
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
async def test_transition_incident_conflict_on_version_mismatch(
|
||||||
|
incident_service: IncidentService, db_conn: asyncpg.Connection
|
||||||
|
) -> None:
|
||||||
|
current_user, service_id = await _seed_user_org_service(db_conn)
|
||||||
|
incident = await incident_service.create_incident(
|
||||||
|
current_user, service_id, IncidentCreate(title="Version", severity="medium", description=None)
|
||||||
|
)
|
||||||
|
|
||||||
|
with pytest.raises(exc.ConflictError):
|
||||||
|
await incident_service.transition_incident(
|
||||||
|
current_user,
|
||||||
|
incident.id,
|
||||||
|
TransitionRequest(to_status="acknowledged", version=999, note=None),
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
async def test_add_comment_creates_event(
|
||||||
|
incident_service: IncidentService, db_conn: asyncpg.Connection
|
||||||
|
) -> None:
|
||||||
|
current_user, service_id = await _seed_user_org_service(db_conn)
|
||||||
|
incident = await incident_service.create_incident(
|
||||||
|
current_user, service_id, IncidentCreate(title="Comment", severity="low", description=None)
|
||||||
|
)
|
||||||
|
|
||||||
|
event = await incident_service.add_comment(
|
||||||
|
current_user,
|
||||||
|
incident.id,
|
||||||
|
CommentRequest(content="Investigating"),
|
||||||
|
)
|
||||||
|
|
||||||
|
assert event.event_type == "comment_added"
|
||||||
|
assert event.payload == {"content": "Investigating"}
|
||||||
219
tests/services/test_org_service.py
Normal file
219
tests/services/test_org_service.py
Normal file
@@ -0,0 +1,219 @@
|
|||||||
|
"""Unit tests covering OrgService flows."""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from contextlib import asynccontextmanager
|
||||||
|
from uuid import UUID, uuid4
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
from app.api.deps import CurrentUser
|
||||||
|
from app.core import exceptions as exc, security
|
||||||
|
from app.db import Database
|
||||||
|
from app.repositories import NotificationRepository, OrgRepository, ServiceRepository
|
||||||
|
from app.schemas.org import NotificationTargetCreate, ServiceCreate
|
||||||
|
from app.services.org import OrgService
|
||||||
|
|
||||||
|
|
||||||
|
pytestmark = pytest.mark.asyncio
|
||||||
|
|
||||||
|
|
||||||
|
class _SingleConnectionDatabase(Database):
|
||||||
|
"""Database stub that reuses a single asyncpg connection."""
|
||||||
|
|
||||||
|
def __init__(self, conn) -> None: # type: ignore[override]
|
||||||
|
self._conn = conn
|
||||||
|
|
||||||
|
@asynccontextmanager
|
||||||
|
async def connection(self): # type: ignore[override]
|
||||||
|
yield self._conn
|
||||||
|
|
||||||
|
@asynccontextmanager
|
||||||
|
async def transaction(self): # type: ignore[override]
|
||||||
|
tr = self._conn.transaction()
|
||||||
|
await tr.start()
|
||||||
|
try:
|
||||||
|
yield self._conn
|
||||||
|
except Exception:
|
||||||
|
await tr.rollback()
|
||||||
|
raise
|
||||||
|
else:
|
||||||
|
await tr.commit()
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
async def org_service(db_conn):
|
||||||
|
"""OrgService bound to the per-test database connection."""
|
||||||
|
|
||||||
|
return OrgService(database=_SingleConnectionDatabase(db_conn))
|
||||||
|
|
||||||
|
|
||||||
|
async def _create_user(conn, email: str) -> UUID:
|
||||||
|
user_id = uuid4()
|
||||||
|
await conn.execute(
|
||||||
|
"INSERT INTO users (id, email, password_hash) VALUES ($1, $2, $3)",
|
||||||
|
user_id,
|
||||||
|
email,
|
||||||
|
security.hash_password("Password123!"),
|
||||||
|
)
|
||||||
|
return user_id
|
||||||
|
|
||||||
|
|
||||||
|
async def _create_org(conn, name: str, slug: str | None = None) -> UUID:
|
||||||
|
org_id = uuid4()
|
||||||
|
org_repo = OrgRepository(conn)
|
||||||
|
await org_repo.create(org_id, name, slug or name.lower().replace(" ", "-"))
|
||||||
|
return org_id
|
||||||
|
|
||||||
|
|
||||||
|
async def _add_membership(conn, user_id: UUID, org_id: UUID, role: str) -> None:
|
||||||
|
await conn.execute(
|
||||||
|
"INSERT INTO org_members (id, user_id, org_id, role) VALUES ($1, $2, $3, $4)",
|
||||||
|
uuid4(),
|
||||||
|
user_id,
|
||||||
|
org_id,
|
||||||
|
role,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
async def _create_service(conn, org_id: UUID, name: str, slug: str) -> None:
|
||||||
|
repo = ServiceRepository(conn)
|
||||||
|
await repo.create(uuid4(), org_id, name, slug)
|
||||||
|
|
||||||
|
|
||||||
|
async def _create_notification_target(conn, org_id: UUID, name: str) -> None:
|
||||||
|
repo = NotificationRepository(conn)
|
||||||
|
await repo.create_target(uuid4(), org_id, name, "webhook", "https://example.com/hook")
|
||||||
|
|
||||||
|
|
||||||
|
def _make_user(user_id: UUID, email: str, org_id: UUID, role: str) -> CurrentUser:
|
||||||
|
return CurrentUser(user_id=user_id, email=email, org_id=org_id, org_role=role, token="token")
|
||||||
|
|
||||||
|
|
||||||
|
async def test_get_current_org_returns_summary(org_service, db_conn):
|
||||||
|
org_id = await _create_org(db_conn, "Current Org", slug="current-org")
|
||||||
|
user_id = await _create_user(db_conn, "owner@example.com")
|
||||||
|
await _add_membership(db_conn, user_id, org_id, "admin")
|
||||||
|
|
||||||
|
current_user = _make_user(user_id, "owner@example.com", org_id, "admin")
|
||||||
|
|
||||||
|
result = await org_service.get_current_org(current_user)
|
||||||
|
|
||||||
|
assert result.id == org_id
|
||||||
|
assert result.slug == "current-org"
|
||||||
|
|
||||||
|
|
||||||
|
async def test_get_current_org_raises_not_found(org_service, db_conn):
|
||||||
|
user_id = await _create_user(db_conn, "ghost@example.com")
|
||||||
|
missing_org = uuid4()
|
||||||
|
current_user = _make_user(user_id, "ghost@example.com", missing_org, "admin")
|
||||||
|
|
||||||
|
with pytest.raises(exc.NotFoundError):
|
||||||
|
await org_service.get_current_org(current_user)
|
||||||
|
|
||||||
|
|
||||||
|
async def test_get_members_returns_org_members(org_service, db_conn):
|
||||||
|
org_id = await _create_org(db_conn, "Members Org", slug="members-org")
|
||||||
|
admin_id = await _create_user(db_conn, "admin@example.com")
|
||||||
|
member_id = await _create_user(db_conn, "member@example.com")
|
||||||
|
await _add_membership(db_conn, admin_id, org_id, "admin")
|
||||||
|
await _add_membership(db_conn, member_id, org_id, "member")
|
||||||
|
|
||||||
|
current_user = _make_user(admin_id, "admin@example.com", org_id, "admin")
|
||||||
|
members = await org_service.get_members(current_user)
|
||||||
|
|
||||||
|
emails = {m.email for m in members}
|
||||||
|
assert emails == {"admin@example.com", "member@example.com"}
|
||||||
|
|
||||||
|
|
||||||
|
async def test_create_service_rejects_duplicate_slug(org_service, db_conn):
|
||||||
|
org_id = await _create_org(db_conn, "Dup Org", slug="dup-org")
|
||||||
|
user_id = await _create_user(db_conn, "service@example.com")
|
||||||
|
await _add_membership(db_conn, user_id, org_id, "member")
|
||||||
|
await _create_service(db_conn, org_id, "Existing", "duplicate")
|
||||||
|
|
||||||
|
current_user = _make_user(user_id, "service@example.com", org_id, "member")
|
||||||
|
with pytest.raises(exc.ConflictError):
|
||||||
|
await org_service.create_service(current_user, ServiceCreate(name="New", slug="duplicate"))
|
||||||
|
|
||||||
|
|
||||||
|
async def test_create_service_persists_service(org_service, db_conn):
|
||||||
|
org_id = await _create_org(db_conn, "Service Org", slug="service-org")
|
||||||
|
user_id = await _create_user(db_conn, "creator@example.com")
|
||||||
|
await _add_membership(db_conn, user_id, org_id, "member")
|
||||||
|
current_user = _make_user(user_id, "creator@example.com", org_id, "member")
|
||||||
|
|
||||||
|
result = await org_service.create_service(current_user, ServiceCreate(name="API", slug="api"))
|
||||||
|
|
||||||
|
assert result.name == "API"
|
||||||
|
row = await db_conn.fetchrow(
|
||||||
|
"SELECT name, org_id FROM services WHERE id = $1",
|
||||||
|
result.id,
|
||||||
|
)
|
||||||
|
assert row is not None and row["org_id"] == org_id
|
||||||
|
|
||||||
|
|
||||||
|
async def test_get_services_returns_only_org_services(org_service, db_conn):
|
||||||
|
org_id = await _create_org(db_conn, "Own Org", slug="own-org")
|
||||||
|
other_org = await _create_org(db_conn, "Other Org", slug="other-org")
|
||||||
|
user_id = await _create_user(db_conn, "viewer@example.com")
|
||||||
|
await _add_membership(db_conn, user_id, org_id, "viewer")
|
||||||
|
|
||||||
|
await _create_service(db_conn, org_id, "Owned", "owned")
|
||||||
|
await _create_service(db_conn, other_org, "Foreign", "foreign")
|
||||||
|
|
||||||
|
current_user = _make_user(user_id, "viewer@example.com", org_id, "viewer")
|
||||||
|
services = await org_service.get_services(current_user)
|
||||||
|
|
||||||
|
assert len(services) == 1
|
||||||
|
assert services[0].name == "Owned"
|
||||||
|
|
||||||
|
|
||||||
|
async def test_create_notification_target_requires_webhook_url(org_service, db_conn):
|
||||||
|
org_id = await _create_org(db_conn, "Webhook Org", slug="webhook-org")
|
||||||
|
user_id = await _create_user(db_conn, "admin-webhook@example.com")
|
||||||
|
await _add_membership(db_conn, user_id, org_id, "admin")
|
||||||
|
current_user = _make_user(user_id, "admin-webhook@example.com", org_id, "admin")
|
||||||
|
|
||||||
|
with pytest.raises(exc.BadRequestError):
|
||||||
|
await org_service.create_notification_target(
|
||||||
|
current_user,
|
||||||
|
NotificationTargetCreate(name="Hook", target_type="webhook", webhook_url=None),
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
async def test_create_notification_target_persists_target(org_service, db_conn):
|
||||||
|
org_id = await _create_org(db_conn, "Notify Org", slug="notify-org")
|
||||||
|
user_id = await _create_user(db_conn, "notify@example.com")
|
||||||
|
await _add_membership(db_conn, user_id, org_id, "admin")
|
||||||
|
current_user = _make_user(user_id, "notify@example.com", org_id, "admin")
|
||||||
|
|
||||||
|
target = await org_service.create_notification_target(
|
||||||
|
current_user,
|
||||||
|
NotificationTargetCreate(
|
||||||
|
name="Pager", target_type="webhook", webhook_url="https://example.com/hook"
|
||||||
|
),
|
||||||
|
)
|
||||||
|
|
||||||
|
assert target.enabled is True
|
||||||
|
row = await db_conn.fetchrow(
|
||||||
|
"SELECT org_id, name FROM notification_targets WHERE id = $1",
|
||||||
|
target.id,
|
||||||
|
)
|
||||||
|
assert row is not None and row["org_id"] == org_id
|
||||||
|
|
||||||
|
|
||||||
|
async def test_get_notification_targets_scopes_to_org(org_service, db_conn):
|
||||||
|
org_id = await _create_org(db_conn, "Scope Org", slug="scope-org")
|
||||||
|
other_org = await _create_org(db_conn, "Scope Other", slug="scope-other")
|
||||||
|
user_id = await _create_user(db_conn, "scope@example.com")
|
||||||
|
await _add_membership(db_conn, user_id, org_id, "admin")
|
||||||
|
|
||||||
|
await _create_notification_target(db_conn, org_id, "Own Target")
|
||||||
|
await _create_notification_target(db_conn, other_org, "Other Target")
|
||||||
|
|
||||||
|
current_user = _make_user(user_id, "scope@example.com", org_id, "admin")
|
||||||
|
targets = await org_service.get_notification_targets(current_user)
|
||||||
|
|
||||||
|
assert len(targets) == 1
|
||||||
|
assert targets[0].name == "Own Target"
|
||||||
199
tests/worker/test_celery_tasks.py
Normal file
199
tests/worker/test_celery_tasks.py
Normal file
@@ -0,0 +1,199 @@
|
|||||||
|
"""End-to-end Celery worker tests against the real Redis broker."""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
import inspect
|
||||||
|
from uuid import UUID, uuid4
|
||||||
|
|
||||||
|
import asyncpg
|
||||||
|
import pytest
|
||||||
|
import redis
|
||||||
|
|
||||||
|
from app.config import settings
|
||||||
|
from app.repositories.incident import IncidentRepository
|
||||||
|
from app.taskqueue import CeleryTaskQueue
|
||||||
|
from celery.contrib.testing.worker import start_worker
|
||||||
|
|
||||||
|
from worker.celery_app import celery_app
|
||||||
|
|
||||||
|
|
||||||
|
pytestmark = pytest.mark.asyncio
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture(scope="module", autouse=True)
|
||||||
|
def ensure_redis_available() -> None:
|
||||||
|
"""Skip the module if the configured Redis broker is unreachable."""
|
||||||
|
|
||||||
|
client = redis.Redis.from_url(settings.resolved_task_queue_broker_url)
|
||||||
|
try:
|
||||||
|
client.ping()
|
||||||
|
except redis.RedisError as exc: # pragma: no cover - diagnostic-only path
|
||||||
|
pytest.skip(f"Redis broker unavailable: {exc}")
|
||||||
|
finally:
|
||||||
|
client.close()
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture(scope="module")
|
||||||
|
def celery_worker_instance(ensure_redis_available: None):
|
||||||
|
"""Run a real Celery worker connected to Redis for the duration of the module."""
|
||||||
|
|
||||||
|
queues = [settings.task_queue_default_queue, settings.task_queue_critical_queue]
|
||||||
|
with start_worker(
|
||||||
|
celery_app,
|
||||||
|
loglevel="INFO",
|
||||||
|
pool="solo",
|
||||||
|
concurrency=1,
|
||||||
|
queues=queues,
|
||||||
|
perform_ping_check=False,
|
||||||
|
):
|
||||||
|
yield
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture(autouse=True)
|
||||||
|
def purge_celery_queues():
|
||||||
|
"""Clear any pending tasks before and after each test for isolation."""
|
||||||
|
|
||||||
|
celery_app.control.purge()
|
||||||
|
yield
|
||||||
|
celery_app.control.purge()
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def celery_queue() -> CeleryTaskQueue:
|
||||||
|
return CeleryTaskQueue(
|
||||||
|
default_queue=settings.task_queue_default_queue,
|
||||||
|
critical_queue=settings.task_queue_critical_queue,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
async def _seed_incident_with_target(conn: asyncpg.Connection) -> tuple[UUID, UUID]:
|
||||||
|
org_id = uuid4()
|
||||||
|
service_id = uuid4()
|
||||||
|
incident_id = uuid4()
|
||||||
|
target_id = uuid4()
|
||||||
|
|
||||||
|
await conn.execute(
|
||||||
|
"INSERT INTO orgs (id, name, slug) VALUES ($1, $2, $3)",
|
||||||
|
org_id,
|
||||||
|
"Celery Org",
|
||||||
|
f"celery-{org_id.hex[:6]}",
|
||||||
|
)
|
||||||
|
await conn.execute(
|
||||||
|
"INSERT INTO services (id, org_id, name, slug) VALUES ($1, $2, $3, $4)",
|
||||||
|
service_id,
|
||||||
|
org_id,
|
||||||
|
"API",
|
||||||
|
f"svc-{service_id.hex[:6]}",
|
||||||
|
)
|
||||||
|
|
||||||
|
repo = IncidentRepository(conn)
|
||||||
|
await repo.create(
|
||||||
|
incident_id=incident_id,
|
||||||
|
org_id=org_id,
|
||||||
|
service_id=service_id,
|
||||||
|
title="Latency spike",
|
||||||
|
description="",
|
||||||
|
severity="high",
|
||||||
|
)
|
||||||
|
|
||||||
|
await conn.execute(
|
||||||
|
"""
|
||||||
|
INSERT INTO notification_targets (id, org_id, name, target_type, webhook_url, enabled)
|
||||||
|
VALUES ($1, $2, $3, $4, $5, $6)
|
||||||
|
""",
|
||||||
|
target_id,
|
||||||
|
org_id,
|
||||||
|
"Primary Webhook",
|
||||||
|
"webhook",
|
||||||
|
"https://example.com/hook",
|
||||||
|
True,
|
||||||
|
)
|
||||||
|
|
||||||
|
return org_id, incident_id
|
||||||
|
|
||||||
|
|
||||||
|
async def _wait_until(predicate, timeout: float = 5.0, interval: float = 0.1) -> None:
|
||||||
|
deadline = asyncio.get_running_loop().time() + timeout
|
||||||
|
while True:
|
||||||
|
result = predicate()
|
||||||
|
if inspect.isawaitable(result):
|
||||||
|
result = await result
|
||||||
|
if result:
|
||||||
|
return
|
||||||
|
if asyncio.get_running_loop().time() >= deadline:
|
||||||
|
raise AssertionError("Timed out waiting for Celery worker to finish")
|
||||||
|
await asyncio.sleep(interval)
|
||||||
|
|
||||||
|
|
||||||
|
async def _attempt_sent(conn: asyncpg.Connection, incident_id: UUID) -> bool:
|
||||||
|
row = await conn.fetchrow(
|
||||||
|
"SELECT status FROM notification_attempts WHERE incident_id = $1",
|
||||||
|
incident_id,
|
||||||
|
)
|
||||||
|
return bool(row and row["status"] == "sent")
|
||||||
|
|
||||||
|
|
||||||
|
async def _attempt_count(conn: asyncpg.Connection, incident_id: UUID) -> int:
|
||||||
|
count = await conn.fetchval(
|
||||||
|
"SELECT COUNT(*) FROM notification_attempts WHERE incident_id = $1",
|
||||||
|
incident_id,
|
||||||
|
)
|
||||||
|
return int(count or 0)
|
||||||
|
|
||||||
|
|
||||||
|
async def _attempt_count_is(conn: asyncpg.Connection, incident_id: UUID, expected: int) -> bool:
|
||||||
|
return await _attempt_count(conn, incident_id) == expected
|
||||||
|
|
||||||
|
|
||||||
|
async def test_incident_triggered_task_marks_attempt_sent(
|
||||||
|
db_admin: asyncpg.Connection,
|
||||||
|
celery_worker_instance: None,
|
||||||
|
celery_queue: CeleryTaskQueue,
|
||||||
|
) -> None:
|
||||||
|
org_id, incident_id = await _seed_incident_with_target(db_admin)
|
||||||
|
|
||||||
|
celery_queue.incident_triggered(
|
||||||
|
incident_id=incident_id,
|
||||||
|
org_id=org_id,
|
||||||
|
triggered_by=uuid4(),
|
||||||
|
)
|
||||||
|
|
||||||
|
await _wait_until(lambda: _attempt_sent(db_admin, incident_id))
|
||||||
|
|
||||||
|
|
||||||
|
async def test_escalate_task_refires_when_incident_still_triggered(
|
||||||
|
db_admin: asyncpg.Connection,
|
||||||
|
celery_worker_instance: None,
|
||||||
|
celery_queue: CeleryTaskQueue,
|
||||||
|
) -> None:
|
||||||
|
org_id, incident_id = await _seed_incident_with_target(db_admin)
|
||||||
|
|
||||||
|
celery_queue.schedule_escalation_check(
|
||||||
|
incident_id=incident_id,
|
||||||
|
org_id=org_id,
|
||||||
|
delay_seconds=0,
|
||||||
|
)
|
||||||
|
|
||||||
|
await _wait_until(lambda: _attempt_count_is(db_admin, incident_id, 1))
|
||||||
|
|
||||||
|
|
||||||
|
async def test_escalate_task_skips_when_incident_acknowledged(
|
||||||
|
db_admin: asyncpg.Connection,
|
||||||
|
celery_worker_instance: None,
|
||||||
|
celery_queue: CeleryTaskQueue,
|
||||||
|
) -> None:
|
||||||
|
org_id, incident_id = await _seed_incident_with_target(db_admin)
|
||||||
|
await db_admin.execute(
|
||||||
|
"UPDATE incidents SET status = 'acknowledged' WHERE id = $1",
|
||||||
|
incident_id,
|
||||||
|
)
|
||||||
|
|
||||||
|
celery_queue.schedule_escalation_check(
|
||||||
|
incident_id=incident_id,
|
||||||
|
org_id=org_id,
|
||||||
|
delay_seconds=0,
|
||||||
|
)
|
||||||
|
|
||||||
|
await asyncio.sleep(1)
|
||||||
|
assert await _attempt_count(db_admin, incident_id) == 0
|
||||||
96
tests/worker/test_notifications.py
Normal file
96
tests/worker/test_notifications.py
Normal file
@@ -0,0 +1,96 @@
|
|||||||
|
"""Tests for worker notification helpers."""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from uuid import UUID, uuid4
|
||||||
|
|
||||||
|
import asyncpg
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
from app.repositories.incident import IncidentRepository
|
||||||
|
from worker.tasks.notifications import NotificationDispatch, prepare_notification_dispatches
|
||||||
|
|
||||||
|
|
||||||
|
pytestmark = pytest.mark.asyncio
|
||||||
|
|
||||||
|
|
||||||
|
async def _seed_incident(conn: asyncpg.Connection) -> tuple[UUID, UUID, UUID]:
|
||||||
|
org_id = uuid4()
|
||||||
|
service_id = uuid4()
|
||||||
|
incident_id = uuid4()
|
||||||
|
|
||||||
|
await conn.execute(
|
||||||
|
"INSERT INTO orgs (id, name, slug) VALUES ($1, $2, $3)",
|
||||||
|
org_id,
|
||||||
|
"Notif Org",
|
||||||
|
"notif-org",
|
||||||
|
)
|
||||||
|
await conn.execute(
|
||||||
|
"INSERT INTO services (id, org_id, name, slug) VALUES ($1, $2, $3, $4)",
|
||||||
|
service_id,
|
||||||
|
org_id,
|
||||||
|
"API",
|
||||||
|
"api",
|
||||||
|
)
|
||||||
|
|
||||||
|
repo = IncidentRepository(conn)
|
||||||
|
await repo.create(
|
||||||
|
incident_id=incident_id,
|
||||||
|
org_id=org_id,
|
||||||
|
service_id=service_id,
|
||||||
|
title="Outage",
|
||||||
|
description="",
|
||||||
|
severity="high",
|
||||||
|
)
|
||||||
|
|
||||||
|
return org_id, service_id, incident_id
|
||||||
|
|
||||||
|
|
||||||
|
async def test_prepare_notification_dispatches_creates_attempts(db_conn: asyncpg.Connection) -> None:
|
||||||
|
org_id, _service_id, incident_id = await _seed_incident(db_conn)
|
||||||
|
|
||||||
|
target_id = uuid4()
|
||||||
|
await db_conn.execute(
|
||||||
|
"""
|
||||||
|
INSERT INTO notification_targets (id, org_id, name, target_type, enabled)
|
||||||
|
VALUES ($1, $2, $3, $4, $5)
|
||||||
|
""",
|
||||||
|
target_id,
|
||||||
|
org_id,
|
||||||
|
"Primary Webhook",
|
||||||
|
"webhook",
|
||||||
|
True,
|
||||||
|
)
|
||||||
|
|
||||||
|
dispatches = await prepare_notification_dispatches(db_conn, incident_id=incident_id, org_id=org_id)
|
||||||
|
|
||||||
|
assert len(dispatches) == 1
|
||||||
|
dispatch = dispatches[0]
|
||||||
|
assert isinstance(dispatch, NotificationDispatch)
|
||||||
|
assert dispatch.target["name"] == "Primary Webhook"
|
||||||
|
|
||||||
|
attempt = await db_conn.fetchrow(
|
||||||
|
"SELECT status FROM notification_attempts WHERE id = $1",
|
||||||
|
dispatch.attempt_id,
|
||||||
|
)
|
||||||
|
assert attempt is not None and attempt["status"] == "pending"
|
||||||
|
|
||||||
|
|
||||||
|
async def test_prepare_notification_dispatches_skips_disabled_targets(db_conn: asyncpg.Connection) -> None:
|
||||||
|
org_id, _service_id, incident_id = await _seed_incident(db_conn)
|
||||||
|
|
||||||
|
await db_conn.execute(
|
||||||
|
"""
|
||||||
|
INSERT INTO notification_targets (id, org_id, name, target_type, enabled)
|
||||||
|
VALUES ($1, $2, $3, $4, $5)
|
||||||
|
""",
|
||||||
|
uuid4(),
|
||||||
|
org_id,
|
||||||
|
"Disabled",
|
||||||
|
"email",
|
||||||
|
False,
|
||||||
|
)
|
||||||
|
|
||||||
|
dispatches = await prepare_notification_dispatches(db_conn, incident_id=incident_id, org_id=org_id)
|
||||||
|
|
||||||
|
assert dispatches == []
|
||||||
3
worker/__init__.py
Normal file
3
worker/__init__.py
Normal file
@@ -0,0 +1,3 @@
|
|||||||
|
"""Celery worker package for IncidentOps."""
|
||||||
|
|
||||||
|
__all__ = ["celery_app"]
|
||||||
43
worker/celery_app.py
Normal file
43
worker/celery_app.py
Normal file
@@ -0,0 +1,43 @@
|
|||||||
|
"""Celery application configured for IncidentOps."""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from celery import Celery
|
||||||
|
from kombu import Queue
|
||||||
|
|
||||||
|
from app.config import settings
|
||||||
|
|
||||||
|
|
||||||
|
celery_app = Celery("incidentops")
|
||||||
|
|
||||||
|
|
||||||
|
celery_app.conf.update(
|
||||||
|
broker_url=settings.resolved_task_queue_broker_url,
|
||||||
|
task_default_queue=settings.task_queue_default_queue,
|
||||||
|
task_queues=(
|
||||||
|
Queue(settings.task_queue_default_queue),
|
||||||
|
Queue(settings.task_queue_critical_queue),
|
||||||
|
),
|
||||||
|
task_routes={
|
||||||
|
"worker.tasks.notifications.escalate_if_unacked": {
|
||||||
|
"queue": settings.task_queue_critical_queue
|
||||||
|
},
|
||||||
|
},
|
||||||
|
task_serializer="json",
|
||||||
|
accept_content=["json"],
|
||||||
|
timezone="UTC",
|
||||||
|
enable_utc=True,
|
||||||
|
)
|
||||||
|
|
||||||
|
if settings.task_queue_backend == "sqs":
|
||||||
|
celery_app.conf.broker_transport_options = {
|
||||||
|
"region": settings.aws_region or "us-east-1",
|
||||||
|
"visibility_timeout": settings.task_queue_visibility_timeout,
|
||||||
|
"polling_interval": settings.task_queue_polling_interval,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
celery_app.autodiscover_tasks(["worker.tasks"])
|
||||||
|
|
||||||
|
|
||||||
|
__all__ = ["celery_app"]
|
||||||
5
worker/tasks/__init__.py
Normal file
5
worker/tasks/__init__.py
Normal file
@@ -0,0 +1,5 @@
|
|||||||
|
"""Celery task definitions for IncidentOps."""
|
||||||
|
|
||||||
|
from worker.tasks import notifications
|
||||||
|
|
||||||
|
__all__ = ["notifications"]
|
||||||
225
worker/tasks/notifications.py
Normal file
225
worker/tasks/notifications.py
Normal file
@@ -0,0 +1,225 @@
|
|||||||
|
"""Notification-related Celery tasks and helpers."""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
from dataclasses import dataclass
|
||||||
|
from datetime import UTC, datetime
|
||||||
|
from typing import Any
|
||||||
|
from uuid import UUID, uuid4
|
||||||
|
|
||||||
|
import asyncpg
|
||||||
|
from celery import shared_task
|
||||||
|
from celery.utils.log import get_task_logger
|
||||||
|
|
||||||
|
from app.config import settings
|
||||||
|
from app.repositories.incident import IncidentRepository
|
||||||
|
from app.repositories.notification import NotificationRepository
|
||||||
|
|
||||||
|
logger = get_task_logger(__name__)
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class NotificationDispatch:
|
||||||
|
"""Represents a pending notification attempt for a target."""
|
||||||
|
|
||||||
|
attempt_id: UUID
|
||||||
|
incident_id: UUID
|
||||||
|
target: dict[str, Any]
|
||||||
|
|
||||||
|
|
||||||
|
def _serialize_target(target: dict[str, Any]) -> dict[str, Any]:
|
||||||
|
serialized: dict[str, Any] = {}
|
||||||
|
for key, value in target.items():
|
||||||
|
if isinstance(value, UUID):
|
||||||
|
serialized[key] = str(value)
|
||||||
|
else:
|
||||||
|
serialized[key] = value
|
||||||
|
return serialized
|
||||||
|
|
||||||
|
|
||||||
|
async def prepare_notification_dispatches(
|
||||||
|
conn: asyncpg.Connection,
|
||||||
|
*,
|
||||||
|
incident_id: UUID,
|
||||||
|
org_id: UUID,
|
||||||
|
) -> list[NotificationDispatch]:
|
||||||
|
"""Create notification attempts for all enabled targets in the org."""
|
||||||
|
|
||||||
|
notification_repo = NotificationRepository(conn)
|
||||||
|
targets = await notification_repo.get_targets_by_org(org_id, enabled_only=True)
|
||||||
|
dispatches: list[NotificationDispatch] = []
|
||||||
|
|
||||||
|
for target in targets:
|
||||||
|
attempt = await notification_repo.create_attempt(uuid4(), incident_id, target["id"])
|
||||||
|
dispatches.append(
|
||||||
|
NotificationDispatch(
|
||||||
|
attempt_id=attempt["id"],
|
||||||
|
incident_id=attempt["incident_id"],
|
||||||
|
target=_serialize_target(target),
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
return dispatches
|
||||||
|
|
||||||
|
|
||||||
|
async def _prepare_dispatches_with_new_connection(
|
||||||
|
incident_id: UUID,
|
||||||
|
org_id: UUID,
|
||||||
|
) -> list[NotificationDispatch]:
|
||||||
|
conn = await asyncpg.connect(settings.database_url)
|
||||||
|
try:
|
||||||
|
return await prepare_notification_dispatches(conn, incident_id=incident_id, org_id=org_id)
|
||||||
|
finally:
|
||||||
|
await conn.close()
|
||||||
|
|
||||||
|
|
||||||
|
async def _mark_attempt_success(attempt_id: UUID) -> None:
|
||||||
|
conn = await asyncpg.connect(settings.database_url)
|
||||||
|
try:
|
||||||
|
repo = NotificationRepository(conn)
|
||||||
|
await repo.update_attempt_success(attempt_id, datetime.now(UTC))
|
||||||
|
finally:
|
||||||
|
await conn.close()
|
||||||
|
|
||||||
|
|
||||||
|
async def _mark_attempt_failure(attempt_id: UUID, error: str) -> None:
|
||||||
|
conn = await asyncpg.connect(settings.database_url)
|
||||||
|
try:
|
||||||
|
repo = NotificationRepository(conn)
|
||||||
|
await repo.update_attempt_failure(attempt_id, error)
|
||||||
|
finally:
|
||||||
|
await conn.close()
|
||||||
|
|
||||||
|
|
||||||
|
async def _should_escalate(incident_id: UUID) -> bool:
|
||||||
|
conn = await asyncpg.connect(settings.database_url)
|
||||||
|
try:
|
||||||
|
repo = IncidentRepository(conn)
|
||||||
|
incident = await repo.get_by_id(incident_id)
|
||||||
|
if incident is None:
|
||||||
|
return False
|
||||||
|
return incident["status"] == "triggered"
|
||||||
|
finally:
|
||||||
|
await conn.close()
|
||||||
|
|
||||||
|
|
||||||
|
def _simulate_delivery(channel: str, target: dict[str, Any], incident_id: str) -> None:
|
||||||
|
target_name = target.get("name") or target.get("id")
|
||||||
|
logger.info("Simulated %s delivery for incident %s to %s", channel, incident_id, target_name)
|
||||||
|
|
||||||
|
|
||||||
|
@shared_task(name="worker.tasks.notifications.incident_triggered", bind=True)
|
||||||
|
def incident_triggered(
|
||||||
|
self,
|
||||||
|
*,
|
||||||
|
incident_id: str,
|
||||||
|
org_id: str,
|
||||||
|
triggered_by: str | None = None,
|
||||||
|
) -> None:
|
||||||
|
"""Fan-out notifications to all active targets for the incident's org."""
|
||||||
|
|
||||||
|
incident_uuid = UUID(incident_id)
|
||||||
|
org_uuid = UUID(org_id)
|
||||||
|
try:
|
||||||
|
dispatches = asyncio.run(_prepare_dispatches_with_new_connection(incident_uuid, org_uuid))
|
||||||
|
except Exception as exc: # pragma: no cover - logged for observability
|
||||||
|
logger.exception("Failed to prepare notification dispatches: %s", exc)
|
||||||
|
raise
|
||||||
|
|
||||||
|
if not dispatches:
|
||||||
|
logger.info("No notification targets for org %s", org_id)
|
||||||
|
return
|
||||||
|
|
||||||
|
for dispatch in dispatches:
|
||||||
|
target_type = dispatch.target.get("target_type")
|
||||||
|
kwargs = {
|
||||||
|
"attempt_id": str(dispatch.attempt_id),
|
||||||
|
"incident_id": incident_id,
|
||||||
|
"target": dispatch.target,
|
||||||
|
}
|
||||||
|
if target_type == "webhook":
|
||||||
|
send_webhook.apply_async(kwargs=kwargs, queue=settings.task_queue_default_queue)
|
||||||
|
elif target_type == "email":
|
||||||
|
send_email.apply_async(kwargs=kwargs, queue=settings.task_queue_default_queue)
|
||||||
|
elif target_type == "slack":
|
||||||
|
send_slack.apply_async(kwargs=kwargs, queue=settings.task_queue_default_queue)
|
||||||
|
else:
|
||||||
|
logger.warning("Unsupported notification target type: %s", target_type)
|
||||||
|
|
||||||
|
|
||||||
|
@shared_task(
|
||||||
|
name="worker.tasks.notifications.send_webhook",
|
||||||
|
bind=True,
|
||||||
|
autoretry_for=(Exception,),
|
||||||
|
retry_backoff=True,
|
||||||
|
retry_kwargs={"max_retries": 3},
|
||||||
|
)
|
||||||
|
def send_webhook(self, *, attempt_id: str, target: dict[str, Any], incident_id: str) -> None:
|
||||||
|
"""Simulate webhook delivery and mark the attempt status."""
|
||||||
|
|
||||||
|
try:
|
||||||
|
_simulate_delivery("webhook", target, incident_id)
|
||||||
|
asyncio.run(_mark_attempt_success(UUID(attempt_id)))
|
||||||
|
except Exception as exc: # pragma: no cover - logged for observability
|
||||||
|
logger.exception("Webhook delivery failed: %s", exc)
|
||||||
|
asyncio.run(_mark_attempt_failure(UUID(attempt_id), str(exc)))
|
||||||
|
raise
|
||||||
|
|
||||||
|
|
||||||
|
@shared_task(name="worker.tasks.notifications.send_email", bind=True)
|
||||||
|
def send_email(self, *, attempt_id: str, target: dict[str, Any], incident_id: str) -> None:
|
||||||
|
"""Simulate email delivery for the notification attempt."""
|
||||||
|
|
||||||
|
try:
|
||||||
|
_simulate_delivery("email", target, incident_id)
|
||||||
|
asyncio.run(_mark_attempt_success(UUID(attempt_id)))
|
||||||
|
except Exception as exc: # pragma: no cover
|
||||||
|
logger.exception("Email delivery failed: %s", exc)
|
||||||
|
asyncio.run(_mark_attempt_failure(UUID(attempt_id), str(exc)))
|
||||||
|
raise
|
||||||
|
|
||||||
|
|
||||||
|
@shared_task(name="worker.tasks.notifications.send_slack", bind=True)
|
||||||
|
def send_slack(self, *, attempt_id: str, target: dict[str, Any], incident_id: str) -> None:
|
||||||
|
"""Simulate Slack delivery for the notification attempt."""
|
||||||
|
|
||||||
|
try:
|
||||||
|
_simulate_delivery("slack", target, incident_id)
|
||||||
|
asyncio.run(_mark_attempt_success(UUID(attempt_id)))
|
||||||
|
except Exception as exc: # pragma: no cover
|
||||||
|
logger.exception("Slack delivery failed: %s", exc)
|
||||||
|
asyncio.run(_mark_attempt_failure(UUID(attempt_id), str(exc)))
|
||||||
|
raise
|
||||||
|
|
||||||
|
|
||||||
|
@shared_task(name="worker.tasks.notifications.escalate_if_unacked", bind=True)
|
||||||
|
def escalate_if_unacked(self, *, incident_id: str, org_id: str) -> None:
|
||||||
|
"""Re-dispatch notifications if the incident remains unacknowledged."""
|
||||||
|
|
||||||
|
incident_uuid = UUID(incident_id)
|
||||||
|
should_escalate = asyncio.run(_should_escalate(incident_uuid))
|
||||||
|
if not should_escalate:
|
||||||
|
logger.info("Incident %s no longer needs escalation", incident_id)
|
||||||
|
return
|
||||||
|
|
||||||
|
logger.info("Incident %s still triggered; re-fanning notifications", incident_id)
|
||||||
|
incident_triggered.apply_async( # type: ignore[attr-defined]
|
||||||
|
kwargs={
|
||||||
|
"incident_id": incident_id,
|
||||||
|
"org_id": org_id,
|
||||||
|
"triggered_by": None,
|
||||||
|
},
|
||||||
|
queue=settings.task_queue_critical_queue,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
__all__ = [
|
||||||
|
"NotificationDispatch",
|
||||||
|
"incident_triggered",
|
||||||
|
"escalate_if_unacked",
|
||||||
|
"prepare_notification_dispatches",
|
||||||
|
"send_email",
|
||||||
|
"send_slack",
|
||||||
|
"send_webhook",
|
||||||
|
]
|
||||||
Reference in New Issue
Block a user