Skip to main content

Operations

Documentation Map

Operations

Start / Run / Deploy

Local development:

CI verification:

  • ./scripts/ci.sh

OCI image helper:

  • ./scripts/oci_image.sh print-tags
  • ./scripts/oci_image.sh build
  • ./scripts/oci_image.sh publish

Operator Entry Paths

  • Local service path:
    • ./scripts/dev.sh bootstrap
    • ./scripts/ci.sh
    • ./scripts/dev.sh run
  • Standalone host path:
    • deploy/compose/jhf-warp.stack.yml
    • PRODUCTION_STACK_DEPLOYMENT.md (docs/PRODUCTION_STACK_DEPLOYMENT.md)
    • OPERATOR_RUNBOOK.md (docs/OPERATOR_RUNBOOK.md)
  • Fabric read-first consumer path:
    • /health
    • /ready
    • /version
    • /fabric-manifest.json
    • /openapi.json
  • OCI consumer path:
    • ./scripts/oci_image.sh print-tags
    • OCI_IMAGE_PATH.md (docs/OCI_IMAGE_PATH.md)

Healthchecks

  • GET /health
    • basic liveness
  • GET /ready
    • readiness plus warnings, capability keys, and self-description references
  • GET /version
    • canonical version endpoint
  • Compose healthcheck policy:
    • no interval below 20s
    • production default: 120s
    • optional low-CPU production override: 180s
    • integration stack: 20s on Postgres only, then stop the stack after verification
  • Production container checks:
    • api: lightweight TCP socket open on <internal-runtime-redacted>:8080
    • postgres: native pg_isready

Deployment Boundary

Minimum operator baseline:

  • reverse proxy or gateway in front of the service
  • internal-network-only exposure wherever possible
  • self-description endpoints handled separately from mutation/control surfaces
  • mutating routes default-denied unless an authenticated internal caller explicitly needs them
  • configure JHF_WARP_FABRIC_CONTEXT_BASE_URL for the current projection/composition layer
  • optionally configure JHF_WARP_FABRIC_CONTEXT_AUTH_TOKEN for Warp -> Fabric service auth
  • verify host-local env files with python scripts/verify_host_env_contract.py <env-file> and keep only canonical JHF_WARP_* keys in live deployment files
  • verify runtime materialization drift with python scripts/verify_runtime_materialization.py --host <internal-runtime-redacted> so repo truth, host env, container env, compose labels, and app readback stay aligned

Read-only self-description surfaces:

  • /health
  • /ready
  • /version
  • /openapi.json
  • /fabric-manifest.json

Mutating/control surfaces that should stay on authenticated internal-only paths:

  • /api/v1/openclaw/patch/*
  • /api/v1/execution/*
  • /api/v1/control-agent/*
  • persistent learning proposal review/write paths

Projected authority contract:

  • self-description endpoints remain open
  • internal routes require Authorization: Bearer <token>
  • Heddle stays upstream auth truth
  • Fabric currently normalizes/projects context
  • future normative governance docks at Spine
  • internal write/control endpoints fail closed when the projected authority context is unavailable or incomplete

Safe Docker Log Diagnostics

Use bounded snapshots only for live-host Docker log inspection.

Rules:

  • never run unbounded docker logs on a live host
  • always include --since
  • always include --tail
  • always wrap the call in a hard timeout
  • prefer one bounded snapshot over long-running follow mode

Repo helper:

./scripts/safe_docker_logs.sh jhf-warp-api 10m 200 20

Equivalent raw pattern:

timeout --foreground 20s docker logs --since 10m --tail 200 jhf-warp-api

Runtime Guardrails

CPU-safe runtime guardrails for the shared host baseline:

  • no-repeat, low-pressure diagnostics only
  • repo-owned stack truth stays jhf-warp with canonical jhf-warp-* container names
  • default shared-host health and watchdog cadence must stay non-aggressive (>= 60s)
  • restart handling must use bounded backoff instead of tight loops
  • every deploy/verify pass must end with a bounded post-deploy cleanup check
  • rerunning the same bounded verify flow must stay idempotent and leave no hanging debug helpers

Canonical verifier:

python scripts/verify_runtime_guardrails.py --report artifacts/runtime-guardrails-report.json
python scripts/verify_runtime_guardrails.py --host <internal-runtime-redacted> --report artifacts/runtime-guardrails-live-report.json
python scripts/verify_runtime_materialization.py
python scripts/verify_runtime_materialization.py --host <internal-runtime-redacted>
python scripts/verify_agent_capability_policy_projection.py

Troubleshooting shortcut:

Standard bounded diagnostics evidence:

  • repo/CI path writes artifacts/runtime-guardrails-report.json
  • live host path writes artifacts/runtime-guardrails-live-report.json
  • the smoke workflow uploads the repo report as the canonical bounded diagnostics artifact

Post-deploy cleanup/postcheck expectations:

  • run a bounded log snapshot, not a long-lived stream:
    • ./scripts/safe_docker_logs.sh jhf-warp-api 10m 200 20
  • a bounded timeout counts as valid completion for the diagnostic snapshot as long as no lingering log readers remain
  • ensure no lingering docker logs, watch, or tail -f processes remain for jhf-warp
  • prefer one lightweight docker stats --no-stream sample over sustained monitoring

Logging

The service currently relies on standard application logging and CI command output. There is no fully documented structured logging contract yet.

Minimum operator-useful logging should include:

  • startup mode, runtime mode, and persistence mode
  • outbound integration skip/fail/success outcomes
  • control-agent cycle outcomes
  • patch plan/apply guard decisions

Monitoring

Useful operator views today:

  • /health
  • /ready
  • /version
  • /metrics
  • /api/v1/runtime/inventory
  • /api/v1/topology/diff
  • /api/v1/drift/summary
  • /api/v1/control-agent/status
  • /api/v1/persistent-agents

Bounded verify/test stack handling:

  • start only for explicit integration verification:
    • docker compose -f compose.integration.yml up --build -d
  • stop immediately after the check window:
    • docker compose -f compose.integration.yml down -v

Minimum monitoring baseline today:

  • self-description state from /health, /ready, /version
  • normalized internal metrics from /metrics
  • persistence mode and runtime mode from /api/v1/runtime/inventory
  • drift severity from /api/v1/drift/summary
  • rollout verification from /api/v1/rollouts/audit
  • control-agent health and scheduler state from /api/v1/control-agent/status
  • persistent-agent governance state from /api/v1/persistent-agents

Metrics surface:

  • GET /metrics
    • Prometheus-style text payload
    • protected as an internal read route by the same projected-authority boundary as other internal operator reads
    • intentionally small: service, persistence, runtime, drift, outbox, control-agent, and governance counters only

Minimum alert/warning set that should be surfaced operationally:

  • fixture-memory active outside explicit development/test use
  • OpenClaw runtime unavailable or degraded
  • failed or repeatedly skipped downstream delivery
  • control-agent reconcile warnings, replay spikes, or watchdog growth
  • OCI publish failures on release-oriented builds

Useful dashboard fields:

  • Grafana:
    • health/readiness state
    • version and deployment image ref
    • runtime mode and persistence mode
    • drift severity trend
    • integration failure counts
  • Gitea:
    • latest green CI state
    • latest built image tag
    • known blocker/warning summary
    • last successful verification timestamp
    • current /metrics scrape timestamp or last successful metrics read

Recommended alert thresholds and warnings:

  • fixture-memory active outside explicit development/test use
  • repeated OpenClaw inventory failures
  • non-zero drift severity that persists across checks
  • repeated control-agent reconcile warnings, replay spikes, or watchdog growth
  • repeated downstream integration failures or skipped deliveries
  • publish-lane failure when a release-oriented OCI build is expected

Known Failure Modes

  • service starts in fixture-memory mode because Postgres DSN is missing
  • runtime inventory/drift degrade because OpenClaw host/runtime facts are unavailable
  • outbound integration routes skip because downstream tokens or URLs are missing
  • OCI publish job skips because GITEA_PACKAGES_TOKEN is not present in the runner context

Restart / Recovery

  • restart the service process/container through the deployment system
  • verify /ready and /api/v1/runtime/inventory after restart
  • verify migration state before assuming persistence regressions
  • regenerate a patch plan before any live runtime mutation retry

Runtime Dependencies

  • Python runtime
  • Postgres for durable state
  • OpenClaw host/runtime for full orchestration value
  • optional downstream integrations for setup/sync delivery

Implemented Vs Planned

Implemented today:

  • local and stack deployment flows
  • projected authority gating for internal read/write routes
  • minimal metrics export at /metrics
  • read-first Fabric self-description surfaces

Planned or external-only:

  • Fabric registration or write-back control
  • remote MCP server delivery
  • operator-managed runner Postgres verification prerequisites
  • operator-managed OCI publish credentials and downstream consumer rollout
  • RUNBOOK.md
  • SECURITY.md (docs/SECURITY.md)
  • OPERATOR_RUNBOOK.md (docs/OPERATOR_RUNBOOK.md)
  • OCI_IMAGE_PATH.md (docs/OCI_IMAGE_PATH.md)

License

AGPLv3. See ../LICENSE (LICENSE). Learn more at helpifyr.com.