Operations

Documentation Map

Overview
Quick Start
Installation
Configuration
Operations
Troubleshooting
Release Notes
Compatibility
Channel: latest
Source repo: JaddaHelpifyr/jhf-warp

Operations

Start / Run / Deploy

Local development:

./scripts/dev.sh bootstrap
./scripts/dev.sh run
INSTALL.md
CONFIGURATION.md

CI verification:

./scripts/ci.sh

OCI image helper:

./scripts/oci_image.sh print-tags
./scripts/oci_image.sh build
./scripts/oci_image.sh publish

Operator Entry Paths

Local service path:
- ./scripts/dev.sh bootstrap
- ./scripts/ci.sh
- ./scripts/dev.sh run
Standalone host path:
- deploy/compose/jhf-warp.stack.yml
- PRODUCTION_STACK_DEPLOYMENT.md (docs/PRODUCTION_STACK_DEPLOYMENT.md)
- OPERATOR_RUNBOOK.md (docs/OPERATOR_RUNBOOK.md)
Fabric read-first consumer path:
- /health
- /ready
- /version
- /fabric-manifest.json
- /openapi.json
OCI consumer path:
- ./scripts/oci_image.sh print-tags
- OCI_IMAGE_PATH.md (docs/OCI_IMAGE_PATH.md)

Healthchecks

GET /health
- basic liveness
GET /ready
- readiness plus warnings, capability keys, and self-description references
GET /version
- canonical version endpoint
Compose healthcheck policy:
- no interval below 20s
- production default: 120s
- optional low-CPU production override: 180s
- integration stack: 20s on Postgres only, then stop the stack after verification
Production container checks:
- api: lightweight TCP socket open on <internal-runtime-redacted>:8080
- postgres: native pg_isready

Deployment Boundary

Minimum operator baseline:

reverse proxy or gateway in front of the service
internal-network-only exposure wherever possible
self-description endpoints handled separately from mutation/control surfaces
mutating routes default-denied unless an authenticated internal caller explicitly needs them
configure JHF_WARP_FABRIC_CONTEXT_BASE_URL for the current projection/composition layer
optionally configure JHF_WARP_FABRIC_CONTEXT_AUTH_TOKEN for Warp -> Fabric service auth
verify host-local env files with python scripts/verify_host_env_contract.py <env-file> and keep only canonical JHF_WARP_* keys in live deployment files
verify runtime materialization drift with python scripts/verify_runtime_materialization.py --host <internal-runtime-redacted> so repo truth, host env, container env, compose labels, and app readback stay aligned

Read-only self-description surfaces:

/health
/ready
/version
/openapi.json
/fabric-manifest.json

Mutating/control surfaces that should stay on authenticated internal-only paths:

/api/v1/openclaw/patch/*
/api/v1/execution/*
/api/v1/control-agent/*
persistent learning proposal review/write paths

Projected authority contract:

self-description endpoints remain open
internal routes require Authorization: Bearer <token>
Heddle stays upstream auth truth
Fabric currently normalizes/projects context
future normative governance docks at Spine
internal write/control endpoints fail closed when the projected authority context is unavailable or incomplete

Safe Docker Log Diagnostics

Use bounded snapshots only for live-host Docker log inspection.

Rules:

never run unbounded docker logs on a live host
always include --since
always include --tail
always wrap the call in a hard timeout
prefer one bounded snapshot over long-running follow mode

Repo helper:

./scripts/safe_docker_logs.sh jhf-warp-api 10m 200 20

Equivalent raw pattern:

timeout --foreground 20s docker logs --since 10m --tail 200 jhf-warp-api

Runtime Guardrails

CPU-safe runtime guardrails for the shared host baseline:

no-repeat, low-pressure diagnostics only
repo-owned stack truth stays jhf-warp with canonical jhf-warp-* container names
default shared-host health and watchdog cadence must stay non-aggressive (>= 60s)
restart handling must use bounded backoff instead of tight loops
every deploy/verify pass must end with a bounded post-deploy cleanup check
rerunning the same bounded verify flow must stay idempotent and leave no hanging debug helpers

Canonical verifier:

python scripts/verify_runtime_guardrails.py --report artifacts/runtime-guardrails-report.json
python scripts/verify_runtime_guardrails.py --host <internal-runtime-redacted> --report artifacts/runtime-guardrails-live-report.json
python scripts/verify_runtime_materialization.py
python scripts/verify_runtime_materialization.py --host <internal-runtime-redacted>
python scripts/verify_agent_capability_policy_projection.py

Troubleshooting shortcut:

TROUBLESHOOTING.md

Standard bounded diagnostics evidence:

repo/CI path writes artifacts/runtime-guardrails-report.json
live host path writes artifacts/runtime-guardrails-live-report.json
the smoke workflow uploads the repo report as the canonical bounded diagnostics artifact

Post-deploy cleanup/postcheck expectations:

run a bounded log snapshot, not a long-lived stream:
- ./scripts/safe_docker_logs.sh jhf-warp-api 10m 200 20
a bounded timeout counts as valid completion for the diagnostic snapshot as long as no lingering log readers remain
ensure no lingering docker logs, watch, or tail -f processes remain for jhf-warp
prefer one lightweight docker stats --no-stream sample over sustained monitoring

Logging

The service currently relies on standard application logging and CI command output. There is no fully documented structured logging contract yet.

Minimum operator-useful logging should include:

startup mode, runtime mode, and persistence mode
outbound integration skip/fail/success outcomes
control-agent cycle outcomes
patch plan/apply guard decisions

Monitoring

Useful operator views today:

/health
/ready
/version
/metrics
/api/v1/runtime/inventory
/api/v1/topology/diff
/api/v1/drift/summary
/api/v1/control-agent/status
/api/v1/persistent-agents

Bounded verify/test stack handling:

start only for explicit integration verification:
- docker compose -f compose.integration.yml up --build -d
stop immediately after the check window:
- docker compose -f compose.integration.yml down -v

Minimum monitoring baseline today:

self-description state from /health, /ready, /version
normalized internal metrics from /metrics
persistence mode and runtime mode from /api/v1/runtime/inventory
drift severity from /api/v1/drift/summary
rollout verification from /api/v1/rollouts/audit
control-agent health and scheduler state from /api/v1/control-agent/status
persistent-agent governance state from /api/v1/persistent-agents

Metrics surface:

GET /metrics
- Prometheus-style text payload
- protected as an internal read route by the same projected-authority boundary as other internal operator reads
- intentionally small: service, persistence, runtime, drift, outbox, control-agent, and governance counters only

Minimum alert/warning set that should be surfaced operationally:

fixture-memory active outside explicit development/test use
OpenClaw runtime unavailable or degraded
failed or repeatedly skipped downstream delivery
control-agent reconcile warnings, replay spikes, or watchdog growth
OCI publish failures on release-oriented builds

Useful dashboard fields:

Grafana:
- health/readiness state
- version and deployment image ref
- runtime mode and persistence mode
- drift severity trend
- integration failure counts
Gitea:
- latest green CI state
- latest built image tag
- known blocker/warning summary
- last successful verification timestamp
- current /metrics scrape timestamp or last successful metrics read

Recommended alert thresholds and warnings:

fixture-memory active outside explicit development/test use
repeated OpenClaw inventory failures
non-zero drift severity that persists across checks
repeated control-agent reconcile warnings, replay spikes, or watchdog growth
repeated downstream integration failures or skipped deliveries
publish-lane failure when a release-oriented OCI build is expected

Known Failure Modes

service starts in fixture-memory mode because Postgres DSN is missing
runtime inventory/drift degrade because OpenClaw host/runtime facts are unavailable
outbound integration routes skip because downstream tokens or URLs are missing
OCI publish job skips because GITEA_PACKAGES_TOKEN is not present in the runner context

Restart / Recovery

restart the service process/container through the deployment system
verify /ready and /api/v1/runtime/inventory after restart
verify migration state before assuming persistence regressions
regenerate a patch plan before any live runtime mutation retry

Runtime Dependencies

Python runtime
Postgres for durable state
OpenClaw host/runtime for full orchestration value
optional downstream integrations for setup/sync delivery

Implemented Vs Planned

Implemented today:

local and stack deployment flows
projected authority gating for internal read/write routes
minimal metrics export at /metrics
read-first Fabric self-description surfaces

Planned or external-only:

Fabric registration or write-back control
remote MCP server delivery
operator-managed runner Postgres verification prerequisites
operator-managed OCI publish credentials and downstream consumer rollout

RUNBOOK.md
SECURITY.md (docs/SECURITY.md)
OPERATOR_RUNBOOK.md (docs/OPERATOR_RUNBOOK.md)
OCI_IMAGE_PATH.md (docs/OCI_IMAGE_PATH.md)

License

AGPLv3. See ../LICENSE (LICENSE). Learn more at helpifyr.com.

Documentation Map​

Operations

Start / Run / Deploy​

Operator Entry Paths​

Healthchecks​

Deployment Boundary​

Safe Docker Log Diagnostics​

Runtime Guardrails​

Logging​

Monitoring​

Known Failure Modes​

Restart / Recovery​

Runtime Dependencies​

Implemented Vs Planned​

Related Documents​

License​

Documentation Map

Start / Run / Deploy

Operator Entry Paths

Healthchecks

Deployment Boundary

Safe Docker Log Diagnostics

Runtime Guardrails

Logging

Monitoring

Known Failure Modes

Restart / Recovery

Runtime Dependencies

Implemented Vs Planned

Related Documents

License