Skip to main content

Operations

Documentation Map

Operations

Start / Run / Deploy

Primary modes:

  • local/operator CLI via python -m n8n_expert ...
  • packaged CLI via jhf-shuttle ...
  • optional API runtime via jhf-shuttle serve
  • optional OCI image from the root Dockerfile
  • optional on-prem compose bundle via docker-compose.onprem-messaging.yml
  • optional OCI package path via scripts/oci_image.sh (explicit version and sha tags)

Canonical runtime contract source:

  • docs/RUNTIME_STACK_CONTRACT.md

Healthchecks

  • API health: GET /api/v1/health
  • API status/readiness-adjacent surface: GET /api/v1/status
  • mailbox adapter health: GET /healthz
  • host-facing mailbox adapter health default: GET http://<host>:58805/healthz (MAILBOX_ADAPTER_HEALTH_HOST_PORT)
  • mailbox publish payload guardrail: MAILBOX_ADAPTER_MAX_PAYLOAD_BYTES (default 131072, returns 413 payload too large when exceeded)
  • there is no separate committed /readyz endpoint today

On-prem messaging compose health policy:

  • jhf-shuttle-nats and jhf-shuttle-mailbox-adapter are the only healthchecked services in the default stack
  • default interval: 30s, timeout: 3s, retries: 5, start period: 30s
  • low-CPU override (docker-compose.onprem-messaging.lowcpu.yml) raises intervals to 90s
  • jhf-shuttle-restart-recovery and openclaw-lane-wait-observer intentionally run without healthchecks to avoid unnecessary probe churn
  • low-CPU evidence runbook: docs/LOW_CPU_24H_EVIDENCE_PLAN.md
  • low-CPU run status snapshot: docs/LOW_CPU_24H_EVIDENCE_STATUS_2026-04-02.md
  • overlap guardrail preflight: py scripts\check_shuttle_runtime_overlap.py must return 0 before cutover/deploy (mailbox/NATS/restart-recovery alias groups)

Lane-wait observer guardrails:

  • only one active lane-wait observer may poll logs for the same gateway target at a time
  • OPENCLAW_LANE_WAIT_PRIMARY_OBSERVER_NAME defines deterministic primary ownership when multiple stacks are present
  • secondary observers enter standby mode and must not continuously poll Docker logs
  • default polling is hardened for low host overhead:
    • OPENCLAW_FLOW_CONTROL_POLL_SECONDS=45
    • hard floor OPENCLAW_FLOW_CONTROL_MIN_POLL_SECONDS=30
    • idle poll OPENCLAW_FLOW_CONTROL_IDLE_POLL_SECONDS=120
    • bounded log window OPENCLAW_FLOW_CONTROL_LOG_TAIL_LINES=300
    • bounded backoff and jitter (OPENCLAW_FLOW_CONTROL_MAX_BACKOFF_SECONDS, OPENCLAW_FLOW_CONTROL_POLL_JITTER_SECONDS)

Restart-recovery guardrails:

  • restart poll defaults are low-pressure and bounded:
    • OPENCLAW_RESTART_RECOVERY_MIN_POLL_SECONDS=30
    • OPENCLAW_RESTART_RECOVERY_POLL_SECONDS=45
    • OPENCLAW_RESTART_RECOVERY_MAX_BACKOFF_SECONDS=300
    • OPENCLAW_RESTART_RECOVERY_POLL_JITTER_SECONDS=3
  • errors must not trigger tight-loop retries; restart-recovery uses bounded backoff + jitter

Safe Docker log policy for observer diagnostics:

  • use bounded log reads only (timeout + --since + --tail)
  • do not run unbounded or follow-mode reads on live hosts (docker logs -f, unlimited --tail)
  • observer use-case must always stay within bounded read limits and bounded poll intervals

Runtime port-policy verify/readiness path:

  1. python3 scripts/check_host_port_contract.py
  2. python3 scripts/verify_runtime_port_contract.py --json
  3. python3 scripts/verify_runtime_port_contract.py --json --ssh-target <internal-runtime-redacted><internal-runtime-redacted>
  4. python3 scripts/verify_cpu_safe_runtime_guardrails.py --json
  5. bash scripts/post_deploy_runtime_cleanup.sh

Cutover-only overlap guardrail:

  • py scripts\check_shuttle_runtime_overlap.py must return 0 before enforcing a single mailbox/NATS/restart-recovery path

Rules:

  • static-required: undeclared live port drift is a failure
  • dynamic-allowed-with-discovery: discovery source + consumer-safe publish path are mandatory
  • internal-only: host port publishes are failures
  • shared-host-exception: allowed only with explicit contract entry and discovery path

Logs And Artifacts

  • logs/events.jsonl
  • logs/contexts/*.json
  • logs/upgrade-impact/*.json
  • logs/upgrade-automation/latest.json
  • logs/catalog-refresh/*.json
  • logs/release-hardening/latest.json
  • logs/end-to-end-regression/latest.json
  • dist/package-metadata.json (generated packaging contract output)

Monitoring-Relevant States

  • n8n reachability
  • instance version and version gap
  • catalog freshness and baseline refresh truth
  • latest upgrade summary, alerts, and backlog
  • mailbox adapter health
  • NATS/JetStream configuration truth
  • webhook/callback contract visibility

Dashboard Fields

Grafana should prioritize:

  • latest upstream version
  • versions behind
  • catalog freshness status
  • baseline coverage ratio
  • top upgrade alerts
  • mailbox adapter health and pressure level

Gitea dashboards should prioritize:

  • current version
  • latest successful verification
  • lifecycle stage
  • open residual risks
  • registered capabilities
  • critical dependencies

Operational Gaps

  • no unified /metrics endpoint
  • no single committed readiness-only endpoint
  • long-running operational evidence still depends on scripts and artifacts rather than a persistent control plane

AGPLv3. Learn more at helpifyr.com.