Skip to main content

jhf-spool Operations

Documentation Map

Operations

Version: 2026-04-01

Start / Run / Deploy

  • entrypoint: compose.dev.yaml
  • low-cpu override: compose.lowcpu.yaml
  • scripts:
    • scripts/dev-up.sh
    • scripts/dev-down.sh
    • scripts/ops/deploy_news_memory_main_stack.sh
    • docs/STACK_CONTRACT.md (canonical runtime contract)

Shared-host n8n port defaults:

  • default host port: NEWS_MEMORY_N8N_PORT=25678
  • reserved/forbidden by default: 15678 (shared-host global n8n runtime)
  • startup preflight in both start scripts fails early when the target n8n port is busy
  • optional override for reserved ports: NEWS_MEMORY_ALLOW_RESERVED_N8N_PORTS=1

Health and Readiness

  • /v1/health/live
  • /v1/health/ready
  • /v1/health/info
  • /v1/fabric/metadata
  • /metrics

Additional operator surfaces:

  • /v1/research/operational-slo-gates
  • /v1/research/security-compliance-gates
  • /v1/research/incident-readiness-gates
  • /v1/research/tls-proxy-readiness
  • /v1/research/secret-readiness
  • /v1/research/paddle-readiness

Current healthcheck cadence policy (maintained stack):

  • postgres, minio, redis: 120s interval, 5s timeout, 3 retries, 60s start period
  • api: 60s interval, 3s timeout, 3 retries, 30s start period
  • n8n: 60s interval, 5s timeout, 3 retries, 45s start period
  • reverse-proxy and observability: host-managed in integrated mode, outside this compose stack

Monitoring

Current stack includes:

  • OpenTelemetry collector
  • host-managed Prometheus/Grafana in integrated deployments
  • runtime and readiness helper scripts

Logs and Diagnostics

Primary diagnostic paths:

  • API container logs from the maintained Compose stack
  • host reverse-proxy logs when TLS or upstream routing fails
  • n8n container logs when scheduled orchestration degrades
  • host observability surfaces for health, readiness, and freshness signals

Useful scripts:

  • scripts/ops/run_host_release_check.py
  • scripts/ops/run_operational_release_checks.py
  • scripts/ops/run_live_platform_journey.py
  • scripts/ops/evaluate_n8n_live_readiness.py
  • scripts/ops/verify_runtime_materialization_drift.py
  • scripts/ops/evaluate_secret_readiness.py
  • scripts/ops/evaluate_fabric_combination_consumer.py
  • scripts/ops/run_lowcpu_soak_probe.py
  • scripts/ops/query_gitea_actions_runs.py

Gitea Actions run API compatibility:

  • in this environment /api/v1/repos/{owner}/{repo}/actions/runs can return 404
  • maintained run-status collection must use scripts/ops/query_gitea_actions_runs.py
  • the helper uses /actions/runs first and automatically falls back to /actions/tasks on 404

Release-check telemetry snapshot:

  • run_operational_release_checks.py now emits healthcheck_load in report JSON
  • block includes:
    • host CPU sample (usage_percent_1s, best effort)
    • exec_create sample count over a short time window
  • fallback behavior:
    • if CPU sampling fails, cpu_sample.available=false and error text is preserved
    • if timeout is unavailable on host, exec_create_sample.available=false with fallback error marker

Example:

python scripts/ops/run_host_release_check.py --output-dir reports/operational-release-checks

Runtime materialization drift verify:

python scripts/ops/verify_runtime_materialization_drift.py \
--host <internal-runtime-redacted><internal-runtime-redacted> \
--repo-path-on-host /home/administrator/jhf-spool-main \
--base-url https://<internal-runtime-redacted> \
--insecure \
--output reports/runtime-materialization/latest.json

The verifier compares:

  • repo-owned runtime contract and compose truth
  • active host compose materialization
  • running container env/labels/mounts/networks
  • app readback from /v1/health/info

It fails on missing keys, undocumented non-interpolated overrides, container/app readback mismatch, and externally visible ingress drift.

Standard Restart Order

When the maintained stack needs a bounded restart:

  1. verify database and storage dependencies first
  2. start or recover PostgreSQL, MinIO, Qdrant, and Redis
  3. start the API service
  4. verify host-managed proxy/TLS edge
  5. verify n8n only after the API is healthy

Compose Core Stack Recovery

If host TLS is up but the product is unavailable:

  1. inspect whether the core jhf-spool services still exist
  2. recover missing core containers before debugging host proxy or TLS
  3. verify /v1/health/live and /v1/health/ready
  4. verify docs and OpenAPI surfaces
  5. re-run the operational release check if the outage affected orchestration or gates

Known Failure Modes

  • core stack partly absent while host proxy remains up
  • TLS trust issues mistaken for service outages
  • external source/provider drift
  • inactive n8n workflows causing stale automation

Typical 502 Cases

  • host proxy upstream points to stale backend target
  • upstream API container stopped or absent
  • DNS lookup mismatch between host proxy config and active backend target

Typical TLS Cases

  • local trust failure against the internal Caddy CA
  • HTTP/HTTPS mismatch during manual checks
  • wrong public base URL or reverse-proxy configuration

Typical n8n Cases

  • workflows deployed but inactive
  • stale NEWS_MEMORY_API_BASE_URL
  • invalid or missing shared API key
  • workflow host path still pointing at an old domain/base URL

Runtime Dependencies

Hard:

  • PostgreSQL
  • MinIO
  • Qdrant
  • Redis

Optional:

  • n8n
  • Paddle
  • NewsAPI
  • external source providers

Weak Host Mode

For low-resource hosts, run the maintained stack with:

docker compose -f compose.dev.yaml -f compose.lowcpu.yaml up -d --build

This keeps the same healthcheck surfaces but stretches intervals to reduce healthcheck exec load.

For lightweight release telemetry sampling on weak hosts:

  • keep telemetry windows short (30s default)
  • avoid full monitoring suites for routine verification
  • use release report snapshots as regression evidence between rollouts

For 24h low-cpu soak evidence collection:

python scripts/ops/run_lowcpu_soak_probe.py \
--host <internal-runtime-redacted><internal-runtime-redacted> \
--stack-prefix jhf-spool- \
--samples 24 \
--sample-interval-seconds 3600 \
--telemetry-window-seconds 30

Host-target note:

  • when running the collector on the same host as the stack, use --host <internal-runtime-redacted>
  • user-prefixed self-targets like <internal-runtime-redacted><internal-runtime-redacted> are normalized to local execution by the collector to avoid SSH self-auth failures

Artifacts are written to reports/healthcheck-soak/ as:

  • per-sample JSONL telemetry stream
  • summarized metrics JSON

Credential Drift Verify (Spool)

For jhf-spool auth-resilience checks (valid key, invalid key, rotated key) against /v1/search/semantic:

python scripts/ops/verify_spool_auth_rotation.py \
--base-url https://<internal-runtime-redacted> \
--valid-key "$VALID_SPOOL_KEY" \
--invalid-key "$INTENTIONALLY_INVALID_KEY" \
--rotated-key "$ROTATED_SPOOL_KEY" \
--insecure \
--strict \
--output reports/auth-rotation/latest.json

The output is machine-readable and separates:

  • auth drift (401 on invalid key while valid key path is healthy)
  • rotation recovery (rotated key succeeds again)
  • potential platform/network outage (valid path not healthy)

Canonical consumer contract:

  • docs/CREDENTIAL_ROTATION_CONTRACT.md (spool-auth-rotation-v1)

License: AGPLv3
Learn more: https://helpifyr.com