Operate and Monitor
Use this page when the stack is already running and you need a daily operator path for health, readiness, drift visibility, and bounded recovery decisions.
When to use this page
- You need the standard health and readiness readback loop.
- You need to monitor whether repo truth, runtime truth, and user-visible behavior still align.
- You need to decide whether to observe, escalate, restart, or recover.
Prerequisites
- You can access the Fabric health and readiness surfaces.
- You know whether you are verifying local repo truth, deployed runtime truth, or both.
- You can use bounded repo-owned verification commands before host mutation.
Operating model
Healthy operations on Helpifyr are read-first and compare multiple layers:
- docs truth
- contract truth
- runtime health
- observability readiness
- security readiness
- recovery posture
Architecture / Flow
Step-by-step procedure
1. Start with the standard readback sequence
Use the same order documented in the Fabric operations lane:
GET /health
GET /api/v1/platform/services
GET /api/v1/observability/readiness
GET /api/v1/security/readiness
GET /api/v1/recovery/readiness
GET /api/v1/signoff/readiness
This separates:
- coarse service health
- subsystem readiness
- security posture
- recovery posture
- signoff readiness
2. Check runtime evidence when signals disagree
If one surface looks stale or contradictory, use:
python ./scripts/verify_runtime_materialization.py --check
Use it when:
- docs and runtime disagree
- readiness looks worse than health
- a service exists but its deployment shape is suspect
3. Keep bounded guardrails before mutation
Before restart or redeploy work, use:
bash ./scripts/verify-runtime-guardrails.sh
This keeps diagnostics and operator behavior inside the repo-owned low-pressure safety lane.
4. Use the matching operations runbook
Typical next choices:
5. Verify behavior, not just endpoints
After any recovery or operator action:
- repeat the same readiness sequence
- repeat the materialization check when relevant
- confirm one representative user, workflow, or consumer path behaves correctly again
Example operator loop
curl -s <fabric-base-url>/health
curl -s <fabric-base-url>/api/v1/platform/services
curl -s <fabric-base-url>/api/v1/observability/readiness
curl -s <fabric-base-url>/api/v1/security/readiness
curl -s <fabric-base-url>/api/v1/recovery/readiness
curl -s <fabric-base-url>/api/v1/signoff/readiness
python ./scripts/verify_runtime_materialization.py --check
bash ./scripts/verify-runtime-guardrails.sh
Verification
Operations posture is considered healthy enough to continue when:
- the standard health and readiness sequence is readable
- the signals do not contradict each other in an unresolved way
- repo/runtime drift is absent or bounded
- no required owner runbook escalation remains open
Common failure modes
Monitoring only /health
Problem:
- you miss deeper readiness and security regressions.
Better path:
- include observability, security, recovery, and signoff readiness
Recovering without comparing to repo truth
Problem:
- a runtime drift issue is treated as a generic outage.
Better path:
- run
python ./scripts/verify_runtime_materialization.py --check
Turning every red signal into a restart
Problem:
- operator actions become noisy and non-diagnostic.
Better path:
- pick the narrowest matching runbook first
Owner Handoff
- operations truth owner:
JaddaHelpifyr/helpifyr-fabric - environment and deployment recovery support:
JaddaHelpifyr/jhf-openclaw-env,JaddaHelpifyr/jhf-deployment