Troubleshoot Production Issues
Use this page when a live Helpifyr environment is degraded and you need to narrow the failure before you restart, redeploy, or escalate.
When to use this page
- user-visible behavior looks broken
- the stack is unhealthy, stale, or internally inconsistent
- you need the fastest safe path from symptom to bounded owner runbook
Prerequisites
- You can read the public-safe health and readiness surfaces.
- You can capture evidence before mutating the runtime.
- You know whether you are allowed to run verification-only commands or owner-approved recovery actions.
Troubleshooting principle
Start with readback, not mutation.
The first goal is to separate:
- control-plane health
- runtime materialization drift
- security or readiness posture
- product-specific failures that only look like stack-wide problems
Architecture / Flow
Step-by-step procedure
1. Capture the symptom in operator language
Record:
- what failed first
- whether the failure is total, degraded, stale, or intermittent
- which user path, agent path, or automation lane exposed it
Do this before deeper troubleshooting so later evidence can be compared to the original symptom.
2. Read the stack-level health surfaces first
Start with the same surfaces used for healthy-day operations:
GET /health
GET /api/v1/platform/services
GET /api/v1/observability/readiness
GET /api/v1/security/readiness
This tells you whether the problem is:
- a broad outage
- a readiness-only regression
- a security posture issue
- a user-visible failure that does not yet show up as stack-wide red
3. Check for repo-truth versus runtime-truth drift
If the symptom suggests deployment skew, stale state, or partial rollout, run:
python ./scripts/verify_runtime_materialization.py --check
This is the preferred first comparison because it stays in repo-owned verification territory and avoids ad-hoc host mutations.
4. Use the narrowest owner runbook
Once the likely boundary is clearer, move to the corresponding bounded runbook:
If restart or redeploy looks tempting before you have a boundary, pause and prove the boundary first.
5. Only then consider guarded mutation
If verification shows the recovery really needs a bounded runtime action, the public-safe guarded command family is:
bash ./scripts/verify-runtime-guardrails.sh
and only after the owner path confirms it:
bash ./scripts/redeploy-host-stack.sh
6. Verify against the original symptom
After recovery, do not stop at a green endpoint. Re-run:
- the same health and readiness surfaces
- the same repo-owned verification command
- one representative user or workflow path that originally failed
Example production triage sequence
curl -s <fabric-base-url>/health
curl -s <fabric-base-url>/api/v1/platform/services
curl -s <fabric-base-url>/api/v1/observability/readiness
curl -s <fabric-base-url>/api/v1/security/readiness
python ./scripts/verify_runtime_materialization.py --check
bash ./scripts/verify-runtime-guardrails.sh
If those checks prove a bounded restart or redeploy is required:
bash ./scripts/redeploy-host-stack.sh
Verification
A production issue is only considered resolved when:
- the original symptom was recorded
- stack-level health and readiness were captured before recovery
- repo/runtime drift was checked when relevant
- the recovery used the narrowest owner runbook that fit the evidence
- the same verification sequence is green after recovery
- one representative failing path now works again
Common failure modes
Restarting before the failure boundary is clear
Problem:
- this can hide the real root cause and create more drift.
Better path:
- run health, readiness, and materialization verification first
Treating readiness as the same thing as health
Problem:
- services can answer while the platform is still not operationally ready.
Better path:
- compare
/healthwith/api/v1/observability/readinessand/api/v1/security/readiness
Escalating to host mutation without guardrails
Problem:
- a real production issue turns into an uncontrolled runtime mutation.
Better path:
- use
bash ./scripts/verify-runtime-guardrails.sh - then follow the owner runbook
Owner Handoff
- cross-stack troubleshooting truth owner:
JaddaHelpifyr/helpifyr-fabric - environment recovery support:
JaddaHelpifyr/jhf-openclaw-env - deployment recovery support:
JaddaHelpifyr/jhf-deployment