Aller au contenu principal

Troubleshoot Production Issues

Use this page when a live Helpifyr environment is degraded and you need to narrow the failure before you restart, redeploy, or escalate.

When to use this page

  • user-visible behavior looks broken
  • the stack is unhealthy, stale, or internally inconsistent
  • you need the fastest safe path from symptom to bounded owner runbook

Prerequisites

  • You can read the public-safe health and readiness surfaces.
  • You can capture evidence before mutating the runtime.
  • You know whether you are allowed to run verification-only commands or owner-approved recovery actions.

Troubleshooting principle

Start with readback, not mutation.

The first goal is to separate:

  • control-plane health
  • runtime materialization drift
  • security or readiness posture
  • product-specific failures that only look like stack-wide problems

Architecture / Flow

Step-by-step procedure

1. Capture the symptom in operator language

Record:

  • what failed first
  • whether the failure is total, degraded, stale, or intermittent
  • which user path, agent path, or automation lane exposed it

Do this before deeper troubleshooting so later evidence can be compared to the original symptom.

2. Read the stack-level health surfaces first

Start with the same surfaces used for healthy-day operations:

GET /health
GET /api/v1/platform/services
GET /api/v1/observability/readiness
GET /api/v1/security/readiness

This tells you whether the problem is:

  • a broad outage
  • a readiness-only regression
  • a security posture issue
  • a user-visible failure that does not yet show up as stack-wide red

3. Check for repo-truth versus runtime-truth drift

If the symptom suggests deployment skew, stale state, or partial rollout, run:

python ./scripts/verify_runtime_materialization.py --check

This is the preferred first comparison because it stays in repo-owned verification territory and avoids ad-hoc host mutations.

4. Use the narrowest owner runbook

Once the likely boundary is clearer, move to the corresponding bounded runbook:

If restart or redeploy looks tempting before you have a boundary, pause and prove the boundary first.

5. Only then consider guarded mutation

If verification shows the recovery really needs a bounded runtime action, the public-safe guarded command family is:

bash ./scripts/verify-runtime-guardrails.sh

and only after the owner path confirms it:

bash ./scripts/redeploy-host-stack.sh

6. Verify against the original symptom

After recovery, do not stop at a green endpoint. Re-run:

  • the same health and readiness surfaces
  • the same repo-owned verification command
  • one representative user or workflow path that originally failed

Example production triage sequence

curl -s <fabric-base-url>/health
curl -s <fabric-base-url>/api/v1/platform/services
curl -s <fabric-base-url>/api/v1/observability/readiness
curl -s <fabric-base-url>/api/v1/security/readiness
python ./scripts/verify_runtime_materialization.py --check
bash ./scripts/verify-runtime-guardrails.sh

If those checks prove a bounded restart or redeploy is required:

bash ./scripts/redeploy-host-stack.sh

Verification

A production issue is only considered resolved when:

  1. the original symptom was recorded
  2. stack-level health and readiness were captured before recovery
  3. repo/runtime drift was checked when relevant
  4. the recovery used the narrowest owner runbook that fit the evidence
  5. the same verification sequence is green after recovery
  6. one representative failing path now works again

Common failure modes

Restarting before the failure boundary is clear

Problem:

  • this can hide the real root cause and create more drift.

Better path:

  • run health, readiness, and materialization verification first

Treating readiness as the same thing as health

Problem:

  • services can answer while the platform is still not operationally ready.

Better path:

  • compare /health with /api/v1/observability/readiness and /api/v1/security/readiness

Escalating to host mutation without guardrails

Problem:

  • a real production issue turns into an uncontrolled runtime mutation.

Better path:

  • use bash ./scripts/verify-runtime-guardrails.sh
  • then follow the owner runbook

Owner Handoff

  • cross-stack troubleshooting truth owner: JaddaHelpifyr/helpifyr-fabric
  • environment recovery support: JaddaHelpifyr/jhf-openclaw-env
  • deployment recovery support: JaddaHelpifyr/jhf-deployment

Source Truth

Next paths