Troubleshoot Production Issues

Use this page when a live Helpifyr environment is degraded and you need to narrow the failure before you restart, redeploy, or escalate.

When to use this page

user-visible behavior looks broken
the stack is unhealthy, stale, or internally inconsistent
you need the fastest safe path from symptom to bounded owner runbook

Prerequisites

You can read the public-safe health and readiness surfaces.
You can capture evidence before mutating the runtime.
You know whether you are allowed to run verification-only commands or owner-approved recovery actions.

Troubleshooting principle

Start with readback, not mutation.

The first goal is to separate:

control-plane health
runtime materialization drift
security or readiness posture
product-specific failures that only look like stack-wide problems

Architecture / Flow

Step-by-step procedure

1. Capture the symptom in operator language

Record:

what failed first
whether the failure is total, degraded, stale, or intermittent
which user path, agent path, or automation lane exposed it

Do this before deeper troubleshooting so later evidence can be compared to the original symptom.

2. Read the stack-level health surfaces first

Start with the same surfaces used for healthy-day operations:

GET /health
GET /api/v1/platform/services
GET /api/v1/observability/readiness
GET /api/v1/security/readiness

This tells you whether the problem is:

a broad outage
a readiness-only regression
a security posture issue
a user-visible failure that does not yet show up as stack-wide red

3. Check for repo-truth versus runtime-truth drift

If the symptom suggests deployment skew, stale state, or partial rollout, run:

python ./scripts/verify_runtime_materialization.py --check

This is the preferred first comparison because it stays in repo-owned verification territory and avoids ad-hoc host mutations.

4. Use the narrowest owner runbook

Once the likely boundary is clearer, move to the corresponding bounded runbook:

If restart or redeploy looks tempting before you have a boundary, pause and prove the boundary first.

5. Only then consider guarded mutation

If verification shows the recovery really needs a bounded runtime action, the public-safe guarded command family is:

bash ./scripts/verify-runtime-guardrails.sh

and only after the owner path confirms it:

bash ./scripts/redeploy-host-stack.sh

6. Verify against the original symptom

After recovery, do not stop at a green endpoint. Re-run:

the same health and readiness surfaces
the same repo-owned verification command
one representative user or workflow path that originally failed

Example production triage sequence

curl -s <fabric-base-url>/health
curl -s <fabric-base-url>/api/v1/platform/services
curl -s <fabric-base-url>/api/v1/observability/readiness
curl -s <fabric-base-url>/api/v1/security/readiness
python ./scripts/verify_runtime_materialization.py --check
bash ./scripts/verify-runtime-guardrails.sh

If those checks prove a bounded restart or redeploy is required:

bash ./scripts/redeploy-host-stack.sh

Verification

A production issue is only considered resolved when:

the original symptom was recorded
stack-level health and readiness were captured before recovery
repo/runtime drift was checked when relevant
the recovery used the narrowest owner runbook that fit the evidence
the same verification sequence is green after recovery
one representative failing path now works again

Common failure modes

Restarting before the failure boundary is clear

Problem:

this can hide the real root cause and create more drift.

Better path:

run health, readiness, and materialization verification first

Treating readiness as the same thing as health

Problem:

services can answer while the platform is still not operationally ready.

Better path:

compare /health with /api/v1/observability/readiness and /api/v1/security/readiness

Escalating to host mutation without guardrails

Problem:

a real production issue turns into an uncontrolled runtime mutation.

Better path:

use bash ./scripts/verify-runtime-guardrails.sh
then follow the owner runbook

Owner Handoff

cross-stack troubleshooting truth owner: JaddaHelpifyr/helpifyr-fabric
environment recovery support: JaddaHelpifyr/jhf-openclaw-env
deployment recovery support: JaddaHelpifyr/jhf-deployment

Troubleshoot Production Issues

When to use this page

Prerequisites

Troubleshooting principle

Architecture / Flow

Step-by-step procedure

1. Capture the symptom in operator language

2. Read the stack-level health surfaces first

3. Check for repo-truth versus runtime-truth drift

4. Use the narrowest owner runbook

5. Only then consider guarded mutation

6. Verify against the original symptom

Example production triage sequence

Verification

Common failure modes

Restarting before the failure boundary is clear

Treating readiness as the same thing as health

Escalating to host mutation without guardrails

Owner Handoff

Source Truth

Next paths

When to use this page​

Prerequisites​

Troubleshooting principle​

Architecture / Flow​

Step-by-step procedure​

1. Capture the symptom in operator language​

2. Read the stack-level health surfaces first​

3. Check for repo-truth versus runtime-truth drift​

4. Use the narrowest owner runbook​

5. Only then consider guarded mutation​

6. Verify against the original symptom​

Example production triage sequence​

Verification​

Common failure modes​

Restarting before the failure boundary is clear​

Treating readiness as the same thing as health​

Escalating to host mutation without guardrails​

Owner Handoff​

Source Truth​

Next paths​

When to use this page

Prerequisites

Troubleshooting principle

Architecture / Flow

Step-by-step procedure

1. Capture the symptom in operator language

2. Read the stack-level health surfaces first

3. Check for repo-truth versus runtime-truth drift

4. Use the narrowest owner runbook

5. Only then consider guarded mutation

6. Verify against the original symptom

Example production triage sequence

Verification

Common failure modes

Restarting before the failure boundary is clear

Treating readiness as the same thing as health

Escalating to host mutation without guardrails

Owner Handoff

Source Truth

Next paths