Skip to main content

Troubleshooting

Documentation Map

Troubleshooting

Run This Check

Fast repo-local checks:

python3 scripts/check_docs_contract.py
python3 scripts/check_module_features_contract.py
python3 scripts/check_runtime_materialization_drift.py
python3 scripts/check_live_runtime_contract.py

Host/runtime checks:

bash scripts/plugin_smoke_test.sh
bash scripts/probe_live_host_readonly.sh

LocalAI starts but downloads too much

The tested aio-cpu image is convenient but heavy. For leaner production packaging, swap the image later, but keep the OpenAI-compatible embedding endpoint contract stable.

Mem0 recall/capture returns Bad Request

Check in this order:

  1. LocalAI /v1/embeddings returns 384 values
  2. Qdrant collection uses 384 + Cosine
  3. The LocalAI compatibility patch is present in vendor/mem0-oss.mjs
  4. Qdrant collection is plain, not native-tenant
  5. Qdrant client in the fork is new enough for your server version

OpenAI SDK embeddings from LocalAI look wrong

In the validated stack, raw HTTP to LocalAI returned correct 384-d vectors while the SDK path returned an invalid 96-value zero vector. That is why this kit patches the fork to use raw HTTP for LocalAI embeddings.

AIO LocalAI returns 500 for /v1/embeddings

Symptom:

  • POST /v1/embeddings fails with 500
  • LocalAI logs mention a broken model path under /models/huggingface:/...
  • text-embedding-ada-002 appears present, but embeddings still fail

Cause:

  • the stock aio-cpu image can fall back to an internal huggingface://... embedding reference
  • that fallback can drift away from the working all-MiniLM-L6-v2 definition used by this kit

Fix in this repository:

  1. force MODELS to include /models/text-embedding-ada-002.yaml
  2. bind-mount the known-good localai-model.text-embedding-ada-002.yaml to /models/text-embedding-ada-002.yaml
  3. restart the LocalAI container

Verify:

curl -s http://HOST:8088/v1/models
curl -s http://HOST:8088/v1/embeddings \
-H 'Content-Type: application/json' \
-d '{"model":"text-embedding-ada-002","input":"semantic memory check"}'

Expected:

  • /v1/models lists text-embedding-ada-002
  • /v1/embeddings returns a vector with 384 values

Operator guidance:

  • for production stability, slim remains the safer LocalAI profile
  • aio is fine when you want convenience, but only with the explicit model override now built into this kit

LocalAI says backend not found: transformers or sentencetransformers

Symptom:

  • /v1/embeddings fails even though the model file exists
  • LocalAI logs mention backend not found: transformers or backend not found: sentencetransformers
  • this often appears on latest-aio-cpu when operators expect all-MiniLM-L6-v2 to work through the sentence-transformer backend

Cause:

  • the runtime image does not provide the sentence-transformer backend for that deployment
  • the previous fallback path expected a GGUF artifact that was not guaranteed to exist

Deterministic fix path:

  1. keep the canonical YAML contract:
    • localai-model.text-embedding-ada-002.yaml (C:/CodexTest/jhf-bobbin/configs/localai-model.text-embedding-ada-002.yaml)
  2. ensure the canonical GGUF artifact exists:
    • all-minilm-l6-v2_f16.gguf
  3. ensure installer vars are set:
    • LOCALAI_EMBEDDING_MODEL_FILE=all-minilm-l6-v2_f16.gguf
    • LOCALAI_EMBEDDING_MODEL_URL=https://huggingface.co/LLukas22/all-MiniLM-L6-v2-GGUF/resolve/main/all-minilm-l6-v2_f16.gguf
  4. recreate LocalAI and verify POST /v1/embeddings returns length 384

Why this matters:

  • this path removes backend ambiguity and artifact-missing drift from the critical embedding runtime
  • the smoke path now fails closed when embeddings are not functional

AIO LocalAI keeps downloading unrelated models and Mem0 stays down too long

Symptom:

  • jhf-bobbin-localai remains in health: starting
  • logs show downloads for speech, image, vision, or other side models
  • OpenClaw logs show transient fetch failed during that time

Fix:

  • reduce MODELS to /models/text-embedding-ada-002.yaml during recovery
  • let the stack expose only the embedding path needed by Mem0 first
  • reintroduce extra AIO models later only if you actually need them

Validated effect:

  • this removed the long startup drag on the live host and allowed Mem0 recall/capture to come back promptly after recreate

OpenClaw update breaks semantic memory

Run:

bash scripts/reapply_after_openclaw_update.sh

If still broken, roll back to memory-core:

python3 scripts/activate_memory_core.py
cd /root/openclaw
docker compose up -d --force-recreate openclaw-gateway

apply_patch warning still appears

That warning is unrelated to the semantic-memory stack. It comes from the OpenClaw runtime/profile tool allowlist and does not by itself mean Mem0 is broken.

Qdrant external host vs local host

If you already have a Qdrant server elsewhere:

  • set USE_LOCAL_QDRANT=0
  • set QDRANT_URL=http://HOST:6333
  • rerun the installer

Mistral 7B embeddings feel worse than MiniLM

That is consistent with the live validation behind this kit.

mistral:7b-instruct-v0.3-q4_K_M can generate embeddings technically, but the retrieval quality for Mem0-style operational recall was usually worse than all-MiniLM-L6-v2.

Recommendation:

  • keep all-MiniLM-L6-v2 / text-embedding-ada-002 as the primary Mem0 embedding path
  • use local Mistral 7B for compaction, heartbeat, watchdog, and log-triage tasks instead

Want to disable Mem0 but keep the stack ready

Run:

python3 scripts/activate_memory_core.py
cd /root/openclaw
docker compose up -d --force-recreate openclaw-gateway

LocalAI and Qdrant can remain running for later re-enable.

Agents hit session file locked and suddenly switch provider/model

Symptom:

  • logs show session file locked (timeout 10000ms)
  • the same agent then shows model fallback decisions such as deepseek/deepseek-chat -> kimi -> ...
  • cicd-ops, main, or hocksie are affected most often

Important diagnosis:

  • this is usually not a Mem0 recall bug
  • the provider switch is often a downstream effect of the lock/timeout
  • in the validated live stack, the biggest trigger was the older agent-native post-compaction-recovery cron jobs before they were converted to the lightweight file-only variant

Why this happens:

  • those recovery jobs run with isolated cron session keys
  • but they still inspect the agent's recent state
  • that can contend with the hot interactive agent:*:main session for the same agent
  • once the session path stalls, the model request times out and OpenClaw falls through its fallback chain
  • a separate validated failure mode was using host-only paths like /root/openclaw/... inside the gateway container; the lightweight jobs must read the mounted container paths under /home/node/.openclaw/...

Preferred fix in this kit:

  1. convert the older recovery jobs to the lightweight local-mistral / file-only form
  2. disable the overlapping watchdog jobs for the same three workspaces so only one lightweight maintenance path remains

Helper:

bash scripts/migrate_recovery_crons_to_lightweight.sh

The helper converts these recovery jobs:

  • d8201a2f-f816-42bc-acb0-287df0a0050e
  • e9dc7e0c-bbbf-4108-af72-245d3a52c23c
  • 1ca99d11-caa8-4e84-bfaf-8cb3707ad505

Verify:

openclaw cron list --json
timeout 20s docker logs --since 30m --tail 400 openclaw-gateway 2>&1 | grep -E 'session file locked|model fallback decision'

Expected:

  • the three recovery jobs above show agentId: local-mistral
  • their session keys start with agent:local-mistral:post-compaction-recovery-
  • the older overlapping watchdog jobs are disabled
  • lock spikes and surprise provider switches drop sharply

Log-read safety note:

  • on live hosts, use bounded log probes only (timeout + --since + --tail)
  • avoid unbounded docker logs calls in operational verification paths

Repeated LocalAI timeout bursts and recreate churn

Symptom:

  • frequent timeout/grpc-style errors around LocalAI probes
  • repeated recreate attempts increase host CPU pressure

Guarded recovery in this repo:

  1. use scripts/localai_probe_guard.py for /readyz and /v1/models probes
  2. allow the guard to enter temporary degraded mode when timeout bursts cross threshold
  3. do not run tight recreate loops while degraded mode is active

Useful checks:

python3 scripts/localai_probe_guard.py --endpoint models --base-url http://<internal-runtime-redacted>:8088/v1
cat /tmp/jhf-bobbin-localai-guard.prom

Tuning knobs (installer env):

  • LOCALAI_PROBE_DEGRADE_THRESHOLD
  • LOCALAI_PROBE_TIMEOUT_WINDOW_SECONDS
  • LOCALAI_PROBE_DEGRADE_COOLDOWN_SECONDS
  • LOCALAI_MODELS_MIN_INTERVAL_SECONDS

AGPLv3. See ../LICENSE (LICENSE).

Learn more at helpifyr.com.