← Back to index | ← 10 Deployment | 12 Security →
Runbooks for routine operations and incident recovery. Read this when something breaks.
Incident: site is down
1. curl -4 -sL "https://quiz.thesynergygroup.ch/" --max-time 15 -o /dev/null -w "%{http_code}"
→ 5xx means platform broken; 502/503 likely cluster issue; 404 likely Ingress drift
2. kubectl get pods -A --kubeconfig=$HOME/.kube/coachpilot-prod.yaml | grep -v Running
→ any pods not Running? → describe → events
3. kubectl get nodes
→ any NotReady? → likely Exoscale SG drop (see below)
4. curl -v https://api.coachpilot.ch/health
→ if 404 from nginx: gateway pod down OR Ingress rule missing
If gateway returns nginx 404 but pods are Running: check the Ingress secretName matches the cert resource's stored secret name. Often <name>-secret not <name>. See reference_exoscale_coachpilot_ops.md.
Lesson #42: SG drop after nodepool resize
After ANY SKS node-pool resize, scale, or type-upgrade, verify all workers have at least one Security Group attached. The cluster's CCM does NOT manage worker SG attachments; new instances come up with security-groups: [] and Exoscale default-denies inbound, killing the NLB. All in-cluster ingress dies even though pods show Running.
Caused a 24h outage of quiz/assess/app.coachpilot.ch on 2026-04-26→27.
Recovery:
python3 "10 Projects/Agent Zero/repos/exoscale-deploy-kit/repair_worker_sgs.py" \
a2e3aa09-0be5-4ee6-993d-2d0aa506d125
NLB health checks recover within 10–20 seconds. Idempotent.
Two related gotchas:
exoscale.api.v2.Client(url=...)is silently ignored. Always useClient(key, sec, zone='ch-dk-2'). Passurl=and every call hits Geneva regardless, giving phantom 404s on Zurich resources.- API keys are scoped to a specific Exoscale account/org. For
the-synergy-group-aguse theclaude_fix_nlbkey in vaultapis/exoscale/claude_fix_nlb.
Open: a CronJob auto-remedy for this (Lesson #42) is on the deferred list. Today it's manual.
kubeconfig expired (24h)
Symptom: kubectl returns "the server has asked for the client to provide credentials".
Recovery: see 10 Deployment §kubeconfig — 24h TTL.
Postgres migration didn't apply
Symptom: gateway logs column "<new_col>" does not exist.
Recovery: re-apply the migration. They're idempotent.
kubectl exec -i postgres-0 -n coachpilot -- psql -U coachpilot -d coachpilot \
< migrations/00XX_<name>.sql
If unsure which migration is missing: query information_schema.columns to see what exists.
OPcache serving stale PHP (MMC)
Symptom: deployed a .php file but the change isn't visible.
Recovery:
# Via SSH:
cd <wp-root>
wp eval 'opcache_invalidate("/full/path/to/file.php", true); echo "OK";'
Or via Python paramiko (see 10 Deployment §MMC deploy). Touching the file also helps (forces filemtime update).
If that doesn't fix it: the file probably wasn't uploaded successfully. Re-SFTP + verify size matches.
Snapshot restore (rollback a config)
# Find the snapshot id
curl -s "https://api.coachpilot.ch/api/v1/assess/framework/money_archetypes/snapshots?client=mmc&limit=20" \
-H "Authorization: Bearer <BEARER_MMC>" | python3 -m json.tool
# Restore it
curl -X POST "https://api.coachpilot.ch/api/v1/assess/framework/money_archetypes/restore?client=mmc&snapshot=<id>" \
-H "Authorization: Bearer <BEARER_MMC>"
A new snapshot is created for the restore action itself, so this is reversible.
For tenant_brand there's no snapshot system today — that's a deferred item. To roll back a brand change, manually PATCH the previous values.
Smoke tests (post-deploy)
Tenant brand fields surface correctly
curl -s "https://api.coachpilot.ch/api/v1/assess/tenant/mmc/brand" \
-H "Authorization: Bearer $BEARER_MMC" | python3 -m json.tool
Verify the new D4 fields (archetype_scores_webhook, etc.) are present with MMC's backfilled values.
Calibration endpoint
curl -s -X POST "https://api.coachpilot.ch/api/v1/assess/framework/money_archetypes/calibrate?client=mmc" \
-H "Content-Type: application/json" \
-d '{"scores":{"victim":65,"martyr":22,"hero":40}}' | python3 -m json.tool
Expected: martyr lifts from 22 → 32, applied_rules includes victim_martyr_cooccurrence.
Resolved framework
curl -s "https://api.coachpilot.ch/api/v1/assess/framework/money_archetypes/resolved?client=mmc" \
| python3 -m json.tool | head -50
Should include _tenant_id: "mmc", _tenant_urls: {...}, scoring.engine.archetype_calibration.rules.
Public quiz emits theme provider
curl -s "https://quiz.thesynergygroup.ch/?tenant=mmc" \
| grep -oE 'data-tenant-theme="[^"]*"|data-tenant="[^"]*"'
# Expected: data-tenant="mmc" + data-tenant-theme="mmc"
MMC WordPress receiver
curl -s -X POST "https://mindfulmoneycoaching.online/wp-json/sg-course/v1/archetype-scores" \
-H "Content-Type: application/json" \
-H "X-Deal-Secret: <mmc archetype_scores_secret>" \
-d '{"email":"smoke@example.com","name":"Smoke","scores":{"victim":0.65,"martyr":0.22,...},"source":"smoke","response_format":"likert"}'
Then verify martyr was calibrated to ~0.32 (i.e. +10 percentage points after the gateway round-trip):
ssh u146818668@coachpilot.ch -p 65002 \
'cd /home/u146818668/domains/mindfulmoneycoaching.online/public_html && wp transient get sg_archetype_pending_$(echo -n smoke@example.com | md5sum | cut -d" " -f1)'
Dashboard live preview (B8) sanity
Open https://app.coachpilot.ch/en/portal/tenant/brand?tenant=mmc in a browser. The live preview card at the top should:
- Render with MMC's current brand (Ilana palette + photo)
- Update instantly when you type a new colour in the form
- Save persists; refresh shows the saved value
If preview doesn't update: it means the form values aren't bound to the BrandPreviewCard props. Check brand/page.tsx.
Logs
# Gateway
kubectl logs -n coachpilot deployment/coachpilot-gateway --tail=100 -f
# Public quiz
kubectl logs -n assess deployment/dashboard --tail=100 -f
# Agent
kubectl logs -n az deployment/adaptive-assessment --tail=100 -f
# Postgres
kubectl logs -n coachpilot postgres-0 --tail=50
Look for:
[results] webhook fire failed:— webhook URL unreachable for some tenant[results] tenant <id> has no coach_email/coach_notification_email— D4 fallback warningtenant <id> has no engine config— calibration won't fire[archetype-scores] calibration applied rules: <ids>— WP-side calibration tracegenerate_results failed:— agent narrative path broken
Common production data tasks
Find all tenants with webhook configured
SELECT tenant_id, archetype_scores_webhook, coach_notification_email
FROM tenant_brand
WHERE archetype_scores_webhook IS NOT NULL;
Find sessions completed in last hour
SELECT tenant_id, format, COUNT(*), MAX(completed_at)
FROM assess_sessions
WHERE completed_at >= NOW() - INTERVAL '1 hour'
GROUP BY tenant_id, format;
Find a specific prospect's session history
For platform-tracked sessions:
SELECT * FROM assess_sessions WHERE user_email='player@example.com' ORDER BY completed_at DESC;
For MMC legacy sessions: bridge endpoint /wp-json/coachpilot/v1/money-quiz?email=player@example.com (gateway-authenticated).
Credential locations
| What | Where |
|---|---|
| Hostinger SSH (MMC + TSG + HSP shared) | Vault secret/passwords/hostinger_ssh |
Exoscale API key (claude_fix_nlb) | Vault apis/exoscale/claude_fix_nlb |
| Resend API key | k8s secret coachpilot-secrets.RESEND_API_KEY |
| Anthropic API key | k8s secret coachpilot-secrets.ANTHROPIC_API_KEY |
BEARER_MMC | k8s secret coachpilot-secrets.BEARER_MMC |
BEARER_HSP | k8s secret coachpilot-secrets.BEARER_HSP |
| MMC bridge secret | k8s + WP option (see 09 MMC bridge §Secret rotation) |
| Docker Hub creds | ~/.docker/config.json (interactive docker login synergygroup) |
Never commit secrets. Never paste them in PR descriptions or Slack.
Escalation
When in doubt, the safe action is rollback (kubectl rollout undo) rather than debug-in-place. Pods are stateless; rolling back to the previous image takes ~30 s and restores the last known good state.
For unrecoverable data loss (someone dropped a table): Postgres has WAL backups via Exoscale managed snapshots. Restore via Exoscale console — DO NOT attempt to repair the live cluster.
Next
→ 12 Security — auth, RBAC, secrets, audit log.