← Back to index | ← 10 Deployment | 12 Security →

Runbooks for routine operations and incident recovery. Read this when something breaks.


Incident: site is down

1. curl -4 -sL "https://quiz.thesynergygroup.ch/" --max-time 15 -o /dev/null -w "%{http_code}"
   → 5xx means platform broken; 502/503 likely cluster issue; 404 likely Ingress drift
2. kubectl get pods -A --kubeconfig=$HOME/.kube/coachpilot-prod.yaml | grep -v Running
   → any pods not Running? → describe → events
3. kubectl get nodes
   → any NotReady?  → likely Exoscale SG drop (see below)
4. curl -v https://api.coachpilot.ch/health
   → if 404 from nginx: gateway pod down OR Ingress rule missing

If gateway returns nginx 404 but pods are Running: check the Ingress secretName matches the cert resource's stored secret name. Often <name>-secret not <name>. See reference_exoscale_coachpilot_ops.md.


Lesson #42: SG drop after nodepool resize

After ANY SKS node-pool resize, scale, or type-upgrade, verify all workers have at least one Security Group attached. The cluster's CCM does NOT manage worker SG attachments; new instances come up with security-groups: [] and Exoscale default-denies inbound, killing the NLB. All in-cluster ingress dies even though pods show Running.

Caused a 24h outage of quiz/assess/app.coachpilot.ch on 2026-04-26→27.

Recovery:

python3 "10 Projects/Agent Zero/repos/exoscale-deploy-kit/repair_worker_sgs.py" \
  a2e3aa09-0be5-4ee6-993d-2d0aa506d125

NLB health checks recover within 10–20 seconds. Idempotent.

Two related gotchas:

  • exoscale.api.v2.Client(url=...) is silently ignored. Always use Client(key, sec, zone='ch-dk-2'). Pass url= and every call hits Geneva regardless, giving phantom 404s on Zurich resources.
  • API keys are scoped to a specific Exoscale account/org. For the-synergy-group-ag use the claude_fix_nlb key in vault apis/exoscale/claude_fix_nlb.

Open: a CronJob auto-remedy for this (Lesson #42) is on the deferred list. Today it's manual.


kubeconfig expired (24h)

Symptom: kubectl returns "the server has asked for the client to provide credentials".

Recovery: see 10 Deployment §kubeconfig — 24h TTL.


Postgres migration didn't apply

Symptom: gateway logs column "<new_col>" does not exist.

Recovery: re-apply the migration. They're idempotent.

kubectl exec -i postgres-0 -n coachpilot -- psql -U coachpilot -d coachpilot \
  < migrations/00XX_<name>.sql

If unsure which migration is missing: query information_schema.columns to see what exists.


OPcache serving stale PHP (MMC)

Symptom: deployed a .php file but the change isn't visible.

Recovery:

# Via SSH:
cd <wp-root>
wp eval 'opcache_invalidate("/full/path/to/file.php", true); echo "OK";'

Or via Python paramiko (see 10 Deployment §MMC deploy). Touching the file also helps (forces filemtime update).

If that doesn't fix it: the file probably wasn't uploaded successfully. Re-SFTP + verify size matches.


Snapshot restore (rollback a config)

# Find the snapshot id
curl -s "https://api.coachpilot.ch/api/v1/assess/framework/money_archetypes/snapshots?client=mmc&limit=20" \
  -H "Authorization: Bearer <BEARER_MMC>" | python3 -m json.tool

# Restore it
curl -X POST "https://api.coachpilot.ch/api/v1/assess/framework/money_archetypes/restore?client=mmc&snapshot=<id>" \
  -H "Authorization: Bearer <BEARER_MMC>"

A new snapshot is created for the restore action itself, so this is reversible.

For tenant_brand there's no snapshot system today — that's a deferred item. To roll back a brand change, manually PATCH the previous values.


Smoke tests (post-deploy)

Tenant brand fields surface correctly

curl -s "https://api.coachpilot.ch/api/v1/assess/tenant/mmc/brand" \
  -H "Authorization: Bearer $BEARER_MMC" | python3 -m json.tool

Verify the new D4 fields (archetype_scores_webhook, etc.) are present with MMC's backfilled values.

Calibration endpoint

curl -s -X POST "https://api.coachpilot.ch/api/v1/assess/framework/money_archetypes/calibrate?client=mmc" \
  -H "Content-Type: application/json" \
  -d '{"scores":{"victim":65,"martyr":22,"hero":40}}' | python3 -m json.tool

Expected: martyr lifts from 22 → 32, applied_rules includes victim_martyr_cooccurrence.

Resolved framework

curl -s "https://api.coachpilot.ch/api/v1/assess/framework/money_archetypes/resolved?client=mmc" \
  | python3 -m json.tool | head -50

Should include _tenant_id: "mmc", _tenant_urls: {...}, scoring.engine.archetype_calibration.rules.

Public quiz emits theme provider

curl -s "https://quiz.thesynergygroup.ch/?tenant=mmc" \
  | grep -oE 'data-tenant-theme="[^"]*"|data-tenant="[^"]*"'
# Expected: data-tenant="mmc" + data-tenant-theme="mmc"

MMC WordPress receiver

curl -s -X POST "https://mindfulmoneycoaching.online/wp-json/sg-course/v1/archetype-scores" \
  -H "Content-Type: application/json" \
  -H "X-Deal-Secret: <mmc archetype_scores_secret>" \
  -d '{"email":"smoke@example.com","name":"Smoke","scores":{"victim":0.65,"martyr":0.22,...},"source":"smoke","response_format":"likert"}'

Then verify martyr was calibrated to ~0.32 (i.e. +10 percentage points after the gateway round-trip):

ssh u146818668@coachpilot.ch -p 65002 \
  'cd /home/u146818668/domains/mindfulmoneycoaching.online/public_html && wp transient get sg_archetype_pending_$(echo -n smoke@example.com | md5sum | cut -d" " -f1)'

Dashboard live preview (B8) sanity

Open https://app.coachpilot.ch/en/portal/tenant/brand?tenant=mmc in a browser. The live preview card at the top should:

  • Render with MMC's current brand (Ilana palette + photo)
  • Update instantly when you type a new colour in the form
  • Save persists; refresh shows the saved value

If preview doesn't update: it means the form values aren't bound to the BrandPreviewCard props. Check brand/page.tsx.


Logs

# Gateway
kubectl logs -n coachpilot deployment/coachpilot-gateway --tail=100 -f

# Public quiz
kubectl logs -n assess deployment/dashboard --tail=100 -f

# Agent
kubectl logs -n az deployment/adaptive-assessment --tail=100 -f

# Postgres
kubectl logs -n coachpilot postgres-0 --tail=50

Look for:

  • [results] webhook fire failed: — webhook URL unreachable for some tenant
  • [results] tenant <id> has no coach_email/coach_notification_email — D4 fallback warning
  • tenant <id> has no engine config — calibration won't fire
  • [archetype-scores] calibration applied rules: <ids> — WP-side calibration trace
  • generate_results failed: — agent narrative path broken

Common production data tasks

Find all tenants with webhook configured

SELECT tenant_id, archetype_scores_webhook, coach_notification_email
FROM tenant_brand
WHERE archetype_scores_webhook IS NOT NULL;

Find sessions completed in last hour

SELECT tenant_id, format, COUNT(*), MAX(completed_at)
FROM assess_sessions
WHERE completed_at >= NOW() - INTERVAL '1 hour'
GROUP BY tenant_id, format;

Find a specific prospect's session history

For platform-tracked sessions:

SELECT * FROM assess_sessions WHERE user_email='player@example.com' ORDER BY completed_at DESC;

For MMC legacy sessions: bridge endpoint /wp-json/coachpilot/v1/money-quiz?email=player@example.com (gateway-authenticated).


Credential locations

WhatWhere
Hostinger SSH (MMC + TSG + HSP shared)Vault secret/passwords/hostinger_ssh
Exoscale API key (claude_fix_nlb)Vault apis/exoscale/claude_fix_nlb
Resend API keyk8s secret coachpilot-secrets.RESEND_API_KEY
Anthropic API keyk8s secret coachpilot-secrets.ANTHROPIC_API_KEY
BEARER_MMCk8s secret coachpilot-secrets.BEARER_MMC
BEARER_HSPk8s secret coachpilot-secrets.BEARER_HSP
MMC bridge secretk8s + WP option (see 09 MMC bridge §Secret rotation)
Docker Hub creds~/.docker/config.json (interactive docker login synergygroup)

Never commit secrets. Never paste them in PR descriptions or Slack.


Escalation

When in doubt, the safe action is rollback (kubectl rollout undo) rather than debug-in-place. Pods are stateless; rolling back to the previous image takes ~30 s and restores the last known good state.

For unrecoverable data loss (someone dropped a table): Postgres has WAL backups via Exoscale managed snapshots. Restore via Exoscale console — DO NOT attempt to repair the live cluster.


Next

12 Security — auth, RBAC, secrets, audit log.