Docker Compose → k3d Migration Plan
Status: Planned · April 2026
Phased migration from
docker compose upon a single Hetzner VPS to a self-hosted k3s cluster managed by k3d. Same hardware, dramatically better operational story: rolling updates, self-healing restarts, readiness gates, declarative config, and a clean path to multi-node if the platform outgrows one server.
Why k3d, Not Full k3s or a Managed Cluster
| Option | Pro | Con |
|---|---|---|
| Docker Compose (current) | Zero ops overhead now | No rolling updates, no readiness gates, restarts drop traffic |
| k3s bare-metal | Production-grade, no Docker overhead | More complex install, harder to reset |
| k3d (k3s in Docker) | Same Docker layer already present, reset with one command, mirrors prod locally | Tiny Docker-in-Docker overhead (~50 MB) |
| Hetzner Managed k8s | Fully managed | Monthly cost, overkill for current scale |
k3d runs k3s inside Docker containers. The same Hetzner VPS keeps running Docker; k3d creates a k3s control-plane + agent in containers, then kubectl talks to it. No new servers, no new cost.
Current vs Target Architecture
Current
───────
Hetzner VPS
docker-compose.yml
db postgres:16-alpine :5432
api ghcr.io/…/api :3000
kick-server ghcr.io/…/kick-server :3001
worker ghcr.io/…/worker (no port)
migrator ghcr.io/…/migrator (run-once)
Target
──────
Hetzner VPS
k3d cluster: vf-cluster (1 server node + 1 agent node inside Docker)
namespace: vf-prod
Deployment/api (2 replicas, rolling update)
Deployment/kick-server (1 replica)
Deployment/worker (1 replica)
Job/migrator (run-once per deploy, pre-upgrade hook)
Service/api ClusterIP → Traefik Ingress → :443
Service/kick-server ClusterIP → Traefik Ingress → :443
Secret/vf-env all .env values (replaces env_file)
ConfigMap/vf-config non-secret values (LOG_LEVEL, PORT, APP_URL)
db → Hetzner Managed PostgreSQL (external, no StatefulSet)The PostgreSQL container is removed from k3d — the platform already targets Hetzner Managed PG in production. The k3d compose only ever ran PG for local dev, which stays as docker-compose.yml for developer use.
Repository Layout After Migration
infra/
k8s/
namespace.yaml
secret.yaml.example ← committed skeleton, real values in CI secrets
configmap.yaml
deployment-api.yaml
deployment-kick-server.yaml
deployment-worker.yaml
job-migrator.yaml
service-api.yaml
service-kick-server.yaml
ingress.yaml
scripts/
cluster-bootstrap.sh ← one-shot: install k3d, create cluster, apply base
deploy.sh ← called by CI: update image tags + kubectl rolloutPhase 1 — Cluster Bootstrap (day 1, ~2 h)
1.1 Install k3d on the Hetzner VPS
curl -s https://raw.githubusercontent.com/k3d-io/k3d/main/install.sh | bash
k3d version # confirm ≥ 5.71.2 Create the cluster
k3d cluster create vf-cluster \
--servers 1 \
--agents 1 \
--port "80:80@loadbalancer" \
--port "443:443@loadbalancer" \
--k3s-arg "--disable=servicelb@server:0" # use Traefik LB insteadk3d automatically installs Traefik as the ingress controller. The host ports 80/443 map directly into the cluster's load-balancer container — no iptables rules to manage.
1.3 Verify
kubectl get nodes # should show 1 server + 1 agent, Ready
kubectl get pods -A # traefik pod Running in kube-systemPhase 2 — Manifests (day 1–2, ~4 h)
All files live in infra/k8s/.
namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
name: vf-prodsecret.yaml (skeleton — real values injected by CI)
apiVersion: v1
kind: Secret
metadata:
name: vf-env
namespace: vf-prod
type: Opaque
stringData:
DATABASE_URL: ""
RESEND_API_KEY: ""
SCRAPER_API_KEY: ""
PUSHER_APP_ID: ""
PUSHER_KEY: ""
PUSHER_SECRET: ""
PUSHER_CLUSTER: "eu"
S3_ENDPOINT: ""
S3_REGION: ""
S3_ACCESS_KEY: ""
S3_SECRET_KEY: ""
S3_BUCKET: "kycdocs"
KICK_S3_BUCKET: "eventskick"
X_CLIENT_ID: ""
X_CLIENT_SECRET: ""
KICK_CLIENT_ID: ""
KICK_CLIENT_SECRET: ""configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: vf-config
namespace: vf-prod
data:
LOG_LEVEL: "info"
PORT: "3000"
APP_URL: "https://app.verifluence.io"
RESEND_FROM: "no-reply@updates.verifluence.io"deployment-api.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: api
namespace: vf-prod
spec:
replicas: 2
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0 # never take a pod down before a new one is Ready
maxSurge: 1
selector:
matchLabels:
app: api
template:
metadata:
labels:
app: api
spec:
containers:
- name: api
image: ghcr.io/verifluence/api:latest
ports:
- containerPort: 3000
envFrom:
- secretRef:
name: vf-env
- configMapRef:
name: vf-config
readinessProbe: # IR-1: gates traffic until /api/health returns 200
httpGet:
path: /api/health
port: 3000
initialDelaySeconds: 5
periodSeconds: 10
failureThreshold: 3
livenessProbe:
httpGet:
path: /api/health
port: 3000
initialDelaySeconds: 15
periodSeconds: 30
failureThreshold: 3
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "500m"
memory: "512Mi"deployment-kick-server.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: kick-server
namespace: vf-prod
spec:
replicas: 1
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0
maxSurge: 1
selector:
matchLabels:
app: kick-server
template:
metadata:
labels:
app: kick-server
spec:
containers:
- name: kick-server
image: ghcr.io/verifluence/kick-server:latest
ports:
- containerPort: 3000
envFrom:
- secretRef:
name: vf-env
- configMapRef:
name: vf-config
readinessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 5
periodSeconds: 10
resources:
requests:
cpu: "50m"
memory: "64Mi"
limits:
cpu: "200m"
memory: "256Mi"deployment-worker.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: worker
namespace: vf-prod
spec:
replicas: 1
selector:
matchLabels:
app: worker
template:
metadata:
labels:
app: worker
spec:
containers:
- name: worker
image: ghcr.io/verifluence/worker:latest
envFrom:
- secretRef:
name: vf-env
- configMapRef:
name: vf-config
resources:
requests:
cpu: "50m"
memory: "64Mi"
limits:
cpu: "500m"
memory: "256Mi"job-migrator.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: migrator
namespace: vf-prod
annotations:
# Helm pre-upgrade hook equivalent — CI recreates this Job before rolling out images
"helm.sh/hook": pre-upgrade
spec:
backoffLimit: 2
template:
spec:
restartPolicy: Never
containers:
- name: migrator
image: ghcr.io/verifluence/migrator:latest
envFrom:
- secretRef:
name: vf-envservice-api.yaml
apiVersion: v1
kind: Service
metadata:
name: api
namespace: vf-prod
spec:
selector:
app: api
ports:
- port: 80
targetPort: 3000service-kick-server.yaml
apiVersion: v1
kind: Service
metadata:
name: kick-server
namespace: vf-prod
spec:
selector:
app: kick-server
ports:
- port: 80
targetPort: 3000ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: vf-ingress
namespace: vf-prod
annotations:
traefik.ingress.kubernetes.io/router.entrypoints: websecure
traefik.ingress.kubernetes.io/router.tls: "true"
traefik.ingress.kubernetes.io/router.tls.certresolver: letsencrypt
spec:
rules:
- host: api.verifluence.io
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: api
port:
number: 80
- host: kick.verifluence.io
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: kick-server
port:
number: 80
tls:
- hosts:
- api.verifluence.io
- kick.verifluence.ioTLS: Traefik's built-in Let's Encrypt cert resolver handles certificate issuance and renewal automatically — no cert-manager required for a single-domain setup.
Phase 3 — Local Validation (day 2, ~2 h)
Before touching production, validate all manifests against a local k3d cluster:
# Start a local mirror cluster
k3d cluster create vf-local --port "8080:80@loadbalancer"
# Apply everything
kubectl apply -f infra/k8s/namespace.yaml
kubectl apply -f infra/k8s/configmap.yaml
kubectl apply -f infra/k8s/secret.yaml # use dev values
kubectl apply -f infra/k8s/deployment-api.yaml
kubectl apply -f infra/k8s/deployment-kick-server.yaml
kubectl apply -f infra/k8s/deployment-worker.yaml
kubectl apply -f infra/k8s/service-api.yaml
kubectl apply -f infra/k8s/service-kick-server.yaml
# Watch rollout
kubectl rollout status deployment/api -n vf-prod
# Port-forward to smoke-test without ingress
kubectl port-forward svc/api 3000:80 -n vf-prod
curl http://localhost:3000/api/healthAll probes must pass before Phase 4.
Phase 4 — CI/CD Pipeline Update (day 2–3, ~3 h)
Replace the current deploy step (SSH + docker compose pull && up) with:
# .github/workflows/deploy-api.yml — new deploy job
deploy:
name: Deploy to k3d
needs: build
runs-on: ubuntu-latest
environment: production
steps:
- uses: actions/checkout@v4
- name: Write kubeconfig
run: |
mkdir -p ~/.kube
echo "${{ secrets.KUBECONFIG_B64 }}" | base64 -d > ~/.kube/config
- name: Delete old migrator Job (if exists)
run: |
kubectl delete job migrator -n vf-prod --ignore-not-found=true
- name: Run migrator
run: |
# Patch image tag then apply
sed "s|:latest|:sha-${{ github.sha }}|g" infra/k8s/job-migrator.yaml \
| kubectl apply -f -
kubectl wait --for=condition=complete job/migrator -n vf-prod --timeout=120s
- name: Roll out new images
run: |
SHA=sha-${{ github.sha }}
kubectl set image deployment/api api=${{ env.API_IMAGE }}:${SHA} -n vf-prod
kubectl set image deployment/kick-server kick-server=${{ env.KICK_IMAGE }}:${SHA} -n vf-prod
kubectl set image deployment/worker worker=${{ env.WORKER_IMAGE }}:${SHA} -n vf-prod
- name: Wait for rollout
run: |
kubectl rollout status deployment/api -n vf-prod --timeout=120s
kubectl rollout status deployment/kick-server -n vf-prod --timeout=120s
kubectl rollout status deployment/worker -n vf-prod --timeout=120s
- name: Smoke-test health endpoint
run: |
kubectl run smoke --rm -i --restart=Never \
--image=curlimages/curl:latest \
--namespace=vf-prod \
-- curl -sf http://api/api/health | grep '"status":"ok"'Secrets required in GitHub:
KUBECONFIG_B64— base64-encoded kubeconfig from the Hetzner VPS (k3d kubeconfig get vf-cluster | base64)
Phase 5 — Production Cutover (day 3–4, ~2 h)
Cutover is done in-place on the same VPS — no DNS changes needed because Traefik takes over the same ports (80/443) that the compose reverse-proxy was using.
# On the Hetzner VPS:
# 1. Stop compose stack (brief downtime window, < 30 s)
cd /srv/verifluence
docker compose down
# 2. Apply k8s manifests (cluster already running from bootstrap)
kubectl apply -f infra/k8s/
kubectl rollout status deployment/api -n vf-prod --timeout=120s
# 3. Verify
curl https://api.verifluence.io/api/healthRollback if anything is wrong:
docker compose up -d # old stack back in secondsPhase 6 — Post-Migration Cleanup (week 2)
Once k3d has been stable in production for 7 days:
- Remove
db:service fromdocker-compose.yml(PG is Hetzner Managed — always was in prod) - Rename
docker-compose.yml→docker-compose.dev.ymlwith a comment: "local dev only" - Archive the old SSH-based deploy step from
deploy-api.yml - Document new
kubectlrunbook in the DR playbook (IR-8)
Synergies with M-IR
This migration directly satisfies several open M-IR items:
| M-IR item | How k3d covers it |
|---|---|
| IR-2 Zero-downtime rolling deploy | RollingUpdate strategy + maxUnavailable: 0 is the deploy strategy |
| IR-1 Health endpoint gates traffic | readinessProbe on /api/health — pods never receive traffic until healthy |
| IR-6 Resource metrics | kubectl top pods + resource requests/limits baseline; Prometheus scrape via k3s built-in metrics-server |
Risks & Mitigations
| Risk | Mitigation |
|---|---|
| k3d overhead on a small VPS | k3s is ~40 MB RAM for control plane; well within headroom on a Hetzner CX22 (4 GB) |
| kubeconfig leaking production access | Store in GitHub secret, rotate quarterly; scope via --kubeconfig flag |
| Traefik TLS fails on first deploy | Pre-create a ClusterIssuer with ACME staging endpoint; flip to prod after cert verified |
| Migrator Job fails mid-deploy | backoffLimit: 2; deploy pipeline halts on kubectl wait timeout before images are rolled |
| Local dev breaks (no compose db) | docker-compose.dev.yml keeps postgres for local dev; devs switch with COMPOSE_FILE=docker-compose.dev.yml |
Out of Scope
- Helm chart packaging (manifests are enough at current scale; Helm is a V2 consideration)
- Horizontal Pod Autoscaler (HPA) — single-node k3d cannot meaningfully autoscale; address when a second node is added
- Multi-node cluster — follow-up: add a second Hetzner VPS as a k3d agent node
- Persistent volumes for PostgreSQL — DB stays on Hetzner Managed PG; no PVCs needed