Docker Compose → k3d Migration Plan

Status: Planned · April 2026
Phased migration from docker compose up on a single Hetzner VPS to a self-hosted k3s cluster managed by k3d. Same hardware, dramatically better operational story: rolling updates, self-healing restarts, readiness gates, declarative config, and a clean path to multi-node if the platform outgrows one server.

Why k3d, Not Full k3s or a Managed Cluster

Option	Pro	Con
Docker Compose (current)	Zero ops overhead now	No rolling updates, no readiness gates, restarts drop traffic
k3s bare-metal	Production-grade, no Docker overhead	More complex install, harder to reset
k3d (k3s in Docker)	Same Docker layer already present, reset with one command, mirrors prod locally	Tiny Docker-in-Docker overhead (~50 MB)
Hetzner Managed k8s	Fully managed	Monthly cost, overkill for current scale

k3d runs k3s inside Docker containers. The same Hetzner VPS keeps running Docker; k3d creates a k3s control-plane + agent in containers, then kubectl talks to it. No new servers, no new cost.

Current vs Target Architecture

Current
───────
Hetzner VPS
  docker-compose.yml
    db          postgres:16-alpine   :5432
    api         ghcr.io/…/api        :3000
    kick-server ghcr.io/…/kick-server :3001
    worker      ghcr.io/…/worker     (no port)
    migrator    ghcr.io/…/migrator   (run-once)

Target
──────
Hetzner VPS
  k3d cluster: vf-cluster (1 server node + 1 agent node inside Docker)
    namespace: vf-prod
      Deployment/api           (2 replicas, rolling update)
      Deployment/kick-server   (1 replica)
      Deployment/worker        (1 replica)
      Job/migrator             (run-once per deploy, pre-upgrade hook)
      Service/api              ClusterIP → Traefik Ingress → :443
      Service/kick-server      ClusterIP → Traefik Ingress → :443
      Secret/vf-env            all .env values (replaces env_file)
      ConfigMap/vf-config      non-secret values (LOG_LEVEL, PORT, APP_URL)
  db  →  Hetzner Managed PostgreSQL (external, no StatefulSet)

The PostgreSQL container is removed from k3d — the platform already targets Hetzner Managed PG in production. The k3d compose only ever ran PG for local dev, which stays as docker-compose.yml for developer use.

Repository Layout After Migration

infra/
  k8s/
    namespace.yaml
    secret.yaml.example       ← committed skeleton, real values in CI secrets
    configmap.yaml
    deployment-api.yaml
    deployment-kick-server.yaml
    deployment-worker.yaml
    job-migrator.yaml
    service-api.yaml
    service-kick-server.yaml
    ingress.yaml
  scripts/
    cluster-bootstrap.sh      ← one-shot: install k3d, create cluster, apply base
    deploy.sh                 ← called by CI: update image tags + kubectl rollout

Phase 1 — Cluster Bootstrap (day 1, ~2 h)

1.1 Install k3d on the Hetzner VPS

bash

curl -s https://raw.githubusercontent.com/k3d-io/k3d/main/install.sh | bash
k3d version   # confirm ≥ 5.7

1.2 Create the cluster

bash

k3d cluster create vf-cluster \
  --servers 1 \
  --agents  1 \
  --port "80:80@loadbalancer" \
  --port "443:443@loadbalancer" \
  --k3s-arg "--disable=servicelb@server:0"   # use Traefik LB instead

k3d automatically installs Traefik as the ingress controller. The host ports 80/443 map directly into the cluster's load-balancer container — no iptables rules to manage.

1.3 Verify

bash

kubectl get nodes          # should show 1 server + 1 agent, Ready
kubectl get pods -A        # traefik pod Running in kube-system

Phase 2 — Manifests (day 1–2, ~4 h)

All files live in infra/k8s/.

namespace.yaml

yaml

apiVersion: v1
kind: Namespace
metadata:
  name: vf-prod

secret.yaml (skeleton — real values injected by CI)

yaml

apiVersion: v1
kind: Secret
metadata:
  name: vf-env
  namespace: vf-prod
type: Opaque
stringData:
  DATABASE_URL:      ""
  RESEND_API_KEY:    ""
  SCRAPER_API_KEY:   ""
  PUSHER_APP_ID:     ""
  PUSHER_KEY:        ""
  PUSHER_SECRET:     ""
  PUSHER_CLUSTER:    "eu"
  S3_ENDPOINT:       ""
  S3_REGION:         ""
  S3_ACCESS_KEY:     ""
  S3_SECRET_KEY:     ""
  S3_BUCKET:         "kycdocs"
  KICK_S3_BUCKET:    "eventskick"
  X_CLIENT_ID:       ""
  X_CLIENT_SECRET:   ""
  KICK_CLIENT_ID:    ""
  KICK_CLIENT_SECRET: ""

configmap.yaml

yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: vf-config
  namespace: vf-prod
data:
  LOG_LEVEL: "info"
  PORT:      "3000"
  APP_URL:   "https://app.verifluence.io"
  RESEND_FROM: "no-reply@updates.verifluence.io"

deployment-api.yaml

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
  namespace: vf-prod
spec:
  replicas: 2
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0    # never take a pod down before a new one is Ready
      maxSurge: 1
  selector:
    matchLabels:
      app: api
  template:
    metadata:
      labels:
        app: api
    spec:
      containers:
        - name: api
          image: ghcr.io/verifluence/api:latest
          ports:
            - containerPort: 3000
          envFrom:
            - secretRef:
                name: vf-env
            - configMapRef:
                name: vf-config
          readinessProbe:           # IR-1: gates traffic until /api/health returns 200
            httpGet:
              path: /api/health
              port: 3000
            initialDelaySeconds: 5
            periodSeconds: 10
            failureThreshold: 3
          livenessProbe:
            httpGet:
              path: /api/health
              port: 3000
            initialDelaySeconds: 15
            periodSeconds: 30
            failureThreshold: 3
          resources:
            requests:
              cpu:    "100m"
              memory: "128Mi"
            limits:
              cpu:    "500m"
              memory: "512Mi"

deployment-kick-server.yaml

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: kick-server
  namespace: vf-prod
spec:
  replicas: 1
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 1
  selector:
    matchLabels:
      app: kick-server
  template:
    metadata:
      labels:
        app: kick-server
    spec:
      containers:
        - name: kick-server
          image: ghcr.io/verifluence/kick-server:latest
          ports:
            - containerPort: 3000
          envFrom:
            - secretRef:
                name: vf-env
            - configMapRef:
                name: vf-config
          readinessProbe:
            httpGet:
              path: /health
              port: 3000
            initialDelaySeconds: 5
            periodSeconds: 10
          resources:
            requests:
              cpu:    "50m"
              memory: "64Mi"
            limits:
              cpu:    "200m"
              memory: "256Mi"

deployment-worker.yaml

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: worker
  namespace: vf-prod
spec:
  replicas: 1
  selector:
    matchLabels:
      app: worker
  template:
    metadata:
      labels:
        app: worker
    spec:
      containers:
        - name: worker
          image: ghcr.io/verifluence/worker:latest
          envFrom:
            - secretRef:
                name: vf-env
            - configMapRef:
                name: vf-config
          resources:
            requests:
              cpu:    "50m"
              memory: "64Mi"
            limits:
              cpu:    "500m"
              memory: "256Mi"

job-migrator.yaml

yaml

apiVersion: batch/v1
kind: Job
metadata:
  name: migrator
  namespace: vf-prod
  annotations:
    # Helm pre-upgrade hook equivalent — CI recreates this Job before rolling out images
    "helm.sh/hook": pre-upgrade
spec:
  backoffLimit: 2
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: migrator
          image: ghcr.io/verifluence/migrator:latest
          envFrom:
            - secretRef:
                name: vf-env

service-api.yaml

yaml

apiVersion: v1
kind: Service
metadata:
  name: api
  namespace: vf-prod
spec:
  selector:
    app: api
  ports:
    - port: 80
      targetPort: 3000

service-kick-server.yaml

yaml

apiVersion: v1
kind: Service
metadata:
  name: kick-server
  namespace: vf-prod
spec:
  selector:
    app: kick-server
  ports:
    - port: 80
      targetPort: 3000

ingress.yaml

yaml

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: vf-ingress
  namespace: vf-prod
  annotations:
    traefik.ingress.kubernetes.io/router.entrypoints: websecure
    traefik.ingress.kubernetes.io/router.tls: "true"
    traefik.ingress.kubernetes.io/router.tls.certresolver: letsencrypt
spec:
  rules:
    - host: api.verifluence.io
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: api
                port:
                  number: 80
    - host: kick.verifluence.io
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: kick-server
                port:
                  number: 80
  tls:
    - hosts:
        - api.verifluence.io
        - kick.verifluence.io

TLS: Traefik's built-in Let's Encrypt cert resolver handles certificate issuance and renewal automatically — no cert-manager required for a single-domain setup.

Phase 3 — Local Validation (day 2, ~2 h)

Before touching production, validate all manifests against a local k3d cluster:

bash

# Start a local mirror cluster
k3d cluster create vf-local --port "8080:80@loadbalancer"

# Apply everything
kubectl apply -f infra/k8s/namespace.yaml
kubectl apply -f infra/k8s/configmap.yaml
kubectl apply -f infra/k8s/secret.yaml          # use dev values
kubectl apply -f infra/k8s/deployment-api.yaml
kubectl apply -f infra/k8s/deployment-kick-server.yaml
kubectl apply -f infra/k8s/deployment-worker.yaml
kubectl apply -f infra/k8s/service-api.yaml
kubectl apply -f infra/k8s/service-kick-server.yaml

# Watch rollout
kubectl rollout status deployment/api -n vf-prod

# Port-forward to smoke-test without ingress
kubectl port-forward svc/api 3000:80 -n vf-prod
curl http://localhost:3000/api/health

All probes must pass before Phase 4.

Phase 4 — CI/CD Pipeline Update (day 2–3, ~3 h)

Replace the current deploy step (SSH + docker compose pull && up) with:

yaml

# .github/workflows/deploy-api.yml  — new deploy job

  deploy:
    name: Deploy to k3d
    needs: build
    runs-on: ubuntu-latest
    environment: production
    steps:
      - uses: actions/checkout@v4

      - name: Write kubeconfig
        run: |
          mkdir -p ~/.kube
          echo "${{ secrets.KUBECONFIG_B64 }}" | base64 -d > ~/.kube/config

      - name: Delete old migrator Job (if exists)
        run: |
          kubectl delete job migrator -n vf-prod --ignore-not-found=true

      - name: Run migrator
        run: |
          # Patch image tag then apply
          sed "s|:latest|:sha-${{ github.sha }}|g" infra/k8s/job-migrator.yaml \
            | kubectl apply -f -
          kubectl wait --for=condition=complete job/migrator -n vf-prod --timeout=120s

      - name: Roll out new images
        run: |
          SHA=sha-${{ github.sha }}
          kubectl set image deployment/api         api=${{ env.API_IMAGE }}:${SHA}         -n vf-prod
          kubectl set image deployment/kick-server kick-server=${{ env.KICK_IMAGE }}:${SHA} -n vf-prod
          kubectl set image deployment/worker      worker=${{ env.WORKER_IMAGE }}:${SHA}    -n vf-prod

      - name: Wait for rollout
        run: |
          kubectl rollout status deployment/api         -n vf-prod --timeout=120s
          kubectl rollout status deployment/kick-server -n vf-prod --timeout=120s
          kubectl rollout status deployment/worker      -n vf-prod --timeout=120s

      - name: Smoke-test health endpoint
        run: |
          kubectl run smoke --rm -i --restart=Never \
            --image=curlimages/curl:latest \
            --namespace=vf-prod \
            -- curl -sf http://api/api/health | grep '"status":"ok"'

Secrets required in GitHub:

KUBECONFIG_B64 — base64-encoded kubeconfig from the Hetzner VPS (k3d kubeconfig get vf-cluster | base64)

Phase 5 — Production Cutover (day 3–4, ~2 h)

Cutover is done in-place on the same VPS — no DNS changes needed because Traefik takes over the same ports (80/443) that the compose reverse-proxy was using.

bash

# On the Hetzner VPS:

# 1. Stop compose stack (brief downtime window, < 30 s)
cd /srv/verifluence
docker compose down

# 2. Apply k8s manifests (cluster already running from bootstrap)
kubectl apply -f infra/k8s/
kubectl rollout status deployment/api -n vf-prod --timeout=120s

# 3. Verify
curl https://api.verifluence.io/api/health

Rollback if anything is wrong:

bash

docker compose up -d      # old stack back in seconds

Phase 6 — Post-Migration Cleanup (week 2)

Once k3d has been stable in production for 7 days:

Remove db: service from docker-compose.yml (PG is Hetzner Managed — always was in prod)
Rename docker-compose.yml → docker-compose.dev.yml with a comment: "local dev only"
Archive the old SSH-based deploy step from deploy-api.yml
Document new kubectl runbook in the DR playbook (IR-8)

Synergies with M-IR

This migration directly satisfies several open M-IR items:

M-IR item	How k3d covers it
IR-2 Zero-downtime rolling deploy	`RollingUpdate` strategy + `maxUnavailable: 0` is the deploy strategy
IR-1 Health endpoint gates traffic	`readinessProbe` on `/api/health` — pods never receive traffic until healthy
IR-6 Resource metrics	`kubectl top pods` + resource `requests/limits` baseline; Prometheus scrape via k3s built-in metrics-server

Risks & Mitigations

Risk	Mitigation
k3d overhead on a small VPS	k3s is ~40 MB RAM for control plane; well within headroom on a Hetzner CX22 (4 GB)
kubeconfig leaking production access	Store in GitHub secret, rotate quarterly; scope via `--kubeconfig` flag
Traefik TLS fails on first deploy	Pre-create a `ClusterIssuer` with ACME staging endpoint; flip to prod after cert verified
Migrator Job fails mid-deploy	`backoffLimit: 2`; deploy pipeline halts on `kubectl wait` timeout before images are rolled
Local dev breaks (no compose db)	`docker-compose.dev.yml` keeps postgres for local dev; devs switch with `COMPOSE_FILE=docker-compose.dev.yml`

Out of Scope

Helm chart packaging (manifests are enough at current scale; Helm is a V2 consideration)
Horizontal Pod Autoscaler (HPA) — single-node k3d cannot meaningfully autoscale; address when a second node is added
Multi-node cluster — follow-up: add a second Hetzner VPS as a k3d agent node
Persistent volumes for PostgreSQL — DB stays on Hetzner Managed PG; no PVCs needed

Docker Compose → k3d Migration Plan ​

Why k3d, Not Full k3s or a Managed Cluster ​

Current vs Target Architecture ​

Repository Layout After Migration ​

Phase 1 — Cluster Bootstrap (day 1, ~2 h) ​

1.1 Install k3d on the Hetzner VPS ​

1.2 Create the cluster ​

1.3 Verify ​

Phase 2 — Manifests (day 1–2, ~4 h) ​

namespace.yaml ​

secret.yaml (skeleton — real values injected by CI) ​

configmap.yaml ​

deployment-api.yaml ​

deployment-kick-server.yaml ​

deployment-worker.yaml ​

job-migrator.yaml ​

service-api.yaml ​

service-kick-server.yaml ​

ingress.yaml ​

Phase 3 — Local Validation (day 2, ~2 h) ​

Phase 4 — CI/CD Pipeline Update (day 2–3, ~3 h) ​

Phase 5 — Production Cutover (day 3–4, ~2 h) ​

Phase 6 — Post-Migration Cleanup (week 2) ​

Synergies with M-IR ​

Risks & Mitigations ​

Out of Scope ​

Docker Compose → k3d Migration Plan

Why k3d, Not Full k3s or a Managed Cluster

Current vs Target Architecture

Repository Layout After Migration

Phase 1 — Cluster Bootstrap (day 1, ~2 h)

1.1 Install k3d on the Hetzner VPS

1.2 Create the cluster

1.3 Verify

Phase 2 — Manifests (day 1–2, ~4 h)

namespace.yaml

secret.yaml (skeleton — real values injected by CI)

configmap.yaml

deployment-api.yaml

deployment-kick-server.yaml

deployment-worker.yaml

job-migrator.yaml

service-api.yaml

service-kick-server.yaml

ingress.yaml

Phase 3 — Local Validation (day 2, ~2 h)

Phase 4 — CI/CD Pipeline Update (day 2–3, ~3 h)

Phase 5 — Production Cutover (day 3–4, ~2 h)

Phase 6 — Post-Migration Cleanup (week 2)

Synergies with M-IR

Risks & Mitigations

Out of Scope