Migrating from Docker Swarm to GKE: 5 Lessons from Production

April 15, 2026 (1mo ago)

When I joined the Nielai project at ENB Mobile Care, the AI inference layer was running on a Docker Swarm cluster on bare-metal servers. It worked — barely — for our smartphone-grading pipeline of six specialized models (detection, classification, segmentation). But as request volume grew and we needed to add new model versions every sprint, Swarm's limitations started showing up everywhere: no real autoscaling, opaque rollouts, and a brittle deployment story.

We migrated everything to Google Kubernetes Engine (GKE) over a few months. Here's what I learned the hard way.

1. HPA tuning is per-microservice, not per-cluster

The naive approach: set a single Horizontal Pod Autoscaler config and apply it to every deployment. We did this for about a week. It was a disaster.

Our pipeline has wildly different workload shapes:

A single HPA targeting 70% CPU made the cheap services scale aggressively while the expensive ones starved. The fix was per-service tuning:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: detection-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: detection
  minReplicas: 2
  maxReplicas: 6
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60
    - type: Pods
      pods:
        metric:
          name: inference_latency_p95_ms
        target:
          type: AverageValue
          averageValue: "300"
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300

The custom metric (inference_latency_p95_ms) is what actually matters to users. CPU utilization is a proxy at best. Once we wired in latency-based scaling, p95 stabilized within minutes of a traffic spike.

2. Observability is non-negotiable on day one

Swarm was opaque. We had docker stats and a Slack channel that screamed when something fell over. On GKE, we wired up VictoriaMetrics + Grafana before the first production deploy.

The dashboards we found indispensable:

Lesson: if you can't see it, you can't debug it at 2 AM.

3. Don't migrate the pipeline. Migrate one service at a time.

Our first migration plan was "lift the entire stack into GKE in one weekend." We tried. It took two weekends and the second one was a rollback.

What worked: migrate one stateless service at a time, route a small percentage of traffic via a gateway, and only proceed when the new service is producing identical outputs to the old one. We used a hash-comparison sidecar to verify that the GKE-served model produced byte-identical predictions to the Swarm version. When we found discrepancies (we did, on three separate models), they were almost always due to slightly different ONNX Runtime versions or CUDA driver mismatches. Catching those during traffic shadowing saved a production incident.

4. Cost vs latency is the tradeoff that will haunt you

GKE makes it trivially easy to scale to "always have headroom." It also makes it trivially easy to burn money. Our first month of GKE was 2.4x the bare-metal Swarm bill.

We dialed it back by:

End result: cost came down to ~1.3x Swarm, with materially better latency and zero on-call pages from infrastructure issues.

5. CI/CD discipline matters more than the platform

GKE didn't make our deployments faster. GitLab CI/CD pipelines with proper gates did.

Every model deploy now goes through:

  1. Lint + unit tests
  2. Containerization with docker buildx (multi-arch)
  3. Push to Artifact Registry
  4. Deploy to a staging namespace on GKE
  5. Run a regression suite of 200 labeled samples and assert mAP ≥ baseline minus tolerance
  6. Manual approval gate
  7. Canary rollout to 10% of production traffic
  8. Auto-promote to 100% if p95 latency stays in bounds for 30 minutes

This is the boring infrastructure that nobody thanks you for. It's also what lets us ship a new model version on a Friday without losing sleep.

What I'd do differently

If I were starting over today:

The migration took four months end-to-end. Worth every hour. Our deploy frequency went from "carefully, every two weeks" to "any weekday afternoon," and we sleep through the night.

🎵 linear ring - innerforest