When I joined the Nielai project at ENB Mobile Care, the AI inference layer was running on a Docker Swarm cluster on bare-metal servers. It worked — barely — for our smartphone-grading pipeline of six specialized models (detection, classification, segmentation). But as request volume grew and we needed to add new model versions every sprint, Swarm's limitations started showing up everywhere: no real autoscaling, opaque rollouts, and a brittle deployment story.
We migrated everything to Google Kubernetes Engine (GKE) over a few months. Here's what I learned the hard way.
1. HPA tuning is per-microservice, not per-cluster
The naive approach: set a single Horizontal Pod Autoscaler config and apply it to every deployment. We did this for about a week. It was a disaster.
Our pipeline has wildly different workload shapes:
- Detection (YOLOv11l): GPU-bound, ~150ms p50 latency, expensive to scale.
- Classification (ResNet-50): CPU-friendly, ~40ms latency, cheap to scale horizontally.
- Segmentation (UNet): Memory-heavy, slow to warm up, scales poorly above 2 replicas per node.
A single HPA targeting 70% CPU made the cheap services scale aggressively while the expensive ones starved. The fix was per-service tuning:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: detection-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: detection
minReplicas: 2
maxReplicas: 6
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
- type: Pods
pods:
metric:
name: inference_latency_p95_ms
target:
type: AverageValue
averageValue: "300"
behavior:
scaleDown:
stabilizationWindowSeconds: 300The custom metric (inference_latency_p95_ms) is what actually matters to users. CPU utilization is a proxy at best. Once we wired in latency-based scaling, p95 stabilized within minutes of a traffic spike.
2. Observability is non-negotiable on day one
Swarm was opaque. We had docker stats and a Slack channel that screamed when something fell over. On GKE, we wired up VictoriaMetrics + Grafana before the first production deploy.
The dashboards we found indispensable:
- Per-model inference latency histograms (p50, p95, p99) — caught a regression where ONNX Runtime fell back to CPU silently.
- Pod CPU/memory by deployment — surfaced a memory leak in the segmentation worker that only appeared after 6+ hours.
- Queue depth before and after each model stage — showed exactly where the pipeline was bottlenecked when end-to-end latency drifted.
Lesson: if you can't see it, you can't debug it at 2 AM.
3. Don't migrate the pipeline. Migrate one service at a time.
Our first migration plan was "lift the entire stack into GKE in one weekend." We tried. It took two weekends and the second one was a rollback.
What worked: migrate one stateless service at a time, route a small percentage of traffic via a gateway, and only proceed when the new service is producing identical outputs to the old one. We used a hash-comparison sidecar to verify that the GKE-served model produced byte-identical predictions to the Swarm version. When we found discrepancies (we did, on three separate models), they were almost always due to slightly different ONNX Runtime versions or CUDA driver mismatches. Catching those during traffic shadowing saved a production incident.
4. Cost vs latency is the tradeoff that will haunt you
GKE makes it trivially easy to scale to "always have headroom." It also makes it trivially easy to burn money. Our first month of GKE was 2.4x the bare-metal Swarm bill.
We dialed it back by:
- Setting
minReplicas: 1for non-critical services with lazy warmup. - Using GKE Autopilot pools for the bursty services and standard pools with reserved nodes for the always-on detection service.
- Aggressively shrinking CPU/RAM requests based on real p95 utilization, not the over-provisioned defaults.
End result: cost came down to ~1.3x Swarm, with materially better latency and zero on-call pages from infrastructure issues.
5. CI/CD discipline matters more than the platform
GKE didn't make our deployments faster. GitLab CI/CD pipelines with proper gates did.
Every model deploy now goes through:
- Lint + unit tests
- Containerization with
docker buildx(multi-arch) - Push to Artifact Registry
- Deploy to a
stagingnamespace on GKE - Run a regression suite of 200 labeled samples and assert mAP ≥ baseline minus tolerance
- Manual approval gate
- Canary rollout to 10% of production traffic
- Auto-promote to 100% if p95 latency stays in bounds for 30 minutes
This is the boring infrastructure that nobody thanks you for. It's also what lets us ship a new model version on a Friday without losing sleep.
What I'd do differently
If I were starting over today:
- Start with observability before writing the first deployment manifest.
- Adopt Kustomize or Helm from day one. We did manifest-templating with shell scripts for too long.
- Invest in a traffic shadowing harness earlier — it pays for itself the first time it catches a silent model regression.
- Don't pick the cluster size based on load tests. Pick it based on the second-busiest hour of real production traffic, then add 30%.
The migration took four months end-to-end. Worth every hour. Our deploy frequency went from "carefully, every two weeks" to "any weekday afternoon," and we sleep through the night.