Building a 6-Model Ensemble for Smartphone Grading

Smartphone grading is the boring-but-hard part of the secondhand phone trade. A buyback agent looks at a phone, judges scratches, dents, screen burn-in, water damage indicators, and assigns a grade (A, B, C, D, F). The grade determines the trade-in price. Two agents looking at the same phone often disagree by one or two grades, which is enough to lose customer trust.

We built a computer-vision system to automate this. Here's how the pipeline ended up after a year of iteration.

The architecture

A single model can't do this. Smartphone defects span vastly different scales: a hairline scratch on the back glass is sub-pixel at full-frame resolution, while a missing button is obvious. We landed on a six-model ensemble organized by function:

Phone detection — locate and crop the phone in the frame (YOLOv11l).
Side classification — front, back, left, right, top, bottom (ResNet-50, fine-tuned).
Screen-region segmentation — isolate the display panel for burn-in / dead-pixel analysis (UNet).
Body-defect detection — scratches, dents on the chassis (YOLOv11x with a custom defect dataset).
Glass-crack segmentation — front/back glass cracks (UNet, separate weights).
Final grade classifier — takes structured outputs from models 1–5 and emits a final letter grade (gradient boosting on tabular features).

Models 1–5 are vision models. Model 6 is deliberately not a vision model — it's a small XGBoost classifier operating on counts, areas, and confidence scores. This decoupling turned out to be one of the best decisions we made.

Why decouple the final grade?

Early prototypes used an end-to-end vision model that emitted a grade directly. It worked, sort of. It also failed catastrophically when we changed lighting conditions, camera angles, or phone models. Retraining the whole thing took days.

By splitting the pipeline, we gained three things:

Each component is independently testable. A scratch detector either finds a scratch or it doesn't. We can build a tight regression suite for that one model.
The grading logic is auditable. When a customer disputes a grade, we can show them: "We detected 3 scratches above 5mm, area covering 2.1% of back glass, plus one screen dead-pixel cluster. That maps to grade B." A black-box CNN can't explain itself.
The grade thresholds become a config file. When the business team wants to be more lenient on grade B in Q4, we change a YAML, not a model.

The bottleneck wasn't where we expected

We benchmarked 10,000+ samples across YOLOv8 / v10 / v11 / v12 / v26 looking for the best detection model. The differences were real but small (1–3% mAP). The actual bottleneck in production was elsewhere:

Image upload from the agent's tablet dominated end-to-end latency. The model pipeline was ~2 seconds. The upload was 5–8.
Image preprocessing (decoding, color-space conversion, resize) was ~30% of GPU time before we batched it.
Disk I/O on the model registry during cold starts spiked p99 latency by 4 seconds.

The fix wasn't a better model. It was:

Client-side image compression (WebP at quality 85) — cut upload time by 60%.
Preprocessing fused into a CUDA kernel via TorchVision Transforms v2.
Models pre-warmed in init-containers so the first request to a fresh pod doesn't pay cold-start cost.

Lesson: profile before you optimize. We almost spent two weeks tuning YOLOv12 hyperparameters when the win was in the upload pipeline.

Model versioning is harder than it looks

Six models means six independent training schedules, six datasets, six sets of weights. We use:

DVC for dataset versioning, with remote storage on GCS.
MLflow on Vertex AI for experiment tracking and model registry.
A single pipeline.yaml that pins exact model versions for production. Bumping a model version is a PR. The PR runs the regression suite. The merge triggers a canary rollout.

Without this, "which version of model 4 is running in production?" becomes an unanswerable question within a few sprints.

What I'd tell my past self

Build the regression suite before the second model. Once you have N models, retrofitting tests is exponentially harder.
Don't optimize model latency until you've measured the whole request lifecycle. Network and I/O usually dominate.
Decouple decisions from features. Anything that requires business approval (grade thresholds, accept/reject logic) should be config, not weights.
Plan for the day the dataset doubles. Our annotation tooling around Label Studio and Roboflow was an afterthought; it became the rate-limiting step for new model versions.

The system is in production today, processing thousands of phones a day, with grading consistency materially better than human-only review. The interesting engineering wasn't in any single model — it was in the seams between them.