Building a 6-Model Ensemble for Smartphone Grading

May 10, 2026 (3w ago)

Smartphone grading is the boring-but-hard part of the secondhand phone trade. A buyback agent looks at a phone, judges scratches, dents, screen burn-in, water damage indicators, and assigns a grade (A, B, C, D, F). The grade determines the trade-in price. Two agents looking at the same phone often disagree by one or two grades, which is enough to lose customer trust.

We built a computer-vision system to automate this. Here's how the pipeline ended up after a year of iteration.

The architecture

A single model can't do this. Smartphone defects span vastly different scales: a hairline scratch on the back glass is sub-pixel at full-frame resolution, while a missing button is obvious. We landed on a six-model ensemble organized by function:

  1. Phone detection — locate and crop the phone in the frame (YOLOv11l).
  2. Side classification — front, back, left, right, top, bottom (ResNet-50, fine-tuned).
  3. Screen-region segmentation — isolate the display panel for burn-in / dead-pixel analysis (UNet).
  4. Body-defect detection — scratches, dents on the chassis (YOLOv11x with a custom defect dataset).
  5. Glass-crack segmentation — front/back glass cracks (UNet, separate weights).
  6. Final grade classifier — takes structured outputs from models 1–5 and emits a final letter grade (gradient boosting on tabular features).

Models 1–5 are vision models. Model 6 is deliberately not a vision model — it's a small XGBoost classifier operating on counts, areas, and confidence scores. This decoupling turned out to be one of the best decisions we made.

Why decouple the final grade?

Early prototypes used an end-to-end vision model that emitted a grade directly. It worked, sort of. It also failed catastrophically when we changed lighting conditions, camera angles, or phone models. Retraining the whole thing took days.

By splitting the pipeline, we gained three things:

The bottleneck wasn't where we expected

We benchmarked 10,000+ samples across YOLOv8 / v10 / v11 / v12 / v26 looking for the best detection model. The differences were real but small (1–3% mAP). The actual bottleneck in production was elsewhere:

The fix wasn't a better model. It was:

Lesson: profile before you optimize. We almost spent two weeks tuning YOLOv12 hyperparameters when the win was in the upload pipeline.

Model versioning is harder than it looks

Six models means six independent training schedules, six datasets, six sets of weights. We use:

Without this, "which version of model 4 is running in production?" becomes an unanswerable question within a few sprints.

What I'd tell my past self

The system is in production today, processing thousands of phones a day, with grading consistency materially better than human-only review. The interesting engineering wasn't in any single model — it was in the seams between them.

🎵 linear ring - innerforest