Breakdown 06

ML recommendation system: a production-grade training project

A next-best-offer service designed with Spec-Driven Development via GitHub Spec Kit. Full MLOps stack built at a corporate session as a direct answer to skepticism about vibe-coding for serious systems.

CTO Head of AI Architect

Problem

"Vibe-coding is fine for a three-page site, but not for production ML" — the typical objection at corporate training sessions. This project was built at a training session specifically to answer that skepticism with working code.

Methodology

A next-best-offer service: personalized recommendations in real time for conversion and LTV growth. Designed via SDD (Spec-Driven Development) through GitHub Spec Kit. The spec is the source of truth; code is its output.

SDD phases: specify (what to build and why) → clarify (agent places [NEEDS CLARIFICATION] markers — forces open unknowns to be named) → plan (each architectural decision with documented rationale, not "we decided this" but "here's why") → tasks (breakdown by dependencies, parallelizable tasks explicitly marked) → analyze (continuous validation: finds contradictions between spec, plan, and tasks — a refinement loop, not a one-time gate) → implement (agent executes tasks one by one, commits and tests).

1. API layer: FastAPI — endpoints for customer data, events, NBO recommendations.
2. Async processing: Celery workers compute events and recommendations asynchronously, not as a synchronous in-request calculation.
3. ML pipeline: dual-model — ALS (collaborative filtering) + LightGBM (gradient boosting); model training and deployment.
4. Feature store: Feast centralizes features — consistency between training and inference (no train/serve skew).
5. Real-time + batch hybrid: instant API responses + scheduled retraining.
6. Observable by default: Prometheus metrics, Jaeger tracing, structured logs; Docker Compose with health checks.

Dependency map and cascade failure points:

events ──▶ Celery worker ──▶ Feast (feature store)
                                   │
HTTP /nbo ──▶ FastAPI ──▶ model (ALS + LightGBM) ◀─┘
                              ▲
              batch retrain ──┘   (separate scheduled pipeline)

Feast / Redis / PG failure ──▶ cascade to /nbo
Silent batch retrain failure ──▶ model staleness (service up, model stale)

The main trap is visible on this diagram: the /nbo service stays "green" when the batch pipeline dies — so an alert on model freshness is needed, not just API availability.

Artifact

github.com/dobryakov/next-best-0ffer (Python). Containerized, each component (API, workers, ML pipeline) isolated. The repository includes SDD specs and the task plan — used as training material at corporate sessions.

Series signature

Where it breaks

Model staleness on silent training pipeline failure — the model goes stale unnoticed if retraining fails silently. An alert on model freshness is needed, not just service availability.
Redis/PostgreSQL unavailability causes a cascade — storage failure propagates across services; an explicit degradation contract is needed.
Celery scaling complexity at high event volume.
Feature computation latency hits recommendation speed.

For whom and why

If you're running or planning corporate vibe-coding training — this is a training project that answers "it's not for serious systems" with working code. A full MLOps stack with an honest breakdown of where it fails: participants see both the result and its limitations.

Want this production-grade training format for your team?

Corporate enablement on AI-assisted ML development — with a working artifact that participants can dissect, not just a slide deck.

Email me

Other breakdowns

An engineering breakdown series: real task → methodology → working artifact → honest breakdown of where it fails.

Back to series →