ML recommendation system: a production-grade training project
A next-best-offer service designed with Spec-Driven Development via GitHub Spec Kit. Full MLOps stack built at a corporate session as a direct answer to skepticism about vibe-coding for serious systems.
Problem
"Vibe-coding is fine for a three-page site, but not for production ML" — the typical objection at corporate training sessions. This project was built at a training session specifically to answer that skepticism with working code.
Methodology
A next-best-offer service: personalized recommendations in real time for conversion and LTV growth. Designed via SDD (Spec-Driven Development) through GitHub Spec Kit. The spec is the source of truth; code is its output.
SDD phases: specify (what to build and why) → clarify
(agent places [NEEDS CLARIFICATION]
markers — forces open unknowns to be named) → plan (each architectural
decision with documented rationale, not "we decided this" but "here's why") →
tasks (breakdown by dependencies, parallelizable tasks explicitly marked)
→ analyze (continuous validation: finds contradictions between spec, plan,
and tasks — a refinement loop, not a one-time gate) → implement (agent
executes tasks one by one, commits and tests).
- 1. API layer: FastAPI — endpoints for customer data, events, NBO recommendations.
- 2. Async processing: Celery workers compute events and recommendations asynchronously, not as a synchronous in-request calculation.
- 3. ML pipeline: dual-model — ALS (collaborative filtering) + LightGBM (gradient boosting); model training and deployment.
- 4. Feature store: Feast centralizes features — consistency between training and inference (no train/serve skew).
- 5. Real-time + batch hybrid: instant API responses + scheduled retraining.
- 6. Observable by default: Prometheus metrics, Jaeger tracing, structured logs; Docker Compose with health checks.
Dependency map and cascade failure points:
events ──▶ Celery worker ──▶ Feast (feature store)
│
HTTP /nbo ──▶ FastAPI ──▶ model (ALS + LightGBM) ◀─┘
▲
batch retrain ──┘ (separate scheduled pipeline)
Feast / Redis / PG failure ──▶ cascade to /nbo
Silent batch retrain failure ──▶ model staleness (service up, model stale)
The main trap is visible on this diagram: the /nbo
service stays "green" when the batch pipeline dies — so an alert on model freshness
is needed, not just API availability.
Artifact
github.com/dobryakov/next-best-0ffer (Python). Containerized, each component (API, workers, ML pipeline) isolated. The repository includes SDD specs and the task plan — used as training material at corporate sessions.
Where it breaks
- Model staleness on silent training pipeline failure — the model goes stale unnoticed if retraining fails silently. An alert on model freshness is needed, not just service availability.
- Redis/PostgreSQL unavailability causes a cascade — storage failure propagates across services; an explicit degradation contract is needed.
- Celery scaling complexity at high event volume.
- Feature computation latency hits recommendation speed.
For whom and why
If you're running or planning corporate vibe-coding training — this is a training project that answers "it's not for serious systems" with working code. A full MLOps stack with an honest breakdown of where it fails: participants see both the result and its limitations.
Want this production-grade training format for your team?
Corporate enablement on AI-assisted ML development — with a working artifact that participants can dissect, not just a slide deck.
Email meOther breakdowns
An engineering breakdown series: real task → methodology → working artifact → honest breakdown of where it fails.
Back to series →