
yolodex
Agent Skills for Autonomous YOLO Dataset Generation & Model Training
Overview
Yolodex is a fully autonomous ML pipeline that turns any YouTube video into a trained YOLO object detection model—no manual labeling required. Point it at a video URL, name your target classes (e.g. “player”, “weapon”, “vehicle”), and the system handles everything: video download, frame extraction, AI-powered labeling, data augmentation, model training, evaluation, and iterative refinement.
Built at the OpenAI Codex Hackathon 2026 (Feb 2026). Winner.
Pipeline
Collect
Downloads video via yt-dlp and extracts frames at configurable FPS using ffmpeg
Label
Vision LLM (GPT-5-nano, GPT-4.1-mini, or Gemini) auto-generates YOLO bounding box labels for each frame via structured JSON output
Augment
Generates 4 synthetic variants per frame (flip, brightness, contrast, noise) with coordinated label transforms — 5x dataset expansion
Train
Runs Ultralytics YOLOv8 training on the labeled + augmented dataset
Evaluate
Extracts mAP@50, precision, recall, per-class AP, and identifies weakest classes
Iterate
If mAP@50 is below target, re-labels worst frames or collects more data and re-trains automatically
Key Features
- ●Zero-label-effort training — point at a YouTube URL, name your classes, and it handles everything autonomously
- ●Parallel Codex subagents via git worktrees for Nx speedup on frame labeling
- ●Iterative feedback loop — automatically re-labels and re-trains until mAP@50 target is met
- ●Multiple labeling backends: GPT-5-nano, GPT-4.1-mini, Gemini native bbox, CUA+SAM, and keyless Codex image-view mode
- ●5x data augmentation with coordinated label transforms (flip, brightness, contrast, noise)
- ●Codex-native skill architecture — each pipeline stage is an independently invocable skill