yolodex screenshot

yolodex

Agent Skills for Autonomous YOLO Dataset Generation & Model Training

Overview

Yolodex is a fully autonomous ML pipeline that turns any YouTube video into a trained YOLO object detection model—no manual labeling required. Point it at a video URL, name your target classes (e.g. “player”, “weapon”, “vehicle”), and the system handles everything: video download, frame extraction, AI-powered labeling, data augmentation, model training, evaluation, and iterative refinement.

Built at the OpenAI Codex Hackathon 2026 (Feb 2026). Winner.

Pipeline

1

Collect

Downloads video via yt-dlp and extracts frames at configurable FPS using ffmpeg

2

Label

Vision LLM (GPT-5-nano, GPT-4.1-mini, or Gemini) auto-generates YOLO bounding box labels for each frame via structured JSON output

3

Augment

Generates 4 synthetic variants per frame (flip, brightness, contrast, noise) with coordinated label transforms — 5x dataset expansion

4

Train

Runs Ultralytics YOLOv8 training on the labeled + augmented dataset

5

Evaluate

Extracts mAP@50, precision, recall, per-class AP, and identifies weakest classes

6

Iterate

If mAP@50 is below target, re-labels worst frames or collects more data and re-trains automatically

Key Features

  • Zero-label-effort training — point at a YouTube URL, name your classes, and it handles everything autonomously
  • Parallel Codex subagents via git worktrees for Nx speedup on frame labeling
  • Iterative feedback loop — automatically re-labels and re-trains until mAP@50 target is met
  • Multiple labeling backends: GPT-5-nano, GPT-4.1-mini, Gemini native bbox, CUA+SAM, and keyless Codex image-view mode
  • 5x data augmentation with coordinated label transforms (flip, brightness, contrast, noise)
  • Codex-native skill architecture — each pipeline stage is an independently invocable skill