projects

zero-rl

A compact R1-Zero-style post-training experiment for Countdown. It trains Qwen2.5-3B with GRPO using strict programmatic rewards: no supervised traces, no reward model, and no LLM judge.

The interesting bit is the transfer test: train on 3-number problems, then evaluate on both 3-number and harder 4-number prompts. The final run hit 68.46% solve rate on held-out 3-number examples on an AMD MI300X GPU, with vLLM bringing the observed 3B step time down to roughly 8.1 seconds. It also surfaced a clear generalization gap on 4-number transfer.

Python·GRPO·Qwen2.5
cursor-lens

A local-first CLI for understanding how people work with coding agents. It reads Cursor transcript files, computes workflow metrics, applies privacy filtering, and generates private Markdown and HTML reports.

The tool keeps raw transcripts local by default, supports optional AI-generated builder profiles, and makes agent workflows easier to inspect through sessions, prompts, tool calls, and file references.

TypeScript·CLI·privacy
nanogpt-hack

A small nanoGPT-style pretraining playground built to understand the training loop end to end: dataset prep, configs, checkpointing, WandB logging, and controlled ablations.

Uses a 100M-token mix of FineWeb-Edu, Cosmopedia, and Python code at roughly 30M parameters. The cleanest result was RoPE + AdamW, improving validation loss under the same token budget.

Python·pretraining·RoPE