Large Language Models (LLMs) already carry surprising reasoning skills, but tapping into them has been either expensive (reinforcement learning with verifiable rewards, RLVR) or fragile (large-data supervised fine-tuning, SFT). A new study led by the University of Waterloo’s TIGER Lab—with NetMind as an industry collaborator and our CEO Kai Zou as one of the co-authors—introduces One-Shot Critique Fine-Tuning (CFT) and shows there’s a third way that is both cheap and robust. Although the paper is not yet published and is currently only available on arXiv, it has already been featured by QbitAI, a leading Chinese tech media outlet with over 3.5 million subscribers.
Large Language Models (LLMs) already carry surprising reasoning skills, but tapping into them has been either expensive (reinforcement learning with verifiable rewards, RLVR) or fragile (large-data supervised fine-tuning, SFT). A new study led by the University of Waterloo’s TIGER Lab—with NetMind as an industry collaborator and our CEO Kai Zou as one of the co-authors—introduces One-Shot Critique Fine-Tuning (CFT) and shows there’s a third way that is both cheap and robust. Although the paper is not yet published and is currently only available on arXiv, it has already been featured by QbitAI, a leading Chinese tech media outlet with over 3.5 million subscribers.
CFT still belongs to the category of SFT, but instead of asking the model to imitate a reference answer as normal SFT does, CFT trains the model to criticize the quality of a candidate answer. This aligns with how humans learn: before mastering a problem, we often learn by evaluating and reflecting on existing attempts. Critiquing exposes the model to diverse reasoning paths, both correct and flawed, building a deeper understanding of logical patterns and pitfalls.
Overview of the 1-shot CFT dataset construction and the key difference between SFT and CFT training.
The One-Shot CFT framework is refreshingly simple, yet conceptually powerful. Here's how it works:
The biggest benefit of SFT/CFT compared to RL is sample efficiency. Actually, our experiments demonstrate that One-Shot CFT can be completed in under 5 GPU-hours but still drastically boost the performance in several benchmarks.
We benchmarked One-Shot CFT against state-of-the-art methods across both math and logical reasoning domains using datasets like MATH-500, Olympiad, AMC24, and AMC23.
Average accuracy (%) on different benchmarks for Qwen and Llama models, comparing base, SFT, RLVR, and CFT with only one training example.
Low-Cost, High-Impact: A New Path for LLM Training
Compared to the massive computational demands of reinforcement learning, One-Shot CFT is a game-changer. It offers significantly more efficient training that can be completed with just a single A100 GPU (depending on your base model size), eliminating the need for complex reward models or specialized RL infrastructure. Moreover, the project is fully open-sourced, providing access to training scripts, fine-tuned model weights, and datasets. This makes One-Shot CFT a practical and scalable solution for individual researchers, small labs, and startups with limited resources looking to enhance the reasoning capabilities of large language models.
Learn More & Get Started
In a world where massive models are often synonymous with massive costs, our new algorithm One-Shot CFT proves there’s a third way: efficient, effective, and elegantly human-like!