Understanding R1-Zero-Like Training with Dr. GRPO Algorithm

Explore the mysteries behind R1-Zero-like training algorithms through an in-depth interview with Zichen Liu, the first author of the Dr. GRPO algorithm. Dive into the technical foundations of LLM post-training methodologies and understand how R1-Zero-like training dominated 2025 despite the initial uncertainty surrounding their mechanisms. Learn about the breakthrough moments in developing the v3 base model and examine critical questions about self-reflection in AI systems, including whether self-reflection is genuine and what happens when penalizing self-reflection keywords. Investigate the fusion of keyword and LLM-based detection methods while addressing the fundamental question of trusting LLM-as-a-judge systems. Discover the technical details of Dr. GRPO, including why specific terms exist in the algorithm, the significance of bias term removal in the GRPO Nature paper, and compatibility considerations with GSPO. Analyze potential drawbacks of Dr. GRPO, explore possibilities for removing other algorithmic terms, and understand the delicate balance required in algorithm engineering. Gain insights into future research directions from Liu's laboratory and understand the broader implications of these advanced training methodologies for the field of artificial intelligence.

Syllabus

- intro:
- start of the interview:
- background of zichen:
- LLM post-training:
- summarization of R1-Zero-Like training:
- v3 base model ahah moment:
- is self reflexion real?:
- what would happen if we penalizing self reflexion keywords:
- fusing of keyword/llm-based detection:
- can you trust the llm-as-a-judge:
- what's up with qwen:
- Dr. GRPO overview:
- why that term is there at all?:
- GRPO nature paper removed the bias term???:
- how compaptible Dr. GRPO with GSPO?:
- is there drawback of Dr. GRPO?:
- is there other terms we can remove?:
- balance in the algorithm engineering:
- next research for the lab:
- conclusion: