Overview
Syllabus
- intro:
- start of the interview:
- background of zichen:
- LLM post-training:
- summarization of R1-Zero-Like training:
- v3 base model ahah moment:
- is self reflexion real?:
- what would happen if we penalizing self reflexion keywords:
- fusing of keyword/llm-based detection:
- can you trust the llm-as-a-judge:
- what's up with qwen:
- Dr. GRPO overview:
- why that term is there at all?:
- GRPO nature paper removed the bias term???:
- how compaptible Dr. GRPO with GSPO?:
- is there drawback of Dr. GRPO?:
- is there other terms we can remove?:
- balance in the algorithm engineering:
- next research for the lab:
- conclusion:
Taught by
Yacine Mahdid