DeepseekMath

llm learning road di zc

Note sull'episodio

00:00:00 Start

00:17:41 DeepSeekMathBase+7B+的数学能力评估

00:27:47 DeepSeekMath-RL+的训练与评估

00:40:24 探索抗噪声奖励信号的算法

00:46:35 DPO、PPO+和+GRPO+的目标及梯度

00:51:30 Closing