meliksahturker commited on
Commit
42fdba4
·
verified ·
1 Parent(s): 5a8da9a

Correction in DPO Beta intuition. Changed 'higher' to 'lower' and vice versa in line 3995.

Browse files
Files changed (1) hide show
  1. app/src/content/article.mdx +1 -1
app/src/content/article.mdx CHANGED
@@ -3992,7 +3992,7 @@ Our recommendation for your training runs is to run scans of your learning rate
3992
 
3993
  **Tune your β**
3994
 
3995
- The experiments we ran for the ß parameter ranged from 0.01 to 0.99 to explore values that encourage different degrees of alignment to the reference model. As a reminder, lower values of beta encourage staying close to the reference model while higher values allow the model to match the preference data more closely. The model performance for β=0.1 is the highest for both reasoning modes and improves compared to the metrics from the SFT checkpoint. Using a low beta value hurts model performance and results in a worse model than the SFT checkpoint, while performance remains stable across multiple ß values without extended thinking.
3996
 
3997
  These results suggest that values greater than 0.1 are preferable for preference optimisation, and that aligning the model with the preference data is more beneficial than staying close to the reference model. However, we suggest exploring ß values in the range 0.01 and 0.5. Higher values may erase capabilities from the SFT checkpoint that we might not be capturing in the evals shown on the plot.
3998
 
 
3992
 
3993
  **Tune your β**
3994
 
3995
+ The experiments we ran for the ß parameter ranged from 0.01 to 0.99 to explore values that encourage different degrees of alignment to the reference model. As a reminder, higher values of beta encourage staying close to the reference model while lower values allow the model to match the preference data more closely. The model performance for β=0.1 is the highest for both reasoning modes and improves compared to the metrics from the SFT checkpoint. Using a low beta value hurts model performance and results in a worse model than the SFT checkpoint, while performance remains stable across multiple ß values without extended thinking.
3996
 
3997
  These results suggest that values greater than 0.1 are preferable for preference optimisation, and that aligning the model with the preference data is more beneficial than staying close to the reference model. However, we suggest exploring ß values in the range 0.01 and 0.5. Higher values may erase capabilities from the SFT checkpoint that we might not be capturing in the evals shown on the plot.
3998