A Regret Minimization Framework on Preference Learning in Large Language Models

Suhwan Kim^1,*, Taehyun Cho^1,*,†, Geon-Hyeong Kim², Yu Jin Kim²,
Youngsoo Jang^3,‡, Moontae Lee^2,‡, Jungwoo Lee^1,4,‡

¹Seoul National University, ²LG AI Research, ³UNIST, ⁴HodooAI Labs

* Equal contribution. † Work done during internship at LG AI Research. ‡ Equal corresponding authorship.

🥇 ICML 2026 Spotlight (Top 2.2%)

Figure 1: reward maximization versus regret minimization overview — **Figure 1.** Reward maximization evaluates each reasoning step locally, whereas regret minimization evaluates behavior through prospective rollout and counterfactual reassessment.

RePO reframes RLHF as regret minimization: preferences over reasoning trajectories are interpreted as behavior-conditioned judgments of relative suboptimality.

Abstract

Reinforcement learning with verifiable rewards has driven progress on reasoning-intensive tasks, but many realistic language tasks cannot be equipped with reliable automated verifiers. In such settings, learning systems increasingly rely on human preference feedback, making it important to ask how such feedback should be interpreted.

This paper introduces Regret-based Preference Optimization (RePO), which views human preferences not as immediate reward labels, but as prospective and counterfactual assessments of relative suboptimality. Instead of asking which partial trajectory has larger local utility, RePO asks which behavior is closer to optimal behavior after considering plausible future continuations and alternative actions.

Under a KL-regularized reinforcement learning framework, this perspective yields a regret decomposition compatible with direct preference optimization. Experiments on human preference alignment and mathematical reasoning benchmarks show that RePO improves over DPO-style baselines and remains practical through RePO_det, a behavior-policy-free deterministic approximation.

Why Regret?

The paper argues that human judgments over intermediate reasoning are not purely local. Evaluators mentally anticipate future outcomes and compare the observed behavior against plausible alternatives. This motivates a regret-minimization view: Preferences should reflect closeness to optimal behavior, not only realized reward.

Prospective judgment

A partial reasoning segment may be preferred when it lies on a plausible path to a good future outcome, even if it has not yet received any immediate reward.

Counterfactual comparison

A realized outcome can be worse than another, yet the chosen action can still be closer to the optimal policy under plausible counterfactual alternatives.

Regret-based Preference Optimization

RePO instantiates the preference score with negative regret. For context $q_{<t}$ and output $o_t$, regret compares the optimal value at the current context with the behavior-policy value of taking the observed output and then following the behavior policy $\mu$.

\[ \begin{aligned} -\mathrm{Reg}^{\mu}_{\pi^*}(q_{<t}, o_t) &= V^{\pi^*}(q_{<t}) - Q^{\mu}(q_{<t}, o_t) \\ &= -\alpha\,\log \frac{\pi^*(o_t \mid q_{<t})}{\pi_{\mathrm{ref}}(o_t \mid q_{<t})} + \alpha\,\overline{\mathbb{D}}_{\mathrm{KL}}(\mu \,\Vert\, \pi^*;\, q_{<t}, o_t). \end{aligned} \tag{1} \]

The first term is a local relative-likelihood term analogous to DPO. The second term is a sequential forward KL divergence that measures how far the behavior policy deviates from the optimal policy along future rollouts, defined as

\[ \begin{aligned} &\overline{\mathbb{D}}_{\mathrm{KL}}(\mu \,\Vert\, \pi^*;\, q_{<t}, o_t) :=\; \mathbb{E}_{\tau \sim \mathbb{P}^{\mu}}\!\left[\, \sum_{l>0} \gamma^{l}\, \mathbb{D}_{\mathrm{KL}}\!\left(\mu(\cdot \mid q_{<t+l}) \,\Vert\, \pi^*(\cdot \mid q_{<t+l})\right) \,\right]. \end{aligned} \tag{2} \]

This makes the objective explicitly behavior-aware in offline or heterogeneous preference data.

Exact sequential KL computation is expensive, so the practical estimator reuses observed trajectories as finite-horizon rollouts. When behavior-policy log-probabilities are known, RePO uses them directly:

\[ \begin{aligned} S^{\mathrm{RePO}} &:= \widehat{\mathrm{Reg}}^{\,\mu}_{\pi_\theta}(q_{<t}, o_t) \\ &= -\alpha\!\left(\, \log \frac{\pi_\theta(o_t \mid q_{<t})}{\pi_{\mathrm{ref}}(o_t \mid q_{<t})} + \frac{1}{T-t}\sum_{1 \leq l \leq T-t} \log \frac{\pi_\theta(o_{t+l} \mid q_{<t+l})}{\mu(o_{t+l} \mid q_{<t+l})} \,\right). \end{aligned} \tag{3} \]

When the behavior policy is unavailable, RePO_det replaces it with a deterministic pseudo-policy concentrated on the observed trajectory. This keeps the regret structure while making the method usable for pre-collected offline preference datasets:

\[ \begin{aligned} S^{\mathrm{RePO\_det}} &:= \widehat{\mathrm{Reg}}_{\pi_\theta}(q_{<t}, o_t) \\ &= -\alpha\!\left(\, \log \frac{\pi_\theta(o_t \mid q_{<t})}{\pi_{\mathrm{ref}}(o_t \mid q_{<t})} + \frac{1}{T-t}\sum_{1 \leq l \leq T-t} \log \pi_\theta(o_{t+l} \mid q_{<t+l}) \,\right). \end{aligned} \tag{4} \]

Prospective Judgement

Preferences over partial reasoning are interpreted through likely future continuation.

Counterfactual Thinking

Actions are compared against alternatives that could have been taken at the same context.

Behavior-aware Learning

The objective accounts for rollout likelihoods instead of assuming on-policy preference data.

Inductive Bias of Regret

Why does regret-based preference learning outperform reward-based alternatives? Beyond the off-policy correction in the regret decomposition above, the regret objective structurally internalizes a pessimistic inductive bias that aligns with how humans evaluate incomplete reasoning: regret at a successful terminal state is dominated by the expected regret at any intermediate partial context along the same trajectory.

The assumption behind this property is mild: the behavior policy $\mu$ should be closer to the optimal policy $\pi^*$ than to the reference policy $\pi_{\mathrm{ref}}$ in KL divergence — a condition that naturally holds along trajectories that reach a verifier-accepted outcome.

\[ \epsilon \;:=\; \mathbb{D}_{\mathrm{KL}}\!\left(\mu(\cdot \mid q_{<t}) \,\Vert\, \pi_{\mathrm{ref}}(\cdot \mid q_{<t})\right) \;-\; \mathbb{D}_{\mathrm{KL}}\!\left(\mu(\cdot \mid q_{<t}) \,\Vert\, \pi^*(\cdot \mid q_{<t})\right) \;\geq\; 0. \tag{5} \]

Under this condition, for any verifier-accepted output $o_T^{\star}$ and any earlier context $q_{<t}$ on the same trajectory, the regret of the full success is upper-bounded by the expected regret at the partial context:

\[ \widehat{\mathrm{Reg}}^{\,\mu}_{\pi^*}(q_{<T},\, o_T^{\star}) \;\leq\; \mathbb{E}_{\mu}\!\left[\,\widehat{\mathrm{Reg}}^{\,\mu}_{\pi^*}(q_{<t}, \,\cdot)\,\right]. \tag{6} \]

Intuitively, masking parts of a successful trajectory introduces uncertainty about its eventual completion, so a partial context is judged more harshly than the corresponding fully revealed accepted trajectory. The regret objective therefore prefers complete, verifier-consistent reasoning over partial reasoning without requiring any auxiliary masked-data augmentation — the bias is encoded in the loss itself.

This is the structural reason RePO achieves stronger sample efficiency than DPO-style baselines: where DPO must rely on externally constructed contrastive pairs to teach this preference, RePO inherits it directly from the regret decomposition, as we verify empirically below.

Experiments

The experiments ask whether RePO improves human preference alignment and mathematical reasoning, whether RePO_det remains effective without behavior-policy access, and whether regret minimization internalizes the pessimistic inductive bias toward incomplete successful trajectories.

Human preference alignment

Best results are bold; second-best results are underlined.

**Table 1.** AlpacaEval2, Arena-Hard, and MT-Bench results on Qwen3-1.7B/4B.
Backbone	Method	AlpacaEval 2		Arena-Hard	MT-Bench
Backbone	Method	LC (%)	WR (%)	WR (%)	GPT-4.1	GPT-5.1
Qwen3-1.7B-Base	Base	8.27	6.05	10.4	5.09	4.41
	DPO	23.90	25.84	23.4	6.00	4.81
	IPO	24.46	26.37	21.9	6.56	5.10
	RPO	18.63	15.47	17.7	5.98	5.13
	KTO	34.73	38.93	30.4	6.92	5.32
	TDPO	12.24	10.72	14.5	5.56	4.73
	RePO	36.61	43.66	27.1	6.88	5.43
	RePO_det	34.95	41.42	26.6	6.89	5.16
Qwen3-4B-Base	Base	12.80	11.62	25.4	5.56	4.83
	DPO	32.89	33.92	44.5	6.79	5.74
	IPO	36.43	38.63	47.8	7.43	6.14
	RPO	29.51	28.23	41.9	7.05	6.13
	KTO	52.31	55.78	63.9	8.22	6.93
	TDPO	17.97	17.08	30.9	6.24	5.38
	RePO	55.08	60.12	60.1	8.09	6.78
	RePO_det	51.66	55.53	59.9	8.18	6.97

Mathematical reasoning

**Table 2.** Math-reasoning benchmark results using Qwen3-1.7B/4B.
Backbone	Method	GSM8K	MATH	MATH500	AMC23	Minerva
Qwen3-1.7B-Base	Base	61.71	48.50	48.60	30.00	9.60
	DPO	77.33	53.44	52.80	32.50	16.91
	RPO	69.07	50.32	51.40	30.00	13.24
	IPO	79.45	51.76	53.40	20.00	16.54
	KTO	79.68	54.42	56.60	35.00	17.28
	TDPO	64.06	48.80	52.20	25.00	9.93
	RePO	80.52	54.50	57.40	30.00	20.59
	RePO_det	80.74	54.84	54.40	25.00	25.74
Qwen3-4B-Base	Base	78.77	61.20	64.20	32.50	19.90
	DPO	87.87	56.66	57.80	35.00	27.21
	RPO	90.30	63.44	66.80	47.50	22.79
	IPO	88.86	58.36	57.40	45.00	27.57
	KTO	90.83	67.38	67.60	55.00	25.74
	TDPO	90.67	62.76	64.80	47.50	24.26
	RePO	90.60	65.54	66.20	42.50	22.43
	RePO_det	91.05	65.72	65.40	47.50	23.50

Behavior-policy-free Training

Offline preference datasets often do not expose the behavior policy that generated each response. RePO_det approximates each observed trajectory as if it were generated by a deterministic policy, which makes regret-based training usable in behavior-agnostic and cross-model settings.

Figure 4a: Llama3.2-1B cross-model evaluation. — **Figure 4.** Math-reasoning accuracy of DPO and **RePO_det** on Llama3.2-1B and Gemma-2-2B trained on Qwen-family preference data. Replace the two placeholder PNGs with your two actual Figure 4 panels.

Figure 4b: Gemma-2-2B cross-model evaluation. — **Figure 4.** Math-reasoning accuracy of DPO and **RePO_det** on Llama3.2-1B and Gemma-2-2B trained on Qwen-family preference data. Replace the two placeholder PNGs with your two actual Figure 4 panels.

Sample Efficiency from Regret Inductive Bias

RePO structurally encodes the bias that incomplete successful trajectories should be evaluated pessimistically because their final completion is uncertain. In the masked-data study, DPO improves as more augmented pairs are provided, while RePO remains strong without requiring this additional supervision.

Figure 5a: Qwen3-1.7B masked augmentation accuracy. — **Figure 5.** GSM8K accuracy as a function of masked-augmented preference data. Replace the two placeholder PNGs with your actual Figure 5 panels.

Figure 5b: Qwen3-4B masked augmentation accuracy. — **Figure 5.** GSM8K accuracy as a function of masked-augmented preference data. Replace the two placeholder PNGs with your actual Figure 5 panels.

**Table 4.** Masked-data ablation on mathematical reasoning benchmarks.
Model	Method	GSM8K	MATH	MSE
Qwen3-1.7B	DPO	72.55	52.14	0.073
	DPO_masked	77.71 (+5.16)	53.52 (+1.38)	0.056
	RePO	80.52	54.50	0
	RePO_masked	80.29 (-0.23)	55.26 (+0.76)	-
Qwen3-4B	DPO	85.22	63.70	0.066
	DPO_masked	88.63 (+3.41)	64.46 (+0.72)	0.054
	RePO	90.60	65.54	0
	RePO_masked	90.98 (+0.38)	65.46 (-0.08)	-

Takeaways

Human-centered objective

RePO models preferences as prospective and counterfactual judgments over reasoning trajectories.

Regret decomposition

The objective combines a local DPO-like term with a long-horizon behavior-deviation term.

Practical offline variant

RePO_det supports offline preference data without explicit behavior-policy metadata.

BibTeX

@inproceedings{kim2026regret,
  title     = {A Regret Minimization Framework on Preference Learning in Large Language Models},
  author    = {Kim, Suhwan and Cho, Taehyun and Kim, Geonhyeong and Kim, Yujin and Jang, Youngsoo and Lee, Moontae and Lee, Jungwoo},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
  year      = {2026}
}