A Regret Minimization Framework on Preference Learning in Large Language Models

Suhwan Kim1,*, Taehyun Cho1,*,†, Geon-Hyeong Kim2, Yu Jin Kim2,
Youngsoo Jang3,‡, Moontae Lee2,‡, Jungwoo Lee1,4,‡
1Seoul National University, 2LG AI Research, 3UNIST, 4HodooAI Labs
* Equal contribution. † Work done during internship at LG AI Research. ‡ Equal corresponding authorship.
🥇 ICML 2026 Spotlight (Top 2.2%)
Figure 1: reward maximization versus regret minimization overview
Figure 1. Reward maximization evaluates each reasoning step locally, whereas regret minimization evaluates behavior through prospective rollout and counterfactual reassessment.

RePO reframes RLHF as regret minimization: preferences over reasoning trajectories are interpreted as behavior-conditioned judgments of relative suboptimality.

Abstract

Reinforcement learning with verifiable rewards has driven progress on reasoning-intensive tasks, but many realistic language tasks cannot be equipped with reliable automated verifiers. In such settings, learning systems increasingly rely on human preference feedback, making it important to ask how such feedback should be interpreted.

This paper introduces Regret-based Preference Optimization (RePO), which views human preferences not as immediate reward labels, but as prospective and counterfactual assessments of relative suboptimality. Instead of asking which partial trajectory has larger local utility, RePO asks which behavior is closer to optimal behavior after considering plausible future continuations and alternative actions.

Under a KL-regularized reinforcement learning framework, this perspective yields a regret decomposition compatible with direct preference optimization. Experiments on human preference alignment and mathematical reasoning benchmarks show that RePO improves over DPO-style baselines and remains practical through RePO_det, a behavior-policy-free deterministic approximation.

Why Regret?

The paper argues that human judgments over intermediate reasoning are not purely local. Evaluators mentally anticipate future outcomes and compare the observed behavior against plausible alternatives. This motivates a regret-minimization view: Preferences should reflect closeness to optimal behavior, not only realized reward.

Prospective judgment

A partial reasoning segment may be preferred when it lies on a plausible path to a good future outcome, even if it has not yet received any immediate reward.

Figure 2: prospective judgment illustration
Figure 2. Preference over an incomplete trajectory depends on imagined continuation.

Counterfactual comparison

A realized outcome can be worse than another, yet the chosen action can still be closer to the optimal policy under plausible counterfactual alternatives.

Figure 3: counterfactual comparison illustration
Figure 3. Regret minimization evaluates proximity to optimal behavior, not just realized payoff.

Regret-based Preference Optimization

RePO instantiates the preference score with negative regret. For context $q_{<t}$ and output $o_t$, regret compares the optimal value at the current context with the behavior-policy value of taking the observed output and then following the behavior policy $\mu$.

\[ \begin{aligned} -\mathrm{Reg}^{\mu}_{\pi^*}(q_{<t}, o_t) &= V^{\pi^*}(q_{<t}) - Q^{\mu}(q_{<t}, o_t) \\ &= -\alpha\,\log \frac{\pi^*(o_t \mid q_{<t})}{\pi_{\mathrm{ref}}(o_t \mid q_{<t})} + \alpha\,\overline{\mathbb{D}}_{\mathrm{KL}}(\mu \,\Vert\, \pi^*;\, q_{<t}, o_t). \end{aligned} \tag{1} \]

The first term is a local relative-likelihood term analogous to DPO. The second term is a sequential forward KL divergence that measures how far the behavior policy deviates from the optimal policy along future rollouts, defined as

\[ \begin{aligned} &\overline{\mathbb{D}}_{\mathrm{KL}}(\mu \,\Vert\, \pi^*;\, q_{<t}, o_t) :=\; \mathbb{E}_{\tau \sim \mathbb{P}^{\mu}}\!\left[\, \sum_{l>0} \gamma^{l}\, \mathbb{D}_{\mathrm{KL}}\!\left(\mu(\cdot \mid q_{<t+l}) \,\Vert\, \pi^*(\cdot \mid q_{<t+l})\right) \,\right]. \end{aligned} \tag{2} \]

This makes the objective explicitly behavior-aware in offline or heterogeneous preference data.

Exact sequential KL computation is expensive, so the practical estimator reuses observed trajectories as finite-horizon rollouts. When behavior-policy log-probabilities are known, RePO uses them directly:

\[ \begin{aligned} S^{\mathrm{RePO}} &:= \widehat{\mathrm{Reg}}^{\,\mu}_{\pi_\theta}(q_{<t}, o_t) \\ &= -\alpha\!\left(\, \log \frac{\pi_\theta(o_t \mid q_{<t})}{\pi_{\mathrm{ref}}(o_t \mid q_{<t})} + \frac{1}{T-t}\sum_{1 \leq l \leq T-t} \log \frac{\pi_\theta(o_{t+l} \mid q_{<t+l})}{\mu(o_{t+l} \mid q_{<t+l})} \,\right). \end{aligned} \tag{3} \]

When the behavior policy is unavailable, RePO_det replaces it with a deterministic pseudo-policy concentrated on the observed trajectory. This keeps the regret structure while making the method usable for pre-collected offline preference datasets:

\[ \begin{aligned} S^{\mathrm{RePO\_det}} &:= \widehat{\mathrm{Reg}}_{\pi_\theta}(q_{<t}, o_t) \\ &= -\alpha\!\left(\, \log \frac{\pi_\theta(o_t \mid q_{<t})}{\pi_{\mathrm{ref}}(o_t \mid q_{<t})} + \frac{1}{T-t}\sum_{1 \leq l \leq T-t} \log \pi_\theta(o_{t+l} \mid q_{<t+l}) \,\right). \end{aligned} \tag{4} \]
Prospective Judgement

Preferences over partial reasoning are interpreted through likely future continuation.

Counterfactual Thinking

Actions are compared against alternatives that could have been taken at the same context.

Behavior-aware Learning

The objective accounts for rollout likelihoods instead of assuming on-policy preference data.

Inductive Bias of Regret

Why does regret-based preference learning outperform reward-based alternatives? Beyond the off-policy correction in the regret decomposition above, the regret objective structurally internalizes a pessimistic inductive bias that aligns with how humans evaluate incomplete reasoning: regret at a successful terminal state is dominated by the expected regret at any intermediate partial context along the same trajectory.

The assumption behind this property is mild: the behavior policy $\mu$ should be closer to the optimal policy $\pi^*$ than to the reference policy $\pi_{\mathrm{ref}}$ in KL divergence — a condition that naturally holds along trajectories that reach a verifier-accepted outcome.

\[ \epsilon \;:=\; \mathbb{D}_{\mathrm{KL}}\!\left(\mu(\cdot \mid q_{<t}) \,\Vert\, \pi_{\mathrm{ref}}(\cdot \mid q_{<t})\right) \;-\; \mathbb{D}_{\mathrm{KL}}\!\left(\mu(\cdot \mid q_{<t}) \,\Vert\, \pi^*(\cdot \mid q_{<t})\right) \;\geq\; 0. \tag{5} \]

Under this condition, for any verifier-accepted output $o_T^{\star}$ and any earlier context $q_{<t}$ on the same trajectory, the regret of the full success is upper-bounded by the expected regret at the partial context:

\[ \widehat{\mathrm{Reg}}^{\,\mu}_{\pi^*}(q_{<T},\, o_T^{\star}) \;\leq\; \mathbb{E}_{\mu}\!\left[\,\widehat{\mathrm{Reg}}^{\,\mu}_{\pi^*}(q_{<t}, \,\cdot)\,\right]. \tag{6} \]

Intuitively, masking parts of a successful trajectory introduces uncertainty about its eventual completion, so a partial context is judged more harshly than the corresponding fully revealed accepted trajectory. The regret objective therefore prefers complete, verifier-consistent reasoning over partial reasoning without requiring any auxiliary masked-data augmentation — the bias is encoded in the loss itself.

This is the structural reason RePO achieves stronger sample efficiency than DPO-style baselines: where DPO must rely on externally constructed contrastive pairs to teach this preference, RePO inherits it directly from the regret decomposition, as we verify empirically below.

Experiments

The experiments ask whether RePO improves human preference alignment and mathematical reasoning, whether RePO_det remains effective without behavior-policy access, and whether regret minimization internalizes the pessimistic inductive bias toward incomplete successful trajectories.

Human preference alignment

Best results are bold; second-best results are underlined.

Table 1. AlpacaEval2, Arena-Hard, and MT-Bench results on Qwen3-1.7B/4B.
Backbone Method AlpacaEval 2 Arena-Hard MT-Bench
LC (%) WR (%) WR (%) GPT-4.1 GPT-5.1
Qwen3-1.7B-BaseBase8.276.0510.45.094.41
DPO23.9025.8423.46.004.81
IPO24.4626.3721.96.565.10
RPO18.6315.4717.75.985.13
KTO34.7338.9330.46.925.32
TDPO12.2410.7214.55.564.73
RePO36.6143.6627.16.885.43
RePO_det34.9541.4226.66.895.16
Qwen3-4B-BaseBase12.8011.6225.45.564.83
DPO32.8933.9244.56.795.74
IPO36.4338.6347.87.436.14
RPO29.5128.2341.97.056.13
KTO52.3155.7863.98.226.93
TDPO17.9717.0830.96.245.38
RePO55.0860.1260.18.096.78
RePO_det51.6655.5359.98.186.97

Mathematical reasoning

Table 2. Math-reasoning benchmark results using Qwen3-1.7B/4B.
Backbone Method GSM8K MATH MATH500 AMC23 Minerva
Qwen3-1.7B-BaseBase61.7148.5048.6030.009.60
DPO77.3353.4452.8032.5016.91
RPO69.0750.3251.4030.0013.24
IPO79.4551.7653.4020.0016.54
KTO79.6854.4256.6035.0017.28
TDPO64.0648.8052.2025.009.93
RePO80.5254.5057.4030.0020.59
RePO_det80.7454.8454.4025.0025.74
Qwen3-4B-BaseBase78.7761.2064.2032.5019.90
DPO87.8756.6657.8035.0027.21
RPO90.3063.4466.8047.5022.79
IPO88.8658.3657.4045.0027.57
KTO90.8367.3867.6055.0025.74
TDPO90.6762.7664.8047.5024.26
RePO90.6065.5466.2042.5022.43
RePO_det91.0565.7265.4047.5023.50

Behavior-policy-free Training

Offline preference datasets often do not expose the behavior policy that generated each response. RePO_det approximates each observed trajectory as if it were generated by a deterministic policy, which makes regret-based training usable in behavior-agnostic and cross-model settings.

Figure 4a: Llama3.2-1B cross-model evaluation.
(a) Llama3.2-1B
Figure 4b: Gemma-2-2B cross-model evaluation.
(b) Gemma-2-2B
Figure 4. Math-reasoning accuracy of DPO and RePO_det on Llama3.2-1B and Gemma-2-2B trained on Qwen-family preference data. Replace the two placeholder PNGs with your two actual Figure 4 panels.

Sample Efficiency from Regret Inductive Bias

RePO structurally encodes the bias that incomplete successful trajectories should be evaluated pessimistically because their final completion is uncertain. In the masked-data study, DPO improves as more augmented pairs are provided, while RePO remains strong without requiring this additional supervision.

Figure 5a: Qwen3-1.7B masked augmentation accuracy.
(a) Qwen3-1.7B
Figure 5b: Qwen3-4B masked augmentation accuracy.
(b) Qwen3-4B
Figure 5. GSM8K accuracy as a function of masked-augmented preference data. Replace the two placeholder PNGs with your actual Figure 5 panels.
Table 4. Masked-data ablation on mathematical reasoning benchmarks.
Model Method GSM8K MATH MSE
Qwen3-1.7BDPO72.5552.140.073
DPOmasked77.71 (+5.16)53.52 (+1.38)0.056
RePO80.5254.500
RePOmasked80.29 (-0.23)55.26 (+0.76)-
Qwen3-4BDPO85.2263.700.066
DPOmasked88.63 (+3.41)64.46 (+0.72)0.054
RePO90.6065.540
RePOmasked90.98 (+0.38)65.46 (-0.08)-

Takeaways

Human-centered objective

RePO models preferences as prospective and counterfactual judgments over reasoning trajectories.

Regret decomposition

The objective combines a local DPO-like term with a long-horizon behavior-deviation term.

Practical offline variant

RePO_det supports offline preference data without explicit behavior-policy metadata.

BibTeX

@inproceedings{kim2026regret,
  title     = {A Regret Minimization Framework on Preference Learning in Large Language Models},
  author    = {Kim, Suhwan and Cho, Taehyun and Kim, Geonhyeong and Kim, Yujin and Jang, Youngsoo and Lee, Moontae and Lee, Jungwoo},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
  year      = {2026}
}