Papers
arxiv:2605.10899

RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards

Published on May 11
· Submitted by
Gaotang Li
on May 13
Authors:
,
,
,
,
,
,
,
,
,
,

Abstract

Deep research agents trained using RubricEM framework demonstrate superior performance on long-form research tasks through rubric-guided reinforcement learning with stage-aware planning and reflection-based meta-policy evolution.

AI-generated summary

Training deep research agents, namely systems that plan, search, evaluate evidence, and synthesize long-form reports, pushes reinforcement learning beyond the regime of verifiable rewards. Their outputs lack ground-truth answers, their trajectories span many tool-augmented decisions, and standard post-training offers little mechanism for turning past attempts into reusable experience. In this work, we argue that rubrics should serve not merely as final-answer evaluators, but as the shared interface that structures policy execution, judge feedback, and agent memory. Based on this view, we introduce RubricEM, a rubric-guided reinforcement learning framework that combines stagewise policy decomposition with reflection-based meta-policy evolution. RubricEM first makes research trajectories stage-aware by conditioning planning, evidence gathering, review, and synthesis on self-generated rubrics. It then assigns credit with Stage-Structured GRPO, which uses stagewise rubric judgments to provide denser semantic feedback for long-horizon optimization. In parallel, RubricEM trains a shared-backbone reflection meta-policy that distills judged trajectories into reusable rubric-grounded guidance for future attempts. The resulting RubricEM-8B achieves strong performance across four long-form research benchmarks, outperforming comparable open models and approaching proprietary deep-research systems. Beyond final performance, we perform thorough analyses to understand the key ingredients of RubricEM.

Community

Paper author Paper submitter

TLDR: RubricEM introduces a rubric-guided reinforcement learning framework for training long-form deep research agents, enabling finer-grained stagewise credit assignment and reflection meta-policy training beyond verifiable rewards.

noticed a potential snag: self-generated rubrics guiding plan, research, review, and answer might drift or get biased as training progresses. did you try any calibration mechanism to anchor rubrics across stages so credit signals stay aligned with the intended semantics rather than chasing emergent artifacts? an ablation removing the rubric memory would be telling, since the reflection bank seems to be where reusable guidance really comes from. btw the arXivLens breakdown helped me parse the method in more depth, it covers the stage-structured credit and the reflection loop nicely: https://arxivlens.com/PaperView/Details/rubricem-meta-rl-with-rubric-guided-policy-decomposition-beyond-verifiable-rewards-4221-54872c72

·
Paper author

Thank you for the thoughtful comment and for sharing the arXivLens write-up! We would like to first clarify a potential misunderstanding. In RubricEM, the agent’s self-generated rubrics are mainly used to guide its own Plan/Research/Review/Answer trajectory, while reward credit comes from separate judge-side stagewise rubrics produced by an external judge model whose parameters are not updated during RL. The self-rubrics can inform the judge, but they are not blindly used as the scoring standard.

We also include relevant ablations: Sec. 5.1 shows that SS-GRPO still improves without the reflection meta-policy / rubric-bank memory, and the full method performs best when the memory is added. Sec. 5.2 and Fig. 6 further show that rubric-structured scaffolding improves both SFT quality and RL gains (compared to standard ReAct-style scaffolding); notably, this RL ablation does not use memory either, so the gains mainly come from a better scaffold that makes trajectories more structured and gives the judge better references. We hope this explanation helps.

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.10899
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.10899 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.10899 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.10899 in a Space README.md to link it from this page.

Collections including this paper 3