ACL 2026 Main Conference

Retrieval as Generation: A Unified Framework with Self-Triggered Information Planning

Bo Li¹, Mingda Wang², Gexiang Fang¹, Shikun Zhang¹, Wei Ye¹

¹National Engineering Research Center for Software Engineering, Peking University, ²Hebei University of Technology

GRIP rethinks RAG by making retrieval control part of generation itself. Instead of relying on external controllers or heuristic triggers, the model emits token-level actions to decide when to retrieve, how to reformulate follow-up queries, and when to stop. It further learns multi-step retrieval behavior through one-step decision optimization, avoiding heavy long-horizon search-policy optimization while preserving adaptive depth and controllable stopping.

[RETRIEVE] [INTERMEDIARY] [ANSWER] [SOLVED]

Method Types Results Control Resources

Two Representative GRIP Cases

Explore two GRIP trajectories: relation chaining and semantic reinterpretation with query reformulation.

English Case · Relation Chaining

Question

Who founded the company that Robinsons-May is a subsidiary of?

Current State

Only a related company name is recalled.

Current Action

Round 1 of a 3-round retrieval budget.

External Knowledge

Robinsons-May is a subsidiary of The May Department Stores Company.

What changed in this round?

The parent company is identified, but the founder is still missing.

Final Answer

Waiting for final answer

Why GRIP

Most RAG pipelines still treat retrieval as an external intervention: a separate controller decides whether to retrieve, the retriever runs, and the generator consumes fixed evidence. GRIP instead moves retrieval control into the autoregressive trajectory itself, so information planning, query reformulation, and termination become explicit, trainable actions represented in the same decoding space as language generation.

Retrieval should be part of decoding

GRIP internalizes retrieval timing into generation rather than delegating it to an external controller.

Control tokens make behavior explicit

When to retrieve, what to ask next, and when to stop are all represented as language-native actions.

Structured supervision teaches planning

Four answerability types expose the model to direct answering, retrieval triggering, multi-hop planning, and answer completion.

One-step decision optimization

GRIP learns multi-step retrieval behavior through one-step decision optimization rather than long-horizon search-policy optimization, making it simpler and more stable while preserving adaptive depth and controllable stopping.

Method

GRIP combines a unified token-control interface with Self-Triggered Information Planning. The full model figure below shows the complete training-and-inference pipeline: structured SFT and RL in the upper half, and a token-driven retrieval planning loop in the lower half, all within a single autoregressive trajectory.

Control Tokens

Token-level retrieval interface

[RETRIEVE], [INTERMEDIARY], [ANSWER], and [SOLVED] form a minimal but expressive action space for retrieval planning.

Planning Loop

Self-triggered planning

The model decides whether current knowledge is sufficient, emits a retrieval action if not, integrates evidence, and repeats until resolved.

Structured Data

Four answerability types

The supervision signal is scenario-typed, so the same model learns direct answering, retrieval triggering, multi-hop planning, and answer completion.

Optimization

SFT + rule-based RL

SFT learns token-controlled behaviors, and DAPO-based RL further improves answer fidelity and behavior control while reducing redundant retrievals.

Four Structured Supervision Types

GRIP uses four structured data types aligned with distinct token trajectories, allowing the model to learn when to answer, when to retrieve, and when to continue planning. Instead of a single generic supervision pattern, each type teaches a different decision boundary in the token-controlled retrieval loop.

Type-α · Direct Answer

Queries that can be answered directly from internal knowledge. The model emits [ANSWER] and ends with [SOLVED].

Type-β · Retrieval Needed

The model has partial knowledge but lacks enough information to finalize the answer. It emits [INTERMEDIARY] and then [RETRIEVE].

Type-γ · Multi-hop Planning

Complex cases requiring iterative evidence gathering. The model produces a refined partial state and then issues a new follow-up query.

Type-θ · Answer Completion

Retrieved passages contain the needed evidence, but the model must still synthesize and finalize the answer before stopping.

Main Results

The table below reproduces the paper’s main result table with the full metric set: EM, ROUGE, F1, and Avg.Score. GRIP is the strongest open-source system in the main comparison and remains competitive with GPT-4o using a much smaller backbone.

Table 1 in the paper reports EM, ROUGE (average of ROUGE-1/2/L), F1, and Avg.Score on five QA benchmarks.

Method	HotpotQA			PopQA			NQ			WebQ			TriviaQA			Avg. Score
Method	EM	ROUGE	F1	EM	ROUGE	F1	EM	ROUGE	F1	EM	ROUGE	F1	EM	ROUGE	F1	Avg. Score
Training-free Method
Instruct	17.2	21.6	25.9	17.4	19.6	23.2	14.4	23.8	20.3	14.8	25.1	29.4	46.2	45.3	55.1	26.6
GPT-3.5 Turbo	26.2	32.4	38.2	29.1	31.1	35.3	20.7	34.3	30.0	15.9	27.4	31.9	55.6	57.8	69.7	35.7
GPT-4o	33.2	40.2	47.0	30.6	32.3	39.9	26.5	42.7	28.3	23.5	32.3	37.0	65.7	64.3	78.2	41.4
Single RAG	26.1	31.6	37.2	22.8	26.6	31.1	19.3	28.4	24.8	14.0	22.7	26.6	46.4	47.0	56.8	30.8
FLARE	23.2	27.9	32.8	14.3	16.0	18.4	14.7	22.4	21.3	24.2	30.3	34.7	48.6	48.5	56.4	28.9
DRAGIN	27.9	32.6	38.7	15.5	16.8	19.8	23.9	32.8	28.5	25.2	31.5	35.7	55.3	53.6	64.6	33.5
ETC	32.5	37.7	44.2	30.5	32.5	37.5	20.9	26.7	30.7	18.9	26.6	30.4	52.9	52.1	63.0	35.8
Training-based Method
SFT-RAG	20.3	24.1	28.6	29.4	25.2	30.4	20.8	17.9	21.3	18.9	18.3	23.1	50.1	24.8	57.2	27.4
Self-RAG	19.6	23.8	26.7	18.1	22.3	22.8	15.7	22.4	24.0	16.4	26.5	27.4	50.2	47.3	57.5	28.0
INFO-RAG	19.9	23.7	26.9	18.3	22.6	23.0	17.2	22.9	24.9	18.1	27.7	28.9	50.8	47.8	58.1	28.7
RobustRAG	27.6	31.8	37.5	29.7	27.7	32.4	26.4	25.1	29.2	21.5	25.0	29.1	48.8	47.7	57.9	33.2
GainRAG	31.4	35.6	41.8	30.1	33.3	38.1	22.9	27.9	32.2	16.5	24.5	28.9	50.3	49.1	59.2	34.8
R1-Searcher	26.0	29.1	34.9	41.6	35.2	41.3	25.8	24.9	28.7	21.8	26.1	30.6	56.0	53.3	64.9	36.0
RetRobust	29.6	34.9	40.9	34.1	35.1	40.4	24.2	29.0	33.8	21.8	27.4	31.7	53.6	50.9	61.9	36.6
InstructRAG	31.2	36.8	42.3	33.1	35.7	40.3	29.5	29.5	33.6	19.5	26.8	31.4	51.3	52.2	62.5	37.0
GRIP (ours)	33.0	37.6	44.1	38.6	37.5	38.4	32.1	35.8	32.0	31.4	39.3	34.6	57.9	55.9	67.4	41.0
w/o RL	31.6	36.6	43.0	38.1	37.1	37.6	32.6	36.1	32.7	32.0	39.9	35.1	57.0	55.2	66.8	40.7

GRIP reaches an Avg.Score of 41.0, outperforming all open-source baselines in the main table and approaching GPT-4o while using a substantially smaller backbone.

Behavior Analysis

Before turning to deeper control analyses, the paper first shows that GRIP learns task-aware retrieval depth and can improve retrieval quality by generating follow-up queries conditioned on evolving intermediate states.

Adaptive retrieval across tasks

GRIP retrieves more on HotpotQA and PopQA, fewer on NQ, and much less than R1-Searcher on average. RL reduces redundant retrieval while preserving the same task-aware pattern.

Table 4 in the paper reports mean retrieval count per dataset: GRIP averages 1.24 calls, versus 5.12 for R1-Searcher and 1.60 for GRIP without RL.

Improving retrieval quality with new queries

GRIP-generated follow-up queries substantially improve gold-answer coverage in top-1 and top-3 retrieved passages on both NQ and WebQ.

The larger gains on top-1 than top-3 suggest that GRIP improves evidence ranking, not just recall.

Controllable Retrieval Budget and Depth Extrapolation

This is one of the key analysis sections in the paper. GRIP does not blindly consume the available retrieval budget: even when the maximum budget B increases, realized retrieval count stays far below the cap, while performance improves modestly. The distribution plots below show selective extrapolation beyond the training depth of three retrieval steps.

Retrieval-count distributions under budget control

Figure 14 shows that as the inference-time budget increases from B=3 to B=10, GRIP shifts some mass beyond r>3 while keeping most probability concentrated on small retrieval counts. This indicates selective and task-aware extrapolation rather than mechanical budget consumption.

Effect of maximum retrieval budget B

Max B	Hotpot	PopQA	NQ	WebQ	Trivia	Avg.Count	Avg.Score
3	1.44	1.58	0.76	1.15	1.25	1.24	41.0
5	1.56	1.73	1.01	1.45	1.31	1.41	41.2
7	1.66	1.79	1.12	1.65	1.35	1.51	41.5
10	1.74	1.86	1.22	1.90	1.39	1.62	41.8

As B increases from 3 to 10, Avg.Score rises from 41.0 to 41.8 while Avg.Count grows only from 1.24 to 1.62.

Behavior shift after RL

Appendix G retrieval count shift after RL

Appendix G shows how RL shifts the retrieval-count distribution on NQ and WebQA. Compared with GRIP without RL, the final model places more mass on Retrieve = 0 for cases that can be answered directly, while also increasing the proportion of Retrieve = 1 where one retrieval step is sufficient. This supports the paper’s claim that RL improves behavior control rather than merely changing answer style.

General Capability Preservation

Another critical analysis in the paper asks whether retrieval-planning fine-tuning damages non-RAG abilities. GRIP shows only minimal degradation on closed-form benchmarks and remains competitive with the base instruct model on open-ended summarization judged by GPT-4o.

Closed-form non-RAG evaluation

Model	MMLU (Acc)	MBPP (Pass@1)
Instruct	66.56	54.6
SFT-RAG	62.93	47.8
GRIP	65.73	53.8

Compared with the base instruct model, GRIP drops only 0.83 points on MMLU and 0.8 on MBPP, much less than the matched SFT-RAG baseline.

Pairwise GPT-4o evaluation on CNN/DailyMail

Pairwise comparison (GRIP vs.)	Win	Equal	Loss
Instruct	51.0	37.5	11.5
SFT-RAG	90.5	8.5	1.0
R1-Searcher	59.5	32.5	8.0

GRIP remains competitive with the base instruct model and is preferred much more often than SFT-RAG or R1-Searcher, suggesting that retrieval control does not come at the cost of general generation quality.

Resources

Reproducibility at a glance

Structured supervision types

GRIP training is organized around Type-α: direct answer, Type-β: retrieval needed, Type-γ: multi-hop planning, and Type-θ: answer completion.

Data construction

data_generation/first.sh, make_first_steps.py, use_gpt_for_data.py, merge_dataset.py, and index.py support data construction and BM25 indexing.

Inference

inference/agent.py provides the multi-step GRIP inference logic, and inference/inference.sh is the example launch script.

Preprocess & training

train/examples/data_preprocess/grip/sft.py, rl.py, run_sft_llama.sh, and dapo_4w_continue_rl_ep3_llama.sh cover the two-stage optimization pipeline.

Evaluation

eval/eval.py and eval/utils.py compute EM, F1, ROUGE, and related outputs under a unified protocol.

Released data and weights

The repository also links released structured datasets and trained checkpoints, making it possible to reproduce GRIP training and evaluate the final model directly.

Wikipedia / QA data → Type-α / β / γ / θ structured supervision → SFT parquet + RL parquet → supervised fine-tuning → DAPO-based RL → merged checkpoint → multi-step inference → benchmark evaluation

Citation

BibTeX

If you find this work useful, please cite the paper below.

@misc{li2026retrievalgenerationunifiedframework,
      title={Retrieval as Generation: A Unified Framework with Self-Triggered Information Planning}, 
      author={Bo Li and Mingda Wang and Gexiang Fang and Shikun Zhang and Wei Ye},
      year={2026},
      eprint={2604.11407},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2604.11407}, 
}

ETC (AAAI'26 Oral)

ADG (ACL'26 Main)

MDS (ACL'26 Findings)

SCD (AAAI'26 Oral)