Retrieval as Generation: A Unified Framework with Self-Triggered Information Planning
Two Representative GRIP Cases
Explore two GRIP trajectories: relation chaining and semantic reinterpretation with query reformulation.
Current State
Only a related company name is recalled.
Current Action
External Knowledge
What changed in this round?
The parent company is identified, but the founder is still missing.
Why GRIP
Most RAG pipelines still treat retrieval as an external intervention: a separate controller decides whether to retrieve, the retriever runs, and the generator consumes fixed evidence. GRIP instead moves retrieval control into the autoregressive trajectory itself, so information planning, query reformulation, and termination become explicit, trainable actions represented in the same decoding space as language generation.
Retrieval should be part of decoding
GRIP internalizes retrieval timing into generation rather than delegating it to an external controller.
Control tokens make behavior explicit
When to retrieve, what to ask next, and when to stop are all represented as language-native actions.
Structured supervision teaches planning
Four answerability types expose the model to direct answering, retrieval triggering, multi-hop planning, and answer completion.
One-step decision optimization
GRIP learns multi-step retrieval behavior through one-step decision optimization rather than long-horizon search-policy optimization, making it simpler and more stable while preserving adaptive depth and controllable stopping.
Method
GRIP combines a unified token-control interface with Self-Triggered Information Planning. The full model figure below shows the complete training-and-inference pipeline: structured SFT and RL in the upper half, and a token-driven retrieval planning loop in the lower half, all within a single autoregressive trajectory.
Token-level retrieval interface
[RETRIEVE], [INTERMEDIARY], [ANSWER], and [SOLVED] form a minimal but expressive action space for retrieval planning.
Self-triggered planning
The model decides whether current knowledge is sufficient, emits a retrieval action if not, integrates evidence, and repeats until resolved.
Four answerability types
The supervision signal is scenario-typed, so the same model learns direct answering, retrieval triggering, multi-hop planning, and answer completion.
SFT + rule-based RL
SFT learns token-controlled behaviors, and DAPO-based RL further improves answer fidelity and behavior control while reducing redundant retrievals.
Four Structured Supervision Types
GRIP uses four structured data types aligned with distinct token trajectories, allowing the model to learn when to answer, when to retrieve, and when to continue planning. Instead of a single generic supervision pattern, each type teaches a different decision boundary in the token-controlled retrieval loop.
Type-α · Direct Answer
Queries that can be answered directly from internal knowledge. The model emits [ANSWER] and ends with [SOLVED].
Type-β · Retrieval Needed
The model has partial knowledge but lacks enough information to finalize the answer. It emits [INTERMEDIARY] and then [RETRIEVE].
Type-γ · Multi-hop Planning
Complex cases requiring iterative evidence gathering. The model produces a refined partial state and then issues a new follow-up query.
Type-θ · Answer Completion
Retrieved passages contain the needed evidence, but the model must still synthesize and finalize the answer before stopping.
Main Results
The table below reproduces the paper’s main result table with the full metric set: EM, ROUGE, F1, and Avg.Score. GRIP is the strongest open-source system in the main comparison and remains competitive with GPT-4o using a much smaller backbone.
| Method | HotpotQA | PopQA | NQ | WebQ | TriviaQA | Avg. Score |
||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| EM | ROUGE | F1 | EM | ROUGE | F1 | EM | ROUGE | F1 | EM | ROUGE | F1 | EM | ROUGE | F1 | ||
| Training-free Method | ||||||||||||||||
| Instruct | 17.2 | 21.6 | 25.9 | 17.4 | 19.6 | 23.2 | 14.4 | 23.8 | 20.3 | 14.8 | 25.1 | 29.4 | 46.2 | 45.3 | 55.1 | 26.6 |
| GPT-3.5 Turbo | 26.2 | 32.4 | 38.2 | 29.1 | 31.1 | 35.3 | 20.7 | 34.3 | 30.0 | 15.9 | 27.4 | 31.9 | 55.6 | 57.8 | 69.7 | 35.7 |
| GPT-4o | 33.2 | 40.2 | 47.0 | 30.6 | 32.3 | 39.9 | 26.5 | 42.7 | 28.3 | 23.5 | 32.3 | 37.0 | 65.7 | 64.3 | 78.2 | 41.4 |
| Single RAG | 26.1 | 31.6 | 37.2 | 22.8 | 26.6 | 31.1 | 19.3 | 28.4 | 24.8 | 14.0 | 22.7 | 26.6 | 46.4 | 47.0 | 56.8 | 30.8 |
| FLARE | 23.2 | 27.9 | 32.8 | 14.3 | 16.0 | 18.4 | 14.7 | 22.4 | 21.3 | 24.2 | 30.3 | 34.7 | 48.6 | 48.5 | 56.4 | 28.9 |
| DRAGIN | 27.9 | 32.6 | 38.7 | 15.5 | 16.8 | 19.8 | 23.9 | 32.8 | 28.5 | 25.2 | 31.5 | 35.7 | 55.3 | 53.6 | 64.6 | 33.5 |
| ETC | 32.5 | 37.7 | 44.2 | 30.5 | 32.5 | 37.5 | 20.9 | 26.7 | 30.7 | 18.9 | 26.6 | 30.4 | 52.9 | 52.1 | 63.0 | 35.8 |
| Training-based Method | ||||||||||||||||
| SFT-RAG | 20.3 | 24.1 | 28.6 | 29.4 | 25.2 | 30.4 | 20.8 | 17.9 | 21.3 | 18.9 | 18.3 | 23.1 | 50.1 | 24.8 | 57.2 | 27.4 |
| Self-RAG | 19.6 | 23.8 | 26.7 | 18.1 | 22.3 | 22.8 | 15.7 | 22.4 | 24.0 | 16.4 | 26.5 | 27.4 | 50.2 | 47.3 | 57.5 | 28.0 |
| INFO-RAG | 19.9 | 23.7 | 26.9 | 18.3 | 22.6 | 23.0 | 17.2 | 22.9 | 24.9 | 18.1 | 27.7 | 28.9 | 50.8 | 47.8 | 58.1 | 28.7 |
| RobustRAG | 27.6 | 31.8 | 37.5 | 29.7 | 27.7 | 32.4 | 26.4 | 25.1 | 29.2 | 21.5 | 25.0 | 29.1 | 48.8 | 47.7 | 57.9 | 33.2 |
| GainRAG | 31.4 | 35.6 | 41.8 | 30.1 | 33.3 | 38.1 | 22.9 | 27.9 | 32.2 | 16.5 | 24.5 | 28.9 | 50.3 | 49.1 | 59.2 | 34.8 |
| R1-Searcher | 26.0 | 29.1 | 34.9 | 41.6 | 35.2 | 41.3 | 25.8 | 24.9 | 28.7 | 21.8 | 26.1 | 30.6 | 56.0 | 53.3 | 64.9 | 36.0 |
| RetRobust | 29.6 | 34.9 | 40.9 | 34.1 | 35.1 | 40.4 | 24.2 | 29.0 | 33.8 | 21.8 | 27.4 | 31.7 | 53.6 | 50.9 | 61.9 | 36.6 |
| InstructRAG | 31.2 | 36.8 | 42.3 | 33.1 | 35.7 | 40.3 | 29.5 | 29.5 | 33.6 | 19.5 | 26.8 | 31.4 | 51.3 | 52.2 | 62.5 | 37.0 |
| GRIP (ours) | 33.0 | 37.6 | 44.1 | 38.6 | 37.5 | 38.4 | 32.1 | 35.8 | 32.0 | 31.4 | 39.3 | 34.6 | 57.9 | 55.9 | 67.4 | 41.0 |
| w/o RL | 31.6 | 36.6 | 43.0 | 38.1 | 37.1 | 37.6 | 32.6 | 36.1 | 32.7 | 32.0 | 39.9 | 35.1 | 57.0 | 55.2 | 66.8 | 40.7 |
Behavior Analysis
Before turning to deeper control analyses, the paper first shows that GRIP learns task-aware retrieval depth and can improve retrieval quality by generating follow-up queries conditioned on evolving intermediate states.
Adaptive retrieval across tasks
GRIP retrieves more on HotpotQA and PopQA, fewer on NQ, and much less than R1-Searcher on average. RL reduces redundant retrieval while preserving the same task-aware pattern.
Improving retrieval quality with new queries
GRIP-generated follow-up queries substantially improve gold-answer coverage in top-1 and top-3 retrieved passages on both NQ and WebQ.
Controllable Retrieval Budget and Depth Extrapolation
This is one of the key analysis sections in the paper. GRIP does not blindly consume the available retrieval budget: even when the maximum budget B increases, realized retrieval count stays far below the cap, while performance improves modestly. The distribution plots below show selective extrapolation beyond the training depth of three retrieval steps.
Retrieval-count distributions under budget control
Figure 14 shows that as the inference-time budget increases from B=3 to B=10, GRIP shifts some mass beyond r>3 while keeping most probability concentrated on small retrieval counts. This indicates selective and task-aware extrapolation rather than mechanical budget consumption.
Effect of maximum retrieval budget B
| Max B | Hotpot | PopQA | NQ | WebQ | Trivia | Avg.Count | Avg.Score |
|---|---|---|---|---|---|---|---|
| 3 | 1.44 | 1.58 | 0.76 | 1.15 | 1.25 | 1.24 | 41.0 |
| 5 | 1.56 | 1.73 | 1.01 | 1.45 | 1.31 | 1.41 | 41.2 |
| 7 | 1.66 | 1.79 | 1.12 | 1.65 | 1.35 | 1.51 | 41.5 |
| 10 | 1.74 | 1.86 | 1.22 | 1.90 | 1.39 | 1.62 | 41.8 |
Behavior shift after RL
Appendix G shows how RL shifts the retrieval-count distribution on NQ and WebQA. Compared with GRIP without RL, the final model places more mass on Retrieve = 0 for cases that can be answered directly, while also increasing the proportion of Retrieve = 1 where one retrieval step is sufficient. This supports the paper’s claim that RL improves behavior control rather than merely changing answer style.
General Capability Preservation
Another critical analysis in the paper asks whether retrieval-planning fine-tuning damages non-RAG abilities. GRIP shows only minimal degradation on closed-form benchmarks and remains competitive with the base instruct model on open-ended summarization judged by GPT-4o.
Closed-form non-RAG evaluation
| Model | MMLU (Acc) | MBPP (Pass@1) |
|---|---|---|
| Instruct | 66.56 | 54.6 |
| SFT-RAG | 62.93 | 47.8 |
| GRIP | 65.73 | 53.8 |
Pairwise GPT-4o evaluation on CNN/DailyMail
| Pairwise comparison (GRIP vs.) | Win | Equal | Loss |
|---|---|---|---|
| Instruct | 51.0 | 37.5 | 11.5 |
| SFT-RAG | 90.5 | 8.5 | 1.0 |
| R1-Searcher | 59.5 | 32.5 | 8.0 |
Resources
Reproducibility at a glance
Structured supervision types
GRIP training is organized around Type-α: direct answer, Type-β: retrieval needed, Type-γ: multi-hop planning, and Type-θ: answer completion.
Data construction
data_generation/first.sh, make_first_steps.py, use_gpt_for_data.py, merge_dataset.py, and index.py support data construction and BM25 indexing.
Inference
inference/agent.py provides the multi-step GRIP inference logic, and inference/inference.sh is the example launch script.
Preprocess & training
train/examples/data_preprocess/grip/sft.py, rl.py, run_sft_llama.sh, and dapo_4w_continue_rl_ep3_llama.sh cover the two-stage optimization pipeline.
Evaluation
eval/eval.py and eval/utils.py compute EM, F1, ROUGE, and related outputs under a unified protocol.
Released data and weights
The repository also links released structured datasets and trained checkpoints, making it possible to reproduce GRIP training and evaluate the final model directly.
BibTeX
If you find this work useful, please cite the paper below.
@misc{li2026retrievalgenerationunifiedframework,
title={Retrieval as Generation: A Unified Framework with Self-Triggered Information Planning},
author={Bo Li and Mingda Wang and Gexiang Fang and Shikun Zhang and Wei Ye},
year={2026},
eprint={2604.11407},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2604.11407},
}