Instruction Data Selection via Answer Divergence
Why ADG
Many instructions admit several valid answers, so single-reference scoring can be structurally misleading. ADG turns this ambiguity into a useful signal by asking how sampled answers distribute in representation space: are they tightly clustered, stretched along one direction, or scattered in a high-signal multi-directional regime? This shift lets ADG prioritize examples that are more likely to reveal genuine model uncertainty rather than superficial stylistic variation.
Single-reference scoring is limited
A high loss may reflect harmless differences in style, format, or reasoning path rather than a real competence gap.
Multi-sample behavior is more informative
Several sampled answers reveal whether the model exhibits structured disagreement or only trivial paraphrastic variation.
Coverage is preserved explicitly
ADG combines geometry-aware scoring with proportional selection inside semantic bins so the final subset stays broad and useful.
Method
The method is shown first as the complete model diagram, then decomposed into three steps so readers can move from the full pipeline to the core mechanism. This layout emphasizes both the global workflow and the two scoring components, D(x) and I(x), without forcing the figure into a narrow side-by-side format.
Sample multiple answers
- Draw K stochastic answers for each instruction using high-temperature decoding.
- Treat the answer set as a structured object rather than relying on one teacher answer.
Score the geometry
- Pool output-token hidden states into answer embeddings.
- D(x) quantifies spread around the mean.
- I(x) quantifies whether the spread is multi-directional.
Select with semantic coverage
- Cluster instruction embeddings into semantic bins.
- Assign a proportional quota to each bin.
- Select top-ranked instructions within each region.
Main Results
Across two backbones and three public instruction pools, ADG achieves the best Avg.Score in every setting under the same 10K budget.
Analysis
This refined version keeps the analysis focused while improving spacing, hierarchy, and readability across full-width figures, dual-column summary blocks, and chart-based robustness views.
Task-type composition shift
ADG increases the proportion of Discrete Decision and Math while reducing Writing and Knowledge-heavy prompts, suggesting that the selected data better emphasizes boundary-revealing and verifiable tasks.
Quadrant-based case study
The quadrant view makes the geometry concrete: ADG favors the high-D(x), high-I(x) region where disagreement is both strong and structured, rather than mere stylistic drift or stable single-mode outputs.
Selection-budget scaling
Selection efficiency
End-to-end selection on Alpaca-GPT4 (52,002 examples), using 4 GPUs.
| Selector | Time (h) ↓ | Ex/s ↑ |
|---|---|---|
| SuperFiltering | 1.05 | 13.76 |
| ADG | 1.78 | 8.10 |
| ZIP | 3.17 | 4.56 |
| MIG | 3.92 | 3.69 |
Cross-backbone generalization
| Fine-tune | Selection | Reason. | Know. | Code. | Avg. |
|---|---|---|---|---|---|
| Qwen2.5 | Qwen2.5 | 75.43 | 67.36 | 64.40 | 69.06 |
| Qwen2.5 | LLaMA3 | 76.09 | 67.11 | 62.69 | 68.63 |
| LLaMA3 | LLaMA3 | 50.11 | 60.31 | 42.14 | 50.85 |
| LLaMA3 | Qwen2.5 | 49.84 | 59.61 | 41.11 | 50.19 |
Robustness and ablations
Resources
Reproducibility at a glance
Core selection code
ADG/ADG_llama.py and ADG/ADG_qwen.py implement ADG scoring and subset selection for the LLaMA and Qwen backbones.
Generation & embedding
generation/generation.py produces multiple sampled answers, while generation/embedding/embed.py builds instruction embeddings and clustering labels.
Training & evaluation
train/train_llama.sh, train/train_qwen.sh, and eval/eval.sh cover fine-tuning and benchmark evaluation.
Analysis
analysis/analyse.py supports optional task-type analysis for inspecting the selected data.
Environment
requirements.txt lists the Python dependencies, and the README recommends Python 3.10 or above.
Data format
The repo expects JSON or JSONL instruction data with fields such as id, instruction, input, and output.
Pipeline
The released repository already exposes a practical end-to-end workflow from instruction pool preparation to selection, fine-tuning, and evaluation.
instruction pool → generation/generation.py → multi-sample answer JSONL → generation/embedding/embed.py → instruction embeddings + cluster labels → ADG/ADG_llama.py or ADG/ADG_qwen.py → top / middle / bottom selected subsets → train/train_*.sh → finetuned checkpoints → eval/eval.sh
BibTeX
If you find this work useful, please cite the paper below.
@misc{li2026instructiondataselectionanswer,
title={Instruction Data Selection via Answer Divergence},
author={Bo Li and Mingda Wang and Shikun Zhang and Wei Ye},
year={2026},
eprint={2604.10448},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2604.10448},
}