ACL 2026 Main Conference

Instruction Data Selection via Answer Divergence

Bo Li¹, Mingda Wang², Shikun Zhang¹, Wei Ye¹

¹National Engineering Research Center for Software Engineering, Peking University, ²Hebei University of Technology

ADG measures the geometry of multi-sample answers instead of scoring each instruction against a single reference response. It combines dispersion magnitude with shape anisotropy, then selects top examples within semantic bins under a fixed 10K budget to preserve both quality and coverage.

Why ADG

Many instructions admit several valid answers, so single-reference scoring can be structurally misleading. ADG turns this ambiguity into a useful signal by asking how sampled answers distribute in representation space: are they tightly clustered, stretched along one direction, or scattered in a high-signal multi-directional regime? This shift lets ADG prioritize examples that are more likely to reveal genuine model uncertainty rather than superficial stylistic variation.

Single-reference scoring is limited

A high loss may reflect harmless differences in style, format, or reasoning path rather than a real competence gap.

Multi-sample behavior is more informative

Several sampled answers reveal whether the model exhibits structured disagreement or only trivial paraphrastic variation.

Coverage is preserved explicitly

ADG combines geometry-aware scoring with proportional selection inside semantic bins so the final subset stays broad and useful.

Method

The method is shown first as the complete model diagram, then decomposed into three steps so readers can move from the full pipeline to the core mechanism. This layout emphasizes both the global workflow and the two scoring components, D(x) and I(x), without forcing the figure into a narrow side-by-side format.

Stage 1

Sample multiple answers

Draw K stochastic answers for each instruction using high-temperature decoding.
Treat the answer set as a structured object rather than relying on one teacher answer.

Stage 2

Score the geometry

Pool output-token hidden states into answer embeddings.
D(x) quantifies spread around the mean.
I(x) quantifies whether the spread is multi-directional.

s(x) = (1 − λ) · D(x) + λ · I(x)

Stage 3

Select with semantic coverage

Cluster instruction embeddings into semantic bins.
Assign a proportional quota to each bin.
Select top-ranked instructions within each region.

Main Results

Across two backbones and three public instruction pools, ADG achieves the best Avg.Score in every setting under the same 10K budget.

Analysis

This refined version keeps the analysis focused while improving spacing, hierarchy, and readability across full-width figures, dual-column summary blocks, and chart-based robustness views.

Task-type composition shift

ADG increases the proportion of Discrete Decision and Math while reducing Writing and Knowledge-heavy prompts, suggesting that the selected data better emphasizes boundary-revealing and verifiable tasks.

Quadrant-based case study

The quadrant view makes the geometry concrete: ADG favors the high-D(x), high-I(x) region where disagreement is both strong and structured, rather than mere stylistic drift or stable single-mode outputs.

Selection-budget scaling

On Alpaca-GPT4 with LLaMA3-8B, ADG stays ahead of Random and SuperFiltering from 3K to 25K, while improvements gradually saturate at larger budgets.

Selection efficiency

End-to-end selection on Alpaca-GPT4 (52,002 examples), using 4 GPUs.

Selector	Time (h) ↓	Ex/s ↑
SuperFiltering	1.05	13.76
ADG	1.78	8.10
ZIP	3.17	4.56
MIG	3.92	3.69

ADG is slower than the lightest selectors because it performs multi-sample decoding and hidden-state extraction, but remains practical as an offline preprocessing step.

Cross-backbone generalization

Fine-tune	Selection	Reason.	Know.	Code.	Avg.
Qwen2.5	Qwen2.5	75.43	67.36	64.40	69.06
Qwen2.5	LLaMA3	76.09	67.11	62.69	68.63
LLaMA3	LLaMA3	50.11	60.31	42.14	50.85
LLaMA3	Qwen2.5	49.84	59.61	41.11	50.19

The ranking signal transfers with only a small drop, suggesting ADG captures more than a backbone-specific heuristic.

Robustness and ablations

Resources

Reproducibility at a glance

Core selection code

ADG/ADG_llama.py and ADG/ADG_qwen.py implement ADG scoring and subset selection for the LLaMA and Qwen backbones.

Generation & embedding

generation/generation.py produces multiple sampled answers, while generation/embedding/embed.py builds instruction embeddings and clustering labels.

Training & evaluation

train/train_llama.sh, train/train_qwen.sh, and eval/eval.sh cover fine-tuning and benchmark evaluation.

Analysis

analysis/analyse.py supports optional task-type analysis for inspecting the selected data.

Environment

requirements.txt lists the Python dependencies, and the README recommends Python 3.10 or above.

Data format

The repo expects JSON or JSONL instruction data with fields such as id, instruction, input, and output.

Pipeline

The released repository already exposes a practical end-to-end workflow from instruction pool preparation to selection, fine-tuning, and evaluation.

instruction pool
  → generation/generation.py
  → multi-sample answer JSONL
  → generation/embedding/embed.py
  → instruction embeddings + cluster labels
  → ADG/ADG_llama.py or ADG/ADG_qwen.py
  → top / middle / bottom selected subsets
  → train/train_*.sh
  → finetuned checkpoints
  → eval/eval.sh

Citation

BibTeX

If you find this work useful, please cite the paper below.

@misc{li2026instructiondataselectionanswer,
      title={Instruction Data Selection via Answer Divergence}, 
      author={Bo Li and Mingda Wang and Shikun Zhang and Wei Ye},
      year={2026},
      eprint={2604.10448},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2604.10448}, 
}

GRIP (ACL'26 Main)

MDS (ACL'26 Findings)

SCD (AAAI'26 Oral)

ETC (AAAI'26 Oral)

Instruction Data Selection via Answer Divergence

Why ADG

Single-reference scoring is limited

Multi-sample behavior is more informative

Coverage is preserved explicitly

Method

Sample multiple answers

Score the geometry

Select with semantic coverage

Main Results

Analysis

Task-type composition shift

Quadrant-based case study

Selection-budget scaling

Selection efficiency

Cross-backbone generalization

Robustness and ablations

Resources

Core selection code

Generation & embedding

Training & evaluation

Analysis

Environment

Data format

Pipeline

BibTeX