Findings of ACL 2026

Data Selection for Multi-turn Dialogue Instruction Tuning

Bo Li1· Shikun Zhang1· Wei Ye1
1National Engineering Research Center for Software Engineering, Peking University

MDS selects training-effective multi-turn dialogues by combining global semantic coverage with local structural quality, building compact supervision sets that are more diverse, coherent, and reliable for dialogue instruction tuning.

Comparison between single-turn and multi-turn data quality
A turn-level comparison already reveals the challenge: multi-turn dialogue data are noisier and more variable than standard single-turn instruction data, which motivates dialogue-level selection rather than isolated turn scoring.
Overview

Why multi-turn data selection needs a different design

Many selectors score isolated instruction-response pairs. MDS instead treats each dialogue as a coherent trajectory and selects full conversations that are both semantically representative and structurally reliable.

01

Multi-turn corpora are structurally noisier

Later turns may drift away from the original intent, end in chitchat tails, or violate the requested response format. These errors accumulate across the dialogue and are hard to detect at the single-turn level.

02

Turn-level scoring misses trajectory structure

Applying single-turn selectors to dialogue turns ignores cross-turn dependencies such as history anchoring, topic continuity, and query-answer form alignment.

03

MDS selects dialogues, not disconnected turns

By combining semantic coverage with dialogue-level quality, MDS builds a compact subset that suppresses noisy conversations while preserving long-tail intents and better-formed supervision.

Method

A two-stage selector for semantic diversity and dialogue structure

MDS first preserves broad intent coverage in a dialogue-trajectory space, then reranks candidate dialogues using local structural signals that capture history grounding, information progress, and form consistency.

Stage 1 · Global selection

Trajectory-aware semantic coverage

User queries are embedded into a dialogue-level trajectory representation. MDS clusters this space into semantic bins and performs coverage-aware bin-wise selection to avoid collapsing onto a few frequent patterns.

  • User-query trajectory embeddings
  • Semantic bins via K-means
  • Coverage and anti-redundancy within each bin
Long-tail intent coverageDialogue-level selection
Stage 2 · Local selection

Structural reliability inside each dialogue

MDS scores candidate dialogues using entity-grounded topic grounding and information progress, together with a hard form-consistency filter that removes badly aligned query-answer pairs.

  • Entity coherence and anti-redundancy
  • History anchoring and information progress
  • Query-answer form consistency
Topic groundingForm reliability
Global stagePrevents a few dense interaction patterns from dominating the selected set.
Local stageFilters out structurally weak dialogues that look acceptable at the turn level but break across turns.
Final subsetA compact training pool that is simultaneously diverse, coherent, and better aligned with multi-turn learning.
Results

Consistent gains across backbones, benchmarks, and domains

On the general-domain Baize pool, MDS achieves the best average rank on both LLaMA3-8B-Instruct and Qwen3-8B-Instruct. It also improves domain-specific training on Banking while retaining cross-domain transfer.

Method MT-Eval
L-E
MT-Eval
G-E
MT-Eval
Ent-F1
MT-Eval
Cos
ConsistentChat
L-E
ConsistentChat
G-E
ConsistentChat
Ent-F1
ConsistentChat
Cos
TopDial
L-E
TopDial
G-E
TopDial
Ent-F1
TopDial
Cos
Avg. Rank ↓
L-E and G-E denote LLM-EVAL and G-EVAL; Ent-F1 is entity-level F1; Cos is embedding cosine similarity. Lower average rank is better.
Domain-specific setting

MDS also improves Banking selection without sacrificing transfer

Method Banking Test
G-E
Banking Test
Ent-F1
ConsistentChat
G-E
ConsistentChat
Ent-F1
When all methods select 10K dialogues from the Banking corpus, MDS attains the strongest in-domain G-E and also delivers the best out-of-domain results on ConsistentChat.
Analysis

Where the gains come from

Beyond main results, MDS shows stronger long-dialogue robustness, preserves order-sensitive cross-turn structure, and shifts the selected training pool toward cleaner, more grounded conversations.

Length robustness analysis

Long-dialogue robustness

As turns get deeper, MDS preserves stronger entity coverage and semantic fidelity than competing selectors.

Error type delta analysis

Error-type difference sets

MDS-only dialogues contain more clean examples and substantially fewer topic-drift and unsupported errors.

Controlled perturbation analysis

The paper also includes an order-perturbation study on the same 10K dialogues selected by MDS. Shuffling turns degrades order-sensitive consistency, especially on the high-history-dependency subset, supporting the design choice of explicitly modeling cross-turn anchoring and anti-redundancy.

ESCHARENRHigh-history subset
Resources

Reproducibility at a glance

The project centers on offline dialogue selection, compact subset construction, and multi-turn evaluation under a fixed dialogue budget.

D
Training pools
Baize for general-domain dialogue selection and Banking for domain-specific customer-service dialogue selection.
E
Evaluation
MT-Eval, ConsistentChat, TopDial, and a held-out Banking Test set, covering open-ended, consistency-sensitive, and task-oriented settings.
B
Backbones
LLaMA3-8B-Instruct and Qwen3-8B-Instruct, both fine-tuned with LoRA under the same training budget and protocol.
M
Core message
Selection quality matters more than simply training on the full noisy multi-turn pool when dialogue structure and intent coverage must be preserved together.

Useful links

P
Paper PDF
The full ACL 2026 paper.
A
arXiv
Abstract page and citation metadata.
G
GitHub repository
Code, training scripts, and project updates.

Deployment note

This page is provided as a single self-contained index.html, which makes local preview and GitHub Pages deployment straightforward.

Citation

BibTeX

If you find this work useful, please cite the paper below.

@misc{li2026dataselectionmultiturndialogue,
  title        = {Data Selection for Multi-turn Dialogue Instruction Tuning},
  author       = {Bo Li and Shikun Zhang and Wei Ye},
  year         = {2026},
  eprint       = {2604.07892},
  archivePrefix= {arXiv},
  primaryClass = {cs.CL},
  url          = {https://arxiv.org/abs/2604.07892}
}