Findings of ACL 2026

Data Selection for Multi-turn Dialogue Instruction Tuning

Bo Li¹· Shikun Zhang¹· Wei Ye¹

¹National Engineering Research Center for Software Engineering, Peking University

MDS selects training-effective multi-turn dialogues by combining global semantic coverage with local structural quality, building compact supervision sets that are more diverse, coherent, and reliable for dialogue instruction tuning.

Comparison between single-turn and multi-turn data quality — A turn-level comparison already reveals the challenge: multi-turn dialogue data are noisier and more variable than standard single-turn instruction data, which motivates dialogue-level selection rather than isolated turn scoring.

Overview

Why multi-turn data selection needs a different design

Many selectors score isolated instruction-response pairs. MDS instead treats each dialogue as a coherent trajectory and selects full conversations that are both semantically representative and structurally reliable.

Multi-turn corpora are structurally noisier

Later turns may drift away from the original intent, end in chitchat tails, or violate the requested response format. These errors accumulate across the dialogue and are hard to detect at the single-turn level.

Turn-level scoring misses trajectory structure

Applying single-turn selectors to dialogue turns ignores cross-turn dependencies such as history anchoring, topic continuity, and query-answer form alignment.

MDS selects dialogues, not disconnected turns

By combining semantic coverage with dialogue-level quality, MDS builds a compact subset that suppresses noisy conversations while preserving long-tail intents and better-formed supervision.

Method

A two-stage selector for semantic diversity and dialogue structure

MDS first preserves broad intent coverage in a dialogue-trajectory space, then reranks candidate dialogues using local structural signals that capture history grounding, information progress, and form consistency.

Stage 1 · Global selection

Trajectory-aware semantic coverage

User queries are embedded into a dialogue-level trajectory representation. MDS clusters this space into semantic bins and performs coverage-aware bin-wise selection to avoid collapsing onto a few frequent patterns.

User-query trajectory embeddings
Semantic bins via K-means
Coverage and anti-redundancy within each bin

Long-tail intent coverageDialogue-level selection

Stage 2 · Local selection

Structural reliability inside each dialogue

MDS scores candidate dialogues using entity-grounded topic grounding and information progress, together with a hard form-consistency filter that removes badly aligned query-answer pairs.

Entity coherence and anti-redundancy
History anchoring and information progress
Query-answer form consistency

Topic groundingForm reliability

Global stagePrevents a few dense interaction patterns from dominating the selected set.

Local stageFilters out structurally weak dialogues that look acceptable at the turn level but break across turns.

Final subsetA compact training pool that is simultaneously diverse, coherent, and better aligned with multi-turn learning.

Results

Consistent gains across backbones, benchmarks, and domains

On the general-domain Baize pool, MDS achieves the best average rank on both LLaMA3-8B-Instruct and Qwen3-8B-Instruct. It also improves domain-specific training on Banking while retaining cross-domain transfer.

Method	MT-Eval L-E	MT-Eval G-E	MT-Eval Ent-F1	MT-Eval Cos	ConsistentChat L-E	ConsistentChat G-E	ConsistentChat Ent-F1	ConsistentChat Cos	TopDial L-E	TopDial G-E	TopDial Ent-F1	TopDial Cos	Avg. Rank ↓

L-E and G-E denote LLM-EVAL and G-EVAL; Ent-F1 is entity-level F1; Cos is embedding cosine similarity. Lower average rank is better.

Domain-specific setting

MDS also improves Banking selection without sacrificing transfer

Method	Banking Test G-E	Banking Test Ent-F1	ConsistentChat G-E	ConsistentChat Ent-F1

When all methods select 10K dialogues from the Banking corpus, MDS attains the strongest in-domain G-E and also delivers the best out-of-domain results on ConsistentChat.

Analysis

Where the gains come from

Beyond main results, MDS shows stronger long-dialogue robustness, preserves order-sensitive cross-turn structure, and shifts the selected training pool toward cleaner, more grounded conversations.

Length robustness analysis — Long-dialogue robustness
As turns get deeper, MDS preserves stronger entity coverage and semantic fidelity than competing selectors.

Error type delta analysis — Error-type difference sets
MDS-only dialogues contain more clean examples and substantially fewer topic-drift and unsupported errors.

Controlled perturbation analysis

The paper also includes an order-perturbation study on the same 10K dialogues selected by MDS. Shuffling turns degrades order-sensitive consistency, especially on the high-history-dependency subset, supporting the design choice of explicitly modeling cross-turn anchoring and anti-redundancy.

ESCHARENRHigh-history subset

Resources

Reproducibility at a glance

The project centers on offline dialogue selection, compact subset construction, and multi-turn evaluation under a fixed dialogue budget.

Training pools

Baize for general-domain dialogue selection and Banking for domain-specific customer-service dialogue selection.

Evaluation

MT-Eval, ConsistentChat, TopDial, and a held-out Banking Test set, covering open-ended, consistency-sensitive, and task-oriented settings.

Backbones

LLaMA3-8B-Instruct and Qwen3-8B-Instruct, both fine-tuned with LoRA under the same training budget and protocol.

Core message

Selection quality matters more than simply training on the full noisy multi-turn pool when dialogue structure and intent coverage must be preserved together.

Useful links

Paper PDF

The full ACL 2026 paper.

arXiv

Abstract page and citation metadata.

GitHub repository

Code, training scripts, and project updates.

Deployment note

This page is provided as a single self-contained index.html, which makes local preview and GitHub Pages deployment straightforward.

Citation

BibTeX

If you find this work useful, please cite the paper below.

@misc{li2026dataselectionmultiturndialogue,
  title        = {Data Selection for Multi-turn Dialogue Instruction Tuning},
  author       = {Bo Li and Shikun Zhang and Wei Ye},
  year         = {2026},
  eprint       = {2604.07892},
  archivePrefix= {arXiv},
  primaryClass = {cs.CL},
  url          = {https://arxiv.org/abs/2604.07892}
}