Differentiable Articulatory Copy-Synthesis
of Biphonic Singing

Mateo Cámara, María Pilar Daza-Llin, Fernando Marcos-Macías, José Luis Blanco

Companion page for DAFx 2026 — Audio examples, supplementary results, and MRI analysis

Abstract

Sygyt is a Tuvan style of biphonic singing in which a low vocal drone is sustained while a high harmonic is selectively amplified in the 1–3 kHz region. Copy-synthesizing this effect remains challenging for articulatory models, since it requires fine control of narrowly focused resonances that standard low-dimensional tract parameterizations cannot easily reproduce.

We address this problem with a differentiable Kelly–Lochbaum waveguide augmented with a sublingual second source, cubic B-spline tract parameterization, and spatially varying learnable damping, optimized end-to-end by gradient descent from audio. On 20 segments from two independent sygyt datasets (5 singers, 10 pitches), the proposed model reduces log-spectral distance by 30–38% relative to an articulatory baseline, with the largest gains concentrated in the overtone region. Cepstral-envelope analysis further shows more accurate recovery of the merged formant structure characteristic of sygyt production. The model also outperforms a DDSP harmonic-plus-noise baseline with direct per-harmonic spectral control, suggesting that explicit acoustic structure is a useful inductive bias for overtone-singing copy-synthesis.

We compare three synthesis approaches across 20 segments from two datasets:

Articulator chain — fixed glottis + waveguide (~19 DOFs/frame)
DDSP baseline — harmonic-plus-noise neural synthesizer (~100k parameters)
B-spline (ours) — differentiable waveguide with B-spline tract (86 DOFs/frame, 1100× fewer parameters than DDSP)

What Does Sygyt Sound Like?

Video recording of a Tuvan singer demonstrating sygyt technique, with synchronized spectrogram overlay showing the characteristic amplified harmonic in the 1–3 kHz region.

Key Results

9.34

Overall LSD (dB) — B-spline

0.868

Spectral Correlation

30–38%

LSD improvement over Articulator

1100×

Fewer params than DDSP

DOFs per frame

Listeners (subjective test)

Main Results (Table III)

Copy-synthesis results on 20 segments of Tuvan sygyt from two independent datasets. Objective metrics: mean ± std. PESQ (wideband MOS-LQO), CDPAM (learned perceptual distance, lower = closer), ViSQOL (audio-mode MOS-LQO, 1–5). Q1 (overall quality) and Q2 (harmonic similarity): subjective scores (mean ± 95% CI, 0–100, 20 listeners × 3 segments/dataset). Best in bold blue.

Dataset / Method	DOFs	LSD (dB) ↓	SpCorr ↑	PESQ ↑	CDPAM ↓	ViSQOL ↑	Q1 ↑	Q2 ↑
HFA — 1 singer, 10 pitches (F3–E♭4)
Articulator chain	19	13.84 ± 0.54	0.708 ± 0.040	1.17 ± 0.15	3.06 ± 3.34	2.81 ± 0.53	13.6 ± 3.7	25.3 ± 4.5
DDSP	~100k	10.99 ± 0.54	0.819 ± 0.014	1.20 ± 0.13	3.56 ± 3.71	3.81 ± 0.21	42.4 ± 5.0	50.3 ± 4.9
B-spline (ours)	86	9.64 ± 0.29	0.857 ± 0.019	1.37 ± 0.34	2.36 ± 3.07	3.60 ± 0.22	44.8 ± 5.5	52.9 ± 5.8
Bergevin — 4 singers (T1–T4)
Articulator chain	19	14.53 ± 0.93	0.658 ± 0.046	1.10 ± 0.05	0.73 ± 0.41	2.88 ± 0.21	11.6 ± 3.6	15.7 ± 3.9
DDSP	~100k	10.71 ± 0.72	0.825 ± 0.030	1.26 ± 0.13	0.77 ± 0.55	3.69 ± 0.33	35.4 ± 5.5	45.1 ± 5.4
B-spline (ours)	86	9.04 ± 0.46	0.879 ± 0.008	1.58 ± 0.35	0.60 ± 0.34	3.85 ± 0.23	38.5 ± 5.3	46.3 ± 5.6

Audio Examples — HFA Dataset

10 overtone-singing segments from the HFA dataset, spanning pitches F3–E♭4. Each card shows Target, Articulator, DDSP, and B-spline with per-segment LSD (dB) and SpCorr.

F3 — Scale (h8–h10)

Target

Articulator

LSD 14.0 · SpCorr 0.739

DDSP

LSD 12.3 · SpCorr 0.814

B-spline

LSD 9.86 · SpCorr 0.878

G3 — Interval (h6→h7)

Target

Articulator

LSD 13.8 · SpCorr 0.748

DDSP

LSD 11.2 · SpCorr 0.834

B-spline

LSD 9.17 · SpCorr 0.890

A3 — Arpeggio (h4–h6)

Target

Articulator

LSD 14.1 · SpCorr 0.719

DDSP

LSD 11.1 · SpCorr 0.821

B-spline

LSD 9.11 · SpCorr 0.877

A3 — Interval (h5→h6)

Target

Articulator

LSD 14.0 · SpCorr 0.710

DDSP

LSD 10.7 · SpCorr 0.833

B-spline

LSD 9.78 · SpCorr 0.858

A3 — Scale (h6–h8)

Target

Articulator

LSD 12.9 · SpCorr 0.756

DDSP

LSD 10.8 · SpCorr 0.838

B-spline

LSD 9.83 · SpCorr 0.863

B♭3 — Interval (h6→h7)

Target

Articulator

LSD 13.2 · SpCorr 0.739

DDSP

LSD 11.3 · SpCorr 0.819

B-spline

LSD 10.0 · SpCorr 0.855

B3 — Interval (h4→h5)

Target

Articulator

LSD 14.3 · SpCorr 0.693

DDSP

LSD 10.9 · SpCorr 0.814

B-spline

LSD 9.52 · SpCorr 0.853

C4 — Arpeggio (h3–h5)

Target

Articulator

LSD 14.5 · SpCorr 0.651

DDSP

LSD 10.7 · SpCorr 0.799

B-spline

LSD 9.90 · SpCorr 0.827

D4 — Glissando (h3→h8)

Target

Articulator

LSD 13.2 · SpCorr 0.694

DDSP

LSD 10.1 · SpCorr 0.825

B-spline

LSD 9.69 · SpCorr 0.837

E♭4 — Arpeggio (h3–h5)

Target

Articulator

LSD 14.5 · SpCorr 0.629

DDSP

LSD 10.7 · SpCorr 0.795

B-spline

LSD 9.53 · SpCorr 0.836

Audio Examples — Bergevin Dataset

10 ethnographic Khoomei recordings from Bergevin et al., organized by singer (T1–T4).

Singer T1

T1 — Take 3 Short

Target

Articulator

LSD 13.5 · SpCorr 0.696

DDSP

LSD 12.2 · SpCorr 0.755

B-spline

LSD 9.24 · SpCorr 0.866

Singer T3

T3 — Take 2 Short A

Target

Articulator

LSD 13.0 · SpCorr 0.740

DDSP

LSD 11.4 · SpCorr 0.803

B-spline

LSD 8.47 · SpCorr 0.893

T3 — Take 2 Short B

Target

Articulator

LSD 14.3 · SpCorr 0.689

DDSP

LSD 11.4 · SpCorr 0.809

B-spline

LSD 8.84 · SpCorr 0.890

Singer T4

T4 — Take 1 Short A

Target

Articulator

LSD 14.1 · SpCorr 0.654

DDSP

LSD 10.0 · SpCorr 0.833

B-spline

LSD 8.20 · SpCorr 0.875

Cross-Pitch Generalization

LSD (dB) across 10 HFA segments spanning pitches F3–E♭4. B-spline wins on every pitch.

Pitch	Pattern	Articulator	DDSP	B-spline
F3	Scale (h8–h10)	14.0	12.3	9.9
G3	Interval (h6–h7)	13.8	11.2	9.2
A3	Arpeggio (h4–h6)	14.1	11.1	9.1
A3	Interval (h5–h6)	14.0	10.7	9.8
A3	Scale (h6–h8)	12.9	10.8	9.8
B♭3	Interval (h6–h7)	13.2	11.3	10.0
B3	Interval (h4–h5)	14.3	10.9	9.5
C4	Arpeggio (h3–h5)	14.5	10.7	9.9
D4	Glissando (h3–h8)	13.2	10.1	9.7
E♭4	Arpeggio (h3–h5)	14.5	10.7	9.5

Overtone-Region Evaluation

Metrics for the 1–3 kHz overtone region across all 20 segments. SpCorr_OT: spectral correlation in 1–3 kHz; eR: Bergevin energy ratio (1–2 kHz / 0–8 kHz); S_ot: overtone salience (dB); HPR: harmonic prominence ratio. Best synthesis value (closest to target) in bold blue.

Dataset	Method	SpCorr_OT	eR	S_ot (dB)	HPR
HFA — 10 segments
	Target	—	0.56 ± 0.12	15.13 ± 1.39	41.05 ± 4.46
	Articulator	0.79 ± 0.05	0.86 ± 0.23	16.04 ± 1.35	26.89 ± 4.36
	DDSP	0.90 ± 0.02	0.09 ± 0.03	13.65 ± 2.17	25.32 ± 6.66
	B-spline	0.88 ± 0.03	0.68 ± 0.10	14.99 ± 0.85	35.21 ± 3.53
Bergevin — 10 segments
	Target	—	0.43 ± 0.24	12.35 ± 1.82	22.61 ± 6.72
	Articulator	0.52 ± 0.21	0.79 ± 0.20	14.88 ± 1.54	15.08 ± 4.58
	DDSP	0.86 ± 0.08	0.20 ± 0.09	10.91 ± 1.86	13.42 ± 2.51
	B-spline	0.82 ± 0.12	0.54 ± 0.24	12.92 ± 2.71	17.48 ± 6.02
All — 20 segments
	Target	—	0.49 ± 0.20	13.74 ± 2.14	31.83 ± 10.84
	Articulator	0.66 ± 0.21	0.82 ± 0.21	15.46 ± 1.56	20.99 ± 7.40
	DDSP	0.88 ± 0.06	0.14 ± 0.09	12.28 ± 2.44	19.37 ± 7.80
	B-spline	0.85 ± 0.09	0.61 ± 0.19	13.95 ± 2.26	26.35 ± 10.15

DDSP achieves the highest local spectral correlation (SpCorr_OT) because it fits each harmonic independently, but B-spline better preserves the overtone energy structure (eR, S_ot, HPR), which is closer to the target values and perceptually more important for the characteristic sygyt timbre.

Ablation Study

2×2 factorial ablation (B-spline K_A=40, all 20 segments, 500 iterations). The sublingual source is the primary driver (ΔLSD ≈ 1.0 dB), with per-segment damping providing a secondary improvement (ΔLSD ≈ 0.14 dB). SpCorr_OT: spectral correlation in the 1–3 kHz overtone region.

Condition	Sublingual	Damping	LSD (dB) ↓	SpCorr ↑	SpCorr_OT ↑
Full model	✓	✓	9.34 ± 0.49	0.868 ± 0.018	0.850 ± 0.090
No sublingual	—	✓	10.32 ± 0.56	0.820 ± 0.026	0.814 ± 0.127
No damping	✓	—	9.48 ± 0.48	0.864 ± 0.019	0.845 ± 0.091
Minimal	—	—	10.45 ± 0.61	0.815 ± 0.028	0.810 ± 0.139

Mean loss curves (±1σ envelope) for the four ablation conditions over 500 iterations.

Learned Parameters

Visualization of parameters learned by B-spline optimization across all 20 segments.

Tract profiles. Mean oral diameter profile (0–43 sections) for each of the 20 segments, with ±1σ temporal variation.

Sublingual source. Heatmap of learned second-source amplitude over normalized time for all segments.

Glottal parameters. Box plots of OQ, tilt, tenseness, and aspiration comparing HFA and Bergevin datasets.

Velum opening. Mean ± std velum opening per segment. Low values indicate closed velopharyngeal port.

Paper Figures

Publication-quality figures from the paper. Click to open full-resolution PDF.

Architecture. Differentiable Kelly–Lochbaum waveguide with sublingual second source and three-way junction. Oral (44 sections), nasal (28), and sublingual (15) tracts.

Spectrogram comparison. Target vs. three synthesis methods for representative segments, showing the amplified harmonic in 1–3 kHz.

Tract & damping. Learned oral diameter and per-segment damping profiles, showing constriction near the sublingual junction.

Formant envelope. Cepstral-smoothed spectral envelopes showing the merged F2≈F3 structure characteristic of sygyt production.

Convergence. Loss curves for the three methods, showing comparable convergence speed despite 4.5× more parameters in B-spline.

What is sygyt? Spectrogram of a sygyt performance showing the fundamental drone and the selectively amplified harmonic melody in 1–3 kHz.

Supplementary Visualizations

Method Comparison

Per-segment LSD. Grouped horizontal bars for 20 segments across 3 methods, sorted by B-spline quality.

DDSP vs B-spline LSD. All points above the parity line indicate B-spline wins. Colored by dataset.

Energy preservation. Per-segment energy ratio (ideal = 1.0) for all three methods.

Optimization time vs. quality. Longer segments take more time but do not necessarily yield worse LSD.

Convergence

B-spline convergence. Loss curves for all 20 segments overlaid; HFA (blue) and Bergevin (orange).

3-method convergence. Articulator, DDSP, and B-spline loss curves for 3 representative segments.

Diphonic Content

Diphonic content. Top: fraction of diphonic frames; Bottom: mean harmonic enhancement (dB) per segment.

Enhancement vs. quality. Harmonic enhancement (dB) plotted against B-spline LSD.

Comparison with Prior Approaches

Comparison with related approaches relevant to overtone singing synthesis. Phys. = physical vocal tract model; Diff. = differentiable (gradient-based); C-S = copy-synthesis from audio; Biph. = biphonic/overtone production. LSD on 10 HFA sygyt segments.

Method	Phys.	Diff.	C-S	Biph.	Formant merging	LSD (dB) ↓
Sondhi; Story	✓	×	×	×	×	—
Kob	✓	×	×	✓	Manual	—
Tsai et al.	×	×	×	✓	Post-hoc	—
Pink Trombone	✓	×	×	×	×	—
DDSP	×	✓	✓	×	×	10.99 ± 0.54
Ours (B-spline)	✓	✓	✓	✓	Emergent	9.64 ± 0.29

System Comparison: VocalTrax vs. TubeTalkerPkg

Feature	TubeTalkerPkg	VocalTrax (ours)
Task	Forward synthesis	Copy-synthesis (inverse)
Input	Target formants	Target audio
Method	SensMap perturbation	Gradient descent
Tract sections	44	44
Sample rate	44.1 kHz	16 kHz
Glottal source	Single LF	Dual (sublingual)
Damping	Fixed (wall)	Per-segment (learnable)
Differentiable	No	Yes (JAX)
Overtone control	Manual (area fn.)	Automatic
Platform	MATLAB/MEX	Python/JAX

Subjective Listening Test

Protocol

We conducted a perceptual evaluation inspired by the MUSHRA methodology (ITU-R BS.1534-3). 20 listeners (3 expert, 2 trained, 15 novice; one monotone respondent excluded) evaluated 6 segments (3 per dataset) in a blind A/B/C comparison via a web interface. Each trial presented the reference recording followed by three unlabeled synthesized conditions. Listeners rated each condition on two 0–100 anchored scales:

Q1 — Overall quality: How natural and similar to the reference does the synthesis sound overall?
Q2 — Harmonic similarity: How closely does the whistled overtone match the reference in pitch, clarity, and loudness?

Scale anchors: Bad (0) – Poor (20) – Acceptable (40) – Good (60) – Very good (80) – Excellent (100). Presentation order and condition assignment were randomized per listener using a seeded PRNG. Two labeled practice trials preceded the test for familiarization.

Try the listening test yourself →

Example Trial (seg02 — HFA, F3 Scale)

Below is a reconstruction of one test trial with method labels revealed. In the actual test, conditions were labeled A/B/C in randomized order.

Reference Target recording