Differentiable Articulatory Copy-Synthesis
of Biphonic Singing

Mateo Cámara, María Pilar Daza-Llin, Fernando Marcos-Macías, José Luis Blanco

Companion page for DAFx 2026 — Audio examples, supplementary results, and MRI analysis

Abstract

Sygyt is a Tuvan style of biphonic singing in which a low vocal drone is sustained while a high harmonic is selectively amplified in the 1–3 kHz region. Copy-synthesizing this effect remains challenging for articulatory models, since it requires fine control of narrowly focused resonances that standard low-dimensional tract parameterizations cannot easily reproduce.

We address this problem with a differentiable Kelly–Lochbaum waveguide augmented with a sublingual second source, cubic B-spline tract parameterization, and spatially varying learnable damping, optimized end-to-end by gradient descent from audio. On 20 segments from two independent sygyt datasets (5 singers, 10 pitches), the proposed model reduces log-spectral distance by 30–38% relative to an articulatory baseline, with the largest gains concentrated in the overtone region. Cepstral-envelope analysis further shows more accurate recovery of the merged formant structure characteristic of sygyt production. The model also outperforms a DDSP harmonic-plus-noise baseline with direct per-harmonic spectral control, suggesting that explicit acoustic structure is a useful inductive bias for overtone-singing copy-synthesis.

We compare three synthesis approaches across 20 segments from two datasets:

What Does Sygyt Sound Like?

Video recording of a Tuvan singer demonstrating sygyt technique, with synchronized spectrogram overlay showing the characteristic amplified harmonic in the 1–3 kHz region.

Key Results

9.34
Overall LSD (dB) — B-spline
0.868
Spectral Correlation
30–38%
LSD improvement over Articulator
1100×
Fewer params than DDSP
86
DOFs per frame
20
Listeners (subjective test)

Main Results (Table III)

Copy-synthesis results on 20 segments of Tuvan sygyt from two independent datasets. Objective metrics: mean ± std. PESQ (wideband MOS-LQO), CDPAM (learned perceptual distance, lower = closer), ViSQOL (audio-mode MOS-LQO, 1–5). Q1 (overall quality) and Q2 (harmonic similarity): subjective scores (mean ± 95% CI, 0–100, 20 listeners × 3 segments/dataset). Best in bold blue.

Dataset / MethodDOFsLSD (dB) ↓SpCorr ↑PESQ ↑CDPAM ↓ViSQOL ↑Q1 ↑Q2 ↑
HFA — 1 singer, 10 pitches (F3–E♭4)
Articulator chain1913.84 ± 0.540.708 ± 0.0401.17 ± 0.153.06 ± 3.342.81 ± 0.5313.6 ± 3.725.3 ± 4.5
DDSP~100k10.99 ± 0.540.819 ± 0.0141.20 ± 0.133.56 ± 3.713.81 ± 0.2142.4 ± 5.050.3 ± 4.9
B-spline (ours)869.64 ± 0.290.857 ± 0.0191.37 ± 0.342.36 ± 3.073.60 ± 0.2244.8 ± 5.552.9 ± 5.8
Bergevin — 4 singers (T1–T4)
Articulator chain1914.53 ± 0.930.658 ± 0.0461.10 ± 0.050.73 ± 0.412.88 ± 0.2111.6 ± 3.615.7 ± 3.9
DDSP~100k10.71 ± 0.720.825 ± 0.0301.26 ± 0.130.77 ± 0.553.69 ± 0.3335.4 ± 5.545.1 ± 5.4
B-spline (ours)869.04 ± 0.460.879 ± 0.0081.58 ± 0.350.60 ± 0.343.85 ± 0.2338.5 ± 5.346.3 ± 5.6

Audio Examples — HFA Dataset

10 overtone-singing segments from the HFA dataset, spanning pitches F3–E♭4. Each card shows Target, Articulator, DDSP, and B-spline with per-segment LSD (dB) and SpCorr.

F3 — Scale (h8–h10)

LSD 14.0 · SpCorr 0.739
LSD 12.3 · SpCorr 0.814
LSD 9.86 · SpCorr 0.878
Spectrogram comparison for F3 Scale

G3 — Interval (h6→h7)

LSD 13.8 · SpCorr 0.748
LSD 11.2 · SpCorr 0.834
LSD 9.17 · SpCorr 0.890
Spectrogram comparison for G3 Interval

A3 — Arpeggio (h4–h6)

LSD 14.1 · SpCorr 0.719
LSD 11.1 · SpCorr 0.821
LSD 9.11 · SpCorr 0.877
Spectrogram comparison for A3 Arpeggio

A3 — Interval (h5→h6)

LSD 14.0 · SpCorr 0.710
LSD 10.7 · SpCorr 0.833
LSD 9.78 · SpCorr 0.858
Spectrogram comparison for A3 Interval

A3 — Scale (h6–h8)

LSD 12.9 · SpCorr 0.756
LSD 10.8 · SpCorr 0.838
LSD 9.83 · SpCorr 0.863
Spectrogram comparison for A3 Scale

B♭3 — Interval (h6→h7)

LSD 13.2 · SpCorr 0.739
LSD 11.3 · SpCorr 0.819
LSD 10.0 · SpCorr 0.855
Spectrogram comparison for Bb3 Interval

B3 — Interval (h4→h5)

LSD 14.3 · SpCorr 0.693
LSD 10.9 · SpCorr 0.814
LSD 9.52 · SpCorr 0.853
Spectrogram comparison for B3 Interval

C4 — Arpeggio (h3–h5)

LSD 14.5 · SpCorr 0.651
LSD 10.7 · SpCorr 0.799
LSD 9.90 · SpCorr 0.827
Spectrogram comparison for C4 Arpeggio

D4 — Glissando (h3→h8)

LSD 13.2 · SpCorr 0.694
LSD 10.1 · SpCorr 0.825
LSD 9.69 · SpCorr 0.837
Spectrogram comparison for D4 Glissando

E♭4 — Arpeggio (h3–h5)

LSD 14.5 · SpCorr 0.629
LSD 10.7 · SpCorr 0.795
LSD 9.53 · SpCorr 0.836
Spectrogram comparison for Eb4 Arpeggio

Audio Examples — Bergevin Dataset

10 ethnographic Khoomei recordings from Bergevin et al., organized by singer (T1–T4).

Singer T1

T1 — Take 3 Short

LSD 13.5 · SpCorr 0.696
LSD 12.2 · SpCorr 0.755
LSD 9.24 · SpCorr 0.866
Spectrogram comparison for T1 3short

Singer T2

T2 — Take 1 Short A

LSD 15.2 · SpCorr 0.653
LSD 9.84 · SpCorr 0.865
LSD 9.17 · SpCorr 0.886
Spectrogram comparison for T2 1shortA

T2 — Take 1 Short B

LSD 15.5 · SpCorr 0.636
LSD 10.8 · SpCorr 0.841
LSD 9.59 · SpCorr 0.878
Spectrogram comparison for T2 1shortB

T2 — Take 1 Short C

LSD 15.0 · SpCorr 0.639
LSD 9.88 · SpCorr 0.857
LSD 9.48 · SpCorr 0.871
Spectrogram comparison for T2 1shortC

T2 — Take 2 Short

LSD 13.6 · SpCorr 0.690
LSD 10.6 · SpCorr 0.823
LSD 8.89 · SpCorr 0.880
Spectrogram comparison for T2 2short

T2 — Take 3 Short

LSD 15.8 · SpCorr 0.616
LSD 10.8 · SpCorr 0.841
LSD 9.68 · SpCorr 0.877
Spectrogram comparison for T2 3short

T2 — Take 5 Short

LSD 15.4 · SpCorr 0.567
LSD 10.3 · SpCorr 0.827
LSD 8.81 · SpCorr 0.876
Spectrogram comparison for T2 5short

Singer T3

T3 — Take 2 Short A

LSD 13.0 · SpCorr 0.740
LSD 11.4 · SpCorr 0.803
LSD 8.47 · SpCorr 0.893
Spectrogram comparison for T3 2shortA

T3 — Take 2 Short B

LSD 14.3 · SpCorr 0.689
LSD 11.4 · SpCorr 0.809
LSD 8.84 · SpCorr 0.890
Spectrogram comparison for T3 2shortB

Singer T4

T4 — Take 1 Short A

LSD 14.1 · SpCorr 0.654
LSD 10.0 · SpCorr 0.833
LSD 8.20 · SpCorr 0.875
Spectrogram comparison for T4 1shortA

Cross-Pitch Generalization

LSD (dB) across 10 HFA segments spanning pitches F3–E♭4. B-spline wins on every pitch.

PitchPatternArticulatorDDSPB-spline
F3Scale (h8–h10)14.012.39.9
G3Interval (h6–h7)13.811.29.2
A3Arpeggio (h4–h6)14.111.19.1
A3Interval (h5–h6)14.010.79.8
A3Scale (h6–h8)12.910.89.8
B♭3Interval (h6–h7)13.211.310.0
B3Interval (h4–h5)14.310.99.5
C4Arpeggio (h3–h5)14.510.79.9
D4Glissando (h3–h8)13.210.19.7
E♭4Arpeggio (h3–h5)14.510.79.5

Overtone-Region Evaluation

Metrics for the 1–3 kHz overtone region across all 20 segments. SpCorrOT: spectral correlation in 1–3 kHz; eR: Bergevin energy ratio (1–2 kHz / 0–8 kHz); Sot: overtone salience (dB); HPR: harmonic prominence ratio. Best synthesis value (closest to target) in bold blue.

DatasetMethodSpCorrOTeRSot (dB)HPR
HFA — 10 segments
Target0.56 ± 0.1215.13 ± 1.3941.05 ± 4.46
Articulator0.79 ± 0.050.86 ± 0.2316.04 ± 1.3526.89 ± 4.36
DDSP0.90 ± 0.020.09 ± 0.0313.65 ± 2.1725.32 ± 6.66
B-spline0.88 ± 0.030.68 ± 0.1014.99 ± 0.8535.21 ± 3.53
Bergevin — 10 segments
Target0.43 ± 0.2412.35 ± 1.8222.61 ± 6.72
Articulator0.52 ± 0.210.79 ± 0.2014.88 ± 1.5415.08 ± 4.58
DDSP0.86 ± 0.080.20 ± 0.0910.91 ± 1.8613.42 ± 2.51
B-spline0.82 ± 0.120.54 ± 0.2412.92 ± 2.7117.48 ± 6.02
All — 20 segments
Target0.49 ± 0.2013.74 ± 2.1431.83 ± 10.84
Articulator0.66 ± 0.210.82 ± 0.2115.46 ± 1.5620.99 ± 7.40
DDSP0.88 ± 0.060.14 ± 0.0912.28 ± 2.4419.37 ± 7.80
B-spline0.85 ± 0.090.61 ± 0.1913.95 ± 2.2626.35 ± 10.15

DDSP achieves the highest local spectral correlation (SpCorrOT) because it fits each harmonic independently, but B-spline better preserves the overtone energy structure (eR, Sot, HPR), which is closer to the target values and perceptually more important for the characteristic sygyt timbre.

Ablation Study

2×2 factorial ablation (B-spline KA=40, all 20 segments, 500 iterations). The sublingual source is the primary driver (ΔLSD ≈ 1.0 dB), with per-segment damping providing a secondary improvement (ΔLSD ≈ 0.14 dB). SpCorrOT: spectral correlation in the 1–3 kHz overtone region.

ConditionSublingualDampingLSD (dB) ↓SpCorr ↑SpCorrOT
Full model9.34 ± 0.490.868 ± 0.0180.850 ± 0.090
No sublingual10.32 ± 0.560.820 ± 0.0260.814 ± 0.127
No damping9.48 ± 0.480.864 ± 0.0190.845 ± 0.091
Minimal10.45 ± 0.610.815 ± 0.0280.810 ± 0.139
Ablation convergence curves

Mean loss curves (±1σ envelope) for the four ablation conditions over 500 iterations.

Learned Parameters

Visualization of parameters learned by B-spline optimization across all 20 segments.

Paper Figures

Publication-quality figures from the paper. Click to open full-resolution PDF.

Supplementary Visualizations

Method Comparison

Convergence

Diphonic Content

Comparison with Prior Approaches

Comparison with related approaches relevant to overtone singing synthesis. Phys. = physical vocal tract model; Diff. = differentiable (gradient-based); C-S = copy-synthesis from audio; Biph. = biphonic/overtone production. LSD on 10 HFA sygyt segments.

MethodPhys.Diff.C-SBiph.Formant mergingLSD (dB) ↓
Sondhi; Story××××
Kob××Manual
Tsai et al.×××Post-hoc
Pink Trombone××××
DDSP×××10.99 ± 0.54
Ours (B-spline)Emergent9.64 ± 0.29

System Comparison: VocalTrax vs. TubeTalkerPkg

FeatureTubeTalkerPkgVocalTrax (ours)
TaskForward synthesisCopy-synthesis (inverse)
InputTarget formantsTarget audio
MethodSensMap perturbationGradient descent
Tract sections4444
Sample rate44.1 kHz16 kHz
Glottal sourceSingle LFDual (sublingual)
DampingFixed (wall)Per-segment (learnable)
DifferentiableNoYes (JAX)
Overtone controlManual (area fn.)Automatic
PlatformMATLAB/MEXPython/JAX

Subjective Listening Test

Protocol

We conducted a perceptual evaluation inspired by the MUSHRA methodology (ITU-R BS.1534-3). 20 listeners (3 expert, 2 trained, 15 novice; one monotone respondent excluded) evaluated 6 segments (3 per dataset) in a blind A/B/C comparison via a web interface. Each trial presented the reference recording followed by three unlabeled synthesized conditions. Listeners rated each condition on two 0–100 anchored scales:

Scale anchors: Bad (0) – Poor (20) – Acceptable (40) – Good (60) – Very good (80) – Excellent (100). Presentation order and condition assignment were randomized per listener using a seeded PRNG. Two labeled practice trials preceded the test for familiarization.

Try the listening test yourself →

Example Trial (seg02 — HFA, F3 Scale)

Below is a reconstruction of one test trial with method labels revealed. In the actual test, conditions were labeled A/B/C in randomized order.

Articulator chain
Q1 — Overall quality
BadPoorAcceptableGoodVery goodExcellent
Q2 — Harmonic similarity
BadPoorAcceptableGoodVery goodExcellent
DDSP
Q1 — Overall quality
BadPoorAcceptableGoodVery goodExcellent
Q2 — Harmonic similarity
BadPoorAcceptableGoodVery goodExcellent
B-spline (ours)
Q1 — Overall quality
BadPoorAcceptableGoodVery goodExcellent
Q2 — Harmonic similarity
BadPoorAcceptableGoodVery goodExcellent

Aggregate Results (mean ± 95% CI)

MethodHFA (3 segments)Bergevin (3 segments)
Q1 (Quality)Q2 (Harmonic)Q1 (Quality)Q2 (Harmonic)
Articulator chain13.6 ± 3.725.3 ± 4.511.6 ± 3.615.7 ± 3.9
DDSP42.4 ± 5.050.3 ± 4.935.4 ± 5.545.1 ± 5.4
B-spline (ours)44.8 ± 5.552.9 ± 5.838.5 ± 5.346.3 ± 5.6

B-spline and DDSP are both significantly preferred over the articulator chain (Friedman test, p < 10−4; pairwise Wilcoxon with Holm–Bonferroni: p < 0.001, Cliff’s δ > 0.7). The B-spline–DDSP difference is not statistically significant (p > 0.19). 20 listeners, 6 stimuli (3 per dataset), 0–100 anchored scales.

MPILARS: Real-Time MRI Vocal Tract Analysis

We apply the B-spline model to two sygyt segments extracted from real-time MRI recordings (MPILARS dataset), demonstrating generalization to a third independent data source. The fMRI data provides ground-truth vocal tract geometry for qualitative comparison with the learned B-spline profiles.

Vibrato Phonation — clip 21–23 (2.0 s, 126 frames)

10.70
LSD (dB)
0.801
SpCorr

MRI-Derived Vocal Tract Analysis

B-spline synthesis results - vibrato

B-spline copy-synthesis results: spectrogram comparison and learned parameters.

2D spectrogram - vibrato

Full 2D spectrogram of the vibrato segment. View full-resolution PDF (20 MB)

Stable Phonation — clip 24 (0.75 s, 47 frames)

11.01
LSD (dB)
0.796
SpCorr

MRI-Derived Vocal Tract Analysis

B-spline synthesis results - stable

B-spline copy-synthesis results: spectrogram comparison and learned parameters.

2D spectrogram - stable

Full 2D spectrogram of the stable segment. View full-resolution PDF (7.3 MB)