Companion page for DAFx 2026 — Audio examples, supplementary results, and MRI analysis
Sygyt is a Tuvan style of biphonic singing in which a low vocal drone is sustained while a high harmonic is selectively amplified in the 1–3 kHz region. Copy-synthesizing this effect remains challenging for articulatory models, since it requires fine control of narrowly focused resonances that standard low-dimensional tract parameterizations cannot easily reproduce.
We address this problem with a differentiable Kelly–Lochbaum waveguide augmented with a sublingual second source, cubic B-spline tract parameterization, and spatially varying learnable damping, optimized end-to-end by gradient descent from audio. On 20 segments from two independent sygyt datasets (5 singers, 10 pitches), the proposed model reduces log-spectral distance by 30–38% relative to an articulatory baseline, with the largest gains concentrated in the overtone region. Cepstral-envelope analysis further shows more accurate recovery of the merged formant structure characteristic of sygyt production. The model also outperforms a DDSP harmonic-plus-noise baseline with direct per-harmonic spectral control, suggesting that explicit acoustic structure is a useful inductive bias for overtone-singing copy-synthesis.
We compare three synthesis approaches across 20 segments from two datasets:
Video recording of a Tuvan singer demonstrating sygyt technique, with synchronized spectrogram overlay showing the characteristic amplified harmonic in the 1–3 kHz region.
Copy-synthesis results on 20 segments of Tuvan sygyt from two independent datasets. Objective metrics: mean ± std. PESQ (wideband MOS-LQO), CDPAM (learned perceptual distance, lower = closer), ViSQOL (audio-mode MOS-LQO, 1–5). Q1 (overall quality) and Q2 (harmonic similarity): subjective scores (mean ± 95% CI, 0–100, 20 listeners × 3 segments/dataset). Best in bold blue.
| Dataset / Method | DOFs | LSD (dB) ↓ | SpCorr ↑ | PESQ ↑ | CDPAM ↓ | ViSQOL ↑ | Q1 ↑ | Q2 ↑ |
|---|---|---|---|---|---|---|---|---|
| HFA — 1 singer, 10 pitches (F3–E♭4) | ||||||||
| Articulator chain | 19 | 13.84 ± 0.54 | 0.708 ± 0.040 | 1.17 ± 0.15 | 3.06 ± 3.34 | 2.81 ± 0.53 | 13.6 ± 3.7 | 25.3 ± 4.5 |
| DDSP | ~100k | 10.99 ± 0.54 | 0.819 ± 0.014 | 1.20 ± 0.13 | 3.56 ± 3.71 | 3.81 ± 0.21 | 42.4 ± 5.0 | 50.3 ± 4.9 |
| B-spline (ours) | 86 | 9.64 ± 0.29 | 0.857 ± 0.019 | 1.37 ± 0.34 | 2.36 ± 3.07 | 3.60 ± 0.22 | 44.8 ± 5.5 | 52.9 ± 5.8 |
| Bergevin — 4 singers (T1–T4) | ||||||||
| Articulator chain | 19 | 14.53 ± 0.93 | 0.658 ± 0.046 | 1.10 ± 0.05 | 0.73 ± 0.41 | 2.88 ± 0.21 | 11.6 ± 3.6 | 15.7 ± 3.9 |
| DDSP | ~100k | 10.71 ± 0.72 | 0.825 ± 0.030 | 1.26 ± 0.13 | 0.77 ± 0.55 | 3.69 ± 0.33 | 35.4 ± 5.5 | 45.1 ± 5.4 |
| B-spline (ours) | 86 | 9.04 ± 0.46 | 0.879 ± 0.008 | 1.58 ± 0.35 | 0.60 ± 0.34 | 3.85 ± 0.23 | 38.5 ± 5.3 | 46.3 ± 5.6 |
10 overtone-singing segments from the HFA dataset, spanning pitches F3–E♭4. Each card shows Target, Articulator, DDSP, and B-spline with per-segment LSD (dB) and SpCorr.
10 ethnographic Khoomei recordings from Bergevin et al., organized by singer (T1–T4).
LSD (dB) across 10 HFA segments spanning pitches F3–E♭4. B-spline wins on every pitch.
| Pitch | Pattern | Articulator | DDSP | B-spline |
|---|---|---|---|---|
| F3 | Scale (h8–h10) | 14.0 | 12.3 | 9.9 |
| G3 | Interval (h6–h7) | 13.8 | 11.2 | 9.2 |
| A3 | Arpeggio (h4–h6) | 14.1 | 11.1 | 9.1 |
| A3 | Interval (h5–h6) | 14.0 | 10.7 | 9.8 |
| A3 | Scale (h6–h8) | 12.9 | 10.8 | 9.8 |
| B♭3 | Interval (h6–h7) | 13.2 | 11.3 | 10.0 |
| B3 | Interval (h4–h5) | 14.3 | 10.9 | 9.5 |
| C4 | Arpeggio (h3–h5) | 14.5 | 10.7 | 9.9 |
| D4 | Glissando (h3–h8) | 13.2 | 10.1 | 9.7 |
| E♭4 | Arpeggio (h3–h5) | 14.5 | 10.7 | 9.5 |
Metrics for the 1–3 kHz overtone region across all 20 segments. SpCorrOT: spectral correlation in 1–3 kHz; eR: Bergevin energy ratio (1–2 kHz / 0–8 kHz); Sot: overtone salience (dB); HPR: harmonic prominence ratio. Best synthesis value (closest to target) in bold blue.
| Dataset | Method | SpCorrOT | eR | Sot (dB) | HPR |
|---|---|---|---|---|---|
| HFA — 10 segments | |||||
| Target | — | 0.56 ± 0.12 | 15.13 ± 1.39 | 41.05 ± 4.46 | |
| Articulator | 0.79 ± 0.05 | 0.86 ± 0.23 | 16.04 ± 1.35 | 26.89 ± 4.36 | |
| DDSP | 0.90 ± 0.02 | 0.09 ± 0.03 | 13.65 ± 2.17 | 25.32 ± 6.66 | |
| B-spline | 0.88 ± 0.03 | 0.68 ± 0.10 | 14.99 ± 0.85 | 35.21 ± 3.53 | |
| Bergevin — 10 segments | |||||
| Target | — | 0.43 ± 0.24 | 12.35 ± 1.82 | 22.61 ± 6.72 | |
| Articulator | 0.52 ± 0.21 | 0.79 ± 0.20 | 14.88 ± 1.54 | 15.08 ± 4.58 | |
| DDSP | 0.86 ± 0.08 | 0.20 ± 0.09 | 10.91 ± 1.86 | 13.42 ± 2.51 | |
| B-spline | 0.82 ± 0.12 | 0.54 ± 0.24 | 12.92 ± 2.71 | 17.48 ± 6.02 | |
| All — 20 segments | |||||
| Target | — | 0.49 ± 0.20 | 13.74 ± 2.14 | 31.83 ± 10.84 | |
| Articulator | 0.66 ± 0.21 | 0.82 ± 0.21 | 15.46 ± 1.56 | 20.99 ± 7.40 | |
| DDSP | 0.88 ± 0.06 | 0.14 ± 0.09 | 12.28 ± 2.44 | 19.37 ± 7.80 | |
| B-spline | 0.85 ± 0.09 | 0.61 ± 0.19 | 13.95 ± 2.26 | 26.35 ± 10.15 | |
DDSP achieves the highest local spectral correlation (SpCorrOT) because it fits each harmonic independently, but B-spline better preserves the overtone energy structure (eR, Sot, HPR), which is closer to the target values and perceptually more important for the characteristic sygyt timbre.
2×2 factorial ablation (B-spline KA=40, all 20 segments, 500 iterations). The sublingual source is the primary driver (ΔLSD ≈ 1.0 dB), with per-segment damping providing a secondary improvement (ΔLSD ≈ 0.14 dB). SpCorrOT: spectral correlation in the 1–3 kHz overtone region.
| Condition | Sublingual | Damping | LSD (dB) ↓ | SpCorr ↑ | SpCorrOT ↑ |
|---|---|---|---|---|---|
| Full model | ✓ | ✓ | 9.34 ± 0.49 | 0.868 ± 0.018 | 0.850 ± 0.090 |
| No sublingual | — | ✓ | 10.32 ± 0.56 | 0.820 ± 0.026 | 0.814 ± 0.127 |
| No damping | ✓ | — | 9.48 ± 0.48 | 0.864 ± 0.019 | 0.845 ± 0.091 |
| Minimal | — | — | 10.45 ± 0.61 | 0.815 ± 0.028 | 0.810 ± 0.139 |
Mean loss curves (±1σ envelope) for the four ablation conditions over 500 iterations.
Visualization of parameters learned by B-spline optimization across all 20 segments.
Publication-quality figures from the paper. Click to open full-resolution PDF.
Comparison with related approaches relevant to overtone singing synthesis. Phys. = physical vocal tract model; Diff. = differentiable (gradient-based); C-S = copy-synthesis from audio; Biph. = biphonic/overtone production. LSD on 10 HFA sygyt segments.
| Method | Phys. | Diff. | C-S | Biph. | Formant merging | LSD (dB) ↓ |
|---|---|---|---|---|---|---|
| Sondhi; Story | ✓ | × | × | × | × | — |
| Kob | ✓ | × | × | ✓ | Manual | — |
| Tsai et al. | × | × | × | ✓ | Post-hoc | — |
| Pink Trombone | ✓ | × | × | × | × | — |
| DDSP | × | ✓ | ✓ | × | × | 10.99 ± 0.54 |
| Ours (B-spline) | ✓ | ✓ | ✓ | ✓ | Emergent | 9.64 ± 0.29 |
| Feature | TubeTalkerPkg | VocalTrax (ours) |
|---|---|---|
| Task | Forward synthesis | Copy-synthesis (inverse) |
| Input | Target formants | Target audio |
| Method | SensMap perturbation | Gradient descent |
| Tract sections | 44 | 44 |
| Sample rate | 44.1 kHz | 16 kHz |
| Glottal source | Single LF | Dual (sublingual) |
| Damping | Fixed (wall) | Per-segment (learnable) |
| Differentiable | No | Yes (JAX) |
| Overtone control | Manual (area fn.) | Automatic |
| Platform | MATLAB/MEX | Python/JAX |
We conducted a perceptual evaluation inspired by the MUSHRA methodology (ITU-R BS.1534-3). 20 listeners (3 expert, 2 trained, 15 novice; one monotone respondent excluded) evaluated 6 segments (3 per dataset) in a blind A/B/C comparison via a web interface. Each trial presented the reference recording followed by three unlabeled synthesized conditions. Listeners rated each condition on two 0–100 anchored scales:
Scale anchors: Bad (0) – Poor (20) – Acceptable (40) – Good (60) – Very good (80) – Excellent (100). Presentation order and condition assignment were randomized per listener using a seeded PRNG. Two labeled practice trials preceded the test for familiarization.
Below is a reconstruction of one test trial with method labels revealed. In the actual test, conditions were labeled A/B/C in randomized order.
| Method | HFA (3 segments) | Bergevin (3 segments) | ||
|---|---|---|---|---|
| Q1 (Quality) | Q2 (Harmonic) | Q1 (Quality) | Q2 (Harmonic) | |
| Articulator chain | 13.6 ± 3.7 | 25.3 ± 4.5 | 11.6 ± 3.6 | 15.7 ± 3.9 |
| DDSP | 42.4 ± 5.0 | 50.3 ± 4.9 | 35.4 ± 5.5 | 45.1 ± 5.4 |
| B-spline (ours) | 44.8 ± 5.5 | 52.9 ± 5.8 | 38.5 ± 5.3 | 46.3 ± 5.6 |
B-spline and DDSP are both significantly preferred over the articulator chain (Friedman test, p < 10−4; pairwise Wilcoxon with Holm–Bonferroni: p < 0.001, Cliff’s δ > 0.7). The B-spline–DDSP difference is not statistically significant (p > 0.19). 20 listeners, 6 stimuli (3 per dataset), 0–100 anchored scales.
We apply the B-spline model to two sygyt segments extracted from real-time MRI recordings (MPILARS dataset), demonstrating generalization to a third independent data source. The fMRI data provides ground-truth vocal tract geometry for qualitative comparison with the learned B-spline profiles.
B-spline copy-synthesis results: spectrogram comparison and learned parameters.
Full 2D spectrogram of the vibrato segment. View full-resolution PDF (20 MB)
B-spline copy-synthesis results: spectrogram comparison and learned parameters.
Full 2D spectrogram of the stable segment. View full-resolution PDF (7.3 MB)