Guitar Tone Morphing by Diffusion-based Model

Abstract

In Music Information Retrieval (MIR), modeling and transforming the tone of musical instruments, particularly electric guitars, has gained increasing attention due to the richness of the instrument tone and the flexibility of expression. Tone morphing enables smooth transitions between different guitar sounds, giving musicians greater freedom to explore new textures and personalize their performances. This study explores learning-based approaches for guitar tone morphing, beginning with LoRA fine-tuning to improve the model performance on limited data. Moreover, we introduce a simpler method, named spherical interpolation using Music2Latent. It yields significantly better results than the more complex fine-tuning approach. Experiments show that the proposed architecture generates smoother and more natural tone transitions, making it a practical and efficient tool for music production and real-time audio effects.

Method Overview

We investigate three model variants for guitar tone morphing: two diffusion-based with different LoRA placements and one lightweight latent interpolation, together with standard interpolation operators. Given two input tones, we encode them into latent vectors and interpolate either in latent space to trace a smooth morphing trajectory, then decode to audio. When content preservation is critical, Adaptive Instance Normalization (AdaIN) is optionally applied to align statistics.

A. Single-Sided LoRA Fine-Tuning

Built on an AudioLDM-style pipeline, this variant uses textual inversion to enrich prompts and then applies LoRA fine-tuning to the conditional U-Net (decoder side) to better align text embeddings with latent features. The unconditional U-Net is independently fine-tuned with a small LoRA rank (e.g., 2) without textual input. At inference time, we perform SLERP on the latent vectors of the two input tones and LERP on their text embeddings. The interpolated representation is then decoded through the unconditional U-Net, the VAE, and the vocoder. Only the decoder is adapted; the encoder remains frozen, enabling expressive tone blending while limiting content drift.

B. Dual-Sided LoRA Fine-Tuning

This variant fine-tunes two U-Nets (conditional and unconditional) for each tone domain, with LoRA rank 2 after textual inversion. At inference, we use SLERP for latent vectors, LERP for prompt embeddings, and additionally perform parameter-space LERP to merge the two U-Nets into a model specialized for the interpolated tone. AdaIN is integrated in the decoding path to match the mean and variance of the output to the interpolated style, offering more precise control over stylistic attributes.

$$\mathrm{AdaIN}(\mathbf{x},\mathbf{y})=\sigma(\mathbf{y})\left(\frac{\mathbf{x}-\mu(\mathbf{x})}{\sigma(\mathbf{x})}\right)+\mu(\mathbf{y})$$

C. Music2Latent Interpolation

This method relies on a lightweight encoder–decoder that reconstructs audio in a learned latent space without diffusion or text conditioning. We encode both inputs, perform spherical interpolation (SLERP) between their latent vectors, and decode back to audio. Although it does not involve fine-tuning, it provides a clean benchmark of latent-space interpolation quality. AdaIN can also be applied during morphing to better align style characteristics and enhance naturalness.

Interpolation

$$\hat{\mathbf{v}}=\frac{\mathbf{v}}{\lVert \mathbf{v}\rVert}, \quad \theta_0=\arccos\!\big(\hat{\mathbf{v}}_0 \cdot \hat{\mathbf{v}}_1\big)$$

$$\mathrm{LERP}(\alpha,\hat{\mathbf{v}}_0,\hat{\mathbf{v}}_1)=(1-\alpha)\hat{\mathbf{v}}_0+\alpha\hat{\mathbf{v}}_1$$

$$\mathrm{SLERP}(\alpha,\hat{\mathbf{v}}_0,\hat{\mathbf{v}}_1) =\frac{\sin\big((1-\alpha)\theta_0\big)}{\sin(\theta_0)}\,\hat{\mathbf{v}}_0 +\frac{\sin\big(\alpha\theta_0\big)}{\sin(\theta_0)}\,\hat{\mathbf{v}}_1$$

Notation. $\mathbf{v}_0,\mathbf{v}_1$ are the latent vectors of the two tones; $\hat{\mathbf{v}}$ denotes $\ell_2$-normalized vectors; $\theta_0$ is the angle between $\hat{\mathbf{v}}_0$ and $\hat{\mathbf{v}}_1$; $\alpha \in [0,1]$ controls the morph progress. LERP is straight-line interpolation in latent space; SLERP moves along the great-circle on the unit hypersphere, preserving constant angular speed and norm.

Results

Perceptual Results — CDPAM ↓ & MOS ↑

We evaluate perceptual quality with CDPAM (↓ better) and a MOS listening test (↑ better). Multi-resolution Spectral Convergence (SC) is additionally computed to diagnose reconstruction fidelity across STFT settings.

TABLE I — Tone Morphing Quality. Lower CDPAM and higher MOS indicate better perceptual quality.

Model / Setting	CDPAM mean ± std ↓	MOS ↑
AudioLDM — w/o LoRA	0.32 ± 0.100	3.17
AudioLDM — w/ 1 LoRA	0.45 ± 0.140	1.07
AudioLDM — w/ 2 LoRA	0.22 ± 0.132	3.03
AudioLDM2 — w/o LoRA	0.25 ± 0.120	3.30
AudioLDM2 — w/ 1 LoRA	0.34 ± 0.110	2.70
AudioLDM2 — w/ 2 LoRA	0.33 ± 0.122	3.23
MusicLDM — w/o LoRA	0.85 ± 0.116	1.97
MusicLDM — w/ 1 LoRA	0.19 ± 0.120	3.70
MusicLDM — w/ 2 LoRA	0.08 ± 0.114	1.20
Spherical Music2Latent interpolation	0.13 ± 0.060	4.30

Some Interpolation Audio Samples

Tone 1 .wav

Tone 2 .wav

MusicLDM — w/o LoRA

Steps: t ∈ {0.0, 0.25, 0.5, 0.75, 1.0}

t = 0.0 (Tone 1)

t = 0.25

t = 0.5

t = 0.75

t = 1.0 (Tone 2)

MusicLDM — Single LoRA

Steps: t ∈ {0.0, 0.25, 0.5, 0.75, 1.0}

t = 0.0 (Tone 1)

t = 0.25

t = 0.5

t = 0.75

t = 1.0 (Tone 2)

MusicLDM — Dual LoRA

Steps: t ∈ {0.0, 0.25, 0.5, 0.75, 1.0}

t = 0.0 (Tone 1)

t = 0.25

t = 0.5

t = 0.75

t = 1.0 (Tone 2)

Music2Latent

Steps: t ∈ {0.0, 0.25, 0.5, 0.75, 1.0}

t = 0.0 (Tone 1)

t = 0.25

t = 0.5

t = 0.75

t = 1.0 (Tone 2)

Tone 3（Source） .wav

Tone 4（Target） .wav

Music2Latent — Another pair (extra)

Steps: t ∈ {0.0, 0.25, 0.5, 0.75, 1.0}

t = 0.0 (Tone 3)

t = 0.25

t = 0.5

t = 0.75

t = 1.0 (Tone 4)

Spectral Convergence (SC) Results — Decoder/Vocoder & Reconstruction

To further analyze decoder/vocoder behavior, we report multi-resolution Spectral Convergence (SC). The Without VAE block isolates vocoders; the With VAE (reconstruction) block includes VAE decoding to reflect end-to-end artifacts introduced by latent reconstruction.

$$\mathrm{SC}(M_r, M_t) \;=\; \frac{\sqrt{\sum_{m,k}\,\big(M_t(m,k)-M_r(m,k)\big)^2}} {\sqrt{\sum_{m,k}\, M_t(m,k)^2}}\;.$$

Notation. $M_t(m,k)$ is the target magnitude spectrogram at frame $m$ and frequency bin $k$; $M_r(m,k)$ is the reference magnitude spectrogram. The sums run over all time–frequency indices $(m,k)$. We compute SC for multiple STFT settings and average the results; lower is better.

TABLE II — The Reconstruction Spectral Convergence (SC) Loss Result. Lower SC is better.

Model	Mean	Median
Without LDM VAE
BigVGAN	0.03978	0.04005
HifiGAN (AudioLDM)	0.72955	0.15795
HifiGAN (AudioLDM2)	0.72955	0.15795
HifiGAN (MusicLDM)	0.30206	0.15017
With LDM VAE
HifiGAN (AudioLDM)	1.80133	0.79835
HifiGAN (AudioLDM2)	0.49473	0.13271
HifiGAN (MusicLDM)	0.46880	0.11239
BigVGAN (AudioLDM)	1.10628	0.35980
BigVGAN (AudioLDM2)	0.50800	0.12355
BigVGAN (MusicLDM)	0.65704	0.22113

Best perceptual quality: Direct spherical interpolation in Music2Latent achieves the top MOS (4.3) and one of the lowest CDPAM scores, yielding the smoothest and most natural morphs.
LoRA is not always beneficial: Depending on the backbone (AudioLDM / AudioLDM2 / MusicLDM) and LoRA placement, perceptual quality can decrease due to content shift or artifacts under limited data.
SC (diagnostic): Multi-resolution SC highlights decoder/vocoder losses; adding VAE reconstruction exposes end-to-end artifacts typical for LDM pipelines.

Some Reconstruction Audio Samples

Ground truth.wav

Reconstruct result (HifiGAN + MusicLDM VAE).wav

Ground truth.wav

Reconstruct result (HifiGAN + MusicLDM VAE).wav

BibTeX

@misc{chen2025guitartonemorphingdiffusionbased,
  title={Guitar Tone Morphing by Diffusion-based Model},
  author={Kuan-Yu Chen and Kuan-Lin Chen and Yu-Chieh Yu and Jian-Jiun Ding},
  year={2025},
  eprint={2510.07908},
  archivePrefix={arXiv},
  primaryClass={eess.AS},
  url={https://arxiv.org/abs/2510.07908},
}