Recent progress in music generation has been remarkably advanced by the state-of-the-art MusicLM, which comprises a hierarchy of three LMs, respectively, for semantic, coarse acoustic, and fine acoustic modelings. Yet, sampling with the MusicLM requires processing through these LMs one by one to obtain the fine-grained acoustic tokens, making it computationally expensive and prohibitive for a real-time generation. Efficient music generation with a quality on par with MusicLM remains a significant challenge. In this paper, we present MeLoDy (M for music; L for LM; D for diffusion), an LM-guided diffusion model that generates music audios of state-of-the-art quality meanwhile reducing 95.7% or 99.6% forward passes in MusicLM, respectively, for sampling 10s or 30s music. MeLoDy inherits the highest-level LM from MusicLM for semantic modeling, and applies a novel dual-path diffusion (DPD) model and an audio VAE-GAN to efficiently decode the conditioning semantic tokens into waveform. DPD is proposed to simultaneously model the coarse and fine acoustics by incorporating the semantic information into segments of latents effectively via cross-attention at each denoising step. Our experimental results suggest the superiority of MeLoDy, not only in its practical advantages on sampling speed and infinitely continuable generation, but also in its state-of-the-art musicality, audio quality, and text correlation.
Overview of MeLoDy
Generating Text Prompts from MusicLM
Given the same text promopt, we pair-wise compare MeLoDy's generated samples (non-cherrypicked) with the samples taken from MusicLM's demo page. For a pair compairson, text prompts involving vocals or human voice are ignored since MeLoDy was mainly trained with non-vocal music data. We also dropped the painting part since the evaluation of visual-audio correlation tends to be highly varied and dependent on subjective feelings. For each generated music, we demonstrate the ability of music continuation in MeLoDy by prolonging the audio to the double of its duration (see Algorithm 2 in the Appendix).
Given the same text promopt, we pair-wise compare MeLoDy's generated samples (non-cherrypicked) with the samples taken from Noise2Music's demo page. For a pair compairson, text prompts involving vocals or human voice are ignored since MeLoDy was mainly trained with non-vocal music data. For each generated music, we demonstrate the ability of music continuation in MeLoDy by prolonging the audio to the double of its duration (see Algorithm 2 in the Appendix).
We use 2 thoughtful text prompts to analyze the diversity and the validity of MeLoDy's samples, and present 10 samples (non-cherrypicked) per text prompt. Besides, we take 4 non-vocal music clips from MusicCaps as the music prompts for MeLoDy, and generate 5 samples (non-cherrypicked) per music prompt to show that MeLoDy is capable of generating diverse music audios of similar style. For each generated music, we demonstrate the ability of music continuation in MeLoDy by prolonging the sampled 10s audio to 20s (see Algorithm 2 in the Appendix).
Text Prompt 1: Give me a background music track suitable for time-lapse videos.
Ablation: Generation with Different Angle Schedules
In this section, we present the generated examples for the ablation study on angle schedules (in Appendix D). We analyze the quality of samples genrated with uniform or our proposed angle schedules, respectively, using 3 representative text prompts: acoustic guitar, flute, and saxophone.