Efficient Neural Music Generation

Anonymous

[Paper] [Appendix]

Abstract

Recent progress in music generation has been remarkably advanced by the state-of-the-art MusicLM, which comprises a hierarchy of three LMs, respectively, for semantic, coarse acoustic, and fine acoustic modelings. Yet, sampling with the MusicLM requires processing through these LMs one by one to obtain the fine-grained acoustic tokens, making it computationally expensive and prohibitive for a real-time generation. Efficient music generation with a quality on par with MusicLM remains a significant challenge. In this paper, we present MeLoDy (M for music; L for LM; D for diffusion), an LM-guided diffusion model that generates music audios of state-of-the-art quality meanwhile reducing 95.7% or 99.6% forward passes in MusicLM, respectively, for sampling 10s or 30s music. MeLoDy inherits the highest-level LM from MusicLM for semantic modeling, and applies a novel dual-path diffusion (DPD) model and an audio VAE-GAN to efficiently decode the conditioning semantic tokens into waveform. DPD is proposed to simultaneously model the coarse and fine acoustics by incorporating the semantic information into segments of latents effectively via cross-attention at each denoising step. Our experimental results suggest the superiority of MeLoDy, not only in its practical advantages on sampling speed and infinitely continuable generation, but also in its state-of-the-art musicality, audio quality, and text correlation.

Overview of MeLoDy

Generating Text Prompts from MusicLM

Given the same text promopt, we pair-wise compare MeLoDy's generated samples (non-cherrypicked) with the samples taken from MusicLM's demo page. For a pair compairson, text prompts involving vocals or human voice are ignored since MeLoDy was mainly trained with non-vocal music data. We also dropped the painting part since the evaluation of visual-audio correlation tends to be highly varied and dependent on subjective feelings. For each generated music, we demonstrate the ability of music continuation in MeLoDy by prolonging the audio to the double of its duration (see Algorithm 2 in the Appendix).

Instruments

1
2
3
4

Text Prompt	MusicLM (10s)	MeLoDy (10s)	MeLoDy (Cont. to 20s)

Genres

1
2
3
4
5
6
7

Text Prompt	MusicLM (10s)	MeLoDy (10s)	MeLoDy (Cont. to 20s)

Musician Experience Level

1
2

Text Prompt	MusicLM (10s)	MeLoDy (10s)	MeLoDy (Cont. to 20s)

Places

1
2

Text Prompt	MusicLM (10s)	MeLoDy (10s)	MeLoDy (Cont. to 20s)

Epochs

1
2

Text Prompt	MusicLM (10s)	MeLoDy (10s)	MeLoDy (Cont. to 20s)

Accordion Solos

1
2

Text Prompt	MusicLM (10s)	MeLoDy (10s)	MeLoDy (Cont. to 20s)

Rich Description

1
2

Text Prompt	MusicLM (30s)	MeLoDy (30s)	MeLoDy (Cont. to 1m)

Generating Text Prompts from Noise2Music

Given the same text promopt, we pair-wise compare MeLoDy's generated samples (non-cherrypicked) with the samples taken from Noise2Music's demo page. For a pair compairson, text prompts involving vocals or human voice are ignored since MeLoDy was mainly trained with non-vocal music data. For each generated music, we demonstrate the ability of music continuation in MeLoDy by prolonging the audio to the double of its duration (see Algorithm 2 in the Appendix).

Semantically Rich Text Prompts

1
2

Text Prompt	Noise2Music (30s)	MeLoDy (30s)	MeLoDy (Cont. to 1m)

Representation of Key Musical Attributes

Text Prompt	Noise2Music (30s)	MeLoDy (30s)	MeLoDy (Cont. to 1m)

More Generation Examples from MusicCaps

1
2
3

Text Prompt	Noise2Music (30s)	MeLoDy (30s)	MeLoDy (Cont. to 1m)

Diversity Analysis with Text and Music Prompts

We use 2 thoughtful text prompts to analyze the diversity and the validity of MeLoDy's samples, and present 10 samples (non-cherrypicked) per text prompt. Besides, we take 4 non-vocal music clips from MusicCaps as the music prompts for MeLoDy, and generate 5 samples (non-cherrypicked) per music prompt to show that MeLoDy is capable of generating diverse music audios of similar style. For each generated music, we demonstrate the ability of music continuation in MeLoDy by prolonging the sampled 10s audio to 20s (see Algorithm 2 in the Appendix).

Text Prompt 1: Give me a background music track suitable for time-lapse videos.

1
2
3
4
5
6
7
8
9
10

Duration	Generated Sample

Text Prompt 2: Give me a piece of music that can be listened when drinking coffee.

1
2
3
4
5
6
7
8
9
10

Duration	Generated Sample

Music Prompt 1: (Y9hCnEfZFZ04 from MusicCaps)

1
2
3
4
5

Duration	Generated Sample

Music Prompt 2: (YacEHGV7Gq6U from MusicCaps)

1
2
3
4
5

Duration	Generated Sample

Music Prompt 3: (YFYFapDVOFHg from MusicCaps)

1
2
3
4
5

Duration	Generated Sample

Music Prompt 4: (YMHHshnnqyco from MusicCaps)

1
2
3
4
5

Duration	Generated Sample

Ablation: Generation with Different Angle Schedules

In this section, we present the generated examples for the ablation study on angle schedules (in Appendix D). We analyze the quality of samples genrated with uniform or our proposed angle schedules, respectively, using 3 representative text prompts: acoustic guitar, flute, and saxophone.

Text Prompt	Uniform Angle Schedule	Ours Angle Schedule