Imagine a professional musician being able to explore new compositions without having to play a single note on an instrument. Or an indie game developer populating virtual worlds with realistic sound effects and ambient noise on a shoestring budget. Or a small business owner adding a soundtrack to their latest Instagram post with ease. That’s the promise of AudioCraft — our simple framework that generates high-quality, realistic audio and music from text-based user inputs after training on raw audio signals as opposed to MIDI or piano rolls.
AudioCraft consists of three models: MusicGen, AudioGen, and EnCodec. MusicGen, which was trained with Meta-owned and specifically licensed music, generates music from text-based user inputs, while AudioGen, which was trained on public sound effects, generates audio from text-based user inputs. Today, we’re excited to release an improved version of our EnCodec decoder, which allows for higher quality music generation with fewer artifacts; our pre-trained AudioGen model, which lets you generate environmental sounds and sound effects like a dog barking, cars honking, or footsteps on a wooden floor; and all of the AudioCraft model weights and code. The models are available for research purposes and to further people’s understanding of the technology. We’re excited to give researchers and practitioners access so they can train their own models with their own datasets for the first time and help advance the state of the art.
From text to audio with ease
In recent years, generative AI models including language models have made huge strides and shown exceptional abilities: from the generation of a wide-variety of images and video from text descriptions exhibiting spatial understanding to text and speech models that perform machine translation or even text or speech dialogue agents. Yet while we’ve seen a lot of excitement around generative AI for images, video, and text, audio has always seemed to lag a bit behind. There’s some work out there, but it’s highly complicated and not very open, so people aren’t able to readily play with it.
Generating high-fidelity audio of any kind requires modeling complex signals and patterns at varying scales. Music is arguably the most challenging type of audio to generate because it’s composed of local and long-range patterns, from a suite of notes to a global musical structure with multiple instruments. Generating coherent music with AI has often been addressed through the use of symbolic representations like MIDI or piano rolls. However, these approaches are unable to fully grasp the expressive nuances and stylistic elements found in music. More recent advances leverage self-supervised audio representation learning and a number of hierarchical or cascaded models to generate music, feeding the raw audio into a complex system in order to capture long-range structures in the signal while generating quality audio. But we knew that more could be done in this field.
The AudioCraft family of models is capable of producing high-quality audio with long-term consistency, and it can be easily interacted with through a natural interface. With AudioCraft, we simplify the overall design of generative models for audio compared to prior work in the field — giving people the full recipe to play with the existing models that Meta has been developing over the past several years while also empowering them to push the limits and develop their own models.
AudioCraft works for music and sound generation and compression — all in the same place. Because it’s easy to build on and reuse, people who want to build better sound generators, compression algorithms, or music generators can do it all in the same code base and build on top of what others have done.
And while a lot of work went into making the models simple, the team was equally committed to ensuring that AudioCraft could support the state of the art. People can easily extend our models and adapt them to their use cases for research. There are nearly limitless possibilities once you give people access to the models to tune them to their needs. And that’s what we want to do with this family of models: give people the power to extend their work.
A simple approach to audio generation
Generating audio from raw audio signals is challenging as it requires modeling extremely long sequences. A typical music track of a few minutes sampled at 44.1 kHz (which is the standard quality of music recordings) consists of millions of timesteps. In comparison, text-based generative models like Llama and Llama 2 are fed with text processed as sub-words that represent just a few thousands of timesteps per sample.
To address this challenge, we learn discrete audio tokens from the raw signal using the EnCodec neural audio codec, which gives us a new fixed “vocabulary” for music samples. We can then train autoregressive language models over these discrete audio tokens to generate new tokens and new sounds and music when converting the tokens back to the audio space with EnCodec’s decoder.
Learning audio tokens from the waveform
EnCodec is a lossy neural codec that was trained specifically to compress any kind of audio and reconstruct the original signal with high fidelity. It consists of an autoencoder with a residual vector quantization bottleneck that produces several parallel streams of audio tokens with a fixed vocabulary. The different streams capture different levels of information of the audio waveform, allowing us to reconstruct the audio with high fidelity from all the streams.
Training audio language models
We then use a single autoregressive language model to recursively model the audio tokens from EnCodec. We introduce a simple approach to leverage the internal structure of the parallel streams of tokens and show that with a single model and elegant token interleaving pattern, our approach efficiently models audio sequences, simultaneously capturing the long-term dependencies in the audio and allowing us to generate high-quality sound.
MusicGen is an audio generation model specifically tailored for music generation. Music tracks are more complex than environmental sounds, and generating coherent samples on the long-term structure is especially important when creating novel musical pieces. MusicGen was trained on roughly 400,000 recordings along with text description and metadata, amounting to 20,000 hours of music owned by Meta or licensed specifically for this purpose.