Music Generation | Juha Putaansuu

Step 1/5

Composer's Direction (Prompt)

Generating music begins by telling the AI how you want the music to sound. Generators are provided with genres, instruments, and even structures (intro, chorus) or lyrics.

The AI understands abstract concepts like "melodic", "banger", or "atmospheric", associating them with chords and scales.

"An energetic synthwave pop track with a female vocalist, powerful 80s drum beat, 120bpm tempo, and a catchy chorus."

|

Step 2/5

Semantic and Acoustic Tokens

The AI does not intuitively understand sound; rather, it "listens" to music by reading massive amounts of frequency data.

At this stage, the words are converted into semantic tokens (which contain the idea of the style and melody) and acoustic tokens (which define, for example, what an electric guitar or the singer's voice physically sounds like).

Step 3/5

Audio Construction in Latent Space

The music is built as a "mel-spectrogram", which is a compressed visual representation of sound waves. Just as image generators denoise pixels, audio AI applies denoising to seconds of audio frequencies.

Most commercial generators also use autoregressive ("next-word predicting") models similar to LLMs to generate what the next second should sound like musically.

Sequence: 0 / 50

Latent noise: 100%

Step 4/5

Decoding and Vocoding (Vocoder / VAE)

The "spectrogram" designed by the AI is not yet audio that a speaker can play back. It is merely rough mathematics of frequencies.

Here the model's final brilliant component, the so-called Vocoder (e.g., HiFi-GAN), transforms this abstract musical plan into a real-time 44,100 Hz sound wave that is audible to the human ear!

Step 5/5

Finished Music Track

Moments later, you have a complete, unique mastered track! The bass pumps, the singer performs, and the rhythm hits by the rules the AI learned from hundreds of thousands of hours of human-made music.

Whether it's country music or EDM, math and neural networks have turned into a genuine groove!