Google‘s new AI technology, MusicLM, has the ability to turn text into minutes-long musical pieces, mimicking human-like composition skills. The AI model can even transform a whistled or hummed melody into different instruments, much like how DALL-E generates images from written prompts. Although the model isn’t available for public use, Google has shared samples that showcase its capabilities.
The samples range from 30-second snippets of what sound like actual songs, created from a single paragraph-long description that specifies the genre, vibe, and instruments to use, to five-minute-long pieces generated from just one or two words, such as “melodic techno.” One particularly impressive demonstration is the “story mode,” where the model receives a script and morphs between prompts.
MusicLM is also able to simulate human vocals, although there’s a noticeable difference in the quality. The vocals sound grainy or staticky, which is most prominent in the example of music that would play in a gym. Despite this, the AI is able to correctly capture the tone and overall sound of human voices.
Google’s research paper explains the details behind MusicLM’s functioning, stating that it outperforms other AI music systems in terms of “quality and adherence to the caption.” MusicLM is able to take in audio and copy the melody, which is demonstrated in the examples of humming or whistling a tune that is then reproduced as a synth lead, string quartet, guitar solo, etc.
Google is being cautious with MusicLM, with no plans to release the model at this time due to potential misappropriation of creative content or cultural appropriation. Instead, the company is releasing a dataset with around 5,500 music-text pairs to aid in training and evaluating other musical AI systems.
Title: “Google’s AI Brings Text to Life with Music Generation”