NVIDIA (NVDA) developed a new type of artificial intelligence model that can create sound effects, change the way a person sounds, and generate music using natural language messages. Called Fugatto, or Foundational Generative Audio Transformer Opus 1, the model is a research project. Nvidia says it isn’t announcing any plans to commercialize the technology, but it could have broad implications for industries ranging from music and entertainment to translation services.
“What’s so exciting about (Fugatto) is that having a model that you can trick into asking him to make sounds in a certain way really opens up the landscape of things you can imagine doing with him.” , Bryan Catanzaro, vice president of applied deep learning research at Nvidia, told Yahoo Finance.
Nvidia shares fell 4% on the day.
What sets Fugatto apart from other models, Catanzaro explained, is that it can perform the tasks of several other models. For example, there are models capable of synthesizing speech and others that can add sound effects to music; But Fugatto does it all. Think of it as a sort of complement to video and image generation models like Stability AI’s Stable Video Diffusion or OpenAI’s Sora.
“The fundamental improvement here is that…we’re able to synthesize audio using language, which I think opens up new avenues for tools that people can use to create incredible audio,” he said. added Catanzaro.
According to Nvidia, Fugatto is the first fundamental model with emergent properties, meaning it is able to mix the elements it was trained on and follow “free-form instructions.”
The template can generate audio via standard word prompts as well as manipulate audio files you upload. So if you have a file of a person speaking, you can translate that person’s words into another language and still make their voice sound like yours. You can also take a simple piece and make it sound like an orchestral performance or add different rhythms to the music.
You can also upload a document and have the model read it in the voice of your choice. Additionally, you can ask the model to produce voices that carry emotional weight. Want audio of a discouraged English teacher reading Edgar Allen Poe? Fugatto should be able to do it.
Catanzaro cautions, however, that the model is not always perfect. And some results are better than others.
Like generative models of images and videos, Fugatto raises questions about the potential impact on artists, audio engineers, and people working in related fields. Catanzaro, however, hopes the technology will help musicians.