Unlocking GenAI: How Diffusion Transformers are Revolutionizing OpenAI’s Sora

Saining Xie, a computer science professor at NYU, began the research project that spawned the diffusion transformer in June 2022. Diffusion models typically have a “backbone,” or engine of sorts, called a U-Net. In other words, larger and larger transformer models can be trained with significant but not unattainable increases in compute. The current process of training diffusion transformers potentially introduces some inefficiencies and performance loss, but Xie believes this can be addressed over the long horizon. “I’m interested in integrating the domains of content understanding and creation within the framework of diffusion transformers.

OpenAI’s Sora, a cutting-edge GenAI technology that can generate videos and interactive 3D environments on demand, has reached a new milestone with its remarkable capabilities.

Sora is a bonafide testament to the evolution of GenAI, showcasing the state-of-the-art advancements made in the field.

Interestingly, one of the key components of this innovation, a colloquially known AI model called the diffusion transformer, has been around for years, making it a pioneer in AI research.

Stability AI, a startup powered by the diffusion transformer, has recently released their newest image generator, Stable Diffusion 3.0. This could potentially revolutionize the GenAI industry, allowing models to scale beyond what was previously thought possible.

Saining Xie, a computer science professor at NYU, began working on the diffusion transformer project in June of 2022. Along with his mentee, William Peebles, who eventually became the co-lead of Sora at OpenAI, Xie combined two concepts in machine learning – diffusion and the transformer – to create the diffusion transformer.

“The introduction of transformers marks a significant leap in scalability and effectiveness,” said Xie in an email interview with TechCrunch. “This is particularly evident in models like Sora, which benefit from training on vast volumes of video data and leverage extensive model parameters to showcase the transformative potential of transformers when applied at scale.”

The process of generating media with AI, including images, videos, speech, music, and 3D meshes, often relies on a technique known as diffusion. This involves slowly adding noise to a piece of media until it becomes unrecognizable. The diffusion model then learns how to remove this noise, approaching the desired output one step at a time.

The majority of modern AI-powered media generators use a “backbone,” or engine, called a U-Net. This backbone is responsible for estimating and removing the noise, but is complex and can significantly slow down the diffusion process.

Luckily, the diffusion transformer provides a more efficient and effective alternative to U-Nets.

Transformers, which are commonly used in complex reasoning tasks and models such as GPT-4, Gemini, and ChatGPT, have a unique feature called an “attention mechanism.” This mechanism weighs the relevance of every input and uses that information to generate the output, making it simpler and more parallelizable than other architectures.

“The introduction of transformers … marks a significant leap in scalability and effectiveness,” Xie explained. “This is particularly evident in models like Sora, which benefit from training on vast volumes of video data and leverage extensive model parameters to showcase the transformative potential of transformers when applied at scale.”

So why did it take so long for projects like Sora and Stable Diffusion to utilize the diffusion transformer? Xie believes that it was only recently that the importance of having a scalable backbone model became apparent.

“The Sora team has truly gone above and beyond to demonstrate the capabilities of this approach on a large scale,” he said. “They have made it clear that U-Nets are out and transformers are in when it comes to diffusion models.”

Xie believes that the process of implementing diffusion transformers into existing models should be a simple swap, regardless of the type of media being generated. Although there may be some inefficiencies and performance loss during the training process, Xie is confident that these issues can be addressed in the long term.

“The main takeaway is simple: forget U-Nets and make the switch to transformers, as they are faster, more efficient, and much more scalable,” he stated. “I am interested in integrating the domains of content understanding and creation within the framework of diffusion transformers. Currently, these aspects are like two separate worlds – one for understanding and one for creating. I envision a future where these domains are integrated, and I believe that standardizing underlying architectures, with transformers being the ideal candidate, is key to achieving this integration.”

If Sora and Stable Diffusion 3.0 are any indication of what we can expect from diffusion transformers, it seems we are in for an exciting journey ahead.

Avatar photo
Kira Kim

Kira Kim is a science journalist with a background in biology and a passion for environmental issues. She is known for her clear and concise writing, as well as her ability to bring complex scientific concepts to life for a general audience.

Articles: 867

One comment

Leave a Reply

Your email address will not be published. Required fields are marked *