Text to 3D: DreamFusion vs Shape-E
Comparison of text to 3D algorithms from Google and OpenAI, text to 3D startups, an overview of other major open-source alternatives.
Why do we need text to 3D?
Over the past year, we've seen significant advancements in the text-to-image problem using diffusion models to the point where it's now primarily considered a solved problem. With this progress in mind, sights are now set on the next grand challenge: text-to-3D.
While adding a third dimension brings extra complexities, it also offers the thrilling prospect of generating 3D objects straight from textual descriptions. The possible advancements in this area could enhance technical capabilities, refine design methods, and evolve narrative construction in the gaming and virtual reality sectors. We will be running two text-to-3D proposals from Google and OpenAI.
Google didn't open it, compared to implementation from OpenAI, but there is a great open-source alternative. So, let’s take a closer look at both implementations.
What is Google's proposal?
Google developed DreamFusion, which leverages the combined power of two models:
- Imagen is a pretrained text-to-image diffusion model;
- NeRF (Neural Radiance Field) is a multilayer perceptron (
MLP
) that can generate novel views of complex 3D scenes based on a partial set of 2D images.
Google team used only the pretrained 64×64
base Imagen
model with no modifications (not even the super-resolution cascade for generating higher-resolution images).
A NeRF-like model, initialized with random weights, creates a scene from a text. It repeatedly produces images of the scene from random perspectives, which are then used to calculate the distillation loss function that wraps around Imagen
. The gradient descent with this approach eventually results in a 3D model (parameterized as a NeRF
) that resembles the text.
Fig. 1: Source: DreamFusion: Text-to-3D using 2D Diffusion by Poole et al.
Stable DreamFusion is an open-source variant of Google's DreamFusion, but it comes with several modifications:
-
The Stable Diffusion model is used instead of closed-source Imagen. It's a latent diffusion model, which diffuses in a latent space instead of the original image space. Therefore, the loss to propagate back from the VAE's encoder part is needed too, which adds time expenditure during training.
-
For faster rendering, a multiresolution grid encoder from torch-ngp was implemented. It's an efficient method to process neural graphics primitives using a novel input encoding and a multiresolution hash table of trainable feature vectors. This approach enables faster training and rendering times, with high-quality graphics trainable in seconds and renderable in milliseconds at high resolutions.
-
NeRF creates a continuous 3D representation of a scene from 2D images. This volumetric scene function models the color and volume density of a 3D point in space.
-
To create a 3D object or mesh that can be opened in 3D modeling tools like Blender, the "Marching Cubes" algorithm can be applied to this 3D representation. The algorithm works by dividing the space into cubes and then determining the position of the object's surface within each cube. It "marches" through all the cubes, hence the name. In this case, we applied a DMTet algorithm from Nvidia that works significantly better than default Marching Cubes.
Here are some objects generated with Stable Dreamfusion, including their 3D meshes that can be opened in Blender:
After you unzip and import .obj
file into Blender, switch to Texture Paint
, as shown in the screenshot below:
Fig. 2: mesh.obj
file from angel.zip imported into Blender.
What is the OpenAI proposal?
OpenAI has introduced Shap-E, a novel open-source AI-driven text-to-3D-model
generator that, at first glance, appears to be similar to its recent Point-E.
According to the paper, Shap-E is trained in two stages.
-
The encoder accepts 3D assets as input and converts them into parameters for mathematical representations known as implicit functions. This conversion enables the model to understand the 3D assets' underlying representation deeply;
-
The encoder's outputs were used to train a conditional diffusion model. It learns the conditional distribution of the implicit function parameters based on input data. It generates a wide variety of intricate 3D assets by sampling from the learned distribution.
A vast dataset of paired 3D assets was used to train the diffusion model. Unlike Point-E, which generates explicit representations over point clouds, Shap-E directly generates the parameters of implicit functions that can be rendered as textured meshes or neural radiance fields.
Despite it, Shap-E converges more rapidly and achieves similar or superior sample quality.
Here are some objects generated with Shap-E, including their 3D meshes that can be opened in Blender:
After you import a .ply
file into Blender, switch to Vertex Paint
, as shown in the screenshot below:
Fig. 3: burger.ply file imported into Blender.
DreamFusion & Shape-E OpenSource Alternatives
The team is addressing the challenge of creating a 360° model from a single image. They utilize a neural radiance field and a conditional image generator to conceive new views of the object.
This approach, influenced by DreamFields and DreamFusion, provides a consistent, realistic 3D reconstruction from a single perspective, outperforming previous methods.
An input image is converted into a 3D version utilizing a Neural Radiance Field (NeRF
) and a previously trained denoising diffusion model. The initial model is then enhanced and altered into textured point clouds by deploying text-to-image
generative and contrastive models. The outcome is a high-definition 3D visualization of the initial image.
A framework capable of modifying the viewpoint of an object based on a solitary RGB image, leveraging large-scale diffusion models. It uses synthetic data to learn camera controls, permitting the generation of new views of an object.
Despite its synthetic training data, it has strong generalization capabilities and outperforms existing models in single-view 3D reconstruction and new view creation.
TANGO is a novel approach for transforming 3D shapes according to text prompts, generating photorealistic styles. It leverages the CLIP
model to optimize appearance factors like reflectance and lighting, offering improved photorealism, 3D geometry consistency, and robustness even for low-quality meshes.
NVIDIA presented GET3D, a generative model that produces explicit textured 3D meshes with intricate topology, detailed geometry, and high-quality textures. It leverages recent advancements in differentiable surface modeling, differentiable rendering, and 2D Generative Adversarial Networks to train from 2D image sets.
Additional interesting implementations.
3D Asset Generation Startups
With 3DFY Prompt, users can quickly transform their text prompts into high-quality 3D models that can be used for various purposes, from designing virtual environments for video games to creating 3D models of products for prototyping purposes.
Innovative 3D generative AI technology for crafting game-optimized 3D models. With text-to-3d
, transform written prompts into detailed meshes and textures, while text-to-animation
breathes life into your creations with user-defined animations. Plus, you can freely edit and customize your freshly generated 3D models.
State-of-the-art algorithms extract human motions from any video to generate 3D animations in minutes. Proprietary technology to manage animations and emotes retargeting on any avatar and in any environment. Unique on the market Cloud-Based Emote catalog providing our partners with fresh and infinite content.
Allows generating stunning 3D art with nothing more than an image in minutes. Kaedim is optimized for usable, production-ready 3D assets.
This new generation of image generators, which can be trained with your unique art, provides unparalleled style consistency. With text prompts and reference images, you can generate visually appealing and stylistically consistent graphics in seconds.
The team applies NeRF
, which captures intricate 3D scenes and game assets, with the ability to display them online or integrate them into platforms like Blender or Unity.
The API can transform video walkthroughs and coarse textured models into interactive 3D scenes and generate pre-rendered 360° images and videos.
Conclusions
This year's grand challenge lies in generating 3D assets from text. Despite its complexity, we are confident this problem will soon be solved and ready for production across various industries, including games, virtual & augmented reality, and other immersive 3D experiences.
This breakthrough will undeniably revolutionize how we interact with digital environments, leading to an unprecedented leap in user experience.
Have an idea? Let's discuss!
Talk to Yuliya. She will make sure that all is covered. Don't waste time on googling - get all answers from relevant expert in under one hour.