Make-It-3D: High-Fidelity 3D Creation from A Single Image with Diffusion Prior

Abstract

In this work, we investigate the problem of creating high-fidelity 3D content from only a single image. This is inherently challenging: it essentially involves estimating the underlying 3D geometry while simultaneously hallucinating unseen textures. To address this challenge, we leverage prior knowledge from a well-trained 2D diffusion model to act as 3D-aware supervision for 3D creation. Our approach, Make-It-3D, employs a two-stage optimization pipeline: the first stage optimizes a neural radiance field by incorporating constraints from the reference image at the frontal view and diffusion prior at novel views; the second stage transforms the coarse model into textured point clouds and further elevates the realism with diffusion prior while leveraging the high-quality textures from the reference image. Extensive experiments demonstrate that our method outperforms prior works by a large margin, resulting in faithful reconstructions and impressive visual quality. Our method presents the first attempt to achieve high-quality 3D creation from a single image for general objects and enables various applications such as text-to-3D creation and texture editing.

Pipeline

Generating novel views for general scenes or objects from only a single image is inherently challenging due to the difficulty of inferring both geometry and missing texture. We therefore tackle this challenge by cultivating the dark knowledge of pretrained 2D diffusion models.

Given an input image, we first hallucinate its underlying 3D representation, neural radiance field (NeRF), whose rendering appears as a plausible sample to a pretrained denoising diffusion model, and we constrain this optimization process with the texture and depth supervision at the reference view. To further improve the rendering realism, we keep the learned geometry and enhance the textures with the reference image. As such, in the second stage, we lift the input image to textured point clouds and focus on refining the color of the points occluded in the reference view. We leverage prior knowledge of the text-to-image generative model and the text-image contrastive model for both stages. In this way, we achieve a faithful 3D representation of the input image with restored high-fidelity texture and geometry.

Diverse Text-to-3D

Make-It-3D can generate diverse and visually stunning 3D models given a text description.

Texture Modification

Make-It-3D achieves 3D-aware texture modification such as tattoo drawing and stylization.

BibTeX