SiTH

TL;DR

SiTH is a non-optimization single-view 3D reconstruction pipeline integrating a novel image-conditioned ControlNet model and implicit neural fields.
SiTH reconstructs a fully-textured and high-quality 3D human mesh from a single image in 2 minutes.
SiTH can be trained with as few as 500 3D human scans and generalizes well to diverse images such as AI-generated images. (Check out our demo!)
To foster future research, we contribute a new evaluation benchmark for single-view 3D human reconstruction.

Abstract

A long-standing goal of 3D human reconstruction is to create lifelike and fully detailed 3D humans from single-view images. The main challenge lies in inferring unknown body shapes, appearances, and clothing details in areas not visible in the images. To address this, we propose SiTH, a novel pipeline that uniquely integrates an image-conditioned diffusion model into a 3D mesh reconstruction workflow. At the core of our method lies the decomposition of the challenging single-view reconstruction problem into generative hallucination and reconstruction subproblems. For the former, we employ a powerful generative diffusion model to hallucinate unseen back-view appearance based on the input images. For the latter, we leverage skinned body meshes as guidance to recover full-body texture meshes from the input and back-view images. SiTH requires as few as 500 3D human scans for training while maintaining its generality and robustness to diverse images. Extensive evaluations on two 3D human benchmarks, including our newly created one, highlighted our method's superior accuracy and perceptual quality in 3D textured human reconstruction.

Video

Method Overview

Hallucination and Reconstruction

At the core of SiTH is the decomposition of the challenging single-view problem into two subproblems: generative back-view hallucination and mesh reconstruction. For hallucination, we harness the generative capabilities of diffusion models to infer unobserved back-view appearances from the input images. For reconstruction, we utilize a skinned body mesh, providing essential 3D guidance for accurate human mesh reconstruction. This decomposition strategy allows our pipeline to be trained efficiently with just 500 3D scans, while still robustly handling unseen images.

Image-conditioned Diffusion

Compared to text-conditioning in traditional diffusion models, our image-conditioning strategy is more consistent and accurate, allowing for building a data-driven 3D reconstruction pipeline. First, the pretrained CLIP and VAE encoders ensure the output images maintain visual consistency with the front-view images. Additionally, we render UV maps from the SMPL-X body mesh and extract silhouette masks from the back-view images. These additional conditional controls ensure the human poses in the output images match those in the front view. To preserve the diffusion model's generative power, we specifically optimize the ControlNet and cross-attention layers with only a small amount of 3D human data.

Results

Single-view 3D Human Reconstruction

                            
                            Show input image
                            
                            Show input image
                            
                            Show input image

                            
                            Show input image
                            
                            Show input image
                            
                            Show input image

                            
                            Show input image
                            
                            Show input image
                            
                            Show input image

Text-guided 3D Human Creation

SiTH can be easily combined with powerful 2D text-to-image models.

                            
                            "A male wearing a white suit"
                        
                            "A man wearing brown t-shirt and pants"
                        
                            "A woman in a white t-shirt and a tennis skirt"

3D Avatars

SiTH can easily create animation-ready 3D avatars from images.

Benchmark Evaluation

The benchmark is based on the CustomHumans dataet. To access the benchmark, please apply the dataset directly.

Single-view 3D Human Reconstruction

Methods	P-to-S (cm) ↓	S-to-P (cm) ↓	NC ↑	f-Score ↑
PIFu [Saito2019]	2.209	2.582	0.805	34.881
PIFuHD [Saito2020]	2.107	2.228	0.804	39.076
PaMIR [Zheng2021]	2.181	2.507	0.813	35.847
FOF [Feng2022]	2.079	2.644	0.808	36.013
2K2K [Han2023]	2.488	3.292	0.796	30.186
ICON* [Xiu2022]	2.256	2.795	0.791	30.437
ECON* [Xiu2023]	2.483	2.680	0.797	30.894
SiTH* (Ours)	1.871	2.045	0.826	37.029

*indicates methods trained on the same THuman2.0 dataset.

Comparison with Optimization-based (Score Distillation) Approaches

                            
                            Zero-1-to-3, 10 mins
                        
                            Magic-1-to-3, 6 hours
                        
                            TeCH, 6 hours

SiTH (Ours), 2 mins

GT Reference

Compared to optimization-based methods (Score Distillation), SiTH reconstructs consistent facial textures that align with the input images and their underlying geometry.

Acknowledgement

This work was partially supported by the Swiss SERI Consolidation Grant "AI-PERCEIVE". We thank Xu Chen for insightful discussions, Manuel Kaufmann for suggestions on writing and the title, and Christoph Gebhardt, Marcel Buehler, and Juan-Ting Lin for their writing advice.

Single-view Textured Human Reconstruction with Image-Conditioned Diffusion

Paper(CVPR 2024)

Video

Code

Demo

News:

[June 14, 2024] Release training scripts.

[May 15, 2024] Update an application of 3D avatar animation.

[April 24, 2024] Gradio demo for 3D human creation is now available.

[April 15, 2024] Release demo code, models, and the evaluation benchmark.

TL;DR

Abstract

Video

Method Overview

Hallucination and Reconstruction

Image-conditioned Diffusion

Results

Single-view 3D Human Reconstruction

Text-guided 3D Human Creation

SiTH can be easily combined with powerful 2D text-to-image models.

3D Avatars

SiTH can easily create animation-ready 3D avatars from images.

Benchmark Evaluation

Single-view 3D Human Reconstruction

PIFu [Saito2019]

PIFuHD [Saito2020]

PaMIR [Zheng2021]

FOF [Feng2022]

2K2K [Han2023]

ICON* [Xiu2022]

ECON* [Xiu2023]

SiTH* (Ours)

*indicates methods trained on the same THuman2.0 dataset.

Comparison with Optimization-based (Score Distillation) Approaches

Zero-1-to-3, 10 mins

Magic-1-to-3, 6 hours

TeCH, 6 hours

SiTH (Ours), 2 mins

GT Reference

Compared to optimization-based methods (Score Distillation), SiTH reconstructs consistent facial textures that align with the input images and their underlying geometry.

Bibtex

Acknowledgement

Single-view Textured Human Reconstruction
with Image-Conditioned Diffusion

Paper
(CVPR 2024)