Separable 3D Reconstruction of Two Interacting Objects from Multiple Views

overview

Abstract

Separable 3D reconstruction of multiple objects from multi-view RGB images— resulting in two different 3D shapes for the two objects with a clear separation between them—remains a sparely researched problem. It is challenging due to severe mutual occlusions and ambiguities along the objects' interaction boundaries. This paper investigates the setting and introduces a new neuro-implicit method that can reconstruct the geometry and appearance of two objects undergoing close interactions from multiple RGB views while disjoining both in 3D, avoiding surface inter-penetrations and enabling novel-view synthesis of the observed scene. In our approach, the objects in the scene are first encoded using a shared multi-resolution hash grid. Next, its features are decoded into two neural SDFs maintained for the respective objects. The framework is end-to-end trainable and supervised using a novel alpha-blending regularization that ensures that the two geometries are well separated even under extreme occlusions. Our reconstruction method is markerless and can be applied to rigid as well as articulated objects. The experiments confirm the effectiveness of our framework and substantial improvements using 3D and novel view synthesis metrics compared to several existing approaches applicable in our setting. We also introduce a new dataset consisting of close interactions between a human and an object as well as two humans performing martial arts.

Video Introduction

Method Overview

scales

We semantically segment the input multi-view images into the background and the areas corresponding to two interacting objects. The scene is then encoded using a shared, multi-resolution hash grid encoding $\mathbf{e}$. The shared features are then decoded using two separate SDF MLPs to produce corresponding SDFs $\Phi_1$ and $\Phi_2$. The per-point colour $\mathcal{C}_s$ is estimated from the joint scene SDF, composed using $\Phi_s = \Phi_1 \cup \Phi_2$. Finally, we integrate the colour values of the sampled points in the ray by $\alpha$-blending the individual opacities, $\alpha_1$ and $\alpha_2$, which ensures clean separation boundaries between the two. The entire framework is then supervised using the rendering loss and additional regularisers.

Results


Human Object Scenario

Scene

Human

Object

Human Human Scenario

Scene

Human 1

Human 2

Hand Object Scenario

Scene

Hand

Object

Object Object Scenario

Scene

Object 1

Object 2



Full Video