Back to Rerun

README

examples/python/widebaseline/README.md

0.31.43.1 KB
Original Source
<!--[metadata] title = "Learning to render novel views from wide-baseline stereo pairs" source = "https://github.com/rerun-io/cross_attention_renderer/" tags = ["2D", "3D", "View synthesis", "Time series", "Pinhole camera", "Paper walkthrough"] thumbnail = "https://static.rerun.io/learning-to-render/75c96220e356938037dce35fcb5349f5f8064d8f/480w.png" thumbnail_dimensions = [480, 480] -->

This example is a visual walkthrough of the paper "Learning to render novel views from wide-baseline stereo pairs". All the visualizations were created by editing the original source code to log data with the Rerun SDK.

Visual paper walkthrough

Novel view synthesis has made remarkable progress in recent years, but most methods require per-scene optimization on many images. In their CVPR 2023 paper Yilun Du et al. propose a method that works with just 2 views. I created a visual walkthrough of the work using the Rerun SDK.

https://vimeo.com/865975229?autoplay=1&loop=1&autopause=0&background=1&muted=1&ratio=10000:10547

“Learning to Render Novel Views from Wide-Baseline Stereo Pairs” describes a three stage approach. (a) Image features for each input view are extracted. (b) Features along the target rays are collected. (c) The color is predicted through the use of cross-attention.

<picture> <source media="(max-width: 480px)" srcset="https://static.rerun.io/widebaseline-overview/76d19a9bc9f4c101036577a747c029caa85fb95e/480w.png"> <source media="(max-width: 768px)" srcset="https://static.rerun.io/widebaseline-overview/76d19a9bc9f4c101036577a747c029caa85fb95e/768w.png"> <source media="(max-width: 1024px)" srcset="https://static.rerun.io/widebaseline-overview/76d19a9bc9f4c101036577a747c029caa85fb95e/1024w.png"> <source media="(max-width: 1200px)" srcset="https://static.rerun.io/widebaseline-overview/76d19a9bc9f4c101036577a747c029caa85fb95e/1200w.png"> </picture>

To render a pixel its corresponding ray is projected onto each input image. Instead of uniformly sampling along the ray in 3D, the samples are distributed such that they are equally spaced on the image plane. The same points are also projected onto the other view (light color).

https://vimeo.com/865975245?autoplay=1&loop=1&autopause=0&background=1&muted=1&ratio=10000:7941

The image features at these samples are used to synthesize new views. The method learns to attend to the features close to the surface. Here we show the attention maps for one pixel, and the resulting pseudo depth maps if we interpret the attention as a probability distribution.

https://vimeo.com/865975258?autoplay=1&loop=1&autopause=0&background=1&muted=1&ratio=10000:9184

Make sure to check out the paper by Yilun Du, Cameron Smith, Ayush Tewari, Vincent Sitzmann to learn about the details of the method.