Time: 3-4pm Mondays
Venue: Gates 310

Date	Title/Abstract
9 Sep, 2024	Introductions
16 Sep, 2024	Lightning talks
23 Sep, 2024	Katie Luo: Denoising Vision Transformers, to be presented at ECCV Abstract: We study a crucial yet often overlooked issue inherent to Vision Transformers (ViTs): feature maps of these models exhibit grid-like artifacts, which hurt the performance of ViTs in downstream dense prediction tasks such as semantic segmentation, depth prediction, and object discovery. We trace this issue down to the positional embeddings at the input stage. To mitigate this, we propose a two-stage denoising approach, termed Denoising Vision Transformers (DVT). In the first stage, we separate the clean features from those contaminated by positional artifacts by enforcing cross-view feature consistency with neural fields on a per-image basis. This per-image optimization process extracts artifact-free features from raw ViT outputs, providing clean feature estimates for offline applications. In the second stage, we train a lightweight transformer block to predict clean features from raw ViT outputs, leveraging the derived estimates of the clean features as supervision. Our method, DVT, does not require re-training the existing pre-trained ViTs, and is immediately applicable to any Vision Transformer architecture. We evaluate our method on a variety of representative ViTs (DINO, DeiT-III, EVA02, CLIP, DINOv2, DINOv2-reg) and demonstrate that DVT consistently improves existing state-of-the-art general-purpose models in semantic and geometric tasks across multiple datasets. We hope our study will encourage a re-evaluation of ViT design, especially regarding the naive use of positional embeddings. I-Ting Tsai: 3D Synthesis for Architectural Design Abstract: We introduce a 3D synthesis method for architectural design to allow for the efficient generation of diverse and precise building designs. In spite of advances in 3D synthesis, current off-the-shelf 3D synthesis techniques are inappropriate for architectural design: they are trained primarily on isolated objects, have limited diversity, blend building facades with background and produce overly complex geometry that is difficult to edit or manipulate, a major issue in an iterative design process. We propose an alternative pipeline that integrates auto-generated coarse models with segment-wise texture inpainting, resulting in diverse, style-consistent, and shape-precise designs. We show through qualitative and quantitative experiments that our pipeline generates more diverse, visually appealing architectures with clean geometries without the need for any extensive training.
30 Sep, 2024	Xinrui Liu: Hybrid Tours: A Clip-based System for Authoring Long-take Touring Shots Long-take touring shots are characterized by smooth camera motion over a long distance that seamlessly connects different views of the captured scene. However, filming continuous long-take shots in real life is very difficult, and scanning and reconstructing an environment to render these shots virtually is very resource-intensive. We propose Hybrid Tours, a hybrid approach to creating these shots that combines the capturing of short clips representing potential tour segments with a custom interactive application that lets users filter and combine these segments to create and render longer camera trajectories. We show that short clips are easier to capture than long-take shots, and that clip-based authoring and reconstruction lead to higher-fidelity results at a lower cost than typical image-based rendering workflows. Nhan Tran: Practice talk for the upcoming UIST 2024 (10-12 minutes). If time allows, I will also briefly discuss an in-progress work in submission for CHI 2025 (5 minutes). Personal Time-Lapse: Our bodies are constantly in motion—from the bending of arms and legs to the less conscious movement of breathing, our precise shape and location change constantly. This can make subtler developments (e.g., the growth of hair, or the healing of a wound) difficult to observe. Our work focuses on helping users record and visualize this type of subtle, longer-term change. We present a mobile tool that combines custom 3D tracking with interactive visual feedback and computational imaging to capture personal time-lapse, which approximates longer-term video of the subject (typically, part of the capturing user's body) under a fixed viewpoint, body pose, and lighting condition. These personal time-lapses offer a powerful and detailed way to track visual changes of the subject over time. We begin with a formative study that examines what makes personal time-lapse so difficult to capture. Building on our findings, we motivate the design of our capture tool, evaluate this design with users, and demonstrate its effectiveness in a variety of challenging examples. For project details and interactive results, please visit www.MeCapture.com. ARticulate: Interactive Visual Guidance for Demonstrated Rotational Degrees of Freedom in Mobile AR Mobile Augmented Reality (AR) offers a powerful way to provide spatially-aware guidance for real-world applications. In many cases, these applications involve the configuration of a camera or articulated subject, asking users to navigate several spatial degrees of freedom (DOF) at once. Most guidance for such tasks relies on decomposing available DOF into subspaces that can be mapped to simple visual feedback. Unfortunately, different factorizations of the same motion lead to different feedback, and finding the factorization that best matches a user's intuition can be difficult. We propose an interactive approach that infers rotational degrees of freedom from short user demonstrations. Users select one to two DOF at a time by demonstrating a small range of motion, which we map to a visualization that shows the current DOF value relative to the user's objective. We demonstrate the effectiveness of this tool in multiple guided data capture tasks.
7 Oct, 2024	Lekha Revankar:Recognition of features in satellite imagery (forests, swimming pools, etc.) depends strongly on the spatial scale of the concept and therefore the resolution of the images. This poses two challenges: Which resolution is best suited for recognizing a given concept, and where and when should the costlier higher-resolution (HR) imagery be acquired? We present a novel scheme to address these challenges by introducing three components: (1) A technique to distill knowledge from models trained on HR imagery to recognition models that operate on imagery of lower resolution (LR), (2) a sampling strategy for HR imagery based on model disagreement, and (3) an LLM-based approach for inferring concept "scale". With these components we present a system to efficiently perform scale-aware recognition in satellite imagery, improving accuracy over single-scale inference while following budget constraints. Our novel approach offers up to a 26.3% improvement over entirely HR baselines, using 76.3 % fewer HR images. Chia-Hsiang Kao: Remote sensing question answering is challenging due to the modality-sensitive and parameter-sensitive nature of real-world observations. We present a novel system that aims to answer remote sensing questions scientifically, emphasizing interpretability, replicability, and cross-validity. Our approach utilizes a self-reflective and consensus-based scheme for statement verification. Key contributions include (1) Dataset: We build a multi-modal remote sensing QA benchmark containing (question, answer, map, code) tuples verified by domain experts, and (2) System: We develop an architecture that integrates cross-modal analysis, self-reflection, and consensus formation to verify remote sensing statements. Preliminary results demonstrate the system's ability to handle complex remote sensing queries across various modalities and parameters. The iterative refinement process and consensus formation mechanism show promise in improving answer accuracy and reliability.
21 Oct, 2024	Gene Chou: We address the problem of in-the-wild, wide-baseline view interpolation. The input is internet photos of a scene with illumination variations and occlusions, which serve as start and end viewpoints. The output is interpolated views consistent in appearance and geometry. We validate that our method outperforms even commercial models in terms of consistency and 3D-awareness. Densely generated frames implicitly represent a 3D scene, providing an alternative to repeated renderings for applications such as simulations and virtual walkthroughs. Furthermore, we show our model benefits applications that require 3D control such as novel view synthesis via gaussian splatting. Our results suggest that we can scale up scene-level 3D learning using only 2D data such as videos and minimally annotated internet photos, which is almost unlimited. Kuan Wei Huang: MegaScenes is a large, diverse dataset of structure-from-motion reconstructions of landmarks around the world from Internet photos. The sparse views from Internet photos result in sets of disconnected reconstructed components. To generate complete a 3D scene for a particular landmark, these point clouds need to be registered into a global coordinate frame. However, existing registration methods would fail as they rely on overlapping point clouds as inputs or for training. My work bridges this gap by addressing the task of point cloud registration with no overlaps by leveraging the landmark floor plans as the global coordinate system. In this talk, I will share an overview and updates on the dataset and preliminary results.
28 Oct, 2024	Haian Jin: In recent years, the fields of computer vision and graphics have achieved significant advances in scene reconstruction, representation, and editing. However, much of this work relies heavily on inductive biases, such as shading models, 3D structures, and rendering formulas, without fully leveraging data-driven techniques. My recent research focuses on bypassing these commonly used inductive biases to build more generalizable, scalable, and high-quality pipelines that are fully data-driven. For instance, in the field of relighting, we developed a model called "Neural Gaffer", which accurately relights any object in a single image under various lighting conditions, entirely free of traditional shading models and BRDF assumptions. Moreover, novel view synthesis has long been a core challenge in 3D vision. But how much 3D inductive bias is truly necessary? Surprisingly, very little. We introduced "LVSM"—a fully transformer-based, large view synthesis model that generates consistent and high-quality views from sparse posed inputs with minimal reliance on 3D inductive bias. By bypassing traditional 3D inductive bias., from 3D representations (e.g., NeRF, 3DGS) to network designs (e.g., epipolar projections, plane sweeps), we address novel view synthesis with a fully data-driven approach, and outperform all previous methods. Related work: Neural Gaffer: Relighting Any Object via Diffusion (NeurIPS 2024) Bradon Thymes: Video Question Answering (VideoQA) plays a critical role in advancing video understanding by generating responses to questions about video content based on visual and contextual cues. Traditional benchmarks for VideoQA emphasize spatio-temporal and scene analysis; however, they often overlook questions that require audio information, particularly speaker diarization. Speaker diarization, which involves identifying and segmenting different speakers in an audio stream, adds valuable context by tracking who spoke and when. Integrating speaker diarization into large language models (LLMs) can significantly enhance the accuracy of responses by leveraging speaker-specific context, providing a more comprehensive understanding of multimodal content. This approach presents a promising direction for improving video understanding systems.
4 Nov, 2024	Peter Michael: A low-cost system for computational illumination tasks Abstract: Computational illumination techniques have been used for various tasks, from relighting Hollywood actors to recovering 3D geometry. However, the rigs used to perform many of these tasks can be prohibitively expensive. We propose a simple setup consisting of a projector, camera, and handheld reflective surface for such tasks. In this talk, I will discuss how this setup could be used for acquiring the light transport characteristics of an environment and show preliminary results on the related task of light painting, which places virtual emissive objects in an environment with accurate modeling of secondary lighting effects. Ethan Yang: 3D-Timelapse Time-lapse is a powerful medium which helps us visualize slow and subtle changes over time. But one challenge of capturing conventional time-lapse is that it requires determining the viewpoint at the start, and carefully returning to the same fixed viewpoint in each subsequent capture. Since the most interesting changes often occur in unexpected parts of the scene, this makes framing the initial viewpoint difficult as there is no way to see in into the future or recover a different view during or after acquisition. This project explores a more flexible alternative where the goal is to capture a time-lapse in 3D allowing the user to freely reposition the camera in consideration of the observed temporal changes. To achieve this, we aim to create a holistic pipeline from capture, to registration, to representation, and address the key challenges that come from working with this specific kind of data in each stage. Very much a work in progress.
11 Nov, 2024	--- No seminar - CVPR ---
18 Nov, 2024	Mariia Soroka: Quadric-Based Silhouette Sampling for Differentiable Rendering Differentiable rendering is concerned with computing derivatives of a rendering algorithm w.r.t. various scene parameters (e.g. textures, vertex positions, camera). Having high-quality derivatives is crucial for inverse rendering and it allows to backpropagate through a renderer. Changing scene parameters leads to moving silhouettes of objects, which have a significant impact on the shading integrals computed by the renderer. Therefore, they also contribute to the derivatives and have to be accounted for. Building upon the edge sampling approach, we sample silhouette edges explicitly for each shading point. We maintain a dedicated data structure over the edges and traverse it stochastically based on heuristics tailored to match the expected edge contribution. We utilize the analytic simplicity and geometric expressiveness of quadrics and convex polyhedra, to reject irrelevant edges and achieve better silhouette detection. Youming Deng: Large field-of-view (FOV) cameras offer the potential for high-quality scene reconstructio with less captures, as each capture covers a broader region. However, current state-of-the-art reconstruction pipeline, such as Gaussians, fail to fully utilize the advantages of large FOV captures due to incompatibilities between fast rasterization and the modeling of extreme lens distortion. We present a self-calibrating framework that jointly optimizes camera parameters, lens distortion and 3D Gaussian representations, enabling accurate and efficient scene reconstruction. Our approach introduces a novel method for modeling complex lens distortions using a hybrid network that combines invertible residual networks with explicit grids. This design effectively regularizes the optimization process, achieving greater accuracy than conventional camera models. Additionally, we propose a cubemap-based resampling strategy to support large FOV images without sacrificing resolution or introducing distortion artifacts.