Date | Title/Abstract |
---|---|
9 Sep, 2024 | Introductions |
16 Sep, 2024 | Lightning talks |
23 Sep, 2024 | Katie Luo: Denoising Vision Transformers, to be presented at ECCV Abstract: We study a crucial yet often overlooked issue inherent to Vision Transformers (ViTs): feature maps of these models exhibit grid-like artifacts, which hurt the performance of ViTs in downstream dense prediction tasks such as semantic segmentation, depth prediction, and object discovery. We trace this issue down to the positional embeddings at the input stage. To mitigate this, we propose a two-stage denoising approach, termed Denoising Vision Transformers (DVT). In the first stage, we separate the clean features from those contaminated by positional artifacts by enforcing cross-view feature consistency with neural fields on a per-image basis. This per-image optimization process extracts artifact-free features from raw ViT outputs, providing clean feature estimates for offline applications. In the second stage, we train a lightweight transformer block to predict clean features from raw ViT outputs, leveraging the derived estimates of the clean features as supervision. Our method, DVT, does not require re-training the existing pre-trained ViTs, and is immediately applicable to any Vision Transformer architecture. We evaluate our method on a variety of representative ViTs (DINO, DeiT-III, EVA02, CLIP, DINOv2, DINOv2-reg) and demonstrate that DVT consistently improves existing state-of-the-art general-purpose models in semantic and geometric tasks across multiple datasets. We hope our study will encourage a re-evaluation of ViT design, especially regarding the naive use of positional embeddings. I-Ting Tsai: 3D Synthesis for Architectural Design Abstract: We introduce a 3D synthesis method for architectural design to allow for the efficient generation of diverse and precise building designs. In spite of advances in 3D synthesis, current off-the-shelf 3D synthesis techniques are inappropriate for architectural design: they are trained primarily on isolated objects, have limited diversity, blend building facades with background and produce overly complex geometry that is difficult to edit or manipulate, a major issue in an iterative design process. We propose an alternative pipeline that integrates auto-generated coarse models with segment-wise texture inpainting, resulting in diverse, style-consistent, and shape-precise designs. We show through qualitative and quantitative experiments that our pipeline generates more diverse, visually appealing architectures with clean geometries without the need for any extensive training. |
30 Sep, 2024 |
Xinrui Liu: Hybrid Tours: A Clip-based System for Authoring Long-take Touring Shots Long-take touring shots are characterized by smooth camera motion over a long distance that seamlessly connects different views of the captured scene. However, filming continuous long-take shots in real life is very difficult, and scanning and reconstructing an environment to render these shots virtually is very resource-intensive. We propose Hybrid Tours, a hybrid approach to creating these shots that combines the capturing of short clips representing potential tour segments with a custom interactive application that lets users filter and combine these segments to create and render longer camera trajectories. We show that short clips are easier to capture than long-take shots, and that clip-based authoring and reconstruction lead to higher-fidelity results at a lower cost than typical image-based rendering workflows. Nhan Tran: Practice talk for the upcoming UIST 2024 (10-12 minutes). If time allows, I will also briefly discuss an in-progress work in submission for CHI 2025 (5 minutes).
|
7 Oct, 2024 |
Lekha Revankar:Recognition of features in satellite imagery (forests, swimming pools, etc.) depends strongly on the spatial scale of the concept and therefore the resolution of the images. This poses two challenges:
Which resolution is best suited for recognizing a given concept, and where and when should the costlier higher-resolution (HR) imagery be acquired?
We present a novel scheme to address these challenges by introducing three components: (1) A technique to distill knowledge from models trained on HR imagery to recognition models that operate on imagery of lower resolution (LR), (2) a sampling strategy for HR imagery based on model disagreement, and (3) an LLM-based approach for inferring concept "scale". With these components we present a system to efficiently perform scale-aware recognition in satellite imagery, improving accuracy over single-scale inference while following budget constraints. Our novel approach offers up to a 26.3% improvement over entirely HR baselines, using 76.3 % fewer HR images. Chia-Hsiang Kao: Remote sensing question answering is challenging due to the modality-sensitive and parameter-sensitive nature of real-world observations. We present a novel system that aims to answer remote sensing questions scientifically, emphasizing interpretability, replicability, and cross-validity. Our approach utilizes a self-reflective and consensus-based scheme for statement verification. Key contributions include (1) Dataset: We build a multi-modal remote sensing QA benchmark containing (question, answer, map, code) tuples verified by domain experts, and (2) System: We develop an architecture that integrates cross-modal analysis, self-reflection, and consensus formation to verify remote sensing statements. Preliminary results demonstrate the system's ability to handle complex remote sensing queries across various modalities and parameters. The iterative refinement process and consensus formation mechanism show promise in improving answer accuracy and reliability. |
21 Oct, 2024 |
Gene Chou: We address the problem of in-the-wild, wide-baseline view interpolation. The input is internet photos of a scene with illumination variations and occlusions, which serve as start and end viewpoints. The output is interpolated views consistent in appearance and geometry. We validate that our method outperforms even commercial models in terms of consistency and 3D-awareness. Densely generated frames implicitly represent a 3D scene, providing an alternative to repeated renderings for applications such as simulations and virtual walkthroughs. Furthermore, we show our model benefits applications that require 3D control such as novel view synthesis via gaussian splatting. Our results suggest that we can scale up scene-level 3D learning using only 2D data such as videos and minimally annotated internet photos, which is almost unlimited.
Kuan Wei Huang: MegaScenes is a large, diverse dataset of structure-from-motion reconstructions of landmarks around the world from Internet photos. The sparse views from Internet photos result in sets of disconnected reconstructed components. To generate complete a 3D scene for a particular landmark, these point clouds need to be registered into a global coordinate frame. However, existing registration methods would fail as they rely on overlapping point clouds as inputs or for training. My work bridges this gap by addressing the task of point cloud registration with no overlaps by leveraging the landmark floor plans as the global coordinate system. In this talk, I will share an overview and updates on the dataset and preliminary results. |
28 Oct, 2024 |
Haian Jin: In recent years, the fields of computer vision and graphics have achieved significant advances in scene reconstruction, representation, and editing. However, much of this work relies heavily on inductive biases, such as shading models, 3D structures, and rendering formulas, without fully leveraging data-driven techniques.
My recent research focuses on bypassing these commonly used inductive biases to build more generalizable, scalable, and high-quality pipelines that are fully data-driven.
For instance, in the field of relighting, we developed a model called "Neural Gaffer", which accurately relights any object in a single image under various lighting conditions, entirely free of traditional shading models and BRDF assumptions.
Moreover, novel view synthesis has long been a core challenge in 3D vision. But how much 3D inductive bias is truly necessary? Surprisingly, very little. We introduced "LVSM"—a fully transformer-based, large view synthesis model that generates consistent and high-quality views from sparse posed inputs with minimal reliance on 3D inductive bias. By bypassing traditional 3D inductive bias., from 3D representations (e.g., NeRF, 3DGS) to network designs (e.g., epipolar projections, plane sweeps), we address novel view synthesis with a fully data-driven approach, and outperform all previous methods.
Related work: Neural Gaffer: Relighting Any Object via Diffusion (NeurIPS 2024) Bradon Thymes: Video Question Answering (VideoQA) plays a critical role in advancing video understanding by generating responses to questions about video content based on visual and contextual cues. Traditional benchmarks for VideoQA emphasize spatio-temporal and scene analysis; however, they often overlook questions that require audio information, particularly speaker diarization. Speaker diarization, which involves identifying and segmenting different speakers in an audio stream, adds valuable context by tracking who spoke and when. Integrating speaker diarization into large language models (LLMs) can significantly enhance the accuracy of responses by leveraging speaker-specific context, providing a more comprehensive understanding of multimodal content. This approach presents a promising direction for improving video understanding systems. |
4 Nov, 2024 |
Peter Michael: A low-cost system for computational illumination tasks Abstract: Computational illumination techniques have been used for various tasks, from relighting Hollywood actors to recovering 3D geometry. However, the rigs used to perform many of these tasks can be prohibitively expensive. We propose a simple setup consisting of a projector, camera, and handheld reflective surface for such tasks. In this talk, I will discuss how this setup could be used for acquiring the light transport characteristics of an environment and show preliminary results on the related task of light painting, which places virtual emissive objects in an environment with accurate modeling of secondary lighting effects. Ethan Yang: 3D-Timelapse Time-lapse is a powerful medium which helps us visualize slow and subtle changes over time. But one challenge of capturing conventional time-lapse is that it requires determining the viewpoint at the start, and carefully returning to the same fixed viewpoint in each subsequent capture. Since the most interesting changes often occur in unexpected parts of the scene, this makes framing the initial viewpoint difficult as there is no way to see in into the future or recover a different view during or after acquisition. This project explores a more flexible alternative where the goal is to capture a time-lapse in 3D allowing the user to freely reposition the camera in consideration of the observed temporal changes. To achieve this, we aim to create a holistic pipeline from capture, to registration, to representation, and address the key challenges that come from working with this specific kind of data in each stage. Very much a work in progress. |
11 Nov, 2024 | --- No seminar - CVPR --- |
18 Nov, 2024 |
Mariia Soroka: Quadric-Based Silhouette Sampling for Differentiable Rendering Differentiable rendering is concerned with computing derivatives of a rendering algorithm w.r.t. various scene parameters (e.g. textures, vertex positions, camera). Having high-quality derivatives is crucial for inverse rendering and it allows to backpropagate through a renderer. Changing scene parameters leads to moving silhouettes of objects, which have a significant impact on the shading integrals computed by the renderer. Therefore, they also contribute to the derivatives and have to be accounted for. Building upon the edge sampling approach, we sample silhouette edges explicitly for each shading point. We maintain a dedicated data structure over the edges and traverse it stochastically based on heuristics tailored to match the expected edge contribution. We utilize the analytic simplicity and geometric expressiveness of quadrics and convex polyhedra, to reject irrelevant edges and achieve better silhouette detection. Youming Deng: Large field-of-view (FOV) cameras offer the potential for high-quality scene reconstructio with less captures, as each capture covers a broader region. However, current state-of-the-art reconstruction pipeline, such as Gaussians, fail to fully utilize the advantages of large FOV captures due to incompatibilities between fast rasterization and the modeling of extreme lens distortion. We present a self-calibrating framework that jointly optimizes camera parameters, lens distortion and 3D Gaussian representations, enabling accurate and efficient scene reconstruction. Our approach introduces a novel method for modeling complex lens distortions using a hybrid network that combines invertible residual networks with explicit grids. This design effectively regularizes the optimization process, achieving greater accuracy than conventional camera models. Additionally, we propose a cubemap-based resampling strategy to support large FOV images without sacrificing resolution or introducing distortion artifacts. |