A color photo of a woman wearing a red and white dress while touching her hat, smiling for a photo.

 

Date: February 6, 2026

Speaker: Jiani Huang

Title: Neurosymbolic Multi-modal Alignment for Scene Graph Generation

 

Abstract: Multi-modal large language models (MLLMs) show strong potential for embodied AI, but often struggle with fine-grained grounding between visual perception and high-level semantics, leading to inaccurate and unreliable behavior in dynamic environments. We address this challenge by using spatio-temporal scene graphs (STSGs) as a structured interface between perception and reasoning.

To extract STSGs, we introduce LASER, a weakly supervised framework for learning fine-grained spatio-temporal representations from video using only natural-language captions, eliminating the need for manual STSG annotations. LASER extracts rich spatio-temporal specifications from captions and trains perception models via differentiable symbolic reasoning, while outperforming fully supervised baselines across multiple video datasets.

Building on the ability to extract STSGs across diverse scenarios, ESCA contextualizes embodied agents by grounding their perception in structured spatio-temporal representations. ESCA significantly improves perception for agents powered by both open-source and commercial MLLMs, reducing perception errors and enabling open-source models to surpass proprietary baselines.

Together, LASER and ESCA provide a unified neurosymbolic approach for integrating video, language, and action through spatio-temporal structure, enabling more accurate, explainable, and reliable embodied AI systems.

 

Bio: Jiani Huang is a Ph.D. Candidate in Computer Science at the University of Pennsylvania, advised by Professor Mayur Naik. Her research interests lie at the intersection of programming languages and machine learning, with a focus on neuro-symbolic methods for multi-modal systems.

Her work centers on (1) the design and implementation of Scallop, a neurosymbolic programming language, and (2) its applications across diverse domains, including natural language processing, computer vision, and trustworthy AI. More broadly, her research explores how structured reasoning over rich spatio-temporal representations can support robust perception and decision-making in complex, dynamic environments.

By integrating symbolic structure with learning-based models across perception, reasoning, and action, she aims to advance the development of AI systems that are reliable, interpretable, and suitable for deployment in real-world, safety-critical settings.