Date: February 9, 2026
Time: 2:55-4:10 p.m.
Location: Gates 114
Speaker: William Yang, Ph.D candidate, Princeton University
Abstract: Synthetic data generation has become an increasingly powerful tool for overcoming the limitations of collecting and curating large real-world datasets for model training. Yet, fundamental questions remain about how synthetic data stores task-relevant information and how it can best be generated. In this talk, we bring together two complementary lines of work that aim to deepen our understanding of synthetic dataset construction. First, we examine dataset distillation, which compresses large datasets into a compact collection of synthetic examples while retaining important task-specific information. We discuss what distilled data actually represent, how it encodes task-relevant information about early training dynamics, and why it cannot simply substitute for real data. Second, we investigate text-to-image (T2I) models as generative engines for synthetic training data, focusing on the challenge of producing diverse, semantically aligned samples. We introduce a fine-tuning strategy, Beyond OBjects (BOB), which leverages class-agnostic attributes such as background and pose to guide model adaptation, mitigating overfitting while preserving generative diversity. Together, these perspectives offer both conceptual insights and practical advances toward building more effective, interpretable, and generalizable synthetic datasets in the era of large-scale data.
Bio: William Yang is a fifth year Ph.D candidate at Princeton University under the supervision of Prof. Olga Russakovsky. His research interests broadly in the field of machine learning, with a focus on understanding the impact of data on modern computer vision systems. His current research focuses on synthetic data generation for more efficient learning.