Chris De Sa

Gates Hall, Room 426

I am an Assistant Professor in the Computer Science department at Cornell University. I am a member of the Cornell Machine Learning Group and I lead the Relax ML Lab. My research interests include algorithmic, software, and hardware techniques for high-performance machine learning, with a focus on relaxed-consistency variants of stochastic algorithms such as asynchronous and low-precision stochastic gradient descent (SGD) and Markov chain Monte Carlo. My work builds towards using these techniques to construct data analytics and machine learning frameworks, including for deep learning, that are efficient, parallel, and distributed.

I graduated from Stanford University in 2017, where I was advised by Kunle Olukotun and by Chris R‌é.

Recent News and Awards[show all news][show only recent news]

Our paper "Is My Prediction Arbitrary? The Confounding Effects of Variance in Fair Classification Benchmarks" won a Best Student Paper Award Honorable Mention for the AI for Social Impact track at AAAI-2024.

I gave a keynote speech at the International Conference on AI-ML Systems.

I am program chair at MLSys 2024.

I was awarded the Google Research Scholar Award.

I was a workshop/tutorial chair at MLSys 2022.

Our paper "Optimal Complexity in Decentralized Training" (Yucheng Lu, Chris De Sa) won an Outstanding Paper Award Honorable Mention at ICML 2021 (awarded to 4 papers out of 1184 publications).

I co-organized the Cornell Institute for Digital Agriculture Hackathon.

I have joined the executive committee of the Cornell Institute for Digital Agriculture (CIDA).

I won the National Science Foundation CAREER award.

Our paper on bias in model selection was featured in the popular press.

I was awarded the Mr. & Mrs. Richard F. Tucker Excellence in Teaching Award from the College of Engineering.

I was awarded an NSF Robust Intelligence Small Grant for "Reliable Machine Learning in Hyperbolic Spaces."

Three papers from our lab were accepted into NeurIPS 2020, of which two won spotlight awards!

Together with faculty from other departments, I have been teaching PLSCI 7202, a short course on applications of machine learning to plant science.

PhD Students

Ruqi Zhang. PhD 2021, Statistics (now an Assistant Professor at Purdue CS). Scalable Bayesian inference for ML.

Yucheng Lu. PhD Student, Computer Science (now at Together). Distributed optimization, ML systems.

A. Feder Cooper. PhD Student, Computer Science. Accountable, robust ML systems.

Jerry Chee. PhD Student, Computer Science. Scalable and resource-efficient ML.

Tao Yu. PhD Student, Computer Science. Private and secure ML, accurate learning in hyperbolic space.

Yaohui Cai. PhD Student, Electrical and Computing Engineering. Efficient deep learning inference and training (co-advised with Zhiru Zhang).

Si Yi (Cathy) Meng. PhD Student, Computer Science. Optimization algorithms for large-scale machine learning.

Albert Tseng. PhD Student, Computer Science. Deep learning quantization and compression.


▷ CS 4780/5780 Machine Learning (Spring 2022, Spring 2018)

▷ CS 4787/5777 Principles of Large-Scale Machine Learning (Fall 2023, Fall 2022, Spring 2021, Spring 2020, Spring 2019)

▷ CS 6787 Advanced Machine Learning Systems (Fall 2021, Fall 2020, Fall 2019, Fall 2018, Fall 2017)

▷ CS 7792 Special Topics in Machine Learning (Spring 2023)

Office Hours Wednesdays 2:00-3:00 PM in Gates 426.


Lab WebsiteCVGoogle ScholarShow All AbstractsHide All Abstracts

AAAI 2024
Arbitrariness of Prediction in Fair Classification Best Student Paper (Honorable Mention)
A. Feder Cooper, Katherine Lee, Madiha Choksi, Solon Barocas, Christopher De Sa, James Grimmelmann, Jon Kleinberg, Siddhartha Sen, Baobao Zhang
In The 38th Annual AAAI Conference on Artificial Intelligence, February 2024.
Variance in predictions across different trained models is a significant, under-explored source of error in fair binary classification. In practice, the variance on some data examples is so large that decisions can be effectively arbitrary. To investigate this problem, we take an experimental approach and make four overarching contributions: We: 1) Define a metric called self-consistency, derived from variance, which we use as a proxy for measuring and reducing arbitrariness; 2) Develop an ensembling algorithm that abstains from classification when a prediction would be arbitrary; 3) Conduct the largest to-date empirical study of the role of variance (vis-a-vis self-consistency and arbitrariness) in fair binary classification; and, 4) Release a toolkit that makes the US Home Mortgage Disclosure Act (HMDA) datasets easily usable for future research. Altogether, our experiments reveal shocking insights about the reliability of conclusions on benchmark datasets. Most fair binary classification benchmarks are close-to-fair when taking into account the amount of arbitrariness present in predictions -- before we even try to apply any fairness interventions. This finding calls into question the practical utility of common algorithmic fairness methods, and in turn suggests that we should reconsider how we choose to measure fairness in binary classification.

SA 2023
Neural Caches for Monte Carlo Partial Differential Equation Solvers
Zilu Li, Guandao Yang, Xi Deng, Christopher De Sa, Bharath Hariharan, Steve Marschner
In SIGGRAPH Asia 2023, December 2023.
This paper presents a method that uses neural networks as a caching mechanism to reduce the variance of Monte Carlo Partial Differential Equation solvers, such as the Walk-on-Spheres algorithm [Sawhney and Crane 2020]. While these Monte Carlo PDE solvers have the merits of being unbiased and discretization-free, their high variance often hinders real-time applications. On the other hand, neural networks can approximate the PDE solution, and evaluating these networks at inference time can be very fast. However, neural-network-based solutions may suffer from convergence difficulties and high bias. Our hybrid system aims to combine these two potentially complementary solutions by training a neural field to approximate the PDE solution using supervision from a WoS solver. This neural field is then used as a cache in the WoS solver to reduce variance during inference. We demonstrate that our neural field training procedure is better than the commonly used self-supervised objectives in the literature. We also show that our hybrid solver exhibits lower variance than WoS with the same computational budget: it is significantly better for small compute budgets and provides smaller improvements for larger budgets, reaching the same performance as WoS in the limit.

NeurIPS 2023
QuIP: 2-Bit Quantization of Large Language Models With Guarantees Spotlight
Jerry Chee, Yaohui Cai, Volodymyr Kuleshov, Christopher De Sa
In Proceedings of the 36th Neural Information Processing Systems Conference, December 2023.
This work studies post-training parameter quantization in large language models (LLMs). We introduce quantization with incoherence processing (QuIP), a new method based on the insight that quantization benefits from incoherent weight and Hessian matrices, i.e., from the weights being even in magnitude and the directions in which it is important to round them accurately being unaligned with the coordinate axes. QuIP consists of two steps: (1) an adaptive rounding procedure minimizing a quadratic proxy objective; (2) efficient pre- and post-processing that ensures weight and Hessian incoherence via multiplication by random orthogonal matrices. We complement QuIP with the first theoretical analysis for an LLM-scale quantization algorithm, and show that our theory also applies to an existing method, OPTQ. Empirically, we find that our incoherence preprocessing improves several existing quantization algorithms and yields the first LLM quantization methods that produce viable results using only two bits per weight. Our code can be found at

CD-GraB: Coordinating Distributed Example Orders for Provably Accelerated Training
A. Feder Cooper, Wentao Guo, Khiem Pham, Tiancheng Yuan, Charlie F. Ruan, Yucheng Lu, Christopher De Sa
In Proceedings of the 36th Neural Information Processing Systems Conference, December 2023.
Recent research on online Gradient Balancing (GraB) has revealed that there exist permutation-based example orderings that are guaranteed to outperform random reshuffling (RR). Whereas RR arbitrarily permutes training examples, GraB leverages stale gradients from prior epochs to order examples -- achieving a provably faster convergence rate than RR. However, GraB is limited by design: While it demonstrates an impressive ability to scale-up training on centralized data, it does not naturally extend to modern distributed ML workloads. We therefore propose Coordinated Distributed GraB (CD-GraB), which uses insights from prior work on kernel thinning to translate the benefits of provably faster permutation-based example ordering to distributed settings. With negligible overhead, CD-GraB exhibits a linear speedup in convergence rate over centralized GraB and outperforms baselines empirically, including distributed RR, on a variety of benchmark tasks.

Coneheads: Hierarchy Aware Attention
Albert Tseng, Tao Yu, Toni J.B. Liu, Christopher De Sa
In Proceedings of the 36th Neural Information Processing Systems Conference, December 2023.
Attention networks such as transformers have achieved state-of-the-art performance in many domains. These networks rely heavily on the dot product attention operator, which computes the similarity between two points by taking their inner product. However, the inner product does not explicitly model the complex structural properties of real world datasets, such as hierarchies between data points. To remedy this, we introduce cone attention, a drop-in replacement for dot product attention based on hyperbolic entailment cones. Cone attention associates two points by the depth of their lowest common ancestor in a hierarchy defined by hyperbolic cones, which intuitively measures the divergence of two points and gives a similarity score. We test cone attention on a wide variety of models and tasks and show that it improves task-level performance over dot product attention and other baselines, and is able to match dot-product attention with significantly fewer parameters. Our results suggest that cone attention is an effective way to capture hierarchical relationships when calculating attention.

TART: A plug-and-play Transformer module for task-agnostic reasoning
Kush Bhatia, Avanika Narayan, Christopher De Sa, Christopher Re
In Proceedings of the 36th Neural Information Processing Systems Conference, December 2023.
Large language models (LLMs) exhibit in-context learning abilities which enable the same model to perform several tasks without any task-specific training. In contrast, traditional adaptation approaches, such as fine-tuning, modify the underlying models for each specific task. In-context learning, however, consistently underperforms task-specific tuning approaches even when presented with the same examples. While most existing approaches (e.g., prompt engineering) focus on the LLM's learned representations to patch this performance gap, our experiments actually reveal that LLM representations contain sufficient information to make good predictions. As such, we focus on the LLM's reasoning abilities and demonstrate that this performance gap exists due to their inability to perform simple probabilistic reasoning tasks. This raises an intriguing question: Are LLMs actually capable of learning how to reason in a task-agnostic manner? We answer this in the affirmative and, as a proof of concept, propose TART which generically improves an LLM's reasoning abilities using a synthetically trained reasoning module. TART trains this Transformer-based reasoning module in a task-agnostic manner using only synthetic logistic regression tasks and composes it with an arbitrary real-world pre-trained model without any additional training. With a single inference module, TART improves performance across different model families (GPT-Neo, Pythia, Bloom), model sizes (100M - 6B), tasks (14 NLP classification tasks), and even across different modalities (audio and vision). On the RAFT Benchmark, TART improves GPT-Neo (125M)'s performance such that it outperforms Bloom (176B), and is within 4% of GPT-3.

Riemannian Residual Neural Networks
Isay Katsman, Eric Ming Chen, Sidhanth Holalkere, Anna Asch, Aaron Lou, Ser-Nam Lim, Christopher De Sa
In Proceedings of the 36th Neural Information Processing Systems Conference, December 2023.
Recent methods in geometric deep learning have introduced various neural networks to operate over data that lie on Riemannian manifolds. Such networks are often necessary to learn well over graphs with a hierarchical structure or to learn over manifold-valued data encountered in the natural sciences. These networks are often inspired by and directly generalize standard Euclidean neural networks. However, extending Euclidean networks is difficult and has only been done for a select few manifolds. In this work, we examine the residual neural network (ResNet) and show how to extend this construction to general Riemannian manifolds in a geometrically principled manner. Originally introduced to help solve the vanishing gradient problem, ResNets have become ubiquitous in machine learning due to their beneficial learning properties, excellent empirical results, and easy-to-incorporate nature when building varied neural networks. We find that our Riemannian ResNets mirror these desirable properties: when compared to existing manifold neural networks designed to learn over hyperbolic space and the manifold of symmetric positive definite matrices, we outperform both kinds of networks in terms of relevant testing metrics and training dynamics.

UAI 2023
Inference for Probabilistic Dependency Graphs
Oliver Richardson, Joe Halpern, Christopher De Sa
In UAI: the 39th Conference on Uncertainty in Artificial Intelligence, August 2023.
Probabilistic dependency graphs (PDGs) are a flexible class of probabilistic graphical models, subsuming Bayesian Networks and Factor Graphs. They can also capture inconsistent beliefs, and provide a way of measuring the degree of this inconsistency. We present the first tractable inference algorithm for PDGs with discrete variables, making the asymptotic complexity of PDG inference similar that of the graphical models they generalize. The key components are: (1) the observation that PDG inference can be reduced to convex optimization with exponential cone constraints, (2) a construction that allows us to express these problems compactly for PDGs of boundeed treewidth, for which we needed to further develop the theory of PDGs, and (3) an appeal to interior point methods that can solve such problems in polynomial time. We verify the correctness and time complexity of our approach, and provide an implementation of it. We then evaluate our implementation, and demonstrate that it outperforms baseline approaches. Our code is available at

ICML 2023
CocktailSGD: Fine-tuning Foundation Models over 500Mbps Networks
Jue Wang, Yucheng Lu, Binhang Yuan, Beidi Chen, Percy Liang, Christopher De Sa, Christopher Re, Ce Zhang
In ICML: the Thirty-ninth International Conference on Machine Learning, July 2023.
Distributed training of foundation models, especially large language models (LLMs), is communication-intensive and so has heavily relied on centralized data centers with fast interconnects. Can we train on slow networks and unlock the potential of decentralized infrastructure for foundation models? In this paper, we propose CocktailSGD, a novel communication-efficient training framework that combines three distinct compression techniques -- random sparsification, top-K sparsification, and quantization -- to achieve much greater compression than each individual technique alone. We justify the benefit of such a hybrid approach through a theoretical analysis of convergence. Empirically, we show that CocktailSGD achieves up to 117× compression in fine-tuning LLMs up to 20 billion parameters without hurting convergence. On a 500Mbps network, CocktailSGD only incurs ∼1.2× slowdown compared with data center networks.

InfoDiffusion: Representation Learning Using Information Maximizing Diffusion Models
Yingheng Wang, Yair Schiff, Aaron Gokaslan, Weishen Pan, Fei Wang, Christopher De Sa, Volodymyr Kuleshov
In ICML: the Thirty-ninth International Conference on Machine Learning, July 2023.
Diffusion models feature high sample quality, but are not effective at learning semantically meaningful latent representations. Here, we propose InfoDiffusion, an algorithm that enables diffusion models to perform representation learning using low-dimensional latent variables. We introduce auxiliary-variable diffusion models---a model family that contains an additional set of semantically meaningful latents---and we derive new variational inference algorithms that optimize a learning objective regularized with a mutual information term. Maximizing mutual information helps InfoDiffusion uncover semantically meaningful representations across multiple datasets, including representations that achieve the strong property of disentanglement. We envision our methods being useful in applications that require exploring a learned latent space to generate high-quality outputs, e.g., in generative design.

STEP: Learning N:M Structured Sparsity Masks from Scratch with Precondition
Yucheng Lu, Amir Yazdanbakhsh, Shivani Agrawal, Suvinay Subramanian, Oleg Rybakov, Christopher De Sa
In ICML: the Thirty-ninth International Conference on Machine Learning, July 2023.
Recent innovations on hardware (e.g. Nvidia A100) have motivated learning N:M structured sparsity masks from scratch for fast model inference. However, state-of-the-art learning recipes in this regime (e.g. SR-STE) are proposed for non-adaptive optimizers like momentum SGD, while incurring non-trivial accuracy drop for Adam-trained models like attention-based LLMs. In this paper, we first demonstrate such gap origins from poorly estimated second moment (i.e. variance) in Adam states given by the masked weights. We conjecture that learning N:M masks with Adam should take the critical regime of variance estimation into account. In light of this, we propose STEP, an Adam-aware recipe that learns N:M masks with two phases: first, STEP calculates a reliable variance estimate (precondition phase) and subsequently, the variance remains fixed and is used as a precondition to learn N:M masks (mask-learning phase). STEP automatically identifies the switching point of two phases by dynamically sampling variance changes over the training trajectory and testing the sample concentration. Empirically, we evaluate STEP and other baselines such as ASP and SR-STE on multiple tasks including CIFAR classification, machine translation and LLM fine-tuning (BERT-Base, GPT-2). We show STEP mitigates the accuracy drop of baseline recipes and is robust to aggressive structured sparsity ratios.

JMLR 2023
Decentralized Learning: Theoretical Optimality and Practical Improvements
Yucheng Lu, Christopher De Sa
In JMLR: Journal of Machine Learning Research, April 2023.
Decentralization is a promising method of scaling up parallel machine learning systems. In this paper, we provide a tight lower bound on the iteration complexity for such methods in a stochastic non-convex setting. Our lower bound reveals a theoretical gap in known convergence rates of many existing decentralized training algorithms, such as D-PSGD. We prove by construction this lower bound is tight and achievable. Motivated by our insights, we further propose DeTAG, a practical gossip-style decentralized algorithm that achieves the lower bound with only a logarithm gap. While a simple version of DeTAG with plain SGD and constant step size suffice for achieving theoretical limits, we additionally provide convergence bound for DeTAG under general non-increasing step size and momentum. Empirically, we compare DeTAG with other decentralized algorithms on multiple vision benchmarks, including CIFAR10/100 and ImageNet. We substantiate our theory and show DeTAG converges faster on unshuffled data and in sparse networks. Furthermore, we study a DeTAG variant, DeTAG*, that practically speeds up data-center-scale model training. This manuscript provides extended contents to its ICML version.

ICLR 2023
Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam
Yucheng Lu, Conglong Li, Minjia Zhang, Christopher De Sa, Yuxiong He
In ICLR: The Eleventh International Conference on Learning Representations, May 2023.
1-bit gradient compression and local steps are two representative techniques that enable drastic communication reduction in distributed SGD. Their benefits, however, remain an open question on Adam-based large model pre-training (e.g. BERT and GPT). In this paper, we demonstrate the non-linearity in Adam causes slow convergence even when 1-bit compression or local steps are individually applied. To alleviate this limitation, we propose 0/1 Adam that linearizes each Adam step via approximating its optimizer states using their stale estimates and linear correlation. 0/1 Adam performs an Adam-like step to preserve the adaptivity, while its linearity allows utilizing 1-bit compression and local steps simultaneously for wall-clock time speed up. We provide convergence guarantee for 0/1 Adam on smooth non-convex objectives. On various large-scale benchmarks such as BERT-Base, BERT-Large, GPT-2 pre-training and ImageNet, we demonstrate on up to 128 GPUs that 0/1 Adam is able to reduce up to 87% of data volume, 54% of communication rounds, and achieve up to 2 higher training throughput and end-to-end training time reduction compared to the state-of-the-art baseline 1-bit Adam; while enjoying the same statistical convergence speed and end task model accuracy on GLUE dataset and ImageNet validation set.

Random Laplacian Features for Learning with Hyperbolic Space
Tao Yu, Christopher De Sa
In ICLR: The Eleventh International Conference on Learning Representations, May 2023.
Due to its geometric properties, hyperbolic space can support high-fidelity embeddings of tree- and graph-structured data, upon which various hyperbolic networks have been developed. Existing hyperbolic networks encode geometric priors not only for the input, but also at every layer of the network. This approach involves repeatedly mapping to and from hyperbolic space, which makes these networks complicated to implement, computationally expensive to scale, and numerically unstable to train. In this paper, we propose a simpler approach: learn a hyperbolic embedding of the input, then map once from it to Euclidean space using a mapping that encodes geometric priors by respecting the isometries of hyperbolic space, and finish with a standard Euclidean network. The key insight is to use a random feature mapping via the eigenfunctions of the Laplace operator, which we show can approximate any isometry-invariant kernel on hyperbolic space. Our method can be used together with any graph neural networks: using even a linear graph model yields significant improvements in both efficiency and performance over other hyperbolic baselines in both transductive and inductive tasks.

NeurIPS 2022
GraB: Finding Provably Better Data Permutations than Random Reshuffling
Yucheng Lu, Wentao Guo, Christopher De Sa
In NeurIPS: Proceedings of the 35th Neural Information Processing Systems Conference, December 2022.
Random reshuffling, which randomly permutes the dataset each epoch, is widely adopted in model training because it yields faster convergence than with-replacement sampling. Recent studies indicate greedily chosen data orderings can further speed up convergence empirically, at the cost of using more computation and memory. However, greedy ordering lacks theoretical justification and has limited utility due to its non-trivial memory and computation overhead. In this paper, we first formulate an example-ordering framework named mph{herding} and answer affirmatively that SGD with herding converges at the rate \( O(T^{-2/3}) \) on smooth, non-convex objectives, faster than the \( O(n^{1/3}T^{-2/3}) \) obtained by random reshuffling, where \( n \) denotes the number of data points and \( T \) denotes the total number of iterations. To reduce the memory overhead, we leverage discrepancy minimization theory to propose an online Gradient Balancing algorithm (GraB) that enjoys the same rate as herding, while reducing the memory usage from \( O(nd) \) to just \( O(d) \) and computation from \( O(n^2) \) to \( O(n) \), where \( d \) denotes the model dimension. We show empirically on applications including MNIST, CIFAR10, WikiText and GLUE that GraB can outperform random reshuffling in terms of both training and validation performance, and even outperform state-of-the-art greedy ordering while reducing memory usage over \( 100 \times \).

Understanding Hyperdimensional Computing for Parallel Single-Pass Learning
Tao Yu, Yichi Zhang, Zhiru Zhang, Christopher De Sa
In NeurIPS: Proceedings of the 35th Neural Information Processing Systems Conference, December 2022.
Hyperdimensional computing (HDC) is an emerging learning paradigm that computes with high dimensional binary vectors. There is an active line of research on HDC in the community of emerging hardware because of its energy efficiency and ultra-low latency—but HDC suffers from low model accuracy, with little theoretical understanding of what limits its performance. We propose a new theoretical analysis of the limits of HDC via a consideration of what similarity matrices can be "expressed" by binary vectors, and we show how the limits of HDC can be approached using random Fourier features (RFF). We extend our analysis to the more general class of vector symbolic architectures (VSA), which compute with high-dimensional vectors (hypervectors) that are not necessarily binary. We propose a new class of VSAs, finite group VSAs, which surpass the limits of HDC. Using representation theory, we characterize which similarity matrices can be "expressed" by finite group VSA hypervectors, and we show how these VSAs can be constructed. Experimental results show that our RFF method and group VSA can both outperform the state-of-the-art HDC model by up to 7.6% while maintaining hardware efficiency. This work aims to inspire a future interest on HDC in the ML community and connect to the hardware community.

Model Preserving Compression for Neural Networks
Jerry Chee, Megan Renz, Anil Damle, Christopher De Sa
In NeurIPS: Proceedings of the 35th Neural Information Processing Systems Conference, December 2022.
After training complex deep learning models, a common task is to compress the model to reduce compute and storage demands. When compressing, it is desirable to preserve the original model's per-example decisions (e.g., to go beyond top-1 accuracy or preserve robustness), maintain the network's structure, automatically determine per-layer compression levels, and eliminate the need for fine tuning. No existing compression methods simultaneously satisfy these criteria—we introduce a principled approach that does by leveraging interpolative decompositions. Our approach simultaneously selects and eliminates channels (analogously, neurons), then constructs an interpolation matrix that propagates a correction into the next layer, preserving the network's structure. Consequently, our method achieves good performance even without fine tuning and admits theoretical analysis. Our theoretical generalization bound for a one layer network lends itself naturally to a heuristic that allows our method to automatically choose per-layer sizes for deep networks. We demonstrate the efficacy of our approach with strong empirical performance on a variety of tasks, models, and datasets—from simple one-hidden-layer networks to deep networks on ImageNet.

From Gradient Flow on Population Loss to Learning with Stochastic Gradient Descent
Christopher De Sa, Satyen Kale, Jason D. Lee, Ayush Sekhari, Karthik Sridharan
In NeurIPS: Proceedings of the 35th Neural Information Processing Systems Conference, December 2022.
Stochastic Gradient Descent (SGD) has been the method of choice for learning large-scale non-convex models. While a general analysis of when SGD works has been elusive, there has been a lot of recent progress in understanding the convergence of Gradient Flow (GF) on the population loss, partly due to the simplicity that a continuous-time analysis buys us. An overarching theme of our paper is providing general conditions under which SGD converges, assuming that GF on the population loss converges. Our main tool to establish this connection is a general converse Lyapunov like theorem, which implies the existence of a Lyapunov potential under mild assumptions on the rates of convergence of GF. In fact, using these potentials, we show a one-to-one correspondence between rates of convergence of GF and geometrical properties of the underlying objective. When these potentials further satisfy certain self-bounding properties, we show that they can be used to provide a convergence guarantee for Gradient Descent (GD) and SGD (even when the GF path and GD/SGD paths are quite far apart). It turns out that these self-bounding assumptions are in a sense also necessary for GD/SGD to work. Using our framework, we provide a unified analysis for GD/SGD not only for classical settings like convex losses, or objectives that satisfy PL/ KL properties, but also for more complex problems including Phase Retrieval and Matrix sq-root, and extending the results in the recent work of Chatterjee 2022.

CSLaw 2022
Non-Determinism and the Lawlessness of ML Code Oral
A. Feder Cooper, Jonathan Frankle, Christopher De Sa
In CSLaw: 2nd ACM Symposium on Computer Science and Law, November 2022.
Legal literature on machine learning (ML) tends to focus on harms, and as a result tends to reason about individual model outcomes and summary error rates. This focus on model-level outcomes and errors has masked important aspects of ML that are rooted in its inherent non-determinism. We show that the effects of non-determinism, and consequently its implications for the law, instead become clearer from the perspective of reasoning about ML outputs as probability distributions over possible outcomes. This distributional viewpoint accounts for non-determinism by emphasizing the possible outcomes of ML. Importantly, this type of reasoning is not exclusive with current legal reasoning; it complements (and in fact can strengthen) analyses concerning individual, concrete outcomes for specific automated decisions. By clarifying the important role of non-determinism, we demonstrate that ML code falls outside of the cyberlaw frame of treating "code as law," as this frame assumes that code is deterministic. We conclude with a brief discussion of what work ML can do to constrain the potentially harm-inducing effects of non-determinism, and we clarify where the law must do work to bridge the gap between its current individual-outcome focus and the distributional approach that we recommend.

ICML 2022
Low-Precision Stochastic Gradient Langevin Dynamics Spotlight
Ruqi Zhang, Andrew Wilson, Christopher De Sa
In ICML: the Thirty-ninty International Conference on Machine Learning, July 2022.
While low-precision optimization has been widely used to accelerate deep learning, low-precision sampling remains largely unexplored. As a consequence, sampling is simply infeasible in many large-scale scenarios, despite providing remarkable benefits to generalization and uncertainty estimation for neural networks. In this paper, we provide the first study of low-precision Stochastic Gradient Langevin Dynamics (SGLD), showing that its costs can be significantly reduced without sacrificing performance, due to its intrinsic ability to handle system noise. We prove that the convergence of low-precision SGLD with full-precision gradient accumulators is less affected by the quantization error than its SGD counterpart in the strongly convex setting. To further enable low-precision gradient accumulators, we develop a new quantization function for SGLD that preserves the variance in each update step. We demonstrate that low-precision SGLD achieves comparable performance to full-precision SGLD with only 8 bits on a variety of deep learning tasks.

ICLR 2022
A General Analysis of Example-Selection for Stochastic Gradient Descent Spotlight
Yucheng Lu, Si Yi Meng, and Christopher De Sa
In ICLR: Proceedings of the Tenth International Conference on Learning Representations, April 2022.
Training example order in SGD has long been known to affect convergence rate. Recent results show that accelerated rates are possible in a variety of cases for permutation-based sample orders, in which each example from the training set is used once before any example is reused. In this paper, we develop a broad condition on the sequence of examples used by SGD that is sufficient to prove tight convergence rates in both strongly convex and non-convex settings. We show that our approach suffices to recover, and in some cases improve upon, previous state-of-the-art analyses for four known example-selection schemes: (1) shuffle once, (2) random reshuffling, (3) random reshuffling with data echoing, and (4) Markov Chain Gradient Descent. Motivated by our theory, we propose two new example-selection approaches. First, using quasi-Monte-Carlo methods, we achieve unprecedented accelerated convergence rates for learning with data augmentation. Second, we greedily choose a fixed scan-order to minimize the metric used in our condition and show that we can obtain more accurate solutions from the same number of epochs of SGD. We conclude by empirically demonstrating the utility of our approach for both convex linear-model and deep learning tasks.

How Low Can We Go: Trading Memory for Error in Low-Precision Training
Chengrun Yang, Ziyang Wu, Jerry Chee, Christopher De Sa, and Madeleine Udell
In ICLR: Proceedings of the Tenth International Conference on Learning Representations, April 2022.
Low-precision arithmetic trains deep learning models using less energy, less memory and less time. However, we pay a price for the savings: lower precision may yield larger round-off error and hence larger prediction error. As applications proliferate, users must choose which precision to use to train a new model, and chip manufacturers must decide which precisions to manufacture. We view these precision choices as a hyperparameter tuning problem, and borrow ideas from meta-learning to learn the tradeoff between memory and error. In this paper, we introduce Pareto Estimation to Pick the Perfect Precision (PEPPP). We use matrix factorization to find non-dominated configurations (the Pareto frontier) with a limited number of network evaluations. For any given memory budget, the precision that minimizes error is a point on this frontier. Practitioners can use the frontier to trade memory for error and choose the best precision for their goals.

NeurIPS 2021
Hyperparameter Optimization Is Deceiving Us, and How to Stop It
A. Feder Cooper, Yucheng Lu, and Christopher De Sa
In NeurIPS: Proceedings of the 34th Neural Information Processing Systems Conference, December 2021.
Recent empirical work shows that inconsistent results, based on choice of hyperparameter optimization (HPO) configuration, are a widespread problem in ML research. When comparing two algorithms J and K, searching one subspace can yield the conclusion that J outperforms K, whereas searching another can entail the opposite. In short, the way we choose hyperparameters can deceive us. We provide a theoretical complement to this prior work, arguing that, to avoid such deception, the process of drawing conclusions from HPO should be made more rigorous. We call this process epistemic hyperparameter optimization (EHPO), and put forth a logical framework to capture its semantics and how it can lead to inconsistent conclusions about performance. Our framework enables us to prove EHPO methods that are guaranteed to be defended against deception. We demonstrate its utility by proving and empirically validating a defended variant of random search.

Representing Hyperbolic Space Accurately using Multi-Component Floats
Tao Yu, Christopher De Sa
In NeurIPS: Proceedings of the 34th Neural Information Processing Systems Conference, December 2021.
Hyperbolic space is very useful for embedding data with hierarchical structure; however, representing hyperbolic space with ordinary floating-point numbers greatly affects the performance due to its ineluctable numerical errors. Simply increasing the precision of floats fails to solve the problem and incurs a high computation cost for simulating greater-than-double-precision floats on hardware such as GPUs, which does not support them. In this paper, we propose a simple, feasible-on-GPUs, and easy-to-understand solution for numerically accurate learning on hyperbolic space. We do this with a new approach to represent hyperbolic space using multi-component floating-point (MCF) in the Poincare upper-half space model. Theoretically and experimentally we show our model has small numerical error, and on embedding tasks across various datasets, models represented by multi-component floating-points gain significantly more capacity with only a mild computational slowdown on GPUs.

Equivariant Manifold Flows
Isay Katsman, Aaron Lou, Derek Lim, Qingxuan Jiang, Ser-Nam Lim, Christopher De Sa
In NeurIPS: Proceedings of the 34th Neural Information Processing Systems Conference, December 2021.
Tractably modelling distributions over manifolds has long been an important goal in the natural sciences. Recent work has focused on developing general machine learning models to learn such distributions. However, for many applications these distributions must respect manifold symmetries—a trait which most previous models disregard. In this paper, we lay the theoretical foundations for learning symmetry-invariant distributions on arbitrary manifolds via equivariant manifold flows. We demonstrate the utility of our approach by using it to learn gauge invariant densities over SU(n) in the context of quantum field theory.

EAAMO 2021
Accuracy-Efficiency Trade-Offs and Accountability in Distributed ML Systems Oral
A. Feder Cooper, Karen Levy, Christopher De Sa
In EAAMO: Conference on Equity and Access in Algorithms, Mechanisms, and Optimization (to appear), October 2021.
Trade-offs between accuracy and efficiency pervade law, public health, and other non-computing domains, which have developed policies to guide how to balance the two in conditions of uncertainty. While computer science also commonly studies accuracy-efficiency trade-offs, their policy implications remain poorly examined. Drawing on risk assessment practices in the US, we argue that, since examining these trade-offs has been useful for guiding governance in other domains, we need to similarly reckon with these tradeoffs in governing computer systems. We focus our analysis on distributed machine learning systems. Understanding the policy implications in this area is particularly urgent because such systems, which include autonomous vehicles, tend to be high-stakes and safety-critical. We 1) describe how the trade-off takes shape for these systems, 2) highlight gaps between existing US risk assessment standards and what these systems require to be properly assessed, and 3) make specific calls to action to facilitate accountability when hypothetical risks concerning the accuracy-efficiency trade-off become realized as accidents in the real world. We close by discussing how such accountability mechanisms encourage more just, transparent governance aligned with public values.

ICML 2021
Optimal Complexity in Decentralized Training Outstanding Paper (Honorable Mention)
Yucheng Lu, Christopher De Sa
In ICML: the Thirty-eighth International Conference on Machine Learning, July 2021.
Decentralization is a promising method of scaling up parallel machine learning systems. In this paper, we provide a tight lower bound on the iteration complexity for such methods in a stochastic non-convex setting. Our lower bound reveals a theoretical gap in known convergence rates of many existing decentralized training algorithms, such as D-PSGD. We prove by construction this lower bound is tight and achievable. Motivated by our insights, we further propose DeTAG, a practical gossip-style decentralized algorithm that achieves the lower bound with only a logarithm gap. Empirically, we compare DeTAG with other decentralized algorithms on image classification tasks, and we show DeTAG enjoys faster convergence compared to baselines, especially on unshuffled data and in sparse networks.

Variance Reduced Training with Stratified Sampling for Forecasting Models
Yucheng Lu, Youngsuk Park, Lifan Chen, Yuyang Wang, Christopher De Sa, Dean Foster
In ICML: the Thirty-eighth International Conference on Machine Learning, July 2021.
In large-scale time series forecasting, one often encounters the situation where the temporal patterns of time series, while drifting over time, differ from one another in the same dataset. In this paper, we provably show under such heterogeneity, training a forecasting model with commonly used stochastic optimizers (e.g. SGD) potentially suffers large variance on gradient estimation, and thus incurs long-time training. We show that this issue can be efficiently alleviated via stratification, which allows the optimizer to sample from pre-grouped time series strata. For better trading-off gradient variance and computation complexity, we further propose SCott (Stochastic Stratified Control Variate Gradient Descent), a variance reduced SGD-style optimizer that utilizes stratified sampling via control variate. In theory, we provide the convergence guarantee of SCott on smooth non-convex objectives. Empirically, we evaluate SCott and other baseline optimizers on both synthetic and real-world time series forecasting problems, and demonstrate SCott converges faster with respect to both iterations and wall clock time.

Low-Precision Reinforcement Learning: Running Soft Actor-Critic in Half Precision
Johan Björck, Xiangyu Chen, Christopher De Sa, Carla Gomes, Kilian Weinberger
In ICML: the Thirty-eighth International Conference on Machine Learning, July 2021.
Low-precision training has become a popular approach to reduce compute requirements, memory footprint, and energy consumption in supervised learning. In contrast, this promising approach has not yet enjoyed similarly widespread adoption within the reinforcement learning (RL) community, partly because RL agents can be notoriously hard to train even in full precision. In this paper we consider continuous control with the state-of-the-art SAC agent and demonstrate that a naïve adaptation of low-precision methods from supervised learning fails. We propose a set of six modifications, all straightforward to implement, that leaves the underlying agent and its hyperparameters unchanged but improves the numerical stability dramatically. The resulting modified SAC agent has lower memory and compute requirements while matching full-precision rewards, demonstrating that low-precision training can substantially accelerate state-of-the-art RL without parameter tuning.

Equivariant Manifold Flows
Isay Katsman, Aaron Lou, Derek Lim, Qingxuan Jiang, Ser-Nam Lim, Christopher De Sa
In ICML Workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models, July 2021.
Tractably modelling distributions over manifolds has long been an important goal in the natural sciences. Recent work has focused on developing general machine learning models to learn such distributions. However, for many applications these distributions must respect manifold symmetries—a trait which most previous models disregard. In this paper, we lay the theoretical foundations for learning symmetry-invariant distributions on arbitrary manifolds via equivariant manifold flows. We demonstrate the utility of our approach by using it to learn gauge invariant densities over SU(n) in the context of quantum field theory.

Hyperparameter Optimization Is Deceiving Us, and How to Stop It
A. Feder Cooper, Yucheng Lu, and Christopher De Sa
In ICLR 2021, Workshop on Robust ML (RML), May 2021.
While hyperparameter optimization (HPO) is known to greatly impact learning algorithm performance, it is often treated as an empirical afterthought. Recent empirical works have highlighted the risk of this second-rate treatment of HPO. They show that inconsistent performance results, based on choice of hyperparameter subspace to search, are a widespread problem in ML research. When comparing two algorithms, J and K searching one subspace can yield the conclusion that J outperforms K, whereas searching another can entail the opposite result. In short, your choice of hyperparameters can deceive you. We provide a theoretical complement to this prior work: We analytically characterize this problem, which we term hyperparameter deception, and show that grid search is inherently deceptive. We prove a defense with guarantees against deception, and demonstrate a defense in practice.

Model Selection's Disparate Impact in Real-World Deep Learning Applications Oral
Jessica Zosa Forde*, A. Feder Cooper*, Kweku Kwegyir-Aggrey, Christopher De Sa, Michael Littman
In ICLR 2021, Workshop on the Science and Engineering of Deep Learning (SEDL), May 2021.
Algorithmic fairness has emphasized the role of biased data in automated decision outcomes. Recently, there has been a shift in attention to sources of bias that implicate fairness in other stages in the ML pipeline. We contend that one source of such bias, human preferences in model selection, remains under-explored in terms of its role in disparate impact across demographic groups. Using a deep learning model trained on real-world medical imaging data, we verify our claim empirically and argue that choice of metric for model comparison can significantly bias model selection outcomes.

Meta-Learning Divergences for Variational Inference
Ruqi Zhang, Yingzhen Li, Christopher De Sa, Sam Devlin, Cheng Zhang
In AISTATS: Proceedings of the 24th International Conference on Artificial Intelligence and Statistics, April 2021.
Variational inference (VI) plays an essential role in approximate Bayesian inference due to its computational efficiency and broad applicability. Crucial to the performance of VI is the selection of the associated divergence measure, as VI approximates the intractable distribution by minimizing this divergence. In this paper we propose a meta-learning algorithm to learn the divergence metric suited for the task of interest, automating the design of VI methods. In addition, we learn the initialization of the variational parameters without additional cost when our method is deployed in the few-shot learning scenarios. We demonstrate our approach outperforms standard VI on Gaussian mixture distribution approximation, Bayesian neural network regression, image generation with variational autoencoders and recommender systems with a partial variational autoencoder.

MLSys 2021
PipeMare: Asynchronous Pipeline Parallel DNN Training
Bowen Yang, Jian Zhang, Jonathan Li, Christopher Ré, Christopher R. Aberger, Christopher De Sa
In MLSys: Proceedings of the 4th Conference on Machine Learning and Systems, April 2021.
Recently there has been a flurry of interest around using pipeline parallelism while training neural networks. Pipeline parallelism enables larger models to be partitioned spatially across chips and within a chip, leading to both lower network communication and overall higher hardware utilization. Unfortunately, to preserve statistical efficiency, existing pipeline-parallelism techniques sacrifice hardware efficiency by introducing bubbles into the pipeline and/or incurring extra memory costs. In this paper, we investigate to what extent these sacrifices are necessary. Theoretically, we derive a simple but robust training method, called PipeMare, that tolerates asynchronous updates during pipeline-parallel execution. Using this, we show empirically, on a ResNet network and a Transformer network, that PipeMare can achieve final model qualities that match those of synchronous training techniques (at most 0.9% worse test accuracy and 0.3 better test BLEU score) while either using up to 2.0X less weight and optimizer memory or being up to 3.3X faster than other pipeline parallel training techniques. To the best of our knowledge we are the first to explore these techniques and fine-grained pipeline parallelism (e.g. the number of pipeline stages equals to the number of layers) during neural network training.

NeurIPS 2020
Asymptotically Optimal Exact Minibatch Metropolis-Hastings Spotlight
Ruqi Zhang, A. Feder Cooper, Christopher De Sa
In NeurIPS: Proceedings of the 33rd Neural Information Processing Systems Conference, December 2020.
Metropolis-Hastings (MH) is a commonly-used MCMC algorithm, but it can be intractable on large datasets due to requiring computations over the whole dataset. In this paper, we study minibatch MH methods, which instead use subsamples to enable scaling. We observe that most existing minibatch MH methods are inexact (i.e. they may change the target distribution), and show that this inexactness can cause arbitrarily large errors in inference. We propose a new exact minibatch MH method, TunaMH, which exposes a tunable trade-off between its minibatch size and its theoretically guaranteed convergence rate. We prove a lower bound on the batch size that any minibatch MH method must use to retain exactness while guaranteeing fast convergence—the first such bound for minibatch MH—and show TunaMH is asymptotically optimal in terms of the batch size. Empirically, we show TunaMH outperforms other exact minibatch MH methods on robust linear regression, truncated Gaussian mixtures, and logistic regression.

Random Reshuffling is Not Always Better Spotlight
Christopher De Sa
In NeurIPS: Proceedings of the 33rd Neural Information Processing Systems Conference, December 2020.
Many learning algorithms, such as stochastic gradient descent, are affected by the order in which training examples are used. It is generally believed that sampling the training examples without-replacement, also known as random reshuffling, causes learning algorithms to converge faster. We give a counterexample to the Operator Inequality of Noncommutative Arithmetic and Geometric Means, a longstanding conjecture that relates to the performance of random reshuffling in learning algorithms (Recht and Ré, "Toward a noncommutative arithmetic-geometric mean inequality: conjectures, case-studies, and consequences," COLT 2012). We use this to give an example of a learning task and algorithm for which with-replacement random sampling actually outperforms random reshuffling.

Neural Manifold Ordinary Differential Equations
Aaron Lou, Derek Lim, Isay Katsman, Leo Huang, Qingxuan Jiang, Ser-Nam Lim, Christopher De Sa
In NeurIPS: Proceedings of the 33rd Neural Information Processing Systems Conference, December 2020.
To better conform to data geometry, recent deep generative modelling techniques adapt Euclidean constructions to non-Euclidean spaces. In this paper, we study normalizing flows on manifolds. Previous work has developed flow models for specific cases; however, these advancements hand craft layers on a manifold-by-manifold basis, restricting generality and inducing cumbersome design constraints. We overcome these issues by introducing Neural Manifold Ordinary Differential Equations, a manifold generalization of Neural ODEs, which enables the construction of Manifold Continuous Normalizing Flows (MCNFs). MCNFs require only local geometry (therefore generalizing to arbitrary manifolds) and compute probabilities with continuous change of variables (allowing for a simple and expressive flow construction). We find that leveraging continuous manifold dynamics produces a marked improvement for both density estimation and downstream tasks.

ICML 2020
Moniqua: Modulo Quantized Communication in Decentralized SGD
Yucheng Lu, Christopher De Sa
In ICML: the Thirty-seventh International Conference on Machine Learning, July 2020.
Running Stochastic Gradient Descent (SGD) in a decentralized fashion has shown promising results. In this paper we propose Moniqua, a technique that allows decentralized SGD to use quantized communication. We prove in theory that Moniqua communicates a provably bounded number of bits per iteration, while converging at the same asymptotic rate as the original algorithm does with full-precision communication. Moniqua improves upon prior works in that it (1) requires zero additional memory, (2) works with 1-bit quantization, and (3) is applicable to a variety of decentralized algorithms. We demonstrate empirically that Moniqua converges faster with respect to wall clock time than other quantized decentralized algorithms. We also show that Moniqua is robust to very low bit-budgets, allowing 1-bit-per-parameter communication without compromising validation accuracy when training ResNet20 and ResNet110 on CIFAR10.

Differentiating through the Fréchet Mean
Aaron Lou, Isay Katsman, Qingxuan Jiang, Serge Belongie, Ser Nam Lim, Christopher De Sa
In ICML: the Thirty-seventh International Conference on Machine Learning, July 2020.
Recent advances in deep representation learning on Riemannian manifolds extend classical deep learning operations to better capture the geometry of the manifold. One possible extension is the Fréchet mean, the generalization of the Euclidean mean; however, it has been difficult to apply because it lacks a closed form with an easily computable derivative. In this paper, we show how to differentiate through the Fréchet mean for arbitrary Riemannian manifolds. Then, focusing on hyperbolic space, we derive explicit gradient expressions and a fast, accurate, and hyperparameter-free Fréchet mean solver. This fully integrates the Fréchet mean into the hyperbolic neural network pipeline. To demonstrate this integration, we present two case studies. First, we apply our Fréchet mean to the existing Hyperbolic Graph Convolutional Network, replacing its projected aggregation to obtain state-of-the-art results on datasets with high hyperbolicity. Second, to demonstrate the Fréchet mean's capacity to generalize Euclidean neural network operations, we develop a hyperbolic batch normalization method that gives an improvement parallel to the one observed in the Euclidean setting.

Regulating Accuracy-Efficiency Trade-Offs in Distributed Machine Learning Systems Oral
A. Feder Cooper, Karen Levy, Christopher De Sa
In LML 2020: ICML Workshop on Law and Machine Learning, July 2020.
In this paper we discuss the trade-off between accuracy and efficiency in distributed machine learning (ML) systems and analyze its resulting policy considerations. This trade-off is in fact quite common in multiple disciplines, including law and medicine, and it applies to a wide variety of subfields within computer science. Accuracy and efficiency trade-offs have unique implications in ML algorithms because, being probabilistic in nature, such algorithms generally exhibit error tolerance. After describing how the trade-off takes shape in real-world distributed computing systems, we show the interplay between such systems and ML algorithms, explaining in detail how accuracy and efficiency interact particularly in distributed ML systems. We close by making specific calls to action for approaching regulatory policy for the emerging technology of real-time distributed ML systems.

Neural Manifold Ordinary Differential Equations
Aaron Lou, Derek Lim, Isay Katsman, Leo Huang, Qingxuan Jiang, Ser-Nam Lim, Christopher De Sa
In INNF+ 2020: ICML Workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models, July 2020.
To better conform to data geometry, recent deep generative modelling techniques adapt Euclidean constructions to non-Euclidean spaces. In this paper, we study normalizing flows on manifolds. Previous work has developed flow models for specific cases; however, these advancements hand craft layers on a manifold-by-manifold basis, restricting generality and inducing cumbersome design constraints. We overcome these issues by introducing Neural Manifold Ordinary Differential Equations, a manifold generalization of Neural ODEs, which enables the construction of Manifold Continuous Normalizing Flows (MCNFs). MCNFs require only local geometry (therefore generalizing to arbitrary manifolds) and compute probabilities with continuous change of variables (allowing for a simple and expressive flow construction). We find that leveraging continuous manifold dynamics produces a marked improvement for both density estimation and downstream tasks.

AMAGOLD: Amortized Metropolis Adjustment for Efficient Stochastic Gradient MCMC
Ruqi Zhang, A. Feder Cooper, Christopher De Sa
In AISTATS: The 23rd International Conference on Artificial Intelligence and Statistics, June 2020.
Stochastic gradient Hamiltonian Monte Carlo (SGHMC) is an efficient method for sampling from continuous distributions. It is a faster alternative to HMC: instead of using the whole dataset at each iteration, SGHMC uses only a subsample. This improves performance, but introduces bias that can cause SGHMC to converge to the wrong distribution. One can prevent this using a step size that decays to zero, but such a step size schedule can drastically slow down convergence. To address this tension, we propose a novel second-order SG-MCMC algorithm—AMAGOLD—that infrequently uses Metropolis-Hastings (M-H) corrections to remove bias. The infrequency of corrections amortizes their cost. We prove AMAGOLD converges to the target distribution with a fixed, rather than a diminishing, step size, and that its convergence rate is at most a constant factor slower than a full-batch baseline. We empirically demonstrate AMAGOLD's effectiveness on synthetic distributions, Bayesian logistic regression, and Bayesian neural networks.

NeurIPS 2019
Channel Gating Neural Networks
Weizhe Hua, Yuan Zhou, Christopher De Sa, Zhiru Zhang, G. Edward Suh
In NeurIPS: Proceedings of the 32nd Neural Information Processing Systems Conference, December 2019.
This paper introduces channel gating, a dynamic, fine-grained, and highly hardware-efficient pruning scheme to reduce the compute cost for convolutional neural networks (CNNs). Channel gating identifies regions in the features that contribute less to the classification result, and skips the computation on a subset of the input channels for these ineffective regions. Unlike static network pruning, channel gating optimizes CNN inference at run-time by exploiting input-specific characteristics, which allows substantially reducing the compute cost with almost no accuracy loss. We experimentally show that applying channel gating in state-of-the-art networks achieves a 2.7-8.0x reduction in FLOPs with minimal accuracy loss on CIFAR-10. Combining our method with knowledge distillation reduces the compute cost of ResNet-18 by 2.6x without accuracy degradation on ImageNet. We further demonstrate that channel gating can be realized in hardware in an efficient manner. Our approach exhibits sparsity patterns that are well-suited to dense systolic arrays with minimal additional hardware. We have designed an accelerator for channel gating networks, which can be implemented using either FPGAs or ASICs. Running a quantized ResNet-18 model for ImageNet, our accelerator achieves an encouraging speedup of 2.4x on average, with a theoretical FLOP reduction of 2.8x.

Numerically Accurate Hyperbolic Embeddings Using Tiling-Based Models Spotlight
Tao Yu, Christopher De Sa
In NeurIPS: Proceedings of the 32nd Neural Information Processing Systems Conference, December 2019.
Hyperbolic embeddings achieve excellent performance when embedding hierarchical data structures like synonym or type hierarchies, but they can be limited by numerical error when ordinary floating point numbers are used to represent points in hyperbolic space. Standard models such as the Poincaré disk and the Lorentz model have unbounded error as points get far from the origin. To address this, we propose a new model with which uses an integer-based tiling to represent any point in the space with provably bounded numerical error. This allows us to learn high-precision embeddings without using BigFloats, and enables us to store the resulting embeddings with fewer bits. We evaluate our tiling-based model empirically, and show that it can both compress hyperbolic embeddings (down to 2% of a Poincaré embedding on WordNet) and learn more accurate embeddings on real-world datasets.

Poisson-Minibatching for Gibbs Sampling with Convergence Rate Guarantees Spotlight
Ruqi Zhang, Christopher De Sa
In NeurIPS: Proceedings of the 32nd Neural Information Processing Systems Conference, December 2019.
Gibbs sampling is a Markov chain Monte Carlo method that is often used for learning and inference on graphical models. Minibatching, in which a small random subset of the graph is used at each iteration, can help make Gibbs sampling scale to large graphical models by reducing its computational cost. In this paper, we propose a new auxiliary-variable minibatched Gibbs sampling method, Poisson-minibatching Gibbs, which both produces unbiased samples and has a theoretical guarantee on its convergence rate. In comparison to previous minibatched Gibbs algorithms, Poisson-minibatching Gibbs supports fast sampling from continuous state spaces and avoids the need for a Metropolis-Hastings correction on discrete state spaces. We demonstrate the effectiveness of our method on multiple applications and in comparison with both plain Gibbs and previous minibatched methods.

Dimension-Free Bounds for Low-Precision Training
Zheng Li, Christopher De Sa
In NeurIPS: Proceedings of the 32nd Neural Information Processing Systems Conference, December 2019.
Low-precision training is a promising way of decreasing the time and energy cost of training machine learning models. Previous work has analyzed low-precision training algorithms, such as low-precision stochastic gradient descent, and derived theoretical bounds on their convergence rates. These bounds tend to depend on the dimension of the model d in that the number of bits needed to achieve a particular error bound increases as d increases. In this paper, we derive new bounds for low-precision training algorithms that do not contain the dimension d, which lets us better understand what affects the convergence of these algorithms as parameters scale. Our methods also generalize naturally to let us prove new convergence bounds on low-precision training with other quantization schemes, such as low-precision floating-point computation and logarithmic quantization.

EMC2@NeurIPS 2019
QPyTorch: A Low-Precision Arithmetic Simulation Framework
Tianyi Zhang, Zhiqiu Lin, Guandao Yang, Christopher De Sa
In EMC2: Workshop on Energy Efficient ML and Cognitive Computing, at NeurIPS, December 2019.
Low-precision training reduces computational cost and produces efficient models. Recent research in developing new low-precision training algorithms often relies on simulation to empirically evaluate the statistical effects of quantization while avoiding the substantial overhead of building specific hardware. To support this empirical research, we introduce QPyTorch, a low-precision arithmetic simulation framework. Built natively in PyTorch, QPyTorch provides a convenient interface that minimizes the efforts needed to reliably convert existing codes to study low-precision training. QPyTorch is general, and supports a variety of combinations of precisions, number formats, and rounding options. Additionally, it leverages an efficient fused-kernel approach to reduce simulator overhead, which enables simulation of large-scale, realistic problems.

MICRO 2019
Boosting the Performance of CNN Accelerators with Dynamic Fine-Grained Channel Gating
Weizhe Hua, Yuan Zhou, Christopher De Sa, Zhiru Zhang, G. Edward Suh
In MICRO '52 Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, October 2019.
This paper proposes a new fine-grained dynamic pruning technique for CNN inference, named channel gating, and presents an accelerator architecture that can effectively exploit the dynamic sparsity. Intuitively, channel gating identifies the regions in the feature map of each CNN layer that contribute less to the classification result and turns off a subset of channels for computing the activations in these less important regions. Unlike static network pruning, which removes redundant weights or neurons prior to inference, channel gating exploits dynamic sparsity specific to each input at run time and in a structured manner. To maximize compute savings while minimizing accuracy loss, channel gating learns the gating thresholds together with weights automatically through training. Experimental results show that the proposed approach can significantly speed up state-of-the-art networks with a marginal accuracy loss, and enable a trade-off between performance and accuracy. This paper also shows that channel gating can be supported with a small set of extensions to a CNN accelerator, and implements a prototype for quantized ResNet-18 models. The accelerator shows an average speedup of 2.3× for ImageNet when the theoretical FLOP reduction is 2.8×, indicating that the hardware can effectively exploit the dynamic sparsity exposed by channel gating.

Cloud-Hosted Intelligence for Real-time IoT Applications
Ken Birman, Bharath Hariharan, Christopher De Sa
In SIGOPS Operating Systems Review 53, July 2019.
Deploying machine learning into IoT cloud settings will require an evolution of the cloud infrastructure. In this white paper, we justify this assertion and identify new capabilities needed for real-time intelligent systems. We also outline our initial efforts to create a new edge architecture more suitable for ML. Although the work is still underway, several components exist, and we review them. We then point to open technical problems that will need to be solved as we progress further in this direction.

ICML 2019
Improving Neural Network Quantization without Retraining using Outlier Channel Splitting
Ritchie Zhao, Yuwei Hu, Jordan Dotzel, Christopher De Sa, Zhiru Zhang
In ICML: the Thirty-sixth International Conference on Machine Learnin, June 2019.
Quantization can improve the execution latency and energy efficiency of neural networks on both commodity GPUs and specialized accelerators. The majority of existing literature focuses on training quantized DNNs, while this work examines the less-studied topic of quantizing a floating-point model without (re)training. DNN weights and activations follow a bell-shaped distribution post-training, while practical hardware uses a linear quantization grid. This leads to challenges in dealing with outliers in the distribution. Prior work has addressed this by clipping the outliers or using specialized hardware. In this work, we propose outlier channel splitting (OCS), which duplicates channels containing outliers, then halves the channel values. The network remains functionally identical, but affected outliers are moved toward the center of the distribution. OCS requires no additional training and works on commodity hardware. Experimental evaluation on ImageNet classification and language modeling shows that OCS can outperform state-of-the-art clipping techniques with only minor overhead.

Distributed Learning with Sublinear Communication Long oral
Jayadev Acharya, Christopher De Sa, Dylan J. Foster, Karthik Sridharan
In ICML: the Thirty-sixth International Conference on Machine Learnin, June 2019.
In distributed statistical learning, \( N \) samples are split across \( m \) machines and a learner wishes to use minimal communication to learn as well as if the examples were on a single machine. This model has received substantial interest in machine learning due to its scalability and potential for parallel speedup. However, in high-dimensional settings, where the number examples is smaller than the number of features ("dimension"), the speedup afforded by distributed learning may be overshadowed by the cost of communicating a single example. This paper investigates the following question: When is it possible to learn a d-dimensional model in the distributed setting with total communication sublinear in \( d \)?

Starting with a negative result, we show that for learning \( \ell_1 \)-bounded or sparse linear models, no algorithm can obtain optimal error until communication is linear in dimension. Our main result is that that by slightly relaxing the standard boundedness assumptions for linear models, we can obtain distributed algorithms that enjoy optimal error with communication logarithmic in dimension. This result is based on a family of algorithms that combine mirror descent with randomized sparsification/quantization of iterates, and extends to the general stochastic convex optimization model.

A Kernel Theory of Modern Data Augmentation
Tri Dao, Albert Gu, Alexander J. Ratner, Virginia Smith, Christopher De Sa, Christopher Ré
In ICML: the Thirty-sixth International Conference on Machine Learnin, June 2019.
Data augmentation, a technique in which a training set is expanded with class-preserving transformations, is ubiquitous in modern machine learning pipelines. In this paper, we seek to establish a theoretical framework for understanding data augmentation. We approach this from two directions: First, we provide a general model of augmentation as a Markov process, and show that kernels appear naturally with respect to this model, even when we do not employ kernel classification. Next, we analyze more directly the effect of augmentation on kernel classifiers, showing that data augmentation can be approximated by first-order feature averaging and second-order variance regularization components. These frameworks both serve to illustrate the ways in which data augmentation affects the downstream learning model, and the resulting analyses provide novel connections between prior work in invariant kernels, tangent propagation, and robust optimization. Finally, we provide several proof-of-concept applications showing that our theory can be useful for accelerating machine learning workflows, such as reducing the amount of computation needed to train using augmented data, and predicting the utility of a transformation prior to training.

SWALP : Stochastic Weight Averaging in Low Precision Training
Guandao Yang, Tianyi Zhang, Polina Kirichenko, Junwen Bai, Andrew Gordon Wilson, Christopher De Sa
In ICML: the Thirty-sixth International Conference on Machine Learnin, June 2019.
Low precision operations can provide scalability, memory savings, portability, and energy efficiency. This paper proposes SWALP, an approach to low precision training that averages low-precision SGD iterates with a modified learning rate schedule. SWALP is easy to implement and can match the performance of full-precision SGD even with all numbers quantized down to 8 bits, including the gradient accumulators. Additionally, we show that SWALP converges arbitrarily close to the optimal solution for quadratic objectives, and to a noise ball asymptotically smaller than low precision SGD in strongly convex settings.

CVPR 2019
Building Efficient Deep Neural Networks with Unitary Group Convolutions
Ritchie Zhao, Yuwei Hu, Jordan Dotzel, Christopher De Sa, Zhiru Zhang
In CVPR: The Conference on Computer Vision and Pattern Recognition, June 2019.
We propose unitary group convolutions (UGConvs), a building block for CNNs which compose a group convolution with unitary transforms in feature space to learn a richer set of representations than group convolution alone. UGConvs generalize two disparate ideas in CNN architecture, channel shuffling (i.e. ShuffleNet) and block-circulant networks (i.e. CirCNN), and provide unifying insights that lead to a deeper understanding of each technique. We experimentally demonstrate that dense unitary transforms can outperform channel shuffling in DNN accuracy. On the other hand, different dense transforms exhibit comparable accuracy performance. Based on these observations we propose HadaNet, a UGConv network using Hadamard transforms. HadaNets achieve similar accuracy to circulant networks with lower computation complexity, and better accuracy than ShuffleNets with the same number of parameters and floating-point multiplies.

ICDT 2019
A Formal Framework For Probabilistic Unclean Databases
Christopher De Sa, Ihab F. Ilyas, Benny Kimelfeld, Christopher Ré, Theodoros Rekatsinas
In ICDT: 22nd International Conference on Database Theory, March 2019.
Most theoretical frameworks that focus on data errors and inconsistencies follow logic-based reasoning. Yet, practical data cleaning tools need to incorporate statistical reasoning to be effective in real-world data cleaning tasks. Motivated by these empirical successes, we propose a formal framework for unclean databases, where two types of statistical knowledge are incorporated: The first represents a belief of how intended (clean) data is generated, and the second represents a belief of how noise is introduced in the actual observed database instance. To capture this noisy channel model, we introduce the concept of a Probabilistic Unclean Database (PUD), a triple that consists of a probabilistic database that we call the intention, a probabilistic data transformator that we call the realization and captures how noise is introduced, and a dirty observed database instance that we call the observation. We define three computational problems in the PUD framework: cleaning (infer the most probable clean instance given a PUD), probabilistic query answering (compute the probability of an answer tuple over the unclean observed instance), and learning (estimate the most likely intention and realization models of a PUD given a collection of training data). We illustrate the PUD framework on concrete representations of the intention and realization, show that they generalize traditional concepts of repairs such as cardinality and value repairs, draw connection to consistent query answering, and prove tractability results. We further show that parameters can be learned in practical instantiations, and in fact, prove that under certain conditions we can learn a PUD directly from a single dirty database instance without any need for clean examples.

SPIE 2019
Addressing sensor drift in a proprioceptive optical foam system
Ilse M. Van Meerbeek, Jose A. Barreiros, Robert F. Shepherd, Christopher M. De Sa
In Proc. SPIE 10970: Sensors and Smart Structures Technologies for Civil, Mechanical, and Aerospace Systems 2019, March 2019.
We previously reported an elastomeric, optical foam sensor that can sense different types of deformation. The elastomeric foam is embedded with optical fibers that shine light into the foam while simultaneously transmitting scattered light exiting the foam. We applied machine learning techniques to the optical fiber data to form a prediction model that predicts whether the foam is being twisted or bent (classification), as well as the magnitude and direction of the deformation (regression). The best classification model had 100% accuracy on new data points, and the best regression models had a mean absolute error of 0.06 degrees on new data points. This kind of proprioceptive ability could give soft robots much more information about their physical state and therefore improve our ability to control them; however, prediction error increases with time due to drift in the optical fiber outputs. This paper presents an attempt to address this drift. We applied a technique based on work presented by Di Carlo et. al. This unsupervised technique uses the evolutionary optimization process "covariance matrix adaptation evolution strategy" (CMA-ES) to compute a correction factor that can be applied to unobserved, drifted data points. The best solutions reduced classification error by 49% and regression mean absolute error by 36%.

Sci. Robot. 2018
Soft optoelectronic sensory foams with proprioception
Ilse Van Meerbeek, Christopher De Sa, Robert Shepherd
In Science Robotics, November 2018.
In a step toward soft robot proprioception, and therefore better control, this paper presents an internally illuminated elastomer foam that has been trained to detect its own deformation through machine learning techniques. Optical fibers transmitted light into the foam and simultaneously received diffuse waves from internal reflection. The diffuse reflected light was interpreted by machine learning techniques to predict whether the foam was twisted clockwise, twisted counterclockwise, bent up, or bent down. Machine learning techniques were also used to predict the magnitude of the deformation type. On new data points, the model predicted the type of deformation with 100% accuracy and the magnitude of the deformation with a mean absolute error of 0.06°. This capability may impart soft robots with more complete proprioception, enabling them to be reliably controlled and responsive to external stimuli.

ICML 2018
Minibatch Gibbs Sampling on Large Graphical Models Long oral
Christopher De Sa, Vincent Chen, Wing Wong
In ICML: Proceedings of the 35rd International Conference on Machine Learning, July 2018.
Gibbs sampling is the de facto Markov chain Monte Carlo method used for inference and learning on large scale graphical models. For complicated factor graphs with lots of factors, the performance of Gibbs sampling can be limited by the computational cost of executing a single update step of the Markov chain. This cost is proportional to the degree of the graph, the number of factors adjacent to each variable. In this paper, we show how this cost can be reduced by using minibatching: subsampling the factors to form an estimate of their sum. We introduce several minibatched variants of Gibbs, show that they can be made unbiased, prove bounds on their convergence rates, and show that under some conditions they can result in asymptotic single-update-run-time speedups over plain Gibbs sampling.

Representation Tradeoffs for Hyperbolic Embeddings Long oral
Frederic Sala, Christopher De Sa, Albert Gu, Christopher Ré
In ICML: Proceedings of the 35rd International Conference on Machine Learning, July 2018.
Hyperbolic embeddings offer excellent quality with few dimensions when embedding hierarchical data structures. We give a combinatorial construction that embeds trees into hyperbolic space with arbitrarily low distortion without optimization. On WordNet, this algorithm obtains a mean-average-precision of 0.989 with only two dimensions, outperforming existing work by 0.11 points. We provide bounds characterizing the precision-dimensionality tradeoff inherent in any hyperbolic embedding. To embed general metric spaces, we propose a hyperbolic generalization of multidimensional scaling (h-MDS). We show how to perform exact recovery of hyperbolic points from distances, provide a perturbation analysis, and give a recovery result that enables us to reduce dimensionality. Finally, we extract lessons from the algorithms and theory above to design a scalable PyTorch-based implementation that can handle incomplete information.

The Convergence of Stochastic Gradient Descent in Asynchronous Shared Memory
Dan Alistarh, Christopher De Sa, Nikola Konstantinov
In PODC: Principles of Distributed Computing, July 2018.
Stochastic Gradient Descent (SGD) is a fundamental algorithm in machine learning, representing the optimization backbone for training several classic models, from regression to neural networks. Given the recent practical focus on distributed machine learning, significant work has been dedicated to the convergence properties of this algorithm under the inconsistent and noisy updates arising from execution in a distributed environment. However, surprisingly, the convergence properties of this classic algorithm in the standard shared-memory model are still not well-understood.

In this work, we address this gap, and provide new convergence bounds for lock-free concurrent stochastic gradient descent, executing in the classic asynchronous shared memory model, against a strong adaptive adversary. Our results give improved upper and lower bounds on the "price of asynchrony" when executing the fundamental SGD algorithm in a concurrent setting. They show that this classic optimization tool can converge faster and with a wider range of parameters than previously known under asynchronous iterations. At the same time, we exhibit a fundamental trade-off between the maximum delay in the system and the rate at which SGD can converge, which governs the set of parameters under which this algorithm can still work efficiently.

Accelerated Stochastic Power Iteration
Christopher De Sa, Bryan He, Ioannis Mitliagkas, Christopher Ré, Peng Xu
In AISTATS: The 21st International Conference on Artificial Intelligence and Statistics, April 2018.
Principal component analysis (PCA) is one of the most powerful tools in machine learning. The simplest method for PCA, the power iteration, requires \( \mathcal O(1/\Delta) \) full-data passes to recover the principal component of a matrix with eigen-gap \( \Delta \). Lanczos, a significantly more complex method, achieves an accelerated rate of \( \mathcal O(1/\sqrt{\Delta}) \) passes. Modern applications, however, motivate methods that only ingest a subset of available data, known as the stochastic setting. In the online stochastic setting, simple algorithms like Oja's iteration achieve the optimal sample complexity \( \mathcal O(\sigma^2/\Delta^2) \). Unfortunately, they are fully sequential, and also require \( \mathcal O(\sigma^2/\Delta^2) \) iterations, far from the \( \mathcal O(1/\sqrt{\Delta}) \) rate of Lanczos. We propose a simple variant of the power iteration with an added momentum term, that achieves both the optimal sample and iteration complexity. In the full-pass setting, standard analysis shows that momentum achieves the accelerated rate, \( \mathcal O(1/\sqrt{\Delta}) \). We demonstrate empirically that naively applying momentum to a stochastic method, does not result in acceleration. We perform a novel, tight variance analysis that reveals the "breaking-point variance" beyond which this acceleration does not occur. By combining this insight with modern variance reduction techniques, we construct stochastic PCA algorithms, for the online and offline setting, that achieve an accelerated iteration complexity \( \mathcal O(1/\sqrt{\Delta}) \). Due to the embarassingly parallel nature of our methods, this acceleration translates directly to wall-clock time if deployed in a parallel environment. Our approach is very general, and applies to many non-convex optimization problems that can now be accelerated using the same technique.

SODA 2018
A Two Pronged Progress in Structured Dense Matrix Multiplication
Christopher De Sa, Albert Gu, Rohan Puttagunta, Christopher Ré, Atri Rudra
In SODA: ACM-SIAM Symposium on Discrete Algorithms, January 2018.
Matrix-vector multiplication is one of the most fundamental computing primitives. Given a matrix \( A\in\mathbb{F}^{N\times N} \) and a vector \( b \), it is known that in the worst case \( \Theta(N^2) \) operations over \( \mathbb{F} \) are needed to compute \( Ab \). A broad question is to identify classes of structured dense matrices that can be represented with \( O(N) \) parameters, and for which matrix-vector multiplication can be performed sub-quadratically. One such class of structured matrices is the orthogonal polynomial transforms, whose rows correspond to a family of orthogonal polynomials. Other well known classes include the Toeplitz, Hankel, Vandermonde, Cauchy matrices and their extensions that are all special cases of a displacement rank property. In this paper, we make progress on two fronts:
1. We introduce the notion of recurrence width of matrices. For matrices with constant recurrence width, we design algorithms to compute \( Ab \) and \( A^Tb \) in a near-linear number of operations. This notion of width is finer than all the above classes of structured matrices and thus we can compute multiplication for all of them using the same core algorithm.
2. We additionally adapt this algorithm to an algorithm for a much more general class of matrices with displacement structure: those with low displacement rank with respect to quasiseparable matrices. This class includes Toeplitz-plus-Hankel-like matrices, Discrete Cosine/Sine Transforms, and more, and captures all previously known matrices with displacement structure that we are aware of.
Our work unifies, generalizes, and simplifies existing state-of-the-art results in structured matrix-vector multiplication. Finally, we show how applications in areas such as multipoint evaluations of multivariate polynomials and computing linear sequences can be reduced to problems involving low recurrence width matrices.

NeurIPS 2017
Gaussian Quadrature for Kernel Features Spotlight
Tri Dao, Christopher De Sa, Christopher Ré
In NeurIPS: Proceedings of the 30th Neural Information Processing Systems Conference, December 2017.
Kernel methods have recently attracted resurgent interest, matching the performance of deep neural networks in tasks such as speech recognition. The random Fourier features map is a technique commonly used to scale up kernel machines, but employing the randomized feature map means that \( O(\epsilon^{-2})\) samples are required to achieve an approximation error of at most \( \epsilon\) . In this paper, we investigate some alternative schemes for constructing feature maps that are deterministic, rather than random, by approximating the kernel in the frequency domain using Gaussian quadrature. We show that deterministic feature maps can be constructed, for any \( \gamma > 0\) , to achieve error \( \epsilon\) with \( O(e^{\gamma} + \epsilon^{-1/\gamma})\) samples as \( \epsilon \) goes to 0. We validate our methods on datasets in different domains, such as MNIST and TIMIT, showing that deterministic features are faster to generate and achieve comparable accuracy to the state-of-the-art kernel methods based on random Fourier features.

ISCA 2017
Understanding and Optimizing Asynchronous Low-Precision Stochastic Gradient Descent
Christopher De Sa, Matt Feldman, Christopher Ré, Kunle Olukotun
In ISCA: 44th International Symposium on Computer Architecture, June 2017.
Stochastic gradient descent (SGD) is one of the most popular numerical algorithms used in machine learning and other domains. Since this is likely to continue for the foreseeable future, it is important to study techniques that can make it run fast on parallel hardware. In this paper, we provide the first analysis of a technique called BUCKWILD that uses both asynchronous execution and low-precision computation. We introduce the DMGC model, the first conceptualization of the parameter space that exists when implementing low-precision SGD, and show that it provides a way to both classify these algorithms and model their performance. We leverage this insight to propose and analyze techniques to improve the speed of low-precision SGD. First, we propose software optimizations that can increase throughput on existing CPUs by up to 11x. Second, we propose architectural changes, including a new cache technique we call an obstinate cache, that increase throughput beyond the limits of current-generation hardware. We also implement and analyze low-precision SGD on the FPGA, which is a promising alternative to the CPU for future SGD systems.

HILDA 2017
Flipper: A Systematic Approach to Debugging Training Sets
Paroma Varma, Dan Iter, Christopher De Sa, Christopher Ré
In HILDA: Proceedings of the 2nd Workshop on Human-In-the-Loop Data Analytics, at SIGMOD, May 2017.
As machine learning methods gain popularity across different fields, acquiring labeled training datasets has become the primary bottleneck in the machine learning pipeline. Recently generative models have been used to create and label large amounts of training data, albeit noisily. The output of these generative models is then used to train a discriminative model of choice, such as logistic regression or a complex neural network. However, any errors in the generative model can propagate to the subsequent model being trained. Unfortunately, these generative models are not easily interpretable and are therefore difficult to debug for users. To address this, we present our vision for Flipper, a framework that presents users with high-level information about why their training set is inaccurate and informs their decisions as they improve their generative model manually. We present potential tools within the Flipper framework, inspired by observing biomedical experts working with generative models, which allow users to analyze the errors in their training data in a systematic fashion. Finally, we discuss a prototype of Flipper and report results of a user study where users create a training set for a classification task and improve the discriminative model's accuracy by 2.4 points in less than an hour with feedback from Flipper.

NeurIPS 2016
Data Programming: Creating Large Training Sets, Quickly
Alex Ratner, Christopher De Sa, Sen Wu, Daniel Selsam, Christopher Ré
In NeurIPS: Proceedings of the 29th Neural Information Processing Systems Conference, December 2016.
Large labeled training sets are the critical building blocks of supervised learning methods and are key enablers of deep learning techniques. For some applications, creating labeled training sets is the most time-consuming and expensive part of applying machine learning. We therefore propose a paradigm for the programmatic creation of training sets called data programming in which users provide a set of labeling functions, which are programs that heuristically label large subsets of data points, albeit noisily. By viewing these labeling functions as implicitly describing a generative model for this noise, we show that we can recover the parameters of this model to "denoise" the training set. Then, we show how to modify a discriminative loss function to make it noise-aware. We demonstrate our method over a range of discriminative models including logistic regression and LSTMs. We establish theoretically that we can recover the parameters of these generative models in a handful of settings. Experimentally, on the 2014 TAC-KBP relation extraction challenge, we show that data programming would have obtained a winning score, and also show that applying data programming to an LSTM model leads to a TAC-KBP score almost 6 F1 points over a supervised LSTM baseline (and into second place in the competition). Additionally, in initial user studies we observed that data programming may be an easier way to create machine learning models for non-experts.

Scan Order in Gibbs Sampling: Models in Which it Matters and Bounds on How Much
Bryan He, Christopher De Sa, Ioannis Mitliagkas, Christopher Ré
In NeurIPS: Proceedings of the 29th Neural Information Processing Systems Conference, December 2016.
Gibbs sampling is a Markov Chain Monte Carlo sampling technique that iteratively samples variables from their conditional distributions. There are two common scan orders for the variables: random scan and systematic scan. Due to the benefits of locality in hardware, systematic scan is commonly used, even though most statistical guarantees are only for random scan. While it has been conjectured that the mixing times of random scan and systematic scan do not differ by more than a logarithmic factor, we show by counterexample that this is not the case, and we prove that that the mixing times do not differ by more than a polynomial factor under mild conditions. To prove these relative bounds, we introduce a method of augmenting the state space to study systematic scan using conductance.

FiLM-NIPS 2016
Socratic Learning: Empowering the Generative Model
Paroma Varma, Rose Yu, Dan Iter, Christopher De Sa, Christopher Ré
In FiLM-NIPS: Future of Interactive Learning Machines at NIPS, December 2016.
Modern machine learning techniques, such as deep learning, often use discriminative models that require large amounts of labeled data. An alternative approach is to use a generative model, which leverages heuristics from domain experts to train on unlabeled data. Domain experts often prefer to use generative models because they "tell a story" about their data. Unfortunately, generative models are typically less accurate than discriminative models. Several recent approaches combine both types of model to exploit their strengths. In this setting, a misspecified generative model can hurt the performance of subsequent discriminative training. To address this issue, we propose a framework called Socratic learning that automatically uses information from the discriminative model to correct generative model misspecification. Furthermore, this process provides users with interpretable feedback about how to improve their generative model. We evaluate Socratic learning on real-world relation extraction tasks and observe an immediate improvement in classification accuracy that could otherwise require several weeks of effort by domain experts.

ICML 2016
Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs Sampling Best Paper Award
Christopher De Sa, Kunle Olukotun, Christopher Ré
In ICML: Proceedings of the 33rd International Conference on Machine Learning, June 2016.
Gibbs sampling is a Markov chain Monte Carlo technique commonly used for estimating marginal distributions. To speed up Gibbs sampling, there has recently been interest in parallelizing it by executing asynchronously. While empirical results suggest that many models can be efficiently sampled asynchronously, traditional Markov chain analysis does not apply to the asynchronous case, and thus asynchronous Gibbs sampling is poorly understood. In this paper, we derive a better understanding of the two main challenges of asynchronous Gibbs: bias and mixing time. We show experimentally that our theoretical results match practical outcomes.

OptML 2016
Parallel SGD: When does Averaging Help?
Jian Zhang, Christopher De Sa, Ioannis Mitiliagkas, Christopher Ré
In OptML: Optimization Methods for the Next Generation of Machine Learning, workshop at ICML, June 2016.
Consider a number of workers running SGD independently on the same pool of data and averaging the models every once in a while — a common but not well understood practice. We study model averaging as a variance-reducing mechanism and describe two ways in which the frequency of averaging affects convergence. For convex objectives, we show the benefit of frequent averaging depends on the gradient variance envelope. For non-convex objectives, we illustrate that this benefit depends on the presence of multiple globally optimal points. We complement our findings with multicore experiments on both synthetic and real data.

DeepDive: Declarative Knowledge Base Construction
Christopher De Sa, Alex Ratner, Christopher Ré, Jaeho Shin, Feiran Wang, Sen Wu, Ce Zhang
In SIGMOD Record, Research Highlight, April 2016.
The dark data extraction or knowledge base construction (KBC) problem is to populate a SQL database with information from unstructured data sources including emails, webpages, and pdf reports. KBC is a long-standing problem in industry and research that encompasses problems of data extraction, cleaning, and integration. We describe DeepDive, a system that combines database and machine learning ideas to help develop KBC systems. The key idea in DeepDive is that statistical inference and machine learning are key tools to attack classical data problems in extraction, cleaning, and integration in a unified and more effective manner. DeepDive programs are declarative in that one cannot write probabilistic inference algorithms; instead, one interacts by defining features or rules about the domain. A key reason for this design choice is to enable domain experts to build their own KBC systems. We present the applications, abstractions, and techniques of DeepDive employed to accelerate construction of KBC systems.

Generating Configurable Hardware from Parallel Patterns
Raghu Prabhakar, David Koeplinger, Kevin J. Brown, HyoukJoong Lee, Christopher De Sa, Christos Kozyrakis, Kunle Olukotun
In ASPLOS: 21st Int'l Conference on Architectural Support for Programming Languages and Operating Systems, April 2016.
In recent years the computing landscape has seen an increasing shift towards specialized accelerators. Field programmable gate arrays (FPGAs) are particularly promising as they offer significant performance and energy improvements compared to CPUs for a wide class of applications and are far more flexible than fixed-function ASICs. However, FPGAs are difficult to program. Traditional programming models for reconfigurable logic use low-level hardware description languages like Verilog and VHDL, which have none of the productivity features of modern software development languages but produce very efficient designs, and low-level software languages like C and OpenCL coupled with high-level synthesis (HLS) tools that typically produce designs that are far less efficient. Functional languages with parallel patterns are a better fit for hardware generation because they both provide high-level abstractions to programmers with little experience in hardware design and avoid many of the problems faced when generating hardware from imperative languages. In this paper, we identify two optimizations that are important when using parallel patterns to generate hardware: tiling and metapipelining. We present a general representation of tiled parallel patterns, and provide rules for automatically tiling patterns and generating metapipelines. We demonstrate experimentally that these optimizations result in speedups up to 40x on a set of benchmarks from the data analytics domain.

CGO 2016
Have Abstraction and Eat Performance, Too: Optimized Heterogeneous Computing with Parallel Patterns
Kevin J. Brown, HyoukJoong Lee, Tiark Rompf, Arvind K. Sujeeth, Christopher De Sa, Christopher Aberger, Kunle Olukotun
In CGO: International Symposium on Code Generation and Optimization, March 2016.
High performance in modern computing platforms requires programs to be parallel, distributed, and run on heterogeneous hardware. However programming such architectures is extremely difficult due to the need to implement the application using multiple programming models and combine them together in ad-hoc ways. To optimize distributed applications both for modern hardware and for modern programmers we need a programming model that is sufficiently expressive to support a variety of parallel applications, sufficiently performant to surpass hand-optimized sequential implementations, and sufficiently portable to support a variety of heterogeneous hardware. Unfortunately existing systems tend to fall short of these requirements.

In this paper we introduce the Distributed Multiloop Language (DMLL), a new intermediate language based on common parallel patterns that captures the necessary semantic knowledge to efficiently target distributed heterogeneous architectures. We show straightforward analyses that determine what data to distribute based on its usage as well as powerful transformations of nested patterns that restructure computation to enable distribution and optimize for heterogeneous devices. We present experimental results for a range of applications spanning multiple domains and demonstrate highly efficient execution compared to manually-optimized counterparts in multiple distributed programming models.

NeurIPS 2015
Rapidly Mixing Gibbs Sampling for a Class of Factor Graphs Using Hierarchy Width Spotlight
Christopher De Sa, Ce Zhang, Kunle Olukotun, Christopher Ré
In NIPS: Proceedings of the 28th Neural Information Processing Systems Conference, December 2015.
Gibbs sampling on factor graphs is a widely used inference technique, which often produces good empirical results. Theoretical guarantees for its performance are weak: even for tree structured graphs, the mixing time of Gibbs may be exponential in the number of variables. To help understand the behavior of Gibbs sampling, we introduce a new (hyper)graph property, called hierarchy width. We show that under suitable conditions on the weights, bounded hierarchy width ensures polynomial mixing time. Our study of hierarchy width is in part motivated by a class of factor graph templates, hierarchical templates, which have bounded hierarchy width—regardless of the data used to instantiate them. We demonstrate a rich application from natural language processing in which Gibbs sampling provably mixes rapidly and achieves accuracy that exceeds human volunteers.

Taming the Wild: A Unified Analysis of Hogwild!-Style Algorithms
Christopher De Sa, Ce Zhang, Kunle Olukotun, Christopher Ré
In NIPS: Proceedings of the 28th Neural Information Processing Systems Conference, December 2015.
Stochastic gradient descent (SGD) is a ubiquitous algorithm for a variety of machine learning problems. Researchers and industry have developed several techniques to optimize SGD's runtime performance, including asynchronous execution and reduced precision. Our main result is a martingale-based analysis that enables us to capture the rich noise models that may arise from such techniques. Specifically, we use our new analysis in three ways: (1) we derive convergence rates for the convex case (Hogwild!) with relaxed assumptions on the sparsity of the problem; (2) we analyze asynchronous SGD algorithms for non-convex matrix problems including matrix completion; and (3) we design and analyze an asynchronous SGD algorithm, called Buckwild!, that uses lower-precision arithmetic. We show experimentally that our algorithms run efficiently for a variety of problems on modern hardware.

VLDB 2015
Incremental Knowledge Base Construction Using DeepDive Best of Issue
Jaeho Shin, Sen Wu, Feiran Wang, Ce Zhang, Christopher De Sa, Christopher Ré
In VLDB: Proceedings of the 41st International Conference on Very Large Data Bases, September 2015.
Populating a database with unstructured information is a long-standing problem in industry and research that encompasses problems of extraction, cleaning, and integration. Recent names used for this problem include dealing with dark data and knowledge base construction (KBC). In this work, we describe DeepDive, a system that combines database and machine learning ideas to help develop KBC systems, and we present techniques to make the KBC process more efficient. We observe that the KBC process is iterative, and we develop techniques to incrementally produce inference results for KBC systems. We propose two methods for incremental inference, based respectively on sampling and variational techniques. We also study the tradeoff space of these methods and develop a simple rule-based optimizer. DeepDive includes all of these contributions, and we evaluate DeepDive on five KBC systems, showing that it can speed up KBC inference tasks by up to two orders of magnitude with negligible impact on quality.

ICML 2015
Global Convergence of Stochastic Gradient Descent for Some Nonconvex Matrix Problems
Christopher De Sa, Kunle Olukotun, Christopher Ré
In ICML: Proceedings of the 32nd International Conference on Machine Learning, July 2015.
Stochastic gradient descent (SGD) on a low-rank factorization is commonly employed to speed up matrix problems including matrix completion, subspace tracking, and SDP relaxation. In this paper, we exhibit a step size scheme for SGD on a low-rank least-squares problem, and we prove that, under broad sampling conditions, our method converges globally from a random starting point within \(O(\epsilon^{-1} n \log n)\) steps with constant probability for constant-rank problems. Our modification of SGD relates it to stochastic power iteration. We also show experiments to illustrate the runtime and convergence of the algorithm.


MixML: A Unified Analysis of Weakly Consistent Parallel Learning
Yucheng Lu, Jack Nash, Christopher De Sa
Manuscript updated May 2020
Parallelism is a ubiquitous method for accelerating machine learning algorithms. However, theoretical analysis of parallel learning is usually done in an algorithm- and protocol-specific setting, giving little insight about how changes in the structure of communication could affect convergence. In this paper we propose MixML, a general framework for analyzing convergence of weakly consistent parallel machine learning. Our framework includes: (1) a unified way of modeling the communication process among parallel workers; (2) a new parameter, the mixing time tmix, that quantifies how the communication process affects convergence; and (3) a principled way of converting a convergence proof for a sequential algorithm into one for a parallel version that depends only on tmix. We show MixML recovers and improves on known convergence bounds for asynchronous and/or decentralized versions of many algorithms, includingSGD and AMSGrad. Our experiments substantiate the theory and show the dependency of convergence on the underlying mixing time.

Overwrite Quantization: Opportunistic Outlier Handling for Neural Network Accelerators
Ritchie Zhao, Christopher De Sa, Zhiru Zhang
Manuscript updated October 2019
Outliers in weights and activations pose a key challenge for fixed-point quantization of neural networks. While outliers can be addressed by fine-tuning, this is not practical for machine learning (ML) service providers (e.g., Google, Microsoft) who often receive customers' models without the training data. Specialized hardware for handling outliers can enable low-precision DNNs, but incurs nontrivial area overhead. In this paper, we propose overwrite quantization (OverQ), a novel hardware technique which opportunistically increases bitwidth for outliers by letting them overwrite adjacent values. An FPGA prototype shows OverQ can significantly improve ResNet-18 accuracy at 4 bits while incurring relatively little increase in resource utilization.

High-Accuracy Low-Precision Training
Christopher De Sa, Megan Leszczynski, Jian Zhang, Alana Marzoev, Christopher R. Aberger, Kunle Olukotun, Christopher Ré
Manuscript updated December 2018
There is currently an arms race to design low-precision hardware accelerators capable of training machine learning models. This is because purpose-built, low-precision hardware accelerators can lower both the time and energy needed to complete a task. In contrast, traditional hardware architectures are over-provisioned, in terms of numerical precision, for machine learning tasks. Unfortunately, the statistical effects of low-precision computation during training are still not well understood. As a result, it is difficult to reach the statistical accuracies of traditional architectures on these new accelerators which have limited support for higher precision computation. This is due to a tradeoff with standard low-precision training algorithms: as the number of bits is decreased, noise that limits statistical accuracy is increased. In this paper we argue that one can reap the hardware benefits of low-precision accelerators while maintaining the statistical accuracies of traditional, higher-precision data formats. To do this we introduce a training algorithm called High-Accuracy Low-Precision (HALP). HALP is a low-precision stochastic gradient descent variant that uses entirely low-precision computation in its inner loop while infrequently recentering this computation with higher-precision computation done in an outer loop. HALP uses three techniques to reduce noise: (1) a known variance reduction method based on stochastic variance-reduced gradient (SVRG); (2) a novel bit centering technique that uses infrequent high-precision computation to reduce quantization noise; and (3) a novel dynamic bias adjustment technique to prevent overflow and underflow. On strongly convex problems, we show both theoretically and empirically that HALP converges at the same linear rate as full-precision SVRG. Inspired by these results, we show on two neural network applications (CNN and LSTM) that HALP can empirically compete with higher-precision training algorithms.

SysML: The New Frontier of Machine Learning Systems
Alexander Ratner et. al.
On arxiv March 2019
Machine learning (ML) techniques are enjoying rapidly increasing adoption. However, designing and implementing the systems that support ML models in real-world deployments remains a significant obstacle, in large part due to the radically different development and deployment profile of modern ML methods, and the range of practical concerns that come with broader adoption. We propose to foster a new systems machine learning research community at the intersection of the traditional systems and ML communities, focused on topics such as hardware systems for ML, software systems for ML, and ML optimized for metrics beyond predictive accuracy. To do this, we describe a new conference, SysML, that explicitly targets research at the intersection of systems and machine learning with a program committee split evenly between experts in systems and ML, and an explicit focus on topics at the intersection of the two.