
Channel Gating Neural Networks
Weizhe Hua, Yuan Zhou, Christopher De Sa, Zhiru Zhang, G. Edward Suh
In NeurIPS: Proceedings of the 32th Neural Information Processing Systems Conference, December 2019. (to appear)
This paper introduces channel gating, a dynamic, finegrained, and highly hardwareefficient pruning scheme to reduce the compute cost for convolutional neural networks (CNNs). Channel gating identifies regions in the features that contribute less to the classification result, and skips the computation on a subset of the input channels for these ineffective regions. Unlike static network pruning, channel gating optimizes CNN inference at runtime by exploiting inputspecific characteristics, which allows substantially reducing the compute cost with almost no accuracy loss. We experimentally show that applying channel gating in stateoftheart networks achieves a 2.78.0x reduction in FLOPs with minimal accuracy loss on CIFAR10. Combining our method with knowledge distillation reduces the compute cost of ResNet18 by 2.6x without accuracy degradation on ImageNet. We further demonstrate that channel gating can be realized in hardware in an efficient manner. Our approach exhibits sparsity patterns that are wellsuited to dense systolic arrays with minimal additional hardware. We have designed an accelerator for channel gating networks, which can be implemented using either FPGAs or ASICs. Running a quantized ResNet18 model for ImageNet, our accelerator achieves an encouraging speedup of 2.4x on average, with a theoretical FLOP reduction of 2.8x.

Numerically Accurate Hyperbolic Embeddings Using TilingBased Models
Spotlight
Tao Yu, Christopher De Sa
In NeurIPS: Proceedings of the 32th Neural Information Processing Systems Conference, December 2019. (to appear)
Hyperbolic embeddings achieve excellent performance when embedding hierarchical data structures like synonym or type hierarchies, but they can be limited by numerical error when ordinary floating point numbers are used to represent points in hyperbolic space.
Standard models such as the Poincaré disk and the Lorentz model have unbounded error as points get far from the origin.
To address this, we propose a new model with which uses an integerbased tiling to represent any point in the space with provably bounded numerical error. This allows us to learn highprecision embeddings without using BigFloats, and enables us to store the resulting embeddings with fewer bits. We evaluate our tilingbased model empirically, and show that it can both compress hyperbolic embeddings (down to 2% of a Poincaré embedding on WordNet) and learn more accurate embeddings on realworld datasets.

PoissonMinibatching for Gibbs Sampling with Convergence Rate Guarantees
Spotlight
Ruqi Zhang, Christopher De Sa
In NeurIPS: Proceedings of the 32th Neural Information Processing Systems Conference, December 2019. (to appear)
Gibbs sampling is a Markov chain Monte Carlo method that is often used for learning and inference on graphical models.
Minibatching, in which a small random subset of the graph is used at each iteration, can help make Gibbs sampling scale to large graphical models by reducing its computational cost.
In this paper, we propose a new auxiliaryvariable minibatched Gibbs sampling method, Poissonminibatching Gibbs, which both produces unbiased samples and has a theoretical guarantee on its convergence rate.
In comparison to previous minibatched Gibbs algorithms, Poissonminibatching Gibbs supports fast sampling from continuous state spaces and avoids the need for a MetropolisHastings correction on discrete state spaces.
We demonstrate the effectiveness of our method on multiple applications and in comparison with both plain Gibbs and previous minibatched methods.

DimensionFree Bounds for LowPrecision Training
Zheng Li, Christopher De Sa
In NeurIPS: Proceedings of the 32th Neural Information Processing Systems Conference, December 2019. (to appear)
Lowprecision training is a promising way of decreasing the time and energy cost of training machine learning models.
Previous work has analyzed lowprecision training algorithms, such as lowprecision stochastic gradient descent, and derived theoretical bounds on their convergence rates.
These bounds tend to depend on the dimension of the model d in that the number of bits needed to achieve a particular error bound increases as d increases.
In this paper, we derive new bounds for lowprecision training algorithms that do not contain the dimension d, which lets us better understand what affects the convergence of these algorithms as parameters scale.
Our methods also generalize naturally to let us prove new convergence bounds on lowprecision training with other quantization schemes, such as lowprecision floatingpoint computation and logarithmic quantization.

Boosting the Performance of CNN Accelerators with Dynamic FineGrained Channel Gating
Weizhe Hua, Yuan Zhou, Christopher De Sa, Zhiru Zhang, G. Edward Suh
In MICRO '52 Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, October 2019.
This paper proposes a new finegrained dynamic pruning technique for CNN inference, named channel gating, and presents an accelerator architecture that can effectively exploit the dynamic sparsity. Intuitively, channel gating identifies the regions in the feature map of each CNN layer that contribute less to the classification result and turns off a subset of channels for computing the activations in these less important regions. Unlike static network pruning, which removes redundant weights or neurons prior to inference, channel gating exploits dynamic sparsity specific to each input at run time and in a structured manner. To maximize compute savings while minimizing accuracy loss, channel gating learns the gating thresholds together with weights automatically through training. Experimental results show that the proposed approach can significantly speed up stateoftheart networks with a marginal accuracy loss, and enable a tradeoff between performance and accuracy. This paper also shows that channel gating can be supported with a small set of extensions to a CNN accelerator, and implements a prototype for quantized ResNet18 models. The accelerator shows an average speedup of 2.3× for ImageNet when the theoretical FLOP reduction is 2.8×, indicating that the hardware can effectively exploit the dynamic sparsity exposed by channel gating.

CloudHosted Intelligence for Realtime IoT Applications
Ken Birman, Bharath Hariharan, Christopher De Sa
In SIGOPS Operating Systems Review 53, 1, 7–13, July 2019
Deploying machine learning into IoT cloud settings will require an evolution of the cloud infrastructure. In this white paper, we justify this assertion and identify new capabilities needed for realtime intelligent systems. We also outline our initial efforts to create a new edge architecture more suitable for ML. Although the work is still underway, several components exist, and we review them. We then point to open technical problems that will need to be solved as we progress further in this direction.

Improving Neural Network Quantization without Retraining using Outlier Channel Splitting
Ritchie Zhao, Yuwei Hu, Jordan Dotzel, Christopher De Sa, Zhiru Zhang
In ICML: the Thirtysixth International Conference on Machine Learning, June 2019
Quantization can improve the execution latency and energy efficiency of neural networks on both commodity GPUs and specialized accelerators. The majority of existing literature focuses on training quantized DNNs, while this work examines the lessstudied topic of quantizing a floatingpoint model without (re)training. DNN weights and activations follow a bellshaped distribution posttraining, while practical hardware uses a linear quantization grid. This leads to challenges in dealing with outliers in the distribution. Prior work has addressed this by clipping the outliers or using specialized hardware. In this work, we propose outlier channel splitting (OCS), which duplicates channels containing outliers, then halves the channel values. The network remains functionally identical, but affected outliers are moved toward the center of the distribution. OCS requires no additional training and works on commodity hardware. Experimental evaluation on ImageNet classification and language modeling shows that OCS can outperform stateoftheart clipping techniques with only minor overhead.

Distributed Learning with Sublinear Communication
Jayadev Acharya, Christopher De Sa, Dylan J. Foster, Karthik Sridharan
In ICML: the Thirtysixth International Conference on Machine Learning, June 2019
In distributed statistical learning, \( N \) samples are split across \( m \) machines and a learner wishes to use minimal communication to learn as well as if the examples were on a single machine. This model has received substantial interest in machine learning due to its scalability and potential for parallel speedup. However, in highdimensional settings, where the number examples is smaller than the number of features ("dimension"), the speedup afforded by distributed learning may be overshadowed by the cost of communicating a single example. This paper investigates the following question: When is it possible to learn a ddimensional model in the distributed setting with total communication sublinear in \( d \)?
Starting with a negative result, we show that for learning \( \ell_1 \)bounded or sparse linear models, no algorithm can obtain optimal error until communication is linear in dimension. Our main result is that that by slightly relaxing the standard boundedness assumptions for linear models, we can obtain distributed algorithms that enjoy optimal error with communication logarithmic in dimension. This result is based on a family of algorithms that combine mirror descent with randomized sparsification/quantization of iterates, and extends to the general stochastic convex optimization model.

A Kernel Theory of Modern Data Augmentation
Tri Dao, Albert Gu, Alexander J. Ratner, Virginia Smith, Christopher De Sa, Christopher Ré
In ICML: the Thirtysixth International Conference on Machine Learning, June 2019
Data augmentation, a technique in which a training set is expanded with classpreserving transformations, is ubiquitous in modern machine learning pipelines. In this paper, we seek to establish a theoretical framework for understanding data augmentation. We approach this from two directions: First, we provide a general model of augmentation as a Markov process, and show that kernels appear naturally with respect to this model, even when we do not employ kernel classification. Next, we analyze more directly the effect of augmentation on kernel classifiers, showing that data augmentation can be approximated by firstorder feature averaging and secondorder variance regularization components. These frameworks both serve to illustrate the ways in which data augmentation affects the downstream learning model, and the resulting analyses provide novel connections between prior work in invariant kernels, tangent propagation, and robust optimization. Finally, we provide several proofofconcept applications showing that our theory can be useful for accelerating machine learning workflows, such as reducing the amount of computation needed to train using augmented data, and predicting the utility of a transformation prior to training.

SWALP : Stochastic Weight Averaging in Low Precision Training
Guandao Yang, Tianyi Zhang, Polina Kirichenko, Junwen Bai, Andrew Gordon Wilson, Christopher De Sa
In ICML: the Thirtysixth International Conference on Machine Learning, June 2019
Low precision operations can provide scalability, memory savings, portability, and energy efficiency. This paper proposes SWALP, an approach to low precision training that averages lowprecision SGD iterates with a modified learning rate schedule. SWALP is easy to implement and can match the performance of fullprecision SGD even with all numbers quantized down to 8 bits, including the gradient accumulators. Additionally, we show that SWALP converges arbitrarily close to the optimal solution for quadratic objectives, and to a noise ball asymptotically smaller than low precision SGD in strongly convex settings.

Building Efficient Deep Neural Networks with Unitary Group Convolutions
Ritchie Zhao, Yuwei Hu, Jordan Dotzel, Christopher De Sa, Zhiru Zhang
The Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
We propose unitary group convolutions (UGConvs), a building block for CNNs which compose a group convolution with unitary transforms in feature space to learn a richer set of representations than group convolution alone. UGConvs generalize two disparate ideas in CNN architecture, channel shuffling (i.e. ShuffleNet) and blockcirculant networks (i.e. CirCNN), and provide unifying insights that lead to a deeper understanding of each technique. We experimentally demonstrate that dense unitary transforms can outperform channel shuffling in DNN accuracy. On the other hand, different dense transforms exhibit comparable accuracy performance. Based on these observations we propose HadaNet, a UGConv network using Hadamard transforms. HadaNets achieve similar accuracy to circulant networks with lower computation complexity, and better accuracy than ShuffleNets with the same number of parameters and floatingpoint multiplies.

A Formal Framework For Probabilistic Unclean Databases
Christopher De Sa, Ihab F. Ilyas, Benny Kimelfeld, Christopher Ré, Theodoros Rekatsinas
In the 22nd International Conference on Database Theory (ICDT 2019), March 2019.
Most theoretical frameworks that focus on data errors and inconsistencies follow logicbased reasoning. Yet, practical data cleaning tools need to incorporate statistical reasoning to be effective in realworld data cleaning tasks. Motivated by these empirical successes, we propose a formal framework for unclean databases, where two types of statistical knowledge are incorporated: The first represents a belief of how intended (clean) data is generated, and the second represents a belief of how noise is introduced in the actual observed database instance. To capture this noisy channel model, we introduce the concept of a Probabilistic Unclean Database (PUD), a triple that consists of a probabilistic database that we call the intention, a probabilistic data transformator that we call the realization and captures how noise is introduced, and a dirty observed database instance that we call the observation. We define three computational problems in the PUD framework: cleaning (infer the most probable clean instance given a PUD), probabilistic query answering (compute the probability of an answer tuple over the unclean observed instance), and learning (estimate the most likely intention and realization models of a PUD given a collection of training data). We illustrate the PUD framework on concrete representations of the intention and realization, show that they generalize traditional concepts of repairs such as cardinality and value repairs, draw connection to consistent query answering, and prove tractability results. We further show that parameters can be learned in practical instantiations, and in fact, prove that under certain conditions we can learn a PUD directly from a single dirty database instance without any need for clean examples.

Addressing sensor drift in a proprioceptive optical foam system
Ilse M. Van Meerbeek, Jose A. Barreiros, Robert F. Shepherd, Christopher M. De Sa
Proc. SPIE 10970, Sensors and Smart Structures Technologies for Civil, Mechanical, and Aerospace Systems 2019, March 2019.
We previously reported an elastomeric, optical foam sensor that can sense different types of deformation. The elastomeric foam is embedded with optical fibers that shine light into the foam while simultaneously transmitting scattered light exiting the foam. We applied machine learning techniques to the optical fiber data to form a prediction model that predicts whether the foam is being twisted or bent (classification), as well as the magnitude and direction of the deformation (regression). The best classification model had 100% accuracy on new data points, and the best regression models had a mean absolute error of 0.06 degrees on new data points. This kind of proprioceptive ability could give soft robots much more information about their physical state and therefore improve our ability to control them; however, prediction error increases with time due to drift in the optical fiber outputs. This paper presents an attempt to address this drift. We applied a technique based on work presented by Di Carlo et. al. This unsupervised technique uses the evolutionary optimization process "covariance matrix adaptation evolution strategy" (CMAES) to compute a correction factor that can be applied to unobserved, drifted data points. The best solutions reduced classification error by 49% and regression mean absolute error by 36%.

Soft optoelectronic sensory foams with proprioception
Ilse Van Meerbeek, Christopher De Sa, Robert Shepherd
Science Robotics, November 2018.
In a step toward soft robot proprioception, and therefore better control, this paper presents an internally illuminated elastomer foam that has been trained to detect its own deformation through machine learning techniques. Optical fibers transmitted light into the foam and simultaneously received diffuse waves from internal reflection. The diffuse reflected light was interpreted by machine learning techniques to predict whether the foam was twisted clockwise, twisted counterclockwise, bent up, or bent down. Machine learning techniques were also used to predict the magnitude of the deformation type. On new data points, the model predicted the type of deformation with 100% accuracy and the magnitude of the deformation with a mean absolute error of 0.06°. This capability may impart soft robots with more complete proprioception, enabling them to be reliably controlled and responsive to external stimuli.

Minibatch Gibbs Sampling on Large Graphical Models
Christopher De Sa, Vincent Chen, Wing Wong
In ICML: Proceedings of the 35rd International Conference on Machine Learning, July 2018.
Gibbs sampling is the de facto Markov chain Monte Carlo method used for inference and learning on large scale graphical models. For complicated factor graphs with lots of factors, the performance of Gibbs sampling can be limited by the computational cost of executing a single update step of the Markov chain. This cost is proportional to the degree of the graph, the number of factors adjacent to each variable. In this paper, we show how this cost can be reduced by using minibatching: subsampling the factors to form an estimate of their sum. We introduce several minibatched variants of Gibbs, show that they can be made unbiased, prove bounds on their convergence rates, and show that under some conditions they can result in asymptotic singleupdateruntime speedups over plain Gibbs sampling.

Representation Tradeoffs for Hyperbolic Embeddings
Frederic Sala, Chris De Sa, Albert Gu, Christopher Ré
In ICML: Proceedings of the 35rd International Conference on Machine Learning, July 2018.
Hyperbolic embeddings offer excellent quality with few dimensions when embedding hierarchical data structures. We give a combinatorial construction that embeds trees into hyperbolic space with arbitrarily low distortion without optimization. On WordNet, this algorithm obtains a meanaverageprecision of 0.989 with only two dimensions, outperforming existing work by 0.11 points. We provide bounds characterizing the precisiondimensionality tradeoff inherent in any hyperbolic embedding. To embed general metric spaces, we propose a hyperbolic generalization of multidimensional scaling (hMDS). We show how to perform exact recovery of hyperbolic points from distances, provide a perturbation analysis, and give a recovery result that enables us to reduce dimensionality. Finally, we extract lessons from the algorithms and theory above to design a scalable PyTorchbased implementation that can handle incomplete information.

The Convergence of Stochastic Gradient Descent in Asynchronous Shared Memory
Dan Alistarh, Christopher De Sa, Nikola Konstantinov
In Principles of Distributed Computing (PODC 2018), July 2018.
Stochastic Gradient Descent (SGD) is a fundamental algorithm in machine learning, representing the optimization backbone for training several classic models, from regression to neural networks. Given the recent practical focus on distributed machine learning, significant work has been dedicated to the convergence properties of this algorithm under the inconsistent and noisy updates arising from execution in a distributed environment. However, surprisingly, the convergence properties of this classic algorithm in the standard sharedmemory model are still not wellunderstood.
In this work, we address this gap, and provide new convergence bounds for lockfree concurrent stochastic gradient descent, executing in the classic asynchronous shared memory model, against a strong adaptive adversary. Our results give improved upper and lower bounds on the "price of asynchrony" when executing the fundamental SGD algorithm in a concurrent setting. They show that this classic optimization tool can converge faster and with a wider range of parameters than previously known under asynchronous iterations. At the same time, we exhibit a fundamental tradeoff between the maximum delay in the system and the rate at which SGD can converge, which governs the set of parameters under which this algorithm can still work efficiently.

Accelerated Stochastic Power Iteration
Christopher De Sa, Bryan He, Ioannis Mitliagkas, Christopher Ré, Peng Xu
In AISTATS: The 21st International Conference on Artificial Intelligence and Statistics, April 2018.
Principal component analysis (PCA) is one of the most powerful tools in machine learning. The simplest method for PCA, the power iteration, requires \( \mathcal O(1/\Delta) \) fulldata passes to recover the principal component of a matrix with eigengap \( \Delta \). Lanczos, a significantly more complex method, achieves an accelerated rate of \( \mathcal O(1/\sqrt{\Delta}) \) passes. Modern applications, however, motivate methods that only ingest a subset of available data, known as the stochastic setting. In the online stochastic setting, simple algorithms like Oja's iteration achieve the optimal sample complexity \( \mathcal O(\sigma^2/\Delta^2) \). Unfortunately, they are fully sequential, and also require \( \mathcal O(\sigma^2/\Delta^2) \) iterations, far from the \( \mathcal O(1/\sqrt{\Delta}) \) rate of Lanczos. We propose a simple variant of the power iteration with an added momentum term, that achieves both the optimal sample and iteration complexity. In the fullpass setting, standard analysis shows that momentum achieves the accelerated rate, \( \mathcal O(1/\sqrt{\Delta}) \). We demonstrate empirically that naively applying momentum to a stochastic method, does not result in acceleration. We perform a novel, tight variance analysis that reveals the "breakingpoint variance" beyond which this acceleration does not occur. By combining this insight with modern variance reduction techniques, we construct stochastic PCA algorithms, for the online and offline setting, that achieve an accelerated iteration complexity \( \mathcal O(1/\sqrt{\Delta}) \). Due to the embarassingly parallel nature of our methods, this acceleration translates directly to wallclock time if deployed in a parallel environment. Our approach is very general, and applies to many nonconvex optimization problems that can now be accelerated using the same technique.

A Two Pronged Progress in Structured Dense Matrix Multiplication
Christopher De Sa, Albert Gu, Rohan Puttagunta, Christopher Ré, Atri Rudra
In SODA: ACMSIAM Symposium on Discrete Algorithms (SODA18), January 2018.
Matrixvector multiplication is one of the most fundamental computing
primitives. Given a matrix \( A\in\mathbb{F}^{N\times N} \) and a vector \( b \), it is
known that in the worst case \( \Theta(N^2) \) operations over \( \mathbb{F} \) are
needed to compute \( Ab \). A broad question is to identify classes of structured
dense matrices that can be represented with \( O(N) \) parameters, and for which
matrixvector multiplication can be performed subquadratically. One such class
of structured matrices is the orthogonal polynomial transforms, whose rows
correspond to a family of orthogonal polynomials. Other well known classes
include the Toeplitz, Hankel, Vandermonde, Cauchy matrices and their extensions
that are all special cases of a displacement rank property. In this paper, we
make progress on two fronts:
1. We introduce the notion of recurrence width of matrices. For matrices with
constant recurrence width, we design algorithms to compute \( Ab \) and \( A^Tb \) in a
nearlinear number of operations. This notion of width is finer than all the
above classes of structured matrices and thus we can compute multiplication for
all of them using the same core algorithm.
2. We additionally adapt this algorithm to an algorithm for a much more
general class of matrices with displacement structure: those with low
displacement rank with respect to quasiseparable matrices. This class includes
ToeplitzplusHankellike matrices, Discrete Cosine/Sine Transforms, and more,
and captures all previously known matrices with displacement structure that we
are aware of.
Our work unifies, generalizes, and simplifies existing stateoftheart
results in structured matrixvector multiplication. Finally, we show how
applications in areas such as multipoint evaluations of multivariate
polynomials and computing linear sequences can be reduced to problems involving
low recurrence width matrices.

Gaussian Quadrature for Kernel Features
Spotlight
Tri Dao, Christopher De Sa, Christopher Ré
In NIPS: Proceedings of the 30th Neural Information Processing Systems Conference, December 2017.
Kernel methods have recently attracted resurgent interest, matching the
performance of deep neural networks in tasks such as speech recognition. The
random Fourier features map is a technique commonly used to scale up kernel
machines, but employing the randomized feature map means that
\( O(\epsilon^{2})\) samples are required to achieve an approximation error of at
most \( \epsilon\) . In this paper, we investigate some alternative schemes for
constructing feature maps that are deterministic, rather than random, by
approximating the kernel in the frequency domain using Gaussian quadrature. We
show that deterministic feature maps can be constructed, for any \( \gamma > 0\) ,
to achieve error \( \epsilon\) with \( O(e^{\gamma} + \epsilon^{1/\gamma})\) samples
as \( \epsilon \) goes to 0. We validate our methods on datasets in different
domains, such as MNIST and TIMIT, showing that deterministic features are
faster to generate and achieve comparable accuracy to the stateoftheart
kernel methods based on random Fourier features.

Understanding and Optimizing Asynchronous LowPrecision Stochastic Gradient Descent
Chris De Sa, Matt Feldman, Christopher Ré, and Kunle Olukotun
In ISCA: 44th International Symposium on Computer Architecture, June 2017.
Stochastic gradient descent (SGD) is one of the most popular numerical algorithms used in machine learning and other domains. Since this is likely to continue for the foreseeable future, it is important to study techniques that can make it run fast on parallel hardware. In this paper, we provide the first analysis of a technique called BUCKWILD that uses both asynchronous execution and lowprecision computation. We introduce the DMGC model, the first conceptualization of the parameter space that exists when implementing lowprecision SGD, and show that it provides a way to both classify these algorithms and model their performance. We leverage this insight to propose and analyze techniques to improve the speed of lowprecision SGD. First, we propose software optimizations that can increase throughput on existing CPUs by up to 11x. Second, we propose architectural changes, including a new cache technique we call an obstinate cache, that increase throughput beyond the limits of currentgeneration hardware. We also implement and analyze lowprecision SGD on the FPGA, which is a promising alternative to the CPU for future SGD systems.

Flipper: A Systematic Approach to Debugging Training Sets
Paroma Varma, Dan Iter, Christopher De Sa, and Christopher Ré
In HILDA: Proceedings of the 2nd Workshop on HumanIntheLoop Data Analytics, May 2017.
As machine learning methods gain popularity across different fields, acquiring labeled training datasets has become the primary bottleneck in the machine learning pipeline. Recently generative models have been used to create and label large amounts of training data, albeit noisily. The output of these generative models is then used to train a discriminative model of choice, such as logistic regression or a complex neural network. However, any errors in the generative model can propagate to the subsequent model being trained. Unfortunately, these generative models are not easily interpretable and are therefore difficult to debug for users. To address this, we present our vision for Flipper, a framework that presents users with highlevel information about why their training set is inaccurate and informs their decisions as they improve their generative model manually. We present potential tools within the Flipper framework, inspired by observing biomedical experts working with generative models, which allow users to analyze the errors in their training data in a systematic fashion. Finally, we discuss a prototype of Flipper and report results of a user study where users create a training set for a classification task and improve the discriminative model's accuracy by 2.4 points in less than an hour with feedback from Flipper.

Data Programming: Creating Large Training Sets, Quickly
Alex Ratner, Chris De Sa, Sen Wu, Daniel Selsam, and Christopher Ré
In NIPS: Proceedings of the 29th Neural Information Processing Systems Conference, December 2016.
Large labeled training sets are the critical building blocks of supervised learning methods and are key enablers of deep learning techniques. For some applications, creating labeled training sets is the most timeconsuming and expensive part of applying machine learning. We therefore propose a paradigm for the programmatic creation of training sets called data programming in which users provide a set of labeling functions, which are programs that heuristically label large subsets of data points, albeit noisily. By viewing these labeling functions as implicitly describing a generative model for this noise, we show that we can recover the parameters of this model to "denoise" the training set. Then, we show how to modify a discriminative loss function to make it noiseaware. We demonstrate our method over a range of discriminative models including logistic regression and LSTMs. We establish theoretically that we can recover the parameters of these generative models in a handful of settings. Experimentally, on the 2014 TACKBP relation extraction challenge, we show that data programming would have obtained a winning score, and also show that applying data programming to an LSTM model leads to a TACKBP score almost 6 F1 points over a supervised LSTM baseline (and into second place in the competition). Additionally, in initial user studies we observed that data programming may be an easier way to create machine learning models for nonexperts.

Scan Order in Gibbs Sampling: Models in Which it Matters and Bounds on How Much
Bryan He, Christopher De Sa, Ioannis Mitliagkas, and Christopher Ré
In NIPS: Proceedings of the 29th Neural Information Processing Systems Conference, December 2016.
Gibbs sampling is a Markov Chain Monte Carlo sampling technique that iteratively samples variables from their conditional distributions. There are two common scan orders for the variables: random scan and systematic scan. Due to the benefits of locality in hardware, systematic scan is commonly used, even though most statistical guarantees are only for random scan. While it has been conjectured that the mixing times of random scan and systematic scan do not differ by more than a logarithmic factor, we show by counterexample that this is not the case, and we prove that that the mixing times do not differ by more than a polynomial factor under mild conditions. To prove these relative bounds, we introduce a method of augmenting the state space to study systematic scan using conductance.

Socratic Learning: Empowering the Generative Model
Paroma Varma, Rose Yu, Dan Iter, Christopher De Sa, Christopher Ré
In FiLMNIPS: Future of Interactive Learning Machines at NIPS, December 2016.
Modern machine learning techniques, such as deep learning, often use discriminative models that require large amounts of labeled data. An alternative approach is to use a generative model, which leverages heuristics from domain experts to train on unlabeled data. Domain experts often prefer to use generative models because they "tell a story" about their data. Unfortunately, generative models are typically less accurate than discriminative models. Several recent approaches combine both types of model to exploit their strengths. In this setting, a misspecified generative model can hurt the performance of subsequent discriminative training. To address this issue, we propose a framework called Socratic learning that automatically uses information from the discriminative model to correct generative model misspecification. Furthermore, this process provides users with interpretable feedback about how to improve their generative model. We evaluate Socratic learning on realworld relation extraction tasks and observe an immediate improvement in classification accuracy that could otherwise require several weeks of effort by domain experts.

Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs Sampling
Best Paper Award
Christopher De Sa, Kunle Olukotun, and Christopher Ré
In ICML: Proceedings of the 33rd International Conference on
Machine Learning, June 2016.
Gibbs sampling is a Markov chain Monte Carlo technique commonly used
for estimating marginal distributions.
To speed up Gibbs sampling, there has recently been interest in parallelizing
it by executing asynchronously. While empirical results suggest
that many models can be efficiently sampled asynchronously, traditional
Markov chain analysis does not apply to the asynchronous case, and thus
asynchronous Gibbs sampling is poorly understood.
In this paper, we derive a better understanding of the two main challenges
of asynchronous Gibbs: bias and mixing time.
We show experimentally that our theoretical results match practical outcomes.

Parallel SGD: When does Averaging Help?
Jian Zhang, Christopher De Sa, Ioannis Mitiliagkas, and Christopher Ré
In OptML: Optimization Methods for the Next Generation of Machine Learning
, workshop at ICML, June 2016.
Consider a number of workers running SGD independently on the
same pool of data and averaging the models every once in a while — a
common but not well understood practice. We study model averaging
as a variancereducing mechanism and describe two ways in which the
frequency of averaging affects convergence. For convex objectives, we show
the benefit of frequent averaging depends on the gradient variance envelope.
For nonconvex objectives, we illustrate that this benefit depends on the
presence of multiple globally optimal points. We complement our findings
with multicore experiments on both synthetic and real data.

DeepDive: Declarative Knowledge Base Construction
Christopher De Sa, Alex Ratner, Christopher Ré, Jaeho Shin,
Feiran Wang, Sen Wu, and Ce Zhang
Research highlight in SIGMOD Record, April 2016.
The dark data extraction or knowledge base construction (KBC) problem is to populate a SQL database with information from unstructured data sources including emails, webpages, and pdf reports. KBC is a longstanding problem in industry and research that encompasses problems of data extraction, cleaning, and integration. We describe DeepDive, a system that combines database and machine learning ideas to help develop KBC systems. The key idea in DeepDive is that statistical inference and machine learning are key tools to attack classical data problems in extraction, cleaning, and integration in a unified and more effective manner. DeepDive programs are declarative in that one cannot write probabilistic inference algorithms; instead, one interacts by defining features or rules about the domain. A key reason for this design choice is to enable domain experts to build their own KBC systems. We present the applications, abstractions, and techniques of DeepDive employed to accelerate construction of KBC systems.

Generating Configurable Hardware from Parallel Patterns
Raghu Prabhakar, David Koeplinger, Kevin J. Brown, HyoukJoong Lee, Christopher De Sa, Christos Kozyrakis, and Kunle Olukotun
In ASPLOS: 21st Int'l Conference on Architectural Support
for Programming Languages and Operating Systems, April 2016.
In recent years the computing landscape has seen an increasing shift towards specialized accelerators. Field programmable gate arrays (FPGAs) are particularly promising as they offer significant performance and energy improvements compared to CPUs for a wide class of applications and are far more flexible than fixedfunction ASICs. However, FPGAs are difficult to program. Traditional programming models for reconfigurable logic use lowlevel hardware description languages like Verilog and VHDL, which have none of the productivity features of modern software development languages but produce very efficient designs, and lowlevel software languages like C and OpenCL coupled with highlevel synthesis (HLS) tools that typically produce designs that are far less efficient. Functional languages with parallel patterns are a better fit for hardware generation because they both provide highlevel abstractions to programmers with little experience in hardware design and avoid many of the problems faced when generating hardware from imperative languages. In this paper, we identify two optimizations that are important when using parallel patterns to generate hardware: tiling and metapipelining. We present a general representation of tiled parallel patterns, and provide rules for automatically tiling patterns and generating metapipelines. We demonstrate experimentally that these optimizations result in speedups up to 40x on a set of benchmarks from the data analytics domain.

Have Abstraction and Eat Performance, Too: Optimized Heterogeneous Computing with Parallel Patterns
Kevin J. Brown, HyoukJoong Lee, Tiark Rompf, Arvind K. Sujeeth,
Christopher De Sa, Christopher Aberger, and Kunle Olukotun
In CGO: International Symposium on Code Generation and Optimization, March 2016.
High performance in modern computing platforms requires
programs to be parallel, distributed, and run on heterogeneous
hardware. However programming such architectures
is extremely difficult due to the need to implement the application
using multiple programming models and combine
them together in adhoc ways. To optimize distributed applications
both for modern hardware and for modern programmers
we need a programming model that is sufficiently
expressive to support a variety of parallel applications, sufficiently
performant to surpass handoptimized sequential implementations,
and sufficiently portable to support a variety
of heterogeneous hardware. Unfortunately existing systems
tend to fall short of these requirements.
In this paper we introduce the Distributed Multiloop Language
(DMLL), a new intermediate language based on common
parallel patterns that captures the necessary semantic
knowledge to efficiently target distributed heterogeneous architectures.
We show straightforward analyses that determine
what data to distribute based on its usage as well as
powerful transformations of nested patterns that restructure
computation to enable distribution and optimize for heterogeneous
devices. We present experimental results for a range
of applications spanning multiple domains and demonstrate
highly efficient execution compared to manuallyoptimized
counterparts in multiple distributed programming models.

Rapidly Mixing Gibbs Sampling for a Class of Factor Graphs Using Hierarchy Width
Spotlight
Christopher De Sa, Ce Zhang, Kunle Olukotun, Christopher Ré
In NIPS: Proceedings of the 28th Neural Information Processing Systems Conference, December 2015.
Gibbs sampling on factor graphs is a widely used inference technique, which often produces good empirical results. Theoretical guarantees for its performance are weak: even for tree structured graphs, the mixing time of Gibbs may be exponential in the number of variables. To help understand the behavior of Gibbs sampling, we introduce a new (hyper)graph property, called hierarchy width. We show that under suitable conditions on the weights, bounded hierarchy width ensures polynomial mixing time. Our study of hierarchy width is in part motivated by a class of factor graph templates, hierarchical templates, which have bounded hierarchy width—regardless of the data used to instantiate them. We demonstrate a rich application from natural language processing in which Gibbs sampling provably mixes rapidly and achieves accuracy that exceeds human volunteers.

Taming the Wild: A Unified Analysis of Hogwild!Style Algorithms
Christopher De Sa, Ce Zhang, Kunle Olukotun, Christopher Ré
In NIPS: Proceedings of the 28th Neural Information Processing Systems Conference, December 2015.
Stochastic gradient descent (SGD) is a ubiquitous algorithm for a variety of machine learning problems. Researchers and industry have developed several techniques to optimize SGD's runtime performance, including asynchronous execution and reduced precision. Our main result is a martingalebased analysis that enables us to capture the rich noise models that may arise from such techniques. Specifically, we use our new analysis in three ways: (1) we derive convergence rates for the convex case (Hogwild!) with relaxed assumptions on the sparsity of the problem; (2) we analyze asynchronous SGD algorithms for nonconvex matrix problems including matrix completion; and (3) we design and analyze an asynchronous SGD algorithm, called Buckwild!, that uses lowerprecision arithmetic. We show experimentally that our algorithms run efficiently for a variety of problems on modern hardware.

Incremental Knowledge Base Construction Using DeepDive
Best of Issue
Jaeho Shin, Sen Wu, Feiran Wang, Ce Zhang, Christopher De Sa, Christopher Ré
In VLDB: Proceedings of the 41st International Conference on Very Large Data Bases, September 2015.
Populating a database with unstructured information is a
longstanding problem in industry and research that encompasses
problems of extraction, cleaning, and integration. Recent
names used for this problem include dealing with dark
data and knowledge base construction (KBC). In this work,
we describe DeepDive, a system that combines database and
machine learning ideas to help develop KBC systems, and we
present techniques to make the KBC process more efficient.
We observe that the KBC process is iterative, and we develop
techniques to incrementally produce inference results
for KBC systems. We propose two methods for incremental
inference, based respectively on sampling and variational
techniques. We also study the tradeoff space of these methods
and develop a simple rulebased optimizer. DeepDive
includes all of these contributions, and we evaluate DeepDive
on five KBC systems, showing that it can speed up
KBC inference tasks by up to two orders of magnitude with
negligible impact on quality.

Global Convergence of Stochastic Gradient Descent for Some Nonconvex Matrix Problems
Christopher De Sa, Kunle Olukotun, and Christopher Ré
In ICML: Proceedings of the 32nd International Conference on
Machine Learning, July 2015.
Stochastic gradient descent (SGD) on a lowrank factorization is commonly employed to speed up matrix problems including matrix completion, subspace tracking, and SDP relaxation. In this paper, we exhibit a step size scheme for SGD on a lowrank leastsquares problem, and we prove that, under broad sampling conditions, our method converges globally from a random starting point within \(O(\epsilon^{1} n \log n)\) steps with constant probability for constantrank problems. Our modification of SGD relates it to stochastic power iteration. We also show experiments to illustrate the runtime and convergence of the algorithm.