http://www.cs.cornell.edu/~asampson//Adrian Sampson2017-10-14T17:05:21+00:00Adrian Sampsonhttp://www.cs.cornell.edu/~asampson/http://www.cs.cornell.edu/~asampson/blog/closedproblems.htmlClosed Problems in Approximate Computing2017-10-14T00:00:00+00:00<aside>
These are notes for a <a href="/~asampson/media/closedproblems-nope2017-slides.pdf">talk</a> I will give at the <a href="http://nope.pub">NOPE</a> workshop at MICRO 2017, where the title is <i>Approximate Computing Is Dead; Long Live Approximate Computing</i>.
</aside>
<p><a href="/~asampson/research.html#approximate-computing">Approximate computing</a> has reached an adolescent phase as a research area. We have picked bushels of low-hanging fruit. While there are many approximation papers left to write, it’s a good time to enumerate the closed problems: research problems that are probably no longer worth pursuing.</p>
<h2 id="closed-problems-in-approximate-hardware">Closed Problems in Approximate Hardware</h2>
<p><strong>No more approximate functional units.</strong>
Especially for people who love VLSI work, a natural first step in approximate computing is to design approximate adders, multipliers, and other basic functional units. Cut a carry chain here, drop a block of intermediate results there, or use an automated search to find “unnecessary” gates—there are lots of ways to design an FU that’s mostly right most of the time. Despite dozens of papers in this vein, however, the gains seem to range from minimal to nonexistent. A lovely <a href="https://hal.inria.fr/hal-01423147">DATE 2017 paper by Barrois et al.</a> recently studied some of these approximate FUs and found that:</p>
<blockquote>
<p>existing approximate adders and multipliers tend to be dominated by truncated or rounded fixed-point ones.</p>
</blockquote>
<p>In other words, plain old fixed-point FUs with a narrower bit width are usually at least as good as fancy “approximate” FUs. The problem is that, if you’re approximating the values below a given bit position, it’s usually not worth it to compute those approximate bits at all. In fact, by dropping the approximate bits altogether, you can exploit the smaller data size for broader advantages in the whole application. Software using approximate adders and multipliers, on the other hand, ends up copying around and operating on worthless, incorrect trailing bits for no benefit in precision.</p>
<p>This paper has raised the bar for FU-level approximation research. We should no longer publish VLSI papers that measure approximate adders in isolation. Without a radically different approach, we should stop designing approximate functional units altogether.</p>
<p><strong>No more voltage overscaling.</strong>
For some, <em>approximate computing</em> is a synonym for <em>voltage overscaling</em>. Voltage overscaling is when you turn up the clock rate or turn down $V_{\text{DD}}$ beyond their safe ranges and allow occasional timing errors. I accept some of the blame for solidifying voltage overscaling’s outsized mindshare by co-authoring <a href="/~asampson/media/papers/truffle-asplos2012.pdf">a paper about “architecture support for approximate computing”</a> that exclusively used voltage as its error–energy knob.</p>
<p>The problem with voltage overscaling is that it’s nearly impossible to evaluate. It’s easy to model its effects on energy and frequency, but the pattern of timing errors depends on a chip’s design, synthesis, layout, manufacturing process, and even environmental conditions such as temperature. Even a halfway-decent error analysis for voltage overscaling requires a full circuit simulator. To account for process variation, we’d need to tape out real silicon at scale. In a frustrating Catch-22 of research evaluation, it’s hard to muster the enthusiasm for a tapeout before we can prove that the results are likely to be good.</p>
<p>There is even a credible argument that the results are likely to be bad. In voltage overscaling, the circuit’s critical path fails first. And in many circuits, the longest paths in the design contribute the most to the output accuracy. In a functional unit, for example, the most significant output bits usually arise from the longest paths. So instead of a smooth degradation in accuracy as the voltage decreases, these circuits are likely to fall off a steep accuracy cliff when the critical path first fails to meet timing.</p>
<p>Research should stop using voltage overscaling as the “default” approximate computing technique. In fact, we should stop using it altogether until we have evidence <em>in silica</em> that the technique’s voltage–error trade-offs are favorable.</p>
<p><strong>In general, no more fine-grained approximate operations.</strong>
Approximate functional units and voltage overscaling are both instances of <em>operation-level</em> approximation techniques. They reduce the cost of individual operations like multiplies or adds, but they do not change the computation’s higher-level structure. All of these <a href="http://ieeexplore.ieee.org/document/8054698/">fine-grained techniques</a> have to contend with <a href="http://ieeexplore.ieee.org/document/6757323/">the Horowitz imbalance</a>, to coin a phrase: the huge discrepancy between the cost of processor control with respect to “real work.” Even if an FU operation were free, the benefit would be irrelevant compared to the cost of fetching, decoding, and scheduling the instruction that invoked it. These fine-grained approximation strategies are no longer worth pursuing on their own.</p>
<h3 id="instead">Instead…</h3>
<p>If we have any hope of making hardware approximation useful, we will need to start by addressing control overhead. Research that reduces non-computational processing costs works as a benefit multiplier for approximate computing. Approximate operations in <a href="https://dl.acm.org/citation.cfm?id=2594339">CGRA-like spatial architectures</a> or <a href="https://dl.acm.org/citation.cfm?id=2750380">explicit dataflow processors</a>, for example, have a chance of succeeding where they would fail in a CPU or GPU context. We have work to do to integrate approximation into the <a href="https://dl.acm.org/citation.cfm?id=2594339">constraint-based</a> <a href="https://dl.acm.org/citation.cfm?id=2462163">techniques</a> that these accelerators use for configuration.</p>
<h2 id="closed-problems-in-approximate-programming-models">Closed Problems in Approximate Programming Models</h2>
<p><strong>No more automatic approximability analysis.</strong>
Papers in programming languages sometimes try to automatically determine which variables and operations in an unannotated program require perfect precision and which are amenable to approximation. The idea is—sometimes explicitly—to alleviate <a href="/~asampson/media/papers/enerj-pldi2011.pdf">EnerJ</a>’s annotation burden (which can be high, I admit).</p>
<p>This is not a good goal. Imagine a world where your compiler is free to make its own decisions about which parts of your program are really important and which can be exposed to errors. No one wants this compiler.</p>
<p>“But wait,” the compiler might protest. “I can demonstrate that approximating those variables has only a tiny impact on your quality metric in this broad set of test inputs!”</p>
<p>That’s very useful to know, compiler, but it’s not strong enough evidence to justify breaking a program without the developer’s express consent. Without a guarantee that the test inputs perfectly represent real run-time conditions, silent failures in the field are a real possibility. And quality metrics are only a loose reflection of real-world utility, so basing automatic decisions on them is deeply concerning.</p>
<p>Work that makes EnerJ annotations implicit misunderstands EnerJ’s intent. We designed EnerJ <em>in response</em> to earlier work that applied approximation without developer involvement. The explicit annotation style acts as a check on the compiler’s freedom to break your code. The time has passed for research that places the power back into the compiler’s grubby hands.</p>
<p><strong>No more generic unsound compiler transformation.</strong>
I love <a href="https://people.csail.mit.edu/stelios/papers/fse11.pdf">loop perforation</a> and the devil-may-care attitude that its paper represents. I hope its inventors won’t be angry if I say loop perforation is the world’s dumbest approximate programming technique: it works by finding a loop and changing its counter increment expression, <code class="highlighter-rouge">i++</code>, to <code class="highlighter-rouge">i += 2</code> or <code class="highlighter-rouge">i += 3</code>. The shocking thing about loop perforation is that it sometimes works: some loops can survive the removal of some of their iterations.</p>
<p>Loop perforation is surprisingly hard to beat. Using it as an inspiration, you can imagine endless compiler-driven schemes for creatively transforming programs that, while totally unsound, will work some of the time. In my anecdotal experience, however, few techniques can dominate loop perforation on an efficiency–accuracy Pareto frontier. Some transformations do somewhat better some of the time, but I have never seen a dramatic, broad improvement.</p>
<p>It’s time to stop looking. While it can be fun to cook up novel unsound compiler transformations, we do not need any more papers in this vein.</p>
<h3 id="instead-1">Instead…</h3>
<p>More researchers in our community should favor tool design over language constructs and program analysis. For example, we need practical operating system support for balancing resource contention with approximate computing. <a href="http://approximate.computer/wax2017/papers/kulkarni.pdf">Especially in datacenters</a>, applications should be able to negotiate with the OS to reduce their output quality in exchange for bandwidth or latency. An approximation-aware resource scheduler does not depend on novel hardware or compiler techniques: many applications have built-in quality parameters that can compete with resource consumption. Research prototypes probably won’t cut it for this kind of work; real-world system implementations, on the other hand, might be ripe for adoption.</p>
<h2 id="closed-problems-in-quality-enforcement">Closed Problems in Quality Enforcement</h2>
<p><strong>No more weak statistical guarantees.</strong>
To control output quality degradation in approximate computing, one promising approach is to offer a <em>statistical guarantee</em>. When an approximation technique leads to good quality in most cases but poor quality in rare cases, traditional compile-time guarantees can be unhelpful. A statistical guarantee, on the other hand, can bound the <em>probability</em> of seeing a bad output. A statistical guarantee might certify, for example, that the probability of observing output error $E$ above a threshold $T$ is at most $P$.</p>
<p>Too many papers that strive to check statistical correctness end up offering <a href="/~asampson/blog/probablycorrect.html">extremely weak guarantees</a>. The problem is that even fancy-sounding statistical machinery rests on the dubious assumption that we can predict the probability distribution that programs will encounter at run time. We assume that an input follows a Gaussian distribution, or that it’s uniformly distributed in some range, or that it is drawn from a known body of example images. For an input randomly selected from this given input distribution, we can make a strong guarantee about the probability of observing a high-quality output.</p>
<p>When real-world inputs inevitably follow some other distribution, however, all bets are off. Imagine a degenerate distribution that finds the worst possible input for your approximate program, in terms of output quality, and presents that value with probability 1.0. An <em>adversarial input distribution</em> can break any quality enforcement technique that relies on stress-testing a program pre-deployment. Even ignoring adversarial conditions, it’s extremely hard to defend the assumption that in-deployment inputs for a program will <em>exactly</em> match the distribution that the programmer modeled in development. Run-time input distributions are inherently unpredictable, and they render development-time statistical guarantees useless.</p>
<h3 id="instead-2">Instead…</h3>
<p>We need more research on run-time enforcement that directly addresses the problem of unpredictable input distributions. Consider SaaS applications based on machine learning, for example: <a href="https://wit.ai">Wit.ai</a> for natural language understanding or <a href="https://aws.amazon.com/rekognition/">Rekognition</a> for computer vision, for example. All ML models have an error rate, meaning that some customers’ workloads will observe higher accuracy than others. Given this subjective variation in output accuracy, what strong statements can cloud providers make to their customers about precision? And if a service advertises a quality guarantee, how can customers keep the provider honest without recomputing everything themselves? These narrower questions may be tractable where the fully general problem of statistical guarantees is not.</p>
<h2 id="closed-domains-for-approximation">Closed Domains for Approximation</h2>
<p><strong>No more sadness about the imperfection of quality metrics.</strong>
All approximate computing research depends on application-specific output quality metrics, and these quality metrics are far from perfect. Researchers make a good-faith effort to capture meaningful properties in each program’s output, but we have no real assurances that any metric is reasonable. Even worse, we often need to fix an arbitrary threshold on quality to call an output “good enough.” These thresholds rarely have any bearing on any deployment scenario, real or imagined. A <em>de facto</em> standard threshold of 10% error has emerged, which is both a triumph in consistency and a tragedy in real-world relevance.</p>
<p>Arbitrary thresholds and researcher-invented quality metrics are worth griping about, but most of what needs to be said about this sorry state of affairs has already been said. The system is not perfect, but PARSEC is not a perfect representation of all parallel computation workloads; Gem5 is not a perfect model of real CPUs; and geometric mean speedup is not a perfect proxy for utility. No standard evaluation strategy is without flaws. We should constantly work to develop better quality metrics and to understand thresholds that actually matter to users, but it is no longer useful to complain about the basic system of quality metrics and thresholds.</p>
<p><strong>No more benchmark-oriented research?</strong>
More worrisome than quality metrics themselves is the fact that we need to invent quality metrics at all. Our current evaluation standards are a symptom of a <em>benchmark-oriented</em> approach to approximate computing research. It follows the basic strategy forged by classic architecture and compiler research: develop a new gadget and measure its impact on figures of merit for a broad class of off-the-shelf benchmarks from as many domains as possible. The point is to claim generality: to show that an idea advances the capabilities of <em>computers in general</em>, and that it’s not just an optimization for one person’s code.</p>
<p>This commitment to generality is the root cause of approximate computing’s messy evaluation norms. When a research project needs to demonstrate benefits for seven different domains, its researchers don’t have time to deeply engage with any single domain. A benchmark-driven attitude leads directly to invented quality metrics, arbitrary thresholds, and minimal involvement from domain experts. To break free from the traditional trappings of approximate computing work, we need to break free from benchmark suites.</p>
<h3 id="instead-3">Instead…</h3>
<p>Instead of bringing approximation to every domain we can think of, let’s look for domains that already embrace approximation. In the PARSEC benchmarks, approximate computing is optional. There are real, important applications where <a href="http://approximate.computer/wax2016/papers/sampson.pdf">approximation is compulsory</a> because perfection is unachievable. In these domains, we don’t have to invent quality metrics: researchers in the domain already have a consensus on what makes one system better than another.</p>
<p>Approximation is compulsory in AI domains like vision, natural language understanding, and speech recognition, for example, so these fields already have established methodologies for measuring accuracy. These established metrics, like word error rate for speech recognition or mean average precision for object detection, are certainly not perfect, but their flaws are intimately understood by ML and AI researchers. Real-time 3D rendering is another example: approximations abound in the effort to draw a subjectively beautiful scene at a high frame rate.</p>
<p>Approximate computing researchers should embed with these domains. Instead of trying to foist a new framework for approximate execution onto domain experts, we can learn something from how they currently manage their compulsory approximation. Following the approximation rules for an established domain will be hard work: any new approximation technique will need to beat the Pareto frontier established by conventional techniques. And it’s impossible to manipulate the quality metric to make your proposal look better. But no one said research was supposed to be easy.</p>
<p>I’m giving a talk at the <a href="http://nope.pub">NOPE</a> workshop about impossible problems in approximate computing. Here are some research directions that are no longer worth pursuing—and a few that are.</p>
http://www.cs.cornell.edu/~asampson/blog/statsmistakes.htmlStatistical Mistakes and How to Avoid Them2016-11-23T00:00:00+00:00<p>Computer scientists in systemsy fields, myself included, aren’t great at using statistics. Maybe it’s because there are so many other potential problems with empirical evaluations that solid statistical reasoning doesn’t seem that important. Other subfields, like HCI and machine learning, have much higher standards for data analysis. Let’s learn from their example.</p>
<p>Here are three kinds of avoidable statistics mistakes that I notice in published papers.</p>
<h3 id="no-statistics-at-all">No Statistics at All</h3>
<p>The most common blunder is not using statistics at all when your paper clearly uses statistical data. If your paper uses the phrase “we report the average time over 20 runs of the algorithm,” for example, you should probably use statistics.</p>
<p>Here are two easy things that every paper should do when it deals with performance data or anything else that can randomly vary:</p>
<p>First, plot the error bars. In every figure that represents an average, compute the <a href="https://www.r-bloggers.com/standard-deviation-vs-standard-error/">standard error of the mean</a> or just the plain old standard deviation and add little whiskers to each bar. Explain what the error bars mean in the caption.</p>
<p><img src="/~asampson/media/errorbars.svg" alt="(a) Just noise. (b) Meaningful results. (c) Who knows???" class="img-responsive" style="width: 100%;" /></p>
<p>Second, do a simple statistical test. If you ever say “our system’s average running time is X seconds, which is less than the baseline running time of Y seconds,” you need show that the difference is <a href="https://en.wikipedia.org/wiki/Statistical_significance">statistically significant</a>. Statistical significance tells the reader that the difference you found was more than just “in the noise.”</p>
<p>For most CS papers I read, a really basic test will work: <a href="http://stattrek.com/hypothesis-test/difference-in-means.aspx?Tutorial=AP">Student’s $t$-test</a> checks that two averages that look different actually are different. The process is easy. Collect some $N$ samples from the two conditions, compute the mean $\overline{X}$ and the standard deviation $s$ for each, and plug them into this formula:</p>
<p>\[
t =
\frac{ \overline{X}_1 - \overline{X}_2 }
{ \sqrt{ \frac{s_1^2}{N_1} +
\frac{s_2^2}{N_2} } }
\]</p>
<p>then plug that $t$ into <a href="https://en.m.wikipedia.org/wiki/Student%27s_t-distribution">the cumulative distribution function of the $t$-distribution</a> to get a $p$-value. If your $p$-value is below a threshold $\alpha$ that you chose ahead of time (0.05 or 0.01, say), then you have a statistically significant difference. Your favorite numerical library probably already has <a href="http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html">an implementation</a> that does all the work for you.</p>
<p>If you’ve taken even an intro stats course, you know all this already! But you might be surprised to learn how many computer scientists don’t. Program committees don’t require that papers use solid statistics, so the literature is full of statistics-free but otherwise-good papers, so standards remain low, and Prof. Ouroboros keeps drawing figures without error bars. Other fields are <a href="http://www.nature.com/news/psychology-journal-bans-p-values-1.17001">moving <em>beyond</em> the $p$-value</a>, and CS isn’t even there yet.</p>
<h3 id="failure-to-reject--confirmation">Failure to Reject = Confirmation</h3>
<p>When you do use a statistical test in a paper, you need to interpret its results correctly. When your test produces a $p$-value, here are the correct interpretations:</p>
<ul>
<li>If $p < \alpha$: The difference between our average running time and the baseline’s average running time is statistically significant. Pedantically, we <em>reject the null hypothesis</em> that says that the averages might be the same.</li>
<li>Otherwise, if $p \ge \alpha$: We conclude nothing at all. Pedantically, we <em>fail to reject</em> that null hypothesis.</li>
</ul>
<p>It’s tempting to think, when $p \ge \alpha$, that you’ve found the opposite thing from the $p < \alpha$ case: that you get to conclude that there is <em>no statistically significant difference</em> between the two averages. Don’t do that!</p>
<p>Simple statistical tests like the $t$-test only tell you when averages are different; they can’t tell you when they’re the same. When they fail to find a difference, there are two possible explanations: either there is no difference or you haven’t collected enough data yet. So when a test fails, it could be your fault: if you had run a slightly larger experiment with a slightly larger $N$, the test might have successfully found the difference. It’s always wrong to conclude that the difference does not exist.</p>
<p>If you want to claim that two means are <em>equal</em>, you’ll need to use a different test where the null hypothesis says that they differ by at least a certain amount. For example, an appropriate <a href="http://stattrek.com/hypothesis-test/difference-in-means.aspx?Tutorial=AP">one-tailed $t$-test</a> will do.</p>
<h3 id="the-multiple-comparisons-problem">The Multiple Comparisons Problem</h3>
<p>In most ordinary evaluation sections, it’s probably enough to use only a handful of statistical tests to draw one or two bottom-line conclusions. But you might find yourself automatically running an unbounded number of comparisons. Perhaps you have $n$ benchmarks, and you want to compare the running time <em>on each one</em> to a corresponding baseline with a separate statistical test. Or maybe your system works in a feedback loop: it tries one strategy, performs a statistical test to check whether the strategy worked, and starts over with a new strategy otherwise.</p>
<p>Repeated statistical tests can get you into trouble. The problem is that every statistical test has a probability of lying to you. The probability that any <em>single</em> test is wrong is small, but if you do lots of test, the probability amplifies quickly.</p>
<p>For example, say you choose $\alpha = 0.05$ and run one $t$-test. When the test succeeds—when it finds a significant difference—it’s telling you that, if a difference <em>didn’t</em> exist, the data you saw would arise by random chance with probability $\alpha$. So there’s still a chance that you measured a difference when one doesn’t really exist, but that can only happen in only 5 out of 100 parallel universes. I’d take that bet.</p>
<p>Now, say you run a series of $n$ tests in the scope of one paper. Say there’s no <em>true</em> difference to be found. Even so, every test has an $\alpha$ chance of going wrong and telling you a difference exists. The chances that your paper has more than $k$ errors in it is given by the binomial distribution:</p>
<p>\[
1 - \sum_{i=0}^{k} {n \choose i} \alpha^i (1-\alpha)^{n-i}
\]</p>
<p>which grows exponentially with the number of tests, $n$. If you use just 10 tests with $\alpha = 0.05$, for example, your chance of having one test go wrong grows to 40%. If you do 100, the probability is above 99%. At that point, it’s a near certainty that your paper is misreporting some result.</p>
<p>(To compute these probabilities yourself, set $k = 0$ so you get the chance of at least one error. Then the CDF above simplifies down to $1 - (1 - \alpha) ^ n$.)</p>
<p>This pitfall is called the <a href="https://en.wikipedia.org/wiki/Multiple_comparisons_problem">multiple comparisons problem</a>. If you really need to run lots of tests, all is not lost: there are standard ways to compensate for the increased chance of error. The simplest are the <a href="http://mathworld.wolfram.com/BonferroniCorrection.html">Bonferroni</a> and <a href="https://en.m.wikipedia.org/wiki/Šidák_correction">Šidák</a> corrections, where you reduce your per-test $\alpha$ to $\frac{\alpha}{n}$ to preserve an overall $\alpha$ chance of going wrong.</p>
<p>You can get CS papers published with shoddy statistics, but that doesn’t mean you should. Here are three easy ways to bungle the data analysis in your evaluation section: don’t even try to use statistics when you really ought to; misinterpret an inconclusive statistical test as concluding a negative; or run too many tests without considering that some of them might be lying to you. I’ve seen all three of these mistakes in multiple published papers—don’t let this be you!</p>
http://www.cs.cornell.edu/~asampson/blog/probablycorrect.htmlProbably Correct2016-06-15T00:00:00+00:00<p>How do you know whether a program is good enough if it’s allowed to be wrong some of the time?</p>
<p>Say, for example, that you want to use <a href="https://en.wikipedia.org/wiki/Fast_inverse_square_root">Quake III’s famous inverse square root approximation</a>.
The approximation is closer to $x^{-1/2}$ for some inputs $x$ and farther away for others.
You’ll want to know the chances that the approximation is close enough for the $x$s you care about.</p>
<p>When your program can only be right some of the time, it’s important to take a statistical view of correctness.
This is not just about squirrelly floating-point hacks: probably-correct programs are ubiquitous, from <a href="http://www.apple.com/ios/siri/">Siri</a> to <a href="https://www.teslamotors.com/presskit/autopilot">Tesla’s autopilot</a>.
This post is about infusing statistics into the ways we define correctness and the everyday tools we use to enforce it, like unit testing.
We’ll explore two simple but solid approaches to enforcing statistical correctness.
The first is an analogy to traditional testing, and the second moves checking to run time for a stronger guarantee.
Both require only Wikipedia-level statistics to understand and implement.</p>
<p>At the end, I’ll argue that these basic approaches are deceptively difficult to beat.
If we want to make stronger guarantees about probably-correct programs, we’ll need more creative ideas.</p>
<h2 id="correct-vs-probably-correct">Correct vs. Probably Correct</h2>
<p>First, let’s recap traditional definitions of correctness.
With ordinary, hopefully-always-correct programs, the ultimate goal is <strong>verification</strong>:</p>
<p>\[ \forall x \; f(x) \text{ is good} \]</p>
<p>The word <em>good</em> is intentionally vague: it might mean something about the output $f$ writes to a file, or about how fast $f$ runs, or whether $f$ violated some security policy.
In any case, verification says your program behaves well on every input.</p>
<p>Verification is hard, so we also have <strong>testing</strong>, which says a program behaves well on a few example inputs:</p>
<p>\[ \forall\; x \in X \; f(x) \text{ is good} \]</p>
<p>Testing tells us a set of inputs $X$ all lead to good behavior.
It doesn’t imply $\forall x$ anything, but it’s something.</p>
<p>For this post, we’ll assume $f$ is good on some inputs and bad on others, but it doesn’t fail at random.
In other words, it’s <em>deterministic:</em> for a given $x$, running $f(x)$ is either always good or always bad.
The <a href="https://en.wikipedia.org/wiki/Fast_inverse_square_root">fast inverse square root</a> function is one example: the error is below $10^{-4}$ for most inputs, but it can be as high as $0.04$ for reasonably small values of $x$.
(See for yourself with this <a href="https://gist.github.com/sampsyo/c1ed448618dadce682fdc5303ce432ec">Python implementation</a>.)
If you know your threshold for a good-enough inverse square root is an error of 0.01, you’ll want to know your chances of violating that bound.</p>
<p>Nondeterministically correct programs are also important, of course, but there the goal is to show something more complicated: something like $\forall x \; \text{Pr}\left[ f(x) \text{ is good} \right] \ge T$.
This post focuses on deterministic programs.</p>
<h2 id="statistical-testing">Statistical Testing</h2>
<p>There’s an easy way to get a basic kind of statistical correctness.
It’s roughly equivalent to traditional testing in terms of both difficulty and strength, so I’ll call it <strong>statistical testing</strong>.
(But to be clear, this is not my invention.)</p>
<p>The idea is to pick, instead of a set $X$ of representative inputs, a <a href="https://en.wikipedia.org/wiki/Probability_distribution">probability distribution</a> $D$ of inputs that you think is representative of real-world behavior.
For the fast inverse square root function, for example, we might pick a uniform distribution between 0.0 and 10.0, suggesting that any input in that range is equally likely.</p>
<p>Statistical testing can show, with high confidence, when you randomly choose an $x$ from the input distribution $D$, it has a high probability of making $f(x)$ good.
In other words, your goal is to show:</p>
<p>\[ \text{Pr}_{x \sim D} \left[ f(x) \text{ is good} \right] \ge T \]</p>
<p>with confidence $\alpha$.
Your <a href="https://en.wikipedia.org/wiki/Confidence_interval">confidence</a> parameter helps you decide how much evidence to collect—instead of proving that statement absolutely, we’ll say that we have observed enough evidence that there’s only an $\alpha$ chance we observed a random fluke.</p>
<p>Let $p = \text{Pr}_{x \sim D} \left[ f(x) \text{ is good} \right]$ be the <em>correctness probability</em> for $f$.
Our goal is to check whether $p \ge T$, our threshold for <em>good enough</em>.
Here’s the complete recipe:</p>
<ol>
<li>Pick your input distribution $D$.</li>
<li>Randomly choose $n$ inputs $x$ according to $D$. (This is called <a href="https://en.wikipedia.org/wiki/Sampling_(statistics)">sampling</a>.)</li>
<li>Run $f$ on each sampled $x$ and check whether each $f(x)$ is good.</li>
<li>Let $g$ be the number of good runs. Now, $\hat{p} = \frac{g}{n}$ is your estimate for $p$.</li>
<li>Perform some light statistics magic.</li>
</ol>
<p>There are a few ways to do the statistics. Here’s a really simple way: use a <a href="https://en.m.wikipedia.org/wiki/Binomial_proportion_confidence_interval">confidence interval formula</a> to get upper and lower bounds on $p$.
The <a href="https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval#Clopper-Pearson_interval">Clopper–Pearson</a> formula, for example, gives you a $p_{\text{low}}$ and $p_{\text{high}}$ so that:</p>
<p>\[ \text{Pr}\left[ p_{\text{low}} \le p \le p_{\text{high}} \right] \ge 1 - \alpha \]</p>
<p>Remember that $\alpha$ is small, so you’re saying that it’s likely you have an interval around $p$.
If $p_{\text{low}} \ge T$, then you can say with confidence $\alpha$ that $f$ is good on the input distribution $D$.
If $p_{\text{high}} \le T$, then you can say it’s wrong.
Otherwise, the test is inconclusive—you need to take more samples.
Collecting more samples (increasing $n$) tightens the interval; demanding higher confidence (decreasing $\alpha$) loosens the interval.</p>
<p>There are fancier ways, too: you could use <a href="https://en.wikipedia.org/wiki/Sequential_probability_ratio_test">Wald’s sequential sampling</a> to automatically choose $n$ and rule out possibility of an inconclusive result.
But the simple Clopper–Pearson way is perfectly good, and it’s easy to implement: here it is in <a href="https://gist.github.com/sampsyo/c073c089bde311a6777313a4a7ac933e">four lines of Python</a>.</p>
<p>The statistical testing technique is so simple that it, or something at least as strong, should appear in every paper that proposes a new approximation strategy.
It doesn’t require any fancy computer science: all you need to do is run $f$ as a black box and check its output, just like in traditional testing.
Our <a href="http://dx.doi.org/10.1145/2594291.2594294">probabilistic assertions</a> checker uses some fanciness to make the approach more efficient, but these tricks aren’t necessary to perform a statistically sound test.
So if you read an <a href="/~asampson/research.html#approximate-computing">approximate computing</a> paper that doesn’t report its $\alpha$, be suspicious.</p>
<h3 id="limitations">Limitations</h3>
<p>Statistical testing is limited by its need for an input distribution, $D$.
That requirement makes statistical testing’s guarantee about as strong as traditional testing is for normal programs:
it says that your program behaves itself under specific conditions that you anticipate in development.
It doesn’t say anything about what will happen when your program meets the real world—there are no guarantees for any input distribution other than $D$.</p>
<p>More subtly, statistical testing also requires that you have a $D$ that you can generate random samples from.
This makes it tricky to use, for example, if your $f$ is an image classifier that works on photographs that users upload to a Web service—it’s hard to randomly generate photos from scratch!
You could sample from a pool of test photos, but that will only let you draw conclusions about those test photos—not the distribution of photos that users might upload.</p>
<p>Statistical testing is useful when you can anticipate the input distribution ahead of time.
Is it possible to make statements that don’t depend on a known, sample-able distribution?</p>
<h2 id="going-on-line-statistical-checking">Going On-Line: Statistical Checking</h2>
<p>A stronger guarantee could help us cope with unanticipated distributions—even <em>adversarial</em> distributions.
For example, a user might find
a single $x_\text{bad}$ input that your program doesn’t handle well and then issue a probability distribution $D_\text{bad}$ that hammers on that one $x_\text{bad}$ over and over.
Statistical testing will never help with adversarial input distributions, but some form of on-line enforcement might.</p>
<p>Let’s explore a simple on-line variant of statistical testing, which I’ll call <strong>statistical checking</strong>, and consider how its guarantees stack up against adversarial input distributions.
The idea is that you have an oracle that can decide whether a given execution $f(x)$ is good or bad, but it’s too expensive to run on <em>every</em> execution.
For example, you can always check the <a href="https://en.wikipedia.org/wiki/Fast_inverse_square_root">fast inverse square root</a> output by comparing with an exact $x^{-1/2}$ computation, but that would obviate all the efficiency benefits of using the approximation in the first place.
Statistical checking reduces the overhead by running the oracle after a random sample of executions.</p>
<p>Say you run $f$ on a server for a full day and, at the end of the day, you want to know how many of the requests were good.
Let $p$ be the probability that an execution on that day is good: in expectation, $p$ is also the fraction of good executions.
Again, we hope $p$ will be high.
Here’s the statistical checking recipe:</p>
<ol>
<li>Choose a probability $p_\text{check}$ that you’ll use to decide whether to check each execution.</li>
<li>After running $f(x)$ each time, flip a biased coin that comes up heads with probability $p_\text{check}$. If it’s heads, pay the expense to check whether $f(x)$ is good; otherwise, do nothing.</li>
<li>At the end of the day, tally up the number of times you checked, $c$, and the number of times the check came out good, $g$. Now, $\hat{p} = \frac{g}{c}$ is your estimate for $p$.</li>
<li>Use the same statistical magic as last time to get a confidence interval on $p$.</li>
</ol>
<p>The same binomial confidence interval techniques that we used for statistical testing, like Clopper–Pearson, work here too.
And if you want to do the statistics multiple times, like at the end of <em>every</em> day or even after each execution you randomly check, you can again use <a href="https://en.wikipedia.org/wiki/Sequential_probability_ratio_test">Wald’s sequential sampling</a> to avoid the <a href="https://en.wikipedia.org/wiki/Multiple_comparisons_problem">multiple comparisons problem</a>.</p>
<p>The guarantees are similar: you again get an $\alpha$-confidence interval on $p$ that lets you decide whether you have enough evidence to conclude that the day’s executions were good enough or not.
The $p_\text{check}$ knob lets you pay more overhead for a better shot at a conclusive outcome in either direction.</p>
<p>Like random screening in the customs line, randomly choosing the executions to check is the key to defeating adversarial distributions.
This way, your program’s adversary can <em>provably</em> have no idea which executions will be checked—it has nowhere to hide.
Any non-random strategy, such as <a href="https://en.wikipedia.org/wiki/Exponential_backoff">exponential backoff</a>, admits some adversary that behaves well only on checked executions.
(This <a href="/~asampson/blog/naivemonitoring.html">old post with pictures</a> gets at the same idea.)</p>
<h2 id="even-stronger-statements">Even Stronger Statements</h2>
<p>Statistical testing and statistical checking, as simple as they are, yield surprisingly good guarantees.
Is it possible to do even better?</p>
<p>In particular, neither sampling-based technique can say anything about worst-case errors.
We can know with high confidence that 99% of executions are good enough, for example, but we can’t know <em>how</em> bad that remaining 1% might be.
We could check looser bounds, but sampling will never get us to 100% certainty about anything: there’s always a chance we got unlucky and failed to see a particularly bad $x_\text{bad}$.</p>
<p>A worst-case guarantee is deceptively difficult to certify.
I can only see two ways that might work:</p>
<ul>
<li>Conservatively identify <em>all</em> (not just most) of the bad $x$s for $f$ and detect them at run time.</li>
<li>Derive a cheap-enough oracle that can dynamically check <em>every</em> execution for correctness.</li>
</ul>
<p>Both options are hard!
And they amount to recovering complete correctness—anything less than perfection risks missing a single outlier $x_\text{bad}$.
Getting a guarantee that’s stronger than simple statistical checking will take real creativity.</p>
<h2 id="heuristics-cant-beat-statistical-testing">Heuristics Can’t Beat Statistical Testing</h2>
<p>One approach that <em>can’t</em> beat the simple techniques is an on-line heuristic.
Here’s the usual line of reasoning:</p>
<blockquote>
<p>Statistical testing is weak because it only knows about inputs we anticipated <em>in vitro</em>.
And statistical checking is weak because it only looks at some of the inputs at run time.
To do better, let’s check <em>every</em> execution!
Just before running $f$ on $x$, or just after getting the output $f(x)$, apply some heuristic to predict whether the execution is good or bad.
The heuristic will statistically avoid bad behavior, so we’ll get a stronger guarantee.</p>
</blockquote>
<p>Let’s call this general approach <strong>heuristic checking</strong>.
It’s “easy” because there’s no program analysis necessary: we still get to treat $f$ as a black box.
And the idea to check every run sounds like it might offer a stronger kind of guarantee.</p>
<p>It can’t.
Heuristics are orthogonal to statistical guarantees—you need some other technique, like statistical testing or checking, to make any rigorous statements about them.</p>
<p>The problem is that every heuristic has false positives.
Regardless of whether you choose a decision tree, a support vector machine, a neural network, or a fuzzy lookup table, your favorite heuristic necessarily has blind spots.
For example, you might try to train an SVM on lots of inputs to predict when a given $x$ will cause lots of error in your fast inverse square root approximation, $f$.
If the SVM predicts for a given $x$ that $f(x)$ will be bad, then run the slower fallback $x^{-1/2}$ code instead.</p>
<figure style="max-width: 200px;">
<img src="/~asampson/media/heuristiccheck.svg" alt="heuristic checks on inputs and outputs" />
<figcaption>Adding checks to an approximate program $f$ yields a new approximate program $f'$.</figcaption>
</figure>
<p>Like any trained model, the SVM will make an wrong prediction in some minority of the cases—in exactly the same way that the approximation itself is inaccurate some of the time.
That means that we can think of the entire SVM-augmented system as just another probably-correct program with all the same problems as the original $f$.
Let $f'$ be the function that runs the SVM predictor and then chooses to run $f$ or the accurate $x^{-1/2}$.
This new $f'$ you’ve created also has some $x_\text{bad}$ inputs and also needs some validation of its correctness, just as much as the original $f$.
You’ll still need to apply statistical testing, statistical checking, or something of their ilk to understand the correctness of $f'$.</p>
<p>In that sense, heuristic checking can never offer any statistical guarantees by itself—it’s <em>orthogonal</em> to the technique you use to assess statistical correctness.
Even the best heuristic can only adjust the correctness probability; it can’t change the <em>kind</em> of guarantee that’s possible.</p>
<p>That’s not to say that heuristic checking is useless.
It can definitely be a useful to empirically improve your program’s correctness probability; hence publications in <a href="https://homes.cs.washington.edu/~luisceze/publications/approxdebug-asplos15.pdf">ASPLOS 2015</a> (where I’m an author), <a href="http://cccp.eecs.umich.edu/papers/dskhudia-isca15.pdf">ISCA 2015</a>, <a href="http://dl.acm.org/citation.cfm?id=2872402">ASPLOS 2016</a>, <a href="http://dl.acm.org/citation.cfm?id=2908087">PLDI 2016</a>, and <a href="http://www.cc.gatech.edu/~ayazdanb/publication/papers/mithra-isca16.pdf">ISCA 2016</a>.
But we need to be clear about exactly what this kind of work can do: it can adjust the correctness probability $p$, but it can’t change the <em>kind</em> of guarantee you state about $p$.</p>
<p>Work along these lines needs to be careful to use the right baseline.
Enhancing an $f$ with heuristic checking is morally equivalent to using a more accurate $f$ in the first place.
You could, for example, change your fast inverse square root function to enable the second Newton iteration.
This would increase accuracy and increase cost—exactly the same effects as adding heuristic checking.
So if you design a new checking heuristic, remember to compare against other strategies for improving accuracy.</p>
<p>In my own <a href="https://homes.cs.washington.edu/~luisceze/publications/approxdebug-asplos15.pdf">ASPLOS 2015 paper</a>, for example, we used a fuzzy memoization table to detect approximate outputs that deviated too much from previously-observed behavior.
Our evaluation showed that the extra checking costs energy, but it also increases accuracy on average.
There were other, more obvious ways to change the energy–accuracy trade-off: we could have adjusted the hardware voltage parameters, for example, and ended up with the same strength of guarantee.
A good evaluation should treat the obvious strategy as a baseline: compare the total energy energy savings when the average accuracy is equal, or vice-versa.</p>
<p>Statistical correctness is a critical but underappreciated problem. Fortunately, basic statistics are enough to make pretty good statements about statistical correctness. But we’re far from done: there are juicy, unsolved, computer-sciencey problems remaining in meaningfully outperforming these basic tools.</p>
<hr />
<p><em>Thanks to Cyrus Rashtchian, Todd Mytkowicz, and Kathryn McKinley for unbelievably valuable feedback on earlier drafts of this post. Needless to say, they don’t necessarily agree with everything here.</em></p>
<p>Say you have a program that’s right only some of the time. How can you tell whether it’s correct enough? Using with some Wikipedia-level statistics, it’s pretty easy to make probabilistic statements about quality. I’ll explain two strategies for measuring statistical correctness. Then I’ll argue that it’s deceptively difficult produce guarantees that are any stronger than the ones you get from the basic techniques.</p>
http://www.cs.cornell.edu/~asampson/blog/opengl.htmlWeep for Graphics Programming2016-05-02T00:00:00+00:00<p>The mainstream real-time graphics APIs, OpenGL and Direct3D, are probably the most widespread way that programmers interact with heterogeneous hardware.
But their brand of CPU–GPU integration is unconscionable.
CPU-side code needs to coordinate closely with GPU-side shader programs for good performance, but the APIs we have today treat the two execution units as isolated universes.
This mindset leads to stringly typed interfaces, a huge volume of boilerplate, and impoverished GPU-specific programming languages.</p>
<p>This post tours a few gritty realities in a trivial OpenGL application.
You can follow along with <a href="http://adriansampson.net/doc/tinygl/">a literate listing</a> of the <a href="https://github.com/sampsyo/tinygl/blob/master/tinygl.c">full source code</a>.</p>
<h2 id="shaders-are-strings">Shaders are Strings</h2>
<p>To define an object’s appearance in a 3D scene, real-time graphics applications use <em><a href="https://en.wikipedia.org/wiki/Shader">shaders</a>:</em> small programs that run on the GPU as part of the rendering pipeline.
There are several <a href="https://en.wikipedia.org/wiki/Shader#Types">kinds</a> of shaders, but the two most common are the <a href="https://www.opengl.org/wiki/Vertex_Shader">vertex shader</a>, which determines the position of each vertex in an object’s mesh, and the <a href="https://www.opengl.org/wiki/Fragment_Shader">fragment shader</a>, which produces the color of each pixel on the object’s surface.
You write shaders in special C-like programming languages: OpenGL uses <a href="https://www.opengl.org/documentation/glsl/">GLSL</a>.</p>
<p>This is where things go wrong. To set up a shader, the host program sends a <em>string containing shader source code</em> to the graphics card driver.
The driver JITs the source to the GPU’s internal architecture and loads it onto the hardware.</p>
<p>Here’s a simplified pair of GLSL <a href="http://sampsyo.github.io/tinygl/#section-7">vertex and fragment shaders in C string constants</a>:</p>
<div class="language-c highlighter-rouge"><pre class="highlight"><code><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">vertex_shader</span> <span class="o">=</span>
<span class="s">"in vec4 position;</span><span class="se">\n</span><span class="s">"</span>
<span class="s">"out vec4 myPos;</span><span class="se">\n</span><span class="s">"</span>
<span class="s">"void main() {</span><span class="se">\n</span><span class="s">"</span>
<span class="s">" myPos = position;</span><span class="se">\n</span><span class="s">"</span>
<span class="s">" gl_Position = position;</span><span class="se">\n</span><span class="s">"</span>
<span class="s">"}</span><span class="se">\n</span><span class="s">"</span><span class="p">;</span>
<span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">fragment_shader</span> <span class="o">=</span>
<span class="s">"uniform float phase;</span><span class="se">\n</span><span class="s">"</span>
<span class="s">"in vec4 myPos;</span><span class="se">\n</span><span class="s">"</span>
<span class="s">"void main() {</span><span class="se">\n</span><span class="s">"</span>
<span class="s">" gl_FragColor = ...;</span><span class="se">\n</span><span class="s">"</span>
<span class="s">"}</span><span class="se">\n</span><span class="s">"</span><span class="p">;</span>
</code></pre>
</div>
<p>(It’s also common to load shader code from text files at startup time.)
Those <a href="https://www.opengl.org/wiki/Type_Qualifier_(GLSL)"><code class="highlighter-rouge">in</code>, <code class="highlighter-rouge">out</code>, and <code class="highlighter-rouge">uniform</code> qualifiers</a> denote communication channels between the CPU and GPU and between the different stages of the GPU’s rendering pipeline.
That <code class="highlighter-rouge">myPos</code> variable serves to shuffle data from through vertex shader into the fragment shader.
The vertex shader’s <code class="highlighter-rouge">main</code> function assigns to the magic <code class="highlighter-rouge">gl_Position</code> variable for its output, and the fragment shader assigns to <code class="highlighter-rouge">gl_FragColor</code>.</p>
<p>Here’s roughly how you <a href="http://sampsyo.github.io/tinygl/#section-18">compile and load the shader program</a>:</p>
<div class="language-c highlighter-rouge"><pre class="highlight"><code><span class="c1">// Compile the vertex shader.
</span><span class="n">GLuint</span> <span class="n">vshader</span> <span class="o">=</span> <span class="n">glCreateShader</span><span class="p">(</span><span class="n">GL_VERTEX_SHADER</span><span class="p">);</span>
<span class="n">glShaderSource</span><span class="p">(</span><span class="n">vshader</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="o">&</span><span class="n">vertex_shader</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="c1">// Compile the fragment shader.
</span><span class="n">GLuint</span> <span class="n">fshader</span> <span class="o">=</span> <span class="n">glCreateShader</span><span class="p">(</span><span class="n">GL_FRAGMENT_SHADER</span><span class="p">);</span>
<span class="n">glShaderSource</span><span class="p">(</span><span class="n">fshader</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="o">&</span><span class="n">fragment_shader</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="c1">// Create a program that stitches the two shader stages together.
</span><span class="n">GLuint</span> <span class="n">shader_program</span> <span class="o">=</span> <span class="n">glCreateProgram</span><span class="p">();</span>
<span class="n">glAttachShader</span><span class="p">(</span><span class="n">shader_program</span><span class="p">,</span> <span class="n">vshader</span><span class="p">);</span>
<span class="n">glAttachShader</span><span class="p">(</span><span class="n">shader_program</span><span class="p">,</span> <span class="n">fshader</span><span class="p">);</span>
<span class="n">glLinkProgram</span><span class="p">(</span><span class="n">shader_program</span><span class="p">);</span>
</code></pre>
</div>
<p>With that boilerplate, we’re ready to invoke <code class="highlighter-rouge">shader_program</code> to draw objects.</p>
<p>The shaders-in-strings interface is the original sin of graphics programming.
It means that some parts of the complete program’s semantics are unknowable until run time—for no reason except that they target a different hardware unit.
It’s like <a href="https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/eval"><code class="highlighter-rouge">eval</code> in JavaScript</a>, but worse: every OpenGL program is <em>required</em> to cram some of its code into strings.</p>
<p>Direct3D and the next generation of graphics APIs—<a href="http://www.amd.com/en-us/innovations/software-technologies/technologies-gaming/mantle">Mantle</a>, <a href="https://developer.apple.com/metal/">Metal</a>, and <a href="https://www.khronos.org/vulkan/">Vulkan</a>—clean up some of the mess by using a bytecode to ship shaders instead of raw source code.
But pre-compiling shader programs to an IR doesn’t solve the fundamental problem:
the <em>interface</em> between the CPU and GPU code is needlessly dynamic, so you can’t reason statically about the whole, heterogeneous program.</p>
<h2 id="stringly-typed-binding-boilerplate">Stringly Typed Binding Boilerplate</h2>
<p>If string-wrapped shader code is OpenGL’s principal investment in pain,
then it collects its pain dividends via the CPU–GPU communication interface.</p>
<p>Check out those variables <code class="highlighter-rouge">position</code> and <code class="highlighter-rouge">phase</code> in the vertex and fragment shaders, respectively.
The <code class="highlighter-rouge">in</code> and <code class="highlighter-rouge">uniform</code> qualifiers mean they’re parameters that come from the CPU.
To use those parameters, the host program’s first step is to <a href="http://sampsyo.github.io/tinygl/#section-28">look up <em>location</em> handles</a> for each variable:</p>
<div class="language-c highlighter-rouge"><pre class="highlight"><code><span class="n">GLuint</span> <span class="n">loc_phase</span> <span class="o">=</span> <span class="n">glGetUniformLocation</span><span class="p">(</span><span class="n">program</span><span class="p">,</span> <span class="s">"phase"</span><span class="p">);</span>
<span class="n">GLuint</span> <span class="n">loc_position</span> <span class="o">=</span> <span class="n">glGetAttribLocation</span><span class="p">(</span><span class="n">program</span><span class="p">,</span> <span class="s">"position"</span><span class="p">);</span>
</code></pre>
</div>
<p>Yes, you look up the variable by passing its name as a string.
The <code class="highlighter-rouge">phase</code> parameter is just a <code class="highlighter-rouge">float</code> scalar, but <code class="highlighter-rouge">position</code> is a dynamically sized array of position vectors, so it requires even more boilerplate to <a href="http://sampsyo.github.io/tinygl/#section-34">set up a backing buffer</a>.</p>
<p>Next, we use these handles to <a href="http://sampsyo.github.io/tinygl/#section-42">pass new data to the shaders</a> to draw each frame:</p>
<div class="language-c highlighter-rouge"><pre class="highlight"><code><span class="c1">// The render loop.
</span><span class="k">while</span> <span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
<span class="c1">// Bind our compiled shader program.
</span> <span class="n">glUseProgram</span><span class="p">(</span><span class="n">shader_program</span><span class="p">);</span>
<span class="c1">// Set the scalar `phase` variable.
</span> <span class="n">glUniform1f</span><span class="p">(</span><span class="n">loc_phase</span><span class="p">,</span> <span class="n">sin</span><span class="p">(</span><span class="mi">4</span> <span class="o">*</span> <span class="n">t</span><span class="p">));</span>
<span class="c1">// Set the `location` array by copying data into the buffer.
</span> <span class="n">glBindBuffer</span><span class="p">(</span><span class="n">GL_ARRAY_BUFFER</span><span class="p">,</span> <span class="n">buffer</span><span class="p">);</span>
<span class="n">glBufferSubData</span><span class="p">(</span><span class="n">GL_ARRAY_BUFFER</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">points</span><span class="p">),</span> <span class="n">points</span><span class="p">);</span>
<span class="c1">// Use these parameters and our shader program to draw something.
</span> <span class="n">glDrawArrays</span><span class="p">(</span><span class="n">GL_TRIANGLE_FAN</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">NVERTICES</span><span class="p">);</span>
<span class="p">}</span>
</code></pre>
</div>
<p>The <a href="http://sampsyo.github.io/tinygl/#section-42">verbosity</a> is distracting, but those <a href="https://www.khronos.org/opengles/sdk/docs/man/xhtml/glUniform.xml"><code class="highlighter-rouge">glUniform1f</code></a> and <a href="https://www.opengl.org/sdk/docs/man2/xhtml/glBufferSubData.xml"><code class="highlighter-rouge">glBufferSubData</code></a> calls are morally equivalent to
writing <code class="highlighter-rouge">set("variable", value)</code> instead of <code class="highlighter-rouge">let variable = value</code>.
The C and GLSL compilers can check and optimize the CPU and GPU code separately,
but the stringly typed CPU–GPU interface prevents either compiler from doing anything useful to the complete program.</p>
<h2 id="the-age-of-heterogeneity">The Age of Heterogeneity</h2>
<p>OpenGL and its equivalents make miserable standard bearers for the age of hardware heterogeneity.
Heterogeneity is rapidly becoming ubiquitous, and we need better ways to write software that spans hardware units with different capabilities.
OpenGL’s programming model espouses the simplistic view that heterogeneous software should comprise multiple, loosely coupled, independent programs.</p>
<p>If pervasive heterogeneity is going to succeed, we need to bury this 20th-century notion. We need programming models that let us write <em>one</em> program that spans multiple execution contexts.
This won’t erase heterogeneity’s essential complexity, but it will let us stop treating non-CPU code as a second-class citizen.</p>
<p>The mainstream real-time graphics APIs, OpenGL and Direct3D, make miserable standard bearers for the age of hardware heterogeneity.
Their approach to heterogeneous programming leads to stringly typed interfaces, a huge volume of boilerplate, and impoverished GPU-specific programming languages.</p>
http://www.cs.cornell.edu/~asampson/blog/wax2016.htmlNotes from WAX 20162016-04-17T00:00:00+00:00<p>We held <a href="http://approximate.computer/wax2016/">WAX</a>, the workshop on approximate computing, at <a href="https://www.ece.cmu.edu/calcm/asplos2016/">ASPLOS</a> last week. I love organizing WAX—it’s a great excuse for the approximation community to talk about the broader themes that extend beyond any single person’s research project <em>du jour</em>.</p>
<p>Here are some notes on those themes.
You can also check out <a href="http://approximate.computer/wax2016/program/">the archived program</a> for links to papers and slides.</p>
<h3 id="disgruntled-introductions">Disgruntled Introductions</h3>
<p>To introduce ourselves, we all said something we like about approximate computing and something we don’t like.
Predictably, this invited a healthy dose of griping.
Two gripey themes emerged:</p>
<ul>
<li>There’s some sadness that approximation hasn’t “hit it big” yet, commercially speaking. We’re a half decade or so into the approximate-computing craze, so it feels to many like we should see shipping hardware soon.</li>
<li>Our terminology is confusing. What does <em>quality</em> mean versus <em>accuracy</em> versus <em>quality of service?</em> A cohort even complained about <em>approximate computing</em> itself being more off-putting than alternatives like <em>inexact</em> or <em>good-enough</em> computing.</li>
</ul>
<p><img class="img-responsive" src="http://www.cs.cornell.edu/~asampson/media/wax2016.jpg" alt="looks like fun!" /></p>
<h3 id="cross-stack-keynotes">Cross-Stack Keynotes</h3>
<p>We had two awesome keynote speakers, both of whom brought broad, interdisciplinary views on approximation.</p>
<p><a href="http://research.microsoft.com/en-us/people/matthaip/">Matthai Philipose</a> from MSR has a goal of <em>continuous mobile vision:</em> always-on CV on a wearable device with all-day battery life and reasonable cloud costs.
His data suggests that approximation is critical—not just a luxury—for this setting: current vision techniques can’t fit in the necessary energy and dollar budgets. It’s not even close.
So he’s on a campaign to introduce approximation everywhere, from the camera sensor hardware to the DNN models and algorithms.</p>
<p><a href="http://ee.princeton.edu/people/faculty/naveen-verma">Naveen Verma</a> from Princeton is a hardware researcher but, unlike some architects, believes approximate computing should come from the top down, from algorithms.
He showed off <em>data-driven hardware resilience</em>, where you train a machine learning model to counteract the effects of deterministic hardware approximation.
Under the right conditions, this cross-stack approach can lead to extremely good tolerance—much more than algorithm-agnostic approximation.</p>
<h3 id="best-practices">Best Practices</h3>
<p>The discussion at the end of the day coalesced around standards of rigor in approximate-computing research.
There was a broad consensus that evaluation methodologies have not improved enough since those heady days of the first few approximation papers.
We hatched the idea of putting together a best practices document for approximation research, covering:</p>
<ul>
<li>Standard benchmarks with standard quality metrics and standard thresholds. When people are free to define their own quality metrics, there’s no way to compare two papers and no way to trust that an approximation is actually useful. The <em>de facto</em> 10% quality loss standard is my least favorite legacy of <a href="/~asampson/media/papers/enerj-pldi2011.pdf">the EnerJ paper</a>.</li>
<li>A map of the available approximation techniques. If you want to apply approximation to a bottleneck in your favorite application, where should you start? This kind of guide is common for traditional performance optimization, so we should have one too.</li>
<li>Agreement on what <em>kinds</em> of quality guarantees are worth striving for. Where do statistical guarantees make sense, and where do we need more traditional deterministic bounds?</li>
</ul>
<p>I’m excited about this idea for a community-sanctioned set of standards. But it’s going to be difficult: work like this doesn’t fit with normal incentives for academics.</p>
<h3 id="a-better-workshop-next-year">A Better Workshop Next Year</h3>
<p>I have plenty to learn about organizing workshops.
Here are some things we need to fix:</p>
<ul>
<li>One-minute lightning talks are a staple at WAX, but they need work. They’re supposed to be an effortless and fun way to provoke discussion, but people recently have put too much work into them—and a high standard means a low turnout. Even so, I heard reactions that one minute isn’t enough to communicate a whole idea. We need new ideas to keep the lightning round’s lighthearted spirit while making it more useful.</li>
<li>Several people told me that the free-form, small-group lunch discussions were their favorite part of WAX. This was doubly true for new people and outsiders, who got to ask questions that wouldn’t work in front of a whole-workshop audience. We need to create more of this kind of discussion, ideally by replacing the usual anemic post-talk Q&A. Maybe we can get seating at small round tables for the next WAX, or we could steal <a href="http://composition.al/">Lindsey Kuper</a>’s <a href="http://composition.al/blog/2016/01/25/off-the-beaten-track-2016-program-chairs-report/">card-based Q&A idea</a> from POPL OBT.</li>
<li>We need to remind speakers to skip their motivation slides for this venue.</li>
</ul>
<p>Next year, my co-chairs and I want to get the WAX franchise more organized. Haphazardly cobbling things together one year at a time has been fun, but the workshop is getting bigger and more serious. We should assign real roles, like <em>program chair</em> and <em>publicity chair</em>, which means we’ll need more help. Please <a href="mailto:asampson@cs.cornell.edu">get in touch</a> if you’d like to get involved—and thanks to everyone who already volunteered!</p>
<p>See you at WAX 2017.</p>
<p><a href="http://approximate.computer/wax2016/">WAX</a> is the workshop on approximate computing. This year at <a href="https://www.ece.cmu.edu/calcm/asplos2016/">ASPLOS</a>, I organized its third or fourth iteration, depending on how you count, along with <a href="http://homes.cs.washington.edu/~luisceze/">Luis Ceze</a>, <a href="http://www.cc.gatech.edu/~hadi/">Hadi Esmaeilzadeh</a>, and <a href="http://research.microsoft.com/en-us/people/zorn/">Ben Zorn</a>. Here’s some stuff that happened at the workshop.</p>