Adrian Sampson: Viva Las Vega

Tools for drawing plots are ubiquitous. But good, usable tools that create beautiful, publishable visualizations for academic papers are harder to find. In papers I’ve published, I have experimented with gnuplot, Pychart, R with ggplot2 and tikzDevice, Matplotlib, plain old Numbers, and still more tools that are lost to the sands of academic time. As with word processing or build automation, computer science academics tend to get all precious and finnicky about their plotting tools. I’m no exception, so I’d like to tell you about a new hope in this genre: Vega.

Vega is a new tool from Trifacta, a startup that includes UW CSE’s own Jeff Heer, that builds on the wildly popular D3 library to accomplish something new: a declarative language for defining beautiful visualizations. Many tools, including D3 itself and gnuplot, are plenty powerful but hindered by the infinite flexibility afforded by a full programming language. In gnuplot’s case, for example, you need to learn a completely new, hacked-together DSL just to produce the world’s dumbest 2D scatter plot. Vega’s approach eschews the traditional programming language. Instead, it provides a grammar that lets you declare elements in your plot without shooting yourself in the foot with Turing completeness. A plot is just a JSON document: there’s no obscure looping, conditional, or string-munging syntax to learn. In its place, Vega provides sensible defaults and powerful canned functions that obviate the need for explicit scene construction. This means that Vega plots are beautiful by default (unlike, say, gnuplot’s ugly ducklings). Check out some of the examples to see what I mean.

In short, Vega trades away flexibility for great returns in productivity. The result is that I can produce more useful plots while swearing a lot less.

But as impressively designed as it is, Vega was clearly not intended for producing publication-ready figures. My earlier post on coffee intake was an experiment in using Vega for its indended purpose: Web-embedded, potentially interactive, dynamically drawn plots. I’m excited enough by the Vega concept to produce tools for using it, so I’d like to abuse it for publication purposes.

This post explores a few mismatches between Vega and publishing and suggests how to work around a few of them—and how Vega itself might address some others. I think it’s worth it: Vega pleasant to use in many ways—certainly moreso than my previous workflow (Python scripts that generate gnuplot scripts that generate PostScript files that generate PDFs; ugh). With a little additional tooling, it makes a great addition to the paper-writing process.

Black and White

Color is a great visual element for the Web, but good published figures are grayscale. Vega comes with built-in color scales that look great, but you’ll need to define your own scale to use gray values. Here’s a small one you can steal and adapt:

{
  "name": "color",
  "type": "ordinal",
  "range": [
    "#000000",
    "#999999",
    "#cccccc"
  ]
}

Producing a PDF

Vega can produce PNGs and SVGs, but you need a PDF if you want to embed a figure into a LaTeX paper. My first attempt at this was to use an svg2pdf program to do this conversion, but this led to weird text rendering bugs and inconsistent output. I recommend that you use the kind software that’s the best at rendering SVGs these days: real Web browsers.

Specifically, I use the wkpdf command-line tool, which invokes the WebKit engine to render web pages to PDF snapshots. This approach produces perfect, publishable PDFs. There are other options, but if you use wkpdf itself, remember to use the --no-paginate flag to use the right bounding box for your figure.

Here’s a Makefile snippet that automates this approach:

%.pdf: %.svg
    wkpdf --no-paginate --source $< --output $@

I’ll keep dreaming of a LaTeX-aware backend that lets TeX itself do the text rendering, à la tikzDevice or gnuplot’s epslatex backend. But this approach is more than good enough in the mean time.

The Stacked and Clustered Bar Chart

In systems fields, the “lead figure” in a paper is often the infamous clustered and stacked bar chart. This comes up when you want to compare at least two versions of a system (say, a stock compiler and one with My Great Optimization) for each of many benchmark programs but you also want to break down each of those bars into parts (perhaps the time spent on compilation vs. execution). Almost every paper I write these days has one of these things in it somewhere. This kind of plot is tricky to get right in any plotting system (just try googling for “stacked and clustered” along with your favorite plotting tool’s name), and Vega’s examples include a stacked chart and a grouped chart but not the holy grail.

The solution is elegant in Vega: nest a group for the stack layers inside another group for the clusters. Please steal my example spec, which produces a plot like this:

A word of caution: Vega’s “stack” transform expects stacked magnitudes to be cumulative, not independent values. So if you want a stack to show 30 points of one factor and 20 points of another, the second input value will need to be 30 + 20 = 50. Don’t fall for this trap like I did.

A Better Language than JSON

One of Vega’s strengths is that it uses a familiar declarative language, JSON, instead of a home-brewed imperative mish-mash. But JSON has a few warts that make it frustrating for authoring directly. It’s not really meant for humans, in my opinion—even though it’s great as a machine-written interchange language.

Here are a few complaints. The strictness of JSON’s grammar means that you end up writing ridiculous nestings of brackets. Actual excerpt from a spec I wrote:

Them’s some information-rich LOCs! This kind of syntactic noise makes the structure incredibly difficult to line up with your eyes.

Also, JSON doesn’t allow trailing commas in maps and lists. Spot the bug in this snippet, for example:

"properties": {
  "enter": {
    "x": {"scale": "program", "field": "key"},
    "width": {"scale": "program", "band": true},
    "height": {"group": "height"},
  }
}

You got it: the comma after "height"} is not allowed. This restriction makes it awkward to add and remove rows from a JSON document. It might seem trivial, but it’s impossible to express how often this comes up in practice—and how angry it makes my typing fingers.

JSON also doesn’t have comments. Vega specs may be simple, but they’re not that simple. They need comments.

I’d like to build a frontend that takes a human-friendly language like YAML and compiles it down to JSON to feed to Vega. It’s also possible that a domain-specific (but still declartive!) approach could make Vega users even more productive.

Loose Ends

There are a few other areas where Vega needs a little help. (Fortunately, it’s open source! Somebody somewhere should step up and contribute.)

Debuggability: Mostly because of JavaScript’s shortcomings, Vega’s error messages are inscrutable. Some engineering effort is needed to avoid messages like “undefined has no property x” and instead say something like “you forgot to write that scale, dummy.” Some introspection would be useful as well: I often find myself wondering what a transform is doing to my data—a tool that can show me the data’s intermediate state in a table could help immensely.
Legend on top: Legends in Vega go on the side, listed vertically. They can take up less space if the items are listed horizontally and the legends are placed above the main plot. Let’s make this possible.
Labeled proportion chart: As far as I can tell, there’s no easy option to label the slices of a pie chart (or any less-problematic technique for showing proportions of a whole).

Despite these shortcomings, Vega is astoundingly useful and efficient for such a young tool. Give it a try for your next data-driven publication.