Declarative HPC

Summary

Engineering high-performance serial code is hard. Squeezing performance from parallel code with any nontrivial communication is harder. Even on one node, it’s hard to orchestrate communication so that computing units don’t spend all their time waiting for data, whether that data is coming from another core or from a slow main memory. Communication costs are particularly hard to manage in applications with irregular data access; and the task is maddening in cloud environments, where network contention and virtual machine scheduling can cause huge, unpredictable spikes in message latency. We tackle these challenges with frameworks consisting of a language to let the programmer declare data dependencies and a runtime to manage the task of communicating that data in a timely fashion.

Papers

W. Xie, G. Wang, D. Bindel, A. Demers, and J. Gehrke, “Fast Iterative Graph Computation with Block Updates,” Proceedings of the VLDB Endowment, vol. 6, no. 14, pp. 2014–2025, 2013.

@article{2013-blockgrace,
  author = {Xie, Wenlei and Wang, Guozhang and Bindel, David and Demers, Alan and Gehrke, Johannes},
  journal = {Proceedings of the VLDB Endowment},
  number = {14},
  pages = {2014--2025},
  publisher = {VLDB Endowment},
  title = {Fast Iterative Graph Computation with Block Updates},
  volume = {6},
  year = {2013},
  doi = {10.14778/2556549.2556581},
  code = {https://github.com/wenleix/BlockGRACE}
}

Abstract:

Scaling iterative graph processing applications to large graphs is an important problem. Performance is critical, as data scientists need to execute graph programs many times with varying parameters. The need for a high-level, high-performance programming model has inspired much research on graph programming frameworks. In this paper, we show that the important class of computationally light graph applications – applications that perform little computation per vertex – has severe scalability problems across multiple cores as these applications hit an early “memory wall” that limits their speedup. We propose a novel block-oriented computation model, in which computation is iterated locally over blocks of highly connected nodes, significantly improving the amount of computation per cache miss. Following this model, we describe the design and implementation of a block-aware graph processing runtime that keeps the familiar vertex-centric programming paradigm while reaping the benefits of block-oriented execution. Our experiments show that block-oriented execution significantly improves the performance of our framework for several graph applications.

T. Zou, G. Wang, M. Vaz Salles, D. Bindel, A. Demers, J. Gehrke, and W. White, “Making Time-Stepped Applications Tick in the Cloud,” in Proceedings of the Second ACM Symposium on Cloud Computing (SOCC), 2011.

@inproceedings{2011-socc,
  author = {Zou, Tao and Wang, Guozhang and Vaz Salles, Marcos and Bindel, David and Demers, Alan and Gehrke, Johannes and White, Walker},
  title = {Making Time-Stepped Applications Tick in the Cloud},
  booktitle = {Proceedings of the Second
                 ACM Symposium on Cloud Computing (SOCC)},
  month = oct,
  year = {2011},
  doi = {10.1145/2038916.2038936}
}

Abstract:

Scientists are currently evaluating the cloud as a new platform. Many important scientific applications, however, perform poorly in the cloud. These applications proceed in highly parallel discrete time-steps or “ticks,” using logical synchronization barriers at tick boundaries. We observe that network jitter in the cloud can severely increase the time required for communication in these applications, significantly increasing overall running time.

In this paper, we propose a general parallel framework to process time-stepped applications in the cloud. Our framework exposes a high-level, data-centric programming model which represents application state as tables and dependencies between states as queries over these tables. We design a jitter-tolerant runtime that uses these data dependencies to absorb latency spikes by (1) carefully scheduling computation and (2) replicating data and computation. Our data-driven approach is transparent to the scientist and requires little additional code. Our experiments show that our methods improve performance up to a factor of three for several typical time-stepped applications.