Efficient Communication and Collection with Compact Normal Forms

Edward Z. Yang, Giovanni Campagna, Ömer Ağacan, Ahmed El-Hassany, Abhishek Kulkarni, Ryan Newton.
Paper (PDF)SlidesHaddock

Abstract In distributed applications, the transmission of non-contiguous data structures is greatly slowed down by the need to serialize them into a buffer before sending. We describe Compact Normal Forms, an API that allows programmers to explicitly place immutable heap objects into regions, which can both be accessed like ordinary data as well as efficiently transmitted over the network. The process of placing objects into compact regions (essentially a copy) is faster than any serializer and can be amortized over a series of functional updates to the data structure in question. We implement this scheme in the Glasgow Haskell Compiler and show that even with the space expansion attendant with memory-oriented data structure representations, we achieve between ×2 and ×4 speedups on fast local networks with sufficiently large data structures.

Usage

Thanks to Simon Marlow, compact normal forms will be shipping with GHC 8.2; see the Haddock for the actual API we provide. The API documentation differs slightly from the paper in a few ways:

  1. Most functions are prefixed by compact,
  2. There are variants of functions which preserve sharing and which don't prefer sharing (most of the benchmarks in the paper were done with the no-sharing variant, which is a bit faster),
  3. We use the NFData type class rather than a new Serializable type class, and error at runtime if you attempt to serialize a data structure which is not serializable, and
  4. You can specify a default block size for compact regions. The larger the block size, the more continuity you have, but also the more space that is wasted.

Errata

At ICFP, we were informed of two relevant pieces of work which had escaped our notice prior to the camera-ready. We'd like to acknowledge this work:

  • Jost Berthold. Orthogonal Serialisation for Haskell. In Proceedings of the 22nd Symposium on Implementation and Application of Functional Languages. 2011. This paper describes a method for serializing by directly using the information tables which are available by the runtime system for garbage collection or other mechanisms. This method does require serialization, but consequently does not require virtual address tricks. Their API supports serializing more types of objects than we deal with in our paper; we rely on data being in normal form for our safety guarantees.
  • Andrew Appel. A Runtime System. 1990. In Section 17, "Structured writes and reads", Appel describes a nice mechanism for piggy-backing off of garbage collection in order to perform a copy to a "region" (in their case, a file). However, in order to be able to write forwarding pointers, they must stop-the-world; our system can initialize multiple compact regions concurrently without inducing a GC.