Datacenter-In-A-Box at LOw Cost (DIABLO)

Datacenter-in-a-Box at Low cost (DIABLO) is a novel cost-efficient evaluation methodology, which uses Field-Programmable Gate Arrays (FPGAs) and treats datacenters as whole computers with tightly integrated hardware and software. Instead of prototyping everything in FPGAs, we build realistic reconfigurable abstracted performance models at scales of O(10,000) servers focusing on detailed datacenter switch models. Our server model supports the full SPARC v8 instruction set architecture and runs the full Linux operating system and open-source datacenter software stack, achieving two orders of magnitude simulation speedup over software-based simulators. This speedup enables us to run the full datacenter software stack for O(100) seconds of simulated time.

DIABLO is built on top of the FPGA Architecture Model Execution (FAME) technology. We have built several working DIABLO prototyes using 65nm Xilinx Virtex 5 FPGAs. The cost per simulate node is only $12. We have used DIABLO to successfully reproduce several datacenter phenomena at scales of 2,000 nodes, showing it an execellent research tool for HW/SW design space explorations in datacenter. To the best of our knowledge, DIABLO is the world's largest distributed execution-driven simulator for datacenter.

The tremendous success of Internet services has led to the rapid growth of Warehouse-Scale Computers (WSCs). The networking infrastructure has become one of the most vital components in a datacenter, being crucial to improving server utilization and supporting massive map-reduce jobs, and is now a very active area of research. With the rapid evolving set of workloads and software, evaluating network designs really requires simulating a computer system with three key features.

  • Scale: the test environment requires enough scale to study system phenomena at aggregate-level switches
  • Performance: the evaluation platform needs to have decent performance to cope with massive fine-grained parallelism in the target architecture
  • Accuracy: a high-performance network requires nanosecond-scale accuracy
To avoid the high capital cost of hardware prototyping, many designs have only been evaluated with a very small testbed built with off-the-shelf devices, often running unrealistic microbenchmarks or traces collected from an old cluster. Many evaluations assume the workload is static and that computations are only loosely coupled with the very adaptive networking stack. We argue the research community is facing a hardware-software co-evaluation crisis.

Academia has tried to address the problem by building small clusters at around 100 nodes, using wimpy ARM or x86 cores with gigabit Ethernet. None has been able to reach the scale of thousands. In addition, nothing in the hardware is really adjustable, therefore it is not possible to perform HW/SW co-tunings on such platforms.

We are planning the second generation of DIABLO for the ASPIRE lab. A new DIABLO FPGA hardware is also scheduled targetting 14nm FPGAs. DIABLO 2 will be used to evaluate novel datacenter rack-level optimizations and 100 Gbps interconnect designs.