The Cell Broadband Engine , jointly developed by IBM, Sony, and Toshiba, is an emerging hardware platform for multimedia devices. Unlike the symmetrical multi-core architectures that are being developed for the desktop computing market, Cell implements an asymmetric architecture in which there is a general-purpose processor core derived from the PowerPC, and eight identical throughput-oriented cores implementing an entirely different architecture and instruction set that is targeted toward multimedia applications.
Integrating different cores onto a single chip in this manner makes sense from a hardware designer's perspective. For the intended application domain of multimedia applications, having a number of throughput-oriented processor cores is far more effective than trying to do the same work with general-purpose cores, which would also take up more space on the die. From a software designer's perspective, on the other hand, dealing with the asymmetrical architecture and multiple instruction sets of a heterogeneous multiprocessor makes programming considerably more difficult.
The goal of our work is to provide the best of both worlds, combining the benefits of a symmetric multi-processor system (which is simpler for programmers) on top of an asymmetric multi-processor system (which hardware architects can build more efficiently and cheaply). Our solution consists of a virtual machine layer that sits on top of the heterogeneous hardware and automatically distributes work to the different multiprocessing cores.
We have implemented a prototype of such a system that we call CellVM. To the programmer, CellVM presents the interface of a standard Java Virtual Machine (JVM). Internally, CellVM interprets JVM bytecodes by co-execution between the different functional units contained on the Cell microprocessor. CellVM utilizes an interpretation approach that we call cooperative interpretation, executing simple instructions directly on specialized processor cores, but requesting assistance from a general purpose core for executing complex (and often rarely used) instructions.
Of particular note is our solution to the data latency problem. In the Cell architecture, the throughput-oriented processors do not have hardware-based data caches but provide software-controlled scratchpad memories in conjunction with DMA capabilities. Key to a streamlined division of labor in our cooperative interpretation environment is an efficient use of these scratchpad memories as a distributed data cache. Our software-controlled data caching algorithm is surprisingly effective, achieving hit rates above 90% for the instruction cache and above 74% for the data cache for the benchmarks that we tested.
Andreas Gal (UC Irvine)
Kevin William (Trinity College)