ICCAD keynote

ICCAD San Jose, CA— Peter Hofstee, Cell chief scientist and architect, IBM systems and technology group, Austin, TX presented "The Cell Processor: Applications, Architecture, and Design in the Multi-core Era" in his keynote address.

Hofstee described the goals of the Cell processor design as improving the performance of a graphics accelerator by a factor of 1000 while overcoming limitations of power, memory, and frequency. The new chip would have to run in real-time and integrate with the user environment across a wide range of platforms. His team envisioned a symmetric multi-processing (SMP) environment with integrated security features. From the start of the project, they had 5 years and $400 million to create the chip. The team included engineers from Sony, Toshiba, and IBM.

Power is obviously an issue for high performance chips, but is severely constrained by the economics of a consumer end product. Since standard designs and architectures have a direct relationship between performance and power, a 1000 times improvement would require similar increases in power consumption and heat density. This situation is not possible nor is it realistic for air cooled platforms.

Performance is also related to operating frequency. The problem is that Gelsinger's law predicts a 1.4 times improvement for a doubling of transistors. Increasing circuit complexity costs power and area, and every new generation of the original architecture is less efficient. The use of multiple cores is one solution.

The disparity between memory and processor speeds is causing memory latency to approach 64 clocks, while microprocessor pipelines are only about 5 deep. The current alternative is to speculate and create a large transistor overhead for miss and refill operations. The last major architectural change that helped performance was the addition of pipelines to the instruction flow.

To address the problems, the design teams looked at two areas for changes, increasing concurrency to reduce data and clock rates, and increasing specialization to offload compute tasks into hardware which can reduce average power consumption while increasing total throughput. These architectural choices have dramatic effects on the power, memory latency, and efficiency of the SoC.

As basic starting points, a PowerPC can achieve about 20 Gflops (billion floating point operations) while the Cell chip can perform about 200 Gflops and an Nvidia graphics chip can do over 2 Tflops. The 2 orders of magnitude difference in performance show the possibilities and also illustrate the trade off between programmability and performance. A general purpose processor can do more kinds of tasks at the cost of reduced performance.

To address the performance and programmability issues, they developed the Cell multi-core processor architecture. It is based on a PowerPC core in a SMP structure. The synergistic processors—the specialized accelerator blocks—have the same external architecture as the PowerPC to ease internal resource allocation and interfacing issues. The cell chip uses a combination of concurrency and specialization to give software compatibility with the 64-bit PowerPC.

The Cell chip addresses the power wall by designing for 4 GHz but only operating at 3.8 GHz. This speed reduction permits lower voltages and less critical timing at minimal performance penalty. The general purpose accelerators have a RISC-like 32-bit structure with about 10 percent of the area in the core, another 10 percent for the streaming DMA and the balance for programmable acceleration functions.

One issue for the chip is the need for security, addressed by an isolated load feature that has no dependence on the operating system or hypervisor. The architecture separates the real-time and non-real-time functions.

Overall, the chip achieves about a 10 times improvement in performance for streaming data like encoding or decoding and about 100 times improvement for random functions. Due to this level of performance, the Cell chip enables lifelike interfaces such as speech recognition and determining emotions from facial expressions. The first applications are for games, but blade servers and home media center controllers are in the works. Toshiba has an application called "magic mirror" that allows a real-time video image to be modified, so you can look into a screen and watch your hair turn into feathers or other image modifications.

Eventually, the Cell will facilitate an infrastructure that has ubiquitous connectivity, seamless functions with natural interfaces, data security, and on-demand compute capabilities. Ideally, all of this will be based on standard and open-source architectures and software.

The road to higher performance is through modularity, more parallel operations and greater specialization. The specialized functions need good APIs and functional standards since the programming models for highly parallel application specific accelerators is still in its embryonic stages. The future for the design community is healthy, since applications still need greater performance to meet users expectations.

To comment on this article send email to:gmoretti@gabeoneda.com