MIT professor advocates going multi-core in designs

Multicore Expo Santa Clara, March 21, 2006 -- CA Anant Agarwal, professor at MIT, discussed "Going Multi-core: Opportunities, Challenges and Dreams" in his keynote talk. After noting that processing demands are outstripping available compute capabilities, he gave the example of a board-level product with 1 CPU, 20 DSP chips, 16 FPGAs and an ASIC.

To address the compute challenge, vendors have traditionally used scaling and frequency increases to follow Moore's law. Until 2002, the performance and number of transistors seemed to track. Architectural changes in the processors accounted for the increase in internal devices as designers added pipeline stages, changed to superscalar, and increased the size of caches. After 2002, however, the continued scaling started to see diminishing returns from a single CPU, due to wire delays, power envelopes and a lack of new architectural features to add to the existing designs.

Intel’s Pentium family offers an example of diminishing returns. In 2000 the Pentium 3 ran at 1 GHz, and contained 28 M transistors in a 0.18  process and achieved a specint 2000 benchmark of 343. The Pentium 4, released in the same year, operated at 1.4 GHz, contained 42 M transistors in the same process, and achieved a specint of 393. Transistor count increased by 50 percent, but the performance only improved by 15 percent.

Many things have changed in the processing landscape over time. Today's applications have much more parallelism, not just PowerPoint and Word, but games, voice over Internet, streaming video, security and firewalls, networks, and wireless. The result is that a single processor cannot meet the throughput and bandwidth requirements for the various transformations and threads now running.

The ability to put multiple cores on a single chip allows the hardware to exploit the high parallelism built in the latest applications. Higher levels of integration are efficient. In discrete chips, bandwidth might be 2 G bits per second, latency about 60 ns, and the energy required to transfer a word is 500 pJ. In comparison, a multicore design will have on-chip bandwidth of over 40 G bits per second, a latency of less than 3 ns, and requires about 5 pJ of energy to transfer a word. The parallelism and interconnect efficiency enables harnessing the power of "n", so processing throughput increases are relatively linear with the number of processors.

Analyzing the tradeoffs between single- and multi-core processors shows two areas of comparison. Let’s start with performance. Cycles per instruction (CPI) is a figure of merit for a computer and measures cache misses and clock cycles needed to refill pipeline. Assuming a cache miss rate of 1 percent and 100 cycles to refill cache, a typical 90 nm microprocessor would have a CPI of 2. Increasing the cache to 3 times the original reduces the cache miss rate to 0.6 percent resulting in a 65 nm CPU achieving a CPI of 1.6, an improvement of 40 percent. A dual core CPU would have an effective CPI of 1, twice the performance of the original single design. Therefore, the single core microprocessor has hit the range of diminishing returns and is not worth the effort.

One example of the performance of a multi-core design is the MIT RAW processor. It has 16 cores, runs at a clock rate of 425 MHz and achieves 6.8 GOPS (billion operations per second). Even though it is running at a slower clock, it has as much as 100 times the performance of a 600MHz Pentium 3 in streaming applications. For highly parallel applications, this processor will be much better in performance.

At the same time, the power efficiency of a design is also an issue. Power consumption is a third order (cubed) function of voltage, so for a 1 percent increase in frequency, we get a 3 percent increase in power consumption. By comparing a superscalar with a multicore processor, we would see that the power efficiency as measured in BOPS/watt is better for a multicore design. In fact their measurements show 50 percent more performance at 75 percent of the clock frequency of a multicore versus single core processor. The multicore chip did 1.88 times the BOPS/watt over a normalized superscalar chip.

This improvement in both performance and power efficiency implies a "Moore's law" of cores, that the number of cores will double every 18 months. In academia, papers have shown something approaching this trend, with 16 cores in '02, and 64 cores in 05. Industry follows with 4 cores in '02 and 16 in '05. The trends predict a continued doubling over time.

The problems with multicore are not only the performance and power efficiency, but also the programming challenges. Raw throughput as measured with benchmarks is one metric, but the overall architecture will affect the performance. Current systems rely on buses or rings, which are not scalable. The performance bottlenecks of the interconnects will affect both bandwidth and latency in a system. An alternative to bus or ring topologies is a mesh structure for connecting the parallel processors. A bus can handle 2-4 cores and 8 threads but the miss rate of 1 percent quickly increases the overall latency. A ring can handle 4-8 processors and can tolerate a miss rate of 2 percent with 16 threads before the latency rises dramatically. A mesh is scalable to the limits of the silicon. Miss rates of up to 10 percent may not impact the latency at all and 64 threads sees a doubling of latency from 15 to 30 ns with 2 percent miss rates.

Communication latency is not an interconnect problem, but is a protocol and software overhead issue. If processor to processor communications overhead can be reduced to a few cycles, memory accesses are minimized, and direct access to the interconnect fabric is available, a multicore processor can provide far more processing cycles than almost any alternative structures.

While the programming issues are not easy to address, the power efficiency of a multicore processor is a serious challenge. The largest CPUs now consume 100 watts to operate. A massively parallel cluster of only 100 CPUs would require 10 KW for operation. This needs to be reconsidered. In comparison, a multicore chip comprised of 90 RISC processors might consume ½ W and produce 1/8 of the performance of an Itanium. The overall power efficiency of the array is 25 times that of the single core Itanium.

The Madison Itanium allocates only about 4 percent of the chip area to ALU and FPU functions. And still the result is that area equates to power consumption. Therefore, the metric of interest for a design is resource size. A resource size must not be increased unless the resulting percentage increase in performance is at least the same percentage increase in area (power).

Other architectural considerations for multicore depend on the concepts that processing and communications are cheaper than memory access. A data transfer over 1mm of network consumes 3 pJ of energy. An ALU add operation uses 2 pJ. In comparison a cache read uses 50 pJ and an off-chip read uses 500 pJ of energy. Multicore architectures will not only change the nature of processors, but will also cause a migration from memory oriented computational models to communication centric ones. The traditional cluster computing programming squanders the multicore opportunities due to the message passing or shared memory structures and high overhead communications. To achieve optimal results from a multicore environment, the programming and OS must allow specifying parallelism at any granularity and favor communications over memory.

This set of changes in hardware and software leads to a stream programming approach. Data streams in the network go to a processing element, which computes its values and sends out the results to the network. There is no actual memory access, synchronization and address arithmetic. The processing and data communications occur at the same time.

Multicore capabilities are challenging the myths of computing. Existing cores make good sense. Bigger cores are better, and interconnect latency is due to wire delays. In the future, we might consider a core as the lookup table of the 21st century. If we can solve the performance, power efficiency, and programming challenges, multicore could replace all hardware in the future.

To comment on this article send email to:gmoretti@gabeoneda.com