Intel CTO looks into the future: Measuring the value and need for multi-core

Stanford, CA –Aug 21, 2006—Justin Rattner, CTO of Intel Corp, discussed "Cool codes for hot chips: a quantitative basis for multi-core design" in one of the Hot Chips conference keynote addresses. He started by reviewing the past 20 years of quantitative measurements of processor designs. These tools, such as SPEC and other benchmarks, have helped to tune and optimize designs by supplying standard arrays of tests to measure various computer characteristics such as speed and throughput.

Now, an increasing number of companies are working on both large and small multi-core designs. Intel is developing a suite of test applications to provide standard way to measure this new class of multi-processor chips. They call their test suite the Tera-Scale straw-man, as the project has just started and must eventually measure new classes of applications and unknown future generations of multi-core enabled applications.

Early stages of evaluation are already showing the challenges and deficiencies of the existing benchmarks in evaluating new paradigms. Future processor designs will likely have a mix of fixed function accelerators and some controllers for performance and energy considerations. So, new benchmarks must be developed to measure the computing environment that will exist in the next 5 to 10 years.

To illustrate some of the challenges in developing multi-core benchmarks, Rattner showed some examples of applications that are likely to be candidates for multi-core chips. One is a 4-camera imaging tracking systems that doesn't need the special body suits with marker points that are currently in use for digitizing motion for video and games. Another is full shadow and ray tracing for photo realistic images. A third is motion tracking and estimation.

All of the applications are candidates for multi-core processors with complex memory hierarchies, on-die communications fabrics, explicit thread support, and some fixed function accelerators. The question is how to measure performance. The existing metrics such as SPLASH, TPC (which measures server throughput), EEMBC (for embedded applications), and 3DMARK (for video simulations) are all optimized for very small numbers of processor cores.

If the industry continues to use the existing metrics and waits for the new applications to emerge, the growing gap between the number of processor and the number of multi-threaded applications will make performance measurements meaningless. Therefore, the industry needs to develop benchmarks which can measure both new and established characteristics so comparisons are possible. The newer applications are in areas like data mining, recognition, and large-scale synthesis.

Benchmarks must measure highly threaded and scalable functions, differentiated and stressful characteristics such as memory structure, cache sizes and quantities, number of cores all in the context of performance and energy efficiency. The new benchmark suite must be able to evaluate variables including core size and in-order versus out-of-order execution, performance versus scalability, compute versus I/O, and types of cache and memory architectures.

Some existing benchmarks do have some multi-core capabilities, but focus on a small number of cores due to the type of hardware available for benchmarking. For example, in graph mining applications, Gaston shows increasing performance for up to 4 cores, then flattens out for more processors. gSpan scales better, but shows very low performance for an array of processors until the cluster exceeds 16 cores.

One other issue is that the hardware architecture and the algorithms will be very closely linked. For example, a forward solver seems to peak at 5x single core performance with as many as 16 processors. By adding hardware thread support for scheduling, synchronization, and direct instruction support, the performance increases when the compute cluster exceeds 32 processors. Changing the cache structure enables increases beyond 64 cores. Finally, adding application specific accelerators facilitates more improvement in the cluster. This example shows how the target of focus and the algorithm are critical to each other, and shows the need for designers to look at existing bottlenecks when evaluating benchmark performance.

Emerging applications in areas of recognition, mining, and synthesis (RMS) are likely to become the "killer" applications in the next decade. Recognition involves the modeling and identification of relevant data from multi-product data such as combining speech recognition and lip tracking to improve the signal to noise ratio and improve overall accuracy.

Search functions will be working to find similar instances in a sea of data. If you are trying to search through 5,000 photos, the existing data are highly inadequate; date and possibly a unique name. By creating a search tool that can use examples like colors, faces, textures, and other content characteristics, the search engine can quickly find the picture you want. Today's tools must pre-index the photos and take over 20 hours to characterize and index about 10,000 photos. Obviously, users want to do this in something approaching real-time.

The next generation applications will require very large amounts of computation, in excess of 10 GFlop, and memory size and bandwidth, more than 100GB/s. Vision systems and ray tracing are applications that can benefit greatly from changes in architectures and algorithms, easily a 10 times improvement from each. Currently, a 4-way dual core (8 total CPUs) compute cluster can achieve about 6-10 frames per second in these applications, very much below real-time rates.

Because the application and its accompanying code performance depend on many factors, Intel's research into an RMS test suite tries to measure a mixture of compute and memory bandwidths. So far, the suite has some primitive functions and some parallel applications. All of their benchmarks need to be optimized for high thread-count clusters of processors, since the design assumptions for the measurements are based on today's architectures and algorithms. To improve upon the current state of measurements, Intel is trying to encourage the development of a public RMS suite.

They are already running into intellectual property issues, but are starting to engage with partners from UC-Berkeley, Stanford, University of Pittsburgh, and Princeton as well as other industrial partners. Intel and their partners are in the early stages of calling for contributions from the industry and academia. One recent addition the suite is a server benchmark from Professor Kai Li at Princeton

The future of multi-core processors is in our hands. We need to develop a powerful suite of benchmarks that can look at architectures and algorithms. These tests need to be general solutions, and not focused on specific applications and languages. The emergence of frameworks and machine learning is necessary to make the parallelism of the coming machines more accessible to the ordinary programmers.

The tests also have to acknowledge the penalties for going off chip. The compute clusters will have to be designed to manage I/O transactions to address the difference between core and chip performance versus the communications between boxes. In addition, the future applications are likely to be more statistical than accurate to be correct in the applications. Without a robust set of benchmarks, there is a high potential for a core-count race, which will be very unproductive.