ARM CTO discusses power and computing.

February 4, 2008, San Francisco, CA—The International Solid State Circuits Conference (ISSCC) opened with a series of plenary presentations. One talk by Mike Muller, CTO of ARM entitled “Embedded Processing at the Heart of Life and Style” looked at the trade off between performance, power, and die area.

The challenges in design are exacerbated by the rising costs and risks of the designs. Managers are looking for any ways to de-risk the design process. One way to address the problem is to use more external IP. The costs of microprocessor development is rising at an exponential rate for each generation, the knowledge base and costs are spread across the entire industry when many users take advantage of the expertise embedded in this IP. The IP vendor will invest millions of dollars to develop the IP and deliver robust solutions. For example, ARM invested $500 M to develop the ARM 11 and Cortex families. If companies created their own processors, their aggregate investment would exceed $11 B for this type of microprocessor. By creating this IP family, ARM has not only saved the industry a lot of investment and engineering duplication, but also has helped to reduce the risk of the processor-based designs.

The same comments also apply to the underlying libraries of smaller-scale components. One reason that users are looking into new libraries and other IP is the growing concerns about the environment, especially with regards to power consumption. Many companies are trying to address the power consumption issue by adding more processors to their designs to enable intelligent power control. One example is a washing machine that has reduced power consumption by 70 percent compared to earlier generations of machines. The intelligent controllers reduce power by adjusting the drive current to the motors for the actual loads, rather than supplying full power to the motors all the time.

In another area, manufacturers are looking for ways to eliminate standing power. When the UK changed to digital TV, most viewers had to get a converter box. The problem is that everyone leaves the converter on all time, so the DTV converters now use the equivalent of $11 B in electricity and generate 32 thousand tons of CO2. By designing the DTV converters to change to a “sleep mode” when not in use, most of the costs and CO2 are eliminated.

A third area that affects many people is that of battery life. For example, in most modern cellular phone handsets, the talk and standby time have hit a plateau after rising for the past few years. The batteries are improving, and the circuitry is placed on the lowest power processes possible, but the increase in functions and capabilities is outpacing the improvements in silicon and batteries. To address this isue, new libraries are always being generated to make more power variants and smaller transistors available for the new designs. Unfortunately, everything else about the libraries is getting worse.

To really address the power issues, designers need to do more work at the architectural level and look at the design at the system-level. The design must consider the effects of hardware, software, and the architecture on all phases of the design. These effects must be addressed and optimized to keep from melting the case in the next generation of hand-held products. Designers must make choices between power and performance early in the design to take the greatest advantage of the architecture.

The manufacturing process is also part of the equation. When people started to port their designs from 90 to 65 nm, they found that the low power libraries had lower performance in the 65 process than the 90 nm process. The maximum frequency was lower, and the power budget didn't allow for substitution of the higher power, but faster libraries. The ported designs required changes to the architecture, such as additional pipeline stages, to meet the performance requirements.

The system-level challenges start at the top level with the choice of processor and the applications that have to run on those processors. As the design moves through the design and implementation process, the structure of protocol stacks, virtualization, and real-time functions starts to have greater effect on the power and performance. When the details of the baseband functions come up, the designer must address the fine details and the interactions of the hardware and software. Now, the size and complexity of the programs has made the software part of the design harder than the hardware.

Hardware and partitions of functions into more hardware improves performance and may not affect the power budget too much. Unfortunately, the computations per mW don't scale with transistor size. At the process nodes below 90 nm, the power for a microprocessor is greater than the same processor at 90 nm. Design flexibility then diverges towards two different goals. A single chip can be designed for multiple uses in one product generation, or a more general-purpose platform can be the basis for designs over many product generations. The challenge is to pick the best combination of flexibility and efficiency.

The conundrum is that a change in hardware-software partitioning towards more custom hardware provides improvements in performance but the product line needs more flexibility that only comes from more software. The general design processes have been towards heterogeneous systems that start with C in a single-threaded, single shared memory configuration. As the systems get more complex, programming becomes more difficult. If the original system had a lot of hand-built software, then the changes in architecture require an almost complete redesign of the software.

When the designer is compiling the system, the architectural-level view has to define the inter-kernel communications, increasingly more precise timing constraints, and generate the tests. Efficiency requires efficient components coupled with an optimized architecture. One way to improve the efficiency is to reduce the operating voltage, but reducing the supply voltage adversely affects bit-cell stability and noise margins. At the same time, increasing the threshold voltage on the transistors to help reduce leakage can have other consequences. A 25 percent increase in performance causes a 1000 percent increase in leakage. To make matters even worse, the worst case leakage in the low power libraries is higher than the best case general purpose libraries. So there are serious limits to the benefits of voltage scaling.

Another tactic is use aggressive power gating. Effective use of power gating requires adaptive on-chip voltage sensing IP and deployable functions to implement the necessary retention flip-flops. The system must account for the additional time needed to store the state information or must allow for more area for faster retention and synchronization. In addition, the system-level functions must accommodate the additional costs for the structures needed for the multiple voltage islands.

Lower voltage leads to lower leakage, but now the issue is further clouded by the increase in variability. Unfortunately, there is no correlation between variability and leakage, leading to greater margins in design affecting frequency, yield, and mean-time to failure. More uncertainty leads to the interesting condition that worst case becomes improbable. In the past, all designs were able to run at the lowest specified voltage and nominal frequency because of the optimizations and transformations made available from the EDA tools. But some of these chips could run at much lower voltage or at much faster frequencies. Manufacturers got more yield and money by selling the different performance levels at different prices to address the variability issue.

Now, additional analog circuitry, coupled to the operating system, can be used to ratchet down the power until the error rate hits some threshold, then increase the voltage by one step for operation. If the cost of correcting errors is less than the cost of the additional area for data correction, then the voltage can be reduced even more. An architectural and implementation alternative is to use flip-flops with built-in error detection.

All of these examples show the need to use a system-level view to look at issues of operating power, error rates, and performance. In fact, it is essential to optimize at the system level to avoid local minimizations. Silicon scaling isn't what it used to be. Now a given speed may increase power and energy consumption.

To see how much more the industry need to do, compare some highly parallel systems. The human brain has about 10^12 neurons with an average fanout of 1, operating at about 10 Hz. Current research into artificial silicon neurons envisions a fanout of 1000 to emulate 10^12 synapses. This is about 1 terabyte of equivalent memory and will operate at 10^13 links per second resulting in a network that is performing about 10^8 MIPs. A system of this size can be implemented with about 1 million ARM processors, partitioned as 20 ARM968 + 128MB of DDR memory per chip.

In comparison to the human brain, operating on picowatts, the first computers used over 1 kW of power and operated at about 700 Hz. An ARM 7 is comprised of about 100,000 transistors, consumes 9 mW and performs at 165 digital MIPS. We have a long way to go before we can seriously compete with nature.

To comment on this article send email to:gmoretti@gabeoneda.com