The trouble with low-power design is that the intuitive answer is not necessarily the one that works. ARM has found that in its work on processor design looking at the system-design and process issues as well as the way software interacts with the machine can yield surprising results.
Richard York, director of product marketing at ARM, explained at the recent Embedded World show in Nürnberg how instrumenting software uncovered how a technique once used purely for performance can also pay off in terms of energy saving.
Research carried out at places such as Stanford University has underlined where a lot of the power goes in most microprocessors – instruction fetch rather than execution. And a lot of that energy goes in the memory accesses to repeatedly fetch those instructions from cache or main memory. This work has led to experiments with compressing instructions streams to minimise that power, largely lost through capacitive charging. Researchers at the University of Michigan have tried this approach. ARM is funding some of the low-power work being done on both architectures and circuit-design techniques such as near-threshold logic.
However, York pointed out that the designers at ARM found that, rather than compressing instructions for memory density, storing them in a partially decoded form in level-one cache wound up saving energy. The cost of repeatedly decoding the instructions turned out to be more significant than the capacitive load from a level-one cache.
“The Cortex-A7 ends up with huge power benefits by partially decoding in the cache,” said York. “It turns out that it’s lower power to fetch more from cache than to fetch less.
“Decoding in the instruction cache? That used to be for high performance. No-one ever dreamed of doing that for low power. If we can spot techniques that were used for performance and rework them for power those are interesting areas for innovation,” York added.
The benefit extends beyond code that is running in tight loops, according to York. “If only it all were tight loops. That is why routinely people are putting in level-two caches. Level-three caches are routinely being talked about now. The question now is should we push this technique out to level-two cache?”
York said the question of using level-three caches underlines the need for processor clusters as this allows you to “have a unified level three that supports a cluster of four processors”.
Memory overhead is still a concern, York said, especially when it comes to accessing the flash memory: “Flash memory in particular is very power hungry.”
One of the techniques being explored is to reduce the amount of power fetching words from flash memory that are never used. “We are looking at the number of memory fetches and doing fewer for the same sequence. The engineers have managed to get 10 to 15 per cent fewer memory accesses by carefully profiling customer code to see where it fetched things that were never used.”
This technique can go further, said York. “How do you give the processor hints about what it could fetch? Maybe say to flash don’t send a full 512bit line. I’m jumping halfway in, so I only need this much. All these little tricks add up. We have taken the easy benefits. It gets harder and harder and harder.
“With one of the new processors for the microcontroller space, the engineering team swore blind with the first version they couldn’t find more ways to save power. Now they have found some,” said York.
It is possible as a system architect to go too far in uncovering power savings. For example, intuition says that picking different voltage levels for individual voltage islands makes sense. But that only works if power-management circuitry can deliver all those voltages. In practice, that does not work out so well because each conversion stage, although probably 70 to 80 per cent efficient, is still going to waste some energy. So, engineers at ARM talk to the power-management companies to find out what they are implementing, York said, and this informs the architects.
“How many voltages can they supply? And therefore how many voltage domains can we have? The amount of current. How many regulators? You might have regulators that hybridise and use different techniques based on the supply voltage. If you understand these issues, you may find there is no point in having a lot of different voltage domains. This type of work makes sure we don’t overcook it,” York explained. “Their chip in the system that is providing the voltages and currents: knowing about that is just a important as working with the circuit-design issues.”