Techniques for low power at the system level
Designers thinking about low power and energy have a variety of strategies at their disposal. The most common are:
- Process/libraries (e.g. low-power processes/libraries; high and low threshold voltage cells; and voltage scaling);
- Power and voltage domains;
- Clock gating;
- Low-power optimized clock synthesis;
- Low-power synthesis (e.g. automatic insertion of operand isolation circuitry);
- Implementation optimizations (e.g. operand isolation; pre-computation; and power efficient scheduling of resources).
So, with such a plethora low power tools and capabilities focused on post-register-transfer level (RTL) stages in the design flow, why should you bother with or care about what electronic system level (ESL) synthesis can do for power? A huge part of the answer is that architectural tradeoffs can have a 10X bigger impact than decisions taken further downstream in the design flow, and ESL synthesis, in turn, provides the best vantage point for making – and, thereby, optimal control over – these kinds of architectural, micro-architectural and implementation choices that drive energy consumption.
Key factors in low energy design
1. Low-power may no longer mean low-energy
For line-powered applications where heat and power budget issues can dominate, a key focus is often on low-power design. But, for battery-powered applications, the primary focus is low energy. This has traditionally and mainly been addressed by low-power design techniques because, at larger geometries, dynamic power is a suitable proxy for energy use and allowed designers to focus almost exclusively on minimizing switching gates.
But as static power dissipation has a material impact below 130 nm, low-energy design has become more complex. Now, architects and designers must understand the energy impact of trade-offs in the performance/delay, area, design topology and power implementation strategies for a given task. Moreover, the impact such trade-offs can have is typically hard to assess until late on in the design by which time it is often too late to make any appreciable changes.
Figure 1 illustrates how circuit choice is not always obvious. It shows the power and energy characteristics for two different circuits performing the same task. While circuit A operates at significantly lower power than B, its lower performance means that it needs much more time to complete the task. So while circuit A can boast the lower power performance, it does not use the lesser amount of energy; the high-performance circuit B, which powers up, processes the task quickly and shuts down, is thus preferable for driving the application in a battery-powered device. Low power is not necessarily low energy.
2: The architecture has a first-order impact
Performance/delay, area, and design topology are directly driven by architectural and micro-architectural choices. And, power implementation strategies, such as power and clock gating, are best controlled this way too. It has been estimated that 80% of the power in designs is determined at the RTL-stage or earlier. Depending on back-end tools for low power, after you’ve solidified your architecture, limits your ability to optimize the chip. You need to be architecting for low energy.
3. Effective tradeoffs require accurate estimates and rapid exploration
To make effective tradeoffs in system-level design, you need two things:
- Accurate estimates. These are gained by analyzing hardware architectures, not functional models.
- Rapid exploration. You cannot make effective choices without exploring and assessing the possibilities.
Figure 1. Two circuits, A and B, for the same operation but with different power/energy characteristics
System-level techniques for low energy
Optimal low power and energy implementation (as well as control over low-power implementation tools) is best managed by using the following three system-level techniques.
1. Task characterization
One of the early steps in developing a low-energy implementation strategy should be to characterize tasks according to processing frequency, processing time, and hardware algorithm alternatives. This is important because the strategies for implementation will depend heavily on these characteristics. So:
For discrete tasks requiring specialized hardware, the best solution might be a high-performance architecture that can be powered down between uses to minimize static power dissipation.
For ongoing processing or monitoring functions, the best solution might be to optimize for low power while meeting minimum performance requirements.
2. Architecting and micro-architecting
Figure 2. Two different micro-architecture implementations of same LPM architecture
With the task characterizations in mind, each major intellectual property (IP) block can be architected and micro-architected. This is where the biggest impact can be made. To illustrate this, consider two implementations of the Longest Prefix Match (LPM) algorithm for Internet Protocol (IP) address lookup for packet processing.
Each IP address goes through one-to-three memory lookups to identify the packet’s forwarding port. Both implementations use the same simplistic architecture which schedules three lookups per packet, regardless of whether the packet requires it. But, the two implementations, as illustrated in Figure 2 and Figure 3, use dramatically different amounts of logic to accomplish the same task in the same amount of time.
Without exploring different approaches, you could easily have ended up settling on the first option, even though this would dissipate significantly more power without providing any obvious advantage. And modeling by itself would never identify this significant difference; it can only be assessed by characterizing the implementation of each approach.
With impacts that can run to orders of magnitude in scale, architecture affects the amount of logic, the delay and clock speed required to process tasks.
3. Applying adaptive architectures
In addition to picking the best architecture and micro-architecture for a given set of requirements, there are additional layers of implementation that enable a design to further reduce power usage. We will call these techniques adaptive, in that they dynamically optimize the implementation based on the temporal context.
Figure 3. Area and speed results for the two implementations of same LPM architecture
On a macro level, clock gating and power gating are adaptive techniques because they allow an isolated set of resources to power down when not required. But there are a host of adaptive architecture-based implementation approaches that dynamically optimize power usage while resources are still active. The goal is to build cache structures, not just in the traditional sense of a processor, but anywhere that repetitive behavior can allow pre-computed, stored results to replace using costly hardware resources.
Examples from a processor design might include:
- Post-decode trace caching
Instruction decode and control can encompass a lot of logic and, therefore, require a lot of power. This is especially true for embedded applications that repeatedly leverage the same routines where the decode traces for repeated instruction flows could be cached to allow static isolation of the decode logic when a trace match occurs.
- Adaptive cache policies
For streaming applications, where there is little locality of data, you may not benefit from operating all of the data cache and may want to power down a portion or all of it.
ESL synthesis enables low-power and low-energy design
Let us now explore how ESL Synthesis both automates and facilitates many of the techniques that we have discussed.
Integrated modeling and implementation
Architects need immediate feedback with accurate details. Architectures explored at an artificially high-level obscure important details that have a material impact on final performance. Modeling that is not tied directly to the hardware architecture detaches the engineer from the details required to understand the energy implications of different choices. ESL synthesis links modeling with implementation. It provides an environment that synthesizes directly into hardware from multiple levels of abstraction, while allowing the reuse of design components throughout.
Architectural exploration
If you can explore the design space quickly and correctly for changes in topology, architecture, micro-architecture and even implementation, you can test the implications of ‘what-if’ scenarios to identify the optimal approach for minimal energy.
When coding RTL in Verilog or VHDL, design engineers have complete control over every nuance of the design. Theoretically, this means that the resulting implementations will be optimal. Unfortunately, modifying (and re-verifying) the RTL in order to perform a series of ‘what if’ evaluations on alternative micro-architectures is difficult, time-consuming, and vulnerable to error. Typically, the micro-architecture you start out with is often the one you end up with. But different micro-architectures can return dramatically different results as demonstrated in the example above.
ESL synthesis introduces new ways of managing complex concurrency and streamline design composition while retaining the designer’s control over the architecture and micro-architecture much as he or she would have at the RTL. However, by approaching this at the system-level, change is much easier, while extensive static checking, automatic formal interface contracts and operational-centric design specifications ensure the design’s correctness.
Automated clock implementation and formal clock verification
ESL synthesis has integrated clock management and formal clock connectivity verification for multi-clock domain (MCD) support. This capability validates proper MCD implementations during synthesis. Gated-clock implementations are automated for power management, identifying and managing interface communications between active and inactive clock domains. Complex topology implementations are simplified.
Powerful parameterization
IP is ideally used more than once. However, a block built for one application may not be optimally built for another in terms of power and energy usage. Typically, completed IP must be used as is, even if it contains sub-optimal choices for algorithms or resource sizes. Alternatively, the IP must be redesigned or enhanced by someone unfamiliar with the original implementation. ESL synthesis has a high degree of parameterization, enabling a single design to be delivered where the customization of resources and algorithms is deferred.
Delivering low-energy silicon
Too often, engineers do not treat architecture and design as first-order factors influencing the overall performance of a chip. Typically, with RTL-level design, there is only one opportunity to set the architecture and the implementation. So, design teams must depend on back-end tools, libraries and processes to optimize power and energy characteristics. Unfortunately, at this point, the power and energy ‘tone’ of the chip has largely been set.
System-level techniques using ESL synthesis provide the best control over low-power and low-energy implementation by:
- Integrating modeling and implementation;
- Enabling rapid architectural exploration at a system-level, not just with an algorithmic block;
- Automating clock implementation and verification;
- Providing powerful parameterization for derivatives and follow-ons.
Bluespec Inc.
200 West Street
3rd Floor
Waltham
MA 02451
USA
T: +1 781 250-2200
http://www.bluespec.com