How AMD implemented efficient clock gating analysis for Jaguar

By Steve Kommrusch, AMD | No Comments | Posted: April 9, 2013
Topics/Categories: EDA Topics, EDA - IC Implementation | Tags: clock gating, low power, RTL, x86 | Organizations: AMD, Calypto, Chipmaker

The chipmaker used Calypto’s PowerPro to carry out power analysis of its latest core design at the RTL rather than at post-gate synthesis.

This paper describes how AMD used Calypto’s PowerPro suite to improve efficiency in clock gating for its Jaguar core, and shares the results and advantages of doing power analysis at the RTL stage rather than waiting until post-gate synthesis.

Overview

The flow and tool had these key features:

The RTL analysis could run over the weekend and analyze key power benchmark tests.
The output format was easy to parse and summarize for designer use.
Recommended improvements had value as suggestions and showed possible optimizations.
The correlation between active clock count and total power used was good.
Even given improvements in instructions per clock cycle (IPC) and frequency, PowerPro helped achieve an approximately 20% reduction in typical dynamic application power compared to an already-tuned low-power X86 CPU.

Introduction

Lowering the power consumption of consumer products and networking centers is an important design consideration. The same goes for many of the processor cores that go into these devices.

For its new Jaguar X86 core, AMD wanted to improve on the previous generation in terms of faster performance in a given power envelope, higher frequency at a given voltage, and improved power efficiency through clock gating and unit redesign. The AMD low-power core design team used the Calypto PowerPro power analysis solution to analyze RTL clock-gating quality, find opportunities for improvement, and generate reports the engineering team could use to decrease the operating power of the design.

Because PowerPro analyzes pre-synthesis RTL, it can be run more often and analyze a larger number of simulation cycles more quickly and with fewer machine resources than tools that use synthesized gates. AMD selected a suite of 39 tests, which included a maximum-power condition (making as much of the core active at once as possible), a halt case (no instructions or interrupt activity occurring), and several actual application code snippets. The focus on clock gating and the quick turnaround of RTL analysis allowed AMD to achieve measurable power reductions for typical applications of the Jaguar core.

Low power design

The AMD Jaguar X86 core is a flexible, high-frequency, processor aimed at system-on-chip designs for low power markets and cloud clients. It uses a 28nm process technology and has a small die area (3.1 sq mm).

Compared to the previous generation of this core, Bobcat, many blocks were redesigned for improved power efficiency, including the IC loop buffer, store queue, and L2 clocks.

The AMD Jaguar compute unit (CU) includes four independent Jaguar cores and a shared-cache unit with four L2 databanks and an L2 interface tile. The L2 interface block runs at the core clock speed. The L2 databanks run at half-clock to save power and are clocked only when required, reducing power further.

Figure 1
AMD Jaguar compute core architecture (Source: AMD – click image to enlarge)

Design goals included increases in the frequency and IPC, so designers were doing timing work and trying to minimize the gates between flops. In a few cases, they had to add flops to create more of a pipeline design. The goal at the start of the project was to lower typical application power by 10%. Ultimately, using a design methodology that included deployment of PowerPro, AMD was able to lower the typical power by approximately 20% while increasing frequency at the given voltage by over 10%.

Power analysis/clock gating methodology

In AMD’s overall design flow, engineering managers would pick a tag from which to do synthesis at selected intervals. A snapshot of the relevant RTL code would run through PowerPro. This could be done over the weekend, so AMD’s designers would have results on Monday morning.

Meanwhile, the gates team could carry out synthesis, placement, routing, and gate simulations for the same tag. This process would typically take several weeks.

Because PowerPro can analyze the RTL in a matter of hours, AMD could run weekend regressions to make sure all of the simulations passed and conduct power analysis of the RTL design very quickly. This helped increase clock-gating efficiency by iteratively adjusting the existing clock gates based on PowerPro’s recommendations. The weekend regressions allowed the rapid analysis of design alternatives, resulting in significant performance and power improvements, including optimizations that could not have been done at the gate level or that may not have been detected and targeted without the PowerPro reports.

To run a typical PowerPro analysis of a given snapshot of RTL, the following steps were completed using short AMD-internal scripts.

1. Run ‘builtIt’ script

Checks out the RTL view from the Perforce design repository
Builds simulation model using pre-processor scripts and VCS
Builds pre-synthesis view of the RTL code using pre-processor scripts

2. Run ‘simIt’ script

Runs 39 tests using LSF to spawn jobs out to simulation farm machines.
Captures FSDB data that starts and ends at an instruction-count boundary.
Converts FSDB to switching activity interchange format (SAIF) files used by PowerPro.

3. Run ‘powerProIt’ script

Reads in IP.f and run.tcl files for each block and SAIF files for each simulation set.
Uses LSF to spawn PowerPro jobs out to simulation farm machines.
Creates output report directories and files with improved clock gating for review.

4. Run ‘sum.pl’ script

 Analyzes PowerPro outputs and organizes results into summary tables to help track clock-gating improvements month to month and per IP

In AMD’s Bobcat and Jaguar methodologies, designers inserted architectural clock gating into the RTL. AMD chose this path, as opposed to options such as automatic clock-gating, largely due to high-speed timing path concerns. The designer of a high-speed CPU block is often in a better position to know which signals are available in time to gate a clock than an RTL analysis tool. AMD’s Calypto application engineer created a short, 20-line module describing AMD’s clock-gating cell; this was used to build the pre-synthesis model that PowerPro analyzed.

To optimize AMD’s use of PowerPro, AMD designers added clock gating artificially onto the JTAG clock. This allowed AMD to focus on other areas where optimizations were more likely to be helpful. This RTL adjustment was a four-line PERL script called by the ‘builtIt’ script that was run prior to building the simulation and PowerPro models. The PowerPro model was the same as our pre-synthesis model. It did not include certain verification pieces that were included in the simulation model.

Once the models were built, simulations were started. PowerPro uses SAIF files to track gate efficiency and gather temporal information about the design. AMD devised a way for the simulations to enable SAIF tracking, starting and ending at specific instruction counts. This was much preferred over starting and ending at a given simulation cycle number or at a given time because it allowed the key instruction sequence to be analyzed from week to week even as designers made improvements that caused the test to execute more instructions per clock cycle.

After the simulations were completed, 39 test cases were run on PowerPro. Important cases included the halt test, virus test, and 17 tests in the ‘AppTyp’ group. Synthetic and ‘special’ groups for clock-gating analysis of certain modes of interest were also deployed. AMD initially focused design work and PowerPro improvement runs on the ‘cpu_halt’ test because it is one of the more straightforward tests to optimize. A halt case occurs when the CPU micro-code has no more instructions to execute and halts the entire core. For this test, we left on the root clock to reveal anything that was not clock-gated. Because by definition nothing should be clocked during this scenario, looking at the halt case allowed the team to determine which flops were being clocked and how to gate them.

As the design matured, AMD focused on the typical application suites for the design (the ‘AppTyp’ group) to improve the power efficiency of the lab benchmarks. Accomplishing these improvements was more challenging. It was not a matter of whether a flop was on or off, as with the halt case; instead, it involved slight reductions in activity. For example, PowerPro may report that a flop active 20% of the time needs to be active only 15% of the time. To help with these optimizations, PowerPro provides suggested RTL clock gating. This can be used as-is or inspire further improvements based on the team’s knowledge of the design. For example, it may be apparent that an entire section of a block that may not be needed at a given time.

After optimizing the design based on the ‘cpu_halt’ and ‘AppTyp’ tests, AMD ran a virus pattern. This is an in-house snippet of code that makes the vast majority of the core active, including floating-point computations, cache misses, and memory fetches. Although normal applications would not impose such demands, it can be used to set maximum instantaneous power requirements.

After PowerPro completed these tests, a final in-house script was executed that parsed through all the report files and created tables. The script processed the PowerPro reports and presented the results to track progress and identify further opportunities for power improvements.

The goal was to reduce the number of flops that are clocked; reducing the number of flops or improving clock gating manifests as an improvement in the PowerPro analysis. The PowerPro analyses, and the RTL recommendations suggested therein, could be used as an alternative to PTPX roll-ups.

The PowerPro results revealed how efficient the design could be. For example, when PowerPro suggested a way to eliminate 20% of the gated activity, AMD’s engineers could review the pre-optimization report and see which flops were less efficient. Or they could look at the post-optimization report after the PowerPro-recommended improvements were made; if PowerPro determined that the design could use 10% fewer clocked flops, an engineer could examine the design to see which 10% were eliminated.

The result was that the actual power, as shown in the monthly or bi-monthly PTPX reports, was lower because the designers were making design adjustments based on the week-to-week reports from PowerPro.

PowerPro results

The first test AMD ran was ‘cpu_halt’, because it was among the easier ways to make significant improvements in clock gating, for reasons explained earlier. Figure 2 shows a snapshot of the clock-gating improvement process as tracked by PowerPro. Thirteen blocks are shown that had been leveraged between Jaguar and Bobcat. By helping track progress often, even as functionality and timing work was progressing, the team was able to drive down active clock counts dramatically during product development.

Figure 2
Clock-gating improvements based on ‘cpu_halt’ regressions (Source: AMD – click image to enlarge)

The ‘cpu_halt’ test was also run after adding a new block (the shared L2 cache controller) to the design that was not leveraged from the previous core. The significant drop in activity seen in Figure 3 from Month 3 to Month 4 shows a point at which the functionality of the new block was nearly complete and design work began focusing on power concerns.

Figure 3
Average clocked flops after adding ‘newblock’: the shared L2 cache controller (Source: AMD – click image to enlarge)

AMD then ran various applications on PowerPro (Table 1). The goal was to minimize the average number of flops clocked each cycle by optimizing away flops or improving clock-gating efficiency. Designers could look at RTL as-is flop-efficiency details as well as recommended improvements for gating efficiency. The design owner’s name was associated with each block to establish a clear assignment of responsibility for reviewing and improving clock-gating results.

Table 1
Summary of PowerPro ‘AppTyp’ results (NB: ‘newblock’ is not part of CPU core total) (Source: AMD – click image to enlarge)

Table 2 shows the result of our clock-gating efforts on the AMD Jaguar core. For typical applications, even though IPC improved from one core to the next, the percentage of active flops decreased by approximately 25%.

Table 2
Comparison of clock-gating improvements (NB: % of flops active is approximate) (Source: AMD – click image to enlarge)

Post-synthesis power results

At the same frequency, Bobcat and Jaguar have similar maximum power levels for the virus case (due to timing work, Jaguar can run at lower voltages for the same frequency, but it also has higher IPC architecturally). The power shown in Table 3 is all CPU power, including active mesh power and array power outside the purview of our current PowerPro flow. All power results shown are approximate-scaled dynamic power relative to the virus power for Bobcat. For example, according to the PTPX estimation, Jaguar takes 10% less dynamic power than Bobcat (0.9 vs. 1.0) at the same frequency while providing slightly higher IPC on the same Bobcat virus code.

Table 3 shows the 25% reduction in active flops for typical applications carried through to about a 21% power benefit in gate-level, post-route power analysis runs (7.7/10.3 = 0.75 and 0.58/0.73 = 0.79).

Table 3
Comparison of power improvements (NB: % of flops active is approximate) (Source: AMD – click image to enlarge)

These PTPX runs included accurate gate and wire capacitance for the actual tape-out netlist. But, as noted previously, getting an accurate PTPX result can take several weeks, because it requires that the design be synthesized and routed through a back-end flow that is capable of achieving the high frequencies at which the AMD Jaguar cores can run.

Although the overall average benefit seen by PTPX was very similar to PowerPro, at the block level we saw about a +/-30% variance. That is, some blocks had more power per clocked flop than others, which would be expected because some blocks tend to have more combinatorial logic than others and the version of PowerPro used here did not estimate combinatorial capacitive load.

The general PTPX results demonstrate that using PowerPro as a quick estimate for power work was useful. Also, based on Bobcat silicon results, reasonable correlation (+/-10%) between silicon and PTPX results has been observed.

Future use

The Calypto team has various improvements underway to make the PowerPro tool more accurate and augment its ability to find power savings. The current tool is already valuable for doing early power estimates based on investigative RTL work. For example, if a particular market is focused on a different typical application mix, the design team could estimate the power savings achievable by making RTL changes and running test PowerPro analyses using appropriate tests and simulations. Even without PTPX gate simulations, an accurate estimate of the potential power savings can be used for product planning.

AMD may continue to explore other PowerPro features that have promise in helping lower the power of future designs. PowerPro data-gating analysis can find large combinatorial logic cones that can be prevented from toggling when the result is not needed. It also has a quick-synth mode that estimates the capacitive load behind flops and makes timing-related suggestions for clock gating. Calypto has improved the signal names for its suggested clock-gate improvements, which should make PowerPro clock-gating suggestions easier to analyze and incorporate.

About the author

Steve Kommrusch is Fellow Design Engineer at AMD.

Contact

Calypto Design Systems
1731 Technology Drive
Ste 340
San Jose
CA 95110
USA
T: +1 (408) 850-2300
W: www.calypto.com