Addressing SoC performance challenges in advanced deep-submicron CMOS processes
The article describes a novel optimization approach that extends leading methodologies to improve performance, power and area. It is based on a pre-generated cell library that extends commercially available foundry libraries and couples it with novel logic optimization to aim for the delivery of near full-custom performance levels.
The approach assesses the gate-level netlist generated by RTL synthesis tools using commercially available foundry cell libraries and selects the best-suited additional cells from the mated design-specific extension library. It then creates directives for the synthesis optimization to specify the addition of cells with better drive strength and skew variants as well as combined and complex cells from the extension library to significantly improve the design’s performance.
This methodology has demonstrated improvements in performance of 15-20%, cuts in power consumption of up to 25% and reductions in die area of 15-20%.
The SoC performance challenge
Electronic products such as network infrastructure devices, mobile phones, laptops, tablets and other consumer gadgets contain system-on-chips (SoCs) that must meet increasingly aggressive performance, power and cost targets.
Recent CMOS processes for 40nm, 28nm and below have the potential to offer these sought-after performance gains; but achieving maximum possible speed, while consuming the minimum area and dynamic/static power, remains a daunting challenge.
A significant limitation of today’s RTL-to-GDSII design methodology is the nature of the digital cell IP used. Standard-cell libraries, the fundamental building blocks of common digital design methodologies, have been designed to employ a limited number of library cells to minimize cell characterization costs and reduce design tool run times. Digital logic implementation tools have continued to evolve without exploring any changes to this practice, which means that the performance available using traditional synthesis and standard-cell libraries alone continues to be limited.
You can achieve significantly better performance by adopting a full-custom design strategy; but, in the great majority of cases, this is too expensive and time-consuming, and also less scalable to meet the demands of present day product development cycles.
Instead, this article proposes a methodology that combines a novel design optimization technique with a new generation of cell IP libraries containing very large numbers of cell variants.
Limitations
For a given design architecture, its RTL description, the choice of process technology and the accompanying standard-cell library, there is a practical limit to achievable performance.
When RTL is synthesized using a set of performance constraints, the optimization process will draw upon a foundry’s standard-cell library containing logical and physical representations of basic logic functions that have been designed and characterized for the chosen process.
When all the ‘knobs’ on the logic synthesis and physical implementation tools have been turned to highest possible settings, the design will generally reach its performance limits. To further increase performance, RTL code or architectural changes must be implemented—a process that is often costly in terms of implementation effort, verification time and potential risk. For instance, it might be necessary to insert an extra level of pipelining.
The role of custom design
RTL-to-GDSII design flows have been honed to a remarkable level of sophistication, but they are generally flawed by the coarse granularity in the underlying logic component libraries they rely on to build the circuits described in the RTL. These components number from a few hundred and up to a few thousand cells and are provided by the foundry or by commercial standard-cell library vendors.
If the SoC designer cannot achieve sufficiently high performance within the required power budget using the traditional standard cell-based design methodology, an alternative approach is to apply custom design techniques to one or more key blocks in the SoC.
Designs with embedded CPU or DSP engines are good targets for a custom design flow, as the CPU or DSP performance directly limits the end application performance. Custom blocks may be implemented using special handcrafted cells or larger structures, which may be the only path to achieving the required speed gain or area/power reduction. Certain combinational cells, when used in sufficient quantity within critical blocks of the design, can be manually adapted with transistor-level optimization to increase the performance of the block in question.
Identifying which cells to hand craft is critical to minimizing overhead in additional cell construction, characterization and deployment. Custom design can be a prohibitively expensive trade-off, even when a suitably qualified team is available to perform the work. And when performance issues are more general in nature, rather than confined to a well-structured block, manual transistor-level optimizations are usually not an option.
Scalable solution to the performance problem
The strategy we propose enables a broad range of design teams to take advantage of extended cell libraries, without the need for a dedicated IP development team.
Instead of just adding a few so-called ‘tactical cells’ to existing libraries, this approach is based on a set of pre-generated cell libraries containing comprehensive and fine-grained extensions to existing foundry cell libraries. Together with novel logic optimization technologies that are tuned to take best advantage of these extended libraries, the technique can be used as an adjunct to established design flows.
The key to getting improved performance from a standard cell-based RTL design methodology relies on a concept we are calling ‘application-specific library sub-setting’. This technique is employed after an initial gate-level netlist for the design has been produced using RTL synthesis with the preferred base cell library.
During the normal design optimization phase, following gate mapping, a new type of logic optimization tool is used to analyze the design netlist and calculate a set of better digital cells to use for implementing the logic. This cell set includes a select set of cell functions, drive strengths and rise/fall skew (beta ratio) variants that will likely offer improved performance for the target design.
The chosen set of cells is then extracted from the large pre-validated extension library and is introduced into the gate-level netlist via the optimization tool. The cell selection is tuned to the requirements of the design, aiming to complete optimization while minimizing tool run times. This optimization loop may be performed iteratively using a variety of possible strategies, allowing the design performance to improve incrementally until target performance has been reached.
Optimization continues through the place-and-route stage to ensure gains achieved during front-end optimization are preserved. Additional cell types may be specified at each optimization step and tuned to the design metric to be improved: speed, area, power, or a combination of them. In-place-optimization (IPO) methodologies can also be employed, even at late stages of post-detail routing: footprint-compatible cells can be automatically specified to reduce leakage power without disturbing the routing.
The availability of these footprint-compatible cells also improves timing closure productivity in the backend, and in many cases makes re-extraction of the design’s routing parasitics unnecessary—an often time-consuming process.
Flow components
The methodology uses NanGate MegaLibrary and NanGate Design Optimizer to enhance traditional RTL-to-GDSII flows and delivers higher-performance design results.
The NanGate MegaLibrary is an extended cell library containing upward of 10,000 fine-grained cell variants, built to complement the existing foundry-provided or commercially available base cell libraries.
A MegaLibrary is made available by an independent library vendor and is deployed through the usual foundry interface mechanism. It contains a wide variety of additional cell types including drive strength, skew and leakage control variants, in addition to many combined and complex cell functions that are not normally present in the base library.
A library cell count of 10,000 is well in excess of that seen in a typical foundry process base library; it would be difficult for most IP vendors to develop and validate such a library without significant additional automation technology that is available from NanGate.
Synthesis tools from the major EDA vendors cannot handle the large number of cells available in a NanGate MegaLibrary because the algorithms they employ are not designed to take advantage of the library’s cell granularity. The NanGate Design Optimizer (NDO) post-synthesis optimization tool integrates with existing flows to take advantage of richer and more fine-grained MegaLibrary extension libraries. It operates on the design netlist in concert with the existing RTL-to-GDSII flow, and enables cell subsets from an extended cell library to be used within the synthesis optimization loop to improve and fine-tune the design performance.
Even when NDO uses only the base libraries for the foundry process in conjunction with traditional synthesis tools, it can still provide some improvement in design performance as a result of its proprietary critical region analysis and post-synthesis optimization techniques.
When NDO is used with a MegaLibrary, the performance advantage is significantly amplified by enabling the addition of fine-grained cell variants to the design. The advantage is realized both inside and outside the critical region targeted for performance gains.
After the required performance has been reached, NDO employs another strategy to reduce area, where combined cells with smaller footprints are used to replace a number of individual standard cells. When this process has completed, the functionality remains the same, but overall block area will be substantially reduced. This process offers the potential for large savings in silicon area of the die, and potentially cutting weeks of design team effort. This translates directly to more die per wafer and reduced silicon cost.
Performance metrics
There are many factors that influence the maximum performance a design can achieve. Experience has shown that on a highly optimized ARM CPU core in a 45nm process, it is possible to achieve a 14% speed increase over the best result obtained using traditional foundry-provided base libraries and current state-of-the-art EDA tools.
The combination of NDO and a MegaLibrary provides an effective and efficient route to higher performing digital designs on any process geometry; however, the performance improvements obtainable are dictated by the characteristics of the process in question.
The first MegaLibraries were developed for the 65nm and 40nm geometry nodes; the focus is now on 28nm and 20nm. Leakage management is more challenging at 65nm and below, where a 10nm increase in channel length can reduce static power consumption by up to 50%; by comparison, at the 180nm node, the static power reduction would be minimal since leakage is usually not a major issue.
For area optimization, deploying combined cells and new complex cells has been found to reduce digital logic real estate by 15-20%. This can have a huge impact on baseline profit margins due to reduced manufacturing costs as well as a reduction in time spent in design and development.
A secondary benefit of area reduction is typically a decrease in the chip’s power consumption. For mobile devices, this translates to a longer battery life.
Jens Tagore-Brage is chief technology officer of Nangate.
NanGate
155-A Moffett Park Drive, Suite 101
Sunnyvale CA
94089
USA
W: www.nangate.com
T: +1 408 541 1992