Proving the 20nm ecosystem with the ARM Mali GPU

By Tim Whitfield | No Comments | Posted: August 25, 2013
Topics/Categories: IP - Assembly & Integration | Tags: 20nm, cell libraries, double patterning, ecosystem, GPU, physical IP, restricted design rules (RDR), routing | Organizations: ARM, Synopsys, TSMC

Tim Whitfield is director of engineering for ARM’s Hinschu Design Centre in Taiwan. He joined ARM in 2000 and was involved with the first ARM synthesisable core, and the creation of the physical implementation group.

ARM’s job is to help companies develop complex SoCs more quickly than they could on their own. As our processors become more powerful and power efficient, and the processes upon which they are made become more complex, it is up to us to ensure that our customers can produce their designs as quickly as ever.

This is why we worked with tools partner Synopsys, foundry partner TSMC, and our own physical and logical lP to run a complete SoC design through to silicon on a 20nm process. This helped us explore how all the elements of the design flow work together and to discover and overcome some issues to ensure that the 20nm ecosystem is now ready for our mutual customers.

We added a further twist by centering our test design on the ARM Mali graphics processing unit (GPU) rather than one of ARM’s CPUs. Why did we choose to do that?

One reason is because of the increasing demand for high-performance graphics processing, and the other is the diversity of that demand. Some customers want to use our GPUs to power next-generation smartphone displays, others as the heart of compute engines. Diverse applications mean diverse implementations, providing the impetus to explore the 20nm design envelope.

Another reason for using Mali is that GPUs tend to take up a lot of silicon area compared to CPUs. This challenges us to move away from the design optimisations we have long used to implement dense, high-performance and low-energy CPUs and look instead for different trade-offs of power, performance and configuration.

The third reason for choosing Mali is that the dataflow requirements of GPU computation tend to lead to routing congestion. This issue can be overcome by adding more routing layers, but that increases cost. Navigating these trade-offs is complex. Advanced process technologies offer more choices, such as various channel lengths and threshold voltages in the basic transistors. We undertake extensive benchmarking to understand the power and performance trade-offs of each option, so that users can understand the implications of their GPU configuration choices.

We worked with a Synopsys-based design flow, starting with Design Compiler for physically-aware synthesis, then using IC Compiler for double-patterning-aware place and route and PrimeTime SI for timing sign-off. To make the chip as valid a trial as possible, we used DFT MAX to insert full test structures. Lastly, the physical verification was handled by IC Validator.

Completing a GPU-based SoC design from RTL to GDSII can take weeks, making it important to get the correct design rapidly. The Synopsys tools helped us achieve this goal. For example, we used Design Compiler Graphical to analyse early versions of the RTL for routing congestion. Getting the RTL designers to alter their code to ease congestion at this early stage saved having to deal with it later on in the design flow.

We were also able to use IC Compiler’s Data Flow Analyzer feature to provide a graphical representation of the connectivity and dataflows between sub-blocks in the GPU, helping us to develop a congestion-aware floorplan.

At 20nm, we had to deal with the impact of double-patterning lithography and extreme density on place and route. These two factors have led to the introduction of many more, and more restrictive, design rules from the foundries. This, in turn, is challenging the standard-cell and memory designers to achieve the density increase one usually expects from migrating to a new process.

For example, to achieve adequate routing density we had to plan the pin position in each standard cell very carefully, adding padding in some cases to help. Doing this intelligently is critical; without padding, a design may be unroutable, but padding every cell indiscriminately will unnecessarily add a lot of die area.

We also worked with Synopsys to automatically insert boundary cells required for 20nm designs and to reduce other layout-dependent effects that can cause variability in the design.

Accommodating the needs of double-patterning lithography has made the power grid implementation more difficult. We found that 20nm design rules with different metal pitches on each double-patterned layer make it easy to end up with misaligned pitches that can absorb a lot of routing resource. We worked with Synopsys to figure out how to pitch-match the various power grids to avoid wasting routing resources.

Even with good cell and memory libraries, layout at 20nm remains difficult. The layout-decomposition and coloring tools in IC Compiler help ensure that our designs will be printable using double patterning. The Automatic DRC Repair facility in IC Validator helped us fix any remaining layout violations.

With our Mali test design, we have taken our latest GPU core, the latest Synopsys tools and an advanced TSMC process and used them to produce real silicon. This has enabled us to explore the impact of process, library and configuration choices in implementing Mali, and to validate our Artisan physical IP libraries. In working with Synopsys we have been able to show that the tools are ready, and we developed ways to keep some of the implementation challenges of 20nm and smaller processes transparent to the end user.

We have done all this in a very dynamic design environment in which the process and its design rules, as well as the physical IP and the tools were evolving all at once. With a 16nm test chip already underway, it’s clear that design is going to remain very dynamic. For ARM, joint validation projects with tool providers such as Synopsys and foundry partners such as TSMC help us ensure that the entire design ecosystem is ready for customers who want to access the power, performance and area advantages of new process nodes.

To watch Tim being interviewed about developing this 20nm flow by Phil Dworsky, director of strategic alliances at Synopsys, click here.

More on the collaboration between ARM and Synopsys here.

Author

Tim Whitfield is currently the director of engineering for ARM’s Hinschu Design Centre in Taiwan. He moved to Taiwan two years ago to create a design centre focused on ARM physical implementation on advanced process nodes. Previously, Whitfield has worked for GEC and Fujitsu Telecommunications before joining ARM in 2000. Involved in the first ARM synthesisable core, he was a founding member of the physical implementation group where he technically led a number of CPU macro hardening projects for ARM partners and foundry projects. Whitfield has also led a number of key ARM development and test chips used for software development and IP validation, spanning multiple process nodes from 180nm through 16nm. Whitfield graduated from Brunel University in the United Kingdom in 1995 with a degree in electrical and electronic engineering.