Work undertaken jointly by ARM, TSMC and Synopsys to implement advanced processor designs on TSMC’s 16FF+ process and the emerging N10, 10nm process has revealed the design and verification challenges of working at such dimensions, according to speakers at a Synopsys event at DAC last month.
Willy Chen, deputy director, design and technology platform, TSMC, said that the company had already done 20 product tape-outs using its 16nm finFET process, and expects to reach 50 by the end of this year.
Working on these designs has revealed issues in areas as diverse as IP design, place and route, fin-based placement, low VDD enablement, pin access and routability, accuracy of RC extraction, correlation between STA and Spice, and advanced EM rule support.
Chen took place and route as an example, pointing out that for finFET-based design, place and route tools need to be able to snap blocks to the fin grid for automated IP and macro placement.
He said TSMC has taped out a 10nm validator chip with a four-core ARM Cortex A57 processor and other circuits on board: “It’s an important step in a concurrent ecosystem bring-up.”
This work has shown that designers working at 10nm will have to do full colouring of their layouts, and develop a colouring strategy for the pins of library and macro IP. Place and route tools will need to work with tracks on predefined colours, be able to create access routes to coloured pins efficiently, and work with same/different-colour spacing rules.
Chen said that RC extraction, EM checking and DRC will also all need to become colour-aware.
A full-colour design flow for the N10 process is now ready for customers, according to Chen. One of the key aspects of the flow is the ability to promote key signals to higher layers automatically, to reduce resistance, a change from the previous approach which emphasised creating the shortest route through the metal stack.
“It’s an area in to which we put extra effort for N10,” said Chen.
Is it worth moving to the new process? Chen said that according to results so far, moving from N16 to N10 halves the area of the test chip, and offers 16% faster speed at 6% lower VDD.
Haroon Gauhar, principal design engineer, ARM, talked about the work his company has been doing on implementing a Cortex-A72 processor on the TSMC 16FF+ process. He said the processor achieved 3.5 times the performance of an A15 implemented on a 28nm planar process, or consumed 75% less energy for the same workload.
The 64bit A8 architecture processor, with four cores, a Mali GPU, CoreLink interconnect and 2Mbyte of cache, was implemented using an ARM Artisan 9-track cell library. To control power consumption, it uses regional clock-gating. Libraries are ULVT C20 throughout and ULVT C16 on the clock network.
Gauhar said among the challenges of implementing the design were making the right choice of 16FF+ libraries, VT classes and ratios. The higher Fmax of the design, targeted increase in power efficiency and generally rising design complexity also meant that his design team needed a 2 to 3x increase in productivity to keep up.
Joe Walston, senior staff applications consultant, Synopsys, talked about the way the tools have had to evolve to enable 16nm and 10nm design.
The team had two main goals: to reduce the clock network power, and to manage the total negative slack (TNS) for low power.
One way to reduce clock-network power is to try to bank registers together into multibit registers. This can be done in three ways: by inference from the RTL, by using physically-aware multibit approaches, and by using multibit facilities within IC Compiler.
Walston said that in this case, the team did the RTL inferencing first, then looked for nearby registers, and then ran the IC Compiler multibit optimisation. He said the majority of the power savings came from the RTL inferencing process. The result of the three processes was a 5% dynamic power saving and a 44% TNS saving.
The team’s clock-gating strategy was to take a tiered approach, using six levels of clock gating, and focusing their efforts on the most energy-consuming clusters of logic on the chip.
Other key implementation challenges included routing on the 16FF+ process, which has 11 layers of metal, the first three requiring double-patterned lithography. The clock tree was implemented on M4 to 7, signal routing on M2 through 9, and the power grid on M8 through 11.
Gauhar said one of the key advantages of the flow was using Synopsys’ IC Compiler II tool to speed up implementation.
“The quicker we get to placement optimisation the more iterations we can do,” said Gauhar. “The biggest advantage of ICCII is the runtime reduction, from three to four days to around one day.”
Rob Aitken, ARM Fellow, gave a systemic view of what it takes to implement complex processors at advanced nodes such as 16nm and 10nm finFET.
He argued that to produce efficient 10nm processor implementations, designers need to be thinking about their system architecture, the expected workload, the number of cores, interconnect strategy, memory architecture, and caching strategies. He also said that designers should be thinking about the CPU’s microarchitecture features, such as pipeline depth, in-order vs out-of-order execution, caching and so on.
In terms of implementation, he recommended thinking about the target process, the ability to do dynamic voltage and frequency scaling, the relative performance of logic and SRAM, and the importance of wires, capacitances and resistances.
Aitken argued that ARM’s experience at these advanced nodes suggested that there is a change in the frequency distribution of signals in the designs, with more close-to-critical paths than have been seen before. This means that getting a chip to work to spec means taking a holistic approach to optimising standard logic paths, arithmetic paths, memory paths, and wire-dominated paths, rather than just fixing the design on a path-by-path basis.
How to do you optimise a design for performance? Aitken said the thing to do is to start with a fast design, tune its implementation, tune the scripts controlling the implementation, tune the standard-cell and SRAM instances – and look for other co-optimisation opportunities.
“If you are going to 10nm you really need an ecosystem,” he concluded.