Flow exploration key to finFET network processor implementation
As SoCs and the processes upon which they are implemented become more complex, designers are finding it increasingly challenging to explore more than one approach to implementation. Representatives from ARM, Samsung and Synopsys got together at DAC to discuss how a design management tool had helped them explore multiple approaches to the implementation of a complex SoC over just four weeks.
According to Wolfgang Helfricht, physical IP platform marketing director, ARM, the target design was one instance of a family of network infrastructure SoCs, designed to scale in performance and power, depending on where they sit in an ‘intelligent flexible cloud’ linking peripheral devices to central data centres.
Helfricht’s argument is that future networks will need computation, storage and acceleration to be much more widely distributed throughout the network than they are now, to cope with the flood of data being thrown off by devices at the network’s periphery.
“Lots of little data equals big data and big bandwidth,” he said.
To service this requirement, ARM has developed a set of cores with differing performance and power characteristics, from the Cortex A53 to A57, which could be formed into multicore clusters linked over a cache-coherent network as part of a scalable network infrastructure SoC architecture.
Helfricht said the implementation would use one of ARM’s Artisan physical IP libraries, optimised for Samsung’s 14LPP finFET process. The libraries take advantage of the lower leakage of the finFET process. They also use the extra drive strength of the finFET transistor to enable cells to be shrunk, turning a 12-track library into a nine-track version for greater density. The process can also run at lower voltages.
Kelvin Low, senior director of foundry marketing, Samsung, said the targeted 14LPP finFET process would offer 1.67 times the performance of the planar 28LPP process at the same power consumption, or consume 0.41 times the power at 0.8V for the same performance, and take up 0.55 times the area of 28LPP, calculated on a mix of logic, SRAM and analogue I/O.
Andy Potemski, director of R&D for the Lynx Design System at Synopsys, outlined the challenge of proving the SoC ecosystem worked by building a multicore network subsystem using 16 Cortex A53 cores, arranged in four quad-core clusters, linked over a CoreLink cache-coherent network with 4Mbyte of cache.
The target was to enable the SoC to run at 1.5GHz at its worst-case design corner, when implemented on Samsung’s 14LPP finFET process – in four weeks.
The design flow was managed by the Lynx Design System, which includes a Run Time Manager, to control the flow, and Design Tracker, to measure Quality of Results (QoR). The flow included Synopsys’s IC Compiler II place and route system, which has been developed to improve design productivity.
Lynx acts as a design management cockpit, enabling users to take generic design flows, tune them using technology plug-ins to match the needs of the process in use, and then explore how different combinations of tool settings, scripts and design approaches affect the overall project results. Lynx can also run multiple flows concurrently, so designers can compare different approaches as the design progresses.
The technology plug-in for the Samsung 14LPP process handles proicess-specific issues such as managing the placement of cells on the finFET grid, pin accessibility checks, via clustering, and RC layer optimisation.
Among the challenges of implementing the design, apart from the fact that the whole flow was put together in weeks, was the fact that the design had more than 6 million instances, three levels of hierarchy, and was expected to reach 95% of its design goal results on first pass.
Hierarchical vs flat
One big challenge in achieving this was to optimize the implementation of multiple cores at once. According to Potemski, it would have been possible to take a ‘divide and conquer’ approach, optimizing a single core on its own and then instantiating it four times to make a quad-core cluster. However, the capacity of IC Compiler II allowed the team to go for a flat implementation strategy, which meant that there was no need for I/O budgets for each CPU, there were no iterations between the CPU and the quad-core cluster level, and each CPU was optimised in its real context. The approach also improved design productivity, since there was less hierarchy to manage, and led to better QOR.
Lynx also enabled the team to explore different approaches to synthesis and place and route. The team tried a hierarchical synthesis approach, using Design Compiler Graphical to synthesise a single CPU, and then to synthesise a quad-core cluster, which led to faster synthesis runs. The design was then passed to IC Compiler II to do the flat place and route process. This hierarchical/flat approach cut the implementation time by 40%.
The shift from a hierarchical approach to flat place and route improved QoR, especially because it made it easier to optimise the placement of ports between CPU and the top-level interconnects.
The project team managed to create an implementation running at 1.44GHz at its worst-case design corner in four weeks. The flat flow also helped improve WNS, and TNS and moderately reduced area and power.
The team also looked at optimizing the design at the top level, where there are different challenges for optimizing the quad-core cluster (such as the depth of the logic from register to register), and the cache coherent network (where the placement of the crosspoints is critical to timing closure).
To help optimize module placement, the tool flow offers a graphical display of data ‘flylines’, to help users visualize and explore the signal density between the cores, and between the cores and the cache. A similar display helped designers understand the timing effort going into making particular routes meet their timing requirements. Lynx acted as an interactive cockpit for this work, so that designers could optimize their designs, see how the optimizations affected QoR, and then re-optimize until they achieved the desired goal.
“Designers never want to be far from the tools,” said Potemski, highlighting the value of the interactivity enabled by Lynx alongside its automation.
Part of what enabled such an iterative approach to the design was the performance of the tools. Potemski said that running the four-core clusters through the entire flow took about 48 hours each time, and that the team probably ran the whole flow “a dozen times”. He added that the team did more iterations on the core than on the top level of the design, for which the complete implementation runtime was between two and a half and three days.
There’s more on Samsung’s use of design flow exploration in this related story.
Leave a Comment
You must be logged in to post a comment.