Calibre scales to 4000 nodes for faster sign off in the cloud
A number of under-the-hood tweaks have enabled Calibre, the sign-off tool built by Mentor, a Siemens business, to scale run urgent jobs across thousands of nodes, enabling customers such as AMD to tune their resources to balance cost against turnaround time.
In a session at this week’s Design Automation Conference (DAC) in Las Vegas (June 4, 2019), AMD silicon design engineer James Robinson said that moving huge Calibre jobs to cloud arrays of up to 4000 cores when verifying the latest version of its Epyc server processor complex had cut runtimes from 17 hours to fewer than 8. “We can run that overnight instead of taking up almost an entire day. This has made a real difference to turnaround times for AMD.”
Robinson said moving away from on-premises resources to the cloud was an essential part of the time savings. “When you are in a tapeout crunch, where do you get 4000 cores?”
Michael White, director of product marketing for Calibre physical verification at Mentor, said that customers building large ICs on advanced nodes are beginning to use cloud resources partly due to their demand for compute power outstripping their on-premises resources, and partly due to growing trust in the security of the compute platforms. “A thing we hear from customers is that it’s increasingly difficult to get hold of hardware in a timely manner. Cloud gives you the flexibility to accelerate getting your design taped-out.
“Calibre has supported scalable computing for many years,” White added. “This is the same Calibre as that running on-premise. But we’ve done some specific things in Calibre to take us to the next level. We’ve added intelligence to look at operations that don’t scale that well and bring them forward in the run, to have them run in parallel with those that do [scale], to ensure you are not waiting for them to complete at the end. And we have looked at how we can improve memory usage.”
Memory savings
Noting that server pricing is often sensitive to the amount of memory specified, AMD took advantage of the way a tool such as Calibre uses resources when scaled out.
“As you scale out, you need less RAM on each machine. That means you can use smaller, cheaper instances,” Robinson said, pointing to how a demand for 300Gbyte of worker RAM dropped to less than 150Gbyte as the job scaled up to 2000 cores.
Another tweak involved organising the job to avoid having cloud resources waiting for work to do when the verification run starts up. “On the cloud, anything sitting idle is costing you money. When starting up, usually the master is working hard but the workers are doing nothing. We asked Mentor: ‘Can you do something about this?’ They listened,” Robinson said, noting that the HDBflex feature to allow the addition of worker nodes as tasks become available “was a direct result of AMD feedback”.
To help deliver the runtime improvements to customers such as AMD, Mentor worked with TSMC to ensure that its foundry data could be used on the cloud, and with Microsoft Azure on scale-out and node-management strategies.
Six month project
Willy Chen, deputy director of design methodology and services marketing at TSMC, said the project started in October last year, taking just six months to move to certification. Before it moved to customer deployments, TSMC tested the Calibre support with a 500 million gate test chip and its 17Gbyte design database. The baseline runtime was more than 24 hours.
“From there we worked with Mentor and Microsoft and figured out a way to push down the runtime to four hours,” Chen said. “Now I can submit a job in the morning and see the result after lunchtime.”
Optimization is an ongoing process, Robinson said: “We have been working closely with Mentor. They’ve given us pre-release versions. In one case we saw a drop from ten hours to less than six with a software change. That’s one reason why I am so excited to continue working with Mentor and Microsoft on cloud infrastructure.”