DAC Pavilion focuses on cloud-scaling issues
Talks in the Design-on-Cloud Pavilion at this year’s DAC demonstrated how the question over its usage is not so much whether design could or should migrate to the cloud but how to optimize cost and performance when it’s there.
Not surprisingly, Amazon’s own projects have turned to cloud computing. “We benefit within Amazon using AWS in IC and SoC development,” said David Pellerin, head of worldwide business development for infotech and semiconductors at AWS, pointing to three chip projects at the company: Graviton/2; Inferentia; and Nitro. “All three of these chips were 100 per cent developed using AWS.”
But a growing number of semiconductor manufacturers have been moving workloads to the cloud to take the pressure off their in-house server farms. Pellerin said: “The important difference is on the compute and storage sides in the way you can scale up and scale down.”
Craig Johnson, vice president of cloud solutions at Cadence Design Systems, said: “There is the opportunity to optimize in any domain from either a cost or throughput perspective.”
Selected server speedups
Although a major focus is on parallelized applications, Johnson said simply the ability to tune the capacity and performance of a target server or cluster can easily yield speedups. He pointed to the experience of customer Gaonchips, which saw a 30 per cent reduction in runtime with the Innovus implementation tools.
Bigger gains come from jobs that make use of the massive parallelism that is available on the cloud though it is important to scale compute, storage and I/O to avoid wasting capacity or licenses. Wei-Lii Tan, product manager for analog and mixed-signal verification at Mentor, a Siemens business, used library characterization as an example of the kinds of considerations that need to be made. He showed how 10,000 CPUs were deployed on the experiment to provide near linear scaling for runtime reductions.
Library characterization has the key advantage of scaling relatively easily: it is mostly short SPICE simulations and the only dependencies are entries in the output .lib tables that rely on multiple simulations. Even so, Tan said, the job dispacther that issues jobs to the nodes needs to be reasonably intelligent as I/O and disk access may prove to be bottlenecks. Having a dispatcher match jobs that do not conflict over resources will allow greater scaling.
“A cluster-management system is needed to allocate and deallocate nodes: the last thing we need is execution nodes being active and costing us money if they are not running jobs,” Tan explained.
“You can easily get overwhelmed with choices. It’s important to work with EDA and cloud partners to choose the right machine type,” Tan added, noting that it is important to use key performance-indication (KPI) tools to determine how well the target machines perform on jobs.
“By tracking KPIs, you can optimize the workload and see if different virtual machine types work better. There are many KPI tracking mechanisms available from EDA and cloud providers,” Tan explained.
Mark Duffield, worldwide tech lead for semiconductors and electronics at AWS, said machine-learning techniques also provide opportunities for improving the utilization of cloud resources on EDA jobs. “You can make sure all the licenses are being used,” he said.
One major advantage of cloud deployment versus on-premises is that it makes some aspects of performance planning and deployment easier. “With a lot of on-premises servers, machines get added over time: there will be batches of machines with different specs. To increase the predictability of job runtimes, it’s nice to have machines with the same spec. That’s all possible using cloud resources.”
EDA jobs have characteristics that can call for more specialized forms of storage and I/O, claimed Ravi Poddar, director of EDA and HPC solutions at Pure Storage. “EDA tends to be high file-count dominated. A profile we took at one customer showed 95 per cent of files are 8Kbyte or under in size. Being able to handle a large namespace efficiently is very important.
“It’s easy to get a funnel effect: storage tends to get saturated and then runs pretty slow. Even as you increase parallelism, runtimes can get very, very large. You may not even know: you may be used to seeing runtimes of x and not realize it could run faster. One way to address saturation is to run test jobs and see where the knee of the curve is.”
Pure Storage has designed flash-based arrays that were designed to handle the number of metadata transactions that come with handling huge numbers of small files, coupled with software optimized for bulk file operations such as the deletions that follow the completion of something like a large physical-verification job.
“We have over 25 EDA customers in production. Jobs that used to take over eight hours now complete in 20 minutes.”
Other use-cases are emerging. Pellerin pointed to the concept of the virtual chamber – a temporary development environment created using a collection of containers – as a further use-case for cloud computing in EDA. These chambers can be used to provided enclaves for teams that are segregated from others. “Or you maybe create a chamber to debug an issue with a partner,” he said. “We are starting to see this on the foundry where digital twins are used to get at operational issues and handle predictive maintenance. And we are seeing AWS being used to collaborate on supply-chain issues.”
Though customers are becoming more comfortable with the idea of moving data out to the cloud, security concerns remain and need to be handled.
“It’s important to consider how data is transferred to and from the cloud and how access is controlled,” Tan said. “You need to control access using secure access points and use protocols that prevent unauthorized access.”
Leave a Comment
You must be logged in to post a comment.