How tool parallelism, automatic partitioning, deep debug memories and time domain multiplexing eases FPGA prototyping of large ASIC and SoC designs
There are three big myths about prototyping large ASIC or SoC designs on FPGAs: that prototyping capacity is limited to less than 100 million ASIC gates; that it takes months to get an FPGA-based prototype working; and that the resulting prototype will have limited debug visibility.
This may still be the case for homebrew solutions, but commercial tools, such as Synopsys’ ProtoCompiler, a design automation and debug environment for the HAPS FPGA-based prototyping system, combined with integrated HAPS firmware and hardware accessories, use a number of techniques to overcome these issues:
- Multi-processing the synthesis, mapping, partition, and place and route tool chain to enable large designs to be processed in hours
- Automated partitioning and time-domain multiplexing (TDM) of critical signals, to avoid FPGA I/O bottlenecks
- Combining firmware and hardware to make it possible to distribute clock, reset, and other key signals across dozens of FPGAs
- Multi-FPGA debug schemes with lots of storage, to avoid impacting the prototype floorplan
Multi-processing multi-million gate ASIC designs
What benefits does multiprocessing bring in practice? Figure 1 shows the execution threads for a 96 million gate ASIC design being processed for prototyping on a HAPS-70 S96 system with eight target FPGAs, in this case Xilinx Virtex-7 2000T FPGAs with approximately 70% utilization.
The Intel i7-46xxM processor doing the work supports four concurrent processes, so in this case each thread processes two of the target FPGAs. ProtoCompiler licenses allow up to four processes to execute at once, assuming similar parallel capacity in the Xilinx Vivado place and route system.
As figure 1 shows, a 96 million ASIC gate design can be synthesised and partitioned, and then placed and routed on the target FPGAs in under a day with a typical workstation.
Figure 1 Parallel execution time for 5-8 Virtex-7 2000T FPGAs (Source: Synopsys)
Multiprocessing can also be used to cut runtimes for designs targeting single FPGAs, by using ‘compile points’ to treat RTL partitions as a block that the tools can independently synthesize, optimize, and place and route. This improves runtime efficiency and enables incremental builds, in which a compile point does not change from one revision to the next.
A design can have any number of compile points, which can also be nested, as shown in Figure 2, where CP5 is nested inside CP4, CP6 and CP7 are nested in CP5, and so on. Compile points can be assigned manually or automatically by the system. The best places for compile points are around design modules that have registered interfaces and which entirely encompass a clock domain.
Figure 2 Using compile points to segment a large single-FPGA prototype (Source: Synopsys)
ProtoCompiler’s ability to run four processes means that up to four compile points can be processed at once with one license. ProtoCompiler DX supports multiprocessing up to the licenses available.
To extend parallelism further, users can apply multiple ProtoCompiler licenses to target large prototypes that integrate dozens of FPGAs. Synopsys has developed technology that enables tools such as ProtoCompiler to use distributed parallel-processing strategies to accelerate large computational tasks.
This gives users the chance to ask more ‘what if’ questions during the prototyping process, leading to a more useful prototype.
Automated partitioning with TDM
Large ASIC or SoC designs can have very large numbers of internal signals. Even the most advanced FPGAs have limited I/O capacity. One of the key issues in prototyping large designs on multi-FPGA target systems is to ensure, as far as possible, that the physical limits of the FPGA I/O don’t create bottlenecks in the prototyped design.
Synopsys has addressed this issue in two ways. The first is to use the Xilinx Virtex-7 family of FPGAs as its prototyping target. These FPGAs include multi-gigabit transceivers optimized for chip-to-chip interfaces, which can be configured to support DDR3 interfaces at up to 1,866Mbit/s.
The second is to use an automated time domain multiplexing scheme to run multiple signals from the design under test across the same IOSERDES channel of the Virtex FPGAs. These source-synchronous links can achieve throughput of greater than 1Gbit/s in the HAPS system, thanks to careful characterization of the PCBs, HapsTrak connectors, cabling and interconnect daughter boards of the HAPS system.
ProtoCompiler automates the use of these TDM channels by inserting Tx and Rx TDM IP, clock distribution circuits, control logic for data selection, SERDES training logic for data synchronization, and optional monitoring logic for status and channel error detection. Timing constraints for these additional circuits are generated for the Xilinx Vivado system and functional simulation models for the Synopsys VCS simulator.
Automatic system chaining
As the size of ASIC and SOC designs increase, so must the capacity of FPGA-based prototyping systems. System chaining strategies enable multiple FPGA modules to be linked together to increase overall capacity.
Chaining functions include:
- Clock distribution and synchronization
- Reset distribution
- System boot sequencing
HAPS systems include an embedded processor, dedicated logic, and firmware to serve as the system’s supervisor and manager. The system supervisor automates bring-up and helps ensure safe operation for the system including functions for power and temperature monitoring, physical interface management, a user control interface, and clock and reset distribution.
High fan-out signals, such as clock and reset, can degrade and introduce skew when driven over long PCB traces. The highest capacity HAPS series, HAPS-70, which can be ordered with up to four FPGA modules, can be reliably chained into a system supporting up to 12 FPGAs, without additional clock distribution circuits. Using Virtex-7 2000T FPGAs, and conservative FPGA logic cell to ASIC gate equivalence metrics, this provides around 144 million gates of prototyping capacity.
For larger designs, an external clock distribution circuit enables up to 12 clocks to be synchronized across up to six HAPS-70 FPGA modules, which can in turn host 1, 2, or 4 FPGAs. This means that up to 24 FPGAs, for a total capacity of 288 million ASIC gates, can be chained together reliably.
Figure 3 Clock distribution in the HAPS-70 system (Source: Synopsys)
As prototypes scale up, the ability to relate the behaviour of multiple logic signals spread across multiple FPGAs becomes a challenge for traditional FPGA debugging tools. Prototype-specific debug systems have been developed to increase both the capacity and synchronization of the embedded instrumentation necessary for successful large-prototype debug.
Such tools need to synchronize trigger and sample circuits across multiple FPGAs, to ensure a correlated view of the state activity. In the HAPS series, this is done by using the RTL debug features of ProtoCompiler to add adding watchpoints and trigger specifications to the pre-partitioned RTL of the design. The debug circuitry inferred is then automatically distributed and chained by ProtoCompiler’s partitioning engine.
Because design debugging is iterative, the debug logic must be repeatedly recompiled and its connectivity across FPGAs updated. Each rebuild of the prototype causes delay, and threatens to disrupt the partitioning and system routing schemes in a multi-FPGA prototype.
To minimize disruption to the prototype floorplan, the HAPS multi-FPGA Deep Trace Debug (DTD) system relies on the HAPS MGB (multi-gigabit) interfaces, rather than the general purpose HapsTrak I/Os, to connect instrumented logic to a centralized Intelligent In-Circuit Emulator (IICE) system hosted by an external FPGA. Cables attached to the MGB sites, rather than HapsTrak, link the prototype (of up to eight FPGAs) to the debug hub for synchronization and storage.
The HAPS debug system is also designed to overcome to the scarcity of embedded memory on the system’s FPGAs, which may have already been allocated to the caches, scratch-pad, and other memory elements of the SoC design. The lack of available onchip memory limits the number of signals that can be instrumented and the sample period that can be captured. To address this capacity problem HAPS DTD uses an external memory storage system of up to 8Gbyte.
This capacity means that hundreds of signals can be recorded for full seconds of clock time. The buffer can also be configured to maximize signal visibility by taking advantage of ProtoCompiler’s debugger signal multiplexing schemes. In this approach, eight groups of up to 2,000 signals can be designated as watchpoints for the IICE, making 16,000 signals available during a debugger session.
Figure 4 HAPS multi-FPGA deep trace debug (DTD) architecture (Source: Synopsys)
Prototyping large ASIC or SoC designs on a set of FPGAs presents a number of complex engineering challenges. However, those challenges are being met by the introduction of techniques such as multi-processing, automated partitioning with TDM, system chaining strategies and multi-FPGA debug schemes in commercial tools such as those provided by Synopsys, The result is that it is increasingly possible to prototype designs of tens or even hundreds of millions of gates, accelerating debug and system software development.
Troy Scott is product marketing manager responsible for FPGA-based prototyping software tools at Synopsys. He has more than 20 years of experience in the EDA and semiconductor industries. His background includes HDL synthesis and simulation, SoC prototyping, and IP evaluation and marketing.