Developing and integrating configurable GPU IP using FPGA-based prototyping

By Andy Jolley | No Comments | Posted: April 20, 2015
Topics/Categories: IP - Assembly & Integration, EDA - Verification | Tags: FPGA prototyping, time to prototype | Organizations: Imagination Technologies, Synopsys

How Imagination Technologies used FPGA-based prototyping to develop its GPU IP and integrate it into a real world system

Imagination Technologies develops families of complex GPU IP cores that can be configured for applications ranging from smartphones and tablets, games consoles, smart TVs and set up boxes, to in-car navigation information, laptops and netbooks.

The breadth of application means Imagination needs a robust FPGA-based prototyping methodology to help it develop the cores and the supporting software, and be able to offer end users something to integrate in their SoC prototypes. The methodology also needs to scale from the simplest configurations to the most complex, which may include real-world PCIe, DDR3, and DVI interfaces to provide live graphics output.

How complex can the cores get? The Imagination PowerVR Series6, 6XE and 6XT GPU cores can have ½, 1, 2 4, 6 or 8 unified shader clusters, with configurable 2D and 3D graphics processors. The largest configuration is eight times the size of the smallest.

No URL for image

The prototyping system

The Synopsys FPGA-based prototyping solution consists of the modular, scalable HAPS prototyping systems, and the integrated ProtoCompiler software environment for implementation, bring-up and debug. The hardware uses FPGA modules based on Xilinx Virtex-7 2000T devices, each able to prototype approximately 12 million ASIC gates, to create a system that scales from four million to 288 million ASIC gates. The system is designed to enable users to bring up designs quickly, run them at high system clock rates (in prototyping terms), and use off-FPGA memory to store seconds of debug data.

The ProtoCompiler software includes multiple FPGA partitioning, FPGA synthesis, platform bring-up and debug modules. The synthesis technology can migrate ASIC-centric RTL designs to the latest FPGA architectures, efficiently and with good utilization. ProtoCompiler also has automatic partitioning technology to split designs across the multiple FPGAs of a large HAPS-70 system, and debug technologies to capture and correlate signals from across the implementation, irrespective of the number of FPGAs used.

The ProtoCompiler partitioner splits designs across multiple FPGAs and handles the interconnection of their signals across FPGA boundaries. It can work automatically, or be driven by constraints that help designers guide the partitioning.

Creating a top-level test infrastructure for Imagination’s GPU IP

Imagination and Synopsys worked together to develop a flexible methodology to implement the PowerVR GX6250 GPU on the HAPS-70 platform. The first step was to develop a methodology for implementing a PowerVR Series6 core, using Certify, a manual partitioning tool. We assumed we needed a top-level test infrastructure, to enable stand-alone testing, including a PCI interface for host PC connections, DDR3 memory for test stimulus and result storage, and rate-conversion modules to interface the PCIe and DDR3 interfaces running at full speed to the much slower prototype GPU.

No URL for image

The PCIe module connects to the host PC that runs the regression test suite at 125MHz. The DDR3 memory module, running at 133MHz, holds test stimulus and results. For each regression test, the host PC loads test data into the DDR3 memory, then sends a command to the GPU to read in the test data and run the test. The results are written into the DDR3 memory so the host PC can read it back.

Implementing the GPU on the HAPS-70 system – first pass

We chose a HAPS-70 S48 platform as the target for the PowerVR Series6 core. We used the manual-partitioning features of Certify to assign the GPU and test-infrastructure logic to the FPGAs of the HAPS platform. Experience from previous prototyping efforts told us to assign the two datapath-intensive universal shader cores, and the processing unit, to separate FPGAs, and put all the top-level test infrastructure logic on another FPGA so the various interfaces and memories could communicate at on-chip speeds.

All the other logic was manually assigned to the FPGAs. We also watched the utilization of each FPGA and adjusted the partitioning to minimize the interconnect between the FPGAs. Minimizing the number of signals that needed to be multiplexed across FPGA boundaries helped maximize performance. With this manual methodology, it took around a month to achieve a satisfactory partition and a system speed of 8MHz. But once the system was up and running we were able to run a set of 7,000 regression tests successfully.

Implementing the GPU on the HAPS-70 system – second pass

Having completed the Series6 implementation, the next challenge was to implement the PowerVR GX6250 GPU, a more complex design with additional logic, including data compressors, decompressors and a frame buffer, to support live graphics output.

No URL for image

By running area estimation, we could see that although the new shader blocks and processing units would each still fit in their assigned FPGAs, the assignment of the test-infrastructure logic, which utilized 90% of its FPGA’s logic in the first implementation, would have to be reconsidered.

To reduce time-to-prototype, we tried two ways to implement the more complex GPU.

In the first approach, we used the existing HAPS-70 S48 platform configuration as guidance for a manual repartition to accommodate the more complex core and its enhanced test infrastructure. This was possible, but the increased traffic between FPGAs required four times more pin multiplexing and therefore reduced system performance from 8MHz to 2MHz.

In the second approach, we shifted from using Certify to the recently introduced ProtoCompiler design automation and debug tool and a larger HAPS-70 system, using five out of the six FPGAs present on a S60 system comprising of a four-FPGA S48 module and a two-FPGA S24 module. This was to accommodate the more complex test infrastructure and to give the ProtoCompiler partitioner greater flexibility by providing more resources.

Constraining the partitioning

The idea was to move away from the constrained approach of the S48 implementation and give the partitioner greater freedom to decide where logic should be implemented and how the individual FPGAs should be interconnected. However, we did lock the two universal shader cores into their own FPGAs, so that other logic could not be added to those devices by the partitioner. We also assigned the other processing units and test infrastructure logic to an FPGA each, but didn’t lock out other logic from being implemented in those FPGAs.

Simple constraints were used to configure the five-FPGA S60 system, set utilization limits for each FPGA at 80%, and define a simple pin-multiplexing strategy. For this first attempt at implementing the more complex GPU core, we also defined logical interconnect bus widths between each of the FPGAs, rather than defining explicit cable connections between them, using insights gained when implementing the Series6 core. We then ran the ProtoCompiler partitioner.

No URL for image

ProtoCompiler includes an abstraction flow, which speeds up partitioning runs (to under one minute for this implementation) and provides an early view of the number of signals between FPGAs, the necessary multiplexing ratios, and the utilizations of each device. This analysis showed that the bottleneck for this implementation would be between FPGA D on the S48 system and FPGA A on the S24 system.

The resultant pin-multiplexing ratio of 16 would be likely to yield a system performance of about 4MHz, about half that of the previous implementation. To fix this we rewrote the partitioning constraints to increase the logical interconnect between FPGA D of the S48 system and FPGA A of the S24 system from 200 to 300 signals. Rerunning the partitioner showed that the multiplexing ratio was reduced from sixteen to twelve, increasing system performance to about 6MHz.

To get a prototype running as fast as possible, we decided to go ahead with this implementation, using the information gained from the abstract iterations to define where to put flexible connectors to link between each of the FPGAs. We then advanced the implementation through FPGA synthesis, logic placement and routing to get the BIN files to configure the HAPS system. This implementation of the Imagination PowerVR GX6250 GPU core ran at 7.3MHz.

Adding live video output

The next step was to add the compressor, decompressor and frame buffer to the test-infrastructure logic to enable a video output. Fortunately, all this logic had already been assigned to one FPGA and there was enough spare capacity to resubmit this implementation without disrupting the original partition.

No URL for image

ProtoCompiler can perform most of its processes incrementally and in parallel, which made it quick to implement the changes necessary to incorporate the live video support logic without having to rerun the whole process. The updated design ran at the same system frequency as it had without the video support logic.

Improving system performance

Having achieved a first full prototype of the PowerVR GX6250 GPU as quickly as possible, the next step was to improve its performance by updating the pin multiplexing strategy.

No URL for image

The first implementation had used a simple asynchronous pin multiplexing strategy inherited from the prototype of the Series6 core developed using Certify. ProtoCompiler supports high-speed TDM I/O pin multiplexing to speed up data transfers to around 1Gbit/s. This approach relies on the availability of a clock capable of signalling between the source and destination FPGAs. Every connector used in the HAPS-70 system accommodates this, so all possible links between all FPGAs can operate at these higher speeds.

ProtoCompiler maintains a full understanding of all the FPGA to FPGA links made using the flexible cable connections. This means the tool can assign the original FPGA to FPGA protocol signals to individual high-speed TDM cells, and then further assign them and their clocks to the traces available on the flexible cables. It can also map multi-terminal logical nets onto the point-to-point high-speed TDM connections, by replicating the drivers within the transmitting FPGAs for each point-to-point link.

For the PowerVR GX6250 GPU implementation, ProtoCompiler automated the use of the high-speed TDM infrastructure, boosting system performance to around 12MHz.

Conclusion

Using the HAPS-70 system and ProtoCompiler, Imagination Technologies and Synopsys were able to build a prototyping strategy for Imagination’s GPU IP that enabled the company to bring up new core configurations quickly and then refine them incrementally to improve system performance.

Further information

Imagination Technologies PowerVR Series6 GPU cores

Synopsys HAPS-70 series FPGA prototyping system

Successful GPU IP implementation on Synopsys HAPS platforms using ProtoCompiler

YouTube video of Prototyping Imagination’s PowerVR Series6XT dual-cluster 64-core GPU with HAPS

Author

Andy Jolley is senior staff application consultant – worldwide product line lead, FPGA-based prototyping at Synopsys. Jolley has been working with FPGA technologies for more than 25 years, originally in a design capacity in the telecommunications, radar and video industries before supporting FPGA synthesis and prototyping technologies at Synplicity and then Synopsys. Most recently, Andy has been supporting UK customers with their complex CPU SoC and GPU IP prototyping needs on the Synopsys HAPS platforms and providing support for worldwide engagements to deploy the same SoC and GPU IPs embedded into user applications. Andy holds a Bachelor’s Degree in Electronic Engineering from the University of Brighton, England.