Behavioral IP reuse methodology
No one disputes the promise inherent in the concept of design reuse. But the true value of what has been delivered so far is often debated. This paper proposes a reuse methodology that is both practical and real and which uses behavioral synthesis as its driving technology. It discusses the most basic elements of behavioral synthesis and demonstrates two different reuse scenarios using a concrete example and real logic synthesis results.
Behavioral synthesis for beginners
Behavioral synthesis is a technology that allows you to design at a level of abstraction higher than the common practice of RTL design. It takes a relatively abstract description in a high-level language, such as SystemC, and generates an RTL or gate-level implementation of the functionality.
This process typically involves the conversion of the behavioral design into a data-path and a state machine. The FSM is automatically generated by the behavioral synthesis tool. The inputs to the behavioral synthesis process are:
- The behavioral description
- Technology information (in the form of a technology library)
- Implementation requirements (in the form of synthesis design constraints, performance requirements and directives)
By varying ANY one of these three inputs, you will get a different implementation (FSM/data-path pair). For example, constraining a design to have a latency of 10 cycles will produce, say, implementation A; but constraining it to have a latency of 8 cycles and be pipelined, will produce the completely different implementation B. To create RTL designs by hand that met these requirements, you would need to write two completely different RTL implementations, doubling the workload. Using behavioral synthesis, you simply change your constraints.
For the purposes of the reuse discussion, perhaps the most important function that behavioral synthesis fulfils is that it automatically creates the schedule for the design, and maps the functional units and data path to that schedule. Each time it does this, it will get all of these pieces to work together in the optimal manner.
Forms of reuse
The term “reuse”, when used in an RTL design context, usually implies the use of an implementation from one design cycle to the next. This may be direct design reuse or there may be some small modifications to the original design. Such a design is often reused in a technical context that is quite different to that of the original implementation. For example, the first may have been implemented in a 180nm technology and the second in 130nm.
Figure 1. Rapid application derivatives.
These two environments will have quite different timing characteristics.
The only way to practically implement such a reuse methodology is to simply plug your old RTL code into the logic synthesis flow with the new library. Since the new technology is typically faster than the old, this will waste resources.
If you would like to reuse the original functionality, but would like it to actually be faster the second time around, you are facing a complete rewrite of the RTL flow.
There is a second form of reuse that was simply not possible before behavioral synthesis was available. This is, specifically, retargeting. A single behavioral implementation can be trivially retargeted to different implementations for different purposes. There are many reasons to want to do this, including:
- Building fast, more expensive devices alongside a slower, smaller and cheaper implementation.
- Building two devices with the same throughput, but with differing clock speeds and latencies.
- Targeting an FPGA for prototyping and an ASIC for production.
With an RTL flow, building multiple targets usually requires complete re-implementation. This is very expensive and not often done.
Concrete example
Many of the issues surrounding reuse can be best demonstrated by way of a concrete example. To this end, we examined the implementation of a DCTIDCT module. Our goal, when developing such a module, was to build an IP block capable of being reused in as many different cases as possible with no modification to its source code.We then further developed the exercise by looking at two distinct case scenarios:
- The initial design needs to be targeted at two different clock speeds: 100MHz and 200MHz. The implementation technology will be 180nm.
- The initial design is built on 180nm at 100MHz. At some later date, we wish to reuse this same design with a 130nm technology. At this time we will consider the changes in performance required to: a. Move to 200 MHz b. Move to 400 MHz c. Implement at 200 MHz, but half the latency
In walking through the different implementation scenarios, let’s now compare the implications discovered for using a behavioral implementation against an RTL implementation.
The ‘base’ DCTIDCT
The DCTIDCT was implemented in its base form on 180nm technology and a latency of four cycles. The initial target was 100MHz. As with any behavioral implementation, we produced RTL output that was subsequently input to a normal logic synthesis flow. For the purposes of the exercise, we treated the RTL code that was output from the initial behavioral synthesis run as if it was the original, hand-coded RTL design.
This initial implementation had the following characteristics:
- Input Constraints
- Technology – 180nm
- Design Clock Period – 10ns
- Logic Synthesis Clock Period – 10ns
- Latency – 4 cycles
- Results
- Area 624K
- Logic Synthesis slack – 0.001ns
Reusing the base block
Next, we assumed that it was two years down the line, and we were building a new application that needed a DCT block. Obviously, we wanted to reuse the one we already have. However, we had moved to a 130nm implementation technology, and faced the requirement that the new device was able to handle data at 4X the rate of the original.
Reuse in the RTL flow
When you have the original RTL code for a design, the first thing one might try to do is simply take the existing code and put it through logic synthesis with the new technology library. On adopting this approach for the DCTIDCT, we got the following results:
- Input Constraints
- Technology – 130nm
- Design Clock Period – 10ns
- Logic Synthesis Clock Period – 10ns
- Latency – 4 cycles
- Results
- Area 200K
- Logic Synthesis slack – 4.37ns
So, yes, this block was easily reusable with the new technology; but it was still running at the original performance level (10ns clock, 4 cycles).What we really needed was to improve the throughput of the circuit.
To that end, and since there is plenty of slack with the new 130nm technology, one could simply change the clock period for the logic synthesis tool. Applying that to the exercise produced these updated results:
- Input Constraints
- Technology – 130nm
- Design Clock Period – 10ns
- Logic Synthesis Clock Period – 5ns
- Latency – 4 cycles
- Results
- Area 195K
- Logic Synthesis slack – 0.001ns
We are on the right path.With the new technology and the original RTL code, we have been able to double the performance of the design. But, remember, we really need a 4X improvement. So, you might say, cut the clock cycle in half again. In this exercise, that change gave the following:
- Input Constraints
- Technology – 130nm
- Design Clock Period – 10ns
- Logic Synthesis Clock Period – 2.5ns
- Latency – 4 cycles
- Results
- Logic Synthesis slack – FAILED TO MEET TIMING
A dead end! Using this approach to reuse, it was clear that we would need to significantly rework the RTL code to meet the performance goals. A further implication was that a complete functional verification pass would need to be made. Our “RTL reuse” methodology has thus become so prohibitively expensive that it appears to be not much better than starting from scratch.
Reuse in the behavioral synthesis flow
Instead, however, we re-ran the exercise with the same objectives, but using behavioral synthesis as our implementation flow.We started out at exactly the same point. The divergence occurred when we came to face our future project objectives: the need to create a new implementation with the new performance parameters.
Two options based around behavioral synthesis were considered. But for both, we re-ran the behavioral synthesis tool on the original behavioral description, the variations being that we gave it different constraints that reflect the new performance requirements.
First, we simply reran our original behavioral synthesis script, and changed only the technology library (now 130nm) and the clock period (now 2.5ns). This gave us:
- Input Constraints
- Technology – 130nm
- Design Clock Period – 2.5ns
- Logic Synthesis Clock Period – 2.5ns
- Latency – 4 cycles
- Results
- Area 427K
- Logic Synthesis slack – 0.001ns
The second variation was to reduce the clock period to 5ns, and reduce the latency of the design from 4 cycles to 2 cycles.We were able to do this because all of these variables are simply constraints on the behavioral synthesis tool. This implementation produced:
- Input Constraints
- Technology – 130nm
- Design Clock Period – 5ns
- Logic Synthesis Clock Period – 5ns
- Latency – 2 cycles
- Results
- Area 300K
- Logic Synthesis slack – 0.001ns
Objective achieved.
Conclusion
This comparison demonstrates that there is some potential in reuse in an RTL flow, but that it faces severe limitations in a migration to future designs.With an RTL reuse methodology, we could reuse our DCTIDCT IP module in our future design, with a modern technology, but got only half way to our target performance goal.
With a behavioral synthesis methodology, by contrast, we had maximum flexibility to reuse IP blocks in future designs.We were able to get to our performance goal of a 4X improvement without changing a single line of code in the original behavioral description. It is also worth noting that the original IP was already functionally validated so the theoretical verification cost would have been negligible.
It is therefore clear that a behavioral IP reuse strategy has the potential to finally deliver on the promise of design reuse that we have all been waiting for.