The challenge of clock domain crossings – and some solutions
Advanced semiconductor processes have made it possible to integrate hundreds of millions of gates of digital logic on a die. What has made this practical, however, has been the shift to block-based design, in which many large functional blocks from a variety of sources are quickly integrated into a new SoC. Without the ability to reuse design blocks, it would be impractical, and perhaps even impossible, to take full advantage of the capabilities of an advanced process in any reasonable timescale – designing all that functionality from scratch is simply too complex.
What abstraction to the block level gives with one hand it tends to takes away with the other. Even if each block can be relied upon to behave properly within its boundaries, a complex SoC design attempts to integrate and then coordinate many such blocks, despite the fact that each may have been designed by a different group using a different strategy. Each block, for example, may expect a different clock rate, may dynamically adjust its clock to match its workload, and may employ sophisticated clock-gating strategies to minimise power consumption.
How bad is the problem becoming? According to a survey last year, 32% of SoC designs underway among those surveyed employed 50 or more clock domains. SoC designers, therefore, are faced with trying to ensure that their systemic and inter-block clocking strategies work as expected, even as thousands of signals pass between tens or even hundreds of different clock domains. Add in power-management strategies that turn blocks on and off to minimise energy use, and therefore leave signals at block boundaries in undetermined states, and complex external interfaces which introduce their own clocking requirements, and the potential for errors multiplies.
What sort of errors can occur? Clock domain crossing (CDC) bugs can happen when a signal crosses from one asynchronous domain to another, and arrives too close to the receiving clock edge, so that the captured value is non-deterministic due to set-up or hold-time violations. This metastable state results in incorrect values being propagated through the design, causing functional errors. The problem with such errors is that using standard methods to track them down, such as simulation or static timing analysis, doesn’t make sense because the failures arise due to corner-case combinations of timing and functionality that can make them unpredictable and intermittent.
Some of the tools used to tackle such bugs have limited capacity, which can force designers into partitioning their SoCs artificially, so that each sub-block can be handled within the tool’s capacity. Other methodologies produce so many false positives that their diagnostic value is self limiting – there are just too many potential errors to consider.
Our current thinking is that designers need a configurable solution, which uses both structural and functional analysis to ensure that signals which cross between domains on both ASICs and FPGAs are received reliably. Such a tool also needs high capacity and to work hierarchically, so that a design can be partitioned for analysis in a way that matches the original design intent, but doesn’t sacrifice top-level, full-chip precision on extremely large designs.
If we can get this right then we have a chance of helping designers tackle the new class of issues that has emerged as chip design has evolved into a block-integration challenge.