RAID vendor Dot Hill adopts OVM flow for reliability
How the company migrated to an OVM-based methodology to design and verify a 30 million-gate ASIC design, on the path to UVM.
Here’s the thing about RAID storage: anyone can do it. It’s easy enough to take an updated Windows PC or server and a handful of spare hard drives, and then set up a software RAID that delivers the basics of reliability, performance and capacity.
Then again….
DIY RAID quickly gets ragged when the unusual happens: power failures, severed PCI Express (PCIe) links or even stray alpha particles flipping bits in an SRAM. These events are not that uncommon. As ‘big data’ propels complexity and demand for cloud storage burgeons, they are among the reasons why storage vendors like Dot Hill Systems exist. Such companies promise customers that they will not lose data even when bizarre exceptions occur.
This is how Dot Hill kept that promise when designing and verifying a 30 million-gate RAID accelerator using advanced OVM-based verification, and migrating from AVM. The approach yielded several benefits including first-pass success on its maiden use of an ASIC architecture.
Figure 1 Chip plot of Dot Hill's OVM-verified ASIC (Source: Dot Hill Systems)
Increasing bit rates and customer expectations
The accelerator helps move data around and churn through XOR calculations. It is the key enabler for performance and speed. The chip’s specifications included two ARM 926 processors, four eight-lane PCIe Gen 2 interfaces, a dual-channel DDR3 interface running at 1333 MHz, and RAID acceleration engines.
The building blocks were assembled and verified by a team of six engineers. A decade ago, teams would stitching together PCI links that could send/receive 133MBps and DRAM that shuttled data at 800MBps. Today’s bit rates are 4GBps and 20 GBps, respectively.
And bit rates are just one measure of change and increasing complexity.
“We’ve had the migration from parallel buses to high-speed serial links, too,” says verification engineer Ty Sell. “Maintaining signal integrity is a challenge.”
Another challenge is the move beyond simple parity on buses for error correction. With PCIe, things are trickier. This is partly because all PCIe devices support existing, non-PCIe-aware software for error handling (by mapping PCIe errors to existing PCI reporting mechanisms). There are also new PCIe-specific mechanisms.
Consider too, how customer expectations, once established, never ratchet back. Today, it’s assumed that RAID systems are resilient and self-recovering. So, verification teams must test all the recovery mechanisms that are supposed to kick in when bad data comes across a link. These mechanisms should execute so quickly that the user does not notice them.
Skyrocketing expectations for speed are the main reason Dot Hill switched from FPGA- to ASIC-based designs. Despite the many advances in FPGAs, ASICs remain far and away the performance leader.
Trading lab debug for simulation – and lots of it
Moving to ASICs required fundamental changes to the verification methodology and the shift to OVM.
“We’d try to get the FPGAs into the lab as quick as we could and just find the problems there,” says Mike Peters, a Dot Hill design engineer. “With an ASIC, the goal is to find problems in simulation before you build the chip. This required much more thorough verification than the company has ever done before.”
Consider the issues tossed at the simulator. For starters, there’s the sheer volume of traffic pulsing through the device and all the associated concurrency those bits entail. The four PCIe ports can route through any of the other PCIe ports or to DDR. Writing to two or three registers in a RAID engine can launch literally thousands of XOR operations requiring complex calculations. All the while, the processors are running and executing code.
Figure 2 The Dot Hill ASIC is used in storage arrays (Source: Dot Hill Systems)
Even something as seemingly simple as routing data across the chip is fraught. All the ASIC components are connected via a high-speed, point-to-point switch fabric, which itself supports multitasking. So, a single write can be written to multiple destinations. That indeterminacy that makes simulation devilishly difficult.
Heavy internal and external traffic traveling at high speeds directly ties to the second big challenge — testing for all the myriad options, configurations and use cases. One of the biggest problems Dot Hill encountered came in maintaining control over any overlapping cycles. When writing and reading randomly throughout memory, you must ensure that nothing overlaps or interferes with anything else.
The drivers for test proliferation are not random. Rather, they reflect the broader context in which a chip will operate and the configuration choices that will be made by the end customer. Dot Hill’s ASIC has a great deal of programmability and there is no single, fixed-data path. Address maps are everywhere and users can reconfigure address bases to be sent to different places on the circuit. All this rerouting must be tested.
Each port has high variability too. Though the PCIe ports are eight-lanes wide, they can operate at four lanes, two lanes or one lane, and in Gen 2 or Gen 1 mode. The DRAM can be DDR2 or DDR3, single channel or dual channel. The DIMMs can be comprised of different configurations of chips with different sizes and numbers of data pins.
“Trying to handle all this with directed tests that covered all permutations and configurations just isn’t possible,” says Peters. “So we tried to run lots and lots of random tests over a long period of time on as many servers as we could to get as much coverage as possible.”
Other characteristics of the ASIC, including features designed to increase robustness, created further verification headaches. The chip generates lots of interrupts, each one requiring that traffic stop and an action be taken, often including some sort of error correction. Since each data path is protected with parity, myriad instances of error injection and fault recovery had to be tested.
Module by module
The key to verification is to start small and build from there. For Dot Hill, this meant starting by testing individual modules of the ASIC, gaining confidence that these modules worked properly and then moving to the full chip testbench. The good news: almost none of this granular work is wasted thanks to the reuse enabled by OVM.
The work requires a craftsman’s eye and a saint’s patience. Compared to the eventual number of external interfaces, the chip’s inner workings present far more interfaces that must be modeled. The team previously had to sweat through the details of writing laundry lists of directed tests for these modules.
“But with SystemVerilog and OVM/UVM, there is so much power from randomization you can build into your tests,” says Peters. “It’s much easier to get a robust test on a module quickly.”
Standardized methodologies and languages also offer more flexibility. This was important because Dot Hill did not abandon FPGAs completely. It built an FPGA prototype, mostly to allow the software team to make an early start on the close-the-metal code. Unfortunately, the prototype turned out to be substantially different from the production device, in part because of turmoil in the ASIC market.
When Dot Hill launched this project, many ASIC vendors were in the midst of a process change from 90 to 65nm. Moreover, Dot Hill’s first ASIC contractor went out of business a few months after starting work on the design. This forced the team to shift away from original specifications still in the prototype: it had PCIe interfaces that were four rather than eight lanes wide and memory that was single-channel DDR2 rather than dual-channel DDR3.
OVM allowed the Dot Hill team to abstract enough of the design so that the same tests could be run in both environments, even for very different pinouts. Beyond the Questa Advanced Simulator, the team used Questa Codelink to debug and test software that would run on the ARM processors and for software-driven verification of certain hardware components. This was crucial for an internal ROM that had to be error-free because it contained boot code – if this code didn’t work properly it would have rendered the processors worthless).
“Software is normally easily changed, but not when it’s embedded in a ROM inside a chip — that’s why it received a little extra attention,” says Peters.
Questa Codelink works in much the same way as the latest generation of digital video recording devices. During functional simulation, advanced tracing technology automatically captures and compresses all important activity inside the processors, enabling the verification engineer to ‘playback’ a simulation with features such as fast forward, rewind, pause, single step, and even equivalents to zoom and pan.
Benchmarking success
The key benchmark was whether the new approach worked and how the results compared to earlier FPGA-based accelerators. As one answer, consider that within approximately two hours from when the team got the chip on the board back in our lab, it had the system running more or less as expected. Once some minor board problems were fixed, the team was able to access the processors. A short time afterwards, they had functional DRAM, and within about a week they were pretty much running RAID cycles — not bad on a chip this large.
Also not bad given that it can take months to bring up an FPGA in a lab. Granted, this pace can be a result of many things beyond a verification team’s control (mostly that hardware visibility in FPGAs is opaque compared to ASICs). But it always poses the singular challenge of being slow.
Concrete benefits of OVM abstraction
This work occurred mostly in the first half of 2012. The Dot Hill team that handled the migration from AVM to OVM then is now well down the road on a new chip, this time with an entirely UVM-based verification environment. Like its predecessor, the new chip is heavily controlled via registers. So it’s perhaps not surprising the team already has good things to say about the coverage built into the UVM register package.
The biggest benefit continues to come from the ability to abstract more and more of the verification work. This started several years ago when Dot Hill moved to AVM and made an effort to do scoreboarding, write drivers and so on. With OVM, many of these features evolved into standard core components of the methodology, freeing up verification engineers from supporting their own custom components. The UVM testbench the team is working on now is going to have an unprecedented amount of top-level reusability.
Figure 3 From left-to-right: Mike Peters, Ty Sell, Don Allingham (Source: Dot Hill Systems)
“AVM, OVM, UVM — each is better than what we had before,” says Sell. “Each one is making us more productive.”
The progression bodes well for the ability to keep up with demands for advanced RAID-based storage and backup. The only bad news, if it counts as such, is that the team will also have to keep up with demands of Dot Hill management.
“Now the company expects us to do a more complex data chip with the same team in less time,” says Peters. “No good verification work goes unpunished.”
And while building and testing advanced RAID arrays may be something of a black art, that’s a sentiment nearly any verification engineer can relate to.
About Dot Hill
Dot Hill Systems Corp. is a recognized leader in software and hardware solutions for storing, sharing, protecting and managing data. Leveraging its proprietary Assured family of storage solutions, Dot Hill solves many of today’s most challenging storage problems – helping IT to improve performance, increase availability, simplify operations, and reduce costs.
About the authors
Michael Horn is a principal verification architect at Mentor Graphics. Don Allingham is a design engineer mostly focused on ARM-related projects at Dot Hill Systems.
Contact
Mentor Graphics
Corporate Office
8005 SW Boeckman Rd
Wilsonville
OR 97070
USA
T: +1 800 547 3000
W: www.mentor.com