It has been said that there are more than 100 companies currently developing custom hardware engines that can accelerate the machine learning (ML) function. Some target the data center where huge amounts of algorithm development and training are being performed. Power consumption has become one of the largest cost components of training, often utilizing large numbers of high-end GPUs and FPGAs to perform the task today. It is hoped that dedicated accelerators will be able to both speed up this function and perform it using a fraction of the power. Algorithms and networks are evolving so rapidly that these devices must retain maximum flexibility.
Other accelerators focus on the inference problem that runs input sets through a trained network to produce a classification. Most are deployed in the field where power, performance and accuracy are being optimized. Many are designed for a particular class of problem, such as audio or vision, and being targeted at segments including consumer, automotive or the IoT. Each restricts the flexibility that is necessary. Flexibility becomes a design optimization choice – the more that is fixed in hardware means greater performance or lower power but the software side is less amenable to change.
At the heart of most of these devices is an array of computational elements that have access to distributed memory for holding weights and intermediate data. Waves of data will be fed into the computations array and data will flow out of it. The computational elements could be fixed function (such as a multiply-accumulate function) or could be specialized DSP blocks. Some implement them as hardwired blocks, others resemble highly focused processors, and yet others look more like FPGA blocks. Each of these blocks is then replicated many times and connected using some interconnect fabric. Other parts of the device will be responsible for managing the flow of data through the chip or for performing custom functions that do not nicely map into the computational fabric.
The big question is: how does one verify these devices? I am not going to claim that I have the answer. It is not clear if anyone has a definitive answer today. This subject area is nascent, and everyone is trying to learn from the best practices of the past while recognizing the unique challenges of these devices.
The programming scheme for these devices is also considerably different from the relationship that exists between software and an instruction set processor. Traditional languages used for programming are stable. While new languages are being designed, they all utilize the same underlying concepts. New languages basically try to optimize the development process. AI software goes through a compilation processes that is highly unpredictable. It also produces non-deterministic results. Retraining a network may produce an inference network that is considerably different from a preceding one after it has been quantized and optimized. This means that ‘real’ software has less relevance to the verification task than in the past.
It is important to consider the task being performed. When using Portable Stimulus (PSS) and Test Suite Synthesis, we are not trying to ascertain if the architecture is a good one; we are trying to ascertain that the architecture as defined works. The PSS graph does not know what representative workloads may look like, or what might represent worst-case conditions. While you may be able to extract a power trace from a testcase that was generated using Test Suite Synthesis, this may not represent a typical operating condition. Likewise, it is not possible to generate tests that might test throughput. The verification of these attributes requires over constraining the model to generate synthetic benchmarks. There are pros and cons to using this approach versus real life sample workloads and networks.
But the verification of the engine on a few sample workloads does not provide enough confidence that the engine will be able to verify the range of networks it is likely to encounter during its lifespan. It thus becomes important that effective verification strategies are developed that test the fundamental dataflows through these engines. Luckily, this is a task highly suited to PSS and Test Suite Synthesis.
A neural network is a set of nodes with connections between them. The best way to verify that is to focus on the outcomes, the results that you want to see from the last node. Then you can say that if this is the result you want to see, the previous three nodes must have this input and thus the nodes before that must have… and so on. That is how the problem solver inside the Test Suite Synthesis engine works and that is why it is very good for this nature of problem.
The job of the software toolchain for an AI engine is to schedule pieces of work on various pieces of hardware so that the right result comes out at the right time. Using the AI toolchain to come up with a representative set is too difficult. We must start with the hardware itself and look at the capabilities that have been built into its architecture. We want to make sure that this queue of operations works correctly, or we want to make sure a particular resource is maxed out. You can reason back through the network to see what you must feed in to achieve those corner cases.
The verification of a neural accelerator must rely on the same hierarchical approach used for more traditional processors, but with perhaps a different weighting of concentration. First, the tiles must be verified as close to exhaustively as possible. Depending upon the complexity of the blocks, this could be done with SystemVerilog and a constrained random verification methodology, or if they contain processors, PSS could be used.
Second, the network that interconnects them must be verified. This could be a more extensive problem than is seen for system assembly today because of the large numbers of blocks involved, but it is likely to involve a regular array structure, so that may simplify the task.
Third, we need to define graphs in PSS that define the important dataflows through the devices and use Test Suite Synthesis to explore those. At first, this would be a simple verification of the individual flows, and then multiples of these can be scheduled concurrently to look for unintended interactions between them or to locate congestion points in the device.
Finally, some sample networks can be run, and these can be used to ascertain the effectiveness of the device at achieving desired performance or power goals. However, the concentration of real software is likely to be less than it was for traditional software running on processors.
The first wave of custom accelerator chips can be expected to hit the market quite soon. It remains to be seen how many will be successful, how many will have functional errors that will be difficult to hide using the software tool chain, and how many will reach the desired levels of performance or power.
Breker Verification Systems will continue to work with the industry to ensure that it has access to the tools it needs to enable a good functional verification methodology. We are listening and learning along with the industry. Together we can do this.