Cadence culls zeroes for faster neural throughput
Cadence Design Systems has launched what it calls its second generation AI processor, with an architecture designed to take advantage of the structure of typical deep neural networks (DNNs) to reduce the number of wasted calculations and so improve overall throughput.
In common with a growing number of IP and silicon vendors, Cadence aims to take advantage of the trend to do more DNN inferencing on embedded devices instead of handing that work off to cloud servers, as is currently the case with services such as Amazon’s Echo and Apple Siri.
Lazaar Louis, senior director of product management, marketing and business development in Cadence’s Tensilica IP unit, said the company is looking not just at bringing more local AI processing into smart home assistants but also automotive systems, drones and smart-city surveillance units.
Louis pointed to cameras that observe city parking spots and which are used in services to direct drivers to empty locations. “Often the connectivity is bad: you can’t send the camera images to the cloud. It’s a similar problem for the drones used for inspection of wind turbines and power lines in rural areas.
“Then there is privacy, Louis said, pointing to consumer applications such as voice-driven assistants. “There are concerns about user data being sent to the cloud. A desire to have privacy with processing done on the device.”
Louis claimed the DNA-100 offers up to 4.7x the performance of existing architectures for a given array size. “This is enabled through support for sparse computing and higher MAC utilization.”
The core optimization is what Cadence calls its sparse computing engine and the way that it handles calculations that do not result in useful work. In many convolution and fully connected layers there are a lot of zero-value neuron weights that do not need to tie up a MAC unit. The weight compression used by Cadence means zero weights are not stored explicitly and are identified once the data has been loaded and decompressed. Only calculations that involve non-zero weights are forwarded to the MAC pipelines. The result is, for a network with 15 per cent zero weights and around double that in activations, a speedup of two-fold. Additional pruning carried out by the network compiler, which culls vectors that have limit impact on the output of a layer, increases the amount of useful work and by a further 50 per cent. The main target for pruning is in duplicated neurons to try to limit the impact on accuracy, Louis said. “Our target accuracy loss is 1 per cent.”
The DNA-100 has hardware support for activation calculation, with built-in functions for reLU, sigmoid and several other common activation techniques.
As it is an integer architecture, with quantization down to 8bit, Cadence has implemented a compiler to convert trained networks from the standard floating-point formats to its architecture and to take care of pruning and compression. As well as supporting Caffe, Tensorflow and with future plans for PyTorch, Cadence was one of the first crop of hardware vendors to say it would support Facebook’s recently launched Glow environment.
Louis said the architecture can support arrays of the DNA-100 engine, interconnected using a network-on-chip (NOC). The DNA-100 processor will be available to select customers in December 2018 with general availability expected in the first quarter of 2019.