The Linley Spring Conference this week (19 April) saw several vendors present architectures that they claim can deliver more performance to edge systems and a move away from relying on benchmarks based on the old stalwarts of ResNet and YOLO.
Though the TOPS metric is becoming less and less popular anyway as a way of describing the performance of network accelerators, Hailo chief architect Daniel Chibotero took aim at it in his presentation on the Hailo-8 which is now in mass production. He argued it is far better to take a portfolio of models and run them to assess how well an embedded accelerator will fare in the real world – and they should be relatively new models. “Usually, customers’ workloads make use of modern, state-of-the-art networks,” he said.
Chibotero pointed to another issue that drives the design of edge-AI accelerators in different directions to those of server-based designs. Both are challenged by power consumption but with edge-based systems, he said, the main requirement is that the energy needed by the accelerator for the target model does not push it over the ability of passive cooling to prevent transistors on the chips from overheating.
The architecture used by Hailo’s device, built on a 16nm process, distributes control, memory and processing elements across the SoC and uses dataflow techniques to have the intermediate results provided by layers move through the model’s pipeline.
Dataflow looks to be a popular model for edge AI. Expedera’s chief scientist Sharad Chole used to work at Cisco and has adopted a network-processor paradigm similar to that used by the relatively new class of DPU devices offered by Fungible and nVidia to run neural-network applications. He described how networking SoC providers had succeeded in moving traditional software operations into hardware and have data flow through the device to be processed by a stream of different operations.
“We came up with a completely new architecture with no global interconnect,” Chole claimed, using a deeply pipelined design similar to DPUs. Hardware schedulers use queues to organize how data packets are moved between execution units. The overall approach uses what he called a “fire and forget” strategy that potentially can deliver high numbers of inferences per second per watt (Ips/W), the metric that Expedera has chosen to focus on. He called on ResNet50 as the example, with a peak of 2000Ips/W versus 500 for an “edge GPU”.
Training with integers
For its edge-AI processors, Deep AI Technologies aims to exploit the programmability of FPGAs to tune the hardware processing to the target neural-network model. One unconventional aspect of the hardware engine is that it is not just designed to run inferences in embedded systems but perform training as well. Rather than accept the overhead of floating-point arithmetic for training, DeepAI uses the same reduced precision of 8bit integer operations both inferencing and training. Potentially, this can cause issues during training as it is very easy for the lengthy backpropagation calculations to hit the maximum or minimum value and, in doing so, lose the ability to converge on usable gradients.
DeepAI CTO Moshe Mishali explained that offchip control algorithms tweak the assumed range of the 8bit values used by the backpropagation operations so that they never overflow or underflow, in a mechanism that is not dissimilar from the block floating-point calculations used by some server-class AI engines. This type of processing, Mishali claimed, gives the DeepAI processor a power and speed advantage over something like a V100 GPU on training. Again using ResNet50, one test showed the Deep AI approach on a Xilinx Alveo U50 card at 800 images per second needing 75W rather than the 300W of power used by a V100 for a training throughput of 360 images per second.