Arm readying AI processor to catch up with ‘surprising’ demand
By the middle of this year Arm intends to deliver a processor designed specifically for deep-learning pipelines in edge devices, to capitalize on a move away from cloud computing for image and voice recognition.
Rene Haas, president of Arm’s IP products group, said: “Machine learning is one of the most significant changes that is hitting our computing landscape. We believe that years from now people will not be looking at machine learning as a unique category where computers learn but rather it will be native to everything that computers do.”
Jem Davies, general manager of Arm’s machine-learning group, said the Project Trillium machine-learning processor represents “a ground-up design for high performance and efficiency. It is a scalable architecture that…specifically targets inference at the edge and gives us massive uplift over traditional CPU and GPU architectures.”
Arm is aiming for a performance level of 3TOp/s/W when implemented on a 7nm process. Davies claimed “We believe this is considerably in excess of anything in the market right now. What we are looking for is for early adopters to take these into premium smartphones. By the time the IP is delivered to them, [7nm is] where everybody will be.”
Haas added: “The architecture that we chose will scale up through a number of sophisticated applications right up to the data center.”
Closer to the edge
Although higher-performance versions of the Trillium architecture may be used in data-center accelerators, the focus is on inference at the edge.
Davies cited concerns over internet bandwidth availability and data-center power as a driver for pulling machine-learning functions back from the cloud for local processing.
Haas added: “This move of machine learning to the edge took us all a little bit by surprise. Everyone knows Arm has been doing a lot of work around IoT for quite some time…It wasn’t very long ago that the problem statement was about connecting these devices to the internet, applying a sensor and connecting data and sending it through. The level of analytics, the level of learning, the level of sophistication has moved much faster than I think everyone has anticipated.”
Like competitors such as Cadence Design Systems’ Tensilica group and Ceva, Arm is building a toolchain that will accept the output from deep-learning development and training environments such as Caffe and Tensorflow.
Davies noted: “CMSIS-NN is available already and has caught the attention of Google’s Android group. It is enabling machine learning on Cortex-M class devices. Using the software support, our partners will be able to use a wide variety of devices with a wide variety of capabilities in machine learning.”
Combination offering
Arm expects the Trillium processors to be paired with more application-specific offerings such as a second generation of vision processors derived from the Apical architecture acquired in the spring of 2016. Arm expects to release an object-detection core by the end of March.
“What happens if we put [the object detector and machine-learning processor] together is we get much better capability,” Davies said. “You use the object detector as a preprocessor that triggers when faces are detected, for example. That way the machine-learning processor’s workload is greatly reduced.
“This will lead to a completely new class of smart cameras, which will be the basis for a whole new class of value added services built on top of those devices,” said Davies.
Although the processor is meant to be released to lead partners by the middle of the year with the aim of being in smartphones available early in 2019, Arm is not describing the architectural details publicly.
Architecture hints
Davies said as the processor is aimed at inferencing workloads it will employ reduced-precision arithmetic, a technique already used by competitors such as Cadence and Ceva who have their very long instruction word (VLIW) digital signal processor (DSP) cores run multiple 8bit or 16bit calculations in a single cycle. Arm is focusing primarily on 8bit operations with 16bit operations provided for weight calculations that require the additional precision.
Davies indicated the processor will incorporate hardware coprocessors and has focused attention on the performance of convolutional neural network layers. He said the team has used profiling techniques similar to those employed in the development of the Mali GPUs to focus on areas of the microarchitecture that could prove to be bottlenecks.
“The thing we are very proud of is the intelligent memory system. To achieve [the target] level of performance with a conventional fetch and decode architecture you will use your entire power budget just doing that. We use optimized units for specific functions and are clever about what data we load and use.
“[In neural networks] the figure of merit is never to try to reload a piece of data. If you have to reload you have failed. There is lots of work in the memory subsystem to avoid having to reload something. Once loaded you try to use it as much as possible before you discard it. We have local memories attached to the machine-learning processor to minimize the time and power needed to load that data,” Davies said.