Bit width tweaks point way to practical deep learning

By Chris Edwards |  No Comments  |  Posted: May 23, 2016
Topics/Categories: Embedded - Architecture & Design, EDA - ESL  |  Tags: , , , ,  | Organizations: , , , , ,

In both data centers and automobiles deep learning is taking hold. But it is a technique that challenges conventional microprocessors, leading system designers to look at alternative architectures for acceleration.

In mid-May nVidia CEO Jen-Hsun Huang argued deep learning is likely to drive a boost to the company’s sales not just into data centers but into cars. Vehicle makers bought 50 per cent more of the company’s GPUs this quarter compared to the same quarter a year ago. Many are going into dashboard computers to run infotainment systems but some of the graphics accelerators are beginning to underpin advanced driver assistance system (ADAS) designs.

The GPU has become a mainstay of deep-learning research over the past five years, due largely to the work of researchers such as Dan Ciresan and colleagues at the Swiss research institute IDSIA. GPUs have proven particularly important to training the convolutional neural networks (CNNs) that underpin the technology. Training requires a number of passes over large matrices of simulated neurons, feeding results both forwards and backwards through the network and to update the weights that they will use when processing incoming data. Deep-learning networks often rely on several layers of fully interconnected neurons, requiring thousands of weight adjustments for each neuron on each pass.

Ciresan’s group is also the one that taught one bank of CNNs to recognise road signs in an experiment that led to the machine being slightly better than humans. The CNN approach had an advantage when signs were so badly damaged there was practically no text left. Deep learning used other visual cues such as the shapes of the signs and position of text fragments to guide recognition. Even if only part of an overall ADAS, the CNN looks interesting enough for a number of automakers to investigate them.

Deep learning “not a fad”

Huang told analysts on a recent conference call: “I have a great deal of confidence that machine learning is not a fad. I have a great deal of confidence that machine learning is going be the future computing model for a lot of very large and complicated problems.”

According to Huang, the company has on the order of ten times more autonomous-driving projects than infotainment projects “and we have a fair number of infotainment projects”.

There is an issue with using a GPU. The architecture is comparatively power hungry, although they use less than general-purpose processors on the same CNN jobs. Because of their numbercrunching ability on floating-point numbers they seem likely to continue in training applications for CNNs. But other architectures are likely to come to the fore when it comes to inferencing. And Google is one of the companies to argue that.

Norm Jouppi, famous for work on caching designs such as the victim-cache idea and who is now a distinguished hardware engineer at the search-engine giant, wrote in a blog post in mid-May the company has developed its own ASIC for handling CNNs – specifically the company’s TensorFlow language and computation framework. Although some have interpreted Jouppi’s comments of “allowing the chip to be more tolerant of reduced computational precison, which means it requires fewer transistors per operation” to be a reference to the use of approximate computing, it seems more likely that the machine architecture is a massively parallel array of conventional arithmetic units but with fairly narrow bitwidths.

Google released a version of TensorFlow earlier in May that provided support for quantized 8bit representations. Pete Warden, Google research engineer and former Jetpac CTO, wrote of CNNs in a blog post about the release: “You can run them with 8bit parameters and intermediate buffers, and suffer no noticeable loss in the final results. This was astonishing to me, but it’s something that’s been rediscovered over and over again.”

Fixed-point inferencing

In short, training needs the precision of 32bit floating point for the most part. Inferencing engines, which will be the hardware used in most ADAS implementations, can get by with far less arithmetic. Cars are unlikely to be learning what pedestrians look like while driving along. The split of workloads will be good news to the IP suppliers aiming to sell into these new applications, as well as chipmakers such as NXP and STMicrolectronics.

In principle, digital signal processors (DSPs) can do the job and consume less energy and they are growing extensions that deal with the specialised needs of neural networks. In recent months, processor-core developers Ceva and Cadence Design Systems’ Tensilica operation have launched versions of their DSPs designed to move data around more efficiently for CNNs. Both are using fixed-point architectures, with Ceva offering a tool that will convert floating-point representations into fixed point for implementation on the DSP core.

Freescale’s acquisition of Cognivue, provides the recently merged NXP with a DSP architecture that is being enhanced to handle CNNs. The third-generation part which has features for handling CNNs has yet to appear in silicon.

The DSP in the BlueBox ADAS platform released at NXP’s FTF conference uses an older version, which has not been optimised for CNNs. However, NXP product manager Allan MacAuslin argues flexibility is more important in the short term as car makers try to work out which AI and vision-processing techniques work best.

In a separate move, STMicroelectronics has tied up a deal with automotive-vision specialist MobilEye to develop a fifth-generation architecture that could start to move into cars around 2020. ST aims to use an aggressive process technology – 10nm finFET – to implement the future processors. NXP is opting for a more conservative approach.

Process choices

“High-performance processing often comes at the cost of long term reliability and functional safety. A more robust transistor switches slower. Clocking fast comes at the cost of reliability,” MacAuslin claims, adding that the current product family for ADAS is based on the 28nm process, which first appeared on the market several years ago.

Those favoring the use of older process technologies can take comfort in Google’s adoption of what looks to be quite narrow fixed-point arithmetic for its own Tensor processor.

The other issue for any embedded implementation of CNNs is how it handles memory accesses. The fully connected layers imply the provision of massive amounts of memory bandwidth as data is moved through the array of simulated neurons. Cadence and Ceva have approached the problem through the use of scatter-gather memory controllers that organize data fetched from main memory more efficiently for the parallel processors to work on.

Some have gone further in attempting to capture what is necessary for CNNs. Synopsys’ EV family of vision processors combine a multicore implementation of the ARC architecture with an accelerator designed specifically for running CNNs.

Software optimization

However, further optimizations in software may reduce the need to develop architectures specific to CNNs. A team led by nVidia chief scientist Bill Dally who is also a professor at Stanford University and one of the key researchers behind data-streaming processor ideas has been working on ways to reduce the number of connections between neurons that are needed within CNNs. Techniques similar to those used to speed up the matrix-intensive processing common to many EDA tools may have application in CNNs.

Approximations to the trained network can achieve a much sparser representation – sparsity means fewer MIPS and data moves. By making the software more friendly to the way GPUs handle data could prove to be more effective than trying to bend the memory management of DSPs to be better at handling conventional CNNs, and deliver power savings almost for free. As R&D is moving quickly, the focus is likely to be on reasonably general-purpose architectures until the software optimizations start to run out of steam.

Comments are closed.


Synopsys Cadence Design Systems Siemens EDA
View All Sponsors