For nVidia chief scientist and Stanford professor Bill Dally, now is a great time to be involved in hardware design. The rapid growth of machine learning as part of a push towards artificial intelligence (AI) has made all the difference.
Speaking at the VLSI Circuits Symposium plenary in Honolulu on Tuesday (June 19, 2018), Dally said: “Hardware is what’s fueling the AI revolution around us. Today, progress is gated by hardware.”
He argued that a lot of the core ideas have been in place for decades, albeit with tweaks to make them perform better, but that it took improvements in hardware to make them viable. A case in point was the way in which the research community seized on the high floating-point throughput of graphics processing units (GPUs) in the late 2000s once a more compute-friendly approach to training deeply layered neural networks was developed. The faster hardware made it feasible to train deep neural networks (DNNs) on large enough quantities of data to be useful for real-world applications.
To move into applications such as self-driving cars, deep-learning networks have to make bigger strides in energy efficiency, Dally said. Benchmarks run on servers such as ResNet use 225 x 225 pixel images. “But the cameras we put in cars aren’t 225 x 255. They are HD. And there are 12 of them. It’s an enormous computational load.”
Deep cuts to precision
The graphics processor maker is far from alone in trying to slash the amount of work DNNs have to do when they are running inferences – which will be their primary job in cyberphysical systems and other edge devices. But Dally has been able to call on work performed by his research group at Stanford to inform the design of future processors and the software that will go onto them.
The attack on the compute burden of DNNs at nVidia is happening on several fronts. One is reducing the energy of the computations themselves; another is to reduce the number of computations; and the third is to try to work out ways to reduce the overhead of instruction control and memory movements. Without careful design, control and memory operations can easily dominate the energy cost.
There are key differences between the needs of server-based accelerators and the processors that need to go into mobile systems, whether they are handsets or cars. Engineers in the data center want flexibility so they can progressively optimize algorithms. And the server is where the network gets trained.
High resolution is needed for training because of gradient descent, although the use of 64bit floating-point arithmetic is probably overkill for many uses. “Scientists like to do everything in FP64 because it saves them having to think” about the ramifications of using less resolution, Dally said.
No-one trying to make embedded DNN processors has any qualms about reducing resolution. Most designers working on DNN processors and software have gone from floats to integer to 16bit to 8bit and, now, think binary is probably OK for a lot of applications. What happens, Dally explained, is that often the weights used to tweak the inputs into each neuron fall into something that looks like a bimodal distribution, falling either side of a zero point.
The binary option is between a positive and negative fixed weight rather than implying a multiplication by 1 or 0. But with a small local codebook to store the actual weight, the storage demand for the millions of weights a typical DNN needs plummets compared to 64bit values.
Binary may be a step too far, so some have proposed ternary arithmetic. It adds a zero to the binary option of a negative or positive fixed-weight value and is fast becoming a popular option for hardware accelerators.
For systems that use more resolution than binary there is exponential coding. In common with other groups, one observation that has come out of work by Song Han, formerly based at Stanford and now at MIT, and other colleagues at Stanford, is that normal numbers do not reflect the weight space all that well. “With evenly spaced sampling, I’m sampling very sparsely where things are happening,” Dally said.
Having the steps grow as the weights increase can fit the application a lot better. “The best approach I have so far is exponential coding,” he said.
Lose the multiplier
With exponential coding, it is even possible to get rid of the multiplier that is common to just about every DNN accelerator that uses digital computation and use simple shift-and-add instructions as well. “At 16nm, core inference can be performed for just 10fJ per MAC [multiply-accumulate],” Dally said, though it is not really a MAC operation. At such low computation energies, the focus on power shifts strongly. “It becomes all about moving the data around.”
Work by Dally and colleagues at Stanford more than a decade ago already homed in on this issue, showing that even with a full MAC, computation is often an order of magnitude more efficient than moving the data in and out of the registers that feed that MAC unit. Once you have removed many of the transistors from the arithmetic unit, the difference is even more stark.
Taking advantage of the sparsity of many practical DNNs is one way to cut the memory energy usage. It turns out that many weights in a network are zero. And neurons rarely send anything bigger than zero. Use of activation functions such as ReLU, which zero out anything below a threshold and apply a linear scale above that threshold, lead to a lot of zeroes moving around. Pruning the network to get rid of zero weights and aggressively data-gating arithmetic units can make a big difference. But signalling between cores could also be due a change, Dally said. He proposed using low-energy techniques such as ground-referenced signalling.
However, advanced signalling schemes could fall foul of the patent wars because some IP companies concentrate on this area. Dally had the honor of working at a company being sued by another – Rambus – over a patent he himself had co-written. The patent was for a low-energy signalling scheme developed at startup Velio. Rambus later bought the I/O patents.
When asked if he would back its use as a standard, Dally said of the more recent ground-signalling scheme he proposed: “We would be very happy to have other people using it. It’s a very efficient way of doing signalling.”