Deep pipelines and dynamic memory sharing may provide the key to the development of faster and more efficient server-farm blades as the focus in hardware design moves to augmenting conventional processors with specialized accelerators.
Research over the past decade has shown that data movement is fast becoming the biggest issue in energy-efficient designs. Environmental groups such as Greenpeace are putting the electricity consumption of data centers under the spotlight. Focusing on the software uses the memory subsystem provides an opportunity to slash that consumption.
Speaking at the DAC 2016 Tuesday session on heterogeneous architectures, Professor Jason Cong of UCLA and cofounder of Falcon Computing Solutions, said: “Data center energy consumption is a very big deal. We are looking at having to build 50 additional large power plants by 2020 to support them.”
One of the problems in data-center computer design is, Cong claimed, “a pretty big mismatch between workloads in data centers and processor design”.
Professor Mark Horowitz of Stanford University said: “You quickly realize that in a modern microprocessor most of the energy goes into the memory system. A little more than half of the energy goes into the uncore. I don’t want to read and write the DRAM because it’s a thousand times more energy than accessing the data locally.”
One of the ways to deal with the mismatch and avoid excessive memory movement to and from DRAM and various levels of cache is to make much greater use of hardware that appear to support some of the algorithms now needed in server farms better than general-purpose processors.
Inside the algorithm space
“If you look at the space of all algorithms, there is a tiny space for algorithms suited to GPUs. If the algorithm is massively data parallel, you can run it on a GPU and get much better performance,” Horowitz said. “Within that space of GPU applications there is an even smaller space of massively parallel, incredibly local algorithms that fit. With GPUs, you can’t get efficiency if you don’t have high levels of data locality.
“At this point, you might think ‘great, we’re all screwed’. But the truth is we do a lot of computation that is highly local. Convolutional neural networks are like this. It turns out that on these algorithms it isn’t worth touching the DRAM,” said Horowitz. “Modem processing, deep learning and linear algebra problems all fall into this subset.”
The key is to rework the algorithm so that as many as possible of the operations on a particular piece of data are queued before writing back to main memory. “You work on the data once you have it and work on it some more,” Horowitz said.
Cong added: “Many machine-learning applications can have high locality. There is a fair amount of data you can cache and prefetch.”
But even with highly local algorithms the “wire cost” today is significant, Horowitz said. Although GPUs do improve the performance of massively parallel workloads with high locality, he added: “If you were going to build a machine that’s optimally bad for locality, you would build a GPU. The bottom line is that it’s all about the memory. The architectures want to be very deeply pipelined. The algorithms tend to have a flow-forward structure. The pipeline through may be hundreds of cycles long and have very few data dependencies.”
Horowitz claimed the coarse-grained reconfigurable architecture (CGRA) model appears to offer better promise for future architectures. “CGRAs look to be about twice as energy efficient as SIMD.”
The ability to dynamically create and destroy computational pipelines has helped the drive the use of FPGAs in servers. Although their core arithmetic speed generally lags that of a GPU, where more of the die area can be given over to high-speed shader cores, the FPGA’s highly flexible routing helps ensure that if data elements can be shared or passed between execution units, they won’t have far to travel. Dedicated wiring can deliver the data to where it’s needed on the next cycle rather than being shovelled in and out of caches.
As a result, the FPGA has provided a convenient way to implement custom hardware in server blades – a development that encouraged Intel to buy Altera last year.
“The work we have been doing has been catching on. Microsoft, for example, has deployed a lot of FPGAs and their use now goes way beyond search,” Cong claimed.
The key problem, said Cong, is “heterogeneity makes programming very hard”. Researchers such as Cong and Horowitz are working on development frameworks that can make the code run on accelerators more portable and easier to read.
How applications share accelerator substrates also introduces problems, Columbia University associate professor Luca Carloni explained. To help exploit data locality, much of the accelerator hardware will be scratchpad memories that, ideally, would be usable across many different algorithms.
“But each algorithm is different and the computation I/O patterns vary greatly,” Carloni said.
Carloni recommended the use of multi-ported local memories that can be split and recombined as needed to fit different algorithm pipelines and access patterns. The memory might be made available as a last-level cache with portions split off to be used as scratchpad memories when an accelerator on the same die is configured. “When the accelerator needs the memory, the last level cache gets smaller,” he said.
But even if the core algorithms exploit hardware platforms with minimal memory movement, work will be needed in the upper software layers because of the way that the applications themselves are written. One of the biggest overheads now, Cong said, is the cost of serializing and deserializing data as gigabytes are read in and out of the storage subsystems. “Google call it the data-center tax.” he said.