The wrong units of compute
“We are not using system the right way and not building the right systems in the first place,” claimed Christos Kozyrakis, associate professor at Stanford University, at the recent DATE conference in Grenoble, France.
The systems in question are the scale-out computers now used heavily for the internet servers operated by the likes of Facebook and Google. These systems are overturning conventional thinking on the design not just of complete computers but the processors inside them. But there’s nothing like a technological revolution for creativity.
“It’s a great time to be in computing,” said Kozyrakis.
Scale-out applications typically run common application code but have their data “sharded” across compute elements to make their massive thirst for memory manageable.
Onur Kocberber, a PhD candidate from EPFL in Switzerland, said: “Most of the application is memory access.”
Kocberber said the consequences of that for today’s processors are not low instruction-per-clock counts – because of the amount of time spent waiting for data to arrive from main memory but, ironically, poor use of the memory bandwidth that is available. The sheer quantity of data that needs to be sifted for applications such as seach means that the last layer of data cache – the largest one in the system – is quickly overwhelmed. It is this that leads to bandwidth underuse according to the EPFL team.
“Beyond three to four megabytes of cache is useless and the cache adds latency,” he said.
Compute units and pods
The team at EPFL proposed the idea of compute ‘pods’, effectively the same as the compute units introduced by ARM’s John Goodacre in his keynote at DATE. Having analyzed the performance of several processor families, including the widely used Intel Xeon, the ARM-based Calxeda and the Tilera multiprocessor, the researchers proposed a different approach based around the pod concept.
The scale-out pod surrounds a relatively small last-layer data cache with processor cores “to ensure each core has a short distance to the last-layer cache”.
“Then we just replicate the pods. Because the cores don’t communicate with each other, we can just fill the die area,” said Kocberber. “But finding the right size of pod is not a trivial task.”
Assuming a process node of 20nm and a Cortex-A15-like processor consuming 1.1mm square of die area, the EPFL team decided that the optimal pod would surround an 8Mbyte data cache with 32 cores. It is not clear, however, whether the cache needs to be multiported or how the multi-pod system would communicate with main memory.
When first analyzing server performance while working at Microsoft, Kozyrakis thought the memory usage of scale-out applications was an experimental error. “The CPU utilization was high but memory bandwidth was extremely low,” he said. “At first, I thought it was a mistake.”
Short of a massive improvement in main memory latency, the reality is that scale-out servers do not need the bandwidth. And memory deisgned for high bandwidth is simply wasting energy. There is a problem though. You cannot easily trade off bandwidth against energy in servers with today’s components.
“It turns out that there is a very good technology available now, which is LPDDR2,” said Kozyrakis. “Can we build server memory out of LPDDR2 chips. You can’t build high-capacity chips because they were not designed for that. However, you could achieve capacity through die stacking.”
Simulating an LPDDR2-based server against a conventional design, Kozyrakis found no appreciable difference in performance. “You didn’t need the bandwidth to begin with.”
Leave a Comment
You must be logged in to post a comment.