Memory gets smarter for network speedups

By Chris Edwards |  No Comments  |  Posted: October 22, 2013
Topics/Categories: Blog - EDA  |  Tags: , , , ,

Memoir Systems has developed a set of memory controller IP cores that exploit common access patterns used by processors in network switches and similar systems to improve their performance and power consumption. For one system designed to handle multiple channels of 480Gbit/s traffic, the design resulted in close to an order of magnitude improvement in performance for a 40 per cent increase in memory die area and 15 per cent in overall die size, with a modest power bump.

Sundar Iyer, cofounder and CEO, claimed that with traditional approaches to memory architecture power is spiralling out of control as network SoC designers attempt to squeeze more high-speed ports onto their deep submicron devices.

“It’s no longer just bandwidth but cost constraints. Customers want to integrate more ports on the SoC. One customer was building a 480Gbit/s product for their last generation. For the next generation it’s 3.2Tbit/s. Even on the last generation, power was 50W. Now hitting power budgets that blow up the chip even before they meet the spec,” said Iyer.

Using the survey of processor performance versus that of memory performed by John Hennessey and David Patterson in the fifth edition of their book “Computer Architecture: A Quantitative Approach”, Iyer pointed out that the two have diverged progressively for several decades. Although instruction-level parallelism and clock speed have topped out in processors, the rise of multicore processing has boosted aggregate performance whereas “the number of memory operations per second since 2000 has flatlined”.

Iyer distinguished between memory bandwidth, which has been helped by pipelining and clock speed boosts, and memory operations per second because a number of the common functions performed by packet switches involve rapid manipulation of comparatively small pieces of data that control packet flow. These, he claimed, have become the bottlenecks in packet-processor designs rather than the bulk memory operations used to transfer packets between ports.

“The gap is one order of magnitude today,” Iyer said. “How can we eliminate this gap or at least reduce the gap?”

Memoir’s answer is to look at the types of memory access that are causing problems for conventional bulk on-chip memory macros and build more intelligence into a controller that mediates transactions between the processor and memory. These controllers sit between the SoC bus or on-chip network and SRAM macros. The company developed four types of “pattern-aware memory” that coordinate specific types of access:

  • Counter memory uses two ports side by side to communicate increments to memory-based variables instead of demanding successive reads and writes to the same location with a simple addition in between.
  • A more general-purpose read-modify-write controller provides more sophisticated support for variables that need back-to-back accesses, in effect caching the data location to speed up the subsequent accesses.
  • FIFO memory implements flexible hardware-assisted queues instead of relying on software to perform successive reads and writes. Using this mode allows more efficient prefetching.
  • Buffer memory lets the controller take care of memory block allocation when the processor creates a buffer for a received packet. “We return the address so the processor knows where the packet is,” said Iyer.

Because the controller cores have more influence on memory usage and allocation, they enable the use of power-saving modes, Iyer said.

“We made the observation that 70 per cent of the time memories are idle but burning energy doing nothing,” explained Iyer. “Even then, in these memories only 20 per cent of the memory macros are active at any one time. For example, output ports are only rarely congested: only 10 to 15 per cent of the buffer may be utilized most of the time. The rest are idle. Using these two key observations, we were able to provide fine-grained control of these idle macros. You can put blocks into sleep or hibernate.”

As more intelligence devolves into the memory array, it raises the prospect of re-partitioning the system to take advantage of 3D and interposer multichip packaging. Iyer said: “We have had very early conversations in the 3D space, looking at the use of 3D memories and whether we could insert these ideas into bulk-memory devices.”

Comments are closed.

PLATINUM SPONSORS

Synopsys Cadence Design Systems Siemens EDA
View All Sponsors