Multicore Association prepares standard for modelling parallelized software
The Multicore Association is getting close to publishing the first version of a specification that aims to standardise the way processor designers can describe the available parallelism in their products to performance estimation and auto-parallelising code generators.
Markus Levy, president of the Multicore Association, said the SHIM standard is analogous to the IP-XACT standard for describing the configuration of hardware IP. “It’s a link between a model of the hardware and the types of tool that might take advantage of it, such as performance estimators. I would not go as far as, say, emulators because we are targeting only 80 per cent accuracy. But it will allow people to work on candidate multicore devices and get to a certain point of confidence in their performance before handing off to the platform-specific tools.”
Like IP-XACT, the association’s SHIM standard is based on XML. “The end user will never see SHIM in the end,” said Levy. “It’s an agreement between the hardware guy and the tool guy. Right now we are trying to evangelise this so we can get this information to the tool providers more easily.”
Hardware structure
Masaki Gondo, software CTO at eSol and chairman of the SHIM working group, said: “SHIM will describe the hardware topology, what kind of processor clusters are involved and the memory subsystem topology. It describes performance metrics, such as the number of cycles it takes for them to talk to each other or how long it would take to fetch data from an address in the memory space. We have different performance metrics for different access types, such as what a word or doubleword write would take. But it’s a simple enough XML schema: just 368 lines of XSD.
“It’s not much fun to play with XML files, so we needed a tool to write the XML files for SHIM,” said Gondo.
The result is an editor as well as a wizard-type tool that builds the skeleton of a SHIM description from selections the user at the processor supplier would make on clustering, cache structures and memory sizes.
Gondo explained: “We will make this open source when we publish the specification. This editor creates new XML file and also reads SHIM XML files. You can also generate XML data bindings for tools. If you have an XML schema and give that schema file a schema compiler it will generate Java or C++ code. The tool vendor can use this source code as a reference to see how to use SHIM APIs in their tools.”
Memory moves
The key issue for the SHIM working group was to define a mechanism that could describe a multicore system and its memory hierarchy in sufficient detail to allow accurate performance estimation and parallelisation without burdening the analysis tools with masses of timing data and detailed instruction sets.
To describe what each processor can do, the working group borrowed the common instruction set from the open-source LLVM compiler. In SHIM, the operations supported by the processor or ‘master’ in SHIM parlance are associated with their estimated cycle times.
“We are not concerned with what the actual instructions are. If a tool needs to work on that, it can access the actual instruction set. But in the case of auto parallelising compiler, the compiler does not need to understand the actual instruction, it just needs to know that it can take a fragment of GCC source code and estimate how many cycles it will take on a particular master. You would then get a map of how many cycles it would take for this particular hardware,” Gondo explained.
Gondo claimed because LLVM’s intermediate-code instruction set has the concept of SIMD or vector operations, SHIM should be able to represent the performance of a given core’s instruction-level parallelism.
The growing complexity of memory subsystems causes a problem for performance estimation. Trying to analyse the response times of on-chip networks that support out-of-order completion and other ways to reduce the effect of blocking is hard without implementing the actual arbitration logic. This is another area where the working group believe it’s better to aim for 80 per cent rather than 100 per cent accuracy.
“There are so many possible parameters,” Gondo said. “One way to model that is to have some kind of polynomial equations to express characteristics of the performance. But that could be very complex. And will be difficult for tool vendors to adopt that. So the approach that we take is much simpler: pitch and latency.”
In SHIM, models of memory blocks store three values for latency – the time taken to perform a random access to the block – and and one for pitch, which measures the latency of consecutive or predicted reads and writes. The three values for latency represent minimum, which may reflect a level-one cache hit, maximum and typical.
“We will get those figures from benchmarking tools. We also plan to make that tool open source,” said Gondo.
There are currently ten companies in the working group: five tools vendors and five semiconductor suppliers. “We believe the tool vendors will benefit most,” Gondo said. “But I’m finding that OEMs once they understand what SHIM is they see the benefit because they are interested in best performance.”
Pull from the OEM community should encourage both chipmakers and tools vendors to invest in tools based around a standard like SHIM. The standard’s structure is likely to be on show at the association’s upcoming conference in May.