A year on from the last Computex and the launch of the Cortex-A73 and Mali-G71, ARM has launched a new trio of processors aimed not just at smartphones this time but servers and driver-assistance systems.
The two Cortex processors are the first to appear that incorporate concepts from the DynamIQ architectural additions the company plans to make to its forthcoming version 8 processor cores. Both the Cortex-A75 and little sibling A55 support the version 8.2 instruction set, which adds operations for high-availability systems and instructions for dot-product and half-precision floating-point arithmetic to suit machine learning applications.
Ian Smythe, senior director of marketing program’s in ARM’s processor group, said the design of the new processor cores is intended to deliver “not just on-device AI performance but also deliver TCO improvements for infrastructure and support the development of ASIL-D safety systems in autonomous machines”.
Smythe claimed a 50 per cent improvement in throughput for the three-way superscalar A75 compared to its two-way predecessor, the A73. This assumes an increase in clock frequency from 2.4GHz on an A73 to 3GHz for an A75 using simulations of a Specint2006 benchmark suite. The A55, which uses an in-order issue pipeline in contrast to the out-of-order architecture on the A75, improves power efficiency by up to 2.5 times, according to the company, assuming a shift from 28nm for an A53 to 16nm finFET for an A55.
With the introduction of a new interconnect infrastructure with the CCI-600, the processors can be built into eight-way clusters – double that of previous implementations. Typical configurations are expected to be four big plus four little cores for vehicle-based ADAS and servers, one big and seven little for power-constrained devices.
Govind Wathan, product manager at ARM, said the move to a three-way superscalar leads to approximately 20 per cent more instructions per cycle compared to the A73. “We have also completely redesigned the memory subsystem,” he added. “On many actual applications, 30-odd per cent of the instruction mix is memory based so that is one area where we have focused. Another area where we have worked is on the branch predictor and the design of that is linked with the new memory system.”
One of the limitations on cluster size previously was the use of a shared level-two (L2) cache. In the new processors, each has its own private L2 and can share a level-three (L3) cache, although some infrastructure-oriented designs may opt to push this out to a system-level cache.
“We’ve reduced the latency into the L2 cache by about half,” Wathan claimed. “To help with coherency we have a functional unit called the DynamIQ share unit.”
The L3 cache is, when used as a single unified array, 16-way set-associative. But this can be split into four groups of 4-way set-associative regions to improve performance in infrastructure applications where individual processors have different types of workload and memory demands. The DSU controls power to the cache pages so that it only feeds them when the workload handled by the cluster justifies. The DSU’s controller analyzes the memory flow to determine when to fire up cache lines in groups of set-associative ways.
As part of the DynamIQ family, the Cortex processors can talk to hardware accelerators through dedicated ports managed by the DSU. The processor cores have activity monitoring hardware to support thread management by an operating system and to watch for system and memory errors in high-availability scenarios as well as additional instructions to support type-two hypervisors. To support design into ASIL-rated applications, ARM will provide documentation to support safety cases for the new processors.
With support for the version 8.2 instruction set, another addition for memory handling in a multicore environment is support for cache stashing and for more atomic operations. Stashing is used to send data directly from an I/O controller directly into the cache while it is also being transferred to main memory. This saves having to incur the delay of issuing a read from main memory following the I/O access and has previously been a feature of architectures such as the NXP/Freescale PowerQUICC.
AI meets graphics
The Mali-G72 continues with the tile-based rendering approach of its predecessors but adds extra local memory to cache partially completed tiles between passes by the renderer to further reduce offchip memory-bus traffic.
Anand Patel, director of product marketing, said: “We did some optimizations of the pipeline. We simplified the architecture in places and improved throughput for some use-cases.”
One of the optimizations is better performance for low bit-width multiplies, which better suits the output of some neural-network tools such as Google’s Tensorflow. The way that the shader cores access distributed L1 memories has also been altered to better suit the accesses made by typical neural-network kernels. “We also increased the size of those in certain places,” Patel added.
Further changes include profile-driven changes to the shader cores to reduce die area overall that increase the cycle times for some lesser-used operations. “As content has evolved we’ve got better visibility of the code. We’ve got a better view of how they exercise the execution units,” Patel said. “There are some parts of the pipeline that don’t get used as much while others get stressed heavily. The lesser-used components have reduced throughput and reduced area. Where others like reciprocal square root, we’ve added more hardware.”