Synopsys has introduced a variant of its HS family of cores for use in embedded Linux applications. The HS38 is intended as a successor to the ARC 770D and includes an enhanced MMU, bigger page sizes and a larger physical addressing range, among other changes.
The company sees Linux, and variants such as Android, being embedded in applications as diverse as digital TVs, home networking, data centre processors and switches, smart appliances, network attached storage and wearables.
“What we’re seeing today with wearables is just the tip of the iceberg,” said Michael Thompson, senior manager of product marketing for DesignWare ARC processors at Synopsys. “They’re going to draw us in and will take a different form factor to watches – maybe in glasses or phones.”
HS38 vs 770D
The HS38 uses the same 10-stage pipeline as earlier HS family cores, and can be clocked 30% faster than the 770D because that core has a shorter, 7-stage pipeline. The longer pipeline also has an enhanced branch prediction unit, which Thompson claims is 95% accurate, as well as a second ALU in its ninth stage to mask the effect of mis-predicted branches.
The core uses the ARCv2 instruction set architecture (ISA), which offers 18% better code density than the v1 ISA, and delivers 15% higher instruction performance (1.93 DMIPS/MHz vs 1.71 DMIPS/MHz for the 770D).
An optional floating-point unit (FPU) with support for single- and double-precision arithmetic instructions is also available.
The HS38 has a full MMU with support for up to 1024 micro transaction lookaside buffers (uTLBs), and can be configured with normal and large (up to 16 Mbyte) page sizes. This means it can move data more quickly than the 770D, which has a simpler MMU with a fixed 8Kbyte page size and just 256 uTLBs.
Figure 1 The HS38 core includes features to support symmetric multiprocessing (Source: Synopsys)
The HS38 also has a 40bit physical address space, which means it can directly address 1Tbyte of memory. Thompson says this may find use in datacenter and NAS applications “and we have had requests from customers for a 49bit address space.”
The HS38 can be configured with up to 64Kbyte each of L1 instruction and data cache, and up to 8Mbyte of L2 cache. It is also possible to implement separate closely coupled instruction memories (CCMs) for instructions and data of up to 16Mbyte.
Two-cycle memory access means that the core’s maximum clock speed is not as limited when large cache and CCM memory is used with the processor. The two cycle access allows high density memory to be used with the HS38, whereas the 770D would require high-speed memory (which is larger and has higher power consumption) to achieve the same frequency.
The HS38 is available in dual- and quad-core configurations, with support for SMP Linux. Multicore designs can take advantage of a full L1 cache-coherency unit.
Figure 2 The ARConnect IP provides facilities to ease the co-ordination of multiple cores (Source: Synopsys)
ARConnect multicore IP is also available to provide a lot of the hardware infrastructure necessary to manage interactions between the two cores and the outside world.
As always, customers are looking for great performance at low power consumption, Synopsys claims that the HS38 can deliver up to 4200 DMIPS at 2.2GHz, in a typical 28nm process, while consuming less than 90mW and taking 0.21 mm2 of die area. The performance is said to be twice that of the 660D.
Configurability and extensibility
The HS38 can be configured and extended in a number of ways, using the ARChitect tool. Configuration options include:
- CPU set-up, register file size, timers, byte ordering
- memory type, size, partitions, base address
- power management, clock gating
- ports and bus protocol
- multipliers, dividers other hardware features
- optional features: XY, FPU, RTT
- optional instructions
The core is also set up so that users can extend its instruction set and functionality by adding proprietary hardware to the core’s pipeline through the ARC Processor EXtension (APEX) custom instruction interface. This enables users to add 32bit instructions, core and auxiliary registers, condition and status codes and memory-mapped blocks. The added instructions can be blocking or non-blocking, and out-of-order completion is supported. The user’s proprietary instruction extensions (specified in Verilog RTL) can be added to the HS38 using a graphical wizard in ARChitect and are usable in the MetaWare compiler and nSIM simulator.
“About 60% of our customers use one or more APEX instructions,” said Thompon. “The most is just over 100, but typically it is five to 10.”
The release of the HS38 also brings updates to the HS34 and HS36 versions of the core. These include support for up to 16Mbyte each of instruction and data CCM, up eight contexts (and options to configure up to eight register files to go with those contexts), and support for 64bit APEX instructions that enable users to use register pairs with their APEX instructions to move data faster.
The HS34 and HS36 also get improved power management through extra architectural clock gating and sleep modes, and support for L1 cache coherency and L2 cache in the dual- and quad-core versions of the HS36.
The HS38 is supported by the Synopsys MetaWare Development Toolkit, which includes an optimized C/C++ compiler, a debugger and an instruction set simulator. An ARC HS Processor Family Virtualizer Development Kit, including a processor and common peripherals, is also available, as is a cycle-accurate simulator.
The ARC AXS103 Software Development Platform provides a development environment with a rich set of peripherals, drivers, pre-built Linux images and application examples.
Open source software support includes an optimized Linux kernel as well as the GNU Compiler Collection (GCC), GNU Project Debugger (GDB) and associated GNU programming utilities (binutils).
A technology plug-in is also available for Synopsys’ Lynx Design System, providing pre-tuned design flow scripts, constraints and tool settings for accelerating chip-level integration and time to optimized results.
Support for close coupled memories and direct mapped peripherals with single cycle access to all peripheral registers on an SoC improves performance and reduces system latency.
For a more detailed look at how this core can be used in embedded Linux applications, click here.
For more product information, click here.