This article looks at some of the key architectural and implementation decisions Synopsys has made in developing a version of its HS series of licensable processor cores to serve the embedded Linux market
The ARC HS38 licensable CPU core has been designed for embedded applications that use a high-performance, virtual memory operating system such as Linux.
The core is a successor to the ARC 700 family, and includes a number of features to support the development of power–efficient embedded designs running such operating systems. These include:
- Single-, dual-, and quad-core variants with cache-coherent symmetric multiprocessing (SMP)
- Support for shared level-two (L2) cache
- A new memory-management unit (MMU)
- Enhanced context switching
- Multiple power domains and an optional power manager that enables dynamic voltage/frequency scaling
Simulations indicate that an ARC HS38 core will deliver a per-core throughput at 1.6GHz of more than 3,100 Dhrystone MIPS or 5,600 CoreMarks, when implemented in a 28nm high-performance-mobile CMOS process.
Power consumption for a minimal ARC HS38 implementation is 0.036mW per megahertz (58mW at 1.6GHz). It would occupy 0.21mm2 of silicon.
Multicore symmetric multiprocessing
Multicore designs have always been possible using ARC cores, but the ARC HS38 series makes it easier to implement dual- and quad-core clusters and cache-coherent SMP for applications such as embedded Linux.
L1 cache coherence is critical for SMP. When two or more CPUs can access the same memory, some mechanism must keep them from independently modifying the same data. Maintaining this coherence in software consumes numerous clock cycles, so cache-coherent processors implement this mechanism in hardware. The ARC HS38 series uses a common method called snooping that watches all the L1 caches for read and write operations and keeps the cached data coherent with the data in the other caches.
A new cache coherency unit performs the snooping. Each CPU core has a three-channel snoop interface: one channel carries snoop commands from the coherency unit to the core; another channel returns the core’s response to the coherency unit; and the third channel transfers data from the core’s L1 cache to the coherency unit, which shares it with the L1 caches of the other core(s).
To govern these cache-to-cache transfers, the coherency unit employs a common protocol known as MOESI (modified, owned, exclusive, shared, invalid) to represent the five possible states of cached data. For example, when a core writes an instruction result to the L1 data cache, it marks that cache line as modified, and that line becomes the only valid copy. The coherency unit must transfer that data to the other L1 data caches before other cores can use it.
Figure 1 Dual-core ARC HS38x2 cluster (Source: Synopsys)
The new cache coherency unit snoops the L1 caches of all CPU cores in a cluster and ensures that data modified by one core is shared with the others
Likewise, an optional I/O coherency unit keeps input/output traffic coherent with the L1 caches. When an I/O device modifies data in one core’s L1 cache, this unit updates the other L1 caches, too. The important point is that application programmers needn’t worry about these details, because the coherency hardware automatically handles the complex bookkeeping.
Inter-core communication is another critical feature for SMP. To enable the CPUs in a multicore cluster to exchange messages, the cluster shares a centralized SRAM among all the cores. If two or more cores try to access this memory at the same time, an arbitrator gives each core its turn in a round-robin fashion.
Similarly, an inter-core interrupt handler allows any core to send an interrupt to another core—for example, if a process running on one core generates an error that must be handled by a process running on another core.
To synchronize multiple processes, a 64-bit real-time clock acts as a central resource. This avoids the redundancy of implementing separate clocks in each CPU core. Hardware semaphores aid this synchronization and govern each core’s access to the cluster’s shared resources. Developers will appreciate a centralized mechanism that enables an external debugger or development system to halt, run, or reset any core or combination of cores in a cluster.
The coherent L1 caches can work with an optional L2 cache, which is much more than just a block of SRAM bolted onto the CPU. Several features ensure high performance while consuming minimal power.
All CPUs in a multicore cluster share the same L2 cache, whose size is user-configurable up to 8Mbyte. This cache is designed to run at the same clock frequency as the CPU core and attaches to a private backside bus on each CPU with a separate 64-bit bus for instructions and a 128-bit bus for data. On its front side, which connects to the AXI peripheral bus, the L2 cache has a configurable interface that can be 64, 128, or 256 bits wide. These features ensure that the L2 cache can keep up with the CPUs while avoiding AXI bus traffic on the datapaths between the CPU cores and the L2 cache.
In addition to configuring the L2 cache’s clock speed, memory size, and AXI interfaces, chip designers can customize other features. Cache lines can be 64 or 128 bytes long with up to 16-way set-associativity. For mission-critical applications, the L2 cache optionally supports error-correction codes (ECC). The cache also supports AXI cache protocols, including read-through, write-through, read-no-allocate, and write-no-allocate.
To save power, the L2 cache can quickly enter a sleep mode that retains the SRAM’s contents but slashes power consumption by about 90%. Another low-power mode (idle) is faster to enter and exit but saves less power. A full shut-down mode saves the most power but does not retain the cached data.
When chip designers implement the L2 cache, they can select higher-density SRAM libraries to reduce its power consumption and die area, although performance will suffer a bit. The cache-control hardware implements four levels of clock gating, one more level than is found in the CPU core.
Memory management unit
The new MMU enables the HS38 cores to run sophisticated embedded operating systems that support both SMP and virtual memory. Although the 10-year-old ARC 700 already has an MMU, the new one is more advanced. Its configurable physical-address space is 40 bits, enough for one terabyte (1Tbyte) of main memory, compared with a 32-bit physical-address space on the MMU of the ARC 700, addressing just 4Gbyte of memory.
Likewise, the new MMU supports variable-size memory pages. ARC HS38 concurrently supports memory pages in the normal range (4Kbyte, 8Kbyte, or 16Kbyte) as well as very large pages (4Mbyte, 8Mbyte, or 16Mbyte). Larger pages tend to reduce the number of missed references in the translation-lookaside buffer (TLB).
The new MMU’s primary TLB has 1,024 entries and is four-way set-associative. With its large TLB, an ARC HS38 core is much more likely to find a virtual-to-physical memory-address translation in the buffer without having to waste numerous clock cycles walking the page table in main memory.
The new TLB has a predictor that can further accelerate these address translations. Instead of waiting for the translation to finish before accessing the cache, the TLB can monitor recent translations and predict with 99% accuracy whether that bit is zero or one and thereby predict the physical address.
Using that prediction, the CPU can access the data while the TLB resolves the address translation. If the prediction is correct, the CPU has already started fetching the data. If the prediction is wrong, the CPU goes instead to the correct memory address in the TLB.
Multiple processes running on ARC HS38 cores can share a single TLB mapping. On Linux, for example, the MMU’s address-space identifier allows up to 64 libraries to share the TLB without flushing and refilling the buffer for each one.
Enhanced context switching
ARC CPU cores support multiple register banks for fast context switching. In a single clock cycle, the CPU can change a pointer to a duplicate register bank that holds the complete state of another process. This bank switching is much faster than copying the registers to memory and reloading them later. Customers can configure an ARC HS38 core with up to eight banks.
Multiple power domains
In an ARC HS multicore design, each CPU core resides in its own power domain. To reduce power consumption when the system does not require maximum performance, individual cores can sleep or shut down. As Figure 2 shows, a centralized power-management unit (PMU) controls these functions for all the cores.
Figure 2 ARC HS family power-management unit (Source: Synopsys)
Using this configurable PMU, designers can govern the power-on, power-off, and sleep states of each core in a multicore cluster.
The PMU can also adjust the core voltage and clock frequency to match the CPUs’ performance with varying workloads. This dynamic voltage/frequency scaling can dramatically reduce power consumption, which varies linearly with clock frequency and quadratically with voltage. Typically, these parameters will change in a step function. The PMU will adjust the clock frequency up or down in small increments until reaching a certain threshold. Then it will change the voltage, which has the greatest effect on power consumption.
Programmers can control how the PMU adjusts these parameters, fine-tuning the system for specific workloads. However, as with most CPU clusters that support dynamic voltage/frequency scaling, all the CPU cores must work in unison. Individual cores can sleep or shut down but cannot operate at different voltages or clock frequencies to the other cores in the same cluster.
Individual CPU cores and other elements of a multicore cluster have multiple power domains of their own. Each core is partitioned into three domains: one that always keeps active a small amount of critical logic; a second domain that powers the core logic and programmer-invisible registers; and a third domain that powers the programmer-visible registers and L2-cache SRAMs. The always-on domain contains logic that must remain available to respond to wake-up commands. The second domain can sleep while retaining state or shut down after saving state. The third domain can retain state while sleeping in a low-voltage mode that reduces current leakage.
Figure 3 Power domains in a quad-core ARC HS38x4 cluster (Source: Synopsys)
These independent domains allow the CPU cores and related logic to optimize power consumption as the workload varies.
Other elements of a multicore cluster also reside in their own power domains and can enter a lower-power state. These elements include the cache-coherency unit and I/O coherency unit. For instance, if all the cores in a cluster but one shut down, some of these multiprocessing functions become unnecessary and can draw less current as well.
Complex operating systems such as Linux are becoming increasing common in embedded applications ranging from digital TVs to networking and data center equipment. Running these operating systems efficiently takes a combination of dedicated architectural features to handle issues such as cache coherency in multiprocessor designs, good performance, and effective power-management strategies. Synopsys has included all these features in its ARC HS38 core to serve current needs, and left headroom, in terms of both performance and feature set, to serve future needs.
This piece has been adapted from a more detailed discussion of the HS38 architecture, available here.
There are more details of the HS38 here.
There’s a related piece on the configurability of the core here.
Company infoSynopsys Corporate Headquarters 700 East Middlefield Road Mountain View, CA 94043 (650) 584-5000 (800) 541-7737 www.synopsys.com
Sign up for more
If this was useful to you, why not make sure you’re getting our regular digests of Tech Design Forum’s technical content? Register and receive our newsletter free.