Picking the right-sized crypto processor for your SoC

By Ruud Derwig | 1 Comment | Posted: July 11, 2018
Topics/Categories: IP - Selection | Tags: AES, benchmark, cryptography, FIPS 140-2, SHA | Organizations: EEMBC, Synopsys

Ruud Derwig has more than 20 years of experience with software and system architectures for embedded systems. Key areas of expertise include (real-time, multi-core) operating systems, media processing, component based architectures, and security. He holds a master's degree in computing science and a professional doctorate in engineering. Derwigis currently a software and systems architect at Synopsys.

Applications such as encrypted storage, authenticated communication, secure boot, and software update all need confidentiality, integrity, authenticity, and/or non-repudiation, which are made possible by cryptography algorithms.

Security protocols prescribe which cryptography algorithms should be used, how they should be applied, and in what order. Within these constraints, SoC designers have a lot of implementation choices, which may depend upon the kind of device they are trying to secure, the importance of doing so, and its resource constraints. The robustness of the security implementation will also vary, depending on the kind of attacks expected. Implementing security is therefore a trade-off between factors such as the level of risk, level of security, cost of security implementation, energy consumption, and performance.

Cryptography implementation options

Cryptography algorithms and protocols can be implemented in hardware, software, or a mixture of both. Figure 1 shows three commercial designs built three different ways.

No URL for image

The left of Figure 1 shows a software-only model built with the Synopsys DesignWare Cryptography Software Library. This library features configurable symmetric and asymmetric cryptography algorithms for encryption and certificate-processing functions. The Library is certified under the National Institute of Science and Technology Crypto Algorithm Verification Program, making it useful for systems that need to comply with FIPS 140-2.

The middle of Figure 1 shows an example of partial acceleration by means of specialized instructions and auxiliary registers. The DesignWare ARC SEM Security Processor is extended with the DesignWare ARC CryptoPack option, which adds custom instructions and registers to the ARC processors to speed up software algorithms including AES, 3DES, ECC, SHA-256, and RSA.

The right side of Figure 1 shows hardware solutions. The DesignWare Cryptography IP solutions include configurable symmetric and hash cryptographic engines, public key accelerators (PKAs) and true random number generators (TRNGs).

Software-only algorithm implementation

A software-only implementation involves trade-offs between performance, code size, and memory. Using lookup tables instead of computing values on-the-fly, for example, can speed up the implementation of certain algorithms – at the cost of extra memory. Loop unrolling can reduce branching overhead and provide more scheduling options to the compiler, at the cost of larger code.

Designers need to consider the vulnerability of their software implementations to attacks. For example, implementing RSA requires the modular exponentiation of very large integer numbers. The square-and-multiply algorithm does this efficiently, but the time it takes to execute depends linearly on the number of ‘1’ bits in the key. By repeating timing measurements for different data values and applying statistical analysis to the results, an attacker could obtain the secret key.

As an example of the impact of different software implementation choices, Figure 2 shows the performance versus code size of several SHA hashing implementations.

The labels of the data points indicate the software implementation choice. A number indicates the loop unroll factor, and ‘pre’ or ‘otf’ indicates whether the message scheduler is pre-computed at the beginning of the routine (pre), or the computation is interleaved with the round transformations (on-the-fly). The shape and color of the points indicate the compiler configuration for generating code optimized either for performance or size. Performance is shown as the number of cycles per message byte, so a lower number indicates higher performance. The figure shows that loop unrolling has the largest effect on performance, but compiler optimization for speed and on-the-fly computation improve performance as well, at the cost of increased size.

No URL for image

Partially accelerated implementation

In this approach, parts of the cryptographic algorithm are implemented in dedicated hardware, which improves overall performance and reduces energy consumption at the cost of increased hardware area and reduced flexibility. Some of the area absorbed by the cryptographic acceleration functions can be regained by reducing the instruction memory. Data memory could also be reduced by computing values on-the-fly in hardware instead of using lookup tables.

Designers can offload software computations to dedicated hardware in various ways. One is to implement a bus-based peripheral device to do the computations, but this only makes sense if enough cryptographic computations are offloaded at once to make up for bus latencies to and from the peripheral.

Other approaches include using a coprocessor interface or extending the processor with dedicated registers and specialized instructions.

The CryptoPack implementation of AES shows one approach to finding the right granularity and mix of hardware and software elements to meet a performance, area, and energy budget. The AES algorithm repeats the same computations (called a ‘round’) several times. CryptoPack introduces an instruction for the ARC 32bit processor that computes a quarter of a round, instead of a full round. This makes the best use of the ARC core’s 32bit datapath to get the required round keys from memory to the accelerator, pipelined with the computations. Implementing a full round in hardware as an instruction extension would need more hardware but would not improve performance much because extra instructions would still be required to load and transfer the full 128bit round key.

Full hardware algorithm implementation

For cases in which a software or partially accelerated software implementation does not provide enough performance or energy efficiency, a hardware engine can implement the full algorithm. Software is then limited to a device driver or hardware abstraction layer for interacting with the hardware engine.

For bulk cryptographic operations (such as hashing, encryption, and decryption), hardware engines often have DMA features, so the engine does not have to rely on a processor and device driver software for data transfers.

Another implementation issue involves dealing with the keys needed for the algorithms. In a software implementation, the keys are accessed by the software and this can be a security risk. In a hardware implementation, the keys may be accessed through a dedicated interface to non-volatile memory. A cryptography engine could also be configured with extra key storage memory, so it can switch between multiple parallel clients quickly.

SoC designers also need to decide how to implement cryptography algorithms, for example exploring whether to implement constant timing strategies for extra security, or to choose look-up tables versus on-the-fly computations. Each choice has a different implication for the robustness of the security implementation.

Security robustness: countermeasures

Different implementations of a cryptography algorithm have different levels of security robustness. For example, what happens when an implementation is fed incorrect parameters, such as weak keys? Does the implementation check such corner cases? Will it withstand side-channel attacks made by measuring the timing, power, or electromagnetic emissions of the implementation when running an algorithm?

Countermeasures to these attacks come at a performance, power, and area cost. Knowing whether or not to incur this cost demands an analysis of the kinds of threats the device may face, and the importance of securing it against such attacks. Responses can range from no hardening (the device is only operated in a secure environment by trusted operators), to some hardening (robustness against network-based timing attacks), to extensive hardening (protection against differential power analysis and/or hardware fault-injection attacks).

Timing and power analysis attacks can be countered by preventing information leakage by using implementations whose timing is not dependent on the data being processed, and/or by masking the information by deliberately adding noise.

Measurable parameters

If SoC designers want to assess whether they have chosen the right-sized cryptographic processing solution, they need to choose some measurable parameters with which to compare options. These can include:

Cryptographic performance, usually defined as throughput and measured as bits or bytes per cycle or second, or operations per second.
Latency, the time from starting to process a message to completing it.
CPU utilization during cryptographic operations. This will typically be 100% for a software solution, but less for a hardware solution, especially if a crypto engine is in use and the main CPU can be doing other things, or switch into a low-power mode.
Size of solution. Hardware can be measured by logic and memory gate count or resulting die area. Software can be measured by memory footprint, divided between code size and data size.
Power, usually measured in Watts, or energy measured in Joules. Power usually scales linearly with frequency, so it can be normalized for clock frequency to give a measure of power efficiency in μW/MHz.
Security robustness, which is more difficult to quantify. Some test labs assess how long it takes to hack a solution, given certain resources. Lab testing is often part of formal certification like Common Criteria or the EMV specification. Such certifications can be used as a proxy for security robustness.

The EEMBC IoT SecureMark

The Embedded Microprocessor Benchmark Consortium (EEMBC) develops benchmarks such as CoreMark, the industry standard for embedded microprocessor performance. It has now introduced a set of Internet of Things benchmarks: IoTMark and SecureMark.

SecureMark provides a framework for benchmarking the efficiency of cryptographic processing solutions, which supports different profiles for different application domains.

SecureMark-TLS models the cryptographic operations required for the Transport Layer Security (TLS) protocol that is used for secure internet communication. TLS supports various cryptographic approaches to authenticating the communicating parties and ensuring the privacy and integrity of messages exchanged between them. The IoT security working group of EEMBC chose a set of cryptography algorithms and parameters appropriate for IoT edge devices for its SecureMark-TLS benchmark.

The resulting system-level benchmark measures the energy used and performance of the cryptographic operations necessary to do the TLS operations for IoT devices.

The SecureMark-TLS test framework connects the device under test (DUT), such as an ARC processor-based IoT development kit, to a testbed that has two hardware boards and two software components. An I/O manager board sends commands over a UART to the DUT and receives results back to check whether the operations have been performed correctly. An energy monitor board provides power to the DUT and measures its energy consumption. A timestamp I/O pin, toggled by the DUT, marks the beginning and end of the operations being measured.

The SecureMark-TLS benchmarking software consists of a host PC application and embedded DUT software. The host application drives the execution of the benchmark by using the test-bed hardware boards to send commands to the DUT to perform cryptographic operations. It receives the results of the operations for checking, and it obtains power and timing measurements from the energy monitor.

The EEMBC test-harness software that runs on the DUT toggles the timestamp I/O to indicate when energy measurements should be started and stopped, and uses an API to tell the DUT to do various benchmark cryptographic operations.

Example benchmark results

Synopsys has experimented using the beta release of SecureMark-TLS, as shown in Figure 3. It used the mbed TLS reference implementation from EEMBC, and the DWC Cryptography Software implementation, accelerated using the CryptoPack processor extensions for the ARC EM processor in the DesignWare ARC EM Starter Kit.

No URL for image

The graph shows the results measured in time units, normalized to the smallest number (the 23 byte SHA256 hash operation on the ARC EM processor using CryptoPack acceleration). Lower numbers are better.

The benchmark shows that symmetric crypto and hashing run much faster than asymmetric shared-secret, sign, and verify computations. It also shows that CryptoPack acceleration speeds up the symmetric and hash (AES and SHA) processes between 4x and 7x, depending on the message size. The speedup for asymmetric ECDH and ECDSA operations is smaller, at up to 40%. CryptoPack was designed to add only 10% to 20% hardware area to a small ARC EM processor, and the large multipliers required for further speedup of the asymmetric operations would require more area.

Conclusions

SoC designers have to make nuanced tradeoffs between the cost, performance, energy consumption and security of their designs, especially for IoT devices. Most of the cost of security comes from implementing the necessary cryptographic algorithms. This can be done in hardware, software, or a mix of the two, and making effective comparisons among these approaches demands a consistent set of measurements.

The EEMBC has developed SecureMark as a tool for comparing the efficiency of cryptographic implementations. In turn, SecureMark-TLS focuses on implementations of the TLS protocol in IoT devices. Using this benchmark to compare a pure software approach with a hardware-accelerated software solution that uses CryptoPack extensions for the ARC EM processor showed a 4x to 7x performance improvement, at a cost of a 10% to 20% area increase when implementing the AES and SHA algorithms.

Further information

Download the Complete White Paper: Right-Sizing Your Cryptographic Processing Solution
For more benchmark details and cryptography implementations, visit the EEMBC website.
Download White Paper: Securing the Internet of Things Using Hardware Rooted Processor Security – An Architect’s Guide

Author

Ruud Derwig

Ruud Derwig has more than 20 years of experience with software and system architectures for embedded systems. Key areas of expertise include (real-time, multi-core) operating systems, media processing, component based architectures, and security. He holds a master’s degree in computing science and a professional doctorate in engineering. Derwig started his career at Philips Corporate Research, worked as a software technology competence manager at NXP Semiconductors, and is currently a software and systems architect at Synopsys.

Company info

Synopsys Corporate Headquarters

690 East Middlefield Road

Mountain View, CA 94043

(650) 584-5000

(800) 541-7737

www.synopsys.com