Going beyond traditional compiler optimizations with SOMNIUM DRT

Posted on June 30, 2016

This article discusses traditional code optimization techniques and describes how SOMNIUM DRT's patented optimizations extend that to provide smaller code, higher performance and lower energy consumption.

(This article is based on our webinar on optimization: https://www.youtube.com/watch?v=r1SXYph9vKw)


Compiler optimization

The current state of the art in compiler technology means that compilers make a very good job of optimising the code. This generally makes trying to optimize the source code fairly pointless - the compiler will usually generate pretty much the same code even if you reorder equations and statements (assuming the code is semantically equivalent).

However, there are hints you can give the compiler which can improve the optimization by giving it information that it may not be able to deduce from analysing the source code - particularly because the compiler only looks at one compilation unit at a time.

These are things like adding annotations to indicate a const variable or that a function is "pure" and returns a value depending only on its arguments.

const int n = 42;

f __attribute__((pure));

On the other hand, you might need to tell the compiler *not* to perform certain optimizations. A common one is marking a variable as volatile so that every read or write in the source code is preserved (and correctly ordered) in the compiled code. Also, for embedded systems with limited memory, it is useful to be able to tell the compiler not to inline certain functions, to avoid excessive code size.

volatile int p;

g __attribute__((noinline));

Although modern compilers do an excellent job, there are two major limitations on what they can do. One is that the compiler only looks at instruction sequences for each compilation unit. Also, the compiler only knows about the CPU (for example, instruction sizes and timings) but has no information about the speed and sizes of different types of memory, the presence of caches, etc.

This places a limit on how much "real world" optimization is possible using existing compiler technology - a system-level view is required to fully optimize the code.

SOMNIUM DRT

SOMNIUM DRT is is a set of development tools for ARM Cortex-M based devices such as the Kinetis and LPC devices from NXP and the STM32 devices from STMicroelectronics. It is fully compatible with industry-standard tools such as the GNU toolchain and Eclipse IDE. DRT uses our patented techniques to produce highly optimized code by exploiting information about the embedded processor and the memory system to deliver improved performance, lower code size and reduced energy use.

Optimizations

The DRT toolchain extends the existing compiler infrastructure and optimizations to tune them for the specific device you are using

The DRT resequencing linker looks at the whole program generated by the compiler and, using knowledge of the system architecture (including the memory system), applies a number of optimizations that the compiler is not able to do. This includes replacing instruction sequences for more efficient ones, removing redundant instructions, reordering functions to make better use of caches, and so on.

The exact set of optimizations performed by the linker depends on the optimization level you specify (for example, -O3 to optimize for speed or -Os for size).

The resequencing linker provides improvements in code size, performance AND energy consumption - all at the same time. This requires no changes to your source code or development process.

Several of the optimizations remove unused or redundant instruction sequences. Others replace instructions or instruction sequences with better ones. Some examples of the optimizations performed by the linker are:

  • Changing PC relative load instructions (ldr) to adr instructions to allow the associated data to be removed
  • Optimizing the alignment of branch target addresses
  • Finding duplicate constants in text sections and merging them
  • Replacing some 32-bit instructions with shorter equivalents
  • Reordering functions to make the best use of cache
  • Performing advanced data flow analysis and further optimizations of register usage and instruction selection

DRT also automatically chooses the best compilation options for the target device and optimization level you specify.

More detail can be found in the DRT Reference Manual.

DRT Toolchain & IDE

The DRT toolchain currently supports ARM Cortex-M devices from NXP, STMicroelectronics and Atmel. We are working with other vendors to extend this range.

The DRT toolchain is available in two products: an Atmel Studio plugin that just provides the toolchain (including the resequencing linker and the optimized libraries) and a complete IDE that includes the toolchain plus extra debugging tools such as trace, live expressions and fault diagnosis.. The DRT IDE supports multiple vendors and provides extra features for debugging and ease of use. It is currently available for both Windows and Linux. A Mac OS version is coming soon.

Benchmarking

We use a number of different ways of evaluating the benefits of the DRT toolchain. This includes tests we have generated internally, some based on customer code, standard examples from semiconductor vendors, and a number of industry standard benchmarks such as Nullstone and EEMBC CoreMark.

Run-time library overheads

One interesting test we have done, is to compare the size of the code for an "empty" C program - a program newly created in the IDE that contains a minimal main function. This is a useful way of understanding the overheads of the C run-time libraries and startup code. We find that most development system use between 2KB and 5KB of ROM for this empty program. DRT uses significantly less ROM and/or RAM than any other toolchain.

empty c results table

This is particularly significant for very small memory footprint devices used in embedded systems such as IoT nodes. With some development tools it would not be possible to use C with these devices with very small amounts of memory.

Our benchmark document shows detailed results for Kinetis, LPC and STM32 examples.

EEMBC CoreMark

SOMNIUM is a member of the EEMBC Automotive Subcommittee and we use their industry standard benchmarks.

We believe that in order to be useful, benchmarks should show 3 dimensions. Not just performance, but memory size and energy use as well. The CoreMark benchmarks are great for this.

It's easy to obtain higher performance by unrolling, inlining and specializing functions to the extreme, but real world systems are memory limited so this approach makes little sense if you want to get an accurate understanding of real world behaviour. We always measure memory size, performance and where possible we measure energy, measured to the micro-joule using the high accuracy EEMBC EnergyMonitor.

Here we have a few examples of typical results:

coremark results graphs

NXP KL02 devices are quite constrained by their memory size and performance. Using DRT increases performance whilst saving significant amounts of ROM, RAM and energy.

KV10 devices have high performance memory systems, with a 16-entry, 4-way set associated flash cache. Even with this high performance hardware, DRT still increases KV10 performance and saves energy whilst reducing code size.

We compared against the vanilla GNU tools from Atmel Studio 6 and 7. DRT didn't affect RAM usage, but always generated the smallest, fastest, lowest energy results.

STM32L053 is an ultra-low-power device with a very simple flash buffer, rather than a cache. DRT can significantly improve its performance and energy behaviour.

NXP Attach demo

As an example of vendor code, we can look at the Attach demo from NXP. This uses the NXP Sensor Fusion library to combine the data from various sensors. It uses the NXP eGUI library to display the data in various forms and allow user interaction. So it demonstrates many of the features that might appear in a real embedded system.

Note that libraries such as Sensor Fusion are only available for GNU-compatible toolchains and so not supported by Keil or IAR.

attach results

Although the savings in the particular example are not huge, they demonstrate an important point: only the DRT toolchain is able to fit the application into the available ROM and RAM. Using the KDS tools would have required using a larger memory - and therefore more expensive - device.

Summary

DRT provides industry-leading optimization resulting in smaller, faster code which uses less energy. It supports the latest C and C++ standards, including C++ exceptions and is extensively tested by several test and validation suites. It includes ease of use enhancements and easier debugging with trace, live expression view and fault diagnosis.

A free trial of DRT is available form the SOMNIUM Portal.

DRT provides automatic import of projects from other Eclipse-based development environments, making it simple to evaluate. Download your free trial today from the SOMNIUM Portal and try it for yourself.

For more information on benchmarking and results, see the white paper on our website.

x

Download a free trial of SOMNIUM DRT

   Got a question?