Anda di halaman 1dari 7

EE Times-India | eetindia.com Copyright 2014 eMedia Asia Ltd.

Page 1 of 7

Engineering embedded software: Basic C
techniques

Here are practical guidelines on how to use a compiler to get better performance out
of the existing C code in your embedded application.

By Robert Oshana
Director of Global Software R&D for Digital Networking
Freescale Semiconductor

and

Mark Kraeling
Product Manager
General Electric


An essential first step in optimising your embedded system software for performance takes place prior to beginning
the optimisation process. It is important to first confirm functional accuracy. In the case of standards-based code
(e.g., voice or video coder), there may be reference vectors already available. If not, then at least some basic tests
should be written to ensure that a baseline is obtained before optimisation. This enables easy identification that an
error has occurred during optimisation incorrect code changes done by the programmer or any overly
aggressive optimisation by a compiler. Once tests in place, optimisation can begin. Figure 1 shows the basic
optimisation process.


Figure 1: Basic flow of optimisation process.

Its also important to understand the features of the development tools as they will provide many useful, time-
saving features. Modern compilers are increasingly better performing with embedded software and leading to a
reduction in the development time required. Linkers, debuggers and other components of the tool chain will have
useful code build and debugging features, but in this chapter we will focus only on the compiler.

Compiler optimisation
From the compiler perspective, there are two basic ways of compiling an application: traditional compilation or
global (cross-file) compilation. In traditional compilation, each source file is compiled separately and then the
generated objects are linked together. In global optimisation, each C file is preprocessed and passed to the optimiser
in the same file. This enables greater optimisations (inter-procedural optimisations) to be made as the compiler has
complete visibility of the program and doesnt have to make conservative assumptions about the external functions
and references.
Global optimisation does have some drawbacks, however. Programs compiled this way will take longer to compile
and are harder to debug (as the compiler has taken away function boundaries and moved variables). In the event of
a compiler bug, it will be more difficult to isolate and work around when built globally. Global or cross-file
optimisations result in full visibility into all the functions, enabling much better optimisations for speed and size.
The disadvantage is that since the optimiser can remove function boundaries and eliminate variables, the code
becomes difficult to debug. Figure 2 shows the compilation flow for each.
Basic compiler configuration. Before building for the first time, some basic configuration will be necessary. Perhaps
the development tools come with project stationery which has the basic options configured, but if not, these items
should be checked:
Target architecture: specifying the correct target architecture will allow the best code to be generated.
Endianness: perhaps the vendor sells silicon with only one edianness, perhaps the silicon can be configured. There
will likely be a default option.
Memory model: different processors may have options for different memory model configurations.
Initial optimisation level: its best to disable optimisations initially.
EE Times-India | eetindia.com Copyright 2014 eMedia Asia Ltd. Page 2 of 7


Enabling optimisations. Optimisations may be disabled by default when no optimisation level is specified and either
new project stationery is created or code is built on the command line. Such code is designed for debugging only.
With optimisations disabled, all variables are written and read back from the stack, enabling the programmer to
modify the value of any variable via the debugger when stopped. The code is inefficient and should not be used in
production code.


Figure 2: Traditional (on left) versus global (on right) compilation.

The levels of optimisation available to the programmer will vary from vendor to vendor, but there are typically four
levels (e.g., from zero to three), with three producing the most optimised code (table 1). With optimisations turned
off, debugging will be simpler because many debuggers have a hard time with optimised and out-of-order scheduled
code, but the code will obviously be much slower (and larger). As the level of optimisation increases, more and
more compiler features will be activated and compilation time will be longer.


Table 1: Example optimisation levels for an embedded optimising compiler.

Note that typically optimisation levels can be applied at the project, module, and function level by using pragmas,
allowing different functions to be compiled at different levels of optimisation.
In addition, there will typically be an option to build for size, which can be specified at any optimisation level. In
practice, a few optimisation levels are most often used: O3 (optimise fully for speed) and O3Os (optimise for size).
In a typical application, critical code is optimised for speed and the bulk of the code may be optimised for size.
Many development environments have a profiler, which enables the programmer to analyse where cycles are spent.
These are valuable tools and should be used to find the critical areas. The function profiler works in the IDE and also
with the command line simulator.
Understanding the embedded architecture. Before writing code for an embedded processor, its important to assess
the architecture itself and understand the resources and capabilities available. Modern embedded architectures
have many features to maximise throughput. Table 2 shows some features that should be understood and questions
the programmer should ask.
EE Times-India | eetindia.com Copyright 2014 eMedia Asia Ltd. Page 3 of 7


Table 2: Embedded architectural features.

Basic C optimisation techniques
Following are some of the basic C optimisation techniques that will benefit code written for all embedded
processors. The central ideas are to ensure the compiler is leveraging all features of the architecture and how to
communicate to the compiler additional information about the program which is not normally communicated in C.
Choosing the right data types..Its important to learn the sizes of the various types on the core before starting to
write code. A compiler is required to support all the required types but there may be performance implications and
reasons to choose one type over another.


Figure 3: Simple FIR filter with intrinsics.


EE Times-India | eetindia.com Copyright 2014 eMedia Asia Ltd. Page 4 of 7

For example, a processor may not support a 32bit multiplication. Use of a 32bit type in a multiply will cause the
compiler to generate a sequence of instructions. If 32bit precision is not needed, it would be better to use 16bit.
Similarly, using a 64bit type on a processor which does not natively support it will result in a similar construction of
64bit arithmetic using 32bit operations.
Use of intrinsics in embedded design. Intrinsic functions, or intrinsics for short, are a way to express operations not
possible or convenient to express in C, or target-specific features (table 3). Intrinsics in combination with custom
data types can allow the use of non-standard data sizes or types. They can also be used to get to application-specific
instructions (e.g., viterbi or video instructions) which cannot be automatically generated from ANSI C by the
compiler. They are used like function calls but the compiler will replace them with the intended instruction or
sequence of instructions. There is no calling overhead.


Table 3: Example instrinsic.

Some examples of features accessible via intrinsics are:
saturation
fractional types
disabling/enabling interrupts.

For example, an FIR filter can be rewritten to use intrinsics and therefore to specify processor operations natively
(figure 3). In this case, simply replacing the multiply and add operations with the intrinsic L_mac (for long multiply-
accumulate) replaces two operations with one and adds the saturation function to ensure that DSP arithmetic is
handled properly.

Figure 4: Configuration of calling conventions.

EE Times-India | eetindia.com Copyright 2014 eMedia Asia Ltd. Page 5 of 7

Functions calling conventions. Each processor or platform will have different calling conventions. Some will be
stack- based, others register-based or a combination of both. Typically, default calling conventions can be
overridden though, which is useful. The calling convention should be changed for functions unsuited to the default,
such as those with many arguments. In these cases, the calling conventions may be inefficient.
The advantages of changing a calling convention include the ability to pass more arguments in registers rather than
on the stack. For example, on some embedded processors, custom calling conventions can be specified for any
function through an application configuration file and pragmas. Its a two step process.
Custom calling conventions are defined by using the application configuration file (a file which is included in the
compilation) (figure 4).


Figure 5: Case Definition for Femto application.

They are invoked via pragma when needed. The rest of the project continues to use the default calling convention.
In the example in figures 6 and 7, the calling convention is invoked for function TestCallingConvention.

char TestCallingConvention (int a, int b, int c, char d, short e)
{
return a+b+c+d+e;
}
#pragma call_conv TestCallingConvention mycall
Figure 6: Invoking calling conventions.

Pointers and memory access
Ensuring alignment. Some embedded processors such as digital signal processors (DSPs) support loading of
multiple data values across the busses as this is necessary to keep the arithmetic functional units busy. These moves
are called multiple data moves (not to be confused with packed or vector moves). They move adjacent values in
memory to different registers. In addition, many compiler optimisations require these multiple register moves
because there is so much data to move to keep all the functional units busy.
Typically, however, a compiler aligns variables in memory to their access width. For example, an array of short
(16bit) data is aligned to 16 bits. However, to leverage multiple data moves, the data must be aligned to a higher
alignment. For example, to load two 16bit values at once, the data must be aligned to 32 bits.
EE Times-India | eetindia.com Copyright 2014 eMedia Asia Ltd. Page 6 of 7





Figure 7: Generated code for function with modified calling conventions.

Restrict and pointer aliasing. When pointers are used in the same piece of code, make sure that they cannot point to
the same memory location (alias). When the compiler knows the pointers do not alias, it can put accesses to
memory pointed to by those pointers in parallel, greatly improving performance. Otherwise, the compiler must
assume that the pointers could alias. Communicate this to the compiler by one of two methods: using the restrict
keyword or by informing the compiler that no pointers alias anywhere in the program (figure 8).


Figure 8: Illustration of pointer aliasing.

EE Times-India | eetindia.com Copyright 2014 eMedia Asia Ltd. Page 7 of 7


Table 4: Example loop before restrict added to parameters (DSP code).


Table 5: Example loop after restrict added to parameters.

The restrict keyword is a type qualifier that can be applied to pointers, references, and arrays (tables 4 and 5). Its
use represents a guarantee by the programmer that within the scope of the pointer declaration, the object pointed
to can be accessed only by that pointer. A violation of this guarantee can produce undefined results.



About the author
Rob Oshana has 30 years of experience in the software industry, primarily focused on embedded and real-time
systems for the defence and semiconductor industries. He has BSEE, MSEE, MSCS, and MBA degrees and is a Senior
Member of IEEE. Rob is a member of several Advisory Boards including the Embedded Systems group, where he is
also an international speaker. He has over 200 presentations and publications in various technology fields and has
written several books on embedded software technology. He is an adjunct professor at Southern Methodist
University where he teaches graduate software engineering courses. He is a Distinguished Member of Technical
Staff and Director of Global Software R&D for Digital Networking at Freescale Semiconductor.
Mark Kraeling is Product Manager at GE Transportation in Melbourne, Florida, where he is involved with advanced
product development in real-time controls, wireless, and communications. Hes developed embedded software for
the automotive and transportation industries since the early 1990s. Mark has a BSEE from Rose-Hulman, an MBA
from Johns Hopkins, and an MSE from Arizona State.



Used with permission from Morgan Kaufmann, a division of Elsevier, Copyright 2012, this article was excerpted from
Software engineering for embedded systems, by Robert Oshana and Mark Kraeling.

Anda mungkin juga menyukai