blob: b433fdafd8e98710aaeee5cf1af5dd1e7a11e950 [file] [log] [blame]
Topic: Marvell-defined options in GCC C/C++ Compiler
Release: 2013Q2 (gcc version 4.6.4)
Date: 2013/06
The document describes the command line options of GCC created by Marvell
in addition to original options.
Default Core Options of Marvell GCC
Marvell-defined GCC Options
Depreciated Marvell-defined GCC Options
Marvell-defined Binutils Features
--------------------------------------------------------------------------------
Default Core Options of Marvell GCC
--------------------------------------------------------------------------------
The following listing is the basic Marvell's CPU cores:
-------------- --------- ------- ---------- ---------------
CPU Cores ISA Thumb VFP VFP+SIMD(NEON)
-mcpu= -march= -mthumb -mfpu= -mfpu=
-------------- --------- ------- ---------- ---------------
marvell-pj1 armv5te THUMB1 vfp N/A
marvell-f armv5te THUMB1 vfp N/A
marvell-fv7 armv7-a THUMB2 vfpv3-d16 N/A
marvell-pj4 armv7-a THUMB2 vfpv3-d16 N/A
marvell-pj4b armv7-a THUMB2 vfpv3 neon
vfpv3-fp16 neon-fp16
marvell-pj4c armv7-a THUMB2 vfpv3 neon
vfpv3-fp16 neon-fp16
-------------- --------- ------- ---------- ---------------
'marvell-pj4c' is only for experiment.
The Marvell GCC uses the following options to select what features the
target supports.
-mcpu=<marvell-...>
Specify the target core with its ISA. It's better to use
'-mcpu=<marvell-...>' instead of '-march=<armv5te/armv7-a>'.
The latter may not known the new features of Marvell cores.
-mfpu=<vfp/vfpv3-d16/vfpv3/vfpv3-fp16/neon/neon-fp16>
Specify the version of VFP or SIMD(NEON).
-mfloat-abi=<soft/softfp/hard>
'soft' float ABI does not generate VFP instructions, and others
do. For function calls, 'softfp' does not pass floating-point
value by VFP registers, but 'hard' may do. Object files
generated with 'soft' and 'softfp' are compatible, but not
with 'hard'.
-mthumb
For ARMv5te, enable THUMB1 features; for ARMv7-a, enable THUMB2.
Note that THUMB1 and VFP cannot co-work, so it is necessary to
use -mfloat-abi=soft to stop generating VFP instruction.
-mwmmxt
If your Marvell Cores support WMMXT feature, this option let
GCC generate the WMMXT instructions. WMMXT and NEON are
incompatible.
It's hard to build the Marvell GCC Toolchains by user himself, and use
the correct options mentioned above. Marvell provides some pre-built toolchains
to support different target features. See the list:
--------------- ------- ----------------- ------------------------------------
Package ISA CPU Core VFP and ABI
--------------- ------- ----------------- ------------------------------------
armv5-*-soft ARMv5te -mcpu=marvell-f -mfloat-abi=soft
armv5-*-softfp ARMv5te -mcpu=marvell-f -mfpu=vfp -mfloat-abi=softfp
armv5-*-hard ARMv5te -mcpu=marvell-f -mfpu=vfp -mfloat-abi=hard
armv7-*-soft ARMv7-a -mcpu=marvell-pj4 -mfloat-abi=soft
armv7-*-softfp ARMv7-a -mcpu=marvell-pj4 -mfpu=vfpv3-d16 -mfloat-abi=softfp
armv7-*-hard ARMv7-a -mcpu=marvell-pj4 -mfpu=vfpv3-d16 -mfloat-abi=hard
--------------- ------- ----------------- ------------------------------------
* Note: For BE32(ARMv5te)/BE8(ARMv7-a) package, the prefix is armeb*-*-*.
The armv5-* packages only provide ARM-code run-time and standard C/C++
libraries. That is your program may be built as THUMB1-code with '-mthumb'
options, but linked to the ARM-code libraries.
The armv7-* packages provide the multilib mechanism and provide ARM-code
and THUMB2-code run-time and standard C/C++ libraries. Use '-mthumb' option
to link to the THUMB2-code libraries.
If you are not sure the default core options, you can use 'gcc -v' to
check the setting of '--with-cpu', '--with-fpu' and '--with-float'. For
example:
$ arm-marvell-eabi-gcc -v
Using built-in specs.
Target: arm-marvell-eabi
Configured with: ... --with-cpu=marvell-f
--with-fpu=vfp
--with-float=softfp ...
If your Marvell core is not matched, you should select the best similar
package according to ARMv5te/ARMv7-a and soft/softfp/hard features. And then
use '-mcpu=' and '-mfpu=' options to compile the programs. For example:
$ arm-marvell-eabi-gcc ... -mcpu=marvell-pj4b -mfpu=neon ...
But you should known that about options does not affect the compiled
libraries, unless the whole toolchains are re-configured and re-built.
--------------------------------------------------------------------------------
Marvell-defined GCC Options
--------------------------------------------------------------------------------
-mmarvell-div (default: off)
Generate Marvell-defined hardware integer division instructions supported
by some Marvell cores instead of calling run-time suport library.
For example, r1 = r2 / r3, the following co-processor instructions would be
emitted according to singed or unsigned integer:
mrc p6, 1, r1, c2, c3, 4 @ signed div
mrc p6, 1, r1, c2, c3, 0 @ unsigned div
Above codes are incompatiable with not-supporting chips and non-Marvell ARM
cores.
This option makes the predefined macro "__hw_int_div__" available. This
may be helpful for assembler coding. Assembler can use SDIV/UDIV pseudo
code instead of MRC code if the under gas is supported by Marvell. User
should note that the predefined macro "__hw_int_div__" is for Marvell-
defined hardware integer division instructions, but "__ARM_ARCH_EXT_IDIV__"
for ARM-defined ones.
-mldrd-strd (default: on for LDRD/STRD is supported)
Turn on to support ldrd/strd code generation. It works iff architecture is
ARMv6t2/ARMv7 in ARM/THUMB mode or ARMv5te in ARM mode.
The old ABI -mabi=apcs-gnu/atpcs does not require double words at 8-bytes
alignment, so -mldrd-strd is useless.
The AAPCS ABI -mabi=aapcs/aapcs-linux always uses LDRD/STRD to access
8-byte-aligned double-word data (e.g. 'long long', 'double', or 'struct'
contains them). But for some stange reasons that the data are not aligned
on 8-bytes, you may wish to use -mno-ldrd-strd to avoid generating
LDRD/STRD instructions.
If this option is turned off, the predefined macro "__no_ldrd_strd__" is
available.
If you need to trun it off to solve the non-alignment access problem, you
may better check your whole program. Maybe the initial code (before 'main'
function) does not adjust the 'sp' register to the 8-byte boundary. Or
maybe the data layout does not follow the AAPCS ABI.
-mtune-ldrd (default: off)
Tune to generate LDRD/STRD (double word load-stores) instructions over
LDM/STM. For some Marvell's chips, the performacne of LDRD/STRD is better
than LDM/STM. Becasue of the requirement of 8-byte alignment for LDRD/STRD,
the compiler may rearrange data aligned at most on 8-byte boundary in order
to promote the probability of using them.
Example 1:
struct S {
char a[16];
};
struct S a, b;
void foo() { a = b; }
Using -mtune-ldrd, the structure 'a' and 'b' are placed on the 8-byte
boundary, so the compiler can generate LDRD/STRD insructions instead of
LDM/STM to do the memory block coping.
Example 2:
memcpy(q, "01234567", 8);
The const string literal, i.e. "01234567", may be forced to aling 8-byte,
so the memory routines, e.g. memcpy, may do a better job by using LDRD/STRD.
Moreover, the compiler may inline or expand these routines and use
LDRD/STRD.
Example 3:
long long q;
q = 0x3456789034567890LL;
The const long long literal, i.e. 0x3456789034567890LL, is divided into two
words, and use 2 LDR instructions to load them. Using -mtune-ldrd, the
two-word long long literal is placed on the 8-byte boundary and 1 LDRD
instruction is generated to load.
Because of larger alignment and more LDRD/STRD code numbers, this option
may waste code and data space, so is turned off if using -Os.
Note that some data may be adjusted to align on 8-byte boundary. If you
want to keep the data's natural alignment, e.g. 4-byte alignment, you
should do not use this option. For building Linux or uBoot examples,
turning on may corrupt the layout of some arrays which natural
4-byte-aligned items come from different objects and are merged by the
static linker (i.e ld).
-mbxret (default: on)
Try to use the 'BX LR' instruction for function returns as possible.
Marvell cores has tuned it with the hardware return-stack mechanism.
For example, the original output codes:
foo:
@ function prologue
stmfd sp!, {r3, lr}
...
@ function epilogue
ldmfd sp!, {r3, pc}
would becomes:
foo:
@ function prologue
stmfd sp!, {r3, lr}
...
@ function epilogue
ldmfd sp!, {r3, lr}
bx lr
This option may increase the code size. Use -mno-bxret to turn it off.
-mcond-exec (default: on)
Enable conditional execution (CE) instructions, overriding processor
specific tune settings, while -mno-cond-exec does the reverse and forces
all CE to be avoided.
If this option is turned off, the predefined macro "__no_cond_exec__" is
available.
Some Marvell cores may get benefits from the pipeline scheduling by
avoiding conditional execution.
-mwmmxt (default: off)
Enable IWMMXT feature in Marvell Core. You'd better use this option
instead of -mcpu=xscale/iwmmxt/iwmmxt2.
If this option is turned on, the predefined macro "__IWMMXT__"、
__IWMMXT2__ and "__ARM_WMMX" are available.
-msched1=ARM_CORE (default: same as -mcpu=)
-msched2=ARM_CORE (default: same as -mcpu=)
Specify pipeline description in pipeline scheduling staget before (
-msched1=) and after (-msched2=) register allocation. The valid ARM_CORE
name is the same with '-mcpu=ARM_CORE' option.
-mthumb-cbz (default: off)
Generate thumb instructions CBZ/CBNZ if possible. So far Marvell chips
don't prefer them.
-mpromote-inline-asm-input-int (default: on)
Promote the small integer type, e.g. short, of input operands of the inline
asm to the full word-size integer.
__asm__ ("..." : ... : "r" (short_var) : ... );
would be converted to:
__asm__ ("..." : ... : "r" ((int)short_var) : ... );
The default is true. If turning off, the compiler may do not zero- nor
signed-extend the small int to the full one in the general integer
register, and the high-part bytes of that register may contain garbage
value.
-mmemcpy-ninsn=N (default: 4. Internal tunning option.)
Specify the threshold of inlined memcpy instruction number. If exceeding
this threshold, memcpy will take a library call.
This internal tunning option affects the MOVE_RATIO value used in the
Scalar Reduction of Aggregates (SRA) pass. For example:
struct S s, *p;
s = *p; // structure copy even though some fields are not used!
...
s.f2++;
If the threshold is higher, the compiler may not use memcpy call to do the
structure assignment. Instead it may be tuned to the following code:
s.f1 = p->f1; // maybe redundant and can be removed latter.
s.f2 = p->f2;
...
s.f2++;
The member-wise assignment may be good sometime, e.g. simple structure and
removing the redundant assignments, but the higher the threshold is, the
code size and register pressure also may become higher.
This option may waste code and data space, so is turned off if using -Os.
-freduce-passed-addressof (default: off)
Try to reduce the number of passing &var argument for improving points-to
analysis. For example:
foo(&local_var);
If foo does not escape the local_var outside the current function, the
compiler may convert it under safety to:
tmp = local_var;
foo(&tmp);
local_var = tmp;
If being lucky, some optimizers may improve the useage of that local
variable because of not escaping.
-fargument-restrict (default: off)
Force pointer arguments to be qualified with restrict keyword. For
example:
foo(int* p1, int* p2);
is converted to:
foo(int* restrict p1, int* restrict p2);
So that the compiler can assume the memory blocks pointed to by p1 and
p2 are not overlapped and do more advance optimization.
This option only works for the C language. This is a dangerous option
and works on the whole compilation unit. Users should use it carefully.
-floop-post-opt (default: off)
Perform some optimizations immediately after RTL-level loop unrolling, such
as address forwarding and dead-code elimination. From the experience of
clinpack benchmark, moving some optimization passes early can reduce
the data dependences and improve the instruction scheduling finally.
-falign-arrays=N (default: 0)
Align the start of non-local arrays to the next power-of-two greater than or
equal to the maximum value of N and their natural alignment. For example:
char a1[11];
static char a2[22] = { 'a', 'b', ... };
void foo() { char a3[33]; ... }
If N is 17 ~ 32, then a1 and a2 are aligned to 32-byte boundary, but the
local stack variable, a3, is not affected by it.
Sometime it can improve the cache operation performance. E.g. benchmark
STREAM with N=32 may be improved on some Marvell's platforms.
This option may waste data space, so is turned off if using -Os.
--------------------------------------------------------------------------------
Depreciated Marvell-defined GCC Options
--------------------------------------------------------------------------------
The following options are depreciated and will become obsolete (nothing) in
the next Marvell GCC toolchains (mgcc48x).
-miwmmxt-use-realign (default: off)
Enable IWMMXT Vectorizer using WALIGNR & WALIGNI to support unaligned
load.
-miwmmxt-use-aggressive-realign (default: off)
Use optimized realign scheme for IWMMXT vectorizer.
-mneon-no-always-misalign (default: on)
Enable Vectorizer do loop peeling for vectorized loop to avoid unaligned
load/store.
-mfused-mac (default: off, experiment)
For VFPv4 or above, generate floating-point fused multiply-add instructions
supported by some Marvell cores.
--------------------------------------------------------------------------------
Marvell-defined Binutils Features
--------------------------------------------------------------------------------
. Marvell-defined ld option
--skip-vendor-check
This option will skip vendor information check, so you
can link libraries built by different toolchain vendors, like
ARM RVCT, successfully.
. Marvell-defined opcodes
The following testsuite give you the Marvell-defined opcode list:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@ Tests for Marvell ALU extersion instructions
.text
.syntax unified
.thumb
sdiv r0, r1, r2
mrc p6, 1, r0, cr1, cr2, 4
sdiv r11, r7, r3
mrc p6, 1, r11, cr7, cr3, 4
.arm
sdiv r0, r1, r2
mrc p6, 1, r0, cr1, cr2, 4
sdiv r11, r7, r3
mrc p6, 1, r11, cr7, cr3, 4
.thumb
udiv r1, r2, r3
mrc p6, 1, r1, cr2, cr3, 0
udiv r10, r6, r2
mrc p6, 1, r10, cr6, cr2, 0
.arm
udiv r1, r2, r3
mrc p6, 1, r1, cr2, cr3, 0
udiv r10, r6, r2
mrc p6, 1, r10, cr6, cr2, 0
.thumb
cnt32 r11, r12
mrc p6,2,r11, cr12,cr0,0
cnt32 r1, r2
mrc p6,2,r1, cr2,cr0,0
cnt32 r3, r7
mrc p6,2,r3, cr7,cr0,0
.arm
cnt32 r11, r12
mrc p6,2,r11, cr12,cr0,0
cnt32 r1, r2
mrc p6,2,r1, cr2,cr0,0
cnt32 r3, r7
mrc p6,2,r3, cr7,cr0,0
.thumb
bitcnt2 r11, r12, r14
mrc p6,2,r11, cr12,cr14,1
bitcnt2 r1, r2, r0
mrc p6,2,r1, cr2,cr0,1
bitcnt2 r3, r7, r7
mrc p6,2,r3, cr7,cr7,1
and3 r0, r14, r8, r9, 7
mrc p6, 3, r0, cr14, cr8, 7
and3 r4, r2, r12, r13, 1
mrc p6, 3, r4, cr2, cr12, 1
and3 r14, r7, r0, r1, 3
mrc p6, 3, r14, cr7, cr0, 3
and3one r0, r14, r8, r9, 7
mrc p6, 4, r0, cr14, cr8, 7
and3one r4, r2, r12, r13, 1
mrc p6, 4, r4, cr2, cr12, 1
and3one r14, r7, r0, r1, 3
mrc p6, 4, r14, cr7, cr0, 3
mrc p6, 5, r4, cr2, cr3, 7
qaddsub r4, r7, r2, r3
mrc p6, 6, r4, cr2, cr3, 7
qdaddsub r4, r7, r2, r3
cmp4x4 r14, r7, r7
mrc p6, 0, r14, cr7, cr7, 0
cmp4x4 r7, r1, r10
mrc p6, 0, r7, cr1, cr10, 0
cmp4x4 r2, r9, r2
mrc p6, 0, r2, cr9, cr2, 0
cmp4x4 r3, r2, r15
mrc p6, 0, r3, cr2, cr15, 0
cmp4x4s r7, r1, r10
mrc p6, 0, r7, cr1, cr10, 1
cmp1x4 r2, r9, r2
mrc p6, 0, r2, cr9, cr2, 2
cmp1x4s r3, r2, r15
mrc p6, 0, r3, cr2, cr15, 3
cmp1x3 r2, r9, r2
mrc p6, 0, r2, cr9, cr2, 4
cmp1x3s r3, r2, r15
mrc p6, 0, r3, cr2, cr15, 5
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~