blob: dd7514662dd79964150b9140f0727fd3ba3844d4 [file] [log] [blame]
0. Turning on auto vectorization
==================================================
To switch on auto vec for the compiler, the following options should
be given:
-mmrvl-use-iwmmxt -ftree-vectorize -O2 (or -O3)
To view the detailed optimization information, use the additional options:
-ftree-vectorizer-verbose=3
================================================================================
1. Hint: use loop local variable instead of long-live variable to access array
The following code snippet is from jpeg:
>>> for( k=0, i=1; i<17; i++ ){
>>> L=(int)pHuffBits[i-1];
>>> for (j = 1; j <= L; j++){
>>> huffsize[k++] = (Ipp8u) i;
>>> }
>>> }
To make it clearer, rewrite it into:
<A> int L, i, k, j;
<A> for( k=0, i=1; i<17; i++ )
<A> {
<A> L = array2[ i - 1 ];
<A> for( j=1; j<=L; j++ )
<A> {
<A> array[ ++k ] = i;
<A> }
<A> }
To enable auto vectorization, rewrite the above into:
<B> int L, i, k, j;
<B> for( k=0, i=1; i<17; i++ )
<B> {
<B> L = array2[ i - 1 ];
<B> for( j=1; j<=L; j++ )
<B> {
<B> array[ j + k - 1 ] = i;
<B> }
<B> k += L;
<B> }
The difference between the above two codes is that for <A>,
"array[++k]" references "k", which outlives this inner loop (the
inner loop is executed for 16 times from the outer loop, and k is
alive for these times).
While in <B>, "j+k-1" is an expression whose live time is defined by
the inner loop (j is an induction variable, (k-1) is (inner)loop
invariant)
================================================================================
2. Hint: use same data type in a single loop
2.1. For the following codes from jpeg:
>>> for (j=0; j<17; j++) {
>>> pTable->mincodeptr[j] = 0;
>>> pTable->maxcode[j] = 0;
>>> }
>>>
The above 2 assignment statements have different vec types thus 2
different vec factor, however a single loop can only be vectorized
using one vect factor, so the above case cannot be vectorized.
To enable above vectorizatio, you can transform the above into:
>>> for (j=0; j<17; j++) {
>>> pTable->maxcode[j] = 0;
>>> }
>>>
>>> for (j=0; j<17; j++) {
>>> pTable->mincodeptr[j] = 0;
>>> }
2.2. For the following codes from jpeg:
>>> while (p<i) {
>>> pTable->huffVal[p] = pHuffValue[p];
>>> p++;
>>> }
The above "pTable->huffVal[p]" is of type IPP16u while pHuffValue
is of type IPP8u.
So far, for the above example, there is no viable workaround.
================================================================================
3. Hint: use simple object reference
For the following codes
>>> #define FLUSH_STREAM(pStream, fStreamFlush, pStreamHandle, len) \
>>> { \
>>> int byteWritten, ti;\
>>> int availBytes = (pStream)->pBsCurByte - (pStream)->pBsBuffer;\
>>> int minLen = (len)-((pStream)->bsByteLen - ((pStream)->pBsCurByte - (pStream)->pBsBuffer));\
>>> byteWritten = fStreamFlush((pStream)->pBsBuffer, pStreamHandle, availBytes, 0);\
>>> if(minLen>byteWritten){\
>>> return IPP_STATUS_STREAMFLUSH_ERR;\
>>> }\
>>> for(ti=0;ti<=availBytes-byteWritten;ti++) {\
>>> (pStream)->pBsBuffer[ti] = (pStream)->pBsBuffer[ti+byteWritten];\
>>> }\
>>> (pStream)->pBsCurByte -= byteWritten;\
>>> }
In the above loops, data reference analysis for
"pStream->pBsBuffer[ti]" will be reported out as "unhandled ref",
thus prevents it being processed by the vectorizer. To get around
this problem, you can rewrite the above as:
>>> Ipp8u * pBuf = &(pStream->pBsBuffer[0]);
>>> for(ti=0;ti<=availBytes-byteWritten;ti++) {
>>> pBuf[ti] = pBuf[ti+byteWritten];
>>> }
(The above case though is still not vectorizable because of cross
loop iteration dependence mentioned below)
================================================================================
4. Hint: use "aligned" attribute for array when possible
For the following simplest code:
>>> extern unsigned char arr1[100];
>>>
>>> extern a;
>>>
>>> int foo()
>>> {
>>> int i;
>>> for( i = 0; i < 100; ++i )
>>> {
>>> arr1[i] = 0;
>>> }
>>> }
Loop peeling (or loop versioning) will be used by auto vectorizer,
which will bloat code size, to overcome such problems, try add
"aligned" attribute to the external declaration:
>>> extern unsigned char __attribute__((aligned(8))) arr1[100];
>>>
>>> extern a;
>>>
>>> int foo()
>>> {
>>> int i;
>>> for( i = 0; i < 100; ++i )
>>> {
>>> arr1[i] = 0;
>>> }
>>> }
================================================================================
5. Issue: auto vected cases for pointers with loop peeling
>>> IPPCODECFUN(IppCodecStatus, _ijxSetZero_16s) (Ipp16s * pDst, int len)
>>> {
>>> int i;
>>> Ipp16s pp[len];
>>>
>>> //_IPP_CHECK_ARG(pDst && len > 0);
>>>
>>> for ( i = 0; i < len; i ++ ) {
>>> pDst[i] = 0;
>>> }
>>>
>>> return IPP_STATUS_NOERR;
>>> }
For the above code snippet, the vectorizer will use loop peeling to
enable loop vectorizing. The side effect is that code size is bloated.
Currently there is no way to specify pointer alignment in GCC, the
following is a description from the mailing list of one of the GCC
developers:
"Unfortunately there's no way to specify alignment attribute of
pointers in GCC - the syntax was allowed in the past but not really
supported correctly, and then entirely disallowed (by this patch
http://gcc.gnu.org/ml/gcc-patches/2005-04/msg02284.html). This issue
was discussed in details in these threads:
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=20794
http://gcc.gnu.org/ml/gcc/2005-03/msg00483.html (and recently came up
again also in http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827#c56).
The problem is that "We don't yet implement either attributes on array
parameters applying to the array not the pointer, or attributes inside
the [] of the array parameter applying to the pointer. (This is
documented in "Attribute Syntax".)" (from the above thread)."
So there seems no workaround for this loop peeling problem on pointers now.
================================================================================
6. Issue: aliasing issue
6.1. Pointers aliasing prevents vectorization
>>> extern unsigned char arr1[100];
>>>
>>> int foo(unsigned char * p)
>>> {
>>> int i;
>>> for( i = 0; i < 100; ++i )
>>> {
>>> p[ i ] = arr1[ i ];
>>> }
>>> }
The alias analysis result for "p" and "arr1" is "may_alias",
which effective prevents vectorization.
(To work around this problem, "-fargument-noalias-global" may
seems the right way to go, however, it just does not work here,
it may be fixed in future versions.)
6.2. A more common example:
>>> int foo(unsigned char * p, unsigned char * q)
>>> {
>>> int i;
>>> for( i = 0; i < 100; ++i )
>>> {
>>> p[ i ] = q[ i ];
>>> }
>>> }
("-fargument-noalias" does not work here)
================================================================================
7. Issue: dependence testing issue
For the following case from jpeg:
>>> Ipp8u * pBuf = &(pStream->pBsBuffer[0]);
>>> for(ti=0;ti<=availBytes-byteWritten;ti++) {
>>> pBuf[ti] = pBuf[ti+byteWritten];
>>> }
The dependence tester will judge that pBuf[ti] and
pBuf[ti+byteWritten] has cross loop dependence relation.
Currently there is no way to overcome this limitation, even for the
following case we know that byteWritten is large than 8: (seems
that dependence testing does not know that byteWritten > 8 ):
>>> Ipp8u * pBuf = &(pStream->pBsBuffer[0]);
>>> if( byteWritten > 32 )
>>> for(ti=0;ti<=availBytes-byteWritten;ti++) {
>>> pBuf[ti] = pBuf[ti+byteWritten];
>>> }
The following case works:
>>> Ipp8u * pBuf = &(pStream->pBsBuffer[0]);
>>> for(ti=0;ti<=anything;ti++) {
>>> pBuf[ti] = pBuf[ti+32];
>>> }
(Because vectorizer knows that difference for (ti) and (ti+32) is
larger than the vector factor(8 here))
================================================================================
8. Issue: straight lines of statements will not be vectorized.
For the following example:
>>> int foo(int i)
>>> {
>>> ...
>>> a1[ i ] = 0
>>> a1[ i + 1 ] = 0;
>>> a1[ i + 2 ] = 0;
>>> a1[ i + 3 ] = 0;
>>> ...
>>> }
The assignment to a1 array will not be vectorized, because all of
these statements belongs to a same iteration vector.
In other words, vectorizer will only try to combine statements from
different but consecutive loop iteration vectors, statements that
belong to same iteration vector will not be combined.