blob: dd7514662dd79964150b9140f0727fd3ba3844d4 [file] [log] [blame]
 0. Turning on auto vectorization ================================================== To switch on auto vec for the compiler, the following options should be given: -mmrvl-use-iwmmxt -ftree-vectorize -O2 (or -O3) To view the detailed optimization information, use the additional options: -ftree-vectorizer-verbose=3 ================================================================================ 1. Hint: use loop local variable instead of long-live variable to access array The following code snippet is from jpeg: >>> for( k=0, i=1; i<17; i++ ){ >>> L=(int)pHuffBits[i-1]; >>> for (j = 1; j <= L; j++){ >>> huffsize[k++] = (Ipp8u) i; >>> } >>> } To make it clearer, rewrite it into: int L, i, k, j; for( k=0, i=1; i<17; i++ ) { L = array2[ i - 1 ]; for( j=1; j<=L; j++ ) { array[ ++k ] = i; } } To enable auto vectorization, rewrite the above into: int L, i, k, j; for( k=0, i=1; i<17; i++ ) { L = array2[ i - 1 ]; for( j=1; j<=L; j++ ) { array[ j + k - 1 ] = i; } k += L; } The difference between the above two codes is that for , "array[++k]" references "k", which outlives this inner loop (the inner loop is executed for 16 times from the outer loop, and k is alive for these times). While in , "j+k-1" is an expression whose live time is defined by the inner loop (j is an induction variable, (k-1) is (inner)loop invariant) ================================================================================ 2. Hint: use same data type in a single loop 2.1. For the following codes from jpeg: >>> for (j=0; j<17; j++) { >>> pTable->mincodeptr[j] = 0; >>> pTable->maxcode[j] = 0; >>> } >>> The above 2 assignment statements have different vec types thus 2 different vec factor, however a single loop can only be vectorized using one vect factor, so the above case cannot be vectorized. To enable above vectorizatio, you can transform the above into: >>> for (j=0; j<17; j++) { >>> pTable->maxcode[j] = 0; >>> } >>> >>> for (j=0; j<17; j++) { >>> pTable->mincodeptr[j] = 0; >>> } 2.2. For the following codes from jpeg: >>> while (p>> pTable->huffVal[p] = pHuffValue[p]; >>> p++; >>> } The above "pTable->huffVal[p]" is of type IPP16u while pHuffValue is of type IPP8u. So far, for the above example, there is no viable workaround. ================================================================================ 3. Hint: use simple object reference For the following codes >>> #define FLUSH_STREAM(pStream, fStreamFlush, pStreamHandle, len) \ >>> { \ >>> int byteWritten, ti;\ >>> int availBytes = (pStream)->pBsCurByte - (pStream)->pBsBuffer;\ >>> int minLen = (len)-((pStream)->bsByteLen - ((pStream)->pBsCurByte - (pStream)->pBsBuffer));\ >>> byteWritten = fStreamFlush((pStream)->pBsBuffer, pStreamHandle, availBytes, 0);\ >>> if(minLen>byteWritten){\ >>> return IPP_STATUS_STREAMFLUSH_ERR;\ >>> }\ >>> for(ti=0;ti<=availBytes-byteWritten;ti++) {\ >>> (pStream)->pBsBuffer[ti] = (pStream)->pBsBuffer[ti+byteWritten];\ >>> }\ >>> (pStream)->pBsCurByte -= byteWritten;\ >>> } In the above loops, data reference analysis for "pStream->pBsBuffer[ti]" will be reported out as "unhandled ref", thus prevents it being processed by the vectorizer. To get around this problem, you can rewrite the above as: >>> Ipp8u * pBuf = &(pStream->pBsBuffer[0]); >>> for(ti=0;ti<=availBytes-byteWritten;ti++) { >>> pBuf[ti] = pBuf[ti+byteWritten]; >>> } (The above case though is still not vectorizable because of cross loop iteration dependence mentioned below) ================================================================================ 4. Hint: use "aligned" attribute for array when possible For the following simplest code: >>> extern unsigned char arr1[100]; >>> >>> extern a; >>> >>> int foo() >>> { >>> int i; >>> for( i = 0; i < 100; ++i ) >>> { >>> arr1[i] = 0; >>> } >>> } Loop peeling (or loop versioning) will be used by auto vectorizer, which will bloat code size, to overcome such problems, try add "aligned" attribute to the external declaration: >>> extern unsigned char __attribute__((aligned(8))) arr1[100]; >>> >>> extern a; >>> >>> int foo() >>> { >>> int i; >>> for( i = 0; i < 100; ++i ) >>> { >>> arr1[i] = 0; >>> } >>> } ================================================================================ 5. Issue: auto vected cases for pointers with loop peeling >>> IPPCODECFUN(IppCodecStatus, _ijxSetZero_16s) (Ipp16s * pDst, int len) >>> { >>> int i; >>> Ipp16s pp[len]; >>> >>> //_IPP_CHECK_ARG(pDst && len > 0); >>> >>> for ( i = 0; i < len; i ++ ) { >>> pDst[i] = 0; >>> } >>> >>> return IPP_STATUS_NOERR; >>> } For the above code snippet, the vectorizer will use loop peeling to enable loop vectorizing. The side effect is that code size is bloated. Currently there is no way to specify pointer alignment in GCC, the following is a description from the mailing list of one of the GCC developers: "Unfortunately there's no way to specify alignment attribute of pointers in GCC - the syntax was allowed in the past but not really supported correctly, and then entirely disallowed (by this patch http://gcc.gnu.org/ml/gcc-patches/2005-04/msg02284.html). This issue was discussed in details in these threads: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=20794 http://gcc.gnu.org/ml/gcc/2005-03/msg00483.html (and recently came up again also in http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827#c56). The problem is that "We don't yet implement either attributes on array parameters applying to the array not the pointer, or attributes inside the [] of the array parameter applying to the pointer. (This is documented in "Attribute Syntax".)" (from the above thread)." So there seems no workaround for this loop peeling problem on pointers now. ================================================================================ 6. Issue: aliasing issue 6.1. Pointers aliasing prevents vectorization >>> extern unsigned char arr1[100]; >>> >>> int foo(unsigned char * p) >>> { >>> int i; >>> for( i = 0; i < 100; ++i ) >>> { >>> p[ i ] = arr1[ i ]; >>> } >>> } The alias analysis result for "p" and "arr1" is "may_alias", which effective prevents vectorization. (To work around this problem, "-fargument-noalias-global" may seems the right way to go, however, it just does not work here, it may be fixed in future versions.) 6.2. A more common example: >>> int foo(unsigned char * p, unsigned char * q) >>> { >>> int i; >>> for( i = 0; i < 100; ++i ) >>> { >>> p[ i ] = q[ i ]; >>> } >>> } ("-fargument-noalias" does not work here) ================================================================================ 7. Issue: dependence testing issue For the following case from jpeg: >>> Ipp8u * pBuf = &(pStream->pBsBuffer[0]); >>> for(ti=0;ti<=availBytes-byteWritten;ti++) { >>> pBuf[ti] = pBuf[ti+byteWritten]; >>> } The dependence tester will judge that pBuf[ti] and pBuf[ti+byteWritten] has cross loop dependence relation. Currently there is no way to overcome this limitation, even for the following case we know that byteWritten is large than 8: (seems that dependence testing does not know that byteWritten > 8 ): >>> Ipp8u * pBuf = &(pStream->pBsBuffer[0]); >>> if( byteWritten > 32 ) >>> for(ti=0;ti<=availBytes-byteWritten;ti++) { >>> pBuf[ti] = pBuf[ti+byteWritten]; >>> } The following case works: >>> Ipp8u * pBuf = &(pStream->pBsBuffer[0]); >>> for(ti=0;ti<=anything;ti++) { >>> pBuf[ti] = pBuf[ti+32]; >>> } (Because vectorizer knows that difference for (ti) and (ti+32) is larger than the vector factor(8 here)) ================================================================================ 8. Issue: straight lines of statements will not be vectorized. For the following example: >>> int foo(int i) >>> { >>> ... >>> a1[ i ] = 0 >>> a1[ i + 1 ] = 0; >>> a1[ i + 2 ] = 0; >>> a1[ i + 3 ] = 0; >>> ... >>> } The assignment to a1 array will not be vectorized, because all of these statements belongs to a same iteration vector. In other words, vectorizer will only try to combine statements from different but consecutive loop iteration vectors, statements that belong to same iteration vector will not be combined.