marvell-doc/MARVELL_Using_GCC_IWMMXT_Autovect.txt - toolchains/armada - Git at Google

 0. Turning on auto vectorization
 ==================================================

  To switch on auto vec for the compiler, the following options should
  be given:

    -mmrvl-use-iwmmxt -ftree-vectorize -O2 (or -O3)

  To view the detailed optimization information, use the additional options:

    -ftree-vectorizer-verbose=3


 ================================================================================
 1. Hint: use loop local variable instead of long-live variable to access array

   The following code snippet is from jpeg:

     >>>  for( k=0, i=1; i<17; i++ ){
     >>>       L=(int)pHuffBits[i-1];
     >>>       for (j = 1; j <= L; j++){
     >>>           huffsize[k++] = (Ipp8u) i;
     >>>       }
     >>>  }

   To make it clearer, rewrite it into:

     <A>  int L, i, k, j;
     <A>  for( k=0, i=1; i<17; i++ )
     <A>    {
     <A>      L = array2[ i - 1 ];
     <A>      for( j=1; j<=L; j++ )
     <A>        {
     <A>          array[ ++k ] = i;
     <A>        }
     <A>    }

   To enable auto vectorization, rewrite the above into:

     <B>  int L, i, k, j;
     <B>  for( k=0, i=1; i<17; i++ )
     <B>    {
     <B>      L = array2[ i - 1 ];
     <B>      for( j=1; j<=L; j++ )
     <B>        {
     <B>          array[ j + k - 1 ] = i;
     <B>        }
     <B>      k += L;
     <B>    }

   The difference between the above two codes is that for <A>,
   "array[++k]" references "k", which outlives this inner loop (the
   inner loop is executed for 16 times from the outer loop, and k is
   alive for these times).

   While in <B>, "j+k-1" is an expression whose live time is defined by
   the inner loop (j is an induction variable, (k-1) is (inner)loop
   invariant)


 ================================================================================
 2. Hint: use same data type in a single loop

 2.1. For the following codes from jpeg:

     >>>  for (j=0; j<17; j++) {
     >>>      pTable->mincodeptr[j] = 0;
     >>>      pTable->maxcode[j] = 0;
     >>>  }
     >>>

   The above 2 assignment statements have different vec types thus 2
   different vec factor, however a single loop can only be vectorized
   using one vect factor, so the above case cannot be vectorized.

   To enable above vectorizatio, you can transform the above into:

     >>>  for (j=0; j<17; j++) {
     >>>      pTable->maxcode[j] = 0;
     >>>  }
     >>>
     >>>  for (j=0; j<17; j++) {
     >>>      pTable->mincodeptr[j] = 0;
     >>>  }

 2.2. For the following codes from jpeg:

     >>>  while (p<i) {
     >>>      pTable->huffVal[p] = pHuffValue[p];
     >>>      p++;
     >>>  }

     The above "pTable->huffVal[p]" is of type IPP16u while pHuffValue
     is of type IPP8u.

     So far, for the above example, there is no viable workaround.


 ================================================================================
 3. Hint: use simple object reference

   For the following codes

   >>>  #define FLUSH_STREAM(pStream, fStreamFlush, pStreamHandle, len) \
   >>>  { \
   >>>      int byteWritten, ti;\
   >>>      int availBytes = (pStream)->pBsCurByte - (pStream)->pBsBuffer;\
   >>>      int minLen     = (len)-((pStream)->bsByteLen - ((pStream)->pBsCurByte - (pStream)->pBsBuffer));\
   >>>      byteWritten = fStreamFlush((pStream)->pBsBuffer, pStreamHandle, availBytes, 0);\
   >>>      if(minLen>byteWritten){\
   >>>          return IPP_STATUS_STREAMFLUSH_ERR;\
   >>>      }\
   >>>      for(ti=0;ti<=availBytes-byteWritten;ti++) {\
   >>>          (pStream)->pBsBuffer[ti] = (pStream)->pBsBuffer[ti+byteWritten];\
   >>>      }\
   >>>      (pStream)->pBsCurByte -= byteWritten;\
   >>>  }

   In the above loops, data reference analysis for
   "pStream->pBsBuffer[ti]" will be reported out as "unhandled ref",
   thus prevents it being processed by the vectorizer. To get around
   this problem, you can rewrite the above as:

   >>>      Ipp8u * pBuf = &(pStream->pBsBuffer[0]);
   >>>      for(ti=0;ti<=availBytes-byteWritten;ti++) {
   >>>          pBuf[ti] = pBuf[ti+byteWritten];
   >>>      }

   (The above case though is still not vectorizable because of cross
   loop iteration dependence mentioned below)

 ================================================================================
 4. Hint: use "aligned" attribute for array when possible
   For the following simplest code:

     >>>  extern unsigned char arr1[100];
     >>>
     >>>  extern a;
     >>>
     >>>  int foo()
     >>>  {
     >>>    int i;
     >>>    for( i = 0; i < 100; ++i )
     >>>      {
     >>>        arr1[i] = 0;
     >>>      }
     >>>  }

   Loop peeling (or loop versioning) will be used by auto vectorizer,
   which will bloat code size, to overcome such problems, try add
   "aligned" attribute to the external declaration:

     >>>  extern unsigned char __attribute__((aligned(8))) arr1[100];
     >>>
     >>>  extern a;
     >>>
     >>>  int foo()
     >>>  {
     >>>    int i;
     >>>    for( i = 0; i < 100; ++i )
     >>>      {
     >>>        arr1[i] = 0;
     >>>      }
     >>>  }

 ================================================================================
 5. Issue: auto vected cases for pointers with loop peeling

    >>> IPPCODECFUN(IppCodecStatus, _ijxSetZero_16s) (Ipp16s * pDst, int len)
    >>> {
    >>>     int  i;
    >>>     Ipp16s pp[len];
    >>>
    >>> 	//_IPP_CHECK_ARG(pDst && len > 0);
    >>>
    >>>     for ( i = 0; i < len; i ++ ) {
    >>>         pDst[i] = 0;
    >>> 	}
    >>>
    >>> 	return IPP_STATUS_NOERR;
    >>> }

   For the above code snippet, the vectorizer will use loop peeling to
   enable loop vectorizing. The side effect is that code size is bloated.

   Currently there is no way to specify pointer alignment in GCC, the
   following is a description from the mailing list of one of the GCC
   developers:

      "Unfortunately there's no way to specify alignment attribute of
      pointers in GCC - the syntax was allowed in the past but not really
      supported correctly, and then entirely disallowed (by this patch
      http://gcc.gnu.org/ml/gcc-patches/2005-04/msg02284.html). This issue
      was discussed in details in these threads:
      http://gcc.gnu.org/bugzilla/show_bug.cgi?id=20794
      http://gcc.gnu.org/ml/gcc/2005-03/msg00483.html (and recently came up
      again also in http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827#c56).
      The problem is that "We don't yet implement either attributes on array
      parameters applying to the array not the pointer, or attributes inside
      the [] of the array parameter applying to the pointer.  (This is
      documented in "Attribute Syntax".)" (from the above thread)."

   So there seems no workaround for this loop peeling problem on pointers now.

 ================================================================================
 6. Issue: aliasing issue

 6.1. Pointers aliasing prevents vectorization
      >>>  extern unsigned char arr1[100];
      >>>
      >>>  int foo(unsigned char * p)
      >>>  {
      >>>    int i;
      >>>    for( i = 0; i < 100; ++i )
      >>>      {
      >>>        p[ i ] = arr1[ i ];
      >>>      }
      >>>  }

      The alias analysis result for "p" and "arr1" is "may_alias",
      which effective prevents vectorization.

      (To work around this problem, "-fargument-noalias-global" may
      seems the right way to go, however, it just does not work here,
      it may be fixed in future versions.)


 6.2. A more common example:

      >>>  int foo(unsigned char * p, unsigned char * q)
      >>>  {
      >>>    int i;
      >>>    for( i = 0; i < 100; ++i )
      >>>      {
      >>>        p[ i ] = q[ i ];
      >>>      }
      >>>  }

      ("-fargument-noalias" does not work here)


 ================================================================================
 7. Issue: dependence testing issue

    For the following case from jpeg:

      >>>      Ipp8u * pBuf = &(pStream->pBsBuffer[0]);
      >>>      for(ti=0;ti<=availBytes-byteWritten;ti++) {
      >>>          pBuf[ti] = pBuf[ti+byteWritten];
      >>>      }

    The dependence tester will judge that pBuf[ti] and
    pBuf[ti+byteWritten] has cross loop dependence relation.

    Currently there is no way to overcome this limitation, even for the
    following case we know that byteWritten is large than 8: (seems
    that dependence testing does not know that byteWritten > 8 ):

      >>>      Ipp8u * pBuf = &(pStream->pBsBuffer[0]);
      >>>      if( byteWritten > 32 )
      >>>        for(ti=0;ti<=availBytes-byteWritten;ti++) {
      >>>            pBuf[ti] = pBuf[ti+byteWritten];
      >>>        }

    The following case works:

      >>>      Ipp8u * pBuf = &(pStream->pBsBuffer[0]);
      >>>      for(ti=0;ti<=anything;ti++) {
      >>>          pBuf[ti] = pBuf[ti+32];
      >>>      }

    (Because vectorizer knows that difference for (ti) and (ti+32) is
    larger than the vector factor(8 here))


 ================================================================================
 8. Issue: straight lines of statements will not be vectorized.

    For the following example:

      >>>  int foo(int i)
      >>>  {
      >>>    ...
      >>>    a1[ i ] = 0
      >>>    a1[ i + 1 ] = 0;
      >>>    a1[ i + 2 ] = 0;
      >>>    a1[ i + 3 ] = 0;
      >>>    ...
      >>>  }

    The assignment to a1 array will not be vectorized, because all of
    these statements belongs to a same iteration vector.

    In other words, vectorizer will only try to combine statements from
    different but consecutive loop iteration vectors, statements that
    belong to same iteration vector will not be combined.
	0. Turning on auto vectorization
	==================================================

	To switch on auto vec for the compiler, the following options should
	be given:

	-mmrvl-use-iwmmxt -ftree-vectorize -O2 (or -O3)

	To view the detailed optimization information, use the additional options:

	-ftree-vectorizer-verbose=3


	================================================================================
	1. Hint: use loop local variable instead of long-live variable to access array

	The following code snippet is from jpeg:

	>>> for( k=0, i=1; i<17; i++ ){
	>>> L=(int)pHuffBits[i-1];
	>>> for (j = 1; j <= L; j++){
	>>> huffsize[k++] = (Ipp8u) i;
	>>> }
	>>> }

	To make it clearer, rewrite it into:

	<A> int L, i, k, j;
	<A> for( k=0, i=1; i<17; i++ )
	<A> {
	<A> L = array2[ i - 1 ];
	<A> for( j=1; j<=L; j++ )
	<A> {
	<A> array[ ++k ] = i;
	<A> }
	<A> }

	To enable auto vectorization, rewrite the above into:

	<B> int L, i, k, j;
	<B> for( k=0, i=1; i<17; i++ )
	<B> {
	<B> L = array2[ i - 1 ];
	<B> for( j=1; j<=L; j++ )
	<B> {
	<B> array[ j + k - 1 ] = i;
	<B> }
	<B> k += L;
	<B> }

	The difference between the above two codes is that for <A>,
	"array[++k]" references "k", which outlives this inner loop (the
	inner loop is executed for 16 times from the outer loop, and k is
	alive for these times).

	While in <B>, "j+k-1" is an expression whose live time is defined by
	the inner loop (j is an induction variable, (k-1) is (inner)loop
	invariant)


	================================================================================
	2. Hint: use same data type in a single loop

	2.1. For the following codes from jpeg:

	>>> for (j=0; j<17; j++) {
	>>> pTable->mincodeptr[j] = 0;
	>>> pTable->maxcode[j] = 0;
	>>> }
	>>>

	The above 2 assignment statements have different vec types thus 2
	different vec factor, however a single loop can only be vectorized
	using one vect factor, so the above case cannot be vectorized.

	To enable above vectorizatio, you can transform the above into:

	>>> for (j=0; j<17; j++) {
	>>> pTable->maxcode[j] = 0;
	>>> }
	>>>
	>>> for (j=0; j<17; j++) {
	>>> pTable->mincodeptr[j] = 0;
	>>> }

	2.2. For the following codes from jpeg:

	>>> while (p<i) {
	>>> pTable->huffVal[p] = pHuffValue[p];
	>>> p++;
	>>> }

	The above "pTable->huffVal[p]" is of type IPP16u while pHuffValue
	is of type IPP8u.

	So far, for the above example, there is no viable workaround.


	================================================================================
	3. Hint: use simple object reference

	For the following codes

	>>> #define FLUSH_STREAM(pStream, fStreamFlush, pStreamHandle, len) \
	>>> { \
	>>> int byteWritten, ti;\
	>>> int availBytes = (pStream)->pBsCurByte - (pStream)->pBsBuffer;\
	>>> int minLen = (len)-((pStream)->bsByteLen - ((pStream)->pBsCurByte - (pStream)->pBsBuffer));\
	>>> byteWritten = fStreamFlush((pStream)->pBsBuffer, pStreamHandle, availBytes, 0);\
	>>> if(minLen>byteWritten){\
	>>> return IPP_STATUS_STREAMFLUSH_ERR;\
	>>> }\
	>>> for(ti=0;ti<=availBytes-byteWritten;ti++) {\
	>>> (pStream)->pBsBuffer[ti] = (pStream)->pBsBuffer[ti+byteWritten];\
	>>> }\
	>>> (pStream)->pBsCurByte -= byteWritten;\
	>>> }

	In the above loops, data reference analysis for
	"pStream->pBsBuffer[ti]" will be reported out as "unhandled ref",
	thus prevents it being processed by the vectorizer. To get around
	this problem, you can rewrite the above as:

	>>> Ipp8u * pBuf = &(pStream->pBsBuffer[0]);
	>>> for(ti=0;ti<=availBytes-byteWritten;ti++) {
	>>> pBuf[ti] = pBuf[ti+byteWritten];
	>>> }

	(The above case though is still not vectorizable because of cross
	loop iteration dependence mentioned below)

	================================================================================
	4. Hint: use "aligned" attribute for array when possible
	For the following simplest code:

	>>> extern unsigned char arr1[100];
	>>>
	>>> extern a;
	>>>
	>>> int foo()
	>>> {
	>>> int i;
	>>> for( i = 0; i < 100; ++i )
	>>> {
	>>> arr1[i] = 0;
	>>> }
	>>> }

	Loop peeling (or loop versioning) will be used by auto vectorizer,
	which will bloat code size, to overcome such problems, try add
	"aligned" attribute to the external declaration:

	>>> extern unsigned char __attribute__((aligned(8))) arr1[100];
	>>>
	>>> extern a;
	>>>
	>>> int foo()
	>>> {
	>>> int i;
	>>> for( i = 0; i < 100; ++i )
	>>> {
	>>> arr1[i] = 0;
	>>> }
	>>> }

	================================================================================
	5. Issue: auto vected cases for pointers with loop peeling

	>>> IPPCODECFUN(IppCodecStatus, _ijxSetZero_16s) (Ipp16s * pDst, int len)
	>>> {
	>>> int i;
	>>> Ipp16s pp[len];
	>>>
	>>> //_IPP_CHECK_ARG(pDst && len > 0);
	>>>
	>>> for ( i = 0; i < len; i ++ ) {
	>>> pDst[i] = 0;
	>>> }
	>>>
	>>> return IPP_STATUS_NOERR;
	>>> }

	For the above code snippet, the vectorizer will use loop peeling to
	enable loop vectorizing. The side effect is that code size is bloated.

	Currently there is no way to specify pointer alignment in GCC, the
	following is a description from the mailing list of one of the GCC
	developers:

	"Unfortunately there's no way to specify alignment attribute of
	pointers in GCC - the syntax was allowed in the past but not really
	supported correctly, and then entirely disallowed (by this patch
	http://gcc.gnu.org/ml/gcc-patches/2005-04/msg02284.html). This issue
	was discussed in details in these threads:
	http://gcc.gnu.org/bugzilla/show_bug.cgi?id=20794
	http://gcc.gnu.org/ml/gcc/2005-03/msg00483.html (and recently came up
	again also in http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827#c56).
	The problem is that "We don't yet implement either attributes on array
	parameters applying to the array not the pointer, or attributes inside
	the [] of the array parameter applying to the pointer. (This is
	documented in "Attribute Syntax".)" (from the above thread)."

	So there seems no workaround for this loop peeling problem on pointers now.

	================================================================================
	6. Issue: aliasing issue

	6.1. Pointers aliasing prevents vectorization
	>>> extern unsigned char arr1[100];
	>>>
	>>> int foo(unsigned char * p)
	>>> {
	>>> int i;
	>>> for( i = 0; i < 100; ++i )
	>>> {
	>>> p[ i ] = arr1[ i ];
	>>> }
	>>> }

	The alias analysis result for "p" and "arr1" is "may_alias",
	which effective prevents vectorization.

	(To work around this problem, "-fargument-noalias-global" may
	seems the right way to go, however, it just does not work here,
	it may be fixed in future versions.)


	6.2. A more common example:

	>>> int foo(unsigned char * p, unsigned char * q)
	>>> {
	>>> int i;
	>>> for( i = 0; i < 100; ++i )
	>>> {
	>>> p[ i ] = q[ i ];
	>>> }
	>>> }

	("-fargument-noalias" does not work here)


	================================================================================
	7. Issue: dependence testing issue

	For the following case from jpeg:

	>>> Ipp8u * pBuf = &(pStream->pBsBuffer[0]);
	>>> for(ti=0;ti<=availBytes-byteWritten;ti++) {
	>>> pBuf[ti] = pBuf[ti+byteWritten];
	>>> }

	The dependence tester will judge that pBuf[ti] and
	pBuf[ti+byteWritten] has cross loop dependence relation.

	Currently there is no way to overcome this limitation, even for the
	following case we know that byteWritten is large than 8: (seems
	that dependence testing does not know that byteWritten > 8 ):

	>>> Ipp8u * pBuf = &(pStream->pBsBuffer[0]);
	>>> if( byteWritten > 32 )
	>>> for(ti=0;ti<=availBytes-byteWritten;ti++) {
	>>> pBuf[ti] = pBuf[ti+byteWritten];
	>>> }

	The following case works:

	>>> Ipp8u * pBuf = &(pStream->pBsBuffer[0]);
	>>> for(ti=0;ti<=anything;ti++) {
	>>> pBuf[ti] = pBuf[ti+32];
	>>> }

	(Because vectorizer knows that difference for (ti) and (ti+32) is
	larger than the vector factor(8 here))


	================================================================================
	8. Issue: straight lines of statements will not be vectorized.

	For the following example:

	>>> int foo(int i)
	>>> {
	>>> ...
	>>> a1[ i ] = 0
	>>> a1[ i + 1 ] = 0;
	>>> a1[ i + 2 ] = 0;
	>>> a1[ i + 3 ] = 0;
	>>> ...
	>>> }

	The assignment to a1 array will not be vectorized, because all of
	these statements belongs to a same iteration vector.

	In other words, vectorizer will only try to combine statements from
	different but consecutive loop iteration vectors, statements that
	belong to same iteration vector will not be combined.