How to understand the speedup in optimization report from icc compiler?











up vote
2
down vote

favorite












environment is:

icc version 19.0.0.117 (gcc version 5.4.0 compatibility)

Intel parallel studio XE cluster edition 2019

Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz

Ubuntu 16.04



compiler flags are:

-std=gnu11 -Wall -xHost -xCORE-AVX2 -O2 -fma -qopenmp -qopenmp-simd -qopt-report=5 -qopt-report-phase=all



I use OpenMP simd or intel parama to vectorize my loop to gain the speedup. In the optimization report generated by icc, I usually see the following result:



LOOP BEGIN at get_forces.c(3668,3)
remark #15389: vectorization support: reference mon->fricforce[n1][d] has unaligned access [ get_forces.c(3669,4) ]
remark #15389: vectorization support: reference mon->vel[n1][d] has unaligned access [ get_forces.c(3669,36) ]
remark #15389: vectorization support: reference vel[n1][d] has unaligned access [ get_forces.c(3669,51) ]
remark #15389: vectorization support: reference mon->drag[n1][d] has unaligned access [ get_forces.c(3671,4) ]
remark #15389: vectorization support: reference mon->vel[n1][d] has unaligned access [ get_forces.c(3671,40) ]
remark #15389: vectorization support: reference vel[n1][d] has unaligned access [ get_forces.c(3671,57) ]
remark #15381: vectorization support: unaligned access used inside loop body
remark #15305: vectorization support: vector length 2
remark #15309: vectorization support: normalized vectorization overhead 0.773
remark #15300: LOOP WAS VECTORIZED
remark #15450: unmasked unaligned unit stride loads: 3
remark #15451: unmasked unaligned unit stride stores: 2
remark #15475: --- begin vector cost summary ---
remark #15476: scalar cost: 21
remark #15477: vector cost: 11.000
remark #15478: estimated potential speedup: 1.050
remark #15488: --- end vector cost summary ---
remark #25456: Number of Array Refs Scalar Replaced In Loop: 1
remark #25015: Estimate of max trip count of loop=1
LOOP END


My question is:
I do not understand how the speedup is calculated from



normalized vectorization overhead 0.773
scalar cost: 21
vector cost: 11.000


Another more extreme and puzzled case could be



LOOP BEGIN at get_forces.c(2690,8)
<Distributed chunk3>
remark #15388: vectorization support: reference q12[j] has aligned access [ get_forces.c(2694,19) ]
remark #15388: vectorization support: reference q12[j] has aligned access [ get_forces.c(2694,26) ]
remark #15335: loop was not vectorized: vectorization possible but seems inefficient. Use vector always directive or -vec-threshold0 to override
remark #15305: vectorization support: vector length 2
remark #15309: vectorization support: normalized vectorization overhead 1.857
remark #15448: unmasked aligned unit stride loads: 1
remark #15475: --- begin vector cost summary ---
remark #15476: scalar cost: 7
remark #15477: vector cost: 3.500
remark #15478: estimated potential speedup: 0.770
remark #15488: --- end vector cost summary ---
remark #25436: completely unrolled by 3
LOOP END


Now, 3.5+1.857=5.357 < 7

So, I could still simd this loop and gain a speedup or I should take the speedup number 0.770 in the report and not simd it?



How to understand the speedup in optimization report from icc compiler?










share|improve this question


























    up vote
    2
    down vote

    favorite












    environment is:

    icc version 19.0.0.117 (gcc version 5.4.0 compatibility)

    Intel parallel studio XE cluster edition 2019

    Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz

    Ubuntu 16.04



    compiler flags are:

    -std=gnu11 -Wall -xHost -xCORE-AVX2 -O2 -fma -qopenmp -qopenmp-simd -qopt-report=5 -qopt-report-phase=all



    I use OpenMP simd or intel parama to vectorize my loop to gain the speedup. In the optimization report generated by icc, I usually see the following result:



    LOOP BEGIN at get_forces.c(3668,3)
    remark #15389: vectorization support: reference mon->fricforce[n1][d] has unaligned access [ get_forces.c(3669,4) ]
    remark #15389: vectorization support: reference mon->vel[n1][d] has unaligned access [ get_forces.c(3669,36) ]
    remark #15389: vectorization support: reference vel[n1][d] has unaligned access [ get_forces.c(3669,51) ]
    remark #15389: vectorization support: reference mon->drag[n1][d] has unaligned access [ get_forces.c(3671,4) ]
    remark #15389: vectorization support: reference mon->vel[n1][d] has unaligned access [ get_forces.c(3671,40) ]
    remark #15389: vectorization support: reference vel[n1][d] has unaligned access [ get_forces.c(3671,57) ]
    remark #15381: vectorization support: unaligned access used inside loop body
    remark #15305: vectorization support: vector length 2
    remark #15309: vectorization support: normalized vectorization overhead 0.773
    remark #15300: LOOP WAS VECTORIZED
    remark #15450: unmasked unaligned unit stride loads: 3
    remark #15451: unmasked unaligned unit stride stores: 2
    remark #15475: --- begin vector cost summary ---
    remark #15476: scalar cost: 21
    remark #15477: vector cost: 11.000
    remark #15478: estimated potential speedup: 1.050
    remark #15488: --- end vector cost summary ---
    remark #25456: Number of Array Refs Scalar Replaced In Loop: 1
    remark #25015: Estimate of max trip count of loop=1
    LOOP END


    My question is:
    I do not understand how the speedup is calculated from



    normalized vectorization overhead 0.773
    scalar cost: 21
    vector cost: 11.000


    Another more extreme and puzzled case could be



    LOOP BEGIN at get_forces.c(2690,8)
    <Distributed chunk3>
    remark #15388: vectorization support: reference q12[j] has aligned access [ get_forces.c(2694,19) ]
    remark #15388: vectorization support: reference q12[j] has aligned access [ get_forces.c(2694,26) ]
    remark #15335: loop was not vectorized: vectorization possible but seems inefficient. Use vector always directive or -vec-threshold0 to override
    remark #15305: vectorization support: vector length 2
    remark #15309: vectorization support: normalized vectorization overhead 1.857
    remark #15448: unmasked aligned unit stride loads: 1
    remark #15475: --- begin vector cost summary ---
    remark #15476: scalar cost: 7
    remark #15477: vector cost: 3.500
    remark #15478: estimated potential speedup: 0.770
    remark #15488: --- end vector cost summary ---
    remark #25436: completely unrolled by 3
    LOOP END


    Now, 3.5+1.857=5.357 < 7

    So, I could still simd this loop and gain a speedup or I should take the speedup number 0.770 in the report and not simd it?



    How to understand the speedup in optimization report from icc compiler?










    share|improve this question
























      up vote
      2
      down vote

      favorite









      up vote
      2
      down vote

      favorite











      environment is:

      icc version 19.0.0.117 (gcc version 5.4.0 compatibility)

      Intel parallel studio XE cluster edition 2019

      Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz

      Ubuntu 16.04



      compiler flags are:

      -std=gnu11 -Wall -xHost -xCORE-AVX2 -O2 -fma -qopenmp -qopenmp-simd -qopt-report=5 -qopt-report-phase=all



      I use OpenMP simd or intel parama to vectorize my loop to gain the speedup. In the optimization report generated by icc, I usually see the following result:



      LOOP BEGIN at get_forces.c(3668,3)
      remark #15389: vectorization support: reference mon->fricforce[n1][d] has unaligned access [ get_forces.c(3669,4) ]
      remark #15389: vectorization support: reference mon->vel[n1][d] has unaligned access [ get_forces.c(3669,36) ]
      remark #15389: vectorization support: reference vel[n1][d] has unaligned access [ get_forces.c(3669,51) ]
      remark #15389: vectorization support: reference mon->drag[n1][d] has unaligned access [ get_forces.c(3671,4) ]
      remark #15389: vectorization support: reference mon->vel[n1][d] has unaligned access [ get_forces.c(3671,40) ]
      remark #15389: vectorization support: reference vel[n1][d] has unaligned access [ get_forces.c(3671,57) ]
      remark #15381: vectorization support: unaligned access used inside loop body
      remark #15305: vectorization support: vector length 2
      remark #15309: vectorization support: normalized vectorization overhead 0.773
      remark #15300: LOOP WAS VECTORIZED
      remark #15450: unmasked unaligned unit stride loads: 3
      remark #15451: unmasked unaligned unit stride stores: 2
      remark #15475: --- begin vector cost summary ---
      remark #15476: scalar cost: 21
      remark #15477: vector cost: 11.000
      remark #15478: estimated potential speedup: 1.050
      remark #15488: --- end vector cost summary ---
      remark #25456: Number of Array Refs Scalar Replaced In Loop: 1
      remark #25015: Estimate of max trip count of loop=1
      LOOP END


      My question is:
      I do not understand how the speedup is calculated from



      normalized vectorization overhead 0.773
      scalar cost: 21
      vector cost: 11.000


      Another more extreme and puzzled case could be



      LOOP BEGIN at get_forces.c(2690,8)
      <Distributed chunk3>
      remark #15388: vectorization support: reference q12[j] has aligned access [ get_forces.c(2694,19) ]
      remark #15388: vectorization support: reference q12[j] has aligned access [ get_forces.c(2694,26) ]
      remark #15335: loop was not vectorized: vectorization possible but seems inefficient. Use vector always directive or -vec-threshold0 to override
      remark #15305: vectorization support: vector length 2
      remark #15309: vectorization support: normalized vectorization overhead 1.857
      remark #15448: unmasked aligned unit stride loads: 1
      remark #15475: --- begin vector cost summary ---
      remark #15476: scalar cost: 7
      remark #15477: vector cost: 3.500
      remark #15478: estimated potential speedup: 0.770
      remark #15488: --- end vector cost summary ---
      remark #25436: completely unrolled by 3
      LOOP END


      Now, 3.5+1.857=5.357 < 7

      So, I could still simd this loop and gain a speedup or I should take the speedup number 0.770 in the report and not simd it?



      How to understand the speedup in optimization report from icc compiler?










      share|improve this question













      environment is:

      icc version 19.0.0.117 (gcc version 5.4.0 compatibility)

      Intel parallel studio XE cluster edition 2019

      Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz

      Ubuntu 16.04



      compiler flags are:

      -std=gnu11 -Wall -xHost -xCORE-AVX2 -O2 -fma -qopenmp -qopenmp-simd -qopt-report=5 -qopt-report-phase=all



      I use OpenMP simd or intel parama to vectorize my loop to gain the speedup. In the optimization report generated by icc, I usually see the following result:



      LOOP BEGIN at get_forces.c(3668,3)
      remark #15389: vectorization support: reference mon->fricforce[n1][d] has unaligned access [ get_forces.c(3669,4) ]
      remark #15389: vectorization support: reference mon->vel[n1][d] has unaligned access [ get_forces.c(3669,36) ]
      remark #15389: vectorization support: reference vel[n1][d] has unaligned access [ get_forces.c(3669,51) ]
      remark #15389: vectorization support: reference mon->drag[n1][d] has unaligned access [ get_forces.c(3671,4) ]
      remark #15389: vectorization support: reference mon->vel[n1][d] has unaligned access [ get_forces.c(3671,40) ]
      remark #15389: vectorization support: reference vel[n1][d] has unaligned access [ get_forces.c(3671,57) ]
      remark #15381: vectorization support: unaligned access used inside loop body
      remark #15305: vectorization support: vector length 2
      remark #15309: vectorization support: normalized vectorization overhead 0.773
      remark #15300: LOOP WAS VECTORIZED
      remark #15450: unmasked unaligned unit stride loads: 3
      remark #15451: unmasked unaligned unit stride stores: 2
      remark #15475: --- begin vector cost summary ---
      remark #15476: scalar cost: 21
      remark #15477: vector cost: 11.000
      remark #15478: estimated potential speedup: 1.050
      remark #15488: --- end vector cost summary ---
      remark #25456: Number of Array Refs Scalar Replaced In Loop: 1
      remark #25015: Estimate of max trip count of loop=1
      LOOP END


      My question is:
      I do not understand how the speedup is calculated from



      normalized vectorization overhead 0.773
      scalar cost: 21
      vector cost: 11.000


      Another more extreme and puzzled case could be



      LOOP BEGIN at get_forces.c(2690,8)
      <Distributed chunk3>
      remark #15388: vectorization support: reference q12[j] has aligned access [ get_forces.c(2694,19) ]
      remark #15388: vectorization support: reference q12[j] has aligned access [ get_forces.c(2694,26) ]
      remark #15335: loop was not vectorized: vectorization possible but seems inefficient. Use vector always directive or -vec-threshold0 to override
      remark #15305: vectorization support: vector length 2
      remark #15309: vectorization support: normalized vectorization overhead 1.857
      remark #15448: unmasked aligned unit stride loads: 1
      remark #15475: --- begin vector cost summary ---
      remark #15476: scalar cost: 7
      remark #15477: vector cost: 3.500
      remark #15478: estimated potential speedup: 0.770
      remark #15488: --- end vector cost summary ---
      remark #25436: completely unrolled by 3
      LOOP END


      Now, 3.5+1.857=5.357 < 7

      So, I could still simd this loop and gain a speedup or I should take the speedup number 0.770 in the report and not simd it?



      How to understand the speedup in optimization report from icc compiler?







      vectorization intel compiler-optimization simd icc






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked 2 days ago









      jjl

      2114




      2114





























          active

          oldest

          votes











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














           

          draft saved


          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53203989%2fhow-to-understand-the-speedup-in-optimization-report-from-icc-compiler%23new-answer', 'question_page');
          }
          );

          Post as a guest





































          active

          oldest

          votes













          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes
















           

          draft saved


          draft discarded



















































           


          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53203989%2fhow-to-understand-the-speedup-in-optimization-report-from-icc-compiler%23new-answer', 'question_page');
          }
          );

          Post as a guest




















































































          Popular posts from this blog

          Guess what letter conforming each word

          Port of Spain

          Run scheduled task as local user group (not BUILTIN)