How to understand the speedup in optimization report from icc compiler?

Multi tool use
Multi tool use











up vote
2
down vote

favorite












environment is:

icc version 19.0.0.117 (gcc version 5.4.0 compatibility)

Intel parallel studio XE cluster edition 2019

Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz

Ubuntu 16.04



compiler flags are:

-std=gnu11 -Wall -xHost -xCORE-AVX2 -O2 -fma -qopenmp -qopenmp-simd -qopt-report=5 -qopt-report-phase=all



I use OpenMP simd or intel parama to vectorize my loop to gain the speedup. In the optimization report generated by icc, I usually see the following result:



LOOP BEGIN at get_forces.c(3668,3)
remark #15389: vectorization support: reference mon->fricforce[n1][d] has unaligned access [ get_forces.c(3669,4) ]
remark #15389: vectorization support: reference mon->vel[n1][d] has unaligned access [ get_forces.c(3669,36) ]
remark #15389: vectorization support: reference vel[n1][d] has unaligned access [ get_forces.c(3669,51) ]
remark #15389: vectorization support: reference mon->drag[n1][d] has unaligned access [ get_forces.c(3671,4) ]
remark #15389: vectorization support: reference mon->vel[n1][d] has unaligned access [ get_forces.c(3671,40) ]
remark #15389: vectorization support: reference vel[n1][d] has unaligned access [ get_forces.c(3671,57) ]
remark #15381: vectorization support: unaligned access used inside loop body
remark #15305: vectorization support: vector length 2
remark #15309: vectorization support: normalized vectorization overhead 0.773
remark #15300: LOOP WAS VECTORIZED
remark #15450: unmasked unaligned unit stride loads: 3
remark #15451: unmasked unaligned unit stride stores: 2
remark #15475: --- begin vector cost summary ---
remark #15476: scalar cost: 21
remark #15477: vector cost: 11.000
remark #15478: estimated potential speedup: 1.050
remark #15488: --- end vector cost summary ---
remark #25456: Number of Array Refs Scalar Replaced In Loop: 1
remark #25015: Estimate of max trip count of loop=1
LOOP END


My question is:
I do not understand how the speedup is calculated from



normalized vectorization overhead 0.773
scalar cost: 21
vector cost: 11.000


Another more extreme and puzzled case could be



LOOP BEGIN at get_forces.c(2690,8)
<Distributed chunk3>
remark #15388: vectorization support: reference q12[j] has aligned access [ get_forces.c(2694,19) ]
remark #15388: vectorization support: reference q12[j] has aligned access [ get_forces.c(2694,26) ]
remark #15335: loop was not vectorized: vectorization possible but seems inefficient. Use vector always directive or -vec-threshold0 to override
remark #15305: vectorization support: vector length 2
remark #15309: vectorization support: normalized vectorization overhead 1.857
remark #15448: unmasked aligned unit stride loads: 1
remark #15475: --- begin vector cost summary ---
remark #15476: scalar cost: 7
remark #15477: vector cost: 3.500
remark #15478: estimated potential speedup: 0.770
remark #15488: --- end vector cost summary ---
remark #25436: completely unrolled by 3
LOOP END


Now, 3.5+1.857=5.357 < 7

So, I could still simd this loop and gain a speedup or I should take the speedup number 0.770 in the report and not simd it?



How to understand the speedup in optimization report from icc compiler?










share|improve this question


























    up vote
    2
    down vote

    favorite












    environment is:

    icc version 19.0.0.117 (gcc version 5.4.0 compatibility)

    Intel parallel studio XE cluster edition 2019

    Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz

    Ubuntu 16.04



    compiler flags are:

    -std=gnu11 -Wall -xHost -xCORE-AVX2 -O2 -fma -qopenmp -qopenmp-simd -qopt-report=5 -qopt-report-phase=all



    I use OpenMP simd or intel parama to vectorize my loop to gain the speedup. In the optimization report generated by icc, I usually see the following result:



    LOOP BEGIN at get_forces.c(3668,3)
    remark #15389: vectorization support: reference mon->fricforce[n1][d] has unaligned access [ get_forces.c(3669,4) ]
    remark #15389: vectorization support: reference mon->vel[n1][d] has unaligned access [ get_forces.c(3669,36) ]
    remark #15389: vectorization support: reference vel[n1][d] has unaligned access [ get_forces.c(3669,51) ]
    remark #15389: vectorization support: reference mon->drag[n1][d] has unaligned access [ get_forces.c(3671,4) ]
    remark #15389: vectorization support: reference mon->vel[n1][d] has unaligned access [ get_forces.c(3671,40) ]
    remark #15389: vectorization support: reference vel[n1][d] has unaligned access [ get_forces.c(3671,57) ]
    remark #15381: vectorization support: unaligned access used inside loop body
    remark #15305: vectorization support: vector length 2
    remark #15309: vectorization support: normalized vectorization overhead 0.773
    remark #15300: LOOP WAS VECTORIZED
    remark #15450: unmasked unaligned unit stride loads: 3
    remark #15451: unmasked unaligned unit stride stores: 2
    remark #15475: --- begin vector cost summary ---
    remark #15476: scalar cost: 21
    remark #15477: vector cost: 11.000
    remark #15478: estimated potential speedup: 1.050
    remark #15488: --- end vector cost summary ---
    remark #25456: Number of Array Refs Scalar Replaced In Loop: 1
    remark #25015: Estimate of max trip count of loop=1
    LOOP END


    My question is:
    I do not understand how the speedup is calculated from



    normalized vectorization overhead 0.773
    scalar cost: 21
    vector cost: 11.000


    Another more extreme and puzzled case could be



    LOOP BEGIN at get_forces.c(2690,8)
    <Distributed chunk3>
    remark #15388: vectorization support: reference q12[j] has aligned access [ get_forces.c(2694,19) ]
    remark #15388: vectorization support: reference q12[j] has aligned access [ get_forces.c(2694,26) ]
    remark #15335: loop was not vectorized: vectorization possible but seems inefficient. Use vector always directive or -vec-threshold0 to override
    remark #15305: vectorization support: vector length 2
    remark #15309: vectorization support: normalized vectorization overhead 1.857
    remark #15448: unmasked aligned unit stride loads: 1
    remark #15475: --- begin vector cost summary ---
    remark #15476: scalar cost: 7
    remark #15477: vector cost: 3.500
    remark #15478: estimated potential speedup: 0.770
    remark #15488: --- end vector cost summary ---
    remark #25436: completely unrolled by 3
    LOOP END


    Now, 3.5+1.857=5.357 < 7

    So, I could still simd this loop and gain a speedup or I should take the speedup number 0.770 in the report and not simd it?



    How to understand the speedup in optimization report from icc compiler?










    share|improve this question
























      up vote
      2
      down vote

      favorite









      up vote
      2
      down vote

      favorite











      environment is:

      icc version 19.0.0.117 (gcc version 5.4.0 compatibility)

      Intel parallel studio XE cluster edition 2019

      Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz

      Ubuntu 16.04



      compiler flags are:

      -std=gnu11 -Wall -xHost -xCORE-AVX2 -O2 -fma -qopenmp -qopenmp-simd -qopt-report=5 -qopt-report-phase=all



      I use OpenMP simd or intel parama to vectorize my loop to gain the speedup. In the optimization report generated by icc, I usually see the following result:



      LOOP BEGIN at get_forces.c(3668,3)
      remark #15389: vectorization support: reference mon->fricforce[n1][d] has unaligned access [ get_forces.c(3669,4) ]
      remark #15389: vectorization support: reference mon->vel[n1][d] has unaligned access [ get_forces.c(3669,36) ]
      remark #15389: vectorization support: reference vel[n1][d] has unaligned access [ get_forces.c(3669,51) ]
      remark #15389: vectorization support: reference mon->drag[n1][d] has unaligned access [ get_forces.c(3671,4) ]
      remark #15389: vectorization support: reference mon->vel[n1][d] has unaligned access [ get_forces.c(3671,40) ]
      remark #15389: vectorization support: reference vel[n1][d] has unaligned access [ get_forces.c(3671,57) ]
      remark #15381: vectorization support: unaligned access used inside loop body
      remark #15305: vectorization support: vector length 2
      remark #15309: vectorization support: normalized vectorization overhead 0.773
      remark #15300: LOOP WAS VECTORIZED
      remark #15450: unmasked unaligned unit stride loads: 3
      remark #15451: unmasked unaligned unit stride stores: 2
      remark #15475: --- begin vector cost summary ---
      remark #15476: scalar cost: 21
      remark #15477: vector cost: 11.000
      remark #15478: estimated potential speedup: 1.050
      remark #15488: --- end vector cost summary ---
      remark #25456: Number of Array Refs Scalar Replaced In Loop: 1
      remark #25015: Estimate of max trip count of loop=1
      LOOP END


      My question is:
      I do not understand how the speedup is calculated from



      normalized vectorization overhead 0.773
      scalar cost: 21
      vector cost: 11.000


      Another more extreme and puzzled case could be



      LOOP BEGIN at get_forces.c(2690,8)
      <Distributed chunk3>
      remark #15388: vectorization support: reference q12[j] has aligned access [ get_forces.c(2694,19) ]
      remark #15388: vectorization support: reference q12[j] has aligned access [ get_forces.c(2694,26) ]
      remark #15335: loop was not vectorized: vectorization possible but seems inefficient. Use vector always directive or -vec-threshold0 to override
      remark #15305: vectorization support: vector length 2
      remark #15309: vectorization support: normalized vectorization overhead 1.857
      remark #15448: unmasked aligned unit stride loads: 1
      remark #15475: --- begin vector cost summary ---
      remark #15476: scalar cost: 7
      remark #15477: vector cost: 3.500
      remark #15478: estimated potential speedup: 0.770
      remark #15488: --- end vector cost summary ---
      remark #25436: completely unrolled by 3
      LOOP END


      Now, 3.5+1.857=5.357 < 7

      So, I could still simd this loop and gain a speedup or I should take the speedup number 0.770 in the report and not simd it?



      How to understand the speedup in optimization report from icc compiler?










      share|improve this question













      environment is:

      icc version 19.0.0.117 (gcc version 5.4.0 compatibility)

      Intel parallel studio XE cluster edition 2019

      Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz

      Ubuntu 16.04



      compiler flags are:

      -std=gnu11 -Wall -xHost -xCORE-AVX2 -O2 -fma -qopenmp -qopenmp-simd -qopt-report=5 -qopt-report-phase=all



      I use OpenMP simd or intel parama to vectorize my loop to gain the speedup. In the optimization report generated by icc, I usually see the following result:



      LOOP BEGIN at get_forces.c(3668,3)
      remark #15389: vectorization support: reference mon->fricforce[n1][d] has unaligned access [ get_forces.c(3669,4) ]
      remark #15389: vectorization support: reference mon->vel[n1][d] has unaligned access [ get_forces.c(3669,36) ]
      remark #15389: vectorization support: reference vel[n1][d] has unaligned access [ get_forces.c(3669,51) ]
      remark #15389: vectorization support: reference mon->drag[n1][d] has unaligned access [ get_forces.c(3671,4) ]
      remark #15389: vectorization support: reference mon->vel[n1][d] has unaligned access [ get_forces.c(3671,40) ]
      remark #15389: vectorization support: reference vel[n1][d] has unaligned access [ get_forces.c(3671,57) ]
      remark #15381: vectorization support: unaligned access used inside loop body
      remark #15305: vectorization support: vector length 2
      remark #15309: vectorization support: normalized vectorization overhead 0.773
      remark #15300: LOOP WAS VECTORIZED
      remark #15450: unmasked unaligned unit stride loads: 3
      remark #15451: unmasked unaligned unit stride stores: 2
      remark #15475: --- begin vector cost summary ---
      remark #15476: scalar cost: 21
      remark #15477: vector cost: 11.000
      remark #15478: estimated potential speedup: 1.050
      remark #15488: --- end vector cost summary ---
      remark #25456: Number of Array Refs Scalar Replaced In Loop: 1
      remark #25015: Estimate of max trip count of loop=1
      LOOP END


      My question is:
      I do not understand how the speedup is calculated from



      normalized vectorization overhead 0.773
      scalar cost: 21
      vector cost: 11.000


      Another more extreme and puzzled case could be



      LOOP BEGIN at get_forces.c(2690,8)
      <Distributed chunk3>
      remark #15388: vectorization support: reference q12[j] has aligned access [ get_forces.c(2694,19) ]
      remark #15388: vectorization support: reference q12[j] has aligned access [ get_forces.c(2694,26) ]
      remark #15335: loop was not vectorized: vectorization possible but seems inefficient. Use vector always directive or -vec-threshold0 to override
      remark #15305: vectorization support: vector length 2
      remark #15309: vectorization support: normalized vectorization overhead 1.857
      remark #15448: unmasked aligned unit stride loads: 1
      remark #15475: --- begin vector cost summary ---
      remark #15476: scalar cost: 7
      remark #15477: vector cost: 3.500
      remark #15478: estimated potential speedup: 0.770
      remark #15488: --- end vector cost summary ---
      remark #25436: completely unrolled by 3
      LOOP END


      Now, 3.5+1.857=5.357 < 7

      So, I could still simd this loop and gain a speedup or I should take the speedup number 0.770 in the report and not simd it?



      How to understand the speedup in optimization report from icc compiler?







      vectorization intel compiler-optimization simd icc






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked 2 days ago









      jjl

      2114




      2114





























          active

          oldest

          votes











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














           

          draft saved


          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53203989%2fhow-to-understand-the-speedup-in-optimization-report-from-icc-compiler%23new-answer', 'question_page');
          }
          );

          Post as a guest





































          active

          oldest

          votes













          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes
















           

          draft saved


          draft discarded



















































           


          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53203989%2fhow-to-understand-the-speedup-in-optimization-report-from-icc-compiler%23new-answer', 'question_page');
          }
          );

          Post as a guest




















































































          G,qxWK6Eu1aW27Ec zKDSU2JKoQ0
          CySehrD5gXUY5EjiFcGtFFV kwzeP7NYCD4P7PxIvJKSQZKujmytCBZ,R34OevozIy,60nCmYHzk3,iPCVJRjS

          Popular posts from this blog

          How to pass form data using jquery Ajax to insert data in database?

          Guess what letter conforming each word

          Run scheduled task as local user group (not BUILTIN)