Spark can I manually specify the number of partitions when do textFile












0















The spark will automatically decide the number of partitions base on the size of the input file. I have two questions:



Can I specify the number of the partition rather than let the spark decide how much partitions?



How bad is shuffle when doing the repartition? Is it really expensive for the performance? My case is that I need repartition to "1" to write into the one Parquet file, the partition was "31". How bad is it? why?










share|improve this question



























    0















    The spark will automatically decide the number of partitions base on the size of the input file. I have two questions:



    Can I specify the number of the partition rather than let the spark decide how much partitions?



    How bad is shuffle when doing the repartition? Is it really expensive for the performance? My case is that I need repartition to "1" to write into the one Parquet file, the partition was "31". How bad is it? why?










    share|improve this question

























      0












      0








      0








      The spark will automatically decide the number of partitions base on the size of the input file. I have two questions:



      Can I specify the number of the partition rather than let the spark decide how much partitions?



      How bad is shuffle when doing the repartition? Is it really expensive for the performance? My case is that I need repartition to "1" to write into the one Parquet file, the partition was "31". How bad is it? why?










      share|improve this question














      The spark will automatically decide the number of partitions base on the size of the input file. I have two questions:



      Can I specify the number of the partition rather than let the spark decide how much partitions?



      How bad is shuffle when doing the repartition? Is it really expensive for the performance? My case is that I need repartition to "1" to write into the one Parquet file, the partition was "31". How bad is it? why?







      apache-spark text-files hive-partitions






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 19 '18 at 5:19









      bd zhangbd zhang

      12




      12
























          1 Answer
          1






          active

          oldest

          votes


















          -1














          Repartition and coalesce are the two functions that are used for repartitioning of data once it is read.






          share|improve this answer
























          • I know, I mean the performance.

            – bd zhang
            Nov 19 '18 at 19:15











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53368716%2fspark-can-i-manually-specify-the-number-of-partitions-when-do-textfile%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          -1














          Repartition and coalesce are the two functions that are used for repartitioning of data once it is read.






          share|improve this answer
























          • I know, I mean the performance.

            – bd zhang
            Nov 19 '18 at 19:15
















          -1














          Repartition and coalesce are the two functions that are used for repartitioning of data once it is read.






          share|improve this answer
























          • I know, I mean the performance.

            – bd zhang
            Nov 19 '18 at 19:15














          -1












          -1








          -1







          Repartition and coalesce are the two functions that are used for repartitioning of data once it is read.






          share|improve this answer













          Repartition and coalesce are the two functions that are used for repartitioning of data once it is read.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 19 '18 at 7:40









          BDABDA

          2569




          2569













          • I know, I mean the performance.

            – bd zhang
            Nov 19 '18 at 19:15



















          • I know, I mean the performance.

            – bd zhang
            Nov 19 '18 at 19:15

















          I know, I mean the performance.

          – bd zhang
          Nov 19 '18 at 19:15





          I know, I mean the performance.

          – bd zhang
          Nov 19 '18 at 19:15


















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53368716%2fspark-can-i-manually-specify-the-number-of-partitions-when-do-textfile%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          How to pass form data using jquery Ajax to insert data in database?

          National Museum of Racing and Hall of Fame

          Guess what letter conforming each word