Pandas merge handling duplicates in join output











up vote
1
down vote

favorite












Is there a nice way to bring only one row, preferably random in one-to-many matching during left join in Pandas?



e.g



left = [[1,1,1], [2,2,2],[3,3,3], [9,9,9], [1,3,2]]
right = [[1,2,2],[1,2,3],[3,2,2], [3,2,9], [3,2,2]]
left = np.asarray(left)
right = np.asarray(right)
left = pd.DataFrame(left)
right = pd.DataFrame(right)
joined_left = left.merge(right, how="left", left_on=[0], right_on=[0])


So this is what we get



   0  1  2
0 1 1 1
1 2 2 2
2 3 3 3
3 9 9 9
4 1 3 2

0 1 2
0 1 2 2
1 1 2 3
2 3 2 2
3 3 2 9
4 3 2 2

0 1_x 2_x 1_y 2_y
0 1 1 1 2.0 2.0
1 1 1 1 2.0 3.0
2 2 2 2 NaN NaN
3 3 3 3 2.0 2.0
4 3 3 3 2.0 9.0
5 3 3 3 2.0 2.0
6 9 9 9 NaN NaN
7 1 3 2 2.0 2.0
8 1 3 2 2.0 3.0


So now I want to have output to be of the same size as my left dataframe and when there are more than one match in right dataframe I want to bring only single random column.



Is there a nice way of doing it using pandas short cut tricks?



thank you!










share|improve this question




























    up vote
    1
    down vote

    favorite












    Is there a nice way to bring only one row, preferably random in one-to-many matching during left join in Pandas?



    e.g



    left = [[1,1,1], [2,2,2],[3,3,3], [9,9,9], [1,3,2]]
    right = [[1,2,2],[1,2,3],[3,2,2], [3,2,9], [3,2,2]]
    left = np.asarray(left)
    right = np.asarray(right)
    left = pd.DataFrame(left)
    right = pd.DataFrame(right)
    joined_left = left.merge(right, how="left", left_on=[0], right_on=[0])


    So this is what we get



       0  1  2
    0 1 1 1
    1 2 2 2
    2 3 3 3
    3 9 9 9
    4 1 3 2

    0 1 2
    0 1 2 2
    1 1 2 3
    2 3 2 2
    3 3 2 9
    4 3 2 2

    0 1_x 2_x 1_y 2_y
    0 1 1 1 2.0 2.0
    1 1 1 1 2.0 3.0
    2 2 2 2 NaN NaN
    3 3 3 3 2.0 2.0
    4 3 3 3 2.0 9.0
    5 3 3 3 2.0 2.0
    6 9 9 9 NaN NaN
    7 1 3 2 2.0 2.0
    8 1 3 2 2.0 3.0


    So now I want to have output to be of the same size as my left dataframe and when there are more than one match in right dataframe I want to bring only single random column.



    Is there a nice way of doing it using pandas short cut tricks?



    thank you!










    share|improve this question


























      up vote
      1
      down vote

      favorite









      up vote
      1
      down vote

      favorite











      Is there a nice way to bring only one row, preferably random in one-to-many matching during left join in Pandas?



      e.g



      left = [[1,1,1], [2,2,2],[3,3,3], [9,9,9], [1,3,2]]
      right = [[1,2,2],[1,2,3],[3,2,2], [3,2,9], [3,2,2]]
      left = np.asarray(left)
      right = np.asarray(right)
      left = pd.DataFrame(left)
      right = pd.DataFrame(right)
      joined_left = left.merge(right, how="left", left_on=[0], right_on=[0])


      So this is what we get



         0  1  2
      0 1 1 1
      1 2 2 2
      2 3 3 3
      3 9 9 9
      4 1 3 2

      0 1 2
      0 1 2 2
      1 1 2 3
      2 3 2 2
      3 3 2 9
      4 3 2 2

      0 1_x 2_x 1_y 2_y
      0 1 1 1 2.0 2.0
      1 1 1 1 2.0 3.0
      2 2 2 2 NaN NaN
      3 3 3 3 2.0 2.0
      4 3 3 3 2.0 9.0
      5 3 3 3 2.0 2.0
      6 9 9 9 NaN NaN
      7 1 3 2 2.0 2.0
      8 1 3 2 2.0 3.0


      So now I want to have output to be of the same size as my left dataframe and when there are more than one match in right dataframe I want to bring only single random column.



      Is there a nice way of doing it using pandas short cut tricks?



      thank you!










      share|improve this question















      Is there a nice way to bring only one row, preferably random in one-to-many matching during left join in Pandas?



      e.g



      left = [[1,1,1], [2,2,2],[3,3,3], [9,9,9], [1,3,2]]
      right = [[1,2,2],[1,2,3],[3,2,2], [3,2,9], [3,2,2]]
      left = np.asarray(left)
      right = np.asarray(right)
      left = pd.DataFrame(left)
      right = pd.DataFrame(right)
      joined_left = left.merge(right, how="left", left_on=[0], right_on=[0])


      So this is what we get



         0  1  2
      0 1 1 1
      1 2 2 2
      2 3 3 3
      3 9 9 9
      4 1 3 2

      0 1 2
      0 1 2 2
      1 1 2 3
      2 3 2 2
      3 3 2 9
      4 3 2 2

      0 1_x 2_x 1_y 2_y
      0 1 1 1 2.0 2.0
      1 1 1 1 2.0 3.0
      2 2 2 2 NaN NaN
      3 3 3 3 2.0 2.0
      4 3 3 3 2.0 9.0
      5 3 3 3 2.0 2.0
      6 9 9 9 NaN NaN
      7 1 3 2 2.0 2.0
      8 1 3 2 2.0 3.0


      So now I want to have output to be of the same size as my left dataframe and when there are more than one match in right dataframe I want to bring only single random column.



      Is there a nice way of doing it using pandas short cut tricks?



      thank you!







      python pandas dataframe random merge






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 11 at 1:26









      coldspeed

      111k17101170




      111k17101170










      asked Nov 11 at 0:36









      YohanRoth

      8901919




      8901919
























          1 Answer
          1






          active

          oldest

          votes

















          up vote
          1
          down vote



          accepted










          You can shuffle right and drop_duplicates(...[, keep='first']) before merging.



          right2 = right.sample(frac=1).drop_duplicates(subset=[0])
          left.merge(right2, how='left', left_on=[0], right_on=[0])

          0 1_x 2_x 1_y 2_y
          0 1 1 1 2.0 2.0
          1 2 2 2 NaN NaN
          2 3 3 3 2.0 2.0
          3 9 9 9 NaN NaN
          4 1 3 2 2.0 2.0


          We shuffle right first, and then drop every duplicate except the first row (considering only column #0), which is the same as randomly selecting a row.






          share|improve this answer

















          • 1




            I see, so you drop duplicates for a merge key column right. Ingenious! Thank you
            – YohanRoth
            Nov 11 at 0:43










          • @YohanRoth - in this case - if your first row of the output is 1 1 1 2.0 2.0, I think that guarantees the last row is also 1 3 2 2.0 2.0 since you've dropped 1 2 3. From your question asking for a random choice, I'm a bit concerned that this may not be the behavior you want. Perhaps it's fine, but worth making sure it's consistent with what you want.
            – Joel
            Nov 11 at 4:47











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53244793%2fpandas-merge-handling-duplicates-in-join-output%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          up vote
          1
          down vote



          accepted










          You can shuffle right and drop_duplicates(...[, keep='first']) before merging.



          right2 = right.sample(frac=1).drop_duplicates(subset=[0])
          left.merge(right2, how='left', left_on=[0], right_on=[0])

          0 1_x 2_x 1_y 2_y
          0 1 1 1 2.0 2.0
          1 2 2 2 NaN NaN
          2 3 3 3 2.0 2.0
          3 9 9 9 NaN NaN
          4 1 3 2 2.0 2.0


          We shuffle right first, and then drop every duplicate except the first row (considering only column #0), which is the same as randomly selecting a row.






          share|improve this answer

















          • 1




            I see, so you drop duplicates for a merge key column right. Ingenious! Thank you
            – YohanRoth
            Nov 11 at 0:43










          • @YohanRoth - in this case - if your first row of the output is 1 1 1 2.0 2.0, I think that guarantees the last row is also 1 3 2 2.0 2.0 since you've dropped 1 2 3. From your question asking for a random choice, I'm a bit concerned that this may not be the behavior you want. Perhaps it's fine, but worth making sure it's consistent with what you want.
            – Joel
            Nov 11 at 4:47















          up vote
          1
          down vote



          accepted










          You can shuffle right and drop_duplicates(...[, keep='first']) before merging.



          right2 = right.sample(frac=1).drop_duplicates(subset=[0])
          left.merge(right2, how='left', left_on=[0], right_on=[0])

          0 1_x 2_x 1_y 2_y
          0 1 1 1 2.0 2.0
          1 2 2 2 NaN NaN
          2 3 3 3 2.0 2.0
          3 9 9 9 NaN NaN
          4 1 3 2 2.0 2.0


          We shuffle right first, and then drop every duplicate except the first row (considering only column #0), which is the same as randomly selecting a row.






          share|improve this answer

















          • 1




            I see, so you drop duplicates for a merge key column right. Ingenious! Thank you
            – YohanRoth
            Nov 11 at 0:43










          • @YohanRoth - in this case - if your first row of the output is 1 1 1 2.0 2.0, I think that guarantees the last row is also 1 3 2 2.0 2.0 since you've dropped 1 2 3. From your question asking for a random choice, I'm a bit concerned that this may not be the behavior you want. Perhaps it's fine, but worth making sure it's consistent with what you want.
            – Joel
            Nov 11 at 4:47













          up vote
          1
          down vote



          accepted







          up vote
          1
          down vote



          accepted






          You can shuffle right and drop_duplicates(...[, keep='first']) before merging.



          right2 = right.sample(frac=1).drop_duplicates(subset=[0])
          left.merge(right2, how='left', left_on=[0], right_on=[0])

          0 1_x 2_x 1_y 2_y
          0 1 1 1 2.0 2.0
          1 2 2 2 NaN NaN
          2 3 3 3 2.0 2.0
          3 9 9 9 NaN NaN
          4 1 3 2 2.0 2.0


          We shuffle right first, and then drop every duplicate except the first row (considering only column #0), which is the same as randomly selecting a row.






          share|improve this answer












          You can shuffle right and drop_duplicates(...[, keep='first']) before merging.



          right2 = right.sample(frac=1).drop_duplicates(subset=[0])
          left.merge(right2, how='left', left_on=[0], right_on=[0])

          0 1_x 2_x 1_y 2_y
          0 1 1 1 2.0 2.0
          1 2 2 2 NaN NaN
          2 3 3 3 2.0 2.0
          3 9 9 9 NaN NaN
          4 1 3 2 2.0 2.0


          We shuffle right first, and then drop every duplicate except the first row (considering only column #0), which is the same as randomly selecting a row.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 11 at 0:39









          coldspeed

          111k17101170




          111k17101170








          • 1




            I see, so you drop duplicates for a merge key column right. Ingenious! Thank you
            – YohanRoth
            Nov 11 at 0:43










          • @YohanRoth - in this case - if your first row of the output is 1 1 1 2.0 2.0, I think that guarantees the last row is also 1 3 2 2.0 2.0 since you've dropped 1 2 3. From your question asking for a random choice, I'm a bit concerned that this may not be the behavior you want. Perhaps it's fine, but worth making sure it's consistent with what you want.
            – Joel
            Nov 11 at 4:47














          • 1




            I see, so you drop duplicates for a merge key column right. Ingenious! Thank you
            – YohanRoth
            Nov 11 at 0:43










          • @YohanRoth - in this case - if your first row of the output is 1 1 1 2.0 2.0, I think that guarantees the last row is also 1 3 2 2.0 2.0 since you've dropped 1 2 3. From your question asking for a random choice, I'm a bit concerned that this may not be the behavior you want. Perhaps it's fine, but worth making sure it's consistent with what you want.
            – Joel
            Nov 11 at 4:47








          1




          1




          I see, so you drop duplicates for a merge key column right. Ingenious! Thank you
          – YohanRoth
          Nov 11 at 0:43




          I see, so you drop duplicates for a merge key column right. Ingenious! Thank you
          – YohanRoth
          Nov 11 at 0:43












          @YohanRoth - in this case - if your first row of the output is 1 1 1 2.0 2.0, I think that guarantees the last row is also 1 3 2 2.0 2.0 since you've dropped 1 2 3. From your question asking for a random choice, I'm a bit concerned that this may not be the behavior you want. Perhaps it's fine, but worth making sure it's consistent with what you want.
          – Joel
          Nov 11 at 4:47




          @YohanRoth - in this case - if your first row of the output is 1 1 1 2.0 2.0, I think that guarantees the last row is also 1 3 2 2.0 2.0 since you've dropped 1 2 3. From your question asking for a random choice, I'm a bit concerned that this may not be the behavior you want. Perhaps it's fine, but worth making sure it's consistent with what you want.
          – Joel
          Nov 11 at 4:47


















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.





          Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


          Please pay close attention to the following guidance:


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53244793%2fpandas-merge-handling-duplicates-in-join-output%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Guess what letter conforming each word

          Port of Spain

          Run scheduled task as local user group (not BUILTIN)