Compare items separated by comma in two column












1















I have a table with two columns, columns A and column B. Each column have items separated by a comma as shown below.



enter image description here



I would like to create a third column (column C) which returns the items that exists in column A but does not exist in column B



enter image description here



I'll appreciate any help with this.



Thank you.










share|improve this question























  • Please don't post images of data. Where do you hold the data, is it already a pandas dataframe or are we looking at an Excel screenshot, is it a text file representation..... Please clarify. If dataframe, post result of df.head() (as text, formatted as code, i.e. indented) if textfile, post the first few lines, in the same way. Thank you.

    – SpghttCd
    Nov 18 '18 at 19:24


















1















I have a table with two columns, columns A and column B. Each column have items separated by a comma as shown below.



enter image description here



I would like to create a third column (column C) which returns the items that exists in column A but does not exist in column B



enter image description here



I'll appreciate any help with this.



Thank you.










share|improve this question























  • Please don't post images of data. Where do you hold the data, is it already a pandas dataframe or are we looking at an Excel screenshot, is it a text file representation..... Please clarify. If dataframe, post result of df.head() (as text, formatted as code, i.e. indented) if textfile, post the first few lines, in the same way. Thank you.

    – SpghttCd
    Nov 18 '18 at 19:24
















1












1








1








I have a table with two columns, columns A and column B. Each column have items separated by a comma as shown below.



enter image description here



I would like to create a third column (column C) which returns the items that exists in column A but does not exist in column B



enter image description here



I'll appreciate any help with this.



Thank you.










share|improve this question














I have a table with two columns, columns A and column B. Each column have items separated by a comma as shown below.



enter image description here



I would like to create a third column (column C) which returns the items that exists in column A but does not exist in column B



enter image description here



I'll appreciate any help with this.



Thank you.







python pandas






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 18 '18 at 19:17









UJAYUJAY

374




374













  • Please don't post images of data. Where do you hold the data, is it already a pandas dataframe or are we looking at an Excel screenshot, is it a text file representation..... Please clarify. If dataframe, post result of df.head() (as text, formatted as code, i.e. indented) if textfile, post the first few lines, in the same way. Thank you.

    – SpghttCd
    Nov 18 '18 at 19:24





















  • Please don't post images of data. Where do you hold the data, is it already a pandas dataframe or are we looking at an Excel screenshot, is it a text file representation..... Please clarify. If dataframe, post result of df.head() (as text, formatted as code, i.e. indented) if textfile, post the first few lines, in the same way. Thank you.

    – SpghttCd
    Nov 18 '18 at 19:24



















Please don't post images of data. Where do you hold the data, is it already a pandas dataframe or are we looking at an Excel screenshot, is it a text file representation..... Please clarify. If dataframe, post result of df.head() (as text, formatted as code, i.e. indented) if textfile, post the first few lines, in the same way. Thank you.

– SpghttCd
Nov 18 '18 at 19:24







Please don't post images of data. Where do you hold the data, is it already a pandas dataframe or are we looking at an Excel screenshot, is it a text file representation..... Please clarify. If dataframe, post result of df.head() (as text, formatted as code, i.e. indented) if textfile, post the first few lines, in the same way. Thank you.

– SpghttCd
Nov 18 '18 at 19:24














2 Answers
2






active

oldest

votes


















0














You can use set intersection. Notice that the performance will not be good if you use pandas, but it is possible



inter = ds.A.str.split(',').apply(set) - ds.B.str.split(',').apply(set).values
df['C'] = inter.str.join(',')


I'd suggest a pure Python approach, though.



df['C'] = [','.join(set(a.split(',')) - set(b.split(','))) for a,b in zip(ds.A, ds.B)]


Timings are clear



%timeit [','.join(set(a.split(',')) - set(b.split(','))) for a,b in zip(ds.A, ds.B)]
40.4 µs ± 1.08 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit ds.A.str.split(',').apply(set) - ds.B.str.split(',').apply(set).values
730 µs ± 27 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)





share|improve this answer































    0














    Try the following code (application of a function along column axis (1):



    import pandas as pd
    import re

    # Source data
    df = pd.DataFrame( data={'A': [ 'Lisa, John, Sam', 'Lisa, John, Sam' ],
    'B': [ 'Lisa, Peter, Sam', 'Lisa, Peter' ] })
    pat = re.compile(r',s*')
    df['C'] = df.apply(lambda x: ', '.join(
    set(re.split(pat, x.A)) - set(re.split(pat, x.B))), axis=1)


    The result is:



                     A                 B          C
    0 Lisa, John, Sam Lisa, Peter, Sam John
    1 Lisa, John, Sam Lisa, Peter John, Sam





    share|improve this answer

























      Your Answer






      StackExchange.ifUsing("editor", function () {
      StackExchange.using("externalEditor", function () {
      StackExchange.using("snippets", function () {
      StackExchange.snippets.init();
      });
      });
      }, "code-snippets");

      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "1"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: true,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: 10,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });














      draft saved

      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53364570%2fcompare-items-separated-by-comma-in-two-column%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      2 Answers
      2






      active

      oldest

      votes








      2 Answers
      2






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      0














      You can use set intersection. Notice that the performance will not be good if you use pandas, but it is possible



      inter = ds.A.str.split(',').apply(set) - ds.B.str.split(',').apply(set).values
      df['C'] = inter.str.join(',')


      I'd suggest a pure Python approach, though.



      df['C'] = [','.join(set(a.split(',')) - set(b.split(','))) for a,b in zip(ds.A, ds.B)]


      Timings are clear



      %timeit [','.join(set(a.split(',')) - set(b.split(','))) for a,b in zip(ds.A, ds.B)]
      40.4 µs ± 1.08 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

      %timeit ds.A.str.split(',').apply(set) - ds.B.str.split(',').apply(set).values
      730 µs ± 27 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)





      share|improve this answer




























        0














        You can use set intersection. Notice that the performance will not be good if you use pandas, but it is possible



        inter = ds.A.str.split(',').apply(set) - ds.B.str.split(',').apply(set).values
        df['C'] = inter.str.join(',')


        I'd suggest a pure Python approach, though.



        df['C'] = [','.join(set(a.split(',')) - set(b.split(','))) for a,b in zip(ds.A, ds.B)]


        Timings are clear



        %timeit [','.join(set(a.split(',')) - set(b.split(','))) for a,b in zip(ds.A, ds.B)]
        40.4 µs ± 1.08 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

        %timeit ds.A.str.split(',').apply(set) - ds.B.str.split(',').apply(set).values
        730 µs ± 27 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)





        share|improve this answer


























          0












          0








          0







          You can use set intersection. Notice that the performance will not be good if you use pandas, but it is possible



          inter = ds.A.str.split(',').apply(set) - ds.B.str.split(',').apply(set).values
          df['C'] = inter.str.join(',')


          I'd suggest a pure Python approach, though.



          df['C'] = [','.join(set(a.split(',')) - set(b.split(','))) for a,b in zip(ds.A, ds.B)]


          Timings are clear



          %timeit [','.join(set(a.split(',')) - set(b.split(','))) for a,b in zip(ds.A, ds.B)]
          40.4 µs ± 1.08 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

          %timeit ds.A.str.split(',').apply(set) - ds.B.str.split(',').apply(set).values
          730 µs ± 27 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)





          share|improve this answer













          You can use set intersection. Notice that the performance will not be good if you use pandas, but it is possible



          inter = ds.A.str.split(',').apply(set) - ds.B.str.split(',').apply(set).values
          df['C'] = inter.str.join(',')


          I'd suggest a pure Python approach, though.



          df['C'] = [','.join(set(a.split(',')) - set(b.split(','))) for a,b in zip(ds.A, ds.B)]


          Timings are clear



          %timeit [','.join(set(a.split(',')) - set(b.split(','))) for a,b in zip(ds.A, ds.B)]
          40.4 µs ± 1.08 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

          %timeit ds.A.str.split(',').apply(set) - ds.B.str.split(',').apply(set).values
          730 µs ± 27 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)






          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 18 '18 at 19:27









          RafaelCRafaelC

          26.4k82851




          26.4k82851

























              0














              Try the following code (application of a function along column axis (1):



              import pandas as pd
              import re

              # Source data
              df = pd.DataFrame( data={'A': [ 'Lisa, John, Sam', 'Lisa, John, Sam' ],
              'B': [ 'Lisa, Peter, Sam', 'Lisa, Peter' ] })
              pat = re.compile(r',s*')
              df['C'] = df.apply(lambda x: ', '.join(
              set(re.split(pat, x.A)) - set(re.split(pat, x.B))), axis=1)


              The result is:



                               A                 B          C
              0 Lisa, John, Sam Lisa, Peter, Sam John
              1 Lisa, John, Sam Lisa, Peter John, Sam





              share|improve this answer






























                0














                Try the following code (application of a function along column axis (1):



                import pandas as pd
                import re

                # Source data
                df = pd.DataFrame( data={'A': [ 'Lisa, John, Sam', 'Lisa, John, Sam' ],
                'B': [ 'Lisa, Peter, Sam', 'Lisa, Peter' ] })
                pat = re.compile(r',s*')
                df['C'] = df.apply(lambda x: ', '.join(
                set(re.split(pat, x.A)) - set(re.split(pat, x.B))), axis=1)


                The result is:



                                 A                 B          C
                0 Lisa, John, Sam Lisa, Peter, Sam John
                1 Lisa, John, Sam Lisa, Peter John, Sam





                share|improve this answer




























                  0












                  0








                  0







                  Try the following code (application of a function along column axis (1):



                  import pandas as pd
                  import re

                  # Source data
                  df = pd.DataFrame( data={'A': [ 'Lisa, John, Sam', 'Lisa, John, Sam' ],
                  'B': [ 'Lisa, Peter, Sam', 'Lisa, Peter' ] })
                  pat = re.compile(r',s*')
                  df['C'] = df.apply(lambda x: ', '.join(
                  set(re.split(pat, x.A)) - set(re.split(pat, x.B))), axis=1)


                  The result is:



                                   A                 B          C
                  0 Lisa, John, Sam Lisa, Peter, Sam John
                  1 Lisa, John, Sam Lisa, Peter John, Sam





                  share|improve this answer















                  Try the following code (application of a function along column axis (1):



                  import pandas as pd
                  import re

                  # Source data
                  df = pd.DataFrame( data={'A': [ 'Lisa, John, Sam', 'Lisa, John, Sam' ],
                  'B': [ 'Lisa, Peter, Sam', 'Lisa, Peter' ] })
                  pat = re.compile(r',s*')
                  df['C'] = df.apply(lambda x: ', '.join(
                  set(re.split(pat, x.A)) - set(re.split(pat, x.B))), axis=1)


                  The result is:



                                   A                 B          C
                  0 Lisa, John, Sam Lisa, Peter, Sam John
                  1 Lisa, John, Sam Lisa, Peter John, Sam






                  share|improve this answer














                  share|improve this answer



                  share|improve this answer








                  edited Nov 18 '18 at 20:55

























                  answered Nov 18 '18 at 20:05









                  Valdi_BoValdi_Bo

                  4,7202815




                  4,7202815






























                      draft saved

                      draft discarded




















































                      Thanks for contributing an answer to Stack Overflow!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53364570%2fcompare-items-separated-by-comma-in-two-column%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      Guess what letter conforming each word

                      Port of Spain

                      Run scheduled task as local user group (not BUILTIN)