Compare items separated by comma in two column

I have a table with two columns, columns A and column B. Each column have items separated by a comma as shown below.

enter image description here

I would like to create a third column (column C) which returns the items that exists in column A but does not exist in column B

enter image description here

I'll appreciate any help with this.

Thank you.

asked Nov 18 '18 at 19:17

UJAY

374

Please don't post images of data. Where do you hold the data, is it already a pandas dataframe or are we looking at an Excel screenshot, is it a text file representation..... Please clarify. If dataframe, post result of df.head() (as text, formatted as code, i.e. indented) if textfile, post the first few lines, in the same way. Thank you.

– SpghttCd
Nov 18 '18 at 19:24

add a comment |

I have a table with two columns, columns A and column B. Each column have items separated by a comma as shown below.

enter image description here

I would like to create a third column (column C) which returns the items that exists in column A but does not exist in column B

enter image description here

I'll appreciate any help with this.

Thank you.

asked Nov 18 '18 at 19:17

UJAY

374

Please don't post images of data. Where do you hold the data, is it already a pandas dataframe or are we looking at an Excel screenshot, is it a text file representation..... Please clarify. If dataframe, post result of df.head() (as text, formatted as code, i.e. indented) if textfile, post the first few lines, in the same way. Thank you.

– SpghttCd
Nov 18 '18 at 19:24

add a comment |

I have a table with two columns, columns A and column B. Each column have items separated by a comma as shown below.

enter image description here

I would like to create a third column (column C) which returns the items that exists in column A but does not exist in column B

enter image description here

I'll appreciate any help with this.

Thank you.

asked Nov 18 '18 at 19:17

UJAY

374

I have a table with two columns, columns A and column B. Each column have items separated by a comma as shown below.

enter image description here

I would like to create a third column (column C) which returns the items that exists in column A but does not exist in column B

enter image description here

I'll appreciate any help with this.

Thank you.

python pandas

asked Nov 18 '18 at 19:17

UJAY

374

asked Nov 18 '18 at 19:17

UJAY

374

asked Nov 18 '18 at 19:17

UJAY

374

asked Nov 18 '18 at 19:17

UJAY

374

asked Nov 18 '18 at 19:17

UJAY

374

Please don't post images of data. Where do you hold the data, is it already a pandas dataframe or are we looking at an Excel screenshot, is it a text file representation..... Please clarify. If dataframe, post result of df.head() (as text, formatted as code, i.e. indented) if textfile, post the first few lines, in the same way. Thank you.

– SpghttCd
Nov 18 '18 at 19:24

add a comment |

Please don't post images of data. Where do you hold the data, is it already a pandas dataframe or are we looking at an Excel screenshot, is it a text file representation..... Please clarify. If dataframe, post result of df.head() (as text, formatted as code, i.e. indented) if textfile, post the first few lines, in the same way. Thank you.

– SpghttCd
Nov 18 '18 at 19:24

Please don't post images of data. Where do you hold the data, is it already a pandas dataframe or are we looking at an Excel screenshot, is it a text file representation..... Please clarify. If dataframe, post result of df.head() (as text, formatted as code, i.e. indented) if textfile, post the first few lines, in the same way. Thank you.

– SpghttCd
Nov 18 '18 at 19:24

add a comment |

2 Answers
2

active

oldest

votes

You can use set intersection. Notice that the performance will not be good if you use pandas, but it is possible

inter = ds.A.str.split(',').apply(set) - ds.B.str.split(',').apply(set).values

df['C'] = inter.str.join(',')

I'd suggest a pure Python approach, though.

df['C'] = [','.join(set(a.split(',')) - set(b.split(','))) for a,b in zip(ds.A, ds.B)]

Timings are clear

%timeit [','.join(set(a.split(',')) - set(b.split(','))) for a,b in zip(ds.A, ds.B)]

40.4 µs ± 1.08 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)



%timeit ds.A.str.split(',').apply(set) - ds.B.str.split(',').apply(set).values

730 µs ± 27 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

answered Nov 18 '18 at 19:27

RafaelC

26.4k82851

add a comment |

Try the following code (application of a function along column axis (1):

import pandas as pd

import re



# Source data

df = pd.DataFrame( data={'A': [ 'Lisa, John, Sam', 'Lisa, John, Sam' ],

    'B': [ 'Lisa, Peter, Sam', 'Lisa, Peter' ] })

pat = re.compile(r',s*')

df['C'] = df.apply(lambda x: ', '.join(

    set(re.split(pat, x.A)) - set(re.split(pat, x.B))), axis=1)

The result is:

                 A                 B          C

0  Lisa, John, Sam  Lisa, Peter, Sam       John

1  Lisa, John, Sam       Lisa, Peter  John, Sam

edited Nov 18 '18 at 20:55

answered Nov 18 '18 at 20:05

Valdi_Bo

4,7202815

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53364570%2fcompare-items-separated-by-comma-in-two-column%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

You can use set intersection. Notice that the performance will not be good if you use pandas, but it is possible

inter = ds.A.str.split(',').apply(set) - ds.B.str.split(',').apply(set).values

df['C'] = inter.str.join(',')

I'd suggest a pure Python approach, though.

df['C'] = [','.join(set(a.split(',')) - set(b.split(','))) for a,b in zip(ds.A, ds.B)]

Timings are clear

%timeit [','.join(set(a.split(',')) - set(b.split(','))) for a,b in zip(ds.A, ds.B)]

40.4 µs ± 1.08 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)



%timeit ds.A.str.split(',').apply(set) - ds.B.str.split(',').apply(set).values

730 µs ± 27 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

answered Nov 18 '18 at 19:27

RafaelC

26.4k82851

add a comment |

You can use set intersection. Notice that the performance will not be good if you use pandas, but it is possible

inter = ds.A.str.split(',').apply(set) - ds.B.str.split(',').apply(set).values

df['C'] = inter.str.join(',')

I'd suggest a pure Python approach, though.

df['C'] = [','.join(set(a.split(',')) - set(b.split(','))) for a,b in zip(ds.A, ds.B)]

Timings are clear

%timeit [','.join(set(a.split(',')) - set(b.split(','))) for a,b in zip(ds.A, ds.B)]

40.4 µs ± 1.08 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)



%timeit ds.A.str.split(',').apply(set) - ds.B.str.split(',').apply(set).values

730 µs ± 27 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

answered Nov 18 '18 at 19:27

RafaelC

26.4k82851

add a comment |

You can use set intersection. Notice that the performance will not be good if you use pandas, but it is possible

inter = ds.A.str.split(',').apply(set) - ds.B.str.split(',').apply(set).values

df['C'] = inter.str.join(',')

I'd suggest a pure Python approach, though.

df['C'] = [','.join(set(a.split(',')) - set(b.split(','))) for a,b in zip(ds.A, ds.B)]

Timings are clear

%timeit [','.join(set(a.split(',')) - set(b.split(','))) for a,b in zip(ds.A, ds.B)]

40.4 µs ± 1.08 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)



%timeit ds.A.str.split(',').apply(set) - ds.B.str.split(',').apply(set).values

730 µs ± 27 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

answered Nov 18 '18 at 19:27

RafaelC

26.4k82851

You can use set intersection. Notice that the performance will not be good if you use pandas, but it is possible

inter = ds.A.str.split(',').apply(set) - ds.B.str.split(',').apply(set).values

df['C'] = inter.str.join(',')

I'd suggest a pure Python approach, though.

df['C'] = [','.join(set(a.split(',')) - set(b.split(','))) for a,b in zip(ds.A, ds.B)]

Timings are clear

%timeit [','.join(set(a.split(',')) - set(b.split(','))) for a,b in zip(ds.A, ds.B)]

40.4 µs ± 1.08 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)



%timeit ds.A.str.split(',').apply(set) - ds.B.str.split(',').apply(set).values

730 µs ± 27 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

answered Nov 18 '18 at 19:27

RafaelC

26.4k82851

answered Nov 18 '18 at 19:27

RafaelC

26.4k82851

answered Nov 18 '18 at 19:27

RafaelC

26.4k82851

answered Nov 18 '18 at 19:27

RafaelC

26.4k82851

add a comment |

Try the following code (application of a function along column axis (1):

import pandas as pd

import re



# Source data

df = pd.DataFrame( data={'A': [ 'Lisa, John, Sam', 'Lisa, John, Sam' ],

    'B': [ 'Lisa, Peter, Sam', 'Lisa, Peter' ] })

pat = re.compile(r',s*')

df['C'] = df.apply(lambda x: ', '.join(

    set(re.split(pat, x.A)) - set(re.split(pat, x.B))), axis=1)

The result is:

                 A                 B          C

0  Lisa, John, Sam  Lisa, Peter, Sam       John

1  Lisa, John, Sam       Lisa, Peter  John, Sam

edited Nov 18 '18 at 20:55

answered Nov 18 '18 at 20:05

Valdi_Bo

4,7202815

add a comment |

Try the following code (application of a function along column axis (1):

import pandas as pd

import re



# Source data

df = pd.DataFrame( data={'A': [ 'Lisa, John, Sam', 'Lisa, John, Sam' ],

    'B': [ 'Lisa, Peter, Sam', 'Lisa, Peter' ] })

pat = re.compile(r',s*')

df['C'] = df.apply(lambda x: ', '.join(

    set(re.split(pat, x.A)) - set(re.split(pat, x.B))), axis=1)

The result is:

                 A                 B          C

0  Lisa, John, Sam  Lisa, Peter, Sam       John

1  Lisa, John, Sam       Lisa, Peter  John, Sam

edited Nov 18 '18 at 20:55

answered Nov 18 '18 at 20:05

Valdi_Bo

4,7202815

add a comment |

Try the following code (application of a function along column axis (1):

import pandas as pd

import re



# Source data

df = pd.DataFrame( data={'A': [ 'Lisa, John, Sam', 'Lisa, John, Sam' ],

    'B': [ 'Lisa, Peter, Sam', 'Lisa, Peter' ] })

pat = re.compile(r',s*')

df['C'] = df.apply(lambda x: ', '.join(

    set(re.split(pat, x.A)) - set(re.split(pat, x.B))), axis=1)

The result is:

                 A                 B          C

0  Lisa, John, Sam  Lisa, Peter, Sam       John

1  Lisa, John, Sam       Lisa, Peter  John, Sam

edited Nov 18 '18 at 20:55

answered Nov 18 '18 at 20:05

Valdi_Bo

4,7202815

Try the following code (application of a function along column axis (1):

import pandas as pd

import re



# Source data

df = pd.DataFrame( data={'A': [ 'Lisa, John, Sam', 'Lisa, John, Sam' ],

    'B': [ 'Lisa, Peter, Sam', 'Lisa, Peter' ] })

pat = re.compile(r',s*')

df['C'] = df.apply(lambda x: ', '.join(

    set(re.split(pat, x.A)) - set(re.split(pat, x.B))), axis=1)

The result is:

                 A                 B          C

0  Lisa, John, Sam  Lisa, Peter, Sam       John

1  Lisa, John, Sam       Lisa, Peter  John, Sam

edited Nov 18 '18 at 20:55

answered Nov 18 '18 at 20:05

Valdi_Bo

4,7202815

edited Nov 18 '18 at 20:55

answered Nov 18 '18 at 20:05

Valdi_Bo

4,7202815

answered Nov 18 '18 at 20:05

Valdi_Bo

4,7202815

answered Nov 18 '18 at 20:05

Valdi_Bo

4,7202815

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Agfdhyk