Compare items separated by comma in two column
I have a table with two columns, columns A and column B. Each column have items separated by a comma as shown below.
I would like to create a third column (column C) which returns the items that exists in column A but does not exist in column B
I'll appreciate any help with this.
Thank you.
python pandas
add a comment |
I have a table with two columns, columns A and column B. Each column have items separated by a comma as shown below.
I would like to create a third column (column C) which returns the items that exists in column A but does not exist in column B
I'll appreciate any help with this.
Thank you.
python pandas
Please don't post images of data. Where do you hold the data, is it already a pandas dataframe or are we looking at an Excel screenshot, is it a text file representation..... Please clarify. If dataframe, post result ofdf.head()
(as text, formatted as code, i.e. indented) if textfile, post the first few lines, in the same way. Thank you.
– SpghttCd
Nov 18 '18 at 19:24
add a comment |
I have a table with two columns, columns A and column B. Each column have items separated by a comma as shown below.
I would like to create a third column (column C) which returns the items that exists in column A but does not exist in column B
I'll appreciate any help with this.
Thank you.
python pandas
I have a table with two columns, columns A and column B. Each column have items separated by a comma as shown below.
I would like to create a third column (column C) which returns the items that exists in column A but does not exist in column B
I'll appreciate any help with this.
Thank you.
python pandas
python pandas
asked Nov 18 '18 at 19:17
UJAYUJAY
374
374
Please don't post images of data. Where do you hold the data, is it already a pandas dataframe or are we looking at an Excel screenshot, is it a text file representation..... Please clarify. If dataframe, post result ofdf.head()
(as text, formatted as code, i.e. indented) if textfile, post the first few lines, in the same way. Thank you.
– SpghttCd
Nov 18 '18 at 19:24
add a comment |
Please don't post images of data. Where do you hold the data, is it already a pandas dataframe or are we looking at an Excel screenshot, is it a text file representation..... Please clarify. If dataframe, post result ofdf.head()
(as text, formatted as code, i.e. indented) if textfile, post the first few lines, in the same way. Thank you.
– SpghttCd
Nov 18 '18 at 19:24
Please don't post images of data. Where do you hold the data, is it already a pandas dataframe or are we looking at an Excel screenshot, is it a text file representation..... Please clarify. If dataframe, post result of
df.head()
(as text, formatted as code, i.e. indented) if textfile, post the first few lines, in the same way. Thank you.– SpghttCd
Nov 18 '18 at 19:24
Please don't post images of data. Where do you hold the data, is it already a pandas dataframe or are we looking at an Excel screenshot, is it a text file representation..... Please clarify. If dataframe, post result of
df.head()
(as text, formatted as code, i.e. indented) if textfile, post the first few lines, in the same way. Thank you.– SpghttCd
Nov 18 '18 at 19:24
add a comment |
2 Answers
2
active
oldest
votes
You can use set intersection. Notice that the performance will not be good if you use pandas
, but it is possible
inter = ds.A.str.split(',').apply(set) - ds.B.str.split(',').apply(set).values
df['C'] = inter.str.join(',')
I'd suggest a pure Python approach, though.
df['C'] = [','.join(set(a.split(',')) - set(b.split(','))) for a,b in zip(ds.A, ds.B)]
Timings are clear
%timeit [','.join(set(a.split(',')) - set(b.split(','))) for a,b in zip(ds.A, ds.B)]
40.4 µs ± 1.08 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit ds.A.str.split(',').apply(set) - ds.B.str.split(',').apply(set).values
730 µs ± 27 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
add a comment |
Try the following code (application of a function along column axis (1):
import pandas as pd
import re
# Source data
df = pd.DataFrame( data={'A': [ 'Lisa, John, Sam', 'Lisa, John, Sam' ],
'B': [ 'Lisa, Peter, Sam', 'Lisa, Peter' ] })
pat = re.compile(r',s*')
df['C'] = df.apply(lambda x: ', '.join(
set(re.split(pat, x.A)) - set(re.split(pat, x.B))), axis=1)
The result is:
A B C
0 Lisa, John, Sam Lisa, Peter, Sam John
1 Lisa, John, Sam Lisa, Peter John, Sam
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53364570%2fcompare-items-separated-by-comma-in-two-column%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
You can use set intersection. Notice that the performance will not be good if you use pandas
, but it is possible
inter = ds.A.str.split(',').apply(set) - ds.B.str.split(',').apply(set).values
df['C'] = inter.str.join(',')
I'd suggest a pure Python approach, though.
df['C'] = [','.join(set(a.split(',')) - set(b.split(','))) for a,b in zip(ds.A, ds.B)]
Timings are clear
%timeit [','.join(set(a.split(',')) - set(b.split(','))) for a,b in zip(ds.A, ds.B)]
40.4 µs ± 1.08 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit ds.A.str.split(',').apply(set) - ds.B.str.split(',').apply(set).values
730 µs ± 27 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
add a comment |
You can use set intersection. Notice that the performance will not be good if you use pandas
, but it is possible
inter = ds.A.str.split(',').apply(set) - ds.B.str.split(',').apply(set).values
df['C'] = inter.str.join(',')
I'd suggest a pure Python approach, though.
df['C'] = [','.join(set(a.split(',')) - set(b.split(','))) for a,b in zip(ds.A, ds.B)]
Timings are clear
%timeit [','.join(set(a.split(',')) - set(b.split(','))) for a,b in zip(ds.A, ds.B)]
40.4 µs ± 1.08 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit ds.A.str.split(',').apply(set) - ds.B.str.split(',').apply(set).values
730 µs ± 27 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
add a comment |
You can use set intersection. Notice that the performance will not be good if you use pandas
, but it is possible
inter = ds.A.str.split(',').apply(set) - ds.B.str.split(',').apply(set).values
df['C'] = inter.str.join(',')
I'd suggest a pure Python approach, though.
df['C'] = [','.join(set(a.split(',')) - set(b.split(','))) for a,b in zip(ds.A, ds.B)]
Timings are clear
%timeit [','.join(set(a.split(',')) - set(b.split(','))) for a,b in zip(ds.A, ds.B)]
40.4 µs ± 1.08 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit ds.A.str.split(',').apply(set) - ds.B.str.split(',').apply(set).values
730 µs ± 27 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
You can use set intersection. Notice that the performance will not be good if you use pandas
, but it is possible
inter = ds.A.str.split(',').apply(set) - ds.B.str.split(',').apply(set).values
df['C'] = inter.str.join(',')
I'd suggest a pure Python approach, though.
df['C'] = [','.join(set(a.split(',')) - set(b.split(','))) for a,b in zip(ds.A, ds.B)]
Timings are clear
%timeit [','.join(set(a.split(',')) - set(b.split(','))) for a,b in zip(ds.A, ds.B)]
40.4 µs ± 1.08 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit ds.A.str.split(',').apply(set) - ds.B.str.split(',').apply(set).values
730 µs ± 27 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
answered Nov 18 '18 at 19:27
RafaelCRafaelC
26.4k82851
26.4k82851
add a comment |
add a comment |
Try the following code (application of a function along column axis (1):
import pandas as pd
import re
# Source data
df = pd.DataFrame( data={'A': [ 'Lisa, John, Sam', 'Lisa, John, Sam' ],
'B': [ 'Lisa, Peter, Sam', 'Lisa, Peter' ] })
pat = re.compile(r',s*')
df['C'] = df.apply(lambda x: ', '.join(
set(re.split(pat, x.A)) - set(re.split(pat, x.B))), axis=1)
The result is:
A B C
0 Lisa, John, Sam Lisa, Peter, Sam John
1 Lisa, John, Sam Lisa, Peter John, Sam
add a comment |
Try the following code (application of a function along column axis (1):
import pandas as pd
import re
# Source data
df = pd.DataFrame( data={'A': [ 'Lisa, John, Sam', 'Lisa, John, Sam' ],
'B': [ 'Lisa, Peter, Sam', 'Lisa, Peter' ] })
pat = re.compile(r',s*')
df['C'] = df.apply(lambda x: ', '.join(
set(re.split(pat, x.A)) - set(re.split(pat, x.B))), axis=1)
The result is:
A B C
0 Lisa, John, Sam Lisa, Peter, Sam John
1 Lisa, John, Sam Lisa, Peter John, Sam
add a comment |
Try the following code (application of a function along column axis (1):
import pandas as pd
import re
# Source data
df = pd.DataFrame( data={'A': [ 'Lisa, John, Sam', 'Lisa, John, Sam' ],
'B': [ 'Lisa, Peter, Sam', 'Lisa, Peter' ] })
pat = re.compile(r',s*')
df['C'] = df.apply(lambda x: ', '.join(
set(re.split(pat, x.A)) - set(re.split(pat, x.B))), axis=1)
The result is:
A B C
0 Lisa, John, Sam Lisa, Peter, Sam John
1 Lisa, John, Sam Lisa, Peter John, Sam
Try the following code (application of a function along column axis (1):
import pandas as pd
import re
# Source data
df = pd.DataFrame( data={'A': [ 'Lisa, John, Sam', 'Lisa, John, Sam' ],
'B': [ 'Lisa, Peter, Sam', 'Lisa, Peter' ] })
pat = re.compile(r',s*')
df['C'] = df.apply(lambda x: ', '.join(
set(re.split(pat, x.A)) - set(re.split(pat, x.B))), axis=1)
The result is:
A B C
0 Lisa, John, Sam Lisa, Peter, Sam John
1 Lisa, John, Sam Lisa, Peter John, Sam
edited Nov 18 '18 at 20:55
answered Nov 18 '18 at 20:05
Valdi_BoValdi_Bo
4,7202815
4,7202815
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53364570%2fcompare-items-separated-by-comma-in-two-column%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Please don't post images of data. Where do you hold the data, is it already a pandas dataframe or are we looking at an Excel screenshot, is it a text file representation..... Please clarify. If dataframe, post result of
df.head()
(as text, formatted as code, i.e. indented) if textfile, post the first few lines, in the same way. Thank you.– SpghttCd
Nov 18 '18 at 19:24