pandas: fill nan values based on another column with duplicate values
up vote
1
down vote
favorite
I have two tables which I am trying to merge, df_a
and df_b
, but both are missing some values. The merge happens fine, however I'm trying to speed up the process of filling in the NaN
values after the merge.
The data:
df_a = pd.DataFrame(data={
'id': [1, 2, 3, 1, 6, 5, 3],
'name': [np.nan, 'two', 'three', 'one', np.nan, np.nan, np.nan],
'length': ['l1', 'l2', 'l3', 'l1', 'l6', np.nan, np.nan],
'Q1': [11, 22, 33, 11, 66, 55, 33],
'Q2': [111, 222, 333, 111, 666, 555, 333]})
df_b = pd.DataFrame(data={
'id': [5, 2, 4, 1, 3, 1, 7, 3],
'name': ['five', 'two', 'four', 'one_extra', 'three', 'one', 'seven', 'three'],
'length': ['l5', 'l2', 'l4', 'l1', 'l3', 'l1', 'l7', 'l3_extra'],
'Q1': [2055, 2022, 2044, 2011, 2033, 2011, 2077, 2033],
'Q2': [2055, 2022, 2044, 2011, 2033, 2011, 2077, 2033]})
df_m = df_a.merge(df_b, on=['id', 'name', 'length'], how='outer',
suffixes=['_a', '_b'])
Output:
id name length Q1 Q2
0 1 NaN l1 11 111
1 2 two l2 22 222
2 3 three l3 33 333
3 1 one l1 11 111
4 6 NaN l6 66 666
5 5 NaN NaN 55 555
6 3 NaN NaN 33 333
id name length Q1 Q2
0 5 five l5 2055 2055
1 2 two l2 2022 2022
2 4 four l4 2044 2044
3 1 one_extra l1 2011 2011
4 3 three l3 2033 2033
5 1 one l1 2011 2011
6 7 seven l7 2077 2077
7 3 three l3_extra 2033 2033
id name length Q1_a Q2_a Q1_b Q2_b
0 1 NaN l1 11.0 111.0 NaN NaN
3 1 one l1 11.0 111.0 2011.0 2011.0
9 1 one_extra l1 NaN NaN 2011.0 2011.0
1 2 two l2 22.0 222.0 2022.0 2022.0
2 3 three l3 33.0 333.0 2033.0 2033.0
6 3 NaN NaN 33.0 333.0 NaN NaN
11 3 three l3_extra NaN NaN 2033.0 2033.0
8 4 four l4 NaN NaN 2044.0 2044.0
5 5 NaN NaN 55.0 555.0 NaN NaN
7 5 five l5 NaN NaN 2055.0 2055.0
4 6 NaN l6 66.0 666.0 NaN NaN
10 7 seven l7 NaN NaN 2077.0 2077.0
The Endpoint
I'm trying to get df_m
into a state where any NaN
values are filled in based on their values from a corresponding id
column. Any time there's a duplicate id
, I want to keep both values, so that I end up with the following df
:
id name length Q1_a Q2_a Q1_b Q2_b
0 1 one l1 11.0 111.0 2011.0 2011.0
1 1 one_extra l1 11.0 111.0 2011.0 2011.0
2 2 two l2 22.0 222.0 2022.0 2022.0
3 3 three l3 33.0 333.0 2033.0 2033.0
4 3 three l3_extra 33.0 333.0 2033.0 2033.0
5 4 four l4 NaN NaN 2044.0 2044.0
6 5 five l5 55.0 555.0 2055.0 2055.0
7 6 NaN l6 66.0 666.0 NaN NaN
8 7 seven l7 NaN NaN 2077.0 2077.0
Note that there are some NaN
values, where there was an entry in one table, and not in the other, e.g. there's no entry in df_a
for id=4
, so there's NaN
s at Q1_a
and Q2_a
.
My solution so far
The following works, but I'd like to vectorise this code, as it takes more than 4 seconds with only ~200 rows:
dupe_ids = pd.value_counts(df_m.id)[
pd.value_counts(df_m.id) > 1].index.values
for dupe_id in dupe_ids:
df_m.loc[df_m.id == dupe_id] = df_m.loc[df_m.id == dupe_id].ffill().bfill()
df_m = df_m.drop_duplicates().sort_values(by='id').reset_index(drop=True)
I tried vectorising the code, as follows, but the results were strange (see id=4
).
df_m.sort_values(by='id').groupby(
'id').ffill().bfill().drop_duplicates().reset_index(drop=True)
Output:
id name length Q1_a Q2_a Q1_b Q2_b
0 1 one l1 11.0 111.0 2011.0 2011.0
1 1 one_extra l1 11.0 111.0 2011.0 2011.0
2 2 two l2 22.0 222.0 2022.0 2022.0
3 3 three l3 33.0 333.0 2033.0 2033.0
4 3 three l3_extra 33.0 333.0 2033.0 2033.0
5 4 four l4 55.0 555.0 2044.0 2044.0
6 5 five l5 55.0 555.0 2055.0 2055.0
7 6 seven l6 66.0 666.0 2077.0 2077.0
8 7 seven l7 NaN NaN 2077.0 2077.0
The two tables I'm actually trying to merge are quite a bit more complex, so the example above leads to much more strange behaviour than is seen there. There are over 100 columns, and different columns can have NaN
values in different places.
Any help would be greatly appreciated!
python pandas performance pandas-groupby
add a comment |
up vote
1
down vote
favorite
I have two tables which I am trying to merge, df_a
and df_b
, but both are missing some values. The merge happens fine, however I'm trying to speed up the process of filling in the NaN
values after the merge.
The data:
df_a = pd.DataFrame(data={
'id': [1, 2, 3, 1, 6, 5, 3],
'name': [np.nan, 'two', 'three', 'one', np.nan, np.nan, np.nan],
'length': ['l1', 'l2', 'l3', 'l1', 'l6', np.nan, np.nan],
'Q1': [11, 22, 33, 11, 66, 55, 33],
'Q2': [111, 222, 333, 111, 666, 555, 333]})
df_b = pd.DataFrame(data={
'id': [5, 2, 4, 1, 3, 1, 7, 3],
'name': ['five', 'two', 'four', 'one_extra', 'three', 'one', 'seven', 'three'],
'length': ['l5', 'l2', 'l4', 'l1', 'l3', 'l1', 'l7', 'l3_extra'],
'Q1': [2055, 2022, 2044, 2011, 2033, 2011, 2077, 2033],
'Q2': [2055, 2022, 2044, 2011, 2033, 2011, 2077, 2033]})
df_m = df_a.merge(df_b, on=['id', 'name', 'length'], how='outer',
suffixes=['_a', '_b'])
Output:
id name length Q1 Q2
0 1 NaN l1 11 111
1 2 two l2 22 222
2 3 three l3 33 333
3 1 one l1 11 111
4 6 NaN l6 66 666
5 5 NaN NaN 55 555
6 3 NaN NaN 33 333
id name length Q1 Q2
0 5 five l5 2055 2055
1 2 two l2 2022 2022
2 4 four l4 2044 2044
3 1 one_extra l1 2011 2011
4 3 three l3 2033 2033
5 1 one l1 2011 2011
6 7 seven l7 2077 2077
7 3 three l3_extra 2033 2033
id name length Q1_a Q2_a Q1_b Q2_b
0 1 NaN l1 11.0 111.0 NaN NaN
3 1 one l1 11.0 111.0 2011.0 2011.0
9 1 one_extra l1 NaN NaN 2011.0 2011.0
1 2 two l2 22.0 222.0 2022.0 2022.0
2 3 three l3 33.0 333.0 2033.0 2033.0
6 3 NaN NaN 33.0 333.0 NaN NaN
11 3 three l3_extra NaN NaN 2033.0 2033.0
8 4 four l4 NaN NaN 2044.0 2044.0
5 5 NaN NaN 55.0 555.0 NaN NaN
7 5 five l5 NaN NaN 2055.0 2055.0
4 6 NaN l6 66.0 666.0 NaN NaN
10 7 seven l7 NaN NaN 2077.0 2077.0
The Endpoint
I'm trying to get df_m
into a state where any NaN
values are filled in based on their values from a corresponding id
column. Any time there's a duplicate id
, I want to keep both values, so that I end up with the following df
:
id name length Q1_a Q2_a Q1_b Q2_b
0 1 one l1 11.0 111.0 2011.0 2011.0
1 1 one_extra l1 11.0 111.0 2011.0 2011.0
2 2 two l2 22.0 222.0 2022.0 2022.0
3 3 three l3 33.0 333.0 2033.0 2033.0
4 3 three l3_extra 33.0 333.0 2033.0 2033.0
5 4 four l4 NaN NaN 2044.0 2044.0
6 5 five l5 55.0 555.0 2055.0 2055.0
7 6 NaN l6 66.0 666.0 NaN NaN
8 7 seven l7 NaN NaN 2077.0 2077.0
Note that there are some NaN
values, where there was an entry in one table, and not in the other, e.g. there's no entry in df_a
for id=4
, so there's NaN
s at Q1_a
and Q2_a
.
My solution so far
The following works, but I'd like to vectorise this code, as it takes more than 4 seconds with only ~200 rows:
dupe_ids = pd.value_counts(df_m.id)[
pd.value_counts(df_m.id) > 1].index.values
for dupe_id in dupe_ids:
df_m.loc[df_m.id == dupe_id] = df_m.loc[df_m.id == dupe_id].ffill().bfill()
df_m = df_m.drop_duplicates().sort_values(by='id').reset_index(drop=True)
I tried vectorising the code, as follows, but the results were strange (see id=4
).
df_m.sort_values(by='id').groupby(
'id').ffill().bfill().drop_duplicates().reset_index(drop=True)
Output:
id name length Q1_a Q2_a Q1_b Q2_b
0 1 one l1 11.0 111.0 2011.0 2011.0
1 1 one_extra l1 11.0 111.0 2011.0 2011.0
2 2 two l2 22.0 222.0 2022.0 2022.0
3 3 three l3 33.0 333.0 2033.0 2033.0
4 3 three l3_extra 33.0 333.0 2033.0 2033.0
5 4 four l4 55.0 555.0 2044.0 2044.0
6 5 five l5 55.0 555.0 2055.0 2055.0
7 6 seven l6 66.0 666.0 2077.0 2077.0
8 7 seven l7 NaN NaN 2077.0 2077.0
The two tables I'm actually trying to merge are quite a bit more complex, so the example above leads to much more strange behaviour than is seen there. There are over 100 columns, and different columns can have NaN
values in different places.
Any help would be greatly appreciated!
python pandas performance pandas-groupby
add a comment |
up vote
1
down vote
favorite
up vote
1
down vote
favorite
I have two tables which I am trying to merge, df_a
and df_b
, but both are missing some values. The merge happens fine, however I'm trying to speed up the process of filling in the NaN
values after the merge.
The data:
df_a = pd.DataFrame(data={
'id': [1, 2, 3, 1, 6, 5, 3],
'name': [np.nan, 'two', 'three', 'one', np.nan, np.nan, np.nan],
'length': ['l1', 'l2', 'l3', 'l1', 'l6', np.nan, np.nan],
'Q1': [11, 22, 33, 11, 66, 55, 33],
'Q2': [111, 222, 333, 111, 666, 555, 333]})
df_b = pd.DataFrame(data={
'id': [5, 2, 4, 1, 3, 1, 7, 3],
'name': ['five', 'two', 'four', 'one_extra', 'three', 'one', 'seven', 'three'],
'length': ['l5', 'l2', 'l4', 'l1', 'l3', 'l1', 'l7', 'l3_extra'],
'Q1': [2055, 2022, 2044, 2011, 2033, 2011, 2077, 2033],
'Q2': [2055, 2022, 2044, 2011, 2033, 2011, 2077, 2033]})
df_m = df_a.merge(df_b, on=['id', 'name', 'length'], how='outer',
suffixes=['_a', '_b'])
Output:
id name length Q1 Q2
0 1 NaN l1 11 111
1 2 two l2 22 222
2 3 three l3 33 333
3 1 one l1 11 111
4 6 NaN l6 66 666
5 5 NaN NaN 55 555
6 3 NaN NaN 33 333
id name length Q1 Q2
0 5 five l5 2055 2055
1 2 two l2 2022 2022
2 4 four l4 2044 2044
3 1 one_extra l1 2011 2011
4 3 three l3 2033 2033
5 1 one l1 2011 2011
6 7 seven l7 2077 2077
7 3 three l3_extra 2033 2033
id name length Q1_a Q2_a Q1_b Q2_b
0 1 NaN l1 11.0 111.0 NaN NaN
3 1 one l1 11.0 111.0 2011.0 2011.0
9 1 one_extra l1 NaN NaN 2011.0 2011.0
1 2 two l2 22.0 222.0 2022.0 2022.0
2 3 three l3 33.0 333.0 2033.0 2033.0
6 3 NaN NaN 33.0 333.0 NaN NaN
11 3 three l3_extra NaN NaN 2033.0 2033.0
8 4 four l4 NaN NaN 2044.0 2044.0
5 5 NaN NaN 55.0 555.0 NaN NaN
7 5 five l5 NaN NaN 2055.0 2055.0
4 6 NaN l6 66.0 666.0 NaN NaN
10 7 seven l7 NaN NaN 2077.0 2077.0
The Endpoint
I'm trying to get df_m
into a state where any NaN
values are filled in based on their values from a corresponding id
column. Any time there's a duplicate id
, I want to keep both values, so that I end up with the following df
:
id name length Q1_a Q2_a Q1_b Q2_b
0 1 one l1 11.0 111.0 2011.0 2011.0
1 1 one_extra l1 11.0 111.0 2011.0 2011.0
2 2 two l2 22.0 222.0 2022.0 2022.0
3 3 three l3 33.0 333.0 2033.0 2033.0
4 3 three l3_extra 33.0 333.0 2033.0 2033.0
5 4 four l4 NaN NaN 2044.0 2044.0
6 5 five l5 55.0 555.0 2055.0 2055.0
7 6 NaN l6 66.0 666.0 NaN NaN
8 7 seven l7 NaN NaN 2077.0 2077.0
Note that there are some NaN
values, where there was an entry in one table, and not in the other, e.g. there's no entry in df_a
for id=4
, so there's NaN
s at Q1_a
and Q2_a
.
My solution so far
The following works, but I'd like to vectorise this code, as it takes more than 4 seconds with only ~200 rows:
dupe_ids = pd.value_counts(df_m.id)[
pd.value_counts(df_m.id) > 1].index.values
for dupe_id in dupe_ids:
df_m.loc[df_m.id == dupe_id] = df_m.loc[df_m.id == dupe_id].ffill().bfill()
df_m = df_m.drop_duplicates().sort_values(by='id').reset_index(drop=True)
I tried vectorising the code, as follows, but the results were strange (see id=4
).
df_m.sort_values(by='id').groupby(
'id').ffill().bfill().drop_duplicates().reset_index(drop=True)
Output:
id name length Q1_a Q2_a Q1_b Q2_b
0 1 one l1 11.0 111.0 2011.0 2011.0
1 1 one_extra l1 11.0 111.0 2011.0 2011.0
2 2 two l2 22.0 222.0 2022.0 2022.0
3 3 three l3 33.0 333.0 2033.0 2033.0
4 3 three l3_extra 33.0 333.0 2033.0 2033.0
5 4 four l4 55.0 555.0 2044.0 2044.0
6 5 five l5 55.0 555.0 2055.0 2055.0
7 6 seven l6 66.0 666.0 2077.0 2077.0
8 7 seven l7 NaN NaN 2077.0 2077.0
The two tables I'm actually trying to merge are quite a bit more complex, so the example above leads to much more strange behaviour than is seen there. There are over 100 columns, and different columns can have NaN
values in different places.
Any help would be greatly appreciated!
python pandas performance pandas-groupby
I have two tables which I am trying to merge, df_a
and df_b
, but both are missing some values. The merge happens fine, however I'm trying to speed up the process of filling in the NaN
values after the merge.
The data:
df_a = pd.DataFrame(data={
'id': [1, 2, 3, 1, 6, 5, 3],
'name': [np.nan, 'two', 'three', 'one', np.nan, np.nan, np.nan],
'length': ['l1', 'l2', 'l3', 'l1', 'l6', np.nan, np.nan],
'Q1': [11, 22, 33, 11, 66, 55, 33],
'Q2': [111, 222, 333, 111, 666, 555, 333]})
df_b = pd.DataFrame(data={
'id': [5, 2, 4, 1, 3, 1, 7, 3],
'name': ['five', 'two', 'four', 'one_extra', 'three', 'one', 'seven', 'three'],
'length': ['l5', 'l2', 'l4', 'l1', 'l3', 'l1', 'l7', 'l3_extra'],
'Q1': [2055, 2022, 2044, 2011, 2033, 2011, 2077, 2033],
'Q2': [2055, 2022, 2044, 2011, 2033, 2011, 2077, 2033]})
df_m = df_a.merge(df_b, on=['id', 'name', 'length'], how='outer',
suffixes=['_a', '_b'])
Output:
id name length Q1 Q2
0 1 NaN l1 11 111
1 2 two l2 22 222
2 3 three l3 33 333
3 1 one l1 11 111
4 6 NaN l6 66 666
5 5 NaN NaN 55 555
6 3 NaN NaN 33 333
id name length Q1 Q2
0 5 five l5 2055 2055
1 2 two l2 2022 2022
2 4 four l4 2044 2044
3 1 one_extra l1 2011 2011
4 3 three l3 2033 2033
5 1 one l1 2011 2011
6 7 seven l7 2077 2077
7 3 three l3_extra 2033 2033
id name length Q1_a Q2_a Q1_b Q2_b
0 1 NaN l1 11.0 111.0 NaN NaN
3 1 one l1 11.0 111.0 2011.0 2011.0
9 1 one_extra l1 NaN NaN 2011.0 2011.0
1 2 two l2 22.0 222.0 2022.0 2022.0
2 3 three l3 33.0 333.0 2033.0 2033.0
6 3 NaN NaN 33.0 333.0 NaN NaN
11 3 three l3_extra NaN NaN 2033.0 2033.0
8 4 four l4 NaN NaN 2044.0 2044.0
5 5 NaN NaN 55.0 555.0 NaN NaN
7 5 five l5 NaN NaN 2055.0 2055.0
4 6 NaN l6 66.0 666.0 NaN NaN
10 7 seven l7 NaN NaN 2077.0 2077.0
The Endpoint
I'm trying to get df_m
into a state where any NaN
values are filled in based on their values from a corresponding id
column. Any time there's a duplicate id
, I want to keep both values, so that I end up with the following df
:
id name length Q1_a Q2_a Q1_b Q2_b
0 1 one l1 11.0 111.0 2011.0 2011.0
1 1 one_extra l1 11.0 111.0 2011.0 2011.0
2 2 two l2 22.0 222.0 2022.0 2022.0
3 3 three l3 33.0 333.0 2033.0 2033.0
4 3 three l3_extra 33.0 333.0 2033.0 2033.0
5 4 four l4 NaN NaN 2044.0 2044.0
6 5 five l5 55.0 555.0 2055.0 2055.0
7 6 NaN l6 66.0 666.0 NaN NaN
8 7 seven l7 NaN NaN 2077.0 2077.0
Note that there are some NaN
values, where there was an entry in one table, and not in the other, e.g. there's no entry in df_a
for id=4
, so there's NaN
s at Q1_a
and Q2_a
.
My solution so far
The following works, but I'd like to vectorise this code, as it takes more than 4 seconds with only ~200 rows:
dupe_ids = pd.value_counts(df_m.id)[
pd.value_counts(df_m.id) > 1].index.values
for dupe_id in dupe_ids:
df_m.loc[df_m.id == dupe_id] = df_m.loc[df_m.id == dupe_id].ffill().bfill()
df_m = df_m.drop_duplicates().sort_values(by='id').reset_index(drop=True)
I tried vectorising the code, as follows, but the results were strange (see id=4
).
df_m.sort_values(by='id').groupby(
'id').ffill().bfill().drop_duplicates().reset_index(drop=True)
Output:
id name length Q1_a Q2_a Q1_b Q2_b
0 1 one l1 11.0 111.0 2011.0 2011.0
1 1 one_extra l1 11.0 111.0 2011.0 2011.0
2 2 two l2 22.0 222.0 2022.0 2022.0
3 3 three l3 33.0 333.0 2033.0 2033.0
4 3 three l3_extra 33.0 333.0 2033.0 2033.0
5 4 four l4 55.0 555.0 2044.0 2044.0
6 5 five l5 55.0 555.0 2055.0 2055.0
7 6 seven l6 66.0 666.0 2077.0 2077.0
8 7 seven l7 NaN NaN 2077.0 2077.0
The two tables I'm actually trying to merge are quite a bit more complex, so the example above leads to much more strange behaviour than is seen there. There are over 100 columns, and different columns can have NaN
values in different places.
Any help would be greatly appreciated!
python pandas performance pandas-groupby
python pandas performance pandas-groupby
edited Nov 11 at 19:55
jpp
87.1k194999
87.1k194999
asked Nov 11 at 18:46
KevL
116210
116210
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
up vote
1
down vote
accepted
The issue is you need to perform a ffill
+ bfill
both groupwise. You current vectorisation attempt performs the ffill
groupwise only.
res = df_m.sort_values('id')
.groupby('id').apply(lambda x: x.ffill().bfill())
.drop_duplicates()
print(res)
id name length Q1_a Q2_a Q1_b Q2_b
0 1 one l1 11.0 111.0 2011.0 2011.0
9 1 one_extra l1 11.0 111.0 2011.0 2011.0
1 2 two l2 22.0 222.0 2022.0 2022.0
2 3 three l3 33.0 333.0 2033.0 2033.0
11 3 three l3_extra 33.0 333.0 2033.0 2033.0
8 4 four l4 NaN NaN 2044.0 2044.0
5 5 five l5 55.0 555.0 2055.0 2055.0
4 6 NaN l6 66.0 666.0 NaN NaN
10 7 seven l7 NaN NaN 2077.0 2077.0
Ah! Jaysus I can't believe I missed that. Would the following be a small bit faster, to avoid having to sue.apply()
?df_m.sort_values(by='id').groupby('id').ffill().groupby('id').bfill()
– KevL
Nov 11 at 19:49
1
@KevL, I don't think it would be faster, since you are repeating theGroupBy
operation. But I advise you test it and see for yourself.
– jpp
Nov 11 at 19:55
add a comment |
up vote
0
down vote
Thanks to jpp for his answer.
Just to check, I've timed his solution against the one in my reply to his answer.
Input:
df_m = df_m.append([df_m] * 10000, ignore_index=True)
%timeit df_r1 = df_m
.groupby('id').apply(lambda x: x.ffill().bfill())
.drop_duplicates()
%timeit df_r2 = df_m.groupby('id').ffill().groupby('id').bfill().drop_duplicates()
Output:
114 ms ± 1.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
112 ms ± 515 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
So, it seems with 100,000 rows it's very close.
add a comment |
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
accepted
The issue is you need to perform a ffill
+ bfill
both groupwise. You current vectorisation attempt performs the ffill
groupwise only.
res = df_m.sort_values('id')
.groupby('id').apply(lambda x: x.ffill().bfill())
.drop_duplicates()
print(res)
id name length Q1_a Q2_a Q1_b Q2_b
0 1 one l1 11.0 111.0 2011.0 2011.0
9 1 one_extra l1 11.0 111.0 2011.0 2011.0
1 2 two l2 22.0 222.0 2022.0 2022.0
2 3 three l3 33.0 333.0 2033.0 2033.0
11 3 three l3_extra 33.0 333.0 2033.0 2033.0
8 4 four l4 NaN NaN 2044.0 2044.0
5 5 five l5 55.0 555.0 2055.0 2055.0
4 6 NaN l6 66.0 666.0 NaN NaN
10 7 seven l7 NaN NaN 2077.0 2077.0
Ah! Jaysus I can't believe I missed that. Would the following be a small bit faster, to avoid having to sue.apply()
?df_m.sort_values(by='id').groupby('id').ffill().groupby('id').bfill()
– KevL
Nov 11 at 19:49
1
@KevL, I don't think it would be faster, since you are repeating theGroupBy
operation. But I advise you test it and see for yourself.
– jpp
Nov 11 at 19:55
add a comment |
up vote
1
down vote
accepted
The issue is you need to perform a ffill
+ bfill
both groupwise. You current vectorisation attempt performs the ffill
groupwise only.
res = df_m.sort_values('id')
.groupby('id').apply(lambda x: x.ffill().bfill())
.drop_duplicates()
print(res)
id name length Q1_a Q2_a Q1_b Q2_b
0 1 one l1 11.0 111.0 2011.0 2011.0
9 1 one_extra l1 11.0 111.0 2011.0 2011.0
1 2 two l2 22.0 222.0 2022.0 2022.0
2 3 three l3 33.0 333.0 2033.0 2033.0
11 3 three l3_extra 33.0 333.0 2033.0 2033.0
8 4 four l4 NaN NaN 2044.0 2044.0
5 5 five l5 55.0 555.0 2055.0 2055.0
4 6 NaN l6 66.0 666.0 NaN NaN
10 7 seven l7 NaN NaN 2077.0 2077.0
Ah! Jaysus I can't believe I missed that. Would the following be a small bit faster, to avoid having to sue.apply()
?df_m.sort_values(by='id').groupby('id').ffill().groupby('id').bfill()
– KevL
Nov 11 at 19:49
1
@KevL, I don't think it would be faster, since you are repeating theGroupBy
operation. But I advise you test it and see for yourself.
– jpp
Nov 11 at 19:55
add a comment |
up vote
1
down vote
accepted
up vote
1
down vote
accepted
The issue is you need to perform a ffill
+ bfill
both groupwise. You current vectorisation attempt performs the ffill
groupwise only.
res = df_m.sort_values('id')
.groupby('id').apply(lambda x: x.ffill().bfill())
.drop_duplicates()
print(res)
id name length Q1_a Q2_a Q1_b Q2_b
0 1 one l1 11.0 111.0 2011.0 2011.0
9 1 one_extra l1 11.0 111.0 2011.0 2011.0
1 2 two l2 22.0 222.0 2022.0 2022.0
2 3 three l3 33.0 333.0 2033.0 2033.0
11 3 three l3_extra 33.0 333.0 2033.0 2033.0
8 4 four l4 NaN NaN 2044.0 2044.0
5 5 five l5 55.0 555.0 2055.0 2055.0
4 6 NaN l6 66.0 666.0 NaN NaN
10 7 seven l7 NaN NaN 2077.0 2077.0
The issue is you need to perform a ffill
+ bfill
both groupwise. You current vectorisation attempt performs the ffill
groupwise only.
res = df_m.sort_values('id')
.groupby('id').apply(lambda x: x.ffill().bfill())
.drop_duplicates()
print(res)
id name length Q1_a Q2_a Q1_b Q2_b
0 1 one l1 11.0 111.0 2011.0 2011.0
9 1 one_extra l1 11.0 111.0 2011.0 2011.0
1 2 two l2 22.0 222.0 2022.0 2022.0
2 3 three l3 33.0 333.0 2033.0 2033.0
11 3 three l3_extra 33.0 333.0 2033.0 2033.0
8 4 four l4 NaN NaN 2044.0 2044.0
5 5 five l5 55.0 555.0 2055.0 2055.0
4 6 NaN l6 66.0 666.0 NaN NaN
10 7 seven l7 NaN NaN 2077.0 2077.0
answered Nov 11 at 18:56
jpp
87.1k194999
87.1k194999
Ah! Jaysus I can't believe I missed that. Would the following be a small bit faster, to avoid having to sue.apply()
?df_m.sort_values(by='id').groupby('id').ffill().groupby('id').bfill()
– KevL
Nov 11 at 19:49
1
@KevL, I don't think it would be faster, since you are repeating theGroupBy
operation. But I advise you test it and see for yourself.
– jpp
Nov 11 at 19:55
add a comment |
Ah! Jaysus I can't believe I missed that. Would the following be a small bit faster, to avoid having to sue.apply()
?df_m.sort_values(by='id').groupby('id').ffill().groupby('id').bfill()
– KevL
Nov 11 at 19:49
1
@KevL, I don't think it would be faster, since you are repeating theGroupBy
operation. But I advise you test it and see for yourself.
– jpp
Nov 11 at 19:55
Ah! Jaysus I can't believe I missed that. Would the following be a small bit faster, to avoid having to sue
.apply()
? df_m.sort_values(by='id').groupby('id').ffill().groupby('id').bfill()
– KevL
Nov 11 at 19:49
Ah! Jaysus I can't believe I missed that. Would the following be a small bit faster, to avoid having to sue
.apply()
? df_m.sort_values(by='id').groupby('id').ffill().groupby('id').bfill()
– KevL
Nov 11 at 19:49
1
1
@KevL, I don't think it would be faster, since you are repeating the
GroupBy
operation. But I advise you test it and see for yourself.– jpp
Nov 11 at 19:55
@KevL, I don't think it would be faster, since you are repeating the
GroupBy
operation. But I advise you test it and see for yourself.– jpp
Nov 11 at 19:55
add a comment |
up vote
0
down vote
Thanks to jpp for his answer.
Just to check, I've timed his solution against the one in my reply to his answer.
Input:
df_m = df_m.append([df_m] * 10000, ignore_index=True)
%timeit df_r1 = df_m
.groupby('id').apply(lambda x: x.ffill().bfill())
.drop_duplicates()
%timeit df_r2 = df_m.groupby('id').ffill().groupby('id').bfill().drop_duplicates()
Output:
114 ms ± 1.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
112 ms ± 515 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
So, it seems with 100,000 rows it's very close.
add a comment |
up vote
0
down vote
Thanks to jpp for his answer.
Just to check, I've timed his solution against the one in my reply to his answer.
Input:
df_m = df_m.append([df_m] * 10000, ignore_index=True)
%timeit df_r1 = df_m
.groupby('id').apply(lambda x: x.ffill().bfill())
.drop_duplicates()
%timeit df_r2 = df_m.groupby('id').ffill().groupby('id').bfill().drop_duplicates()
Output:
114 ms ± 1.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
112 ms ± 515 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
So, it seems with 100,000 rows it's very close.
add a comment |
up vote
0
down vote
up vote
0
down vote
Thanks to jpp for his answer.
Just to check, I've timed his solution against the one in my reply to his answer.
Input:
df_m = df_m.append([df_m] * 10000, ignore_index=True)
%timeit df_r1 = df_m
.groupby('id').apply(lambda x: x.ffill().bfill())
.drop_duplicates()
%timeit df_r2 = df_m.groupby('id').ffill().groupby('id').bfill().drop_duplicates()
Output:
114 ms ± 1.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
112 ms ± 515 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
So, it seems with 100,000 rows it's very close.
Thanks to jpp for his answer.
Just to check, I've timed his solution against the one in my reply to his answer.
Input:
df_m = df_m.append([df_m] * 10000, ignore_index=True)
%timeit df_r1 = df_m
.groupby('id').apply(lambda x: x.ffill().bfill())
.drop_duplicates()
%timeit df_r2 = df_m.groupby('id').ffill().groupby('id').bfill().drop_duplicates()
Output:
114 ms ± 1.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
112 ms ± 515 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
So, it seems with 100,000 rows it's very close.
answered Nov 11 at 21:24
KevL
116210
116210
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53251999%2fpandas-fill-nan-values-based-on-another-column-with-duplicate-values%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown