pandas: fill nan values based on another column with duplicate values
up vote
1
down vote
favorite
I have two tables which I am trying to merge, df_a and df_b, but both are missing some values. The merge happens fine, however I'm trying to speed up the process of filling in the NaN values after the merge.
The data:
df_a = pd.DataFrame(data={
    'id': [1, 2, 3, 1, 6, 5, 3],
    'name': [np.nan, 'two', 'three', 'one', np.nan, np.nan, np.nan],
    'length': ['l1', 'l2', 'l3', 'l1', 'l6', np.nan, np.nan],
    'Q1': [11, 22, 33, 11, 66, 55, 33],
    'Q2': [111, 222, 333, 111, 666, 555, 333]})
df_b = pd.DataFrame(data={
    'id': [5, 2, 4, 1, 3, 1, 7, 3],
    'name': ['five', 'two', 'four', 'one_extra', 'three', 'one', 'seven', 'three'],
    'length': ['l5', 'l2', 'l4', 'l1', 'l3', 'l1', 'l7', 'l3_extra'],
    'Q1': [2055, 2022, 2044, 2011, 2033, 2011, 2077, 2033],
    'Q2': [2055, 2022, 2044, 2011, 2033, 2011, 2077, 2033]})
df_m = df_a.merge(df_b, on=['id', 'name', 'length'], how='outer',
                  suffixes=['_a', '_b'])
Output:
       id   name length  Q1   Q2
    0   1    NaN     l1  11  111
    1   2    two     l2  22  222
    2   3  three     l3  33  333
    3   1    one     l1  11  111
    4   6    NaN     l6  66  666
    5   5    NaN    NaN  55  555
    6   3    NaN    NaN  33  333
       id       name    length    Q1    Q2
    0   5       five        l5  2055  2055
    1   2        two        l2  2022  2022
    2   4       four        l4  2044  2044
    3   1  one_extra        l1  2011  2011
    4   3      three        l3  2033  2033
    5   1        one        l1  2011  2011
    6   7      seven        l7  2077  2077
    7   3      three  l3_extra  2033  2033
        id       name    length  Q1_a   Q2_a    Q1_b    Q2_b
    0    1        NaN        l1  11.0  111.0     NaN     NaN
    3    1        one        l1  11.0  111.0  2011.0  2011.0
    9    1  one_extra        l1   NaN    NaN  2011.0  2011.0
    1    2        two        l2  22.0  222.0  2022.0  2022.0
    2    3      three        l3  33.0  333.0  2033.0  2033.0
    6    3        NaN       NaN  33.0  333.0     NaN     NaN
    11   3      three  l3_extra   NaN    NaN  2033.0  2033.0
    8    4       four        l4   NaN    NaN  2044.0  2044.0
    5    5        NaN       NaN  55.0  555.0     NaN     NaN
    7    5       five        l5   NaN    NaN  2055.0  2055.0
    4    6        NaN        l6  66.0  666.0     NaN     NaN
    10   7      seven        l7   NaN    NaN  2077.0  2077.0
The Endpoint
I'm trying to get df_m into a state where any NaN values are filled in based on their values from a corresponding id column. Any time there's a duplicate id, I want to keep both values, so that I end up with the following df:
   id       name    length  Q1_a   Q2_a    Q1_b    Q2_b
0   1        one        l1  11.0  111.0  2011.0  2011.0
1   1  one_extra        l1  11.0  111.0  2011.0  2011.0
2   2        two        l2  22.0  222.0  2022.0  2022.0
3   3      three        l3  33.0  333.0  2033.0  2033.0
4   3      three  l3_extra  33.0  333.0  2033.0  2033.0
5   4       four        l4   NaN    NaN  2044.0  2044.0
6   5       five        l5  55.0  555.0  2055.0  2055.0
7   6        NaN        l6  66.0  666.0     NaN     NaN
8   7      seven        l7   NaN    NaN  2077.0  2077.0
Note that there are some NaN values, where there was an entry in one table, and not in the other, e.g. there's no entry in df_a for id=4, so there's NaNs at Q1_a and Q2_a.
My solution so far
The following works, but I'd like to vectorise this code, as it takes more than 4 seconds with only ~200 rows:
dupe_ids = pd.value_counts(df_m.id)[
    pd.value_counts(df_m.id) > 1].index.values
for dupe_id in dupe_ids:
    df_m.loc[df_m.id == dupe_id] = df_m.loc[df_m.id == dupe_id].ffill().bfill()
df_m = df_m.drop_duplicates().sort_values(by='id').reset_index(drop=True)
I tried vectorising the code, as follows, but the results were strange (see id=4).
df_m.sort_values(by='id').groupby(
    'id').ffill().bfill().drop_duplicates().reset_index(drop=True)
Output:
       id       name    length  Q1_a   Q2_a    Q1_b    Q2_b
    0   1        one        l1  11.0  111.0  2011.0  2011.0
    1   1  one_extra        l1  11.0  111.0  2011.0  2011.0
    2   2        two        l2  22.0  222.0  2022.0  2022.0
    3   3      three        l3  33.0  333.0  2033.0  2033.0
    4   3      three  l3_extra  33.0  333.0  2033.0  2033.0
    5   4       four        l4  55.0  555.0  2044.0  2044.0
    6   5       five        l5  55.0  555.0  2055.0  2055.0
    7   6      seven        l6  66.0  666.0  2077.0  2077.0
    8   7      seven        l7   NaN    NaN  2077.0  2077.0
The two tables I'm actually trying to merge are quite a bit more complex, so the example above leads to much more strange behaviour than is seen there. There are over 100 columns, and different columns can have NaN values in different places.
Any help would be greatly appreciated!
python pandas performance pandas-groupby
add a comment |
up vote
1
down vote
favorite
I have two tables which I am trying to merge, df_a and df_b, but both are missing some values. The merge happens fine, however I'm trying to speed up the process of filling in the NaN values after the merge.
The data:
df_a = pd.DataFrame(data={
    'id': [1, 2, 3, 1, 6, 5, 3],
    'name': [np.nan, 'two', 'three', 'one', np.nan, np.nan, np.nan],
    'length': ['l1', 'l2', 'l3', 'l1', 'l6', np.nan, np.nan],
    'Q1': [11, 22, 33, 11, 66, 55, 33],
    'Q2': [111, 222, 333, 111, 666, 555, 333]})
df_b = pd.DataFrame(data={
    'id': [5, 2, 4, 1, 3, 1, 7, 3],
    'name': ['five', 'two', 'four', 'one_extra', 'three', 'one', 'seven', 'three'],
    'length': ['l5', 'l2', 'l4', 'l1', 'l3', 'l1', 'l7', 'l3_extra'],
    'Q1': [2055, 2022, 2044, 2011, 2033, 2011, 2077, 2033],
    'Q2': [2055, 2022, 2044, 2011, 2033, 2011, 2077, 2033]})
df_m = df_a.merge(df_b, on=['id', 'name', 'length'], how='outer',
                  suffixes=['_a', '_b'])
Output:
       id   name length  Q1   Q2
    0   1    NaN     l1  11  111
    1   2    two     l2  22  222
    2   3  three     l3  33  333
    3   1    one     l1  11  111
    4   6    NaN     l6  66  666
    5   5    NaN    NaN  55  555
    6   3    NaN    NaN  33  333
       id       name    length    Q1    Q2
    0   5       five        l5  2055  2055
    1   2        two        l2  2022  2022
    2   4       four        l4  2044  2044
    3   1  one_extra        l1  2011  2011
    4   3      three        l3  2033  2033
    5   1        one        l1  2011  2011
    6   7      seven        l7  2077  2077
    7   3      three  l3_extra  2033  2033
        id       name    length  Q1_a   Q2_a    Q1_b    Q2_b
    0    1        NaN        l1  11.0  111.0     NaN     NaN
    3    1        one        l1  11.0  111.0  2011.0  2011.0
    9    1  one_extra        l1   NaN    NaN  2011.0  2011.0
    1    2        two        l2  22.0  222.0  2022.0  2022.0
    2    3      three        l3  33.0  333.0  2033.0  2033.0
    6    3        NaN       NaN  33.0  333.0     NaN     NaN
    11   3      three  l3_extra   NaN    NaN  2033.0  2033.0
    8    4       four        l4   NaN    NaN  2044.0  2044.0
    5    5        NaN       NaN  55.0  555.0     NaN     NaN
    7    5       five        l5   NaN    NaN  2055.0  2055.0
    4    6        NaN        l6  66.0  666.0     NaN     NaN
    10   7      seven        l7   NaN    NaN  2077.0  2077.0
The Endpoint
I'm trying to get df_m into a state where any NaN values are filled in based on their values from a corresponding id column. Any time there's a duplicate id, I want to keep both values, so that I end up with the following df:
   id       name    length  Q1_a   Q2_a    Q1_b    Q2_b
0   1        one        l1  11.0  111.0  2011.0  2011.0
1   1  one_extra        l1  11.0  111.0  2011.0  2011.0
2   2        two        l2  22.0  222.0  2022.0  2022.0
3   3      three        l3  33.0  333.0  2033.0  2033.0
4   3      three  l3_extra  33.0  333.0  2033.0  2033.0
5   4       four        l4   NaN    NaN  2044.0  2044.0
6   5       five        l5  55.0  555.0  2055.0  2055.0
7   6        NaN        l6  66.0  666.0     NaN     NaN
8   7      seven        l7   NaN    NaN  2077.0  2077.0
Note that there are some NaN values, where there was an entry in one table, and not in the other, e.g. there's no entry in df_a for id=4, so there's NaNs at Q1_a and Q2_a.
My solution so far
The following works, but I'd like to vectorise this code, as it takes more than 4 seconds with only ~200 rows:
dupe_ids = pd.value_counts(df_m.id)[
    pd.value_counts(df_m.id) > 1].index.values
for dupe_id in dupe_ids:
    df_m.loc[df_m.id == dupe_id] = df_m.loc[df_m.id == dupe_id].ffill().bfill()
df_m = df_m.drop_duplicates().sort_values(by='id').reset_index(drop=True)
I tried vectorising the code, as follows, but the results were strange (see id=4).
df_m.sort_values(by='id').groupby(
    'id').ffill().bfill().drop_duplicates().reset_index(drop=True)
Output:
       id       name    length  Q1_a   Q2_a    Q1_b    Q2_b
    0   1        one        l1  11.0  111.0  2011.0  2011.0
    1   1  one_extra        l1  11.0  111.0  2011.0  2011.0
    2   2        two        l2  22.0  222.0  2022.0  2022.0
    3   3      three        l3  33.0  333.0  2033.0  2033.0
    4   3      three  l3_extra  33.0  333.0  2033.0  2033.0
    5   4       four        l4  55.0  555.0  2044.0  2044.0
    6   5       five        l5  55.0  555.0  2055.0  2055.0
    7   6      seven        l6  66.0  666.0  2077.0  2077.0
    8   7      seven        l7   NaN    NaN  2077.0  2077.0
The two tables I'm actually trying to merge are quite a bit more complex, so the example above leads to much more strange behaviour than is seen there. There are over 100 columns, and different columns can have NaN values in different places.
Any help would be greatly appreciated!
python pandas performance pandas-groupby
add a comment |
up vote
1
down vote
favorite
up vote
1
down vote
favorite
I have two tables which I am trying to merge, df_a and df_b, but both are missing some values. The merge happens fine, however I'm trying to speed up the process of filling in the NaN values after the merge.
The data:
df_a = pd.DataFrame(data={
    'id': [1, 2, 3, 1, 6, 5, 3],
    'name': [np.nan, 'two', 'three', 'one', np.nan, np.nan, np.nan],
    'length': ['l1', 'l2', 'l3', 'l1', 'l6', np.nan, np.nan],
    'Q1': [11, 22, 33, 11, 66, 55, 33],
    'Q2': [111, 222, 333, 111, 666, 555, 333]})
df_b = pd.DataFrame(data={
    'id': [5, 2, 4, 1, 3, 1, 7, 3],
    'name': ['five', 'two', 'four', 'one_extra', 'three', 'one', 'seven', 'three'],
    'length': ['l5', 'l2', 'l4', 'l1', 'l3', 'l1', 'l7', 'l3_extra'],
    'Q1': [2055, 2022, 2044, 2011, 2033, 2011, 2077, 2033],
    'Q2': [2055, 2022, 2044, 2011, 2033, 2011, 2077, 2033]})
df_m = df_a.merge(df_b, on=['id', 'name', 'length'], how='outer',
                  suffixes=['_a', '_b'])
Output:
       id   name length  Q1   Q2
    0   1    NaN     l1  11  111
    1   2    two     l2  22  222
    2   3  three     l3  33  333
    3   1    one     l1  11  111
    4   6    NaN     l6  66  666
    5   5    NaN    NaN  55  555
    6   3    NaN    NaN  33  333
       id       name    length    Q1    Q2
    0   5       five        l5  2055  2055
    1   2        two        l2  2022  2022
    2   4       four        l4  2044  2044
    3   1  one_extra        l1  2011  2011
    4   3      three        l3  2033  2033
    5   1        one        l1  2011  2011
    6   7      seven        l7  2077  2077
    7   3      three  l3_extra  2033  2033
        id       name    length  Q1_a   Q2_a    Q1_b    Q2_b
    0    1        NaN        l1  11.0  111.0     NaN     NaN
    3    1        one        l1  11.0  111.0  2011.0  2011.0
    9    1  one_extra        l1   NaN    NaN  2011.0  2011.0
    1    2        two        l2  22.0  222.0  2022.0  2022.0
    2    3      three        l3  33.0  333.0  2033.0  2033.0
    6    3        NaN       NaN  33.0  333.0     NaN     NaN
    11   3      three  l3_extra   NaN    NaN  2033.0  2033.0
    8    4       four        l4   NaN    NaN  2044.0  2044.0
    5    5        NaN       NaN  55.0  555.0     NaN     NaN
    7    5       five        l5   NaN    NaN  2055.0  2055.0
    4    6        NaN        l6  66.0  666.0     NaN     NaN
    10   7      seven        l7   NaN    NaN  2077.0  2077.0
The Endpoint
I'm trying to get df_m into a state where any NaN values are filled in based on their values from a corresponding id column. Any time there's a duplicate id, I want to keep both values, so that I end up with the following df:
   id       name    length  Q1_a   Q2_a    Q1_b    Q2_b
0   1        one        l1  11.0  111.0  2011.0  2011.0
1   1  one_extra        l1  11.0  111.0  2011.0  2011.0
2   2        two        l2  22.0  222.0  2022.0  2022.0
3   3      three        l3  33.0  333.0  2033.0  2033.0
4   3      three  l3_extra  33.0  333.0  2033.0  2033.0
5   4       four        l4   NaN    NaN  2044.0  2044.0
6   5       five        l5  55.0  555.0  2055.0  2055.0
7   6        NaN        l6  66.0  666.0     NaN     NaN
8   7      seven        l7   NaN    NaN  2077.0  2077.0
Note that there are some NaN values, where there was an entry in one table, and not in the other, e.g. there's no entry in df_a for id=4, so there's NaNs at Q1_a and Q2_a.
My solution so far
The following works, but I'd like to vectorise this code, as it takes more than 4 seconds with only ~200 rows:
dupe_ids = pd.value_counts(df_m.id)[
    pd.value_counts(df_m.id) > 1].index.values
for dupe_id in dupe_ids:
    df_m.loc[df_m.id == dupe_id] = df_m.loc[df_m.id == dupe_id].ffill().bfill()
df_m = df_m.drop_duplicates().sort_values(by='id').reset_index(drop=True)
I tried vectorising the code, as follows, but the results were strange (see id=4).
df_m.sort_values(by='id').groupby(
    'id').ffill().bfill().drop_duplicates().reset_index(drop=True)
Output:
       id       name    length  Q1_a   Q2_a    Q1_b    Q2_b
    0   1        one        l1  11.0  111.0  2011.0  2011.0
    1   1  one_extra        l1  11.0  111.0  2011.0  2011.0
    2   2        two        l2  22.0  222.0  2022.0  2022.0
    3   3      three        l3  33.0  333.0  2033.0  2033.0
    4   3      three  l3_extra  33.0  333.0  2033.0  2033.0
    5   4       four        l4  55.0  555.0  2044.0  2044.0
    6   5       five        l5  55.0  555.0  2055.0  2055.0
    7   6      seven        l6  66.0  666.0  2077.0  2077.0
    8   7      seven        l7   NaN    NaN  2077.0  2077.0
The two tables I'm actually trying to merge are quite a bit more complex, so the example above leads to much more strange behaviour than is seen there. There are over 100 columns, and different columns can have NaN values in different places.
Any help would be greatly appreciated!
python pandas performance pandas-groupby
I have two tables which I am trying to merge, df_a and df_b, but both are missing some values. The merge happens fine, however I'm trying to speed up the process of filling in the NaN values after the merge.
The data:
df_a = pd.DataFrame(data={
    'id': [1, 2, 3, 1, 6, 5, 3],
    'name': [np.nan, 'two', 'three', 'one', np.nan, np.nan, np.nan],
    'length': ['l1', 'l2', 'l3', 'l1', 'l6', np.nan, np.nan],
    'Q1': [11, 22, 33, 11, 66, 55, 33],
    'Q2': [111, 222, 333, 111, 666, 555, 333]})
df_b = pd.DataFrame(data={
    'id': [5, 2, 4, 1, 3, 1, 7, 3],
    'name': ['five', 'two', 'four', 'one_extra', 'three', 'one', 'seven', 'three'],
    'length': ['l5', 'l2', 'l4', 'l1', 'l3', 'l1', 'l7', 'l3_extra'],
    'Q1': [2055, 2022, 2044, 2011, 2033, 2011, 2077, 2033],
    'Q2': [2055, 2022, 2044, 2011, 2033, 2011, 2077, 2033]})
df_m = df_a.merge(df_b, on=['id', 'name', 'length'], how='outer',
                  suffixes=['_a', '_b'])
Output:
       id   name length  Q1   Q2
    0   1    NaN     l1  11  111
    1   2    two     l2  22  222
    2   3  three     l3  33  333
    3   1    one     l1  11  111
    4   6    NaN     l6  66  666
    5   5    NaN    NaN  55  555
    6   3    NaN    NaN  33  333
       id       name    length    Q1    Q2
    0   5       five        l5  2055  2055
    1   2        two        l2  2022  2022
    2   4       four        l4  2044  2044
    3   1  one_extra        l1  2011  2011
    4   3      three        l3  2033  2033
    5   1        one        l1  2011  2011
    6   7      seven        l7  2077  2077
    7   3      three  l3_extra  2033  2033
        id       name    length  Q1_a   Q2_a    Q1_b    Q2_b
    0    1        NaN        l1  11.0  111.0     NaN     NaN
    3    1        one        l1  11.0  111.0  2011.0  2011.0
    9    1  one_extra        l1   NaN    NaN  2011.0  2011.0
    1    2        two        l2  22.0  222.0  2022.0  2022.0
    2    3      three        l3  33.0  333.0  2033.0  2033.0
    6    3        NaN       NaN  33.0  333.0     NaN     NaN
    11   3      three  l3_extra   NaN    NaN  2033.0  2033.0
    8    4       four        l4   NaN    NaN  2044.0  2044.0
    5    5        NaN       NaN  55.0  555.0     NaN     NaN
    7    5       five        l5   NaN    NaN  2055.0  2055.0
    4    6        NaN        l6  66.0  666.0     NaN     NaN
    10   7      seven        l7   NaN    NaN  2077.0  2077.0
The Endpoint
I'm trying to get df_m into a state where any NaN values are filled in based on their values from a corresponding id column. Any time there's a duplicate id, I want to keep both values, so that I end up with the following df:
   id       name    length  Q1_a   Q2_a    Q1_b    Q2_b
0   1        one        l1  11.0  111.0  2011.0  2011.0
1   1  one_extra        l1  11.0  111.0  2011.0  2011.0
2   2        two        l2  22.0  222.0  2022.0  2022.0
3   3      three        l3  33.0  333.0  2033.0  2033.0
4   3      three  l3_extra  33.0  333.0  2033.0  2033.0
5   4       four        l4   NaN    NaN  2044.0  2044.0
6   5       five        l5  55.0  555.0  2055.0  2055.0
7   6        NaN        l6  66.0  666.0     NaN     NaN
8   7      seven        l7   NaN    NaN  2077.0  2077.0
Note that there are some NaN values, where there was an entry in one table, and not in the other, e.g. there's no entry in df_a for id=4, so there's NaNs at Q1_a and Q2_a.
My solution so far
The following works, but I'd like to vectorise this code, as it takes more than 4 seconds with only ~200 rows:
dupe_ids = pd.value_counts(df_m.id)[
    pd.value_counts(df_m.id) > 1].index.values
for dupe_id in dupe_ids:
    df_m.loc[df_m.id == dupe_id] = df_m.loc[df_m.id == dupe_id].ffill().bfill()
df_m = df_m.drop_duplicates().sort_values(by='id').reset_index(drop=True)
I tried vectorising the code, as follows, but the results were strange (see id=4).
df_m.sort_values(by='id').groupby(
    'id').ffill().bfill().drop_duplicates().reset_index(drop=True)
Output:
       id       name    length  Q1_a   Q2_a    Q1_b    Q2_b
    0   1        one        l1  11.0  111.0  2011.0  2011.0
    1   1  one_extra        l1  11.0  111.0  2011.0  2011.0
    2   2        two        l2  22.0  222.0  2022.0  2022.0
    3   3      three        l3  33.0  333.0  2033.0  2033.0
    4   3      three  l3_extra  33.0  333.0  2033.0  2033.0
    5   4       four        l4  55.0  555.0  2044.0  2044.0
    6   5       five        l5  55.0  555.0  2055.0  2055.0
    7   6      seven        l6  66.0  666.0  2077.0  2077.0
    8   7      seven        l7   NaN    NaN  2077.0  2077.0
The two tables I'm actually trying to merge are quite a bit more complex, so the example above leads to much more strange behaviour than is seen there. There are over 100 columns, and different columns can have NaN values in different places.
Any help would be greatly appreciated!
python pandas performance pandas-groupby
python pandas performance pandas-groupby
edited Nov 11 at 19:55
jpp
87.1k194999
87.1k194999
asked Nov 11 at 18:46
KevL
116210
116210
add a comment |
add a comment |
                                2 Answers
                                2
                        
active
oldest
votes
up vote
1
down vote
accepted
The issue is you need to perform a ffill + bfill both groupwise. You current vectorisation attempt performs the ffill groupwise only.
res = df_m.sort_values('id')
          .groupby('id').apply(lambda x: x.ffill().bfill())
          .drop_duplicates()
print(res)
    id       name    length  Q1_a   Q2_a    Q1_b    Q2_b
0    1        one        l1  11.0  111.0  2011.0  2011.0
9    1  one_extra        l1  11.0  111.0  2011.0  2011.0
1    2        two        l2  22.0  222.0  2022.0  2022.0
2    3      three        l3  33.0  333.0  2033.0  2033.0
11   3      three  l3_extra  33.0  333.0  2033.0  2033.0
8    4       four        l4   NaN    NaN  2044.0  2044.0
5    5       five        l5  55.0  555.0  2055.0  2055.0
4    6        NaN        l6  66.0  666.0     NaN     NaN
10   7      seven        l7   NaN    NaN  2077.0  2077.0
Ah! Jaysus I can't believe I missed that. Would the following be a small bit faster, to avoid having to sue.apply()?df_m.sort_values(by='id').groupby('id').ffill().groupby('id').bfill()
– KevL
Nov 11 at 19:49
1
@KevL, I don't think it would be faster, since you are repeating theGroupByoperation. But I advise you test it and see for yourself.
– jpp
Nov 11 at 19:55
add a comment |
up vote
0
down vote
Thanks to jpp for his answer.
Just to check, I've timed his solution against the one in my reply to his answer.
Input:
df_m = df_m.append([df_m] * 10000, ignore_index=True) 
%timeit df_r1 = df_m
          .groupby('id').apply(lambda x: x.ffill().bfill())
          .drop_duplicates()
%timeit df_r2 = df_m.groupby('id').ffill().groupby('id').bfill().drop_duplicates()
Output:
114 ms ± 1.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
112 ms ± 515 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
So, it seems with 100,000 rows it's very close.
add a comment |
                                2 Answers
                                2
                        
active
oldest
votes
                                2 Answers
                                2
                        
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
accepted
The issue is you need to perform a ffill + bfill both groupwise. You current vectorisation attempt performs the ffill groupwise only.
res = df_m.sort_values('id')
          .groupby('id').apply(lambda x: x.ffill().bfill())
          .drop_duplicates()
print(res)
    id       name    length  Q1_a   Q2_a    Q1_b    Q2_b
0    1        one        l1  11.0  111.0  2011.0  2011.0
9    1  one_extra        l1  11.0  111.0  2011.0  2011.0
1    2        two        l2  22.0  222.0  2022.0  2022.0
2    3      three        l3  33.0  333.0  2033.0  2033.0
11   3      three  l3_extra  33.0  333.0  2033.0  2033.0
8    4       four        l4   NaN    NaN  2044.0  2044.0
5    5       five        l5  55.0  555.0  2055.0  2055.0
4    6        NaN        l6  66.0  666.0     NaN     NaN
10   7      seven        l7   NaN    NaN  2077.0  2077.0
Ah! Jaysus I can't believe I missed that. Would the following be a small bit faster, to avoid having to sue.apply()?df_m.sort_values(by='id').groupby('id').ffill().groupby('id').bfill()
– KevL
Nov 11 at 19:49
1
@KevL, I don't think it would be faster, since you are repeating theGroupByoperation. But I advise you test it and see for yourself.
– jpp
Nov 11 at 19:55
add a comment |
up vote
1
down vote
accepted
The issue is you need to perform a ffill + bfill both groupwise. You current vectorisation attempt performs the ffill groupwise only.
res = df_m.sort_values('id')
          .groupby('id').apply(lambda x: x.ffill().bfill())
          .drop_duplicates()
print(res)
    id       name    length  Q1_a   Q2_a    Q1_b    Q2_b
0    1        one        l1  11.0  111.0  2011.0  2011.0
9    1  one_extra        l1  11.0  111.0  2011.0  2011.0
1    2        two        l2  22.0  222.0  2022.0  2022.0
2    3      three        l3  33.0  333.0  2033.0  2033.0
11   3      three  l3_extra  33.0  333.0  2033.0  2033.0
8    4       four        l4   NaN    NaN  2044.0  2044.0
5    5       five        l5  55.0  555.0  2055.0  2055.0
4    6        NaN        l6  66.0  666.0     NaN     NaN
10   7      seven        l7   NaN    NaN  2077.0  2077.0
Ah! Jaysus I can't believe I missed that. Would the following be a small bit faster, to avoid having to sue.apply()?df_m.sort_values(by='id').groupby('id').ffill().groupby('id').bfill()
– KevL
Nov 11 at 19:49
1
@KevL, I don't think it would be faster, since you are repeating theGroupByoperation. But I advise you test it and see for yourself.
– jpp
Nov 11 at 19:55
add a comment |
up vote
1
down vote
accepted
up vote
1
down vote
accepted
The issue is you need to perform a ffill + bfill both groupwise. You current vectorisation attempt performs the ffill groupwise only.
res = df_m.sort_values('id')
          .groupby('id').apply(lambda x: x.ffill().bfill())
          .drop_duplicates()
print(res)
    id       name    length  Q1_a   Q2_a    Q1_b    Q2_b
0    1        one        l1  11.0  111.0  2011.0  2011.0
9    1  one_extra        l1  11.0  111.0  2011.0  2011.0
1    2        two        l2  22.0  222.0  2022.0  2022.0
2    3      three        l3  33.0  333.0  2033.0  2033.0
11   3      three  l3_extra  33.0  333.0  2033.0  2033.0
8    4       four        l4   NaN    NaN  2044.0  2044.0
5    5       five        l5  55.0  555.0  2055.0  2055.0
4    6        NaN        l6  66.0  666.0     NaN     NaN
10   7      seven        l7   NaN    NaN  2077.0  2077.0
The issue is you need to perform a ffill + bfill both groupwise. You current vectorisation attempt performs the ffill groupwise only.
res = df_m.sort_values('id')
          .groupby('id').apply(lambda x: x.ffill().bfill())
          .drop_duplicates()
print(res)
    id       name    length  Q1_a   Q2_a    Q1_b    Q2_b
0    1        one        l1  11.0  111.0  2011.0  2011.0
9    1  one_extra        l1  11.0  111.0  2011.0  2011.0
1    2        two        l2  22.0  222.0  2022.0  2022.0
2    3      three        l3  33.0  333.0  2033.0  2033.0
11   3      three  l3_extra  33.0  333.0  2033.0  2033.0
8    4       four        l4   NaN    NaN  2044.0  2044.0
5    5       five        l5  55.0  555.0  2055.0  2055.0
4    6        NaN        l6  66.0  666.0     NaN     NaN
10   7      seven        l7   NaN    NaN  2077.0  2077.0
answered Nov 11 at 18:56
jpp
87.1k194999
87.1k194999
Ah! Jaysus I can't believe I missed that. Would the following be a small bit faster, to avoid having to sue.apply()?df_m.sort_values(by='id').groupby('id').ffill().groupby('id').bfill()
– KevL
Nov 11 at 19:49
1
@KevL, I don't think it would be faster, since you are repeating theGroupByoperation. But I advise you test it and see for yourself.
– jpp
Nov 11 at 19:55
add a comment |
Ah! Jaysus I can't believe I missed that. Would the following be a small bit faster, to avoid having to sue.apply()?df_m.sort_values(by='id').groupby('id').ffill().groupby('id').bfill()
– KevL
Nov 11 at 19:49
1
@KevL, I don't think it would be faster, since you are repeating theGroupByoperation. But I advise you test it and see for yourself.
– jpp
Nov 11 at 19:55
Ah! Jaysus I can't believe I missed that. Would the following be a small bit faster, to avoid having to sue
.apply()? df_m.sort_values(by='id').groupby('id').ffill().groupby('id').bfill()– KevL
Nov 11 at 19:49
Ah! Jaysus I can't believe I missed that. Would the following be a small bit faster, to avoid having to sue
.apply()? df_m.sort_values(by='id').groupby('id').ffill().groupby('id').bfill()– KevL
Nov 11 at 19:49
1
1
@KevL, I don't think it would be faster, since you are repeating the
GroupBy operation. But I advise you test it and see for yourself.– jpp
Nov 11 at 19:55
@KevL, I don't think it would be faster, since you are repeating the
GroupBy operation. But I advise you test it and see for yourself.– jpp
Nov 11 at 19:55
add a comment |
up vote
0
down vote
Thanks to jpp for his answer.
Just to check, I've timed his solution against the one in my reply to his answer.
Input:
df_m = df_m.append([df_m] * 10000, ignore_index=True) 
%timeit df_r1 = df_m
          .groupby('id').apply(lambda x: x.ffill().bfill())
          .drop_duplicates()
%timeit df_r2 = df_m.groupby('id').ffill().groupby('id').bfill().drop_duplicates()
Output:
114 ms ± 1.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
112 ms ± 515 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
So, it seems with 100,000 rows it's very close.
add a comment |
up vote
0
down vote
Thanks to jpp for his answer.
Just to check, I've timed his solution against the one in my reply to his answer.
Input:
df_m = df_m.append([df_m] * 10000, ignore_index=True) 
%timeit df_r1 = df_m
          .groupby('id').apply(lambda x: x.ffill().bfill())
          .drop_duplicates()
%timeit df_r2 = df_m.groupby('id').ffill().groupby('id').bfill().drop_duplicates()
Output:
114 ms ± 1.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
112 ms ± 515 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
So, it seems with 100,000 rows it's very close.
add a comment |
up vote
0
down vote
up vote
0
down vote
Thanks to jpp for his answer.
Just to check, I've timed his solution against the one in my reply to his answer.
Input:
df_m = df_m.append([df_m] * 10000, ignore_index=True) 
%timeit df_r1 = df_m
          .groupby('id').apply(lambda x: x.ffill().bfill())
          .drop_duplicates()
%timeit df_r2 = df_m.groupby('id').ffill().groupby('id').bfill().drop_duplicates()
Output:
114 ms ± 1.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
112 ms ± 515 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
So, it seems with 100,000 rows it's very close.
Thanks to jpp for his answer.
Just to check, I've timed his solution against the one in my reply to his answer.
Input:
df_m = df_m.append([df_m] * 10000, ignore_index=True) 
%timeit df_r1 = df_m
          .groupby('id').apply(lambda x: x.ffill().bfill())
          .drop_duplicates()
%timeit df_r2 = df_m.groupby('id').ffill().groupby('id').bfill().drop_duplicates()
Output:
114 ms ± 1.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
112 ms ± 515 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
So, it seems with 100,000 rows it's very close.
answered Nov 11 at 21:24
KevL
116210
116210
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
 
But avoid …
- Asking for help, clarification, or responding to other answers.
 - Making statements based on opinion; back them up with references or personal experience.
 
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
 
But avoid …
- Asking for help, clarification, or responding to other answers.
 - Making statements based on opinion; back them up with references or personal experience.
 
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53251999%2fpandas-fill-nan-values-based-on-another-column-with-duplicate-values%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown