pandas: fill nan values based on another column with duplicate values

up vote
1
down vote

favorite

I have two tables which I am trying to merge, df_a and df_b, but both are missing some values. The merge happens fine, however I'm trying to speed up the process of filling in the NaN values after the merge.

The data:

df_a = pd.DataFrame(data={

    'id': [1, 2, 3, 1, 6, 5, 3],

    'name': [np.nan, 'two', 'three', 'one', np.nan, np.nan, np.nan],

    'length': ['l1', 'l2', 'l3', 'l1', 'l6', np.nan, np.nan],

    'Q1': [11, 22, 33, 11, 66, 55, 33],

    'Q2': [111, 222, 333, 111, 666, 555, 333]})



df_b = pd.DataFrame(data={

    'id': [5, 2, 4, 1, 3, 1, 7, 3],

    'name': ['five', 'two', 'four', 'one_extra', 'three', 'one', 'seven', 'three'],

    'length': ['l5', 'l2', 'l4', 'l1', 'l3', 'l1', 'l7', 'l3_extra'],

    'Q1': [2055, 2022, 2044, 2011, 2033, 2011, 2077, 2033],

    'Q2': [2055, 2022, 2044, 2011, 2033, 2011, 2077, 2033]})



df_m = df_a.merge(df_b, on=['id', 'name', 'length'], how='outer',

                  suffixes=['_a', '_b'])

Output:



       id   name length  Q1   Q2

    0   1    NaN     l1  11  111

    1   2    two     l2  22  222

    2   3  three     l3  33  333

    3   1    one     l1  11  111

    4   6    NaN     l6  66  666

    5   5    NaN    NaN  55  555

    6   3    NaN    NaN  33  333



       id       name    length    Q1    Q2

    0   5       five        l5  2055  2055

    1   2        two        l2  2022  2022

    2   4       four        l4  2044  2044

    3   1  one_extra        l1  2011  2011

    4   3      three        l3  2033  2033

    5   1        one        l1  2011  2011

    6   7      seven        l7  2077  2077

    7   3      three  l3_extra  2033  2033



        id       name    length  Q1_a   Q2_a    Q1_b    Q2_b

    0    1        NaN        l1  11.0  111.0     NaN     NaN

    3    1        one        l1  11.0  111.0  2011.0  2011.0

    9    1  one_extra        l1   NaN    NaN  2011.0  2011.0

    1    2        two        l2  22.0  222.0  2022.0  2022.0

    2    3      three        l3  33.0  333.0  2033.0  2033.0

    6    3        NaN       NaN  33.0  333.0     NaN     NaN

    11   3      three  l3_extra   NaN    NaN  2033.0  2033.0

    8    4       four        l4   NaN    NaN  2044.0  2044.0

    5    5        NaN       NaN  55.0  555.0     NaN     NaN

    7    5       five        l5   NaN    NaN  2055.0  2055.0

    4    6        NaN        l6  66.0  666.0     NaN     NaN

    10   7      seven        l7   NaN    NaN  2077.0  2077.0

The Endpoint

I'm trying to get df_m into a state where any NaN values are filled in based on their values from a corresponding id column. Any time there's a duplicate id, I want to keep both values, so that I end up with the following df:

   id       name    length  Q1_a   Q2_a    Q1_b    Q2_b

0   1        one        l1  11.0  111.0  2011.0  2011.0

1   1  one_extra        l1  11.0  111.0  2011.0  2011.0

2   2        two        l2  22.0  222.0  2022.0  2022.0

3   3      three        l3  33.0  333.0  2033.0  2033.0

4   3      three  l3_extra  33.0  333.0  2033.0  2033.0

5   4       four        l4   NaN    NaN  2044.0  2044.0

6   5       five        l5  55.0  555.0  2055.0  2055.0

7   6        NaN        l6  66.0  666.0     NaN     NaN

8   7      seven        l7   NaN    NaN  2077.0  2077.0

Note that there are some NaN values, where there was an entry in one table, and not in the other, e.g. there's no entry in df_a for id=4, so there's NaNs at Q1_a and Q2_a.

My solution so far

The following works, but I'd like to vectorise this code, as it takes more than 4 seconds with only ~200 rows:

dupe_ids = pd.value_counts(df_m.id)[

    pd.value_counts(df_m.id) > 1].index.values



for dupe_id in dupe_ids:

    df_m.loc[df_m.id == dupe_id] = df_m.loc[df_m.id == dupe_id].ffill().bfill()



df_m = df_m.drop_duplicates().sort_values(by='id').reset_index(drop=True)

I tried vectorising the code, as follows, but the results were strange (see id=4).

df_m.sort_values(by='id').groupby(

    'id').ffill().bfill().drop_duplicates().reset_index(drop=True)



Output:



       id       name    length  Q1_a   Q2_a    Q1_b    Q2_b

    0   1        one        l1  11.0  111.0  2011.0  2011.0

    1   1  one_extra        l1  11.0  111.0  2011.0  2011.0

    2   2        two        l2  22.0  222.0  2022.0  2022.0

    3   3      three        l3  33.0  333.0  2033.0  2033.0

    4   3      three  l3_extra  33.0  333.0  2033.0  2033.0

    5   4       four        l4  55.0  555.0  2044.0  2044.0

    6   5       five        l5  55.0  555.0  2055.0  2055.0

    7   6      seven        l6  66.0  666.0  2077.0  2077.0

    8   7      seven        l7   NaN    NaN  2077.0  2077.0

The two tables I'm actually trying to merge are quite a bit more complex, so the example above leads to much more strange behaviour than is seen there. There are over 100 columns, and different columns can have NaN values in different places.

Any help would be greatly appreciated!

edited Nov 11 at 19:55

jpp

87.1k194999

asked Nov 11 at 18:46

KevL

116210

add a comment |

up vote
1
down vote

favorite

The data:

df_a = pd.DataFrame(data={

    'id': [1, 2, 3, 1, 6, 5, 3],

    'name': [np.nan, 'two', 'three', 'one', np.nan, np.nan, np.nan],

    'length': ['l1', 'l2', 'l3', 'l1', 'l6', np.nan, np.nan],

    'Q1': [11, 22, 33, 11, 66, 55, 33],

    'Q2': [111, 222, 333, 111, 666, 555, 333]})



df_b = pd.DataFrame(data={

    'id': [5, 2, 4, 1, 3, 1, 7, 3],

    'name': ['five', 'two', 'four', 'one_extra', 'three', 'one', 'seven', 'three'],

    'length': ['l5', 'l2', 'l4', 'l1', 'l3', 'l1', 'l7', 'l3_extra'],

    'Q1': [2055, 2022, 2044, 2011, 2033, 2011, 2077, 2033],

    'Q2': [2055, 2022, 2044, 2011, 2033, 2011, 2077, 2033]})



df_m = df_a.merge(df_b, on=['id', 'name', 'length'], how='outer',

                  suffixes=['_a', '_b'])

Output:



       id   name length  Q1   Q2

    0   1    NaN     l1  11  111

    1   2    two     l2  22  222

    2   3  three     l3  33  333

    3   1    one     l1  11  111

    4   6    NaN     l6  66  666

    5   5    NaN    NaN  55  555

    6   3    NaN    NaN  33  333



       id       name    length    Q1    Q2

    0   5       five        l5  2055  2055

    1   2        two        l2  2022  2022

    2   4       four        l4  2044  2044

    3   1  one_extra        l1  2011  2011

    4   3      three        l3  2033  2033

    5   1        one        l1  2011  2011

    6   7      seven        l7  2077  2077

    7   3      three  l3_extra  2033  2033



        id       name    length  Q1_a   Q2_a    Q1_b    Q2_b

    0    1        NaN        l1  11.0  111.0     NaN     NaN

    3    1        one        l1  11.0  111.0  2011.0  2011.0

    9    1  one_extra        l1   NaN    NaN  2011.0  2011.0

    1    2        two        l2  22.0  222.0  2022.0  2022.0

    2    3      three        l3  33.0  333.0  2033.0  2033.0

    6    3        NaN       NaN  33.0  333.0     NaN     NaN

    11   3      three  l3_extra   NaN    NaN  2033.0  2033.0

    8    4       four        l4   NaN    NaN  2044.0  2044.0

    5    5        NaN       NaN  55.0  555.0     NaN     NaN

    7    5       five        l5   NaN    NaN  2055.0  2055.0

    4    6        NaN        l6  66.0  666.0     NaN     NaN

    10   7      seven        l7   NaN    NaN  2077.0  2077.0

The Endpoint

   id       name    length  Q1_a   Q2_a    Q1_b    Q2_b

0   1        one        l1  11.0  111.0  2011.0  2011.0

1   1  one_extra        l1  11.0  111.0  2011.0  2011.0

2   2        two        l2  22.0  222.0  2022.0  2022.0

3   3      three        l3  33.0  333.0  2033.0  2033.0

4   3      three  l3_extra  33.0  333.0  2033.0  2033.0

5   4       four        l4   NaN    NaN  2044.0  2044.0

6   5       five        l5  55.0  555.0  2055.0  2055.0

7   6        NaN        l6  66.0  666.0     NaN     NaN

8   7      seven        l7   NaN    NaN  2077.0  2077.0

Note that there are some NaN values, where there was an entry in one table, and not in the other, e.g. there's no entry in df_a for id=4, so there's NaNs at Q1_a and Q2_a.

My solution so far

The following works, but I'd like to vectorise this code, as it takes more than 4 seconds with only ~200 rows:

dupe_ids = pd.value_counts(df_m.id)[

    pd.value_counts(df_m.id) > 1].index.values



for dupe_id in dupe_ids:

    df_m.loc[df_m.id == dupe_id] = df_m.loc[df_m.id == dupe_id].ffill().bfill()



df_m = df_m.drop_duplicates().sort_values(by='id').reset_index(drop=True)

I tried vectorising the code, as follows, but the results were strange (see id=4).

df_m.sort_values(by='id').groupby(

    'id').ffill().bfill().drop_duplicates().reset_index(drop=True)



Output:



       id       name    length  Q1_a   Q2_a    Q1_b    Q2_b

    0   1        one        l1  11.0  111.0  2011.0  2011.0

    1   1  one_extra        l1  11.0  111.0  2011.0  2011.0

    2   2        two        l2  22.0  222.0  2022.0  2022.0

    3   3      three        l3  33.0  333.0  2033.0  2033.0

    4   3      three  l3_extra  33.0  333.0  2033.0  2033.0

    5   4       four        l4  55.0  555.0  2044.0  2044.0

    6   5       five        l5  55.0  555.0  2055.0  2055.0

    7   6      seven        l6  66.0  666.0  2077.0  2077.0

    8   7      seven        l7   NaN    NaN  2077.0  2077.0

Any help would be greatly appreciated!

edited Nov 11 at 19:55

jpp

87.1k194999

asked Nov 11 at 18:46

KevL

116210

add a comment |

up vote
1
down vote

favorite

The data:

df_a = pd.DataFrame(data={

    'id': [1, 2, 3, 1, 6, 5, 3],

    'name': [np.nan, 'two', 'three', 'one', np.nan, np.nan, np.nan],

    'length': ['l1', 'l2', 'l3', 'l1', 'l6', np.nan, np.nan],

    'Q1': [11, 22, 33, 11, 66, 55, 33],

    'Q2': [111, 222, 333, 111, 666, 555, 333]})



df_b = pd.DataFrame(data={

    'id': [5, 2, 4, 1, 3, 1, 7, 3],

    'name': ['five', 'two', 'four', 'one_extra', 'three', 'one', 'seven', 'three'],

    'length': ['l5', 'l2', 'l4', 'l1', 'l3', 'l1', 'l7', 'l3_extra'],

    'Q1': [2055, 2022, 2044, 2011, 2033, 2011, 2077, 2033],

    'Q2': [2055, 2022, 2044, 2011, 2033, 2011, 2077, 2033]})



df_m = df_a.merge(df_b, on=['id', 'name', 'length'], how='outer',

                  suffixes=['_a', '_b'])

Output:



       id   name length  Q1   Q2

    0   1    NaN     l1  11  111

    1   2    two     l2  22  222

    2   3  three     l3  33  333

    3   1    one     l1  11  111

    4   6    NaN     l6  66  666

    5   5    NaN    NaN  55  555

    6   3    NaN    NaN  33  333



       id       name    length    Q1    Q2

    0   5       five        l5  2055  2055

    1   2        two        l2  2022  2022

    2   4       four        l4  2044  2044

    3   1  one_extra        l1  2011  2011

    4   3      three        l3  2033  2033

    5   1        one        l1  2011  2011

    6   7      seven        l7  2077  2077

    7   3      three  l3_extra  2033  2033



        id       name    length  Q1_a   Q2_a    Q1_b    Q2_b

    0    1        NaN        l1  11.0  111.0     NaN     NaN

    3    1        one        l1  11.0  111.0  2011.0  2011.0

    9    1  one_extra        l1   NaN    NaN  2011.0  2011.0

    1    2        two        l2  22.0  222.0  2022.0  2022.0

    2    3      three        l3  33.0  333.0  2033.0  2033.0

    6    3        NaN       NaN  33.0  333.0     NaN     NaN

    11   3      three  l3_extra   NaN    NaN  2033.0  2033.0

    8    4       four        l4   NaN    NaN  2044.0  2044.0

    5    5        NaN       NaN  55.0  555.0     NaN     NaN

    7    5       five        l5   NaN    NaN  2055.0  2055.0

    4    6        NaN        l6  66.0  666.0     NaN     NaN

    10   7      seven        l7   NaN    NaN  2077.0  2077.0

The Endpoint

   id       name    length  Q1_a   Q2_a    Q1_b    Q2_b

0   1        one        l1  11.0  111.0  2011.0  2011.0

1   1  one_extra        l1  11.0  111.0  2011.0  2011.0

2   2        two        l2  22.0  222.0  2022.0  2022.0

3   3      three        l3  33.0  333.0  2033.0  2033.0

4   3      three  l3_extra  33.0  333.0  2033.0  2033.0

5   4       four        l4   NaN    NaN  2044.0  2044.0

6   5       five        l5  55.0  555.0  2055.0  2055.0

7   6        NaN        l6  66.0  666.0     NaN     NaN

8   7      seven        l7   NaN    NaN  2077.0  2077.0

Note that there are some NaN values, where there was an entry in one table, and not in the other, e.g. there's no entry in df_a for id=4, so there's NaNs at Q1_a and Q2_a.

My solution so far

The following works, but I'd like to vectorise this code, as it takes more than 4 seconds with only ~200 rows:

dupe_ids = pd.value_counts(df_m.id)[

    pd.value_counts(df_m.id) > 1].index.values



for dupe_id in dupe_ids:

    df_m.loc[df_m.id == dupe_id] = df_m.loc[df_m.id == dupe_id].ffill().bfill()



df_m = df_m.drop_duplicates().sort_values(by='id').reset_index(drop=True)

I tried vectorising the code, as follows, but the results were strange (see id=4).

df_m.sort_values(by='id').groupby(

    'id').ffill().bfill().drop_duplicates().reset_index(drop=True)



Output:



       id       name    length  Q1_a   Q2_a    Q1_b    Q2_b

    0   1        one        l1  11.0  111.0  2011.0  2011.0

    1   1  one_extra        l1  11.0  111.0  2011.0  2011.0

    2   2        two        l2  22.0  222.0  2022.0  2022.0

    3   3      three        l3  33.0  333.0  2033.0  2033.0

    4   3      three  l3_extra  33.0  333.0  2033.0  2033.0

    5   4       four        l4  55.0  555.0  2044.0  2044.0

    6   5       five        l5  55.0  555.0  2055.0  2055.0

    7   6      seven        l6  66.0  666.0  2077.0  2077.0

    8   7      seven        l7   NaN    NaN  2077.0  2077.0

Any help would be greatly appreciated!

edited Nov 11 at 19:55

jpp

87.1k194999

asked Nov 11 at 18:46

KevL

116210

The data:

df_a = pd.DataFrame(data={

    'id': [1, 2, 3, 1, 6, 5, 3],

    'name': [np.nan, 'two', 'three', 'one', np.nan, np.nan, np.nan],

    'length': ['l1', 'l2', 'l3', 'l1', 'l6', np.nan, np.nan],

    'Q1': [11, 22, 33, 11, 66, 55, 33],

    'Q2': [111, 222, 333, 111, 666, 555, 333]})



df_b = pd.DataFrame(data={

    'id': [5, 2, 4, 1, 3, 1, 7, 3],

    'name': ['five', 'two', 'four', 'one_extra', 'three', 'one', 'seven', 'three'],

    'length': ['l5', 'l2', 'l4', 'l1', 'l3', 'l1', 'l7', 'l3_extra'],

    'Q1': [2055, 2022, 2044, 2011, 2033, 2011, 2077, 2033],

    'Q2': [2055, 2022, 2044, 2011, 2033, 2011, 2077, 2033]})



df_m = df_a.merge(df_b, on=['id', 'name', 'length'], how='outer',

                  suffixes=['_a', '_b'])

Output:



       id   name length  Q1   Q2

    0   1    NaN     l1  11  111

    1   2    two     l2  22  222

    2   3  three     l3  33  333

    3   1    one     l1  11  111

    4   6    NaN     l6  66  666

    5   5    NaN    NaN  55  555

    6   3    NaN    NaN  33  333



       id       name    length    Q1    Q2

    0   5       five        l5  2055  2055

    1   2        two        l2  2022  2022

    2   4       four        l4  2044  2044

    3   1  one_extra        l1  2011  2011

    4   3      three        l3  2033  2033

    5   1        one        l1  2011  2011

    6   7      seven        l7  2077  2077

    7   3      three  l3_extra  2033  2033



        id       name    length  Q1_a   Q2_a    Q1_b    Q2_b

    0    1        NaN        l1  11.0  111.0     NaN     NaN

    3    1        one        l1  11.0  111.0  2011.0  2011.0

    9    1  one_extra        l1   NaN    NaN  2011.0  2011.0

    1    2        two        l2  22.0  222.0  2022.0  2022.0

    2    3      three        l3  33.0  333.0  2033.0  2033.0

    6    3        NaN       NaN  33.0  333.0     NaN     NaN

    11   3      three  l3_extra   NaN    NaN  2033.0  2033.0

    8    4       four        l4   NaN    NaN  2044.0  2044.0

    5    5        NaN       NaN  55.0  555.0     NaN     NaN

    7    5       five        l5   NaN    NaN  2055.0  2055.0

    4    6        NaN        l6  66.0  666.0     NaN     NaN

    10   7      seven        l7   NaN    NaN  2077.0  2077.0

The Endpoint

   id       name    length  Q1_a   Q2_a    Q1_b    Q2_b

0   1        one        l1  11.0  111.0  2011.0  2011.0

1   1  one_extra        l1  11.0  111.0  2011.0  2011.0

2   2        two        l2  22.0  222.0  2022.0  2022.0

3   3      three        l3  33.0  333.0  2033.0  2033.0

4   3      three  l3_extra  33.0  333.0  2033.0  2033.0

5   4       four        l4   NaN    NaN  2044.0  2044.0

6   5       five        l5  55.0  555.0  2055.0  2055.0

7   6        NaN        l6  66.0  666.0     NaN     NaN

8   7      seven        l7   NaN    NaN  2077.0  2077.0

Note that there are some NaN values, where there was an entry in one table, and not in the other, e.g. there's no entry in df_a for id=4, so there's NaNs at Q1_a and Q2_a.

My solution so far

The following works, but I'd like to vectorise this code, as it takes more than 4 seconds with only ~200 rows:

dupe_ids = pd.value_counts(df_m.id)[

    pd.value_counts(df_m.id) > 1].index.values



for dupe_id in dupe_ids:

    df_m.loc[df_m.id == dupe_id] = df_m.loc[df_m.id == dupe_id].ffill().bfill()



df_m = df_m.drop_duplicates().sort_values(by='id').reset_index(drop=True)

I tried vectorising the code, as follows, but the results were strange (see id=4).

df_m.sort_values(by='id').groupby(

    'id').ffill().bfill().drop_duplicates().reset_index(drop=True)



Output:



       id       name    length  Q1_a   Q2_a    Q1_b    Q2_b

    0   1        one        l1  11.0  111.0  2011.0  2011.0

    1   1  one_extra        l1  11.0  111.0  2011.0  2011.0

    2   2        two        l2  22.0  222.0  2022.0  2022.0

    3   3      three        l3  33.0  333.0  2033.0  2033.0

    4   3      three  l3_extra  33.0  333.0  2033.0  2033.0

    5   4       four        l4  55.0  555.0  2044.0  2044.0

    6   5       five        l5  55.0  555.0  2055.0  2055.0

    7   6      seven        l6  66.0  666.0  2077.0  2077.0

    8   7      seven        l7   NaN    NaN  2077.0  2077.0

Any help would be greatly appreciated!

python pandas performance pandas-groupby

edited Nov 11 at 19:55

jpp

87.1k194999

asked Nov 11 at 18:46

KevL

116210

edited Nov 11 at 19:55

jpp

87.1k194999

asked Nov 11 at 18:46

KevL

116210

edited Nov 11 at 19:55

jpp

87.1k194999

edited Nov 11 at 19:55

jpp

87.1k194999

edited Nov 11 at 19:55

jpp

87.1k194999

asked Nov 11 at 18:46

KevL

116210

asked Nov 11 at 18:46

KevL

116210

asked Nov 11 at 18:46

KevL

116210

add a comment |

2 Answers
2

active

oldest

votes

up vote
1
down vote

accepted

The issue is you need to perform a ffill + bfill both groupwise. You current vectorisation attempt performs the ffill groupwise only.

res = df_m.sort_values('id')

          .groupby('id').apply(lambda x: x.ffill().bfill())

          .drop_duplicates()



print(res)



    id       name    length  Q1_a   Q2_a    Q1_b    Q2_b

0    1        one        l1  11.0  111.0  2011.0  2011.0

9    1  one_extra        l1  11.0  111.0  2011.0  2011.0

1    2        two        l2  22.0  222.0  2022.0  2022.0

2    3      three        l3  33.0  333.0  2033.0  2033.0

11   3      three  l3_extra  33.0  333.0  2033.0  2033.0

8    4       four        l4   NaN    NaN  2044.0  2044.0

5    5       five        l5  55.0  555.0  2055.0  2055.0

4    6        NaN        l6  66.0  666.0     NaN     NaN

10   7      seven        l7   NaN    NaN  2077.0  2077.0

answered Nov 11 at 18:56

jpp

87.1k194999

Ah! Jaysus I can't believe I missed that. Would the following be a small bit faster, to avoid having to sue .apply()? df_m.sort_values(by='id').groupby('id').ffill().groupby('id').bfill()
– KevL
Nov 11 at 19:49

1

@KevL, I don't think it would be faster, since you are repeating the GroupBy operation. But I advise you test it and see for yourself.
– jpp
Nov 11 at 19:55

add a comment |

up vote
0
down vote

Thanks to jpp for his answer.

Just to check, I've timed his solution against the one in my reply to his answer.

Input:

df_m = df_m.append([df_m] * 10000, ignore_index=True) 



%timeit df_r1 = df_m

          .groupby('id').apply(lambda x: x.ffill().bfill())

          .drop_duplicates()



%timeit df_r2 = df_m.groupby('id').ffill().groupby('id').bfill().drop_duplicates()



Output:

114 ms ± 1.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

112 ms ± 515 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

So, it seems with 100,000 rows it's very close.

answered Nov 11 at 21:24

KevL

116210

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53251999%2fpandas-fill-nan-values-based-on-another-column-with-duplicate-values%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
1
down vote

accepted

The issue is you need to perform a ffill + bfill both groupwise. You current vectorisation attempt performs the ffill groupwise only.

res = df_m.sort_values('id')

          .groupby('id').apply(lambda x: x.ffill().bfill())

          .drop_duplicates()



print(res)



    id       name    length  Q1_a   Q2_a    Q1_b    Q2_b

0    1        one        l1  11.0  111.0  2011.0  2011.0

9    1  one_extra        l1  11.0  111.0  2011.0  2011.0

1    2        two        l2  22.0  222.0  2022.0  2022.0

2    3      three        l3  33.0  333.0  2033.0  2033.0

11   3      three  l3_extra  33.0  333.0  2033.0  2033.0

8    4       four        l4   NaN    NaN  2044.0  2044.0

5    5       five        l5  55.0  555.0  2055.0  2055.0

4    6        NaN        l6  66.0  666.0     NaN     NaN

10   7      seven        l7   NaN    NaN  2077.0  2077.0

answered Nov 11 at 18:56

jpp

87.1k194999

Ah! Jaysus I can't believe I missed that. Would the following be a small bit faster, to avoid having to sue .apply()? df_m.sort_values(by='id').groupby('id').ffill().groupby('id').bfill()
– KevL
Nov 11 at 19:49

1

@KevL, I don't think it would be faster, since you are repeating the GroupBy operation. But I advise you test it and see for yourself.
– jpp
Nov 11 at 19:55

add a comment |

up vote
1
down vote

accepted

The issue is you need to perform a ffill + bfill both groupwise. You current vectorisation attempt performs the ffill groupwise only.

res = df_m.sort_values('id')

          .groupby('id').apply(lambda x: x.ffill().bfill())

          .drop_duplicates()



print(res)



    id       name    length  Q1_a   Q2_a    Q1_b    Q2_b

0    1        one        l1  11.0  111.0  2011.0  2011.0

9    1  one_extra        l1  11.0  111.0  2011.0  2011.0

1    2        two        l2  22.0  222.0  2022.0  2022.0

2    3      three        l3  33.0  333.0  2033.0  2033.0

11   3      three  l3_extra  33.0  333.0  2033.0  2033.0

8    4       four        l4   NaN    NaN  2044.0  2044.0

5    5       five        l5  55.0  555.0  2055.0  2055.0

4    6        NaN        l6  66.0  666.0     NaN     NaN

10   7      seven        l7   NaN    NaN  2077.0  2077.0

answered Nov 11 at 18:56

jpp

87.1k194999

Ah! Jaysus I can't believe I missed that. Would the following be a small bit faster, to avoid having to sue .apply()? df_m.sort_values(by='id').groupby('id').ffill().groupby('id').bfill()
– KevL
Nov 11 at 19:49

1

@KevL, I don't think it would be faster, since you are repeating the GroupBy operation. But I advise you test it and see for yourself.
– jpp
Nov 11 at 19:55

add a comment |

up vote
1
down vote

accepted

The issue is you need to perform a ffill + bfill both groupwise. You current vectorisation attempt performs the ffill groupwise only.

res = df_m.sort_values('id')

          .groupby('id').apply(lambda x: x.ffill().bfill())

          .drop_duplicates()



print(res)



    id       name    length  Q1_a   Q2_a    Q1_b    Q2_b

0    1        one        l1  11.0  111.0  2011.0  2011.0

9    1  one_extra        l1  11.0  111.0  2011.0  2011.0

1    2        two        l2  22.0  222.0  2022.0  2022.0

2    3      three        l3  33.0  333.0  2033.0  2033.0

11   3      three  l3_extra  33.0  333.0  2033.0  2033.0

8    4       four        l4   NaN    NaN  2044.0  2044.0

5    5       five        l5  55.0  555.0  2055.0  2055.0

4    6        NaN        l6  66.0  666.0     NaN     NaN

10   7      seven        l7   NaN    NaN  2077.0  2077.0

answered Nov 11 at 18:56

jpp

87.1k194999

The issue is you need to perform a ffill + bfill both groupwise. You current vectorisation attempt performs the ffill groupwise only.

res = df_m.sort_values('id')

          .groupby('id').apply(lambda x: x.ffill().bfill())

          .drop_duplicates()



print(res)



    id       name    length  Q1_a   Q2_a    Q1_b    Q2_b

0    1        one        l1  11.0  111.0  2011.0  2011.0

9    1  one_extra        l1  11.0  111.0  2011.0  2011.0

1    2        two        l2  22.0  222.0  2022.0  2022.0

2    3      three        l3  33.0  333.0  2033.0  2033.0

11   3      three  l3_extra  33.0  333.0  2033.0  2033.0

8    4       four        l4   NaN    NaN  2044.0  2044.0

5    5       five        l5  55.0  555.0  2055.0  2055.0

4    6        NaN        l6  66.0  666.0     NaN     NaN

10   7      seven        l7   NaN    NaN  2077.0  2077.0

answered Nov 11 at 18:56

jpp

87.1k194999

answered Nov 11 at 18:56

jpp

87.1k194999

answered Nov 11 at 18:56

jpp

87.1k194999

answered Nov 11 at 18:56

jpp

87.1k194999

Ah! Jaysus I can't believe I missed that. Would the following be a small bit faster, to avoid having to sue .apply()? df_m.sort_values(by='id').groupby('id').ffill().groupby('id').bfill()
– KevL
Nov 11 at 19:49

1

@KevL, I don't think it would be faster, since you are repeating the GroupBy operation. But I advise you test it and see for yourself.
– jpp
Nov 11 at 19:55

add a comment |

Ah! Jaysus I can't believe I missed that. Would the following be a small bit faster, to avoid having to sue .apply()? df_m.sort_values(by='id').groupby('id').ffill().groupby('id').bfill()
– KevL
Nov 11 at 19:49

1

@KevL, I don't think it would be faster, since you are repeating the GroupBy operation. But I advise you test it and see for yourself.
– jpp
Nov 11 at 19:55

Ah! Jaysus I can't believe I missed that. Would the following be a small bit faster, to avoid having to sue .apply()? df_m.sort_values(by='id').groupby('id').ffill().groupby('id').bfill()
– KevL
Nov 11 at 19:49

@KevL, I don't think it would be faster, since you are repeating the GroupBy operation. But I advise you test it and see for yourself.
– jpp
Nov 11 at 19:55

add a comment |

up vote
0
down vote

Thanks to jpp for his answer.

Just to check, I've timed his solution against the one in my reply to his answer.

Input:

df_m = df_m.append([df_m] * 10000, ignore_index=True) 



%timeit df_r1 = df_m

          .groupby('id').apply(lambda x: x.ffill().bfill())

          .drop_duplicates()



%timeit df_r2 = df_m.groupby('id').ffill().groupby('id').bfill().drop_duplicates()



Output:

114 ms ± 1.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

112 ms ± 515 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

So, it seems with 100,000 rows it's very close.

answered Nov 11 at 21:24

KevL

116210

add a comment |

up vote
0
down vote

Thanks to jpp for his answer.

Just to check, I've timed his solution against the one in my reply to his answer.

Input:

df_m = df_m.append([df_m] * 10000, ignore_index=True) 



%timeit df_r1 = df_m

          .groupby('id').apply(lambda x: x.ffill().bfill())

          .drop_duplicates()



%timeit df_r2 = df_m.groupby('id').ffill().groupby('id').bfill().drop_duplicates()



Output:

114 ms ± 1.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

112 ms ± 515 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

So, it seems with 100,000 rows it's very close.

answered Nov 11 at 21:24

KevL

116210

add a comment |

up vote
0
down vote

Thanks to jpp for his answer.

Just to check, I've timed his solution against the one in my reply to his answer.

Input:

df_m = df_m.append([df_m] * 10000, ignore_index=True) 



%timeit df_r1 = df_m

          .groupby('id').apply(lambda x: x.ffill().bfill())

          .drop_duplicates()



%timeit df_r2 = df_m.groupby('id').ffill().groupby('id').bfill().drop_duplicates()



Output:

114 ms ± 1.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

112 ms ± 515 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

So, it seems with 100,000 rows it's very close.

answered Nov 11 at 21:24

KevL

116210

Thanks to jpp for his answer.

Just to check, I've timed his solution against the one in my reply to his answer.

Input:

df_m = df_m.append([df_m] * 10000, ignore_index=True) 



%timeit df_r1 = df_m

          .groupby('id').apply(lambda x: x.ffill().bfill())

          .drop_duplicates()



%timeit df_r2 = df_m.groupby('id').ffill().groupby('id').bfill().drop_duplicates()



Output:

114 ms ± 1.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

112 ms ± 515 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

So, it seems with 100,000 rows it's very close.

answered Nov 11 at 21:24

KevL

116210

answered Nov 11 at 21:24

KevL

116210

answered Nov 11 at 21:24

KevL

116210

answered Nov 11 at 21:24

KevL

116210

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Agfdhyk