pandas: fill nan values based on another column with duplicate values











up vote
1
down vote

favorite
1












I have two tables which I am trying to merge, df_a and df_b, but both are missing some values. The merge happens fine, however I'm trying to speed up the process of filling in the NaN values after the merge.



The data:



df_a = pd.DataFrame(data={
'id': [1, 2, 3, 1, 6, 5, 3],
'name': [np.nan, 'two', 'three', 'one', np.nan, np.nan, np.nan],
'length': ['l1', 'l2', 'l3', 'l1', 'l6', np.nan, np.nan],
'Q1': [11, 22, 33, 11, 66, 55, 33],
'Q2': [111, 222, 333, 111, 666, 555, 333]})

df_b = pd.DataFrame(data={
'id': [5, 2, 4, 1, 3, 1, 7, 3],
'name': ['five', 'two', 'four', 'one_extra', 'three', 'one', 'seven', 'three'],
'length': ['l5', 'l2', 'l4', 'l1', 'l3', 'l1', 'l7', 'l3_extra'],
'Q1': [2055, 2022, 2044, 2011, 2033, 2011, 2077, 2033],
'Q2': [2055, 2022, 2044, 2011, 2033, 2011, 2077, 2033]})

df_m = df_a.merge(df_b, on=['id', 'name', 'length'], how='outer',
suffixes=['_a', '_b'])
Output:

id name length Q1 Q2
0 1 NaN l1 11 111
1 2 two l2 22 222
2 3 three l3 33 333
3 1 one l1 11 111
4 6 NaN l6 66 666
5 5 NaN NaN 55 555
6 3 NaN NaN 33 333

id name length Q1 Q2
0 5 five l5 2055 2055
1 2 two l2 2022 2022
2 4 four l4 2044 2044
3 1 one_extra l1 2011 2011
4 3 three l3 2033 2033
5 1 one l1 2011 2011
6 7 seven l7 2077 2077
7 3 three l3_extra 2033 2033

id name length Q1_a Q2_a Q1_b Q2_b
0 1 NaN l1 11.0 111.0 NaN NaN
3 1 one l1 11.0 111.0 2011.0 2011.0
9 1 one_extra l1 NaN NaN 2011.0 2011.0
1 2 two l2 22.0 222.0 2022.0 2022.0
2 3 three l3 33.0 333.0 2033.0 2033.0
6 3 NaN NaN 33.0 333.0 NaN NaN
11 3 three l3_extra NaN NaN 2033.0 2033.0
8 4 four l4 NaN NaN 2044.0 2044.0
5 5 NaN NaN 55.0 555.0 NaN NaN
7 5 five l5 NaN NaN 2055.0 2055.0
4 6 NaN l6 66.0 666.0 NaN NaN
10 7 seven l7 NaN NaN 2077.0 2077.0


The Endpoint



I'm trying to get df_m into a state where any NaN values are filled in based on their values from a corresponding id column. Any time there's a duplicate id, I want to keep both values, so that I end up with the following df:



   id       name    length  Q1_a   Q2_a    Q1_b    Q2_b
0 1 one l1 11.0 111.0 2011.0 2011.0
1 1 one_extra l1 11.0 111.0 2011.0 2011.0
2 2 two l2 22.0 222.0 2022.0 2022.0
3 3 three l3 33.0 333.0 2033.0 2033.0
4 3 three l3_extra 33.0 333.0 2033.0 2033.0
5 4 four l4 NaN NaN 2044.0 2044.0
6 5 five l5 55.0 555.0 2055.0 2055.0
7 6 NaN l6 66.0 666.0 NaN NaN
8 7 seven l7 NaN NaN 2077.0 2077.0


Note that there are some NaN values, where there was an entry in one table, and not in the other, e.g. there's no entry in df_a for id=4, so there's NaNs at Q1_a and Q2_a.



My solution so far



The following works, but I'd like to vectorise this code, as it takes more than 4 seconds with only ~200 rows:



dupe_ids = pd.value_counts(df_m.id)[
pd.value_counts(df_m.id) > 1].index.values

for dupe_id in dupe_ids:
df_m.loc[df_m.id == dupe_id] = df_m.loc[df_m.id == dupe_id].ffill().bfill()

df_m = df_m.drop_duplicates().sort_values(by='id').reset_index(drop=True)


I tried vectorising the code, as follows, but the results were strange (see id=4).



df_m.sort_values(by='id').groupby(
'id').ffill().bfill().drop_duplicates().reset_index(drop=True)

Output:

id name length Q1_a Q2_a Q1_b Q2_b
0 1 one l1 11.0 111.0 2011.0 2011.0
1 1 one_extra l1 11.0 111.0 2011.0 2011.0
2 2 two l2 22.0 222.0 2022.0 2022.0
3 3 three l3 33.0 333.0 2033.0 2033.0
4 3 three l3_extra 33.0 333.0 2033.0 2033.0
5 4 four l4 55.0 555.0 2044.0 2044.0
6 5 five l5 55.0 555.0 2055.0 2055.0
7 6 seven l6 66.0 666.0 2077.0 2077.0
8 7 seven l7 NaN NaN 2077.0 2077.0


The two tables I'm actually trying to merge are quite a bit more complex, so the example above leads to much more strange behaviour than is seen there. There are over 100 columns, and different columns can have NaN values in different places.



Any help would be greatly appreciated!










share|improve this question




























    up vote
    1
    down vote

    favorite
    1












    I have two tables which I am trying to merge, df_a and df_b, but both are missing some values. The merge happens fine, however I'm trying to speed up the process of filling in the NaN values after the merge.



    The data:



    df_a = pd.DataFrame(data={
    'id': [1, 2, 3, 1, 6, 5, 3],
    'name': [np.nan, 'two', 'three', 'one', np.nan, np.nan, np.nan],
    'length': ['l1', 'l2', 'l3', 'l1', 'l6', np.nan, np.nan],
    'Q1': [11, 22, 33, 11, 66, 55, 33],
    'Q2': [111, 222, 333, 111, 666, 555, 333]})

    df_b = pd.DataFrame(data={
    'id': [5, 2, 4, 1, 3, 1, 7, 3],
    'name': ['five', 'two', 'four', 'one_extra', 'three', 'one', 'seven', 'three'],
    'length': ['l5', 'l2', 'l4', 'l1', 'l3', 'l1', 'l7', 'l3_extra'],
    'Q1': [2055, 2022, 2044, 2011, 2033, 2011, 2077, 2033],
    'Q2': [2055, 2022, 2044, 2011, 2033, 2011, 2077, 2033]})

    df_m = df_a.merge(df_b, on=['id', 'name', 'length'], how='outer',
    suffixes=['_a', '_b'])
    Output:

    id name length Q1 Q2
    0 1 NaN l1 11 111
    1 2 two l2 22 222
    2 3 three l3 33 333
    3 1 one l1 11 111
    4 6 NaN l6 66 666
    5 5 NaN NaN 55 555
    6 3 NaN NaN 33 333

    id name length Q1 Q2
    0 5 five l5 2055 2055
    1 2 two l2 2022 2022
    2 4 four l4 2044 2044
    3 1 one_extra l1 2011 2011
    4 3 three l3 2033 2033
    5 1 one l1 2011 2011
    6 7 seven l7 2077 2077
    7 3 three l3_extra 2033 2033

    id name length Q1_a Q2_a Q1_b Q2_b
    0 1 NaN l1 11.0 111.0 NaN NaN
    3 1 one l1 11.0 111.0 2011.0 2011.0
    9 1 one_extra l1 NaN NaN 2011.0 2011.0
    1 2 two l2 22.0 222.0 2022.0 2022.0
    2 3 three l3 33.0 333.0 2033.0 2033.0
    6 3 NaN NaN 33.0 333.0 NaN NaN
    11 3 three l3_extra NaN NaN 2033.0 2033.0
    8 4 four l4 NaN NaN 2044.0 2044.0
    5 5 NaN NaN 55.0 555.0 NaN NaN
    7 5 five l5 NaN NaN 2055.0 2055.0
    4 6 NaN l6 66.0 666.0 NaN NaN
    10 7 seven l7 NaN NaN 2077.0 2077.0


    The Endpoint



    I'm trying to get df_m into a state where any NaN values are filled in based on their values from a corresponding id column. Any time there's a duplicate id, I want to keep both values, so that I end up with the following df:



       id       name    length  Q1_a   Q2_a    Q1_b    Q2_b
    0 1 one l1 11.0 111.0 2011.0 2011.0
    1 1 one_extra l1 11.0 111.0 2011.0 2011.0
    2 2 two l2 22.0 222.0 2022.0 2022.0
    3 3 three l3 33.0 333.0 2033.0 2033.0
    4 3 three l3_extra 33.0 333.0 2033.0 2033.0
    5 4 four l4 NaN NaN 2044.0 2044.0
    6 5 five l5 55.0 555.0 2055.0 2055.0
    7 6 NaN l6 66.0 666.0 NaN NaN
    8 7 seven l7 NaN NaN 2077.0 2077.0


    Note that there are some NaN values, where there was an entry in one table, and not in the other, e.g. there's no entry in df_a for id=4, so there's NaNs at Q1_a and Q2_a.



    My solution so far



    The following works, but I'd like to vectorise this code, as it takes more than 4 seconds with only ~200 rows:



    dupe_ids = pd.value_counts(df_m.id)[
    pd.value_counts(df_m.id) > 1].index.values

    for dupe_id in dupe_ids:
    df_m.loc[df_m.id == dupe_id] = df_m.loc[df_m.id == dupe_id].ffill().bfill()

    df_m = df_m.drop_duplicates().sort_values(by='id').reset_index(drop=True)


    I tried vectorising the code, as follows, but the results were strange (see id=4).



    df_m.sort_values(by='id').groupby(
    'id').ffill().bfill().drop_duplicates().reset_index(drop=True)

    Output:

    id name length Q1_a Q2_a Q1_b Q2_b
    0 1 one l1 11.0 111.0 2011.0 2011.0
    1 1 one_extra l1 11.0 111.0 2011.0 2011.0
    2 2 two l2 22.0 222.0 2022.0 2022.0
    3 3 three l3 33.0 333.0 2033.0 2033.0
    4 3 three l3_extra 33.0 333.0 2033.0 2033.0
    5 4 four l4 55.0 555.0 2044.0 2044.0
    6 5 five l5 55.0 555.0 2055.0 2055.0
    7 6 seven l6 66.0 666.0 2077.0 2077.0
    8 7 seven l7 NaN NaN 2077.0 2077.0


    The two tables I'm actually trying to merge are quite a bit more complex, so the example above leads to much more strange behaviour than is seen there. There are over 100 columns, and different columns can have NaN values in different places.



    Any help would be greatly appreciated!










    share|improve this question


























      up vote
      1
      down vote

      favorite
      1









      up vote
      1
      down vote

      favorite
      1






      1





      I have two tables which I am trying to merge, df_a and df_b, but both are missing some values. The merge happens fine, however I'm trying to speed up the process of filling in the NaN values after the merge.



      The data:



      df_a = pd.DataFrame(data={
      'id': [1, 2, 3, 1, 6, 5, 3],
      'name': [np.nan, 'two', 'three', 'one', np.nan, np.nan, np.nan],
      'length': ['l1', 'l2', 'l3', 'l1', 'l6', np.nan, np.nan],
      'Q1': [11, 22, 33, 11, 66, 55, 33],
      'Q2': [111, 222, 333, 111, 666, 555, 333]})

      df_b = pd.DataFrame(data={
      'id': [5, 2, 4, 1, 3, 1, 7, 3],
      'name': ['five', 'two', 'four', 'one_extra', 'three', 'one', 'seven', 'three'],
      'length': ['l5', 'l2', 'l4', 'l1', 'l3', 'l1', 'l7', 'l3_extra'],
      'Q1': [2055, 2022, 2044, 2011, 2033, 2011, 2077, 2033],
      'Q2': [2055, 2022, 2044, 2011, 2033, 2011, 2077, 2033]})

      df_m = df_a.merge(df_b, on=['id', 'name', 'length'], how='outer',
      suffixes=['_a', '_b'])
      Output:

      id name length Q1 Q2
      0 1 NaN l1 11 111
      1 2 two l2 22 222
      2 3 three l3 33 333
      3 1 one l1 11 111
      4 6 NaN l6 66 666
      5 5 NaN NaN 55 555
      6 3 NaN NaN 33 333

      id name length Q1 Q2
      0 5 five l5 2055 2055
      1 2 two l2 2022 2022
      2 4 four l4 2044 2044
      3 1 one_extra l1 2011 2011
      4 3 three l3 2033 2033
      5 1 one l1 2011 2011
      6 7 seven l7 2077 2077
      7 3 three l3_extra 2033 2033

      id name length Q1_a Q2_a Q1_b Q2_b
      0 1 NaN l1 11.0 111.0 NaN NaN
      3 1 one l1 11.0 111.0 2011.0 2011.0
      9 1 one_extra l1 NaN NaN 2011.0 2011.0
      1 2 two l2 22.0 222.0 2022.0 2022.0
      2 3 three l3 33.0 333.0 2033.0 2033.0
      6 3 NaN NaN 33.0 333.0 NaN NaN
      11 3 three l3_extra NaN NaN 2033.0 2033.0
      8 4 four l4 NaN NaN 2044.0 2044.0
      5 5 NaN NaN 55.0 555.0 NaN NaN
      7 5 five l5 NaN NaN 2055.0 2055.0
      4 6 NaN l6 66.0 666.0 NaN NaN
      10 7 seven l7 NaN NaN 2077.0 2077.0


      The Endpoint



      I'm trying to get df_m into a state where any NaN values are filled in based on their values from a corresponding id column. Any time there's a duplicate id, I want to keep both values, so that I end up with the following df:



         id       name    length  Q1_a   Q2_a    Q1_b    Q2_b
      0 1 one l1 11.0 111.0 2011.0 2011.0
      1 1 one_extra l1 11.0 111.0 2011.0 2011.0
      2 2 two l2 22.0 222.0 2022.0 2022.0
      3 3 three l3 33.0 333.0 2033.0 2033.0
      4 3 three l3_extra 33.0 333.0 2033.0 2033.0
      5 4 four l4 NaN NaN 2044.0 2044.0
      6 5 five l5 55.0 555.0 2055.0 2055.0
      7 6 NaN l6 66.0 666.0 NaN NaN
      8 7 seven l7 NaN NaN 2077.0 2077.0


      Note that there are some NaN values, where there was an entry in one table, and not in the other, e.g. there's no entry in df_a for id=4, so there's NaNs at Q1_a and Q2_a.



      My solution so far



      The following works, but I'd like to vectorise this code, as it takes more than 4 seconds with only ~200 rows:



      dupe_ids = pd.value_counts(df_m.id)[
      pd.value_counts(df_m.id) > 1].index.values

      for dupe_id in dupe_ids:
      df_m.loc[df_m.id == dupe_id] = df_m.loc[df_m.id == dupe_id].ffill().bfill()

      df_m = df_m.drop_duplicates().sort_values(by='id').reset_index(drop=True)


      I tried vectorising the code, as follows, but the results were strange (see id=4).



      df_m.sort_values(by='id').groupby(
      'id').ffill().bfill().drop_duplicates().reset_index(drop=True)

      Output:

      id name length Q1_a Q2_a Q1_b Q2_b
      0 1 one l1 11.0 111.0 2011.0 2011.0
      1 1 one_extra l1 11.0 111.0 2011.0 2011.0
      2 2 two l2 22.0 222.0 2022.0 2022.0
      3 3 three l3 33.0 333.0 2033.0 2033.0
      4 3 three l3_extra 33.0 333.0 2033.0 2033.0
      5 4 four l4 55.0 555.0 2044.0 2044.0
      6 5 five l5 55.0 555.0 2055.0 2055.0
      7 6 seven l6 66.0 666.0 2077.0 2077.0
      8 7 seven l7 NaN NaN 2077.0 2077.0


      The two tables I'm actually trying to merge are quite a bit more complex, so the example above leads to much more strange behaviour than is seen there. There are over 100 columns, and different columns can have NaN values in different places.



      Any help would be greatly appreciated!










      share|improve this question















      I have two tables which I am trying to merge, df_a and df_b, but both are missing some values. The merge happens fine, however I'm trying to speed up the process of filling in the NaN values after the merge.



      The data:



      df_a = pd.DataFrame(data={
      'id': [1, 2, 3, 1, 6, 5, 3],
      'name': [np.nan, 'two', 'three', 'one', np.nan, np.nan, np.nan],
      'length': ['l1', 'l2', 'l3', 'l1', 'l6', np.nan, np.nan],
      'Q1': [11, 22, 33, 11, 66, 55, 33],
      'Q2': [111, 222, 333, 111, 666, 555, 333]})

      df_b = pd.DataFrame(data={
      'id': [5, 2, 4, 1, 3, 1, 7, 3],
      'name': ['five', 'two', 'four', 'one_extra', 'three', 'one', 'seven', 'three'],
      'length': ['l5', 'l2', 'l4', 'l1', 'l3', 'l1', 'l7', 'l3_extra'],
      'Q1': [2055, 2022, 2044, 2011, 2033, 2011, 2077, 2033],
      'Q2': [2055, 2022, 2044, 2011, 2033, 2011, 2077, 2033]})

      df_m = df_a.merge(df_b, on=['id', 'name', 'length'], how='outer',
      suffixes=['_a', '_b'])
      Output:

      id name length Q1 Q2
      0 1 NaN l1 11 111
      1 2 two l2 22 222
      2 3 three l3 33 333
      3 1 one l1 11 111
      4 6 NaN l6 66 666
      5 5 NaN NaN 55 555
      6 3 NaN NaN 33 333

      id name length Q1 Q2
      0 5 five l5 2055 2055
      1 2 two l2 2022 2022
      2 4 four l4 2044 2044
      3 1 one_extra l1 2011 2011
      4 3 three l3 2033 2033
      5 1 one l1 2011 2011
      6 7 seven l7 2077 2077
      7 3 three l3_extra 2033 2033

      id name length Q1_a Q2_a Q1_b Q2_b
      0 1 NaN l1 11.0 111.0 NaN NaN
      3 1 one l1 11.0 111.0 2011.0 2011.0
      9 1 one_extra l1 NaN NaN 2011.0 2011.0
      1 2 two l2 22.0 222.0 2022.0 2022.0
      2 3 three l3 33.0 333.0 2033.0 2033.0
      6 3 NaN NaN 33.0 333.0 NaN NaN
      11 3 three l3_extra NaN NaN 2033.0 2033.0
      8 4 four l4 NaN NaN 2044.0 2044.0
      5 5 NaN NaN 55.0 555.0 NaN NaN
      7 5 five l5 NaN NaN 2055.0 2055.0
      4 6 NaN l6 66.0 666.0 NaN NaN
      10 7 seven l7 NaN NaN 2077.0 2077.0


      The Endpoint



      I'm trying to get df_m into a state where any NaN values are filled in based on their values from a corresponding id column. Any time there's a duplicate id, I want to keep both values, so that I end up with the following df:



         id       name    length  Q1_a   Q2_a    Q1_b    Q2_b
      0 1 one l1 11.0 111.0 2011.0 2011.0
      1 1 one_extra l1 11.0 111.0 2011.0 2011.0
      2 2 two l2 22.0 222.0 2022.0 2022.0
      3 3 three l3 33.0 333.0 2033.0 2033.0
      4 3 three l3_extra 33.0 333.0 2033.0 2033.0
      5 4 four l4 NaN NaN 2044.0 2044.0
      6 5 five l5 55.0 555.0 2055.0 2055.0
      7 6 NaN l6 66.0 666.0 NaN NaN
      8 7 seven l7 NaN NaN 2077.0 2077.0


      Note that there are some NaN values, where there was an entry in one table, and not in the other, e.g. there's no entry in df_a for id=4, so there's NaNs at Q1_a and Q2_a.



      My solution so far



      The following works, but I'd like to vectorise this code, as it takes more than 4 seconds with only ~200 rows:



      dupe_ids = pd.value_counts(df_m.id)[
      pd.value_counts(df_m.id) > 1].index.values

      for dupe_id in dupe_ids:
      df_m.loc[df_m.id == dupe_id] = df_m.loc[df_m.id == dupe_id].ffill().bfill()

      df_m = df_m.drop_duplicates().sort_values(by='id').reset_index(drop=True)


      I tried vectorising the code, as follows, but the results were strange (see id=4).



      df_m.sort_values(by='id').groupby(
      'id').ffill().bfill().drop_duplicates().reset_index(drop=True)

      Output:

      id name length Q1_a Q2_a Q1_b Q2_b
      0 1 one l1 11.0 111.0 2011.0 2011.0
      1 1 one_extra l1 11.0 111.0 2011.0 2011.0
      2 2 two l2 22.0 222.0 2022.0 2022.0
      3 3 three l3 33.0 333.0 2033.0 2033.0
      4 3 three l3_extra 33.0 333.0 2033.0 2033.0
      5 4 four l4 55.0 555.0 2044.0 2044.0
      6 5 five l5 55.0 555.0 2055.0 2055.0
      7 6 seven l6 66.0 666.0 2077.0 2077.0
      8 7 seven l7 NaN NaN 2077.0 2077.0


      The two tables I'm actually trying to merge are quite a bit more complex, so the example above leads to much more strange behaviour than is seen there. There are over 100 columns, and different columns can have NaN values in different places.



      Any help would be greatly appreciated!







      python pandas performance pandas-groupby






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 11 at 19:55









      jpp

      87.1k194999




      87.1k194999










      asked Nov 11 at 18:46









      KevL

      116210




      116210
























          2 Answers
          2






          active

          oldest

          votes

















          up vote
          1
          down vote



          accepted










          The issue is you need to perform a ffill + bfill both groupwise. You current vectorisation attempt performs the ffill groupwise only.



          res = df_m.sort_values('id')
          .groupby('id').apply(lambda x: x.ffill().bfill())
          .drop_duplicates()

          print(res)

          id name length Q1_a Q2_a Q1_b Q2_b
          0 1 one l1 11.0 111.0 2011.0 2011.0
          9 1 one_extra l1 11.0 111.0 2011.0 2011.0
          1 2 two l2 22.0 222.0 2022.0 2022.0
          2 3 three l3 33.0 333.0 2033.0 2033.0
          11 3 three l3_extra 33.0 333.0 2033.0 2033.0
          8 4 four l4 NaN NaN 2044.0 2044.0
          5 5 five l5 55.0 555.0 2055.0 2055.0
          4 6 NaN l6 66.0 666.0 NaN NaN
          10 7 seven l7 NaN NaN 2077.0 2077.0





          share|improve this answer





















          • Ah! Jaysus I can't believe I missed that. Would the following be a small bit faster, to avoid having to sue .apply()? df_m.sort_values(by='id').groupby('id').ffill().groupby('id').bfill()
            – KevL
            Nov 11 at 19:49








          • 1




            @KevL, I don't think it would be faster, since you are repeating the GroupBy operation. But I advise you test it and see for yourself.
            – jpp
            Nov 11 at 19:55


















          up vote
          0
          down vote













          Thanks to jpp for his answer.



          Just to check, I've timed his solution against the one in my reply to his answer.



          Input:
          df_m = df_m.append([df_m] * 10000, ignore_index=True)

          %timeit df_r1 = df_m
          .groupby('id').apply(lambda x: x.ffill().bfill())
          .drop_duplicates()

          %timeit df_r2 = df_m.groupby('id').ffill().groupby('id').bfill().drop_duplicates()

          Output:
          114 ms ± 1.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
          112 ms ± 515 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


          So, it seems with 100,000 rows it's very close.






          share|improve this answer





















            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53251999%2fpandas-fill-nan-values-based-on-another-column-with-duplicate-values%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes








            up vote
            1
            down vote



            accepted










            The issue is you need to perform a ffill + bfill both groupwise. You current vectorisation attempt performs the ffill groupwise only.



            res = df_m.sort_values('id')
            .groupby('id').apply(lambda x: x.ffill().bfill())
            .drop_duplicates()

            print(res)

            id name length Q1_a Q2_a Q1_b Q2_b
            0 1 one l1 11.0 111.0 2011.0 2011.0
            9 1 one_extra l1 11.0 111.0 2011.0 2011.0
            1 2 two l2 22.0 222.0 2022.0 2022.0
            2 3 three l3 33.0 333.0 2033.0 2033.0
            11 3 three l3_extra 33.0 333.0 2033.0 2033.0
            8 4 four l4 NaN NaN 2044.0 2044.0
            5 5 five l5 55.0 555.0 2055.0 2055.0
            4 6 NaN l6 66.0 666.0 NaN NaN
            10 7 seven l7 NaN NaN 2077.0 2077.0





            share|improve this answer





















            • Ah! Jaysus I can't believe I missed that. Would the following be a small bit faster, to avoid having to sue .apply()? df_m.sort_values(by='id').groupby('id').ffill().groupby('id').bfill()
              – KevL
              Nov 11 at 19:49








            • 1




              @KevL, I don't think it would be faster, since you are repeating the GroupBy operation. But I advise you test it and see for yourself.
              – jpp
              Nov 11 at 19:55















            up vote
            1
            down vote



            accepted










            The issue is you need to perform a ffill + bfill both groupwise. You current vectorisation attempt performs the ffill groupwise only.



            res = df_m.sort_values('id')
            .groupby('id').apply(lambda x: x.ffill().bfill())
            .drop_duplicates()

            print(res)

            id name length Q1_a Q2_a Q1_b Q2_b
            0 1 one l1 11.0 111.0 2011.0 2011.0
            9 1 one_extra l1 11.0 111.0 2011.0 2011.0
            1 2 two l2 22.0 222.0 2022.0 2022.0
            2 3 three l3 33.0 333.0 2033.0 2033.0
            11 3 three l3_extra 33.0 333.0 2033.0 2033.0
            8 4 four l4 NaN NaN 2044.0 2044.0
            5 5 five l5 55.0 555.0 2055.0 2055.0
            4 6 NaN l6 66.0 666.0 NaN NaN
            10 7 seven l7 NaN NaN 2077.0 2077.0





            share|improve this answer





















            • Ah! Jaysus I can't believe I missed that. Would the following be a small bit faster, to avoid having to sue .apply()? df_m.sort_values(by='id').groupby('id').ffill().groupby('id').bfill()
              – KevL
              Nov 11 at 19:49








            • 1




              @KevL, I don't think it would be faster, since you are repeating the GroupBy operation. But I advise you test it and see for yourself.
              – jpp
              Nov 11 at 19:55













            up vote
            1
            down vote



            accepted







            up vote
            1
            down vote



            accepted






            The issue is you need to perform a ffill + bfill both groupwise. You current vectorisation attempt performs the ffill groupwise only.



            res = df_m.sort_values('id')
            .groupby('id').apply(lambda x: x.ffill().bfill())
            .drop_duplicates()

            print(res)

            id name length Q1_a Q2_a Q1_b Q2_b
            0 1 one l1 11.0 111.0 2011.0 2011.0
            9 1 one_extra l1 11.0 111.0 2011.0 2011.0
            1 2 two l2 22.0 222.0 2022.0 2022.0
            2 3 three l3 33.0 333.0 2033.0 2033.0
            11 3 three l3_extra 33.0 333.0 2033.0 2033.0
            8 4 four l4 NaN NaN 2044.0 2044.0
            5 5 five l5 55.0 555.0 2055.0 2055.0
            4 6 NaN l6 66.0 666.0 NaN NaN
            10 7 seven l7 NaN NaN 2077.0 2077.0





            share|improve this answer












            The issue is you need to perform a ffill + bfill both groupwise. You current vectorisation attempt performs the ffill groupwise only.



            res = df_m.sort_values('id')
            .groupby('id').apply(lambda x: x.ffill().bfill())
            .drop_duplicates()

            print(res)

            id name length Q1_a Q2_a Q1_b Q2_b
            0 1 one l1 11.0 111.0 2011.0 2011.0
            9 1 one_extra l1 11.0 111.0 2011.0 2011.0
            1 2 two l2 22.0 222.0 2022.0 2022.0
            2 3 three l3 33.0 333.0 2033.0 2033.0
            11 3 three l3_extra 33.0 333.0 2033.0 2033.0
            8 4 four l4 NaN NaN 2044.0 2044.0
            5 5 five l5 55.0 555.0 2055.0 2055.0
            4 6 NaN l6 66.0 666.0 NaN NaN
            10 7 seven l7 NaN NaN 2077.0 2077.0






            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Nov 11 at 18:56









            jpp

            87.1k194999




            87.1k194999












            • Ah! Jaysus I can't believe I missed that. Would the following be a small bit faster, to avoid having to sue .apply()? df_m.sort_values(by='id').groupby('id').ffill().groupby('id').bfill()
              – KevL
              Nov 11 at 19:49








            • 1




              @KevL, I don't think it would be faster, since you are repeating the GroupBy operation. But I advise you test it and see for yourself.
              – jpp
              Nov 11 at 19:55


















            • Ah! Jaysus I can't believe I missed that. Would the following be a small bit faster, to avoid having to sue .apply()? df_m.sort_values(by='id').groupby('id').ffill().groupby('id').bfill()
              – KevL
              Nov 11 at 19:49








            • 1




              @KevL, I don't think it would be faster, since you are repeating the GroupBy operation. But I advise you test it and see for yourself.
              – jpp
              Nov 11 at 19:55
















            Ah! Jaysus I can't believe I missed that. Would the following be a small bit faster, to avoid having to sue .apply()? df_m.sort_values(by='id').groupby('id').ffill().groupby('id').bfill()
            – KevL
            Nov 11 at 19:49






            Ah! Jaysus I can't believe I missed that. Would the following be a small bit faster, to avoid having to sue .apply()? df_m.sort_values(by='id').groupby('id').ffill().groupby('id').bfill()
            – KevL
            Nov 11 at 19:49






            1




            1




            @KevL, I don't think it would be faster, since you are repeating the GroupBy operation. But I advise you test it and see for yourself.
            – jpp
            Nov 11 at 19:55




            @KevL, I don't think it would be faster, since you are repeating the GroupBy operation. But I advise you test it and see for yourself.
            – jpp
            Nov 11 at 19:55












            up vote
            0
            down vote













            Thanks to jpp for his answer.



            Just to check, I've timed his solution against the one in my reply to his answer.



            Input:
            df_m = df_m.append([df_m] * 10000, ignore_index=True)

            %timeit df_r1 = df_m
            .groupby('id').apply(lambda x: x.ffill().bfill())
            .drop_duplicates()

            %timeit df_r2 = df_m.groupby('id').ffill().groupby('id').bfill().drop_duplicates()

            Output:
            114 ms ± 1.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
            112 ms ± 515 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


            So, it seems with 100,000 rows it's very close.






            share|improve this answer

























              up vote
              0
              down vote













              Thanks to jpp for his answer.



              Just to check, I've timed his solution against the one in my reply to his answer.



              Input:
              df_m = df_m.append([df_m] * 10000, ignore_index=True)

              %timeit df_r1 = df_m
              .groupby('id').apply(lambda x: x.ffill().bfill())
              .drop_duplicates()

              %timeit df_r2 = df_m.groupby('id').ffill().groupby('id').bfill().drop_duplicates()

              Output:
              114 ms ± 1.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
              112 ms ± 515 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


              So, it seems with 100,000 rows it's very close.






              share|improve this answer























                up vote
                0
                down vote










                up vote
                0
                down vote









                Thanks to jpp for his answer.



                Just to check, I've timed his solution against the one in my reply to his answer.



                Input:
                df_m = df_m.append([df_m] * 10000, ignore_index=True)

                %timeit df_r1 = df_m
                .groupby('id').apply(lambda x: x.ffill().bfill())
                .drop_duplicates()

                %timeit df_r2 = df_m.groupby('id').ffill().groupby('id').bfill().drop_duplicates()

                Output:
                114 ms ± 1.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
                112 ms ± 515 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


                So, it seems with 100,000 rows it's very close.






                share|improve this answer












                Thanks to jpp for his answer.



                Just to check, I've timed his solution against the one in my reply to his answer.



                Input:
                df_m = df_m.append([df_m] * 10000, ignore_index=True)

                %timeit df_r1 = df_m
                .groupby('id').apply(lambda x: x.ffill().bfill())
                .drop_duplicates()

                %timeit df_r2 = df_m.groupby('id').ffill().groupby('id').bfill().drop_duplicates()

                Output:
                114 ms ± 1.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
                112 ms ± 515 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


                So, it seems with 100,000 rows it's very close.







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Nov 11 at 21:24









                KevL

                116210




                116210






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.





                    Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                    Please pay close attention to the following guidance:


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53251999%2fpandas-fill-nan-values-based-on-another-column-with-duplicate-values%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Guess what letter conforming each word

                    Port of Spain

                    Run scheduled task as local user group (not BUILTIN)