Pandas: filling missing values by mean in each group












43















This should be straightforward, but the closest thing I've found is this post:
pandas: Filling missing values within a group, and I still can't solve my problem....



Suppose I have the following dataframe



df = pd.DataFrame({'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3], 'name': ['A','A', 'B','B','B','B', 'C','C','C']})

name value
0 A 1
1 A NaN
2 B NaN
3 B 2
4 B 3
5 B 1
6 C 3
7 C NaN
8 C 3


and I'd like to fill in "NaN" with mean value in each "name" group, i.e.



      name  value
0 A 1
1 A 1
2 B 2
3 B 2
4 B 3
5 B 1
6 C 3
7 C 3
8 C 3


I'm not sure where to go after:



grouped = df.groupby('name').mean()


Thanks a bunch.










share|improve this question





























    43















    This should be straightforward, but the closest thing I've found is this post:
    pandas: Filling missing values within a group, and I still can't solve my problem....



    Suppose I have the following dataframe



    df = pd.DataFrame({'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3], 'name': ['A','A', 'B','B','B','B', 'C','C','C']})

    name value
    0 A 1
    1 A NaN
    2 B NaN
    3 B 2
    4 B 3
    5 B 1
    6 C 3
    7 C NaN
    8 C 3


    and I'd like to fill in "NaN" with mean value in each "name" group, i.e.



          name  value
    0 A 1
    1 A 1
    2 B 2
    3 B 2
    4 B 3
    5 B 1
    6 C 3
    7 C 3
    8 C 3


    I'm not sure where to go after:



    grouped = df.groupby('name').mean()


    Thanks a bunch.










    share|improve this question



























      43












      43








      43


      18






      This should be straightforward, but the closest thing I've found is this post:
      pandas: Filling missing values within a group, and I still can't solve my problem....



      Suppose I have the following dataframe



      df = pd.DataFrame({'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3], 'name': ['A','A', 'B','B','B','B', 'C','C','C']})

      name value
      0 A 1
      1 A NaN
      2 B NaN
      3 B 2
      4 B 3
      5 B 1
      6 C 3
      7 C NaN
      8 C 3


      and I'd like to fill in "NaN" with mean value in each "name" group, i.e.



            name  value
      0 A 1
      1 A 1
      2 B 2
      3 B 2
      4 B 3
      5 B 1
      6 C 3
      7 C 3
      8 C 3


      I'm not sure where to go after:



      grouped = df.groupby('name').mean()


      Thanks a bunch.










      share|improve this question
















      This should be straightforward, but the closest thing I've found is this post:
      pandas: Filling missing values within a group, and I still can't solve my problem....



      Suppose I have the following dataframe



      df = pd.DataFrame({'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3], 'name': ['A','A', 'B','B','B','B', 'C','C','C']})

      name value
      0 A 1
      1 A NaN
      2 B NaN
      3 B 2
      4 B 3
      5 B 1
      6 C 3
      7 C NaN
      8 C 3


      and I'd like to fill in "NaN" with mean value in each "name" group, i.e.



            name  value
      0 A 1
      1 A 1
      2 B 2
      3 B 2
      4 B 3
      5 B 1
      6 C 3
      7 C 3
      8 C 3


      I'm not sure where to go after:



      grouped = df.groupby('name').mean()


      Thanks a bunch.







      python pandas






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited May 23 '17 at 12:18









      Community

      11




      11










      asked Nov 13 '13 at 22:43









      BlueFeetBlueFeet

      6021819




      6021819
























          8 Answers
          8






          active

          oldest

          votes


















          57














          One way would be to use transform:



          >>> df
          name value
          0 A 1
          1 A NaN
          2 B NaN
          3 B 2
          4 B 3
          5 B 1
          6 C 3
          7 C NaN
          8 C 3
          >>> df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))
          >>> df
          name value
          0 A 1
          1 A 1
          2 B 2
          3 B 2
          4 B 3
          5 B 1
          6 C 3
          7 C 3
          8 C 3





          share|improve this answer



















          • 2





            I found it helpful when starting out to sit down and read through the docs. This one is covered in the groupby section. There's too much stuff to remember, but you pick up rules like "transform is for per-group operations which you want indexed like the original frame" and so on.

            – DSM
            Nov 13 '13 at 22:57











          • Also look for the Wes McKinney book. Personally I think the docs on groupby are abismal, the book is marginally better.

            – Woody Pride
            Nov 14 '13 at 0:51






          • 19





            if you have more than two columns, make sure to specify the column name df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))['value']

            – Lauren
            Jan 10 '17 at 16:57






          • 8





            @Lauren Good point. I'd like to add that for performance reasons you might consider to move the value column specification further left to the group-by clause. This way the lambda function is only called for values in that particular column, and not every column and then chose column. Did a test and it was twice as fast when using two columns. And naturally you get better performance the more columns you don't need to impute: df["value"] = df.groupby("name")["value"].transform(lambda x: x.fillna(x.mean()))

            – André C. Andersen
            Jul 28 '17 at 12:11











          • I have been searching for this for two days.. Just a question for you. Why is it too hard to do this with loops? Because in my case there are two multi indexes i.e. State and Age_Group then I am trying to fill missing values in those groups with group means (from the same state within the same age group take mean and fill missings in group)..Thanks

            – cyber-math
            Jan 9 at 20:26





















          14














          @DSM has IMO the right answer, but I'd like to share my generalization and optimization of the question: Multiple columns to group-by and having multiple value columns:



          df = pd.DataFrame(
          {
          'category': ['X', 'X', 'X', 'X', 'X', 'X', 'Y', 'Y', 'Y'],
          'name': ['A','A', 'B','B','B','B', 'C','C','C'],
          'other_value': [10, np.nan, np.nan, 20, 30, 10, 30, np.nan, 30],
          'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3],
          }
          )


          ... gives ...



            category name  other_value value
          0 X A 10.0 1.0
          1 X A NaN NaN
          2 X B NaN NaN
          3 X B 20.0 2.0
          4 X B 30.0 3.0
          5 X B 10.0 1.0
          6 Y C 30.0 3.0
          7 Y C NaN NaN
          8 Y C 30.0 3.0


          In this generalized case we would like to group by category and name, and impute only on value.



          This can be solved as follows:



          df['value'] = df.groupby(['category', 'name'])['value']
          .transform(lambda x: x.fillna(x.mean()))


          Notice the column list in the group-by clause, and that we select the value column right after the group-by. This makes the transformation only be run on that particular column. You could add it to the end, but then you will run it for all columns only to throw out all but one measure column at the end. A standard SQL query planner might have been able to optimize this, but pandas (0.19.2) doesn't seem to do this.



          Performance test by increasing the dataset by doing ...



          big_df = None
          for _ in range(10000):
          if big_df is None:
          big_df = df.copy()
          else:
          big_df = pd.concat([big_df, df])
          df = big_df


          ... confirms that this increases the speed proportional to how many columns you don't have to impute:



          import pandas as pd
          from datetime import datetime

          def generate_data():
          ...

          t = datetime.now()
          df = generate_data()
          df['value'] = df.groupby(['category', 'name'])['value']
          .transform(lambda x: x.fillna(x.mean()))
          print(datetime.now()-t)

          # 0:00:00.016012

          t = datetime.now()
          df = generate_data()
          df["value"] = df.groupby(['category', 'name'])
          .transform(lambda x: x.fillna(x.mean()))['value']
          print(datetime.now()-t)

          # 0:00:00.030022


          On a final note you can generalize even further if you want to impute more than one column, but not all:



          df[['value', 'other_value']] = df.groupby(['category', 'name'])['value', 'other_value']
          .transform(lambda x: x.fillna(x.mean()))





          share|improve this answer


























          • Thank you for this great work. I am wondering how I could success the same transformation with using for loops. Speed is not my concern since I am trying to find manual methods. Thanks @AndréC.Andersen

            – cyber-math
            Jan 9 at 21:55





















          6














          I'd do it this way



          df.loc[df.value.isnull(), 'value'] = df.groupby('group').value.transform('mean')





          share|improve this answer

































            3














            The featured high ranked answer only works for a pandas Dataframe with only two columns. If you have a more columns case use instead:



            df['Crude_Birth_rate'] = df.groupby("continent").Crude_Birth_rate.transform(
            lambda x: x.fillna(x.mean()))





            share|improve this answer

































              2














              def groupMeanValue(group):
              group['value'] = group['value'].fillna(group['value'].mean())
              return group

              dft = df.groupby("name").transform(groupMeanValue)





              share|improve this answer































                1














                df.fillna(df.groupby(['name'], as_index=False).mean(), inplace=True)





                share|improve this answer





















                • 2





                  Please give some explanation of your answer. Why should someone who stumbles upon this page from google use your solution over the other 6 answers?

                  – divibisan
                  Oct 4 '18 at 20:28



















                0














                I just did this



                df.fillna(df.mean(), inplace=True)


                All missing values within your DataFrame will be filled by mean. If that is what you're looking for. This worked for me. It's simple, and gets the job done.






                share|improve this answer
























                • This answer is incorrect because it doesn't work groupwise.

                  – jpp
                  Nov 16 '18 at 13:49



















                0















                fillna + groupby + transform + mean



                This seems intuitive:



                df['value'] = df['value'].fillna(df.groupby('name')['value'].transform('mean'))


                The groupby + transform syntax maps the groupwise mean to the index of the original dataframe. This is roughly equivalent to @DSM's solution, but avoids the need to define an anonymous lambda function.






                share|improve this answer























                  Your Answer






                  StackExchange.ifUsing("editor", function () {
                  StackExchange.using("externalEditor", function () {
                  StackExchange.using("snippets", function () {
                  StackExchange.snippets.init();
                  });
                  });
                  }, "code-snippets");

                  StackExchange.ready(function() {
                  var channelOptions = {
                  tags: "".split(" "),
                  id: "1"
                  };
                  initTagRenderer("".split(" "), "".split(" "), channelOptions);

                  StackExchange.using("externalEditor", function() {
                  // Have to fire editor after snippets, if snippets enabled
                  if (StackExchange.settings.snippets.snippetsEnabled) {
                  StackExchange.using("snippets", function() {
                  createEditor();
                  });
                  }
                  else {
                  createEditor();
                  }
                  });

                  function createEditor() {
                  StackExchange.prepareEditor({
                  heartbeatType: 'answer',
                  autoActivateHeartbeat: false,
                  convertImagesToLinks: true,
                  noModals: true,
                  showLowRepImageUploadWarning: true,
                  reputationToPostImages: 10,
                  bindNavPrevention: true,
                  postfix: "",
                  imageUploader: {
                  brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
                  contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
                  allowUrls: true
                  },
                  onDemand: true,
                  discardSelector: ".discard-answer"
                  ,immediatelyShowMarkdownHelp:true
                  });


                  }
                  });














                  draft saved

                  draft discarded


















                  StackExchange.ready(
                  function () {
                  StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f19966018%2fpandas-filling-missing-values-by-mean-in-each-group%23new-answer', 'question_page');
                  }
                  );

                  Post as a guest















                  Required, but never shown

























                  8 Answers
                  8






                  active

                  oldest

                  votes








                  8 Answers
                  8






                  active

                  oldest

                  votes









                  active

                  oldest

                  votes






                  active

                  oldest

                  votes









                  57














                  One way would be to use transform:



                  >>> df
                  name value
                  0 A 1
                  1 A NaN
                  2 B NaN
                  3 B 2
                  4 B 3
                  5 B 1
                  6 C 3
                  7 C NaN
                  8 C 3
                  >>> df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))
                  >>> df
                  name value
                  0 A 1
                  1 A 1
                  2 B 2
                  3 B 2
                  4 B 3
                  5 B 1
                  6 C 3
                  7 C 3
                  8 C 3





                  share|improve this answer



















                  • 2





                    I found it helpful when starting out to sit down and read through the docs. This one is covered in the groupby section. There's too much stuff to remember, but you pick up rules like "transform is for per-group operations which you want indexed like the original frame" and so on.

                    – DSM
                    Nov 13 '13 at 22:57











                  • Also look for the Wes McKinney book. Personally I think the docs on groupby are abismal, the book is marginally better.

                    – Woody Pride
                    Nov 14 '13 at 0:51






                  • 19





                    if you have more than two columns, make sure to specify the column name df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))['value']

                    – Lauren
                    Jan 10 '17 at 16:57






                  • 8





                    @Lauren Good point. I'd like to add that for performance reasons you might consider to move the value column specification further left to the group-by clause. This way the lambda function is only called for values in that particular column, and not every column and then chose column. Did a test and it was twice as fast when using two columns. And naturally you get better performance the more columns you don't need to impute: df["value"] = df.groupby("name")["value"].transform(lambda x: x.fillna(x.mean()))

                    – André C. Andersen
                    Jul 28 '17 at 12:11











                  • I have been searching for this for two days.. Just a question for you. Why is it too hard to do this with loops? Because in my case there are two multi indexes i.e. State and Age_Group then I am trying to fill missing values in those groups with group means (from the same state within the same age group take mean and fill missings in group)..Thanks

                    – cyber-math
                    Jan 9 at 20:26


















                  57














                  One way would be to use transform:



                  >>> df
                  name value
                  0 A 1
                  1 A NaN
                  2 B NaN
                  3 B 2
                  4 B 3
                  5 B 1
                  6 C 3
                  7 C NaN
                  8 C 3
                  >>> df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))
                  >>> df
                  name value
                  0 A 1
                  1 A 1
                  2 B 2
                  3 B 2
                  4 B 3
                  5 B 1
                  6 C 3
                  7 C 3
                  8 C 3





                  share|improve this answer



















                  • 2





                    I found it helpful when starting out to sit down and read through the docs. This one is covered in the groupby section. There's too much stuff to remember, but you pick up rules like "transform is for per-group operations which you want indexed like the original frame" and so on.

                    – DSM
                    Nov 13 '13 at 22:57











                  • Also look for the Wes McKinney book. Personally I think the docs on groupby are abismal, the book is marginally better.

                    – Woody Pride
                    Nov 14 '13 at 0:51






                  • 19





                    if you have more than two columns, make sure to specify the column name df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))['value']

                    – Lauren
                    Jan 10 '17 at 16:57






                  • 8





                    @Lauren Good point. I'd like to add that for performance reasons you might consider to move the value column specification further left to the group-by clause. This way the lambda function is only called for values in that particular column, and not every column and then chose column. Did a test and it was twice as fast when using two columns. And naturally you get better performance the more columns you don't need to impute: df["value"] = df.groupby("name")["value"].transform(lambda x: x.fillna(x.mean()))

                    – André C. Andersen
                    Jul 28 '17 at 12:11











                  • I have been searching for this for two days.. Just a question for you. Why is it too hard to do this with loops? Because in my case there are two multi indexes i.e. State and Age_Group then I am trying to fill missing values in those groups with group means (from the same state within the same age group take mean and fill missings in group)..Thanks

                    – cyber-math
                    Jan 9 at 20:26
















                  57












                  57








                  57







                  One way would be to use transform:



                  >>> df
                  name value
                  0 A 1
                  1 A NaN
                  2 B NaN
                  3 B 2
                  4 B 3
                  5 B 1
                  6 C 3
                  7 C NaN
                  8 C 3
                  >>> df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))
                  >>> df
                  name value
                  0 A 1
                  1 A 1
                  2 B 2
                  3 B 2
                  4 B 3
                  5 B 1
                  6 C 3
                  7 C 3
                  8 C 3





                  share|improve this answer













                  One way would be to use transform:



                  >>> df
                  name value
                  0 A 1
                  1 A NaN
                  2 B NaN
                  3 B 2
                  4 B 3
                  5 B 1
                  6 C 3
                  7 C NaN
                  8 C 3
                  >>> df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))
                  >>> df
                  name value
                  0 A 1
                  1 A 1
                  2 B 2
                  3 B 2
                  4 B 3
                  5 B 1
                  6 C 3
                  7 C 3
                  8 C 3






                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Nov 13 '13 at 22:51









                  DSMDSM

                  207k35397372




                  207k35397372








                  • 2





                    I found it helpful when starting out to sit down and read through the docs. This one is covered in the groupby section. There's too much stuff to remember, but you pick up rules like "transform is for per-group operations which you want indexed like the original frame" and so on.

                    – DSM
                    Nov 13 '13 at 22:57











                  • Also look for the Wes McKinney book. Personally I think the docs on groupby are abismal, the book is marginally better.

                    – Woody Pride
                    Nov 14 '13 at 0:51






                  • 19





                    if you have more than two columns, make sure to specify the column name df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))['value']

                    – Lauren
                    Jan 10 '17 at 16:57






                  • 8





                    @Lauren Good point. I'd like to add that for performance reasons you might consider to move the value column specification further left to the group-by clause. This way the lambda function is only called for values in that particular column, and not every column and then chose column. Did a test and it was twice as fast when using two columns. And naturally you get better performance the more columns you don't need to impute: df["value"] = df.groupby("name")["value"].transform(lambda x: x.fillna(x.mean()))

                    – André C. Andersen
                    Jul 28 '17 at 12:11











                  • I have been searching for this for two days.. Just a question for you. Why is it too hard to do this with loops? Because in my case there are two multi indexes i.e. State and Age_Group then I am trying to fill missing values in those groups with group means (from the same state within the same age group take mean and fill missings in group)..Thanks

                    – cyber-math
                    Jan 9 at 20:26
















                  • 2





                    I found it helpful when starting out to sit down and read through the docs. This one is covered in the groupby section. There's too much stuff to remember, but you pick up rules like "transform is for per-group operations which you want indexed like the original frame" and so on.

                    – DSM
                    Nov 13 '13 at 22:57











                  • Also look for the Wes McKinney book. Personally I think the docs on groupby are abismal, the book is marginally better.

                    – Woody Pride
                    Nov 14 '13 at 0:51






                  • 19





                    if you have more than two columns, make sure to specify the column name df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))['value']

                    – Lauren
                    Jan 10 '17 at 16:57






                  • 8





                    @Lauren Good point. I'd like to add that for performance reasons you might consider to move the value column specification further left to the group-by clause. This way the lambda function is only called for values in that particular column, and not every column and then chose column. Did a test and it was twice as fast when using two columns. And naturally you get better performance the more columns you don't need to impute: df["value"] = df.groupby("name")["value"].transform(lambda x: x.fillna(x.mean()))

                    – André C. Andersen
                    Jul 28 '17 at 12:11











                  • I have been searching for this for two days.. Just a question for you. Why is it too hard to do this with loops? Because in my case there are two multi indexes i.e. State and Age_Group then I am trying to fill missing values in those groups with group means (from the same state within the same age group take mean and fill missings in group)..Thanks

                    – cyber-math
                    Jan 9 at 20:26










                  2




                  2





                  I found it helpful when starting out to sit down and read through the docs. This one is covered in the groupby section. There's too much stuff to remember, but you pick up rules like "transform is for per-group operations which you want indexed like the original frame" and so on.

                  – DSM
                  Nov 13 '13 at 22:57





                  I found it helpful when starting out to sit down and read through the docs. This one is covered in the groupby section. There's too much stuff to remember, but you pick up rules like "transform is for per-group operations which you want indexed like the original frame" and so on.

                  – DSM
                  Nov 13 '13 at 22:57













                  Also look for the Wes McKinney book. Personally I think the docs on groupby are abismal, the book is marginally better.

                  – Woody Pride
                  Nov 14 '13 at 0:51





                  Also look for the Wes McKinney book. Personally I think the docs on groupby are abismal, the book is marginally better.

                  – Woody Pride
                  Nov 14 '13 at 0:51




                  19




                  19





                  if you have more than two columns, make sure to specify the column name df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))['value']

                  – Lauren
                  Jan 10 '17 at 16:57





                  if you have more than two columns, make sure to specify the column name df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))['value']

                  – Lauren
                  Jan 10 '17 at 16:57




                  8




                  8





                  @Lauren Good point. I'd like to add that for performance reasons you might consider to move the value column specification further left to the group-by clause. This way the lambda function is only called for values in that particular column, and not every column and then chose column. Did a test and it was twice as fast when using two columns. And naturally you get better performance the more columns you don't need to impute: df["value"] = df.groupby("name")["value"].transform(lambda x: x.fillna(x.mean()))

                  – André C. Andersen
                  Jul 28 '17 at 12:11





                  @Lauren Good point. I'd like to add that for performance reasons you might consider to move the value column specification further left to the group-by clause. This way the lambda function is only called for values in that particular column, and not every column and then chose column. Did a test and it was twice as fast when using two columns. And naturally you get better performance the more columns you don't need to impute: df["value"] = df.groupby("name")["value"].transform(lambda x: x.fillna(x.mean()))

                  – André C. Andersen
                  Jul 28 '17 at 12:11













                  I have been searching for this for two days.. Just a question for you. Why is it too hard to do this with loops? Because in my case there are two multi indexes i.e. State and Age_Group then I am trying to fill missing values in those groups with group means (from the same state within the same age group take mean and fill missings in group)..Thanks

                  – cyber-math
                  Jan 9 at 20:26







                  I have been searching for this for two days.. Just a question for you. Why is it too hard to do this with loops? Because in my case there are two multi indexes i.e. State and Age_Group then I am trying to fill missing values in those groups with group means (from the same state within the same age group take mean and fill missings in group)..Thanks

                  – cyber-math
                  Jan 9 at 20:26















                  14














                  @DSM has IMO the right answer, but I'd like to share my generalization and optimization of the question: Multiple columns to group-by and having multiple value columns:



                  df = pd.DataFrame(
                  {
                  'category': ['X', 'X', 'X', 'X', 'X', 'X', 'Y', 'Y', 'Y'],
                  'name': ['A','A', 'B','B','B','B', 'C','C','C'],
                  'other_value': [10, np.nan, np.nan, 20, 30, 10, 30, np.nan, 30],
                  'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3],
                  }
                  )


                  ... gives ...



                    category name  other_value value
                  0 X A 10.0 1.0
                  1 X A NaN NaN
                  2 X B NaN NaN
                  3 X B 20.0 2.0
                  4 X B 30.0 3.0
                  5 X B 10.0 1.0
                  6 Y C 30.0 3.0
                  7 Y C NaN NaN
                  8 Y C 30.0 3.0


                  In this generalized case we would like to group by category and name, and impute only on value.



                  This can be solved as follows:



                  df['value'] = df.groupby(['category', 'name'])['value']
                  .transform(lambda x: x.fillna(x.mean()))


                  Notice the column list in the group-by clause, and that we select the value column right after the group-by. This makes the transformation only be run on that particular column. You could add it to the end, but then you will run it for all columns only to throw out all but one measure column at the end. A standard SQL query planner might have been able to optimize this, but pandas (0.19.2) doesn't seem to do this.



                  Performance test by increasing the dataset by doing ...



                  big_df = None
                  for _ in range(10000):
                  if big_df is None:
                  big_df = df.copy()
                  else:
                  big_df = pd.concat([big_df, df])
                  df = big_df


                  ... confirms that this increases the speed proportional to how many columns you don't have to impute:



                  import pandas as pd
                  from datetime import datetime

                  def generate_data():
                  ...

                  t = datetime.now()
                  df = generate_data()
                  df['value'] = df.groupby(['category', 'name'])['value']
                  .transform(lambda x: x.fillna(x.mean()))
                  print(datetime.now()-t)

                  # 0:00:00.016012

                  t = datetime.now()
                  df = generate_data()
                  df["value"] = df.groupby(['category', 'name'])
                  .transform(lambda x: x.fillna(x.mean()))['value']
                  print(datetime.now()-t)

                  # 0:00:00.030022


                  On a final note you can generalize even further if you want to impute more than one column, but not all:



                  df[['value', 'other_value']] = df.groupby(['category', 'name'])['value', 'other_value']
                  .transform(lambda x: x.fillna(x.mean()))





                  share|improve this answer


























                  • Thank you for this great work. I am wondering how I could success the same transformation with using for loops. Speed is not my concern since I am trying to find manual methods. Thanks @AndréC.Andersen

                    – cyber-math
                    Jan 9 at 21:55


















                  14














                  @DSM has IMO the right answer, but I'd like to share my generalization and optimization of the question: Multiple columns to group-by and having multiple value columns:



                  df = pd.DataFrame(
                  {
                  'category': ['X', 'X', 'X', 'X', 'X', 'X', 'Y', 'Y', 'Y'],
                  'name': ['A','A', 'B','B','B','B', 'C','C','C'],
                  'other_value': [10, np.nan, np.nan, 20, 30, 10, 30, np.nan, 30],
                  'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3],
                  }
                  )


                  ... gives ...



                    category name  other_value value
                  0 X A 10.0 1.0
                  1 X A NaN NaN
                  2 X B NaN NaN
                  3 X B 20.0 2.0
                  4 X B 30.0 3.0
                  5 X B 10.0 1.0
                  6 Y C 30.0 3.0
                  7 Y C NaN NaN
                  8 Y C 30.0 3.0


                  In this generalized case we would like to group by category and name, and impute only on value.



                  This can be solved as follows:



                  df['value'] = df.groupby(['category', 'name'])['value']
                  .transform(lambda x: x.fillna(x.mean()))


                  Notice the column list in the group-by clause, and that we select the value column right after the group-by. This makes the transformation only be run on that particular column. You could add it to the end, but then you will run it for all columns only to throw out all but one measure column at the end. A standard SQL query planner might have been able to optimize this, but pandas (0.19.2) doesn't seem to do this.



                  Performance test by increasing the dataset by doing ...



                  big_df = None
                  for _ in range(10000):
                  if big_df is None:
                  big_df = df.copy()
                  else:
                  big_df = pd.concat([big_df, df])
                  df = big_df


                  ... confirms that this increases the speed proportional to how many columns you don't have to impute:



                  import pandas as pd
                  from datetime import datetime

                  def generate_data():
                  ...

                  t = datetime.now()
                  df = generate_data()
                  df['value'] = df.groupby(['category', 'name'])['value']
                  .transform(lambda x: x.fillna(x.mean()))
                  print(datetime.now()-t)

                  # 0:00:00.016012

                  t = datetime.now()
                  df = generate_data()
                  df["value"] = df.groupby(['category', 'name'])
                  .transform(lambda x: x.fillna(x.mean()))['value']
                  print(datetime.now()-t)

                  # 0:00:00.030022


                  On a final note you can generalize even further if you want to impute more than one column, but not all:



                  df[['value', 'other_value']] = df.groupby(['category', 'name'])['value', 'other_value']
                  .transform(lambda x: x.fillna(x.mean()))





                  share|improve this answer


























                  • Thank you for this great work. I am wondering how I could success the same transformation with using for loops. Speed is not my concern since I am trying to find manual methods. Thanks @AndréC.Andersen

                    – cyber-math
                    Jan 9 at 21:55
















                  14












                  14








                  14







                  @DSM has IMO the right answer, but I'd like to share my generalization and optimization of the question: Multiple columns to group-by and having multiple value columns:



                  df = pd.DataFrame(
                  {
                  'category': ['X', 'X', 'X', 'X', 'X', 'X', 'Y', 'Y', 'Y'],
                  'name': ['A','A', 'B','B','B','B', 'C','C','C'],
                  'other_value': [10, np.nan, np.nan, 20, 30, 10, 30, np.nan, 30],
                  'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3],
                  }
                  )


                  ... gives ...



                    category name  other_value value
                  0 X A 10.0 1.0
                  1 X A NaN NaN
                  2 X B NaN NaN
                  3 X B 20.0 2.0
                  4 X B 30.0 3.0
                  5 X B 10.0 1.0
                  6 Y C 30.0 3.0
                  7 Y C NaN NaN
                  8 Y C 30.0 3.0


                  In this generalized case we would like to group by category and name, and impute only on value.



                  This can be solved as follows:



                  df['value'] = df.groupby(['category', 'name'])['value']
                  .transform(lambda x: x.fillna(x.mean()))


                  Notice the column list in the group-by clause, and that we select the value column right after the group-by. This makes the transformation only be run on that particular column. You could add it to the end, but then you will run it for all columns only to throw out all but one measure column at the end. A standard SQL query planner might have been able to optimize this, but pandas (0.19.2) doesn't seem to do this.



                  Performance test by increasing the dataset by doing ...



                  big_df = None
                  for _ in range(10000):
                  if big_df is None:
                  big_df = df.copy()
                  else:
                  big_df = pd.concat([big_df, df])
                  df = big_df


                  ... confirms that this increases the speed proportional to how many columns you don't have to impute:



                  import pandas as pd
                  from datetime import datetime

                  def generate_data():
                  ...

                  t = datetime.now()
                  df = generate_data()
                  df['value'] = df.groupby(['category', 'name'])['value']
                  .transform(lambda x: x.fillna(x.mean()))
                  print(datetime.now()-t)

                  # 0:00:00.016012

                  t = datetime.now()
                  df = generate_data()
                  df["value"] = df.groupby(['category', 'name'])
                  .transform(lambda x: x.fillna(x.mean()))['value']
                  print(datetime.now()-t)

                  # 0:00:00.030022


                  On a final note you can generalize even further if you want to impute more than one column, but not all:



                  df[['value', 'other_value']] = df.groupby(['category', 'name'])['value', 'other_value']
                  .transform(lambda x: x.fillna(x.mean()))





                  share|improve this answer















                  @DSM has IMO the right answer, but I'd like to share my generalization and optimization of the question: Multiple columns to group-by and having multiple value columns:



                  df = pd.DataFrame(
                  {
                  'category': ['X', 'X', 'X', 'X', 'X', 'X', 'Y', 'Y', 'Y'],
                  'name': ['A','A', 'B','B','B','B', 'C','C','C'],
                  'other_value': [10, np.nan, np.nan, 20, 30, 10, 30, np.nan, 30],
                  'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3],
                  }
                  )


                  ... gives ...



                    category name  other_value value
                  0 X A 10.0 1.0
                  1 X A NaN NaN
                  2 X B NaN NaN
                  3 X B 20.0 2.0
                  4 X B 30.0 3.0
                  5 X B 10.0 1.0
                  6 Y C 30.0 3.0
                  7 Y C NaN NaN
                  8 Y C 30.0 3.0


                  In this generalized case we would like to group by category and name, and impute only on value.



                  This can be solved as follows:



                  df['value'] = df.groupby(['category', 'name'])['value']
                  .transform(lambda x: x.fillna(x.mean()))


                  Notice the column list in the group-by clause, and that we select the value column right after the group-by. This makes the transformation only be run on that particular column. You could add it to the end, but then you will run it for all columns only to throw out all but one measure column at the end. A standard SQL query planner might have been able to optimize this, but pandas (0.19.2) doesn't seem to do this.



                  Performance test by increasing the dataset by doing ...



                  big_df = None
                  for _ in range(10000):
                  if big_df is None:
                  big_df = df.copy()
                  else:
                  big_df = pd.concat([big_df, df])
                  df = big_df


                  ... confirms that this increases the speed proportional to how many columns you don't have to impute:



                  import pandas as pd
                  from datetime import datetime

                  def generate_data():
                  ...

                  t = datetime.now()
                  df = generate_data()
                  df['value'] = df.groupby(['category', 'name'])['value']
                  .transform(lambda x: x.fillna(x.mean()))
                  print(datetime.now()-t)

                  # 0:00:00.016012

                  t = datetime.now()
                  df = generate_data()
                  df["value"] = df.groupby(['category', 'name'])
                  .transform(lambda x: x.fillna(x.mean()))['value']
                  print(datetime.now()-t)

                  # 0:00:00.030022


                  On a final note you can generalize even further if you want to impute more than one column, but not all:



                  df[['value', 'other_value']] = df.groupby(['category', 'name'])['value', 'other_value']
                  .transform(lambda x: x.fillna(x.mean()))






                  share|improve this answer














                  share|improve this answer



                  share|improve this answer








                  edited Jul 28 '17 at 12:32

























                  answered Jul 28 '17 at 12:06









                  André C. AndersenAndré C. Andersen

                  3,93612646




                  3,93612646













                  • Thank you for this great work. I am wondering how I could success the same transformation with using for loops. Speed is not my concern since I am trying to find manual methods. Thanks @AndréC.Andersen

                    – cyber-math
                    Jan 9 at 21:55





















                  • Thank you for this great work. I am wondering how I could success the same transformation with using for loops. Speed is not my concern since I am trying to find manual methods. Thanks @AndréC.Andersen

                    – cyber-math
                    Jan 9 at 21:55



















                  Thank you for this great work. I am wondering how I could success the same transformation with using for loops. Speed is not my concern since I am trying to find manual methods. Thanks @AndréC.Andersen

                  – cyber-math
                  Jan 9 at 21:55







                  Thank you for this great work. I am wondering how I could success the same transformation with using for loops. Speed is not my concern since I am trying to find manual methods. Thanks @AndréC.Andersen

                  – cyber-math
                  Jan 9 at 21:55













                  6














                  I'd do it this way



                  df.loc[df.value.isnull(), 'value'] = df.groupby('group').value.transform('mean')





                  share|improve this answer






























                    6














                    I'd do it this way



                    df.loc[df.value.isnull(), 'value'] = df.groupby('group').value.transform('mean')





                    share|improve this answer




























                      6












                      6








                      6







                      I'd do it this way



                      df.loc[df.value.isnull(), 'value'] = df.groupby('group').value.transform('mean')





                      share|improve this answer















                      I'd do it this way



                      df.loc[df.value.isnull(), 'value'] = df.groupby('group').value.transform('mean')






                      share|improve this answer














                      share|improve this answer



                      share|improve this answer








                      edited Oct 9 '17 at 9:55









                      IanS

                      8,42122762




                      8,42122762










                      answered Nov 18 '16 at 17:18









                      piRSquaredpiRSquared

                      154k22146288




                      154k22146288























                          3














                          The featured high ranked answer only works for a pandas Dataframe with only two columns. If you have a more columns case use instead:



                          df['Crude_Birth_rate'] = df.groupby("continent").Crude_Birth_rate.transform(
                          lambda x: x.fillna(x.mean()))





                          share|improve this answer






























                            3














                            The featured high ranked answer only works for a pandas Dataframe with only two columns. If you have a more columns case use instead:



                            df['Crude_Birth_rate'] = df.groupby("continent").Crude_Birth_rate.transform(
                            lambda x: x.fillna(x.mean()))





                            share|improve this answer




























                              3












                              3








                              3







                              The featured high ranked answer only works for a pandas Dataframe with only two columns. If you have a more columns case use instead:



                              df['Crude_Birth_rate'] = df.groupby("continent").Crude_Birth_rate.transform(
                              lambda x: x.fillna(x.mean()))





                              share|improve this answer















                              The featured high ranked answer only works for a pandas Dataframe with only two columns. If you have a more columns case use instead:



                              df['Crude_Birth_rate'] = df.groupby("continent").Crude_Birth_rate.transform(
                              lambda x: x.fillna(x.mean()))






                              share|improve this answer














                              share|improve this answer



                              share|improve this answer








                              edited Oct 13 '16 at 9:08

























                              answered Oct 13 '16 at 8:52









                              Philipp SchwarzPhilipp Schwarz

                              5,03132029




                              5,03132029























                                  2














                                  def groupMeanValue(group):
                                  group['value'] = group['value'].fillna(group['value'].mean())
                                  return group

                                  dft = df.groupby("name").transform(groupMeanValue)





                                  share|improve this answer




























                                    2














                                    def groupMeanValue(group):
                                    group['value'] = group['value'].fillna(group['value'].mean())
                                    return group

                                    dft = df.groupby("name").transform(groupMeanValue)





                                    share|improve this answer


























                                      2












                                      2








                                      2







                                      def groupMeanValue(group):
                                      group['value'] = group['value'].fillna(group['value'].mean())
                                      return group

                                      dft = df.groupby("name").transform(groupMeanValue)





                                      share|improve this answer













                                      def groupMeanValue(group):
                                      group['value'] = group['value'].fillna(group['value'].mean())
                                      return group

                                      dft = df.groupby("name").transform(groupMeanValue)






                                      share|improve this answer












                                      share|improve this answer



                                      share|improve this answer










                                      answered Mar 9 '16 at 14:36









                                      Prajit PatilPrajit Patil

                                      212




                                      212























                                          1














                                          df.fillna(df.groupby(['name'], as_index=False).mean(), inplace=True)





                                          share|improve this answer





















                                          • 2





                                            Please give some explanation of your answer. Why should someone who stumbles upon this page from google use your solution over the other 6 answers?

                                            – divibisan
                                            Oct 4 '18 at 20:28
















                                          1














                                          df.fillna(df.groupby(['name'], as_index=False).mean(), inplace=True)





                                          share|improve this answer





















                                          • 2





                                            Please give some explanation of your answer. Why should someone who stumbles upon this page from google use your solution over the other 6 answers?

                                            – divibisan
                                            Oct 4 '18 at 20:28














                                          1












                                          1








                                          1







                                          df.fillna(df.groupby(['name'], as_index=False).mean(), inplace=True)





                                          share|improve this answer















                                          df.fillna(df.groupby(['name'], as_index=False).mean(), inplace=True)






                                          share|improve this answer














                                          share|improve this answer



                                          share|improve this answer








                                          edited Oct 4 '18 at 18:19









                                          Paul Roub

                                          32.7k85773




                                          32.7k85773










                                          answered Oct 4 '18 at 18:11









                                          Vino VincentVino Vincent

                                          111




                                          111








                                          • 2





                                            Please give some explanation of your answer. Why should someone who stumbles upon this page from google use your solution over the other 6 answers?

                                            – divibisan
                                            Oct 4 '18 at 20:28














                                          • 2





                                            Please give some explanation of your answer. Why should someone who stumbles upon this page from google use your solution over the other 6 answers?

                                            – divibisan
                                            Oct 4 '18 at 20:28








                                          2




                                          2





                                          Please give some explanation of your answer. Why should someone who stumbles upon this page from google use your solution over the other 6 answers?

                                          – divibisan
                                          Oct 4 '18 at 20:28





                                          Please give some explanation of your answer. Why should someone who stumbles upon this page from google use your solution over the other 6 answers?

                                          – divibisan
                                          Oct 4 '18 at 20:28











                                          0














                                          I just did this



                                          df.fillna(df.mean(), inplace=True)


                                          All missing values within your DataFrame will be filled by mean. If that is what you're looking for. This worked for me. It's simple, and gets the job done.






                                          share|improve this answer
























                                          • This answer is incorrect because it doesn't work groupwise.

                                            – jpp
                                            Nov 16 '18 at 13:49
















                                          0














                                          I just did this



                                          df.fillna(df.mean(), inplace=True)


                                          All missing values within your DataFrame will be filled by mean. If that is what you're looking for. This worked for me. It's simple, and gets the job done.






                                          share|improve this answer
























                                          • This answer is incorrect because it doesn't work groupwise.

                                            – jpp
                                            Nov 16 '18 at 13:49














                                          0












                                          0








                                          0







                                          I just did this



                                          df.fillna(df.mean(), inplace=True)


                                          All missing values within your DataFrame will be filled by mean. If that is what you're looking for. This worked for me. It's simple, and gets the job done.






                                          share|improve this answer













                                          I just did this



                                          df.fillna(df.mean(), inplace=True)


                                          All missing values within your DataFrame will be filled by mean. If that is what you're looking for. This worked for me. It's simple, and gets the job done.







                                          share|improve this answer












                                          share|improve this answer



                                          share|improve this answer










                                          answered Nov 16 '17 at 18:15









                                          keith singletonkeith singleton

                                          1




                                          1













                                          • This answer is incorrect because it doesn't work groupwise.

                                            – jpp
                                            Nov 16 '18 at 13:49



















                                          • This answer is incorrect because it doesn't work groupwise.

                                            – jpp
                                            Nov 16 '18 at 13:49

















                                          This answer is incorrect because it doesn't work groupwise.

                                          – jpp
                                          Nov 16 '18 at 13:49





                                          This answer is incorrect because it doesn't work groupwise.

                                          – jpp
                                          Nov 16 '18 at 13:49











                                          0















                                          fillna + groupby + transform + mean



                                          This seems intuitive:



                                          df['value'] = df['value'].fillna(df.groupby('name')['value'].transform('mean'))


                                          The groupby + transform syntax maps the groupwise mean to the index of the original dataframe. This is roughly equivalent to @DSM's solution, but avoids the need to define an anonymous lambda function.






                                          share|improve this answer




























                                            0















                                            fillna + groupby + transform + mean



                                            This seems intuitive:



                                            df['value'] = df['value'].fillna(df.groupby('name')['value'].transform('mean'))


                                            The groupby + transform syntax maps the groupwise mean to the index of the original dataframe. This is roughly equivalent to @DSM's solution, but avoids the need to define an anonymous lambda function.






                                            share|improve this answer


























                                              0












                                              0








                                              0








                                              fillna + groupby + transform + mean



                                              This seems intuitive:



                                              df['value'] = df['value'].fillna(df.groupby('name')['value'].transform('mean'))


                                              The groupby + transform syntax maps the groupwise mean to the index of the original dataframe. This is roughly equivalent to @DSM's solution, but avoids the need to define an anonymous lambda function.






                                              share|improve this answer














                                              fillna + groupby + transform + mean



                                              This seems intuitive:



                                              df['value'] = df['value'].fillna(df.groupby('name')['value'].transform('mean'))


                                              The groupby + transform syntax maps the groupwise mean to the index of the original dataframe. This is roughly equivalent to @DSM's solution, but avoids the need to define an anonymous lambda function.







                                              share|improve this answer












                                              share|improve this answer



                                              share|improve this answer










                                              answered Nov 16 '18 at 13:59









                                              jppjpp

                                              98.1k2159109




                                              98.1k2159109






























                                                  draft saved

                                                  draft discarded




















































                                                  Thanks for contributing an answer to Stack Overflow!


                                                  • Please be sure to answer the question. Provide details and share your research!

                                                  But avoid



                                                  • Asking for help, clarification, or responding to other answers.

                                                  • Making statements based on opinion; back them up with references or personal experience.


                                                  To learn more, see our tips on writing great answers.




                                                  draft saved


                                                  draft discarded














                                                  StackExchange.ready(
                                                  function () {
                                                  StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f19966018%2fpandas-filling-missing-values-by-mean-in-each-group%23new-answer', 'question_page');
                                                  }
                                                  );

                                                  Post as a guest















                                                  Required, but never shown





















































                                                  Required, but never shown














                                                  Required, but never shown












                                                  Required, but never shown







                                                  Required, but never shown

































                                                  Required, but never shown














                                                  Required, but never shown












                                                  Required, but never shown







                                                  Required, but never shown







                                                  Popular posts from this blog

                                                  鏡平學校

                                                  ꓛꓣだゔៀៅຸ໢ທຮ໕໒ ,ໂ'໥໓າ໼ឨឲ៵៭ៈゎゔit''䖳𥁄卿' ☨₤₨こゎもょの;ꜹꟚꞖꞵꟅꞛေၦေɯ,ɨɡ𛃵𛁹ޝ޳ޠ޾,ޤޒޯ޾𫝒𫠁သ𛅤チョ'サノބޘދ𛁐ᶿᶇᶀᶋᶠ㨑㽹⻮ꧬ꧹؍۩وَؠ㇕㇃㇪ ㇦㇋㇋ṜẰᵡᴠ 軌ᵕ搜۳ٰޗޮ޷ސޯ𫖾𫅀ल, ꙭ꙰ꚅꙁꚊꞻꝔ꟠Ꝭㄤﺟޱސꧨꧼ꧴ꧯꧽ꧲ꧯ'⽹⽭⾁⿞⼳⽋២៩ញណើꩯꩤ꩸ꩮᶻᶺᶧᶂ𫳲𫪭𬸄𫵰𬖩𬫣𬊉ၲ𛅬㕦䬺𫝌𫝼,,𫟖𫞽ហៅ஫㆔ాఆఅꙒꚞꙍ,Ꙟ꙱エ ,ポテ,フࢰࢯ𫟠𫞶 𫝤𫟠ﺕﹱﻜﻣ𪵕𪭸𪻆𪾩𫔷ġ,ŧآꞪ꟥,ꞔꝻ♚☹⛵𛀌ꬷꭞȄƁƪƬșƦǙǗdžƝǯǧⱦⱰꓕꓢႋ神 ဴ၀க௭எ௫ឫោ ' េㇷㇴㇼ神ㇸㇲㇽㇴㇼㇻㇸ'ㇸㇿㇸㇹㇰㆣꓚꓤ₡₧ ㄨㄟ㄂ㄖㄎ໗ツڒذ₶।ऩछएोञयूटक़कयँृी,冬'𛅢𛅥ㇱㇵㇶ𥄥𦒽𠣧𠊓𧢖𥞘𩔋цѰㄠſtʯʭɿʆʗʍʩɷɛ,əʏダヵㄐㄘR{gỚṖḺờṠṫảḙḭᴮᵏᴘᵀᵷᵕᴜᴏᵾq﮲ﲿﴽﭙ軌ﰬﶚﶧ﫲Ҝжюїкӈㇴffצּ﬘﭅﬈軌'ffistfflſtffतभफɳɰʊɲʎ𛁱𛁖𛁮𛀉 𛂯𛀞నఋŀŲ 𫟲𫠖𫞺ຆຆ ໹້໕໗ๆทԊꧢꧠ꧰ꓱ⿝⼑ŎḬẃẖỐẅ ,ờỰỈỗﮊDžȩꭏꭎꬻ꭮ꬿꭖꭥꭅ㇭神 ⾈ꓵꓑ⺄㄄ㄪㄙㄅㄇstA۵䞽ॶ𫞑𫝄㇉㇇゜軌𩜛𩳠Jﻺ‚Üမ႕ႌႊၐၸဓၞၞၡ៸wyvtᶎᶪᶹစဎ꣡꣰꣢꣤ٗ؋لㇳㇾㇻㇱ㆐㆔,,㆟Ⱶヤマފ޼ޝަݿݞݠݷݐ',ݘ,ݪݙݵ𬝉𬜁𫝨𫞘くせぉて¼óû×ó£…𛅑הㄙくԗԀ5606神45,神796'𪤻𫞧ꓐ㄁ㄘɥɺꓵꓲ3''7034׉ⱦⱠˆ“𫝋ȍ,ꩲ軌꩷ꩶꩧꩫఞ۔فڱێظペサ神ナᴦᵑ47 9238їﻂ䐊䔉㠸﬎ffiﬣ,לּᴷᴦᵛᵽ,ᴨᵤ ᵸᵥᴗᵈꚏꚉꚟ⻆rtǟƴ𬎎

                                                  Why https connections are so slow when debugging (stepping over) in Java?