Pandas: filling missing values by mean in each group

This should be straightforward, but the closest thing I've found is this post:
pandas: Filling missing values within a group, and I still can't solve my problem....

Suppose I have the following dataframe

df = pd.DataFrame({'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3], 'name': ['A','A', 'B','B','B','B', 'C','C','C']})



  name  value

0    A      1

1    A    NaN

2    B    NaN

3    B      2

4    B      3

5    B      1

6    C      3

7    C    NaN

8    C      3

and I'd like to fill in "NaN" with mean value in each "name" group, i.e.

      name  value

0    A      1

1    A      1

2    B      2

3    B      2

4    B      3

5    B      1

6    C      3

7    C      3

8    C      3

I'm not sure where to go after:

grouped = df.groupby('name').mean()

Thanks a bunch.

edited May 23 '17 at 12:18

Community♦

asked Nov 13 '13 at 22:43

BlueFeet

6021819

add a comment |

This should be straightforward, but the closest thing I've found is this post:
pandas: Filling missing values within a group, and I still can't solve my problem....

Suppose I have the following dataframe

df = pd.DataFrame({'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3], 'name': ['A','A', 'B','B','B','B', 'C','C','C']})



  name  value

0    A      1

1    A    NaN

2    B    NaN

3    B      2

4    B      3

5    B      1

6    C      3

7    C    NaN

8    C      3

and I'd like to fill in "NaN" with mean value in each "name" group, i.e.

      name  value

0    A      1

1    A      1

2    B      2

3    B      2

4    B      3

5    B      1

6    C      3

7    C      3

8    C      3

I'm not sure where to go after:

grouped = df.groupby('name').mean()

Thanks a bunch.

edited May 23 '17 at 12:18

Community♦

asked Nov 13 '13 at 22:43

BlueFeet

6021819

add a comment |

This should be straightforward, but the closest thing I've found is this post:
pandas: Filling missing values within a group, and I still can't solve my problem....

Suppose I have the following dataframe

df = pd.DataFrame({'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3], 'name': ['A','A', 'B','B','B','B', 'C','C','C']})



  name  value

0    A      1

1    A    NaN

2    B    NaN

3    B      2

4    B      3

5    B      1

6    C      3

7    C    NaN

8    C      3

and I'd like to fill in "NaN" with mean value in each "name" group, i.e.

      name  value

0    A      1

1    A      1

2    B      2

3    B      2

4    B      3

5    B      1

6    C      3

7    C      3

8    C      3

I'm not sure where to go after:

grouped = df.groupby('name').mean()

Thanks a bunch.

edited May 23 '17 at 12:18

Community♦

asked Nov 13 '13 at 22:43

BlueFeet

6021819

This should be straightforward, but the closest thing I've found is this post:
pandas: Filling missing values within a group, and I still can't solve my problem....

Suppose I have the following dataframe

df = pd.DataFrame({'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3], 'name': ['A','A', 'B','B','B','B', 'C','C','C']})



  name  value

0    A      1

1    A    NaN

2    B    NaN

3    B      2

4    B      3

5    B      1

6    C      3

7    C    NaN

8    C      3

and I'd like to fill in "NaN" with mean value in each "name" group, i.e.

      name  value

0    A      1

1    A      1

2    B      2

3    B      2

4    B      3

5    B      1

6    C      3

7    C      3

8    C      3

I'm not sure where to go after:

grouped = df.groupby('name').mean()

Thanks a bunch.

python pandas

edited May 23 '17 at 12:18

Community♦

asked Nov 13 '13 at 22:43

BlueFeet

6021819

edited May 23 '17 at 12:18

Community♦

asked Nov 13 '13 at 22:43

BlueFeet

6021819

edited May 23 '17 at 12:18

Community♦

edited May 23 '17 at 12:18

Community♦

edited May 23 '17 at 12:18

Community♦

asked Nov 13 '13 at 22:43

BlueFeet

6021819

asked Nov 13 '13 at 22:43

BlueFeet

6021819

asked Nov 13 '13 at 22:43

BlueFeet

6021819

add a comment |

8 Answers
8

active

oldest

votes

One way would be to use transform:

>>> df

  name  value

0    A      1

1    A    NaN

2    B    NaN

3    B      2

4    B      3

5    B      1

6    C      3

7    C    NaN

8    C      3

>>> df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))

>>> df

  name  value

0    A      1

1    A      1

2    B      2

3    B      2

4    B      3

5    B      1

6    C      3

7    C      3

8    C      3

answered Nov 13 '13 at 22:51

DSM

207k35397372

2

I found it helpful when starting out to sit down and read through the docs. This one is covered in the groupby section. There's too much stuff to remember, but you pick up rules like "transform is for per-group operations which you want indexed like the original frame" and so on.

– DSM
Nov 13 '13 at 22:57

Also look for the Wes McKinney book. Personally I think the docs on groupby are abismal, the book is marginally better.

– Woody Pride
Nov 14 '13 at 0:51

19

if you have more than two columns, make sure to specify the column name df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))['value']

– Lauren
Jan 10 '17 at 16:57

8

@Lauren Good point. I'd like to add that for performance reasons you might consider to move the value column specification further left to the group-by clause. This way the lambda function is only called for values in that particular column, and not every column and then chose column. Did a test and it was twice as fast when using two columns. And naturally you get better performance the more columns you don't need to impute: df["value"] = df.groupby("name")["value"].transform(lambda x: x.fillna(x.mean()))

– André C. Andersen
Jul 28 '17 at 12:11

I have been searching for this for two days.. Just a question for you. Why is it too hard to do this with loops? Because in my case there are two multi indexes i.e. State and Age_Group then I am trying to fill missing values in those groups with group means (from the same state within the same age group take mean and fill missings in group)..Thanks

– cyber-math
Jan 9 at 20:26

|
show 1 more comment

@DSM has IMO the right answer, but I'd like to share my generalization and optimization of the question: Multiple columns to group-by and having multiple value columns:

df = pd.DataFrame(

    {

        'category': ['X', 'X', 'X', 'X', 'X', 'X', 'Y', 'Y', 'Y'],

        'name': ['A','A', 'B','B','B','B', 'C','C','C'],

        'other_value': [10, np.nan, np.nan, 20, 30, 10, 30, np.nan, 30],

        'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3],

    }

)

... gives ...

  category name  other_value value

0        X    A         10.0   1.0

1        X    A          NaN   NaN

2        X    B          NaN   NaN

3        X    B         20.0   2.0

4        X    B         30.0   3.0

5        X    B         10.0   1.0

6        Y    C         30.0   3.0

7        Y    C          NaN   NaN

8        Y    C         30.0   3.0

In this generalized case we would like to group by category and name, and impute only on value.

This can be solved as follows:

df['value'] = df.groupby(['category', 'name'])['value']

    .transform(lambda x: x.fillna(x.mean()))

Notice the column list in the group-by clause, and that we select the value column right after the group-by. This makes the transformation only be run on that particular column. You could add it to the end, but then you will run it for all columns only to throw out all but one measure column at the end. A standard SQL query planner might have been able to optimize this, but pandas (0.19.2) doesn't seem to do this.

Performance test by increasing the dataset by doing ...

big_df = None

for _ in range(10000):

    if big_df is None:

        big_df = df.copy()

    else:

        big_df = pd.concat([big_df, df])

df = big_df

... confirms that this increases the speed proportional to how many columns you don't have to impute:

import pandas as pd

from datetime import datetime



def generate_data():

    ...



t = datetime.now()

df = generate_data()

df['value'] = df.groupby(['category', 'name'])['value']

    .transform(lambda x: x.fillna(x.mean()))

print(datetime.now()-t)



# 0:00:00.016012



t = datetime.now()

df = generate_data()

df["value"] = df.groupby(['category', 'name'])

    .transform(lambda x: x.fillna(x.mean()))['value']

print(datetime.now()-t)



# 0:00:00.030022

On a final note you can generalize even further if you want to impute more than one column, but not all:

df[['value', 'other_value']] = df.groupby(['category', 'name'])['value', 'other_value']

    .transform(lambda x: x.fillna(x.mean()))

edited Jul 28 '17 at 12:32

answered Jul 28 '17 at 12:06

André C. Andersen

3,93612646

Thank you for this great work. I am wondering how I could success the same transformation with using for loops. Speed is not my concern since I am trying to find manual methods. Thanks @AndréC.Andersen

– cyber-math
Jan 9 at 21:55

add a comment |

I'd do it this way

df.loc[df.value.isnull(), 'value'] = df.groupby('group').value.transform('mean')

edited Oct 9 '17 at 9:55

IanS

8,42122762

answered Nov 18 '16 at 17:18

piRSquared

154k22146288

add a comment |

The featured high ranked answer only works for a pandas Dataframe with only two columns. If you have a more columns case use instead:

df['Crude_Birth_rate'] = df.groupby("continent").Crude_Birth_rate.transform(

    lambda x: x.fillna(x.mean()))

edited Oct 13 '16 at 9:08

answered Oct 13 '16 at 8:52

Philipp Schwarz

5,03132029

add a comment |

def groupMeanValue(group):

    group['value'] = group['value'].fillna(group['value'].mean())

    return group



dft = df.groupby("name").transform(groupMeanValue)

answered Mar 9 '16 at 14:36

Prajit Patil

212

add a comment |

df.fillna(df.groupby(['name'], as_index=False).mean(), inplace=True)

edited Oct 4 '18 at 18:19

Paul Roub

32.7k85773

answered Oct 4 '18 at 18:11

Vino Vincent

111

2

Please give some explanation of your answer. Why should someone who stumbles upon this page from google use your solution over the other 6 answers?

– divibisan
Oct 4 '18 at 20:28

add a comment |

I just did this

df.fillna(df.mean(), inplace=True)

All missing values within your DataFrame will be filled by mean. If that is what you're looking for. This worked for me. It's simple, and gets the job done.

answered Nov 16 '17 at 18:15

keith singleton

This answer is incorrect because it doesn't work groupwise.

– jpp
Nov 16 '18 at 13:49

add a comment |

`fillna` + `groupby` + `transform` + `mean`

This seems intuitive:

df['value'] = df['value'].fillna(df.groupby('name')['value'].transform('mean'))

The groupby + transform syntax maps the groupwise mean to the index of the original dataframe. This is roughly equivalent to @DSM's solution, but avoids the need to define an anonymous lambda function.

answered Nov 16 '18 at 13:59

jpp

98.1k2159109

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f19966018%2fpandas-filling-missing-values-by-mean-in-each-group%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

8 Answers
8

active

oldest

votes

8 Answers
8

active

oldest

votes

One way would be to use transform:

>>> df

  name  value

0    A      1

1    A    NaN

2    B    NaN

3    B      2

4    B      3

5    B      1

6    C      3

7    C    NaN

8    C      3

>>> df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))

>>> df

  name  value

0    A      1

1    A      1

2    B      2

3    B      2

4    B      3

5    B      1

6    C      3

7    C      3

8    C      3

answered Nov 13 '13 at 22:51

DSM

207k35397372

2

I found it helpful when starting out to sit down and read through the docs. This one is covered in the groupby section. There's too much stuff to remember, but you pick up rules like "transform is for per-group operations which you want indexed like the original frame" and so on.

– DSM
Nov 13 '13 at 22:57

Also look for the Wes McKinney book. Personally I think the docs on groupby are abismal, the book is marginally better.

– Woody Pride
Nov 14 '13 at 0:51

19

if you have more than two columns, make sure to specify the column name df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))['value']

– Lauren
Jan 10 '17 at 16:57

8

@Lauren Good point. I'd like to add that for performance reasons you might consider to move the value column specification further left to the group-by clause. This way the lambda function is only called for values in that particular column, and not every column and then chose column. Did a test and it was twice as fast when using two columns. And naturally you get better performance the more columns you don't need to impute: df["value"] = df.groupby("name")["value"].transform(lambda x: x.fillna(x.mean()))

– André C. Andersen
Jul 28 '17 at 12:11

I have been searching for this for two days.. Just a question for you. Why is it too hard to do this with loops? Because in my case there are two multi indexes i.e. State and Age_Group then I am trying to fill missing values in those groups with group means (from the same state within the same age group take mean and fill missings in group)..Thanks

– cyber-math
Jan 9 at 20:26

|
show 1 more comment

One way would be to use transform:

>>> df

  name  value

0    A      1

1    A    NaN

2    B    NaN

3    B      2

4    B      3

5    B      1

6    C      3

7    C    NaN

8    C      3

>>> df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))

>>> df

  name  value

0    A      1

1    A      1

2    B      2

3    B      2

4    B      3

5    B      1

6    C      3

7    C      3

8    C      3

answered Nov 13 '13 at 22:51

DSM

207k35397372

2

I found it helpful when starting out to sit down and read through the docs. This one is covered in the groupby section. There's too much stuff to remember, but you pick up rules like "transform is for per-group operations which you want indexed like the original frame" and so on.

– DSM
Nov 13 '13 at 22:57

Also look for the Wes McKinney book. Personally I think the docs on groupby are abismal, the book is marginally better.

– Woody Pride
Nov 14 '13 at 0:51

19

if you have more than two columns, make sure to specify the column name df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))['value']

– Lauren
Jan 10 '17 at 16:57

8

@Lauren Good point. I'd like to add that for performance reasons you might consider to move the value column specification further left to the group-by clause. This way the lambda function is only called for values in that particular column, and not every column and then chose column. Did a test and it was twice as fast when using two columns. And naturally you get better performance the more columns you don't need to impute: df["value"] = df.groupby("name")["value"].transform(lambda x: x.fillna(x.mean()))

– André C. Andersen
Jul 28 '17 at 12:11

I have been searching for this for two days.. Just a question for you. Why is it too hard to do this with loops? Because in my case there are two multi indexes i.e. State and Age_Group then I am trying to fill missing values in those groups with group means (from the same state within the same age group take mean and fill missings in group)..Thanks

– cyber-math
Jan 9 at 20:26

|
show 1 more comment

One way would be to use transform:

>>> df

  name  value

0    A      1

1    A    NaN

2    B    NaN

3    B      2

4    B      3

5    B      1

6    C      3

7    C    NaN

8    C      3

>>> df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))

>>> df

  name  value

0    A      1

1    A      1

2    B      2

3    B      2

4    B      3

5    B      1

6    C      3

7    C      3

8    C      3

answered Nov 13 '13 at 22:51

DSM

207k35397372

One way would be to use transform:

>>> df

  name  value

0    A      1

1    A    NaN

2    B    NaN

3    B      2

4    B      3

5    B      1

6    C      3

7    C    NaN

8    C      3

>>> df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))

>>> df

  name  value

0    A      1

1    A      1

2    B      2

3    B      2

4    B      3

5    B      1

6    C      3

7    C      3

8    C      3

answered Nov 13 '13 at 22:51

DSM

207k35397372

answered Nov 13 '13 at 22:51

DSM

207k35397372

answered Nov 13 '13 at 22:51

DSM

207k35397372

answered Nov 13 '13 at 22:51

DSM

207k35397372

2

I found it helpful when starting out to sit down and read through the docs. This one is covered in the groupby section. There's too much stuff to remember, but you pick up rules like "transform is for per-group operations which you want indexed like the original frame" and so on.

– DSM
Nov 13 '13 at 22:57

Also look for the Wes McKinney book. Personally I think the docs on groupby are abismal, the book is marginally better.

– Woody Pride
Nov 14 '13 at 0:51

19

if you have more than two columns, make sure to specify the column name df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))['value']

– Lauren
Jan 10 '17 at 16:57

8

@Lauren Good point. I'd like to add that for performance reasons you might consider to move the value column specification further left to the group-by clause. This way the lambda function is only called for values in that particular column, and not every column and then chose column. Did a test and it was twice as fast when using two columns. And naturally you get better performance the more columns you don't need to impute: df["value"] = df.groupby("name")["value"].transform(lambda x: x.fillna(x.mean()))

– André C. Andersen
Jul 28 '17 at 12:11

I have been searching for this for two days.. Just a question for you. Why is it too hard to do this with loops? Because in my case there are two multi indexes i.e. State and Age_Group then I am trying to fill missing values in those groups with group means (from the same state within the same age group take mean and fill missings in group)..Thanks

– cyber-math
Jan 9 at 20:26

|
show 1 more comment

2

I found it helpful when starting out to sit down and read through the docs. This one is covered in the groupby section. There's too much stuff to remember, but you pick up rules like "transform is for per-group operations which you want indexed like the original frame" and so on.

– DSM
Nov 13 '13 at 22:57

Also look for the Wes McKinney book. Personally I think the docs on groupby are abismal, the book is marginally better.

– Woody Pride
Nov 14 '13 at 0:51

19

if you have more than two columns, make sure to specify the column name df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))['value']

– Lauren
Jan 10 '17 at 16:57

8

@Lauren Good point. I'd like to add that for performance reasons you might consider to move the value column specification further left to the group-by clause. This way the lambda function is only called for values in that particular column, and not every column and then chose column. Did a test and it was twice as fast when using two columns. And naturally you get better performance the more columns you don't need to impute: df["value"] = df.groupby("name")["value"].transform(lambda x: x.fillna(x.mean()))

– André C. Andersen
Jul 28 '17 at 12:11

I have been searching for this for two days.. Just a question for you. Why is it too hard to do this with loops? Because in my case there are two multi indexes i.e. State and Age_Group then I am trying to fill missing values in those groups with group means (from the same state within the same age group take mean and fill missings in group)..Thanks

– cyber-math
Jan 9 at 20:26

I found it helpful when starting out to sit down and read through the docs. This one is covered in the groupby section. There's too much stuff to remember, but you pick up rules like "transform is for per-group operations which you want indexed like the original frame" and so on.

– DSM
Nov 13 '13 at 22:57

Also look for the Wes McKinney book. Personally I think the docs on groupby are abismal, the book is marginally better.

– Woody Pride
Nov 14 '13 at 0:51

if you have more than two columns, make sure to specify the column name df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))['value']

– Lauren
Jan 10 '17 at 16:57

@Lauren Good point. I'd like to add that for performance reasons you might consider to move the value column specification further left to the group-by clause. This way the lambda function is only called for values in that particular column, and not every column and then chose column. Did a test and it was twice as fast when using two columns. And naturally you get better performance the more columns you don't need to impute: df["value"] = df.groupby("name")["value"].transform(lambda x: x.fillna(x.mean()))

– André C. Andersen
Jul 28 '17 at 12:11

I have been searching for this for two days.. Just a question for you. Why is it too hard to do this with loops? Because in my case there are two multi indexes i.e. State and Age_Group then I am trying to fill missing values in those groups with group means (from the same state within the same age group take mean and fill missings in group)..Thanks

– cyber-math
Jan 9 at 20:26

|
show 1 more comment

@DSM has IMO the right answer, but I'd like to share my generalization and optimization of the question: Multiple columns to group-by and having multiple value columns:

df = pd.DataFrame(

    {

        'category': ['X', 'X', 'X', 'X', 'X', 'X', 'Y', 'Y', 'Y'],

        'name': ['A','A', 'B','B','B','B', 'C','C','C'],

        'other_value': [10, np.nan, np.nan, 20, 30, 10, 30, np.nan, 30],

        'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3],

    }

)

... gives ...

  category name  other_value value

0        X    A         10.0   1.0

1        X    A          NaN   NaN

2        X    B          NaN   NaN

3        X    B         20.0   2.0

4        X    B         30.0   3.0

5        X    B         10.0   1.0

6        Y    C         30.0   3.0

7        Y    C          NaN   NaN

8        Y    C         30.0   3.0

In this generalized case we would like to group by category and name, and impute only on value.

This can be solved as follows:

df['value'] = df.groupby(['category', 'name'])['value']

    .transform(lambda x: x.fillna(x.mean()))

Performance test by increasing the dataset by doing ...

big_df = None

for _ in range(10000):

    if big_df is None:

        big_df = df.copy()

    else:

        big_df = pd.concat([big_df, df])

df = big_df

... confirms that this increases the speed proportional to how many columns you don't have to impute:

import pandas as pd

from datetime import datetime



def generate_data():

    ...



t = datetime.now()

df = generate_data()

df['value'] = df.groupby(['category', 'name'])['value']

    .transform(lambda x: x.fillna(x.mean()))

print(datetime.now()-t)



# 0:00:00.016012



t = datetime.now()

df = generate_data()

df["value"] = df.groupby(['category', 'name'])

    .transform(lambda x: x.fillna(x.mean()))['value']

print(datetime.now()-t)



# 0:00:00.030022

On a final note you can generalize even further if you want to impute more than one column, but not all:

df[['value', 'other_value']] = df.groupby(['category', 'name'])['value', 'other_value']

    .transform(lambda x: x.fillna(x.mean()))

edited Jul 28 '17 at 12:32

answered Jul 28 '17 at 12:06

André C. Andersen

3,93612646

Thank you for this great work. I am wondering how I could success the same transformation with using for loops. Speed is not my concern since I am trying to find manual methods. Thanks @AndréC.Andersen

– cyber-math
Jan 9 at 21:55

add a comment |

@DSM has IMO the right answer, but I'd like to share my generalization and optimization of the question: Multiple columns to group-by and having multiple value columns:

df = pd.DataFrame(

    {

        'category': ['X', 'X', 'X', 'X', 'X', 'X', 'Y', 'Y', 'Y'],

        'name': ['A','A', 'B','B','B','B', 'C','C','C'],

        'other_value': [10, np.nan, np.nan, 20, 30, 10, 30, np.nan, 30],

        'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3],

    }

)

... gives ...

  category name  other_value value

0        X    A         10.0   1.0

1        X    A          NaN   NaN

2        X    B          NaN   NaN

3        X    B         20.0   2.0

4        X    B         30.0   3.0

5        X    B         10.0   1.0

6        Y    C         30.0   3.0

7        Y    C          NaN   NaN

8        Y    C         30.0   3.0

In this generalized case we would like to group by category and name, and impute only on value.

This can be solved as follows:

df['value'] = df.groupby(['category', 'name'])['value']

    .transform(lambda x: x.fillna(x.mean()))

Performance test by increasing the dataset by doing ...

big_df = None

for _ in range(10000):

    if big_df is None:

        big_df = df.copy()

    else:

        big_df = pd.concat([big_df, df])

df = big_df

... confirms that this increases the speed proportional to how many columns you don't have to impute:

import pandas as pd

from datetime import datetime



def generate_data():

    ...



t = datetime.now()

df = generate_data()

df['value'] = df.groupby(['category', 'name'])['value']

    .transform(lambda x: x.fillna(x.mean()))

print(datetime.now()-t)



# 0:00:00.016012



t = datetime.now()

df = generate_data()

df["value"] = df.groupby(['category', 'name'])

    .transform(lambda x: x.fillna(x.mean()))['value']

print(datetime.now()-t)



# 0:00:00.030022

On a final note you can generalize even further if you want to impute more than one column, but not all:

df[['value', 'other_value']] = df.groupby(['category', 'name'])['value', 'other_value']

    .transform(lambda x: x.fillna(x.mean()))

edited Jul 28 '17 at 12:32

answered Jul 28 '17 at 12:06

André C. Andersen

3,93612646

Thank you for this great work. I am wondering how I could success the same transformation with using for loops. Speed is not my concern since I am trying to find manual methods. Thanks @AndréC.Andersen

– cyber-math
Jan 9 at 21:55

add a comment |

@DSM has IMO the right answer, but I'd like to share my generalization and optimization of the question: Multiple columns to group-by and having multiple value columns:

df = pd.DataFrame(

    {

        'category': ['X', 'X', 'X', 'X', 'X', 'X', 'Y', 'Y', 'Y'],

        'name': ['A','A', 'B','B','B','B', 'C','C','C'],

        'other_value': [10, np.nan, np.nan, 20, 30, 10, 30, np.nan, 30],

        'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3],

    }

)

... gives ...

  category name  other_value value

0        X    A         10.0   1.0

1        X    A          NaN   NaN

2        X    B          NaN   NaN

3        X    B         20.0   2.0

4        X    B         30.0   3.0

5        X    B         10.0   1.0

6        Y    C         30.0   3.0

7        Y    C          NaN   NaN

8        Y    C         30.0   3.0

In this generalized case we would like to group by category and name, and impute only on value.

This can be solved as follows:

df['value'] = df.groupby(['category', 'name'])['value']

    .transform(lambda x: x.fillna(x.mean()))

Performance test by increasing the dataset by doing ...

big_df = None

for _ in range(10000):

    if big_df is None:

        big_df = df.copy()

    else:

        big_df = pd.concat([big_df, df])

df = big_df

... confirms that this increases the speed proportional to how many columns you don't have to impute:

import pandas as pd

from datetime import datetime



def generate_data():

    ...



t = datetime.now()

df = generate_data()

df['value'] = df.groupby(['category', 'name'])['value']

    .transform(lambda x: x.fillna(x.mean()))

print(datetime.now()-t)



# 0:00:00.016012



t = datetime.now()

df = generate_data()

df["value"] = df.groupby(['category', 'name'])

    .transform(lambda x: x.fillna(x.mean()))['value']

print(datetime.now()-t)



# 0:00:00.030022

On a final note you can generalize even further if you want to impute more than one column, but not all:

df[['value', 'other_value']] = df.groupby(['category', 'name'])['value', 'other_value']

    .transform(lambda x: x.fillna(x.mean()))

edited Jul 28 '17 at 12:32

answered Jul 28 '17 at 12:06

André C. Andersen

3,93612646

@DSM has IMO the right answer, but I'd like to share my generalization and optimization of the question: Multiple columns to group-by and having multiple value columns:

df = pd.DataFrame(

    {

        'category': ['X', 'X', 'X', 'X', 'X', 'X', 'Y', 'Y', 'Y'],

        'name': ['A','A', 'B','B','B','B', 'C','C','C'],

        'other_value': [10, np.nan, np.nan, 20, 30, 10, 30, np.nan, 30],

        'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3],

    }

)

... gives ...

  category name  other_value value

0        X    A         10.0   1.0

1        X    A          NaN   NaN

2        X    B          NaN   NaN

3        X    B         20.0   2.0

4        X    B         30.0   3.0

5        X    B         10.0   1.0

6        Y    C         30.0   3.0

7        Y    C          NaN   NaN

8        Y    C         30.0   3.0

In this generalized case we would like to group by category and name, and impute only on value.

This can be solved as follows:

df['value'] = df.groupby(['category', 'name'])['value']

    .transform(lambda x: x.fillna(x.mean()))

Performance test by increasing the dataset by doing ...

big_df = None

for _ in range(10000):

    if big_df is None:

        big_df = df.copy()

    else:

        big_df = pd.concat([big_df, df])

df = big_df

... confirms that this increases the speed proportional to how many columns you don't have to impute:

import pandas as pd

from datetime import datetime



def generate_data():

    ...



t = datetime.now()

df = generate_data()

df['value'] = df.groupby(['category', 'name'])['value']

    .transform(lambda x: x.fillna(x.mean()))

print(datetime.now()-t)



# 0:00:00.016012



t = datetime.now()

df = generate_data()

df["value"] = df.groupby(['category', 'name'])

    .transform(lambda x: x.fillna(x.mean()))['value']

print(datetime.now()-t)



# 0:00:00.030022

On a final note you can generalize even further if you want to impute more than one column, but not all:

df[['value', 'other_value']] = df.groupby(['category', 'name'])['value', 'other_value']

    .transform(lambda x: x.fillna(x.mean()))

edited Jul 28 '17 at 12:32

answered Jul 28 '17 at 12:06

André C. Andersen

3,93612646

edited Jul 28 '17 at 12:32

answered Jul 28 '17 at 12:06

André C. Andersen

3,93612646

answered Jul 28 '17 at 12:06

André C. Andersen

3,93612646

answered Jul 28 '17 at 12:06

André C. Andersen

3,93612646

Thank you for this great work. I am wondering how I could success the same transformation with using for loops. Speed is not my concern since I am trying to find manual methods. Thanks @AndréC.Andersen

– cyber-math
Jan 9 at 21:55

add a comment |

Thank you for this great work. I am wondering how I could success the same transformation with using for loops. Speed is not my concern since I am trying to find manual methods. Thanks @AndréC.Andersen

– cyber-math
Jan 9 at 21:55

Thank you for this great work. I am wondering how I could success the same transformation with using for loops. Speed is not my concern since I am trying to find manual methods. Thanks @AndréC.Andersen

– cyber-math
Jan 9 at 21:55

add a comment |

I'd do it this way

df.loc[df.value.isnull(), 'value'] = df.groupby('group').value.transform('mean')

edited Oct 9 '17 at 9:55

IanS

8,42122762

answered Nov 18 '16 at 17:18

piRSquared

154k22146288

add a comment |

I'd do it this way

df.loc[df.value.isnull(), 'value'] = df.groupby('group').value.transform('mean')

edited Oct 9 '17 at 9:55

IanS

8,42122762

answered Nov 18 '16 at 17:18

piRSquared

154k22146288

add a comment |

I'd do it this way

df.loc[df.value.isnull(), 'value'] = df.groupby('group').value.transform('mean')

edited Oct 9 '17 at 9:55

IanS

8,42122762

answered Nov 18 '16 at 17:18

piRSquared

154k22146288

I'd do it this way

df.loc[df.value.isnull(), 'value'] = df.groupby('group').value.transform('mean')

edited Oct 9 '17 at 9:55

IanS

8,42122762

answered Nov 18 '16 at 17:18

piRSquared

154k22146288

edited Oct 9 '17 at 9:55

IanS

8,42122762

edited Oct 9 '17 at 9:55

IanS

8,42122762

edited Oct 9 '17 at 9:55

IanS

8,42122762

answered Nov 18 '16 at 17:18

piRSquared

154k22146288

answered Nov 18 '16 at 17:18

piRSquared

154k22146288

answered Nov 18 '16 at 17:18

piRSquared

154k22146288

add a comment |

The featured high ranked answer only works for a pandas Dataframe with only two columns. If you have a more columns case use instead:

df['Crude_Birth_rate'] = df.groupby("continent").Crude_Birth_rate.transform(

    lambda x: x.fillna(x.mean()))

edited Oct 13 '16 at 9:08

answered Oct 13 '16 at 8:52

Philipp Schwarz

5,03132029

add a comment |

The featured high ranked answer only works for a pandas Dataframe with only two columns. If you have a more columns case use instead:

df['Crude_Birth_rate'] = df.groupby("continent").Crude_Birth_rate.transform(

    lambda x: x.fillna(x.mean()))

edited Oct 13 '16 at 9:08

answered Oct 13 '16 at 8:52

Philipp Schwarz

5,03132029

add a comment |

The featured high ranked answer only works for a pandas Dataframe with only two columns. If you have a more columns case use instead:

df['Crude_Birth_rate'] = df.groupby("continent").Crude_Birth_rate.transform(

    lambda x: x.fillna(x.mean()))

edited Oct 13 '16 at 9:08

answered Oct 13 '16 at 8:52

Philipp Schwarz

5,03132029

The featured high ranked answer only works for a pandas Dataframe with only two columns. If you have a more columns case use instead:

df['Crude_Birth_rate'] = df.groupby("continent").Crude_Birth_rate.transform(

    lambda x: x.fillna(x.mean()))

edited Oct 13 '16 at 9:08

answered Oct 13 '16 at 8:52

Philipp Schwarz

5,03132029

edited Oct 13 '16 at 9:08

answered Oct 13 '16 at 8:52

Philipp Schwarz

5,03132029

answered Oct 13 '16 at 8:52

Philipp Schwarz

5,03132029

answered Oct 13 '16 at 8:52

Philipp Schwarz

5,03132029

add a comment |

def groupMeanValue(group):

    group['value'] = group['value'].fillna(group['value'].mean())

    return group



dft = df.groupby("name").transform(groupMeanValue)

answered Mar 9 '16 at 14:36

Prajit Patil

212

add a comment |

def groupMeanValue(group):

    group['value'] = group['value'].fillna(group['value'].mean())

    return group



dft = df.groupby("name").transform(groupMeanValue)

answered Mar 9 '16 at 14:36

Prajit Patil

212

add a comment |

def groupMeanValue(group):

    group['value'] = group['value'].fillna(group['value'].mean())

    return group



dft = df.groupby("name").transform(groupMeanValue)

answered Mar 9 '16 at 14:36

Prajit Patil

212

def groupMeanValue(group):

    group['value'] = group['value'].fillna(group['value'].mean())

    return group



dft = df.groupby("name").transform(groupMeanValue)

answered Mar 9 '16 at 14:36

Prajit Patil

212

answered Mar 9 '16 at 14:36

Prajit Patil

212

answered Mar 9 '16 at 14:36

Prajit Patil

212

answered Mar 9 '16 at 14:36

Prajit Patil

212

add a comment |

df.fillna(df.groupby(['name'], as_index=False).mean(), inplace=True)

edited Oct 4 '18 at 18:19

Paul Roub

32.7k85773

answered Oct 4 '18 at 18:11

Vino Vincent

111

2

Please give some explanation of your answer. Why should someone who stumbles upon this page from google use your solution over the other 6 answers?

– divibisan
Oct 4 '18 at 20:28

add a comment |

df.fillna(df.groupby(['name'], as_index=False).mean(), inplace=True)

edited Oct 4 '18 at 18:19

Paul Roub

32.7k85773

answered Oct 4 '18 at 18:11

Vino Vincent

111

2

Please give some explanation of your answer. Why should someone who stumbles upon this page from google use your solution over the other 6 answers?

– divibisan
Oct 4 '18 at 20:28

add a comment |

df.fillna(df.groupby(['name'], as_index=False).mean(), inplace=True)

edited Oct 4 '18 at 18:19

Paul Roub

32.7k85773

answered Oct 4 '18 at 18:11

Vino Vincent

111

df.fillna(df.groupby(['name'], as_index=False).mean(), inplace=True)

edited Oct 4 '18 at 18:19

Paul Roub

32.7k85773

answered Oct 4 '18 at 18:11

Vino Vincent

111

edited Oct 4 '18 at 18:19

Paul Roub

32.7k85773

edited Oct 4 '18 at 18:19

Paul Roub

32.7k85773

edited Oct 4 '18 at 18:19

Paul Roub

32.7k85773

answered Oct 4 '18 at 18:11

Vino Vincent

111

answered Oct 4 '18 at 18:11

Vino Vincent

111

answered Oct 4 '18 at 18:11

Vino Vincent

111

2

Please give some explanation of your answer. Why should someone who stumbles upon this page from google use your solution over the other 6 answers?

– divibisan
Oct 4 '18 at 20:28

add a comment |

2

Please give some explanation of your answer. Why should someone who stumbles upon this page from google use your solution over the other 6 answers?

– divibisan
Oct 4 '18 at 20:28

Please give some explanation of your answer. Why should someone who stumbles upon this page from google use your solution over the other 6 answers?

– divibisan
Oct 4 '18 at 20:28

add a comment |

I just did this

df.fillna(df.mean(), inplace=True)

All missing values within your DataFrame will be filled by mean. If that is what you're looking for. This worked for me. It's simple, and gets the job done.

answered Nov 16 '17 at 18:15

keith singleton

This answer is incorrect because it doesn't work groupwise.

– jpp
Nov 16 '18 at 13:49

add a comment |

I just did this

df.fillna(df.mean(), inplace=True)

All missing values within your DataFrame will be filled by mean. If that is what you're looking for. This worked for me. It's simple, and gets the job done.

answered Nov 16 '17 at 18:15

keith singleton

This answer is incorrect because it doesn't work groupwise.

– jpp
Nov 16 '18 at 13:49

add a comment |

I just did this

df.fillna(df.mean(), inplace=True)

All missing values within your DataFrame will be filled by mean. If that is what you're looking for. This worked for me. It's simple, and gets the job done.

answered Nov 16 '17 at 18:15

keith singleton

I just did this

df.fillna(df.mean(), inplace=True)

All missing values within your DataFrame will be filled by mean. If that is what you're looking for. This worked for me. It's simple, and gets the job done.

answered Nov 16 '17 at 18:15

keith singleton

answered Nov 16 '17 at 18:15

keith singleton

answered Nov 16 '17 at 18:15

keith singleton

answered Nov 16 '17 at 18:15

keith singleton

This answer is incorrect because it doesn't work groupwise.

– jpp
Nov 16 '18 at 13:49

add a comment |

This answer is incorrect because it doesn't work groupwise.

– jpp
Nov 16 '18 at 13:49

This answer is incorrect because it doesn't work groupwise.

– jpp
Nov 16 '18 at 13:49

add a comment |

`fillna` + `groupby` + `transform` + `mean`

This seems intuitive:

df['value'] = df['value'].fillna(df.groupby('name')['value'].transform('mean'))

answered Nov 16 '18 at 13:59

jpp

98.1k2159109

add a comment |

`fillna` + `groupby` + `transform` + `mean`

This seems intuitive:

df['value'] = df['value'].fillna(df.groupby('name')['value'].transform('mean'))

answered Nov 16 '18 at 13:59

jpp

98.1k2159109

add a comment |

`fillna` + `groupby` + `transform` + `mean`

This seems intuitive:

df['value'] = df['value'].fillna(df.groupby('name')['value'].transform('mean'))

answered Nov 16 '18 at 13:59

jpp

98.1k2159109

`fillna` + `groupby` + `transform` + `mean`

This seems intuitive:

df['value'] = df['value'].fillna(df.groupby('name')['value'].transform('mean'))

answered Nov 16 '18 at 13:59

jpp

98.1k2159109

answered Nov 16 '18 at 13:59

jpp

98.1k2159109

answered Nov 16 '18 at 13:59

jpp

98.1k2159109

answered Nov 16 '18 at 13:59

jpp

98.1k2159109

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Agfdhyk