Pandas: filling missing values by mean in each group
This should be straightforward, but the closest thing I've found is this post:
pandas: Filling missing values within a group, and I still can't solve my problem....
Suppose I have the following dataframe
df = pd.DataFrame({'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3], 'name': ['A','A', 'B','B','B','B', 'C','C','C']})
name value
0 A 1
1 A NaN
2 B NaN
3 B 2
4 B 3
5 B 1
6 C 3
7 C NaN
8 C 3
and I'd like to fill in "NaN" with mean value in each "name" group, i.e.
name value
0 A 1
1 A 1
2 B 2
3 B 2
4 B 3
5 B 1
6 C 3
7 C 3
8 C 3
I'm not sure where to go after:
grouped = df.groupby('name').mean()
Thanks a bunch.
python pandas
add a comment |
This should be straightforward, but the closest thing I've found is this post:
pandas: Filling missing values within a group, and I still can't solve my problem....
Suppose I have the following dataframe
df = pd.DataFrame({'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3], 'name': ['A','A', 'B','B','B','B', 'C','C','C']})
name value
0 A 1
1 A NaN
2 B NaN
3 B 2
4 B 3
5 B 1
6 C 3
7 C NaN
8 C 3
and I'd like to fill in "NaN" with mean value in each "name" group, i.e.
name value
0 A 1
1 A 1
2 B 2
3 B 2
4 B 3
5 B 1
6 C 3
7 C 3
8 C 3
I'm not sure where to go after:
grouped = df.groupby('name').mean()
Thanks a bunch.
python pandas
add a comment |
This should be straightforward, but the closest thing I've found is this post:
pandas: Filling missing values within a group, and I still can't solve my problem....
Suppose I have the following dataframe
df = pd.DataFrame({'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3], 'name': ['A','A', 'B','B','B','B', 'C','C','C']})
name value
0 A 1
1 A NaN
2 B NaN
3 B 2
4 B 3
5 B 1
6 C 3
7 C NaN
8 C 3
and I'd like to fill in "NaN" with mean value in each "name" group, i.e.
name value
0 A 1
1 A 1
2 B 2
3 B 2
4 B 3
5 B 1
6 C 3
7 C 3
8 C 3
I'm not sure where to go after:
grouped = df.groupby('name').mean()
Thanks a bunch.
python pandas
This should be straightforward, but the closest thing I've found is this post:
pandas: Filling missing values within a group, and I still can't solve my problem....
Suppose I have the following dataframe
df = pd.DataFrame({'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3], 'name': ['A','A', 'B','B','B','B', 'C','C','C']})
name value
0 A 1
1 A NaN
2 B NaN
3 B 2
4 B 3
5 B 1
6 C 3
7 C NaN
8 C 3
and I'd like to fill in "NaN" with mean value in each "name" group, i.e.
name value
0 A 1
1 A 1
2 B 2
3 B 2
4 B 3
5 B 1
6 C 3
7 C 3
8 C 3
I'm not sure where to go after:
grouped = df.groupby('name').mean()
Thanks a bunch.
python pandas
python pandas
edited May 23 '17 at 12:18
Community♦
11
11
asked Nov 13 '13 at 22:43
BlueFeetBlueFeet
6021819
6021819
add a comment |
add a comment |
8 Answers
8
active
oldest
votes
One way would be to use transform
:
>>> df
name value
0 A 1
1 A NaN
2 B NaN
3 B 2
4 B 3
5 B 1
6 C 3
7 C NaN
8 C 3
>>> df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))
>>> df
name value
0 A 1
1 A 1
2 B 2
3 B 2
4 B 3
5 B 1
6 C 3
7 C 3
8 C 3
2
I found it helpful when starting out to sit down and read through the docs. This one is covered in thegroupby
section. There's too much stuff to remember, but you pick up rules like "transform is for per-group operations which you want indexed like the original frame" and so on.
– DSM
Nov 13 '13 at 22:57
Also look for the Wes McKinney book. Personally I think the docs on groupby are abismal, the book is marginally better.
– Woody Pride
Nov 14 '13 at 0:51
19
if you have more than two columns, make sure to specify the column name df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))['value']
– Lauren
Jan 10 '17 at 16:57
8
@Lauren Good point. I'd like to add that for performance reasons you might consider to move the value column specification further left to the group-by clause. This way the lambda function is only called for values in that particular column, and not every column and then chose column. Did a test and it was twice as fast when using two columns. And naturally you get better performance the more columns you don't need to impute:df["value"] = df.groupby("name")["value"].transform(lambda x: x.fillna(x.mean()))
– André C. Andersen
Jul 28 '17 at 12:11
I have been searching for this for two days.. Just a question for you. Why is it too hard to do this with loops? Because in my case there are two multi indexes i.e.State
andAge_Group
then I am trying to fill missing values in those groups with group means (from the same state within the same age group take mean and fill missings in group)..Thanks
– cyber-math
Jan 9 at 20:26
|
show 1 more comment
@DSM has IMO the right answer, but I'd like to share my generalization and optimization of the question: Multiple columns to group-by and having multiple value columns:
df = pd.DataFrame(
{
'category': ['X', 'X', 'X', 'X', 'X', 'X', 'Y', 'Y', 'Y'],
'name': ['A','A', 'B','B','B','B', 'C','C','C'],
'other_value': [10, np.nan, np.nan, 20, 30, 10, 30, np.nan, 30],
'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3],
}
)
... gives ...
category name other_value value
0 X A 10.0 1.0
1 X A NaN NaN
2 X B NaN NaN
3 X B 20.0 2.0
4 X B 30.0 3.0
5 X B 10.0 1.0
6 Y C 30.0 3.0
7 Y C NaN NaN
8 Y C 30.0 3.0
In this generalized case we would like to group by category
and name
, and impute only on value
.
This can be solved as follows:
df['value'] = df.groupby(['category', 'name'])['value']
.transform(lambda x: x.fillna(x.mean()))
Notice the column list in the group-by clause, and that we select the value
column right after the group-by. This makes the transformation only be run on that particular column. You could add it to the end, but then you will run it for all columns only to throw out all but one measure column at the end. A standard SQL query planner might have been able to optimize this, but pandas (0.19.2) doesn't seem to do this.
Performance test by increasing the dataset by doing ...
big_df = None
for _ in range(10000):
if big_df is None:
big_df = df.copy()
else:
big_df = pd.concat([big_df, df])
df = big_df
... confirms that this increases the speed proportional to how many columns you don't have to impute:
import pandas as pd
from datetime import datetime
def generate_data():
...
t = datetime.now()
df = generate_data()
df['value'] = df.groupby(['category', 'name'])['value']
.transform(lambda x: x.fillna(x.mean()))
print(datetime.now()-t)
# 0:00:00.016012
t = datetime.now()
df = generate_data()
df["value"] = df.groupby(['category', 'name'])
.transform(lambda x: x.fillna(x.mean()))['value']
print(datetime.now()-t)
# 0:00:00.030022
On a final note you can generalize even further if you want to impute more than one column, but not all:
df[['value', 'other_value']] = df.groupby(['category', 'name'])['value', 'other_value']
.transform(lambda x: x.fillna(x.mean()))
Thank you for this great work. I am wondering how I could success the same transformation with usingfor
loops. Speed is not my concern since I am trying to find manual methods. Thanks @AndréC.Andersen
– cyber-math
Jan 9 at 21:55
add a comment |
I'd do it this way
df.loc[df.value.isnull(), 'value'] = df.groupby('group').value.transform('mean')
add a comment |
The featured high ranked answer only works for a pandas Dataframe with only two columns. If you have a more columns case use instead:
df['Crude_Birth_rate'] = df.groupby("continent").Crude_Birth_rate.transform(
lambda x: x.fillna(x.mean()))
add a comment |
def groupMeanValue(group):
group['value'] = group['value'].fillna(group['value'].mean())
return group
dft = df.groupby("name").transform(groupMeanValue)
add a comment |
df.fillna(df.groupby(['name'], as_index=False).mean(), inplace=True)
2
Please give some explanation of your answer. Why should someone who stumbles upon this page from google use your solution over the other 6 answers?
– divibisan
Oct 4 '18 at 20:28
add a comment |
I just did this
df.fillna(df.mean(), inplace=True)
All missing values within your DataFrame will be filled by mean. If that is what you're looking for. This worked for me. It's simple, and gets the job done.
This answer is incorrect because it doesn't work groupwise.
– jpp
Nov 16 '18 at 13:49
add a comment |
fillna
+ groupby
+ transform
+ mean
This seems intuitive:
df['value'] = df['value'].fillna(df.groupby('name')['value'].transform('mean'))
The groupby
+ transform
syntax maps the groupwise mean to the index of the original dataframe. This is roughly equivalent to @DSM's solution, but avoids the need to define an anonymous lambda
function.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f19966018%2fpandas-filling-missing-values-by-mean-in-each-group%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
8 Answers
8
active
oldest
votes
8 Answers
8
active
oldest
votes
active
oldest
votes
active
oldest
votes
One way would be to use transform
:
>>> df
name value
0 A 1
1 A NaN
2 B NaN
3 B 2
4 B 3
5 B 1
6 C 3
7 C NaN
8 C 3
>>> df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))
>>> df
name value
0 A 1
1 A 1
2 B 2
3 B 2
4 B 3
5 B 1
6 C 3
7 C 3
8 C 3
2
I found it helpful when starting out to sit down and read through the docs. This one is covered in thegroupby
section. There's too much stuff to remember, but you pick up rules like "transform is for per-group operations which you want indexed like the original frame" and so on.
– DSM
Nov 13 '13 at 22:57
Also look for the Wes McKinney book. Personally I think the docs on groupby are abismal, the book is marginally better.
– Woody Pride
Nov 14 '13 at 0:51
19
if you have more than two columns, make sure to specify the column name df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))['value']
– Lauren
Jan 10 '17 at 16:57
8
@Lauren Good point. I'd like to add that for performance reasons you might consider to move the value column specification further left to the group-by clause. This way the lambda function is only called for values in that particular column, and not every column and then chose column. Did a test and it was twice as fast when using two columns. And naturally you get better performance the more columns you don't need to impute:df["value"] = df.groupby("name")["value"].transform(lambda x: x.fillna(x.mean()))
– André C. Andersen
Jul 28 '17 at 12:11
I have been searching for this for two days.. Just a question for you. Why is it too hard to do this with loops? Because in my case there are two multi indexes i.e.State
andAge_Group
then I am trying to fill missing values in those groups with group means (from the same state within the same age group take mean and fill missings in group)..Thanks
– cyber-math
Jan 9 at 20:26
|
show 1 more comment
One way would be to use transform
:
>>> df
name value
0 A 1
1 A NaN
2 B NaN
3 B 2
4 B 3
5 B 1
6 C 3
7 C NaN
8 C 3
>>> df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))
>>> df
name value
0 A 1
1 A 1
2 B 2
3 B 2
4 B 3
5 B 1
6 C 3
7 C 3
8 C 3
2
I found it helpful when starting out to sit down and read through the docs. This one is covered in thegroupby
section. There's too much stuff to remember, but you pick up rules like "transform is for per-group operations which you want indexed like the original frame" and so on.
– DSM
Nov 13 '13 at 22:57
Also look for the Wes McKinney book. Personally I think the docs on groupby are abismal, the book is marginally better.
– Woody Pride
Nov 14 '13 at 0:51
19
if you have more than two columns, make sure to specify the column name df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))['value']
– Lauren
Jan 10 '17 at 16:57
8
@Lauren Good point. I'd like to add that for performance reasons you might consider to move the value column specification further left to the group-by clause. This way the lambda function is only called for values in that particular column, and not every column and then chose column. Did a test and it was twice as fast when using two columns. And naturally you get better performance the more columns you don't need to impute:df["value"] = df.groupby("name")["value"].transform(lambda x: x.fillna(x.mean()))
– André C. Andersen
Jul 28 '17 at 12:11
I have been searching for this for two days.. Just a question for you. Why is it too hard to do this with loops? Because in my case there are two multi indexes i.e.State
andAge_Group
then I am trying to fill missing values in those groups with group means (from the same state within the same age group take mean and fill missings in group)..Thanks
– cyber-math
Jan 9 at 20:26
|
show 1 more comment
One way would be to use transform
:
>>> df
name value
0 A 1
1 A NaN
2 B NaN
3 B 2
4 B 3
5 B 1
6 C 3
7 C NaN
8 C 3
>>> df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))
>>> df
name value
0 A 1
1 A 1
2 B 2
3 B 2
4 B 3
5 B 1
6 C 3
7 C 3
8 C 3
One way would be to use transform
:
>>> df
name value
0 A 1
1 A NaN
2 B NaN
3 B 2
4 B 3
5 B 1
6 C 3
7 C NaN
8 C 3
>>> df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))
>>> df
name value
0 A 1
1 A 1
2 B 2
3 B 2
4 B 3
5 B 1
6 C 3
7 C 3
8 C 3
answered Nov 13 '13 at 22:51
DSMDSM
207k35397372
207k35397372
2
I found it helpful when starting out to sit down and read through the docs. This one is covered in thegroupby
section. There's too much stuff to remember, but you pick up rules like "transform is for per-group operations which you want indexed like the original frame" and so on.
– DSM
Nov 13 '13 at 22:57
Also look for the Wes McKinney book. Personally I think the docs on groupby are abismal, the book is marginally better.
– Woody Pride
Nov 14 '13 at 0:51
19
if you have more than two columns, make sure to specify the column name df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))['value']
– Lauren
Jan 10 '17 at 16:57
8
@Lauren Good point. I'd like to add that for performance reasons you might consider to move the value column specification further left to the group-by clause. This way the lambda function is only called for values in that particular column, and not every column and then chose column. Did a test and it was twice as fast when using two columns. And naturally you get better performance the more columns you don't need to impute:df["value"] = df.groupby("name")["value"].transform(lambda x: x.fillna(x.mean()))
– André C. Andersen
Jul 28 '17 at 12:11
I have been searching for this for two days.. Just a question for you. Why is it too hard to do this with loops? Because in my case there are two multi indexes i.e.State
andAge_Group
then I am trying to fill missing values in those groups with group means (from the same state within the same age group take mean and fill missings in group)..Thanks
– cyber-math
Jan 9 at 20:26
|
show 1 more comment
2
I found it helpful when starting out to sit down and read through the docs. This one is covered in thegroupby
section. There's too much stuff to remember, but you pick up rules like "transform is for per-group operations which you want indexed like the original frame" and so on.
– DSM
Nov 13 '13 at 22:57
Also look for the Wes McKinney book. Personally I think the docs on groupby are abismal, the book is marginally better.
– Woody Pride
Nov 14 '13 at 0:51
19
if you have more than two columns, make sure to specify the column name df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))['value']
– Lauren
Jan 10 '17 at 16:57
8
@Lauren Good point. I'd like to add that for performance reasons you might consider to move the value column specification further left to the group-by clause. This way the lambda function is only called for values in that particular column, and not every column and then chose column. Did a test and it was twice as fast when using two columns. And naturally you get better performance the more columns you don't need to impute:df["value"] = df.groupby("name")["value"].transform(lambda x: x.fillna(x.mean()))
– André C. Andersen
Jul 28 '17 at 12:11
I have been searching for this for two days.. Just a question for you. Why is it too hard to do this with loops? Because in my case there are two multi indexes i.e.State
andAge_Group
then I am trying to fill missing values in those groups with group means (from the same state within the same age group take mean and fill missings in group)..Thanks
– cyber-math
Jan 9 at 20:26
2
2
I found it helpful when starting out to sit down and read through the docs. This one is covered in the
groupby
section. There's too much stuff to remember, but you pick up rules like "transform is for per-group operations which you want indexed like the original frame" and so on.– DSM
Nov 13 '13 at 22:57
I found it helpful when starting out to sit down and read through the docs. This one is covered in the
groupby
section. There's too much stuff to remember, but you pick up rules like "transform is for per-group operations which you want indexed like the original frame" and so on.– DSM
Nov 13 '13 at 22:57
Also look for the Wes McKinney book. Personally I think the docs on groupby are abismal, the book is marginally better.
– Woody Pride
Nov 14 '13 at 0:51
Also look for the Wes McKinney book. Personally I think the docs on groupby are abismal, the book is marginally better.
– Woody Pride
Nov 14 '13 at 0:51
19
19
if you have more than two columns, make sure to specify the column name df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))['value']
– Lauren
Jan 10 '17 at 16:57
if you have more than two columns, make sure to specify the column name df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))['value']
– Lauren
Jan 10 '17 at 16:57
8
8
@Lauren Good point. I'd like to add that for performance reasons you might consider to move the value column specification further left to the group-by clause. This way the lambda function is only called for values in that particular column, and not every column and then chose column. Did a test and it was twice as fast when using two columns. And naturally you get better performance the more columns you don't need to impute:
df["value"] = df.groupby("name")["value"].transform(lambda x: x.fillna(x.mean()))
– André C. Andersen
Jul 28 '17 at 12:11
@Lauren Good point. I'd like to add that for performance reasons you might consider to move the value column specification further left to the group-by clause. This way the lambda function is only called for values in that particular column, and not every column and then chose column. Did a test and it was twice as fast when using two columns. And naturally you get better performance the more columns you don't need to impute:
df["value"] = df.groupby("name")["value"].transform(lambda x: x.fillna(x.mean()))
– André C. Andersen
Jul 28 '17 at 12:11
I have been searching for this for two days.. Just a question for you. Why is it too hard to do this with loops? Because in my case there are two multi indexes i.e.
State
and Age_Group
then I am trying to fill missing values in those groups with group means (from the same state within the same age group take mean and fill missings in group)..Thanks– cyber-math
Jan 9 at 20:26
I have been searching for this for two days.. Just a question for you. Why is it too hard to do this with loops? Because in my case there are two multi indexes i.e.
State
and Age_Group
then I am trying to fill missing values in those groups with group means (from the same state within the same age group take mean and fill missings in group)..Thanks– cyber-math
Jan 9 at 20:26
|
show 1 more comment
@DSM has IMO the right answer, but I'd like to share my generalization and optimization of the question: Multiple columns to group-by and having multiple value columns:
df = pd.DataFrame(
{
'category': ['X', 'X', 'X', 'X', 'X', 'X', 'Y', 'Y', 'Y'],
'name': ['A','A', 'B','B','B','B', 'C','C','C'],
'other_value': [10, np.nan, np.nan, 20, 30, 10, 30, np.nan, 30],
'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3],
}
)
... gives ...
category name other_value value
0 X A 10.0 1.0
1 X A NaN NaN
2 X B NaN NaN
3 X B 20.0 2.0
4 X B 30.0 3.0
5 X B 10.0 1.0
6 Y C 30.0 3.0
7 Y C NaN NaN
8 Y C 30.0 3.0
In this generalized case we would like to group by category
and name
, and impute only on value
.
This can be solved as follows:
df['value'] = df.groupby(['category', 'name'])['value']
.transform(lambda x: x.fillna(x.mean()))
Notice the column list in the group-by clause, and that we select the value
column right after the group-by. This makes the transformation only be run on that particular column. You could add it to the end, but then you will run it for all columns only to throw out all but one measure column at the end. A standard SQL query planner might have been able to optimize this, but pandas (0.19.2) doesn't seem to do this.
Performance test by increasing the dataset by doing ...
big_df = None
for _ in range(10000):
if big_df is None:
big_df = df.copy()
else:
big_df = pd.concat([big_df, df])
df = big_df
... confirms that this increases the speed proportional to how many columns you don't have to impute:
import pandas as pd
from datetime import datetime
def generate_data():
...
t = datetime.now()
df = generate_data()
df['value'] = df.groupby(['category', 'name'])['value']
.transform(lambda x: x.fillna(x.mean()))
print(datetime.now()-t)
# 0:00:00.016012
t = datetime.now()
df = generate_data()
df["value"] = df.groupby(['category', 'name'])
.transform(lambda x: x.fillna(x.mean()))['value']
print(datetime.now()-t)
# 0:00:00.030022
On a final note you can generalize even further if you want to impute more than one column, but not all:
df[['value', 'other_value']] = df.groupby(['category', 'name'])['value', 'other_value']
.transform(lambda x: x.fillna(x.mean()))
Thank you for this great work. I am wondering how I could success the same transformation with usingfor
loops. Speed is not my concern since I am trying to find manual methods. Thanks @AndréC.Andersen
– cyber-math
Jan 9 at 21:55
add a comment |
@DSM has IMO the right answer, but I'd like to share my generalization and optimization of the question: Multiple columns to group-by and having multiple value columns:
df = pd.DataFrame(
{
'category': ['X', 'X', 'X', 'X', 'X', 'X', 'Y', 'Y', 'Y'],
'name': ['A','A', 'B','B','B','B', 'C','C','C'],
'other_value': [10, np.nan, np.nan, 20, 30, 10, 30, np.nan, 30],
'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3],
}
)
... gives ...
category name other_value value
0 X A 10.0 1.0
1 X A NaN NaN
2 X B NaN NaN
3 X B 20.0 2.0
4 X B 30.0 3.0
5 X B 10.0 1.0
6 Y C 30.0 3.0
7 Y C NaN NaN
8 Y C 30.0 3.0
In this generalized case we would like to group by category
and name
, and impute only on value
.
This can be solved as follows:
df['value'] = df.groupby(['category', 'name'])['value']
.transform(lambda x: x.fillna(x.mean()))
Notice the column list in the group-by clause, and that we select the value
column right after the group-by. This makes the transformation only be run on that particular column. You could add it to the end, but then you will run it for all columns only to throw out all but one measure column at the end. A standard SQL query planner might have been able to optimize this, but pandas (0.19.2) doesn't seem to do this.
Performance test by increasing the dataset by doing ...
big_df = None
for _ in range(10000):
if big_df is None:
big_df = df.copy()
else:
big_df = pd.concat([big_df, df])
df = big_df
... confirms that this increases the speed proportional to how many columns you don't have to impute:
import pandas as pd
from datetime import datetime
def generate_data():
...
t = datetime.now()
df = generate_data()
df['value'] = df.groupby(['category', 'name'])['value']
.transform(lambda x: x.fillna(x.mean()))
print(datetime.now()-t)
# 0:00:00.016012
t = datetime.now()
df = generate_data()
df["value"] = df.groupby(['category', 'name'])
.transform(lambda x: x.fillna(x.mean()))['value']
print(datetime.now()-t)
# 0:00:00.030022
On a final note you can generalize even further if you want to impute more than one column, but not all:
df[['value', 'other_value']] = df.groupby(['category', 'name'])['value', 'other_value']
.transform(lambda x: x.fillna(x.mean()))
Thank you for this great work. I am wondering how I could success the same transformation with usingfor
loops. Speed is not my concern since I am trying to find manual methods. Thanks @AndréC.Andersen
– cyber-math
Jan 9 at 21:55
add a comment |
@DSM has IMO the right answer, but I'd like to share my generalization and optimization of the question: Multiple columns to group-by and having multiple value columns:
df = pd.DataFrame(
{
'category': ['X', 'X', 'X', 'X', 'X', 'X', 'Y', 'Y', 'Y'],
'name': ['A','A', 'B','B','B','B', 'C','C','C'],
'other_value': [10, np.nan, np.nan, 20, 30, 10, 30, np.nan, 30],
'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3],
}
)
... gives ...
category name other_value value
0 X A 10.0 1.0
1 X A NaN NaN
2 X B NaN NaN
3 X B 20.0 2.0
4 X B 30.0 3.0
5 X B 10.0 1.0
6 Y C 30.0 3.0
7 Y C NaN NaN
8 Y C 30.0 3.0
In this generalized case we would like to group by category
and name
, and impute only on value
.
This can be solved as follows:
df['value'] = df.groupby(['category', 'name'])['value']
.transform(lambda x: x.fillna(x.mean()))
Notice the column list in the group-by clause, and that we select the value
column right after the group-by. This makes the transformation only be run on that particular column. You could add it to the end, but then you will run it for all columns only to throw out all but one measure column at the end. A standard SQL query planner might have been able to optimize this, but pandas (0.19.2) doesn't seem to do this.
Performance test by increasing the dataset by doing ...
big_df = None
for _ in range(10000):
if big_df is None:
big_df = df.copy()
else:
big_df = pd.concat([big_df, df])
df = big_df
... confirms that this increases the speed proportional to how many columns you don't have to impute:
import pandas as pd
from datetime import datetime
def generate_data():
...
t = datetime.now()
df = generate_data()
df['value'] = df.groupby(['category', 'name'])['value']
.transform(lambda x: x.fillna(x.mean()))
print(datetime.now()-t)
# 0:00:00.016012
t = datetime.now()
df = generate_data()
df["value"] = df.groupby(['category', 'name'])
.transform(lambda x: x.fillna(x.mean()))['value']
print(datetime.now()-t)
# 0:00:00.030022
On a final note you can generalize even further if you want to impute more than one column, but not all:
df[['value', 'other_value']] = df.groupby(['category', 'name'])['value', 'other_value']
.transform(lambda x: x.fillna(x.mean()))
@DSM has IMO the right answer, but I'd like to share my generalization and optimization of the question: Multiple columns to group-by and having multiple value columns:
df = pd.DataFrame(
{
'category': ['X', 'X', 'X', 'X', 'X', 'X', 'Y', 'Y', 'Y'],
'name': ['A','A', 'B','B','B','B', 'C','C','C'],
'other_value': [10, np.nan, np.nan, 20, 30, 10, 30, np.nan, 30],
'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3],
}
)
... gives ...
category name other_value value
0 X A 10.0 1.0
1 X A NaN NaN
2 X B NaN NaN
3 X B 20.0 2.0
4 X B 30.0 3.0
5 X B 10.0 1.0
6 Y C 30.0 3.0
7 Y C NaN NaN
8 Y C 30.0 3.0
In this generalized case we would like to group by category
and name
, and impute only on value
.
This can be solved as follows:
df['value'] = df.groupby(['category', 'name'])['value']
.transform(lambda x: x.fillna(x.mean()))
Notice the column list in the group-by clause, and that we select the value
column right after the group-by. This makes the transformation only be run on that particular column. You could add it to the end, but then you will run it for all columns only to throw out all but one measure column at the end. A standard SQL query planner might have been able to optimize this, but pandas (0.19.2) doesn't seem to do this.
Performance test by increasing the dataset by doing ...
big_df = None
for _ in range(10000):
if big_df is None:
big_df = df.copy()
else:
big_df = pd.concat([big_df, df])
df = big_df
... confirms that this increases the speed proportional to how many columns you don't have to impute:
import pandas as pd
from datetime import datetime
def generate_data():
...
t = datetime.now()
df = generate_data()
df['value'] = df.groupby(['category', 'name'])['value']
.transform(lambda x: x.fillna(x.mean()))
print(datetime.now()-t)
# 0:00:00.016012
t = datetime.now()
df = generate_data()
df["value"] = df.groupby(['category', 'name'])
.transform(lambda x: x.fillna(x.mean()))['value']
print(datetime.now()-t)
# 0:00:00.030022
On a final note you can generalize even further if you want to impute more than one column, but not all:
df[['value', 'other_value']] = df.groupby(['category', 'name'])['value', 'other_value']
.transform(lambda x: x.fillna(x.mean()))
edited Jul 28 '17 at 12:32
answered Jul 28 '17 at 12:06
André C. AndersenAndré C. Andersen
3,93612646
3,93612646
Thank you for this great work. I am wondering how I could success the same transformation with usingfor
loops. Speed is not my concern since I am trying to find manual methods. Thanks @AndréC.Andersen
– cyber-math
Jan 9 at 21:55
add a comment |
Thank you for this great work. I am wondering how I could success the same transformation with usingfor
loops. Speed is not my concern since I am trying to find manual methods. Thanks @AndréC.Andersen
– cyber-math
Jan 9 at 21:55
Thank you for this great work. I am wondering how I could success the same transformation with using
for
loops. Speed is not my concern since I am trying to find manual methods. Thanks @AndréC.Andersen– cyber-math
Jan 9 at 21:55
Thank you for this great work. I am wondering how I could success the same transformation with using
for
loops. Speed is not my concern since I am trying to find manual methods. Thanks @AndréC.Andersen– cyber-math
Jan 9 at 21:55
add a comment |
I'd do it this way
df.loc[df.value.isnull(), 'value'] = df.groupby('group').value.transform('mean')
add a comment |
I'd do it this way
df.loc[df.value.isnull(), 'value'] = df.groupby('group').value.transform('mean')
add a comment |
I'd do it this way
df.loc[df.value.isnull(), 'value'] = df.groupby('group').value.transform('mean')
I'd do it this way
df.loc[df.value.isnull(), 'value'] = df.groupby('group').value.transform('mean')
edited Oct 9 '17 at 9:55
IanS
8,42122762
8,42122762
answered Nov 18 '16 at 17:18
piRSquaredpiRSquared
154k22146288
154k22146288
add a comment |
add a comment |
The featured high ranked answer only works for a pandas Dataframe with only two columns. If you have a more columns case use instead:
df['Crude_Birth_rate'] = df.groupby("continent").Crude_Birth_rate.transform(
lambda x: x.fillna(x.mean()))
add a comment |
The featured high ranked answer only works for a pandas Dataframe with only two columns. If you have a more columns case use instead:
df['Crude_Birth_rate'] = df.groupby("continent").Crude_Birth_rate.transform(
lambda x: x.fillna(x.mean()))
add a comment |
The featured high ranked answer only works for a pandas Dataframe with only two columns. If you have a more columns case use instead:
df['Crude_Birth_rate'] = df.groupby("continent").Crude_Birth_rate.transform(
lambda x: x.fillna(x.mean()))
The featured high ranked answer only works for a pandas Dataframe with only two columns. If you have a more columns case use instead:
df['Crude_Birth_rate'] = df.groupby("continent").Crude_Birth_rate.transform(
lambda x: x.fillna(x.mean()))
edited Oct 13 '16 at 9:08
answered Oct 13 '16 at 8:52
Philipp SchwarzPhilipp Schwarz
5,03132029
5,03132029
add a comment |
add a comment |
def groupMeanValue(group):
group['value'] = group['value'].fillna(group['value'].mean())
return group
dft = df.groupby("name").transform(groupMeanValue)
add a comment |
def groupMeanValue(group):
group['value'] = group['value'].fillna(group['value'].mean())
return group
dft = df.groupby("name").transform(groupMeanValue)
add a comment |
def groupMeanValue(group):
group['value'] = group['value'].fillna(group['value'].mean())
return group
dft = df.groupby("name").transform(groupMeanValue)
def groupMeanValue(group):
group['value'] = group['value'].fillna(group['value'].mean())
return group
dft = df.groupby("name").transform(groupMeanValue)
answered Mar 9 '16 at 14:36
Prajit PatilPrajit Patil
212
212
add a comment |
add a comment |
df.fillna(df.groupby(['name'], as_index=False).mean(), inplace=True)
2
Please give some explanation of your answer. Why should someone who stumbles upon this page from google use your solution over the other 6 answers?
– divibisan
Oct 4 '18 at 20:28
add a comment |
df.fillna(df.groupby(['name'], as_index=False).mean(), inplace=True)
2
Please give some explanation of your answer. Why should someone who stumbles upon this page from google use your solution over the other 6 answers?
– divibisan
Oct 4 '18 at 20:28
add a comment |
df.fillna(df.groupby(['name'], as_index=False).mean(), inplace=True)
df.fillna(df.groupby(['name'], as_index=False).mean(), inplace=True)
edited Oct 4 '18 at 18:19
Paul Roub
32.7k85773
32.7k85773
answered Oct 4 '18 at 18:11
Vino VincentVino Vincent
111
111
2
Please give some explanation of your answer. Why should someone who stumbles upon this page from google use your solution over the other 6 answers?
– divibisan
Oct 4 '18 at 20:28
add a comment |
2
Please give some explanation of your answer. Why should someone who stumbles upon this page from google use your solution over the other 6 answers?
– divibisan
Oct 4 '18 at 20:28
2
2
Please give some explanation of your answer. Why should someone who stumbles upon this page from google use your solution over the other 6 answers?
– divibisan
Oct 4 '18 at 20:28
Please give some explanation of your answer. Why should someone who stumbles upon this page from google use your solution over the other 6 answers?
– divibisan
Oct 4 '18 at 20:28
add a comment |
I just did this
df.fillna(df.mean(), inplace=True)
All missing values within your DataFrame will be filled by mean. If that is what you're looking for. This worked for me. It's simple, and gets the job done.
This answer is incorrect because it doesn't work groupwise.
– jpp
Nov 16 '18 at 13:49
add a comment |
I just did this
df.fillna(df.mean(), inplace=True)
All missing values within your DataFrame will be filled by mean. If that is what you're looking for. This worked for me. It's simple, and gets the job done.
This answer is incorrect because it doesn't work groupwise.
– jpp
Nov 16 '18 at 13:49
add a comment |
I just did this
df.fillna(df.mean(), inplace=True)
All missing values within your DataFrame will be filled by mean. If that is what you're looking for. This worked for me. It's simple, and gets the job done.
I just did this
df.fillna(df.mean(), inplace=True)
All missing values within your DataFrame will be filled by mean. If that is what you're looking for. This worked for me. It's simple, and gets the job done.
answered Nov 16 '17 at 18:15
keith singletonkeith singleton
1
1
This answer is incorrect because it doesn't work groupwise.
– jpp
Nov 16 '18 at 13:49
add a comment |
This answer is incorrect because it doesn't work groupwise.
– jpp
Nov 16 '18 at 13:49
This answer is incorrect because it doesn't work groupwise.
– jpp
Nov 16 '18 at 13:49
This answer is incorrect because it doesn't work groupwise.
– jpp
Nov 16 '18 at 13:49
add a comment |
fillna
+ groupby
+ transform
+ mean
This seems intuitive:
df['value'] = df['value'].fillna(df.groupby('name')['value'].transform('mean'))
The groupby
+ transform
syntax maps the groupwise mean to the index of the original dataframe. This is roughly equivalent to @DSM's solution, but avoids the need to define an anonymous lambda
function.
add a comment |
fillna
+ groupby
+ transform
+ mean
This seems intuitive:
df['value'] = df['value'].fillna(df.groupby('name')['value'].transform('mean'))
The groupby
+ transform
syntax maps the groupwise mean to the index of the original dataframe. This is roughly equivalent to @DSM's solution, but avoids the need to define an anonymous lambda
function.
add a comment |
fillna
+ groupby
+ transform
+ mean
This seems intuitive:
df['value'] = df['value'].fillna(df.groupby('name')['value'].transform('mean'))
The groupby
+ transform
syntax maps the groupwise mean to the index of the original dataframe. This is roughly equivalent to @DSM's solution, but avoids the need to define an anonymous lambda
function.
fillna
+ groupby
+ transform
+ mean
This seems intuitive:
df['value'] = df['value'].fillna(df.groupby('name')['value'].transform('mean'))
The groupby
+ transform
syntax maps the groupwise mean to the index of the original dataframe. This is roughly equivalent to @DSM's solution, but avoids the need to define an anonymous lambda
function.
answered Nov 16 '18 at 13:59
jppjpp
98.1k2159109
98.1k2159109
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f19966018%2fpandas-filling-missing-values-by-mean-in-each-group%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown