Get a sample of aggregated row values with pandas
I need a function that given a data frame and a number num
constructs a data frame with num
rows such that every row has the following value:
- for columns with string values we sample a value from a column in original table
- for columns with floats or ints we find mean value
Here is my code
def rows_aggr(df, num):
dataframe = None
for i in range(0, num):
row = None
for cname in df.columns.values:
column = df[cname]
dfcol = Series.to_frame(column)
if column.dtype != np.number:
item = dfcol.sample(n=1)
else:
item = dfcol.mean(axis=1)
if row is None:
row = item
else:
row = pd.concat([row, item], axis=1)
if dataframe is None:
dataframe = row
else:
dataframe = pd.concat([dataframe, row], axis=0)
return dataframe
for some reason rows contain nan values and exceed the num
... and this code does not seem to work right. If you know a better way accomplishing what I need - I would be happy to know.
for
df = pd.DataFrame({'col1':list('abcdef'),'col2':range(6)}) and num=3
we would get smth like
c, 2.5
f, 2.5
b, 2.5
assuming and c, f, b
were randomly picked
Thank you!
python pandas
add a comment |
I need a function that given a data frame and a number num
constructs a data frame with num
rows such that every row has the following value:
- for columns with string values we sample a value from a column in original table
- for columns with floats or ints we find mean value
Here is my code
def rows_aggr(df, num):
dataframe = None
for i in range(0, num):
row = None
for cname in df.columns.values:
column = df[cname]
dfcol = Series.to_frame(column)
if column.dtype != np.number:
item = dfcol.sample(n=1)
else:
item = dfcol.mean(axis=1)
if row is None:
row = item
else:
row = pd.concat([row, item], axis=1)
if dataframe is None:
dataframe = row
else:
dataframe = pd.concat([dataframe, row], axis=0)
return dataframe
for some reason rows contain nan values and exceed the num
... and this code does not seem to work right. If you know a better way accomplishing what I need - I would be happy to know.
for
df = pd.DataFrame({'col1':list('abcdef'),'col2':range(6)}) and num=3
we would get smth like
c, 2.5
f, 2.5
b, 2.5
assuming and c, f, b
were randomly picked
Thank you!
python pandas
so for each column withfloat
orint
, the value will always be the same for each row?
– Ben.T
Nov 15 '18 at 17:21
@Ben.T yes, why not )
– YohanRoth
Nov 15 '18 at 18:02
Not sure to see what you want to get. Can you add an example of your expected output ifdf = pd.DataFrame({'col1':list('abcdef'),'col2':range(6)})
andnum=3
?
– Ben.T
Nov 15 '18 at 18:09
1
@Ben.T check pls, added
– YohanRoth
Nov 15 '18 at 18:13
1
Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.
– Parfait
Nov 15 '18 at 19:22
add a comment |
I need a function that given a data frame and a number num
constructs a data frame with num
rows such that every row has the following value:
- for columns with string values we sample a value from a column in original table
- for columns with floats or ints we find mean value
Here is my code
def rows_aggr(df, num):
dataframe = None
for i in range(0, num):
row = None
for cname in df.columns.values:
column = df[cname]
dfcol = Series.to_frame(column)
if column.dtype != np.number:
item = dfcol.sample(n=1)
else:
item = dfcol.mean(axis=1)
if row is None:
row = item
else:
row = pd.concat([row, item], axis=1)
if dataframe is None:
dataframe = row
else:
dataframe = pd.concat([dataframe, row], axis=0)
return dataframe
for some reason rows contain nan values and exceed the num
... and this code does not seem to work right. If you know a better way accomplishing what I need - I would be happy to know.
for
df = pd.DataFrame({'col1':list('abcdef'),'col2':range(6)}) and num=3
we would get smth like
c, 2.5
f, 2.5
b, 2.5
assuming and c, f, b
were randomly picked
Thank you!
python pandas
I need a function that given a data frame and a number num
constructs a data frame with num
rows such that every row has the following value:
- for columns with string values we sample a value from a column in original table
- for columns with floats or ints we find mean value
Here is my code
def rows_aggr(df, num):
dataframe = None
for i in range(0, num):
row = None
for cname in df.columns.values:
column = df[cname]
dfcol = Series.to_frame(column)
if column.dtype != np.number:
item = dfcol.sample(n=1)
else:
item = dfcol.mean(axis=1)
if row is None:
row = item
else:
row = pd.concat([row, item], axis=1)
if dataframe is None:
dataframe = row
else:
dataframe = pd.concat([dataframe, row], axis=0)
return dataframe
for some reason rows contain nan values and exceed the num
... and this code does not seem to work right. If you know a better way accomplishing what I need - I would be happy to know.
for
df = pd.DataFrame({'col1':list('abcdef'),'col2':range(6)}) and num=3
we would get smth like
c, 2.5
f, 2.5
b, 2.5
assuming and c, f, b
were randomly picked
Thank you!
python pandas
python pandas
edited Nov 15 '18 at 18:13
YohanRoth
asked Nov 15 '18 at 16:55
YohanRothYohanRoth
9291919
9291919
so for each column withfloat
orint
, the value will always be the same for each row?
– Ben.T
Nov 15 '18 at 17:21
@Ben.T yes, why not )
– YohanRoth
Nov 15 '18 at 18:02
Not sure to see what you want to get. Can you add an example of your expected output ifdf = pd.DataFrame({'col1':list('abcdef'),'col2':range(6)})
andnum=3
?
– Ben.T
Nov 15 '18 at 18:09
1
@Ben.T check pls, added
– YohanRoth
Nov 15 '18 at 18:13
1
Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.
– Parfait
Nov 15 '18 at 19:22
add a comment |
so for each column withfloat
orint
, the value will always be the same for each row?
– Ben.T
Nov 15 '18 at 17:21
@Ben.T yes, why not )
– YohanRoth
Nov 15 '18 at 18:02
Not sure to see what you want to get. Can you add an example of your expected output ifdf = pd.DataFrame({'col1':list('abcdef'),'col2':range(6)})
andnum=3
?
– Ben.T
Nov 15 '18 at 18:09
1
@Ben.T check pls, added
– YohanRoth
Nov 15 '18 at 18:13
1
Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.
– Parfait
Nov 15 '18 at 19:22
so for each column with
float
or int
, the value will always be the same for each row?– Ben.T
Nov 15 '18 at 17:21
so for each column with
float
or int
, the value will always be the same for each row?– Ben.T
Nov 15 '18 at 17:21
@Ben.T yes, why not )
– YohanRoth
Nov 15 '18 at 18:02
@Ben.T yes, why not )
– YohanRoth
Nov 15 '18 at 18:02
Not sure to see what you want to get. Can you add an example of your expected output if
df = pd.DataFrame({'col1':list('abcdef'),'col2':range(6)})
and num=3
?– Ben.T
Nov 15 '18 at 18:09
Not sure to see what you want to get. Can you add an example of your expected output if
df = pd.DataFrame({'col1':list('abcdef'),'col2':range(6)})
and num=3
?– Ben.T
Nov 15 '18 at 18:09
1
1
@Ben.T check pls, added
– YohanRoth
Nov 15 '18 at 18:13
@Ben.T check pls, added
– YohanRoth
Nov 15 '18 at 18:13
1
1
Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.
– Parfait
Nov 15 '18 at 19:22
Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.
– Parfait
Nov 15 '18 at 19:22
add a comment |
1 Answer
1
active
oldest
votes
One error seems that the condition column.dtype != np.number
does not work. Then there is a problem with index alignment when you do pd.concat([row, item], axis=1)
, item
contains an index number that is not always the same and this add rows with Nan
in row
. Here is another way to do it.
SETUP
df = pd.DataFrame({'col1':list('abcdef'),'col2':list('ijklmn'),
'col3':range(6),'col4':np.arange(10,16)/1.5})
print (df)
col1 col2 col3 col4
0 a i 0 6.666667
1 b j 1 7.333333
2 c k 2 8.000000
3 d l 3 8.666667
4 e m 4 9.333333
5 f n 5 10.000000
you can use select_dtypes
to check if a column is not numeric, and create the dataframe with a dictionary comprehension like:
def rows_aggr(df, num):
list_col_notnumeric = df.select_dtypes(exclude=[np.number]).columns
return pd.DataFrame({col: df[col].sample(num).values
if col in list_col_notnumeric
else df[col].mean()
for col in df.columns})
print (rows_aggr(df, 3))
col1 col2 col3 col4
0 d i 2.5 8.333333
1 a n 2.5 8.333333
2 c j 2.5 8.333333
So when I run it on my dataset I get error "ValueError: If using all scalar values, you must pass an index"
– YohanRoth
Nov 16 '18 at 2:54
@YohanRoth of what I read about this error, it seems that adding the parameterindex=range(num)
in thepd.DataFrame
may solve the issue. As I can't reproduce the error, it's difficult to be sure that is why. Maybe share the first few rows of your dataset, if there is any special type of data in it (other thanstr
/float
/int
) it may be why too.
– Ben.T
Nov 16 '18 at 4:16
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53324376%2fget-a-sample-of-aggregated-row-values-with-pandas%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
One error seems that the condition column.dtype != np.number
does not work. Then there is a problem with index alignment when you do pd.concat([row, item], axis=1)
, item
contains an index number that is not always the same and this add rows with Nan
in row
. Here is another way to do it.
SETUP
df = pd.DataFrame({'col1':list('abcdef'),'col2':list('ijklmn'),
'col3':range(6),'col4':np.arange(10,16)/1.5})
print (df)
col1 col2 col3 col4
0 a i 0 6.666667
1 b j 1 7.333333
2 c k 2 8.000000
3 d l 3 8.666667
4 e m 4 9.333333
5 f n 5 10.000000
you can use select_dtypes
to check if a column is not numeric, and create the dataframe with a dictionary comprehension like:
def rows_aggr(df, num):
list_col_notnumeric = df.select_dtypes(exclude=[np.number]).columns
return pd.DataFrame({col: df[col].sample(num).values
if col in list_col_notnumeric
else df[col].mean()
for col in df.columns})
print (rows_aggr(df, 3))
col1 col2 col3 col4
0 d i 2.5 8.333333
1 a n 2.5 8.333333
2 c j 2.5 8.333333
So when I run it on my dataset I get error "ValueError: If using all scalar values, you must pass an index"
– YohanRoth
Nov 16 '18 at 2:54
@YohanRoth of what I read about this error, it seems that adding the parameterindex=range(num)
in thepd.DataFrame
may solve the issue. As I can't reproduce the error, it's difficult to be sure that is why. Maybe share the first few rows of your dataset, if there is any special type of data in it (other thanstr
/float
/int
) it may be why too.
– Ben.T
Nov 16 '18 at 4:16
add a comment |
One error seems that the condition column.dtype != np.number
does not work. Then there is a problem with index alignment when you do pd.concat([row, item], axis=1)
, item
contains an index number that is not always the same and this add rows with Nan
in row
. Here is another way to do it.
SETUP
df = pd.DataFrame({'col1':list('abcdef'),'col2':list('ijklmn'),
'col3':range(6),'col4':np.arange(10,16)/1.5})
print (df)
col1 col2 col3 col4
0 a i 0 6.666667
1 b j 1 7.333333
2 c k 2 8.000000
3 d l 3 8.666667
4 e m 4 9.333333
5 f n 5 10.000000
you can use select_dtypes
to check if a column is not numeric, and create the dataframe with a dictionary comprehension like:
def rows_aggr(df, num):
list_col_notnumeric = df.select_dtypes(exclude=[np.number]).columns
return pd.DataFrame({col: df[col].sample(num).values
if col in list_col_notnumeric
else df[col].mean()
for col in df.columns})
print (rows_aggr(df, 3))
col1 col2 col3 col4
0 d i 2.5 8.333333
1 a n 2.5 8.333333
2 c j 2.5 8.333333
So when I run it on my dataset I get error "ValueError: If using all scalar values, you must pass an index"
– YohanRoth
Nov 16 '18 at 2:54
@YohanRoth of what I read about this error, it seems that adding the parameterindex=range(num)
in thepd.DataFrame
may solve the issue. As I can't reproduce the error, it's difficult to be sure that is why. Maybe share the first few rows of your dataset, if there is any special type of data in it (other thanstr
/float
/int
) it may be why too.
– Ben.T
Nov 16 '18 at 4:16
add a comment |
One error seems that the condition column.dtype != np.number
does not work. Then there is a problem with index alignment when you do pd.concat([row, item], axis=1)
, item
contains an index number that is not always the same and this add rows with Nan
in row
. Here is another way to do it.
SETUP
df = pd.DataFrame({'col1':list('abcdef'),'col2':list('ijklmn'),
'col3':range(6),'col4':np.arange(10,16)/1.5})
print (df)
col1 col2 col3 col4
0 a i 0 6.666667
1 b j 1 7.333333
2 c k 2 8.000000
3 d l 3 8.666667
4 e m 4 9.333333
5 f n 5 10.000000
you can use select_dtypes
to check if a column is not numeric, and create the dataframe with a dictionary comprehension like:
def rows_aggr(df, num):
list_col_notnumeric = df.select_dtypes(exclude=[np.number]).columns
return pd.DataFrame({col: df[col].sample(num).values
if col in list_col_notnumeric
else df[col].mean()
for col in df.columns})
print (rows_aggr(df, 3))
col1 col2 col3 col4
0 d i 2.5 8.333333
1 a n 2.5 8.333333
2 c j 2.5 8.333333
One error seems that the condition column.dtype != np.number
does not work. Then there is a problem with index alignment when you do pd.concat([row, item], axis=1)
, item
contains an index number that is not always the same and this add rows with Nan
in row
. Here is another way to do it.
SETUP
df = pd.DataFrame({'col1':list('abcdef'),'col2':list('ijklmn'),
'col3':range(6),'col4':np.arange(10,16)/1.5})
print (df)
col1 col2 col3 col4
0 a i 0 6.666667
1 b j 1 7.333333
2 c k 2 8.000000
3 d l 3 8.666667
4 e m 4 9.333333
5 f n 5 10.000000
you can use select_dtypes
to check if a column is not numeric, and create the dataframe with a dictionary comprehension like:
def rows_aggr(df, num):
list_col_notnumeric = df.select_dtypes(exclude=[np.number]).columns
return pd.DataFrame({col: df[col].sample(num).values
if col in list_col_notnumeric
else df[col].mean()
for col in df.columns})
print (rows_aggr(df, 3))
col1 col2 col3 col4
0 d i 2.5 8.333333
1 a n 2.5 8.333333
2 c j 2.5 8.333333
answered Nov 15 '18 at 19:12
Ben.TBen.T
5,9572523
5,9572523
So when I run it on my dataset I get error "ValueError: If using all scalar values, you must pass an index"
– YohanRoth
Nov 16 '18 at 2:54
@YohanRoth of what I read about this error, it seems that adding the parameterindex=range(num)
in thepd.DataFrame
may solve the issue. As I can't reproduce the error, it's difficult to be sure that is why. Maybe share the first few rows of your dataset, if there is any special type of data in it (other thanstr
/float
/int
) it may be why too.
– Ben.T
Nov 16 '18 at 4:16
add a comment |
So when I run it on my dataset I get error "ValueError: If using all scalar values, you must pass an index"
– YohanRoth
Nov 16 '18 at 2:54
@YohanRoth of what I read about this error, it seems that adding the parameterindex=range(num)
in thepd.DataFrame
may solve the issue. As I can't reproduce the error, it's difficult to be sure that is why. Maybe share the first few rows of your dataset, if there is any special type of data in it (other thanstr
/float
/int
) it may be why too.
– Ben.T
Nov 16 '18 at 4:16
So when I run it on my dataset I get error "ValueError: If using all scalar values, you must pass an index"
– YohanRoth
Nov 16 '18 at 2:54
So when I run it on my dataset I get error "ValueError: If using all scalar values, you must pass an index"
– YohanRoth
Nov 16 '18 at 2:54
@YohanRoth of what I read about this error, it seems that adding the parameter
index=range(num)
in the pd.DataFrame
may solve the issue. As I can't reproduce the error, it's difficult to be sure that is why. Maybe share the first few rows of your dataset, if there is any special type of data in it (other than str
/float
/int
) it may be why too.– Ben.T
Nov 16 '18 at 4:16
@YohanRoth of what I read about this error, it seems that adding the parameter
index=range(num)
in the pd.DataFrame
may solve the issue. As I can't reproduce the error, it's difficult to be sure that is why. Maybe share the first few rows of your dataset, if there is any special type of data in it (other than str
/float
/int
) it may be why too.– Ben.T
Nov 16 '18 at 4:16
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53324376%2fget-a-sample-of-aggregated-row-values-with-pandas%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
so for each column with
float
orint
, the value will always be the same for each row?– Ben.T
Nov 15 '18 at 17:21
@Ben.T yes, why not )
– YohanRoth
Nov 15 '18 at 18:02
Not sure to see what you want to get. Can you add an example of your expected output if
df = pd.DataFrame({'col1':list('abcdef'),'col2':range(6)})
andnum=3
?– Ben.T
Nov 15 '18 at 18:09
1
@Ben.T check pls, added
– YohanRoth
Nov 15 '18 at 18:13
1
Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.
– Parfait
Nov 15 '18 at 19:22