Get a sample of aggregated row values with pandas












1














I need a function that given a data frame and a number num constructs a data frame with num rows such that every row has the following value:
- for columns with string values we sample a value from a column in original table
- for columns with floats or ints we find mean value



Here is my code



def rows_aggr(df, num):
dataframe = None
for i in range(0, num):
row = None
for cname in df.columns.values:
column = df[cname]
dfcol = Series.to_frame(column)

if column.dtype != np.number:
item = dfcol.sample(n=1)
else:
item = dfcol.mean(axis=1)

if row is None:
row = item
else:
row = pd.concat([row, item], axis=1)

if dataframe is None:
dataframe = row
else:
dataframe = pd.concat([dataframe, row], axis=0)

return dataframe


for some reason rows contain nan values and exceed the num ... and this code does not seem to work right. If you know a better way accomplishing what I need - I would be happy to know.



for



df = pd.DataFrame({'col1':list('abcdef'),'col2':range(6)}) and num=3


we would get smth like



c, 2.5
f, 2.5
b, 2.5


assuming and c, f, b were randomly picked



Thank you!










share|improve this question
























  • so for each column with float or int, the value will always be the same for each row?
    – Ben.T
    Nov 15 '18 at 17:21












  • @Ben.T yes, why not )
    – YohanRoth
    Nov 15 '18 at 18:02










  • Not sure to see what you want to get. Can you add an example of your expected output if df = pd.DataFrame({'col1':list('abcdef'),'col2':range(6)}) and num=3?
    – Ben.T
    Nov 15 '18 at 18:09






  • 1




    @Ben.T check pls, added
    – YohanRoth
    Nov 15 '18 at 18:13






  • 1




    Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.
    – Parfait
    Nov 15 '18 at 19:22
















1














I need a function that given a data frame and a number num constructs a data frame with num rows such that every row has the following value:
- for columns with string values we sample a value from a column in original table
- for columns with floats or ints we find mean value



Here is my code



def rows_aggr(df, num):
dataframe = None
for i in range(0, num):
row = None
for cname in df.columns.values:
column = df[cname]
dfcol = Series.to_frame(column)

if column.dtype != np.number:
item = dfcol.sample(n=1)
else:
item = dfcol.mean(axis=1)

if row is None:
row = item
else:
row = pd.concat([row, item], axis=1)

if dataframe is None:
dataframe = row
else:
dataframe = pd.concat([dataframe, row], axis=0)

return dataframe


for some reason rows contain nan values and exceed the num ... and this code does not seem to work right. If you know a better way accomplishing what I need - I would be happy to know.



for



df = pd.DataFrame({'col1':list('abcdef'),'col2':range(6)}) and num=3


we would get smth like



c, 2.5
f, 2.5
b, 2.5


assuming and c, f, b were randomly picked



Thank you!










share|improve this question
























  • so for each column with float or int, the value will always be the same for each row?
    – Ben.T
    Nov 15 '18 at 17:21












  • @Ben.T yes, why not )
    – YohanRoth
    Nov 15 '18 at 18:02










  • Not sure to see what you want to get. Can you add an example of your expected output if df = pd.DataFrame({'col1':list('abcdef'),'col2':range(6)}) and num=3?
    – Ben.T
    Nov 15 '18 at 18:09






  • 1




    @Ben.T check pls, added
    – YohanRoth
    Nov 15 '18 at 18:13






  • 1




    Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.
    – Parfait
    Nov 15 '18 at 19:22














1












1








1







I need a function that given a data frame and a number num constructs a data frame with num rows such that every row has the following value:
- for columns with string values we sample a value from a column in original table
- for columns with floats or ints we find mean value



Here is my code



def rows_aggr(df, num):
dataframe = None
for i in range(0, num):
row = None
for cname in df.columns.values:
column = df[cname]
dfcol = Series.to_frame(column)

if column.dtype != np.number:
item = dfcol.sample(n=1)
else:
item = dfcol.mean(axis=1)

if row is None:
row = item
else:
row = pd.concat([row, item], axis=1)

if dataframe is None:
dataframe = row
else:
dataframe = pd.concat([dataframe, row], axis=0)

return dataframe


for some reason rows contain nan values and exceed the num ... and this code does not seem to work right. If you know a better way accomplishing what I need - I would be happy to know.



for



df = pd.DataFrame({'col1':list('abcdef'),'col2':range(6)}) and num=3


we would get smth like



c, 2.5
f, 2.5
b, 2.5


assuming and c, f, b were randomly picked



Thank you!










share|improve this question















I need a function that given a data frame and a number num constructs a data frame with num rows such that every row has the following value:
- for columns with string values we sample a value from a column in original table
- for columns with floats or ints we find mean value



Here is my code



def rows_aggr(df, num):
dataframe = None
for i in range(0, num):
row = None
for cname in df.columns.values:
column = df[cname]
dfcol = Series.to_frame(column)

if column.dtype != np.number:
item = dfcol.sample(n=1)
else:
item = dfcol.mean(axis=1)

if row is None:
row = item
else:
row = pd.concat([row, item], axis=1)

if dataframe is None:
dataframe = row
else:
dataframe = pd.concat([dataframe, row], axis=0)

return dataframe


for some reason rows contain nan values and exceed the num ... and this code does not seem to work right. If you know a better way accomplishing what I need - I would be happy to know.



for



df = pd.DataFrame({'col1':list('abcdef'),'col2':range(6)}) and num=3


we would get smth like



c, 2.5
f, 2.5
b, 2.5


assuming and c, f, b were randomly picked



Thank you!







python pandas






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 15 '18 at 18:13







YohanRoth

















asked Nov 15 '18 at 16:55









YohanRothYohanRoth

9291919




9291919












  • so for each column with float or int, the value will always be the same for each row?
    – Ben.T
    Nov 15 '18 at 17:21












  • @Ben.T yes, why not )
    – YohanRoth
    Nov 15 '18 at 18:02










  • Not sure to see what you want to get. Can you add an example of your expected output if df = pd.DataFrame({'col1':list('abcdef'),'col2':range(6)}) and num=3?
    – Ben.T
    Nov 15 '18 at 18:09






  • 1




    @Ben.T check pls, added
    – YohanRoth
    Nov 15 '18 at 18:13






  • 1




    Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.
    – Parfait
    Nov 15 '18 at 19:22


















  • so for each column with float or int, the value will always be the same for each row?
    – Ben.T
    Nov 15 '18 at 17:21












  • @Ben.T yes, why not )
    – YohanRoth
    Nov 15 '18 at 18:02










  • Not sure to see what you want to get. Can you add an example of your expected output if df = pd.DataFrame({'col1':list('abcdef'),'col2':range(6)}) and num=3?
    – Ben.T
    Nov 15 '18 at 18:09






  • 1




    @Ben.T check pls, added
    – YohanRoth
    Nov 15 '18 at 18:13






  • 1




    Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.
    – Parfait
    Nov 15 '18 at 19:22
















so for each column with float or int, the value will always be the same for each row?
– Ben.T
Nov 15 '18 at 17:21






so for each column with float or int, the value will always be the same for each row?
– Ben.T
Nov 15 '18 at 17:21














@Ben.T yes, why not )
– YohanRoth
Nov 15 '18 at 18:02




@Ben.T yes, why not )
– YohanRoth
Nov 15 '18 at 18:02












Not sure to see what you want to get. Can you add an example of your expected output if df = pd.DataFrame({'col1':list('abcdef'),'col2':range(6)}) and num=3?
– Ben.T
Nov 15 '18 at 18:09




Not sure to see what you want to get. Can you add an example of your expected output if df = pd.DataFrame({'col1':list('abcdef'),'col2':range(6)}) and num=3?
– Ben.T
Nov 15 '18 at 18:09




1




1




@Ben.T check pls, added
– YohanRoth
Nov 15 '18 at 18:13




@Ben.T check pls, added
– YohanRoth
Nov 15 '18 at 18:13




1




1




Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.
– Parfait
Nov 15 '18 at 19:22




Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.
– Parfait
Nov 15 '18 at 19:22












1 Answer
1






active

oldest

votes


















1














One error seems that the condition column.dtype != np.number does not work. Then there is a problem with index alignment when you do pd.concat([row, item], axis=1), item contains an index number that is not always the same and this add rows with Nan in row. Here is another way to do it.



SETUP



df = pd.DataFrame({'col1':list('abcdef'),'col2':list('ijklmn'),
'col3':range(6),'col4':np.arange(10,16)/1.5})
print (df)
col1 col2 col3 col4
0 a i 0 6.666667
1 b j 1 7.333333
2 c k 2 8.000000
3 d l 3 8.666667
4 e m 4 9.333333
5 f n 5 10.000000


you can use select_dtypes to check if a column is not numeric, and create the dataframe with a dictionary comprehension like:



def rows_aggr(df, num):
list_col_notnumeric = df.select_dtypes(exclude=[np.number]).columns
return pd.DataFrame({col: df[col].sample(num).values
if col in list_col_notnumeric
else df[col].mean()
for col in df.columns})

print (rows_aggr(df, 3))
col1 col2 col3 col4
0 d i 2.5 8.333333
1 a n 2.5 8.333333
2 c j 2.5 8.333333





share|improve this answer





















  • So when I run it on my dataset I get error "ValueError: If using all scalar values, you must pass an index"
    – YohanRoth
    Nov 16 '18 at 2:54










  • @YohanRoth of what I read about this error, it seems that adding the parameter index=range(num) in the pd.DataFrame may solve the issue. As I can't reproduce the error, it's difficult to be sure that is why. Maybe share the first few rows of your dataset, if there is any special type of data in it (other than str/float/int) it may be why too.
    – Ben.T
    Nov 16 '18 at 4:16











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53324376%2fget-a-sample-of-aggregated-row-values-with-pandas%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









1














One error seems that the condition column.dtype != np.number does not work. Then there is a problem with index alignment when you do pd.concat([row, item], axis=1), item contains an index number that is not always the same and this add rows with Nan in row. Here is another way to do it.



SETUP



df = pd.DataFrame({'col1':list('abcdef'),'col2':list('ijklmn'),
'col3':range(6),'col4':np.arange(10,16)/1.5})
print (df)
col1 col2 col3 col4
0 a i 0 6.666667
1 b j 1 7.333333
2 c k 2 8.000000
3 d l 3 8.666667
4 e m 4 9.333333
5 f n 5 10.000000


you can use select_dtypes to check if a column is not numeric, and create the dataframe with a dictionary comprehension like:



def rows_aggr(df, num):
list_col_notnumeric = df.select_dtypes(exclude=[np.number]).columns
return pd.DataFrame({col: df[col].sample(num).values
if col in list_col_notnumeric
else df[col].mean()
for col in df.columns})

print (rows_aggr(df, 3))
col1 col2 col3 col4
0 d i 2.5 8.333333
1 a n 2.5 8.333333
2 c j 2.5 8.333333





share|improve this answer





















  • So when I run it on my dataset I get error "ValueError: If using all scalar values, you must pass an index"
    – YohanRoth
    Nov 16 '18 at 2:54










  • @YohanRoth of what I read about this error, it seems that adding the parameter index=range(num) in the pd.DataFrame may solve the issue. As I can't reproduce the error, it's difficult to be sure that is why. Maybe share the first few rows of your dataset, if there is any special type of data in it (other than str/float/int) it may be why too.
    – Ben.T
    Nov 16 '18 at 4:16
















1














One error seems that the condition column.dtype != np.number does not work. Then there is a problem with index alignment when you do pd.concat([row, item], axis=1), item contains an index number that is not always the same and this add rows with Nan in row. Here is another way to do it.



SETUP



df = pd.DataFrame({'col1':list('abcdef'),'col2':list('ijklmn'),
'col3':range(6),'col4':np.arange(10,16)/1.5})
print (df)
col1 col2 col3 col4
0 a i 0 6.666667
1 b j 1 7.333333
2 c k 2 8.000000
3 d l 3 8.666667
4 e m 4 9.333333
5 f n 5 10.000000


you can use select_dtypes to check if a column is not numeric, and create the dataframe with a dictionary comprehension like:



def rows_aggr(df, num):
list_col_notnumeric = df.select_dtypes(exclude=[np.number]).columns
return pd.DataFrame({col: df[col].sample(num).values
if col in list_col_notnumeric
else df[col].mean()
for col in df.columns})

print (rows_aggr(df, 3))
col1 col2 col3 col4
0 d i 2.5 8.333333
1 a n 2.5 8.333333
2 c j 2.5 8.333333





share|improve this answer





















  • So when I run it on my dataset I get error "ValueError: If using all scalar values, you must pass an index"
    – YohanRoth
    Nov 16 '18 at 2:54










  • @YohanRoth of what I read about this error, it seems that adding the parameter index=range(num) in the pd.DataFrame may solve the issue. As I can't reproduce the error, it's difficult to be sure that is why. Maybe share the first few rows of your dataset, if there is any special type of data in it (other than str/float/int) it may be why too.
    – Ben.T
    Nov 16 '18 at 4:16














1












1








1






One error seems that the condition column.dtype != np.number does not work. Then there is a problem with index alignment when you do pd.concat([row, item], axis=1), item contains an index number that is not always the same and this add rows with Nan in row. Here is another way to do it.



SETUP



df = pd.DataFrame({'col1':list('abcdef'),'col2':list('ijklmn'),
'col3':range(6),'col4':np.arange(10,16)/1.5})
print (df)
col1 col2 col3 col4
0 a i 0 6.666667
1 b j 1 7.333333
2 c k 2 8.000000
3 d l 3 8.666667
4 e m 4 9.333333
5 f n 5 10.000000


you can use select_dtypes to check if a column is not numeric, and create the dataframe with a dictionary comprehension like:



def rows_aggr(df, num):
list_col_notnumeric = df.select_dtypes(exclude=[np.number]).columns
return pd.DataFrame({col: df[col].sample(num).values
if col in list_col_notnumeric
else df[col].mean()
for col in df.columns})

print (rows_aggr(df, 3))
col1 col2 col3 col4
0 d i 2.5 8.333333
1 a n 2.5 8.333333
2 c j 2.5 8.333333





share|improve this answer












One error seems that the condition column.dtype != np.number does not work. Then there is a problem with index alignment when you do pd.concat([row, item], axis=1), item contains an index number that is not always the same and this add rows with Nan in row. Here is another way to do it.



SETUP



df = pd.DataFrame({'col1':list('abcdef'),'col2':list('ijklmn'),
'col3':range(6),'col4':np.arange(10,16)/1.5})
print (df)
col1 col2 col3 col4
0 a i 0 6.666667
1 b j 1 7.333333
2 c k 2 8.000000
3 d l 3 8.666667
4 e m 4 9.333333
5 f n 5 10.000000


you can use select_dtypes to check if a column is not numeric, and create the dataframe with a dictionary comprehension like:



def rows_aggr(df, num):
list_col_notnumeric = df.select_dtypes(exclude=[np.number]).columns
return pd.DataFrame({col: df[col].sample(num).values
if col in list_col_notnumeric
else df[col].mean()
for col in df.columns})

print (rows_aggr(df, 3))
col1 col2 col3 col4
0 d i 2.5 8.333333
1 a n 2.5 8.333333
2 c j 2.5 8.333333






share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 15 '18 at 19:12









Ben.TBen.T

5,9572523




5,9572523












  • So when I run it on my dataset I get error "ValueError: If using all scalar values, you must pass an index"
    – YohanRoth
    Nov 16 '18 at 2:54










  • @YohanRoth of what I read about this error, it seems that adding the parameter index=range(num) in the pd.DataFrame may solve the issue. As I can't reproduce the error, it's difficult to be sure that is why. Maybe share the first few rows of your dataset, if there is any special type of data in it (other than str/float/int) it may be why too.
    – Ben.T
    Nov 16 '18 at 4:16


















  • So when I run it on my dataset I get error "ValueError: If using all scalar values, you must pass an index"
    – YohanRoth
    Nov 16 '18 at 2:54










  • @YohanRoth of what I read about this error, it seems that adding the parameter index=range(num) in the pd.DataFrame may solve the issue. As I can't reproduce the error, it's difficult to be sure that is why. Maybe share the first few rows of your dataset, if there is any special type of data in it (other than str/float/int) it may be why too.
    – Ben.T
    Nov 16 '18 at 4:16
















So when I run it on my dataset I get error "ValueError: If using all scalar values, you must pass an index"
– YohanRoth
Nov 16 '18 at 2:54




So when I run it on my dataset I get error "ValueError: If using all scalar values, you must pass an index"
– YohanRoth
Nov 16 '18 at 2:54












@YohanRoth of what I read about this error, it seems that adding the parameter index=range(num) in the pd.DataFrame may solve the issue. As I can't reproduce the error, it's difficult to be sure that is why. Maybe share the first few rows of your dataset, if there is any special type of data in it (other than str/float/int) it may be why too.
– Ben.T
Nov 16 '18 at 4:16




@YohanRoth of what I read about this error, it seems that adding the parameter index=range(num) in the pd.DataFrame may solve the issue. As I can't reproduce the error, it's difficult to be sure that is why. Maybe share the first few rows of your dataset, if there is any special type of data in it (other than str/float/int) it may be why too.
– Ben.T
Nov 16 '18 at 4:16


















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53324376%2fget-a-sample-of-aggregated-row-values-with-pandas%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Guess what letter conforming each word

Run scheduled task as local user group (not BUILTIN)

Port of Spain