Get a sample of aggregated row values with pandas

I need a function that given a data frame and a number num constructs a data frame with num rows such that every row has the following value:
- for columns with string values we sample a value from a column in original table
- for columns with floats or ints we find mean value

Here is my code

def rows_aggr(df, num):

    dataframe = None

    for i in range(0, num):

        row = None

        for cname in df.columns.values:

            column = df[cname]

            dfcol = Series.to_frame(column)



            if column.dtype != np.number:

                item = dfcol.sample(n=1)

            else:

                item = dfcol.mean(axis=1)



            if row is None:

                row = item

            else:

                row = pd.concat([row, item], axis=1)



        if dataframe is None:

            dataframe = row

        else:

            dataframe = pd.concat([dataframe, row], axis=0)



    return dataframe

for some reason rows contain nan values and exceed the num ... and this code does not seem to work right. If you know a better way accomplishing what I need - I would be happy to know.

for

df = pd.DataFrame({'col1':list('abcdef'),'col2':range(6)}) and num=3

we would get smth like

c, 2.5

f, 2.5

b, 2.5

assuming and c, f, b were randomly picked

Thank you!

edited Nov 15 '18 at 18:13

asked Nov 15 '18 at 16:55

YohanRoth

9291919

so for each column with float or int, the value will always be the same for each row?
– Ben.T
Nov 15 '18 at 17:21

@Ben.T yes, why not )
– YohanRoth
Nov 15 '18 at 18:02

Not sure to see what you want to get. Can you add an example of your expected output if df = pd.DataFrame({'col1':list('abcdef'),'col2':range(6)}) and num=3?
– Ben.T
Nov 15 '18 at 18:09

1

@Ben.T check pls, added
– YohanRoth
Nov 15 '18 at 18:13

1

Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.
– Parfait
Nov 15 '18 at 19:22

add a comment |

Here is my code

def rows_aggr(df, num):

    dataframe = None

    for i in range(0, num):

        row = None

        for cname in df.columns.values:

            column = df[cname]

            dfcol = Series.to_frame(column)



            if column.dtype != np.number:

                item = dfcol.sample(n=1)

            else:

                item = dfcol.mean(axis=1)



            if row is None:

                row = item

            else:

                row = pd.concat([row, item], axis=1)



        if dataframe is None:

            dataframe = row

        else:

            dataframe = pd.concat([dataframe, row], axis=0)



    return dataframe

for some reason rows contain nan values and exceed the num ... and this code does not seem to work right. If you know a better way accomplishing what I need - I would be happy to know.

for

df = pd.DataFrame({'col1':list('abcdef'),'col2':range(6)}) and num=3

we would get smth like

c, 2.5

f, 2.5

b, 2.5

assuming and c, f, b were randomly picked

Thank you!

edited Nov 15 '18 at 18:13

asked Nov 15 '18 at 16:55

YohanRoth

9291919

so for each column with float or int, the value will always be the same for each row?
– Ben.T
Nov 15 '18 at 17:21

@Ben.T yes, why not )
– YohanRoth
Nov 15 '18 at 18:02

Not sure to see what you want to get. Can you add an example of your expected output if df = pd.DataFrame({'col1':list('abcdef'),'col2':range(6)}) and num=3?
– Ben.T
Nov 15 '18 at 18:09

1

@Ben.T check pls, added
– YohanRoth
Nov 15 '18 at 18:13

1

Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.
– Parfait
Nov 15 '18 at 19:22

add a comment |

Here is my code

def rows_aggr(df, num):

    dataframe = None

    for i in range(0, num):

        row = None

        for cname in df.columns.values:

            column = df[cname]

            dfcol = Series.to_frame(column)



            if column.dtype != np.number:

                item = dfcol.sample(n=1)

            else:

                item = dfcol.mean(axis=1)



            if row is None:

                row = item

            else:

                row = pd.concat([row, item], axis=1)



        if dataframe is None:

            dataframe = row

        else:

            dataframe = pd.concat([dataframe, row], axis=0)



    return dataframe

for some reason rows contain nan values and exceed the num ... and this code does not seem to work right. If you know a better way accomplishing what I need - I would be happy to know.

for

df = pd.DataFrame({'col1':list('abcdef'),'col2':range(6)}) and num=3

we would get smth like

c, 2.5

f, 2.5

b, 2.5

assuming and c, f, b were randomly picked

Thank you!

edited Nov 15 '18 at 18:13

asked Nov 15 '18 at 16:55

YohanRoth

9291919

Here is my code

def rows_aggr(df, num):

    dataframe = None

    for i in range(0, num):

        row = None

        for cname in df.columns.values:

            column = df[cname]

            dfcol = Series.to_frame(column)



            if column.dtype != np.number:

                item = dfcol.sample(n=1)

            else:

                item = dfcol.mean(axis=1)



            if row is None:

                row = item

            else:

                row = pd.concat([row, item], axis=1)



        if dataframe is None:

            dataframe = row

        else:

            dataframe = pd.concat([dataframe, row], axis=0)



    return dataframe

for some reason rows contain nan values and exceed the num ... and this code does not seem to work right. If you know a better way accomplishing what I need - I would be happy to know.

for

df = pd.DataFrame({'col1':list('abcdef'),'col2':range(6)}) and num=3

we would get smth like

c, 2.5

f, 2.5

b, 2.5

assuming and c, f, b were randomly picked

Thank you!

python pandas

edited Nov 15 '18 at 18:13

asked Nov 15 '18 at 16:55

YohanRoth

9291919

edited Nov 15 '18 at 18:13

asked Nov 15 '18 at 16:55

YohanRoth

9291919

edited Nov 15 '18 at 18:13

asked Nov 15 '18 at 16:55

YohanRoth

9291919

asked Nov 15 '18 at 16:55

YohanRoth

9291919

asked Nov 15 '18 at 16:55

YohanRoth

9291919

so for each column with float or int, the value will always be the same for each row?
– Ben.T
Nov 15 '18 at 17:21

@Ben.T yes, why not )
– YohanRoth
Nov 15 '18 at 18:02

Not sure to see what you want to get. Can you add an example of your expected output if df = pd.DataFrame({'col1':list('abcdef'),'col2':range(6)}) and num=3?
– Ben.T
Nov 15 '18 at 18:09

1

@Ben.T check pls, added
– YohanRoth
Nov 15 '18 at 18:13

1

Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.
– Parfait
Nov 15 '18 at 19:22

add a comment |

so for each column with float or int, the value will always be the same for each row?
– Ben.T
Nov 15 '18 at 17:21

@Ben.T yes, why not )
– YohanRoth
Nov 15 '18 at 18:02

Not sure to see what you want to get. Can you add an example of your expected output if df = pd.DataFrame({'col1':list('abcdef'),'col2':range(6)}) and num=3?
– Ben.T
Nov 15 '18 at 18:09

1

@Ben.T check pls, added
– YohanRoth
Nov 15 '18 at 18:13

1

Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.
– Parfait
Nov 15 '18 at 19:22

so for each column with float or int, the value will always be the same for each row?
– Ben.T
Nov 15 '18 at 17:21

@Ben.T yes, why not )
– YohanRoth
Nov 15 '18 at 18:02

Not sure to see what you want to get. Can you add an example of your expected output if df = pd.DataFrame({'col1':list('abcdef'),'col2':range(6)}) and num=3?
– Ben.T
Nov 15 '18 at 18:09

@Ben.T check pls, added
– YohanRoth
Nov 15 '18 at 18:13

Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.
– Parfait
Nov 15 '18 at 19:22

add a comment |

1 Answer
1

active

oldest

votes

One error seems that the condition column.dtype != np.number does not work. Then there is a problem with index alignment when you do pd.concat([row, item], axis=1), item contains an index number that is not always the same and this add rows with Nan in row. Here is another way to do it.

SETUP

df = pd.DataFrame({'col1':list('abcdef'),'col2':list('ijklmn'),

                   'col3':range(6),'col4':np.arange(10,16)/1.5})

print (df)

  col1 col2  col3       col4

0    a    i     0   6.666667

1    b    j     1   7.333333

2    c    k     2   8.000000

3    d    l     3   8.666667

4    e    m     4   9.333333

5    f    n     5  10.000000

you can use select_dtypes to check if a column is not numeric, and create the dataframe with a dictionary comprehension like:

def rows_aggr(df, num):

    list_col_notnumeric = df.select_dtypes(exclude=[np.number]).columns

    return pd.DataFrame({col: df[col].sample(num).values

                              if col in list_col_notnumeric  

                              else df[col].mean() 

                         for col in df.columns})



print (rows_aggr(df, 3))

  col1 col2  col3      col4

0    d    i   2.5  8.333333

1    a    n   2.5  8.333333

2    c    j   2.5  8.333333

answered Nov 15 '18 at 19:12

Ben.T

5,9572523

So when I run it on my dataset I get error "ValueError: If using all scalar values, you must pass an index"
– YohanRoth
Nov 16 '18 at 2:54

@YohanRoth of what I read about this error, it seems that adding the parameter index=range(num) in the pd.DataFrame may solve the issue. As I can't reproduce the error, it's difficult to be sure that is why. Maybe share the first few rows of your dataset, if there is any special type of data in it (other than str/float/int) it may be why too.
– Ben.T
Nov 16 '18 at 4:16

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53324376%2fget-a-sample-of-aggregated-row-values-with-pandas%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

SETUP

df = pd.DataFrame({'col1':list('abcdef'),'col2':list('ijklmn'),

                   'col3':range(6),'col4':np.arange(10,16)/1.5})

print (df)

  col1 col2  col3       col4

0    a    i     0   6.666667

1    b    j     1   7.333333

2    c    k     2   8.000000

3    d    l     3   8.666667

4    e    m     4   9.333333

5    f    n     5  10.000000

you can use select_dtypes to check if a column is not numeric, and create the dataframe with a dictionary comprehension like:

def rows_aggr(df, num):

    list_col_notnumeric = df.select_dtypes(exclude=[np.number]).columns

    return pd.DataFrame({col: df[col].sample(num).values

                              if col in list_col_notnumeric  

                              else df[col].mean() 

                         for col in df.columns})



print (rows_aggr(df, 3))

  col1 col2  col3      col4

0    d    i   2.5  8.333333

1    a    n   2.5  8.333333

2    c    j   2.5  8.333333

answered Nov 15 '18 at 19:12

Ben.T

5,9572523

So when I run it on my dataset I get error "ValueError: If using all scalar values, you must pass an index"
– YohanRoth
Nov 16 '18 at 2:54

@YohanRoth of what I read about this error, it seems that adding the parameter index=range(num) in the pd.DataFrame may solve the issue. As I can't reproduce the error, it's difficult to be sure that is why. Maybe share the first few rows of your dataset, if there is any special type of data in it (other than str/float/int) it may be why too.
– Ben.T
Nov 16 '18 at 4:16

add a comment |

SETUP

df = pd.DataFrame({'col1':list('abcdef'),'col2':list('ijklmn'),

                   'col3':range(6),'col4':np.arange(10,16)/1.5})

print (df)

  col1 col2  col3       col4

0    a    i     0   6.666667

1    b    j     1   7.333333

2    c    k     2   8.000000

3    d    l     3   8.666667

4    e    m     4   9.333333

5    f    n     5  10.000000

you can use select_dtypes to check if a column is not numeric, and create the dataframe with a dictionary comprehension like:

def rows_aggr(df, num):

    list_col_notnumeric = df.select_dtypes(exclude=[np.number]).columns

    return pd.DataFrame({col: df[col].sample(num).values

                              if col in list_col_notnumeric  

                              else df[col].mean() 

                         for col in df.columns})



print (rows_aggr(df, 3))

  col1 col2  col3      col4

0    d    i   2.5  8.333333

1    a    n   2.5  8.333333

2    c    j   2.5  8.333333

answered Nov 15 '18 at 19:12

Ben.T

5,9572523

So when I run it on my dataset I get error "ValueError: If using all scalar values, you must pass an index"
– YohanRoth
Nov 16 '18 at 2:54

@YohanRoth of what I read about this error, it seems that adding the parameter index=range(num) in the pd.DataFrame may solve the issue. As I can't reproduce the error, it's difficult to be sure that is why. Maybe share the first few rows of your dataset, if there is any special type of data in it (other than str/float/int) it may be why too.
– Ben.T
Nov 16 '18 at 4:16

add a comment |

SETUP

df = pd.DataFrame({'col1':list('abcdef'),'col2':list('ijklmn'),

                   'col3':range(6),'col4':np.arange(10,16)/1.5})

print (df)

  col1 col2  col3       col4

0    a    i     0   6.666667

1    b    j     1   7.333333

2    c    k     2   8.000000

3    d    l     3   8.666667

4    e    m     4   9.333333

5    f    n     5  10.000000

you can use select_dtypes to check if a column is not numeric, and create the dataframe with a dictionary comprehension like:

def rows_aggr(df, num):

    list_col_notnumeric = df.select_dtypes(exclude=[np.number]).columns

    return pd.DataFrame({col: df[col].sample(num).values

                              if col in list_col_notnumeric  

                              else df[col].mean() 

                         for col in df.columns})



print (rows_aggr(df, 3))

  col1 col2  col3      col4

0    d    i   2.5  8.333333

1    a    n   2.5  8.333333

2    c    j   2.5  8.333333

answered Nov 15 '18 at 19:12

Ben.T

5,9572523

SETUP

df = pd.DataFrame({'col1':list('abcdef'),'col2':list('ijklmn'),

                   'col3':range(6),'col4':np.arange(10,16)/1.5})

print (df)

  col1 col2  col3       col4

0    a    i     0   6.666667

1    b    j     1   7.333333

2    c    k     2   8.000000

3    d    l     3   8.666667

4    e    m     4   9.333333

5    f    n     5  10.000000

you can use select_dtypes to check if a column is not numeric, and create the dataframe with a dictionary comprehension like:

def rows_aggr(df, num):

    list_col_notnumeric = df.select_dtypes(exclude=[np.number]).columns

    return pd.DataFrame({col: df[col].sample(num).values

                              if col in list_col_notnumeric  

                              else df[col].mean() 

                         for col in df.columns})



print (rows_aggr(df, 3))

  col1 col2  col3      col4

0    d    i   2.5  8.333333

1    a    n   2.5  8.333333

2    c    j   2.5  8.333333

answered Nov 15 '18 at 19:12

Ben.T

5,9572523

answered Nov 15 '18 at 19:12

Ben.T

5,9572523

answered Nov 15 '18 at 19:12

Ben.T

5,9572523

answered Nov 15 '18 at 19:12

Ben.T

5,9572523

So when I run it on my dataset I get error "ValueError: If using all scalar values, you must pass an index"
– YohanRoth
Nov 16 '18 at 2:54

@YohanRoth of what I read about this error, it seems that adding the parameter index=range(num) in the pd.DataFrame may solve the issue. As I can't reproduce the error, it's difficult to be sure that is why. Maybe share the first few rows of your dataset, if there is any special type of data in it (other than str/float/int) it may be why too.
– Ben.T
Nov 16 '18 at 4:16

add a comment |

So when I run it on my dataset I get error "ValueError: If using all scalar values, you must pass an index"
– YohanRoth
Nov 16 '18 at 2:54

@YohanRoth of what I read about this error, it seems that adding the parameter index=range(num) in the pd.DataFrame may solve the issue. As I can't reproduce the error, it's difficult to be sure that is why. Maybe share the first few rows of your dataset, if there is any special type of data in it (other than str/float/int) it may be why too.
– Ben.T
Nov 16 '18 at 4:16

So when I run it on my dataset I get error "ValueError: If using all scalar values, you must pass an index"
– YohanRoth
Nov 16 '18 at 2:54

@YohanRoth of what I read about this error, it seems that adding the parameter index=range(num) in the pd.DataFrame may solve the issue. As I can't reproduce the error, it's difficult to be sure that is why. Maybe share the first few rows of your dataset, if there is any special type of data in it (other than str/float/int) it may be why too.
– Ben.T
Nov 16 '18 at 4:16

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Agfdhyk