Most efficient way to convert values of column in Pandas DataFrame

I have a a pd.DataFrame that looks like:

enter image description here

I want to create a cutoff on the values to push them into binary digits, my cutoff in this case is 0.85. I want the resulting dataframe to look like:

enter image description here

The script I wrote to do this is easy to understand but for large datasets it is inefficient. I'm sure Pandas has some way of taking care of these types of transformations.

Does anyone know of an efficient way to convert a column of floats to a column of integers using a threshold?

My extremely naive way of doing such a thing:

DF_test = pd.DataFrame(np.array([list("abcde"),list("pqrst"),[0.12,0.23,0.93,0.86,0.33]]).T,columns=["c1","c2","value"])

DF_want = pd.DataFrame(np.array([list("abcde"),list("pqrst"),[0,0,1,1,0]]).T,columns=["c1","c2","value"])





threshold = 0.85



#Empty dataframe to append rows

DF_naive = pd.DataFrame()

for i in range(DF_test.shape[0]):

    #Get first 2 columns

    first2cols = list(DF_test.ix[i][:-1])

    #Check if value is greater than threshold

    binary_value = [int((bool(float(DF_test.ix[i][-1]) > threshold)))]

    #Create series object

    SR_row = pd.Series( first2cols + binary_value,name=i)

    #Add to empty dataframe container

    DF_naive = DF_naive.append(SR_row)

#Relabel columns

DF_naive.columns = DF_test.columns

DF_naive.head()

#the sample DF_want

asked Feb 25 '16 at 22:22

O.rka

7,16030107169

can't you just do df['value'] = np.where(df['value'] > 0.85, 1, 0)? this will convert and set the entire column

– EdChum
Feb 25 '16 at 22:28

add a comment |

I have a a pd.DataFrame that looks like:

enter image description here

I want to create a cutoff on the values to push them into binary digits, my cutoff in this case is 0.85. I want the resulting dataframe to look like:

enter image description here

The script I wrote to do this is easy to understand but for large datasets it is inefficient. I'm sure Pandas has some way of taking care of these types of transformations.

Does anyone know of an efficient way to convert a column of floats to a column of integers using a threshold?

My extremely naive way of doing such a thing:

DF_test = pd.DataFrame(np.array([list("abcde"),list("pqrst"),[0.12,0.23,0.93,0.86,0.33]]).T,columns=["c1","c2","value"])

DF_want = pd.DataFrame(np.array([list("abcde"),list("pqrst"),[0,0,1,1,0]]).T,columns=["c1","c2","value"])





threshold = 0.85



#Empty dataframe to append rows

DF_naive = pd.DataFrame()

for i in range(DF_test.shape[0]):

    #Get first 2 columns

    first2cols = list(DF_test.ix[i][:-1])

    #Check if value is greater than threshold

    binary_value = [int((bool(float(DF_test.ix[i][-1]) > threshold)))]

    #Create series object

    SR_row = pd.Series( first2cols + binary_value,name=i)

    #Add to empty dataframe container

    DF_naive = DF_naive.append(SR_row)

#Relabel columns

DF_naive.columns = DF_test.columns

DF_naive.head()

#the sample DF_want

asked Feb 25 '16 at 22:22

O.rka

7,16030107169

can't you just do df['value'] = np.where(df['value'] > 0.85, 1, 0)? this will convert and set the entire column

– EdChum
Feb 25 '16 at 22:28

add a comment |

I have a a pd.DataFrame that looks like:

enter image description here

I want to create a cutoff on the values to push them into binary digits, my cutoff in this case is 0.85. I want the resulting dataframe to look like:

enter image description here

The script I wrote to do this is easy to understand but for large datasets it is inefficient. I'm sure Pandas has some way of taking care of these types of transformations.

Does anyone know of an efficient way to convert a column of floats to a column of integers using a threshold?

My extremely naive way of doing such a thing:

DF_test = pd.DataFrame(np.array([list("abcde"),list("pqrst"),[0.12,0.23,0.93,0.86,0.33]]).T,columns=["c1","c2","value"])

DF_want = pd.DataFrame(np.array([list("abcde"),list("pqrst"),[0,0,1,1,0]]).T,columns=["c1","c2","value"])





threshold = 0.85



#Empty dataframe to append rows

DF_naive = pd.DataFrame()

for i in range(DF_test.shape[0]):

    #Get first 2 columns

    first2cols = list(DF_test.ix[i][:-1])

    #Check if value is greater than threshold

    binary_value = [int((bool(float(DF_test.ix[i][-1]) > threshold)))]

    #Create series object

    SR_row = pd.Series( first2cols + binary_value,name=i)

    #Add to empty dataframe container

    DF_naive = DF_naive.append(SR_row)

#Relabel columns

DF_naive.columns = DF_test.columns

DF_naive.head()

#the sample DF_want

asked Feb 25 '16 at 22:22

O.rka

7,16030107169

I have a a pd.DataFrame that looks like:

enter image description here

I want to create a cutoff on the values to push them into binary digits, my cutoff in this case is 0.85. I want the resulting dataframe to look like:

enter image description here

The script I wrote to do this is easy to understand but for large datasets it is inefficient. I'm sure Pandas has some way of taking care of these types of transformations.

Does anyone know of an efficient way to convert a column of floats to a column of integers using a threshold?

My extremely naive way of doing such a thing:

DF_test = pd.DataFrame(np.array([list("abcde"),list("pqrst"),[0.12,0.23,0.93,0.86,0.33]]).T,columns=["c1","c2","value"])

DF_want = pd.DataFrame(np.array([list("abcde"),list("pqrst"),[0,0,1,1,0]]).T,columns=["c1","c2","value"])





threshold = 0.85



#Empty dataframe to append rows

DF_naive = pd.DataFrame()

for i in range(DF_test.shape[0]):

    #Get first 2 columns

    first2cols = list(DF_test.ix[i][:-1])

    #Check if value is greater than threshold

    binary_value = [int((bool(float(DF_test.ix[i][-1]) > threshold)))]

    #Create series object

    SR_row = pd.Series( first2cols + binary_value,name=i)

    #Add to empty dataframe container

    DF_naive = DF_naive.append(SR_row)

#Relabel columns

DF_naive.columns = DF_test.columns

DF_naive.head()

#the sample DF_want

python pandas int dataframe

asked Feb 25 '16 at 22:22

O.rka

7,16030107169

asked Feb 25 '16 at 22:22

O.rka

7,16030107169

asked Feb 25 '16 at 22:22

O.rka

7,16030107169

asked Feb 25 '16 at 22:22

O.rka

7,16030107169

asked Feb 25 '16 at 22:22

O.rka

7,16030107169

can't you just do df['value'] = np.where(df['value'] > 0.85, 1, 0)? this will convert and set the entire column

– EdChum
Feb 25 '16 at 22:28

add a comment |

can't you just do df['value'] = np.where(df['value'] > 0.85, 1, 0)? this will convert and set the entire column

– EdChum
Feb 25 '16 at 22:28

can't you just do df['value'] = np.where(df['value'] > 0.85, 1, 0)? this will convert and set the entire column

– EdChum
Feb 25 '16 at 22:28

add a comment |

2 Answers
2

active

oldest

votes

You can use np.where to set your desired value based on a boolean condition:

In [18]:

DF_test['value'] = np.where(DF_test['value'] > threshold, 1,0)

DF_test



Out[18]:

  c1 c2  value

0  a  p      0

1  b  q      0

2  c  r      1

3  d  s      1

4  e  t      0

Note that because your data is a heterogenous np array the 'value' column contains strings rather than floats:

In [58]:

DF_test.iloc[0]['value']



Out[58]:

'0.12'

So you'll need to convert the dtype to float first: DF_test['value'] = DF_test['value'].astype(float)

You can compare the timings:

In [16]:

%timeit np.where(DF_test['value'] > threshold, 1,0)

1000 loops, best of 3: 297 µs per loop



In [17]:

%%timeit

DF_naive = pd.DataFrame()

for i in range(DF_test.shape[0]):

    #Get first 2 columns

    first2cols = list(DF_test.ix[i][:-1])

    #Check if value is greater than threshold

    binary_value = [int((bool(float(DF_test.ix[i][-1]) > threshold)))]

    #Create series object

    SR_row = pd.Series( first2cols + binary_value,name=i)

    #Add to empty dataframe container

    DF_naive = DF_naive.append(SR_row)

10 loops, best of 3: 39.3 ms per loop

the np.where version is over 100x faster, admittedly your code is doing a lot of unnecessary stuff but you get the point

edited Feb 25 '16 at 23:18

answered Feb 25 '16 at 22:32

EdChum

174k32369319

When I run this, the entire column value is then filled up with 1s. np.where(DF_test['value'] > 0.85) returns (array([0, 1, 2, 3, 4]),) and DF_test['value'] > 0.85 returns True everywhere. Any idea why that happens? I copy-pasted DF_test from above.

– Cleb
Feb 25 '16 at 23:08

1

You may need to convert the DF_test['value'] dtype first DF_test['value'] = DF_test'].astype(float) otherwise I haven't a clue

– EdChum
Feb 25 '16 at 23:11

That's it, thanks.

– Cleb
Feb 25 '16 at 23:12

1

@Cleb the OP created a heterogenous np.array as the data for the df, this made all the values in 'value' column into strings hence the need to convert the dtype

– EdChum
Feb 25 '16 at 23:15

1

Ok, you might want to add this to your answer. +1 from my side.

– Cleb
Feb 25 '16 at 23:15

add a comment |

Since bool is a subclass of int, i.e. True == 1 and False == 0, you can convert a Boolean series to its integer form:

DF_test['value'] = (DF_test['value'] > threshold).astype(int)

Generally, including most uses in computation or indexing, the int conversion is not necessary and you may wish to forego it altogether.

answered Nov 18 '18 at 23:44

jpp

97.7k2159109

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f35639588%2fmost-efficient-way-to-convert-values-of-column-in-pandas-dataframe%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

You can use np.where to set your desired value based on a boolean condition:

In [18]:

DF_test['value'] = np.where(DF_test['value'] > threshold, 1,0)

DF_test



Out[18]:

  c1 c2  value

0  a  p      0

1  b  q      0

2  c  r      1

3  d  s      1

4  e  t      0

Note that because your data is a heterogenous np array the 'value' column contains strings rather than floats:

In [58]:

DF_test.iloc[0]['value']



Out[58]:

'0.12'

So you'll need to convert the dtype to float first: DF_test['value'] = DF_test['value'].astype(float)

You can compare the timings:

In [16]:

%timeit np.where(DF_test['value'] > threshold, 1,0)

1000 loops, best of 3: 297 µs per loop



In [17]:

%%timeit

DF_naive = pd.DataFrame()

for i in range(DF_test.shape[0]):

    #Get first 2 columns

    first2cols = list(DF_test.ix[i][:-1])

    #Check if value is greater than threshold

    binary_value = [int((bool(float(DF_test.ix[i][-1]) > threshold)))]

    #Create series object

    SR_row = pd.Series( first2cols + binary_value,name=i)

    #Add to empty dataframe container

    DF_naive = DF_naive.append(SR_row)

10 loops, best of 3: 39.3 ms per loop

the np.where version is over 100x faster, admittedly your code is doing a lot of unnecessary stuff but you get the point

edited Feb 25 '16 at 23:18

answered Feb 25 '16 at 22:32

EdChum

174k32369319

When I run this, the entire column value is then filled up with 1s. np.where(DF_test['value'] > 0.85) returns (array([0, 1, 2, 3, 4]),) and DF_test['value'] > 0.85 returns True everywhere. Any idea why that happens? I copy-pasted DF_test from above.

– Cleb
Feb 25 '16 at 23:08

1

You may need to convert the DF_test['value'] dtype first DF_test['value'] = DF_test'].astype(float) otherwise I haven't a clue

– EdChum
Feb 25 '16 at 23:11

That's it, thanks.

– Cleb
Feb 25 '16 at 23:12

1

@Cleb the OP created a heterogenous np.array as the data for the df, this made all the values in 'value' column into strings hence the need to convert the dtype

– EdChum
Feb 25 '16 at 23:15

1

Ok, you might want to add this to your answer. +1 from my side.

– Cleb
Feb 25 '16 at 23:15

add a comment |

You can use np.where to set your desired value based on a boolean condition:

In [18]:

DF_test['value'] = np.where(DF_test['value'] > threshold, 1,0)

DF_test



Out[18]:

  c1 c2  value

0  a  p      0

1  b  q      0

2  c  r      1

3  d  s      1

4  e  t      0

Note that because your data is a heterogenous np array the 'value' column contains strings rather than floats:

In [58]:

DF_test.iloc[0]['value']



Out[58]:

'0.12'

So you'll need to convert the dtype to float first: DF_test['value'] = DF_test['value'].astype(float)

You can compare the timings:

In [16]:

%timeit np.where(DF_test['value'] > threshold, 1,0)

1000 loops, best of 3: 297 µs per loop



In [17]:

%%timeit

DF_naive = pd.DataFrame()

for i in range(DF_test.shape[0]):

    #Get first 2 columns

    first2cols = list(DF_test.ix[i][:-1])

    #Check if value is greater than threshold

    binary_value = [int((bool(float(DF_test.ix[i][-1]) > threshold)))]

    #Create series object

    SR_row = pd.Series( first2cols + binary_value,name=i)

    #Add to empty dataframe container

    DF_naive = DF_naive.append(SR_row)

10 loops, best of 3: 39.3 ms per loop

the np.where version is over 100x faster, admittedly your code is doing a lot of unnecessary stuff but you get the point

edited Feb 25 '16 at 23:18

answered Feb 25 '16 at 22:32

EdChum

174k32369319

When I run this, the entire column value is then filled up with 1s. np.where(DF_test['value'] > 0.85) returns (array([0, 1, 2, 3, 4]),) and DF_test['value'] > 0.85 returns True everywhere. Any idea why that happens? I copy-pasted DF_test from above.

– Cleb
Feb 25 '16 at 23:08

1

You may need to convert the DF_test['value'] dtype first DF_test['value'] = DF_test'].astype(float) otherwise I haven't a clue

– EdChum
Feb 25 '16 at 23:11

That's it, thanks.

– Cleb
Feb 25 '16 at 23:12

1

@Cleb the OP created a heterogenous np.array as the data for the df, this made all the values in 'value' column into strings hence the need to convert the dtype

– EdChum
Feb 25 '16 at 23:15

1

Ok, you might want to add this to your answer. +1 from my side.

– Cleb
Feb 25 '16 at 23:15

add a comment |

You can use np.where to set your desired value based on a boolean condition:

In [18]:

DF_test['value'] = np.where(DF_test['value'] > threshold, 1,0)

DF_test



Out[18]:

  c1 c2  value

0  a  p      0

1  b  q      0

2  c  r      1

3  d  s      1

4  e  t      0

Note that because your data is a heterogenous np array the 'value' column contains strings rather than floats:

In [58]:

DF_test.iloc[0]['value']



Out[58]:

'0.12'

So you'll need to convert the dtype to float first: DF_test['value'] = DF_test['value'].astype(float)

You can compare the timings:

In [16]:

%timeit np.where(DF_test['value'] > threshold, 1,0)

1000 loops, best of 3: 297 µs per loop



In [17]:

%%timeit

DF_naive = pd.DataFrame()

for i in range(DF_test.shape[0]):

    #Get first 2 columns

    first2cols = list(DF_test.ix[i][:-1])

    #Check if value is greater than threshold

    binary_value = [int((bool(float(DF_test.ix[i][-1]) > threshold)))]

    #Create series object

    SR_row = pd.Series( first2cols + binary_value,name=i)

    #Add to empty dataframe container

    DF_naive = DF_naive.append(SR_row)

10 loops, best of 3: 39.3 ms per loop

the np.where version is over 100x faster, admittedly your code is doing a lot of unnecessary stuff but you get the point

edited Feb 25 '16 at 23:18

answered Feb 25 '16 at 22:32

EdChum

174k32369319

You can use np.where to set your desired value based on a boolean condition:

In [18]:

DF_test['value'] = np.where(DF_test['value'] > threshold, 1,0)

DF_test



Out[18]:

  c1 c2  value

0  a  p      0

1  b  q      0

2  c  r      1

3  d  s      1

4  e  t      0

Note that because your data is a heterogenous np array the 'value' column contains strings rather than floats:

In [58]:

DF_test.iloc[0]['value']



Out[58]:

'0.12'

So you'll need to convert the dtype to float first: DF_test['value'] = DF_test['value'].astype(float)

You can compare the timings:

In [16]:

%timeit np.where(DF_test['value'] > threshold, 1,0)

1000 loops, best of 3: 297 µs per loop



In [17]:

%%timeit

DF_naive = pd.DataFrame()

for i in range(DF_test.shape[0]):

    #Get first 2 columns

    first2cols = list(DF_test.ix[i][:-1])

    #Check if value is greater than threshold

    binary_value = [int((bool(float(DF_test.ix[i][-1]) > threshold)))]

    #Create series object

    SR_row = pd.Series( first2cols + binary_value,name=i)

    #Add to empty dataframe container

    DF_naive = DF_naive.append(SR_row)

10 loops, best of 3: 39.3 ms per loop

the np.where version is over 100x faster, admittedly your code is doing a lot of unnecessary stuff but you get the point

edited Feb 25 '16 at 23:18

answered Feb 25 '16 at 22:32

EdChum

174k32369319

edited Feb 25 '16 at 23:18

answered Feb 25 '16 at 22:32

EdChum

174k32369319

answered Feb 25 '16 at 22:32

EdChum

174k32369319

answered Feb 25 '16 at 22:32

EdChum

174k32369319

When I run this, the entire column value is then filled up with 1s. np.where(DF_test['value'] > 0.85) returns (array([0, 1, 2, 3, 4]),) and DF_test['value'] > 0.85 returns True everywhere. Any idea why that happens? I copy-pasted DF_test from above.

– Cleb
Feb 25 '16 at 23:08

1

You may need to convert the DF_test['value'] dtype first DF_test['value'] = DF_test'].astype(float) otherwise I haven't a clue

– EdChum
Feb 25 '16 at 23:11

That's it, thanks.

– Cleb
Feb 25 '16 at 23:12

1

@Cleb the OP created a heterogenous np.array as the data for the df, this made all the values in 'value' column into strings hence the need to convert the dtype

– EdChum
Feb 25 '16 at 23:15

1

Ok, you might want to add this to your answer. +1 from my side.

– Cleb
Feb 25 '16 at 23:15

add a comment |

When I run this, the entire column value is then filled up with 1s. np.where(DF_test['value'] > 0.85) returns (array([0, 1, 2, 3, 4]),) and DF_test['value'] > 0.85 returns True everywhere. Any idea why that happens? I copy-pasted DF_test from above.

– Cleb
Feb 25 '16 at 23:08

1

You may need to convert the DF_test['value'] dtype first DF_test['value'] = DF_test'].astype(float) otherwise I haven't a clue

– EdChum
Feb 25 '16 at 23:11

That's it, thanks.

– Cleb
Feb 25 '16 at 23:12

1

@Cleb the OP created a heterogenous np.array as the data for the df, this made all the values in 'value' column into strings hence the need to convert the dtype

– EdChum
Feb 25 '16 at 23:15

1

Ok, you might want to add this to your answer. +1 from my side.

– Cleb
Feb 25 '16 at 23:15

When I run this, the entire column value is then filled up with 1s. np.where(DF_test['value'] > 0.85) returns (array([0, 1, 2, 3, 4]),) and DF_test['value'] > 0.85 returns True everywhere. Any idea why that happens? I copy-pasted DF_test from above.

– Cleb
Feb 25 '16 at 23:08

You may need to convert the DF_test['value'] dtype first DF_test['value'] = DF_test'].astype(float) otherwise I haven't a clue

– EdChum
Feb 25 '16 at 23:11

That's it, thanks.

– Cleb
Feb 25 '16 at 23:12

@Cleb the OP created a heterogenous np.array as the data for the df, this made all the values in 'value' column into strings hence the need to convert the dtype

– EdChum
Feb 25 '16 at 23:15

Ok, you might want to add this to your answer. +1 from my side.

– Cleb
Feb 25 '16 at 23:15

add a comment |

Since bool is a subclass of int, i.e. True == 1 and False == 0, you can convert a Boolean series to its integer form:

DF_test['value'] = (DF_test['value'] > threshold).astype(int)

Generally, including most uses in computation or indexing, the int conversion is not necessary and you may wish to forego it altogether.

answered Nov 18 '18 at 23:44

jpp

97.7k2159109

add a comment |

Since bool is a subclass of int, i.e. True == 1 and False == 0, you can convert a Boolean series to its integer form:

DF_test['value'] = (DF_test['value'] > threshold).astype(int)

Generally, including most uses in computation or indexing, the int conversion is not necessary and you may wish to forego it altogether.

answered Nov 18 '18 at 23:44

jpp

97.7k2159109

add a comment |

Since bool is a subclass of int, i.e. True == 1 and False == 0, you can convert a Boolean series to its integer form:

DF_test['value'] = (DF_test['value'] > threshold).astype(int)

Generally, including most uses in computation or indexing, the int conversion is not necessary and you may wish to forego it altogether.

answered Nov 18 '18 at 23:44

jpp

97.7k2159109

Since bool is a subclass of int, i.e. True == 1 and False == 0, you can convert a Boolean series to its integer form:

DF_test['value'] = (DF_test['value'] > threshold).astype(int)

Generally, including most uses in computation or indexing, the int conversion is not necessary and you may wish to forego it altogether.

answered Nov 18 '18 at 23:44

jpp

97.7k2159109

answered Nov 18 '18 at 23:44

jpp

97.7k2159109

answered Nov 18 '18 at 23:44

jpp

97.7k2159109

answered Nov 18 '18 at 23:44

jpp

97.7k2159109

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Agfdhyk