Loops in R with big data set, a better way?

I am using R and have a big datesets containing 12,224,433 rows.
For every row I want to do a spearman correlation test against one vector
and extract P values. The scripts are like this:

pvals <- numeric(nrow(SNP))



for(i in 1:nrow(SNP)) {



  fit <- cor.test(vector, as.numeric(SNP[i,c(4:50)]), method='spearman', exact=FALSE)    



  pvals[i] <-  fit$p.value



  names(pvals)[i] <- paste(SNP$V1[i], SNP$V2[i])



}

The thing is it takes ages, I kind of calculate already, it took 2 hours to run only the first 70,000 rows. So it can take 200 hours.
Is there anyway to speed it up?

edited Nov 20 '18 at 14:27

MrFlick

123k11141170

asked Nov 20 '18 at 14:23

Yun Wang

3

Can you post the first couple of rows of the data set so we have a reproducible example?

– Ben Bolker
Nov 20 '18 at 15:07

add a comment |

I am using R and have a big datesets containing 12,224,433 rows.
For every row I want to do a spearman correlation test against one vector
and extract P values. The scripts are like this:

pvals <- numeric(nrow(SNP))



for(i in 1:nrow(SNP)) {



  fit <- cor.test(vector, as.numeric(SNP[i,c(4:50)]), method='spearman', exact=FALSE)    



  pvals[i] <-  fit$p.value



  names(pvals)[i] <- paste(SNP$V1[i], SNP$V2[i])



}

The thing is it takes ages, I kind of calculate already, it took 2 hours to run only the first 70,000 rows. So it can take 200 hours.
Is there anyway to speed it up?

edited Nov 20 '18 at 14:27

MrFlick

123k11141170

asked Nov 20 '18 at 14:23

Yun Wang

3

Can you post the first couple of rows of the data set so we have a reproducible example?

– Ben Bolker
Nov 20 '18 at 15:07

add a comment |

I am using R and have a big datesets containing 12,224,433 rows.
For every row I want to do a spearman correlation test against one vector
and extract P values. The scripts are like this:

pvals <- numeric(nrow(SNP))



for(i in 1:nrow(SNP)) {



  fit <- cor.test(vector, as.numeric(SNP[i,c(4:50)]), method='spearman', exact=FALSE)    



  pvals[i] <-  fit$p.value



  names(pvals)[i] <- paste(SNP$V1[i], SNP$V2[i])



}

The thing is it takes ages, I kind of calculate already, it took 2 hours to run only the first 70,000 rows. So it can take 200 hours.
Is there anyway to speed it up?

edited Nov 20 '18 at 14:27

MrFlick

123k11141170

asked Nov 20 '18 at 14:23

Yun Wang

I am using R and have a big datesets containing 12,224,433 rows.
For every row I want to do a spearman correlation test against one vector
and extract P values. The scripts are like this:

pvals <- numeric(nrow(SNP))



for(i in 1:nrow(SNP)) {



  fit <- cor.test(vector, as.numeric(SNP[i,c(4:50)]), method='spearman', exact=FALSE)    



  pvals[i] <-  fit$p.value



  names(pvals)[i] <- paste(SNP$V1[i], SNP$V2[i])



}

The thing is it takes ages, I kind of calculate already, it took 2 hours to run only the first 70,000 rows. So it can take 200 hours.
Is there anyway to speed it up?

r loops

edited Nov 20 '18 at 14:27

MrFlick

123k11141170

asked Nov 20 '18 at 14:23

Yun Wang

edited Nov 20 '18 at 14:27

MrFlick

123k11141170

asked Nov 20 '18 at 14:23

Yun Wang

edited Nov 20 '18 at 14:27

MrFlick

123k11141170

edited Nov 20 '18 at 14:27

MrFlick

123k11141170

edited Nov 20 '18 at 14:27

MrFlick

123k11141170

asked Nov 20 '18 at 14:23

Yun Wang

asked Nov 20 '18 at 14:23

Yun Wang

asked Nov 20 '18 at 14:23

Yun Wang

3

Can you post the first couple of rows of the data set so we have a reproducible example?

– Ben Bolker
Nov 20 '18 at 15:07

add a comment |

3

Can you post the first couple of rows of the data set so we have a reproducible example?

– Ben Bolker
Nov 20 '18 at 15:07

Can you post the first couple of rows of the data set so we have a reproducible example?

– Ben Bolker
Nov 20 '18 at 15:07

add a comment |

3 Answers
3

active

oldest

votes

This would be a good candidate for using parallel processing with a package such as foreach or future.apply.

The code below makes use of future.apply because of how simple that package is to use.

The general strategy is to take the action you want to repeat (i.e. getting p-values based on a subset of data), turn that action into a function, and use future.apply to repeat that function for the different subsets of data you want to use.

library(future.apply)



# Establish method used for parallel processing

  plan(multiprocess)



# Convert the relevant subset of the matrix to numeric

  snp_subset <- SNP[,c(4:50)]

  class(snp_subset) <- 'numeric'



# Define a function to get p.values for a given row of the matrix

  get_pvals <- function(row_index) {

    pvals <- cor.test(vector, snp_subset[row_index,], method = 'spearman', exact = FALSE)$p.value

    names(pvals) <- paste(SNP$V1[row_index], SNP$V2[row_index])

    pvals

  }



# Use parallel processing to get p-values for each row of the matrix

  pvals <- future_sapply(X = seq_len(nrow(SNP)),

                         FUN = get_pvals)

answered Nov 20 '18 at 16:00

bschneidr

1,70711531

that's awsome. I will also follow the advice above to put name() outside the loop

– Yun Wang
Nov 22 '18 at 10:27

add a comment |

Here's what I can suggest based on the info you have shared. I have added my thoughts as comments in the code -

# convert all rows to numeric matrix instead of as.numeric(SNP[i,c(4:50)]) in every loop

# also subsetting matrix directly gives you a vector which is what is needed for cor.test()

y <- as.matrix(SNP[, c(4:50)])



# initialize pvals with NA and then replace each value in every loop run

pvals <- rep(NA_real_, nrow(SNP))



for(i in 1:nrow(SNP)) {



  fit <- cor.test(vector, y[i, ], method = 'spearman', exact = FALSE)    



  pvals[i] <-  fit$p.value



}



# you can assign all names in one go instead of doing it in the loop

names(pvals) <- paste(SNP$V1, SNP$V2)

Finally, yours is a classic use case for parallel processing. Using parallel processing packages like foreach you can run multiple tests in parallel and then combine them into your result vector pval.

Also suggest you to read the book 'The R Inferno' for more info on how to improve code efficiency.

edited Nov 20 '18 at 15:22

answered Nov 20 '18 at 15:14

Shree

3,4461323

you might be able to shave a little bit more by extracting only the necessary bits from cor.test.default() (but it would be a nuisance)

– Ben Bolker
Nov 20 '18 at 15:21

@BenBolker Thanks for the suggestion. Feel free to add to or edit the answer. I am unfamiliar with cor.test() or cor.test.default() so can't do it myself. Thanks!

– Shree
Nov 20 '18 at 15:33

1

I thought I have to use some parallel processing packages. But by moving the name() out of the loop alone save already enormous time!

– Yun Wang
Nov 22 '18 at 9:02

add a comment |

You can use apply:

SNP["pvals"] <- apply(SNP[ ,c(4:50)], MARGIN = 1, FUN = function(row) cor.test(vector, as.numeric(row), method='spearman', exact=FALSE)$p.value)



#SNP$pvals

edited Nov 20 '18 at 15:16

answered Nov 20 '18 at 15:08

emsinko

18615

3

Turns out apply behaves similar to for where it runs calls at the R level (not C level like other members of apply family). See this canonical question and answer from @DavidArenburg who answered own question.

– Parfait
Nov 20 '18 at 15:12

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53395095%2floops-in-r-with-big-data-set-a-better-way%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

This would be a good candidate for using parallel processing with a package such as foreach or future.apply.

The code below makes use of future.apply because of how simple that package is to use.

library(future.apply)



# Establish method used for parallel processing

  plan(multiprocess)



# Convert the relevant subset of the matrix to numeric

  snp_subset <- SNP[,c(4:50)]

  class(snp_subset) <- 'numeric'



# Define a function to get p.values for a given row of the matrix

  get_pvals <- function(row_index) {

    pvals <- cor.test(vector, snp_subset[row_index,], method = 'spearman', exact = FALSE)$p.value

    names(pvals) <- paste(SNP$V1[row_index], SNP$V2[row_index])

    pvals

  }



# Use parallel processing to get p-values for each row of the matrix

  pvals <- future_sapply(X = seq_len(nrow(SNP)),

                         FUN = get_pvals)

answered Nov 20 '18 at 16:00

bschneidr

1,70711531

that's awsome. I will also follow the advice above to put name() outside the loop

– Yun Wang
Nov 22 '18 at 10:27

add a comment |

This would be a good candidate for using parallel processing with a package such as foreach or future.apply.

The code below makes use of future.apply because of how simple that package is to use.

library(future.apply)



# Establish method used for parallel processing

  plan(multiprocess)



# Convert the relevant subset of the matrix to numeric

  snp_subset <- SNP[,c(4:50)]

  class(snp_subset) <- 'numeric'



# Define a function to get p.values for a given row of the matrix

  get_pvals <- function(row_index) {

    pvals <- cor.test(vector, snp_subset[row_index,], method = 'spearman', exact = FALSE)$p.value

    names(pvals) <- paste(SNP$V1[row_index], SNP$V2[row_index])

    pvals

  }



# Use parallel processing to get p-values for each row of the matrix

  pvals <- future_sapply(X = seq_len(nrow(SNP)),

                         FUN = get_pvals)

answered Nov 20 '18 at 16:00

bschneidr

1,70711531

that's awsome. I will also follow the advice above to put name() outside the loop

– Yun Wang
Nov 22 '18 at 10:27

add a comment |

This would be a good candidate for using parallel processing with a package such as foreach or future.apply.

The code below makes use of future.apply because of how simple that package is to use.

library(future.apply)



# Establish method used for parallel processing

  plan(multiprocess)



# Convert the relevant subset of the matrix to numeric

  snp_subset <- SNP[,c(4:50)]

  class(snp_subset) <- 'numeric'



# Define a function to get p.values for a given row of the matrix

  get_pvals <- function(row_index) {

    pvals <- cor.test(vector, snp_subset[row_index,], method = 'spearman', exact = FALSE)$p.value

    names(pvals) <- paste(SNP$V1[row_index], SNP$V2[row_index])

    pvals

  }



# Use parallel processing to get p-values for each row of the matrix

  pvals <- future_sapply(X = seq_len(nrow(SNP)),

                         FUN = get_pvals)

answered Nov 20 '18 at 16:00

bschneidr

1,70711531

This would be a good candidate for using parallel processing with a package such as foreach or future.apply.

The code below makes use of future.apply because of how simple that package is to use.

library(future.apply)



# Establish method used for parallel processing

  plan(multiprocess)



# Convert the relevant subset of the matrix to numeric

  snp_subset <- SNP[,c(4:50)]

  class(snp_subset) <- 'numeric'



# Define a function to get p.values for a given row of the matrix

  get_pvals <- function(row_index) {

    pvals <- cor.test(vector, snp_subset[row_index,], method = 'spearman', exact = FALSE)$p.value

    names(pvals) <- paste(SNP$V1[row_index], SNP$V2[row_index])

    pvals

  }



# Use parallel processing to get p-values for each row of the matrix

  pvals <- future_sapply(X = seq_len(nrow(SNP)),

                         FUN = get_pvals)

answered Nov 20 '18 at 16:00

bschneidr

1,70711531

answered Nov 20 '18 at 16:00

bschneidr

1,70711531

answered Nov 20 '18 at 16:00

bschneidr

1,70711531

answered Nov 20 '18 at 16:00

bschneidr

1,70711531

that's awsome. I will also follow the advice above to put name() outside the loop

– Yun Wang
Nov 22 '18 at 10:27

add a comment |

that's awsome. I will also follow the advice above to put name() outside the loop

– Yun Wang
Nov 22 '18 at 10:27

that's awsome. I will also follow the advice above to put name() outside the loop

– Yun Wang
Nov 22 '18 at 10:27

add a comment |

Here's what I can suggest based on the info you have shared. I have added my thoughts as comments in the code -

# convert all rows to numeric matrix instead of as.numeric(SNP[i,c(4:50)]) in every loop

# also subsetting matrix directly gives you a vector which is what is needed for cor.test()

y <- as.matrix(SNP[, c(4:50)])



# initialize pvals with NA and then replace each value in every loop run

pvals <- rep(NA_real_, nrow(SNP))



for(i in 1:nrow(SNP)) {



  fit <- cor.test(vector, y[i, ], method = 'spearman', exact = FALSE)    



  pvals[i] <-  fit$p.value



}



# you can assign all names in one go instead of doing it in the loop

names(pvals) <- paste(SNP$V1, SNP$V2)

Also suggest you to read the book 'The R Inferno' for more info on how to improve code efficiency.

edited Nov 20 '18 at 15:22

answered Nov 20 '18 at 15:14

Shree

3,4461323

you might be able to shave a little bit more by extracting only the necessary bits from cor.test.default() (but it would be a nuisance)

– Ben Bolker
Nov 20 '18 at 15:21

@BenBolker Thanks for the suggestion. Feel free to add to or edit the answer. I am unfamiliar with cor.test() or cor.test.default() so can't do it myself. Thanks!

– Shree
Nov 20 '18 at 15:33

1

I thought I have to use some parallel processing packages. But by moving the name() out of the loop alone save already enormous time!

– Yun Wang
Nov 22 '18 at 9:02

add a comment |

Here's what I can suggest based on the info you have shared. I have added my thoughts as comments in the code -

# convert all rows to numeric matrix instead of as.numeric(SNP[i,c(4:50)]) in every loop

# also subsetting matrix directly gives you a vector which is what is needed for cor.test()

y <- as.matrix(SNP[, c(4:50)])



# initialize pvals with NA and then replace each value in every loop run

pvals <- rep(NA_real_, nrow(SNP))



for(i in 1:nrow(SNP)) {



  fit <- cor.test(vector, y[i, ], method = 'spearman', exact = FALSE)    



  pvals[i] <-  fit$p.value



}



# you can assign all names in one go instead of doing it in the loop

names(pvals) <- paste(SNP$V1, SNP$V2)

Also suggest you to read the book 'The R Inferno' for more info on how to improve code efficiency.

edited Nov 20 '18 at 15:22

answered Nov 20 '18 at 15:14

Shree

3,4461323

you might be able to shave a little bit more by extracting only the necessary bits from cor.test.default() (but it would be a nuisance)

– Ben Bolker
Nov 20 '18 at 15:21

@BenBolker Thanks for the suggestion. Feel free to add to or edit the answer. I am unfamiliar with cor.test() or cor.test.default() so can't do it myself. Thanks!

– Shree
Nov 20 '18 at 15:33

1

I thought I have to use some parallel processing packages. But by moving the name() out of the loop alone save already enormous time!

– Yun Wang
Nov 22 '18 at 9:02

add a comment |

Here's what I can suggest based on the info you have shared. I have added my thoughts as comments in the code -

# convert all rows to numeric matrix instead of as.numeric(SNP[i,c(4:50)]) in every loop

# also subsetting matrix directly gives you a vector which is what is needed for cor.test()

y <- as.matrix(SNP[, c(4:50)])



# initialize pvals with NA and then replace each value in every loop run

pvals <- rep(NA_real_, nrow(SNP))



for(i in 1:nrow(SNP)) {



  fit <- cor.test(vector, y[i, ], method = 'spearman', exact = FALSE)    



  pvals[i] <-  fit$p.value



}



# you can assign all names in one go instead of doing it in the loop

names(pvals) <- paste(SNP$V1, SNP$V2)

Also suggest you to read the book 'The R Inferno' for more info on how to improve code efficiency.

edited Nov 20 '18 at 15:22

answered Nov 20 '18 at 15:14

Shree

3,4461323

Here's what I can suggest based on the info you have shared. I have added my thoughts as comments in the code -

# convert all rows to numeric matrix instead of as.numeric(SNP[i,c(4:50)]) in every loop

# also subsetting matrix directly gives you a vector which is what is needed for cor.test()

y <- as.matrix(SNP[, c(4:50)])



# initialize pvals with NA and then replace each value in every loop run

pvals <- rep(NA_real_, nrow(SNP))



for(i in 1:nrow(SNP)) {



  fit <- cor.test(vector, y[i, ], method = 'spearman', exact = FALSE)    



  pvals[i] <-  fit$p.value



}



# you can assign all names in one go instead of doing it in the loop

names(pvals) <- paste(SNP$V1, SNP$V2)

Also suggest you to read the book 'The R Inferno' for more info on how to improve code efficiency.

edited Nov 20 '18 at 15:22

answered Nov 20 '18 at 15:14

Shree

3,4461323

edited Nov 20 '18 at 15:22

answered Nov 20 '18 at 15:14

Shree

3,4461323

answered Nov 20 '18 at 15:14

Shree

3,4461323

answered Nov 20 '18 at 15:14

Shree

3,4461323

you might be able to shave a little bit more by extracting only the necessary bits from cor.test.default() (but it would be a nuisance)

– Ben Bolker
Nov 20 '18 at 15:21

@BenBolker Thanks for the suggestion. Feel free to add to or edit the answer. I am unfamiliar with cor.test() or cor.test.default() so can't do it myself. Thanks!

– Shree
Nov 20 '18 at 15:33

1

I thought I have to use some parallel processing packages. But by moving the name() out of the loop alone save already enormous time!

– Yun Wang
Nov 22 '18 at 9:02

add a comment |

you might be able to shave a little bit more by extracting only the necessary bits from cor.test.default() (but it would be a nuisance)

– Ben Bolker
Nov 20 '18 at 15:21

@BenBolker Thanks for the suggestion. Feel free to add to or edit the answer. I am unfamiliar with cor.test() or cor.test.default() so can't do it myself. Thanks!

– Shree
Nov 20 '18 at 15:33

1

I thought I have to use some parallel processing packages. But by moving the name() out of the loop alone save already enormous time!

– Yun Wang
Nov 22 '18 at 9:02

you might be able to shave a little bit more by extracting only the necessary bits from cor.test.default() (but it would be a nuisance)

– Ben Bolker
Nov 20 '18 at 15:21

@BenBolker Thanks for the suggestion. Feel free to add to or edit the answer. I am unfamiliar with cor.test() or cor.test.default() so can't do it myself. Thanks!

– Shree
Nov 20 '18 at 15:33

I thought I have to use some parallel processing packages. But by moving the name() out of the loop alone save already enormous time!

– Yun Wang
Nov 22 '18 at 9:02

add a comment |

You can use apply:

SNP["pvals"] <- apply(SNP[ ,c(4:50)], MARGIN = 1, FUN = function(row) cor.test(vector, as.numeric(row), method='spearman', exact=FALSE)$p.value)



#SNP$pvals

edited Nov 20 '18 at 15:16

answered Nov 20 '18 at 15:08

emsinko

18615

3

Turns out apply behaves similar to for where it runs calls at the R level (not C level like other members of apply family). See this canonical question and answer from @DavidArenburg who answered own question.

– Parfait
Nov 20 '18 at 15:12

add a comment |

You can use apply:

SNP["pvals"] <- apply(SNP[ ,c(4:50)], MARGIN = 1, FUN = function(row) cor.test(vector, as.numeric(row), method='spearman', exact=FALSE)$p.value)



#SNP$pvals

edited Nov 20 '18 at 15:16

answered Nov 20 '18 at 15:08

emsinko

18615

3

Turns out apply behaves similar to for where it runs calls at the R level (not C level like other members of apply family). See this canonical question and answer from @DavidArenburg who answered own question.

– Parfait
Nov 20 '18 at 15:12

add a comment |

You can use apply:

SNP["pvals"] <- apply(SNP[ ,c(4:50)], MARGIN = 1, FUN = function(row) cor.test(vector, as.numeric(row), method='spearman', exact=FALSE)$p.value)



#SNP$pvals

edited Nov 20 '18 at 15:16

answered Nov 20 '18 at 15:08

emsinko

18615

You can use apply:

SNP["pvals"] <- apply(SNP[ ,c(4:50)], MARGIN = 1, FUN = function(row) cor.test(vector, as.numeric(row), method='spearman', exact=FALSE)$p.value)



#SNP$pvals

edited Nov 20 '18 at 15:16

answered Nov 20 '18 at 15:08

emsinko

18615

edited Nov 20 '18 at 15:16

answered Nov 20 '18 at 15:08

emsinko

18615

answered Nov 20 '18 at 15:08

emsinko

18615

answered Nov 20 '18 at 15:08

emsinko

18615

3

Turns out apply behaves similar to for where it runs calls at the R level (not C level like other members of apply family). See this canonical question and answer from @DavidArenburg who answered own question.

– Parfait
Nov 20 '18 at 15:12

add a comment |

3

Turns out apply behaves similar to for where it runs calls at the R level (not C level like other members of apply family). See this canonical question and answer from @DavidArenburg who answered own question.

– Parfait
Nov 20 '18 at 15:12

Turns out apply behaves similar to for where it runs calls at the R level (not C level like other members of apply family). See this canonical question and answer from @DavidArenburg who answered own question.

– Parfait
Nov 20 '18 at 15:12

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Agfdhyk