Loops in R with big data set, a better way?












0















I am using R and have a big datesets containing 12,224,433 rows.
For every row I want to do a spearman correlation test against one vector
and extract P values. The scripts are like this:



pvals <- numeric(nrow(SNP))

for(i in 1:nrow(SNP)) {

fit <- cor.test(vector, as.numeric(SNP[i,c(4:50)]), method='spearman', exact=FALSE)

pvals[i] <- fit$p.value

names(pvals)[i] <- paste(SNP$V1[i], SNP$V2[i])

}


The thing is it takes ages, I kind of calculate already, it took 2 hours to run only the first 70,000 rows. So it can take 200 hours.
Is there anyway to speed it up?










share|improve this question




















  • 3





    Can you post the first couple of rows of the data set so we have a reproducible example?

    – Ben Bolker
    Nov 20 '18 at 15:07
















0















I am using R and have a big datesets containing 12,224,433 rows.
For every row I want to do a spearman correlation test against one vector
and extract P values. The scripts are like this:



pvals <- numeric(nrow(SNP))

for(i in 1:nrow(SNP)) {

fit <- cor.test(vector, as.numeric(SNP[i,c(4:50)]), method='spearman', exact=FALSE)

pvals[i] <- fit$p.value

names(pvals)[i] <- paste(SNP$V1[i], SNP$V2[i])

}


The thing is it takes ages, I kind of calculate already, it took 2 hours to run only the first 70,000 rows. So it can take 200 hours.
Is there anyway to speed it up?










share|improve this question




















  • 3





    Can you post the first couple of rows of the data set so we have a reproducible example?

    – Ben Bolker
    Nov 20 '18 at 15:07














0












0








0








I am using R and have a big datesets containing 12,224,433 rows.
For every row I want to do a spearman correlation test against one vector
and extract P values. The scripts are like this:



pvals <- numeric(nrow(SNP))

for(i in 1:nrow(SNP)) {

fit <- cor.test(vector, as.numeric(SNP[i,c(4:50)]), method='spearman', exact=FALSE)

pvals[i] <- fit$p.value

names(pvals)[i] <- paste(SNP$V1[i], SNP$V2[i])

}


The thing is it takes ages, I kind of calculate already, it took 2 hours to run only the first 70,000 rows. So it can take 200 hours.
Is there anyway to speed it up?










share|improve this question
















I am using R and have a big datesets containing 12,224,433 rows.
For every row I want to do a spearman correlation test against one vector
and extract P values. The scripts are like this:



pvals <- numeric(nrow(SNP))

for(i in 1:nrow(SNP)) {

fit <- cor.test(vector, as.numeric(SNP[i,c(4:50)]), method='spearman', exact=FALSE)

pvals[i] <- fit$p.value

names(pvals)[i] <- paste(SNP$V1[i], SNP$V2[i])

}


The thing is it takes ages, I kind of calculate already, it took 2 hours to run only the first 70,000 rows. So it can take 200 hours.
Is there anyway to speed it up?







r loops






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 20 '18 at 14:27









MrFlick

123k11141170




123k11141170










asked Nov 20 '18 at 14:23









Yun WangYun Wang

63




63








  • 3





    Can you post the first couple of rows of the data set so we have a reproducible example?

    – Ben Bolker
    Nov 20 '18 at 15:07














  • 3





    Can you post the first couple of rows of the data set so we have a reproducible example?

    – Ben Bolker
    Nov 20 '18 at 15:07








3




3





Can you post the first couple of rows of the data set so we have a reproducible example?

– Ben Bolker
Nov 20 '18 at 15:07





Can you post the first couple of rows of the data set so we have a reproducible example?

– Ben Bolker
Nov 20 '18 at 15:07












3 Answers
3






active

oldest

votes


















0














This would be a good candidate for using parallel processing with a package such as foreach or future.apply.



The code below makes use of future.apply because of how simple that package is to use.



The general strategy is to take the action you want to repeat (i.e. getting p-values based on a subset of data), turn that action into a function, and use future.apply to repeat that function for the different subsets of data you want to use.



library(future.apply)

# Establish method used for parallel processing
plan(multiprocess)

# Convert the relevant subset of the matrix to numeric
snp_subset <- SNP[,c(4:50)]
class(snp_subset) <- 'numeric'

# Define a function to get p.values for a given row of the matrix
get_pvals <- function(row_index) {
pvals <- cor.test(vector, snp_subset[row_index,], method = 'spearman', exact = FALSE)$p.value
names(pvals) <- paste(SNP$V1[row_index], SNP$V2[row_index])
pvals
}

# Use parallel processing to get p-values for each row of the matrix
pvals <- future_sapply(X = seq_len(nrow(SNP)),
FUN = get_pvals)





share|improve this answer
























  • that's awsome. I will also follow the advice above to put name() outside the loop

    – Yun Wang
    Nov 22 '18 at 10:27



















3














Here's what I can suggest based on the info you have shared. I have added my thoughts as comments in the code -



# convert all rows to numeric matrix instead of as.numeric(SNP[i,c(4:50)]) in every loop
# also subsetting matrix directly gives you a vector which is what is needed for cor.test()
y <- as.matrix(SNP[, c(4:50)])

# initialize pvals with NA and then replace each value in every loop run
pvals <- rep(NA_real_, nrow(SNP))

for(i in 1:nrow(SNP)) {

fit <- cor.test(vector, y[i, ], method = 'spearman', exact = FALSE)

pvals[i] <- fit$p.value

}

# you can assign all names in one go instead of doing it in the loop
names(pvals) <- paste(SNP$V1, SNP$V2)


Finally, yours is a classic use case for parallel processing. Using parallel processing packages like foreach you can run multiple tests in parallel and then combine them into your result vector pval.



Also suggest you to read the book 'The R Inferno' for more info on how to improve code efficiency.






share|improve this answer


























  • you might be able to shave a little bit more by extracting only the necessary bits from cor.test.default() (but it would be a nuisance)

    – Ben Bolker
    Nov 20 '18 at 15:21











  • @BenBolker Thanks for the suggestion. Feel free to add to or edit the answer. I am unfamiliar with cor.test() or cor.test.default() so can't do it myself. Thanks!

    – Shree
    Nov 20 '18 at 15:33






  • 1





    I thought I have to use some parallel processing packages. But by moving the name() out of the loop alone save already enormous time!

    – Yun Wang
    Nov 22 '18 at 9:02



















0














You can use apply:



SNP["pvals"] <- apply(SNP[ ,c(4:50)], MARGIN = 1, FUN = function(row) cor.test(vector, as.numeric(row), method='spearman', exact=FALSE)$p.value)

#SNP$pvals





share|improve this answer





















  • 3





    Turns out apply behaves similar to for where it runs calls at the R level (not C level like other members of apply family). See this canonical question and answer from @DavidArenburg who answered own question.

    – Parfait
    Nov 20 '18 at 15:12













Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53395095%2floops-in-r-with-big-data-set-a-better-way%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























3 Answers
3






active

oldest

votes








3 Answers
3






active

oldest

votes









active

oldest

votes






active

oldest

votes









0














This would be a good candidate for using parallel processing with a package such as foreach or future.apply.



The code below makes use of future.apply because of how simple that package is to use.



The general strategy is to take the action you want to repeat (i.e. getting p-values based on a subset of data), turn that action into a function, and use future.apply to repeat that function for the different subsets of data you want to use.



library(future.apply)

# Establish method used for parallel processing
plan(multiprocess)

# Convert the relevant subset of the matrix to numeric
snp_subset <- SNP[,c(4:50)]
class(snp_subset) <- 'numeric'

# Define a function to get p.values for a given row of the matrix
get_pvals <- function(row_index) {
pvals <- cor.test(vector, snp_subset[row_index,], method = 'spearman', exact = FALSE)$p.value
names(pvals) <- paste(SNP$V1[row_index], SNP$V2[row_index])
pvals
}

# Use parallel processing to get p-values for each row of the matrix
pvals <- future_sapply(X = seq_len(nrow(SNP)),
FUN = get_pvals)





share|improve this answer
























  • that's awsome. I will also follow the advice above to put name() outside the loop

    – Yun Wang
    Nov 22 '18 at 10:27
















0














This would be a good candidate for using parallel processing with a package such as foreach or future.apply.



The code below makes use of future.apply because of how simple that package is to use.



The general strategy is to take the action you want to repeat (i.e. getting p-values based on a subset of data), turn that action into a function, and use future.apply to repeat that function for the different subsets of data you want to use.



library(future.apply)

# Establish method used for parallel processing
plan(multiprocess)

# Convert the relevant subset of the matrix to numeric
snp_subset <- SNP[,c(4:50)]
class(snp_subset) <- 'numeric'

# Define a function to get p.values for a given row of the matrix
get_pvals <- function(row_index) {
pvals <- cor.test(vector, snp_subset[row_index,], method = 'spearman', exact = FALSE)$p.value
names(pvals) <- paste(SNP$V1[row_index], SNP$V2[row_index])
pvals
}

# Use parallel processing to get p-values for each row of the matrix
pvals <- future_sapply(X = seq_len(nrow(SNP)),
FUN = get_pvals)





share|improve this answer
























  • that's awsome. I will also follow the advice above to put name() outside the loop

    – Yun Wang
    Nov 22 '18 at 10:27














0












0








0







This would be a good candidate for using parallel processing with a package such as foreach or future.apply.



The code below makes use of future.apply because of how simple that package is to use.



The general strategy is to take the action you want to repeat (i.e. getting p-values based on a subset of data), turn that action into a function, and use future.apply to repeat that function for the different subsets of data you want to use.



library(future.apply)

# Establish method used for parallel processing
plan(multiprocess)

# Convert the relevant subset of the matrix to numeric
snp_subset <- SNP[,c(4:50)]
class(snp_subset) <- 'numeric'

# Define a function to get p.values for a given row of the matrix
get_pvals <- function(row_index) {
pvals <- cor.test(vector, snp_subset[row_index,], method = 'spearman', exact = FALSE)$p.value
names(pvals) <- paste(SNP$V1[row_index], SNP$V2[row_index])
pvals
}

# Use parallel processing to get p-values for each row of the matrix
pvals <- future_sapply(X = seq_len(nrow(SNP)),
FUN = get_pvals)





share|improve this answer













This would be a good candidate for using parallel processing with a package such as foreach or future.apply.



The code below makes use of future.apply because of how simple that package is to use.



The general strategy is to take the action you want to repeat (i.e. getting p-values based on a subset of data), turn that action into a function, and use future.apply to repeat that function for the different subsets of data you want to use.



library(future.apply)

# Establish method used for parallel processing
plan(multiprocess)

# Convert the relevant subset of the matrix to numeric
snp_subset <- SNP[,c(4:50)]
class(snp_subset) <- 'numeric'

# Define a function to get p.values for a given row of the matrix
get_pvals <- function(row_index) {
pvals <- cor.test(vector, snp_subset[row_index,], method = 'spearman', exact = FALSE)$p.value
names(pvals) <- paste(SNP$V1[row_index], SNP$V2[row_index])
pvals
}

# Use parallel processing to get p-values for each row of the matrix
pvals <- future_sapply(X = seq_len(nrow(SNP)),
FUN = get_pvals)






share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 20 '18 at 16:00









bschneidrbschneidr

1,70711531




1,70711531













  • that's awsome. I will also follow the advice above to put name() outside the loop

    – Yun Wang
    Nov 22 '18 at 10:27



















  • that's awsome. I will also follow the advice above to put name() outside the loop

    – Yun Wang
    Nov 22 '18 at 10:27

















that's awsome. I will also follow the advice above to put name() outside the loop

– Yun Wang
Nov 22 '18 at 10:27





that's awsome. I will also follow the advice above to put name() outside the loop

– Yun Wang
Nov 22 '18 at 10:27













3














Here's what I can suggest based on the info you have shared. I have added my thoughts as comments in the code -



# convert all rows to numeric matrix instead of as.numeric(SNP[i,c(4:50)]) in every loop
# also subsetting matrix directly gives you a vector which is what is needed for cor.test()
y <- as.matrix(SNP[, c(4:50)])

# initialize pvals with NA and then replace each value in every loop run
pvals <- rep(NA_real_, nrow(SNP))

for(i in 1:nrow(SNP)) {

fit <- cor.test(vector, y[i, ], method = 'spearman', exact = FALSE)

pvals[i] <- fit$p.value

}

# you can assign all names in one go instead of doing it in the loop
names(pvals) <- paste(SNP$V1, SNP$V2)


Finally, yours is a classic use case for parallel processing. Using parallel processing packages like foreach you can run multiple tests in parallel and then combine them into your result vector pval.



Also suggest you to read the book 'The R Inferno' for more info on how to improve code efficiency.






share|improve this answer


























  • you might be able to shave a little bit more by extracting only the necessary bits from cor.test.default() (but it would be a nuisance)

    – Ben Bolker
    Nov 20 '18 at 15:21











  • @BenBolker Thanks for the suggestion. Feel free to add to or edit the answer. I am unfamiliar with cor.test() or cor.test.default() so can't do it myself. Thanks!

    – Shree
    Nov 20 '18 at 15:33






  • 1





    I thought I have to use some parallel processing packages. But by moving the name() out of the loop alone save already enormous time!

    – Yun Wang
    Nov 22 '18 at 9:02
















3














Here's what I can suggest based on the info you have shared. I have added my thoughts as comments in the code -



# convert all rows to numeric matrix instead of as.numeric(SNP[i,c(4:50)]) in every loop
# also subsetting matrix directly gives you a vector which is what is needed for cor.test()
y <- as.matrix(SNP[, c(4:50)])

# initialize pvals with NA and then replace each value in every loop run
pvals <- rep(NA_real_, nrow(SNP))

for(i in 1:nrow(SNP)) {

fit <- cor.test(vector, y[i, ], method = 'spearman', exact = FALSE)

pvals[i] <- fit$p.value

}

# you can assign all names in one go instead of doing it in the loop
names(pvals) <- paste(SNP$V1, SNP$V2)


Finally, yours is a classic use case for parallel processing. Using parallel processing packages like foreach you can run multiple tests in parallel and then combine them into your result vector pval.



Also suggest you to read the book 'The R Inferno' for more info on how to improve code efficiency.






share|improve this answer


























  • you might be able to shave a little bit more by extracting only the necessary bits from cor.test.default() (but it would be a nuisance)

    – Ben Bolker
    Nov 20 '18 at 15:21











  • @BenBolker Thanks for the suggestion. Feel free to add to or edit the answer. I am unfamiliar with cor.test() or cor.test.default() so can't do it myself. Thanks!

    – Shree
    Nov 20 '18 at 15:33






  • 1





    I thought I have to use some parallel processing packages. But by moving the name() out of the loop alone save already enormous time!

    – Yun Wang
    Nov 22 '18 at 9:02














3












3








3







Here's what I can suggest based on the info you have shared. I have added my thoughts as comments in the code -



# convert all rows to numeric matrix instead of as.numeric(SNP[i,c(4:50)]) in every loop
# also subsetting matrix directly gives you a vector which is what is needed for cor.test()
y <- as.matrix(SNP[, c(4:50)])

# initialize pvals with NA and then replace each value in every loop run
pvals <- rep(NA_real_, nrow(SNP))

for(i in 1:nrow(SNP)) {

fit <- cor.test(vector, y[i, ], method = 'spearman', exact = FALSE)

pvals[i] <- fit$p.value

}

# you can assign all names in one go instead of doing it in the loop
names(pvals) <- paste(SNP$V1, SNP$V2)


Finally, yours is a classic use case for parallel processing. Using parallel processing packages like foreach you can run multiple tests in parallel and then combine them into your result vector pval.



Also suggest you to read the book 'The R Inferno' for more info on how to improve code efficiency.






share|improve this answer















Here's what I can suggest based on the info you have shared. I have added my thoughts as comments in the code -



# convert all rows to numeric matrix instead of as.numeric(SNP[i,c(4:50)]) in every loop
# also subsetting matrix directly gives you a vector which is what is needed for cor.test()
y <- as.matrix(SNP[, c(4:50)])

# initialize pvals with NA and then replace each value in every loop run
pvals <- rep(NA_real_, nrow(SNP))

for(i in 1:nrow(SNP)) {

fit <- cor.test(vector, y[i, ], method = 'spearman', exact = FALSE)

pvals[i] <- fit$p.value

}

# you can assign all names in one go instead of doing it in the loop
names(pvals) <- paste(SNP$V1, SNP$V2)


Finally, yours is a classic use case for parallel processing. Using parallel processing packages like foreach you can run multiple tests in parallel and then combine them into your result vector pval.



Also suggest you to read the book 'The R Inferno' for more info on how to improve code efficiency.







share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 20 '18 at 15:22

























answered Nov 20 '18 at 15:14









ShreeShree

3,4461323




3,4461323













  • you might be able to shave a little bit more by extracting only the necessary bits from cor.test.default() (but it would be a nuisance)

    – Ben Bolker
    Nov 20 '18 at 15:21











  • @BenBolker Thanks for the suggestion. Feel free to add to or edit the answer. I am unfamiliar with cor.test() or cor.test.default() so can't do it myself. Thanks!

    – Shree
    Nov 20 '18 at 15:33






  • 1





    I thought I have to use some parallel processing packages. But by moving the name() out of the loop alone save already enormous time!

    – Yun Wang
    Nov 22 '18 at 9:02



















  • you might be able to shave a little bit more by extracting only the necessary bits from cor.test.default() (but it would be a nuisance)

    – Ben Bolker
    Nov 20 '18 at 15:21











  • @BenBolker Thanks for the suggestion. Feel free to add to or edit the answer. I am unfamiliar with cor.test() or cor.test.default() so can't do it myself. Thanks!

    – Shree
    Nov 20 '18 at 15:33






  • 1





    I thought I have to use some parallel processing packages. But by moving the name() out of the loop alone save already enormous time!

    – Yun Wang
    Nov 22 '18 at 9:02

















you might be able to shave a little bit more by extracting only the necessary bits from cor.test.default() (but it would be a nuisance)

– Ben Bolker
Nov 20 '18 at 15:21





you might be able to shave a little bit more by extracting only the necessary bits from cor.test.default() (but it would be a nuisance)

– Ben Bolker
Nov 20 '18 at 15:21













@BenBolker Thanks for the suggestion. Feel free to add to or edit the answer. I am unfamiliar with cor.test() or cor.test.default() so can't do it myself. Thanks!

– Shree
Nov 20 '18 at 15:33





@BenBolker Thanks for the suggestion. Feel free to add to or edit the answer. I am unfamiliar with cor.test() or cor.test.default() so can't do it myself. Thanks!

– Shree
Nov 20 '18 at 15:33




1




1





I thought I have to use some parallel processing packages. But by moving the name() out of the loop alone save already enormous time!

– Yun Wang
Nov 22 '18 at 9:02





I thought I have to use some parallel processing packages. But by moving the name() out of the loop alone save already enormous time!

– Yun Wang
Nov 22 '18 at 9:02











0














You can use apply:



SNP["pvals"] <- apply(SNP[ ,c(4:50)], MARGIN = 1, FUN = function(row) cor.test(vector, as.numeric(row), method='spearman', exact=FALSE)$p.value)

#SNP$pvals





share|improve this answer





















  • 3





    Turns out apply behaves similar to for where it runs calls at the R level (not C level like other members of apply family). See this canonical question and answer from @DavidArenburg who answered own question.

    – Parfait
    Nov 20 '18 at 15:12


















0














You can use apply:



SNP["pvals"] <- apply(SNP[ ,c(4:50)], MARGIN = 1, FUN = function(row) cor.test(vector, as.numeric(row), method='spearman', exact=FALSE)$p.value)

#SNP$pvals





share|improve this answer





















  • 3





    Turns out apply behaves similar to for where it runs calls at the R level (not C level like other members of apply family). See this canonical question and answer from @DavidArenburg who answered own question.

    – Parfait
    Nov 20 '18 at 15:12
















0












0








0







You can use apply:



SNP["pvals"] <- apply(SNP[ ,c(4:50)], MARGIN = 1, FUN = function(row) cor.test(vector, as.numeric(row), method='spearman', exact=FALSE)$p.value)

#SNP$pvals





share|improve this answer















You can use apply:



SNP["pvals"] <- apply(SNP[ ,c(4:50)], MARGIN = 1, FUN = function(row) cor.test(vector, as.numeric(row), method='spearman', exact=FALSE)$p.value)

#SNP$pvals






share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 20 '18 at 15:16

























answered Nov 20 '18 at 15:08









emsinkoemsinko

18615




18615








  • 3





    Turns out apply behaves similar to for where it runs calls at the R level (not C level like other members of apply family). See this canonical question and answer from @DavidArenburg who answered own question.

    – Parfait
    Nov 20 '18 at 15:12
















  • 3





    Turns out apply behaves similar to for where it runs calls at the R level (not C level like other members of apply family). See this canonical question and answer from @DavidArenburg who answered own question.

    – Parfait
    Nov 20 '18 at 15:12










3




3





Turns out apply behaves similar to for where it runs calls at the R level (not C level like other members of apply family). See this canonical question and answer from @DavidArenburg who answered own question.

– Parfait
Nov 20 '18 at 15:12







Turns out apply behaves similar to for where it runs calls at the R level (not C level like other members of apply family). See this canonical question and answer from @DavidArenburg who answered own question.

– Parfait
Nov 20 '18 at 15:12




















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53395095%2floops-in-r-with-big-data-set-a-better-way%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Guess what letter conforming each word

Run scheduled task as local user group (not BUILTIN)

Port of Spain