Loops in R with big data set, a better way?
I am using R and have a big datesets containing 12,224,433 rows.
For every row I want to do a spearman correlation test against one vector
and extract P values. The scripts are like this:
pvals <- numeric(nrow(SNP))
for(i in 1:nrow(SNP)) {
fit <- cor.test(vector, as.numeric(SNP[i,c(4:50)]), method='spearman', exact=FALSE)
pvals[i] <- fit$p.value
names(pvals)[i] <- paste(SNP$V1[i], SNP$V2[i])
}
The thing is it takes ages, I kind of calculate already, it took 2 hours to run only the first 70,000 rows. So it can take 200 hours.
Is there anyway to speed it up?
r loops
add a comment |
I am using R and have a big datesets containing 12,224,433 rows.
For every row I want to do a spearman correlation test against one vector
and extract P values. The scripts are like this:
pvals <- numeric(nrow(SNP))
for(i in 1:nrow(SNP)) {
fit <- cor.test(vector, as.numeric(SNP[i,c(4:50)]), method='spearman', exact=FALSE)
pvals[i] <- fit$p.value
names(pvals)[i] <- paste(SNP$V1[i], SNP$V2[i])
}
The thing is it takes ages, I kind of calculate already, it took 2 hours to run only the first 70,000 rows. So it can take 200 hours.
Is there anyway to speed it up?
r loops
3
Can you post the first couple of rows of the data set so we have a reproducible example?
– Ben Bolker
Nov 20 '18 at 15:07
add a comment |
I am using R and have a big datesets containing 12,224,433 rows.
For every row I want to do a spearman correlation test against one vector
and extract P values. The scripts are like this:
pvals <- numeric(nrow(SNP))
for(i in 1:nrow(SNP)) {
fit <- cor.test(vector, as.numeric(SNP[i,c(4:50)]), method='spearman', exact=FALSE)
pvals[i] <- fit$p.value
names(pvals)[i] <- paste(SNP$V1[i], SNP$V2[i])
}
The thing is it takes ages, I kind of calculate already, it took 2 hours to run only the first 70,000 rows. So it can take 200 hours.
Is there anyway to speed it up?
r loops
I am using R and have a big datesets containing 12,224,433 rows.
For every row I want to do a spearman correlation test against one vector
and extract P values. The scripts are like this:
pvals <- numeric(nrow(SNP))
for(i in 1:nrow(SNP)) {
fit <- cor.test(vector, as.numeric(SNP[i,c(4:50)]), method='spearman', exact=FALSE)
pvals[i] <- fit$p.value
names(pvals)[i] <- paste(SNP$V1[i], SNP$V2[i])
}
The thing is it takes ages, I kind of calculate already, it took 2 hours to run only the first 70,000 rows. So it can take 200 hours.
Is there anyway to speed it up?
r loops
r loops
edited Nov 20 '18 at 14:27
MrFlick
123k11141170
123k11141170
asked Nov 20 '18 at 14:23
Yun WangYun Wang
63
63
3
Can you post the first couple of rows of the data set so we have a reproducible example?
– Ben Bolker
Nov 20 '18 at 15:07
add a comment |
3
Can you post the first couple of rows of the data set so we have a reproducible example?
– Ben Bolker
Nov 20 '18 at 15:07
3
3
Can you post the first couple of rows of the data set so we have a reproducible example?
– Ben Bolker
Nov 20 '18 at 15:07
Can you post the first couple of rows of the data set so we have a reproducible example?
– Ben Bolker
Nov 20 '18 at 15:07
add a comment |
3 Answers
3
active
oldest
votes
This would be a good candidate for using parallel processing with a package such as foreach
or future.apply
.
The code below makes use of future.apply
because of how simple that package is to use.
The general strategy is to take the action you want to repeat (i.e. getting p-values based on a subset of data), turn that action into a function, and use future.apply
to repeat that function for the different subsets of data you want to use.
library(future.apply)
# Establish method used for parallel processing
plan(multiprocess)
# Convert the relevant subset of the matrix to numeric
snp_subset <- SNP[,c(4:50)]
class(snp_subset) <- 'numeric'
# Define a function to get p.values for a given row of the matrix
get_pvals <- function(row_index) {
pvals <- cor.test(vector, snp_subset[row_index,], method = 'spearman', exact = FALSE)$p.value
names(pvals) <- paste(SNP$V1[row_index], SNP$V2[row_index])
pvals
}
# Use parallel processing to get p-values for each row of the matrix
pvals <- future_sapply(X = seq_len(nrow(SNP)),
FUN = get_pvals)
that's awsome. I will also follow the advice above to put name() outside the loop
– Yun Wang
Nov 22 '18 at 10:27
add a comment |
Here's what I can suggest based on the info you have shared. I have added my thoughts as comments in the code -
# convert all rows to numeric matrix instead of as.numeric(SNP[i,c(4:50)]) in every loop
# also subsetting matrix directly gives you a vector which is what is needed for cor.test()
y <- as.matrix(SNP[, c(4:50)])
# initialize pvals with NA and then replace each value in every loop run
pvals <- rep(NA_real_, nrow(SNP))
for(i in 1:nrow(SNP)) {
fit <- cor.test(vector, y[i, ], method = 'spearman', exact = FALSE)
pvals[i] <- fit$p.value
}
# you can assign all names in one go instead of doing it in the loop
names(pvals) <- paste(SNP$V1, SNP$V2)
Finally, yours is a classic use case for parallel processing. Using parallel processing packages like foreach
you can run multiple tests in parallel and then combine them into your result vector pval
.
Also suggest you to read the book 'The R Inferno' for more info on how to improve code efficiency.
you might be able to shave a little bit more by extracting only the necessary bits fromcor.test.default()
(but it would be a nuisance)
– Ben Bolker
Nov 20 '18 at 15:21
@BenBolker Thanks for the suggestion. Feel free to add to or edit the answer. I am unfamiliar withcor.test()
orcor.test.default()
so can't do it myself. Thanks!
– Shree
Nov 20 '18 at 15:33
1
I thought I have to use some parallel processing packages. But by moving the name() out of the loop alone save already enormous time!
– Yun Wang
Nov 22 '18 at 9:02
add a comment |
You can use apply
:
SNP["pvals"] <- apply(SNP[ ,c(4:50)], MARGIN = 1, FUN = function(row) cor.test(vector, as.numeric(row), method='spearman', exact=FALSE)$p.value)
#SNP$pvals
3
Turns outapply
behaves similar tofor
where it runs calls at the R level (not C level like other members of apply family). See this canonical question and answer from @DavidArenburg who answered own question.
– Parfait
Nov 20 '18 at 15:12
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53395095%2floops-in-r-with-big-data-set-a-better-way%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
This would be a good candidate for using parallel processing with a package such as foreach
or future.apply
.
The code below makes use of future.apply
because of how simple that package is to use.
The general strategy is to take the action you want to repeat (i.e. getting p-values based on a subset of data), turn that action into a function, and use future.apply
to repeat that function for the different subsets of data you want to use.
library(future.apply)
# Establish method used for parallel processing
plan(multiprocess)
# Convert the relevant subset of the matrix to numeric
snp_subset <- SNP[,c(4:50)]
class(snp_subset) <- 'numeric'
# Define a function to get p.values for a given row of the matrix
get_pvals <- function(row_index) {
pvals <- cor.test(vector, snp_subset[row_index,], method = 'spearman', exact = FALSE)$p.value
names(pvals) <- paste(SNP$V1[row_index], SNP$V2[row_index])
pvals
}
# Use parallel processing to get p-values for each row of the matrix
pvals <- future_sapply(X = seq_len(nrow(SNP)),
FUN = get_pvals)
that's awsome. I will also follow the advice above to put name() outside the loop
– Yun Wang
Nov 22 '18 at 10:27
add a comment |
This would be a good candidate for using parallel processing with a package such as foreach
or future.apply
.
The code below makes use of future.apply
because of how simple that package is to use.
The general strategy is to take the action you want to repeat (i.e. getting p-values based on a subset of data), turn that action into a function, and use future.apply
to repeat that function for the different subsets of data you want to use.
library(future.apply)
# Establish method used for parallel processing
plan(multiprocess)
# Convert the relevant subset of the matrix to numeric
snp_subset <- SNP[,c(4:50)]
class(snp_subset) <- 'numeric'
# Define a function to get p.values for a given row of the matrix
get_pvals <- function(row_index) {
pvals <- cor.test(vector, snp_subset[row_index,], method = 'spearman', exact = FALSE)$p.value
names(pvals) <- paste(SNP$V1[row_index], SNP$V2[row_index])
pvals
}
# Use parallel processing to get p-values for each row of the matrix
pvals <- future_sapply(X = seq_len(nrow(SNP)),
FUN = get_pvals)
that's awsome. I will also follow the advice above to put name() outside the loop
– Yun Wang
Nov 22 '18 at 10:27
add a comment |
This would be a good candidate for using parallel processing with a package such as foreach
or future.apply
.
The code below makes use of future.apply
because of how simple that package is to use.
The general strategy is to take the action you want to repeat (i.e. getting p-values based on a subset of data), turn that action into a function, and use future.apply
to repeat that function for the different subsets of data you want to use.
library(future.apply)
# Establish method used for parallel processing
plan(multiprocess)
# Convert the relevant subset of the matrix to numeric
snp_subset <- SNP[,c(4:50)]
class(snp_subset) <- 'numeric'
# Define a function to get p.values for a given row of the matrix
get_pvals <- function(row_index) {
pvals <- cor.test(vector, snp_subset[row_index,], method = 'spearman', exact = FALSE)$p.value
names(pvals) <- paste(SNP$V1[row_index], SNP$V2[row_index])
pvals
}
# Use parallel processing to get p-values for each row of the matrix
pvals <- future_sapply(X = seq_len(nrow(SNP)),
FUN = get_pvals)
This would be a good candidate for using parallel processing with a package such as foreach
or future.apply
.
The code below makes use of future.apply
because of how simple that package is to use.
The general strategy is to take the action you want to repeat (i.e. getting p-values based on a subset of data), turn that action into a function, and use future.apply
to repeat that function for the different subsets of data you want to use.
library(future.apply)
# Establish method used for parallel processing
plan(multiprocess)
# Convert the relevant subset of the matrix to numeric
snp_subset <- SNP[,c(4:50)]
class(snp_subset) <- 'numeric'
# Define a function to get p.values for a given row of the matrix
get_pvals <- function(row_index) {
pvals <- cor.test(vector, snp_subset[row_index,], method = 'spearman', exact = FALSE)$p.value
names(pvals) <- paste(SNP$V1[row_index], SNP$V2[row_index])
pvals
}
# Use parallel processing to get p-values for each row of the matrix
pvals <- future_sapply(X = seq_len(nrow(SNP)),
FUN = get_pvals)
answered Nov 20 '18 at 16:00
bschneidrbschneidr
1,70711531
1,70711531
that's awsome. I will also follow the advice above to put name() outside the loop
– Yun Wang
Nov 22 '18 at 10:27
add a comment |
that's awsome. I will also follow the advice above to put name() outside the loop
– Yun Wang
Nov 22 '18 at 10:27
that's awsome. I will also follow the advice above to put name() outside the loop
– Yun Wang
Nov 22 '18 at 10:27
that's awsome. I will also follow the advice above to put name() outside the loop
– Yun Wang
Nov 22 '18 at 10:27
add a comment |
Here's what I can suggest based on the info you have shared. I have added my thoughts as comments in the code -
# convert all rows to numeric matrix instead of as.numeric(SNP[i,c(4:50)]) in every loop
# also subsetting matrix directly gives you a vector which is what is needed for cor.test()
y <- as.matrix(SNP[, c(4:50)])
# initialize pvals with NA and then replace each value in every loop run
pvals <- rep(NA_real_, nrow(SNP))
for(i in 1:nrow(SNP)) {
fit <- cor.test(vector, y[i, ], method = 'spearman', exact = FALSE)
pvals[i] <- fit$p.value
}
# you can assign all names in one go instead of doing it in the loop
names(pvals) <- paste(SNP$V1, SNP$V2)
Finally, yours is a classic use case for parallel processing. Using parallel processing packages like foreach
you can run multiple tests in parallel and then combine them into your result vector pval
.
Also suggest you to read the book 'The R Inferno' for more info on how to improve code efficiency.
you might be able to shave a little bit more by extracting only the necessary bits fromcor.test.default()
(but it would be a nuisance)
– Ben Bolker
Nov 20 '18 at 15:21
@BenBolker Thanks for the suggestion. Feel free to add to or edit the answer. I am unfamiliar withcor.test()
orcor.test.default()
so can't do it myself. Thanks!
– Shree
Nov 20 '18 at 15:33
1
I thought I have to use some parallel processing packages. But by moving the name() out of the loop alone save already enormous time!
– Yun Wang
Nov 22 '18 at 9:02
add a comment |
Here's what I can suggest based on the info you have shared. I have added my thoughts as comments in the code -
# convert all rows to numeric matrix instead of as.numeric(SNP[i,c(4:50)]) in every loop
# also subsetting matrix directly gives you a vector which is what is needed for cor.test()
y <- as.matrix(SNP[, c(4:50)])
# initialize pvals with NA and then replace each value in every loop run
pvals <- rep(NA_real_, nrow(SNP))
for(i in 1:nrow(SNP)) {
fit <- cor.test(vector, y[i, ], method = 'spearman', exact = FALSE)
pvals[i] <- fit$p.value
}
# you can assign all names in one go instead of doing it in the loop
names(pvals) <- paste(SNP$V1, SNP$V2)
Finally, yours is a classic use case for parallel processing. Using parallel processing packages like foreach
you can run multiple tests in parallel and then combine them into your result vector pval
.
Also suggest you to read the book 'The R Inferno' for more info on how to improve code efficiency.
you might be able to shave a little bit more by extracting only the necessary bits fromcor.test.default()
(but it would be a nuisance)
– Ben Bolker
Nov 20 '18 at 15:21
@BenBolker Thanks for the suggestion. Feel free to add to or edit the answer. I am unfamiliar withcor.test()
orcor.test.default()
so can't do it myself. Thanks!
– Shree
Nov 20 '18 at 15:33
1
I thought I have to use some parallel processing packages. But by moving the name() out of the loop alone save already enormous time!
– Yun Wang
Nov 22 '18 at 9:02
add a comment |
Here's what I can suggest based on the info you have shared. I have added my thoughts as comments in the code -
# convert all rows to numeric matrix instead of as.numeric(SNP[i,c(4:50)]) in every loop
# also subsetting matrix directly gives you a vector which is what is needed for cor.test()
y <- as.matrix(SNP[, c(4:50)])
# initialize pvals with NA and then replace each value in every loop run
pvals <- rep(NA_real_, nrow(SNP))
for(i in 1:nrow(SNP)) {
fit <- cor.test(vector, y[i, ], method = 'spearman', exact = FALSE)
pvals[i] <- fit$p.value
}
# you can assign all names in one go instead of doing it in the loop
names(pvals) <- paste(SNP$V1, SNP$V2)
Finally, yours is a classic use case for parallel processing. Using parallel processing packages like foreach
you can run multiple tests in parallel and then combine them into your result vector pval
.
Also suggest you to read the book 'The R Inferno' for more info on how to improve code efficiency.
Here's what I can suggest based on the info you have shared. I have added my thoughts as comments in the code -
# convert all rows to numeric matrix instead of as.numeric(SNP[i,c(4:50)]) in every loop
# also subsetting matrix directly gives you a vector which is what is needed for cor.test()
y <- as.matrix(SNP[, c(4:50)])
# initialize pvals with NA and then replace each value in every loop run
pvals <- rep(NA_real_, nrow(SNP))
for(i in 1:nrow(SNP)) {
fit <- cor.test(vector, y[i, ], method = 'spearman', exact = FALSE)
pvals[i] <- fit$p.value
}
# you can assign all names in one go instead of doing it in the loop
names(pvals) <- paste(SNP$V1, SNP$V2)
Finally, yours is a classic use case for parallel processing. Using parallel processing packages like foreach
you can run multiple tests in parallel and then combine them into your result vector pval
.
Also suggest you to read the book 'The R Inferno' for more info on how to improve code efficiency.
edited Nov 20 '18 at 15:22
answered Nov 20 '18 at 15:14
ShreeShree
3,4461323
3,4461323
you might be able to shave a little bit more by extracting only the necessary bits fromcor.test.default()
(but it would be a nuisance)
– Ben Bolker
Nov 20 '18 at 15:21
@BenBolker Thanks for the suggestion. Feel free to add to or edit the answer. I am unfamiliar withcor.test()
orcor.test.default()
so can't do it myself. Thanks!
– Shree
Nov 20 '18 at 15:33
1
I thought I have to use some parallel processing packages. But by moving the name() out of the loop alone save already enormous time!
– Yun Wang
Nov 22 '18 at 9:02
add a comment |
you might be able to shave a little bit more by extracting only the necessary bits fromcor.test.default()
(but it would be a nuisance)
– Ben Bolker
Nov 20 '18 at 15:21
@BenBolker Thanks for the suggestion. Feel free to add to or edit the answer. I am unfamiliar withcor.test()
orcor.test.default()
so can't do it myself. Thanks!
– Shree
Nov 20 '18 at 15:33
1
I thought I have to use some parallel processing packages. But by moving the name() out of the loop alone save already enormous time!
– Yun Wang
Nov 22 '18 at 9:02
you might be able to shave a little bit more by extracting only the necessary bits from
cor.test.default()
(but it would be a nuisance)– Ben Bolker
Nov 20 '18 at 15:21
you might be able to shave a little bit more by extracting only the necessary bits from
cor.test.default()
(but it would be a nuisance)– Ben Bolker
Nov 20 '18 at 15:21
@BenBolker Thanks for the suggestion. Feel free to add to or edit the answer. I am unfamiliar with
cor.test()
or cor.test.default()
so can't do it myself. Thanks!– Shree
Nov 20 '18 at 15:33
@BenBolker Thanks for the suggestion. Feel free to add to or edit the answer. I am unfamiliar with
cor.test()
or cor.test.default()
so can't do it myself. Thanks!– Shree
Nov 20 '18 at 15:33
1
1
I thought I have to use some parallel processing packages. But by moving the name() out of the loop alone save already enormous time!
– Yun Wang
Nov 22 '18 at 9:02
I thought I have to use some parallel processing packages. But by moving the name() out of the loop alone save already enormous time!
– Yun Wang
Nov 22 '18 at 9:02
add a comment |
You can use apply
:
SNP["pvals"] <- apply(SNP[ ,c(4:50)], MARGIN = 1, FUN = function(row) cor.test(vector, as.numeric(row), method='spearman', exact=FALSE)$p.value)
#SNP$pvals
3
Turns outapply
behaves similar tofor
where it runs calls at the R level (not C level like other members of apply family). See this canonical question and answer from @DavidArenburg who answered own question.
– Parfait
Nov 20 '18 at 15:12
add a comment |
You can use apply
:
SNP["pvals"] <- apply(SNP[ ,c(4:50)], MARGIN = 1, FUN = function(row) cor.test(vector, as.numeric(row), method='spearman', exact=FALSE)$p.value)
#SNP$pvals
3
Turns outapply
behaves similar tofor
where it runs calls at the R level (not C level like other members of apply family). See this canonical question and answer from @DavidArenburg who answered own question.
– Parfait
Nov 20 '18 at 15:12
add a comment |
You can use apply
:
SNP["pvals"] <- apply(SNP[ ,c(4:50)], MARGIN = 1, FUN = function(row) cor.test(vector, as.numeric(row), method='spearman', exact=FALSE)$p.value)
#SNP$pvals
You can use apply
:
SNP["pvals"] <- apply(SNP[ ,c(4:50)], MARGIN = 1, FUN = function(row) cor.test(vector, as.numeric(row), method='spearman', exact=FALSE)$p.value)
#SNP$pvals
edited Nov 20 '18 at 15:16
answered Nov 20 '18 at 15:08
emsinkoemsinko
18615
18615
3
Turns outapply
behaves similar tofor
where it runs calls at the R level (not C level like other members of apply family). See this canonical question and answer from @DavidArenburg who answered own question.
– Parfait
Nov 20 '18 at 15:12
add a comment |
3
Turns outapply
behaves similar tofor
where it runs calls at the R level (not C level like other members of apply family). See this canonical question and answer from @DavidArenburg who answered own question.
– Parfait
Nov 20 '18 at 15:12
3
3
Turns out
apply
behaves similar to for
where it runs calls at the R level (not C level like other members of apply family). See this canonical question and answer from @DavidArenburg who answered own question.– Parfait
Nov 20 '18 at 15:12
Turns out
apply
behaves similar to for
where it runs calls at the R level (not C level like other members of apply family). See this canonical question and answer from @DavidArenburg who answered own question.– Parfait
Nov 20 '18 at 15:12
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53395095%2floops-in-r-with-big-data-set-a-better-way%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
3
Can you post the first couple of rows of the data set so we have a reproducible example?
– Ben Bolker
Nov 20 '18 at 15:07