An efficient way of aggregating data from repeated measurements [duplicate]
up vote
2
down vote
favorite
This question already has an answer here:
Calculate the mean by group
3 answers
Aggregate / summarize multiple variables per group (e.g. sum, mean)
5 answers
I'm analyzing gene expression data from a large experiment (12400 single cells and 23800 genes) and I'm running into an efficiency problem. I will write a reproducible example below but my problem is the following:
I converted mouse genes in my dataset to human counterparts to be able to compare with other previously published data. There are multiple matches in some cases (one human gene is mapped to more than one mouse genes). In these cases, I'd like to average the expression values from these multiple genes and come up with one expression value for the human genetic counterpart. I'm able to achieve this by converting my expression data to matrix format (which allows duplicate row names) and applying aggregate()
function, but it takes a VERY long time to go through the large dataset. It is difficult to exemplify the exact situation here, but I my mock analytical pipeline is below:
data <- as.matrix(data.frame(cell1 = c(1,1,1,1,3,3),
cell2 = c(1, 2 ,4 ,10,5,10),
cell3 = c(0,0,0,1,10,20),
cell4 = c(1,3,4,4,20,20)))
# Adding gene names as rownames
rownames(data) <- c("ABC1", "ABC2", "ABC2", "ABC4", "ABC5", "ABC5")
# Mock gene expression matrix
# Columns indicate expression values from individual cells
# Rows indicate genes
data
#> cell1 cell2 cell3 cell4
#> ABC1 1 1 0 1
#> ABC2 1 2 0 3
#> ABC2 1 4 0 4
#> ABC4 1 10 1 4
#> ABC5 3 5 10 20
#> ABC5 3 10 20 20
# Averaging gene expression values where there are multiple measurements for the same gene
aggr_data <- aggregate(data, by=list(rownames(data)), mean)
# End result I'm trying to achieve
aggr_data
#> Group.1 cell1 cell2 cell3 cell4
#> 1 ABC1 1 1.0 0 1.0
#> 2 ABC2 1 3.0 0 3.5
#> 3 ABC4 1 10.0 1 4.0
#> 4 ABC5 3 7.5 15 20.0
Is there a more efficient way for doing this?
Thanks for your answers!
r bigdata aggregate
marked as duplicate by Mike H., phiver, Billal Begueradj, Jaap
StackExchange.ready(function() {
if (StackExchange.options.isMobile) return;
$('.dupe-hammer-message-hover:not(.hover-bound)').each(function() {
var $hover = $(this).addClass('hover-bound'),
$msg = $hover.siblings('.dupe-hammer-message');
$hover.hover(
function() {
$hover.showInfoMessage('', {
messageElement: $msg.clone().show(),
transient: false,
position: { my: 'bottom left', at: 'top center', offsetTop: -7 },
dismissable: false,
relativeToBody: true
});
},
function() {
StackExchange.helpers.removeMessages();
}
);
});
});
Nov 11 at 12:40
This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.
add a comment |
up vote
2
down vote
favorite
This question already has an answer here:
Calculate the mean by group
3 answers
Aggregate / summarize multiple variables per group (e.g. sum, mean)
5 answers
I'm analyzing gene expression data from a large experiment (12400 single cells and 23800 genes) and I'm running into an efficiency problem. I will write a reproducible example below but my problem is the following:
I converted mouse genes in my dataset to human counterparts to be able to compare with other previously published data. There are multiple matches in some cases (one human gene is mapped to more than one mouse genes). In these cases, I'd like to average the expression values from these multiple genes and come up with one expression value for the human genetic counterpart. I'm able to achieve this by converting my expression data to matrix format (which allows duplicate row names) and applying aggregate()
function, but it takes a VERY long time to go through the large dataset. It is difficult to exemplify the exact situation here, but I my mock analytical pipeline is below:
data <- as.matrix(data.frame(cell1 = c(1,1,1,1,3,3),
cell2 = c(1, 2 ,4 ,10,5,10),
cell3 = c(0,0,0,1,10,20),
cell4 = c(1,3,4,4,20,20)))
# Adding gene names as rownames
rownames(data) <- c("ABC1", "ABC2", "ABC2", "ABC4", "ABC5", "ABC5")
# Mock gene expression matrix
# Columns indicate expression values from individual cells
# Rows indicate genes
data
#> cell1 cell2 cell3 cell4
#> ABC1 1 1 0 1
#> ABC2 1 2 0 3
#> ABC2 1 4 0 4
#> ABC4 1 10 1 4
#> ABC5 3 5 10 20
#> ABC5 3 10 20 20
# Averaging gene expression values where there are multiple measurements for the same gene
aggr_data <- aggregate(data, by=list(rownames(data)), mean)
# End result I'm trying to achieve
aggr_data
#> Group.1 cell1 cell2 cell3 cell4
#> 1 ABC1 1 1.0 0 1.0
#> 2 ABC2 1 3.0 0 3.5
#> 3 ABC4 1 10.0 1 4.0
#> 4 ABC5 3 7.5 15 20.0
Is there a more efficient way for doing this?
Thanks for your answers!
r bigdata aggregate
marked as duplicate by Mike H., phiver, Billal Begueradj, Jaap
StackExchange.ready(function() {
if (StackExchange.options.isMobile) return;
$('.dupe-hammer-message-hover:not(.hover-bound)').each(function() {
var $hover = $(this).addClass('hover-bound'),
$msg = $hover.siblings('.dupe-hammer-message');
$hover.hover(
function() {
$hover.showInfoMessage('', {
messageElement: $msg.clone().show(),
transient: false,
position: { my: 'bottom left', at: 'top center', offsetTop: -7 },
dismissable: false,
relativeToBody: true
});
},
function() {
StackExchange.helpers.removeMessages();
}
);
});
});
Nov 11 at 12:40
This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.
add a comment |
up vote
2
down vote
favorite
up vote
2
down vote
favorite
This question already has an answer here:
Calculate the mean by group
3 answers
Aggregate / summarize multiple variables per group (e.g. sum, mean)
5 answers
I'm analyzing gene expression data from a large experiment (12400 single cells and 23800 genes) and I'm running into an efficiency problem. I will write a reproducible example below but my problem is the following:
I converted mouse genes in my dataset to human counterparts to be able to compare with other previously published data. There are multiple matches in some cases (one human gene is mapped to more than one mouse genes). In these cases, I'd like to average the expression values from these multiple genes and come up with one expression value for the human genetic counterpart. I'm able to achieve this by converting my expression data to matrix format (which allows duplicate row names) and applying aggregate()
function, but it takes a VERY long time to go through the large dataset. It is difficult to exemplify the exact situation here, but I my mock analytical pipeline is below:
data <- as.matrix(data.frame(cell1 = c(1,1,1,1,3,3),
cell2 = c(1, 2 ,4 ,10,5,10),
cell3 = c(0,0,0,1,10,20),
cell4 = c(1,3,4,4,20,20)))
# Adding gene names as rownames
rownames(data) <- c("ABC1", "ABC2", "ABC2", "ABC4", "ABC5", "ABC5")
# Mock gene expression matrix
# Columns indicate expression values from individual cells
# Rows indicate genes
data
#> cell1 cell2 cell3 cell4
#> ABC1 1 1 0 1
#> ABC2 1 2 0 3
#> ABC2 1 4 0 4
#> ABC4 1 10 1 4
#> ABC5 3 5 10 20
#> ABC5 3 10 20 20
# Averaging gene expression values where there are multiple measurements for the same gene
aggr_data <- aggregate(data, by=list(rownames(data)), mean)
# End result I'm trying to achieve
aggr_data
#> Group.1 cell1 cell2 cell3 cell4
#> 1 ABC1 1 1.0 0 1.0
#> 2 ABC2 1 3.0 0 3.5
#> 3 ABC4 1 10.0 1 4.0
#> 4 ABC5 3 7.5 15 20.0
Is there a more efficient way for doing this?
Thanks for your answers!
r bigdata aggregate
This question already has an answer here:
Calculate the mean by group
3 answers
Aggregate / summarize multiple variables per group (e.g. sum, mean)
5 answers
I'm analyzing gene expression data from a large experiment (12400 single cells and 23800 genes) and I'm running into an efficiency problem. I will write a reproducible example below but my problem is the following:
I converted mouse genes in my dataset to human counterparts to be able to compare with other previously published data. There are multiple matches in some cases (one human gene is mapped to more than one mouse genes). In these cases, I'd like to average the expression values from these multiple genes and come up with one expression value for the human genetic counterpart. I'm able to achieve this by converting my expression data to matrix format (which allows duplicate row names) and applying aggregate()
function, but it takes a VERY long time to go through the large dataset. It is difficult to exemplify the exact situation here, but I my mock analytical pipeline is below:
data <- as.matrix(data.frame(cell1 = c(1,1,1,1,3,3),
cell2 = c(1, 2 ,4 ,10,5,10),
cell3 = c(0,0,0,1,10,20),
cell4 = c(1,3,4,4,20,20)))
# Adding gene names as rownames
rownames(data) <- c("ABC1", "ABC2", "ABC2", "ABC4", "ABC5", "ABC5")
# Mock gene expression matrix
# Columns indicate expression values from individual cells
# Rows indicate genes
data
#> cell1 cell2 cell3 cell4
#> ABC1 1 1 0 1
#> ABC2 1 2 0 3
#> ABC2 1 4 0 4
#> ABC4 1 10 1 4
#> ABC5 3 5 10 20
#> ABC5 3 10 20 20
# Averaging gene expression values where there are multiple measurements for the same gene
aggr_data <- aggregate(data, by=list(rownames(data)), mean)
# End result I'm trying to achieve
aggr_data
#> Group.1 cell1 cell2 cell3 cell4
#> 1 ABC1 1 1.0 0 1.0
#> 2 ABC2 1 3.0 0 3.5
#> 3 ABC4 1 10.0 1 4.0
#> 4 ABC5 3 7.5 15 20.0
Is there a more efficient way for doing this?
Thanks for your answers!
This question already has an answer here:
Calculate the mean by group
3 answers
Aggregate / summarize multiple variables per group (e.g. sum, mean)
5 answers
r bigdata aggregate
r bigdata aggregate
asked Nov 10 at 3:17
Atakan
355
355
marked as duplicate by Mike H., phiver, Billal Begueradj, Jaap
StackExchange.ready(function() {
if (StackExchange.options.isMobile) return;
$('.dupe-hammer-message-hover:not(.hover-bound)').each(function() {
var $hover = $(this).addClass('hover-bound'),
$msg = $hover.siblings('.dupe-hammer-message');
$hover.hover(
function() {
$hover.showInfoMessage('', {
messageElement: $msg.clone().show(),
transient: false,
position: { my: 'bottom left', at: 'top center', offsetTop: -7 },
dismissable: false,
relativeToBody: true
});
},
function() {
StackExchange.helpers.removeMessages();
}
);
});
});
Nov 11 at 12:40
This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.
marked as duplicate by Mike H., phiver, Billal Begueradj, Jaap
StackExchange.ready(function() {
if (StackExchange.options.isMobile) return;
$('.dupe-hammer-message-hover:not(.hover-bound)').each(function() {
var $hover = $(this).addClass('hover-bound'),
$msg = $hover.siblings('.dupe-hammer-message');
$hover.hover(
function() {
$hover.showInfoMessage('', {
messageElement: $msg.clone().show(),
transient: false,
position: { my: 'bottom left', at: 'top center', offsetTop: -7 },
dismissable: false,
relativeToBody: true
});
},
function() {
StackExchange.helpers.removeMessages();
}
);
});
});
Nov 11 at 12:40
This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
up vote
2
down vote
accepted
You can try dplyr
. summarise_all
with mean()
function offers average of every columns for each group.
library(tidyverse) # including dplyr
(df <-
data_frame(
cell1 = c(1,1,1,1,3,3),
cell2 = c(1, 2 ,4 ,10,5,10),
cell3 = c(0,0,0,1,10,20),
cell4 = c(1,3,4,4,20,20),
gene_name = c("ABC1", "ABC2", "ABC2", "ABC4", "ABC5", "ABC5")
))
#> # A tibble: 6 x 5
#> cell1 cell2 cell3 cell4 gene_name
#> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 1 1 0 1 ABC1
#> 2 1 2 0 3 ABC2
#> 3 1 4 0 4 ABC2
#> 4 1 10 1 4 ABC4
#> 5 3 5 10 20 ABC5
#> 6 3 10 20 20 ABC5
I just added the gene names as additional row. Now you can use group_by()
for the group operation
df %>%
group_by(gene_name) %>% # for each group
summarise_all(mean) # calculate mean for all columns
#> # A tibble: 4 x 5
#> gene_name cell1 cell2 cell3 cell4
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 ABC1 1 1 0 1
#> 2 ABC2 1 3 0 3.5
#> 3 ABC4 1 10 1 4
#> 4 ABC5 3 7.5 15 20
In general, for large data set as your situation, data.table
package would be appropriate: the code is like this
setDT(df)[, lapply(.SD, mean), by = gene_name]
#> gene_name cell1 cell2 cell3 cell4
#> 1: ABC1 1 1.0 0 1.0
#> 2: ABC2 1 3.0 0 3.5
#> 3: ABC4 1 10.0 1 4.0
#> 4: ABC5 3 7.5 15 20.0
setDT
is just for making data.table
object.
dplyr vs data.table
If bind your data set,
df_bench
#># A tibble: 18,000 x 10,001
#> gene_name cell1 cell2 cell3 cell4 cell5 cell6 cell7
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 ABC308 1 1 0 1 1 1 0
#> 2 ABC258 1 2 0 3 1 2 0
#> 3 ABC553 1 4 0 4 1 4 0
#> 4 ABC57 1 10 1 4 1 10 1
#> 5 ABC469 3 5 10 20 3 5 10
#> 6 ABC484 3 10 20 20 3 10 20
#> 7 ABC813 1 1 0 1 1 1 0
#> 8 ABC371 1 2 0 3 1 2 0
#> 9 ABC547 1 4 0 4 1 4 0
#>10 ABC171 1 10 1 4 1 10 1
#># ... with 17,990 more rows, and 9,993 more variables:
#># cell8 <dbl>, cell9 <dbl>, cell10 <dbl>,
#># cell11 <dbl>, cell12 <dbl>, cell13 <dbl>,
#># cell14 <dbl>, cell15 <dbl>, cell16 <dbl>,
#># cell17 <dbl>, cell18 <dbl>, cell19 <dbl>,
#># cell20 <dbl>, cell21 <dbl>, cell22 <dbl>,
#># cell23 <dbl>, cell24 <dbl>, cell25 <dbl>,
#># cell26 <dbl>, cell27 <dbl>, cell28 <dbl>,
#># cell29 <dbl>, cell30 <dbl>, cell31 <dbl>,
#># cell32 <dbl>, cell33 <dbl>, cell34 <dbl>,
#># cell35 <dbl>, cell36 <dbl>, cell37 <dbl>,
#># cell38 <dbl>, cell39 <dbl>, cell40 <dbl>,
#># cell41 <dbl>, cell42 <dbl>, cell43 <dbl>,
#># cell44 <dbl>, cell45 <dbl>, cell46 <dbl>,
#># cell47 <dbl>, cell48 <dbl>, cell49 <dbl>,
#># cell50 <dbl>, cell51 <dbl>, cell52 <dbl>,
#># cell53 <dbl>, cell54 <dbl>, cell55 <dbl>,
#># cell56 <dbl>, cell57 <dbl>, cell58 <dbl>,
#># cell59 <dbl>, cell60 <dbl>, cell61 <dbl>,
#># cell62 <dbl>, cell63 <dbl>, cell64 <dbl>,
#># cell65 <dbl>, cell66 <dbl>, cell67 <dbl>,
#># cell68 <dbl>, cell69 <dbl>, cell70 <dbl>,
#># cell71 <dbl>, cell72 <dbl>, cell73 <dbl>,
#># cell74 <dbl>, cell75 <dbl>, cell76 <dbl>,
#># cell77 <dbl>, cell78 <dbl>, cell79 <dbl>,
#># cell80 <dbl>, cell81 <dbl>, cell82 <dbl>,
#># cell83 <dbl>, cell84 <dbl>, cell85 <dbl>,
#># cell86 <dbl>, cell87 <dbl>, cell88 <dbl>,
#># cell89 <dbl>, cell90 <dbl>, cell91 <dbl>,
#># cell92 <dbl>, cell93 <dbl>, cell94 <dbl>,
#># cell95 <dbl>, cell96 <dbl>, cell97 <dbl>,
#># cell98 <dbl>, cell99 <dbl>, cell100 <dbl>,
#># cell101 <dbl>, cell102 <dbl>, cell103 <dbl>,
#># cell104 <dbl>, cell105 <dbl>, cell106 <dbl>,
#># cell107 <dbl>, …
Using this set,
microbenchmark::microbenchmark(
DPLYR = {
df_bench %>%
group_by(gene_name) %>%
summarise_all(mean)
},
DATATABLE = {
setDT(df_bench)[, lapply(.SD, mean), by = gene_name]
},
times = 50
)
#> Unit: seconds
#> expr min lq mean median uq max neval
#> DPLYR 32.82307 34.89050 38.10948 37.44543 40.01937 47.67549 50
#> DATATABLE 12.16752 13.59018 16.09665 14.25976 15.60752 40.30257 50
data.table
seems faster than dplyr
here.
Thanks for the detailed explanation here. Thumbs up!
– Atakan
Nov 12 at 17:14
add a comment |
up vote
1
down vote
Using data.table should work pretty well:
library(data.table)
as.data.table(data)[, lapply(.SD, mean), by = .(rownames(data))]
# rownames cell1 cell2 cell3 cell4
#1: ABC1 1 1.0 0 1.0
#2: ABC2 1 3.0 0 3.5
#3: ABC4 1 10.0 1 4.0
#4: ABC5 3 7.5 15 20.0
A quick SO search dug up a link to speed comparisons for group-by operations (data.table is the fastest for large data):
Calculate the mean by group
Thanks for your answer. Somehow, I missed the link you shared during my search. The very good info there!
– Atakan
Nov 12 at 17:13
add a comment |
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
2
down vote
accepted
You can try dplyr
. summarise_all
with mean()
function offers average of every columns for each group.
library(tidyverse) # including dplyr
(df <-
data_frame(
cell1 = c(1,1,1,1,3,3),
cell2 = c(1, 2 ,4 ,10,5,10),
cell3 = c(0,0,0,1,10,20),
cell4 = c(1,3,4,4,20,20),
gene_name = c("ABC1", "ABC2", "ABC2", "ABC4", "ABC5", "ABC5")
))
#> # A tibble: 6 x 5
#> cell1 cell2 cell3 cell4 gene_name
#> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 1 1 0 1 ABC1
#> 2 1 2 0 3 ABC2
#> 3 1 4 0 4 ABC2
#> 4 1 10 1 4 ABC4
#> 5 3 5 10 20 ABC5
#> 6 3 10 20 20 ABC5
I just added the gene names as additional row. Now you can use group_by()
for the group operation
df %>%
group_by(gene_name) %>% # for each group
summarise_all(mean) # calculate mean for all columns
#> # A tibble: 4 x 5
#> gene_name cell1 cell2 cell3 cell4
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 ABC1 1 1 0 1
#> 2 ABC2 1 3 0 3.5
#> 3 ABC4 1 10 1 4
#> 4 ABC5 3 7.5 15 20
In general, for large data set as your situation, data.table
package would be appropriate: the code is like this
setDT(df)[, lapply(.SD, mean), by = gene_name]
#> gene_name cell1 cell2 cell3 cell4
#> 1: ABC1 1 1.0 0 1.0
#> 2: ABC2 1 3.0 0 3.5
#> 3: ABC4 1 10.0 1 4.0
#> 4: ABC5 3 7.5 15 20.0
setDT
is just for making data.table
object.
dplyr vs data.table
If bind your data set,
df_bench
#># A tibble: 18,000 x 10,001
#> gene_name cell1 cell2 cell3 cell4 cell5 cell6 cell7
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 ABC308 1 1 0 1 1 1 0
#> 2 ABC258 1 2 0 3 1 2 0
#> 3 ABC553 1 4 0 4 1 4 0
#> 4 ABC57 1 10 1 4 1 10 1
#> 5 ABC469 3 5 10 20 3 5 10
#> 6 ABC484 3 10 20 20 3 10 20
#> 7 ABC813 1 1 0 1 1 1 0
#> 8 ABC371 1 2 0 3 1 2 0
#> 9 ABC547 1 4 0 4 1 4 0
#>10 ABC171 1 10 1 4 1 10 1
#># ... with 17,990 more rows, and 9,993 more variables:
#># cell8 <dbl>, cell9 <dbl>, cell10 <dbl>,
#># cell11 <dbl>, cell12 <dbl>, cell13 <dbl>,
#># cell14 <dbl>, cell15 <dbl>, cell16 <dbl>,
#># cell17 <dbl>, cell18 <dbl>, cell19 <dbl>,
#># cell20 <dbl>, cell21 <dbl>, cell22 <dbl>,
#># cell23 <dbl>, cell24 <dbl>, cell25 <dbl>,
#># cell26 <dbl>, cell27 <dbl>, cell28 <dbl>,
#># cell29 <dbl>, cell30 <dbl>, cell31 <dbl>,
#># cell32 <dbl>, cell33 <dbl>, cell34 <dbl>,
#># cell35 <dbl>, cell36 <dbl>, cell37 <dbl>,
#># cell38 <dbl>, cell39 <dbl>, cell40 <dbl>,
#># cell41 <dbl>, cell42 <dbl>, cell43 <dbl>,
#># cell44 <dbl>, cell45 <dbl>, cell46 <dbl>,
#># cell47 <dbl>, cell48 <dbl>, cell49 <dbl>,
#># cell50 <dbl>, cell51 <dbl>, cell52 <dbl>,
#># cell53 <dbl>, cell54 <dbl>, cell55 <dbl>,
#># cell56 <dbl>, cell57 <dbl>, cell58 <dbl>,
#># cell59 <dbl>, cell60 <dbl>, cell61 <dbl>,
#># cell62 <dbl>, cell63 <dbl>, cell64 <dbl>,
#># cell65 <dbl>, cell66 <dbl>, cell67 <dbl>,
#># cell68 <dbl>, cell69 <dbl>, cell70 <dbl>,
#># cell71 <dbl>, cell72 <dbl>, cell73 <dbl>,
#># cell74 <dbl>, cell75 <dbl>, cell76 <dbl>,
#># cell77 <dbl>, cell78 <dbl>, cell79 <dbl>,
#># cell80 <dbl>, cell81 <dbl>, cell82 <dbl>,
#># cell83 <dbl>, cell84 <dbl>, cell85 <dbl>,
#># cell86 <dbl>, cell87 <dbl>, cell88 <dbl>,
#># cell89 <dbl>, cell90 <dbl>, cell91 <dbl>,
#># cell92 <dbl>, cell93 <dbl>, cell94 <dbl>,
#># cell95 <dbl>, cell96 <dbl>, cell97 <dbl>,
#># cell98 <dbl>, cell99 <dbl>, cell100 <dbl>,
#># cell101 <dbl>, cell102 <dbl>, cell103 <dbl>,
#># cell104 <dbl>, cell105 <dbl>, cell106 <dbl>,
#># cell107 <dbl>, …
Using this set,
microbenchmark::microbenchmark(
DPLYR = {
df_bench %>%
group_by(gene_name) %>%
summarise_all(mean)
},
DATATABLE = {
setDT(df_bench)[, lapply(.SD, mean), by = gene_name]
},
times = 50
)
#> Unit: seconds
#> expr min lq mean median uq max neval
#> DPLYR 32.82307 34.89050 38.10948 37.44543 40.01937 47.67549 50
#> DATATABLE 12.16752 13.59018 16.09665 14.25976 15.60752 40.30257 50
data.table
seems faster than dplyr
here.
Thanks for the detailed explanation here. Thumbs up!
– Atakan
Nov 12 at 17:14
add a comment |
up vote
2
down vote
accepted
You can try dplyr
. summarise_all
with mean()
function offers average of every columns for each group.
library(tidyverse) # including dplyr
(df <-
data_frame(
cell1 = c(1,1,1,1,3,3),
cell2 = c(1, 2 ,4 ,10,5,10),
cell3 = c(0,0,0,1,10,20),
cell4 = c(1,3,4,4,20,20),
gene_name = c("ABC1", "ABC2", "ABC2", "ABC4", "ABC5", "ABC5")
))
#> # A tibble: 6 x 5
#> cell1 cell2 cell3 cell4 gene_name
#> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 1 1 0 1 ABC1
#> 2 1 2 0 3 ABC2
#> 3 1 4 0 4 ABC2
#> 4 1 10 1 4 ABC4
#> 5 3 5 10 20 ABC5
#> 6 3 10 20 20 ABC5
I just added the gene names as additional row. Now you can use group_by()
for the group operation
df %>%
group_by(gene_name) %>% # for each group
summarise_all(mean) # calculate mean for all columns
#> # A tibble: 4 x 5
#> gene_name cell1 cell2 cell3 cell4
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 ABC1 1 1 0 1
#> 2 ABC2 1 3 0 3.5
#> 3 ABC4 1 10 1 4
#> 4 ABC5 3 7.5 15 20
In general, for large data set as your situation, data.table
package would be appropriate: the code is like this
setDT(df)[, lapply(.SD, mean), by = gene_name]
#> gene_name cell1 cell2 cell3 cell4
#> 1: ABC1 1 1.0 0 1.0
#> 2: ABC2 1 3.0 0 3.5
#> 3: ABC4 1 10.0 1 4.0
#> 4: ABC5 3 7.5 15 20.0
setDT
is just for making data.table
object.
dplyr vs data.table
If bind your data set,
df_bench
#># A tibble: 18,000 x 10,001
#> gene_name cell1 cell2 cell3 cell4 cell5 cell6 cell7
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 ABC308 1 1 0 1 1 1 0
#> 2 ABC258 1 2 0 3 1 2 0
#> 3 ABC553 1 4 0 4 1 4 0
#> 4 ABC57 1 10 1 4 1 10 1
#> 5 ABC469 3 5 10 20 3 5 10
#> 6 ABC484 3 10 20 20 3 10 20
#> 7 ABC813 1 1 0 1 1 1 0
#> 8 ABC371 1 2 0 3 1 2 0
#> 9 ABC547 1 4 0 4 1 4 0
#>10 ABC171 1 10 1 4 1 10 1
#># ... with 17,990 more rows, and 9,993 more variables:
#># cell8 <dbl>, cell9 <dbl>, cell10 <dbl>,
#># cell11 <dbl>, cell12 <dbl>, cell13 <dbl>,
#># cell14 <dbl>, cell15 <dbl>, cell16 <dbl>,
#># cell17 <dbl>, cell18 <dbl>, cell19 <dbl>,
#># cell20 <dbl>, cell21 <dbl>, cell22 <dbl>,
#># cell23 <dbl>, cell24 <dbl>, cell25 <dbl>,
#># cell26 <dbl>, cell27 <dbl>, cell28 <dbl>,
#># cell29 <dbl>, cell30 <dbl>, cell31 <dbl>,
#># cell32 <dbl>, cell33 <dbl>, cell34 <dbl>,
#># cell35 <dbl>, cell36 <dbl>, cell37 <dbl>,
#># cell38 <dbl>, cell39 <dbl>, cell40 <dbl>,
#># cell41 <dbl>, cell42 <dbl>, cell43 <dbl>,
#># cell44 <dbl>, cell45 <dbl>, cell46 <dbl>,
#># cell47 <dbl>, cell48 <dbl>, cell49 <dbl>,
#># cell50 <dbl>, cell51 <dbl>, cell52 <dbl>,
#># cell53 <dbl>, cell54 <dbl>, cell55 <dbl>,
#># cell56 <dbl>, cell57 <dbl>, cell58 <dbl>,
#># cell59 <dbl>, cell60 <dbl>, cell61 <dbl>,
#># cell62 <dbl>, cell63 <dbl>, cell64 <dbl>,
#># cell65 <dbl>, cell66 <dbl>, cell67 <dbl>,
#># cell68 <dbl>, cell69 <dbl>, cell70 <dbl>,
#># cell71 <dbl>, cell72 <dbl>, cell73 <dbl>,
#># cell74 <dbl>, cell75 <dbl>, cell76 <dbl>,
#># cell77 <dbl>, cell78 <dbl>, cell79 <dbl>,
#># cell80 <dbl>, cell81 <dbl>, cell82 <dbl>,
#># cell83 <dbl>, cell84 <dbl>, cell85 <dbl>,
#># cell86 <dbl>, cell87 <dbl>, cell88 <dbl>,
#># cell89 <dbl>, cell90 <dbl>, cell91 <dbl>,
#># cell92 <dbl>, cell93 <dbl>, cell94 <dbl>,
#># cell95 <dbl>, cell96 <dbl>, cell97 <dbl>,
#># cell98 <dbl>, cell99 <dbl>, cell100 <dbl>,
#># cell101 <dbl>, cell102 <dbl>, cell103 <dbl>,
#># cell104 <dbl>, cell105 <dbl>, cell106 <dbl>,
#># cell107 <dbl>, …
Using this set,
microbenchmark::microbenchmark(
DPLYR = {
df_bench %>%
group_by(gene_name) %>%
summarise_all(mean)
},
DATATABLE = {
setDT(df_bench)[, lapply(.SD, mean), by = gene_name]
},
times = 50
)
#> Unit: seconds
#> expr min lq mean median uq max neval
#> DPLYR 32.82307 34.89050 38.10948 37.44543 40.01937 47.67549 50
#> DATATABLE 12.16752 13.59018 16.09665 14.25976 15.60752 40.30257 50
data.table
seems faster than dplyr
here.
Thanks for the detailed explanation here. Thumbs up!
– Atakan
Nov 12 at 17:14
add a comment |
up vote
2
down vote
accepted
up vote
2
down vote
accepted
You can try dplyr
. summarise_all
with mean()
function offers average of every columns for each group.
library(tidyverse) # including dplyr
(df <-
data_frame(
cell1 = c(1,1,1,1,3,3),
cell2 = c(1, 2 ,4 ,10,5,10),
cell3 = c(0,0,0,1,10,20),
cell4 = c(1,3,4,4,20,20),
gene_name = c("ABC1", "ABC2", "ABC2", "ABC4", "ABC5", "ABC5")
))
#> # A tibble: 6 x 5
#> cell1 cell2 cell3 cell4 gene_name
#> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 1 1 0 1 ABC1
#> 2 1 2 0 3 ABC2
#> 3 1 4 0 4 ABC2
#> 4 1 10 1 4 ABC4
#> 5 3 5 10 20 ABC5
#> 6 3 10 20 20 ABC5
I just added the gene names as additional row. Now you can use group_by()
for the group operation
df %>%
group_by(gene_name) %>% # for each group
summarise_all(mean) # calculate mean for all columns
#> # A tibble: 4 x 5
#> gene_name cell1 cell2 cell3 cell4
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 ABC1 1 1 0 1
#> 2 ABC2 1 3 0 3.5
#> 3 ABC4 1 10 1 4
#> 4 ABC5 3 7.5 15 20
In general, for large data set as your situation, data.table
package would be appropriate: the code is like this
setDT(df)[, lapply(.SD, mean), by = gene_name]
#> gene_name cell1 cell2 cell3 cell4
#> 1: ABC1 1 1.0 0 1.0
#> 2: ABC2 1 3.0 0 3.5
#> 3: ABC4 1 10.0 1 4.0
#> 4: ABC5 3 7.5 15 20.0
setDT
is just for making data.table
object.
dplyr vs data.table
If bind your data set,
df_bench
#># A tibble: 18,000 x 10,001
#> gene_name cell1 cell2 cell3 cell4 cell5 cell6 cell7
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 ABC308 1 1 0 1 1 1 0
#> 2 ABC258 1 2 0 3 1 2 0
#> 3 ABC553 1 4 0 4 1 4 0
#> 4 ABC57 1 10 1 4 1 10 1
#> 5 ABC469 3 5 10 20 3 5 10
#> 6 ABC484 3 10 20 20 3 10 20
#> 7 ABC813 1 1 0 1 1 1 0
#> 8 ABC371 1 2 0 3 1 2 0
#> 9 ABC547 1 4 0 4 1 4 0
#>10 ABC171 1 10 1 4 1 10 1
#># ... with 17,990 more rows, and 9,993 more variables:
#># cell8 <dbl>, cell9 <dbl>, cell10 <dbl>,
#># cell11 <dbl>, cell12 <dbl>, cell13 <dbl>,
#># cell14 <dbl>, cell15 <dbl>, cell16 <dbl>,
#># cell17 <dbl>, cell18 <dbl>, cell19 <dbl>,
#># cell20 <dbl>, cell21 <dbl>, cell22 <dbl>,
#># cell23 <dbl>, cell24 <dbl>, cell25 <dbl>,
#># cell26 <dbl>, cell27 <dbl>, cell28 <dbl>,
#># cell29 <dbl>, cell30 <dbl>, cell31 <dbl>,
#># cell32 <dbl>, cell33 <dbl>, cell34 <dbl>,
#># cell35 <dbl>, cell36 <dbl>, cell37 <dbl>,
#># cell38 <dbl>, cell39 <dbl>, cell40 <dbl>,
#># cell41 <dbl>, cell42 <dbl>, cell43 <dbl>,
#># cell44 <dbl>, cell45 <dbl>, cell46 <dbl>,
#># cell47 <dbl>, cell48 <dbl>, cell49 <dbl>,
#># cell50 <dbl>, cell51 <dbl>, cell52 <dbl>,
#># cell53 <dbl>, cell54 <dbl>, cell55 <dbl>,
#># cell56 <dbl>, cell57 <dbl>, cell58 <dbl>,
#># cell59 <dbl>, cell60 <dbl>, cell61 <dbl>,
#># cell62 <dbl>, cell63 <dbl>, cell64 <dbl>,
#># cell65 <dbl>, cell66 <dbl>, cell67 <dbl>,
#># cell68 <dbl>, cell69 <dbl>, cell70 <dbl>,
#># cell71 <dbl>, cell72 <dbl>, cell73 <dbl>,
#># cell74 <dbl>, cell75 <dbl>, cell76 <dbl>,
#># cell77 <dbl>, cell78 <dbl>, cell79 <dbl>,
#># cell80 <dbl>, cell81 <dbl>, cell82 <dbl>,
#># cell83 <dbl>, cell84 <dbl>, cell85 <dbl>,
#># cell86 <dbl>, cell87 <dbl>, cell88 <dbl>,
#># cell89 <dbl>, cell90 <dbl>, cell91 <dbl>,
#># cell92 <dbl>, cell93 <dbl>, cell94 <dbl>,
#># cell95 <dbl>, cell96 <dbl>, cell97 <dbl>,
#># cell98 <dbl>, cell99 <dbl>, cell100 <dbl>,
#># cell101 <dbl>, cell102 <dbl>, cell103 <dbl>,
#># cell104 <dbl>, cell105 <dbl>, cell106 <dbl>,
#># cell107 <dbl>, …
Using this set,
microbenchmark::microbenchmark(
DPLYR = {
df_bench %>%
group_by(gene_name) %>%
summarise_all(mean)
},
DATATABLE = {
setDT(df_bench)[, lapply(.SD, mean), by = gene_name]
},
times = 50
)
#> Unit: seconds
#> expr min lq mean median uq max neval
#> DPLYR 32.82307 34.89050 38.10948 37.44543 40.01937 47.67549 50
#> DATATABLE 12.16752 13.59018 16.09665 14.25976 15.60752 40.30257 50
data.table
seems faster than dplyr
here.
You can try dplyr
. summarise_all
with mean()
function offers average of every columns for each group.
library(tidyverse) # including dplyr
(df <-
data_frame(
cell1 = c(1,1,1,1,3,3),
cell2 = c(1, 2 ,4 ,10,5,10),
cell3 = c(0,0,0,1,10,20),
cell4 = c(1,3,4,4,20,20),
gene_name = c("ABC1", "ABC2", "ABC2", "ABC4", "ABC5", "ABC5")
))
#> # A tibble: 6 x 5
#> cell1 cell2 cell3 cell4 gene_name
#> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 1 1 0 1 ABC1
#> 2 1 2 0 3 ABC2
#> 3 1 4 0 4 ABC2
#> 4 1 10 1 4 ABC4
#> 5 3 5 10 20 ABC5
#> 6 3 10 20 20 ABC5
I just added the gene names as additional row. Now you can use group_by()
for the group operation
df %>%
group_by(gene_name) %>% # for each group
summarise_all(mean) # calculate mean for all columns
#> # A tibble: 4 x 5
#> gene_name cell1 cell2 cell3 cell4
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 ABC1 1 1 0 1
#> 2 ABC2 1 3 0 3.5
#> 3 ABC4 1 10 1 4
#> 4 ABC5 3 7.5 15 20
In general, for large data set as your situation, data.table
package would be appropriate: the code is like this
setDT(df)[, lapply(.SD, mean), by = gene_name]
#> gene_name cell1 cell2 cell3 cell4
#> 1: ABC1 1 1.0 0 1.0
#> 2: ABC2 1 3.0 0 3.5
#> 3: ABC4 1 10.0 1 4.0
#> 4: ABC5 3 7.5 15 20.0
setDT
is just for making data.table
object.
dplyr vs data.table
If bind your data set,
df_bench
#># A tibble: 18,000 x 10,001
#> gene_name cell1 cell2 cell3 cell4 cell5 cell6 cell7
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 ABC308 1 1 0 1 1 1 0
#> 2 ABC258 1 2 0 3 1 2 0
#> 3 ABC553 1 4 0 4 1 4 0
#> 4 ABC57 1 10 1 4 1 10 1
#> 5 ABC469 3 5 10 20 3 5 10
#> 6 ABC484 3 10 20 20 3 10 20
#> 7 ABC813 1 1 0 1 1 1 0
#> 8 ABC371 1 2 0 3 1 2 0
#> 9 ABC547 1 4 0 4 1 4 0
#>10 ABC171 1 10 1 4 1 10 1
#># ... with 17,990 more rows, and 9,993 more variables:
#># cell8 <dbl>, cell9 <dbl>, cell10 <dbl>,
#># cell11 <dbl>, cell12 <dbl>, cell13 <dbl>,
#># cell14 <dbl>, cell15 <dbl>, cell16 <dbl>,
#># cell17 <dbl>, cell18 <dbl>, cell19 <dbl>,
#># cell20 <dbl>, cell21 <dbl>, cell22 <dbl>,
#># cell23 <dbl>, cell24 <dbl>, cell25 <dbl>,
#># cell26 <dbl>, cell27 <dbl>, cell28 <dbl>,
#># cell29 <dbl>, cell30 <dbl>, cell31 <dbl>,
#># cell32 <dbl>, cell33 <dbl>, cell34 <dbl>,
#># cell35 <dbl>, cell36 <dbl>, cell37 <dbl>,
#># cell38 <dbl>, cell39 <dbl>, cell40 <dbl>,
#># cell41 <dbl>, cell42 <dbl>, cell43 <dbl>,
#># cell44 <dbl>, cell45 <dbl>, cell46 <dbl>,
#># cell47 <dbl>, cell48 <dbl>, cell49 <dbl>,
#># cell50 <dbl>, cell51 <dbl>, cell52 <dbl>,
#># cell53 <dbl>, cell54 <dbl>, cell55 <dbl>,
#># cell56 <dbl>, cell57 <dbl>, cell58 <dbl>,
#># cell59 <dbl>, cell60 <dbl>, cell61 <dbl>,
#># cell62 <dbl>, cell63 <dbl>, cell64 <dbl>,
#># cell65 <dbl>, cell66 <dbl>, cell67 <dbl>,
#># cell68 <dbl>, cell69 <dbl>, cell70 <dbl>,
#># cell71 <dbl>, cell72 <dbl>, cell73 <dbl>,
#># cell74 <dbl>, cell75 <dbl>, cell76 <dbl>,
#># cell77 <dbl>, cell78 <dbl>, cell79 <dbl>,
#># cell80 <dbl>, cell81 <dbl>, cell82 <dbl>,
#># cell83 <dbl>, cell84 <dbl>, cell85 <dbl>,
#># cell86 <dbl>, cell87 <dbl>, cell88 <dbl>,
#># cell89 <dbl>, cell90 <dbl>, cell91 <dbl>,
#># cell92 <dbl>, cell93 <dbl>, cell94 <dbl>,
#># cell95 <dbl>, cell96 <dbl>, cell97 <dbl>,
#># cell98 <dbl>, cell99 <dbl>, cell100 <dbl>,
#># cell101 <dbl>, cell102 <dbl>, cell103 <dbl>,
#># cell104 <dbl>, cell105 <dbl>, cell106 <dbl>,
#># cell107 <dbl>, …
Using this set,
microbenchmark::microbenchmark(
DPLYR = {
df_bench %>%
group_by(gene_name) %>%
summarise_all(mean)
},
DATATABLE = {
setDT(df_bench)[, lapply(.SD, mean), by = gene_name]
},
times = 50
)
#> Unit: seconds
#> expr min lq mean median uq max neval
#> DPLYR 32.82307 34.89050 38.10948 37.44543 40.01937 47.67549 50
#> DATATABLE 12.16752 13.59018 16.09665 14.25976 15.60752 40.30257 50
data.table
seems faster than dplyr
here.
edited Nov 10 at 6:19
answered Nov 10 at 4:21
Blended
40617
40617
Thanks for the detailed explanation here. Thumbs up!
– Atakan
Nov 12 at 17:14
add a comment |
Thanks for the detailed explanation here. Thumbs up!
– Atakan
Nov 12 at 17:14
Thanks for the detailed explanation here. Thumbs up!
– Atakan
Nov 12 at 17:14
Thanks for the detailed explanation here. Thumbs up!
– Atakan
Nov 12 at 17:14
add a comment |
up vote
1
down vote
Using data.table should work pretty well:
library(data.table)
as.data.table(data)[, lapply(.SD, mean), by = .(rownames(data))]
# rownames cell1 cell2 cell3 cell4
#1: ABC1 1 1.0 0 1.0
#2: ABC2 1 3.0 0 3.5
#3: ABC4 1 10.0 1 4.0
#4: ABC5 3 7.5 15 20.0
A quick SO search dug up a link to speed comparisons for group-by operations (data.table is the fastest for large data):
Calculate the mean by group
Thanks for your answer. Somehow, I missed the link you shared during my search. The very good info there!
– Atakan
Nov 12 at 17:13
add a comment |
up vote
1
down vote
Using data.table should work pretty well:
library(data.table)
as.data.table(data)[, lapply(.SD, mean), by = .(rownames(data))]
# rownames cell1 cell2 cell3 cell4
#1: ABC1 1 1.0 0 1.0
#2: ABC2 1 3.0 0 3.5
#3: ABC4 1 10.0 1 4.0
#4: ABC5 3 7.5 15 20.0
A quick SO search dug up a link to speed comparisons for group-by operations (data.table is the fastest for large data):
Calculate the mean by group
Thanks for your answer. Somehow, I missed the link you shared during my search. The very good info there!
– Atakan
Nov 12 at 17:13
add a comment |
up vote
1
down vote
up vote
1
down vote
Using data.table should work pretty well:
library(data.table)
as.data.table(data)[, lapply(.SD, mean), by = .(rownames(data))]
# rownames cell1 cell2 cell3 cell4
#1: ABC1 1 1.0 0 1.0
#2: ABC2 1 3.0 0 3.5
#3: ABC4 1 10.0 1 4.0
#4: ABC5 3 7.5 15 20.0
A quick SO search dug up a link to speed comparisons for group-by operations (data.table is the fastest for large data):
Calculate the mean by group
Using data.table should work pretty well:
library(data.table)
as.data.table(data)[, lapply(.SD, mean), by = .(rownames(data))]
# rownames cell1 cell2 cell3 cell4
#1: ABC1 1 1.0 0 1.0
#2: ABC2 1 3.0 0 3.5
#3: ABC4 1 10.0 1 4.0
#4: ABC5 3 7.5 15 20.0
A quick SO search dug up a link to speed comparisons for group-by operations (data.table is the fastest for large data):
Calculate the mean by group
edited Nov 10 at 4:20
answered Nov 10 at 4:10
Mike H.
10.8k11023
10.8k11023
Thanks for your answer. Somehow, I missed the link you shared during my search. The very good info there!
– Atakan
Nov 12 at 17:13
add a comment |
Thanks for your answer. Somehow, I missed the link you shared during my search. The very good info there!
– Atakan
Nov 12 at 17:13
Thanks for your answer. Somehow, I missed the link you shared during my search. The very good info there!
– Atakan
Nov 12 at 17:13
Thanks for your answer. Somehow, I missed the link you shared during my search. The very good info there!
– Atakan
Nov 12 at 17:13
add a comment |