An efficient way of aggregating data from repeated measurements [duplicate]

up vote
2
down vote

favorite

This question already has an answer here:

Calculate the mean by group

3 answers

Aggregate / summarize multiple variables per group (e.g. sum, mean)

5 answers

I'm analyzing gene expression data from a large experiment (12400 single cells and 23800 genes) and I'm running into an efficiency problem. I will write a reproducible example below but my problem is the following:

I converted mouse genes in my dataset to human counterparts to be able to compare with other previously published data. There are multiple matches in some cases (one human gene is mapped to more than one mouse genes). In these cases, I'd like to average the expression values from these multiple genes and come up with one expression value for the human genetic counterpart. I'm able to achieve this by converting my expression data to matrix format (which allows duplicate row names) and applying aggregate() function, but it takes a VERY long time to go through the large dataset. It is difficult to exemplify the exact situation here, but I my mock analytical pipeline is below:

data <- as.matrix(data.frame(cell1 = c(1,1,1,1,3,3),

                          cell2 = c(1, 2 ,4 ,10,5,10),

                          cell3 = c(0,0,0,1,10,20),

                          cell4 = c(1,3,4,4,20,20)))



# Adding gene names as rownames

rownames(data) <- c("ABC1", "ABC2", "ABC2", "ABC4", "ABC5", "ABC5")







# Mock gene expression matrix

# Columns indicate expression values from individual cells

# Rows indicate genes 

data

#>      cell1 cell2 cell3 cell4

#> ABC1     1     1     0     1

#> ABC2     1     2     0     3

#> ABC2     1     4     0     4

#> ABC4     1    10     1     4

#> ABC5     3     5    10    20

#> ABC5     3    10    20    20







# Averaging gene expression values where there are multiple measurements for the same gene

aggr_data <- aggregate(data, by=list(rownames(data)), mean)



# End result I'm trying to achieve

aggr_data

#>   Group.1 cell1 cell2 cell3 cell4

#> 1    ABC1     1   1.0     0   1.0

#> 2    ABC2     1   3.0     0   3.5

#> 3    ABC4     1  10.0     1   4.0

#> 4    ABC5     3   7.5    15  20.0

Is there a more efficient way for doing this?

Thanks for your answers!

asked Nov 10 at 3:17

Atakan

355

marked as duplicate by Mike H., phiver, Billal Begueradj, Jaap r
Users with the r badge can single-handedly close r questions as duplicates and reopen them as needed.

StackExchange.ready(function() {
if (StackExchange.options.isMobile) return;

$('.dupe-hammer-message-hover:not(.hover-bound)').each(function() {
var $hover = $(this).addClass('hover-bound'),
$msg = $hover.siblings('.dupe-hammer-message');

$hover.hover(
function() {
$hover.showInfoMessage('', {
messageElement: $msg.clone().show(),
transient: false,
position: { my: 'bottom left', at: 'top center', offsetTop: -7 },
dismissable: false,
relativeToBody: true
});
},
function() {
StackExchange.helpers.removeMessages();
}
);
});
});
Nov 11 at 12:40

This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.

add a comment |

up vote
2
down vote

favorite

This question already has an answer here:

Calculate the mean by group

3 answers

Aggregate / summarize multiple variables per group (e.g. sum, mean)

5 answers

data <- as.matrix(data.frame(cell1 = c(1,1,1,1,3,3),

                          cell2 = c(1, 2 ,4 ,10,5,10),

                          cell3 = c(0,0,0,1,10,20),

                          cell4 = c(1,3,4,4,20,20)))



# Adding gene names as rownames

rownames(data) <- c("ABC1", "ABC2", "ABC2", "ABC4", "ABC5", "ABC5")







# Mock gene expression matrix

# Columns indicate expression values from individual cells

# Rows indicate genes 

data

#>      cell1 cell2 cell3 cell4

#> ABC1     1     1     0     1

#> ABC2     1     2     0     3

#> ABC2     1     4     0     4

#> ABC4     1    10     1     4

#> ABC5     3     5    10    20

#> ABC5     3    10    20    20







# Averaging gene expression values where there are multiple measurements for the same gene

aggr_data <- aggregate(data, by=list(rownames(data)), mean)



# End result I'm trying to achieve

aggr_data

#>   Group.1 cell1 cell2 cell3 cell4

#> 1    ABC1     1   1.0     0   1.0

#> 2    ABC2     1   3.0     0   3.5

#> 3    ABC4     1  10.0     1   4.0

#> 4    ABC5     3   7.5    15  20.0

Is there a more efficient way for doing this?

Thanks for your answers!

asked Nov 10 at 3:17

Atakan

355

marked as duplicate by Mike H., phiver, Billal Begueradj, Jaap r
Users with the r badge can single-handedly close r questions as duplicates and reopen them as needed.

StackExchange.ready(function() {
if (StackExchange.options.isMobile) return;

$('.dupe-hammer-message-hover:not(.hover-bound)').each(function() {
var $hover = $(this).addClass('hover-bound'),
$msg = $hover.siblings('.dupe-hammer-message');

$hover.hover(
function() {
$hover.showInfoMessage('', {
messageElement: $msg.clone().show(),
transient: false,
position: { my: 'bottom left', at: 'top center', offsetTop: -7 },
dismissable: false,
relativeToBody: true
});
},
function() {
StackExchange.helpers.removeMessages();
}
);
});
});
Nov 11 at 12:40

This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.

add a comment |

up vote
2
down vote

favorite

This question already has an answer here:

Calculate the mean by group

3 answers

Aggregate / summarize multiple variables per group (e.g. sum, mean)

5 answers

data <- as.matrix(data.frame(cell1 = c(1,1,1,1,3,3),

                          cell2 = c(1, 2 ,4 ,10,5,10),

                          cell3 = c(0,0,0,1,10,20),

                          cell4 = c(1,3,4,4,20,20)))



# Adding gene names as rownames

rownames(data) <- c("ABC1", "ABC2", "ABC2", "ABC4", "ABC5", "ABC5")







# Mock gene expression matrix

# Columns indicate expression values from individual cells

# Rows indicate genes 

data

#>      cell1 cell2 cell3 cell4

#> ABC1     1     1     0     1

#> ABC2     1     2     0     3

#> ABC2     1     4     0     4

#> ABC4     1    10     1     4

#> ABC5     3     5    10    20

#> ABC5     3    10    20    20







# Averaging gene expression values where there are multiple measurements for the same gene

aggr_data <- aggregate(data, by=list(rownames(data)), mean)



# End result I'm trying to achieve

aggr_data

#>   Group.1 cell1 cell2 cell3 cell4

#> 1    ABC1     1   1.0     0   1.0

#> 2    ABC2     1   3.0     0   3.5

#> 3    ABC4     1  10.0     1   4.0

#> 4    ABC5     3   7.5    15  20.0

Is there a more efficient way for doing this?

Thanks for your answers!

asked Nov 10 at 3:17

Atakan

355

This question already has an answer here:

Calculate the mean by group

3 answers

Aggregate / summarize multiple variables per group (e.g. sum, mean)

5 answers

data <- as.matrix(data.frame(cell1 = c(1,1,1,1,3,3),

                          cell2 = c(1, 2 ,4 ,10,5,10),

                          cell3 = c(0,0,0,1,10,20),

                          cell4 = c(1,3,4,4,20,20)))



# Adding gene names as rownames

rownames(data) <- c("ABC1", "ABC2", "ABC2", "ABC4", "ABC5", "ABC5")







# Mock gene expression matrix

# Columns indicate expression values from individual cells

# Rows indicate genes 

data

#>      cell1 cell2 cell3 cell4

#> ABC1     1     1     0     1

#> ABC2     1     2     0     3

#> ABC2     1     4     0     4

#> ABC4     1    10     1     4

#> ABC5     3     5    10    20

#> ABC5     3    10    20    20







# Averaging gene expression values where there are multiple measurements for the same gene

aggr_data <- aggregate(data, by=list(rownames(data)), mean)



# End result I'm trying to achieve

aggr_data

#>   Group.1 cell1 cell2 cell3 cell4

#> 1    ABC1     1   1.0     0   1.0

#> 2    ABC2     1   3.0     0   3.5

#> 3    ABC4     1  10.0     1   4.0

#> 4    ABC5     3   7.5    15  20.0

Is there a more efficient way for doing this?

Thanks for your answers!

This question already has an answer here:

Calculate the mean by group

3 answers

Aggregate / summarize multiple variables per group (e.g. sum, mean)

5 answers

r bigdata aggregate

asked Nov 10 at 3:17

Atakan

355

asked Nov 10 at 3:17

Atakan

355

asked Nov 10 at 3:17

Atakan

355

asked Nov 10 at 3:17

Atakan

355

asked Nov 10 at 3:17

Atakan

355

marked as duplicate by Mike H., phiver, Billal Begueradj, Jaap r
Users with the r badge can single-handedly close r questions as duplicates and reopen them as needed.

StackExchange.ready(function() {
if (StackExchange.options.isMobile) return;

$('.dupe-hammer-message-hover:not(.hover-bound)').each(function() {
var $hover = $(this).addClass('hover-bound'),
$msg = $hover.siblings('.dupe-hammer-message');

$hover.hover(
function() {
$hover.showInfoMessage('', {
messageElement: $msg.clone().show(),
transient: false,
position: { my: 'bottom left', at: 'top center', offsetTop: -7 },
dismissable: false,
relativeToBody: true
});
},
function() {
StackExchange.helpers.removeMessages();
}
);
});
});
Nov 11 at 12:40

This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.

marked as duplicate by Mike H., phiver, Billal Begueradj, Jaap r
Users with the r badge can single-handedly close r questions as duplicates and reopen them as needed.

StackExchange.ready(function() {
if (StackExchange.options.isMobile) return;

$('.dupe-hammer-message-hover:not(.hover-bound)').each(function() {
var $hover = $(this).addClass('hover-bound'),
$msg = $hover.siblings('.dupe-hammer-message');

$hover.hover(
function() {
$hover.showInfoMessage('', {
messageElement: $msg.clone().show(),
transient: false,
position: { my: 'bottom left', at: 'top center', offsetTop: -7 },
dismissable: false,
relativeToBody: true
});
},
function() {
StackExchange.helpers.removeMessages();
}
);
});
});
Nov 11 at 12:40

This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.

add a comment |

2 Answers
2

active

oldest

votes

up vote
2
down vote

accepted

You can try dplyr. summarise_all with mean() function offers average of every columns for each group.

library(tidyverse) # including dplyr

(df <-

  data_frame(

    cell1 = c(1,1,1,1,3,3),

    cell2 = c(1, 2 ,4 ,10,5,10),

    cell3 = c(0,0,0,1,10,20),

    cell4 = c(1,3,4,4,20,20),

    gene_name = c("ABC1", "ABC2", "ABC2", "ABC4", "ABC5", "ABC5")

  ))

#> # A tibble: 6 x 5

#>   cell1 cell2 cell3 cell4 gene_name

#>   <dbl> <dbl> <dbl> <dbl> <chr>    

#> 1     1     1     0     1 ABC1     

#> 2     1     2     0     3 ABC2     

#> 3     1     4     0     4 ABC2     

#> 4     1    10     1     4 ABC4     

#> 5     3     5    10    20 ABC5     

#> 6     3    10    20    20 ABC5

I just added the gene names as additional row. Now you can use group_by() for the group operation

df %>%

  group_by(gene_name) %>% # for each group

  summarise_all(mean) # calculate mean for all columns

#> # A tibble: 4 x 5

#>   gene_name cell1 cell2 cell3 cell4

#>   <chr>     <dbl> <dbl> <dbl> <dbl>

#> 1 ABC1          1   1       0   1  

#> 2 ABC2          1   3       0   3.5

#> 3 ABC4          1  10       1   4  

#> 4 ABC5          3   7.5    15  20

In general, for large data set as your situation, data.table package would be appropriate: the code is like this

setDT(df)[, lapply(.SD, mean), by = gene_name]

#>    gene_name cell1 cell2 cell3 cell4

#> 1:      ABC1     1   1.0     0   1.0

#> 2:      ABC2     1   3.0     0   3.5

#> 3:      ABC4     1  10.0     1   4.0

#> 4:      ABC5     3   7.5    15  20.0

setDT is just for making data.table object.

dplyr vs data.table

If bind your data set,

df_bench

#># A tibble: 18,000 x 10,001

#>   gene_name cell1 cell2 cell3 cell4 cell5 cell6 cell7

#>   <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>

#> 1 ABC308        1     1     0     1     1     1     0

#> 2 ABC258        1     2     0     3     1     2     0

#> 3 ABC553        1     4     0     4     1     4     0

#> 4 ABC57         1    10     1     4     1    10     1

#> 5 ABC469        3     5    10    20     3     5    10

#> 6 ABC484        3    10    20    20     3    10    20

#> 7 ABC813        1     1     0     1     1     1     0

#> 8 ABC371        1     2     0     3     1     2     0

#> 9 ABC547        1     4     0     4     1     4     0

#>10 ABC171        1    10     1     4     1    10     1

#># ... with 17,990 more rows, and 9,993 more variables:

#>#   cell8 <dbl>, cell9 <dbl>, cell10 <dbl>,

#>#   cell11 <dbl>, cell12 <dbl>, cell13 <dbl>,

#>#   cell14 <dbl>, cell15 <dbl>, cell16 <dbl>,

#>#   cell17 <dbl>, cell18 <dbl>, cell19 <dbl>,

#>#   cell20 <dbl>, cell21 <dbl>, cell22 <dbl>,

#>#   cell23 <dbl>, cell24 <dbl>, cell25 <dbl>,

#>#   cell26 <dbl>, cell27 <dbl>, cell28 <dbl>,

#>#   cell29 <dbl>, cell30 <dbl>, cell31 <dbl>,

#>#   cell32 <dbl>, cell33 <dbl>, cell34 <dbl>,

#>#   cell35 <dbl>, cell36 <dbl>, cell37 <dbl>,

#>#   cell38 <dbl>, cell39 <dbl>, cell40 <dbl>,

#>#   cell41 <dbl>, cell42 <dbl>, cell43 <dbl>,

#>#   cell44 <dbl>, cell45 <dbl>, cell46 <dbl>,

#>#   cell47 <dbl>, cell48 <dbl>, cell49 <dbl>,

#>#   cell50 <dbl>, cell51 <dbl>, cell52 <dbl>,

#>#   cell53 <dbl>, cell54 <dbl>, cell55 <dbl>,

#>#   cell56 <dbl>, cell57 <dbl>, cell58 <dbl>,

#>#   cell59 <dbl>, cell60 <dbl>, cell61 <dbl>,

#>#   cell62 <dbl>, cell63 <dbl>, cell64 <dbl>,

#>#   cell65 <dbl>, cell66 <dbl>, cell67 <dbl>,

#>#   cell68 <dbl>, cell69 <dbl>, cell70 <dbl>,

#>#   cell71 <dbl>, cell72 <dbl>, cell73 <dbl>,

#>#   cell74 <dbl>, cell75 <dbl>, cell76 <dbl>,

#>#   cell77 <dbl>, cell78 <dbl>, cell79 <dbl>,

#>#   cell80 <dbl>, cell81 <dbl>, cell82 <dbl>,

#>#   cell83 <dbl>, cell84 <dbl>, cell85 <dbl>,

#>#   cell86 <dbl>, cell87 <dbl>, cell88 <dbl>,

#>#   cell89 <dbl>, cell90 <dbl>, cell91 <dbl>,

#>#   cell92 <dbl>, cell93 <dbl>, cell94 <dbl>,

#>#   cell95 <dbl>, cell96 <dbl>, cell97 <dbl>,

#>#   cell98 <dbl>, cell99 <dbl>, cell100 <dbl>,

#>#   cell101 <dbl>, cell102 <dbl>, cell103 <dbl>,

#>#   cell104 <dbl>, cell105 <dbl>, cell106 <dbl>,

#>#   cell107 <dbl>, …

Using this set,

microbenchmark::microbenchmark(

  DPLYR = {

    df_bench %>%

      group_by(gene_name) %>%

      summarise_all(mean)

  },

  DATATABLE = {

    setDT(df_bench)[, lapply(.SD, mean), by = gene_name]

  },

  times = 50

)

#> Unit: seconds

#>       expr      min       lq     mean   median       uq      max neval

#>      DPLYR 32.82307 34.89050 38.10948 37.44543 40.01937 47.67549    50

#>  DATATABLE 12.16752 13.59018 16.09665 14.25976 15.60752 40.30257    50

data.table seems faster than dplyr here.

edited Nov 10 at 6:19

answered Nov 10 at 4:21

Blended

40617

Thanks for the detailed explanation here. Thumbs up!
– Atakan
Nov 12 at 17:14

add a comment |

up vote
1
down vote

Using data.table should work pretty well:

library(data.table)

as.data.table(data)[, lapply(.SD, mean), by = .(rownames(data))]

#   rownames cell1 cell2 cell3 cell4

#1:     ABC1     1   1.0     0   1.0

#2:     ABC2     1   3.0     0   3.5

#3:     ABC4     1  10.0     1   4.0

#4:     ABC5     3   7.5    15  20.0

A quick SO search dug up a link to speed comparisons for group-by operations (data.table is the fastest for large data):

Calculate the mean by group

edited Nov 10 at 4:20

answered Nov 10 at 4:10

Mike H.

10.8k11023

Thanks for your answer. Somehow, I missed the link you shared during my search. The very good info there!
– Atakan
Nov 12 at 17:13

add a comment |

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
2
down vote

accepted

You can try dplyr. summarise_all with mean() function offers average of every columns for each group.

library(tidyverse) # including dplyr

(df <-

  data_frame(

    cell1 = c(1,1,1,1,3,3),

    cell2 = c(1, 2 ,4 ,10,5,10),

    cell3 = c(0,0,0,1,10,20),

    cell4 = c(1,3,4,4,20,20),

    gene_name = c("ABC1", "ABC2", "ABC2", "ABC4", "ABC5", "ABC5")

  ))

#> # A tibble: 6 x 5

#>   cell1 cell2 cell3 cell4 gene_name

#>   <dbl> <dbl> <dbl> <dbl> <chr>    

#> 1     1     1     0     1 ABC1     

#> 2     1     2     0     3 ABC2     

#> 3     1     4     0     4 ABC2     

#> 4     1    10     1     4 ABC4     

#> 5     3     5    10    20 ABC5     

#> 6     3    10    20    20 ABC5

I just added the gene names as additional row. Now you can use group_by() for the group operation

df %>%

  group_by(gene_name) %>% # for each group

  summarise_all(mean) # calculate mean for all columns

#> # A tibble: 4 x 5

#>   gene_name cell1 cell2 cell3 cell4

#>   <chr>     <dbl> <dbl> <dbl> <dbl>

#> 1 ABC1          1   1       0   1  

#> 2 ABC2          1   3       0   3.5

#> 3 ABC4          1  10       1   4  

#> 4 ABC5          3   7.5    15  20

In general, for large data set as your situation, data.table package would be appropriate: the code is like this

setDT(df)[, lapply(.SD, mean), by = gene_name]

#>    gene_name cell1 cell2 cell3 cell4

#> 1:      ABC1     1   1.0     0   1.0

#> 2:      ABC2     1   3.0     0   3.5

#> 3:      ABC4     1  10.0     1   4.0

#> 4:      ABC5     3   7.5    15  20.0

setDT is just for making data.table object.

dplyr vs data.table

If bind your data set,

df_bench

#># A tibble: 18,000 x 10,001

#>   gene_name cell1 cell2 cell3 cell4 cell5 cell6 cell7

#>   <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>

#> 1 ABC308        1     1     0     1     1     1     0

#> 2 ABC258        1     2     0     3     1     2     0

#> 3 ABC553        1     4     0     4     1     4     0

#> 4 ABC57         1    10     1     4     1    10     1

#> 5 ABC469        3     5    10    20     3     5    10

#> 6 ABC484        3    10    20    20     3    10    20

#> 7 ABC813        1     1     0     1     1     1     0

#> 8 ABC371        1     2     0     3     1     2     0

#> 9 ABC547        1     4     0     4     1     4     0

#>10 ABC171        1    10     1     4     1    10     1

#># ... with 17,990 more rows, and 9,993 more variables:

#>#   cell8 <dbl>, cell9 <dbl>, cell10 <dbl>,

#>#   cell11 <dbl>, cell12 <dbl>, cell13 <dbl>,

#>#   cell14 <dbl>, cell15 <dbl>, cell16 <dbl>,

#>#   cell17 <dbl>, cell18 <dbl>, cell19 <dbl>,

#>#   cell20 <dbl>, cell21 <dbl>, cell22 <dbl>,

#>#   cell23 <dbl>, cell24 <dbl>, cell25 <dbl>,

#>#   cell26 <dbl>, cell27 <dbl>, cell28 <dbl>,

#>#   cell29 <dbl>, cell30 <dbl>, cell31 <dbl>,

#>#   cell32 <dbl>, cell33 <dbl>, cell34 <dbl>,

#>#   cell35 <dbl>, cell36 <dbl>, cell37 <dbl>,

#>#   cell38 <dbl>, cell39 <dbl>, cell40 <dbl>,

#>#   cell41 <dbl>, cell42 <dbl>, cell43 <dbl>,

#>#   cell44 <dbl>, cell45 <dbl>, cell46 <dbl>,

#>#   cell47 <dbl>, cell48 <dbl>, cell49 <dbl>,

#>#   cell50 <dbl>, cell51 <dbl>, cell52 <dbl>,

#>#   cell53 <dbl>, cell54 <dbl>, cell55 <dbl>,

#>#   cell56 <dbl>, cell57 <dbl>, cell58 <dbl>,

#>#   cell59 <dbl>, cell60 <dbl>, cell61 <dbl>,

#>#   cell62 <dbl>, cell63 <dbl>, cell64 <dbl>,

#>#   cell65 <dbl>, cell66 <dbl>, cell67 <dbl>,

#>#   cell68 <dbl>, cell69 <dbl>, cell70 <dbl>,

#>#   cell71 <dbl>, cell72 <dbl>, cell73 <dbl>,

#>#   cell74 <dbl>, cell75 <dbl>, cell76 <dbl>,

#>#   cell77 <dbl>, cell78 <dbl>, cell79 <dbl>,

#>#   cell80 <dbl>, cell81 <dbl>, cell82 <dbl>,

#>#   cell83 <dbl>, cell84 <dbl>, cell85 <dbl>,

#>#   cell86 <dbl>, cell87 <dbl>, cell88 <dbl>,

#>#   cell89 <dbl>, cell90 <dbl>, cell91 <dbl>,

#>#   cell92 <dbl>, cell93 <dbl>, cell94 <dbl>,

#>#   cell95 <dbl>, cell96 <dbl>, cell97 <dbl>,

#>#   cell98 <dbl>, cell99 <dbl>, cell100 <dbl>,

#>#   cell101 <dbl>, cell102 <dbl>, cell103 <dbl>,

#>#   cell104 <dbl>, cell105 <dbl>, cell106 <dbl>,

#>#   cell107 <dbl>, …

Using this set,

microbenchmark::microbenchmark(

  DPLYR = {

    df_bench %>%

      group_by(gene_name) %>%

      summarise_all(mean)

  },

  DATATABLE = {

    setDT(df_bench)[, lapply(.SD, mean), by = gene_name]

  },

  times = 50

)

#> Unit: seconds

#>       expr      min       lq     mean   median       uq      max neval

#>      DPLYR 32.82307 34.89050 38.10948 37.44543 40.01937 47.67549    50

#>  DATATABLE 12.16752 13.59018 16.09665 14.25976 15.60752 40.30257    50

data.table seems faster than dplyr here.

edited Nov 10 at 6:19

answered Nov 10 at 4:21

Blended

40617

Thanks for the detailed explanation here. Thumbs up!
– Atakan
Nov 12 at 17:14

add a comment |

up vote
2
down vote

accepted

You can try dplyr. summarise_all with mean() function offers average of every columns for each group.

library(tidyverse) # including dplyr

(df <-

  data_frame(

    cell1 = c(1,1,1,1,3,3),

    cell2 = c(1, 2 ,4 ,10,5,10),

    cell3 = c(0,0,0,1,10,20),

    cell4 = c(1,3,4,4,20,20),

    gene_name = c("ABC1", "ABC2", "ABC2", "ABC4", "ABC5", "ABC5")

  ))

#> # A tibble: 6 x 5

#>   cell1 cell2 cell3 cell4 gene_name

#>   <dbl> <dbl> <dbl> <dbl> <chr>    

#> 1     1     1     0     1 ABC1     

#> 2     1     2     0     3 ABC2     

#> 3     1     4     0     4 ABC2     

#> 4     1    10     1     4 ABC4     

#> 5     3     5    10    20 ABC5     

#> 6     3    10    20    20 ABC5

I just added the gene names as additional row. Now you can use group_by() for the group operation

df %>%

  group_by(gene_name) %>% # for each group

  summarise_all(mean) # calculate mean for all columns

#> # A tibble: 4 x 5

#>   gene_name cell1 cell2 cell3 cell4

#>   <chr>     <dbl> <dbl> <dbl> <dbl>

#> 1 ABC1          1   1       0   1  

#> 2 ABC2          1   3       0   3.5

#> 3 ABC4          1  10       1   4  

#> 4 ABC5          3   7.5    15  20

In general, for large data set as your situation, data.table package would be appropriate: the code is like this

setDT(df)[, lapply(.SD, mean), by = gene_name]

#>    gene_name cell1 cell2 cell3 cell4

#> 1:      ABC1     1   1.0     0   1.0

#> 2:      ABC2     1   3.0     0   3.5

#> 3:      ABC4     1  10.0     1   4.0

#> 4:      ABC5     3   7.5    15  20.0

setDT is just for making data.table object.

dplyr vs data.table

If bind your data set,

df_bench

#># A tibble: 18,000 x 10,001

#>   gene_name cell1 cell2 cell3 cell4 cell5 cell6 cell7

#>   <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>

#> 1 ABC308        1     1     0     1     1     1     0

#> 2 ABC258        1     2     0     3     1     2     0

#> 3 ABC553        1     4     0     4     1     4     0

#> 4 ABC57         1    10     1     4     1    10     1

#> 5 ABC469        3     5    10    20     3     5    10

#> 6 ABC484        3    10    20    20     3    10    20

#> 7 ABC813        1     1     0     1     1     1     0

#> 8 ABC371        1     2     0     3     1     2     0

#> 9 ABC547        1     4     0     4     1     4     0

#>10 ABC171        1    10     1     4     1    10     1

#># ... with 17,990 more rows, and 9,993 more variables:

#>#   cell8 <dbl>, cell9 <dbl>, cell10 <dbl>,

#>#   cell11 <dbl>, cell12 <dbl>, cell13 <dbl>,

#>#   cell14 <dbl>, cell15 <dbl>, cell16 <dbl>,

#>#   cell17 <dbl>, cell18 <dbl>, cell19 <dbl>,

#>#   cell20 <dbl>, cell21 <dbl>, cell22 <dbl>,

#>#   cell23 <dbl>, cell24 <dbl>, cell25 <dbl>,

#>#   cell26 <dbl>, cell27 <dbl>, cell28 <dbl>,

#>#   cell29 <dbl>, cell30 <dbl>, cell31 <dbl>,

#>#   cell32 <dbl>, cell33 <dbl>, cell34 <dbl>,

#>#   cell35 <dbl>, cell36 <dbl>, cell37 <dbl>,

#>#   cell38 <dbl>, cell39 <dbl>, cell40 <dbl>,

#>#   cell41 <dbl>, cell42 <dbl>, cell43 <dbl>,

#>#   cell44 <dbl>, cell45 <dbl>, cell46 <dbl>,

#>#   cell47 <dbl>, cell48 <dbl>, cell49 <dbl>,

#>#   cell50 <dbl>, cell51 <dbl>, cell52 <dbl>,

#>#   cell53 <dbl>, cell54 <dbl>, cell55 <dbl>,

#>#   cell56 <dbl>, cell57 <dbl>, cell58 <dbl>,

#>#   cell59 <dbl>, cell60 <dbl>, cell61 <dbl>,

#>#   cell62 <dbl>, cell63 <dbl>, cell64 <dbl>,

#>#   cell65 <dbl>, cell66 <dbl>, cell67 <dbl>,

#>#   cell68 <dbl>, cell69 <dbl>, cell70 <dbl>,

#>#   cell71 <dbl>, cell72 <dbl>, cell73 <dbl>,

#>#   cell74 <dbl>, cell75 <dbl>, cell76 <dbl>,

#>#   cell77 <dbl>, cell78 <dbl>, cell79 <dbl>,

#>#   cell80 <dbl>, cell81 <dbl>, cell82 <dbl>,

#>#   cell83 <dbl>, cell84 <dbl>, cell85 <dbl>,

#>#   cell86 <dbl>, cell87 <dbl>, cell88 <dbl>,

#>#   cell89 <dbl>, cell90 <dbl>, cell91 <dbl>,

#>#   cell92 <dbl>, cell93 <dbl>, cell94 <dbl>,

#>#   cell95 <dbl>, cell96 <dbl>, cell97 <dbl>,

#>#   cell98 <dbl>, cell99 <dbl>, cell100 <dbl>,

#>#   cell101 <dbl>, cell102 <dbl>, cell103 <dbl>,

#>#   cell104 <dbl>, cell105 <dbl>, cell106 <dbl>,

#>#   cell107 <dbl>, …

Using this set,

microbenchmark::microbenchmark(

  DPLYR = {

    df_bench %>%

      group_by(gene_name) %>%

      summarise_all(mean)

  },

  DATATABLE = {

    setDT(df_bench)[, lapply(.SD, mean), by = gene_name]

  },

  times = 50

)

#> Unit: seconds

#>       expr      min       lq     mean   median       uq      max neval

#>      DPLYR 32.82307 34.89050 38.10948 37.44543 40.01937 47.67549    50

#>  DATATABLE 12.16752 13.59018 16.09665 14.25976 15.60752 40.30257    50

data.table seems faster than dplyr here.

edited Nov 10 at 6:19

answered Nov 10 at 4:21

Blended

40617

Thanks for the detailed explanation here. Thumbs up!
– Atakan
Nov 12 at 17:14

add a comment |

up vote
2
down vote

accepted

You can try dplyr. summarise_all with mean() function offers average of every columns for each group.

library(tidyverse) # including dplyr

(df <-

  data_frame(

    cell1 = c(1,1,1,1,3,3),

    cell2 = c(1, 2 ,4 ,10,5,10),

    cell3 = c(0,0,0,1,10,20),

    cell4 = c(1,3,4,4,20,20),

    gene_name = c("ABC1", "ABC2", "ABC2", "ABC4", "ABC5", "ABC5")

  ))

#> # A tibble: 6 x 5

#>   cell1 cell2 cell3 cell4 gene_name

#>   <dbl> <dbl> <dbl> <dbl> <chr>    

#> 1     1     1     0     1 ABC1     

#> 2     1     2     0     3 ABC2     

#> 3     1     4     0     4 ABC2     

#> 4     1    10     1     4 ABC4     

#> 5     3     5    10    20 ABC5     

#> 6     3    10    20    20 ABC5

I just added the gene names as additional row. Now you can use group_by() for the group operation

df %>%

  group_by(gene_name) %>% # for each group

  summarise_all(mean) # calculate mean for all columns

#> # A tibble: 4 x 5

#>   gene_name cell1 cell2 cell3 cell4

#>   <chr>     <dbl> <dbl> <dbl> <dbl>

#> 1 ABC1          1   1       0   1  

#> 2 ABC2          1   3       0   3.5

#> 3 ABC4          1  10       1   4  

#> 4 ABC5          3   7.5    15  20

In general, for large data set as your situation, data.table package would be appropriate: the code is like this

setDT(df)[, lapply(.SD, mean), by = gene_name]

#>    gene_name cell1 cell2 cell3 cell4

#> 1:      ABC1     1   1.0     0   1.0

#> 2:      ABC2     1   3.0     0   3.5

#> 3:      ABC4     1  10.0     1   4.0

#> 4:      ABC5     3   7.5    15  20.0

setDT is just for making data.table object.

dplyr vs data.table

If bind your data set,

df_bench

#># A tibble: 18,000 x 10,001

#>   gene_name cell1 cell2 cell3 cell4 cell5 cell6 cell7

#>   <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>

#> 1 ABC308        1     1     0     1     1     1     0

#> 2 ABC258        1     2     0     3     1     2     0

#> 3 ABC553        1     4     0     4     1     4     0

#> 4 ABC57         1    10     1     4     1    10     1

#> 5 ABC469        3     5    10    20     3     5    10

#> 6 ABC484        3    10    20    20     3    10    20

#> 7 ABC813        1     1     0     1     1     1     0

#> 8 ABC371        1     2     0     3     1     2     0

#> 9 ABC547        1     4     0     4     1     4     0

#>10 ABC171        1    10     1     4     1    10     1

#># ... with 17,990 more rows, and 9,993 more variables:

#>#   cell8 <dbl>, cell9 <dbl>, cell10 <dbl>,

#>#   cell11 <dbl>, cell12 <dbl>, cell13 <dbl>,

#>#   cell14 <dbl>, cell15 <dbl>, cell16 <dbl>,

#>#   cell17 <dbl>, cell18 <dbl>, cell19 <dbl>,

#>#   cell20 <dbl>, cell21 <dbl>, cell22 <dbl>,

#>#   cell23 <dbl>, cell24 <dbl>, cell25 <dbl>,

#>#   cell26 <dbl>, cell27 <dbl>, cell28 <dbl>,

#>#   cell29 <dbl>, cell30 <dbl>, cell31 <dbl>,

#>#   cell32 <dbl>, cell33 <dbl>, cell34 <dbl>,

#>#   cell35 <dbl>, cell36 <dbl>, cell37 <dbl>,

#>#   cell38 <dbl>, cell39 <dbl>, cell40 <dbl>,

#>#   cell41 <dbl>, cell42 <dbl>, cell43 <dbl>,

#>#   cell44 <dbl>, cell45 <dbl>, cell46 <dbl>,

#>#   cell47 <dbl>, cell48 <dbl>, cell49 <dbl>,

#>#   cell50 <dbl>, cell51 <dbl>, cell52 <dbl>,

#>#   cell53 <dbl>, cell54 <dbl>, cell55 <dbl>,

#>#   cell56 <dbl>, cell57 <dbl>, cell58 <dbl>,

#>#   cell59 <dbl>, cell60 <dbl>, cell61 <dbl>,

#>#   cell62 <dbl>, cell63 <dbl>, cell64 <dbl>,

#>#   cell65 <dbl>, cell66 <dbl>, cell67 <dbl>,

#>#   cell68 <dbl>, cell69 <dbl>, cell70 <dbl>,

#>#   cell71 <dbl>, cell72 <dbl>, cell73 <dbl>,

#>#   cell74 <dbl>, cell75 <dbl>, cell76 <dbl>,

#>#   cell77 <dbl>, cell78 <dbl>, cell79 <dbl>,

#>#   cell80 <dbl>, cell81 <dbl>, cell82 <dbl>,

#>#   cell83 <dbl>, cell84 <dbl>, cell85 <dbl>,

#>#   cell86 <dbl>, cell87 <dbl>, cell88 <dbl>,

#>#   cell89 <dbl>, cell90 <dbl>, cell91 <dbl>,

#>#   cell92 <dbl>, cell93 <dbl>, cell94 <dbl>,

#>#   cell95 <dbl>, cell96 <dbl>, cell97 <dbl>,

#>#   cell98 <dbl>, cell99 <dbl>, cell100 <dbl>,

#>#   cell101 <dbl>, cell102 <dbl>, cell103 <dbl>,

#>#   cell104 <dbl>, cell105 <dbl>, cell106 <dbl>,

#>#   cell107 <dbl>, …

Using this set,

microbenchmark::microbenchmark(

  DPLYR = {

    df_bench %>%

      group_by(gene_name) %>%

      summarise_all(mean)

  },

  DATATABLE = {

    setDT(df_bench)[, lapply(.SD, mean), by = gene_name]

  },

  times = 50

)

#> Unit: seconds

#>       expr      min       lq     mean   median       uq      max neval

#>      DPLYR 32.82307 34.89050 38.10948 37.44543 40.01937 47.67549    50

#>  DATATABLE 12.16752 13.59018 16.09665 14.25976 15.60752 40.30257    50

data.table seems faster than dplyr here.

edited Nov 10 at 6:19

answered Nov 10 at 4:21

Blended

40617

You can try dplyr. summarise_all with mean() function offers average of every columns for each group.

library(tidyverse) # including dplyr

(df <-

  data_frame(

    cell1 = c(1,1,1,1,3,3),

    cell2 = c(1, 2 ,4 ,10,5,10),

    cell3 = c(0,0,0,1,10,20),

    cell4 = c(1,3,4,4,20,20),

    gene_name = c("ABC1", "ABC2", "ABC2", "ABC4", "ABC5", "ABC5")

  ))

#> # A tibble: 6 x 5

#>   cell1 cell2 cell3 cell4 gene_name

#>   <dbl> <dbl> <dbl> <dbl> <chr>    

#> 1     1     1     0     1 ABC1     

#> 2     1     2     0     3 ABC2     

#> 3     1     4     0     4 ABC2     

#> 4     1    10     1     4 ABC4     

#> 5     3     5    10    20 ABC5     

#> 6     3    10    20    20 ABC5

I just added the gene names as additional row. Now you can use group_by() for the group operation

df %>%

  group_by(gene_name) %>% # for each group

  summarise_all(mean) # calculate mean for all columns

#> # A tibble: 4 x 5

#>   gene_name cell1 cell2 cell3 cell4

#>   <chr>     <dbl> <dbl> <dbl> <dbl>

#> 1 ABC1          1   1       0   1  

#> 2 ABC2          1   3       0   3.5

#> 3 ABC4          1  10       1   4  

#> 4 ABC5          3   7.5    15  20

In general, for large data set as your situation, data.table package would be appropriate: the code is like this

setDT(df)[, lapply(.SD, mean), by = gene_name]

#>    gene_name cell1 cell2 cell3 cell4

#> 1:      ABC1     1   1.0     0   1.0

#> 2:      ABC2     1   3.0     0   3.5

#> 3:      ABC4     1  10.0     1   4.0

#> 4:      ABC5     3   7.5    15  20.0

setDT is just for making data.table object.

dplyr vs data.table

If bind your data set,

df_bench

#># A tibble: 18,000 x 10,001

#>   gene_name cell1 cell2 cell3 cell4 cell5 cell6 cell7

#>   <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>

#> 1 ABC308        1     1     0     1     1     1     0

#> 2 ABC258        1     2     0     3     1     2     0

#> 3 ABC553        1     4     0     4     1     4     0

#> 4 ABC57         1    10     1     4     1    10     1

#> 5 ABC469        3     5    10    20     3     5    10

#> 6 ABC484        3    10    20    20     3    10    20

#> 7 ABC813        1     1     0     1     1     1     0

#> 8 ABC371        1     2     0     3     1     2     0

#> 9 ABC547        1     4     0     4     1     4     0

#>10 ABC171        1    10     1     4     1    10     1

#># ... with 17,990 more rows, and 9,993 more variables:

#>#   cell8 <dbl>, cell9 <dbl>, cell10 <dbl>,

#>#   cell11 <dbl>, cell12 <dbl>, cell13 <dbl>,

#>#   cell14 <dbl>, cell15 <dbl>, cell16 <dbl>,

#>#   cell17 <dbl>, cell18 <dbl>, cell19 <dbl>,

#>#   cell20 <dbl>, cell21 <dbl>, cell22 <dbl>,

#>#   cell23 <dbl>, cell24 <dbl>, cell25 <dbl>,

#>#   cell26 <dbl>, cell27 <dbl>, cell28 <dbl>,

#>#   cell29 <dbl>, cell30 <dbl>, cell31 <dbl>,

#>#   cell32 <dbl>, cell33 <dbl>, cell34 <dbl>,

#>#   cell35 <dbl>, cell36 <dbl>, cell37 <dbl>,

#>#   cell38 <dbl>, cell39 <dbl>, cell40 <dbl>,

#>#   cell41 <dbl>, cell42 <dbl>, cell43 <dbl>,

#>#   cell44 <dbl>, cell45 <dbl>, cell46 <dbl>,

#>#   cell47 <dbl>, cell48 <dbl>, cell49 <dbl>,

#>#   cell50 <dbl>, cell51 <dbl>, cell52 <dbl>,

#>#   cell53 <dbl>, cell54 <dbl>, cell55 <dbl>,

#>#   cell56 <dbl>, cell57 <dbl>, cell58 <dbl>,

#>#   cell59 <dbl>, cell60 <dbl>, cell61 <dbl>,

#>#   cell62 <dbl>, cell63 <dbl>, cell64 <dbl>,

#>#   cell65 <dbl>, cell66 <dbl>, cell67 <dbl>,

#>#   cell68 <dbl>, cell69 <dbl>, cell70 <dbl>,

#>#   cell71 <dbl>, cell72 <dbl>, cell73 <dbl>,

#>#   cell74 <dbl>, cell75 <dbl>, cell76 <dbl>,

#>#   cell77 <dbl>, cell78 <dbl>, cell79 <dbl>,

#>#   cell80 <dbl>, cell81 <dbl>, cell82 <dbl>,

#>#   cell83 <dbl>, cell84 <dbl>, cell85 <dbl>,

#>#   cell86 <dbl>, cell87 <dbl>, cell88 <dbl>,

#>#   cell89 <dbl>, cell90 <dbl>, cell91 <dbl>,

#>#   cell92 <dbl>, cell93 <dbl>, cell94 <dbl>,

#>#   cell95 <dbl>, cell96 <dbl>, cell97 <dbl>,

#>#   cell98 <dbl>, cell99 <dbl>, cell100 <dbl>,

#>#   cell101 <dbl>, cell102 <dbl>, cell103 <dbl>,

#>#   cell104 <dbl>, cell105 <dbl>, cell106 <dbl>,

#>#   cell107 <dbl>, …

Using this set,

microbenchmark::microbenchmark(

  DPLYR = {

    df_bench %>%

      group_by(gene_name) %>%

      summarise_all(mean)

  },

  DATATABLE = {

    setDT(df_bench)[, lapply(.SD, mean), by = gene_name]

  },

  times = 50

)

#> Unit: seconds

#>       expr      min       lq     mean   median       uq      max neval

#>      DPLYR 32.82307 34.89050 38.10948 37.44543 40.01937 47.67549    50

#>  DATATABLE 12.16752 13.59018 16.09665 14.25976 15.60752 40.30257    50

data.table seems faster than dplyr here.

edited Nov 10 at 6:19

answered Nov 10 at 4:21

Blended

40617

edited Nov 10 at 6:19

answered Nov 10 at 4:21

Blended

40617

answered Nov 10 at 4:21

Blended

40617

answered Nov 10 at 4:21

Blended

40617

Thanks for the detailed explanation here. Thumbs up!
– Atakan
Nov 12 at 17:14

add a comment |

Thanks for the detailed explanation here. Thumbs up!
– Atakan
Nov 12 at 17:14

Thanks for the detailed explanation here. Thumbs up!
– Atakan
Nov 12 at 17:14

add a comment |

up vote
1
down vote

Using data.table should work pretty well:

library(data.table)

as.data.table(data)[, lapply(.SD, mean), by = .(rownames(data))]

#   rownames cell1 cell2 cell3 cell4

#1:     ABC1     1   1.0     0   1.0

#2:     ABC2     1   3.0     0   3.5

#3:     ABC4     1  10.0     1   4.0

#4:     ABC5     3   7.5    15  20.0

A quick SO search dug up a link to speed comparisons for group-by operations (data.table is the fastest for large data):

Calculate the mean by group

edited Nov 10 at 4:20

answered Nov 10 at 4:10

Mike H.

10.8k11023

Thanks for your answer. Somehow, I missed the link you shared during my search. The very good info there!
– Atakan
Nov 12 at 17:13

add a comment |

up vote
1
down vote

Using data.table should work pretty well:

library(data.table)

as.data.table(data)[, lapply(.SD, mean), by = .(rownames(data))]

#   rownames cell1 cell2 cell3 cell4

#1:     ABC1     1   1.0     0   1.0

#2:     ABC2     1   3.0     0   3.5

#3:     ABC4     1  10.0     1   4.0

#4:     ABC5     3   7.5    15  20.0

A quick SO search dug up a link to speed comparisons for group-by operations (data.table is the fastest for large data):

Calculate the mean by group

edited Nov 10 at 4:20

answered Nov 10 at 4:10

Mike H.

10.8k11023

Thanks for your answer. Somehow, I missed the link you shared during my search. The very good info there!
– Atakan
Nov 12 at 17:13

add a comment |

up vote
1
down vote

Using data.table should work pretty well:

library(data.table)

as.data.table(data)[, lapply(.SD, mean), by = .(rownames(data))]

#   rownames cell1 cell2 cell3 cell4

#1:     ABC1     1   1.0     0   1.0

#2:     ABC2     1   3.0     0   3.5

#3:     ABC4     1  10.0     1   4.0

#4:     ABC5     3   7.5    15  20.0

A quick SO search dug up a link to speed comparisons for group-by operations (data.table is the fastest for large data):

Calculate the mean by group

edited Nov 10 at 4:20

answered Nov 10 at 4:10

Mike H.

10.8k11023

Using data.table should work pretty well:

library(data.table)

as.data.table(data)[, lapply(.SD, mean), by = .(rownames(data))]

#   rownames cell1 cell2 cell3 cell4

#1:     ABC1     1   1.0     0   1.0

#2:     ABC2     1   3.0     0   3.5

#3:     ABC4     1  10.0     1   4.0

#4:     ABC5     3   7.5    15  20.0

A quick SO search dug up a link to speed comparisons for group-by operations (data.table is the fastest for large data):

Calculate the mean by group

edited Nov 10 at 4:20

answered Nov 10 at 4:10

Mike H.

10.8k11023

edited Nov 10 at 4:20

answered Nov 10 at 4:10

Mike H.

10.8k11023

answered Nov 10 at 4:10

Mike H.

10.8k11023

answered Nov 10 at 4:10

Mike H.

10.8k11023

Thanks for your answer. Somehow, I missed the link you shared during my search. The very good info there!
– Atakan
Nov 12 at 17:13

add a comment |

Thanks for your answer. Somehow, I missed the link you shared during my search. The very good info there!
– Atakan
Nov 12 at 17:13

Thanks for your answer. Somehow, I missed the link you shared during my search. The very good info there!
– Atakan
Nov 12 at 17:13

add a comment |

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Agfdhyk