An efficient way of aggregating data from repeated measurements [duplicate]











up vote
2
down vote

favorite













This question already has an answer here:




  • Calculate the mean by group

    3 answers



  • Aggregate / summarize multiple variables per group (e.g. sum, mean)

    5 answers




I'm analyzing gene expression data from a large experiment (12400 single cells and 23800 genes) and I'm running into an efficiency problem. I will write a reproducible example below but my problem is the following:



I converted mouse genes in my dataset to human counterparts to be able to compare with other previously published data. There are multiple matches in some cases (one human gene is mapped to more than one mouse genes). In these cases, I'd like to average the expression values from these multiple genes and come up with one expression value for the human genetic counterpart. I'm able to achieve this by converting my expression data to matrix format (which allows duplicate row names) and applying aggregate() function, but it takes a VERY long time to go through the large dataset. It is difficult to exemplify the exact situation here, but I my mock analytical pipeline is below:



data <- as.matrix(data.frame(cell1 = c(1,1,1,1,3,3),
cell2 = c(1, 2 ,4 ,10,5,10),
cell3 = c(0,0,0,1,10,20),
cell4 = c(1,3,4,4,20,20)))

# Adding gene names as rownames
rownames(data) <- c("ABC1", "ABC2", "ABC2", "ABC4", "ABC5", "ABC5")



# Mock gene expression matrix
# Columns indicate expression values from individual cells
# Rows indicate genes
data
#> cell1 cell2 cell3 cell4
#> ABC1 1 1 0 1
#> ABC2 1 2 0 3
#> ABC2 1 4 0 4
#> ABC4 1 10 1 4
#> ABC5 3 5 10 20
#> ABC5 3 10 20 20



# Averaging gene expression values where there are multiple measurements for the same gene
aggr_data <- aggregate(data, by=list(rownames(data)), mean)

# End result I'm trying to achieve
aggr_data
#> Group.1 cell1 cell2 cell3 cell4
#> 1 ABC1 1 1.0 0 1.0
#> 2 ABC2 1 3.0 0 3.5
#> 3 ABC4 1 10.0 1 4.0
#> 4 ABC5 3 7.5 15 20.0


Is there a more efficient way for doing this?



Thanks for your answers!










share|improve this question













marked as duplicate by Mike H., phiver, Billal Begueradj, Jaap r
Users with the  r badge can single-handedly close r questions as duplicates and reopen them as needed.

StackExchange.ready(function() {
if (StackExchange.options.isMobile) return;

$('.dupe-hammer-message-hover:not(.hover-bound)').each(function() {
var $hover = $(this).addClass('hover-bound'),
$msg = $hover.siblings('.dupe-hammer-message');

$hover.hover(
function() {
$hover.showInfoMessage('', {
messageElement: $msg.clone().show(),
transient: false,
position: { my: 'bottom left', at: 'top center', offsetTop: -7 },
dismissable: false,
relativeToBody: true
});
},
function() {
StackExchange.helpers.removeMessages();
}
);
});
});
Nov 11 at 12:40


This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.



















    up vote
    2
    down vote

    favorite













    This question already has an answer here:




    • Calculate the mean by group

      3 answers



    • Aggregate / summarize multiple variables per group (e.g. sum, mean)

      5 answers




    I'm analyzing gene expression data from a large experiment (12400 single cells and 23800 genes) and I'm running into an efficiency problem. I will write a reproducible example below but my problem is the following:



    I converted mouse genes in my dataset to human counterparts to be able to compare with other previously published data. There are multiple matches in some cases (one human gene is mapped to more than one mouse genes). In these cases, I'd like to average the expression values from these multiple genes and come up with one expression value for the human genetic counterpart. I'm able to achieve this by converting my expression data to matrix format (which allows duplicate row names) and applying aggregate() function, but it takes a VERY long time to go through the large dataset. It is difficult to exemplify the exact situation here, but I my mock analytical pipeline is below:



    data <- as.matrix(data.frame(cell1 = c(1,1,1,1,3,3),
    cell2 = c(1, 2 ,4 ,10,5,10),
    cell3 = c(0,0,0,1,10,20),
    cell4 = c(1,3,4,4,20,20)))

    # Adding gene names as rownames
    rownames(data) <- c("ABC1", "ABC2", "ABC2", "ABC4", "ABC5", "ABC5")



    # Mock gene expression matrix
    # Columns indicate expression values from individual cells
    # Rows indicate genes
    data
    #> cell1 cell2 cell3 cell4
    #> ABC1 1 1 0 1
    #> ABC2 1 2 0 3
    #> ABC2 1 4 0 4
    #> ABC4 1 10 1 4
    #> ABC5 3 5 10 20
    #> ABC5 3 10 20 20



    # Averaging gene expression values where there are multiple measurements for the same gene
    aggr_data <- aggregate(data, by=list(rownames(data)), mean)

    # End result I'm trying to achieve
    aggr_data
    #> Group.1 cell1 cell2 cell3 cell4
    #> 1 ABC1 1 1.0 0 1.0
    #> 2 ABC2 1 3.0 0 3.5
    #> 3 ABC4 1 10.0 1 4.0
    #> 4 ABC5 3 7.5 15 20.0


    Is there a more efficient way for doing this?



    Thanks for your answers!










    share|improve this question













    marked as duplicate by Mike H., phiver, Billal Begueradj, Jaap r
    Users with the  r badge can single-handedly close r questions as duplicates and reopen them as needed.

    StackExchange.ready(function() {
    if (StackExchange.options.isMobile) return;

    $('.dupe-hammer-message-hover:not(.hover-bound)').each(function() {
    var $hover = $(this).addClass('hover-bound'),
    $msg = $hover.siblings('.dupe-hammer-message');

    $hover.hover(
    function() {
    $hover.showInfoMessage('', {
    messageElement: $msg.clone().show(),
    transient: false,
    position: { my: 'bottom left', at: 'top center', offsetTop: -7 },
    dismissable: false,
    relativeToBody: true
    });
    },
    function() {
    StackExchange.helpers.removeMessages();
    }
    );
    });
    });
    Nov 11 at 12:40


    This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.

















      up vote
      2
      down vote

      favorite









      up vote
      2
      down vote

      favorite












      This question already has an answer here:




      • Calculate the mean by group

        3 answers



      • Aggregate / summarize multiple variables per group (e.g. sum, mean)

        5 answers




      I'm analyzing gene expression data from a large experiment (12400 single cells and 23800 genes) and I'm running into an efficiency problem. I will write a reproducible example below but my problem is the following:



      I converted mouse genes in my dataset to human counterparts to be able to compare with other previously published data. There are multiple matches in some cases (one human gene is mapped to more than one mouse genes). In these cases, I'd like to average the expression values from these multiple genes and come up with one expression value for the human genetic counterpart. I'm able to achieve this by converting my expression data to matrix format (which allows duplicate row names) and applying aggregate() function, but it takes a VERY long time to go through the large dataset. It is difficult to exemplify the exact situation here, but I my mock analytical pipeline is below:



      data <- as.matrix(data.frame(cell1 = c(1,1,1,1,3,3),
      cell2 = c(1, 2 ,4 ,10,5,10),
      cell3 = c(0,0,0,1,10,20),
      cell4 = c(1,3,4,4,20,20)))

      # Adding gene names as rownames
      rownames(data) <- c("ABC1", "ABC2", "ABC2", "ABC4", "ABC5", "ABC5")



      # Mock gene expression matrix
      # Columns indicate expression values from individual cells
      # Rows indicate genes
      data
      #> cell1 cell2 cell3 cell4
      #> ABC1 1 1 0 1
      #> ABC2 1 2 0 3
      #> ABC2 1 4 0 4
      #> ABC4 1 10 1 4
      #> ABC5 3 5 10 20
      #> ABC5 3 10 20 20



      # Averaging gene expression values where there are multiple measurements for the same gene
      aggr_data <- aggregate(data, by=list(rownames(data)), mean)

      # End result I'm trying to achieve
      aggr_data
      #> Group.1 cell1 cell2 cell3 cell4
      #> 1 ABC1 1 1.0 0 1.0
      #> 2 ABC2 1 3.0 0 3.5
      #> 3 ABC4 1 10.0 1 4.0
      #> 4 ABC5 3 7.5 15 20.0


      Is there a more efficient way for doing this?



      Thanks for your answers!










      share|improve this question














      This question already has an answer here:




      • Calculate the mean by group

        3 answers



      • Aggregate / summarize multiple variables per group (e.g. sum, mean)

        5 answers




      I'm analyzing gene expression data from a large experiment (12400 single cells and 23800 genes) and I'm running into an efficiency problem. I will write a reproducible example below but my problem is the following:



      I converted mouse genes in my dataset to human counterparts to be able to compare with other previously published data. There are multiple matches in some cases (one human gene is mapped to more than one mouse genes). In these cases, I'd like to average the expression values from these multiple genes and come up with one expression value for the human genetic counterpart. I'm able to achieve this by converting my expression data to matrix format (which allows duplicate row names) and applying aggregate() function, but it takes a VERY long time to go through the large dataset. It is difficult to exemplify the exact situation here, but I my mock analytical pipeline is below:



      data <- as.matrix(data.frame(cell1 = c(1,1,1,1,3,3),
      cell2 = c(1, 2 ,4 ,10,5,10),
      cell3 = c(0,0,0,1,10,20),
      cell4 = c(1,3,4,4,20,20)))

      # Adding gene names as rownames
      rownames(data) <- c("ABC1", "ABC2", "ABC2", "ABC4", "ABC5", "ABC5")



      # Mock gene expression matrix
      # Columns indicate expression values from individual cells
      # Rows indicate genes
      data
      #> cell1 cell2 cell3 cell4
      #> ABC1 1 1 0 1
      #> ABC2 1 2 0 3
      #> ABC2 1 4 0 4
      #> ABC4 1 10 1 4
      #> ABC5 3 5 10 20
      #> ABC5 3 10 20 20



      # Averaging gene expression values where there are multiple measurements for the same gene
      aggr_data <- aggregate(data, by=list(rownames(data)), mean)

      # End result I'm trying to achieve
      aggr_data
      #> Group.1 cell1 cell2 cell3 cell4
      #> 1 ABC1 1 1.0 0 1.0
      #> 2 ABC2 1 3.0 0 3.5
      #> 3 ABC4 1 10.0 1 4.0
      #> 4 ABC5 3 7.5 15 20.0


      Is there a more efficient way for doing this?



      Thanks for your answers!





      This question already has an answer here:




      • Calculate the mean by group

        3 answers



      • Aggregate / summarize multiple variables per group (e.g. sum, mean)

        5 answers








      r bigdata aggregate






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 10 at 3:17









      Atakan

      355




      355




      marked as duplicate by Mike H., phiver, Billal Begueradj, Jaap r
      Users with the  r badge can single-handedly close r questions as duplicates and reopen them as needed.

      StackExchange.ready(function() {
      if (StackExchange.options.isMobile) return;

      $('.dupe-hammer-message-hover:not(.hover-bound)').each(function() {
      var $hover = $(this).addClass('hover-bound'),
      $msg = $hover.siblings('.dupe-hammer-message');

      $hover.hover(
      function() {
      $hover.showInfoMessage('', {
      messageElement: $msg.clone().show(),
      transient: false,
      position: { my: 'bottom left', at: 'top center', offsetTop: -7 },
      dismissable: false,
      relativeToBody: true
      });
      },
      function() {
      StackExchange.helpers.removeMessages();
      }
      );
      });
      });
      Nov 11 at 12:40


      This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.






      marked as duplicate by Mike H., phiver, Billal Begueradj, Jaap r
      Users with the  r badge can single-handedly close r questions as duplicates and reopen them as needed.

      StackExchange.ready(function() {
      if (StackExchange.options.isMobile) return;

      $('.dupe-hammer-message-hover:not(.hover-bound)').each(function() {
      var $hover = $(this).addClass('hover-bound'),
      $msg = $hover.siblings('.dupe-hammer-message');

      $hover.hover(
      function() {
      $hover.showInfoMessage('', {
      messageElement: $msg.clone().show(),
      transient: false,
      position: { my: 'bottom left', at: 'top center', offsetTop: -7 },
      dismissable: false,
      relativeToBody: true
      });
      },
      function() {
      StackExchange.helpers.removeMessages();
      }
      );
      });
      });
      Nov 11 at 12:40


      This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.


























          2 Answers
          2






          active

          oldest

          votes

















          up vote
          2
          down vote



          accepted










          You can try dplyr. summarise_all with mean() function offers average of every columns for each group.



          library(tidyverse) # including dplyr
          (df <-
          data_frame(
          cell1 = c(1,1,1,1,3,3),
          cell2 = c(1, 2 ,4 ,10,5,10),
          cell3 = c(0,0,0,1,10,20),
          cell4 = c(1,3,4,4,20,20),
          gene_name = c("ABC1", "ABC2", "ABC2", "ABC4", "ABC5", "ABC5")
          ))
          #> # A tibble: 6 x 5
          #> cell1 cell2 cell3 cell4 gene_name
          #> <dbl> <dbl> <dbl> <dbl> <chr>
          #> 1 1 1 0 1 ABC1
          #> 2 1 2 0 3 ABC2
          #> 3 1 4 0 4 ABC2
          #> 4 1 10 1 4 ABC4
          #> 5 3 5 10 20 ABC5
          #> 6 3 10 20 20 ABC5


          I just added the gene names as additional row. Now you can use group_by() for the group operation



          df %>%
          group_by(gene_name) %>% # for each group
          summarise_all(mean) # calculate mean for all columns
          #> # A tibble: 4 x 5
          #> gene_name cell1 cell2 cell3 cell4
          #> <chr> <dbl> <dbl> <dbl> <dbl>
          #> 1 ABC1 1 1 0 1
          #> 2 ABC2 1 3 0 3.5
          #> 3 ABC4 1 10 1 4
          #> 4 ABC5 3 7.5 15 20




          In general, for large data set as your situation, data.table package would be appropriate: the code is like this



          setDT(df)[, lapply(.SD, mean), by = gene_name]
          #> gene_name cell1 cell2 cell3 cell4
          #> 1: ABC1 1 1.0 0 1.0
          #> 2: ABC2 1 3.0 0 3.5
          #> 3: ABC4 1 10.0 1 4.0
          #> 4: ABC5 3 7.5 15 20.0


          setDT is just for making data.table object.



          dplyr vs data.table



          If bind your data set,



          df_bench
          #># A tibble: 18,000 x 10,001
          #> gene_name cell1 cell2 cell3 cell4 cell5 cell6 cell7
          #> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
          #> 1 ABC308 1 1 0 1 1 1 0
          #> 2 ABC258 1 2 0 3 1 2 0
          #> 3 ABC553 1 4 0 4 1 4 0
          #> 4 ABC57 1 10 1 4 1 10 1
          #> 5 ABC469 3 5 10 20 3 5 10
          #> 6 ABC484 3 10 20 20 3 10 20
          #> 7 ABC813 1 1 0 1 1 1 0
          #> 8 ABC371 1 2 0 3 1 2 0
          #> 9 ABC547 1 4 0 4 1 4 0
          #>10 ABC171 1 10 1 4 1 10 1
          #># ... with 17,990 more rows, and 9,993 more variables:
          #># cell8 <dbl>, cell9 <dbl>, cell10 <dbl>,
          #># cell11 <dbl>, cell12 <dbl>, cell13 <dbl>,
          #># cell14 <dbl>, cell15 <dbl>, cell16 <dbl>,
          #># cell17 <dbl>, cell18 <dbl>, cell19 <dbl>,
          #># cell20 <dbl>, cell21 <dbl>, cell22 <dbl>,
          #># cell23 <dbl>, cell24 <dbl>, cell25 <dbl>,
          #># cell26 <dbl>, cell27 <dbl>, cell28 <dbl>,
          #># cell29 <dbl>, cell30 <dbl>, cell31 <dbl>,
          #># cell32 <dbl>, cell33 <dbl>, cell34 <dbl>,
          #># cell35 <dbl>, cell36 <dbl>, cell37 <dbl>,
          #># cell38 <dbl>, cell39 <dbl>, cell40 <dbl>,
          #># cell41 <dbl>, cell42 <dbl>, cell43 <dbl>,
          #># cell44 <dbl>, cell45 <dbl>, cell46 <dbl>,
          #># cell47 <dbl>, cell48 <dbl>, cell49 <dbl>,
          #># cell50 <dbl>, cell51 <dbl>, cell52 <dbl>,
          #># cell53 <dbl>, cell54 <dbl>, cell55 <dbl>,
          #># cell56 <dbl>, cell57 <dbl>, cell58 <dbl>,
          #># cell59 <dbl>, cell60 <dbl>, cell61 <dbl>,
          #># cell62 <dbl>, cell63 <dbl>, cell64 <dbl>,
          #># cell65 <dbl>, cell66 <dbl>, cell67 <dbl>,
          #># cell68 <dbl>, cell69 <dbl>, cell70 <dbl>,
          #># cell71 <dbl>, cell72 <dbl>, cell73 <dbl>,
          #># cell74 <dbl>, cell75 <dbl>, cell76 <dbl>,
          #># cell77 <dbl>, cell78 <dbl>, cell79 <dbl>,
          #># cell80 <dbl>, cell81 <dbl>, cell82 <dbl>,
          #># cell83 <dbl>, cell84 <dbl>, cell85 <dbl>,
          #># cell86 <dbl>, cell87 <dbl>, cell88 <dbl>,
          #># cell89 <dbl>, cell90 <dbl>, cell91 <dbl>,
          #># cell92 <dbl>, cell93 <dbl>, cell94 <dbl>,
          #># cell95 <dbl>, cell96 <dbl>, cell97 <dbl>,
          #># cell98 <dbl>, cell99 <dbl>, cell100 <dbl>,
          #># cell101 <dbl>, cell102 <dbl>, cell103 <dbl>,
          #># cell104 <dbl>, cell105 <dbl>, cell106 <dbl>,
          #># cell107 <dbl>, …


          Using this set,



          microbenchmark::microbenchmark(
          DPLYR = {
          df_bench %>%
          group_by(gene_name) %>%
          summarise_all(mean)
          },
          DATATABLE = {
          setDT(df_bench)[, lapply(.SD, mean), by = gene_name]
          },
          times = 50
          )
          #> Unit: seconds
          #> expr min lq mean median uq max neval
          #> DPLYR 32.82307 34.89050 38.10948 37.44543 40.01937 47.67549 50
          #> DATATABLE 12.16752 13.59018 16.09665 14.25976 15.60752 40.30257 50


          data.table seems faster than dplyr here.






          share|improve this answer























          • Thanks for the detailed explanation here. Thumbs up!
            – Atakan
            Nov 12 at 17:14


















          up vote
          1
          down vote













          Using data.table should work pretty well:



          library(data.table)
          as.data.table(data)[, lapply(.SD, mean), by = .(rownames(data))]
          # rownames cell1 cell2 cell3 cell4
          #1: ABC1 1 1.0 0 1.0
          #2: ABC2 1 3.0 0 3.5
          #3: ABC4 1 10.0 1 4.0
          #4: ABC5 3 7.5 15 20.0


          A quick SO search dug up a link to speed comparisons for group-by operations (data.table is the fastest for large data):



          Calculate the mean by group






          share|improve this answer























          • Thanks for your answer. Somehow, I missed the link you shared during my search. The very good info there!
            – Atakan
            Nov 12 at 17:13


















          2 Answers
          2






          active

          oldest

          votes








          2 Answers
          2






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          up vote
          2
          down vote



          accepted










          You can try dplyr. summarise_all with mean() function offers average of every columns for each group.



          library(tidyverse) # including dplyr
          (df <-
          data_frame(
          cell1 = c(1,1,1,1,3,3),
          cell2 = c(1, 2 ,4 ,10,5,10),
          cell3 = c(0,0,0,1,10,20),
          cell4 = c(1,3,4,4,20,20),
          gene_name = c("ABC1", "ABC2", "ABC2", "ABC4", "ABC5", "ABC5")
          ))
          #> # A tibble: 6 x 5
          #> cell1 cell2 cell3 cell4 gene_name
          #> <dbl> <dbl> <dbl> <dbl> <chr>
          #> 1 1 1 0 1 ABC1
          #> 2 1 2 0 3 ABC2
          #> 3 1 4 0 4 ABC2
          #> 4 1 10 1 4 ABC4
          #> 5 3 5 10 20 ABC5
          #> 6 3 10 20 20 ABC5


          I just added the gene names as additional row. Now you can use group_by() for the group operation



          df %>%
          group_by(gene_name) %>% # for each group
          summarise_all(mean) # calculate mean for all columns
          #> # A tibble: 4 x 5
          #> gene_name cell1 cell2 cell3 cell4
          #> <chr> <dbl> <dbl> <dbl> <dbl>
          #> 1 ABC1 1 1 0 1
          #> 2 ABC2 1 3 0 3.5
          #> 3 ABC4 1 10 1 4
          #> 4 ABC5 3 7.5 15 20




          In general, for large data set as your situation, data.table package would be appropriate: the code is like this



          setDT(df)[, lapply(.SD, mean), by = gene_name]
          #> gene_name cell1 cell2 cell3 cell4
          #> 1: ABC1 1 1.0 0 1.0
          #> 2: ABC2 1 3.0 0 3.5
          #> 3: ABC4 1 10.0 1 4.0
          #> 4: ABC5 3 7.5 15 20.0


          setDT is just for making data.table object.



          dplyr vs data.table



          If bind your data set,



          df_bench
          #># A tibble: 18,000 x 10,001
          #> gene_name cell1 cell2 cell3 cell4 cell5 cell6 cell7
          #> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
          #> 1 ABC308 1 1 0 1 1 1 0
          #> 2 ABC258 1 2 0 3 1 2 0
          #> 3 ABC553 1 4 0 4 1 4 0
          #> 4 ABC57 1 10 1 4 1 10 1
          #> 5 ABC469 3 5 10 20 3 5 10
          #> 6 ABC484 3 10 20 20 3 10 20
          #> 7 ABC813 1 1 0 1 1 1 0
          #> 8 ABC371 1 2 0 3 1 2 0
          #> 9 ABC547 1 4 0 4 1 4 0
          #>10 ABC171 1 10 1 4 1 10 1
          #># ... with 17,990 more rows, and 9,993 more variables:
          #># cell8 <dbl>, cell9 <dbl>, cell10 <dbl>,
          #># cell11 <dbl>, cell12 <dbl>, cell13 <dbl>,
          #># cell14 <dbl>, cell15 <dbl>, cell16 <dbl>,
          #># cell17 <dbl>, cell18 <dbl>, cell19 <dbl>,
          #># cell20 <dbl>, cell21 <dbl>, cell22 <dbl>,
          #># cell23 <dbl>, cell24 <dbl>, cell25 <dbl>,
          #># cell26 <dbl>, cell27 <dbl>, cell28 <dbl>,
          #># cell29 <dbl>, cell30 <dbl>, cell31 <dbl>,
          #># cell32 <dbl>, cell33 <dbl>, cell34 <dbl>,
          #># cell35 <dbl>, cell36 <dbl>, cell37 <dbl>,
          #># cell38 <dbl>, cell39 <dbl>, cell40 <dbl>,
          #># cell41 <dbl>, cell42 <dbl>, cell43 <dbl>,
          #># cell44 <dbl>, cell45 <dbl>, cell46 <dbl>,
          #># cell47 <dbl>, cell48 <dbl>, cell49 <dbl>,
          #># cell50 <dbl>, cell51 <dbl>, cell52 <dbl>,
          #># cell53 <dbl>, cell54 <dbl>, cell55 <dbl>,
          #># cell56 <dbl>, cell57 <dbl>, cell58 <dbl>,
          #># cell59 <dbl>, cell60 <dbl>, cell61 <dbl>,
          #># cell62 <dbl>, cell63 <dbl>, cell64 <dbl>,
          #># cell65 <dbl>, cell66 <dbl>, cell67 <dbl>,
          #># cell68 <dbl>, cell69 <dbl>, cell70 <dbl>,
          #># cell71 <dbl>, cell72 <dbl>, cell73 <dbl>,
          #># cell74 <dbl>, cell75 <dbl>, cell76 <dbl>,
          #># cell77 <dbl>, cell78 <dbl>, cell79 <dbl>,
          #># cell80 <dbl>, cell81 <dbl>, cell82 <dbl>,
          #># cell83 <dbl>, cell84 <dbl>, cell85 <dbl>,
          #># cell86 <dbl>, cell87 <dbl>, cell88 <dbl>,
          #># cell89 <dbl>, cell90 <dbl>, cell91 <dbl>,
          #># cell92 <dbl>, cell93 <dbl>, cell94 <dbl>,
          #># cell95 <dbl>, cell96 <dbl>, cell97 <dbl>,
          #># cell98 <dbl>, cell99 <dbl>, cell100 <dbl>,
          #># cell101 <dbl>, cell102 <dbl>, cell103 <dbl>,
          #># cell104 <dbl>, cell105 <dbl>, cell106 <dbl>,
          #># cell107 <dbl>, …


          Using this set,



          microbenchmark::microbenchmark(
          DPLYR = {
          df_bench %>%
          group_by(gene_name) %>%
          summarise_all(mean)
          },
          DATATABLE = {
          setDT(df_bench)[, lapply(.SD, mean), by = gene_name]
          },
          times = 50
          )
          #> Unit: seconds
          #> expr min lq mean median uq max neval
          #> DPLYR 32.82307 34.89050 38.10948 37.44543 40.01937 47.67549 50
          #> DATATABLE 12.16752 13.59018 16.09665 14.25976 15.60752 40.30257 50


          data.table seems faster than dplyr here.






          share|improve this answer























          • Thanks for the detailed explanation here. Thumbs up!
            – Atakan
            Nov 12 at 17:14















          up vote
          2
          down vote



          accepted










          You can try dplyr. summarise_all with mean() function offers average of every columns for each group.



          library(tidyverse) # including dplyr
          (df <-
          data_frame(
          cell1 = c(1,1,1,1,3,3),
          cell2 = c(1, 2 ,4 ,10,5,10),
          cell3 = c(0,0,0,1,10,20),
          cell4 = c(1,3,4,4,20,20),
          gene_name = c("ABC1", "ABC2", "ABC2", "ABC4", "ABC5", "ABC5")
          ))
          #> # A tibble: 6 x 5
          #> cell1 cell2 cell3 cell4 gene_name
          #> <dbl> <dbl> <dbl> <dbl> <chr>
          #> 1 1 1 0 1 ABC1
          #> 2 1 2 0 3 ABC2
          #> 3 1 4 0 4 ABC2
          #> 4 1 10 1 4 ABC4
          #> 5 3 5 10 20 ABC5
          #> 6 3 10 20 20 ABC5


          I just added the gene names as additional row. Now you can use group_by() for the group operation



          df %>%
          group_by(gene_name) %>% # for each group
          summarise_all(mean) # calculate mean for all columns
          #> # A tibble: 4 x 5
          #> gene_name cell1 cell2 cell3 cell4
          #> <chr> <dbl> <dbl> <dbl> <dbl>
          #> 1 ABC1 1 1 0 1
          #> 2 ABC2 1 3 0 3.5
          #> 3 ABC4 1 10 1 4
          #> 4 ABC5 3 7.5 15 20




          In general, for large data set as your situation, data.table package would be appropriate: the code is like this



          setDT(df)[, lapply(.SD, mean), by = gene_name]
          #> gene_name cell1 cell2 cell3 cell4
          #> 1: ABC1 1 1.0 0 1.0
          #> 2: ABC2 1 3.0 0 3.5
          #> 3: ABC4 1 10.0 1 4.0
          #> 4: ABC5 3 7.5 15 20.0


          setDT is just for making data.table object.



          dplyr vs data.table



          If bind your data set,



          df_bench
          #># A tibble: 18,000 x 10,001
          #> gene_name cell1 cell2 cell3 cell4 cell5 cell6 cell7
          #> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
          #> 1 ABC308 1 1 0 1 1 1 0
          #> 2 ABC258 1 2 0 3 1 2 0
          #> 3 ABC553 1 4 0 4 1 4 0
          #> 4 ABC57 1 10 1 4 1 10 1
          #> 5 ABC469 3 5 10 20 3 5 10
          #> 6 ABC484 3 10 20 20 3 10 20
          #> 7 ABC813 1 1 0 1 1 1 0
          #> 8 ABC371 1 2 0 3 1 2 0
          #> 9 ABC547 1 4 0 4 1 4 0
          #>10 ABC171 1 10 1 4 1 10 1
          #># ... with 17,990 more rows, and 9,993 more variables:
          #># cell8 <dbl>, cell9 <dbl>, cell10 <dbl>,
          #># cell11 <dbl>, cell12 <dbl>, cell13 <dbl>,
          #># cell14 <dbl>, cell15 <dbl>, cell16 <dbl>,
          #># cell17 <dbl>, cell18 <dbl>, cell19 <dbl>,
          #># cell20 <dbl>, cell21 <dbl>, cell22 <dbl>,
          #># cell23 <dbl>, cell24 <dbl>, cell25 <dbl>,
          #># cell26 <dbl>, cell27 <dbl>, cell28 <dbl>,
          #># cell29 <dbl>, cell30 <dbl>, cell31 <dbl>,
          #># cell32 <dbl>, cell33 <dbl>, cell34 <dbl>,
          #># cell35 <dbl>, cell36 <dbl>, cell37 <dbl>,
          #># cell38 <dbl>, cell39 <dbl>, cell40 <dbl>,
          #># cell41 <dbl>, cell42 <dbl>, cell43 <dbl>,
          #># cell44 <dbl>, cell45 <dbl>, cell46 <dbl>,
          #># cell47 <dbl>, cell48 <dbl>, cell49 <dbl>,
          #># cell50 <dbl>, cell51 <dbl>, cell52 <dbl>,
          #># cell53 <dbl>, cell54 <dbl>, cell55 <dbl>,
          #># cell56 <dbl>, cell57 <dbl>, cell58 <dbl>,
          #># cell59 <dbl>, cell60 <dbl>, cell61 <dbl>,
          #># cell62 <dbl>, cell63 <dbl>, cell64 <dbl>,
          #># cell65 <dbl>, cell66 <dbl>, cell67 <dbl>,
          #># cell68 <dbl>, cell69 <dbl>, cell70 <dbl>,
          #># cell71 <dbl>, cell72 <dbl>, cell73 <dbl>,
          #># cell74 <dbl>, cell75 <dbl>, cell76 <dbl>,
          #># cell77 <dbl>, cell78 <dbl>, cell79 <dbl>,
          #># cell80 <dbl>, cell81 <dbl>, cell82 <dbl>,
          #># cell83 <dbl>, cell84 <dbl>, cell85 <dbl>,
          #># cell86 <dbl>, cell87 <dbl>, cell88 <dbl>,
          #># cell89 <dbl>, cell90 <dbl>, cell91 <dbl>,
          #># cell92 <dbl>, cell93 <dbl>, cell94 <dbl>,
          #># cell95 <dbl>, cell96 <dbl>, cell97 <dbl>,
          #># cell98 <dbl>, cell99 <dbl>, cell100 <dbl>,
          #># cell101 <dbl>, cell102 <dbl>, cell103 <dbl>,
          #># cell104 <dbl>, cell105 <dbl>, cell106 <dbl>,
          #># cell107 <dbl>, …


          Using this set,



          microbenchmark::microbenchmark(
          DPLYR = {
          df_bench %>%
          group_by(gene_name) %>%
          summarise_all(mean)
          },
          DATATABLE = {
          setDT(df_bench)[, lapply(.SD, mean), by = gene_name]
          },
          times = 50
          )
          #> Unit: seconds
          #> expr min lq mean median uq max neval
          #> DPLYR 32.82307 34.89050 38.10948 37.44543 40.01937 47.67549 50
          #> DATATABLE 12.16752 13.59018 16.09665 14.25976 15.60752 40.30257 50


          data.table seems faster than dplyr here.






          share|improve this answer























          • Thanks for the detailed explanation here. Thumbs up!
            – Atakan
            Nov 12 at 17:14













          up vote
          2
          down vote



          accepted







          up vote
          2
          down vote



          accepted






          You can try dplyr. summarise_all with mean() function offers average of every columns for each group.



          library(tidyverse) # including dplyr
          (df <-
          data_frame(
          cell1 = c(1,1,1,1,3,3),
          cell2 = c(1, 2 ,4 ,10,5,10),
          cell3 = c(0,0,0,1,10,20),
          cell4 = c(1,3,4,4,20,20),
          gene_name = c("ABC1", "ABC2", "ABC2", "ABC4", "ABC5", "ABC5")
          ))
          #> # A tibble: 6 x 5
          #> cell1 cell2 cell3 cell4 gene_name
          #> <dbl> <dbl> <dbl> <dbl> <chr>
          #> 1 1 1 0 1 ABC1
          #> 2 1 2 0 3 ABC2
          #> 3 1 4 0 4 ABC2
          #> 4 1 10 1 4 ABC4
          #> 5 3 5 10 20 ABC5
          #> 6 3 10 20 20 ABC5


          I just added the gene names as additional row. Now you can use group_by() for the group operation



          df %>%
          group_by(gene_name) %>% # for each group
          summarise_all(mean) # calculate mean for all columns
          #> # A tibble: 4 x 5
          #> gene_name cell1 cell2 cell3 cell4
          #> <chr> <dbl> <dbl> <dbl> <dbl>
          #> 1 ABC1 1 1 0 1
          #> 2 ABC2 1 3 0 3.5
          #> 3 ABC4 1 10 1 4
          #> 4 ABC5 3 7.5 15 20




          In general, for large data set as your situation, data.table package would be appropriate: the code is like this



          setDT(df)[, lapply(.SD, mean), by = gene_name]
          #> gene_name cell1 cell2 cell3 cell4
          #> 1: ABC1 1 1.0 0 1.0
          #> 2: ABC2 1 3.0 0 3.5
          #> 3: ABC4 1 10.0 1 4.0
          #> 4: ABC5 3 7.5 15 20.0


          setDT is just for making data.table object.



          dplyr vs data.table



          If bind your data set,



          df_bench
          #># A tibble: 18,000 x 10,001
          #> gene_name cell1 cell2 cell3 cell4 cell5 cell6 cell7
          #> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
          #> 1 ABC308 1 1 0 1 1 1 0
          #> 2 ABC258 1 2 0 3 1 2 0
          #> 3 ABC553 1 4 0 4 1 4 0
          #> 4 ABC57 1 10 1 4 1 10 1
          #> 5 ABC469 3 5 10 20 3 5 10
          #> 6 ABC484 3 10 20 20 3 10 20
          #> 7 ABC813 1 1 0 1 1 1 0
          #> 8 ABC371 1 2 0 3 1 2 0
          #> 9 ABC547 1 4 0 4 1 4 0
          #>10 ABC171 1 10 1 4 1 10 1
          #># ... with 17,990 more rows, and 9,993 more variables:
          #># cell8 <dbl>, cell9 <dbl>, cell10 <dbl>,
          #># cell11 <dbl>, cell12 <dbl>, cell13 <dbl>,
          #># cell14 <dbl>, cell15 <dbl>, cell16 <dbl>,
          #># cell17 <dbl>, cell18 <dbl>, cell19 <dbl>,
          #># cell20 <dbl>, cell21 <dbl>, cell22 <dbl>,
          #># cell23 <dbl>, cell24 <dbl>, cell25 <dbl>,
          #># cell26 <dbl>, cell27 <dbl>, cell28 <dbl>,
          #># cell29 <dbl>, cell30 <dbl>, cell31 <dbl>,
          #># cell32 <dbl>, cell33 <dbl>, cell34 <dbl>,
          #># cell35 <dbl>, cell36 <dbl>, cell37 <dbl>,
          #># cell38 <dbl>, cell39 <dbl>, cell40 <dbl>,
          #># cell41 <dbl>, cell42 <dbl>, cell43 <dbl>,
          #># cell44 <dbl>, cell45 <dbl>, cell46 <dbl>,
          #># cell47 <dbl>, cell48 <dbl>, cell49 <dbl>,
          #># cell50 <dbl>, cell51 <dbl>, cell52 <dbl>,
          #># cell53 <dbl>, cell54 <dbl>, cell55 <dbl>,
          #># cell56 <dbl>, cell57 <dbl>, cell58 <dbl>,
          #># cell59 <dbl>, cell60 <dbl>, cell61 <dbl>,
          #># cell62 <dbl>, cell63 <dbl>, cell64 <dbl>,
          #># cell65 <dbl>, cell66 <dbl>, cell67 <dbl>,
          #># cell68 <dbl>, cell69 <dbl>, cell70 <dbl>,
          #># cell71 <dbl>, cell72 <dbl>, cell73 <dbl>,
          #># cell74 <dbl>, cell75 <dbl>, cell76 <dbl>,
          #># cell77 <dbl>, cell78 <dbl>, cell79 <dbl>,
          #># cell80 <dbl>, cell81 <dbl>, cell82 <dbl>,
          #># cell83 <dbl>, cell84 <dbl>, cell85 <dbl>,
          #># cell86 <dbl>, cell87 <dbl>, cell88 <dbl>,
          #># cell89 <dbl>, cell90 <dbl>, cell91 <dbl>,
          #># cell92 <dbl>, cell93 <dbl>, cell94 <dbl>,
          #># cell95 <dbl>, cell96 <dbl>, cell97 <dbl>,
          #># cell98 <dbl>, cell99 <dbl>, cell100 <dbl>,
          #># cell101 <dbl>, cell102 <dbl>, cell103 <dbl>,
          #># cell104 <dbl>, cell105 <dbl>, cell106 <dbl>,
          #># cell107 <dbl>, …


          Using this set,



          microbenchmark::microbenchmark(
          DPLYR = {
          df_bench %>%
          group_by(gene_name) %>%
          summarise_all(mean)
          },
          DATATABLE = {
          setDT(df_bench)[, lapply(.SD, mean), by = gene_name]
          },
          times = 50
          )
          #> Unit: seconds
          #> expr min lq mean median uq max neval
          #> DPLYR 32.82307 34.89050 38.10948 37.44543 40.01937 47.67549 50
          #> DATATABLE 12.16752 13.59018 16.09665 14.25976 15.60752 40.30257 50


          data.table seems faster than dplyr here.






          share|improve this answer














          You can try dplyr. summarise_all with mean() function offers average of every columns for each group.



          library(tidyverse) # including dplyr
          (df <-
          data_frame(
          cell1 = c(1,1,1,1,3,3),
          cell2 = c(1, 2 ,4 ,10,5,10),
          cell3 = c(0,0,0,1,10,20),
          cell4 = c(1,3,4,4,20,20),
          gene_name = c("ABC1", "ABC2", "ABC2", "ABC4", "ABC5", "ABC5")
          ))
          #> # A tibble: 6 x 5
          #> cell1 cell2 cell3 cell4 gene_name
          #> <dbl> <dbl> <dbl> <dbl> <chr>
          #> 1 1 1 0 1 ABC1
          #> 2 1 2 0 3 ABC2
          #> 3 1 4 0 4 ABC2
          #> 4 1 10 1 4 ABC4
          #> 5 3 5 10 20 ABC5
          #> 6 3 10 20 20 ABC5


          I just added the gene names as additional row. Now you can use group_by() for the group operation



          df %>%
          group_by(gene_name) %>% # for each group
          summarise_all(mean) # calculate mean for all columns
          #> # A tibble: 4 x 5
          #> gene_name cell1 cell2 cell3 cell4
          #> <chr> <dbl> <dbl> <dbl> <dbl>
          #> 1 ABC1 1 1 0 1
          #> 2 ABC2 1 3 0 3.5
          #> 3 ABC4 1 10 1 4
          #> 4 ABC5 3 7.5 15 20




          In general, for large data set as your situation, data.table package would be appropriate: the code is like this



          setDT(df)[, lapply(.SD, mean), by = gene_name]
          #> gene_name cell1 cell2 cell3 cell4
          #> 1: ABC1 1 1.0 0 1.0
          #> 2: ABC2 1 3.0 0 3.5
          #> 3: ABC4 1 10.0 1 4.0
          #> 4: ABC5 3 7.5 15 20.0


          setDT is just for making data.table object.



          dplyr vs data.table



          If bind your data set,



          df_bench
          #># A tibble: 18,000 x 10,001
          #> gene_name cell1 cell2 cell3 cell4 cell5 cell6 cell7
          #> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
          #> 1 ABC308 1 1 0 1 1 1 0
          #> 2 ABC258 1 2 0 3 1 2 0
          #> 3 ABC553 1 4 0 4 1 4 0
          #> 4 ABC57 1 10 1 4 1 10 1
          #> 5 ABC469 3 5 10 20 3 5 10
          #> 6 ABC484 3 10 20 20 3 10 20
          #> 7 ABC813 1 1 0 1 1 1 0
          #> 8 ABC371 1 2 0 3 1 2 0
          #> 9 ABC547 1 4 0 4 1 4 0
          #>10 ABC171 1 10 1 4 1 10 1
          #># ... with 17,990 more rows, and 9,993 more variables:
          #># cell8 <dbl>, cell9 <dbl>, cell10 <dbl>,
          #># cell11 <dbl>, cell12 <dbl>, cell13 <dbl>,
          #># cell14 <dbl>, cell15 <dbl>, cell16 <dbl>,
          #># cell17 <dbl>, cell18 <dbl>, cell19 <dbl>,
          #># cell20 <dbl>, cell21 <dbl>, cell22 <dbl>,
          #># cell23 <dbl>, cell24 <dbl>, cell25 <dbl>,
          #># cell26 <dbl>, cell27 <dbl>, cell28 <dbl>,
          #># cell29 <dbl>, cell30 <dbl>, cell31 <dbl>,
          #># cell32 <dbl>, cell33 <dbl>, cell34 <dbl>,
          #># cell35 <dbl>, cell36 <dbl>, cell37 <dbl>,
          #># cell38 <dbl>, cell39 <dbl>, cell40 <dbl>,
          #># cell41 <dbl>, cell42 <dbl>, cell43 <dbl>,
          #># cell44 <dbl>, cell45 <dbl>, cell46 <dbl>,
          #># cell47 <dbl>, cell48 <dbl>, cell49 <dbl>,
          #># cell50 <dbl>, cell51 <dbl>, cell52 <dbl>,
          #># cell53 <dbl>, cell54 <dbl>, cell55 <dbl>,
          #># cell56 <dbl>, cell57 <dbl>, cell58 <dbl>,
          #># cell59 <dbl>, cell60 <dbl>, cell61 <dbl>,
          #># cell62 <dbl>, cell63 <dbl>, cell64 <dbl>,
          #># cell65 <dbl>, cell66 <dbl>, cell67 <dbl>,
          #># cell68 <dbl>, cell69 <dbl>, cell70 <dbl>,
          #># cell71 <dbl>, cell72 <dbl>, cell73 <dbl>,
          #># cell74 <dbl>, cell75 <dbl>, cell76 <dbl>,
          #># cell77 <dbl>, cell78 <dbl>, cell79 <dbl>,
          #># cell80 <dbl>, cell81 <dbl>, cell82 <dbl>,
          #># cell83 <dbl>, cell84 <dbl>, cell85 <dbl>,
          #># cell86 <dbl>, cell87 <dbl>, cell88 <dbl>,
          #># cell89 <dbl>, cell90 <dbl>, cell91 <dbl>,
          #># cell92 <dbl>, cell93 <dbl>, cell94 <dbl>,
          #># cell95 <dbl>, cell96 <dbl>, cell97 <dbl>,
          #># cell98 <dbl>, cell99 <dbl>, cell100 <dbl>,
          #># cell101 <dbl>, cell102 <dbl>, cell103 <dbl>,
          #># cell104 <dbl>, cell105 <dbl>, cell106 <dbl>,
          #># cell107 <dbl>, …


          Using this set,



          microbenchmark::microbenchmark(
          DPLYR = {
          df_bench %>%
          group_by(gene_name) %>%
          summarise_all(mean)
          },
          DATATABLE = {
          setDT(df_bench)[, lapply(.SD, mean), by = gene_name]
          },
          times = 50
          )
          #> Unit: seconds
          #> expr min lq mean median uq max neval
          #> DPLYR 32.82307 34.89050 38.10948 37.44543 40.01937 47.67549 50
          #> DATATABLE 12.16752 13.59018 16.09665 14.25976 15.60752 40.30257 50


          data.table seems faster than dplyr here.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Nov 10 at 6:19

























          answered Nov 10 at 4:21









          Blended

          40617




          40617












          • Thanks for the detailed explanation here. Thumbs up!
            – Atakan
            Nov 12 at 17:14


















          • Thanks for the detailed explanation here. Thumbs up!
            – Atakan
            Nov 12 at 17:14
















          Thanks for the detailed explanation here. Thumbs up!
          – Atakan
          Nov 12 at 17:14




          Thanks for the detailed explanation here. Thumbs up!
          – Atakan
          Nov 12 at 17:14












          up vote
          1
          down vote













          Using data.table should work pretty well:



          library(data.table)
          as.data.table(data)[, lapply(.SD, mean), by = .(rownames(data))]
          # rownames cell1 cell2 cell3 cell4
          #1: ABC1 1 1.0 0 1.0
          #2: ABC2 1 3.0 0 3.5
          #3: ABC4 1 10.0 1 4.0
          #4: ABC5 3 7.5 15 20.0


          A quick SO search dug up a link to speed comparisons for group-by operations (data.table is the fastest for large data):



          Calculate the mean by group






          share|improve this answer























          • Thanks for your answer. Somehow, I missed the link you shared during my search. The very good info there!
            – Atakan
            Nov 12 at 17:13















          up vote
          1
          down vote













          Using data.table should work pretty well:



          library(data.table)
          as.data.table(data)[, lapply(.SD, mean), by = .(rownames(data))]
          # rownames cell1 cell2 cell3 cell4
          #1: ABC1 1 1.0 0 1.0
          #2: ABC2 1 3.0 0 3.5
          #3: ABC4 1 10.0 1 4.0
          #4: ABC5 3 7.5 15 20.0


          A quick SO search dug up a link to speed comparisons for group-by operations (data.table is the fastest for large data):



          Calculate the mean by group






          share|improve this answer























          • Thanks for your answer. Somehow, I missed the link you shared during my search. The very good info there!
            – Atakan
            Nov 12 at 17:13













          up vote
          1
          down vote










          up vote
          1
          down vote









          Using data.table should work pretty well:



          library(data.table)
          as.data.table(data)[, lapply(.SD, mean), by = .(rownames(data))]
          # rownames cell1 cell2 cell3 cell4
          #1: ABC1 1 1.0 0 1.0
          #2: ABC2 1 3.0 0 3.5
          #3: ABC4 1 10.0 1 4.0
          #4: ABC5 3 7.5 15 20.0


          A quick SO search dug up a link to speed comparisons for group-by operations (data.table is the fastest for large data):



          Calculate the mean by group






          share|improve this answer














          Using data.table should work pretty well:



          library(data.table)
          as.data.table(data)[, lapply(.SD, mean), by = .(rownames(data))]
          # rownames cell1 cell2 cell3 cell4
          #1: ABC1 1 1.0 0 1.0
          #2: ABC2 1 3.0 0 3.5
          #3: ABC4 1 10.0 1 4.0
          #4: ABC5 3 7.5 15 20.0


          A quick SO search dug up a link to speed comparisons for group-by operations (data.table is the fastest for large data):



          Calculate the mean by group







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Nov 10 at 4:20

























          answered Nov 10 at 4:10









          Mike H.

          10.8k11023




          10.8k11023












          • Thanks for your answer. Somehow, I missed the link you shared during my search. The very good info there!
            – Atakan
            Nov 12 at 17:13


















          • Thanks for your answer. Somehow, I missed the link you shared during my search. The very good info there!
            – Atakan
            Nov 12 at 17:13
















          Thanks for your answer. Somehow, I missed the link you shared during my search. The very good info there!
          – Atakan
          Nov 12 at 17:13




          Thanks for your answer. Somehow, I missed the link you shared during my search. The very good info there!
          – Atakan
          Nov 12 at 17:13



          Popular posts from this blog

          Guess what letter conforming each word

          Run scheduled task as local user group (not BUILTIN)

          Port of Spain