Usage of 'for loop' in R to split a dataframe into several dataframes

-1

I have a problem with for loop.
I have a dataframe with 120 unique IDs. I want to split the dataframe into 120 different dataframes based on the ID. I split it using the following code:

split_part0 <- split(PART0_DF, PART0_DF$sysid)

Now I want to do something like

for(i in 1:120){ 

sys[i] <- as.data.frame(split_part0[[i]])}

This way I have the 120 dataframes with unique frame names I can use for further analysis.
Is using 'for loop' in this particular case not possible? If so, what other commands can I use?
Dummy data for PART0_DF:

 Date      sysid   power   temperature

 1.1.2018    1     1000       14

 2.1.2018    1     1200       16

 3.1.2018    1      800       18

 1.1.2018    2     1500        8

 2.1.2018    2      800       18

 3.1.2018    2     1300       11

I want the output to be like

     >>sys1

     Date      sysid   power   temperature

     1.1.2018    1     1000     14

     2.1.2018    1     1200     16

     3.1.2018    1      800     18

     >>sys2

     1.1.2018    2     1500      8

     2.1.2018    2      800     18

     3.1.2018    2     1300     11

edited Nov 21 '18 at 14:42

asked Nov 21 '18 at 11:04

Shruthi Patil

If you provide a small dummy example of PART0_DF it would be easier to understand what it is you need.

– rookie
Nov 21 '18 at 11:47

add a comment |

-1

I have a problem with for loop.
I have a dataframe with 120 unique IDs. I want to split the dataframe into 120 different dataframes based on the ID. I split it using the following code:

split_part0 <- split(PART0_DF, PART0_DF$sysid)

Now I want to do something like

for(i in 1:120){ 

sys[i] <- as.data.frame(split_part0[[i]])}

 Date      sysid   power   temperature

 1.1.2018    1     1000       14

 2.1.2018    1     1200       16

 3.1.2018    1      800       18

 1.1.2018    2     1500        8

 2.1.2018    2      800       18

 3.1.2018    2     1300       11

I want the output to be like

     >>sys1

     Date      sysid   power   temperature

     1.1.2018    1     1000     14

     2.1.2018    1     1200     16

     3.1.2018    1      800     18

     >>sys2

     1.1.2018    2     1500      8

     2.1.2018    2      800     18

     3.1.2018    2     1300     11

edited Nov 21 '18 at 14:42

asked Nov 21 '18 at 11:04

Shruthi Patil

If you provide a small dummy example of PART0_DF it would be easier to understand what it is you need.

– rookie
Nov 21 '18 at 11:47

add a comment |

-1

I have a problem with for loop.
I have a dataframe with 120 unique IDs. I want to split the dataframe into 120 different dataframes based on the ID. I split it using the following code:

split_part0 <- split(PART0_DF, PART0_DF$sysid)

Now I want to do something like

for(i in 1:120){ 

sys[i] <- as.data.frame(split_part0[[i]])}

 Date      sysid   power   temperature

 1.1.2018    1     1000       14

 2.1.2018    1     1200       16

 3.1.2018    1      800       18

 1.1.2018    2     1500        8

 2.1.2018    2      800       18

 3.1.2018    2     1300       11

I want the output to be like

     >>sys1

     Date      sysid   power   temperature

     1.1.2018    1     1000     14

     2.1.2018    1     1200     16

     3.1.2018    1      800     18

     >>sys2

     1.1.2018    2     1500      8

     2.1.2018    2      800     18

     3.1.2018    2     1300     11

edited Nov 21 '18 at 14:42

asked Nov 21 '18 at 11:04

Shruthi Patil

I have a problem with for loop.
I have a dataframe with 120 unique IDs. I want to split the dataframe into 120 different dataframes based on the ID. I split it using the following code:

split_part0 <- split(PART0_DF, PART0_DF$sysid)

Now I want to do something like

for(i in 1:120){ 

sys[i] <- as.data.frame(split_part0[[i]])}

 Date      sysid   power   temperature

 1.1.2018    1     1000       14

 2.1.2018    1     1200       16

 3.1.2018    1      800       18

 1.1.2018    2     1500        8

 2.1.2018    2      800       18

 3.1.2018    2     1300       11

I want the output to be like

     >>sys1

     Date      sysid   power   temperature

     1.1.2018    1     1000     14

     2.1.2018    1     1200     16

     3.1.2018    1      800     18

     >>sys2

     1.1.2018    2     1500      8

     2.1.2018    2      800     18

     3.1.2018    2     1300     11

r for-loop

edited Nov 21 '18 at 14:42

asked Nov 21 '18 at 11:04

Shruthi Patil

edited Nov 21 '18 at 14:42

asked Nov 21 '18 at 11:04

Shruthi Patil

edited Nov 21 '18 at 14:42

asked Nov 21 '18 at 11:04

Shruthi Patil

asked Nov 21 '18 at 11:04

Shruthi Patil

asked Nov 21 '18 at 11:04

Shruthi Patil

If you provide a small dummy example of PART0_DF it would be easier to understand what it is you need.

– rookie
Nov 21 '18 at 11:47

add a comment |

If you provide a small dummy example of PART0_DF it would be easier to understand what it is you need.

– rookie
Nov 21 '18 at 11:47

If you provide a small dummy example of PART0_DF it would be easier to understand what it is you need.

– rookie
Nov 21 '18 at 11:47

add a comment |

2 Answers
2

active

oldest

votes

An easy way to do this is to create a factor vector by appending the string sys to the id numbers, and using it to split the data. There is no need to use a for() loop to produce the desired output, since the result of split() is a list of data frames when the input to be split is a data frame.

The value of the factor is used to name each element in the list generated by split(). In the case of the OP, since sysid is numeric and starts with 1, it's not obvious that the id numbers are being used to name the resulting data frames in the list, as explained in the help for split().

Using the data from the OP we'll illustrate how to use the sysid column to create a factor variable that combines the string sys with the id values, and split it into a list of data frames that can be accessed by name.

rawData <- "Date      sysid   power   temperature

 1.1.2018    1     1000       14

 2.1.2018    1     1200       16

 3.1.2018    1      800       18

 1.1.2018    2     1500        8

 2.1.2018    2      800       18

 3.1.2018    2     1300       11"



data <- read.table(text = rawData,header=TRUE)

sysidName <- paste0("sys",data$sysid)



splitData <- split(data,sysidName)



splitData

...and the output:

> splitData

$`sys1`

      Date sysid power temperature

1 1.1.2018     1  1000          14

2 2.1.2018     1  1200          16

3 3.1.2018     1   800          18



$sys2

      Date sysid power temperature

4 1.1.2018     2  1500           8

5 2.1.2018     2   800          18

6 3.1.2018     2  1300          11



>

At this point one can access individual data frames in the list by using the $ form of the extract operator:

> splitData$sys1

      Date sysid power temperature sysidName

1 1.1.2018     1  1000          14      sys1

2 2.1.2018     1  1200          16      sys1

3 3.1.2018     1   800          18      sys1

>

Also, by using the names() function one can obtain a vector of all the named elements in the list of data frames.

> names(splitData)

[1] "sys1" "sys2"

>

Reiterating the main point from the top of the answer, when split() is used with a data frame, the resulting list is a list of objects of type data.frame(). For example:

> str(splitData["sys1"])

List of 1

 $ sys1:'data.frame':   3 obs. of  4 variables:

  ..$ Date       : Factor w/ 3 levels "1.1.2018","2.1.2018",..: 1 2 3

  ..$ sysid      : int [1:3] 1 1 1

  ..$ power      : int [1:3] 1000 1200 800

  ..$ temperature: int [1:3] 14 16 18

>

If you must use a `for()` loop...

Since the OP asked whether the problem could be solved with a for() loop, the answer is "yes."

# create a vector containing unique values of sysid

ids <- unique(data$sysid)

# initialize output data frame list 

dfList <- list() 

# loop thru unique values and generate named data frames in list() 

for(i in ids){

     dfname <- paste0("sys",i)

     dfList[[dfname]] <- data[data$sysid == i,]

}

dfList

...and the output:

> for(i in ids){

+      dfname <- paste0("sys",i)

+      dfList[[dfname]] <- data[data$sysid == i,]

+ }

> dfList

$`sys1`

      Date sysid power temperature

1 1.1.2018     1  1000          14

2 2.1.2018     1  1200          16

3 3.1.2018     1   800          18



$sys2

      Date sysid power temperature

4 1.1.2018     2  1500           8

5 2.1.2018     2   800          18

6 3.1.2018     2  1300          11

Choosing the "best" answer

Between split(), for() and the other answer using by(), how do we choose the best answer?

One way is to determine which version runs fastest, given that the real data will be much larger than the sample data from the original post.

We can use the microbenchmark package to compare the performance of the three different approaches.

`split()` performance

library(microbenchmark)

> microbenchmark(splitData <- split(data,sysidName),unit="us")

Unit: microseconds

                                expr     min      lq     mean   median       uq     max neval

 splitData <- split(data, sysidName) 144.594 147.359 185.7987 150.1245 170.4705 615.507   100

>

`for()` performance

> microbenchmark(for(i in ids){

+      dfname <- paste0("sys",i)

+      dfList[[dfname]] <- data[data$sysid == i,]

+ },unit="us")

Unit: microseconds

                                                                                              expr      min       lq     mean

 for (i in ids) {     dfname <- paste0("sys", i)     dfList[[dfname]] <- data[data$sysid == i, ] } 2643.755 2857.286 3457.642

   median       uq      max neval

 3099.064 3479.311 8511.609   100

>

`by()` performance

> microbenchmark(df_list <- by(df, df$sysid, function(unique) unique),unit="us")

Unit: microseconds

                                                 expr     min       lq     mean   median       uq      max neval

 df_list <- by(df, df$sysid, function(unique) unique) 256.791 260.5445 304.9296 275.9515 309.5325 1218.372   100

>

...and the winner is:

split(), with an average runtime of 186 microseconds, versus 305 microseconds for by() and a whopping 3,458 microseconds for the for() loop approach.

edited Nov 22 '18 at 6:39

answered Nov 21 '18 at 14:56

Len Greski

3,2281623

I would like to access each data frame to do further analysis separately. Hence, the first solution is perfect. Thank you very much.

– Shruthi Patil
Nov 23 '18 at 9:58

add a comment |

Another option is using the function by():

df <- data.frame(

  Date = c("1.1.2018",  "2.1.2018", "3.1.2018", "1.1.2018", "2.1.2018", "3.1.2018"),

  sysid = c(1, 1, 1, 2, 2, 2),

  power = c(1000, 1200, 800, 1500, 800, 1300)

  )

df

  Date sysid power

1 1.1.2018     1  1000

2 2.1.2018     1  1200

3 3.1.2018     1   800

4 1.1.2018     2  1500

5 2.1.2018     2   800

6 3.1.2018     2  1300

Now split df in as many dataframes as you have distinct ('unique') sysid values using by() and calling unique:

df_list <- by(df, df$sysid, function(unique) unique)

df_list

df$sysid: 1

      Date sysid power

1 1.1.2018     1  1000

2 2.1.2018     1  1200

3 3.1.2018     1   800

---------------------------------------------------------------------------------------------- 

df$sysid: 2

      Date sysid power

4 1.1.2018     2  1500

5 2.1.2018     2   800

6 3.1.2018     2  1300

answered Nov 21 '18 at 15:52

Chris Ruehlemann

469210

I tried this too. But, the solution provided by @LenGreski is more suited for my requirement. Thanks for your help

– Shruthi Patil
Nov 23 '18 at 10:00

On Stack Overflow it is customary to click the upward arrow if a given answer is useful.

– Chris Ruehlemann
Nov 23 '18 at 14:49

I have already. It says I don't have enough reputation for it to get displayed. However, it is recorded.

– Shruthi Patil
Nov 24 '18 at 16:17

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53410738%2fusage-of-for-loop-in-r-to-split-a-dataframe-into-several-dataframes%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

rawData <- "Date      sysid   power   temperature

 1.1.2018    1     1000       14

 2.1.2018    1     1200       16

 3.1.2018    1      800       18

 1.1.2018    2     1500        8

 2.1.2018    2      800       18

 3.1.2018    2     1300       11"



data <- read.table(text = rawData,header=TRUE)

sysidName <- paste0("sys",data$sysid)



splitData <- split(data,sysidName)



splitData

...and the output:

> splitData

$`sys1`

      Date sysid power temperature

1 1.1.2018     1  1000          14

2 2.1.2018     1  1200          16

3 3.1.2018     1   800          18



$sys2

      Date sysid power temperature

4 1.1.2018     2  1500           8

5 2.1.2018     2   800          18

6 3.1.2018     2  1300          11



>

At this point one can access individual data frames in the list by using the $ form of the extract operator:

> splitData$sys1

      Date sysid power temperature sysidName

1 1.1.2018     1  1000          14      sys1

2 2.1.2018     1  1200          16      sys1

3 3.1.2018     1   800          18      sys1

>

Also, by using the names() function one can obtain a vector of all the named elements in the list of data frames.

> names(splitData)

[1] "sys1" "sys2"

>

Reiterating the main point from the top of the answer, when split() is used with a data frame, the resulting list is a list of objects of type data.frame(). For example:

> str(splitData["sys1"])

List of 1

 $ sys1:'data.frame':   3 obs. of  4 variables:

  ..$ Date       : Factor w/ 3 levels "1.1.2018","2.1.2018",..: 1 2 3

  ..$ sysid      : int [1:3] 1 1 1

  ..$ power      : int [1:3] 1000 1200 800

  ..$ temperature: int [1:3] 14 16 18

>

If you must use a `for()` loop...

Since the OP asked whether the problem could be solved with a for() loop, the answer is "yes."

# create a vector containing unique values of sysid

ids <- unique(data$sysid)

# initialize output data frame list 

dfList <- list() 

# loop thru unique values and generate named data frames in list() 

for(i in ids){

     dfname <- paste0("sys",i)

     dfList[[dfname]] <- data[data$sysid == i,]

}

dfList

...and the output:

> for(i in ids){

+      dfname <- paste0("sys",i)

+      dfList[[dfname]] <- data[data$sysid == i,]

+ }

> dfList

$`sys1`

      Date sysid power temperature

1 1.1.2018     1  1000          14

2 2.1.2018     1  1200          16

3 3.1.2018     1   800          18



$sys2

      Date sysid power temperature

4 1.1.2018     2  1500           8

5 2.1.2018     2   800          18

6 3.1.2018     2  1300          11

Choosing the "best" answer

Between split(), for() and the other answer using by(), how do we choose the best answer?

One way is to determine which version runs fastest, given that the real data will be much larger than the sample data from the original post.

We can use the microbenchmark package to compare the performance of the three different approaches.

`split()` performance

library(microbenchmark)

> microbenchmark(splitData <- split(data,sysidName),unit="us")

Unit: microseconds

                                expr     min      lq     mean   median       uq     max neval

 splitData <- split(data, sysidName) 144.594 147.359 185.7987 150.1245 170.4705 615.507   100

>

`for()` performance

> microbenchmark(for(i in ids){

+      dfname <- paste0("sys",i)

+      dfList[[dfname]] <- data[data$sysid == i,]

+ },unit="us")

Unit: microseconds

                                                                                              expr      min       lq     mean

 for (i in ids) {     dfname <- paste0("sys", i)     dfList[[dfname]] <- data[data$sysid == i, ] } 2643.755 2857.286 3457.642

   median       uq      max neval

 3099.064 3479.311 8511.609   100

>

`by()` performance

> microbenchmark(df_list <- by(df, df$sysid, function(unique) unique),unit="us")

Unit: microseconds

                                                 expr     min       lq     mean   median       uq      max neval

 df_list <- by(df, df$sysid, function(unique) unique) 256.791 260.5445 304.9296 275.9515 309.5325 1218.372   100

>

...and the winner is:

split(), with an average runtime of 186 microseconds, versus 305 microseconds for by() and a whopping 3,458 microseconds for the for() loop approach.

edited Nov 22 '18 at 6:39

answered Nov 21 '18 at 14:56

Len Greski

3,2281623

I would like to access each data frame to do further analysis separately. Hence, the first solution is perfect. Thank you very much.

– Shruthi Patil
Nov 23 '18 at 9:58

add a comment |

rawData <- "Date      sysid   power   temperature

 1.1.2018    1     1000       14

 2.1.2018    1     1200       16

 3.1.2018    1      800       18

 1.1.2018    2     1500        8

 2.1.2018    2      800       18

 3.1.2018    2     1300       11"



data <- read.table(text = rawData,header=TRUE)

sysidName <- paste0("sys",data$sysid)



splitData <- split(data,sysidName)



splitData

...and the output:

> splitData

$`sys1`

      Date sysid power temperature

1 1.1.2018     1  1000          14

2 2.1.2018     1  1200          16

3 3.1.2018     1   800          18



$sys2

      Date sysid power temperature

4 1.1.2018     2  1500           8

5 2.1.2018     2   800          18

6 3.1.2018     2  1300          11



>

At this point one can access individual data frames in the list by using the $ form of the extract operator:

> splitData$sys1

      Date sysid power temperature sysidName

1 1.1.2018     1  1000          14      sys1

2 2.1.2018     1  1200          16      sys1

3 3.1.2018     1   800          18      sys1

>

Also, by using the names() function one can obtain a vector of all the named elements in the list of data frames.

> names(splitData)

[1] "sys1" "sys2"

>

Reiterating the main point from the top of the answer, when split() is used with a data frame, the resulting list is a list of objects of type data.frame(). For example:

> str(splitData["sys1"])

List of 1

 $ sys1:'data.frame':   3 obs. of  4 variables:

  ..$ Date       : Factor w/ 3 levels "1.1.2018","2.1.2018",..: 1 2 3

  ..$ sysid      : int [1:3] 1 1 1

  ..$ power      : int [1:3] 1000 1200 800

  ..$ temperature: int [1:3] 14 16 18

>

If you must use a `for()` loop...

Since the OP asked whether the problem could be solved with a for() loop, the answer is "yes."

# create a vector containing unique values of sysid

ids <- unique(data$sysid)

# initialize output data frame list 

dfList <- list() 

# loop thru unique values and generate named data frames in list() 

for(i in ids){

     dfname <- paste0("sys",i)

     dfList[[dfname]] <- data[data$sysid == i,]

}

dfList

...and the output:

> for(i in ids){

+      dfname <- paste0("sys",i)

+      dfList[[dfname]] <- data[data$sysid == i,]

+ }

> dfList

$`sys1`

      Date sysid power temperature

1 1.1.2018     1  1000          14

2 2.1.2018     1  1200          16

3 3.1.2018     1   800          18



$sys2

      Date sysid power temperature

4 1.1.2018     2  1500           8

5 2.1.2018     2   800          18

6 3.1.2018     2  1300          11

Choosing the "best" answer

Between split(), for() and the other answer using by(), how do we choose the best answer?

One way is to determine which version runs fastest, given that the real data will be much larger than the sample data from the original post.

We can use the microbenchmark package to compare the performance of the three different approaches.

`split()` performance

library(microbenchmark)

> microbenchmark(splitData <- split(data,sysidName),unit="us")

Unit: microseconds

                                expr     min      lq     mean   median       uq     max neval

 splitData <- split(data, sysidName) 144.594 147.359 185.7987 150.1245 170.4705 615.507   100

>

`for()` performance

> microbenchmark(for(i in ids){

+      dfname <- paste0("sys",i)

+      dfList[[dfname]] <- data[data$sysid == i,]

+ },unit="us")

Unit: microseconds

                                                                                              expr      min       lq     mean

 for (i in ids) {     dfname <- paste0("sys", i)     dfList[[dfname]] <- data[data$sysid == i, ] } 2643.755 2857.286 3457.642

   median       uq      max neval

 3099.064 3479.311 8511.609   100

>

`by()` performance

> microbenchmark(df_list <- by(df, df$sysid, function(unique) unique),unit="us")

Unit: microseconds

                                                 expr     min       lq     mean   median       uq      max neval

 df_list <- by(df, df$sysid, function(unique) unique) 256.791 260.5445 304.9296 275.9515 309.5325 1218.372   100

>

...and the winner is:

split(), with an average runtime of 186 microseconds, versus 305 microseconds for by() and a whopping 3,458 microseconds for the for() loop approach.

edited Nov 22 '18 at 6:39

answered Nov 21 '18 at 14:56

Len Greski

3,2281623

I would like to access each data frame to do further analysis separately. Hence, the first solution is perfect. Thank you very much.

– Shruthi Patil
Nov 23 '18 at 9:58

add a comment |

rawData <- "Date      sysid   power   temperature

 1.1.2018    1     1000       14

 2.1.2018    1     1200       16

 3.1.2018    1      800       18

 1.1.2018    2     1500        8

 2.1.2018    2      800       18

 3.1.2018    2     1300       11"



data <- read.table(text = rawData,header=TRUE)

sysidName <- paste0("sys",data$sysid)



splitData <- split(data,sysidName)



splitData

...and the output:

> splitData

$`sys1`

      Date sysid power temperature

1 1.1.2018     1  1000          14

2 2.1.2018     1  1200          16

3 3.1.2018     1   800          18



$sys2

      Date sysid power temperature

4 1.1.2018     2  1500           8

5 2.1.2018     2   800          18

6 3.1.2018     2  1300          11



>

At this point one can access individual data frames in the list by using the $ form of the extract operator:

> splitData$sys1

      Date sysid power temperature sysidName

1 1.1.2018     1  1000          14      sys1

2 2.1.2018     1  1200          16      sys1

3 3.1.2018     1   800          18      sys1

>

Also, by using the names() function one can obtain a vector of all the named elements in the list of data frames.

> names(splitData)

[1] "sys1" "sys2"

>

Reiterating the main point from the top of the answer, when split() is used with a data frame, the resulting list is a list of objects of type data.frame(). For example:

> str(splitData["sys1"])

List of 1

 $ sys1:'data.frame':   3 obs. of  4 variables:

  ..$ Date       : Factor w/ 3 levels "1.1.2018","2.1.2018",..: 1 2 3

  ..$ sysid      : int [1:3] 1 1 1

  ..$ power      : int [1:3] 1000 1200 800

  ..$ temperature: int [1:3] 14 16 18

>

If you must use a `for()` loop...

Since the OP asked whether the problem could be solved with a for() loop, the answer is "yes."

# create a vector containing unique values of sysid

ids <- unique(data$sysid)

# initialize output data frame list 

dfList <- list() 

# loop thru unique values and generate named data frames in list() 

for(i in ids){

     dfname <- paste0("sys",i)

     dfList[[dfname]] <- data[data$sysid == i,]

}

dfList

...and the output:

> for(i in ids){

+      dfname <- paste0("sys",i)

+      dfList[[dfname]] <- data[data$sysid == i,]

+ }

> dfList

$`sys1`

      Date sysid power temperature

1 1.1.2018     1  1000          14

2 2.1.2018     1  1200          16

3 3.1.2018     1   800          18



$sys2

      Date sysid power temperature

4 1.1.2018     2  1500           8

5 2.1.2018     2   800          18

6 3.1.2018     2  1300          11

Choosing the "best" answer

Between split(), for() and the other answer using by(), how do we choose the best answer?

One way is to determine which version runs fastest, given that the real data will be much larger than the sample data from the original post.

We can use the microbenchmark package to compare the performance of the three different approaches.

`split()` performance

library(microbenchmark)

> microbenchmark(splitData <- split(data,sysidName),unit="us")

Unit: microseconds

                                expr     min      lq     mean   median       uq     max neval

 splitData <- split(data, sysidName) 144.594 147.359 185.7987 150.1245 170.4705 615.507   100

>

`for()` performance

> microbenchmark(for(i in ids){

+      dfname <- paste0("sys",i)

+      dfList[[dfname]] <- data[data$sysid == i,]

+ },unit="us")

Unit: microseconds

                                                                                              expr      min       lq     mean

 for (i in ids) {     dfname <- paste0("sys", i)     dfList[[dfname]] <- data[data$sysid == i, ] } 2643.755 2857.286 3457.642

   median       uq      max neval

 3099.064 3479.311 8511.609   100

>

`by()` performance

> microbenchmark(df_list <- by(df, df$sysid, function(unique) unique),unit="us")

Unit: microseconds

                                                 expr     min       lq     mean   median       uq      max neval

 df_list <- by(df, df$sysid, function(unique) unique) 256.791 260.5445 304.9296 275.9515 309.5325 1218.372   100

>

...and the winner is:

split(), with an average runtime of 186 microseconds, versus 305 microseconds for by() and a whopping 3,458 microseconds for the for() loop approach.

edited Nov 22 '18 at 6:39

answered Nov 21 '18 at 14:56

Len Greski

3,2281623

rawData <- "Date      sysid   power   temperature

 1.1.2018    1     1000       14

 2.1.2018    1     1200       16

 3.1.2018    1      800       18

 1.1.2018    2     1500        8

 2.1.2018    2      800       18

 3.1.2018    2     1300       11"



data <- read.table(text = rawData,header=TRUE)

sysidName <- paste0("sys",data$sysid)



splitData <- split(data,sysidName)



splitData

...and the output:

> splitData

$`sys1`

      Date sysid power temperature

1 1.1.2018     1  1000          14

2 2.1.2018     1  1200          16

3 3.1.2018     1   800          18



$sys2

      Date sysid power temperature

4 1.1.2018     2  1500           8

5 2.1.2018     2   800          18

6 3.1.2018     2  1300          11



>

At this point one can access individual data frames in the list by using the $ form of the extract operator:

> splitData$sys1

      Date sysid power temperature sysidName

1 1.1.2018     1  1000          14      sys1

2 2.1.2018     1  1200          16      sys1

3 3.1.2018     1   800          18      sys1

>

Also, by using the names() function one can obtain a vector of all the named elements in the list of data frames.

> names(splitData)

[1] "sys1" "sys2"

>

Reiterating the main point from the top of the answer, when split() is used with a data frame, the resulting list is a list of objects of type data.frame(). For example:

> str(splitData["sys1"])

List of 1

 $ sys1:'data.frame':   3 obs. of  4 variables:

  ..$ Date       : Factor w/ 3 levels "1.1.2018","2.1.2018",..: 1 2 3

  ..$ sysid      : int [1:3] 1 1 1

  ..$ power      : int [1:3] 1000 1200 800

  ..$ temperature: int [1:3] 14 16 18

>

If you must use a `for()` loop...

Since the OP asked whether the problem could be solved with a for() loop, the answer is "yes."

# create a vector containing unique values of sysid

ids <- unique(data$sysid)

# initialize output data frame list 

dfList <- list() 

# loop thru unique values and generate named data frames in list() 

for(i in ids){

     dfname <- paste0("sys",i)

     dfList[[dfname]] <- data[data$sysid == i,]

}

dfList

...and the output:

> for(i in ids){

+      dfname <- paste0("sys",i)

+      dfList[[dfname]] <- data[data$sysid == i,]

+ }

> dfList

$`sys1`

      Date sysid power temperature

1 1.1.2018     1  1000          14

2 2.1.2018     1  1200          16

3 3.1.2018     1   800          18



$sys2

      Date sysid power temperature

4 1.1.2018     2  1500           8

5 2.1.2018     2   800          18

6 3.1.2018     2  1300          11

Choosing the "best" answer

Between split(), for() and the other answer using by(), how do we choose the best answer?

One way is to determine which version runs fastest, given that the real data will be much larger than the sample data from the original post.

We can use the microbenchmark package to compare the performance of the three different approaches.

`split()` performance

library(microbenchmark)

> microbenchmark(splitData <- split(data,sysidName),unit="us")

Unit: microseconds

                                expr     min      lq     mean   median       uq     max neval

 splitData <- split(data, sysidName) 144.594 147.359 185.7987 150.1245 170.4705 615.507   100

>

`for()` performance

> microbenchmark(for(i in ids){

+      dfname <- paste0("sys",i)

+      dfList[[dfname]] <- data[data$sysid == i,]

+ },unit="us")

Unit: microseconds

                                                                                              expr      min       lq     mean

 for (i in ids) {     dfname <- paste0("sys", i)     dfList[[dfname]] <- data[data$sysid == i, ] } 2643.755 2857.286 3457.642

   median       uq      max neval

 3099.064 3479.311 8511.609   100

>

`by()` performance

> microbenchmark(df_list <- by(df, df$sysid, function(unique) unique),unit="us")

Unit: microseconds

                                                 expr     min       lq     mean   median       uq      max neval

 df_list <- by(df, df$sysid, function(unique) unique) 256.791 260.5445 304.9296 275.9515 309.5325 1218.372   100

>

...and the winner is:

split(), with an average runtime of 186 microseconds, versus 305 microseconds for by() and a whopping 3,458 microseconds for the for() loop approach.

edited Nov 22 '18 at 6:39

answered Nov 21 '18 at 14:56

Len Greski

3,2281623

edited Nov 22 '18 at 6:39

answered Nov 21 '18 at 14:56

Len Greski

3,2281623

answered Nov 21 '18 at 14:56

Len Greski

3,2281623

answered Nov 21 '18 at 14:56

Len Greski

3,2281623

I would like to access each data frame to do further analysis separately. Hence, the first solution is perfect. Thank you very much.

– Shruthi Patil
Nov 23 '18 at 9:58

add a comment |

I would like to access each data frame to do further analysis separately. Hence, the first solution is perfect. Thank you very much.

– Shruthi Patil
Nov 23 '18 at 9:58

I would like to access each data frame to do further analysis separately. Hence, the first solution is perfect. Thank you very much.

– Shruthi Patil
Nov 23 '18 at 9:58

add a comment |

Another option is using the function by():

df <- data.frame(

  Date = c("1.1.2018",  "2.1.2018", "3.1.2018", "1.1.2018", "2.1.2018", "3.1.2018"),

  sysid = c(1, 1, 1, 2, 2, 2),

  power = c(1000, 1200, 800, 1500, 800, 1300)

  )

df

  Date sysid power

1 1.1.2018     1  1000

2 2.1.2018     1  1200

3 3.1.2018     1   800

4 1.1.2018     2  1500

5 2.1.2018     2   800

6 3.1.2018     2  1300

Now split df in as many dataframes as you have distinct ('unique') sysid values using by() and calling unique:

df_list <- by(df, df$sysid, function(unique) unique)

df_list

df$sysid: 1

      Date sysid power

1 1.1.2018     1  1000

2 2.1.2018     1  1200

3 3.1.2018     1   800

---------------------------------------------------------------------------------------------- 

df$sysid: 2

      Date sysid power

4 1.1.2018     2  1500

5 2.1.2018     2   800

6 3.1.2018     2  1300

answered Nov 21 '18 at 15:52

Chris Ruehlemann

469210

I tried this too. But, the solution provided by @LenGreski is more suited for my requirement. Thanks for your help

– Shruthi Patil
Nov 23 '18 at 10:00

On Stack Overflow it is customary to click the upward arrow if a given answer is useful.

– Chris Ruehlemann
Nov 23 '18 at 14:49

I have already. It says I don't have enough reputation for it to get displayed. However, it is recorded.

– Shruthi Patil
Nov 24 '18 at 16:17

add a comment |

Another option is using the function by():

df <- data.frame(

  Date = c("1.1.2018",  "2.1.2018", "3.1.2018", "1.1.2018", "2.1.2018", "3.1.2018"),

  sysid = c(1, 1, 1, 2, 2, 2),

  power = c(1000, 1200, 800, 1500, 800, 1300)

  )

df

  Date sysid power

1 1.1.2018     1  1000

2 2.1.2018     1  1200

3 3.1.2018     1   800

4 1.1.2018     2  1500

5 2.1.2018     2   800

6 3.1.2018     2  1300

Now split df in as many dataframes as you have distinct ('unique') sysid values using by() and calling unique:

df_list <- by(df, df$sysid, function(unique) unique)

df_list

df$sysid: 1

      Date sysid power

1 1.1.2018     1  1000

2 2.1.2018     1  1200

3 3.1.2018     1   800

---------------------------------------------------------------------------------------------- 

df$sysid: 2

      Date sysid power

4 1.1.2018     2  1500

5 2.1.2018     2   800

6 3.1.2018     2  1300

answered Nov 21 '18 at 15:52

Chris Ruehlemann

469210

I tried this too. But, the solution provided by @LenGreski is more suited for my requirement. Thanks for your help

– Shruthi Patil
Nov 23 '18 at 10:00

On Stack Overflow it is customary to click the upward arrow if a given answer is useful.

– Chris Ruehlemann
Nov 23 '18 at 14:49

I have already. It says I don't have enough reputation for it to get displayed. However, it is recorded.

– Shruthi Patil
Nov 24 '18 at 16:17

add a comment |

Another option is using the function by():

df <- data.frame(

  Date = c("1.1.2018",  "2.1.2018", "3.1.2018", "1.1.2018", "2.1.2018", "3.1.2018"),

  sysid = c(1, 1, 1, 2, 2, 2),

  power = c(1000, 1200, 800, 1500, 800, 1300)

  )

df

  Date sysid power

1 1.1.2018     1  1000

2 2.1.2018     1  1200

3 3.1.2018     1   800

4 1.1.2018     2  1500

5 2.1.2018     2   800

6 3.1.2018     2  1300

Now split df in as many dataframes as you have distinct ('unique') sysid values using by() and calling unique:

df_list <- by(df, df$sysid, function(unique) unique)

df_list

df$sysid: 1

      Date sysid power

1 1.1.2018     1  1000

2 2.1.2018     1  1200

3 3.1.2018     1   800

---------------------------------------------------------------------------------------------- 

df$sysid: 2

      Date sysid power

4 1.1.2018     2  1500

5 2.1.2018     2   800

6 3.1.2018     2  1300

answered Nov 21 '18 at 15:52

Chris Ruehlemann

469210

Another option is using the function by():

df <- data.frame(

  Date = c("1.1.2018",  "2.1.2018", "3.1.2018", "1.1.2018", "2.1.2018", "3.1.2018"),

  sysid = c(1, 1, 1, 2, 2, 2),

  power = c(1000, 1200, 800, 1500, 800, 1300)

  )

df

  Date sysid power

1 1.1.2018     1  1000

2 2.1.2018     1  1200

3 3.1.2018     1   800

4 1.1.2018     2  1500

5 2.1.2018     2   800

6 3.1.2018     2  1300

Now split df in as many dataframes as you have distinct ('unique') sysid values using by() and calling unique:

df_list <- by(df, df$sysid, function(unique) unique)

df_list

df$sysid: 1

      Date sysid power

1 1.1.2018     1  1000

2 2.1.2018     1  1200

3 3.1.2018     1   800

---------------------------------------------------------------------------------------------- 

df$sysid: 2

      Date sysid power

4 1.1.2018     2  1500

5 2.1.2018     2   800

6 3.1.2018     2  1300

answered Nov 21 '18 at 15:52

Chris Ruehlemann

469210

answered Nov 21 '18 at 15:52

Chris Ruehlemann

469210

answered Nov 21 '18 at 15:52

Chris Ruehlemann

469210

answered Nov 21 '18 at 15:52

Chris Ruehlemann

469210

I tried this too. But, the solution provided by @LenGreski is more suited for my requirement. Thanks for your help

– Shruthi Patil
Nov 23 '18 at 10:00

On Stack Overflow it is customary to click the upward arrow if a given answer is useful.

– Chris Ruehlemann
Nov 23 '18 at 14:49

I have already. It says I don't have enough reputation for it to get displayed. However, it is recorded.

– Shruthi Patil
Nov 24 '18 at 16:17

add a comment |

I tried this too. But, the solution provided by @LenGreski is more suited for my requirement. Thanks for your help

– Shruthi Patil
Nov 23 '18 at 10:00

On Stack Overflow it is customary to click the upward arrow if a given answer is useful.

– Chris Ruehlemann
Nov 23 '18 at 14:49

I have already. It says I don't have enough reputation for it to get displayed. However, it is recorded.

– Shruthi Patil
Nov 24 '18 at 16:17

I tried this too. But, the solution provided by @LenGreski is more suited for my requirement. Thanks for your help

– Shruthi Patil
Nov 23 '18 at 10:00

On Stack Overflow it is customary to click the upward arrow if a given answer is useful.

– Chris Ruehlemann
Nov 23 '18 at 14:49

I have already. It says I don't have enough reputation for it to get displayed. However, it is recorded.

– Shruthi Patil
Nov 24 '18 at 16:17

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Usage of 'for loop' in R to split a dataframe into several dataframes

2 Answers 2

If you must use a for() loop...

Choosing the "best" answer

split() performance

for() performance

by() performance

...and the winner is:

Your Answer

Sign up or log in

Post as a guest

Post as a guest

2 Answers 2

2 Answers 2

If you must use a for() loop...

Choosing the "best" answer

split() performance

for() performance

by() performance

...and the winner is:

If you must use a for() loop...

Choosing the "best" answer

split() performance

for() performance

by() performance

...and the winner is:

If you must use a for() loop...

Choosing the "best" answer

split() performance

for() performance

by() performance

...and the winner is:

If you must use a for() loop...

Choosing the "best" answer

split() performance

for() performance

by() performance

...and the winner is:

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

鏡平學校

Why https connections are so slow when debugging (stepping over) in Java?

2 Answers
2

If you must use a `for()` loop...

`split()` performance

`for()` performance

`by()` performance

2 Answers
2

2 Answers
2

If you must use a `for()` loop...

`split()` performance

`for()` performance

`by()` performance

If you must use a `for()` loop...

`split()` performance

`for()` performance

`by()` performance

If you must use a `for()` loop...

`split()` performance

`for()` performance

`by()` performance

If you must use a `for()` loop...

`split()` performance

`for()` performance

`by()` performance