Using Caret to select features within folds of a cross validation

Multi tool use
Multi tool use











up vote
1
down vote

favorite












In the caret package, is there any way to use the recursive feature elimination function within the folds of a cross validation scheme in a trainControl that passes to a train function which uses tuning grids?



I love the recursive feature elimination function, but it really should be applied to the training folds in cross validation and then tested on the hold-out fold.



I've played around with a bunch of different methods to do this, but none are perfect. For example, I can make my own cross validation folds and run trainControl with method = 'none' but that won't utilize the training grid in train (an evaluation group is needed for that). I can also make my own cv folds, and have method = 'cv', in the trainControl (I can use a tuning grid here), but the best tune is chosen on the hold-out sample generated by the trainControl hold-out, not my hold-out.



Is there a way to tell caret to evaluate the models with the tuning grids on my pre-specified hold-out fold (the one taken prior to feature elimination)?



In my work-flow, I am testing a few different model types with their own tuning grids. There are parts of caret I really like, and I've spent a ton of time on this so I'd like to use it, but this is a deal breaker if I can't get it to work. I'm open to any suggestions!



Thanks in advance-



SOLUTION:
My solution may not be the most efficient, but it seems to work. I made my cross validation folds using information here: https://stats.stackexchange.com/questions/61090/how-to-split-a-data-set-to-do-10-fold-cross-validation.
Using the createFolds (caret function) does not create equal folds, so I opted for the second solution. It looks like you may could do it with caret's stratified sampling, but I haven't explored that yet.



This code utilizes a bootstrap approach within each cv fold and predicts all of the observations in the hold out fold for each iteration.



  ## Make the folds for the cross validation
folds <- cut(seq(1,nrow(data)), breaks=10, labels=FALSE) %>%
sample(., length(.), replace= F)

for(f in 1:10) {

testIndexes <- which(folds == f,arr.ind=TRUE)
trainIndexes <- which(folds != f, arr.ind= T)

trainIndexList <- replicate(500, sample(trainIndexes, length(trainIndexes), replace = T), simplify = F)
testIndexList <- replicate(500, testIndexes, simplify = F)

testData <- data[testIndexes, ]
trainData <- data[-testIndexes, ]

## Make the train control object
train_control <- trainControl(method = 'boot',
numbe r= 1,
summaryFunction = modfun,
preProcOptions = c('center', 'scale', newdata= testData),
index = trainIndexList,
indexOut = testIndexList,
classProbs = T,
savePredictions = T)

## Feature Selection
## Make the control for the recursive feature elimination
rfe_control <- rfeControl(functions = rfFuncs, method = 'cv', number= 10)

## Generate the data frame based on feature selection
fs_results <- rfe(trainData[,2:ncol(trainData)],
trainData[,'target'],
sizes=c(2:ncol(trainData)),
rfeControl= rfe_control)

use_features <- c('target', predictors(fs_results))

features <- predictors(fs_results) %>% data.frame(features= .) %>% mutate(fold= f) %>%
rbind(features, .) ## Specify features as a data frame ahead of time

data_min <- data[, use_features] %>% data.frame()


...(modeling code, including train functions and desired output)...



}


I haven't tried to do an lapply instead of a for loop yet. I'd appreciate any suggestions for efficiency.










share|improve this question
























  • had you got any solutions
    – Shubham Sharma
    Jan 24 at 13:22










  • I ended up assigning folds outside the loop: Then, for
    – RoseS
    Mar 16 at 17:15















up vote
1
down vote

favorite












In the caret package, is there any way to use the recursive feature elimination function within the folds of a cross validation scheme in a trainControl that passes to a train function which uses tuning grids?



I love the recursive feature elimination function, but it really should be applied to the training folds in cross validation and then tested on the hold-out fold.



I've played around with a bunch of different methods to do this, but none are perfect. For example, I can make my own cross validation folds and run trainControl with method = 'none' but that won't utilize the training grid in train (an evaluation group is needed for that). I can also make my own cv folds, and have method = 'cv', in the trainControl (I can use a tuning grid here), but the best tune is chosen on the hold-out sample generated by the trainControl hold-out, not my hold-out.



Is there a way to tell caret to evaluate the models with the tuning grids on my pre-specified hold-out fold (the one taken prior to feature elimination)?



In my work-flow, I am testing a few different model types with their own tuning grids. There are parts of caret I really like, and I've spent a ton of time on this so I'd like to use it, but this is a deal breaker if I can't get it to work. I'm open to any suggestions!



Thanks in advance-



SOLUTION:
My solution may not be the most efficient, but it seems to work. I made my cross validation folds using information here: https://stats.stackexchange.com/questions/61090/how-to-split-a-data-set-to-do-10-fold-cross-validation.
Using the createFolds (caret function) does not create equal folds, so I opted for the second solution. It looks like you may could do it with caret's stratified sampling, but I haven't explored that yet.



This code utilizes a bootstrap approach within each cv fold and predicts all of the observations in the hold out fold for each iteration.



  ## Make the folds for the cross validation
folds <- cut(seq(1,nrow(data)), breaks=10, labels=FALSE) %>%
sample(., length(.), replace= F)

for(f in 1:10) {

testIndexes <- which(folds == f,arr.ind=TRUE)
trainIndexes <- which(folds != f, arr.ind= T)

trainIndexList <- replicate(500, sample(trainIndexes, length(trainIndexes), replace = T), simplify = F)
testIndexList <- replicate(500, testIndexes, simplify = F)

testData <- data[testIndexes, ]
trainData <- data[-testIndexes, ]

## Make the train control object
train_control <- trainControl(method = 'boot',
numbe r= 1,
summaryFunction = modfun,
preProcOptions = c('center', 'scale', newdata= testData),
index = trainIndexList,
indexOut = testIndexList,
classProbs = T,
savePredictions = T)

## Feature Selection
## Make the control for the recursive feature elimination
rfe_control <- rfeControl(functions = rfFuncs, method = 'cv', number= 10)

## Generate the data frame based on feature selection
fs_results <- rfe(trainData[,2:ncol(trainData)],
trainData[,'target'],
sizes=c(2:ncol(trainData)),
rfeControl= rfe_control)

use_features <- c('target', predictors(fs_results))

features <- predictors(fs_results) %>% data.frame(features= .) %>% mutate(fold= f) %>%
rbind(features, .) ## Specify features as a data frame ahead of time

data_min <- data[, use_features] %>% data.frame()


...(modeling code, including train functions and desired output)...



}


I haven't tried to do an lapply instead of a for loop yet. I'd appreciate any suggestions for efficiency.










share|improve this question
























  • had you got any solutions
    – Shubham Sharma
    Jan 24 at 13:22










  • I ended up assigning folds outside the loop: Then, for
    – RoseS
    Mar 16 at 17:15













up vote
1
down vote

favorite









up vote
1
down vote

favorite











In the caret package, is there any way to use the recursive feature elimination function within the folds of a cross validation scheme in a trainControl that passes to a train function which uses tuning grids?



I love the recursive feature elimination function, but it really should be applied to the training folds in cross validation and then tested on the hold-out fold.



I've played around with a bunch of different methods to do this, but none are perfect. For example, I can make my own cross validation folds and run trainControl with method = 'none' but that won't utilize the training grid in train (an evaluation group is needed for that). I can also make my own cv folds, and have method = 'cv', in the trainControl (I can use a tuning grid here), but the best tune is chosen on the hold-out sample generated by the trainControl hold-out, not my hold-out.



Is there a way to tell caret to evaluate the models with the tuning grids on my pre-specified hold-out fold (the one taken prior to feature elimination)?



In my work-flow, I am testing a few different model types with their own tuning grids. There are parts of caret I really like, and I've spent a ton of time on this so I'd like to use it, but this is a deal breaker if I can't get it to work. I'm open to any suggestions!



Thanks in advance-



SOLUTION:
My solution may not be the most efficient, but it seems to work. I made my cross validation folds using information here: https://stats.stackexchange.com/questions/61090/how-to-split-a-data-set-to-do-10-fold-cross-validation.
Using the createFolds (caret function) does not create equal folds, so I opted for the second solution. It looks like you may could do it with caret's stratified sampling, but I haven't explored that yet.



This code utilizes a bootstrap approach within each cv fold and predicts all of the observations in the hold out fold for each iteration.



  ## Make the folds for the cross validation
folds <- cut(seq(1,nrow(data)), breaks=10, labels=FALSE) %>%
sample(., length(.), replace= F)

for(f in 1:10) {

testIndexes <- which(folds == f,arr.ind=TRUE)
trainIndexes <- which(folds != f, arr.ind= T)

trainIndexList <- replicate(500, sample(trainIndexes, length(trainIndexes), replace = T), simplify = F)
testIndexList <- replicate(500, testIndexes, simplify = F)

testData <- data[testIndexes, ]
trainData <- data[-testIndexes, ]

## Make the train control object
train_control <- trainControl(method = 'boot',
numbe r= 1,
summaryFunction = modfun,
preProcOptions = c('center', 'scale', newdata= testData),
index = trainIndexList,
indexOut = testIndexList,
classProbs = T,
savePredictions = T)

## Feature Selection
## Make the control for the recursive feature elimination
rfe_control <- rfeControl(functions = rfFuncs, method = 'cv', number= 10)

## Generate the data frame based on feature selection
fs_results <- rfe(trainData[,2:ncol(trainData)],
trainData[,'target'],
sizes=c(2:ncol(trainData)),
rfeControl= rfe_control)

use_features <- c('target', predictors(fs_results))

features <- predictors(fs_results) %>% data.frame(features= .) %>% mutate(fold= f) %>%
rbind(features, .) ## Specify features as a data frame ahead of time

data_min <- data[, use_features] %>% data.frame()


...(modeling code, including train functions and desired output)...



}


I haven't tried to do an lapply instead of a for loop yet. I'd appreciate any suggestions for efficiency.










share|improve this question















In the caret package, is there any way to use the recursive feature elimination function within the folds of a cross validation scheme in a trainControl that passes to a train function which uses tuning grids?



I love the recursive feature elimination function, but it really should be applied to the training folds in cross validation and then tested on the hold-out fold.



I've played around with a bunch of different methods to do this, but none are perfect. For example, I can make my own cross validation folds and run trainControl with method = 'none' but that won't utilize the training grid in train (an evaluation group is needed for that). I can also make my own cv folds, and have method = 'cv', in the trainControl (I can use a tuning grid here), but the best tune is chosen on the hold-out sample generated by the trainControl hold-out, not my hold-out.



Is there a way to tell caret to evaluate the models with the tuning grids on my pre-specified hold-out fold (the one taken prior to feature elimination)?



In my work-flow, I am testing a few different model types with their own tuning grids. There are parts of caret I really like, and I've spent a ton of time on this so I'd like to use it, but this is a deal breaker if I can't get it to work. I'm open to any suggestions!



Thanks in advance-



SOLUTION:
My solution may not be the most efficient, but it seems to work. I made my cross validation folds using information here: https://stats.stackexchange.com/questions/61090/how-to-split-a-data-set-to-do-10-fold-cross-validation.
Using the createFolds (caret function) does not create equal folds, so I opted for the second solution. It looks like you may could do it with caret's stratified sampling, but I haven't explored that yet.



This code utilizes a bootstrap approach within each cv fold and predicts all of the observations in the hold out fold for each iteration.



  ## Make the folds for the cross validation
folds <- cut(seq(1,nrow(data)), breaks=10, labels=FALSE) %>%
sample(., length(.), replace= F)

for(f in 1:10) {

testIndexes <- which(folds == f,arr.ind=TRUE)
trainIndexes <- which(folds != f, arr.ind= T)

trainIndexList <- replicate(500, sample(trainIndexes, length(trainIndexes), replace = T), simplify = F)
testIndexList <- replicate(500, testIndexes, simplify = F)

testData <- data[testIndexes, ]
trainData <- data[-testIndexes, ]

## Make the train control object
train_control <- trainControl(method = 'boot',
numbe r= 1,
summaryFunction = modfun,
preProcOptions = c('center', 'scale', newdata= testData),
index = trainIndexList,
indexOut = testIndexList,
classProbs = T,
savePredictions = T)

## Feature Selection
## Make the control for the recursive feature elimination
rfe_control <- rfeControl(functions = rfFuncs, method = 'cv', number= 10)

## Generate the data frame based on feature selection
fs_results <- rfe(trainData[,2:ncol(trainData)],
trainData[,'target'],
sizes=c(2:ncol(trainData)),
rfeControl= rfe_control)

use_features <- c('target', predictors(fs_results))

features <- predictors(fs_results) %>% data.frame(features= .) %>% mutate(fold= f) %>%
rbind(features, .) ## Specify features as a data frame ahead of time

data_min <- data[, use_features] %>% data.frame()


...(modeling code, including train functions and desired output)...



}


I haven't tried to do an lapply instead of a for loop yet. I'd appreciate any suggestions for efficiency.







cross-validation r-caret feature-selection






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 8 at 9:19









jmuhlenkamp

1,113424




1,113424










asked Aug 3 '17 at 18:41









RoseS

557




557












  • had you got any solutions
    – Shubham Sharma
    Jan 24 at 13:22










  • I ended up assigning folds outside the loop: Then, for
    – RoseS
    Mar 16 at 17:15


















  • had you got any solutions
    – Shubham Sharma
    Jan 24 at 13:22










  • I ended up assigning folds outside the loop: Then, for
    – RoseS
    Mar 16 at 17:15
















had you got any solutions
– Shubham Sharma
Jan 24 at 13:22




had you got any solutions
– Shubham Sharma
Jan 24 at 13:22












I ended up assigning folds outside the loop: Then, for
– RoseS
Mar 16 at 17:15




I ended up assigning folds outside the loop: Then, for
– RoseS
Mar 16 at 17:15

















active

oldest

votes











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














 

draft saved


draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f45492147%2fusing-caret-to-select-features-within-folds-of-a-cross-validation%23new-answer', 'question_page');
}
);

Post as a guest





































active

oldest

votes













active

oldest

votes









active

oldest

votes






active

oldest

votes
















 

draft saved


draft discarded



















































 


draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f45492147%2fusing-caret-to-select-features-within-folds-of-a-cross-validation%23new-answer', 'question_page');
}
);

Post as a guest




















































































ilC xgv9M1tSt J
MDZmcQldqmS0LFNDACE7mhZdd4JsfCL,MVVj o rjKsZKoLYqHP G86

Popular posts from this blog

How to pass form data using jquery Ajax to insert data in database?

Guess what letter conforming each word

Run scheduled task as local user group (not BUILTIN)