Using Caret to select features within folds of a cross validation











up vote
1
down vote

favorite












In the caret package, is there any way to use the recursive feature elimination function within the folds of a cross validation scheme in a trainControl that passes to a train function which uses tuning grids?



I love the recursive feature elimination function, but it really should be applied to the training folds in cross validation and then tested on the hold-out fold.



I've played around with a bunch of different methods to do this, but none are perfect. For example, I can make my own cross validation folds and run trainControl with method = 'none' but that won't utilize the training grid in train (an evaluation group is needed for that). I can also make my own cv folds, and have method = 'cv', in the trainControl (I can use a tuning grid here), but the best tune is chosen on the hold-out sample generated by the trainControl hold-out, not my hold-out.



Is there a way to tell caret to evaluate the models with the tuning grids on my pre-specified hold-out fold (the one taken prior to feature elimination)?



In my work-flow, I am testing a few different model types with their own tuning grids. There are parts of caret I really like, and I've spent a ton of time on this so I'd like to use it, but this is a deal breaker if I can't get it to work. I'm open to any suggestions!



Thanks in advance-



SOLUTION:
My solution may not be the most efficient, but it seems to work. I made my cross validation folds using information here: https://stats.stackexchange.com/questions/61090/how-to-split-a-data-set-to-do-10-fold-cross-validation.
Using the createFolds (caret function) does not create equal folds, so I opted for the second solution. It looks like you may could do it with caret's stratified sampling, but I haven't explored that yet.



This code utilizes a bootstrap approach within each cv fold and predicts all of the observations in the hold out fold for each iteration.



  ## Make the folds for the cross validation
folds <- cut(seq(1,nrow(data)), breaks=10, labels=FALSE) %>%
sample(., length(.), replace= F)

for(f in 1:10) {

testIndexes <- which(folds == f,arr.ind=TRUE)
trainIndexes <- which(folds != f, arr.ind= T)

trainIndexList <- replicate(500, sample(trainIndexes, length(trainIndexes), replace = T), simplify = F)
testIndexList <- replicate(500, testIndexes, simplify = F)

testData <- data[testIndexes, ]
trainData <- data[-testIndexes, ]

## Make the train control object
train_control <- trainControl(method = 'boot',
numbe r= 1,
summaryFunction = modfun,
preProcOptions = c('center', 'scale', newdata= testData),
index = trainIndexList,
indexOut = testIndexList,
classProbs = T,
savePredictions = T)

## Feature Selection
## Make the control for the recursive feature elimination
rfe_control <- rfeControl(functions = rfFuncs, method = 'cv', number= 10)

## Generate the data frame based on feature selection
fs_results <- rfe(trainData[,2:ncol(trainData)],
trainData[,'target'],
sizes=c(2:ncol(trainData)),
rfeControl= rfe_control)

use_features <- c('target', predictors(fs_results))

features <- predictors(fs_results) %>% data.frame(features= .) %>% mutate(fold= f) %>%
rbind(features, .) ## Specify features as a data frame ahead of time

data_min <- data[, use_features] %>% data.frame()


...(modeling code, including train functions and desired output)...



}


I haven't tried to do an lapply instead of a for loop yet. I'd appreciate any suggestions for efficiency.










share|improve this question
























  • had you got any solutions
    – Shubham Sharma
    Jan 24 at 13:22










  • I ended up assigning folds outside the loop: Then, for
    – RoseS
    Mar 16 at 17:15















up vote
1
down vote

favorite












In the caret package, is there any way to use the recursive feature elimination function within the folds of a cross validation scheme in a trainControl that passes to a train function which uses tuning grids?



I love the recursive feature elimination function, but it really should be applied to the training folds in cross validation and then tested on the hold-out fold.



I've played around with a bunch of different methods to do this, but none are perfect. For example, I can make my own cross validation folds and run trainControl with method = 'none' but that won't utilize the training grid in train (an evaluation group is needed for that). I can also make my own cv folds, and have method = 'cv', in the trainControl (I can use a tuning grid here), but the best tune is chosen on the hold-out sample generated by the trainControl hold-out, not my hold-out.



Is there a way to tell caret to evaluate the models with the tuning grids on my pre-specified hold-out fold (the one taken prior to feature elimination)?



In my work-flow, I am testing a few different model types with their own tuning grids. There are parts of caret I really like, and I've spent a ton of time on this so I'd like to use it, but this is a deal breaker if I can't get it to work. I'm open to any suggestions!



Thanks in advance-



SOLUTION:
My solution may not be the most efficient, but it seems to work. I made my cross validation folds using information here: https://stats.stackexchange.com/questions/61090/how-to-split-a-data-set-to-do-10-fold-cross-validation.
Using the createFolds (caret function) does not create equal folds, so I opted for the second solution. It looks like you may could do it with caret's stratified sampling, but I haven't explored that yet.



This code utilizes a bootstrap approach within each cv fold and predicts all of the observations in the hold out fold for each iteration.



  ## Make the folds for the cross validation
folds <- cut(seq(1,nrow(data)), breaks=10, labels=FALSE) %>%
sample(., length(.), replace= F)

for(f in 1:10) {

testIndexes <- which(folds == f,arr.ind=TRUE)
trainIndexes <- which(folds != f, arr.ind= T)

trainIndexList <- replicate(500, sample(trainIndexes, length(trainIndexes), replace = T), simplify = F)
testIndexList <- replicate(500, testIndexes, simplify = F)

testData <- data[testIndexes, ]
trainData <- data[-testIndexes, ]

## Make the train control object
train_control <- trainControl(method = 'boot',
numbe r= 1,
summaryFunction = modfun,
preProcOptions = c('center', 'scale', newdata= testData),
index = trainIndexList,
indexOut = testIndexList,
classProbs = T,
savePredictions = T)

## Feature Selection
## Make the control for the recursive feature elimination
rfe_control <- rfeControl(functions = rfFuncs, method = 'cv', number= 10)

## Generate the data frame based on feature selection
fs_results <- rfe(trainData[,2:ncol(trainData)],
trainData[,'target'],
sizes=c(2:ncol(trainData)),
rfeControl= rfe_control)

use_features <- c('target', predictors(fs_results))

features <- predictors(fs_results) %>% data.frame(features= .) %>% mutate(fold= f) %>%
rbind(features, .) ## Specify features as a data frame ahead of time

data_min <- data[, use_features] %>% data.frame()


...(modeling code, including train functions and desired output)...



}


I haven't tried to do an lapply instead of a for loop yet. I'd appreciate any suggestions for efficiency.










share|improve this question
























  • had you got any solutions
    – Shubham Sharma
    Jan 24 at 13:22










  • I ended up assigning folds outside the loop: Then, for
    – RoseS
    Mar 16 at 17:15













up vote
1
down vote

favorite









up vote
1
down vote

favorite











In the caret package, is there any way to use the recursive feature elimination function within the folds of a cross validation scheme in a trainControl that passes to a train function which uses tuning grids?



I love the recursive feature elimination function, but it really should be applied to the training folds in cross validation and then tested on the hold-out fold.



I've played around with a bunch of different methods to do this, but none are perfect. For example, I can make my own cross validation folds and run trainControl with method = 'none' but that won't utilize the training grid in train (an evaluation group is needed for that). I can also make my own cv folds, and have method = 'cv', in the trainControl (I can use a tuning grid here), but the best tune is chosen on the hold-out sample generated by the trainControl hold-out, not my hold-out.



Is there a way to tell caret to evaluate the models with the tuning grids on my pre-specified hold-out fold (the one taken prior to feature elimination)?



In my work-flow, I am testing a few different model types with their own tuning grids. There are parts of caret I really like, and I've spent a ton of time on this so I'd like to use it, but this is a deal breaker if I can't get it to work. I'm open to any suggestions!



Thanks in advance-



SOLUTION:
My solution may not be the most efficient, but it seems to work. I made my cross validation folds using information here: https://stats.stackexchange.com/questions/61090/how-to-split-a-data-set-to-do-10-fold-cross-validation.
Using the createFolds (caret function) does not create equal folds, so I opted for the second solution. It looks like you may could do it with caret's stratified sampling, but I haven't explored that yet.



This code utilizes a bootstrap approach within each cv fold and predicts all of the observations in the hold out fold for each iteration.



  ## Make the folds for the cross validation
folds <- cut(seq(1,nrow(data)), breaks=10, labels=FALSE) %>%
sample(., length(.), replace= F)

for(f in 1:10) {

testIndexes <- which(folds == f,arr.ind=TRUE)
trainIndexes <- which(folds != f, arr.ind= T)

trainIndexList <- replicate(500, sample(trainIndexes, length(trainIndexes), replace = T), simplify = F)
testIndexList <- replicate(500, testIndexes, simplify = F)

testData <- data[testIndexes, ]
trainData <- data[-testIndexes, ]

## Make the train control object
train_control <- trainControl(method = 'boot',
numbe r= 1,
summaryFunction = modfun,
preProcOptions = c('center', 'scale', newdata= testData),
index = trainIndexList,
indexOut = testIndexList,
classProbs = T,
savePredictions = T)

## Feature Selection
## Make the control for the recursive feature elimination
rfe_control <- rfeControl(functions = rfFuncs, method = 'cv', number= 10)

## Generate the data frame based on feature selection
fs_results <- rfe(trainData[,2:ncol(trainData)],
trainData[,'target'],
sizes=c(2:ncol(trainData)),
rfeControl= rfe_control)

use_features <- c('target', predictors(fs_results))

features <- predictors(fs_results) %>% data.frame(features= .) %>% mutate(fold= f) %>%
rbind(features, .) ## Specify features as a data frame ahead of time

data_min <- data[, use_features] %>% data.frame()


...(modeling code, including train functions and desired output)...



}


I haven't tried to do an lapply instead of a for loop yet. I'd appreciate any suggestions for efficiency.










share|improve this question















In the caret package, is there any way to use the recursive feature elimination function within the folds of a cross validation scheme in a trainControl that passes to a train function which uses tuning grids?



I love the recursive feature elimination function, but it really should be applied to the training folds in cross validation and then tested on the hold-out fold.



I've played around with a bunch of different methods to do this, but none are perfect. For example, I can make my own cross validation folds and run trainControl with method = 'none' but that won't utilize the training grid in train (an evaluation group is needed for that). I can also make my own cv folds, and have method = 'cv', in the trainControl (I can use a tuning grid here), but the best tune is chosen on the hold-out sample generated by the trainControl hold-out, not my hold-out.



Is there a way to tell caret to evaluate the models with the tuning grids on my pre-specified hold-out fold (the one taken prior to feature elimination)?



In my work-flow, I am testing a few different model types with their own tuning grids. There are parts of caret I really like, and I've spent a ton of time on this so I'd like to use it, but this is a deal breaker if I can't get it to work. I'm open to any suggestions!



Thanks in advance-



SOLUTION:
My solution may not be the most efficient, but it seems to work. I made my cross validation folds using information here: https://stats.stackexchange.com/questions/61090/how-to-split-a-data-set-to-do-10-fold-cross-validation.
Using the createFolds (caret function) does not create equal folds, so I opted for the second solution. It looks like you may could do it with caret's stratified sampling, but I haven't explored that yet.



This code utilizes a bootstrap approach within each cv fold and predicts all of the observations in the hold out fold for each iteration.



  ## Make the folds for the cross validation
folds <- cut(seq(1,nrow(data)), breaks=10, labels=FALSE) %>%
sample(., length(.), replace= F)

for(f in 1:10) {

testIndexes <- which(folds == f,arr.ind=TRUE)
trainIndexes <- which(folds != f, arr.ind= T)

trainIndexList <- replicate(500, sample(trainIndexes, length(trainIndexes), replace = T), simplify = F)
testIndexList <- replicate(500, testIndexes, simplify = F)

testData <- data[testIndexes, ]
trainData <- data[-testIndexes, ]

## Make the train control object
train_control <- trainControl(method = 'boot',
numbe r= 1,
summaryFunction = modfun,
preProcOptions = c('center', 'scale', newdata= testData),
index = trainIndexList,
indexOut = testIndexList,
classProbs = T,
savePredictions = T)

## Feature Selection
## Make the control for the recursive feature elimination
rfe_control <- rfeControl(functions = rfFuncs, method = 'cv', number= 10)

## Generate the data frame based on feature selection
fs_results <- rfe(trainData[,2:ncol(trainData)],
trainData[,'target'],
sizes=c(2:ncol(trainData)),
rfeControl= rfe_control)

use_features <- c('target', predictors(fs_results))

features <- predictors(fs_results) %>% data.frame(features= .) %>% mutate(fold= f) %>%
rbind(features, .) ## Specify features as a data frame ahead of time

data_min <- data[, use_features] %>% data.frame()


...(modeling code, including train functions and desired output)...



}


I haven't tried to do an lapply instead of a for loop yet. I'd appreciate any suggestions for efficiency.







cross-validation r-caret feature-selection






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 8 at 9:19









jmuhlenkamp

1,113424




1,113424










asked Aug 3 '17 at 18:41









RoseS

557




557












  • had you got any solutions
    – Shubham Sharma
    Jan 24 at 13:22










  • I ended up assigning folds outside the loop: Then, for
    – RoseS
    Mar 16 at 17:15


















  • had you got any solutions
    – Shubham Sharma
    Jan 24 at 13:22










  • I ended up assigning folds outside the loop: Then, for
    – RoseS
    Mar 16 at 17:15
















had you got any solutions
– Shubham Sharma
Jan 24 at 13:22




had you got any solutions
– Shubham Sharma
Jan 24 at 13:22












I ended up assigning folds outside the loop: Then, for
– RoseS
Mar 16 at 17:15




I ended up assigning folds outside the loop: Then, for
– RoseS
Mar 16 at 17:15

















active

oldest

votes











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














 

draft saved


draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f45492147%2fusing-caret-to-select-features-within-folds-of-a-cross-validation%23new-answer', 'question_page');
}
);

Post as a guest





































active

oldest

votes













active

oldest

votes









active

oldest

votes






active

oldest

votes
















 

draft saved


draft discarded



















































 


draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f45492147%2fusing-caret-to-select-features-within-folds-of-a-cross-validation%23new-answer', 'question_page');
}
);

Post as a guest




















































































Popular posts from this blog

Guess what letter conforming each word

Port of Spain

Run scheduled task as local user group (not BUILTIN)