Using Caret to select features within folds of a cross validation











up vote
1
down vote

favorite












In the caret package, is there any way to use the recursive feature elimination function within the folds of a cross validation scheme in a trainControl that passes to a train function which uses tuning grids?



I love the recursive feature elimination function, but it really should be applied to the training folds in cross validation and then tested on the hold-out fold.



I've played around with a bunch of different methods to do this, but none are perfect. For example, I can make my own cross validation folds and run trainControl with method = 'none' but that won't utilize the training grid in train (an evaluation group is needed for that). I can also make my own cv folds, and have method = 'cv', in the trainControl (I can use a tuning grid here), but the best tune is chosen on the hold-out sample generated by the trainControl hold-out, not my hold-out.



Is there a way to tell caret to evaluate the models with the tuning grids on my pre-specified hold-out fold (the one taken prior to feature elimination)?



In my work-flow, I am testing a few different model types with their own tuning grids. There are parts of caret I really like, and I've spent a ton of time on this so I'd like to use it, but this is a deal breaker if I can't get it to work. I'm open to any suggestions!



Thanks in advance-



SOLUTION:
My solution may not be the most efficient, but it seems to work. I made my cross validation folds using information here: https://stats.stackexchange.com/questions/61090/how-to-split-a-data-set-to-do-10-fold-cross-validation.
Using the createFolds (caret function) does not create equal folds, so I opted for the second solution. It looks like you may could do it with caret's stratified sampling, but I haven't explored that yet.



This code utilizes a bootstrap approach within each cv fold and predicts all of the observations in the hold out fold for each iteration.



  ## Make the folds for the cross validation
folds <- cut(seq(1,nrow(data)), breaks=10, labels=FALSE) %>%
sample(., length(.), replace= F)

for(f in 1:10) {

testIndexes <- which(folds == f,arr.ind=TRUE)
trainIndexes <- which(folds != f, arr.ind= T)

trainIndexList <- replicate(500, sample(trainIndexes, length(trainIndexes), replace = T), simplify = F)
testIndexList <- replicate(500, testIndexes, simplify = F)

testData <- data[testIndexes, ]
trainData <- data[-testIndexes, ]

## Make the train control object
train_control <- trainControl(method = 'boot',
numbe r= 1,
summaryFunction = modfun,
preProcOptions = c('center', 'scale', newdata= testData),
index = trainIndexList,
indexOut = testIndexList,
classProbs = T,
savePredictions = T)

## Feature Selection
## Make the control for the recursive feature elimination
rfe_control <- rfeControl(functions = rfFuncs, method = 'cv', number= 10)

## Generate the data frame based on feature selection
fs_results <- rfe(trainData[,2:ncol(trainData)],
trainData[,'target'],
sizes=c(2:ncol(trainData)),
rfeControl= rfe_control)

use_features <- c('target', predictors(fs_results))

features <- predictors(fs_results) %>% data.frame(features= .) %>% mutate(fold= f) %>%
rbind(features, .) ## Specify features as a data frame ahead of time

data_min <- data[, use_features] %>% data.frame()


...(modeling code, including train functions and desired output)...



}


I haven't tried to do an lapply instead of a for loop yet. I'd appreciate any suggestions for efficiency.










share|improve this question
























  • had you got any solutions
    – Shubham Sharma
    Jan 24 at 13:22










  • I ended up assigning folds outside the loop: Then, for
    – RoseS
    Mar 16 at 17:15















up vote
1
down vote

favorite












In the caret package, is there any way to use the recursive feature elimination function within the folds of a cross validation scheme in a trainControl that passes to a train function which uses tuning grids?



I love the recursive feature elimination function, but it really should be applied to the training folds in cross validation and then tested on the hold-out fold.



I've played around with a bunch of different methods to do this, but none are perfect. For example, I can make my own cross validation folds and run trainControl with method = 'none' but that won't utilize the training grid in train (an evaluation group is needed for that). I can also make my own cv folds, and have method = 'cv', in the trainControl (I can use a tuning grid here), but the best tune is chosen on the hold-out sample generated by the trainControl hold-out, not my hold-out.



Is there a way to tell caret to evaluate the models with the tuning grids on my pre-specified hold-out fold (the one taken prior to feature elimination)?



In my work-flow, I am testing a few different model types with their own tuning grids. There are parts of caret I really like, and I've spent a ton of time on this so I'd like to use it, but this is a deal breaker if I can't get it to work. I'm open to any suggestions!



Thanks in advance-



SOLUTION:
My solution may not be the most efficient, but it seems to work. I made my cross validation folds using information here: https://stats.stackexchange.com/questions/61090/how-to-split-a-data-set-to-do-10-fold-cross-validation.
Using the createFolds (caret function) does not create equal folds, so I opted for the second solution. It looks like you may could do it with caret's stratified sampling, but I haven't explored that yet.



This code utilizes a bootstrap approach within each cv fold and predicts all of the observations in the hold out fold for each iteration.



  ## Make the folds for the cross validation
folds <- cut(seq(1,nrow(data)), breaks=10, labels=FALSE) %>%
sample(., length(.), replace= F)

for(f in 1:10) {

testIndexes <- which(folds == f,arr.ind=TRUE)
trainIndexes <- which(folds != f, arr.ind= T)

trainIndexList <- replicate(500, sample(trainIndexes, length(trainIndexes), replace = T), simplify = F)
testIndexList <- replicate(500, testIndexes, simplify = F)

testData <- data[testIndexes, ]
trainData <- data[-testIndexes, ]

## Make the train control object
train_control <- trainControl(method = 'boot',
numbe r= 1,
summaryFunction = modfun,
preProcOptions = c('center', 'scale', newdata= testData),
index = trainIndexList,
indexOut = testIndexList,
classProbs = T,
savePredictions = T)

## Feature Selection
## Make the control for the recursive feature elimination
rfe_control <- rfeControl(functions = rfFuncs, method = 'cv', number= 10)

## Generate the data frame based on feature selection
fs_results <- rfe(trainData[,2:ncol(trainData)],
trainData[,'target'],
sizes=c(2:ncol(trainData)),
rfeControl= rfe_control)

use_features <- c('target', predictors(fs_results))

features <- predictors(fs_results) %>% data.frame(features= .) %>% mutate(fold= f) %>%
rbind(features, .) ## Specify features as a data frame ahead of time

data_min <- data[, use_features] %>% data.frame()


...(modeling code, including train functions and desired output)...



}


I haven't tried to do an lapply instead of a for loop yet. I'd appreciate any suggestions for efficiency.










share|improve this question
























  • had you got any solutions
    – Shubham Sharma
    Jan 24 at 13:22










  • I ended up assigning folds outside the loop: Then, for
    – RoseS
    Mar 16 at 17:15













up vote
1
down vote

favorite









up vote
1
down vote

favorite











In the caret package, is there any way to use the recursive feature elimination function within the folds of a cross validation scheme in a trainControl that passes to a train function which uses tuning grids?



I love the recursive feature elimination function, but it really should be applied to the training folds in cross validation and then tested on the hold-out fold.



I've played around with a bunch of different methods to do this, but none are perfect. For example, I can make my own cross validation folds and run trainControl with method = 'none' but that won't utilize the training grid in train (an evaluation group is needed for that). I can also make my own cv folds, and have method = 'cv', in the trainControl (I can use a tuning grid here), but the best tune is chosen on the hold-out sample generated by the trainControl hold-out, not my hold-out.



Is there a way to tell caret to evaluate the models with the tuning grids on my pre-specified hold-out fold (the one taken prior to feature elimination)?



In my work-flow, I am testing a few different model types with their own tuning grids. There are parts of caret I really like, and I've spent a ton of time on this so I'd like to use it, but this is a deal breaker if I can't get it to work. I'm open to any suggestions!



Thanks in advance-



SOLUTION:
My solution may not be the most efficient, but it seems to work. I made my cross validation folds using information here: https://stats.stackexchange.com/questions/61090/how-to-split-a-data-set-to-do-10-fold-cross-validation.
Using the createFolds (caret function) does not create equal folds, so I opted for the second solution. It looks like you may could do it with caret's stratified sampling, but I haven't explored that yet.



This code utilizes a bootstrap approach within each cv fold and predicts all of the observations in the hold out fold for each iteration.



  ## Make the folds for the cross validation
folds <- cut(seq(1,nrow(data)), breaks=10, labels=FALSE) %>%
sample(., length(.), replace= F)

for(f in 1:10) {

testIndexes <- which(folds == f,arr.ind=TRUE)
trainIndexes <- which(folds != f, arr.ind= T)

trainIndexList <- replicate(500, sample(trainIndexes, length(trainIndexes), replace = T), simplify = F)
testIndexList <- replicate(500, testIndexes, simplify = F)

testData <- data[testIndexes, ]
trainData <- data[-testIndexes, ]

## Make the train control object
train_control <- trainControl(method = 'boot',
numbe r= 1,
summaryFunction = modfun,
preProcOptions = c('center', 'scale', newdata= testData),
index = trainIndexList,
indexOut = testIndexList,
classProbs = T,
savePredictions = T)

## Feature Selection
## Make the control for the recursive feature elimination
rfe_control <- rfeControl(functions = rfFuncs, method = 'cv', number= 10)

## Generate the data frame based on feature selection
fs_results <- rfe(trainData[,2:ncol(trainData)],
trainData[,'target'],
sizes=c(2:ncol(trainData)),
rfeControl= rfe_control)

use_features <- c('target', predictors(fs_results))

features <- predictors(fs_results) %>% data.frame(features= .) %>% mutate(fold= f) %>%
rbind(features, .) ## Specify features as a data frame ahead of time

data_min <- data[, use_features] %>% data.frame()


...(modeling code, including train functions and desired output)...



}


I haven't tried to do an lapply instead of a for loop yet. I'd appreciate any suggestions for efficiency.










share|improve this question















In the caret package, is there any way to use the recursive feature elimination function within the folds of a cross validation scheme in a trainControl that passes to a train function which uses tuning grids?



I love the recursive feature elimination function, but it really should be applied to the training folds in cross validation and then tested on the hold-out fold.



I've played around with a bunch of different methods to do this, but none are perfect. For example, I can make my own cross validation folds and run trainControl with method = 'none' but that won't utilize the training grid in train (an evaluation group is needed for that). I can also make my own cv folds, and have method = 'cv', in the trainControl (I can use a tuning grid here), but the best tune is chosen on the hold-out sample generated by the trainControl hold-out, not my hold-out.



Is there a way to tell caret to evaluate the models with the tuning grids on my pre-specified hold-out fold (the one taken prior to feature elimination)?



In my work-flow, I am testing a few different model types with their own tuning grids. There are parts of caret I really like, and I've spent a ton of time on this so I'd like to use it, but this is a deal breaker if I can't get it to work. I'm open to any suggestions!



Thanks in advance-



SOLUTION:
My solution may not be the most efficient, but it seems to work. I made my cross validation folds using information here: https://stats.stackexchange.com/questions/61090/how-to-split-a-data-set-to-do-10-fold-cross-validation.
Using the createFolds (caret function) does not create equal folds, so I opted for the second solution. It looks like you may could do it with caret's stratified sampling, but I haven't explored that yet.



This code utilizes a bootstrap approach within each cv fold and predicts all of the observations in the hold out fold for each iteration.



  ## Make the folds for the cross validation
folds <- cut(seq(1,nrow(data)), breaks=10, labels=FALSE) %>%
sample(., length(.), replace= F)

for(f in 1:10) {

testIndexes <- which(folds == f,arr.ind=TRUE)
trainIndexes <- which(folds != f, arr.ind= T)

trainIndexList <- replicate(500, sample(trainIndexes, length(trainIndexes), replace = T), simplify = F)
testIndexList <- replicate(500, testIndexes, simplify = F)

testData <- data[testIndexes, ]
trainData <- data[-testIndexes, ]

## Make the train control object
train_control <- trainControl(method = 'boot',
numbe r= 1,
summaryFunction = modfun,
preProcOptions = c('center', 'scale', newdata= testData),
index = trainIndexList,
indexOut = testIndexList,
classProbs = T,
savePredictions = T)

## Feature Selection
## Make the control for the recursive feature elimination
rfe_control <- rfeControl(functions = rfFuncs, method = 'cv', number= 10)

## Generate the data frame based on feature selection
fs_results <- rfe(trainData[,2:ncol(trainData)],
trainData[,'target'],
sizes=c(2:ncol(trainData)),
rfeControl= rfe_control)

use_features <- c('target', predictors(fs_results))

features <- predictors(fs_results) %>% data.frame(features= .) %>% mutate(fold= f) %>%
rbind(features, .) ## Specify features as a data frame ahead of time

data_min <- data[, use_features] %>% data.frame()


...(modeling code, including train functions and desired output)...



}


I haven't tried to do an lapply instead of a for loop yet. I'd appreciate any suggestions for efficiency.







cross-validation r-caret feature-selection






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 8 at 9:19









jmuhlenkamp

1,113424




1,113424










asked Aug 3 '17 at 18:41









RoseS

557




557












  • had you got any solutions
    – Shubham Sharma
    Jan 24 at 13:22










  • I ended up assigning folds outside the loop: Then, for
    – RoseS
    Mar 16 at 17:15


















  • had you got any solutions
    – Shubham Sharma
    Jan 24 at 13:22










  • I ended up assigning folds outside the loop: Then, for
    – RoseS
    Mar 16 at 17:15
















had you got any solutions
– Shubham Sharma
Jan 24 at 13:22




had you got any solutions
– Shubham Sharma
Jan 24 at 13:22












I ended up assigning folds outside the loop: Then, for
– RoseS
Mar 16 at 17:15




I ended up assigning folds outside the loop: Then, for
– RoseS
Mar 16 at 17:15

















active

oldest

votes











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














 

draft saved


draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f45492147%2fusing-caret-to-select-features-within-folds-of-a-cross-validation%23new-answer', 'question_page');
}
);

Post as a guest





































active

oldest

votes













active

oldest

votes









active

oldest

votes






active

oldest

votes
















 

draft saved


draft discarded



















































 


draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f45492147%2fusing-caret-to-select-features-within-folds-of-a-cross-validation%23new-answer', 'question_page');
}
);

Post as a guest




















































































Popular posts from this blog

鏡平學校

ꓛꓣだゔៀៅຸ໢ທຮ໕໒ ,ໂ'໥໓າ໼ឨឲ៵៭ៈゎゔit''䖳𥁄卿' ☨₤₨こゎもょの;ꜹꟚꞖꞵꟅꞛေၦေɯ,ɨɡ𛃵𛁹ޝ޳ޠ޾,ޤޒޯ޾𫝒𫠁သ𛅤チョ'サノބޘދ𛁐ᶿᶇᶀᶋᶠ㨑㽹⻮ꧬ꧹؍۩وَؠ㇕㇃㇪ ㇦㇋㇋ṜẰᵡᴠ 軌ᵕ搜۳ٰޗޮ޷ސޯ𫖾𫅀ल, ꙭ꙰ꚅꙁꚊꞻꝔ꟠Ꝭㄤﺟޱސꧨꧼ꧴ꧯꧽ꧲ꧯ'⽹⽭⾁⿞⼳⽋២៩ញណើꩯꩤ꩸ꩮᶻᶺᶧᶂ𫳲𫪭𬸄𫵰𬖩𬫣𬊉ၲ𛅬㕦䬺𫝌𫝼,,𫟖𫞽ហៅ஫㆔ాఆఅꙒꚞꙍ,Ꙟ꙱エ ,ポテ,フࢰࢯ𫟠𫞶 𫝤𫟠ﺕﹱﻜﻣ𪵕𪭸𪻆𪾩𫔷ġ,ŧآꞪ꟥,ꞔꝻ♚☹⛵𛀌ꬷꭞȄƁƪƬșƦǙǗdžƝǯǧⱦⱰꓕꓢႋ神 ဴ၀க௭எ௫ឫោ ' េㇷㇴㇼ神ㇸㇲㇽㇴㇼㇻㇸ'ㇸㇿㇸㇹㇰㆣꓚꓤ₡₧ ㄨㄟ㄂ㄖㄎ໗ツڒذ₶।ऩछएोञयूटक़कयँृी,冬'𛅢𛅥ㇱㇵㇶ𥄥𦒽𠣧𠊓𧢖𥞘𩔋цѰㄠſtʯʭɿʆʗʍʩɷɛ,əʏダヵㄐㄘR{gỚṖḺờṠṫảḙḭᴮᵏᴘᵀᵷᵕᴜᴏᵾq﮲ﲿﴽﭙ軌ﰬﶚﶧ﫲Ҝжюїкӈㇴffצּ﬘﭅﬈軌'ffistfflſtffतभफɳɰʊɲʎ𛁱𛁖𛁮𛀉 𛂯𛀞నఋŀŲ 𫟲𫠖𫞺ຆຆ ໹້໕໗ๆทԊꧢꧠ꧰ꓱ⿝⼑ŎḬẃẖỐẅ ,ờỰỈỗﮊDžȩꭏꭎꬻ꭮ꬿꭖꭥꭅ㇭神 ⾈ꓵꓑ⺄㄄ㄪㄙㄅㄇstA۵䞽ॶ𫞑𫝄㇉㇇゜軌𩜛𩳠Jﻺ‚Üမ႕ႌႊၐၸဓၞၞၡ៸wyvtᶎᶪᶹစဎ꣡꣰꣢꣤ٗ؋لㇳㇾㇻㇱ㆐㆔,,㆟Ⱶヤマފ޼ޝަݿݞݠݷݐ',ݘ,ݪݙݵ𬝉𬜁𫝨𫞘くせぉて¼óû×ó£…𛅑הㄙくԗԀ5606神45,神796'𪤻𫞧ꓐ㄁ㄘɥɺꓵꓲ3''7034׉ⱦⱠˆ“𫝋ȍ,ꩲ軌꩷ꩶꩧꩫఞ۔فڱێظペサ神ナᴦᵑ47 9238їﻂ䐊䔉㠸﬎ffiﬣ,לּᴷᴦᵛᵽ,ᴨᵤ ᵸᵥᴗᵈꚏꚉꚟ⻆rtǟƴ𬎎

Why https connections are so slow when debugging (stepping over) in Java?