298118 (1) [Avatar] Offline
#1
Is it wise to do pre-processing after vtreat, such as removing near zero variance and highly correlated predictors or should this be done prior to vtreat? vtreat appears to create a large number of columns...
john.mount (79) [Avatar] Offline
#2
The desirability of removing columns after vtreat processing depends on what machine learning step you are going to do next. If it is something that depends directly on geometry of columns then I would suggest setting scale=TRUE in vtreat and using principle components analysis to reduce the number of columns (setting project=TRUE in the https://github.com/WinVector/zmPDSwR/blob/master/KDD2009/KDD2009vtreat.Rmd example shows how to do that). If it is something the specializes in picking variables (like decision trees, gradient boosting, or random forests) I would not worry as much about all the variables. In all cases variables that don't move at all are not passed through vtreat and you can also choose variables to prune-out by looking at the variable scores returned in the treatment plan (try names() on the plan to see what slots are populated).