375463 (5) [Avatar] Offline
Can you talk a little more about the choice of aucTrain threshold of >= .8 for single-variable models on categorical variables, versus >=.55 for numeric variables in listing 6.7? I can see how a higher threshold is valuable if a large number of levels inflate aucTrain for categorical variables; I am interested in if there is anything specific about .8 that makes that your recommendation. In my case, I originally returned no results for categorical variables. Once I adjusted to .55 for catVars, I received results and in fact found my best single variable model on calibrationAUC (a little larger than the best calibrationAUC on my numeric variables). Should I be skeptical of this result?

Thank you!
john.mount (79) [Avatar] Offline
The thresholds are a bit arbitrary, and we may not have picked well. My idea was the categorical variables have more levels, so we should expect more from them before we pass them. But as you noticed, this can cut everything (not good).

The checking of variables on two sets helps a bit (as one set isn't involved in the construction of the variables). A bit more principled is selecting variables on effect significances (how unlikely an effect strength would be under pure chance) instead of effects sizes.

If you have the time take a look at https://github.com/WinVector/zmPDSwR/blob/master/KDD2009/KDD2009vtreat.Rmd (and the matching rendered HTML). It shows how to prepare the variables using the vtreat library (up on cran). In particular add the line <code>print(treatmentsC$scoreFrame)</code> (after the design treatments) to see a lot of useful diagnostics.
375463 (5) [Avatar] Offline
Thank you very much!