The Author Online Book Forums are Moving

The Author Online Book Forums will soon redirect to Manning's liveBook and liveVideo. All book forum content will migrate to liveBook's discussion forum and all video forum content will migrate to liveVideo. Log in to liveBook or liveVideo with your Manning credentials to join the discussion!

Thank you for your engagement in the AoF over the years! We look forward to offering you a more enhanced forum experience.

CElliott (12) [Avatar] Offline
#1
Real-world Machine Learning is a great book, and I have really enjoyed reading it. The best part of the book is the attention to detail. For example, when machine learning is used as an adjective, it is hyphenated; when it is used as a noun, it is not hyphenated. Few authors today take the time to get that detail right. However, the day might not be complete without one small demur.

When a binary categorical variable, such as gender, is turned into two columns, such as male and female, both Excel Regression and Mathematica LinerModelFit blow up because the two columns are not linearly independent. This is also true of the AutoMPG dataset if region is made three columns instead of two. Every book I have ever read about regression analysis says this is true: a binary categorical variable should be one column of zeros or ones, if for no other reason than then the partial derivative of the regression, which is the column's coefficient, then makes sense. If there are three columns for region, then what does partial of the regression WRT to, say, Europe, mean, the change in MPG WRT a car that is made nowhere?

I don't understand how the software used for the book completes an analyses where gender is two columns in the Titanic dataset or region is three columns in the AutoMPG dataset. What meaning then do you assign to the coefficient of male or female or to the coefficient of Asia, Europe, or America?

Thanks again for the wonderful book.