Thanks for your comment!

This is a tricky subject. What assumptions you need depends on what modeling framework you are using. We ended up sharing a slightly mixed point of view in the book (frankly we are sympathetic to the Bayesian view which suggests transforms, but we didn't want to go to full Bayesian model). We try to clarify this a bit in our errata:

http://winvector.github.io/PDSwR/PracticalDataScienceWithRErrata.html
Overall things are a bit more operational than some would like: a method is good if it helps with the data and problems you have at hand (though obviously you need to choose from principled methods).

If you are using the Gauss-Markov theorem to justify linear regression you need only assume the facts about the errors (that the are uncorrelated and of same magnitude see:

http://www.win-vector.com/blog/2014/08/reading-the-gauss-markov-theorem/ ). If you are using a Bayesian/generative derivation you may want to assume some distributional facts about the x's and perhaps the y's. The "assuming normality" is along those lines- but not strictly what is traditionally taught in statistics.

See Andrew Gelman for some good ideas on regression:

http://andrewgelman.com/2013/08/04/19470/ (from a Bayesian point of view, the frequentists pride themselves on working from weaker assumptions).

Among the most convincing reasons to log-transform are:

1) To fix structural assumptions. For problems like income, wealth and hedonic regression it is plausible each factor may contribute a relative change in expectation (air conditioning may be valued as adding 10% to the value of a car, even though its cost may be in dollars). So it is natural to model y ~ product_i (x_i)^(b_i) or log(y) ~ sum_i b_i * log(x_i) (notice we transformed both x's and y's here, not needed for categorical variables) or even log(y) ~ sum_i b_i x_i (only y's transformed).

2) To fix domain issues like y being non-negative.

3) To compress range. y varying over several orders of magnitude- would mean only a few very large y would dominate the fit. So if log(y) has a more reasonable range you are using more of your data (though you have changed the error model).

Also be aware: we are mostly using regression for prediction (estimating new unseen y's) not inference (what we called extracting advice, estimating the betas or coefficients). The requirements/standards are lower when making predictions than when inferring parameters. See

http://www.win-vector.com/blog/2014/04/what-is-meant-by-regression-modeling/ for a bit of discussion on this. For more transforms see also the book or

http://www.win-vector.com/blog/2012/03/modeling-trick-the-signed-pseudo-logarithm/