Re: Definition of shape1 & shape2 in Bayesian evaluation of ABtest  page 355
Thanks for the question. Sorry if we were too telegraphic about what is going on. It is actually a beautiful topic, and I'll try to explain it here.
You are right, with the amount of data we have the commonRate gets swamped out (so you don't really need it, you would get nearly the same answer without it). That is a good thing.
Roughly what we are doing is using a Bayesian formulation of A/B testing. The math is based on the assumption the conversion rate or intensity is an unobserved quantity that has a prior distribution of plausible values. As we observe events we get a new posterior estimate of the plausible distribution of the possible values of the conversion rate. The easiest way to do this is to additionally assume that the unknown conversion rate is distributed according to the beta distribution (mentioned in this section of the appendix and earlier in the appendix).
The beta distribution has two shape parameters here called shape1 and shape2. We are implicitly saying before we look at the Bresults a plausible somewhat noninformative prior of the Brate is distributed as Beta(shape1=commonRate,shape2=1commonRate). That is a distribution with a mean value equal the commonRate (the rate of conversions from the A/B observations grouped together sort of a frequentest style null hypothesis that there is no difference or a deliberate bias of assuming there is no difference prior to looking at data). The commonRate is a fraction, so this is like adding a single observation that is fractionally split between converting and not converting. The, as in the book, after we see the tab['B','1'] and tab['B','0'] we say the posterior distribution of the rate is distributed as Beta(shape1=commonRate+tab['B','1'],shape2=1commonRate+tab['B','0') which is our actual observations plus our fractional pseudoobservation. So if we see the distribution of possible Brates as very far from Arates this is good evidence the B rate is in fact better.
This can seem a bit mysterious. But the ease of calculation is from what the Bayesians call "conjugate distributions." If we assuming the unknown Brate is Betadistributed with some parameters, then the posterior distribution estimate is Betadistributed with new parameters (that are in fact just the original parameters with the observations added in). This is picking a prior distribution on the unknown parameter (the Beta distribution) that is conjugate to the assumed data generating process (the Binomial or Bernoulli distribution).
The theory is pretty standard (though often used with truly uninformative prior called the Jeffreys prior) and Bayesian. This is also related "Laplace smoothing" where you add one positive and one negative pseudo observation before starting (though here we are adding a total of one pseudoobservation instead of two, and we are monkeying around to get the starting mean to be something plausible and not always 1/2 which could be a huge conversion rate).
This is definitely something you will want to read more on (by us and by others). The book was limited by space. I suggest checking out http://www.winvector.com/blog/2014/04/banditformulationsforabtestssomeintuition/ and http://www.winvector.com/blog/2014/05/aclearpictureofpowerandsignificanceinabtests/
