rachelb (6) [Avatar] Offline
#1
I was already familiar with "count min sketch" probabilistic algorithm and was eager to understand the hyperloglog algorithm. I appreciate the section is only a high level overview, but when I finished it I was confused about how the algorithm works.

My understanding is that a hash value is taken for each element in the stream, then the number of leading zeros for each hash value is taken and then the maximum of the number of leading zeros is determined. This maximum is an estimate for log base 2 of the number of distinct elements. Finally multiple estimates are taken and then combined to give the final estimate.

However, the description misses out the need to take a maximum and the fact that it is an estimate log base 2 of the final number - so the algorithm is not clear at all.

Two other areas could also be clarified. The count of leading zeros being an estimate is not immediate obvious. A few sentences to help build intuition in this area would help.

Also the use of the harmonic mean is also not obvious. Again a few sentences to indicate that each distinct count is an estimate and they are combined to produce a better estimate and the harmonic mean is used as it produces a more accurate estimate.

Finally, more guidance on the relationship between number of registers (hence memory) and accuracy would be useful - I am sure like count min sketch there is a formula for this.



andrew.psaltis (31) [Avatar] Offline
#2
Thank you for your detailed analysis and pointing out things that were flawed or where the text was lacking. I have taken all of this into consideration and updated the text accordingly.

Again, sorry for the delay in responding to you regarding this.