Hi,

I'm enjoying the material so far, but I agree with others here that the code in 2.1.2 is completely incomprehensible without an introduction to numpy. I strongly believe that a description like this should be included in the book: the code is hugely distracting from the main task at hand (understanding kNN), and many people will simply close the book at this point. You're more than welcome to yank any of the below for the book!

Each piece of code I reproduced below is followed by a description of what it does.

*****

dataSetSize = dataSet.shape[0]

dataset is a numpy array. The shape attribute contains the number of rows, columns, etc. from an array. shape[0] gives the number of rows, shape[1] the number of columns. Here we are obtaining the number of rows, since this corresponds to the number of observations (points) in our data set.

*****

[partial line]

diffMat = tile(inX, (dataSetSize,1))

inX is a Python list; this line replicates inX a total of dataSetSize times. The tile function here takes two parameters: the first is what you want to replicate, and the second is how you want to replicate it. The second parameter here is the tuple (dataSetSize, 1), meaning that inX is replicated in a grid of dataSetSize rows and 1 column. (That is, it's replicated dataSetSize times going down, but only once going across.)

Here's an example:

>>> a = [2, 3]

# Replicate as five copies down, one copy across

>>> tile (a, (5, 1))

array([[2, 3],

[2, 3],

[2, 3],

[2, 3],

[2, 3]])

The reason 'tile' is useful here is because it creates an array to be the same dimensions (same number of rows and columns) as dataSet. And since this is the case, the two arrays can be subtracted. In the full line of code:

diffMat = tile(inX, (dataSetSize,1)) - dataSet

we are taking each element of diffMat and subtracting the corresponding element of dataSet. Each row of diffMat, therefore, contains the differences of the features of the target inX and one of the points in dataSet. This is the beginning of our computation of the Euclidean distances.

*****

sqDiffMat = diffMat**2

This squares each element of diffMat. sqDiffMat is now therefore the square of the distances between inX and each point in dataSet. The next thing to do is therefore find the sum of each row in sqDiffMat...

*****

sqDistances = sqDiffMat.sum(axis=1)

For a dataSet consisting of five points, where each row has three measured features, our sqDiffMat could look something like this (all numbers must be positive by now, since each has been squared):

array([[1, 2, 3],

[3, 4, 1],

[5, 5, 1],

[2, 3, 7],

[3, 4, 3]])

What we want to do is compute the sum of each row, since this tells us the distance between each observation in dataSet and our inX vector. That is, we want to calculate 6, 8, 11, 12, 10. And the 'sum' function does exactly this:

>>> sqDiffMat.sum (axis=1)

array([ 6, 8, 11, 12, 10])

If you leave out 'axis=1', you get 47, the sum of the entire array, and not what we want. If you use axis=0 instead of axis=1, you get the sum of the columns, which again is not useful.

*****

distances = sqDistances**0.5

At this point, sqDistances is a vector with dataSetSize elements that almost contains each distance. The only thing left to do is to take the square root of each value.

*****

sortedDistIndicies = distances.argsort()

Now that we have all of the distances, we want the k smallest ones. So we want to sort the distances. This line creates a vector consisting of the order of the _indices_ (not values) from 'distances', in ascending order. Here's what I mean:

>>> distances

array([ 6, 8, 11, 12, 10])

>>> distances.argsort()

array([0, 1, 4, 2, 3])

*****

classCount={}

for i in range(k):

voteIlabel = labels[sortedDistIndicies[i]]

classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1

We care only about the k closest data points. This loop goes through those k closest, grabbing the corresponding labels, and keeping track of how many times those labels appear. Thus, classCount maps labels to occurrence counts.

*****

sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)

At this point, our classCount looks something like this:

{'a': 3, 'c': 5, 'b': 2}

This means that the 'a' label appeared three times, 'b' appeared 2 times, and 'c' appeared five times. Ultimately, we want to now return 'c' because it existed most among the k closest neighbors.

This code sorts these items into a list, biggest first, so that the result is:

[('c', 5), ('a', 3), ('b', 2)]

Notice that we want to sort based on the second component of each tuple. This is accomplished by:

operator.itemgetter(1),

which, when called with a tuple (a, b), returns (b) (i.e. index 1 from that tuple).

*****

return sortedClassCount[0][0]

This grabs the highest-numbered label from the list of the k sorted neighbors.