The Author Online Book Forums are Moving

The Author Online Book Forums will soon redirect to Manning's liveBook and liveVideo. All book forum content will migrate to liveBook's discussion forum and all video forum content will migrate to liveVideo. Log in to liveBook or liveVideo with your Manning credentials to join the discussion!

Thank you for your engagement in the AoF over the years! We look forward to offering you a more enhanced forum experience.

danielzingaro (6) [Avatar] Offline

I'm enjoying the material so far, but I agree with others here that the code in 2.1.2 is completely incomprehensible without an introduction to numpy. I strongly believe that a description like this should be included in the book: the code is hugely distracting from the main task at hand (understanding kNN), and many people will simply close the book at this point. You're more than welcome to yank any of the below for the book! smilie

Each piece of code I reproduced below is followed by a description of what it does.


dataSetSize = dataSet.shape[0]
dataset is a numpy array. The shape attribute contains the number of rows, columns, etc. from an array. shape[0] gives the number of rows, shape[1] the number of columns. Here we are obtaining the number of rows, since this corresponds to the number of observations (points) in our data set.


[partial line]
diffMat = tile(inX, (dataSetSize,1))
inX is a Python list; this line replicates inX a total of dataSetSize times. The tile function here takes two parameters: the first is what you want to replicate, and the second is how you want to replicate it. The second parameter here is the tuple (dataSetSize, 1), meaning that inX is replicated in a grid of dataSetSize rows and 1 column. (That is, it's replicated dataSetSize times going down, but only once going across.)

Here's an example:
>>> a = [2, 3]
# Replicate as five copies down, one copy across
>>> tile (a, (5, 1))
array([[2, 3],
[2, 3],
[2, 3],
[2, 3],
[2, 3]])

The reason 'tile' is useful here is because it creates an array to be the same dimensions (same number of rows and columns) as dataSet. And since this is the case, the two arrays can be subtracted. In the full line of code:
diffMat = tile(inX, (dataSetSize,1)) - dataSet
we are taking each element of diffMat and subtracting the corresponding element of dataSet. Each row of diffMat, therefore, contains the differences of the features of the target inX and one of the points in dataSet. This is the beginning of our computation of the Euclidean distances.


sqDiffMat = diffMat**2
This squares each element of diffMat. sqDiffMat is now therefore the square of the distances between inX and each point in dataSet. The next thing to do is therefore find the sum of each row in sqDiffMat...


sqDistances = sqDiffMat.sum(axis=1)

For a dataSet consisting of five points, where each row has three measured features, our sqDiffMat could look something like this (all numbers must be positive by now, since each has been squared):
array([[1, 2, 3],
[3, 4, 1],
[5, 5, 1],
[2, 3, 7],
[3, 4, 3]])

What we want to do is compute the sum of each row, since this tells us the distance between each observation in dataSet and our inX vector. That is, we want to calculate 6, 8, 11, 12, 10. And the 'sum' function does exactly this:
>>> sqDiffMat.sum (axis=1)
array([ 6, 8, 11, 12, 10])
If you leave out 'axis=1', you get 47, the sum of the entire array, and not what we want. If you use axis=0 instead of axis=1, you get the sum of the columns, which again is not useful.


distances = sqDistances**0.5
At this point, sqDistances is a vector with dataSetSize elements that almost contains each distance. The only thing left to do is to take the square root of each value.


sortedDistIndicies = distances.argsort()
Now that we have all of the distances, we want the k smallest ones. So we want to sort the distances. This line creates a vector consisting of the order of the _indices_ (not values) from 'distances', in ascending order. Here's what I mean:

>>> distances
array([ 6, 8, 11, 12, 10])
>>> distances.argsort()
array([0, 1, 4, 2, 3])


for i in range(k):
voteIlabel = labels[sortedDistIndicies[i]]
classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1

We care only about the k closest data points. This loop goes through those k closest, grabbing the corresponding labels, and keeping track of how many times those labels appear. Thus, classCount maps labels to occurrence counts.


sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)

At this point, our classCount looks something like this:
{'a': 3, 'c': 5, 'b': 2}
This means that the 'a' label appeared three times, 'b' appeared 2 times, and 'c' appeared five times. Ultimately, we want to now return 'c' because it existed most among the k closest neighbors.

This code sorts these items into a list, biggest first, so that the result is:
[('c', 5), ('a', 3), ('b', 2)]

Notice that we want to sort based on the second component of each tuple. This is accomplished by:
which, when called with a tuple (a, b), returns (b) (i.e. index 1 from that tuple).

return sortedClassCount[0][0]
This grabs the highest-numbered label from the list of the k sorted neighbors.
peter.harrington (82) [Avatar] Offline
Re: Laboriously Explaining the kNN Code
Hi Daniel,
Thanks for taking the time to write such a long, constructive post.
The readers have spoken and my example is difficult to follow. I think this is a good place to introduce them to NumPy.
I would like to copy your explanation verbatim and put it in the book but the publisher has certain conventions that I have to follow. I will try to get an explanation as close to this as possible in the text while following their rules.
Thanks again for writing this.
danielzingaro (6) [Avatar] Offline
Re: Laboriously Explaining the kNN Code
Hi Peter,

Thanks for responding. A numpy intro really is required, and I agree that (as I did) above, it can be folded into the kNN description.

I'll make my way through more of the book this evening and provide further comments. You likely don't have to go into grueling detail like this in future chapters, once readers have a grasp of numpy.

Take my text above as public domain. I'm happy to help however I can. I see value in this book's material, but it is sufficiently complex that tight, meticulous explanations (in the style of a textbook) might help. I imagine that most people reading this book are interested not only in the solution code, but also how it works. I understand that "in action" means "how do I use this stuff to do something cool", but the nature of the topic means you're likely attracting many academics... I am one smilie

Thanks again,
rtc1 (2) [Avatar] Offline
Re: Laboriously Explaining the kNN Code
Special thank you to Daniel! Very helpful!

In case anyone is new to Python like I am and would like to see a verbose tracing of the variable types and values after assignment, please see the text below:

>>> from kNN import *

inX=<type 'list'> = [0, 0]
dataSet=<type 'numpy.ndarray'> =
[[ 1. 1.1]
[ 1. 1. ]
[ 0. 0. ]
[ 0. 0.1]]
labels=<type 'list'> = ['A', 'A', 'B', 'B']
k=<type 'int'> = 3
dataSetSize=<type 'int'> = 4
diffMat=<type 'numpy.ndarray'> =
[[-1. -1.1]
[-1. -1. ]
[ 0. 0. ]
[ 0. -0.1]]
sqDiffMat=<type 'numpy.ndarray'> =
[[ 1. 1.21]
[ 1. 1. ]
[ 0. 0. ]
[ 0. 0.01]]
sqDistances=<type 'numpy.ndarray'> = [ 2.21 2. 0. 0.01]
distances=<type 'numpy.ndarray'> = [ 1.48660687 1.41421356 0. 0.1 ]
sortedDistIndices=<type 'numpy.ndarray'> = [2 3 1 0]
classCount=<type 'dict'> = {}
i=<type 'int'> = 0
i=<type 'int'> = 1
i=<type 'int'> = 2
classCount=<type 'dict'> = {'A': 1, 'B': 2}
sortedClassCount=<type 'list'> = [('B', 2), ('A', 1)]
sortedClassCount[0][0]=<type 'str'> = B
results of the classification: B
> > >
aespinoza (1) [Avatar] Offline
Re: Laboriously Explaining the kNN Code
I rewrote the code to reflect the purpose of each line. I did this to help me learn python mostly.

Here is the code: (Pretty print:

def myClassify0(inX, dataSet, labels, k) :
distances = calculateEuclideanDistance(inX, dataSet)

sortedDistIndices = distances.argsort()
return findKClosestDataPoint(sortedDistIndices, labels, k)

def findKClosestDataPoint(sortedDistIndices, labels, k) :
classCount = {}
for i in range(k) :
voteIlabel = labels[sortedDistIndices[i]]
classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1

sortedClassCount = sorted(classCount.iteritems(), key = operator.itemgetter(1), reverse = True)

return sortedClassCount[0][0]

def calculateEuclideanDistance(inX, dataSet) :
diffMatrix = getDifferences(inX, dataSet)
squareDiffMatrix = getSquareOfDifferences(diffMatrix)
sqDistance = sumSqDifferences(squareDiffMatrix)
return getSquareRootOfSum(sqDistance)

def getDifferences(inX, dataSet) :
dataSetSize = dataSet.shape[0]
diffMat = tile(inX, (dataSetSize, 1)) - dataSet

return diffMat

def getSquareOfDifferences(sqDiffMat) :
return sqDiffMat**2

def sumSqDifferences(sqDiffMat) :
return sqDiffMat.sum(axis=1)

def getSquareRootOfSum(sqDistance) :
return sqDistance ** 0.5
foobard (2) [Avatar] Offline
Re: Laboriously Explaining the kNN Code
I just wanted to say "Thanks!" to danielzingaro. As someone who is new to matrices and numpy, I was quite confused by the code in 2.1.2. Your detailed explanation was very helpful.

Thanks for taking the time to write it up!
foobard (2) [Avatar] Offline
Re: Laboriously Explaining the kNN Code
I'm new to matrices and was confused when I saw the code in 2.1.2. Danielzingaro's post helped quite a bit but after going through the code in more detail there's one aspect that I still don't understand.

The differences are calculated here:
diffMat = tile(inX, (dataSetSize,1))
and then squared here:
sqDiffMat = diffMat**2

The variable names lead me to believe that they were actual matrices but that wouldn't make sense because multiplying the matrix by itself (using matrix math) wouldn't result in the elements themselves being squared.

I eventually realized that "diffMat" and "sqDiffMat" are actually arrays. The operation results appear to be quite different. For example, given the following variables :
m = matrix([[1, 2], [3, 4]])
a = array([[1, 2], [3, 4]])

squaring them gives:
m**2 == matrix([[ 7, 10], [15, 22]])
a**2 == array([[ 1, 4], [ 9, 16]])

It seems very misleading to call them matrices but then use them as arrays. As I said, I'm very new to matrices so maybe I'm just confused...