kps (6) [Avatar] Offline
#1
Are the following statements missing in the sample code?
movies['sentiment_ispositive'] = (movies.sentiment > 0).astype(int)
movies['predicted_ispositive'] = nb.predict(df_bows.values).astype(int)

Otherwise the following statement will fail, because the columns are not there!
movies['sentiment predicted_sentiment sentiment_ispositive predicted_ispositive'.split()].head(8)


These statements are provided in the sample on the following page (products).

In the products sample on page 53 there seem to be mismatches on the naming of the various data frames?
hobs (58) [Avatar] Offline
#2
Good catch! We've fixed that snippet in the final version of the manuscript headed to the typesetters this week!

Here's what we have for the movie sentiment example:

>>> from sklearn.naive_bayes import MultinomialNB
>>> nb = MultinomialNB()
>>> nb = nb.fit(df_bows, movies.sentiment > 0)  # <1>
>>> movies['predicted_sentiment'] = nb.predict(df_bows) * 8 - 4  # <2>
>>> movies['error'] = (movies.predicted_sentiment - movies.sentiment).abs()
>>> movies.error.mean().round(1)
2.4  # <3>
>>> movies['sentiment predicted_sentiment sentiment_ispositive predicted_ispositive'.split()].head(8)
    sentiment  predicted_sentiment  sentiment_ispositive  predicted_ispositive
id                                                                            
1    2.266667                    4                     1                     1
2    3.533333                    4                     1                     1
3   -0.600000                   -4                     0                     0
4    1.466667                    4                     1                     1
5    1.733333                    4                     1                     1
6    2.533333                    4                     1                     1
7    2.466667                    4                     1                     1
8    1.266667                   -4                     1                     0
>>> hist -o -p
>>> (movies.predicted_ispositive == movies.sentiment_ispositive).sum() / len(movies)
0.9344648750589345  # <4>

<1> Naive Bayes models are classifiers, so you need to convert your output variable (sentiment float) to a discrete label (integer, string, or bool).
<2> Convert your discrete classification variable back to a real value between -4 and +4 so you can compare it to the "ground truth" sentiment.
<3> The average absolute value of the error between your prediction (mean absolute precision or MAP) is 2.4. 
<4> You got the "thumbs up" rating correct 93% of the time.


And here's the revised product review snippet:


>>> products = get_data('hutto_products')
>>> for text in products.text:
...     bags_of_words.append(Counter(casual_tokenize(text)))
>>> df_product_bows = pd.DataFrame.from_records(bags_of_words)
>>> df_product_bows = df_product_bows.fillna(0).astype(int)
>>> df_all_bows = df_bows.append(df_product_bows)
>>> df_bows2.columns  # <1>
# Index(['!', '"', '#', '#38', '$', '%', '&', ''', '(', '(8',
#        ...
#        'zoomed', 'zooming', 'zooms', 'zx', 'zzzzzzzzz', '~', '½', 'élan', '–', '’']
>>> df_product_bows = df_all_bows.iloc[len(movies):][df_bows.columns]  # <2>
>>> df_product_bows.shape
(3546, 20756)
>>> df_bows.shape  # <3>
(10605, 20756)
>>> products['sentiment_ispositive'] = (products.sentiment > 0).astype(int)
>>> products['predicted_ispositive'] = nb.predict(df_product_bows.values).astype(int)
>>> products.head()
#     id  sentiment                                               text  sentiment_ispositive 
# 0  1_1      -0.90  troubleshooting ad-2500 and ad-2600 no picture...                     0 
# 1  1_2      -0.15  repost from january 13, 2004 with a better fit...                     0
# 2  1_3      -0.20  does your apex dvd player only play dvd audio ...                     0 
# 3  1_4      -0.10  or does it play audio and video but scrolling ...                     0 
# 4  1_5      -0.50  before you try to return the player or waste h...                     0
>>> (products.predicted_ispositive == products.sentiment_ispositive).sum() / len(products)
0.5572476029328821

<1> Your new bags of words have some tokens that weren't in the original bags of words DataFrame (23302 columns now instead of 20756 before).
<2> You need to make sure your new product DataFrame of bags of words has the exact same columns (tokens) in the exact same order as the original one used to train your Naive Bayes model.
<3> This is the original movie bags of words.