The books looks amazing and I cannot wait to get my copy.

I have a suggestion and I hope you can consider it before April release date.

I hope you can cover how to create a neural Spell-Checker and grammar corrector. Many people are interested in this topic, and as far as I know, no available book tackles this issue. I think this will be a great asset for the book.

Some use OpenNMT or ModernMT which are used in neural machine translation to get this done efficiently. Some others use Keras with Tensorflow backend. I am sure the authors are more knowledgeable, and they can come up with a better approach to tackle this topic.

Thanks,

Mohamed

]]>

I'm at section 2.2 (Building your vocabulary with a tokenizer) and though I'm certain I'm learning a lot and enjoying it, I definitely find parts of it needlessly difficult and kind of meandering (when such difficult new material is being introduced, losing the through-line with anecdotes that are kind of arcane/confusing isn't very helpful). I'm also left wondering about the intended audience of the book, as it seems to assume relative familiarity with some foundational concepts in NLP and with data science packages for Python. (P.S. finishing the section in Appendix B on np.arrays and pd.Series and Dataframes would help a lot.)

Anyway, I've stopped reading at the dot product chapter because I'm feeling a little overwhelmed. In order to be able to resume with this NLP book, I've been revisiting Jake Vanderplas' book on Data Science in Python, and I'm generally finding his accessible, direct tone, eager to explain and unpack, a whole lot more approachable than what I'm reading here. But maybe that's by design. J.V.'s book is more an introduction and a handbook, this maybe more of a deep dive?

Still, though, I totally applaud the authors of this NLP book, because while it's certainly not easy, even for someone who's already pretty comfortable with Python, I'm confident that it contains nearly everything, or as much as could be reasonably expected in a sub ~800 page book, that I'd need to know about NLP in Python. ]]>

This question is resolved. Apologies for my error.

]]>

The author assumes that the machine cannot semantically contextualize but can only compute. This is not sufficient for reading comprehension for which machines do need to understand semantically and contextualize. You cannot do this in machine learning. But, you can certainly build a knowledge graph to represent and replicate the associated memory of how humans build relationships and patterns between existing and new knowledge. This is how humans generalize over every day experiences, and assimilate new information through transfer learning. One must look at semantics as well as machine learning to process NLP. In this manner, machines can indeed figure out and compute which enables for artificial general intelligence. The mind processes information by means of perception, language semantics, and associative memory.

I disagree with the tip assumption for 'compute' and 'figure out' for machines on page 117. Machines can indeed compute and figure out.]]>

Could you please help me?

>>> from nlpia.data.loaders import kite_text

---------------------------------------------------------------------------

TypeError Traceback (most recent call last)

C:\IntelPython3\lib\logging\config.py in configure(self)

557 try:

--> 558 handler = self.configure_handler(handlers[name])

559 handler.name = name

C:\IntelPython3\lib\logging\config.py in configure_handler(self, config)

730 try:

--> 731 result = factory(**kwargs)

732 except TypeError as te:

TypeError: __init__() missing 1 required positional argument: 'appname'

During handling of the above exception, another exception occurred:

ValueError Traceback (most recent call last)

<ipython-input-5-8e8dc0cc6f2d> in <module>()

----> 1 from nlpia.data.loaders import kite_text

~\downloads\nlpia\src\nlpia\data\__init__.py in <module>()

----> 1 from nlpia.loaders import * # noqa

~\downloads\nlpia\src\nlpia\loaders.py in <module>()

68 from pugnlp.futil import mkdir_p, path_status, find_files

69 from pugnlp.util import clean_columns

---> 70 from nlpia.constants import DATA_PATH, BIGDATA_PATH

71 from nlpia.constants import DATA_INFO_FILE, BIGDATA_INFO_FILE, BIGDATA_INFO_LATEST

72

~\downloads\nlpia\src\nlpia\constants.py in <module>()

89

90

---> 91 logging.config.dictConfig(LOGGING_CONFIG)

92 logger = logging.getLogger(__name__)

93

C:\IntelPython3\lib\logging\config.py in dictConfig(config)

793 def dictConfig(config):

794 """Configure logging using a dictionary."""

--> 795 dictConfigClass(config).configure()

796

797

C:\IntelPython3\lib\logging\config.py in configure(self)

564 else:

565 raise ValueError('Unable to configure handler '

--> 566 '%r: %s' % (name, e))

567

568 # Now do any that were deferred

ValueError: Unable to configure handler 'logging.handlers.NTEventLogHandler': __init__() missing 1 required positional argument: 'appname']]>

[code]

>>> FLOAT_TYPES = [t for t in set(np.typeDict.values()) if t.__name__.startswith('float')]

>>> FLOAT_TYPES

[numpy.float64, numpy.float128, numpy.float16, numpy.float32]

>>> FLOAT_TYPE_NAMES = [t.__name__ for t in FLOAT_TYPES]

>>> FLOAT_TYPE_NAMES

['float64', 'float128', 'float16', 'float32']

[/code]

I added your Issue to the github issue tracker at [url]https://github.com/totalgood/nlpia/issues/18[/url]. If you want to help us out and get full credit for your code and you can submit a PR at [url]]https://github.com/totalgood/nlpia/pulls[/url].]]>

Regarding "industrial strength" and memory-efficient SVD, we'll definitely attempt to address that in a second edition of the book. In the mean time you may have to do some more research yourself. The `gensim` source code can be a great source of ideas and patterns for "out of core" processing, incremental optimizers/solvers, and sparse matrix multiplication. However they implemented their algorithms is probably a good way to go, because it minimizes the memory footprint on my machine whenever I'm careful about not instantiating my entire corpus or bags of words in memory, but rather generating them as-needed.

Regarding your three "researching" questions, please post your discoveries here whenever you have an update on your research. Here are my thoughts:

1. `numpy` can handle sparse matrix multiplication just fine. Scipy just routes it's linear algebra operators there:

[code]

>>> import numpy as np

>>> import scipy

>>> id(scipy.dot) == id(np.dot)

True

>>> A = scipy.sparse.csr_matrix([[1, 2, 0], [0, 0, 3], [4, 0, 5]])

>>> v = scipy.sparse.csr_matrix([[1], [0], [-1]])

>>> A.dot(v)

<3x1 sparse matrix of type '<class 'numpy.int64'>'

with 3 stored elements in Compressed Sparse Row format>

>>> scipy.dot(A, v)

<3x1 sparse matrix of type '<class 'numpy.int64'>'

with 3 stored elements in Compressed Sparse Row format>

>>> np.dot(A, v)

<3x1 sparse matrix of type '<class 'numpy.int64'>'

with 3 stored elements in Compressed Sparse Row format>

[/code]

2. TruncatedSVD and it's explained_variance method are implemented in `sklearn.decomposition.TruncatedSVD`. Scipy's equivalent is in `scipy.sparse.linalg.svds`. Gensim's is in `gensim.models.lsimodel.stochastic_svd`. You can calculate explained variance yourself from the gensim SVD results (or any SVD) with:

[code]

>>> import numpy as np

>>> import sklearn.decomposition

>>> svd = sklearn.decomposition.TruncatedSVD(2)

>>> A = np.random.randn(1000,100)

>>> svd.fit(A)

TruncatedSVD(algorithm='randomized', n_components=2, n_iter=5,

random_state=None, tol=0.0)

>>> A_2D = svd.transform(A)

>>> np.var(A_2D, axis=0)

array([1.7039018 , 1.65362273])

>>> var = A.shape[1] * np.var(A_2D, axis=0) / np.var(A, axis=0).sum()

>>> var

array([1.6955373 , 1.64550506])

>>> np.abs(var - svd.explained_variance_).round(2)

array([0., 0.])

[/code]

3. I think you just need to correct the math you're using to compute the total variance. You should be squaring, then summing then dividing by `shape[1] - 1` not `shape[1]`. But `np.var` will do all that efficiently on a sparse matrix for you. However, it won't do it incrementally (also called "out-of-core processing" ). So if you have a severe RAM restriction you should research techniques for incremental computation of things like sum() and var(), or just copy the implementations in the gensim source code.

In general I'd recommend using a computational graph framework like Spark, TensorFlow, Keras, Hadoop, whenever you can't get things done with server that has large RAM. This is the "industrial strength" out-of-core processing that you are searching for.

]]>

Here's what we have for the movie sentiment example:

[code]

>>> from sklearn.naive_bayes import MultinomialNB

>>> nb = MultinomialNB()

>>> nb = nb.fit(df_bows, movies.sentiment > 0) # <1>

>>> movies['predicted_sentiment'] = nb.predict(df_bows) * 8 - 4 # <2>

>>> movies['error'] = (movies.predicted_sentiment - movies.sentiment).abs()

>>> movies.error.mean().round(1)

2.4 # <3>

>>> movies['sentiment predicted_sentiment sentiment_ispositive predicted_ispositive'.split()].head(8)

sentiment predicted_sentiment sentiment_ispositive predicted_ispositive

id

1 2.266667 4 1 1

2 3.533333 4 1 1

3 -0.600000 -4 0 0

4 1.466667 4 1 1

5 1.733333 4 1 1

6 2.533333 4 1 1

7 2.466667 4 1 1

8 1.266667 -4 1 0

>>> hist -o -p

>>> (movies.predicted_ispositive == movies.sentiment_ispositive).sum() / len(movies)

0.9344648750589345 # <4>

<1> Naive Bayes models are classifiers, so you need to convert your output variable (sentiment float) to a discrete label (integer, string, or bool).

<2> Convert your discrete classification variable back to a real value between -4 and +4 so you can compare it to the "ground truth" sentiment.

<3> The average absolute value of the error between your prediction (mean absolute precision or MAP) is 2.4.

<4> You got the "thumbs up" rating correct 93% of the time.

[/code]

And here's the revised product review snippet:

[code]

>>> products = get_data('hutto_products')

>>> for text in products.text:

... bags_of_words.append(Counter(casual_tokenize(text)))

>>> df_product_bows = pd.DataFrame.from_records(bags_of_words)

>>> df_product_bows = df_product_bows.fillna(0).astype(int)

>>> df_all_bows = df_bows.append(df_product_bows)

>>> df_bows2.columns # <1>

# Index(['!', '"', '#', '#38', '$', '%', '&', ''', '(', '(8',

# ...

# 'zoomed', 'zooming', 'zooms', 'zx', 'zzzzzzzzz', '~', '½', 'élan', '–', '’']

>>> df_product_bows = df_all_bows.iloc[len(movies):][df_bows.columns] # <2>

>>> df_product_bows.shape

(3546, 20756)

>>> df_bows.shape # <3>

(10605, 20756)

>>> products['sentiment_ispositive'] = (products.sentiment > 0).astype(int)

>>> products['predicted_ispositive'] = nb.predict(df_product_bows.values).astype(int)

>>> products.head()

# id sentiment text sentiment_ispositive

# 0 1_1 -0.90 troubleshooting ad-2500 and ad-2600 no picture... 0

# 1 1_2 -0.15 repost from january 13, 2004 with a better fit... 0

# 2 1_3 -0.20 does your apex dvd player only play dvd audio ... 0

# 3 1_4 -0.10 or does it play audio and video but scrolling ... 0

# 4 1_5 -0.50 before you try to return the player or waste h... 0

>>> (products.predicted_ispositive == products.sentiment_ispositive).sum() / len(products)

0.5572476029328821

<1> Your new bags of words have some tokens that weren't in the original bags of words DataFrame (23302 columns now instead of 20756 before).

<2> You need to make sure your new product DataFrame of bags of words has the exact same columns (tokens) in the exact same order as the original one used to train your Naive Bayes model.

<3> This is the original movie bags of words.

[/code]]]>

It's sad to see language changing so fast, paying less and less attention to the "rule book".]]>

* How to configure and customize open source chatbots like Chatterbot, Will, and AIMLBot

* How to generate completely new text using deep learning.

Some of the sentences in the book were composed by this generative model, but we couldn't train it until we had most of the manuscript completed. Do you think a sentiment analyzer will be able to tag the sentences of the book according to authorship: "Hobson", "Cole", "Hannes" or "Bot"?

We're working with other readers to build a chatbot that integrates all 4 of the chatbot approaches explained in chapter 12 and the block diagram that decorates the inside cover of the printed book. I don't know of any other open source package or online tutorial that even attempts to use all 4 chatbot approaches in a single package. I hope you'll consider contributing by submitting bug reports, feature requests, or pull requests at [url]github.com/totalgood/nlpia[/url].

]]>

[code]

>>> pca_copy.explained_variance_ratio_.cumsum()

array([0.01209829, 0.02122578, 0.03023511, 0.03861407, 0.04582388,

0.05276413, 0.05922511, 0.06471775, 0.07010423, 0.07530074,

0.08020538, 0.08460077, 0.08884538, 0.0930544 , 0.09717685,

0.10115221])

>>> pca_copy.explained_variance_ratio_.cumsum()[-1]

0.10115221

>>> pca_copy.explained_variance_ratio_.sum()

0.10115221

[/code]

That `0.1` total "explained variance" by all the 16 best features (components) created by PCA is still much lower than the `1.0` you were hoping for. But the reason is that that you've summed the explained variance contributed by only 16 of the 9232 possible components (the number of words in your vocabulary). Some information is lost when you create linear combinations of the 9232 features to produce 16 components/features. Unless your 9232 features can be divided into 16 groups of features that are all perfectly correlated (exactly equal to each other, so that every time one word is used another word is used in the same document) then there's no way your sum would ever equal 1.0.

Before I realized what was going on, I compared the explained variance to the imbalance in the training set and the accuracy achievable with a single-feature model. The variance that is explained by your PCA components is the variance in the original 9232 TFIDF vector components, not the variance in the target variable, so none of this is really relevant to your question:

[code]

>>> sms.spam.mean()

0.1319

>>> pca_copy.explained_variance_ratio_.sum() / sms.spam.mean()

0.7669

>>> from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

>>> lda = LDA()

>>> lda = lda.fit(tfidf_docs, sms.spam)

>>> lda.score(tfidf_docs, sms.spam)

1.0

>>> lda_pca = LDA()

>>> lda_pca = lda_pca.fit(pca16_topic_vectors, sms.spam)

>>> lda_pca.score(pca16_topic_vectors, sms.spam)

0.9568

>>> lda_pca1 = lda_pca.fit(pca16_topic_vectors[:,0].reshape((-1,1)), sms.spam)

>>> lda_pca1.score(pca16_topic_vectors[:,0].reshape((-1,1)), sms.spam)

0.8681 # this is the best you can do if you chose only the best component to make your predictions

[/code]

86.8% correlation is the best you could do if you had only one PCA component as your feature and you scaled/thresholded that feature optimally to maximize the correlation with your target variable.

]]>

Thanks for the feedback. LSTM's was certainly a fun chapter write.

I certainly agree with you and the paper that CNN's are good entry point into NLP, if for no other reason that speed difference in training and inference times between CNNs and RNNs make iteration so much faster.

We do have a chapter devoted to Seq2Seq and Attention Networks that should be available in MEAP now. Hopefully that will give some further insight into dealing with language as sequences. Unfortunately, we won't be able to cover Temporal Convolutional Networks in this edition of the book, but definitely interesting work being done there. Interesting and exciting, as the greatest struggle of LSTMs and GRUs is their computational costs.

C]]>

Thanks for the catch. Sorry for the long delay, but Chapter 5 has been heavily reworked to hopefully provide more detail to the introduction of Neural Networks. We'll double check that omission didn't follow us forward.

C]]>

My email is huasheng0822@gmail.com, can you mail me and I have a few questions to ask you plz]]>

One suggestion I would make is defining "overfitting" when you first mention the concept. I came to the book with a linguistics background (rather than a data science/ML), and had no idea what that meant.

Thanks, and keep up the great work, I'm loving the book so far!]]>

However many NLP practitioners use the term "term frequency" as shorthand for "normalized term frequency." Unfortunately "normalized" can mean a lot of different things. So it's not a good practice to be so casual in the use of the term "term frequency" the way we did. We'll fix that in Chapter 3 to make it more precise and clear. We'll use the word "frequency" to mean raw integer count, the same way statisticians use that word.

There are many possible normalizations (or weightings) of term frequency that are used in NLP. Term frequency is sometimes divided by the number of times a term or n-gram occurs in that document, so that it is normalized by the document length. But most python packages save this for last, and normalize for the 2-norm length of the TF-IDF vector, which produces a similar weighted frequency value compared to if you normalized for document length up front. For TF-IDF term frequency is normalized by the prevalence of that token (term or n-gram) in the other documents in the corpus, the number of different documents it appears in. And the list of weighted (or normalized) term frequencies for a document (the TF-IDF vector for a document) is usually normalized again by the "length" (2-norm) of that list of values (vector). This is what "vector normalization" means in most machine learning worlds. In the end, for each document you get a normalized TF-IDF vector of "weighted term frequency" values between zero and one. And all your documents will have vectors of the same magnitude (2-norm) of 1. Most TF-IDF calculations (including [tt][color=darkred]gensim[/color][/tt] and [tt][color=darkred]SciKit Learn[/color][/tt]) do it this way.

The editors are reviewing our updates to Chapters 1-3, which includes this code correction. We'll also add more details about "term frequency." And we've already given the editors 4 new chapters that they are reviewing "early this week." They'll hopefully release them to you soon.

[b]BONUS[/b]: Attached is a diagram we are working on for the chapter on semantic analysis. It relies on "compressing" the TF-IDF vectors down to a more reasonable size and producing a much more meaningful vector using a linear algebra algorithm called SVD and then truncating that vector ([tt][color=darkred]TruncatedSVD[/color][/tt] in SciKit Learn). I just thought I'd give you a head start on picking up this really cool technique. Obviously, it'll make more sense with the chapter text to go along with it. When looking at the matrices in the attached SVD equation, keep in mind that we are trying to transform our TFIDF vectors (the [b]W[/b] matrix on the far left) into the topic vectors in the last matrix (the [b]U[/b] matrix on the far right). To do that we use the SVD algorithm to compute the word-topic matrix ([b]V[/b] just to the right of the equal sign). We truncate it by ignoring the topics with low magnitude in the singular values matrix ([b]S[/b] in the middle) to keep the dimensions low. And then we just muliply this smaller V matrix (using the inner product or dot product) by the TF-IDF vector for any document, and that gives us a topic vector! ]]>

Perhaps we should wait until they are used in an example pipeline to introduce one-hot vectors.]]>

]]>