chimaelg (2) [Avatar] Offline
#1
Hi,
I could not reproduce the bag-of-word dataset in a pandas dataframe from the code on page 33. I am using Python 3.6.

Here is a modification of the code I came up with to get the expected dataframe:

sentences = """Construction was done mostly by local carpenters.
He moved into the South pavilion in 1770.
Turning Monticello into a neoclassical masterpiece was Jefferson's obsession.
"""

corpus = {}
for i, sentence in enumerate(sentences.splitlines()):
    corpus['sent_%d' % (1+i)] = [(token.strip('.'), 1) for token in sentence.split()]

df = pd.DataFrame()
for s in corpus:
    df = df.append(pd.DataFrame([c[1] for c in corpus[s]],index=[c[0] for c in corpus[s]], columns=[s]))
df.fillna(0, inplace=True)
df = df.astype(int)


df = df.groupby(df.index).sum()


hobs (47) [Avatar] Offline
#2
That's an interesting approach, using Pandas Series and groupby on its index to count words and construct a bag of words. I'll double check that the latest code works in Python 3.6. I did notice that the stopword lists and tokenizer behavior changed in nltk for python 3.6.