erinkay (2) [Avatar] Offline
#1
Hello!



So I’ve been working through the chapter 6 examples with mixed results. In example 6.4 (Sample Vector creation from a Lucene index), I get the error message:
ERROR lucene.LuceneIterator: There are too many documents that do not have a term vector for desc-clustering
Exception in thread "main" java.lang.IllegalStateException: There are too many documents that do not have a term vector for desc-clustering

This error gets resolved-ish when I add
—maxPercentErrorDocs 1
to the end of the command line argument. When I do this I get the warning message:
WARN lucene.LuceneIterator: 80 documents do not have a term vector for description

When I get these warning messages, it does write some vectors, and I am able to use them in the k-means example and get normal looking output in ClusterDump.



But when I try to label those clusters, the log looks normal but the output is structured correctly, but doesn’t contain any data. In addition, the clusters contain a significantly smaller number of vectors (largest is 256 vectors).



In addition, when I try the topic modeling example, I get many warning messages which culminate in the following

WARN lucene.LuceneIterator: 1000 documents do not have a term vector for desc-clustering

17/04/29 13:13:26 ERROR lucene.LuceneIterator: There are too many documents that do not have a term vector for desc-clustering

Exception in thread "main" java.lang.IllegalStateException: There are too many documents that do not have a term vector for desc-clustering

at org.apache.mahout.utils.vectors.lucene.LuceneIterator.computeNext(LuceneIterator.java:114)

at org.apache.mahout.utils.vectors.lucene.LuceneIterator.computeNext(LuceneIterator.java:127)




I’ve tried to resolve these issues on my own, but can’t quite figure them out. Does anyone have any resources or ideas about how to approach these problems?

Thanks and all the best!
erinkay (2) [Avatar] Offline
#2
Would it be helpful if I included any more information? I'm still lost and haven't made any progress toward resolving this issue