The Author Online Book Forums are Moving

The Author Online Book Forums will soon redirect to Manning's liveBook and liveVideo. All book forum content will migrate to liveBook's discussion forum and all video forum content will migrate to liveVideo. Log in to liveBook or liveVideo with your Manning credentials to join the discussion!

Thank you for your engagement in the AoF over the years! We look forward to offering you a more enhanced forum experience.

JGF1 (322) [Avatar] Offline
#1
Is there a way you could use Tika to extract meta-data based upon Microformats in a web page?

For example. If you took a web page with div tags with an id or class attribute. Then used the value of the id name or class name for the meta data and then the content inside the div, li, ul element/node could be extracted to give the content associated to the meta-data.

If would be more useful if you could give a real world example of data extraction.
For example LinkedIn data extraction or a similar site. it would frame the whole chapter if you could base an example on a real world useful use case rather than something hypothetical.
mike.mccandless (221) [Avatar] Offline
#2
Re: Chapter 7. Reviewing Chapter on Tika.
I don't think Tika is a good fit for microformats in a web page (this is why we don't have such examples), but I could be wrong. Can you ask this question Tika's users list?
JGF1 (322) [Avatar] Offline
#3
Re: Chapter 7. Reviewing Chapter on Tika.
Hi Mike.
Have just read the chapter. Enjoyed what I read.
I guess Aperture may be more suited to this type of use case.
Hadn't heard about it before. Will have a look at it.
Good luck with the book sales.

PS: Suprised at Tika using ICU4J. I would have thought the Google or Yahoo translate API's would have been a higher level abstraction capable of handling this kind of stuff.
otis (156) [Avatar] Offline
#4
Re: Chapter 7. Reviewing Chapter on Tika.
I think Tika may actually be a good fit for that, but yes, tika-user@lucene is the place to ask.

I think ICU4J is there for lang ID, not translation, but I'm not 100% certain.

Otis
JGF1 (322) [Avatar] Offline
#5
Re: Chapter 7. Reviewing Chapter on Tika.
I think both these (Yahoo/Google Translate) have that (language detection) built in. But I guess on reflection you would want to keep the size of the ancillary JARs down.
mancocapac (15) [Avatar] Offline
#6
Re: Chapter 7. Reviewing Chapter on Tika.
FYI:
The latest version of the book PDF refers to :
version 0.5 in several places but the source code lib is still version 0.4
otis (156) [Avatar] Offline
#7
Re: Chapter 7. Reviewing Chapter on Tika.
Thanks. Actually, I believe Tika 0.6 is around the corner, so maybe that's what we should get and change version references to (eh).
mancocapac (15) [Avatar] Offline
#8
Re: Chapter 7. Reviewing Chapter on Tika.
I am running the TikaIndexer.java code and I am having a problem
trying to index the book pdf (Lucene In Action).
I buy all my manning books in PDF format and I am hoping I can use lucene to search
them.

When I look at the index via Luke the content type comes up <application/rdf+xml>, where as the content type for the
example pdf (file1.pdf) and some other pdf(s) comes up <application/pdf>

From Luke (0.9.9)
stored/uncompressed,omitNorms<Content-Type:application/rdf+xml>
stored/uncompressed,indexed<filename:C:WorkspacesHSiA-1.0.0lia2eTikaDataLuceneinActionSecondEdition.pdf>
stored/uncompressed,omitNorms<resourceName:C:WorkspacesHSiA-1.0.0lia2eTikaDataLuceneinActionSecondEdition.pdf>


stored/uncompressed,omitNorms<Author:John Griffin & Emmanuel Bernard>
stored/uncompressed,omitNorms<Content-Type:application/pdf>
stored/uncompressed,omitNorms<Keywords:LucenePDFDocument; Keanu Reeves; Alfonso Arau>
stored/uncompressed,omitNorms<Last-Modified:Sun Jan 20 04:23:02 PST 2008>
stored/uncompressed,omitNorms<created:Sun Jan 20 03:05:33 PST 2008>
stored/uncompressed,indexed<filename:C:WorkspacesHSiA-1.0.0lia2eTikaDatafile1.pdf>
stored/uncompressed,omitNorms<producer:Acrobat Web Capture 8.0>
stored/uncompressed,omitNorms<resourceName:C:WorkspacesHSiA-1.0.0lia2eTikaDatafile1.pdf>
stored/uncompressed,omitNorms<subject:Testing PDFBox's LucenePDFDocument.class>
stored/uncompressed,omitNorms<title:file1>

I tried changing the CONTENT-TYPE as suggested in the code, but got the same
result.
from TikaIndexer.java:

// If you know content type (eg because this document
// was loaded from an HTTP server), then you should also
// set Metadata.CONTENT_TYPE
metadata.set(Metadata.CONTENT_TYPE, "application/pdf");

// If you know content encoding (eg because this
// document was loaded from an HTTP server), then you
// should also set Metadata.CONTENT_ENCODING

Do I need to also set the content encoding? If so what do I set it to?

Thanks for your help
mike.mccandless (221) [Avatar] Offline
#9
Re: Chapter 7. Reviewing Chapter on Tika.
That is strange and spooky, that Tika can't recognize that the LIA2E PDF is in fact a PDF doc.

I have an older LIA PDF doc which runs fine, but unfortunately I don't have the latest MEAP PDF available...

This sounds like a Tika-specific problem.

If you run java -jar lib/tika-app-0.5.jar book.pdf, does that also fail to extract text? (Pass the --help to see its options, eg pass -m to only get the metadata).
mancocapac (15) [Avatar] Offline
#10
Re: Chapter 7. Reviewing Chapter on Tika.
I ran java -jar lib/tika-app-0.5.jar book.pdf as you suggested and got good results.
So I'm half way there, but still I want
to be able to index my pdf(s) after highlighting/bookmarking them.

I use Adobe Acrobat Pro 8 to open/read my pdfs.
This gives me the ability to Highlight text (not to be confused with lucene highlight) and
create bookmarks and add notes. In this way I can treat my PDFs just like my books.

When I open the LiASe pdfs with VIM both the original and my
highlighted/bookmarked/noted versions have a xml-rdf element section in them
so I assume xml-rdf is a standard component to PDF files. The problem is indexing
works fine on the original and not on my highlighted/bookmarked version.
They both work fine in Acrobat.

If I go back to the original pdf and and a few notes/highlights/bookmarks to it and
save it. The indexing works fine. I tried this with several other pdfs, including the
sample code pdfs from Hibernate Search Code.

However, I found several pdfs that I have
highlighted/bookmarked that exhibit the bad behaviors, including my
Hibernate Search pdf. Perhaps this problem occurs only after a lot of highlighting/bookmarking.

If I open the files via my editor VIM, I do see some CR +LF strangeness.
Both the good & "bad" files have

0000000: 2550 4446 2d31 2e36 0d25 e2e3 cfd3 0d0a %PDF-1.6.%......

for the first line, but the "bad" file doesn't have another $0d0a until

0001210: 6574 2065 6e64 3d22 7722 3f3e 0d0a 656e et end="w"?>..en

up until that point I do see some 0d (CR) but no CR+LF. It is probably the case that
something is getting confused because it sees this verrrry long line. Why the file
stops using CR+LF I don't know. I assume this confusion then leads Tika to guess
this is an xml-rdf file, probably because it gets lost util it starts seeing valid XML, the
rdf.

I see the following bug in Tika: Mime type application/rdf+xml not correctly detected
[#TIKA-309], but it says it is fixed in 0.5 which I am using. Its a bit questionable
though, as I can see by reading the fix comments that it was supposed to be
fixed a couple of times, maybe its still broke.

I will log this issue with Apache TIKA. I would still like to find a work around in the
meantime, as I have highlight/bookmarked many of my books.

Is there a way to force Tika to see these as application/pdf?
Is there some converter program I should try, I saw something called Jena?


Thanks
mike.mccandless (221) [Avatar] Offline
#11
Re: Chapter 7. Reviewing Chapter on Tika.
Wow, good sleuthing! This does sound like a Tika issue. It sounds like the AutoDetectParser is being tricked by the XML into thinking it's xml-rdf.

I think you can use the PDFParser instead of AutoDetectParser? Would that work?
mancocapac (15) [Avatar] Offline
#12
Re: Chapter 7. Reviewing Chapter on Tika.
I entered a bug with Tika. I will try PDFParser, get back to ya.

Thanks
mancocapac (15) [Avatar] Offline
#13
Re: Chapter 7. Reviewing Chapter on Tika.
I received an email stating that this bug has been fixed in 0.6, I have not yet gotten a copy
of 0.6 to check it out.