rrmadhav (4) [Avatar] Offline
#1
I have input content as xml files with well known elements. I want to use Tika to create set of objects using this xml file. Is there a way to provide some kind of mapping information between the name of elements in XML and the fields of a java objects OR columns of DB?

How to handle this mapping with different file formats, e.g. PDF?
jukka.zitting (6) [Avatar] Offline
#2
Re: Mapping fields for extraction
If you already know the structure of your content and have a very specific mapping in mind, then Tika is probably not the best tool for you. Tika makes very few assumptions about the incoming content structure and tries to map all content to a reasonably normalized XHTML presentation, which is probably not detailed enough for your needs.

Instead of Tika, you might want to look at directly using a normal XML parser (see http://en.wikipedia.org/wiki/Java_API_for_XML_Processing) or an XML binding mechanism like JAXB (see http://www.oracle.com/technetwork/articles/javase/index-140168.html).
rrmadhav (4) [Avatar] Offline
#3
Re: Mapping fields for extraction
Thanks for your suggestion Jukka.

Actually, I was thinking if Tika can be used beyond its role of extracting text from variety of file formats. Yes, JAXB could be used if the content is only in XML form. What I wanted to achieve was that a single

extractor + rule-executor/mapping-configuration-entity.

Looks like, my application needs to interface with two libraries separately unless the later part can utilize Tika.
jukka.zitting (6) [Avatar] Offline
#4
Re: Mapping fields for extraction
OK, I see what you're after.

There are two ways in which you could achieve this with Tika. You could either make the relevant Tika parsers output XHTML where the fields you're interested in are marked with something like special <span class="my-field-X">...</span> tags that you can then pick for use in your application. Another, perhaps easier approach is to use the Metadata object for such fields, as then you have an easier API for accessing the field values after parsing.

Note that both of these approaches will likely require you to customize some of the Tika parser classes so that they known where and how to look for the predefined content you're interested in. The upcoming chapter 11 of the book will talk more about how to best handle such customizations.