The Author Online Book Forums are Moving

The Author Online Book Forums will soon redirect to Manning's liveBook and liveVideo. All book forum content will migrate to liveBook's discussion forum and all video forum content will migrate to liveVideo. Log in to liveBook or liveVideo with your Manning credentials to join the discussion!

Thank you for your engagement in the AoF over the years! We look forward to offering you a more enhanced forum experience.

farnetani (1) [Avatar] Offline
I'm building a new system where I will have several pdf files.

The content you will have to have in my indexes are:
1. Name
2. No. of Pages
3. Data File
4. Archive

When I run the search by the system, I will be typing full names that are stored within the file in the index, then I need that system resulting in me:

- All variables above (file name, file date) and especially the page number where the occurrence happened and the line number and if possible the exact position of the line on where it starts to occur.

I need it because I have to go back this occurrence for words that identify topics and subtopics, where traversing the file line by line backwards so allows me to identify the first subtopic and capture it and do the same when you find the topic . Not always the subtopic and the topic will be on the same page of the occurrence.


Document: 00001.pdf

page 115
Line 1:
Line 2: TTTTT - TITLE occurrence (will be captured by the first occurrence of title)
Line 3:
Line 4: YYYY - SECOND SUBTITLE (will be ignored because the system will have already caught the first subtopic in line 6)
Line 5:
Line 6: XXXX - First subtitle (will be captured by the first occurrence of sought caption)
Line 7:
... page ...116
... page ...121
page 122
Line 1: line break
Line 2: Content pertaining to occurrence ...
Line 3: content from occurrence ...
Line 5: content from occurrence ...
Line 6: line break
Line 7:

The big problem is that I do not know how to obtain this information from the page number and line number. Is there any functionality to it when I convert the PDF file to String in the index or will I have to store the Lucene index file line by line informing somehow the number of pages on which that file belongs?

In the example above, I need the system resulting me:

1 occurrence on page 122 with the topic = TTTTT and subtopic = XXXX with all the content that is before the name JOHN MCLAEN until the line break.

Anyway, that will lead me to string containing the result of the occurrence starting at line 2 (after line break) on page 122 and ending the block to line 5 results (before the line break).

Example of result:

Page: 122 - File: 00001.pdf
Processo 0001933-62.2000.8.26.0081 (001.01.2000.001933) - Procedimento Ordinário - Contratos Bancários - Auto Posto Murillo Ltda - - Murillo Jaccoud - - Murillo Jaccoud Junior - Banco Santander (brasil) Sa - Fica o executado Banco SantanderS/A devidamente intimado através de seu advogado a efetuar o pagamento do valor de R$ 90.200,42 (noventa mil, duzentos reais e quarenta e dois centavos) no prazo de 15 dias, sob pena de multa de 10%, nos termos do artigo 475-J. - ADV: JOHN MCLAEN (OAB 103587/SP), MARISA REGINA AMARO MIYASHIRO (OAB 121739/SP), RODRIGO JARA (OAB 275050/SP)
Is this possible?

Any help or hint will be of great value.

Thank you very much.