animusdk (2) [Avatar] Offline
#1
Hello, new here, had a quick question.
i have pdf files, and i have ocr files, as well as a file that has the coordinates for each word in the pdf. My goal is to combine the text to the PDF so i can have search able PDF's. is there a part of the API or chapter of the book i should look more closely at?
thanks
_d

Message was edited by:
animusdk
blowagie (284) [Avatar] Offline
#2
Re: Marry text to image PDF
Wow, that's a very interesting question.
So the PDF files contain nothing but (scanned?) images,
now you want to add the words over those images at the exact coordinates.
That's an interesting project and I'm curious to know what the file with
the words/coordinates looks like.

The first thing you'll have to take into account, is that the origin of the
coordinate system in PDF is in the lower left corner. Most of the other
systems (SVG, Graphics2D,...) have their origin in the upper left corner.
So when you add data, you'll have to take care that you don't add
everything upside/down smilie

Secondly, you'll have to make sure you have more or less the same
font and font size. I assume you now have the words in plain ASCII
characters, but if you want them to match the scanned images, you'll
need a font file with more or less the same glyphs as in the scanned images.

If I were you, I'd start with some small experiments using the method
PdfContentByte.showTextAligned (somewhere in Chapter 11).

Once you move on with your project, you may want to use ColumnText;
but I'm not sure, it all depends on what your words/coordinates file looks
like. (For instance: does it contain complete sentences, and if so: does
it contain information about the leading?)

I think this is something you should take to the mailing list:
itext-questions@lists.sourceforge.net where there are more people
that can help you (I think I'm the only one answering question on
this forum). I think many other people will find this an interesting question.

br,
Bruno
animusdk (2) [Avatar] Offline
#3
Re: Marry text to image PDF
it is an interesting little deal, the coordinate files were created by our OCR engine which is regarded as one of the best available, here is a little snippet from one of those:

11 means 3500 928 3936 1020
12 and 4016 908 4276 1016
13 from 4352 904 4688 1012
14 which 536 1064 944 1172
15 the 1004 1064 1232 1172
16 principal 1284 1060 1892 1184
17 part 1948 1060 2244 1184
18 of 2308 1060 2440 1164
19 the 2504 1056 2720 1168
20 production 2780 1060 3524 1184

ill give the mailing list a shot as well. the other, not fun solution is to re-run everything to output searchable PDF's from the OCR engine, which with 3 machines and our data set (33 million pages) will take about 4-5 months.
blowagie (284) [Avatar] Offline
#4
Re: Marry text to image PDF
Ouch, a rerun is out of the question then.

However, the snipper you posted looks very promising!

You have each word separately and you have the enclosing rectangle.
I'm currently preparing the release of iText 2.0.8, but I'll be glad to write
you a small prototype that generates a searchable PDF based on that snippet
after I'm done. I'll post it on the mailing list (please remind me if I forget).
blowagie (284) [Avatar] Offline
#5
Re: Marry text to image PDF
Hello again,
in the code sample below, I first create an 'image' PDF.
This is a PDF that doesn't allow you to search (you have plenty of PDFs like that).
Then I create a 'normal' searchable PDF. I deliberately chose another font to emphasize
the second point in my original answer. Also you notice that when creating the image
PDF, I use Graphics2D with the actual coordinates. When I create the 'normal' PDF,
I take the different orientation of the Y axis into account, demonstrating the first point
in my initial response.
Finally I take the image_pdf and I add some text similar to the text in the 'normal' PDF
on top of the image. However: I make sure the text is not visible. If I made it visible,
the difference in font would make the text look ugly.
Now we have a PDF where you see the image; the invisible text allows you to search
for the words that are present but not shown. Again: due to the difference in fonts, the
rectangle shown by Adobe Reader will not correspond exactly with the actual word in
the image, but you could easily fine tune that by using the urx and ury variables.
(llx = lower left x; lly = lower left y; urx = upper right x; ury = upper right y.)

import java.awt.Graphics2D;
import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.StringTokenizer;

import com.lowagie.text.Document;
import com.lowagie.text.DocumentException;
import com.lowagie.text.Element;
import com.lowagie.text.PageSize;
import com.lowagie.text.Rectangle;
import com.lowagie.text.pdf.BaseFont;
import com.lowagie.text.pdf.PdfContentByte;
import com.lowagie.text.pdf.PdfReader;
import com.lowagie.text.pdf.PdfStamper;
import com.lowagie.text.pdf.PdfWriter;


public class OCRRenderer {
public static void main(String[] args) {
try {
createImagePdf();
createNormalPdf();
addTextToImagePdf();
} catch (IOException e) {
e.printStackTrace();
} catch (DocumentException e) {
e.printStackTrace();
}
}

private static void createImagePdf() throws IOException, DocumentException {
Document document = new Document(PageSize.LETTER);
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream("ocr_image.pdf"));
document.open();
Graphics2D g2d = writer.getDirectContent().createGraphicsShapes(PageSize.LETTER.getWidth(), PageSize.LETTER.getHeight());
String line;
String word;
float llx, lly, urx, ury;
StringTokenizer tokenizer;
BufferedReader reader = new BufferedReader(
new InputStreamReader(new FileInputStream("ocr.txt")));
while ((line = reader.readLine()) != null) {
tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
tokenizer.nextToken();
word = tokenizer.nextToken();
llx = Float.parseFloat(tokenizer.nextToken() + "f") / 10;
lly = Float.parseFloat(tokenizer.nextToken() + "f") / 10;
urx = Float.parseFloat(tokenizer.nextToken() + "f") / 10;
ury = Float.parseFloat(tokenizer.nextToken() + "f") / 10;
g2d.drawString(word, llx, lly);
}
}
g2d.dispose();
document.close();
}

private static void createNormalPdf() throws IOException, DocumentException {
Document document = new Document(PageSize.LETTER);
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream("ocr_normal.pdf"));
document.open();
PdfContentByte cb = writer.getDirectContent();
BaseFont font = BaseFont.createFont(BaseFont.TIMES_ROMAN, BaseFont.WINANSI, BaseFont.NOT_EMBEDDED);
cb.beginText();
cb.setFontAndSize(font, 12);
String line;
String word;
float llx, lly, urx, ury;
StringTokenizer tokenizer;
BufferedReader reader = new BufferedReader(
new InputStreamReader(new FileInputStream("ocr.txt")));
while ((line = reader.readLine()) != null) {
tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
tokenizer.nextToken();
word = tokenizer.nextToken();
llx = Float.parseFloat(tokenizer.nextToken() + "f") / 10;
lly = document.top() - Float.parseFloat(tokenizer.nextToken() + "f") / 10;
urx = Float.parseFloat(tokenizer.nextToken() + "f") / 10;
ury = document.top() - Float.parseFloat(tokenizer.nextToken() + "f") / 10;
cb.showTextAligned(Element.ALIGN_LEFT, word, llx, lly, 0);
}
}
cb.endText();
document.close();
}

private static void addTextToImagePdf() throws IOException, DocumentException {
PdfReader pdfReader = new PdfReader("ocr_image.pdf");
Rectangle pagesize = pdfReader.getPageSizeWithRotation(1);
PdfStamper stamper = new PdfStamper(pdfReader, new FileOutputStream("ocr_combined.pdf"));
PdfContentByte cb = stamper.getOverContent(1);
BaseFont font = BaseFont.createFont(BaseFont.TIMES_ROMAN, BaseFont.WINANSI, BaseFont.NOT_EMBEDDED);
cb.beginText();
cb.setTextRenderingMode(PdfContentByte.TEXT_RENDER_MODE_INVISIBLE);
cb.setFontAndSize(font, 12);
String line;
String word;
float llx, lly, urx, ury;
StringTokenizer tokenizer;
BufferedReader reader = new BufferedReader(
new InputStreamReader(new FileInputStream("ocr.txt")));
while ((line = reader.readLine()) != null) {
tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
tokenizer.nextToken();
word = tokenizer.nextToken();
llx = Float.parseFloat(tokenizer.nextToken() + "f") / 10;
lly = pagesize.getTop() - Float.parseFloat(tokenizer.nextToken() + "f") / 10;
urx = Float.parseFloat(tokenizer.nextToken() + "f") / 10;
ury = pagesize.getTop() - Float.parseFloat(tokenizer.nextToken() + "f") / 10;
cb.showTextAligned(Element.ALIGN_LEFT, word, llx, lly, 0);
}
}
cb.endText();
stamper.close();
}
}