Thursday, July 17, 2008

Optical Character Recognition With Acrobat Pro

Optical Character Recognition (OCR) software can be a very helpful tool in the practice of law. OCR allows you to scan documents and convert the scan into an editable document. This can be especially helpful when you don't have digital copies of statements, briefs, etc. and you need to quote large portions that would be a burden to type. (keep in mind, in South Dakota the opposing party is required to send an electronic copy of discovery requests if you make a written request for it. See SDCL 15-6-5(i)). I plan to do a run down of OCR options in a future post. For now, I want to share something quick - OCR with Adobe Acrobat Pro.

The first step is getting your text into a PDF (Adobe Portable Document Format) file. Most scanners allow you to send your scanned document straight to a PDF so, in that case, you can skip this step. Keep in mind, OCR is only needed when your text is a picture, i.e. you can't select the characters individually. Picture file include those ending in extensions like .jpg, .bmp, and .gif. Sometimes .pdf files will contain pictures as well. You will know when you have a picture simply because you can't select the text. I'm going to be taking text from a source that doesn't allow you to select text. You might find this kind of problem on web pages that use Flash instead of HTML.

We will need to "capture" the image. There are a variety of ways to do this. The method every one can do is the PrtSc button (usually top right of the keyboard). When you hit this button, the computer takes a snapshot of the screen. The computer stores this in its clipboard so you can paste it into a paint program just like you had copied or cut the picture from somewhere. The down side of the PrntSc method is that it takes a picture of the ENTIRE screen and you will have to clip out the area you want. OneNote also has a "clip" feature that allows you to draw a box around the area you want. OneNote pastes the image into a OneNote page and then you will have to copy the image to a program that will make your PDF. Finally, I would recommend the program SnagIt which allows you to do screen grabs and also to record movies of anything on your computer screen. You can try SnagIt here.

Now you need to save the image as a PDF. One method is to use a PDF virtual printer like PDFCreator, I've discussed this before. Or, if you are lucky enough to have Photoshop, Photoshop may be the best option because you can control output. You simply "save as" a .pdf and then turn downsampling off in the pdf converter to ensure the highest quality output for the OCR sampling engine. Here are my settings:


The last step (and the only step if you scanned a document and saved the file as a .pdf) is to OCR the PDF file. In Acrobat Pro go to the menu item Document -> OCR Text Recognition -> Recognize Text Using OCR.

OCR is still a developing technology so your milage might vary. Here's my output:

2 comments:

  1. I agree with you that OCR is a very helpful and useful tool as OCR allows you to scan documents and convert the scan into an editable document. This can be especially helpful when you don't have digital copies of statements, briefs, etc.

    ReplyDelete