PDF-XChange Co Ltd :: Knowledge Base :: How do I OCR documents with the PDF-XChange family of products?

Articles List

KB351
Jun 10, 2024 08:53 AM

How do I OCR documents with the PDF-XChange family of products?

Question

How do I perform OCR on documents?

How do I convert image-based documents into text-searchable documents?

Answer

You can perform OCR with PDF-XChange Editor, PDF-Tools or even the discontinued product PDF-XChange Viewer:

PDF-XChange Editor

Note that two optical character recognition engines are available in PDF-XChange Editor: the default OCR engine and the enhanced OCR engine, which is available when PDF-XChange Editor Plus is purchased (either as a stand-alone product or as part of the PDF-XChange PRO bundle). The enhanced OCR engine is faster, more accurate and more dynamic than the default OCR engine, and it also contains some extra features. Further information about the enhanced OCR engine is available here. You can use the OCR preferences (available via the preferences option in the file tab) to switch between default and enhanced OCR:

Default OCR Engine

The default engine's OCR process in PDF-XChange Editor analyzes image-based documents, recognizes text and then places a duplicate, invisible text layer on top of it, which makes the source text selectable and searchable in the same manner as ordinary text. This means that the original, image-based text in documents can effectively be searched and selected via the invisible text layer, which is the main benefit of OCR. However, it should be noted that the document text cannot be edited in the same manner as normal, text-based documents - as it remains an image-based document, despite the invisible text layer. In order to convert image-based text into editable text the enhanced OCR engine must be used.

Click the Convert tab, then click OCR Pages to perform OCR on documents:

The OCR Pages dialog box will open:

Use the Page Range settings to determine the page range for OCR:

Select All to specify all pages.
Select Current to specify only the current page.
Select Custom to specify a custom page range, then enter the desired page range in the adjacent number box. Further information about how to specify custom page ranges is available here.

Use the Subset options to specify a subset of selected pages. Select All, Odd or Even as desired.

Use the Recognition Options to determine the language and accuracy of the OCR process. Please note that increasing the accuracy also increases the time that the process takes and vice versa. Additionally, it should be noted that setting the accuracy to high may result in unusual output if the document contains imperfections. This is because the software will search to a greater depth and may attempt to recognize imperfections as text. Click Add/Update Languages to add/update the language packs used for OCR.

Select the Detect skew of page content box to enable automatic detection of skewed pages, which happens when documents are scanned crookedly.
Select the Detect incorrect page rotation box to enable the automatic detection of incorrect page rotation in documents.
Select the Ignore company logos box to omit company logos from OCR. Click the ellipsis icon to add/view/manage logos.
Select the Ignore existing text on page box to omit existing text from the process of optical character recognition.
Select the Ignore comments on page box to omit comments from the process of optical character recognition.
Select the Ignore form fields on page box to omit form fields from the process of optical character recognition.
Use the Output Options to determine the format and quality of output from the OCR process:
Select the Fix content skew and incorrect page rotation box to deskew pages that are scanned crookedly and auto-correct page rotation issues.
Select the Create a New Document box to create a new document for the output of the optical character recognition. If this box is not selected, then the original document will be updated with the output instead.

Click OK to OCR documents.

Enhanced OCR Engine

The Enhanced OCR dialog box appears as detailed below:

The options in this dialog box are the same as those detailed above but with additional Output Options:

Select Searchable Image to retain the image-based content on which OCR is performed and insert a duplicate, invisible text layer on the text recognized during the operation. This will make the source text selectable and searchable in the same manner as ordinary text.
Select Editable Text and Images to replace image-based text in source documents with the text recognized in the process of optical character recognition. This will convert image-based text into editable text, and retain existing content such as text and images.
Select Fine Page Content to replace the content of source documents with new content that contains only the text and images recognized during optical character recognition.
Select the Draw Lines for tables box to replace recognized column/row lines in tables with editable vector lines in output documents.

Please note that in some cases (for example documents that contain one large graphic zone that takes up the whole page area and has some text zones over it) the visual output for Editable Text and Images and Fine Page Content will be very similar.

Click OK to OCR documents.

Note that it is also possible to OCR documents when scanned content or images are used to create PDF documents, and to perform OCR on only a selected area of documents, as detailed below.

Creating Documents from Images

1. Click the File tab, then click New Document and click From Images:

The Image to PDF dialog box will open:

2. Add files and determine settings as detailed here.

3. Click Options for further options. The Image to PDF Options dialog box will open. Click Image Post-Processing to view OCR options when images are converted to PDF:

4. Select the Run OCR box to OCR images when they are converted to PDF. Click OCR Settings to determine language and accuracy options, as detailed above.

Creating Documents from the Scanner

1. Click File, then click New Document.

2. Click From Scanner, then click Custom Scan:

3. The Scan Properties dialog box will open:

4. Determine settings as detailed here.

5. Click Images Insertion Options to determine options for inserted images. The Image to PDF Options dialog box will open. Click Image Post-Processing to view OCR options when scanned content is converted to PDF:

6. Select the Run OCR box to OCR images when they are converted to PDF. Click OCR Settings to determine language and accuracy options, as detailed above.

OCR Selected Region

It is also possible to perform OCR on selected regions of documents when either the Snapshot Tool or the Crop Page Tool has been used to define a page area. For example, click Other Tools in the Organize tab, then click Snapshot Tool and click and drag the mouse to define a snapshot area:

When the area has been defined, right-click it and then click OCR Selected Region in the shortcut menu:

The OCR Options dialog box will open. Determine parameters as detailed above and then click OK to perform OCR on the selected region of the document.

PDF-Tools

Note that two optical character recognition engines are available in PDF-Tools: the default OCR engine and the enhanced OCR engine, which is available when PDF-Tools is purchased (as part of the PDF-XChange PRO bundle). The enhanced OCR engine is faster, more accurate and more dynamic than the default OCR engine, and it also contains some extra features. Further information about the enhanced OCR engine is available here. You can use the OCR preferences (available via the preferences option in the Options tab) to switch between default and enhanced OCR.

Follow the steps below to perform OCR with PDF-Tools:

1. Open PDF-Tools and double-click the OCR Pages tool to run it:

2. Select the files/folders to be processed.
3. The OCR Pages dialog box will open:

Use the Page Range settings to determine the page range for OCR:

Select All to specify all pages.
Select Current to specify only the current page.
Select Custom to specify a custom page range, then enter the desired page range in the adjacent number box. Further information about how to specify custom page ranges is available here.

Use the Subset options to specify a subset of selected pages. Select All, Odd or Even as desired.

Select the Detect skew of page content box to enable automatic detection of skewed pages, which happens when documents are scanned crookedly.
Select the Detect incorrect page rotation box to enable the automatic detection of incorrect page rotation in documents.
Select the Ignore company logos box to omit company logos from OCR. Click the ellipsis icon to add/view/manage logos.
Select the Ignore existing text on page box to omit existing text from the process of optical character recognition.
Select the Ignore comments on page box to omit comments from the process of optical character recognition.
Select the Ignore form fields on page box to omit form fields from the process of optical character recognition.
Use the Output Options to determine the format and quality of output from the OCR process:
Select the Fix content skew and incorrect page rotation box to deskew pages that are scanned crookedly and auto-correct page rotation issues.
Select the Create a New Document box to create a new document for the output of the optical character recognition. If this box is not selected, then the original document will be updated with the output instead.

Use the Output Options to determine the output of OCR:

Select Searchable Image to retain the image-based content on which OCR is performed and insert a duplicate, invisible text layer on the text recognized during the operation. This will make the source text selectable and searchable in the same manner as ordinary text.
Select Editable Text and Images to replace image-based text in source documents with the text recognized in the process of optical character recognition. This will convert image-based text into editable text, and retain existing content such as text and images.
Select Fine Page Content to replace the content of source documents with new content that contains only the text and images recognized during optical character recognition.
Select the Draw Lines for tables box to replace recognized column/row lines in tables with editable vector lines in output documents.

Default OCR Engine

The default engine's OCR process in PDF-Tools analyzes image-based documents, recognizes text and then places a duplicate, invisible text layer on top of it, which makes the source text selectable and searchable in the same manner as ordinary text. This means that the original, image-based text in documents can effectively be searched and selected via the invisible text layer, which is the main benefit of OCR. However, it should be noted that the document text cannot be edited in the same manner as normal, text-based documents - as it remains an image-based document, despite the invisible text layer. In order to convert image-based text into editable text the enhanced OCR engine must be used.

Additionally, please note that you can create custom tools that include OCR functionality, as detailed here.

PDF-XChange Viewer (Discontinued)

1. Click Document in the Menu Toolbar, then click OCR Pages in the submenu (or press Ctrl+Shift+C). The OCR Pages dialog box will open:

The Pages Range options are as follows:
Select All to OCR all the pages of the document.
Select Selected Pages to OCR only the pages currently selected in the document.
Select Current Page to OCR only the current page.
Select Pages to determine specific pages of the document on which to perform the OCR process. Enter the desired page range(s) in the text box.
The Recognition options determine the language and accuracy of the OCR process. If the desired language is not available in the dropdown menu, then click More Languages for further options. Increasing the accuracy increases the time that the process takes and vice versa. Additionally, it should be noted that setting the accuracy to high may result in unusual output if the document on which the operation is carried out features imperfections. This is because the software will search to a greater depth and may attempt to recognise imperfections as text.
The Output options determine the format of the output information from the OCR process:
Select Preserve Original Content & Add Text Layer to have PDF-XChange Viewer analyze the document, recognize text and then insert an invisible text-layer over the text. N.b. The text layer contains identical text to that recognized in the document. This means that the original, image-based text in documents can effectively be searched and selected via the invisible text layer, which is the main benefit of OCR. However, it should be noted that the document text cannot be edited in the same manner as normal, text-based documents - as it remains an image-based document, despite the invisible text layer.
Select Convert Page Content to Image only - Add Text As a Layer to convert documents that contain both images and text into a single, consolidated image. If this option is selected then use the Images Quality dropdown menu to determine the resolution in dpi (dots per inch) of the created image. N.b. If this mode is used for image-only documents, then the only change will be the resolution of the image (when the initial dpi is different from the dpi specified in the Images Quality dropdown menu - otherwise no changes will occur). Please note that output documents from this process will replace input documents. If input documents in their original format will be needed subsequently then a copy should be made before this process is performed.