PDFImageOCR

Reads in PDF Files from a given path, extracts the images from them and converts them to text with Tesseract

Input

It reads in a PDF file or a directory containing PDF files

Output

It creates a DataFrame from the data read and sends it to its output

Type

dataset

Class

fire.nodes.dataset.NodeDatasetPDFImageOCR

Fields

Name Title Description
path Path of the PDF files Path of the PDF file/directory
fileNameCol File Name Column File Name Column in the Output DataFrame
outputCol Column Name which contains the result of OCR OCR output Column in the Output DataFrame