Running Tesseract in Fire

In order to run Tesseract, perform the below installation steps:

Download & Install the Tesseract Language Data files

  • Download and Install the tesseract language data files on each of the worker nodes of the cluster
  • Install them in the same directory on each of the worker nodes
    • git clone https://github.com/tesseract-ocr/tessdata.git
  • Make sure that the tessdata directory is accessible to all the users.

Set TESSDATA_PREFIX as an Environment Variable and restart the Sparkflows server

  • Point the environment variable TESSDATA_PREFIX to the tessdata directory
    • export TESSDATA_PREFIX=/home/centos/tessdata
  • Restart the sparkflows server
  • If the above is not done correctly, then the Sparkflows server would exit when any OCR node is run

Include TESSDATA_PREFIX in spark configs when submitting the job

Include the following in spark configs when running workflows containing the OCR node:

  • --conf spark.executorEnv.TESSDATA_PREFIX=/home/centos/tessdata
  • where the tesseract language data files are in /home/centos/tessdata directory on each of the worker nodes

Error if TESSDATA_PREFIX is not set correctly

If TESSDATA_PREFIX is not set, the spark program would run into the error below.

  • Error opening data file /Users/saudet/projects/bytedeco/javacpp-presets/tesseract/cppbuild/macosx-x86_64/share/tessdata/eng.traineddata
  • Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your “tessdata” directory.
  • Failed loading language ‘eng’
  • Tesseract couldn’t load any languages!