Modern organizations are inundated with vast amounts of unstructured data in the form of images, PDFs, and scanned documents. Extracting relevant text information from these files manually is time-consuming, error-prone, and inefficient.
However, with the advancements in artificial intelligence (AI), you now have the ability to automate this process using code. You can use AI-powered optical character recognition (OCR) algorithms to accurately extract text from images and make your data more accessible, searchable, and actionable.
This article looks at different types of images and methods to extract text from both simple and complex images. We also look at limitations of some common methods and suggest practical ways to improve the output. Let’s begin by understanding why you need to convert images to text!
Why Extract Text from Images?
Many organizations have image data that is scanned from operational paperwork. The text in the image scans is not searchable, editable, or useful for analysis. You have to extract the text or convert it into string data type so you can store and use the data.
For example, you can extract supplier information, invoice date, invoice amount, and other text information from invoice images. You can store the data for tax and audits or use it to analyze supplier performance.
Other use cases for text extraction include:
- Digital conversion of healthcare records, scans, and images.
- Digital conversion of resumes and forms for recruitment and other HR processes.
- Automatic scanning of ID documents like passports, voter IDs, and rental agreements as part of authorization and authentication workflows.
- Scanning food labels and ingredients when adding products.
- Identifying location details from images of places—like street signs, store names, and so on.
What Types of Images Can You Extract Text From?
Technically, you can extract text from all types of images in Python. However, the code complexity and output accuracy can vary greatly depending on the input you expect.
You may just need a few lines of code if you expect an input of simple images, like the ones shown below. Such images have large text, less words, simple font and clear contrast between text and images.
However, most text extraction input images have noisy backgrounds, varying fonts, shading or skewing of image text or handwritten text, like the one shown below.
Such images are going to require much more coding and testing efforts in a DIY coding program. You have to preprocess the text before extraction and then further analyze and correct the text after extraction.
Convert Simple Images to Text in Python
The methods outlined below will work well for simple images.
#1: Tesseract and OpenCV
Tesseract is a widely used open-source OCR (Optical Character Recognition) engine that provides accurate text extraction from images. Open Source Computer Vision Library (OpenCV) is a machine learning software library that provides various functionalities and algorithms to work with images and videos. OpenCV is written in C++ and offers interfaces for various programming languages, including Python.
You can use Tesseract and OpenCV to extract information from images using Python.
To begin, install Tesseract on your system. You can install it by following the instructions specific to your operating system.
Once Tesseract is set up, you must install the pytesseract library, which acts as a Python wrapper for Tesseract along with OpenCV.
After installing everything, follow the following steps for converting the text image to string using Tesseract.
easyOCR is a user-friendly and efficient Python library for OCR. It provides a simple interface to extract text from images that are basic. To get started with easyOCR for text extraction, you need to install the library by running the following command:
Once installed, follow these steps to extract text from an image.
#3: Other Python Libraries
In addition to pytesseract and easyOCR, there are other Python libraries available that offer OCR capabilities for extracting text from images.
PyOCR is a Python wrapper that provides access to various OCR engines such as Tesseract, CuneiForm, and GOCR. It offers a unified interface to utilize these engines for text extraction from images. Here's an example of how to use PyOCR with Tesseract.
OCRopus is a collection of OCR tools and libraries developed by Google. It provides a framework for OCR research and includes various components like layout analysis, character recognition, and post-processing. OCRopus can be used for both single-page and multi-page document OCR. Here's a basic example.
These Python libraries offer additional options and flexibility regarding OCR in Python. Depending on your specific requirements and the nature of your images, exploring these alternative libraries might provide you with different features and performance characteristics.
Limitations of python libraries
Open source python libraries give good results for basic images but often fail for complex images. For example, they give inaccurate results if:
- The background is pixellated, blurry or same colour as the text.
- Image is a scanned copy of handwritten text.
- Image has multiple columns or irregular text placement.
They also cannot perform natural language processing (NLP) to check and improve the output. For example, if only partial text is extracted, NLP can guess and complete the results for better output. But python libraries cannot do this. They return incorrect results if the input is not standard.
Improving Results When Using Python Libraries
You can improve text extraction results in Python libraries by image conversion. You have to first convert the image to grayscale or black and white format. Then, you can further convert grayscale into binary, where text is represented as black pixels and the background as white pixels.
You can also write additional code for pre-processing images. Common preprocessing tasks include:
- Apply filters to remove image speckles and improve clarity.
- Adjust the contrast between the text and background.
- Correct any skew or rotation in the image.
- Adjust any varying text size to a single standard size.
It is important to note that while most of these image preprocessing tasks are basic they still require significant coding and testing efforts.
In practical applications, real-world images require additional complex computer vision based pre-processing such as:
- Image component analysis to find regions of text blocks.
- Pre-labelling of image regions.
- Contour analysis to find boundaries or edges of text regions.
- Stroke thinning for handwritten text.
Computer vision based pre-processing is not possible with above Python libraries.
Convert Complex Images to Text
If you are expecting more complex images as input, you are better of choosing an enterprise solution.
#1: Cloud APIs
You can use fully managed OCR services provided by cloud providers for extracting text from images. The cloud providers handle the underlying complexity of text extraction. You pass the image to the API as input and get the string as output. The top three cloud OCR services are:
You can call any API in your code based on the cloud infrastructure of your organization. Below we give an example of Google Cloud Vision API. First, set up a Google Cloud Project.
- Visit the Google Cloud Console and create a new project.
- Enable the Cloud Vision API for the project.
- Generate an API key or set up authentication credentials to access the API.
After setting up, follow the code example steps below.
The API analyzes the image and returns the extracted text and additional information, such as bounding box coordinates and confidence scores.
Limitations of cloud services
Cloud APIs provide a convenient and scalable solution, allowing you to process large volumes of images without the need for infrastructure setup or maintenance. However, pricing can be unpredictable and outside your control. You often have to use multiple cloud services—like storing your input images and output results in the cloud provider’s database. This adds to costs. You may also get locked in to their infrastructure with legally binding contracts.
Most importantly, cloud providers only provide the infrastructure for your image extraction. You still have to write the code and build the image extraction applications yourself.
#2 Third party AI tools
Third party AI tools provide more comprehensive, AI powered solutions for image processing. For example, Affinda’s document processing engine, Vega, powers many custom image extraction solutions like invoice processing, recruitment data extraction and ID data extraction. Vega combines three AI techniques:
- Computer vision technologies
- Deep learning
- Natural language processing
All three work together to pre-process documents, extract text and post-process results. You just have to input the image and use the output as you like. You can use existing solutions out of the box, without the need for any code. Alternatively, you can request the Affinda team to develop a custom solution for you. Price and usage is totally in your control!
Best Way to Convert Images to Text with Python
You have to consider the complexity of your images, the expected quality of output, and the expected scale and speed of extraction, before choosing the best method. Open source Python libraries are useful for simple image to text conversion. However, for most real-world applications, input images are complex with varying text placement, background colours and image quality. Open source libraries give inaccurate and inconsistent results for such images. For complex legal and finance use cases, with little scope for error and changing text placement in images, consider a ready-to-use document processing engine like Vega.