Automated Extraction of Data from Aadhaar and PAN Cards Using EasyOCR and OpenAI

Blogs

Understanding Prompting Techniques in Detail with few examples
November 24, 2024
Understanding Copy Job in Microsoft Fabric
November 27, 2024

Automated Extraction of Data from Aadhaar and PAN Cards Using EasyOCR and OpenAI

EasyOCR is an open-source library that provides easy-to-use OCR functionality in Python. It supports over 80 languages, making it an ideal choice for extracting text from images of documents like Aadhaar and PAN cards.

EasyOCR is to extract text from Aadhaar and PAN card images and then processes the extracted text using OpenAI’s GPT-3 to extract specific details like Name, Date of Birth, Gender, Address (for Aadhaar), and Father’s Name, PAN Number (for PAN cards).

OpenAI (through Azure OpenAI) is a powerful language model capable of understanding and processing text in a sophisticated manner. By leveraging GPT capabilities, we can apply rules to the extracted text, ensuring we pull out the exact data we need without any unnecessary details.

In this blog, we will walk through a Python-based solution that extracts critical data from Aadhaar and PAN card images using EasyOCR for text extraction and OpenAI’s GPT-3 for data processing and formatting.

Aadhaar and PAN cards are essential documents in India, used for identity verification, tax purposes, and more. With the increasing reliance on digital workflows, the need to automate the extraction of key information from these documents is growing.

In this blog, we’ll explore how to automate this process with the following steps:

  1. Extracting text from Aadhaar and PAN card images using EasyOCR.
  2. Processing the extracted text with OpenAI GPT-3 for structured output.
  3. Formatting the extracted data into a readable format.

Requirements

Before we dive into the script, let’s make sure you have the necessary libraries and tools installed.

  1. Python 3.x
  2. EasyOCR for optical character recognition.
  3. OpenCV for reading and processing images (optional but helpful for visualization).
  4. OpenAI Python SDK for interfacing with the OpenAI API.

You can install the required dependencies by running the following:

command:

pip install easyocr opencv-python openai

Python Code:

import easyocr
import cv2
from openai import AzureOpenAI
# Initialize the Azure OpenAI client
client = AzureOpenAI(
    api_key="YOUR_API_KEY",
    api_version="2024-08-01-preview",
    azure_endpoint="YOUR_AZURE_ENDPOINT"
)

# Function to extract text from image using easyocr
def extract_text_from_image(image_path):
    reader = easyocr.Reader(['en'])
    img = cv2.imread(image_path)  # Load the image
    result = reader.readtext(image_path, detail=0, width_ths=0.9)  # Extract text
    return result
# List of image paths to process
image_paths = ['path_to_your_images']
# Process each image and get the response from the OpenAI model
for image_path in image_paths:
    # Extract text from image
    extracted_text = extract_text_from_image(image_path)
    # Send the extracted text to the Azure OpenAI chat model
    response = client.chat.completions.create(
        model="gpt-35-turbo",
        messages=[
            {
                "role": "system",
                "content": "Assistant is a large language model trained by OpenAI."
            },
           {
                "role": "user",
                "content": f"Extract data from the following text: {extracted_text}"
            }
        ]
    )
    # Print the response from the OpenAI model
    print(response.choices[0].message.content)

Applications of the Script

  1. Identity Verification: Automatically extract key information from Aadhaar and PAN cards for identity verification purposes. This can be integrated into digital onboarding systems, KYC (Know Your Customer) processes, or document verification apps.
  2. Data Processing Automation: For businesses that need to process large volumes of documents, this script can automate the data extraction process, saving time and reducing errors.
  3. Form Filling: By extracting the relevant fields from documents, this tool can be used in automated form filling applications, making it easier to transfer data from physical or scanned documents to digital forms.

Conclusion

Automating data extraction from Aadhaar and PAN cards can save time, reduce human error, and streamline workflows. By combining EasyOCR for text extraction and OpenAI GPT-3 for data processing, this solution enables efficient and accurate data extraction from scanned documents. The script is flexible, scalable, and can be integrated into larger systems for use in various industries, including banking, finance, and government services.

 


Neha Vittal Annam

Leave a Reply

Your email address will not be published. Required fields are marked *