Как извлечь текст из изображений в файлах PDF

В настоящее время компании среднего и крупного масштаба ежедневно используют огромное количество печатных документов. Среди них — счета-фактуры, квитанции, корпоративные документы, отчеты и пресс-релизы.

Для этих компаний использование сканера OCR^એ может сэкономить значительное количество времени, одновременно повышая эффективность и точность. Алгоритмы оптического распознавания символов (OCR) позволяют компьютерам автоматически анализировать напечатанные или рукописные документы и подготавливать текстовые данные в редактируемых форматах, чтобы компьютеры могли их эффективно обрабатывать. Системы оптического распознавания символов преобразуют двухмерное изображение текста, которое может содержать машинноепечатный или рукописный текст из его графического представления в машиночитаемый текст.

Как правило, механизм OCR включает несколько шагов, необходимых для обучения алгоритма машинного обучения эффективному решению проблем с помощью оптического распознавания символов. Следующие шаги, которые могут отличаться от одного движка к другому, примерно необходимы, чтобы приблизиться к автоматическому распознаванию символов:

В рамках этого урока я покажу вам следующее:

Как запустить сканер OCR для файла изображения.
Как отредактировать или выделить определенный текст в файле изображения.
Как запустить сканер OCR для файла PDF или коллекции файлов PDF.

Для начала нам нужно использовать следующие библиотеки:
Tesseract OCR — это механизм распознавания текста с открытым исходным кодом, доступный по лицензии Apache 2.0, и его разработка спонсируется Google с 2006 года. В 2006 году Tesseract считался одним из самых точных механизмов распознавания текста с открытым исходным кодом. Вы можете использовать его напрямую или можете использовать API для извлечения печатного текста из изображений. Самое приятное то, что он поддерживает большое количество языков.

Установка движка Tesseract выходит за рамки этой статьи. Однако вам необходимо следовать официальному руководству по установке Tesseract, чтобы установить его в вашей операционной системе.

Чтобы проверить настройку Tesseract, выполните следующую команду и проверьте сгенерированный вывод:

Python-tesseract — это оболочка Python для Google Tesseract-OCR Engine. Он также полезен в качестве автономного сценария вызова для tesseract, поскольку он может читать все типы изображений, поддерживаемые библиотеками изображений Pillow и Leptonica, включая jpeg, png, gif, bmp, tiff и другие.

OpenCV — это библиотека Python с открытым исходным кодом для компьютерного зрения,машинное обучение и обработка изображений. OpenCV поддерживает широкий спектр языков программирования, таких как Python, C++, Java и т. Д. Он может обрабатывать изображения и видео для идентификации объектов, лиц или даже почерка человека.

PyMuPDF — MuPDF — это универсальный настраиваемый PDF, XPS,и решение для интерпретатора электронных книг, которое можно использовать в широком спектре приложений в качестве средства визуализации PDF, средства просмотра или набора инструментов. PyMuPDF — это привязка Python для MuPDF. Это легкий просмотрщик PDF и XPS.

Numpy — универсальный пакет для обработки массивов. Он предоставляет высокопроизводительный объект многомерного массива,и инструменты для работы с этими массивами. Это фундаментальный пакет для научных вычислений с Python. Кроме того, Numpy также можно использовать как эффективный многомерный контейнер общих данных.

Pillow — построена на основе PIL (библиотеки изображений Python). Это важный модуль для обработки изображений в Python.

Pandas — с открытым исходным кодом, BSD-лицензированная библиотека Python, обеспечивающая высокопроизводительные, простые в использовании структуры данных и инструменты анализа данных для языка программирования Python.

Тип файла: небольшой пакет Python без зависимостей для определения типа файла и типа MIME^એ.

Это руководство направлено на разработку облегченной утилиты на основе командной строки для извлечения, редактировать или выделять текст, включенный в изображение или отсканированный файл PDF, или в папку, содержащую коллекцию файлов PDF.

Для нвчала установим всё, что нам потртребуеся:

$pip install Filetype==1.0.7 numpy==1.19.4 opencv-python==4.4.0.46 pandas==1.1.4 Pillow==8.0.1 PyMuPDF==1.18.9 pytesseract==0.3.7

Начнем с импорта необходимых библиотек:

import os
import re
import argparse
import pytesseract
from pytesseract import Output
import cv2
import numpy as np
import fitz
from io import BytesIO
from PIL import Image
import pandas as pd
import filetype

# Путь к движку Tesseract OCR
TESSERACT_PATH = "C:\Program Files\Tesseract-OCR\tesseract.exe"
# Включить исполняемый файл tesseract
pytesseract.pytesseract.tesseract_cmd = TESSERACT_PATH

TESSERACT_PATH — это место, где находится исполняемый файл Tesseract. Очевидно, вам нужно изменить его для вашего случая.

def pix2np(pix):
    """
    Преобразует буфер растрового изображения в массив numpy
    """
    # pix.samples = последовательность байтов пикселей изображения типа RGBA
    #pix.h = высота в пикселях
    #pix.w = ширина в пикселях
    # pix.n = количество компонентов на пиксель (зависит от цветового пространства и альфа)
    im = np.frombuffer(pix.samples, dtype=np.uint8).reshape(
        pix.h, pix.w, pix.n)
    try:
        im = np.ascontiguousarray(im[..., [2, 1, 0]])  # RGB в BGR
    except IndexError:
        # Преобразовать серый в RGB
        im = cv2.cvtColor(im, cv2.COLOR_GRAY2RGB)
        im = np.ascontiguousarray(im[..., [2, 1, 0]])  # RGB в BGR
    return im

Эта функция преобразует буфер растрового изображения, представляющий снимок экрана, сделанный с использованием библиотеки PyMuPDF, в массив NumPy.

Чтобы повысить точность Tesseract, давайте определим некоторые функции предварительной обработки с помощью OpenCV:

# Функции предварительной обработки изображений для повышения точности вывода
# Преобразовать в оттенки серого
def grayscale(img):
    return cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

# Убрать шум
def remove_noise(img):
    return cv2.medianBlur(img, 5)

# Thresholding
def threshold(img):
    # return cv2.threshold(img, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
    return cv2.threshold(img, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]

# дилатация
def dilate(img):
    kernel = np.ones((5, 5), np.uint8)
    return cv2.dilate(img, kernel, iterations=1)

# эрозия
def erode(img):
    kernel = np.ones((5, 5), np.uint8)
    return cv2.erode(img, kernel, iterations=1)

# открытие - эрозия с последующим расширением
def opening(img):
    kernel = np.ones((5, 5), np.uint8)
    return cv2.morphologyEx(img, cv2.MORPH_OPEN, kernel)

# canny edge detection
def canny(img):
    return cv2.Canny(img, 100, 200)

# коррекция перекоса
def deskew(img):
    coords = np.column_stack(np.where(img > 0))
    angle = cv2.minAreaRect(coords)[-1]
    if angle < -45:
        angle = -(90 + angle)
    else:
        angle = -angle
    (h, w) = img.shape[:2]
    center = (w//2, h//2)
    M = cv2.getRotationMatrix2D(center, angle, 1.0)
    rotated = cv2.warpAffine(
        img, M, (w, h), flags=cv2.INTER_CUBIC, borderMode=cv2.BORDER_REPLICATE)
    return rotated

# соответствие шаблону
def match_template(img, template):
    return cv2.matchTemplate(img, template, cv2.TM_CCOEFF_NORMED)

def convert_img2bin(img):
    """
    Предварительно обрабатывает изображение и генерирует двоичный вывод
    """
    # Преобразовать изображение в изображение в оттенках серого
    output_img = grayscale(img)
    # Инвертируйте изображение в оттенках серого, переворачивая значения пикселей.
    # Все пиксели, размер которых больше 0, устанавливаются в 0, а все пиксели, которые равны 0, устанавливаются в 255
    output_img = cv2.bitwise_not(output_img)
    # Преобразование изображения в двоичное с помощью порогового значения, чтобы показать четкое разделение между белыми и черными пикселями.
    output_img = threshold(output_img)
    return output_img

Мы определили функции для многих задач предварительной обработки, включая преобразование изображений в оттенки серого, изменение значений пикселей, разделение белых и черных пикселей и многое другое.

Затем давайте определим функцию для отображения изображения:

def display_img(title, img):
    """Отображает изображение на экране и удерживает его, пока пользователь не нажмет клавишу"""
    cv2.namedWindow('img', cv2.WINDOW_NORMAL)
    cv2.setWindowTitle('img', title)
    cv2.resizeWindow('img', 1200, 900)
    # Показ изображения на экране
    cv2.imshow('img', img)
    # Удерживать изображение, пока пользователь не нажмет клавишу
    cv2.waitKey(0)
    # Уничтожать окна, когда пользователь нажимает клавишу
    cv2.destroyAllWindows()

Функция display_img() отображает на экране изображение в окне с заголовком, установленным для параметра title, и поддерживает это окно открытым до тех пор, пока пользователь не нажмет клавишу на клавиатуре.

def generate_ss_text(ss_details):
    """Перебирает захваченный текст изображения и выстраивает его построчно.
    Эта функция зависит от макета изображения."""
    # Расположите захваченный текст после сканирования страницы
    parse_text = []
    word_list = []
    last_word = ''
    # Прокрутите захваченный текст на всей странице
    for word in ss_details['text']:
        # Если захваченное слово не пустое
        if word != '':
            # Добавить в список строчных слов
            word_list.append(word)
            last_word = word
        if (last_word != '' and word == '') or (word == ss_details['text'][-1]):
            parse_text.append(word_list)
            word_list = []
    return parse_text

Вышеупомянутая функция выполняет итерацию по захваченному тексту изображения и выстраивает захваченный текст построчно. Это зависит от макета изображения и может потребовать настройки некоторых форматов изображений.

Затем давайте определим функцию для поиска текста с помощью регулярных выражений:

def search_for_text(ss_details, search_str):
    """поиск строки поиска в содержимом изображения"""
    # Найти все совпадения на одной странице
    results = re.findall(search_str, ss_details['text'], re.IGNORECASE)
    # В случае нескольких совпадений на одной странице
    for result in results:
        yield result

Мы будем использовать эту функцию для поиска определенного текста в захваченном содержимом изображения. Возвращает генератор найденных совпадений.

def save_page_content(pdfContent, page_id, page_data):
    """Добавляет содержимое отсканированной страницы построчно в фрейм данных pandas."""
    if page_data:
        for idx, line in enumerate(page_data, 1):
            line = ' '.join(line)
            pdfContent = pdfContent.append(
                {'page': page_id, 'line_id': idx, 'line': line}, ignore_index=True
            )
    return pdfContent

Функция save_page_content() добавляет захваченное содержимое изображения построчно после его сканирования в фрейм данных pdfContent pandas.

Теперь давайте создадим функцию для сохранения полученного фрейма данных в файл CSV:

def save_file_content(pdfContent, input_file):
    """Выводит содержимое кадра данных pandas в CSV-файл, имеющий тот же путь, что и входной_файл.
    но с другим расширением (.csv)"""
    content_file = os.path.join(os.path.dirname(input_file), os.path.splitext(
        os.path.basename(input_file))[0] + ".csv")
    pdfContent.to_csv(content_file, sep=',', index=False)
    return content_file

Затем давайте напишем функцию, которая вычисляет степень достоверности текста, взятого из отсканированного изображения:

def calculate_ss_confidence(ss_details: dict):
    """Вычислить степень достоверности текста, взятого из отсканированного изображения."""
    # page_num  --> Номер страницы обнаруженного текста или элемента
    # block_num --> Номер блока обнаруженного текста или элемента
    # par_num   --> Номер абзаца обнаруженного текста или элемента
    # line_num  --> Номер строки обнаруженного текста или элемента
    # Преобразовать dict в dataFrame
    df = pd.DataFrame.from_dict(ss_details)
    # Преобразуем поле conf (достоверность) в числовое
    df['conf'] = pd.to_numeric(df['conf'], errors='coerce')
    # Устранение записей с отрицательной уверенностью
    df = df[df.conf != -1]
    # Вычислить среднюю достоверность по страницам
    conf = df.groupby(['page_num'])['conf'].mean().tolist()
    return conf[0]

Переход к основной функции: сканирование изображения:

def ocr_img(
        img: np.array, input_file: str, search_str: str, 
        highlight_readable_text: bool = False, action: str = 'Highlight', 
        show_comparison: bool = False, generate_output: bool = True):
    """Сканирует буфер изображения или файл изображения.
    Предварительно обрабатывает изображение.
    Вызывает движок Tesseract с предопределенными параметрами.
    Вычисляет степень достоверности полученного изображения.
    Обводит зеленый прямоугольник вокруг читаемых текстовых элементов, имеющих показатель достоверности> 30.
    Ищет определенный текст.Выделите или отредактируйте найденные совпадения искомого текста.
    Отображает окно, показывающее читаемые текстовые поля или выделенный или отредактированный текст.
    Создает текстовое содержимое изображения.
    Печатает сводку на консоль."""
    # Если исходный файл изображения вводится как параметр
    if input_file:
        # Чтение изображения с помощью opencv
        img = cv2.imread(input_file)
    # Сохранить копию этого изображения для сравнения
    initial_img = img.copy()
    highlighted_img = img.copy()
    # Convert image to binary
    bin_img = convert_img2bin(img)
    # Calling Tesseract
    # Tesseract Configuration parameters
    # oem --> OCR engine mode = 3 >> Legacy + LSTM mode only (LSTM neutral net mode works the best)
    # psm --> page segmentation mode = 6 >> Assume as single uniform block of text (How a page of text can be analyzed)
    config_param = r'--oem 3 --psm 6'
    # Feeding image to tesseract
    details = pytesseract.image_to_data(
        bin_img, output_type=Output.DICT, config=config_param, lang='eng')
    # The details dictionary contains the information of the input image
    # such as detected text, region, position, information, height, width, confidence score.
    ss_confidence = calculate_ss_confidence(details)
    boxed_img = None
    # Total readable items
    ss_readable_items = 0
    # Total matches found
    ss_matches = 0
    for seq in range(len(details['text'])):
        # Consider only text fields with confidence score > 30 (text is readable)
        if float(details['conf'][seq]) > 30.0:
            ss_readable_items += 1
            # Draws a green rectangle around readable text items having a confidence score > 30
            if highlight_readable_text:
                (x, y, w, h) = (details['left'][seq], details['top']
                                [seq], details['width'][seq], details['height'][seq])
                boxed_img = cv2.rectangle(
                    img, (x, y), (x+w, y+h), (0, 255, 0), 2)
            # Searches for the string
            if search_str:
                results = re.findall(
                    search_str, details['text'][seq], re.IGNORECASE)
                for result in results:
                    ss_matches += 1
                    if action:
                        # Draw a red rectangle around the searchable text
                        (x, y, w, h) = (details['left'][seq], details['top']
                                        [seq], details['width'][seq], details['height'][seq])
                        # Details of the rectangle
                        # Starting coordinate representing the top left corner of the rectangle
                        start_point = (x, y)
                        # Ending coordinate representing the botton right corner of the rectangle
                        end_point = (x + w, y + h)
                        #Color in BGR -- Blue, Green, Red
                        if action == "Highlight":
                            color = (0, 255, 255)  # Yellow
                        elif action == "Redact":
                            color = (0, 0, 0)  # Black
                        # Thickness in px (-1 will fill the entire shape)
                        thickness = -1
                        boxed_img = cv2.rectangle(
                            img, start_point, end_point, color, thickness)
                            
    if ss_readable_items > 0 and highlight_readable_text and not (ss_matches > 0 and action in ("Highlight", "Redact")):
        highlighted_img = boxed_img.copy()
    # Highlight found matches of the search string
    if ss_matches > 0 and action == "Highlight":
        cv2.addWeighted(boxed_img, 0.4, highlighted_img,
                        1 - 0.4, 0, highlighted_img)
    # Redact found matches of the search string
    elif ss_matches > 0 and action == "Redact":
        highlighted_img = boxed_img.copy()
        #cv2.addWeighted(boxed_img, 1, highlighted_img, 0, 0, highlighted_img)
    # save the image
    cv2.imwrite("highlighted-text-image.jpg", highlighted_img)  
    # Displays window showing readable text fields or the highlighted or redacted data
    if show_comparison and (highlight_readable_text or action):
        title = input_file if input_file else 'Compare'
        conc_img = cv2.hconcat([initial_img, highlighted_img])
        display_img(title, conc_img)
    # Generates the text content of the image
    output_data = None
    if generate_output and details:
        output_data = generate_ss_text(details)
    # Prints a summary to the console
    if input_file:
        summary = {
            "File": input_file, "Total readable words": ss_readable_items, "Total matches": ss_matches, "Confidence score": ss_confidence
        }
        # Printing Summary
        print("## Summary ########################################################")
        print("\n".join("{}:{}".format(i, j) for i, j in summary.items()))
        print("###################################################################")
    return highlighted_img, ss_readable_items, ss_matches, ss_confidence, output_data
    # pass image into pytesseract module
    # pytesseract is trained in many languages
    #config_param = r'--oem 3 --psm 6'
    #details = pytesseract.image_to_data(img,config=config_param,lang='eng')
    # print(details)
    # return details

Вышеупомянутое выполняет следующее:

Сканирует буфер изображения или файл изображения.
Предварительно обрабатывает изображение.
Запускает движок Tesseract с заранее заданными параметрами.
Вычисляет степень достоверности полученного содержимого изображения.
Обводит зеленый прямоугольник вокруг читаемых текстовых элементов, имеющих показатель достоверности выше 30.
Выполняет поиск определенного текста в захваченном содержимом изображения.
Подсвечивает или редактирует найденные совпадения с искомым текстом.
Отображает окно с читаемыми текстовыми полями, выделенным или отредактированным текстом.
Создает текстовое содержимое изображения.
Печатает сводку на консоль.

def image_to_byte_array(image: Image):
    """
    Converts an image into a byte array
    """
    imgByteArr = BytesIO()
    image.save(imgByteArr, format=image.format if image.format else 'JPEG')
    imgByteArr = imgByteArr.getvalue()
    return imgByteArr

def ocr_file(**kwargs):
    """Opens the input PDF File.
    Opens a memory buffer for storing the output PDF file.
    Creates a DataFrame for storing pages statistics
    Iterates throughout the chosen pages of the input PDF file
    Grabs a screen-shot of the selected PDF page.
    Converts the screen-shot pix to a numpy array
    Scans the grabbed screen-shot.
    Collects the statistics of the screen-shot(page).
    Saves the content of the screen-shot(page).
    Adds the updated screen-shot (Highlighted, Redacted) to the output file.
    Saves the whole content of the PDF file.
    Saves the output PDF file if required.
    Prints a summary to the console."""
    input_file = kwargs.get('input_file')
    output_file = kwargs.get('output_file')
    search_str = kwargs.get('search_str')
    pages = kwargs.get('pages')
    highlight_readable_text = kwargs.get('highlight_readable_text')
    action = kwargs.get('action')
    show_comparison = kwargs.get('show_comparison')
    generate_output = kwargs.get('generate_output')
    # Opens the input PDF file
    pdfIn = fitz.open(input_file)
    # Opens a memory buffer for storing the output PDF file.
    pdfOut = fitz.open()
    # Creates an empty DataFrame for storing pages statistics
    dfResult = pd.DataFrame(
        columns=['page', 'page_readable_items', 'page_matches', 'page_total_confidence'])
    # Creates an empty DataFrame for storing file content
    if generate_output:
        pdfContent = pd.DataFrame(columns=['page', 'line_id', 'line'])
    # Iterate throughout the pages of the input file
    for pg in range(pdfIn.pageCount):
        if str(pages) != str(None):
            if str(pg) not in str(pages):
                continue
        # Select a page
        page = pdfIn[pg]
        # Rotation angle
        rotate = int(0)
        # PDF Page is converted into a whole picture 1056*816 and then for each picture a screenshot is taken.
        # zoom = 1.33333333 -----> Image size = 1056*816
        # zoom = 2 ---> 2 * Default Resolution (text is clear, image text is hard to read)    = filesize small / Image size = 1584*1224
        # zoom = 4 ---> 4 * Default Resolution (text is clear, image text is barely readable) = filesize large
        # zoom = 8 ---> 8 * Default Resolution (text is clear, image text is readable) = filesize large
        zoom_x = 2
        zoom_y = 2
        # The zoom factor is equal to 2 in order to make text clear
        # Pre-rotate is to rotate if needed.
        mat = fitz.Matrix(zoom_x, zoom_y).preRotate(rotate)
        # To captue a specific part of the PDF page
        # rect = page.rect #page size
        # mp = rect.tl + (rect.bl - (0.75)/zoom_x) #rectangular area 56 = 75/1.3333
        # clip = fitz.Rect(mp,rect.br) #The area to capture
        # pix = page.getPixmap(matrix=mat, alpha=False,clip=clip)
        # Get a screen-shot of the PDF page
        # Colorspace -> represents the color space of the pixmap (csRGB, csGRAY, csCMYK)
        # alpha -> Transparancy indicator
        pix = page.getPixmap(matrix=mat, alpha=False, colorspace="csGRAY")
        # convert the screen-shot pix to numpy array
        img = pix2np(pix)
        # Erode image to omit or thin the boundaries of the bright area of the image
        # We apply Erosion on binary images.
        #kernel = np.ones((2,2) , np.uint8)
        #img = cv2.erode(img,kernel,iterations=1)
        upd_np_array, pg_readable_items, pg_matches, pg_total_confidence, pg_output_data \
            = ocr_img(img=img, input_file=None, search_str=search_str, highlight_readable_text=highlight_readable_text  # False
                      , action=action  # 'Redact'
                      , show_comparison=show_comparison  # True
                      , generate_output=generate_output  # False
                      )
        # Collects the statistics of the page
        dfResult = dfResult.append({'page': (pg+1), 'page_readable_items': pg_readable_items,
                                   'page_matches': pg_matches, 'page_total_confidence': pg_total_confidence}, ignore_index=True)
        if generate_output:
            pdfContent = save_page_content(
                pdfContent=pdfContent, page_id=(pg+1), page_data=pg_output_data)
        # Convert the numpy array to image object with mode = RGB
        #upd_img = Image.fromarray(np.uint8(upd_np_array)).convert('RGB')
        upd_img = Image.fromarray(upd_np_array[..., ::-1])
        # Convert the image to byte array
        upd_array = image_to_byte_array(upd_img)
        # Get Page Size
        """
        #To check whether initial page is portrait or landscape
        if page.rect.width > page.rect.height:
            fmt = fitz.PaperRect("a4-1")
        else:
            fmt = fitz.PaperRect("a4")

        #pno = -1 -> Insert after last page
        pageo = pdfOut.newPage(pno = -1, width = fmt.width, height = fmt.height)
        """
        pageo = pdfOut.newPage(
            pno=-1, width=page.rect.width, height=page.rect.height)
        pageo.insertImage(page.rect, stream=upd_array)
        #pageo.insertImage(page.rect, stream=upd_img.tobytes())
        #pageo.showPDFpage(pageo.rect, pdfDoc, page.number)
    content_file = None
    if generate_output:
        content_file = save_file_content(
            pdfContent=pdfContent, input_file=input_file)
    summary = {
        "File": input_file, "Total pages": pdfIn.pageCount, 
        "Processed pages": dfResult['page'].count(), "Total readable words": dfResult['page_readable_items'].sum(), 
        "Total matches": dfResult['page_matches'].sum(), "Confidence score": dfResult['page_total_confidence'].mean(), 
        "Output file": output_file, "Content file": content_file
    }
    # Printing Summary
    print("## Summary ########################################################")
    print("\n".join("{}:{}".format(i, j) for i, j in summary.items()))
    print("\nPages Statistics:")
    print(dfResult, sep='\n')
    print("###################################################################")
    pdfIn.close()
    if output_file:
        pdfOut.save(output_file)
    pdfOut.close()

Функция image_to_byte_array() конвертирует изображение в байтовый массив.

Функция ocr_file() делает следующее:

Открывает исходный PDF-файл.
Открывает буфер памяти для хранения выходного PDF-файла.
Создает фреймворк pandas для хранения статистики страницы.
Перебирает выбранные страницы входного PDF-файла.Делает снимок экрана (изображение) выбранной страницы входного файла PDF.
Преобразует снимок экрана (pix) в массив NumPy.
Сканирует сделанный снимок экрана.
Собирает статистику снимка экрана (страницы).
Сохраняет содержимое скриншота.
Добавляет обновленный снимок экрана в выходной файл.
Сохраняет все содержимое входного файла PDF в файл CSV.При необходимости сохраняет выходной файл PDF.
Печатает сводку на консоль.

Давайте добавим еще одну функцию для обработки папки, содержащей несколько файлов PDF:

def ocr_folder(**kwargs):
    """Scans all PDF Files within a specified path"""
    input_folder = kwargs.get('input_folder')
    # Run in recursive mode
    recursive = kwargs.get('recursive')
    search_str = kwargs.get('search_str')
    pages = kwargs.get('pages')
    action = kwargs.get('action')
    generate_output = kwargs.get('generate_output')
    # Loop though the files within the input folder.
    for foldername, dirs, filenames in os.walk(input_folder):
        for filename in filenames:
            # Check if pdf file
            if not filename.endswith('.pdf'):
                continue
            # PDF File found
            inp_pdf_file = os.path.join(foldername, filename)
            print("Processing file =", inp_pdf_file)
            output_file = None
            if search_str:
                # Generate an output file
                output_file = os.path.join(os.path.dirname(
                    inp_pdf_file), 'ocr_' + os.path.basename(inp_pdf_file))
            ocr_file(
                input_file=inp_pdf_file, output_file=output_file, search_str=search_str, pages=pages, highlight_readable_text=False, action=action, show_comparison=False, generate_output=generate_output
            )
        if not recursive:
            break

Эта функция предназначена для сканирования файлов PDF, содержащихся в определенной папке. Он перебирает файлы указанной папки рекурсивно или нет, в зависимости от значения параметра рекурсивно, и обрабатывает эти файлы один за другим.

Он принимает следующие параметры:

input_folder, путь к папке, содержащей файлы PDF для обработки.
search_str, текст для поиска и манипулирования.
recursive, запускать ли этот процесс рекурсивно, перебирая вложенные папки или нет.
action, действие, которое необходимо выполнить из следующего: Highlight, Redact.
pages: страницы для рассмотрения.generate_output: выберите, сохранять ли содержимое входного PDF-файла вCSV-файл или нет

Прежде чем мы закончим, давайте определим полезные функции для анализа аргументов командной строки:

def is_valid_path(path):
    """Validates the path inputted and checks whether it is a file path or a folder path"""
    if not path:
        raise ValueError(f"Invalid Path")
    if os.path.isfile(path):
        return path
    elif os.path.isdir(path):
        return path
    else:
        raise ValueError(f"Invalid Path {path}")


def parse_args():
    """Get user command line parameters"""
    parser = argparse.ArgumentParser(description="Available Options")
    parser.add_argument('-i', '--input-path', type=is_valid_path,
                        required=True, help="Enter the path of the file or the folder to process")
    parser.add_argument('-a', '--action', choices=[
                        'Highlight', 'Redact'], type=str, help="Choose to highlight or to redact")
    parser.add_argument('-s', '--search-str', dest='search_str',
                        type=str, help="Enter a valid search string")
    parser.add_argument('-p', '--pages', dest='pages', type=tuple,
                        help="Enter the pages to consider in the PDF file, e.g. (0,1)")
    parser.add_argument("-g", "--generate-output", action="store_true", help="Generate text content in a CSV file")
    path = parser.parse_known_args()[0].input_path
    if os.path.isfile(path):
        parser.add_argument('-o', '--output_file', dest='output_file',
                            type=str, help="Enter a valid output file")
        parser.add_argument("-t", "--highlight-readable-text", action="store_true", help="Highlight readable text in the generated image")
        parser.add_argument("-c", "--show-comparison", action="store_true", help="Show comparison between captured image and the generated image")
    if os.path.isdir(path):
        parser.add_argument("-r", "--recursive", action="store_true", help="Whether to process the directory recursively")
    # To Porse The Command Line Arguments
    args = vars(parser.parse_args())
    # To Display The Command Line Arguments
    print("## Command Arguments #################################################")
    print("\n".join("{}:{}".format(i, j) for i, j in args.items()))
    print("######################################################################")
    return args

Функция is_valid_path() проверяет путь, введенный в качестве параметра, и проверяет, является ли он путем к файлу или путем к каталогу.

Функция parse_args() определяет и устанавливает соответствующие ограничения для аргументов командной строки пользователя при запуске этой утилиты.

Ниже приведены пояснения ко всем параметрам:

input_path: обязательный параметр для ввода пути к файлу или папке. для обработки этот параметр связан с ранее определенной функциейis_valid_path ().
action: действие, выполняемое среди списка предопределенных параметров для Избегайте ошибочного выбора.
search_str: текст, который нужно искать для обработки.
pages: страницы, которые следует учитывать при обработке файла PDF.
generate_content: указывает, следует ли создавать захваченное содержимое входного файла, будь то изображение или PDF в файл CSV или нет.
output_file: путь к выходному файлу. Заполнение этого аргумента ограничивается выбором файла в качестве входных данных, а не каталога. & Nbsp;
highlight_readable_text: для рисования зеленых прямоугольников вокруг читаемых текстовых полей с более высоким показателем достоверности чем 30.
show_comparison: отображает окно, в котором сравнивается исходное изображение и обработанное. изображение.
recursive: следует ли обрабатывать папку рекурсивно или нет. Заполнение этого аргумента ограничено выбором каталога.

Наконец, напишем основной код, использующий ранее определенные функции:

if __name__ == '__main__':
    # Parsing command line arguments entered by user
    args = parse_args()
    # If File Path
    if os.path.isfile(args['input_path']):
        # Process a file
        if filetype.is_image(args['input_path']):
            ocr_img(
                # if 'search_str' in (args.keys()) else None
                img=None, input_file=args['input_path'], search_str=args['search_str'], highlight_readable_text=args['highlight_readable_text'], action=args['action'], show_comparison=args['show_comparison'], generate_output=args['generate_output']
            )
        else:
            ocr_file(
                input_file=args['input_path'], output_file=args['output_file'], search_str=args['search_str'] if 'search_str' in (args.keys()) else None, pages=args['pages'], highlight_readable_text=args['highlight_readable_text'], action=args['action'], show_comparison=args['show_comparison'], generate_output=args['generate_output']
            )
    # If Folder Path
    elif os.path.isdir(args['input_path']):
        # Process a folder
        ocr_folder(
            input_folder=args['input_path'], recursive=args['recursive'], search_str=args['search_str'] if 'search_str' in (args.keys()) else None, pages=args['pages'], action=args['action'], generate_output=args['generate_output']
        )

Проверим нашу программу:

$python pdf_ocr.py

Результат:

usage: pdf_ocr.py [-h] -i INPUT_PATH [-a {Highlight,Redact}] [-s SEARCH_STR] [-p PAGES] [-g GENERATE_OUTPUT]

Available Options

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT_PATH, --input_path INPUT_PATH
                        Enter the path of the file or the folder to process
  -a {Highlight,Redact}, --action {Highlight,Redact}
                        Choose to highlight or to redact
  -s SEARCH_STR, --search_str SEARCH_STR
                        Enter a valid search string
  -p PAGES, --pages PAGES
                        Enter the pages to consider e.g.: (0,1)
  -g GENERATE_OUTPUT, --generate_output GENERATE_OUTPUT
                        Generate content in a CSV file

Прежде чем исследовать наши тестовые сценарии, обратите внимание на следующее:

Чтобы избежать ошибки PermissionError, закройте входной файл перед запуском этой утилиты.
Строка поиска соответствует правилам регулярных выражений с использованием встроенного в Python модуля re.
Например, установка для строки поиска значения «organi[sz]e» соответствует как "organise", так и "organize".

Во-первых, давайте попробуем ввести изображение (вы можете получить его здесь, если хотите получить тот же результат) без использования какого-либо файла PDF:

$python pdf_ocr.py -s "BERT" -a Highlight -i example-image-containing-text.jpg

На выходе будет следующее:

## Command Arguments #################################################
input_path:example-image-containing-text.jpg
action:Highlight
search_str:BERT
pages:None
generate_output:False
output_file:None
highlight_readable_text:False
show_comparison:False
######################################################################
## Summary ########################################################
File:example-image-containing-text.jpg
Total readable words:192
Total matches:3
Confidence score:89.89337547979804
###################################################################

И в текущем каталоге появилось новое изображение:

Вы можете передать -t или --highlight-readable-text, чтобы выделить весь обнаруженный текст (в другом формате, чтобы отличать строку поиска от других).

Вы также можете передать -c или --show-compare, чтобы отобразить исходное изображение и отредактированное изображение в одном окне.

Теперь, когда мы работаем с изображениями, давайте попробуем для файлов PDF:

$ python pdf_ocr.py -s "BERT" -i image.pdf -o output.pdf --generate-output -a "Highlight"

image.pdf - это простой PDF-файл, содержащий изображение из предыдущего примера (опять же, вы можете получить его здесь).

На этот раз мы передали PDF-файл аргументу -i и output.pdf в качестве результирующего PDF-файла (где происходит выделение всех элементов). Приведенная выше команда генерирует следующий вывод:

## Command Arguments #################################################
input_path:image.pdf
action:Highlight
search_str:BERT
pages:None
generate_output:True
output_file:output.pdf
highlight_readable_text:False
show_comparison:False
######################################################################
## Summary ########################################################
File:image.pdf
Total pages:1
Processed pages:1
Total readable words:192.0
Total matches:3.0
Confidence score:83.1775128855722
Output file:output.pdf
Content file:image.csv

Pages Statistics:
   page  page_readable_items  page_matches  page_total_confidence
0   1.0                192.0           3.0              83.177513
###################################################################

После выполнения создается файл output.pdf, содержащий тот же исходный PDF-файл, но с выделенным текстом. Кроме того, теперь у нас есть статистика по нашему PDF-файлу, в котором было обнаружено 192 слова, а 3 слова были сопоставлены с помощью нашего поиска с достоверностью около 83,2%. Также создается файл CSV, который включает обнаруженный текст из изображения в каждой строке.

Заключение

Есть и другие параметры, которые мы не использовали в наших примерах, не стесняйтесь их исследовать. Вы также можете передать всю папку в аргумент -i, чтобы сканировать коллекцию файлов PDF.

Tesseract идеально подходит для сканирования чистых и четких документов. Низкое качество сканирования может привести к плохим результатам при распознавании текста. Как обычно, он не дает точных результатов для изображений, на которые влияют артефакты, включая частичное перекрытие, искаженную перспективу и сложный фон.

А вот и полный код:

# Import Libraries
import os
import re
import argparse
import pytesseract
from pytesseract import Output
import cv2
import numpy as np
import fitz
from io import BytesIO
from PIL import Image
import pandas as pd
import filetype

# Path Of The Tesseract OCR engine
TESSERACT_PATH = r"C:\\Program Files\\Tesseract-OCR\\tesseract.exe"
# Include tesseract executable
pytesseract.pytesseract.tesseract_cmd = TESSERACT_PATH


def pix2np(pix):
    """
    Converts a pixmap buffer into a numpy array
    """
    # pix.samples = sequence of bytes of the image pixels like RGBA
    #pix.h = height in pixels
    #pix.w = width in pixels
    # pix.n = number of components per pixel (depends on the colorspace and alpha)
    im = np.frombuffer(pix.samples, dtype=np.uint8).reshape(
        pix.h, pix.w, pix.n)
    try:
        im = np.ascontiguousarray(im[..., [2, 1, 0]])  # RGB To BGR
    except IndexError:
        # Convert Gray to RGB
        im = cv2.cvtColor(im, cv2.COLOR_GRAY2RGB)
        im = np.ascontiguousarray(im[..., [2, 1, 0]])  # RGB To BGR
    return im
################################################################################
# Image Pre-Processing Functions to improve output accurracy
# Convert to grayscale


def grayscale(img):
    return cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)


# Remove noise
def remove_noise(img):
    return cv2.medianBlur(img, 5)


# Thresholding
def threshold(img):
    # return cv2.threshold(img, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
    return cv2.threshold(img, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]


# dilation
def dilate(img):
    kernel = np.ones((5, 5), np.uint8)
    return cv2.dilate(img, kernel, iterations=1)


# erosion
def erode(img):
    kernel = np.ones((5, 5), np.uint8)
    return cv2.erode(img, kernel, iterations=1)


# opening -- erosion followed by a dilation
def opening(img):
    kernel = np.ones((5, 5), np.uint8)
    return cv2.morphologyEx(img, cv2.MORPH_OPEN, kernel)


# canny edge detection
def canny(img):
    return cv2.Canny(img, 100, 200)


# skew correction
def deskew(img):
    coords = np.column_stack(np.where(img > 0))
    angle = cv2.minAreaRect(coords)[-1]
    if angle < -45:
        angle = -(90 + angle)
    else:
        angle = -angle
    (h, w) = img.shape[:2]
    center = (w//2, h//2)
    M = cv2.getRotationMatrix2D(center, angle, 1.0)
    rotated = cv2.warpAffine(
        img, M, (w, h), flags=cv2.INTER_CUBIC, borderMode=cv2.BORDER_REPLICATE)
    return rotated


# template matching
def match_template(img, template):
    return cv2.matchTemplate(img, template, cv2.TM_CCOEFF_NORMED)


def convert_img2bin(img):
    """
    Pre-processes the image and generates a binary output
    """
    # Convert the image into a grayscale image
    output_img = grayscale(img)
    # Invert the grayscale image by flipping pixel values.
    # All pixels that are grater than 0 are set to 0 and all pixels that are = to 0 are set to 255
    output_img = cv2.bitwise_not(output_img)
    # Converting image to binary by Thresholding in order to show a clear separation between white and blacl pixels.
    output_img = threshold(output_img)
    return output_img


def display_img(title, img):
    """Displays an image on screen and maintains the output until the user presses a key"""
    cv2.namedWindow('img', cv2.WINDOW_NORMAL)
    cv2.setWindowTitle('img', title)
    cv2.resizeWindow('img', 1200, 900)
    # Display Image on screen
    cv2.imshow('img', img)
    # Mantain output until user presses a key
    cv2.waitKey(0)
    # Destroy windows when user presses a key
    cv2.destroyAllWindows()


def generate_ss_text(ss_details):
    """Loops through the captured text of an image and arranges this text line by line.
    This function depends on the image layout."""
    # Arrange the captured text after scanning the page
    parse_text = []
    word_list = []
    last_word = ''
    # Loop through the captured text of the entire page
    for word in ss_details['text']:
        # If the word captured is not empty
        if word != '':
            # Add it to the line word list
            word_list.append(word)
            last_word = word
        if (last_word != '' and word == '') or (word == ss_details['text'][-1]):
            parse_text.append(word_list)
            word_list = []
    return parse_text


def search_for_text(ss_details, search_str):
    """Search for the search string within the image content"""
    # Find all matches within one page
    results = re.findall(search_str, ss_details['text'], re.IGNORECASE)
    # In case multiple matches within one page
    for result in results:
        yield result


def save_page_content(pdfContent, page_id, page_data):
    """Appends the content of a scanned page, line by line, to a pandas DataFrame."""
    if page_data:
        for idx, line in enumerate(page_data, 1):
            line = ' '.join(line)
            pdfContent = pdfContent.append(
                {'page': page_id, 'line_id': idx, 'line': line}, ignore_index=True
            )
    return pdfContent


def save_file_content(pdfContent, input_file):
    """Outputs the content of the pandas DataFrame to a CSV file having the same path as the input_file
    but with different extension (.csv)"""
    content_file = os.path.join(os.path.dirname(input_file), os.path.splitext(
        os.path.basename(input_file))[0] + ".csv")
    pdfContent.to_csv(content_file, sep=',', index=False)
    return content_file


def calculate_ss_confidence(ss_details: dict):
    """Calculate the confidence score of the text grabbed from the scanned image."""
    # page_num  --> Page number of the detected text or item
    # block_num --> Block number of the detected text or item
    # par_num   --> Paragraph number of the detected text or item
    # line_num  --> Line number of the detected text or item
    # Convert the dict to dataFrame
    df = pd.DataFrame.from_dict(ss_details)
    # Convert the field conf (confidence) to numeric
    df['conf'] = pd.to_numeric(df['conf'], errors='coerce')
    # Elliminate records with negative confidence
    df = df[df.conf != -1]
    # Calculate the mean confidence by page
    conf = df.groupby(['page_num'])['conf'].mean().tolist()
    return conf[0]


def ocr_img(
        img: np.array, input_file: str, search_str: str, 
        highlight_readable_text: bool = False, action: str = 'Highlight', 
        show_comparison: bool = False, generate_output: bool = True):
    """Scans an image buffer or an image file.
    Pre-processes the image.
    Calls the Tesseract engine with pre-defined parameters.
    Calculates the confidence score of the image grabbed content.
    Draws a green rectangle around readable text items having a confidence score > 30.
    Searches for a specific text.
    Highlight or redact found matches of the searched text.
    Displays a window showing readable text fields or the highlighted or redacted text.
    Generates the text content of the image.
    Prints a summary to the console."""
    # If image source file is inputted as a parameter
    if input_file:
        # Reading image using opencv
        img = cv2.imread(input_file)
    # Preserve a copy of this image for comparison purposes
    initial_img = img.copy()
    highlighted_img = img.copy()
    # Convert image to binary
    bin_img = convert_img2bin(img)
    # Calling Tesseract
    # Tesseract Configuration parameters
    # oem --> OCR engine mode = 3 >> Legacy + LSTM mode only (LSTM neutral net mode works the best)
    # psm --> page segmentation mode = 6 >> Assume as single uniform block of text (How a page of text can be analyzed)
    config_param = r'--oem 3 --psm 6'
    # Feeding image to tesseract
    details = pytesseract.image_to_data(
        bin_img, output_type=Output.DICT, config=config_param, lang='eng')
    # The details dictionary contains the information of the input image
    # such as detected text, region, position, information, height, width, confidence score.
    ss_confidence = calculate_ss_confidence(details)
    boxed_img = None
    # Total readable items
    ss_readable_items = 0
    # Total matches found
    ss_matches = 0
    for seq in range(len(details['text'])):
        # Consider only text fields with confidence score > 30 (text is readable)
        if float(details['conf'][seq]) > 30.0:
            ss_readable_items += 1
            # Draws a green rectangle around readable text items having a confidence score > 30
            if highlight_readable_text:
                (x, y, w, h) = (details['left'][seq], details['top']
                                [seq], details['width'][seq], details['height'][seq])
                boxed_img = cv2.rectangle(
                    img, (x, y), (x+w, y+h), (0, 255, 0), 2)
            # Searches for the string
            if search_str:
                results = re.findall(
                    search_str, details['text'][seq], re.IGNORECASE)
                for result in results:
                    ss_matches += 1
                    if action:
                        # Draw a red rectangle around the searchable text
                        (x, y, w, h) = (details['left'][seq], details['top']
                                        [seq], details['width'][seq], details['height'][seq])
                        # Details of the rectangle
                        # Starting coordinate representing the top left corner of the rectangle
                        start_point = (x, y)
                        # Ending coordinate representing the botton right corner of the rectangle
                        end_point = (x + w, y + h)
                        #Color in BGR -- Blue, Green, Red
                        if action == "Highlight":
                            color = (0, 255, 255)  # Yellow
                        elif action == "Redact":
                            color = (0, 0, 0)  # Black
                        # Thickness in px (-1 will fill the entire shape)
                        thickness = -1
                        boxed_img = cv2.rectangle(
                            img, start_point, end_point, color, thickness)
                            
    if ss_readable_items > 0 and highlight_readable_text and not (ss_matches > 0 and action in ("Highlight", "Redact")):
        highlighted_img = boxed_img.copy()
    # Highlight found matches of the search string
    if ss_matches > 0 and action == "Highlight":
        cv2.addWeighted(boxed_img, 0.4, highlighted_img,
                        1 - 0.4, 0, highlighted_img)
    # Redact found matches of the search string
    elif ss_matches > 0 and action == "Redact":
        highlighted_img = boxed_img.copy()
        #cv2.addWeighted(boxed_img, 1, highlighted_img, 0, 0, highlighted_img)
    # save the image
    cv2.imwrite("highlighted-text-image.jpg", highlighted_img)  
    # Displays window showing readable text fields or the highlighted or redacted data
    if show_comparison and (highlight_readable_text or action):
        title = input_file if input_file else 'Compare'
        conc_img = cv2.hconcat([initial_img, highlighted_img])
        display_img(title, conc_img)
    # Generates the text content of the image
    output_data = None
    if generate_output and details:
        output_data = generate_ss_text(details)
    # Prints a summary to the console
    if input_file:
        summary = {
            "File": input_file, "Total readable words": ss_readable_items, "Total matches": ss_matches, "Confidence score": ss_confidence
        }
        # Printing Summary
        print("## Summary ########################################################")
        print("\n".join("{}:{}".format(i, j) for i, j in summary.items()))
        print("###################################################################")
    return highlighted_img, ss_readable_items, ss_matches, ss_confidence, output_data
    # pass image into pytesseract module
    # pytesseract is trained in many languages
    #config_param = r'--oem 3 --psm 6'
    #details = pytesseract.image_to_data(img,config=config_param,lang='eng')
    # print(details)
    # return details


def image_to_byte_array(image: Image):
    """
    Converts an image into a byte array
    """
    imgByteArr = BytesIO()
    image.save(imgByteArr, format=image.format if image.format else 'JPEG')
    imgByteArr = imgByteArr.getvalue()
    return imgByteArr


def ocr_file(**kwargs):
    """Opens the input PDF File.
    Opens a memory buffer for storing the output PDF file.
    Creates a DataFrame for storing pages statistics
    Iterates throughout the chosen pages of the input PDF file
    Grabs a screen-shot of the selected PDF page.
    Converts the screen-shot pix to a numpy array
    Scans the grabbed screen-shot.
    Collects the statistics of the screen-shot(page).
    Saves the content of the screen-shot(page).
    Adds the updated screen-shot (Highlighted, Redacted) to the output file.
    Saves the whole content of the PDF file.
    Saves the output PDF file if required.
    Prints a summary to the console."""
    input_file = kwargs.get('input_file')
    output_file = kwargs.get('output_file')
    search_str = kwargs.get('search_str')
    pages = kwargs.get('pages')
    highlight_readable_text = kwargs.get('highlight_readable_text')
    action = kwargs.get('action')
    show_comparison = kwargs.get('show_comparison')
    generate_output = kwargs.get('generate_output')
    # Opens the input PDF file
    pdfIn = fitz.open(input_file)
    # Opens a memory buffer for storing the output PDF file.
    pdfOut = fitz.open()
    # Creates an empty DataFrame for storing pages statistics
    dfResult = pd.DataFrame(
        columns=['page', 'page_readable_items', 'page_matches', 'page_total_confidence'])
    # Creates an empty DataFrame for storing file content
    if generate_output:
        pdfContent = pd.DataFrame(columns=['page', 'line_id', 'line'])
    # Iterate throughout the pages of the input file
    for pg in range(pdfIn.pageCount):
        if str(pages) != str(None):
            if str(pg) not in str(pages):
                continue
        # Select a page
        page = pdfIn[pg]
        # Rotation angle
        rotate = int(0)
        # PDF Page is converted into a whole picture 1056*816 and then for each picture a screenshot is taken.
        # zoom = 1.33333333 -----> Image size = 1056*816
        # zoom = 2 ---> 2 * Default Resolution (text is clear, image text is hard to read)    = filesize small / Image size = 1584*1224
        # zoom = 4 ---> 4 * Default Resolution (text is clear, image text is barely readable) = filesize large
        # zoom = 8 ---> 8 * Default Resolution (text is clear, image text is readable) = filesize large
        zoom_x = 2
        zoom_y = 2
        # The zoom factor is equal to 2 in order to make text clear
        # Pre-rotate is to rotate if needed.
        mat = fitz.Matrix(zoom_x, zoom_y).preRotate(rotate)
        # To captue a specific part of the PDF page
        # rect = page.rect #page size
        # mp = rect.tl + (rect.bl - (0.75)/zoom_x) #rectangular area 56 = 75/1.3333
        # clip = fitz.Rect(mp,rect.br) #The area to capture
        # pix = page.getPixmap(matrix=mat, alpha=False,clip=clip)
        # Get a screen-shot of the PDF page
        # Colorspace -> represents the color space of the pixmap (csRGB, csGRAY, csCMYK)
        # alpha -> Transparancy indicator
        pix = page.getPixmap(matrix=mat, alpha=False, colorspace="csGRAY")
        # convert the screen-shot pix to numpy array
        img = pix2np(pix)
        # Erode image to omit or thin the boundaries of the bright area of the image
        # We apply Erosion on binary images.
        #kernel = np.ones((2,2) , np.uint8)
        #img = cv2.erode(img,kernel,iterations=1)
        upd_np_array, pg_readable_items, pg_matches, pg_total_confidence, pg_output_data \
            = ocr_img(img=img, input_file=None, search_str=search_str, highlight_readable_text=highlight_readable_text  # False
                      , action=action  # 'Redact'
                      , show_comparison=show_comparison  # True
                      , generate_output=generate_output  # False
                      )
        # Collects the statistics of the page
        dfResult = dfResult.append({'page': (pg+1), 'page_readable_items': pg_readable_items,
                                   'page_matches': pg_matches, 'page_total_confidence': pg_total_confidence}, ignore_index=True)
        if generate_output:
            pdfContent = save_page_content(
                pdfContent=pdfContent, page_id=(pg+1), page_data=pg_output_data)
        # Convert the numpy array to image object with mode = RGB
        #upd_img = Image.fromarray(np.uint8(upd_np_array)).convert('RGB')
        upd_img = Image.fromarray(upd_np_array[..., ::-1])
        # Convert the image to byte array
        upd_array = image_to_byte_array(upd_img)
        # Get Page Size
        """
        #To check whether initial page is portrait or landscape
        if page.rect.width > page.rect.height:
            fmt = fitz.PaperRect("a4-1")
        else:
            fmt = fitz.PaperRect("a4")

        #pno = -1 -> Insert after last page
        pageo = pdfOut.newPage(pno = -1, width = fmt.width, height = fmt.height)
        """
        pageo = pdfOut.newPage(
            pno=-1, width=page.rect.width, height=page.rect.height)
        pageo.insertImage(page.rect, stream=upd_array)
        #pageo.insertImage(page.rect, stream=upd_img.tobytes())
        #pageo.showPDFpage(pageo.rect, pdfDoc, page.number)
    content_file = None
    if generate_output:
        content_file = save_file_content(
            pdfContent=pdfContent, input_file=input_file)
    summary = {
        "File": input_file, "Total pages": pdfIn.pageCount, 
        "Processed pages": dfResult['page'].count(), "Total readable words": dfResult['page_readable_items'].sum(), 
        "Total matches": dfResult['page_matches'].sum(), "Confidence score": dfResult['page_total_confidence'].mean(), 
        "Output file": output_file, "Content file": content_file
    }
    # Printing Summary
    print("## Summary ########################################################")
    print("\n".join("{}:{}".format(i, j) for i, j in summary.items()))
    print("\nPages Statistics:")
    print(dfResult, sep='\n')
    print("###################################################################")
    pdfIn.close()
    if output_file:
        pdfOut.save(output_file)
    pdfOut.close()


def ocr_folder(**kwargs):
    """Scans all PDF Files within a specified path"""
    input_folder = kwargs.get('input_folder')
    # Run in recursive mode
    recursive = kwargs.get('recursive')
    search_str = kwargs.get('search_str')
    pages = kwargs.get('pages')
    action = kwargs.get('action')
    generate_output = kwargs.get('generate_output')
    # Loop though the files within the input folder.
    for foldername, dirs, filenames in os.walk(input_folder):
        for filename in filenames:
            # Check if pdf file
            if not filename.endswith('.pdf'):
                continue
            # PDF File found
            inp_pdf_file = os.path.join(foldername, filename)
            print("Processing file =", inp_pdf_file)
            output_file = None
            if search_str:
                # Generate an output file
                output_file = os.path.join(os.path.dirname(
                    inp_pdf_file), 'ocr_' + os.path.basename(inp_pdf_file))
            ocr_file(
                input_file=inp_pdf_file, output_file=output_file, search_str=search_str, pages=pages, highlight_readable_text=False, action=action, show_comparison=False, generate_output=generate_output
            )
        if not recursive:
            break


def is_valid_path(path):
    """Validates the path inputted and checks whether it is a file path or a folder path"""
    if not path:
        raise ValueError(f"Invalid Path")
    if os.path.isfile(path):
        return path
    elif os.path.isdir(path):
        return path
    else:
        raise ValueError(f"Invalid Path {path}")


def parse_args():
    """Get user command line parameters"""
    parser = argparse.ArgumentParser(description="Available Options")
    parser.add_argument('-i', '--input-path', type=is_valid_path,
                        required=True, help="Enter the path of the file or the folder to process")
    parser.add_argument('-a', '--action', choices=[
                        'Highlight', 'Redact'], type=str, help="Choose to highlight or to redact")
    parser.add_argument('-s', '--search-str', dest='search_str',
                        type=str, help="Enter a valid search string")
    parser.add_argument('-p', '--pages', dest='pages', type=tuple,
                        help="Enter the pages to consider in the PDF file, e.g. (0,1)")
    parser.add_argument("-g", "--generate-output", action="store_true", help="Generate text content in a CSV file")
    path = parser.parse_known_args()[0].input_path
    if os.path.isfile(path):
        parser.add_argument('-o', '--output_file', dest='output_file',
                            type=str, help="Enter a valid output file")
        parser.add_argument("-t", "--highlight-readable-text", action="store_true", help="Highlight readable text in the generated image")
        parser.add_argument("-c", "--show-comparison", action="store_true", help="Show comparison between captured image and the generated image")
    if os.path.isdir(path):
        parser.add_argument("-r", "--recursive", action="store_true", help="Whether to process the directory recursively")
    # To Porse The Command Line Arguments
    args = vars(parser.parse_args())
    # To Display The Command Line Arguments
    print("## Command Arguments #################################################")
    print("\n".join("{}:{}".format(i, j) for i, j in args.items()))
    print("######################################################################")
    return args


if __name__ == '__main__':
    # Parsing command line arguments entered by user
    args = parse_args()
    # If File Path
    if os.path.isfile(args['input_path']):
        # Process a file
        if filetype.is_image(args['input_path']):
            ocr_img(
                # if 'search_str' in (args.keys()) else None
                img=None, input_file=args['input_path'], search_str=args['search_str'], highlight_readable_text=args['highlight_readable_text'], action=args['action'], show_comparison=args['show_comparison'], generate_output=args['generate_output']
            )
        else:
            ocr_file(
                input_file=args['input_path'], output_file=args['output_file'], search_str=args['search_str'] if 'search_str' in (args.keys()) else None, pages=args['pages'], highlight_readable_text=args['highlight_readable_text'], action=args['action'], show_comparison=args['show_comparison'], generate_output=args['generate_output']
            )
    # If Folder Path
    elif os.path.isdir(args['input_path']):
        # Process a folder
        ocr_folder(
            input_folder=args['input_path'], recursive=args['recursive'], search_str=args['search_str'] if 'search_str' in (args.keys()) else None, pages=args['pages'], action=args['action'], generate_output=args['generate_output']
        )

По мотивам How to Extract Text from Images in PDF Files with Python

Получить в PDF

Как извлечь текст из изображений в файлах PDF

Заключение

Опубликовано Вадим В. Костерин

Оставьте комментарий

Отменить ответ