Skip to content

Error must install unstructured_pytesseract when using paddleocr #4007

Open
@anggapark

Description

@anggapark

Version:
unstructured: 0.17.2
unstructured-client: 0.36.0
unstructured-inference: 1.0.2
unstructured_paddleocr: 2.10.0
paddlepaddle: 3.0.0

Set env
os.environ["OCR_AGENT"] = "unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle"

I'm using paddle as my OCR model, but when I run this code

raw_pdf = partition_pdf( filename=filepath, strategy="hi_res", infer_table_structure=True, extract_images_in_pdf=True, # extract_image_block_types=["Image", "Table"], # extract_image_block_output_dir=path, chunking_strategy="by_title", max_characters=4000, new_after_n_chars=3800, combine_text_under_n_chars=2000, )

it shows error:
ModuleNotFoundError: No module named 'unstructured_pytesseract'

Why do I have to install unstructured_pytesseract when I already have unstructured_paddleocr?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions