From 34ab1681ef3db835b2dddc7ef138686df8ab7db3 Mon Sep 17 00:00:00 2001 From: Jamie Lemon Date: Mon, 16 Feb 2026 15:47:29 +0000 Subject: [PATCH] Updates for 0.3.4 API. --- docs/pymupdf-layout/index.rst | 58 +++++++++++++++++++++++++++++++++++ docs/pymupdf4llm/api.rst | 20 ++++++++++-- 2 files changed, 75 insertions(+), 3 deletions(-) diff --git a/docs/pymupdf-layout/index.rst b/docs/pymupdf-layout/index.rst index bd89e2501..39218d437 100644 --- a/docs/pymupdf-layout/index.rst +++ b/docs/pymupdf-layout/index.rst @@ -159,6 +159,64 @@ The decision tree for whether OCR is actually used or not depends on the followi .. image:: ../images/layout-ocr-flow.png + +.. _pymupdf_layout_ocr_engines: + +OCR engines +~~~~~~~~~~~~~~~~~~~~~~~ + +Tesseract +"""""""""""""""""""""""""""""""""" + +Tesseract is the default OCR engine used by |PyMuPDF4LLM| when the above criteria are met. It is a widely used open-source OCR engine that supports multiple languages and is known for its accuracy. + + +.. _pymupdf_layout_rapid_ocr: + +RapidOCR +"""""""""""""""""""""""""""""""""" + +If you want to use an OCR engine other than Tesseract, you can do so by providing your own OCR function via the `ocr_function` parameter of the :ref:`PyMuPDF4LLM API `. + +If `RapidOCR `_ and the RapidOCR ONNX Runtime are available, you can use a pre-made callable OCR function for it, which is provided in the `pymupdf4llm.ocr` module as `rapidocr_api.exec_ocr`. + + +Example +'''''''''''''''''''''''''''' +:: + + from pymupdf4llm.ocr import rapidocr_api + + md = pymupdf4llm.to_markdown( + doc, + ocr_function=rapidocr_api.exec_ocr, + force_ocr=True + ) + +In this way RapidOCR can be used as an alternative OCR engine to Tesseract for all pages (if `force_ocr=True`) or just for those pages which meet the default criteria for applying OCR (if `force_ocr=False` or omitted). + + +RapidOCR & Tesseract side-by-side +"""""""""""""""""""""""""""""""""" + +If you want to use both OCR engines side-by-side, you can do so by implementing a custom OCR function which calls both OCR engines - one for bbox recognition (RapidOCR) and the other for text recognition (Tesseract) - and then combines their results. + +This pre-made callable OCR function can be found in the `pymupdf4llm.ocr` module as `rapidtess_api.exec_ocr`. + +Example +'''''''''''''''''''''''''''' +:: + + from pymupdf4llm.ocr import rapidtess_api + + md = pymupdf4llm.to_markdown( + doc, + ocr_function=rapidtess_api.exec_ocr, + force_ocr=True + ) + + + ---- .. _pymupdf_layout_and_pymupdf4llm_api: diff --git a/docs/pymupdf4llm/api.rst b/docs/pymupdf4llm/api.rst index f5e1304ec..11e2015c7 100644 --- a/docs/pymupdf4llm/api.rst +++ b/docs/pymupdf4llm/api.rst @@ -45,7 +45,9 @@ The |PyMuPDF4LLM| API dpi: int = 150, \ use_ocr: bool = True, \ ocr_language: str = "eng", \ - ocr_dpi: int = 400, \ + ocr_dpi: int = 300, \ + ocr_function: callable = None, \ + force_ocr: bool = False, \ embed_images: bool = False, \ extract_words: bool = False, \ filename: str | None = None, \ @@ -75,17 +77,29 @@ The |PyMuPDF4LLM| API Reads the pages of the file and outputs the text of its pages in |Markdown| format. How this should happen in detail can be influenced by a number of parameters. Please note that **support for building page chunks** from the |Markdown| text is supported. + :arg Document,str doc: the file, to be specified either as a file path string, or as a |PyMuPDF| :class:`Document` (created via `pymupdf.open`). In order to use `pathlib.Path` specifications, Python file-like objects, documents in memory etc. you **must** use a |PyMuPDF| :class:`Document`. :arg bool detect_bg_color: |PyMuPDFLayoutMode_Ignored| does a simple check for the general background color of the pages (default is ``True``). If any text or vector has this color it will be ignored. May increase detection accuracy. :arg int dpi: specify the desired image resolution in dots per inch. Relevant only if `write_images=True` or `embed_images=True`. Default value is 150. - :arg bool use_ocr: |PyMuPDFLayoutMode_Valid| use :ref:`OCR capability ` to help analyse the page. + :arg bool use_ocr: |PyMuPDFLayoutMode_Valid| use :ref:`OCR capability ` to help analyse the page. This will OCR pages as determined by the default criteria. :arg str ocr_language: |PyMuPDFLayoutMode_Valid| specify the language to be used by the Tesseract OCR engine. Default is "eng" (English). Make sure that the respective language data files are installed. Remember to use correct Tesseract language codes. Multiple languages can be specified by concatenating the respective codes with a plus sign "+", for example "eng+deu" for English and German. - :arg int ocr_dpi: |PyMuPDFLayoutMode_Valid| specify the desired image resolution in dots per inch for applying OCR to the intermediate image of the page. Default value is 400. Only relevant if the page has been determined to profit from OCR (no or few text, most of the page covered by images or character-like vectors, etc.). Large values may increase the OCR precision but increase memory requirements and processing time. There also is a risk of over-sharpening the image which may decrease OCR precision. So the default value should probably be sufficiently high. + :arg int ocr_dpi: |PyMuPDFLayoutMode_Valid| specify the desired image resolution in dots per inch for applying OCR to the intermediate image of the page. Default value is 300. Only relevant if the page has been determined to profit from OCR (no or few text, most of the page covered by images or character-like vectors, etc.). Large values may increase the OCR precision but increase memory requirements and processing time. There also is a risk of over-sharpening the image which may decrease OCR precision. So the default value should probably be sufficiently high. + + :arg callable ocr_function: |PyMuPDFLayoutMode_Valid| if you want to provide your own :ref:`OCR function `, specify it here. If omitted (`None`), the built-in Tesseract OCR engine will be used. + + :arg bool force_ocr: |PyMuPDFLayoutMode_Valid| if `True`, OCR will be applied to all pages regardless of their content. + + Only works when `ocr_function` is specified. + + This may be useful for documents which are known to be image-based and thus profit from OCR, but which do not meet the default criteria for applying OCR. Default is `False` meaning that OCR will only be applied to pages which meet the default criteria. + + .. warning:: + Requires `ocr_function` to be specified otherwise an exception will be raised. :arg bool embed_images: like `write_images`, but images will be included in the markdown text as base64-encoded strings. Mutually exclusive with `write_images` and ignores `image_path`. This may drastically increase the size of your markdown text.