Consider removing PyMuPDF for dependency that is not AGPL licensed #486

madisonmay · 2021-09-21T15:26:08Z

Hi folks! Love the project, but the dependency on PyMuPDF can pose a problem for commercial use because of it's AGPL license (I think use of PyMuPDF may technically require the project itself to be GPL licensed, although I'm no expert here). Would you consider integrating a PR that replaces the PyMuPDF dependency with an alternative?

fg-mindee · 2021-09-27T08:14:00Z

Hi @madisonmay,

Thanks for pointing this out!
You are correct, we did not pay attention to the license of PyMuPDF but indeed AGPL has extra requirements. We'll try to come around with a proposition but here are the elements:

as shown over here, PyMuPDF is much much faster than the other options
we're starting to consider making a few deps (and corresponding features) as "extra" such as HTML support through weasyprint, that could be used.
since part of the project is about accessibility and ease of use, it's going to be rather tricky not being able to read PDF through the library. And personally, I would be concerned about removing PDF reading, if you have some suggestions about other options, happy to discuss this :)

fg-mindee · 2021-12-07T12:40:12Z

Quick update here, we're looking for suggestions from everyone 🙏

We have two main options:

keeping PyMuPDF and trying to figure whether making some parts of the library optional (as extra) would work legally to avoid AGPL
looking for a fast alternative and here we have absolutely no clue (cf. [documents] Benchmark PDF document reading + numpy conversion options #23 it's much faster than other alternatives)

charlesmindee · 2021-12-30T08:52:00Z

That is indeed a tricky issue, could you explain the first point @fg-mindee ? How can we make the license working if it is optional ?

fg-mindee · 2022-01-10T19:59:37Z

@charlesmindee I have to check but the idea was to put all our PDF features in an extra build (not the core/default one)

fg-mindee · 2022-01-10T20:00:22Z

For reference, here are some properly licensed alternatives:

https://github.com/Hopding/pdf-lib (we'd need a binding for Python)
https://github.com/pdfminer/pdfminer.six
https://github.com/Belval/pdf2image

charlesmindee · 2022-02-09T14:33:59Z

For reference, here are some properly licensed alternatives:

* https://github.com/Hopding/pdf-lib (we'd need a binding for Python)

* https://github.com/pdfminer/pdfminer.six

* https://github.com/Belval/pdf2image

pdf2image seems to be easy to use and to integrate, what do you think @fg-mindee ?

fg-mindee · 2022-02-09T14:54:28Z

We should definitely give it a try then. It's MIT license so it's as permissive as it can get 👌

charlesmindee · 2022-02-16T13:21:39Z

Hi @madisonmay, after a lot of investigation it seems that there is no python library to render pdf to images without using poppler (pdf2image) or ghostscript (pdfplumber). All other libraries that are license-compatible such as pdfminer.six, pikepdf, PyPDF2 don't have any option to render pdf pages to images. If you know another option, don't hesitate to suggest !
We are going to put PyMuPDF as an extra (outside the build) to avoid license issues.

cc @fg-mindee

madisonmay · 2022-02-16T14:01:17Z

Hi @charlesmindee, thank you for the ping. I think we may have recently found a decent option for this niche (at least for pdf to image rendering)! Checkout PyPDFium2 (PyPI, Github) -- haven't done a full benchmark but anecdotally seems quite fast and is Apache-2.0. You'd still need something else for text extraction.

That being said, the extras option seems like the a fine resolution to me!

fg-mindee · 2022-02-16T15:56:08Z

Thanks a lot for the suggestion @madisonmay 🙏
We're gonna take a look at it!

fg-mindee · 2022-02-24T15:43:25Z

FYI, pypdfium is incompatible with Python 3.8.1 (but no problem with 3.7, or 3.8.10+ so far)
The docker job (which was using Python 3.8.1) was failing, I just updated the PR

fg-mindee self-assigned this Sep 21, 2021

fg-mindee added the module: io Related to doctr.io label Sep 21, 2021

fg-mindee mentioned this issue Oct 25, 2021

Adding hocr output to generate Pdf/A #512

Closed

fg-mindee added the help wanted Extra attention is needed label Dec 7, 2021

fg-mindee added this to the 1.0.0 milestone Dec 10, 2021

fg-mindee mentioned this issue Jan 10, 2022

Release tracker - v0.5.1 #788

Closed

9 tasks

charlesmindee mentioned this issue Feb 15, 2022

feat: replace PyMuPDF by pdf2image, for license compatibility #818

Closed

fg-mindee mentioned this issue Feb 21, 2022

refactor: Switched from PyMuPDF to pypdfium2 #829

Merged

fg-mindee modified the milestones: 1.0.0, 0.5.1 Feb 23, 2022

fg-mindee closed this as completed in #829 Feb 24, 2022

frgfm mentioned this issue May 30, 2022

[documents] Benchmark PDF document reading + numpy conversion options #23

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider removing PyMuPDF for dependency that is not AGPL licensed #486

Consider removing PyMuPDF for dependency that is not AGPL licensed #486

madisonmay commented Sep 21, 2021

fg-mindee commented Sep 27, 2021

fg-mindee commented Dec 7, 2021

charlesmindee commented Dec 30, 2021

fg-mindee commented Jan 10, 2022

fg-mindee commented Jan 10, 2022

charlesmindee commented Feb 9, 2022

fg-mindee commented Feb 9, 2022

charlesmindee commented Feb 16, 2022

madisonmay commented Feb 16, 2022

fg-mindee commented Feb 16, 2022

fg-mindee commented Feb 24, 2022

Consider removing PyMuPDF for dependency that is not AGPL licensed #486

Consider removing PyMuPDF for dependency that is not AGPL licensed #486

Comments

madisonmay commented Sep 21, 2021

fg-mindee commented Sep 27, 2021

fg-mindee commented Dec 7, 2021

charlesmindee commented Dec 30, 2021

fg-mindee commented Jan 10, 2022

fg-mindee commented Jan 10, 2022

charlesmindee commented Feb 9, 2022

fg-mindee commented Feb 9, 2022

charlesmindee commented Feb 16, 2022

madisonmay commented Feb 16, 2022

fg-mindee commented Feb 16, 2022

fg-mindee commented Feb 24, 2022