Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider removing PyMuPDF for dependency that is not AGPL licensed #486

Closed
Tracked by #788
madisonmay opened this issue Sep 21, 2021 · 11 comments · Fixed by #829
Closed
Tracked by #788

Consider removing PyMuPDF for dependency that is not AGPL licensed #486

madisonmay opened this issue Sep 21, 2021 · 11 comments · Fixed by #829
Assignees
Labels
help wanted Extra attention is needed module: io Related to doctr.io
Milestone

Comments

@madisonmay
Copy link

Hi folks! Love the project, but the dependency on PyMuPDF can pose a problem for commercial use because of it's AGPL license (I think use of PyMuPDF may technically require the project itself to be GPL licensed, although I'm no expert here). Would you consider integrating a PR that replaces the PyMuPDF dependency with an alternative?

@fg-mindee fg-mindee self-assigned this Sep 21, 2021
@fg-mindee fg-mindee added the module: io Related to doctr.io label Sep 21, 2021
@fg-mindee
Copy link
Contributor

Hi @madisonmay,

Thanks for pointing this out!
You are correct, we did not pay attention to the license of PyMuPDF but indeed AGPL has extra requirements. We'll try to come around with a proposition but here are the elements:

  • as shown over here, PyMuPDF is much much faster than the other options
  • we're starting to consider making a few deps (and corresponding features) as "extra" such as HTML support through weasyprint, that could be used.
  • since part of the project is about accessibility and ease of use, it's going to be rather tricky not being able to read PDF through the library. And personally, I would be concerned about removing PDF reading, if you have some suggestions about other options, happy to discuss this :)

@fg-mindee fg-mindee added the help wanted Extra attention is needed label Dec 7, 2021
@fg-mindee
Copy link
Contributor

Quick update here, we're looking for suggestions from everyone 🙏

We have two main options:

@fg-mindee fg-mindee added this to the 1.0.0 milestone Dec 10, 2021
@charlesmindee
Copy link
Collaborator

That is indeed a tricky issue, could you explain the first point @fg-mindee ? How can we make the license working if it is optional ?

@fg-mindee
Copy link
Contributor

@charlesmindee I have to check but the idea was to put all our PDF features in an extra build (not the core/default one)

@fg-mindee
Copy link
Contributor

For reference, here are some properly licensed alternatives:

@fg-mindee fg-mindee mentioned this issue Jan 10, 2022
9 tasks
@charlesmindee
Copy link
Collaborator

For reference, here are some properly licensed alternatives:

* https://github.com/Hopding/pdf-lib (we'd need a binding for Python)

* https://github.com/pdfminer/pdfminer.six

* https://github.com/Belval/pdf2image

pdf2image seems to be easy to use and to integrate, what do you think @fg-mindee ?

@fg-mindee
Copy link
Contributor

We should definitely give it a try then. It's MIT license so it's as permissive as it can get 👌

@charlesmindee
Copy link
Collaborator

Hi @madisonmay, after a lot of investigation it seems that there is no python library to render pdf to images without using poppler (pdf2image) or ghostscript (pdfplumber). All other libraries that are license-compatible such as pdfminer.six, pikepdf, PyPDF2 don't have any option to render pdf pages to images. If you know another option, don't hesitate to suggest !
We are going to put PyMuPDF as an extra (outside the build) to avoid license issues.

cc @fg-mindee

@madisonmay
Copy link
Author

Hi @charlesmindee, thank you for the ping. I think we may have recently found a decent option for this niche (at least for pdf to image rendering)! Checkout PyPDFium2 (PyPI, Github) -- haven't done a full benchmark but anecdotally seems quite fast and is Apache-2.0. You'd still need something else for text extraction.

That being said, the extras option seems like the a fine resolution to me!

@fg-mindee
Copy link
Contributor

Thanks a lot for the suggestion @madisonmay 🙏
We're gonna take a look at it!

@fg-mindee
Copy link
Contributor

FYI, pypdfium is incompatible with Python 3.8.1 (but no problem with 3.7, or 3.8.10+ so far)
The docker job (which was using Python 3.8.1) was failing, I just updated the PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed module: io Related to doctr.io
Projects
None yet
3 participants