Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doctr: init at 0.7.0. Also dependencies mplcursors and pypdfium2 #265735

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

nh2
Copy link
Contributor

@nh2 nh2 commented Nov 5, 2023

  • ➡️ Looking for co-maintainers! ⬅️

doctr is a pretty good OCR software.

TODO

This is a draft PR because:

  • The distro packaging story for the dependency pypdfium2 isn't fully figured out yet: Better instructions / support for offline building pypdfium2-team/pypdfium2#272 (comment)
    • The upstream build system currently does various network acceses during builds.
    • My patches provided fix that (via the standard solution of passing in everything that needs to be downloaded), but it would be better if they were smaller or not needed.
    • Edit: Solved now, as per new upstream instructions.

Still, already sharing this PR in case others want to use doctr.

Things done

  • Built on platform(s)
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • For non-Linux: Is sandboxing enabled in nix.conf? (See Nix manual)
    • sandbox = relaxed
    • sandbox = true
  • Tested, as applicable:
  • Tested compilation of all packages that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage
  • Tested basic functionality of all binary files (usually in ./result/bin/)
    • Tested the doctr example code to detect text from a scanned bank account statement
  • 23.11 Release Notes (or backporting 23.05 Release notes)
    • (Package updates) Added a release notes entry if the change is major or breaking
    • (Module updates) Added a release notes entry if the change is significant
    • (Module addition) Added a release notes entry if adding a new NixOS module
  • Fits CONTRIBUTING.md.

@mara004
Copy link

mara004 commented Nov 9, 2023

The distro packaging story for the dependency pypdfium2 isn't fully figured out yet

  • The upstream build system currently does various network acceses during builds.
  • My patches provided fix that (via the standard solution of passing in everything that needs to be downloaded), but it would be better if they were smaller or not needed.

pypdfium2 TLDR (maintainer POV)

  • It should be doable without patches by externalizing the data files generation instead of trying to put it in our upstream setupsrc (see new instructions https://github.com/pypdfium2-team/pypdfium2/#install-source-caller)
  • Packaging pdfium separately first may be more advisable for a distro, which somewhat changes the pypdfium2 packaging story anyway (likely results in a different layout than a pdfium-binaries tarball - e.g. where would version info be stored at system level, and in which form?)

That said, here's another idea for what you're currently doing (bundling): Maybe you could just extract the data files from our wheels? The prebuilt tarball's origin shouldn't make much difference (I guess?, anyway), but using the wheel would relieve the necessity of dealing with ctypesgen/bindings and (most of) the version info for now.

@nh2 nh2 marked this pull request as ready for review November 11, 2023 23:09
@nh2
Copy link
Contributor Author

nh2 commented Nov 11, 2023

I updated the pypdfium2-bin build as per upstream instructions: https://github.com/pypdfium2-team/pypdfium2/tree/30c60af438b7cd90e22d42dd2ba5bffdeb568c42#install-source-caller

I also added a test that checks that doctr successfully detects text.

This is ready for review now.

"${ctypesgen_pypdfium_fork}/bin/ctypesgen" \
--library pdfium \
--compile-libdirs "${pdfium-bin}/lib" \
--runtime-libdirs "${pdfium-bin}/lib" \
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think --runtime-libdirs should be ., supposing ${pdfium-bin} is not available at runtime?
Or if it is, you wouldn't have to copy in the binary below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mara004 It seems to work with both --runtime-libdirs "${pdfium-bin}/lib" and --runtime-libdirs ..

With --runtime-libdirs "${pdfium-bin}/lib", the generated lib/python3.11/site-packages/pypdfium2_raw/bindings.py contains:

_libdirs = ['/nix/store/q79y5h4af3vyi5kqbrwgjm6zxasshp7k-pdfium-bin/lib']
_allow_system_search = True
_libpath = _find_library("pdfium", _libdirs, _allow_system_search)
_lib = ctypes.CDLL(_libpath)

With --runtime-libdirs . the first line is

_libdirs = ['.']

instead.

supposing ${pdfium-bin} is not available at runtime?

The above _libdirs = ['/nix/store/q79y5h4af3vyi5kqbrwgjm6zxasshp7k-pdfium-bin/lib'] makes ${pdfium-bin} available at runtime.

Or if it is, you wouldn't have to copy in the binary below.

Right -- my understanding was that --runtime-libdirs . with the cp is for bundling the .so inside pypdfium2, while --compile-libdirs "${pdfium-bin}/lib" would be to reference it.

Nixpkgs would likely rather reference it (better build deduplication / sharing of dependencies), but I found that removing the cp has the inconvenient side effect that it produces this error:

Version uncertain: git repo not available.
Falling back to autorelease record.
Traceback (most recent call last):
  File "/build/source/nix_run_setup", line 8, in <module>
    exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\\r\\n', '\\n'), __file__, 'exec'))
  File "setup.py", line 154, in <module>
    main()
  File "setup.py", line 150, in main
    run_setup(modnames, pl_name, pdfium_ver)
  File "setup.py", line 134, in run_setup
    assert_exists(ModuleDir_Raw, kwargs["package_data"]["pypdfium2_raw"])
  File "setup.py", line 76, in assert_exists
    assert False, f"Missing data files: {missing}"
AssertionError: Missing data files: ['libpdfium.so']

This is because of:

    elif pl_name == ExtPlats.system:
        kwargs["package_data"]["pypdfium2_raw"] = [VersionFN, BindingsFN]
    else:
        # ...
        kwargs["package_data"]["pypdfium2_raw"] = [VersionFN, BindingsFN, libname]

I'm currently setting PDFIUM_PLATFORM='prepared!linux_x64:6110, not PDFIUM_PLATFORM='prepared!system:6110, so the lower branch is taken and it expects the libname (.so) in the pypdfium2_raw directory.

So far from the pypdfium2 README I understood system to denotate "when the system has built the .so", and linux_x64 to be used when using the pdfium binary releases, but maybe "placing the binary relase .so in a system directory" also means I should better use system instead?

Copy link

@mara004 mara004 Nov 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, system is the correct one here. The latter meaning "in a system directory" is right - for the purposes of library loading, it doesn't matter who built the binary. Explicit platforms like linux_x64 imply bundling into pypdfium2's source tree, as the assertion hints. With prepared!, the target value is basically only used for the file inclusion mask.

Since I'm not accustomed to the recipe format, I just wasn't sure about the scope of ${pdfium-bin} (temporary to setup, or part of the install), and therefore not sure what you were trying to achieve.
But this clarifies ${pdfium-bin} is part of the install, and you're wanting to load that at runtime, so no need to bundle an unused binary.

--compile-libdirs is used by ctypesgen to load the library at setup/bindings creation time to verify symbol presence, and does not impact install time library loading, which is controlled by --runtime-libdirs instead.
I believe you can also use --universal-libdirs if both match, but I haven't actually tested this yet.

And confusing though it may sound, you might want to add --no-system-libsearch to disable the ctypes.util.find_library() fallback, since you are passing the runtime libdir explicitly.
(Maybe the flag should be renamed to --no-ctypes-libsearch or something to better reflect its meaning.)

Or if the library were contained in a "standard system location" as recognized by ctypes (e.g. /usr/lib), you could alternatively omit the explicit --runtime-libdirs and rely on ctypes to find it, but as NixOS library locations look quite different than on other distros, I don't think that would work, unless NixOS spoofs a different system layout (similar to conda), sets $LD_LIBRARY_PATH or something.

Note, if the binary is contained in a caller-given runtime libdir, pypdfium2 will of course depend on the continued presence of that dir, and load the library from there.
So if pdfium is updated and the directory's hash changes, you also have to rebuild pypdfium2 to link against the newer version of pdfium. Actually this is good because it enforces ABI safety.

@mara004
Copy link

mara004 commented Nov 11, 2023

Thanks, pypdfium2 part basically looks good, except for the two spots noted above, which I don't think would work as-is.

Sorry BTW for acting a bit too rashly in our previous discussion. Anyway, I believe we've reached a good result now.

@nh2
Copy link
Contributor Author

nh2 commented Nov 12, 2023

I've switched to system and also added a doctr OCR test for PDFs that exercises the pypdfium2 code path.

@nh2
Copy link
Contributor Author

nh2 commented Nov 12, 2023

I force-pushed, also renaming pytorch -> torch so that hopefully the ofborg-eval CI passes.

@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/prs-ready-for-review/3032/2910

@nh2
Copy link
Contributor Author

nh2 commented Nov 12, 2023

Sorry BTW for acting a bit too rashly in our previous discussion. Anyway, I believe we've reached a good result now.

@mara004 No worries. Yes, I think the result is good, thanks for the offline packaging instructions!

Also includes required dependencies.
@mara004

This comment was marked as resolved.

"patch": ${toString pdfiumVersion_patch},
"n_commits": 0,
"hash": null,
"origin": "pdfium-binaries/nixos",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's quite unclear whether this should be pdfium-binaries/nixos or system/nixos, but that's our upstream fault. In theory pdfium-binaries implies bundling, but system doesn't include the meaning that the binary comes from pdfium-binaries.
Since a hashed, explicit runtime libdir is as good as bundling, I'd leave it as-is for now.
I plan to revise the version file (and how our code is writing it) to cleanly include all these meanings: library location (bundled/external), package flavor (pypa / conda / linux distro), and binary builder (pdfium-binaries / distro / other)

Copy link
Member

@pbsds pbsds left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Result of nixpkgs-review pr 265735 run on x86_64-linux 1

8 packages failed to build:
  • python310Packages.doctr
  • python310Packages.doctr.dist
  • python310Packages.pypdfium2-bin
  • python310Packages.pypdfium2-bin.dist
  • python311Packages.doctr
  • python311Packages.doctr.dist
  • python311Packages.pypdfium2-bin
  • python311Packages.pypdfium2-bin.dist
4 packages built:
  • python310Packages.mplcursors
  • python310Packages.mplcursors.dist
  • python311Packages.mplcursors
  • python311Packages.mplcursors.dist
error: hash mismatch in fixed-output derivation '/nix/store/k84a5hkxr1cfabs922i5f8szkh553d4w-pdfium-linux-x64.tgz.drv':
         specified: sha256-d1tFlzJuh0q7gzwbVegPngxkiCdL/5kXZTOs/aXATlE=
            got:    sha256-aN7VSp+p7xuxcVn1uQuiIKqdV05El+WD3TWHmdq2gIU=

The build error might simply be a recompressed archive upstream? The suggestions below try to unpack the archive

};

patches = [
# TODO: Remove when https://github.com/mindee/doctr/pull/1373 is merged and available, or when nixpkgs has newer versions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# TODO: Remove when https://github.com/mindee/doctr/pull/1373 is merged and available, or when nixpkgs has newer versions
# https://github.com/mindee/doctr/pull/1373

Comment on lines +22 to +24
# This package uses setuptools-scm to derive the version from the git repo (which we don't have),
# so use this environment variable to set it manually instead.
SETUPTOOLS_SCM_PRETEND_VERSION = version;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a common case. Context is inferrable

Suggested change
# This package uses setuptools-scm to derive the version from the git repo (which we don't have),
# so use this environment variable to set it manually instead.
SETUPTOOLS_SCM_PRETEND_VERSION = version;
env.SETUPTOOLS_SCM_PRETEND_VERSION = version;

Comment on lines +104 to +106
torch
torchvision
onnx
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to move these deps to passthru.optional-dependencies.torch?

homepage = "https://mindee.github.io/doctr/";
license = licenses.asl20;
maintainers = with maintainers; [ chpatrick nh2 ];
platforms = [ "x86_64-linux" ]; # because `pypdfium2` is only packed for this so far
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
platforms = [ "x86_64-linux" ]; # because `pypdfium2` is only packed for this so far
inherit (pypdfium2.meta) platforms;

Comment on lines +47 to +55
pdfiumPrebuiltTar = (fetchurl {
url = "https://github.com/bblanchon/pdfium-binaries/releases/download/chromium/${toString pdfiumVersion_build}/pdfium-${pdfium.platformFileInfix}.tgz";
hash = pdfium.hash;
});

pdfium-bin = runCommand "pdfium-bin" {} ''
mkdir "$out"
tar xf "${pdfiumPrebuiltTar}" --directory "$out"
'';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pdfiumPrebuiltTar = (fetchurl {
url = "https://github.com/bblanchon/pdfium-binaries/releases/download/chromium/${toString pdfiumVersion_build}/pdfium-${pdfium.platformFileInfix}.tgz";
hash = pdfium.hash;
});
pdfium-bin = runCommand "pdfium-bin" {} ''
mkdir "$out"
tar xf "${pdfiumPrebuiltTar}" --directory "$out"
'';
pdfiumPrebuiltTar = (fetchzip {
url = "https://github.com/bblanchon/pdfium-binaries/releases/download/chromium/${toString pdfiumVersion_build}/pdfium-${pdfium.platformFileInfix}.tgz";
hash = pdfium.hash;
stripRoot = false;
};

Comment on lines +75 to +77
# This package uses setuptools-scm to derive the version from the git repo (which we don't have),
# so use this environment variable to set it manually instead.
SETUPTOOLS_SCM_PRETEND_VERSION = "1.1.1+g${version}"; # we use the most recent tag and append the git version
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# This package uses setuptools-scm to derive the version from the git repo (which we don't have),
# so use this environment variable to set it manually instead.
SETUPTOOLS_SCM_PRETEND_VERSION = "1.1.1+g${version}"; # we use the most recent tag and append the git version
env.SETUPTOOLS_SCM_PRETEND_VERSION = "1.1.1+g${src.rev}"; # the most recent tag + git version

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even better if it could be also be based on version

src = fetchFromGitHub {
owner = "pypdfium2-team"; # fork needed for `pypdfium2`
repo = "ctypesgen";
rev = version;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
rev = version;
rev = "3a811c28160b7c742f97ee50bfe55b18d4021ca0";

Comment on lines +40 to +43
pdfiumVersion_major = lib.elemAt parsedNumericVersionComponents 0;
pdfiumVersion_minor = lib.elemAt parsedNumericVersionComponents 1;
pdfiumVersion_build = lib.elemAt parsedNumericVersionComponents 2;
pdfiumVersion_patch = lib.elemAt parsedNumericVersionComponents 3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pdfiumVersion_major = lib.elemAt parsedNumericVersionComponents 0;
pdfiumVersion_minor = lib.elemAt parsedNumericVersionComponents 1;
pdfiumVersion_build = lib.elemAt parsedNumericVersionComponents 2;
pdfiumVersion_patch = lib.elemAt parsedNumericVersionComponents 3;
pdfiumVersion.major = lib.elemAt parsedNumericVersionComponents 0;
pdfiumVersion.minor = lib.elemAt parsedNumericVersionComponents 1;
pdfiumVersion.build = lib.elemAt parsedNumericVersionComponents 2;
pdfiumVersion.patch = lib.elemAt parsedNumericVersionComponents 3;

Comment on lines +133 to +137
build
setuptools-scm
numpy
pytest
pillow
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
build
setuptools-scm
numpy
pytest
pillow
numpy
pillow

Please don't propagate pytest, move it to buildInputs, also move build and setuptools-scm to nativeBuildInputs.

meta = with lib; {
description = "Python bindings to PDFium";
homepage = "https://pypdfium2.readthedocs.io/";
license = with licenses; [ asl20 /* or */ bsd3 ];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment about the dual license?

@wegank wegank added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Jul 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants