-
-
Notifications
You must be signed in to change notification settings - Fork 13.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
doctr: init at 0.7.0. Also dependencies mplcursors and pypdfium2 #265735
base: master
Are you sure you want to change the base?
Conversation
pypdfium2 TLDR (maintainer POV)
That said, here's another idea for what you're currently doing (bundling): Maybe you could just extract the data files from our wheels? The prebuilt tarball's origin shouldn't make much difference (I guess?, anyway), but using the wheel would relieve the necessity of dealing with ctypesgen/bindings and (most of) the version info for now. |
6c06943
to
d7ab8ab
Compare
I updated the I also added a test that checks that This is ready for review now. |
"${ctypesgen_pypdfium_fork}/bin/ctypesgen" \ | ||
--library pdfium \ | ||
--compile-libdirs "${pdfium-bin}/lib" \ | ||
--runtime-libdirs "${pdfium-bin}/lib" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think --runtime-libdirs
should be .
, supposing ${pdfium-bin}
is not available at runtime?
Or if it is, you wouldn't have to copy in the binary below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mara004 It seems to work with both --runtime-libdirs "${pdfium-bin}/lib"
and --runtime-libdirs .
.
With --runtime-libdirs "${pdfium-bin}/lib"
, the generated lib/python3.11/site-packages/pypdfium2_raw/bindings.py
contains:
_libdirs = ['/nix/store/q79y5h4af3vyi5kqbrwgjm6zxasshp7k-pdfium-bin/lib']
_allow_system_search = True
_libpath = _find_library("pdfium", _libdirs, _allow_system_search)
_lib = ctypes.CDLL(_libpath)
With --runtime-libdirs .
the first line is
_libdirs = ['.']
instead.
supposing
${pdfium-bin}
is not available at runtime?
The above _libdirs = ['/nix/store/q79y5h4af3vyi5kqbrwgjm6zxasshp7k-pdfium-bin/lib']
makes ${pdfium-bin}
available at runtime.
Or if it is, you wouldn't have to copy in the binary below.
Right -- my understanding was that --runtime-libdirs .
with the cp
is for bundling the .so
inside pypdfium2
, while --compile-libdirs "${pdfium-bin}/lib"
would be to reference it.
Nixpkgs would likely rather reference it (better build deduplication / sharing of dependencies), but I found that removing the cp
has the inconvenient side effect that it produces this error:
Version uncertain: git repo not available.
Falling back to autorelease record.
Traceback (most recent call last):
File "/build/source/nix_run_setup", line 8, in <module>
exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\\r\\n', '\\n'), __file__, 'exec'))
File "setup.py", line 154, in <module>
main()
File "setup.py", line 150, in main
run_setup(modnames, pl_name, pdfium_ver)
File "setup.py", line 134, in run_setup
assert_exists(ModuleDir_Raw, kwargs["package_data"]["pypdfium2_raw"])
File "setup.py", line 76, in assert_exists
assert False, f"Missing data files: {missing}"
AssertionError: Missing data files: ['libpdfium.so']
This is because of:
elif pl_name == ExtPlats.system:
kwargs["package_data"]["pypdfium2_raw"] = [VersionFN, BindingsFN]
else:
# ...
kwargs["package_data"]["pypdfium2_raw"] = [VersionFN, BindingsFN, libname]
I'm currently setting PDFIUM_PLATFORM='prepared!linux_x64:6110
, not PDFIUM_PLATFORM='prepared!system:6110
, so the lower branch is taken and it expects the libname
(.so
) in the pypdfium2_raw
directory.
So far from the pypdfium2
README I understood system
to denotate "when the system has built the .so
", and linux_x64
to be used when using the pdfium
binary releases, but maybe "placing the binary relase .so
in a system directory" also means I should better use system
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, system
is the correct one here. The latter meaning "in a system directory" is right - for the purposes of library loading, it doesn't matter who built the binary. Explicit platforms like linux_x64
imply bundling into pypdfium2's source tree, as the assertion hints. With prepared!
, the target value is basically only used for the file inclusion mask.
Since I'm not accustomed to the recipe format, I just wasn't sure about the scope of ${pdfium-bin}
(temporary to setup, or part of the install), and therefore not sure what you were trying to achieve.
But this clarifies ${pdfium-bin}
is part of the install, and you're wanting to load that at runtime, so no need to bundle an unused binary.
--compile-libdirs
is used by ctypesgen to load the library at setup/bindings creation time to verify symbol presence, and does not impact install time library loading, which is controlled by --runtime-libdirs
instead.
I believe you can also use --universal-libdirs
if both match, but I haven't actually tested this yet.
And confusing though it may sound, you might want to add --no-system-libsearch
to disable the ctypes.util.find_library()
fallback, since you are passing the runtime libdir explicitly.
(Maybe the flag should be renamed to --no-ctypes-libsearch
or something to better reflect its meaning.)
Or if the library were contained in a "standard system location" as recognized by ctypes (e.g. /usr/lib
), you could alternatively omit the explicit --runtime-libdirs
and rely on ctypes to find it, but as NixOS library locations look quite different than on other distros, I don't think that would work, unless NixOS spoofs a different system layout (similar to conda), sets $LD_LIBRARY_PATH
or something.
Note, if the binary is contained in a caller-given runtime libdir, pypdfium2 will of course depend on the continued presence of that dir, and load the library from there.
So if pdfium is updated and the directory's hash changes, you also have to rebuild pypdfium2 to link against the newer version of pdfium. Actually this is good because it enforces ABI safety.
Thanks, pypdfium2 part basically looks good, except for the two spots noted above, which I don't think would work as-is. Sorry BTW for acting a bit too rashly in our previous discussion. Anyway, I believe we've reached a good result now. |
d7ab8ab
to
fcf937d
Compare
I've switched to |
fcf937d
to
88e206b
Compare
I force-pushed, also renaming |
This pull request has been mentioned on NixOS Discourse. There might be relevant details there: https://discourse.nixos.org/t/prs-ready-for-review/3032/2910 |
@mara004 No worries. Yes, I think the result is good, thanks for the offline packaging instructions! |
Also includes required dependencies.
88e206b
to
a21d9b3
Compare
This comment was marked as resolved.
This comment was marked as resolved.
"patch": ${toString pdfiumVersion_patch}, | ||
"n_commits": 0, | ||
"hash": null, | ||
"origin": "pdfium-binaries/nixos", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's quite unclear whether this should be pdfium-binaries/nixos
or system/nixos
, but that's our upstream fault. In theory pdfium-binaries
implies bundling, but system
doesn't include the meaning that the binary comes from pdfium-binaries.
Since a hashed, explicit runtime libdir is as good as bundling, I'd leave it as-is for now.
I plan to revise the version file (and how our code is writing it) to cleanly include all these meanings: library location (bundled/external), package flavor (pypa / conda / linux distro), and binary builder (pdfium-binaries / distro / other)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Result of nixpkgs-review pr 265735
run on x86_64-linux 1
8 packages failed to build:
- python310Packages.doctr
- python310Packages.doctr.dist
- python310Packages.pypdfium2-bin
- python310Packages.pypdfium2-bin.dist
- python311Packages.doctr
- python311Packages.doctr.dist
- python311Packages.pypdfium2-bin
- python311Packages.pypdfium2-bin.dist
4 packages built:
- python310Packages.mplcursors
- python310Packages.mplcursors.dist
- python311Packages.mplcursors
- python311Packages.mplcursors.dist
error: hash mismatch in fixed-output derivation '/nix/store/k84a5hkxr1cfabs922i5f8szkh553d4w-pdfium-linux-x64.tgz.drv':
specified: sha256-d1tFlzJuh0q7gzwbVegPngxkiCdL/5kXZTOs/aXATlE=
got: sha256-aN7VSp+p7xuxcVn1uQuiIKqdV05El+WD3TWHmdq2gIU=
The build error might simply be a recompressed archive upstream? The suggestions below try to unpack the archive
}; | ||
|
||
patches = [ | ||
# TODO: Remove when https://github.com/mindee/doctr/pull/1373 is merged and available, or when nixpkgs has newer versions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# TODO: Remove when https://github.com/mindee/doctr/pull/1373 is merged and available, or when nixpkgs has newer versions | |
# https://github.com/mindee/doctr/pull/1373 |
# This package uses setuptools-scm to derive the version from the git repo (which we don't have), | ||
# so use this environment variable to set it manually instead. | ||
SETUPTOOLS_SCM_PRETEND_VERSION = version; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a common case. Context is inferrable
# This package uses setuptools-scm to derive the version from the git repo (which we don't have), | |
# so use this environment variable to set it manually instead. | |
SETUPTOOLS_SCM_PRETEND_VERSION = version; | |
env.SETUPTOOLS_SCM_PRETEND_VERSION = version; |
torch | ||
torchvision | ||
onnx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to move these deps to passthru.optional-dependencies.torch
?
homepage = "https://mindee.github.io/doctr/"; | ||
license = licenses.asl20; | ||
maintainers = with maintainers; [ chpatrick nh2 ]; | ||
platforms = [ "x86_64-linux" ]; # because `pypdfium2` is only packed for this so far |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
platforms = [ "x86_64-linux" ]; # because `pypdfium2` is only packed for this so far | |
inherit (pypdfium2.meta) platforms; |
pdfiumPrebuiltTar = (fetchurl { | ||
url = "https://github.com/bblanchon/pdfium-binaries/releases/download/chromium/${toString pdfiumVersion_build}/pdfium-${pdfium.platformFileInfix}.tgz"; | ||
hash = pdfium.hash; | ||
}); | ||
|
||
pdfium-bin = runCommand "pdfium-bin" {} '' | ||
mkdir "$out" | ||
tar xf "${pdfiumPrebuiltTar}" --directory "$out" | ||
''; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pdfiumPrebuiltTar = (fetchurl { | |
url = "https://github.com/bblanchon/pdfium-binaries/releases/download/chromium/${toString pdfiumVersion_build}/pdfium-${pdfium.platformFileInfix}.tgz"; | |
hash = pdfium.hash; | |
}); | |
pdfium-bin = runCommand "pdfium-bin" {} '' | |
mkdir "$out" | |
tar xf "${pdfiumPrebuiltTar}" --directory "$out" | |
''; | |
pdfiumPrebuiltTar = (fetchzip { | |
url = "https://github.com/bblanchon/pdfium-binaries/releases/download/chromium/${toString pdfiumVersion_build}/pdfium-${pdfium.platformFileInfix}.tgz"; | |
hash = pdfium.hash; | |
stripRoot = false; | |
}; |
# This package uses setuptools-scm to derive the version from the git repo (which we don't have), | ||
# so use this environment variable to set it manually instead. | ||
SETUPTOOLS_SCM_PRETEND_VERSION = "1.1.1+g${version}"; # we use the most recent tag and append the git version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# This package uses setuptools-scm to derive the version from the git repo (which we don't have), | |
# so use this environment variable to set it manually instead. | |
SETUPTOOLS_SCM_PRETEND_VERSION = "1.1.1+g${version}"; # we use the most recent tag and append the git version | |
env.SETUPTOOLS_SCM_PRETEND_VERSION = "1.1.1+g${src.rev}"; # the most recent tag + git version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even better if it could be also be based on version
src = fetchFromGitHub { | ||
owner = "pypdfium2-team"; # fork needed for `pypdfium2` | ||
repo = "ctypesgen"; | ||
rev = version; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rev = version; | |
rev = "3a811c28160b7c742f97ee50bfe55b18d4021ca0"; |
pdfiumVersion_major = lib.elemAt parsedNumericVersionComponents 0; | ||
pdfiumVersion_minor = lib.elemAt parsedNumericVersionComponents 1; | ||
pdfiumVersion_build = lib.elemAt parsedNumericVersionComponents 2; | ||
pdfiumVersion_patch = lib.elemAt parsedNumericVersionComponents 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pdfiumVersion_major = lib.elemAt parsedNumericVersionComponents 0; | |
pdfiumVersion_minor = lib.elemAt parsedNumericVersionComponents 1; | |
pdfiumVersion_build = lib.elemAt parsedNumericVersionComponents 2; | |
pdfiumVersion_patch = lib.elemAt parsedNumericVersionComponents 3; | |
pdfiumVersion.major = lib.elemAt parsedNumericVersionComponents 0; | |
pdfiumVersion.minor = lib.elemAt parsedNumericVersionComponents 1; | |
pdfiumVersion.build = lib.elemAt parsedNumericVersionComponents 2; | |
pdfiumVersion.patch = lib.elemAt parsedNumericVersionComponents 3; |
build | ||
setuptools-scm | ||
numpy | ||
pytest | ||
pillow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
build | |
setuptools-scm | |
numpy | |
pytest | |
pillow | |
numpy | |
pillow |
Please don't propagate pytest, move it to buildInputs
, also move build and setuptools-scm to nativeBuildInputs
.
meta = with lib; { | ||
description = "Python bindings to PDFium"; | ||
homepage = "https://pypdfium2.readthedocs.io/"; | ||
license = with licenses; [ asl20 /* or */ bsd3 ]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a comment about the dual license?
doctr is a pretty good OCR software.
TODO
This is a draft PR because:
pypdfium2
isn't fully figured out yet: Better instructions / support for offline building pypdfium2-team/pypdfium2#272 (comment)Still, already sharing this PR in case others want to use
doctr
.Things done
nix.conf
? (See Nix manual)sandbox = relaxed
sandbox = true
nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD"
. Note: all changes have to be committed, also see nixpkgs-review usage./result/bin/
)doctr
example code to detect text from a scanned bank account statement