Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text in PDF recognized as gibberish in any PDFium viewer due to invalid bfrange definitions in ToUnicodeMap #1498

Closed
orzFly opened this issue Feb 24, 2024 · 1 comment

Comments

@orzFly
Copy link
Contributor

orzFly commented Feb 24, 2024

Bug Report

Description of the problem

pdfkit/lib/font/embedded.js

Lines 269 to 271 in 485b7e6

1 beginbfrange
<0000> <${toHex(entries.length - 1)}> [${entries.join(' ')}]
endbfrange

Currently, our code generates all ToUnicodeMap entries on a single line. This yields invalid text mapping on any PDFium base viewers (and maybe others).

https://source.chromium.org/chromium/_/pdfium/pdfium.git/+/master:core/fpdfapi/font/cpdf_tounicodemap.cpp;l=171-172;drc=61bda438f9071586c92f8f626c29021524a8d0b0

    uint32_t lowcode = lowcode_opt.value();
    uint32_t highcode = (lowcode & 0xffffff00) | (highcode_opt.value() & 0xff);

Related Chromium bug: https://bugs.chromium.org/p/pdfium/issues/detail?id=1339#c1

The PDF spec doesn't give too much detail about beginbfrange. I looked around and found the doc below. Based on section 1.4.1 in that doc, the <19ff><1a00><63cf> beginbfrange entry is illegal. The first byte values should be the same for the two source range values in the entry.
https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/5411.ToUnicode.pdf

The link is moved or removed at this moment. I found another copy at http://www.audentia-gestion.fr/ADOBE/5411.ToUnicode.pdf

image

Screenshots

  • Google Chrome 122.0.6261.69 Linux x86_64
    image

  • Chromium 122.0.6261.69 (Official Build) Arch Linux (64-bit)
    image

  • WPS Office for Linux 11.1.0.11698
    image
    image

  • Firefox (pdf.js) - CORRECT
    image

  • Adobe Acrobat Reader 2023.008.20533 64-bit on Windows 11 - CORRECT
    image

Code sample

https://replit.com/@orzFly/pdfkit-tounicode?v=1
test.pdf

I used 258 glyphs in the document, so only the first two (258 % 256 = 2) glyphs is correct - yields "AB" correctly. All the rest are incorrect.

Your environment

  • pdfkit version: 0.12.3, or master
  • Node version: 12.22.9
  • Browser version:
    • Google Chrome 122.0.6261.69 Linux x86_64
    • WPS Office for Linux 11.1.0.11698
    • Chromium 122.0.6261.69 (Official Build) Arch Linux (64-bit)
  • Operating System: Linux x86_64
@orzFly
Copy link
Contributor Author

orzFly commented Feb 24, 2024

I have a possible fix - will send a pull request later. However, I am not sure how to add unit test about this particular behavior.

orzFly added a commit to orz-forks/pdfkit that referenced this issue Feb 24, 2024
orzFly added a commit to orz-forks/pdfkit that referenced this issue Feb 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant