Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle non-latin encodings #10

Open
leeper opened this issue May 25, 2016 · 3 comments
Open

Handle non-latin encodings #10

leeper opened this issue May 25, 2016 · 3 comments

Comments

@leeper
Copy link
Member

leeper commented May 25, 2016

This seems really challenging given the quirkiness of PDF format, but is the big issue to left to implement from rOpenSci onboarding

@leeper leeper added this to the CRAN Release milestone May 25, 2016
@leeper leeper self-assigned this May 25, 2016
@leeper leeper closed this as completed in 71c3b82 May 31, 2016
@leeper leeper reopened this Jun 1, 2016
@leeper
Copy link
Member Author

leeper commented Jun 1, 2016

This works on Windows, but is screwy on Linux. Must investigate more.

@tpaskhalis
Copy link
Contributor

My understanding of the structure of PDF files is that there is no way one could guarantee the correct encoding of text extracted from PDF in anything other than Unicode. Not only encodings can be defined for each individual text block within a single PDF file, it can even contain embedded fonts. Can we just do Encoding(out) <- "UTF-8" and remove the argument?

@leeper
Copy link
Member Author

leeper commented Apr 11, 2018

I don't think I understand PDF well enough to know whether that makes any sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants