Handle non-latin encodings #10

leeper · 2016-05-25T08:38:30Z

This seems really challenging given the quirkiness of PDF format, but is the big issue to left to implement from rOpenSci onboarding

leeper · 2016-06-01T16:01:25Z

This works on Windows, but is screwy on Linux. Must investigate more.

tpaskhalis · 2018-04-11T18:58:42Z

My understanding of the structure of PDF files is that there is no way one could guarantee the correct encoding of text extracted from PDF in anything other than Unicode. Not only encodings can be defined for each individual text block within a single PDF file, it can even contain embedded fonts. Can we just do Encoding(out) <- "UTF-8" and remove the argument?

leeper · 2018-04-11T23:52:44Z

I don't think I understand PDF well enough to know whether that makes any sense.

leeper added bug documentation labels May 25, 2016

leeper added this to the CRAN Release milestone May 25, 2016

leeper self-assigned this May 25, 2016

leeper closed this as completed in 71c3b82 May 31, 2016

leeper reopened this Jun 1, 2016

leeper mentioned this issue Jun 1, 2016

tabulizer: Bindings for Tabula PDF Table Extractor Library ropensci/software-review#42

Closed

9 tasks

leeper added the help wanted label Apr 8, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle non-latin encodings #10

Handle non-latin encodings #10

leeper commented May 25, 2016

leeper commented Jun 1, 2016

tpaskhalis commented Apr 11, 2018

leeper commented Apr 11, 2018

Handle non-latin encodings #10

Handle non-latin encodings #10

Comments

leeper commented May 25, 2016

leeper commented Jun 1, 2016

tpaskhalis commented Apr 11, 2018

leeper commented Apr 11, 2018