write_* improvements #432

jimhester · 2016-06-10T17:12:48Z

The previous implementation as we know was very slow because of the character conversion, on par with write.csv().

set.seed(1)
df <- as.data.frame(matrix(runif(256*2^15), nrow = 256))
system.time(write.csv(df, "/tmp/df1.csv"))
#>    user  system elapsed 
#>  26.074   7.622  33.776
system.time(readr::write_csv(df, "/tmp/df4.csv"))
#>    user  system elapsed 
#>  34.012   0.537  34.589

9c28645 just does a little cleanup and turns off converting numeric to character first. This produces valid round trip-able results and is quite a bit faster than converting to character, however all numeric numbers are printed with the maximum amount of precision.

system.time(readr::write_csv(df, "/tmp/df4.csv"))
#>    user  system elapsed 
#>   8.308   0.240   8.588

a67c1d5 uses the grisu3 implementation found at https://github.com/juj/MathGeoLib/blob/master/src/Math/grisu3.c. It is under the Apache 2 license so is safe for us to use. This actually gives us quite a bit better performance than the naive approach.

system.time(readr::write_csv(df, "/tmp/df4.csv"))
#>    user  system elapsed 
#>   3.047   0.203   3.265

However data.table:fwrite() is still faster than any of these methods.

system.time(data.table::fwrite(df, "/tmp/df3.csv"))
#> Your platform/environment has not detected OpenMP support. fwrite() will still work, but slower in single threaded mode.
#>    user  system elapsed 
#>   1.578   0.063   1.651

I did some profile sampling with R -q -d "valgrind --tool=callgrind --dump-instr=yes --collect-jumps=yes" and the vast majority of our computational time is doing the formatting, so I am not sure how much more room there is to improve.

Fixes #387

Don't convert numeric to character, use the max amount of precision nessesary to roundtrip doubles.

jimhester · 2016-06-10T17:22:21Z

Sample output from grisu3 is identical to current and write.csv() on mtcars

readr::write_tsv(head(mtcars), "/dev/stdout")
#> mpg  cyl disp    hp  drat    wt  qsec    vs  am  gear    carb
#> 21   6   160 110 3.9 2.62    16.46   0   1   4   4
#> 21   6   160 110 3.9 2.875   17.02   0   1   4   4
#> 22.8 4   108 93  3.85    2.32    18.61   1   1   4   1
#> 21.4 6   258 110 3.08    3.215   19.44   1   0   3   1
#> 18.7 8   360 175 3.15    3.44    17.02   0   0   3   2
#> 18.1 6   225 105 2.76    3.46    20.22   1   0   3   1

vs setting the precision manually

readr::write_tsv(head(mtcars), "/dev/stdout")
#> mpg  cyl disp    hp  drat    wt  qsec    vs  am  gear    carb
#> 21   6   160 110 3.8999999999999999  2.6200000000000001  16.460000000000001  0   1   4   4
#> 21   6   160 110 3.8999999999999999  2.875   17.02   0   1   4   4
#> 22.800000000000001   4   108 93  3.8500000000000001  2.3199999999999998  18.609999999999999  1   1   4   1
#> 21.399999999999999   6   258 110 3.0800000000000001  3.2149999999999999  19.440000000000001  1   0   3   1
#> 18.699999999999999   8   360 175 3.1499999999999999  3.4399999999999999  17.02   0   0   3   2
#> 18.100000000000001   6   225 105 2.7599999999999998  3.46    20.219999999999999  1   0   3   1

jimhester · 2016-06-10T17:28:41Z

One possible issue is mentioned in the grisu3 paper, which states the following (emphasis mine)

With just two extra bits it is difficult to do better than in our
example, but often there exists an integer type with more bits. For
IEEE 754 floating-point numbers, which have a significand size of
53, one can use 64 bit integers, providing 11 extra bits. We have
developed an algorithm Grisu2 that uses these extra bits to shorten
the output. However, even 11 extra bits may not be sufficient in
every case. There are still boundary conditions under which Grisu2
will not be able to produce the shortest representation. Since this
property is often a requirement (see [Steele Jr. and White(2004)]
for some examples) we propose a variant, Grisu3, that detects (and
aborts) when its output may not be the shortest. As a consequence
Grisu3 is incomplete and will fail for some percentage of its input.
Given 11 extra bits roughly 99.5% are processed correctly and
are thus guaranteed to be optimal (with respect to shortness and
rounding). The remaining 0.5% are rejected and need to be printed
by another printing algorithm (like Dragon4).

I need to look a this implementation and see what happens if the input is rejected, but it is reassuring it did not fail with the test inputs, which are random although only between 0-1.

Edit
Answered by https://github.com/hadley/readr/pull/432/files#diff-d249c6cf5b0b488ebfa485f331caed3bR331, which uses sprintf in this case.

jimhester · 2016-06-10T17:42:52Z

We may also want to incorporate the changes from https://github.com/dvidelabs/flatcc/blob/master/include/flatcc/portable/grisu3_print.h#L207-L228 (also under Apache 2.0) which prefer 'unscientific' notation at the same length and always append a 0 on decimals.

hadley · 2016-06-10T18:13:35Z

src/grisu3.c

@@ -0,0 +1,361 @@
+/* This file is part of an implementation of the "grisu3" double to string


Can you include the license too? Might need to copy and paste from somewhere else

hadley · 2016-06-10T18:14:13Z

Also need to update Authors@R

hadley · 2016-06-10T18:15:24Z

You mean C-level formatting or R-level formatting?

But I'm happy with that performance - we don't need to be as fast as fwrite(), we just need not to be embarrassingly slow.

jimhester · 2016-06-13T19:56:15Z

C-level formatting is what I meant (after the above changes).

hadley · 2016-06-13T20:02:07Z

Apart from the authorship/license stuff (and news bullet), LGTM. Feel free to merge when you've done those bits.

jimhester · 2016-06-14T15:02:03Z

Added the license and authors to the DESCRIPTION. PTAL briefly to make sure it looks OK and then I can merge this.

hadley · 2016-06-14T18:15:46Z

Looks good - I confirmed that Apache license is compatible with GPL3.

Small cleanup for writing

9c28645

Don't convert numeric to character, use the max amount of precision nessesary to roundtrip doubles.

jimhester added the in progress label Jun 10, 2016

hadley reviewed Jun 10, 2016
View reviewed changes

jimhester added ready and removed in progress labels Jun 14, 2016

jimhester changed the title ~~WIP: write_* improvements~~ write_* improvements Jun 14, 2016

Use grisu3 to format doubles

fb9f46a

jimhester force-pushed the master branch from 94c990e to 52cca9e Compare June 14, 2016 15:11

jimhester added 3 commits June 14, 2016 11:21

Add LICENSE information and output formatting tweaks

a7e1abd

Add note to NEWS

44372f0

Write exactly the string length

ca075fd

jimhester force-pushed the master branch from 52cca9e to ca075fd Compare June 14, 2016 15:21

jimhester merged commit 58d5682 into tidyverse:master Jun 14, 2016

jimhester removed the ready label Jun 14, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

write_* improvements #432

write_* improvements #432

jimhester commented Jun 10, 2016 •

edited

Loading

jimhester commented Jun 10, 2016

jimhester commented Jun 10, 2016 •

edited

Loading

jimhester commented Jun 10, 2016

hadley Jun 10, 2016

hadley commented Jun 10, 2016

hadley commented Jun 10, 2016

jimhester commented Jun 13, 2016

hadley commented Jun 13, 2016

jimhester commented Jun 14, 2016

hadley commented Jun 14, 2016

		@@ -0,0 +1,361 @@
		/* This file is part of an implementation of the "grisu3" double to string

write_* improvements #432

write_* improvements #432

Conversation

jimhester commented Jun 10, 2016 • edited Loading

jimhester commented Jun 10, 2016

jimhester commented Jun 10, 2016 • edited Loading

jimhester commented Jun 10, 2016

hadley Jun 10, 2016

Choose a reason for hiding this comment

hadley commented Jun 10, 2016

hadley commented Jun 10, 2016

jimhester commented Jun 13, 2016

hadley commented Jun 13, 2016

jimhester commented Jun 14, 2016

hadley commented Jun 14, 2016

jimhester commented Jun 10, 2016 •

edited

Loading

jimhester commented Jun 10, 2016 •

edited

Loading