WIP: Transition to Missing #200

joshday · 2018-12-05T23:07:39Z

This builds on #199 (Tables.jl integration) and removes the dependency on DataValues.

The build/tests are running, but I'm still unsure of the consequences of performance, integration with IterableTables, etc.

However, we are currently in a situation where TextParse loads missing data as Union{T,Missing} but then operations like table joins can introduce DataValues.

piever · 2018-12-08T18:30:18Z

In terms of performance, it seems that the Tables style iteration (with a lazy ColumnRows object that overloads getproperty to return the correct thing without materializing the NamedTuple) preserves performance much better than the IndexedTables style iteration (materializing NamedTuple): see JuliaData/Tables.jl#51 for a benchmark. With JuliaData/Tables.jl#50 and JuliaData/Tables.jl#51 I suspect we could switch to using Tables iteration.

joshday · 2018-12-08T19:58:20Z

Since CI is green I'm inclined to merge this first and let the lazy row implementation go into a separate PR before we make a new release for IndexedTables.

@piever Do you want to give the lazy row stuff a go?

piever · 2018-12-08T20:13:54Z

I agree, this PR is already quite massive as it is. I'm happy to try my hand at changing the default iteration in a new PR, though it may be a bit tricky.

In particular an issue is that the Tables implementation of collecting a row iterator to columns does not support unwrapping (see JuliaData/Tables.jl#17), whereas in IndexedTables we unwrap (Named)Tuples by default (not sure how necessary this is though).

src/join.jl

src/ndsparse.jl

src/reshape.jl

src/selection.jl

src/utils.jl

test/test_missing.jl

quinnj

This is looking great; there's actually a lot less change than I expected. What's left?

quinnj · 2018-12-11T06:36:11Z

src/indexedtable.jl

+excludecols(t, cols) = excludecols(t, (cols,))
+excludecols(t, cols::SpecialSelector) = excludecols(t, lowerselection(t, cols))
+function excludecols(t, cols::Tuple) 
+    Tuple(setdiff(1:length(colnames(t)), map(x -> colindex(t, x), cols)))


Nice simplification here

davidanthoff · 2018-12-20T23:05:41Z

Have you guys decided to just live with the performance issues that you get from Missing? They are quite severe for common cases, like a map over a table with columns that can hold missing values.

On v0.8.1:

d = DataValueVector{Float64}(randn(10_000_000))
d[3] = NA
t = table((a=copy(d), b=copy(d), c=copy(d), d=copy(d), e=copy(d)))

# Precompile
map(i->(e=i.e, d=i.d, c=i.c, b=i.b, a=i.a), t)

@time map(i->(e=i.e, d=i.d, c=i.c, b=i.b, a=i.a), t)

gets me 0.643079 seconds (82.47 k allocations: 766.929 MiB, 29.42% gc time).

On v0.9.0:

d = Union{Missing,Float64}[i for i in randn(10_000_000)]
d[3] = missing
t = table((a=copy(d), b=copy(d), c=copy(d), d=copy(d), e=copy(d)))

# Precompile
map(i->(e=i.e, d=i.d, c=i.c, b=i.b, a=i.a), t)

@time map(i->(e=i.e, d=i.d, c=i.c, b=i.b, a=i.a), t)

gets me 7.376202 seconds (150.13 M allocations: 4.672 GiB, 7.68% gc time).

That is more than an order of magnitude performance drop... The root cause here is the same issue that keeps me from adopting Missing for Query.jl and that we discussed to death already in various threads on discourse etc. I don't want to start that discussion again, I'm just curious what the thinking behind the choice here was.

davidanthoff · 2018-12-20T23:20:31Z

Oh, and I also get the impression that the whole TableTraits.jl/Tables.jl integration is all broken...

For example:

julia> DataFrame(a=[1,missing,5], b=[2., 6., missing]) |> table
Table with 3 rows, 2 columns:
a    b
────────
1    2.0
#NA  6.0
5    #NA

Shouldn't that create a IndexedTable with missing values? I'm not even sure what interface it uses for the conversion, but the end result seems incorrect to me.

And the other way around things are also broken:

julia> table((a=[1,missing,5],b=[2.,6.,missing])) |> IteratorInterfaceExtensions.getiterator |> first
NamedTuple{(:a, :b),Tuple{Union{Missing, Int64},Union{Missing, Float64}}}((1, 2.0))

That is incorrect, because IteratorInterfaceExtensions.getiterator must return an iterator that uses DataValue for missing values, otherwise it is not correctly implementing the TableTraits.jl interface.

All of these scenarios used to work just fine and correct for a very long time pre this PR...

piever · 2018-12-21T11:40:18Z

There definitely is a performance issue if at least three fields are missing . v{i} here has i missable fields:

julia> using StructArrays

julia> m = Union{Missing, Int}[i for i in rand(Int, 10_000_000)];

julia> m[3] = missing;

julia> nm = rand(Int, 10_000_000);

julia> v0 = StructArray(a=nm, b=nm, c=nm) ;

julia> v1 = StructArray(a=m, b=nm, c=nm);

julia> v2 = StructArray(a=m, b=m, c=nm) ;

julia> v3 = StructArray(a=m, b=m, c=m) ;

julia> f1(v) = StructArray(v[i] for i in eachindex(v))
f1 (generic function with 1 method)

julia> f2(v) = StructArray(StructArrays.LazyRow(v, i) for i in eachindex(v))
f2 (generic function with 1 method)

julia> @time f1(v0);
  0.297717 seconds (138.80 k allocations: 235.571 MiB, 22.80% gc time)

julia> @time f1(v1);
  0.325315 seconds (153.66 k allocations: 245.685 MiB, 20.19% gc time)

julia> @time f1(v2);
  0.467856 seconds (174.08 k allocations: 256.051 MiB, 23.24% gc time)

julia> @time f1(v3);
  4.976934 seconds (60.11 M allocations: 1.747 GiB, 5.96% gc time)

julia> @time f2(v0);
  0.257504 seconds (172.24 k allocations: 237.419 MiB, 5.44% gc time)

julia> @time f2(v1);
  0.539007 seconds (10.18 M allocations: 552.336 MiB, 36.22% gc time)

julia> @time f2(v2);
  0.499234 seconds (10.21 M allocations: 563.577 MiB, 28.97% gc time)

julia> @time f2(v3);
  0.520931 seconds (10.24 M allocations: 574.369 MiB, 27.54% gc time)

The plan to alleviate this is to transition to LazyRow to iterate (lazy iteration that does not materialize the full NamedTuple). It should probably become the default (as it is the default for DataFrames and has several side advantages).

The collection method in StructArrays is designed to be fast even when the type of the iterator is unstable as long as each column is. However here it is actually the function that is slow:

julia> const T = eltype(v1)
NamedTuple{(:a, :b, :c),Tuple{Union{Missing, Int64},Int64,Int64}}

julia> f(i)::T = (a=i.a, b=i.b, c=i.c)
f (generic function with 1 method)

julia> Core.Compiler.return_type(f, Tuple{eltype(v1)})
NamedTuple{(:a, :b, :c),Tuple{Union{Missing, Int64},Int64,Int64}}

julia> @time StructArray(f(el) for el in v0);
  0.239057 seconds (17 allocations: 238.419 MiB, 44.94% gc time)

julia> @time StructArray(f(el) for el in v1);
  3.895166 seconds (60.00 M allocations: 1.723 GiB, 7.76% gc time)

Here the bottleneck is actually "computing f" when the input data is missing:

julia> s0 = v0[1];

julia> s1 = v1[1];

julia> @btime f(s0);
  17.957 ns (1 allocation: 48 bytes)

julia> @btime f(s1);
  341.888 ns (7 allocations: 208 bytes)

davidanthoff · 2018-12-21T22:21:11Z

Hm, that is very strange... The problems you show in your first code block I don't understand. I would actually not have anticipated any issue there, I think that is an entirely different problem from the one we had discussed previously in these endless threads. Obviously also important and quite severe, but I think distinct. Do you understand why there is a problem there?

The issue I had in mind was what you describe later, i.e. the problem from f not being type-stable. In your example you kind of hide that by annotating f with a return type, which of course one can't do in a generic map implementation. I don't see how a lazy row view would help with that, those seem entirely orthogonal problems. Is there a plan for that specific issue?

piever · 2018-12-22T16:01:15Z

The issue I had in mind was what you describe later, i.e. the problem from f not being type-stable. In your example you kind of hide that by annotating f with a return type, which of course one can't do in a generic map implementation. I don't see how a lazy row view would help with that, those seem entirely orthogonal problems. Is there a plan for that specific issue?

I annotated the return type exactly to show that type instability of f is not the problem (or at least not the only one), as performance is bad even if the return type of f is known. What I think is the bottleneck is (in this extreme exemple) not the "collection code" when f is type unstable, but the actual "computation of f", in that computing f(s1) when s1 has "missable" fields is very slow. Unless I'm misunderstanding something basic, in this case N = 10_000_000 and f(s1) is 340 ns, so overall it would seem 3.4 of the total 3.8 seconds are spent doing this computation.

I don't understand the compiler at all so I have no idea what can be done in practice to fix this but I suspect the slowness of this kind of simple function f is a MWE of something that can only be optimized on the Base Julia side of things (not on StructArrays or IndexedTables).

JeffBezanson · 2018-12-22T19:52:45Z

I think changing how the NamedTuple constructor is implemented might be able to alleviate that, but I don't have a full solution yet. And fixing the performance of f is nice, but as @davidanthoff said that will probably not fix all real-world cases.

For now I would humbly suggest not moving IndexedTables and JuliaDB to missing as long as there are serious performance regressions.

davidanthoff · 2018-12-23T06:18:44Z

For now I would humbly suggest not moving IndexedTables and JuliaDB to missing as long as there are serious performance regressions.

I think this has all been released already, right?

I think there might generally be a need for more quality control in the release process. The last two releases at least were quite problematic, IMO: the previous one had the issue that joins etc. were all broken if data was loaded from csv files (because TextParse.jl had moved to Missing, but not JuliaDB), now this release has these performance issues and essentially broke the whole TableTraits.jl integration story...

joshday added 17 commits December 3, 2018 10:32

table.jl -> indexedtable.jl

d2bcb1f

tables integration

643b4a3

uncomment tests

c74f92f

sneak some cleanup into the PR

86b64ba

fix typo

89e08eb

wip

493845d

wip

b6c6bc5

got out of sync...merge conflicts

0f6b8c3

cleanup

99b1c68

get started, comment out tests with DataValue/NA

cce84dd

remove DataValue/NA from tests

33a8a36

fix reshape and table ==

5346478

dropna -> dropmissing

28d757c

no more DataValues

aea6c72

fix ==

6e71b60

test cleanup, use fallback array method for == with Columns

9e89000

cleanup

76b2826

joshday changed the title ~~WIP DO NOT MERGE: Convert to Missing~~ WIP: Convert to Missing Dec 6, 2018

joshday added 2 commits December 7, 2018 09:33

fix tests on 0.7

64b0200

don't export dropna

1022c8a

joshday changed the title ~~WIP: Convert to Missing~~ WIP: Transition to Missing Dec 7, 2018

cleanup, add Type SpecialSelector

662f3f3

nalimilan reviewed Dec 9, 2018

View reviewed changes

src/join.jl Outdated Show resolved Hide resolved

src/ndsparse.jl Outdated Show resolved Hide resolved

src/reshape.jl Outdated Show resolved Hide resolved

src/selection.jl Outdated Show resolved Hide resolved

src/utils.jl Outdated Show resolved Hide resolved

joshday added 4 commits December 9, 2018 16:47

address some comments

5470db5

remove dropmissing(::AbstractVector) method

53936df

export dropna for deprecation period

6f11e8f

== and isequal in the right places

fb8d138

nalimilan reviewed Dec 11, 2018

View reviewed changes

test/test_missing.jl Show resolved Hide resolved

use convert in join

2e4d098

quinnj approved these changes Dec 12, 2018

View reviewed changes

joshday merged commit a5e9c32 into master Dec 12, 2018

joshday deleted the jd/missing branch December 12, 2018 18:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Transition to Missing #200

WIP: Transition to Missing #200

joshday commented Dec 5, 2018

piever commented Dec 8, 2018

joshday commented Dec 8, 2018

piever commented Dec 8, 2018 •

edited

Loading

quinnj left a comment

quinnj Dec 11, 2018

davidanthoff commented Dec 20, 2018

davidanthoff commented Dec 20, 2018

piever commented Dec 21, 2018

davidanthoff commented Dec 21, 2018

piever commented Dec 22, 2018

JeffBezanson commented Dec 22, 2018

davidanthoff commented Dec 23, 2018

WIP: Transition to Missing #200

WIP: Transition to Missing #200

Conversation

joshday commented Dec 5, 2018

piever commented Dec 8, 2018

joshday commented Dec 8, 2018

piever commented Dec 8, 2018 • edited Loading

quinnj left a comment

Choose a reason for hiding this comment

quinnj Dec 11, 2018

Choose a reason for hiding this comment

davidanthoff commented Dec 20, 2018

davidanthoff commented Dec 20, 2018

piever commented Dec 21, 2018

davidanthoff commented Dec 21, 2018

piever commented Dec 22, 2018

JeffBezanson commented Dec 22, 2018

davidanthoff commented Dec 23, 2018

piever commented Dec 8, 2018 •

edited

Loading