Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Transition to Missing #200

Merged
merged 25 commits into from
Dec 12, 2018
Merged

WIP: Transition to Missing #200

merged 25 commits into from
Dec 12, 2018

Conversation

joshday
Copy link
Collaborator

@joshday joshday commented Dec 5, 2018

This builds on #199 (Tables.jl integration) and removes the dependency on DataValues.

The build/tests are running, but I'm still unsure of the consequences of performance, integration with IterableTables, etc.

However, we are currently in a situation where TextParse loads missing data as Union{T,Missing} but then operations like table joins can introduce DataValues.

@joshday joshday changed the title WIP DO NOT MERGE: Convert to Missing WIP: Convert to Missing Dec 6, 2018
@joshday joshday changed the title WIP: Convert to Missing WIP: Transition to Missing Dec 7, 2018
@piever
Copy link
Collaborator

piever commented Dec 8, 2018

In terms of performance, it seems that the Tables style iteration (with a lazy ColumnRows object that overloads getproperty to return the correct thing without materializing the NamedTuple) preserves performance much better than the IndexedTables style iteration (materializing NamedTuple): see JuliaData/Tables.jl#51 for a benchmark. With JuliaData/Tables.jl#50 and JuliaData/Tables.jl#51 I suspect we could switch to using Tables iteration.

@joshday
Copy link
Collaborator Author

joshday commented Dec 8, 2018

Since CI is green I'm inclined to merge this first and let the lazy row implementation go into a separate PR before we make a new release for IndexedTables.

@piever Do you want to give the lazy row stuff a go?

@piever
Copy link
Collaborator

piever commented Dec 8, 2018

I agree, this PR is already quite massive as it is. I'm happy to try my hand at changing the default iteration in a new PR, though it may be a bit tricky.

In particular an issue is that the Tables implementation of collecting a row iterator to columns does not support unwrapping (see JuliaData/Tables.jl#17), whereas in IndexedTables we unwrap (Named)Tuples by default (not sure how necessary this is though).

src/join.jl Outdated Show resolved Hide resolved
src/ndsparse.jl Outdated Show resolved Hide resolved
src/reshape.jl Outdated Show resolved Hide resolved
src/selection.jl Outdated Show resolved Hide resolved
src/utils.jl Outdated Show resolved Hide resolved
Copy link
Member

@quinnj quinnj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great; there's actually a lot less change than I expected. What's left?

excludecols(t, cols) = excludecols(t, (cols,))
excludecols(t, cols::SpecialSelector) = excludecols(t, lowerselection(t, cols))
function excludecols(t, cols::Tuple)
Tuple(setdiff(1:length(colnames(t)), map(x -> colindex(t, x), cols)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice simplification here

@joshday joshday merged commit a5e9c32 into master Dec 12, 2018
@joshday joshday deleted the jd/missing branch December 12, 2018 18:15
@davidanthoff
Copy link
Contributor

Have you guys decided to just live with the performance issues that you get from Missing? They are quite severe for common cases, like a map over a table with columns that can hold missing values.

On v0.8.1:

d = DataValueVector{Float64}(randn(10_000_000))
d[3] = NA
t = table((a=copy(d), b=copy(d), c=copy(d), d=copy(d), e=copy(d)))

# Precompile
map(i->(e=i.e, d=i.d, c=i.c, b=i.b, a=i.a), t)

@time map(i->(e=i.e, d=i.d, c=i.c, b=i.b, a=i.a), t)

gets me 0.643079 seconds (82.47 k allocations: 766.929 MiB, 29.42% gc time).

On v0.9.0:

d = Union{Missing,Float64}[i for i in randn(10_000_000)]
d[3] = missing
t = table((a=copy(d), b=copy(d), c=copy(d), d=copy(d), e=copy(d)))

# Precompile
map(i->(e=i.e, d=i.d, c=i.c, b=i.b, a=i.a), t)

@time map(i->(e=i.e, d=i.d, c=i.c, b=i.b, a=i.a), t)

gets me 7.376202 seconds (150.13 M allocations: 4.672 GiB, 7.68% gc time).

That is more than an order of magnitude performance drop... The root cause here is the same issue that keeps me from adopting Missing for Query.jl and that we discussed to death already in various threads on discourse etc. I don't want to start that discussion again, I'm just curious what the thinking behind the choice here was.

@davidanthoff
Copy link
Contributor

Oh, and I also get the impression that the whole TableTraits.jl/Tables.jl integration is all broken...

For example:

julia> DataFrame(a=[1,missing,5], b=[2., 6., missing]) |> table
Table with 3 rows, 2 columns:
a    b
────────
1    2.0
#NA  6.0
5    #NA

Shouldn't that create a IndexedTable with missing values? I'm not even sure what interface it uses for the conversion, but the end result seems incorrect to me.

And the other way around things are also broken:

julia> table((a=[1,missing,5],b=[2.,6.,missing])) |> IteratorInterfaceExtensions.getiterator |> first
NamedTuple{(:a, :b),Tuple{Union{Missing, Int64},Union{Missing, Float64}}}((1, 2.0))

That is incorrect, because IteratorInterfaceExtensions.getiterator must return an iterator that uses DataValue for missing values, otherwise it is not correctly implementing the TableTraits.jl interface.

All of these scenarios used to work just fine and correct for a very long time pre this PR...

@piever
Copy link
Collaborator

piever commented Dec 21, 2018

There definitely is a performance issue if at least three fields are missing . v{i} here has i missable fields:

julia> using StructArrays

julia> m = Union{Missing, Int}[i for i in rand(Int, 10_000_000)];

julia> m[3] = missing;

julia> nm = rand(Int, 10_000_000);

julia> v0 = StructArray(a=nm, b=nm, c=nm) ;

julia> v1 = StructArray(a=m, b=nm, c=nm);

julia> v2 = StructArray(a=m, b=m, c=nm) ;

julia> v3 = StructArray(a=m, b=m, c=m) ;

julia> f1(v) = StructArray(v[i] for i in eachindex(v))
f1 (generic function with 1 method)

julia> f2(v) = StructArray(StructArrays.LazyRow(v, i) for i in eachindex(v))
f2 (generic function with 1 method)

julia> @time f1(v0);
  0.297717 seconds (138.80 k allocations: 235.571 MiB, 22.80% gc time)

julia> @time f1(v1);
  0.325315 seconds (153.66 k allocations: 245.685 MiB, 20.19% gc time)

julia> @time f1(v2);
  0.467856 seconds (174.08 k allocations: 256.051 MiB, 23.24% gc time)

julia> @time f1(v3);
  4.976934 seconds (60.11 M allocations: 1.747 GiB, 5.96% gc time)

julia> @time f2(v0);
  0.257504 seconds (172.24 k allocations: 237.419 MiB, 5.44% gc time)

julia> @time f2(v1);
  0.539007 seconds (10.18 M allocations: 552.336 MiB, 36.22% gc time)

julia> @time f2(v2);
  0.499234 seconds (10.21 M allocations: 563.577 MiB, 28.97% gc time)

julia> @time f2(v3);
  0.520931 seconds (10.24 M allocations: 574.369 MiB, 27.54% gc time)

The plan to alleviate this is to transition to LazyRow to iterate (lazy iteration that does not materialize the full NamedTuple). It should probably become the default (as it is the default for DataFrames and has several side advantages).

The collection method in StructArrays is designed to be fast even when the type of the iterator is unstable as long as each column is. However here it is actually the function that is slow:

julia> const T = eltype(v1)
NamedTuple{(:a, :b, :c),Tuple{Union{Missing, Int64},Int64,Int64}}

julia> f(i)::T = (a=i.a, b=i.b, c=i.c)
f (generic function with 1 method)

julia> Core.Compiler.return_type(f, Tuple{eltype(v1)})
NamedTuple{(:a, :b, :c),Tuple{Union{Missing, Int64},Int64,Int64}}

julia> @time StructArray(f(el) for el in v0);
  0.239057 seconds (17 allocations: 238.419 MiB, 44.94% gc time)

julia> @time StructArray(f(el) for el in v1);
  3.895166 seconds (60.00 M allocations: 1.723 GiB, 7.76% gc time)

Here the bottleneck is actually "computing f" when the input data is missing:

julia> s0 = v0[1];

julia> s1 = v1[1];

julia> @btime f(s0);
  17.957 ns (1 allocation: 48 bytes)

julia> @btime f(s1);
  341.888 ns (7 allocations: 208 bytes)

@davidanthoff
Copy link
Contributor

Hm, that is very strange... The problems you show in your first code block I don't understand. I would actually not have anticipated any issue there, I think that is an entirely different problem from the one we had discussed previously in these endless threads. Obviously also important and quite severe, but I think distinct. Do you understand why there is a problem there?

The issue I had in mind was what you describe later, i.e. the problem from f not being type-stable. In your example you kind of hide that by annotating f with a return type, which of course one can't do in a generic map implementation. I don't see how a lazy row view would help with that, those seem entirely orthogonal problems. Is there a plan for that specific issue?

@piever
Copy link
Collaborator

piever commented Dec 22, 2018

The issue I had in mind was what you describe later, i.e. the problem from f not being type-stable. In your example you kind of hide that by annotating f with a return type, which of course one can't do in a generic map implementation. I don't see how a lazy row view would help with that, those seem entirely orthogonal problems. Is there a plan for that specific issue?

I annotated the return type exactly to show that type instability of f is not the problem (or at least not the only one), as performance is bad even if the return type of f is known. What I think is the bottleneck is (in this extreme exemple) not the "collection code" when f is type unstable, but the actual "computation of f", in that computing f(s1) when s1 has "missable" fields is very slow. Unless I'm misunderstanding something basic, in this case N = 10_000_000 and f(s1) is 340 ns, so overall it would seem 3.4 of the total 3.8 seconds are spent doing this computation.

I don't understand the compiler at all so I have no idea what can be done in practice to fix this but I suspect the slowness of this kind of simple function f is a MWE of something that can only be optimized on the Base Julia side of things (not on StructArrays or IndexedTables).

@JeffBezanson
Copy link
Contributor

I think changing how the NamedTuple constructor is implemented might be able to alleviate that, but I don't have a full solution yet. And fixing the performance of f is nice, but as @davidanthoff said that will probably not fix all real-world cases.

For now I would humbly suggest not moving IndexedTables and JuliaDB to missing as long as there are serious performance regressions.

@davidanthoff
Copy link
Contributor

For now I would humbly suggest not moving IndexedTables and JuliaDB to missing as long as there are serious performance regressions.

I think this has all been released already, right?

I think there might generally be a need for more quality control in the release process. The last two releases at least were quite problematic, IMO: the previous one had the issue that joins etc. were all broken if data was loaded from csv files (because TextParse.jl had moved to Missing, but not JuliaDB), now this release has these performance issues and essentially broke the whole TableTraits.jl integration story...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants