-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Transition to Missing #200
Conversation
In terms of performance, it seems that the Tables style iteration (with a lazy |
Since CI is green I'm inclined to merge this first and let the lazy row implementation go into a separate PR before we make a new release for IndexedTables. @piever Do you want to give the lazy row stuff a go? |
I agree, this PR is already quite massive as it is. I'm happy to try my hand at changing the default iteration in a new PR, though it may be a bit tricky. In particular an issue is that the Tables implementation of collecting a row iterator to columns does not support unwrapping (see JuliaData/Tables.jl#17), whereas in IndexedTables we unwrap (Named)Tuples by default (not sure how necessary this is though). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking great; there's actually a lot less change than I expected. What's left?
excludecols(t, cols) = excludecols(t, (cols,)) | ||
excludecols(t, cols::SpecialSelector) = excludecols(t, lowerselection(t, cols)) | ||
function excludecols(t, cols::Tuple) | ||
Tuple(setdiff(1:length(colnames(t)), map(x -> colindex(t, x), cols))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice simplification here
Have you guys decided to just live with the performance issues that you get from On v0.8.1: d = DataValueVector{Float64}(randn(10_000_000))
d[3] = NA
t = table((a=copy(d), b=copy(d), c=copy(d), d=copy(d), e=copy(d)))
# Precompile
map(i->(e=i.e, d=i.d, c=i.c, b=i.b, a=i.a), t)
@time map(i->(e=i.e, d=i.d, c=i.c, b=i.b, a=i.a), t) gets me On v0.9.0: d = Union{Missing,Float64}[i for i in randn(10_000_000)]
d[3] = missing
t = table((a=copy(d), b=copy(d), c=copy(d), d=copy(d), e=copy(d)))
# Precompile
map(i->(e=i.e, d=i.d, c=i.c, b=i.b, a=i.a), t)
@time map(i->(e=i.e, d=i.d, c=i.c, b=i.b, a=i.a), t) gets me That is more than an order of magnitude performance drop... The root cause here is the same issue that keeps me from adopting |
Oh, and I also get the impression that the whole TableTraits.jl/Tables.jl integration is all broken... For example: julia> DataFrame(a=[1,missing,5], b=[2., 6., missing]) |> table
Table with 3 rows, 2 columns:
a b
────────
1 2.0
#NA 6.0
5 #NA Shouldn't that create a And the other way around things are also broken: julia> table((a=[1,missing,5],b=[2.,6.,missing])) |> IteratorInterfaceExtensions.getiterator |> first
NamedTuple{(:a, :b),Tuple{Union{Missing, Int64},Union{Missing, Float64}}}((1, 2.0)) That is incorrect, because All of these scenarios used to work just fine and correct for a very long time pre this PR... |
There definitely is a performance issue if at least three fields are missing . julia> using StructArrays
julia> m = Union{Missing, Int}[i for i in rand(Int, 10_000_000)];
julia> m[3] = missing;
julia> nm = rand(Int, 10_000_000);
julia> v0 = StructArray(a=nm, b=nm, c=nm) ;
julia> v1 = StructArray(a=m, b=nm, c=nm);
julia> v2 = StructArray(a=m, b=m, c=nm) ;
julia> v3 = StructArray(a=m, b=m, c=m) ;
julia> f1(v) = StructArray(v[i] for i in eachindex(v))
f1 (generic function with 1 method)
julia> f2(v) = StructArray(StructArrays.LazyRow(v, i) for i in eachindex(v))
f2 (generic function with 1 method)
julia> @time f1(v0);
0.297717 seconds (138.80 k allocations: 235.571 MiB, 22.80% gc time)
julia> @time f1(v1);
0.325315 seconds (153.66 k allocations: 245.685 MiB, 20.19% gc time)
julia> @time f1(v2);
0.467856 seconds (174.08 k allocations: 256.051 MiB, 23.24% gc time)
julia> @time f1(v3);
4.976934 seconds (60.11 M allocations: 1.747 GiB, 5.96% gc time)
julia> @time f2(v0);
0.257504 seconds (172.24 k allocations: 237.419 MiB, 5.44% gc time)
julia> @time f2(v1);
0.539007 seconds (10.18 M allocations: 552.336 MiB, 36.22% gc time)
julia> @time f2(v2);
0.499234 seconds (10.21 M allocations: 563.577 MiB, 28.97% gc time)
julia> @time f2(v3);
0.520931 seconds (10.24 M allocations: 574.369 MiB, 27.54% gc time) The plan to alleviate this is to transition to The collection method in StructArrays is designed to be fast even when the type of the iterator is unstable as long as each column is. However here it is actually the function that is slow: julia> const T = eltype(v1)
NamedTuple{(:a, :b, :c),Tuple{Union{Missing, Int64},Int64,Int64}}
julia> f(i)::T = (a=i.a, b=i.b, c=i.c)
f (generic function with 1 method)
julia> Core.Compiler.return_type(f, Tuple{eltype(v1)})
NamedTuple{(:a, :b, :c),Tuple{Union{Missing, Int64},Int64,Int64}}
julia> @time StructArray(f(el) for el in v0);
0.239057 seconds (17 allocations: 238.419 MiB, 44.94% gc time)
julia> @time StructArray(f(el) for el in v1);
3.895166 seconds (60.00 M allocations: 1.723 GiB, 7.76% gc time) Here the bottleneck is actually "computing f" when the input data is missing: julia> s0 = v0[1];
julia> s1 = v1[1];
julia> @btime f(s0);
17.957 ns (1 allocation: 48 bytes)
julia> @btime f(s1);
341.888 ns (7 allocations: 208 bytes) |
Hm, that is very strange... The problems you show in your first code block I don't understand. I would actually not have anticipated any issue there, I think that is an entirely different problem from the one we had discussed previously in these endless threads. Obviously also important and quite severe, but I think distinct. Do you understand why there is a problem there? The issue I had in mind was what you describe later, i.e. the problem from |
I annotated the return type exactly to show that type instability of I don't understand the compiler at all so I have no idea what can be done in practice to fix this but I suspect the slowness of this kind of simple function |
I think changing how the NamedTuple constructor is implemented might be able to alleviate that, but I don't have a full solution yet. And fixing the performance of For now I would humbly suggest not moving IndexedTables and JuliaDB to missing as long as there are serious performance regressions. |
I think this has all been released already, right? I think there might generally be a need for more quality control in the release process. The last two releases at least were quite problematic, IMO: the previous one had the issue that joins etc. were all broken if data was loaded from csv files (because TextParse.jl had moved to Missing, but not JuliaDB), now this release has these performance issues and essentially broke the whole TableTraits.jl integration story... |
This builds on #199 (Tables.jl integration) and removes the dependency on DataValues.
The build/tests are running, but I'm still unsure of the consequences of performance, integration with IterableTables, etc.
However, we are currently in a situation where TextParse loads missing data as
Union{T,Missing}
but then operations like table joins can introduceDataValues
.