Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Transition to Missing #200

Merged
merged 25 commits into from
Dec 12, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,3 +12,9 @@
- **(feature)** - `collect_columns` function to collect an iterator of tuples to `Columns` object. (#135)
- **(bugfix)** use `collect_columns` to implement `map`, `groupreduce` and `groupjoin` (#150) to not depend on type inference. Works in many more cases.
- **(feature)** - `view` works with logical indexes now (#134)


## v0.9.0

- **(breaking)** Switch from DataValues to Missing. Related: `dropna` has been changed to `dropmissing`.
- **(breaking)** Depend on OnlineStatsBase rather than OnlineStats.
61 changes: 30 additions & 31 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,28 @@ be used on its own for efficient in-memory data processing and analytics.

## Data Structures

- **The two table types in IndexedTables differ in how data is accessed.**
- **There is no performance difference between table types for operations such as selecting, filtering, and map/reduce.**
IndexedTables offers two data structures: `IndexedTable` and `NDSparse`.

- **Both types store data _in columns_**.
- **`IndexedTable` and `NDSparse` differ mainly in how data is accessed.**
- **Both types have equal performance for Table operations (`select`, `filter`, etc.).**


## Quickstart

```
using Pkg
Pkg.add("IndexedTables")
using IndexedTables

t = table((x = 1:100, y = randn(100)))

select(t, :x)

filter(row -> row.y > 0, t)
```

## `IndexedTable` vs. `NDSparse`

First let's create some data to work with.

Expand All @@ -22,18 +42,18 @@ city = vcat(fill("New York", 3), fill("Boston", 3))

dates = repeat(Date(2016,7,6):Day(1):Date(2016,7,8), 2)

values = [91, 89, 91, 95, 83, 76]
vals = [91, 89, 91, 95, 83, 76]
```

### Table
### IndexedTable

- Data is accessed as a Vector of NamedTuples.
- Sorted by primary key(s), `pkey`.
- (Optionally) Sorted by primary key(s), `pkey`.
- Data is accessed as a Vector of NamedTuples.

```julia
using IndexedTables

julia> t1 = table((city = city, dates = dates, values = values); pkey = [:city, :dates])
julia> t1 = table((city = city, dates = dates, values = vals); pkey = [:city, :dates])
Table with 6 rows, 3 columns:
city dates values
──────────────────────────────
Expand All @@ -46,18 +66,15 @@ city dates values

julia> t1[1]
(city = "Boston", dates = 2016-07-06, values = 95)

julia> first(t1)
(city = "Boston", dates = 2016-07-06, values = 95)
```

### NDSparse

- Data is accessed as an N-dimensional sparse array with arbitrary indexes.
- Sorted by index variables (first argument).
- Data is accessed as an N-dimensional sparse array with arbitrary indexes.

```julia
julia> t2 = ndsparse(@NT(city=city, dates=dates), @NT(value=values))
julia> t2 = ndsparse((city=city, dates=dates), (value=vals,))
2-d NDSparse with 6 values (1 field named tuples):
city dates │ value
───────────────────────┼──────
Expand All @@ -70,26 +87,8 @@ city dates │ value

julia> t2["Boston", Date(2016, 7, 6)]
(value = 95)

julia> first(t2)
(value = 95)
```

As with other multi-dimensional arrays, dimensions can be permuted to change the sort order:

```julia
julia> permutedims(t2, [2,1])
2-d NDSparse with 6 values (1 field named tuples):
dates city │ value
───────────────────────┼──────
2016-07-06 "Boston" │ 95
2016-07-06 "New York" │ 91
2016-07-07 "Boston" │ 83
2016-07-07 "New York" │ 89
2016-07-08 "Boston" │ 76
2016-07-08 "New York" │ 91
```

## Get started

For more information, check out the [JuliaDB API Reference](http://juliadb.org/latest/api/datastructures.html).
For more information, check out the [JuliaDB Documentation](http://juliadb.org/latest/index.html).
2 changes: 1 addition & 1 deletion REQUIRE
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,4 @@ WeakRefStrings 0.4.4
TableTraits 0.3.0
TableTraitsUtils 0.2.0
IteratorInterfaceExtensions 0.1.0
DataValues
Tables
19 changes: 11 additions & 8 deletions src/IndexedTables.jl
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,15 @@ using PooledArrays, SparseArrays, Statistics, WeakRefStrings, TableTraits,
TableTraitsUtils, IteratorInterfaceExtensions

using OnlineStatsBase: OnlineStat, fit!
using DataValues: DataValues, DataValue, NA, isna, DataValueArray
import DataValues: dropna
import Tables

import Base:
show, eltype, length, getindex, setindex!, ndims, map, convert, keys, values,
==, broadcast, empty!, copy, similar, sum, merge, merge!, mapslices,
permutedims, sort, sort!, iterate, pairs
permutedims, sort, sort!, iterate, pairs, reduce, push!, size, permute!, issorted,
sortperm, summary, resize!, vcat, append!, copyto!, view, tail,
tuple_type_cons, tuple_type_head, tuple_type_tail, in, convert


#-----------------------------------------------------------------------# exports
export
Expand All @@ -20,20 +22,20 @@ export
AbstractNDSparse, All, ApplyColwise, Between, ColDict, Columns, IndexedTable,
Keys, NDSparse, NextTable, Not,
# functions
aggregate, aggregate!, aggregate_vec, antijoin, asofjoin, collect_columns, colnames,
column, columns, convertdim, dimlabels, dropna, flatten, flush!, groupby, groupjoin,
aggregate!, antijoin, asofjoin, collect_columns, colnames,
column, columns, convertdim, dimlabels, flatten, flush!, groupby, groupjoin,
groupreduce, innerjoin, insertafter!, insertbefore!, insertcol, insertcolafter,
insertcolbefore, leftgroupjoin, leftjoin, map_rows, naturalgroupjoin, naturaljoin,
ncols, ndsparse, outergroupjoin, outerjoin, pkeynames, pkeys, popcol, pushcol,
reducedim_vec, reindex, renamecol, rows, select, selectkeys, selectvalues, setcol,
stack, summarize, table, unstack, update!, where
stack, summarize, table, unstack, update!, where, dropmissing, dropna

const Tup = Union{Tuple,NamedTuple}
const DimName = Union{Int,Symbol}

include("utils.jl")
include("columns.jl")
include("table.jl")
include("indexedtable.jl")
include("ndsparse.jl")
include("collect.jl")

Expand Down Expand Up @@ -73,7 +75,8 @@ include("flatten.jl")
include("join.jl")
include("reshape.jl")

# TableTraits.jl integration
# TableTraits/Tables integration
include("tabletraits.jl")
include("tables.jl")

end # module
7 changes: 2 additions & 5 deletions src/collect.jl
Original file line number Diff line number Diff line change
@@ -1,8 +1,5 @@
_is_subtype(::Type{S}, ::Type{T}) where {S, T} = promote_type(S, T) == T

dataarrayof(::Type{<:DataValue{T}}, len) where {T} = DataValueArray{T,1}(len)
dataarrayof(::Type{T}, len) where {T} = Vector{T}(undef, len)

"""
collect_columns(itr)

Expand Down Expand Up @@ -166,7 +163,7 @@ function widencolumns(dest, i, el::S, ::Type{T}) where{S <: Tup, T<:Tup}
idx = findall(collect(!(s <: t) for (s, t) in zip(sp, tp)))
new = dest
for l in idx
newcol = dataarrayof(promote_type(sp[l], tp[l]), length(dest))
newcol = Vector{promote_type(sp[l], tp[l])}(undef, length(dest))
copyto!(newcol, 1, column(dest, l), 1, i-1)
new = setcol(new, l, newcol)
end
Expand All @@ -175,7 +172,7 @@ function widencolumns(dest, i, el::S, ::Type{T}) where{S <: Tup, T<:Tup}
end

function widencolumns(dest, i, el::S, ::Type{T}) where{S, T}
new = dataarrayof(promote_type(S, T), length(dest))
new = Vector{promote_type(S, T)}(undef, length(dest))
copyto!(new, 1, dest, 1, i-1)
new
end
Expand Down
36 changes: 8 additions & 28 deletions src/columns.jl
Original file line number Diff line number Diff line change
@@ -1,7 +1,3 @@
import Base:
push!, size, sort, sort!, permute!, issorted, sortperm,
summary, resize!, vcat, append!, copyto!, view

"""
Wrapper around a (named) tuple of Vectors that acts like a Vector of (named) tuples.

Expand Down Expand Up @@ -97,7 +93,6 @@ available selection options and syntax.
"""
function columns end

columns(c) = error("no columns defined for $(typeof(c))")
columns(c::Columns) = c.columns

# Array-like API
Expand All @@ -110,17 +105,14 @@ length(c::Columns{<:Pair, <:Pair}) = length(c.columns.first)
ndims(c::Columns) = 1

"""
`ncols(itr)`
ncols(itr)

Returns the number of columns in `itr`.

# Examples

ncols([1,2,3])
ncols(rows(([1,2,3],[4,5,6])))
ncols(table(([1,2,3],[4,5,6])))
ncols(table(@NT(x=[1,2,3],y=[4,5,6])))
ncols(ndsparse(d, [7,8,9]))
ncols([1,2,3]) == 1
ncols(rows(([1,2,3],[4,5,6]))) == 2
"""
function ncols end
ncols(c::Columns) = fieldcount(typeof(c.columns))
Expand Down Expand Up @@ -184,21 +176,7 @@ resize!(I::Columns, n::Int) = (foreach(c->resize!(c,n), I.columns); I)

_sizehint!(c::Columns, n::Integer) = (foreach(c->_sizehint!(c,n), c.columns); c)

function ==(x::Columns, y::Columns)
nc = length(x.columns)
length(y.columns) == nc || return false
fieldnames(eltype(x)) == fieldnames(eltype(y)) || return false
n = length(x)
length(y) == n || return false
for i in 1:nc
x.columns[i] == y.columns[i] || return false
end
return true
end

==(x::Columns{<:Pair}, y::Columns) = false
==(x::Columns, y::Columns{<:Pair}) = false
==(x::Columns{<:Pair}, y::Columns{<:Pair}) = (x.columns.first == y.columns.first) && (x.columns.second == y.columns.second)
==(x::Columns, y::Columns) = x.columns == y.columns

function _strip_pair(c::Columns{<:Pair})
f, s = map(columns, c.columns)
Expand Down Expand Up @@ -368,7 +346,7 @@ end
# map

"""
`map_rows(f, c...)`
map_rows(f, c...)

Transform collection `c` by applying `f` to each element. For multiple collection arguments, apply `f`
elementwise. Collect output as `Columns` if `f` returns
Expand Down Expand Up @@ -449,7 +427,7 @@ struct Between{T1 <: Union{Int, Symbol}, T2 <: Union{Int, Symbol}}
last::T2
end

const SpecialSelector = Union{Not, All, Keys, Between, Function, Regex}
const SpecialSelector = Union{Not, All, Keys, Between, Function, Regex, Type}

hascolumns(t, s) = true
hascolumns(t, s::Symbol) = s in colnames(t)
Expand All @@ -458,6 +436,7 @@ hascolumns(t, s::Tuple) = all(hascolumns(t, x) for x in s)
hascolumns(t, s::Not) = hascolumns(t, s.cols)
hascolumns(t, s::Between) = hascolumns(t, s.first) && hascolumns(t, s.last)
hascolumns(t, s::All) = all(hascolumns(t, x) for x in s.cols)
hascolumns(t, s::Type) = any(x -> eltype(x) <: s, columns(t))

lowerselection(t, s) = s
lowerselection(t, s::Union{Int, Symbol}) = colindex(t, s)
Expand All @@ -467,6 +446,7 @@ lowerselection(t, s::Keys) = lowerselection(t, IndexedTables.pkeyn
lowerselection(t, s::Between) = Tuple(colindex(t, s.first):colindex(t, s.last))
lowerselection(t, s::Function) = colindex(t, Tuple(filter(s, collect(colnames(t)))))
lowerselection(t, s::Regex) = lowerselection(t, x -> occursin(s, string(x)))
lowerselection(t, s::Type) = Tuple(findall(x -> eltype(x) <: s, columns(t)))

function lowerselection(t, s::All)
s.cols == () && return lowerselection(t, valuenames(t))
Expand Down
30 changes: 9 additions & 21 deletions src/table.jl → src/indexedtable.jl
Original file line number Diff line number Diff line change
@@ -1,5 +1,3 @@
import Base: setindex!, reduce

"""
A permutation

Expand All @@ -16,7 +14,7 @@ end
abstract type AbstractIndexedTable end

"""
A tabular data structure that extends [`Columns`](@ref). Create a `IndexedTable` with the
A tabular data structure that extends [`Columns`](@ref). Create an `IndexedTable` with the
[`table`](@ref) function.
"""
struct IndexedTable{C<:Columns} <: AbstractIndexedTable
Expand Down Expand Up @@ -51,7 +49,9 @@ Construct a table from a vector of tuples. See [`rows`](@ref) and [`Columns`](@r

Copy a Table or NDSparse to create a new table. The same primary keys as the input are used.

table(iter; kw...)
table(x; kw...)

Create an `IndexedTable` from any object `x` that follows the `Tables.jl` interface.


# Keyword Argument Options:
Expand Down Expand Up @@ -353,7 +353,7 @@ function sort!(t::IndexedTable, by...; kwargs...)
end

"""
excludecols(itr, cols)
excludecols(itr, cols) -> Tuple of Int

Names of all columns in `itr` except `cols`. `itr` can be any of
`Table`, `NDSparse`, `Columns`, or `AbstractVector`
Expand All @@ -369,22 +369,10 @@ Names of all columns in `itr` except `cols`. `itr` can be any of
excludecols(t, pkeynames(t))
excludecols([1,2,3], (1,))
"""
function excludecols(t, cols)
if cols isa SpecialSelector
return excludecols(t, lowerselection(t, cols))
end
if !isa(cols, Tuple)
return excludecols(t, (cols,))
end
ns = colnames(t)
mask = ones(Bool, length(ns))
for c in cols
i = colindex(t, c)
if i !== 0
mask[i] = false
end
end
((1:length(ns))[mask]...,)
excludecols(t, cols) = excludecols(t, (cols,))
excludecols(t, cols::SpecialSelector) = excludecols(t, lowerselection(t, cols))
function excludecols(t, cols::Tuple)
Tuple(setdiff(1:length(colnames(t)), map(x -> colindex(t, x), cols)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice simplification here

end

"""
Expand Down
1 change: 0 additions & 1 deletion src/indexing.jl
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,6 @@ _in(x, v::AbstractString) = x == v
_in(x, v::Symbol) = x === v
_in(x, v::Number) = isequal(x, v)

import Base: tail
# test whether row r is within product(idxs...)
@inline row_in(cs, r::Integer, idxs) = _row_in(cs[1], r, idxs[1], tail(cs), tail(idxs))
@inline _row_in(c1, r, i1, rI, ri) = _in(c1[r],i1) & _row_in(rI[1], r, ri[1], tail(rI), tail(ri))
Expand Down
Loading