티스토리 뷰

08_joins

Introduction to DataFrames

Bogumił Kamiński, Apr 21, 2017

Reference

Series

In [1]:
using DataFrames # load package

Joining DataFrames

Preparing DataFrames for a join

In [2]:
x = DataFrame(ID=[1,2,3,4,missing], name = ["Alice", "Bob", "Conor", "Dave","Zed"])
y = DataFrame(id=[1,2,5,6,missing], age = [21,22,23,24,99])
println(x)
println(y)
5×2 DataFrame
│ Row │ ID      │ name   │
│     │ Int64⍰String │
├─────┼─────────┼────────┤
│ 1   │ 1       │ Alice  │
│ 2   │ 2       │ Bob    │
│ 3   │ 3       │ Conor  │
│ 4   │ 4       │ Dave   │
│ 5   │ missing │ Zed    │
5×2 DataFrame
│ Row │ id      │ age   │
│     │ Int64⍰Int64 │
├─────┼─────────┼───────┤
│ 1   │ 1       │ 21    │
│ 2   │ 2       │ 22    │
│ 3   │ 5       │ 23    │
│ 4   │ 6       │ 24    │
│ 5   │ missing │ 99    │
In [3]:
rename!(x, :ID=>:id) # names of columns on which we want to join must be the same
Out[3]:
idname
Int64⍰String
11Alice
22Bob
33Conor
44Dave
5missingZed

Standard joins: inner, left, right, outer, semi, anti

In [4]:
join(x, y, on=:id) # :inner join by default, missing is joined
Out[4]:
idnameage
Int64⍰StringInt64
11Alice21
22Bob22
3missingZed99
In [5]:
join(x, y, on=:id, kind=:left)
Out[5]:
idnameage
Int64⍰StringInt64⍰
11Alice21
22Bob22
33Conormissing
44Davemissing
5missingZed99
In [6]:
join(x, y, on=:id, kind=:right)
Out[6]:
idnameage
Int64⍰String⍰Int64
11Alice21
22Bob22
3missingZed99
45missing23
56missing24
In [7]:
join(x, y, on=:id, kind=:outer)
Out[7]:
idnameage
Int64⍰String⍰Int64⍰
11Alice21
22Bob22
33Conormissing
44Davemissing
5missingZed99
65missing23
76missing24
In [8]:
join(x, y, on=:id, kind=:semi)
Out[8]:
idname
Int64⍰String
11Alice
22Bob
3missingZed
In [9]:
join(x, y, on=:id, kind=:anti)
Out[9]:
idname
Int64⍰String
13Conor
24Dave

Cross join

In [10]:
# cross-join does not require on argument
# it produces a Cartesian product or arguments
function expand_grid(;xs...) # a simple replacement for expand.grid in R
    reduce((x,y) -> join(x, DataFrame(Pair(y...)), kind=:cross),
           DataFrame(Pair(xs[1]...)), xs[2:end])
end

expand_grid(a=[1,2], b=["a","b","c"], c=[true,false])
ArgumentError: unable to construct DataFrame from Pair{Int64,Int64}

Stacktrace:
 [1] DataFrame(::Pair{Int64,Int64}) at /home/yt/.julia/packages/DataFrames/1PqZ3/src/other/tables.jl:32
 [2] #expand_grid#3(::Base.Iterators.Pairs{Symbol,Array{T,1} where T,Tuple{Symbol,Symbol,Symbol},NamedTuple{(:a, :b, :c),Tuple{Array{Int64,1},Array{String,1},Array{Bool,1}}}}, ::Function) at ./In[10]:4
 [3] (::getfield(Main, Symbol("#kw##expand_grid")))(::NamedTuple{(:a, :b, :c),Tuple{Array{Int64,1},Array{String,1},Array{Bool,1}}}, ::typeof(expand_grid)) at ./none:0
 [4] top-level scope at In[10]:7
In [11]:
?reduce
search: reduce mapreduce

Out[11]:
reduce(op, itr; [init])

Reduce the given collection itr with the given binary operator op. If provided, the initial value init must be a neutral element for op that will be returned for empty collections. It is unspecified whether init is used for non-empty collections.

For empty collections, providing init will be necessary, except for some special cases (e.g. when op is one of +, *, max, min, &, |) when Julia can determine the neutral element of op.

Reductions for certain commonly-used operators may have special implementations, and should be used instead: maximum(itr), minimum(itr), sum(itr), prod(itr), any(itr), all(itr).

The associativity of the reduction is implementation dependent. This means that you can't use non-associative operations like - because it is undefined whether reduce(-,[1,2,3]) should be evaluated as (1-2)-3 or 1-(2-3). Use foldl or foldr instead for guaranteed left or right associativity.

Some operations accumulate error. Parallelism will be easier if the reduction can be executed in groups. Future versions of Julia might change the algorithm. Note that the elements are not reordered if you use an ordered collection.

Examples

jldoctest
julia> reduce(*, [2; 3; 4])
24

julia> reduce(*, [2; 3; 4]; init=-1)
-24

reduce(f, A; dims=:, [init])

Reduce 2-argument function f along dimensions of A. dims is a vector specifying the dimensions to reduce, and the keyword argument init is the initial value to use in the reductions. For +, *, max and min the init argument is optional.

The associativity of the reduction is implementation-dependent; if you need a particular associativity, e.g. left-to-right, you should write your own loop or consider using foldl or foldr. See documentation for reduce.

Examples

jldoctest
julia> a = reshape(Vector(1:16), (4,4))
4×4 Array{Int64,2}:
 1  5   9  13
 2  6  10  14
 3  7  11  15
 4  8  12  16

julia> reduce(max, a, dims=2)
4×1 Array{Int64,2}:
 13
 14
 15
 16

julia> reduce(max, a, dims=1)
1×4 Array{Int64,2}:
 4  8  12  16

Complex cases of joins

In [12]:
x = DataFrame(id1=[1,1,2,2,missing,missing],
              id2=[1,11,2,21,missing,99],
              name = ["Alice", "Bob", "Conor", "Dave","Zed", "Zoe"])
y = DataFrame(id1=[1,1,3,3,missing,missing],
              id2=[11,1,31,3,missing,999],
              age = [21,22,23,24,99, 100])
println(x)
println(y)
6×3 DataFrame
│ Row │ id1     │ id2     │ name   │
│     │ Int64⍰Int64⍰String │
├─────┼─────────┼─────────┼────────┤
│ 1   │ 1       │ 1       │ Alice  │
│ 2   │ 1       │ 11      │ Bob    │
│ 3   │ 2       │ 2       │ Conor  │
│ 4   │ 2       │ 21      │ Dave   │
│ 5   │ missingmissing │ Zed    │
│ 6   │ missing │ 99      │ Zoe    │
6×3 DataFrame
│ Row │ id1     │ id2     │ age   │
│     │ Int64⍰Int64⍰Int64 │
├─────┼─────────┼─────────┼───────┤
│ 1   │ 1       │ 11      │ 21    │
│ 2   │ 1       │ 1       │ 22    │
│ 3   │ 3       │ 31      │ 23    │
│ 4   │ 3       │ 3       │ 24    │
│ 5   │ missingmissing │ 99    │
│ 6   │ missing │ 999     │ 100   │
In [13]:
join(x, y, on=[:id1, :id2]) # joining on two columns
Out[13]:
id1id2nameage
Int64⍰Int64⍰StringInt64
111Alice22
2111Bob21
3missingmissingZed99
In [14]:
join(x, y, on=[:id1], makeunique=true) # with duplicates all combinations are produced (here :inner join)
Out[14]:
id1id2nameid2_1age
Int64⍰Int64⍰StringInt64⍰Int64
111Alice1121
211Alice122
3111Bob1121
4111Bob122
5missingmissingZedmissing99
6missingmissingZed999100
7missing99Zoemissing99
8missing99Zoe999100
In [15]:
join(x, y, on=[:id1], kind=:semi) # but not by :semi join (as it would duplicate rows)
Out[15]:
id1id2name
Int64⍰Int64⍰String
111Alice
2111Bob
3missingmissingZed
4missing99Zoe

'Flux in Julia > Learning Julia (Intro_to_Julia_DFs)' 카테고리의 다른 글

09. reshaping  (0) 2018.10.15
08. joins (한글)  (0) 2018.10.14
07. factors (한글)  (0) 2018.10.13
07. factors  (0) 2018.10.13
06. rows (한글)  (0) 2018.10.12
공지사항
최근에 올라온 글
최근에 달린 댓글
Total
Today
Yesterday
링크
TAG
more
«   2025/05   »
1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31
글 보관함