딥스탯 2018. 10. 14. 14:42
08_joins(한글)

Introduction to DataFrames

Bogumił Kamiński, Apr 21, 2017

출처

함께보기

In [1]:
using DataFrames # load package

데이터프레임 조인하기 (Joining DataFrames)

조인할 데이터프레임 준비하기 (Preparing DataFrames for a join)

In [2]:
x = DataFrame(ID=[1,2,3,4,missing], name = ["Alice", "Bob", "Conor", "Dave","Zed"])
y = DataFrame(id=[1,2,5,6,missing], age = [21,22,23,24,99])
println(x)
println(y)
5×2 DataFrame
│ Row │ ID      │ name   │
│     │ Int64⍰String │
├─────┼─────────┼────────┤
│ 1   │ 1       │ Alice  │
│ 2   │ 2       │ Bob    │
│ 3   │ 3       │ Conor  │
│ 4   │ 4       │ Dave   │
│ 5   │ missing │ Zed    │
5×2 DataFrame
│ Row │ id      │ age   │
│     │ Int64⍰Int64 │
├─────┼─────────┼───────┤
│ 1   │ 1       │ 21    │
│ 2   │ 2       │ 22    │
│ 3   │ 5       │ 23    │
│ 4   │ 6       │ 24    │
│ 5   │ missing │ 99    │
In [3]:
rename!(x, :ID=>:id) # 조인(joini할 기준이 되는 행 이름은 같아야만 한다.
Out[3]:
idname
Int64⍰String
11Alice
22Bob
33Conor
44Dave
5missingZed

기본 조인 (Standard joins: inner, left, right, outer, semi, anti)

In [4]:
join(x, y, on=:id) # 기본적으로 이너조인(inner join)을 수행한다. 결측(missing)도 조인된다.
Out[4]:
idnameage
Int64⍰StringInt64
11Alice21
22Bob22
3missingZed99
In [5]:
join(x, y, on=:id, kind=:left) # 레프트조인(left join)
Out[5]:
idnameage
Int64⍰StringInt64⍰
11Alice21
22Bob22
33Conormissing
44Davemissing
5missingZed99
In [6]:
join(x, y, on=:id, kind=:right) # 라이트조인(right join)
Out[6]:
idnameage
Int64⍰String⍰Int64
11Alice21
22Bob22
3missingZed99
45missing23
56missing24
In [7]:
join(x, y, on=:id, kind=:outer) #아우터조인(outer join)
Out[7]:
idnameage
Int64⍰String⍰Int64⍰
11Alice21
22Bob22
33Conormissing
44Davemissing
5missingZed99
65missing23
76missing24
In [8]:
join(x, y, on=:id, kind=:semi) #세미조인(semi join)
Out[8]:
idname
Int64⍰String
11Alice
22Bob
3missingZed
In [9]:
join(x, y, on=:id, kind=:anti) #안티조인(anti join)
Out[9]:
idname
Int64⍰String
13Conor
24Dave

크로스조인 (Cross join)

In [10]:
# 크로스조인(cross-join)은 "on" 인자(argument)를 필요로 하지 않는다.
# 크로스조인(cross-join)은 카테이션 곱(Cartesian product) 혹은 인자(argument)를 만든다.
function expand_grid(;xs...) # R 언어에서 쓰이는 expand.grid의 간단한 형태의 함수
    reduce((x,y) -> join(x, DataFrame(Pair(y...)), kind=:cross),
           DataFrame(Pair(xs[1]...)), xs[2:end])
end

expand_grid(a=[1,2], b=["a","b","c"], c=[true,false])
ArgumentError: unable to construct DataFrame from Pair{Int64,Int64}

Stacktrace:
 [1] DataFrame(::Pair{Int64,Int64}) at /home/yt/.julia/packages/DataFrames/1PqZ3/src/other/tables.jl:32
 [2] #expand_grid#3(::Base.Iterators.Pairs{Symbol,Array{T,1} where T,Tuple{Symbol,Symbol,Symbol},NamedTuple{(:a, :b, :c),Tuple{Array{Int64,1},Array{String,1},Array{Bool,1}}}}, ::Function) at ./In[10]:4
 [3] (::getfield(Main, Symbol("#kw##expand_grid")))(::NamedTuple{(:a, :b, :c),Tuple{Array{Int64,1},Array{String,1},Array{Bool,1}}}, ::typeof(expand_grid)) at ./none:0
 [4] top-level scope at In[10]:7
In [11]:
?reduce
search: reduce mapreduce

Out[11]:
reduce(op, itr; [init])

Reduce the given collection itr with the given binary operator op. If provided, the initial value init must be a neutral element for op that will be returned for empty collections. It is unspecified whether init is used for non-empty collections.

For empty collections, providing init will be necessary, except for some special cases (e.g. when op is one of +, *, max, min, &, |) when Julia can determine the neutral element of op.

Reductions for certain commonly-used operators may have special implementations, and should be used instead: maximum(itr), minimum(itr), sum(itr), prod(itr), any(itr), all(itr).

The associativity of the reduction is implementation dependent. This means that you can't use non-associative operations like - because it is undefined whether reduce(-,[1,2,3]) should be evaluated as (1-2)-3 or 1-(2-3). Use foldl or foldr instead for guaranteed left or right associativity.

Some operations accumulate error. Parallelism will be easier if the reduction can be executed in groups. Future versions of Julia might change the algorithm. Note that the elements are not reordered if you use an ordered collection.

Examples

jldoctest
julia> reduce(*, [2; 3; 4])
24

julia> reduce(*, [2; 3; 4]; init=-1)
-24

reduce(f, A; dims=:, [init])

Reduce 2-argument function f along dimensions of A. dims is a vector specifying the dimensions to reduce, and the keyword argument init is the initial value to use in the reductions. For +, *, max and min the init argument is optional.

The associativity of the reduction is implementation-dependent; if you need a particular associativity, e.g. left-to-right, you should write your own loop or consider using foldl or foldr. See documentation for reduce.

Examples

jldoctest
julia> a = reshape(Vector(1:16), (4,4))
4×4 Array{Int64,2}:
 1  5   9  13
 2  6  10  14
 3  7  11  15
 4  8  12  16

julia> reduce(max, a, dims=2)
4×1 Array{Int64,2}:
 13
 14
 15
 16

julia> reduce(max, a, dims=1)
1×4 Array{Int64,2}:
 4  8  12  16

복잡한 형태의 조인 (Complex cases of joins)

In [12]:
x = DataFrame(id1=[1,1,2,2,missing,missing],
              id2=[1,11,2,21,missing,99],
              name = ["Alice", "Bob", "Conor", "Dave","Zed", "Zoe"])
y = DataFrame(id1=[1,1,3,3,missing,missing],
              id2=[11,1,31,3,missing,999],
              age = [21,22,23,24,99, 100])
println(x)
println(y)
6×3 DataFrame
│ Row │ id1     │ id2     │ name   │
│     │ Int64⍰Int64⍰String │
├─────┼─────────┼─────────┼────────┤
│ 1   │ 1       │ 1       │ Alice  │
│ 2   │ 1       │ 11      │ Bob    │
│ 3   │ 2       │ 2       │ Conor  │
│ 4   │ 2       │ 21      │ Dave   │
│ 5   │ missingmissing │ Zed    │
│ 6   │ missing │ 99      │ Zoe    │
6×3 DataFrame
│ Row │ id1     │ id2     │ age   │
│     │ Int64⍰Int64⍰Int64 │
├─────┼─────────┼─────────┼───────┤
│ 1   │ 1       │ 11      │ 21    │
│ 2   │ 1       │ 1       │ 22    │
│ 3   │ 3       │ 31      │ 23    │
│ 4   │ 3       │ 3       │ 24    │
│ 5   │ missingmissing │ 99    │
│ 6   │ missing │ 999     │ 100   │
In [13]:
join(x, y, on=[:id1, :id2]) # 2개 행을 기준으로 조인
Out[13]:
id1id2nameage
Int64⍰Int64⍰StringInt64
111Alice22
2111Bob21
3missingmissingZed99
In [14]:
join(x, y, on=[:id1], makeunique=true) # 중복되는 경우 모든 경우의 결합을 다 만들어준다. (이 예제는 이너조인(inner join))
Out[14]:
id1id2nameid2_1age
Int64⍰Int64⍰StringInt64⍰Int64
111Alice1121
211Alice122
3111Bob1121
4111Bob122
5missingmissingZedmissing99
6missingmissingZed999100
7missing99Zoemissing99
8missing99Zoe999100
In [15]:
join(x, y, on=[:id1], kind=:semi) # 예외적으로 세미조인(semi join)인 경우는 모든 결합을 다 만들어주지 않는다.
Out[15]:
id1id2name
Int64⍰Int64⍰String
111Alice
2111Bob
3missingmissingZed
4missing99Zoe