Introduction to DataFrames¶

Bogumił Kamiński, Apr 21, 2018

출처¶

https://github.com/JuliaComputing/JuliaBoxTutorials/tree/master/introductory-tutorials/broader-topics-and-ecosystem/intro-to-julia-DataFrames

함께보기¶

https://deepstat.tistory.com/69 (01. constructors)(in English)
https://deepstat.tistory.com/70 (01. constructors)(한글)
https://deepstat.tistory.com/71 (02. basicinfo)(in English)
https://deepstat.tistory.com/72 (02. basicinfo)(한글)
https://deepstat.tistory.com/73 (03. missingvalues)(in English)
https://deepstat.tistory.com/74 (03. missingvalues)(한글)
https://deepstat.tistory.com/75 (04. loadsave)(in English)
https://deepstat.tistory.com/76 (04. loadsave)(한글)
https://deepstat.tistory.com/77 (05. columns)(in English)
https://deepstat.tistory.com/78 (05. columns)(한글)
https://deepstat.tistory.com/79 (06. rows)(in English)
https://deepstat.tistory.com/80 (06. rows)(한글)
https://deepstat.tistory.com/81 (07. factors)(in English)
https://deepstat.tistory.com/82 (07. factors)(한글)
https://deepstat.tistory.com/83 (08. joins)(in English)
https://deepstat.tistory.com/84 (08. joins)(한글)
https://deepstat.tistory.com/85 (09. reshaping)(in English)
https://deepstat.tistory.com/86 (09. reshaping)(한글)
https://deepstat.tistory.com/87 (10. transforms)(in English)
https://deepstat.tistory.com/88 (10. transforms)(한글)
https://deepstat.tistory.com/89 (11. performance)(in English)
https://deepstat.tistory.com/90 (11. performance)(한글)

using DataFrames
using BenchmarkTools

성능향상을 위한 팁¶

행 숫자로 불러내는 것이 행 이름을 사용하는 것보다 빠르다.¶

x = DataFrame(rand(5, 1000))
@btime x[500];
@btime x[:x500];

  13.653 ns (0 allocations: 0 bytes)
  20.983 ns (0 allocations: 0 bytes)

`데이터프레임`을 다룰 때 장벽 함수(barrier function)이나 주석을 적는 것(typing annotation)이 좋다.¶

using Random

function f_bad() # this function will be slow
    Random.seed!(1); x = DataFrame(rand(1000000,2))
    y, z = x[1], x[2]
    p = 0.0
    for i in 1:nrow(x)
        p += y[i]*z[i]
    end
    p
end

@btime f_bad();

  107.566 ms (5999022 allocations: 122.06 MiB)

@code_warntype f_bad() # 느린 이유는 Julia가 데이터프레임의 행 타입을 모르기 때문이다.

Body::Any
│╻            seed!4  1 ── %1  = Random.GLOBAL_RNG::MersenneTwister
││╻╷╷╷         seed!   │    %2  = $(Expr(:foreigncall, :(:jl_alloc_array_1d), Array{UInt32,1}, svec(Any, Int64), :(:ccall), 2, Array{UInt32,1}, 0, 0))::Array{UInt32,1}
│││╻╷╷╷╷╷╷╷     make_seed   │    %3  = (Core.lshr_int)(1, 63)::Int64
││││┃│││││││     push!   │    %4  = (Core.trunc_int)(Core.UInt8, %3)::UInt8
│││││┃││││││      _growend!   │    %5  = (Core.eq_int)(%4, 0x01)::Bool
││││││┃││││        cconvert   └───       goto #3 if not %5
│││││││┃│││         convert   2 ──       invoke Core.throw_inexacterror(:check_top_bit::Symbol, Int64::Any, 1::Int64)
││││││││┃││          Type   └───       $(Expr(:unreachable))
│││││││││┃│           toUInt64   3 ──       goto #4
││││││││││      4 ── %10 = (Core.bitcast)(Core.UInt64, 1)::UInt64
││││││││││      └───       goto #5
│││││││││       5 ──       goto #6
││││││││        6 ──       goto #7
│││││││         7 ──       goto #8
││││││          8 ──       $(Expr(:foreigncall, :(:jl_array_grow_end), Nothing, svec(Any, UInt64), :(:ccall), 2, :(%2), :(%10), :(%10)))
││││││          └───       goto #9
│││││╻╷╷╷╷        lastindex   9 ── %17 = (Base.arraysize)(%2, 1)::Int64
││││││╻╷╷╷         eachindex   │    %18 = (Base.slt_int)(%17, 0)::Bool
│││││││┃│││││       axes1   │    %19 = (Base.ifelse)(%18, 0, %17)::Int64
│││││╻            setindex!   │          (Base.arrayset)(true, %2, 0x00000001, %19)
││││╻            push!   └───       goto #10
││││╻            >>   10 ─       (Base.ifelse)(true, 0, 0)
│││╻            make_seed   └───       goto #11
│││             11 ─       invoke Random.seed!(%1::MersenneTwister, %2::Array{UInt32,1})
│││             └───       goto #12
││              12 ─       goto #13
│╻╷           rand   13 ─ %27 = Random.GLOBAL_RNG::MersenneTwister
││┃│╷╷╷        rand   │    %28 = $(Expr(:foreigncall, :(:jl_alloc_array_2d), Array{Float64,2}, svec(Any, Int64, Int64), :(:ccall), 3, Array{Float64,2}, 1000000, 2, 2, 1000000))::Array{Float64,2}
│││╻╷           rand   │    %29 = (Base.arraylen)(%28)::Int64
││││╻            rand!   │    %30 = (Base.mul_int)(8, %29)::Int64
│││││╻            rand!   │    %31 = (Base.arraylen)(%28)::Int64
││││││╻            _rand!   │    %32 = (Base.mul_int)(8, %31)::Int64
│││││││╻            <=   │    %33 = (Base.sle_int)(%30, %32)::Bool
│││││││         └───       goto #15 if not %33
│││││││         14 ─       goto #16
│               15 ─       nothing
│││││││         16 ┄ %37 = φ (#14 => true, #15 => false)::Bool
│││││││         └───       goto #18 if not %37
│││││││╻            macro expansion   17 ─ %39 = $(Expr(:gc_preserve_begin, :(%28)))
││││││││╻╷           pointer   │    %40 = $(Expr(:foreigncall, :(:jl_array_ptr), Ptr{Float64}, svec(Any), :(:ccall), 1, :(%28)))::Ptr{Float64}
││││││││╻            Type   │    %41 = %new(Random.UnsafeView{Float64}, %40, %29)::Random.UnsafeView{Float64}
││││││││        │          invoke Random.rand!(%27::MersenneTwister, %41::Random.UnsafeView{Float64}, $(QuoteNode(Random.SamplerTrivial{Random.CloseOpen01{Float64},Float64}(Random.CloseOpen01{Float64}())))::Random.SamplerTrivial{Random.CloseOpen01{Float64},Float64})
││││││││        │          $(Expr(:gc_preserve_end, :(%39)))
││││││││        └───       goto #19
│││││││╻            Type   18 ─ %45 = %new(Core.AssertionError, "sizeof(Float64) * n64 <= sizeof(T) * length(A) && isbitstype(T)")::AssertionError
│││││││         │          (Base.throw)(%45)
│││││││         └───       $(Expr(:unreachable))
││││││          19 ┄       goto #20
│││││           20 ─       goto #21
││││            21 ─       goto #22
│││             22 ─       goto #23
││              23 ─       goto #24
│               24 ─ %53 = Main.DataFrame::Core.Compiler.Const(DataFrame, false)
││╻            size   │    %54 = (Base.arraysize)(%28, 2)::Int64
││              │    %55 = invoke DataFrames.gennames(%54::Int64)::Array{Symbol,1}
││╻            Type   │    %56 = invoke DataFrames.:(#DataFrame#60)(false::Bool, %53::Type, %28::Array{Float64,2}, %55::Array{Symbol,1})::DataFrame
│╻╷           getindex5  │    %57 = (DataFrames.getfield)(%56, :columns)::Array{AbstractArray{T,1} where T,1}
││              │    %58 = π (1, Int64)
││╻            getindex   │    %59 = (Base.arrayref)(true, %57, %58)::AbstractArray{T,1} where T
││╻            columns   │    %60 = (DataFrames.getfield)(%56, :columns)::Array{AbstractArray{T,1} where T,1}
││              │    %61 = π (2, Int64)
││╻            getindex   │    %62 = (Base.arrayref)(true, %60, %61)::AbstractArray{T,1} where T
│            7  │    %63 = invoke Main.nrow(%56::DataFrame)::Int64
│╻╷╷╷         Colon   │    %64 = (Base.sle_int)(1, %63)::Bool
││╻            Type   │          (Base.sub_int)(%63, 1)
│││┃            unitrange_last   │    %66 = (Base.ifelse)(%64, %63, 0)::Int64
││╻╷╷          isempty   │    %67 = (Base.slt_int)(%66, 1)::Bool
││              └───       goto #26 if not %67
││              25 ─       goto #27
││              26 ─       goto #27
│               27 ┄ %71 = φ (#25 => true, #26 => false)::Bool
│               │    %72 = φ (#26 => 1)::Int64
│               │    %73 = φ (#26 => 1)::Int64
│               │    %74 = (Base.not_int)(%71)::Bool
│               └───       goto #33 if not %74
│               28 ┄ %76 = φ (#27 => 0.0, #32 => %82)::Any
│               │    %77 = φ (#27 => %72, #32 => %88)::Int64
│               │    %78 = φ (#27 => %73, #32 => %89)::Int64
│            8  │    %79 = (Base.getindex)(%59, %77)::Any
│               │    %80 = (Base.getindex)(%62, %77)::Any
│               │    %81 = (%79 * %80)::Any
│               │    %82 = (%76 + %81)::Any
││╻            ==   │    %83 = (%78 === %66)::Bool
││              └───       goto #30 if not %83
││              29 ─       goto #31
││╻            +   30 ─ %86 = (Base.add_int)(%78, 1)::Int64
│╻            iterate   └───       goto #31
│               31 ┄ %88 = φ (#30 => %86)::Int64
│               │    %89 = φ (#30 => %86)::Int64
│               │    %90 = φ (#29 => true, #30 => false)::Bool
│               │    %91 = (Base.not_int)(%90)::Bool
│               └───       goto #33 if not %91
│               32 ─       goto #28
│            10 33 ─ %94 = φ (#31 => %82, #27 => 0.0)::Any
│               └───       return %94

# 솔루션 1은 장벽 함수 (barrier functiion)을 쓰는 것이다. (거의 모든 코드안에서 쓸 수 있다.)
function f_inner(y,z)
   p = 0.0
   for i in 1:length(y)
       p += y[i]*z[i]
   end
   p
end

function f_barrier() # 내부 함수(inner function)로 일을 보낸다.
    Random.seed!(1); x = DataFrame(rand(1000000,2))
    f_inner(x[1], x[2])
end

using LinearAlgebra

function f_inbuilt() # 혹은 가능하다면 내장 함수(inbuilt function)을 사용한다.
    Random.seed!(1); x = DataFrame(rand(1000000,2))
    x[1] ⋅ x[2] #\cdot<tab>
end

@btime f_barrier();
@btime f_inbuilt();

  8.388 ms (44 allocations: 30.52 MiB)
  11.387 ms (44 allocations: 30.52 MiB)

# 솔루션 2는 추출된 열에 타입을 제공하는 것이다.
# 이는 더 간단한 방법이지만 타입을 모르는 경우에는 사용할 수 없다.
function f_typed()
    Random.seed!(1); x = DataFrame(rand(1000000,2))
    y::Vector{Float64}, z::Vector{Float64} = x[1], x[2]
    p = 0.0
    for i in 1:nrow(x)
        p += y[i]*z[i]
    end
    p
end

@btime f_typed();

  9.565 ms (44 allocations: 30.52 MiB)

지연된(delayed) `데이터프레임` 생성 테크닉 사용하기.¶

function f1()
    x = DataFrame(Float64, 10^4, 100) # 바로 데이터프레임 만들어서 사용하기
    for c in 1:ncol(x)
        d = x[c]
        for r in 1:nrow(x)
            d[r] = rand()
        end
    end
    x
end

function f2()
    x = Vector{Any}(undef, 100)
    for c in 1:length(x)
        d = Vector{Float64}(undef,10^4)
        for r in 1:length(d)
            d[r] = rand()
        end
        x[c] = d
    end
    DataFrame(x) # 전부 다 계산하고 난 다음에 데이터프레임 만들기
end

@btime f1();
@btime f2();

  22.731 ms (1950037 allocations: 37.42 MiB)
  2.109 ms (937 allocations: 7.69 MiB)

`데이터프레임`에 in place로 행을 추가하는 것이 더 빠르다.¶

근데 크기(size)가 왜 변하는지 모르겠다. 원문에는 따로 설명이 없다.

x = DataFrame(rand(10^6, 5))
y = DataFrame(transpose(1.0:5.0))
z = [1.0:5.0;]
println("Size of original x = ",size(x))
@btime vcat($x, $y); # creates a new DataFrame - slow
println("Size of result after running vcat = ", size(vcat(x,y)))
@btime push!($x, $z); # add a single row in place - fast
println("Size of x after running push! = ", size(x))
println(" ")
x = DataFrame(rand(10^6, 5)) # reset to the same starting point
println("Size of original x = ", size(x))
@btime append!($x, $y); # in place - fastest
println("Size of x after running append! = ", size(x))

Size of original x = (1000000, 5)
  6.474 ms (135 allocations: 38.15 MiB)
Size of result after running vcat = (1000001, 5)
  200.355 ns (5 allocations: 80 bytes)
Size of x after running push! = (7610502, 5)
 
Size of original x = (1000000, 5)
  164.579 ns (1 allocation: 16 bytes)
Size of x after running append! = (9220502, 5)

`범주형(categorical)` 타입이나 `결측(missing)` 타입을 허용하면 계산이 느려진다.¶

using StatsBase

function test(data) # uses countmap function to test performance
    println(eltype(data))
    x = rand(data, 10^6)
    y = categorical(x)
    println(" raw:")
    @btime countmap($x)
    println(" categorical:")
    @btime countmap($y)
    nothing
end

println("Using test(1:10)")
test(1:10)
println(" ")
println("Using test([randstring() for i in 1:10])")
test([randstring() for i in 1:10])
println(" ")
println("Using test(allowmissing(1:10))")
test(allowmissing(1:10))
println(" ")
println("Using test(allowmissing([randstring() for i in 1:10]))")
test(allowmissing([randstring() for i in 1:10]))

Using test(1:10)
Int64
 raw:
  5.027 ms (8 allocations: 7.63 MiB)
 categorical:
  20.860 ms (4 allocations: 608 bytes)
 
Using test([randstring() for i in 1:10])
String
 raw:
  40.643 ms (4 allocations: 608 bytes)
 categorical:
  42.940 ms (4 allocations: 608 bytes)
 
Using test(allowmissing(1:10))
Union{Missing, Int64}
 raw:
  14.322 ms (4 allocations: 624 bytes)
 categorical:
  21.466 ms (4 allocations: 608 bytes)
 
Using test(allowmissing([randstring() for i in 1:10]))
Union{Missing, String}
 raw:
  24.033 ms (4 allocations: 608 bytes)
 categorical:
  32.374 ms (4 allocations: 608 bytes)

티스토리

11. performance (한글)

11. performance (한글)

Introduction to DataFrames¶

출처¶

함께보기¶

성능향상을 위한 팁¶

행 숫자로 불러내는 것이 행 이름을 사용하는 것보다 빠르다.¶

`데이터프레임`을 다룰 때 장벽 함수(barrier function)이나 주석을 적는 것(typing annotation)이 좋다.¶

지연된(delayed) `데이터프레임` 생성 테크닉 사용하기.¶

`데이터프레임`에 in place로 행을 추가하는 것이 더 빠르다.¶

`범주형(categorical)` 타입이나 `결측(missing)` 타입을 허용하면 계산이 느려진다.¶

11. performance (한글)

11. performance (한글)

Introduction to DataFrames¶

출처¶

함께보기¶

성능향상을 위한 팁¶

행 숫자로 불러내는 것이 행 이름을 사용하는 것보다 빠르다.¶

데이터프레임을 다룰 때 장벽 함수(barrier function)이나 주석을 적는 것(typing annotation)이 좋다.¶

지연된(delayed) 데이터프레임 생성 테크닉 사용하기.¶

데이터프레임에 in place로 행을 추가하는 것이 더 빠르다.¶

범주형(categorical) 타입이나 결측(missing) 타입을 허용하면 계산이 느려진다.¶

`데이터프레임`을 다룰 때 장벽 함수(barrier function)이나 주석을 적는 것(typing annotation)이 좋다.¶

지연된(delayed) `데이터프레임` 생성 테크닉 사용하기.¶

`데이터프레임`에 in place로 행을 추가하는 것이 더 빠르다.¶

`범주형(categorical)` 타입이나 `결측(missing)` 타입을 허용하면 계산이 느려진다.¶