Flux in Julia/Learning Julia (Intro_to_Julia_DFs)
11. performance
딥스탯
2018. 10. 18. 14:17
Introduction to DataFrames¶
Bogumił Kamiński, Apr 21, 2018
Reference¶
Series¶
- https://deepstat.tistory.com/69 (01. constructors)(in English)
- https://deepstat.tistory.com/70 (01. constructors)(한글)
- https://deepstat.tistory.com/71 (02. basicinfo)(in English)
- https://deepstat.tistory.com/72 (02. basicinfo)(한글)
- https://deepstat.tistory.com/73 (03. missingvalues)(in English)
- https://deepstat.tistory.com/74 (03. missingvalues)(한글)
- https://deepstat.tistory.com/75 (04. loadsave)(in English)
- https://deepstat.tistory.com/76 (04. loadsave)(한글)
- https://deepstat.tistory.com/77 (05. columns)(in English)
- https://deepstat.tistory.com/78 (05. columns)(한글)
- https://deepstat.tistory.com/79 (06. rows)(in English)
- https://deepstat.tistory.com/80 (06. rows)(한글)
- https://deepstat.tistory.com/81 (07. factors)(in English)
- https://deepstat.tistory.com/82 (07. factors)(한글)
- https://deepstat.tistory.com/83 (08. joins)(in English)
- https://deepstat.tistory.com/84 (08. joins)(한글)
- https://deepstat.tistory.com/85 (09. reshaping)(in English)
- https://deepstat.tistory.com/86 (09. reshaping)(한글)
- https://deepstat.tistory.com/87 (10. transforms)(in English)
- https://deepstat.tistory.com/88 (10. transforms)(한글)
- https://deepstat.tistory.com/89 (11. performance)(in English)
- https://deepstat.tistory.com/90 (11. performance)(한글)
In [1]:
using DataFrames
using BenchmarkTools
Performance tips¶
Access by column number is faster than by name¶
In [2]:
x = DataFrame(rand(5, 1000))
@btime x[500];
@btime x[:x500];
When working with data DataFrame
use barrier functions or type annotation¶
In [3]:
using Random
function f_bad() # this function will be slow
Random.seed!(1); x = DataFrame(rand(1000000,2))
y, z = x[1], x[2]
p = 0.0
for i in 1:nrow(x)
p += y[i]*z[i]
end
p
end
@btime f_bad();
In [4]:
@code_warntype f_bad() # the reason is that Julia does not know the types of columns in `DataFrame`
In [5]:
# solution 1 is to use barrier function (it should be possible to use it in almost any code)
function f_inner(y,z)
p = 0.0
for i in 1:length(y)
p += y[i]*z[i]
end
p
end
function f_barrier() # extract the work to an inner function
Random.seed!(1); x = DataFrame(rand(1000000,2))
f_inner(x[1], x[2])
end
using LinearAlgebra
function f_inbuilt() # or use inbuilt function if possible
Random.seed!(1); x = DataFrame(rand(1000000,2))
x[1] ⋅ x[2]
end
@btime f_barrier();
@btime f_inbuilt();
In [6]:
# solution 2 is to provide the types of extracted columns
# it is simpler but there are cases in which you will not know these types
function f_typed()
Random.seed!(1); x = DataFrame(rand(1000000,2))
y::Vector{Float64}, z::Vector{Float64} = x[1], x[2]
p = 0.0
for i in 1:nrow(x)
p += y[i]*z[i]
end
p
end
@btime f_typed();
Consider using delayed DataFrame
creation technique¶
In [7]:
function f1()
x = DataFrame(Float64, 10^4, 100) # we work with DataFrame directly
for c in 1:ncol(x)
d = x[c]
for r in 1:nrow(x)
d[r] = rand()
end
end
x
end
function f2()
x = Vector{Any}(undef,100)
for c in 1:length(x)
d = Vector{Float64}(undef,10^4)
for r in 1:length(d)
d[r] = rand()
end
x[c] = d
end
DataFrame(x) # we delay creation of DataFrame after we have our job done
end
@btime f1();
@btime f2();
You can add rows to a DataFrame
in place and it is fast¶
- But I don't know why the sizes changes. There is no explanation in the original text.
In [8]:
x = DataFrame(rand(10^6, 5))
y = DataFrame(transpose(1.0:5.0))
z = [1.0:5.0;]
println("Size of original x = ",size(x))
@btime vcat($x, $y); # creates a new DataFrame - slow
println("Size of result after running vcat = ", size(vcat(x,y)))
@btime push!($x, $z); # add a single row in place - fast
println("Size of x after running push! = ", size(x))
println(" ")
x = DataFrame(rand(10^6, 5)) # reset to the same starting point
println("Size of original x = ", size(x))
@btime append!($x, $y); # in place - fastest
println("Size of x after running append! = ", size(x))
Allowing missing
as well as categorical
slows down computations¶
In [9]:
using StatsBase
function test(data) # uses countmap function to test performance
println(eltype(data))
x = rand(data, 10^6)
y = categorical(x)
println(" raw:")
@btime countmap($x)
println(" categorical:")
@btime countmap($y)
nothing
end
println("Using test(1:10)")
test(1:10)
println(" ")
println("Using test([randstring() for i in 1:10])")
test([randstring() for i in 1:10])
println(" ")
println("Using test(allowmissing(1:10))")
test(allowmissing(1:10))
println(" ")
println("Using test(allowmissing([randstring() for i in 1:10]))")
test(allowmissing([randstring() for i in 1:10]))