딥스탯 2018. 10. 1. 19:07
9_Julia_is_fast

Julia is fast

Reference

https://github.com/JuliaComputing/JuliaBoxTutorials/tree/master/introductory-tutorials/intro-to-julia (github : JuliaComputing/JuliaBoxTutorials/introductory-tutorials/intro-to-julia/)

Topics:

  1. Define the sum function
  2. Implementations & benchmarking of sum in...
    1. C (hand-written)
    2. C (hand-written with -ffast-math)
    3. python (built-in)
    4. python (numpy)
    5. python (hand-written)
    6. Julia (built-in)
    7. Julia (hand-written)
    8. Julia (hand-written with SIMD)
  3. Summary of benchmarks

Series

Very often, benchmarks are used to compare languages. These benchmarks can lead to long discussions, first as to exactly what is being benchmarked and secondly what explains the differences. These simple questions can sometimes get more complicated than you at first might imagine.

The purpose of this notebook is for you to see a simple benchmark for yourself.

(This material began life as a wonderful lecture by Steven Johnson at MIT: https://github.com/stevengj/18S096/blob/master/lectures/lecture1/Boxes-and-registers.ipynb.)

Define the sum function

sum: An easy enough function to understand

Consider the sum function sum(a),$$sum(a) = \sum_{i=1}^n a_i$$ which computes where is $n$ the length of a.

In [1]:
a = rand(10^7)
Out[1]:
10000000-element Array{Float64,1}:
 0.5269491225334235  
 0.19706199325984275 
 0.23592456884028135 
 0.027500385222536616
 0.41237634926292355 
 0.49877120777765604 
 0.19438800783198507 
 0.32757751795945067 
 0.5192727133968746  
 0.4315287732872737  
 0.8990672827433108  
 0.4407671101300359  
 0.9814272231753296  
 ⋮                   
 0.6125295441451339  
 0.041324486675398786
 0.6543162400651628  
 0.15135956476295687 
 0.3108763380233237  
 0.6905120191561878  
 0.5642808289984709  
 0.1535877856700727  
 0.8338274357103401  
 0.19755217821574522 
 0.4735443129236854  
 0.9515910556915412  
In [2]:
sum(a)
Out[2]:
4.9996703916066885e6

The expected result is 0.5 * 10^7, since the mean of each entry is 0.5

Implementations & benchmarking of sum in...

Benchmarking a few ways in a few languages

In [3]:
@time sum(a)
  0.007909 seconds (5 allocations: 176 bytes)
Out[3]:
4.9996703916066885e6
In [4]:
@time sum(a)
  0.011878 seconds (5 allocations: 176 bytes)
Out[4]:
4.9996703916066885e6
In [5]:
@time sum(a)
  0.010598 seconds (5 allocations: 176 bytes)
Out[5]:
4.9996703916066885e6

The @time macro can yield noisy results, so it's not our best choice for benchmarking!

Luckily, Julia has a BenchmarkTools.jl package to make benchmarking easy and accurate:

In [6]:
using Pkg
Pkg.add("BenchmarkTools")
  Updating registry at `~/.julia/registries/General`
  Updating git-repo `https://github.com/JuliaRegistries/General.git`
 Resolving package versions...
  Updating `~/.julia/environments/v1.0/Project.toml`
 [no changes]
  Updating `~/.julia/environments/v1.0/Manifest.toml`
 [no changes]
In [7]:
using BenchmarkTools

2.A C (hand-written)

C is often considered the gold standard: difficult on the human, nice for the machine. Getting within a factor of 2 of C is often satisfying. Nonetheless, even within C, there are many kinds of optimizations possible that a naive C writer may or may not get the advantage of.

The current author does not speak C, so he does not read the cell below, but is happy to know that you can put C code in a Julia session, compile it, and run it. Note that the """ wrap a multi-line string.

In [8]:
using Libdl
C_code = """
#include <stddef.h>
double c_sum(size_t n, double *X) {
    double s = 0.0;
    for (size_t i = 0; i < n; ++i) {
        s += X[i];
    }
    return s;
}
"""

const Clib = tempname()   # make a temporary file


# compile to a shared library by piping C_code to gcc
# (works only if you have gcc installed):

open(`gcc -fPIC -O3 -msse3 -xc -shared -o $(Clib * "." * Libdl.dlext) -`, "w") do f
    print(f, C_code) 
end

# define a Julia function that calls the C function:
c_sum(X::Array{Float64}) = ccall(("c_sum", Clib), Float64, (Csize_t, Ptr{Float64}), length(X), X)
Out[8]:
c_sum (generic function with 1 method)
In [9]:
c_sum(a)
Out[9]:
4.99967039160726e6
In [10]:
c_sum(a)  sum(a) # type \approx and then <TAB> to get the ≈ symbolb
Out[10]:
true
In [11]:
c_sum(a) - sum(a)
Out[11]:
5.718320608139038e-7
In [12]:
  # alias for the `isapprox` function
Out[12]:
isapprox (generic function with 8 methods)
In [13]:
?isapprox
search: isapprox

Out[13]:
isapprox(x, y; rtol::Real=atol>0 ? 0 : √eps, atol::Real=0, nans::Bool=false, norm::Function)

Inexact equality comparison: true if norm(x-y) <= max(atol, rtol*max(norm(x), norm(y))). The default atol is zero and the default rtol depends on the types of x and y. The keyword argument nans determines whether or not NaN values are considered equal (defaults to false).

For real or complex floating-point values, if an atol > 0 is not specified, rtol defaults to the square root of eps of the type of x or y, whichever is bigger (least precise). This corresponds to requiring equality of about half of the significand digits. Otherwise, e.g. for integer arguments or if an atol > 0 is supplied, rtol defaults to zero.

x and y may also be arrays of numbers, in which case norm defaults to vecnorm but may be changed by passing a norm::Function keyword argument. (For numbers, norm is the same thing as abs.) When x and y are arrays, if norm(x-y) is not finite (i.e. ±Inf or NaN), the comparison falls back to checking whether all elements of x and y are approximately equal component-wise.

The binary operator is equivalent to isapprox with the default arguments, and x ≉ y is equivalent to !isapprox(x,y).

Note that x ≈ 0 (i.e., comparing to zero with the default tolerances) is equivalent to x == 0 since the default atol is 0. In such cases, you should either supply an appropriate atol (or use norm(x) ≤ atol) or rearrange your code (e.g. use x ≈ y rather than x - y ≈ 0). It is not possible to pick a nonzero atol automatically because it depends on the overall scaling (the "units") of your problem: for example, in x - y ≈ 0, atol=1e-9 is an absurdly small tolerance if x is the radius of the Earth in meters, but an absurdly large tolerance if x is the radius of a Hydrogen atom in meters.

Examples

jldoctest
julia> 0.1 ≈ (0.1 - 1e-10)
true

julia> isapprox(10, 11; atol = 2)
true

julia> isapprox([10.0^9, 1.0], [10.0^9, 2.0])
true

julia> 1e-10 ≈ 0
false

julia> isapprox(1e-10, 0, atol=1e-8)
true

We can now benchmark the C code directly from Julia:

In [14]:
c_bench = @benchmark c_sum($a)
Out[14]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     8.910 ms (0.00% GC)
  median time:      9.422 ms (0.00% GC)
  mean time:        9.744 ms (0.00% GC)
  maximum time:     23.826 ms (0.00% GC)
  --------------
  samples:          513
  evals/sample:     1
In [15]:
println("C: Fastest time was $(minimum(c_bench.times) / 1e6) msec")
C: Fastest time was 8.909969 msec
In [16]:
d = Dict()  # a "dictionary", i.e. an associative array
d["C"] = minimum(c_bench.times) / 1e6  # in milliseconds
d
Out[16]:
Dict{Any,Any} with 1 entry:
  "C" => 8.90997

2.B C (hand-written with -ffast-math)

If we allow C to re-arrange the floating point operations, then it'll vectorize with SIMD (single instruction, multiple data) instructions.

In [17]:
const Clib_fastmath = tempname()   # make a temporary file

# The same as above but with a -ffast-math flag added
open(`gcc -fPIC -O3 -msse3 -xc -shared -ffast-math -o $(Clib_fastmath * "." * Libdl.dlext) -`, "w") do f
    print(f, C_code) 
end

# define a Julia function that calls the C function:
c_sum_fastmath(X::Array{Float64}) = ccall(("c_sum", Clib_fastmath), Float64, (Csize_t, Ptr{Float64}), length(X), X)
Out[17]:
c_sum_fastmath (generic function with 1 method)
In [18]:
c_fastmath_bench = @benchmark $c_sum_fastmath($a)
Out[18]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     5.678 ms (0.00% GC)
  median time:      6.029 ms (0.00% GC)
  mean time:        7.398 ms (0.00% GC)
  maximum time:     18.739 ms (0.00% GC)
  --------------
  samples:          675
  evals/sample:     1
In [19]:
d["C -ffast-math"] = minimum(c_fastmath_bench.times) / 1e6  # in milliseconds
Out[19]:
5.678042

2.C python (built-in)

The PyCall package provides a Julia interface to Python:

In [20]:
using Pkg; Pkg.add("PyCall")
using PyCall
 Resolving package versions...
  Updating `~/.julia/environments/v1.0/Project.toml`
 [no changes]
  Updating `~/.julia/environments/v1.0/Manifest.toml`
 [no changes]
In [21]:
# get the Python built-in "sum" function:
pysum = pybuiltin("sum")
Out[21]:
PyObject <built-in function sum>
In [22]:
pysum(a)
Out[22]:
4.99967039160726e6
In [23]:
pysum(a)  sum(a)
Out[23]:
true
In [24]:
py_list_bench = @benchmark $pysum($a)
Out[24]:
BenchmarkTools.Trial: 
  memory estimate:  368 bytes
  allocs estimate:  8
  --------------
  minimum time:     965.677 ms (0.00% GC)
  median time:      968.887 ms (0.00% GC)
  mean time:        972.928 ms (0.00% GC)
  maximum time:     996.690 ms (0.00% GC)
  --------------
  samples:          6
  evals/sample:     1
In [25]:
d["Python built-in"] = minimum(py_list_bench.times) / 1e6
d
Out[25]:
Dict{Any,Any} with 3 entries:
  "C"               => 8.90997
  "Python built-in" => 965.677
  "C -ffast-math"   => 5.67804

2.D python (numpy)

Takes advantage of hardware "SIMD", but only works when it works.

numpy is an optimized C library, callable from Python. It may be installed within Julia as follows:

In [26]:
numpy_sum = pyimport("numpy")["sum"]

py_numpy_bench = @benchmark $numpy_sum($a)
Out[26]:
BenchmarkTools.Trial: 
  memory estimate:  368 bytes
  allocs estimate:  8
  --------------
  minimum time:     5.367 ms (0.00% GC)
  median time:      5.560 ms (0.00% GC)
  mean time:        6.557 ms (0.00% GC)
  maximum time:     17.046 ms (0.00% GC)
  --------------
  samples:          761
  evals/sample:     1
In [27]:
numpy_sum(a)
Out[27]:
4.999670391606692e6
In [28]:
numpy_sum(a)  sum(a)
Out[28]:
true
In [29]:
d["Python numpy"] = minimum(py_numpy_bench.times) / 1e6
d
Out[29]:
Dict{Any,Any} with 4 entries:
  "C"               => 8.90997
  "Python numpy"    => 5.36704
  "Python built-in" => 965.677
  "C -ffast-math"   => 5.67804

2.E python (hand-written)

In [30]:
py"""
def py_sum(A):
    s = 0.0
    for a in A:
        s += a
    return s
"""

sum_py = py"py_sum"
Out[30]:
PyObject <function py_sum at 0x7fd7f32f1ea0>
In [31]:
py_hand = @benchmark $sum_py($a)
Out[31]:
BenchmarkTools.Trial: 
  memory estimate:  368 bytes
  allocs estimate:  8
  --------------
  minimum time:     1.038 s (0.00% GC)
  median time:      1.043 s (0.00% GC)
  mean time:        1.049 s (0.00% GC)
  maximum time:     1.069 s (0.00% GC)
  --------------
  samples:          5
  evals/sample:     1
In [32]:
sum_py(a)
Out[32]:
4.99967039160726e6
In [33]:
sum_py(a)  sum(a)
Out[33]:
true
In [34]:
d["Python hand-written"] = minimum(py_hand.times) / 1e6
d
Out[34]:
Dict{Any,Any} with 5 entries:
  "C"                   => 8.90997
  "Python numpy"        => 5.36704
  "Python hand-written" => 1038.46
  "Python built-in"     => 965.677
  "C -ffast-math"       => 5.67804

2.F Julia (built-in)

Written directly in Julia, not in C!

In [35]:
@which sum(a)
Out[35]:
sum(a::AbstractArray) in Base at reducedim.jl:645
In [36]:
j_bench = @benchmark sum($a)
Out[36]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     5.384 ms (0.00% GC)
  median time:      5.512 ms (0.00% GC)
  mean time:        5.988 ms (0.00% GC)
  maximum time:     16.048 ms (0.00% GC)
  --------------
  samples:          834
  evals/sample:     1
In [37]:
d["Julia built-in"] = minimum(j_bench.times) / 1e6
d
Out[37]:
Dict{Any,Any} with 6 entries:
  "C"                   => 8.90997
  "Python numpy"        => 5.36704
  "Python hand-written" => 1038.46
  "Python built-in"     => 965.677
  "Julia built-in"      => 5.3836
  "C -ffast-math"       => 5.67804

2.G Julia (hand-written)

In [38]:
function mysum(A)   
    s = 0.0 # s = zero(eltype(a))
    for a in A
        s += a
    end
    s
end
Out[38]:
mysum (generic function with 1 method)
In [39]:
j_bench_hand = @benchmark mysum($a)
Out[39]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     9.578 ms (0.00% GC)
  median time:      9.823 ms (0.00% GC)
  mean time:        11.032 ms (0.00% GC)
  maximum time:     23.422 ms (0.00% GC)
  --------------
  samples:          453
  evals/sample:     1
In [40]:
d["Julia hand-written"] = minimum(j_bench_hand.times) / 1e6
d
Out[40]:
Dict{Any,Any} with 7 entries:
  "C"                   => 8.90997
  "Python numpy"        => 5.36704
  "Julia hand-written"  => 9.57754
  "Python hand-written" => 1038.46
  "Python built-in"     => 965.677
  "Julia built-in"      => 5.3836
  "C -ffast-math"       => 5.67804

2.H Julia (hand-written with SIMD)

In [41]:
function mysum_simd(A)   
    s = 0.0 # s = zero(eltype(A))
    @simd for a in A
        s += a
    end
    s
end
Out[41]:
mysum_simd (generic function with 1 method)
In [42]:
j_bench_hand_simd = @benchmark mysum_simd($a)
Out[42]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     5.325 ms (0.00% GC)
  median time:      5.414 ms (0.00% GC)
  mean time:        5.824 ms (0.00% GC)
  maximum time:     16.182 ms (0.00% GC)
  --------------
  samples:          857
  evals/sample:     1
In [43]:
mysum_simd(a)
Out[43]:
4.999670391606648e6
In [44]:
d["Julia hand-written simd"] = minimum(j_bench_hand_simd.times) / 1e6
d
Out[44]:
Dict{Any,Any} with 8 entries:
  "Julia hand-written simd" => 5.32519
  "C"                       => 8.90997
  "Python numpy"            => 5.36704
  "Julia hand-written"      => 9.57754
  "Python hand-written"     => 1038.46
  "Python built-in"         => 965.677
  "Julia built-in"          => 5.3836
  "C -ffast-math"           => 5.67804

Summary of benchmarks

In [45]:
for (key, value) in sort(collect(d), by=last)
    println(rpad(key, 25, "."), lpad(round(value; digits=2), 6, "."))
end
Julia hand-written simd....5.33
Python numpy...............5.37
Julia built-in.............5.38
C -ffast-math..............5.68
C..........................8.91
Julia hand-written.........9.58
Python built-in..........965.68
Python hand-written......1038.46