Introduction to DataFrames¶

Bogumił Kamiński, May 23, 2018

Reference¶

https://github.com/JuliaComputing/JuliaBoxTutorials/tree/master/introductory-tutorials/broader-topics-and-ecosystem/intro-to-julia-DataFrames

Series¶

https://deepstat.tistory.com/69 (01. constructors)(in English)
https://deepstat.tistory.com/70 (01. constructors)(한글)
https://deepstat.tistory.com/71 (02. basicinfo)(in English)
https://deepstat.tistory.com/72 (02. basicinfo)(한글)
https://deepstat.tistory.com/73 (03. missingvalues)(in English)
https://deepstat.tistory.com/74 (03. missingvalues)(한글)
https://deepstat.tistory.com/75 (04. loadsave)(in English)
https://deepstat.tistory.com/76 (04. loadsave)(한글)
https://deepstat.tistory.com/77 (05. columns)(in English)
https://deepstat.tistory.com/78 (05. columns)(한글)

using DataFrames # load package

Manipulating columns of DataFrame¶

Renaming columns¶

Let's start with a DataFrame of Bools that has default column names.

x = DataFrame(Bool, 3, 4)

With rename, we create new DataFrame; here we rename the column :x1 to :A. (rename also accepts collections of Pairs.)

rename(x, :x1 => :A)

With rename! we do an in place transformation.

This time we've applied a function to every column name.

rename!(c -> Symbol(string(c)^2), x)

We can also change the name of a particular column without knowing the original.

Here we change the name of the third column, creating a new DataFrame.

rename(x, names(x)[3] => :third)

With names!, we can change the names of all variables.

names!(x, [:a, :b, :c, :d])

We get an error when we try to provide duplicate names

names!(x, fill(:a, 4))

ArgumentError: Duplicate variable names: Symbol[:a, :a, :a, :a].
Pass makeunique=true to make them unique using a suffix automatically.

Stacktrace:
 [1] #names!#3(::Bool, ::Bool, ::Function, ::DataFrames.Index, ::Array{Symbol,1}) at /home/yt/.julia/packages/DataFrames/1PqZ3/src/other/index.jl:34
 [2] #names! at ./none:0 [inlined]
 [3] #names!#15 at /home/yt/.julia/packages/DataFrames/1PqZ3/src/abstractdataframe/abstractdataframe.jl:139 [inlined]
 [4] names!(::DataFrame, ::Array{Symbol,1}) at /home/yt/.julia/packages/DataFrames/1PqZ3/src/abstractdataframe/abstractdataframe.jl:136
 [5] top-level scope at In[7]:1

unless we pass makeunique=true, which allows us to handle duplicates in passed names.

names!(x, fill(:a, 4), makeunique=true)

Reordering columns¶

We can reorder the names(x) vector as needed, creating a new DataFrame.

using Random
Random.seed!(1234)#srand(1234)
x[shuffle(names(x))]

we can also reorder dataframes with permutecols!.

permutecols!(x, [2, 1, 3, 4])

permutecols!(x, [:a, :a_1, :a_2, :a_3])

Merging/adding columns¶

x = DataFrame([(i,j) for i in 1:3, j in 1:4])

With hcat we can merge two DataFrames. Also [x y] syntax is supported but only when DataFrames have unique column names.

hcat(x, x, makeunique=true)

[x x]

┌ Warning: Duplicate variable names are deprecated: pass makeunique=true to add a suffix automatically.
│   caller = ip:0x0
└ @ Core :-1

We can also use hcat to add a new column; a default name :x1 will be used for this column, so makeunique=true is needed.

y = hcat(x, [1,2,3], makeunique=true)

[x [1,2,3]]

You can also prepend a vector with hcat.

hcat([1,2,3], x, makeunique=true)

[[1,2,3] x]

Alternatively you could append a vector with the following syntax. This is a bit more verbose but cleaner.

y = [x DataFrame(A=[1,2,3])]

Here we do the same but add column :A to the front.

y = [DataFrame(A=[1,2,3]) x]

A column can also be added in the middle. Here a brute-force method is used and a new DataFrame is created.

using BenchmarkTools
@btime [$x[1:2] DataFrame(A=[1,2,3]) $x[3:4]]

  10.601 μs (120 allocations: 9.36 KiB)

We could also do this with a specialized in place method insert!. Let's add :newcol to the DataFrame y.

insert!(y, 2, [1,2,3], :newcol)

If you want to insert the same column name several times makeunique=true is needed as usual.

insert!(y, 2, [1,2,3], :newcol, makeunique=true)

We can see how much faster it is to insert a column with insert! than with hcat using @btime.

@btime insert!(copy($x), 3, [1,2,3], :A)

  1.086 μs (17 allocations: 1.38 KiB)

Let's use insert! to append a column in place,

insert!(x, ncol(x)+1, [1,2,3], :A)

and to in place prepend a column.

insert!(x, 1, [1,2,3], :B)

With merge!, let's merge the second DataFrame into first, but overwriting duplicates.

df1 = DataFrame(x=1:3, y=4:6)
df2 = DataFrame(x='a':'c', z = 'd':'f', new=11:13)
df1, df2, merge!(df1, df2)

(3×4 DataFrame
│ Row │ x    │ y     │ z    │ new   │
│     │ Char │ Int64 │ Char │ Int64 │
├─────┼──────┼───────┼──────┼───────┤
│ 1   │ 'a'  │ 4     │ 'd'  │ 11    │
│ 2   │ 'b'  │ 5     │ 'e'  │ 12    │
│ 3   │ 'c'  │ 6     │ 'f'  │ 13    │, 3×3 DataFrame
│ Row │ x    │ z    │ new   │
│     │ Char │ Char │ Int64 │
├─────┼──────┼──────┼───────┤
│ 1   │ 'a'  │ 'd'  │ 11    │
│ 2   │ 'b'  │ 'e'  │ 12    │
│ 3   │ 'c'  │ 'f'  │ 13    │, 3×4 DataFrame
│ Row │ x    │ y     │ z    │ new   │
│     │ Char │ Int64 │ Char │ Int64 │
├─────┼──────┼───────┼──────┼───────┤
│ 1   │ 'a'  │ 4     │ 'd'  │ 11    │
│ 2   │ 'b'  │ 5     │ 'e'  │ 12    │
│ 3   │ 'c'  │ 6     │ 'f'  │ 13    │)

For comparison: merge two DataFramess but renaming duplicate names via hcat.

df1 = DataFrame(x=1:3, y=4:6)
df2 = DataFrame(x='a':'c', z = 'd':'f', new=11:13)
println(df1)
println(df2)
hcat(df1, df2, makeunique=true)

3×2 DataFrame
│ Row │ x     │ y     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 4     │
│ 2   │ 2     │ 5     │
│ 3   │ 3     │ 6     │
3×3 DataFrame
│ Row │ x    │ z    │ new   │
│     │ Char │ Char │ Int64 │
├─────┼──────┼──────┼───────┤
│ 1   │ 'a'  │ 'd'  │ 11    │
│ 2   │ 'b'  │ 'e'  │ 12    │
│ 3   │ 'c'  │ 'f'  │ 13    │

merge!(df1,df2)

Subsetting/removing columns¶

Let's create a new DataFrame x and show a few ways to create DataFrames with a subset of x's columns.

x = DataFrame([(i,j) for i in 1:3, j in 1:5])

First we could do this by index

x[[1,2,4,5]]

or by column name.

x[[:x1, :x4]]

We can also choose to keep or exclude columns by Bool. (We need a vector whose length is the number of columns in the original DataFrame.)

x[[true, false, true, false, true]]

Here we create a single column DataFrame,

x[[:x1]]

and here we access the vector contained in column :x1.

x[:x1]

3-element Array{Tuple{Int64,Int64},1}:
 (1, 1)
 (2, 1)
 (3, 1)

We could grab the same vector by column number

x[1]

3-element Array{Tuple{Int64,Int64},1}:
 (1, 1)
 (2, 1)
 (3, 1)

and remove everything from a DataFrame with empty!.

empty!(y)

Here we create a copy of x and delete the 3rd column from the copy with delete!.

z = copy(x)
x, delete!(z, 3)

(3×5 DataFrame
│ Row │ x1     │ x2     │ x3     │ x4     │ x5     │
│     │ Tuple… │ Tuple… │ Tuple… │ Tuple… │ Tuple… │
├─────┼────────┼────────┼────────┼────────┼────────┤
│ 1   │ (1, 1) │ (1, 2) │ (1, 3) │ (1, 4) │ (1, 5) │
│ 2   │ (2, 1) │ (2, 2) │ (2, 3) │ (2, 4) │ (2, 5) │
│ 3   │ (3, 1) │ (3, 2) │ (3, 3) │ (3, 4) │ (3, 5) │, 3×4 DataFrame
│ Row │ x1     │ x2     │ x4     │ x5     │
│     │ Tuple… │ Tuple… │ Tuple… │ Tuple… │
├─────┼────────┼────────┼────────┼────────┤
│ 1   │ (1, 1) │ (1, 2) │ (1, 4) │ (1, 5) │
│ 2   │ (2, 1) │ (2, 2) │ (2, 4) │ (2, 5) │
│ 3   │ (3, 1) │ (3, 2) │ (3, 4) │ (3, 5) │)

Modify column by name¶

x = DataFrame([(i,j) for i in 1:3, j in 1:5])

With the following syntax, the existing column is modified without performing any copying.

x[:x1] = x[:x2]
x

We can also use the following syntax to add a new column at the end of a DataFrame.

x[:A] = [1,2,3]
x

A new column name will be added to our DataFrame with the following syntax as well (7 is equal to ncol(x)+1).

x[7] = 11:13
x

Find column name¶

x = DataFrame([(i,j) for i in 1:3, j in 1:5])

We can check if a column with a given name exists via

:x1 in names(x)

true

and determine its index via

findfirst(names(x) .== :x2)

2

06. rows (0)	2018.10.12
05. columns (한글) (0)	2018.10.11
04. loadsave (한글) (0)	2018.10.10
04. loadsave (0)	2018.10.10
03. missingvalues (한글) (0)	2018.10.09

일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

DeepStat

티스토리 뷰

05. columns