티스토리 뷰

05_columns

Introduction to DataFrames

Bogumił Kamiński, May 23, 2018

Reference

Series

In [1]:
using DataFrames # load package

Manipulating columns of DataFrame

Renaming columns

Let's start with a DataFrame of Bools that has default column names.

In [2]:
x = DataFrame(Bool, 3, 4)
Out[2]:
x1x2x3x4
BoolBoolBoolBool
1falsefalsefalsefalse
2falsefalsefalsefalse
3falsefalsetruefalse

With rename, we create new DataFrame; here we rename the column :x1 to :A. (rename also accepts collections of Pairs.)

In [3]:
rename(x, :x1 => :A)
Out[3]:
Ax2x3x4
BoolBoolBoolBool
1falsefalsefalsefalse
2falsefalsefalsefalse
3falsefalsetruefalse

With rename! we do an in place transformation.

This time we've applied a function to every column name.

In [4]:
rename!(c -> Symbol(string(c)^2), x)
Out[4]:
x1x1x2x2x3x3x4x4
BoolBoolBoolBool
1falsefalsefalsefalse
2falsefalsefalsefalse
3falsefalsetruefalse

We can also change the name of a particular column without knowing the original.

Here we change the name of the third column, creating a new DataFrame.

In [5]:
rename(x, names(x)[3] => :third)
Out[5]:
x1x1x2x2thirdx4x4
BoolBoolBoolBool
1falsefalsefalsefalse
2falsefalsefalsefalse
3falsefalsetruefalse

With names!, we can change the names of all variables.

In [6]:
names!(x, [:a, :b, :c, :d])
Out[6]:
abcd
BoolBoolBoolBool
1falsefalsefalsefalse
2falsefalsefalsefalse
3falsefalsetruefalse

We get an error when we try to provide duplicate names

In [7]:
names!(x, fill(:a, 4))
ArgumentError: Duplicate variable names: Symbol[:a, :a, :a, :a].
Pass makeunique=true to make them unique using a suffix automatically.

Stacktrace:
 [1] #names!#3(::Bool, ::Bool, ::Function, ::DataFrames.Index, ::Array{Symbol,1}) at /home/yt/.julia/packages/DataFrames/1PqZ3/src/other/index.jl:34
 [2] #names! at ./none:0 [inlined]
 [3] #names!#15 at /home/yt/.julia/packages/DataFrames/1PqZ3/src/abstractdataframe/abstractdataframe.jl:139 [inlined]
 [4] names!(::DataFrame, ::Array{Symbol,1}) at /home/yt/.julia/packages/DataFrames/1PqZ3/src/abstractdataframe/abstractdataframe.jl:136
 [5] top-level scope at In[7]:1

unless we pass makeunique=true, which allows us to handle duplicates in passed names.

In [8]:
names!(x, fill(:a, 4), makeunique=true)
Out[8]:
aa_1a_2a_3
BoolBoolBoolBool
1falsefalsefalsefalse
2falsefalsefalsefalse
3falsefalsetruefalse

Reordering columns

We can reorder the names(x) vector as needed, creating a new DataFrame.

In [9]:
using Random
Random.seed!(1234)#srand(1234)
x[shuffle(names(x))]
Out[9]:
a_1a_3a_2a
BoolBoolBoolBool
1falsefalsefalsefalse
2falsefalsefalsefalse
3falsefalsetruefalse

we can also reorder dataframes with permutecols!.

In [10]:
permutecols!(x, [2, 1, 3, 4])
Out[10]:
a_1aa_2a_3
BoolBoolBoolBool
1falsefalsefalsefalse
2falsefalsefalsefalse
3falsefalsetruefalse
In [11]:
permutecols!(x, [:a, :a_1, :a_2, :a_3])
Out[11]:
aa_1a_2a_3
BoolBoolBoolBool
1falsefalsefalsefalse
2falsefalsefalsefalse
3falsefalsetruefalse

Merging/adding columns

In [12]:
x = DataFrame([(i,j) for i in 1:3, j in 1:4])
Out[12]:
x1x2x3x4
Tuple…Tuple…Tuple…Tuple…
1(1, 1)(1, 2)(1, 3)(1, 4)
2(2, 1)(2, 2)(2, 3)(2, 4)
3(3, 1)(3, 2)(3, 3)(3, 4)

With hcat we can merge two DataFrames. Also [x y] syntax is supported but only when DataFrames have unique column names.

In [13]:
hcat(x, x, makeunique=true)
Out[13]:
x1x2x3x4x1_1x2_1x3_1x4_1
Tuple…Tuple…Tuple…Tuple…Tuple…Tuple…Tuple…Tuple…
1(1, 1)(1, 2)(1, 3)(1, 4)(1, 1)(1, 2)(1, 3)(1, 4)
2(2, 1)(2, 2)(2, 3)(2, 4)(2, 1)(2, 2)(2, 3)(2, 4)
3(3, 1)(3, 2)(3, 3)(3, 4)(3, 1)(3, 2)(3, 3)(3, 4)
In [14]:
[x x]
┌ Warning: Duplicate variable names are deprecated: pass makeunique=true to add a suffix automatically.
│   caller = ip:0x0
└ @ Core :-1
Out[14]:
x1x2x3x4x1_1x2_1x3_1x4_1
Tuple…Tuple…Tuple…Tuple…Tuple…Tuple…Tuple…Tuple…
1(1, 1)(1, 2)(1, 3)(1, 4)(1, 1)(1, 2)(1, 3)(1, 4)
2(2, 1)(2, 2)(2, 3)(2, 4)(2, 1)(2, 2)(2, 3)(2, 4)
3(3, 1)(3, 2)(3, 3)(3, 4)(3, 1)(3, 2)(3, 3)(3, 4)

We can also use hcat to add a new column; a default name :x1 will be used for this column, so makeunique=true is needed.

In [15]:
y = hcat(x, [1,2,3], makeunique=true)
Out[15]:
x1x2x3x4x1_1
Tuple…Tuple…Tuple…Tuple…Int64
1(1, 1)(1, 2)(1, 3)(1, 4)1
2(2, 1)(2, 2)(2, 3)(2, 4)2
3(3, 1)(3, 2)(3, 3)(3, 4)3
In [16]:
[x [1,2,3]]
Out[16]:
x1x2x3x4x1_1
Tuple…Tuple…Tuple…Tuple…Int64
1(1, 1)(1, 2)(1, 3)(1, 4)1
2(2, 1)(2, 2)(2, 3)(2, 4)2
3(3, 1)(3, 2)(3, 3)(3, 4)3

You can also prepend a vector with hcat.

In [17]:
hcat([1,2,3], x, makeunique=true)
Out[17]:
x1x1_1x2x3x4
Int64Tuple…Tuple…Tuple…Tuple…
11(1, 1)(1, 2)(1, 3)(1, 4)
22(2, 1)(2, 2)(2, 3)(2, 4)
33(3, 1)(3, 2)(3, 3)(3, 4)
In [18]:
[[1,2,3] x]
Out[18]:
x1x1_1x2x3x4
Int64Tuple…Tuple…Tuple…Tuple…
11(1, 1)(1, 2)(1, 3)(1, 4)
22(2, 1)(2, 2)(2, 3)(2, 4)
33(3, 1)(3, 2)(3, 3)(3, 4)

Alternatively you could append a vector with the following syntax. This is a bit more verbose but cleaner.

In [19]:
y = [x DataFrame(A=[1,2,3])]
Out[19]:
x1x2x3x4A
Tuple…Tuple…Tuple…Tuple…Int64
1(1, 1)(1, 2)(1, 3)(1, 4)1
2(2, 1)(2, 2)(2, 3)(2, 4)2
3(3, 1)(3, 2)(3, 3)(3, 4)3

Here we do the same but add column :A to the front.

In [20]:
y = [DataFrame(A=[1,2,3]) x]
Out[20]:
Ax1x2x3x4
Int64Tuple…Tuple…Tuple…Tuple…
11(1, 1)(1, 2)(1, 3)(1, 4)
22(2, 1)(2, 2)(2, 3)(2, 4)
33(3, 1)(3, 2)(3, 3)(3, 4)

A column can also be added in the middle. Here a brute-force method is used and a new DataFrame is created.

In [21]:
using BenchmarkTools
@btime [$x[1:2] DataFrame(A=[1,2,3]) $x[3:4]]
  10.601 μs (120 allocations: 9.36 KiB)
Out[21]:
x1x2Ax3x4
Tuple…Tuple…Int64Tuple…Tuple…
1(1, 1)(1, 2)1(1, 3)(1, 4)
2(2, 1)(2, 2)2(2, 3)(2, 4)
3(3, 1)(3, 2)3(3, 3)(3, 4)

We could also do this with a specialized in place method insert!. Let's add :newcol to the DataFrame y.

In [22]:
insert!(y, 2, [1,2,3], :newcol)
Out[22]:
Anewcolx1x2x3x4
Int64Int64Tuple…Tuple…Tuple…Tuple…
111(1, 1)(1, 2)(1, 3)(1, 4)
222(2, 1)(2, 2)(2, 3)(2, 4)
333(3, 1)(3, 2)(3, 3)(3, 4)

If you want to insert the same column name several times makeunique=true is needed as usual.

In [23]:
insert!(y, 2, [1,2,3], :newcol, makeunique=true)
Out[23]:
Anewcol_1newcolx1x2x3x4
Int64Int64Int64Tuple…Tuple…Tuple…Tuple…
1111(1, 1)(1, 2)(1, 3)(1, 4)
2222(2, 1)(2, 2)(2, 3)(2, 4)
3333(3, 1)(3, 2)(3, 3)(3, 4)

We can see how much faster it is to insert a column with insert! than with hcat using @btime.

In [24]:
@btime insert!(copy($x), 3, [1,2,3], :A)
  1.086 μs (17 allocations: 1.38 KiB)
Out[24]:
x1x2Ax3x4
Tuple…Tuple…Int64Tuple…Tuple…
1(1, 1)(1, 2)1(1, 3)(1, 4)
2(2, 1)(2, 2)2(2, 3)(2, 4)
3(3, 1)(3, 2)3(3, 3)(3, 4)

Let's use insert! to append a column in place,

In [25]:
insert!(x, ncol(x)+1, [1,2,3], :A)
Out[25]:
x1x2x3x4A
Tuple…Tuple…Tuple…Tuple…Int64
1(1, 1)(1, 2)(1, 3)(1, 4)1
2(2, 1)(2, 2)(2, 3)(2, 4)2
3(3, 1)(3, 2)(3, 3)(3, 4)3

and to in place prepend a column.

In [26]:
insert!(x, 1, [1,2,3], :B)
Out[26]:
Bx1x2x3x4A
Int64Tuple…Tuple…Tuple…Tuple…Int64
11(1, 1)(1, 2)(1, 3)(1, 4)1
22(2, 1)(2, 2)(2, 3)(2, 4)2
33(3, 1)(3, 2)(3, 3)(3, 4)3

With merge!, let's merge the second DataFrame into first, but overwriting duplicates.

In [27]:
df1 = DataFrame(x=1:3, y=4:6)
df2 = DataFrame(x='a':'c', z = 'd':'f', new=11:13)
df1, df2, merge!(df1, df2)
Out[27]:
(3×4 DataFrame
│ Row │ x    │ y     │ z    │ new   │
│     │ CharInt64CharInt64 │
├─────┼──────┼───────┼──────┼───────┤
│ 1   │ 'a'  │ 4     │ 'd'  │ 11    │
│ 2   │ 'b'  │ 5     │ 'e'  │ 12    │
│ 3   │ 'c'  │ 6     │ 'f'  │ 13    │, 3×3 DataFrame
│ Row │ x    │ z    │ new   │
│     │ CharCharInt64 │
├─────┼──────┼──────┼───────┤
│ 1   │ 'a'  │ 'd'  │ 11    │
│ 2   │ 'b'  │ 'e'  │ 12    │
│ 3   │ 'c'  │ 'f'  │ 13    │, 3×4 DataFrame
│ Row │ x    │ y     │ z    │ new   │
│     │ CharInt64CharInt64 │
├─────┼──────┼───────┼──────┼───────┤
│ 1   │ 'a'  │ 4     │ 'd'  │ 11    │
│ 2   │ 'b'  │ 5     │ 'e'  │ 12    │
│ 3   │ 'c'  │ 6     │ 'f'  │ 13    │)

For comparison: merge two DataFramess but renaming duplicate names via hcat.

In [28]:
df1 = DataFrame(x=1:3, y=4:6)
df2 = DataFrame(x='a':'c', z = 'd':'f', new=11:13)
println(df1)
println(df2)
hcat(df1, df2, makeunique=true)
3×2 DataFrame
│ Row │ x     │ y     │
│     │ Int64Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 4     │
│ 2   │ 2     │ 5     │
│ 3   │ 3     │ 6     │
3×3 DataFrame
│ Row │ x    │ z    │ new   │
│     │ CharCharInt64 │
├─────┼──────┼──────┼───────┤
│ 1   │ 'a'  │ 'd'  │ 11    │
│ 2   │ 'b'  │ 'e'  │ 12    │
│ 3   │ 'c'  │ 'f'  │ 13    │
Out[28]:
xyx_1znew
Int64Int64CharCharInt64
114'a''d'11
225'b''e'12
336'c''f'13
In [29]:
merge!(df1,df2)
Out[29]:
xyznew
CharInt64CharInt64
1'a'4'd'11
2'b'5'e'12
3'c'6'f'13

Subsetting/removing columns

Let's create a new DataFrame x and show a few ways to create DataFrames with a subset of x's columns.

In [30]:
x = DataFrame([(i,j) for i in 1:3, j in 1:5])
Out[30]:
x1x2x3x4x5
Tuple…Tuple…Tuple…Tuple…Tuple…
1(1, 1)(1, 2)(1, 3)(1, 4)(1, 5)
2(2, 1)(2, 2)(2, 3)(2, 4)(2, 5)
3(3, 1)(3, 2)(3, 3)(3, 4)(3, 5)

First we could do this by index

In [31]:
x[[1,2,4,5]]
Out[31]:
x1x2x4x5
Tuple…Tuple…Tuple…Tuple…
1(1, 1)(1, 2)(1, 4)(1, 5)
2(2, 1)(2, 2)(2, 4)(2, 5)
3(3, 1)(3, 2)(3, 4)(3, 5)

or by column name.

In [32]:
x[[:x1, :x4]]
Out[32]:
x1x4
Tuple…Tuple…
1(1, 1)(1, 4)
2(2, 1)(2, 4)
3(3, 1)(3, 4)

We can also choose to keep or exclude columns by Bool. (We need a vector whose length is the number of columns in the original DataFrame.)

In [33]:
x[[true, false, true, false, true]]
Out[33]:
x1x3x5
Tuple…Tuple…Tuple…
1(1, 1)(1, 3)(1, 5)
2(2, 1)(2, 3)(2, 5)
3(3, 1)(3, 3)(3, 5)

Here we create a single column DataFrame,

In [34]:
x[[:x1]]
Out[34]:
x1
Tuple…
1(1, 1)
2(2, 1)
3(3, 1)

and here we access the vector contained in column :x1.

In [35]:
x[:x1]
Out[35]:
3-element Array{Tuple{Int64,Int64},1}:
 (1, 1)
 (2, 1)
 (3, 1)

We could grab the same vector by column number

In [36]:
x[1]
Out[36]:
3-element Array{Tuple{Int64,Int64},1}:
 (1, 1)
 (2, 1)
 (3, 1)

and remove everything from a DataFrame with empty!.

In [37]:
empty!(y)
Out[37]:

Here we create a copy of x and delete the 3rd column from the copy with delete!.

In [38]:
z = copy(x)
x, delete!(z, 3)
Out[38]:
(3×5 DataFrame
│ Row │ x1     │ x2     │ x3     │ x4     │ x5     │
│     │ Tuple…Tuple…Tuple…Tuple…Tuple… │
├─────┼────────┼────────┼────────┼────────┼────────┤
│ 1   │ (1, 1) │ (1, 2) │ (1, 3) │ (1, 4) │ (1, 5) │
│ 2   │ (2, 1) │ (2, 2) │ (2, 3) │ (2, 4) │ (2, 5) │
│ 3   │ (3, 1) │ (3, 2) │ (3, 3) │ (3, 4) │ (3, 5) │, 3×4 DataFrame
│ Row │ x1     │ x2     │ x4     │ x5     │
│     │ Tuple…Tuple…Tuple…Tuple… │
├─────┼────────┼────────┼────────┼────────┤
│ 1   │ (1, 1) │ (1, 2) │ (1, 4) │ (1, 5) │
│ 2   │ (2, 1) │ (2, 2) │ (2, 4) │ (2, 5) │
│ 3   │ (3, 1) │ (3, 2) │ (3, 4) │ (3, 5) │)

Modify column by name

In [39]:
x = DataFrame([(i,j) for i in 1:3, j in 1:5])
Out[39]:
x1x2x3x4x5
Tuple…Tuple…Tuple…Tuple…Tuple…
1(1, 1)(1, 2)(1, 3)(1, 4)(1, 5)
2(2, 1)(2, 2)(2, 3)(2, 4)(2, 5)
3(3, 1)(3, 2)(3, 3)(3, 4)(3, 5)

With the following syntax, the existing column is modified without performing any copying.

In [40]:
x[:x1] = x[:x2]
x
Out[40]:
x1x2x3x4x5
Tuple…Tuple…Tuple…Tuple…Tuple…
1(1, 2)(1, 2)(1, 3)(1, 4)(1, 5)
2(2, 2)(2, 2)(2, 3)(2, 4)(2, 5)
3(3, 2)(3, 2)(3, 3)(3, 4)(3, 5)

We can also use the following syntax to add a new column at the end of a DataFrame.

In [41]:
x[:A] = [1,2,3]
x
Out[41]:
x1x2x3x4x5A
Tuple…Tuple…Tuple…Tuple…Tuple…Int64
1(1, 2)(1, 2)(1, 3)(1, 4)(1, 5)1
2(2, 2)(2, 2)(2, 3)(2, 4)(2, 5)2
3(3, 2)(3, 2)(3, 3)(3, 4)(3, 5)3

A new column name will be added to our DataFrame with the following syntax as well (7 is equal to ncol(x)+1).

In [42]:
x[7] = 11:13
x
Out[42]:
x1x2x3x4x5Ax7
Tuple…Tuple…Tuple…Tuple…Tuple…Int64Int64
1(1, 2)(1, 2)(1, 3)(1, 4)(1, 5)111
2(2, 2)(2, 2)(2, 3)(2, 4)(2, 5)212
3(3, 2)(3, 2)(3, 3)(3, 4)(3, 5)313

Find column name

In [43]:
x = DataFrame([(i,j) for i in 1:3, j in 1:5])
Out[43]:
x1x2x3x4x5
Tuple…Tuple…Tuple…Tuple…Tuple…
1(1, 1)(1, 2)(1, 3)(1, 4)(1, 5)
2(2, 1)(2, 2)(2, 3)(2, 4)(2, 5)
3(3, 1)(3, 2)(3, 3)(3, 4)(3, 5)

We can check if a column with a given name exists via

In [44]:
:x1 in names(x)
Out[44]:
true

and determine its index via

In [45]:
findfirst(names(x) .== :x2)
Out[45]:
2

'Flux in Julia > Learning Julia (Intro_to_Julia_DFs)' 카테고리의 다른 글

06. rows  (0) 2018.10.12
05. columns (한글)  (0) 2018.10.11
04. loadsave (한글)  (0) 2018.10.10
04. loadsave  (0) 2018.10.10
03. missingvalues (한글)  (0) 2018.10.09
공지사항
최근에 올라온 글
최근에 달린 댓글
Total
Today
Yesterday
링크
TAG
more
«   2025/05   »
1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31
글 보관함