Introduction to DataFrames¶

Bogumił Kamiński, Apr 21, 2018

출처¶

https://github.com/JuliaComputing/JuliaBoxTutorials/tree/master/introductory-tutorials/broader-topics-and-ecosystem/intro-to-julia-DataFrames

함께보기¶

https://deepstat.tistory.com/69 (01. constructors)(in English)
https://deepstat.tistory.com/70 (01. constructors)(한글)
https://deepstat.tistory.com/71 (02. basicinfo)(in English)
https://deepstat.tistory.com/72 (02. basicinfo)(한글)
https://deepstat.tistory.com/73 (03. missingvalues)(in English)
https://deepstat.tistory.com/74 (03. missingvalues)(한글)
https://deepstat.tistory.com/75 (04. loadsave)(in English)
https://deepstat.tistory.com/76 (04. loadsave)(한글)
https://deepstat.tistory.com/77 (05. columns)(in English)
https://deepstat.tistory.com/78 (05. columns)(한글)
https://deepstat.tistory.com/79 (06. rows)(in English)
https://deepstat.tistory.com/80 (06. rows)(한글)
https://deepstat.tistory.com/81 (07. factors)(in English)
https://deepstat.tistory.com/82 (07. factors)(한글)

using DataFrames # load package

범주형 배열로 작업하기 (Working with CategoricalArrays)¶

생성자 (Constructor)¶

x = categorical(["A", "B", "B", "C"]) # 비순서형(unordered) 범주.

4-element CategoricalArray{String,1,UInt32}:
 "A"
 "B"
 "B"
 "C"

y = categorical(["A", "B", "B", "C"], ordered=true) # 순서형(ordered) 범주. 기본적으로, 순서(order)는 정렬순서(sorting order).

4-element CategoricalArray{String,1,UInt32}:
 "A"
 "B"
 "B"
 "C"

z = categorical(["A","B","B","C", missing]) # 결측(missing)이 있는 순서형 범주.

5-element CategoricalArray{Union{Missing, String},1,UInt32}:
 "A"    
 "B"    
 "B"    
 "C"    
 missing

c = cut(1:10, 5) # 순서형. 같은 수로(into equal counts). 레이블(label)의 이름을 바꾸거나 나누는 범위 지정(give custom breaks) 가능.

10-element CategoricalArray{String,1,UInt32}:
 "[1.0, 2.8)" 
 "[1.0, 2.8)" 
 "[2.8, 4.6)" 
 "[2.8, 4.6)" 
 "[4.6, 6.4)" 
 "[4.6, 6.4)" 
 "[6.4, 8.2)" 
 "[6.4, 8.2)" 
 "[8.2, 10.0]"
 "[8.2, 10.0]"

by(DataFrame(x=cut(randn(100000), 10)), :x, d -> DataFrame(n=nrow(d)), sort=true) # 단지 제대로 작동하는지 확인하기 위한 코드.

v = categorical([1,2,2,3,3]) # 글자타입(string)이 아닌 정수타입(integer)를 포함함.

5-element CategoricalArray{Int64,1,UInt32}:
 1
 2
 2
 3
 3

Vector{Union{String, Missing}}(z) # 다시 원래의 벡터(vector)로 돌리고 싶을 때 사용.

5-element Array{Union{Missing, String},1}:
 "A"    
 "B"    
 "B"    
 "C"    
 missing

레벨 관리하기 (Managing levels)¶

arr = [x,y,z,c,v]

5-element Array{CategoricalArray{T,1,UInt32,V,C,U} where U where C where V where T,1}:
 CategoricalString{UInt32}["A", "B", "B", "C"]                                                                                                                          
 CategoricalString{UInt32}["A", "B", "B", "C"]                                                                                                                          
 Union{Missing, CategoricalString{UInt32}}["A", "B", "B", "C", missing]                                                                                                 
 CategoricalString{UInt32}["[1.0, 2.8)", "[1.0, 2.8)", "[2.8, 4.6)", "[2.8, 4.6)", "[4.6, 6.4)", "[4.6, 6.4)", "[6.4, 8.2)", "[6.4, 8.2)", "[8.2, 10.0]", "[8.2, 10.0]"]
 CategoricalValue{Int64,UInt32}[1, 2, 2, 3, 3]

isordered.(arr) # 범주형 배열 (categorical array)이 순서형(ordered)인지 아닌지 확인.

5-element BitArray{1}:
 false
  true
 false
  true
 false

ordered!(x, true), isordered(x) # x를 강제로 순서형(ordered)으로 만들기

(CategoricalString{UInt32}["A", "B", "B", "C"], true)

ordered!(x, false), isordered(x) # 다시 비순서형(unordered)으로 만들기

(CategoricalString{UInt32}["A", "B", "B", "C"], false)

levels.(arr) # 레벨(levels) 나열하기

5-element Array{Array{T,1} where T,1}:
 ["A", "B", "C"]                                                        
 ["A", "B", "C"]                                                        
 ["A", "B", "C"]                                                        
 ["[1.0, 2.8)", "[2.8, 4.6)", "[4.6, 6.4)", "[6.4, 8.2)", "[8.2, 10.0]"]
 [1, 2, 3]

unique.(arr) # 결측(missing)이 포함돼있음.

5-element Array{Array{T,1} where T,1}:
 ["A", "B", "C"]                                                        
 ["A", "B", "C"]                                                        
 Union{Missing, String}["A", "B", "C", missing]                         
 ["[1.0, 2.8)", "[2.8, 4.6)", "[4.6, 6.4)", "[6.4, 8.2)", "[8.2, 10.0]"]
 [1, 2, 3]

y[1] < y[2] # y가 순서형(ordered) 범주라서 비교할 수 있음.

true

v[1] < v[2] # 비록 정수타입(integers)지만, 비순서형(unordered) 범주라서 비교할 수 없음.

ArgumentError: Unordered CategoricalValue objects cannot be tested for order using <. Use isless instead, or call the ordered! function on the parent array to change this

Stacktrace:
 [1] <(::CategoricalValue{Int64,UInt32}, ::CategoricalValue{Int64,UInt32}) at /home/yt/.julia/packages/CategoricalArrays/rQrLR/src/value.jl:179
 [2] top-level scope at In[16]:1

levels!(y, ["C", "B", "A"]) # 순서형 범주의 순서를 바꿀 수 있다. 보통 순서형 범주배열(ordered categorical array)에서 유용하다.

4-element CategoricalArray{String,1,UInt32}:
 "A"
 "B"
 "B"
 "C"

y[1] < y[2] # 순서가 바뀌었음을 알 수 있다.

false

levels!(z, ["A", "B"]) # 무조건 모든 레벨을 지정해줘야한다.

ArgumentError: cannot remove level "C" as it is used at position 4 and allow_missing=false.

Stacktrace:
 [1] #levels!#54(::Bool, ::Function, ::CategoricalArray{Union{Missing, String},1,UInt32,String,CategoricalString{UInt32},Missing}, ::Array{String,1}) at /home/yt/.julia/packages/CategoricalArrays/rQrLR/src/array.jl:594
 [2] levels!(::CategoricalArray{Union{Missing, String},1,UInt32,String,CategoricalString{UInt32},Missing}, ::Array{String,1}) at /home/yt/.julia/packages/CategoricalArrays/rQrLR/src/array.jl:582
 [3] top-level scope at In[19]:1

levels!(z, ["A", "B"], allow_missing=true) # 아니면 나머지 레벨을 missing으로 간주하도록 만들 수도 있다.

5-element CategoricalArray{Union{Missing, String},1,UInt32}:
 "A"    
 "B"    
 "B"    
 missing
 missing

z[1] = "B"
z # z에 "B"만 있도록 만들었다.

5-element CategoricalArray{Union{Missing, String},1,UInt32}:
 "B"    
 "B"    
 "B"    
 missing
 missing

levels(z) # 그렇지만 레벨에는 여전히 "A"가 남아있다.

2-element Array{String,1}:
 "A"
 "B"

droplevels!(z) # 없는 레벨을 지울 수 있다.
levels(z)

1-element Array{String,1}:
 "B"

데이터 다루기 (Data manipulation)¶

x, levels(x)

(CategoricalString{UInt32}["A", "B", "B", "C"], ["A", "B", "C"])

x[2] = "0"
x, levels(x) # 마지막에 새로운 레벨이 추가됐다. (비순서형일때만 사용가능하다.)

(CategoricalString{UInt32}["A", "0", "B", "C"], ["A", "B", "C", "0"])

v, levels(v)

(CategoricalValue{Int64,UInt32}[1, 2, 2, 3, 3], [1, 2, 3])

v[1] + v[2] # 비록 정수타입(int)이더라도 계산할 수 없다.

MethodError: no method matching +(::CategoricalValue{Int64,UInt32}, ::CategoricalValue{Int64,UInt32})
Closest candidates are:
  +(::Any, ::Any, !Matched::Any, !Matched::Any...) at operators.jl:502

Stacktrace:
 [1] top-level scope at In[27]:1

Vector{Int}(v) # 데이터 자체를 정수타입(int)으로 변환하거나

5-element Array{Int64,1}:
 1
 2
 2
 3
 3

get(v[1]) + get(v[2]) # 혹은 get으로 값 하나만 추출해야한다.

3

get.(v) # 이 get.은 결측이 없을 때만 작동할거다.

5-element Array{Int64,1}:
 1
 2
 2
 3
 3

get.(z) # 결측이 있으니까 안 된다.

MethodError: no method matching get(::Missing)
Closest candidates are:
  get(!Matched::Base.EnvDict, !Matched::AbstractString, !Matched::Any) at env.jl:77
  get(!Matched::Base.TTY, !Matched::Symbol, !Matched::Any) at stream.jl:415
  get(!Matched::REPL.Terminals.TTYTerminal, !Matched::Any, !Matched::Any) at /home/yt/Julia/julia/usr/share/julia/stdlib/v1.0/REPL/src/Terminals.jl:176
  ...

Stacktrace:
 [1] _broadcast_getindex_evalf at ./broadcast.jl:574 [inlined]
 [2] _broadcast_getindex at ./broadcast.jl:547 [inlined]
 [3] getindex at ./broadcast.jl:507 [inlined]
 [4] macro expansion at ./broadcast.jl:838 [inlined]
 [5] macro expansion at ./simdloop.jl:73 [inlined]
 [6] copyto! at ./broadcast.jl:837 [inlined]
 [7] copyto! at ./broadcast.jl:792 [inlined]
 [8] copy at ./broadcast.jl:768 [inlined]
 [9] materialize(::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(get),Tuple{CategoricalArray{Union{Missing, String},1,UInt32,String,CategoricalString{UInt32},Missing}}}) at ./broadcast.jl:748
 [10] top-level scope at In[31]:1

Vector{Union{String, Missing}}(z) # 변환해야만 한다.

5-element Array{Union{Missing, String},1}:
 "B"    
 "B"    
 "B"    
 missing
 missing

z[1]*z[2], z.^2 # 예외적으로, 글자타입(string)을 기반으로 한 범주배열(categorical array)일 때만 계산가능하다.

("BB", Union{Missing, String}["BB", "BB", "BB", missing, missing])

recode([1,2,3,4,5,missing], 1=>10) # 이는 array의 값을 inplace로 바꿨다(recode).

6-element Array{Union{Missing, Int64},1}:
 10       
  2       
  3       
  4       
  5       
   missing

recode([1,2,3,4,5,missing], "a", 1=>10, 2=>20) # 매핑하지 않은 값에 대해서 일괄적으로 바꿀 수도 있다.

6-element Array{Any,1}:
 10       
 20       
   "a"    
   "a"    
   "a"    
   missing

recode([1,2,3,4,5,missing], 1=>10, missing=>"missing") # 결측치(missing)을 바꾸기 위해서는 정확하게 콕 집어서 써줘야만 한다.

6-element Array{Any,1}:
 10         
  2         
  3         
  4         
  5         
   "missing"

t = categorical([1:5; missing])
t, levels(t)

(Union{Missing, CategoricalValue{Int64,UInt32}}[1, 2, 3, 4, 5, missing], [1, 2, 3, 4, 5])

recode!(t, [1,3]=>2)
t, levels(t) # recode를 쓰면 레벨자체가 바뀐다는 것을 알아둬야 한다.

(Union{Missing, CategoricalValue{Int64,UInt32}}[2, 2, 2, 4, 5, missing], [2, 4, 5])

t = categorical([1,2,3], ordered=true)
levels(recode(t, 2=>0, 1=>-1)) # 만일 새로운 레벨이 추가된다면, 나중에 추가한 순으로 큰 값이 된다.

3-element Array{Int64,1}:
  3
  0
 -1

t = categorical([1,2,3,4,5], ordered=true) # 만일 매핑하지 않은 값이 있다면, 그 값이 가장 큰 값이 된다.
levels(recode(t, 300, [1,2]=>100, 3=>200))

3-element Array{Int64,1}:
 100
 200
 300

비교 (Comparisons)¶

x = categorical([1,2,3])

3-element CategoricalArray{Int64,1,UInt32}:
 1
 2
 3

xs = [x, categorical(x), categorical(x, ordered=true), categorical(x, ordered=true)]

4-element Array{CategoricalArray{Int64,1,UInt32,Int64,CategoricalValue{Int64,UInt32},Union{}},1}:
 [1, 2, 3]
 [1, 2, 3]
 [1, 2, 3]
 [1, 2, 3]

levels!(xs[2], [3,2,1])

3-element CategoricalArray{Int64,1,UInt32}:
 1
 2
 3

levels!(xs[4], [2,3,1])

3-element CategoricalArray{Int64,1,UInt32}:
 1
 2
 3

[a == b for a in xs, b in xs] # 모든 값이 다 같다. (내용만 비교한다.)

4×4 Array{Bool,2}:
 true  true  true  true
 true  true  true  true
 true  true  true  true
 true  true  true  true

signature(x::CategoricalArray) = (x, levels(x), isordered(x)) # 이것이 실제로 범주배열(categorical array)의 모든 정보다.

signature (generic function with 1 method)

signature(xs[1])

(CategoricalValue{Int64,UInt32}[1, 2, 3], [1, 2, 3], false)

signature(xs[2])

(CategoricalValue{Int64,UInt32}[1, 2, 3], [3, 2, 1], false)

signature(xs[3])

(CategoricalValue{Int64,UInt32}[1, 2, 3], [1, 2, 3], true)

signature(xs[4])

(CategoricalValue{Int64,UInt32}[1, 2, 3], [2, 3, 1], true)

# 전부다 다르다. 참고로 x[1]과 x[2]는 비순서형 범주지만, level 순서가 다르다.
[signature(a) == signature(b) for a in xs, b in xs]

4×4 Array{Bool,2}:
  true  false  false  false
 false   true  false  false
 false  false   true  false
 false  false  false   true

x[1] < x[2] # 비순서형 범주배열의 각 요소들은 비교할 수 없다.

ArgumentError: Unordered CategoricalValue objects cannot be tested for order using <. Use isless instead, or call the ordered! function on the parent array to change this

Stacktrace:
 [1] <(::CategoricalValue{Int64,UInt32}, ::CategoricalValue{Int64,UInt32}) at /home/yt/.julia/packages/CategoricalArrays/rQrLR/src/value.jl:179
 [2] top-level scope at In[52]:1

t[1] < t[2] # 하지만 순서형은 비교할 수 있다.

true

isless(x[1], x[2]) # isless 함수는 비순서형 범주배열에서도 똑같이 작용한다.

true

y = deepcopy(x) # 하지만 아예 다른 범주배열과는 비교할 수 없다.
isless(x[1], y[2])

ArgumentError: CategoricalValue objects with different pools cannot be tested for order

Stacktrace:
 [1] isless(::CategoricalValue{Int64,UInt32}, ::CategoricalValue{Int64,UInt32}) at /home/yt/.julia/packages/CategoricalArrays/rQrLR/src/value.jl:162
 [2] top-level scope at In[55]:2

isless(get(x[1]), get(y[2])) # get을 이용하면 범주배열의 내용을 비교할 수 있다.

true

x[1] == y[2] # 같은지 비교하는 것은 다른 범주배열이라도 작동한다.

false

데이터프레임 안의 범주형 행 (Categorical columns in a DataFrame)¶

df = DataFrame(x = 1:3, y = 'a':'c', z = ["a","b","c"])

categorical!(df) # 모든 string타입의 행을 범주형(categorical)으로 바꾼다. (character타입은 해당사항 없다.)

showcols(df)

┌ Warning: `showcols(df::AbstractDataFrame, all::Bool=false, values::Bool=true)` is deprecated, use `describe(df, stats=[:eltype, :nmissing, :first, :last])` instead.
│   caller = showcols(::DataFrame) at deprecated.jl:54
└ @ DataFrames ./deprecated.jl:54

categorical!(df, :x) # :x행을 범주형으로 바꾸라고 강제했다.

showcols(df)

	x	y	z
	Categorical…	Char	Categorical…
1	1	'a'	a
2	2	'b'	b
3	3	'c'	c

	x	n
	Categorical…	Int64
1	[-4.77514, -1.28319)	10000
2	[-1.28319, -0.84594)	10000
3	[-0.84594, -0.52835)	10000
4	[-0.52835, -0.261037)	10000
5	[-0.261037, -0.00606204)	10000
6	[-0.00606204, 0.250787)	10000
7	[0.250787, 0.523152)	10000
8	[0.523152, 0.84362)	10000
9	[0.84362, 1.29359)	10000
10	[1.29359, 4.50647]	10000

	variable	eltype	nmissing	first	last
	Symbol	DataType	Nothing	Any	Any
1	x	Int64		1	3
2	y	Char		'a'	'c'
3	z	CategoricalString{UInt32}		a	c

07. factors (한글)

Introduction to DataFrames¶

출처¶

함께보기¶

범주형 배열로 작업하기 (Working with CategoricalArrays)¶

생성자 (Constructor)¶

레벨 관리하기 (Managing levels)¶

데이터 다루기 (Data manipulation)¶

비교 (Comparisons)¶

데이터프레임 안의 범주형 행 (Categorical columns in a DataFrame)¶