티스토리 뷰

07_factors

Introduction to DataFrames

Bogumił Kamiński, Apr 21, 2018

Reference

Series

In [1]:
using DataFrames # load package

Working with CategoricalArrays

Constructor

In [2]:
x = categorical(["A", "B", "B", "C"]) # unordered
Out[2]:
4-element CategoricalArray{String,1,UInt32}:
 "A"
 "B"
 "B"
 "C"
In [3]:
y = categorical(["A", "B", "B", "C"], ordered=true) # ordered, by default order is sorting order
Out[3]:
4-element CategoricalArray{String,1,UInt32}:
 "A"
 "B"
 "B"
 "C"
In [4]:
z = categorical(["A","B","B","C", missing]) # unordered with missings
Out[4]:
5-element CategoricalArray{Union{Missing, String},1,UInt32}:
 "A"    
 "B"    
 "B"    
 "C"    
 missing
In [5]:
c = cut(1:10, 5) # ordered, into equal counts, possible to rename labels and give custom breaks
Out[5]:
10-element CategoricalArray{String,1,UInt32}:
 "[1.0, 2.8)" 
 "[1.0, 2.8)" 
 "[2.8, 4.6)" 
 "[2.8, 4.6)" 
 "[4.6, 6.4)" 
 "[4.6, 6.4)" 
 "[6.4, 8.2)" 
 "[6.4, 8.2)" 
 "[8.2, 10.0]"
 "[8.2, 10.0]"
In [6]:
by(DataFrame(x=cut(randn(100000), 10)), :x, d -> DataFrame(n=nrow(d)), sort=true) # just to make sure it works right
Out[6]:
xn
Categorical…Int64
1[-4.39486, -1.28182)10000
2[-1.28182, -0.841328)10000
3[-0.841328, -0.523679)10000
4[-0.523679, -0.253647)10000
5[-0.253647, 0.00147159)10000
6[0.00147159, 0.256508)10000
7[0.256508, 0.524727)10000
8[0.524727, 0.846228)10000
9[0.846228, 1.28873)10000
10[1.28873, 4.4538]10000
In [7]:
v = categorical([1,2,2,3,3]) # contains integers not strings
Out[7]:
5-element CategoricalArray{Int64,1,UInt32}:
 1
 2
 2
 3
 3
In [8]:
Vector{Union{String, Missing}}(z) # sometimes you need to convert back to a standard vector
Out[8]:
5-element Array{Union{Missing, String},1}:
 "A"    
 "B"    
 "B"    
 "C"    
 missing

Managing levels

In [9]:
arr = [x,y,z,c,v]
Out[9]:
5-element Array{CategoricalArray{T,1,UInt32,V,C,U} where U where C where V where T,1}:
 CategoricalString{UInt32}["A", "B", "B", "C"]                                                                                                                          
 CategoricalString{UInt32}["A", "B", "B", "C"]                                                                                                                          
 Union{Missing, CategoricalString{UInt32}}["A", "B", "B", "C", missing]                                                                                                 
 CategoricalString{UInt32}["[1.0, 2.8)", "[1.0, 2.8)", "[2.8, 4.6)", "[2.8, 4.6)", "[4.6, 6.4)", "[4.6, 6.4)", "[6.4, 8.2)", "[6.4, 8.2)", "[8.2, 10.0]", "[8.2, 10.0]"]
 CategoricalValue{Int64,UInt32}[1, 2, 2, 3, 3]                                                                                                                          
In [10]:
isordered.(arr) # chcek if categorical array is orderd
Out[10]:
5-element BitArray{1}:
 false
  true
 false
  true
 false
In [11]:
ordered!(x, true), isordered(x) # make x ordered
Out[11]:
(CategoricalString{UInt32}["A", "B", "B", "C"], true)
In [12]:
ordered!(x, false), isordered(x) # and unordered again
Out[12]:
(CategoricalString{UInt32}["A", "B", "B", "C"], false)
In [13]:
levels.(arr) # list levels
Out[13]:
5-element Array{Array{T,1} where T,1}:
 ["A", "B", "C"]                                                        
 ["A", "B", "C"]                                                        
 ["A", "B", "C"]                                                        
 ["[1.0, 2.8)", "[2.8, 4.6)", "[4.6, 6.4)", "[6.4, 8.2)", "[8.2, 10.0]"]
 [1, 2, 3]                                                              
In [14]:
unique.(arr) # missing will be included
Out[14]:
5-element Array{Array{T,1} where T,1}:
 ["A", "B", "C"]                                                        
 ["A", "B", "C"]                                                        
 Union{Missing, String}["A", "B", "C", missing]                         
 ["[1.0, 2.8)", "[2.8, 4.6)", "[4.6, 6.4)", "[6.4, 8.2)", "[8.2, 10.0]"]
 [1, 2, 3]                                                              
In [15]:
y[1] < y[2] # can compare as y is ordered
Out[15]:
true
In [16]:
v[1] < v[2] # not comparable, v is unordered although it contains integers
ArgumentError: Unordered CategoricalValue objects cannot be tested for order using <. Use isless instead, or call the ordered! function on the parent array to change this

Stacktrace:
 [1] <(::CategoricalValue{Int64,UInt32}, ::CategoricalValue{Int64,UInt32}) at /home/yt/.julia/packages/CategoricalArrays/rQrLR/src/value.jl:179
 [2] top-level scope at In[16]:1
In [17]:
levels!(y, ["C", "B", "A"]) # you can reorder levels, mostly useful for ordered CategoricalArrays
Out[17]:
4-element CategoricalArray{String,1,UInt32}:
 "A"
 "B"
 "B"
 "C"
In [18]:
y[1] < y[2] # observe that the order is changed
Out[18]:
false
In [19]:
levels!(z, ["A", "B"]) # you have to specify all levels that are present
ArgumentError: cannot remove level "C" as it is used at position 4 and allow_missing=false.

Stacktrace:
 [1] #levels!#54(::Bool, ::Function, ::CategoricalArray{Union{Missing, String},1,UInt32,String,CategoricalString{UInt32},Missing}, ::Array{String,1}) at /home/yt/.julia/packages/CategoricalArrays/rQrLR/src/array.jl:594
 [2] levels!(::CategoricalArray{Union{Missing, String},1,UInt32,String,CategoricalString{UInt32},Missing}, ::Array{String,1}) at /home/yt/.julia/packages/CategoricalArrays/rQrLR/src/array.jl:582
 [3] top-level scope at In[19]:1
In [20]:
levels!(z, ["A", "B"], allow_missing=true) # unless the underlying array allows for missings and force removal of levels
Out[20]:
5-element CategoricalArray{Union{Missing, String},1,UInt32}:
 "A"    
 "B"    
 "B"    
 missing
 missing
In [21]:
z[1] = "B"
z # now z has only "B" entries
Out[21]:
5-element CategoricalArray{Union{Missing, String},1,UInt32}:
 "B"    
 "B"    
 "B"    
 missing
 missing
In [22]:
levels(z) # but it remembers the levels it had (the reason is mostly performance)
Out[22]:
2-element Array{String,1}:
 "A"
 "B"
In [23]:
droplevels!(z) # this way we can clean it up
levels(z)
Out[23]:
1-element Array{String,1}:
 "B"

Data manipulation

In [24]:
x, levels(x)
Out[24]:
(CategoricalString{UInt32}["A", "B", "B", "C"], ["A", "B", "C"])
In [25]:
x[2] = "0"
x, levels(x) # new level added at the end (works only for unordered)
Out[25]:
(CategoricalString{UInt32}["A", "0", "B", "C"], ["A", "B", "C", "0"])
In [26]:
v, levels(v)
Out[26]:
(CategoricalValue{Int64,UInt32}[1, 2, 2, 3, 3], [1, 2, 3])
In [27]:
v[1] + v[2] # even though underlying data is Int, we cannot operate on it
MethodError: no method matching +(::CategoricalValue{Int64,UInt32}, ::CategoricalValue{Int64,UInt32})
Closest candidates are:
  +(::Any, ::Any, !Matched::Any, !Matched::Any...) at operators.jl:502

Stacktrace:
 [1] top-level scope at In[27]:1
In [28]:
Vector{Int}(v) # you have either to retrieve the data by conversion (may be expensive)
Out[28]:
5-element Array{Int64,1}:
 1
 2
 2
 3
 3
In [29]:
get(v[1]) + get(v[2]) # or get a single value
Out[29]:
3
In [30]:
get.(v) # this will work for arrays witout missings
Out[30]:
5-element Array{Int64,1}:
 1
 2
 2
 3
 3
In [31]:
get.(z) # but will fail on missing values
MethodError: no method matching get(::Missing)
Closest candidates are:
  get(!Matched::Base.EnvDict, !Matched::AbstractString, !Matched::Any) at env.jl:77
  get(!Matched::Base.TTY, !Matched::Symbol, !Matched::Any) at stream.jl:415
  get(!Matched::REPL.Terminals.TTYTerminal, !Matched::Any, !Matched::Any) at /home/yt/Julia/julia/usr/share/julia/stdlib/v1.0/REPL/src/Terminals.jl:176
  ...

Stacktrace:
 [1] _broadcast_getindex_evalf at ./broadcast.jl:574 [inlined]
 [2] _broadcast_getindex at ./broadcast.jl:547 [inlined]
 [3] getindex at ./broadcast.jl:507 [inlined]
 [4] macro expansion at ./broadcast.jl:838 [inlined]
 [5] macro expansion at ./simdloop.jl:73 [inlined]
 [6] copyto! at ./broadcast.jl:837 [inlined]
 [7] copyto! at ./broadcast.jl:792 [inlined]
 [8] copy at ./broadcast.jl:768 [inlined]
 [9] materialize(::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(get),Tuple{CategoricalArray{Union{Missing, String},1,UInt32,String,CategoricalString{UInt32},Missing}}}) at ./broadcast.jl:748
 [10] top-level scope at In[31]:1
In [32]:
Vector{Union{String, Missing}}(z) # you have to do the conversion
Out[32]:
5-element Array{Union{Missing, String},1}:
 "B"    
 "B"    
 "B"    
 missing
 missing
In [33]:
z[1]*z[2], z.^2 # the only exception are CategoricalArrays based on String - you can operate on them normally
Out[33]:
("BB", Union{Missing, String}["BB", "BB", "BB", missing, missing])
In [34]:
recode([1,2,3,4,5,missing], 1=>10) # recode some values in an array; has also in place recode! equivalent
Out[34]:
6-element Array{Union{Missing, Int64},1}:
 10       
  2       
  3       
  4       
  5       
   missing
In [35]:
recode([1,2,3,4,5,missing], "a", 1=>10, 2=>20) # here we provided a default value for not mapped recodings
Out[35]:
6-element Array{Any,1}:
 10       
 20       
   "a"    
   "a"    
   "a"    
   missing
In [36]:
recode([1,2,3,4,5,missing], 1=>10, missing=>"missing") # to recode Missing you have to do it explicitly
Out[36]:
6-element Array{Any,1}:
 10         
  2         
  3         
  4         
  5         
   "missing"
In [37]:
t = categorical([1:5; missing])
t, levels(t)
Out[37]:
(Union{Missing, CategoricalValue{Int64,UInt32}}[1, 2, 3, 4, 5, missing], [1, 2, 3, 4, 5])
In [38]:
recode!(t, [1,3]=>2)
t, levels(t) # note that the levels are dropped after recode
Out[38]:
(Union{Missing, CategoricalValue{Int64,UInt32}}[2, 2, 2, 4, 5, missing], [2, 4, 5])
In [39]:
t = categorical([1,2,3], ordered=true)
levels(recode(t, 2=>0, 1=>-1)) # and if you introduce a new levels they are added at the end in the order of appearance
Out[39]:
3-element Array{Int64,1}:
  3
  0
 -1
In [40]:
t = categorical([1,2,3,4,5], ordered=true) # when using default it becomes the last level
levels(recode(t, 300, [1,2]=>100, 3=>200))
Out[40]:
3-element Array{Int64,1}:
 100
 200
 300

Comparisons

In [41]:
x = categorical([1,2,3])
Out[41]:
3-element CategoricalArray{Int64,1,UInt32}:
 1
 2
 3
In [42]:
xs = [x, categorical(x), categorical(x, ordered=true), categorical(x, ordered=true)]
Out[42]:
4-element Array{CategoricalArray{Int64,1,UInt32,Int64,CategoricalValue{Int64,UInt32},Union{}},1}:
 [1, 2, 3]
 [1, 2, 3]
 [1, 2, 3]
 [1, 2, 3]
In [43]:
levels!(xs[2], [3,2,1])
Out[43]:
3-element CategoricalArray{Int64,1,UInt32}:
 1
 2
 3
In [44]:
levels!(xs[4], [2,3,1])
Out[44]:
3-element CategoricalArray{Int64,1,UInt32}:
 1
 2
 3
In [45]:
[a == b for a in xs, b in xs] # all are equal - comparison only by contents
Out[45]:
4×4 Array{Bool,2}:
 true  true  true  true
 true  true  true  true
 true  true  true  true
 true  true  true  true
In [46]:
signature(x::CategoricalArray) = (x, levels(x), isordered(x)) # this is actually the full signature of CategoricalArray
Out[46]:
signature (generic function with 1 method)
In [47]:
signature(xs[1])
Out[47]:
(CategoricalValue{Int64,UInt32}[1, 2, 3], [1, 2, 3], false)
In [48]:
signature(xs[2])
Out[48]:
(CategoricalValue{Int64,UInt32}[1, 2, 3], [3, 2, 1], false)
In [49]:
signature(xs[3])
Out[49]:
(CategoricalValue{Int64,UInt32}[1, 2, 3], [1, 2, 3], true)
In [50]:
signature(xs[4])
Out[50]:
(CategoricalValue{Int64,UInt32}[1, 2, 3], [2, 3, 1], true)
In [51]:
# all are different, notice that x[1] and x[2] are unordered but have a different order of levels
[signature(a) == signature(b) for a in xs, b in xs]
Out[51]:
4×4 Array{Bool,2}:
  true  false  false  false
 false   true  false  false
 false  false   true  false
 false  false  false   true
In [52]:
x[1] < x[2] # you cannot compare elements of unordered CategoricalArray
ArgumentError: Unordered CategoricalValue objects cannot be tested for order using <. Use isless instead, or call the ordered! function on the parent array to change this

Stacktrace:
 [1] <(::CategoricalValue{Int64,UInt32}, ::CategoricalValue{Int64,UInt32}) at /home/yt/.julia/packages/CategoricalArrays/rQrLR/src/value.jl:179
 [2] top-level scope at In[52]:1
In [53]:
t[1] < t[2] # but you can do it for an ordered one
Out[53]:
true
In [54]:
isless(x[1], x[2]) # isless works within the same CategoricalArray even if it is not ordered
Out[54]:
true
In [55]:
y = deepcopy(x) # but not across categorical arrays
isless(x[1], y[2])
ArgumentError: CategoricalValue objects with different pools cannot be tested for order

Stacktrace:
 [1] isless(::CategoricalValue{Int64,UInt32}, ::CategoricalValue{Int64,UInt32}) at /home/yt/.julia/packages/CategoricalArrays/rQrLR/src/value.jl:162
 [2] top-level scope at In[55]:2
In [56]:
isless(get(x[1]), get(y[2])) # you can use get to make a comparison of the contents of CategoricalArray
Out[56]:
true
In [57]:
x[1] == y[2] # equality tests works OK across CategoricalArrays
Out[57]:
false

Categorical columns in a DataFrame

In [58]:
df = DataFrame(x = 1:3, y = 'a':'c', z = ["a","b","c"])
Out[58]:
xyz
Int64CharString
11'a'a
22'b'b
33'c'c
In [59]:
categorical!(df) # converts all eltype(AbstractString) columns to categorical
Out[59]:
xyz
Int64CharCategorical…
11'a'a
22'b'b
33'c'c
In [60]:
showcols(df)
┌ Warning: `showcols(df::AbstractDataFrame, all::Bool=false, values::Bool=true)` is deprecated, use `describe(df, stats=[:eltype, :nmissing, :first, :last])` instead.
│   caller = showcols(::DataFrame) at deprecated.jl:54
└ @ DataFrames ./deprecated.jl:54
Out[60]:
variableeltypenmissingfirstlast
SymbolDataTypeNothingAnyAny
1xInt6413
2yChar'a''c'
3zCategoricalString{UInt32}ac
In [61]:
categorical!(df, :x) # manually convert to categorical column :x
Out[61]:
xyz
Categorical…CharCategorical…
11'a'a
22'b'b
33'c'c
In [62]:
showcols(df)
Out[62]:
variableeltypenmissingfirstlast
SymbolDataTypeNothingAnyAny
1xCategoricalValue{Int64,UInt32}13
2yChar'a''c'
3zCategoricalString{UInt32}ac

'Flux in Julia > Learning Julia (Intro_to_Julia_DFs)' 카테고리의 다른 글

08. joins  (0) 2018.10.14
07. factors (한글)  (0) 2018.10.13
06. rows (한글)  (0) 2018.10.12
06. rows  (0) 2018.10.12
05. columns (한글)  (0) 2018.10.11
공지사항
최근에 올라온 글
최근에 달린 댓글
Total
Today
Yesterday
링크
TAG
more
«   2025/05   »
1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31
글 보관함