티스토리 뷰
Introduction to DataFrames¶
Bogumił Kamiński, Apr 21, 2018
Reference¶
Series¶
- https://deepstat.tistory.com/69 (01. constructors)(in English)
- https://deepstat.tistory.com/70 (01. constructors)(한글)
- https://deepstat.tistory.com/71 (02. basicinfo)(in English)
- https://deepstat.tistory.com/72 (02. basicinfo)(한글)
- https://deepstat.tistory.com/73 (03. missingvalues)(in English)
- https://deepstat.tistory.com/74 (03. missingvalues)(한글)
- https://deepstat.tistory.com/75 (04. loadsave)(in English)
- https://deepstat.tistory.com/76 (04. loadsave)(한글)
- https://deepstat.tistory.com/77 (05. columns)(in English)
- https://deepstat.tistory.com/78 (05. columns)(한글)
- https://deepstat.tistory.com/79 (06. rows)(in English)
- https://deepstat.tistory.com/80 (06. rows)(한글)
- https://deepstat.tistory.com/81 (07. factors)(in English)
- https://deepstat.tistory.com/82 (07. factors)(한글)
In [1]:
using DataFrames # load package
Working with CategoricalArrays¶
Constructor¶
In [2]:
x = categorical(["A", "B", "B", "C"]) # unordered
Out[2]:
In [3]:
y = categorical(["A", "B", "B", "C"], ordered=true) # ordered, by default order is sorting order
Out[3]:
In [4]:
z = categorical(["A","B","B","C", missing]) # unordered with missings
Out[4]:
In [5]:
c = cut(1:10, 5) # ordered, into equal counts, possible to rename labels and give custom breaks
Out[5]:
In [6]:
by(DataFrame(x=cut(randn(100000), 10)), :x, d -> DataFrame(n=nrow(d)), sort=true) # just to make sure it works right
Out[6]:
In [7]:
v = categorical([1,2,2,3,3]) # contains integers not strings
Out[7]:
In [8]:
Vector{Union{String, Missing}}(z) # sometimes you need to convert back to a standard vector
Out[8]:
Managing levels¶
In [9]:
arr = [x,y,z,c,v]
Out[9]:
In [10]:
isordered.(arr) # chcek if categorical array is orderd
Out[10]:
In [11]:
ordered!(x, true), isordered(x) # make x ordered
Out[11]:
In [12]:
ordered!(x, false), isordered(x) # and unordered again
Out[12]:
In [13]:
levels.(arr) # list levels
Out[13]:
In [14]:
unique.(arr) # missing will be included
Out[14]:
In [15]:
y[1] < y[2] # can compare as y is ordered
Out[15]:
In [16]:
v[1] < v[2] # not comparable, v is unordered although it contains integers
In [17]:
levels!(y, ["C", "B", "A"]) # you can reorder levels, mostly useful for ordered CategoricalArrays
Out[17]:
In [18]:
y[1] < y[2] # observe that the order is changed
Out[18]:
In [19]:
levels!(z, ["A", "B"]) # you have to specify all levels that are present
In [20]:
levels!(z, ["A", "B"], allow_missing=true) # unless the underlying array allows for missings and force removal of levels
Out[20]:
In [21]:
z[1] = "B"
z # now z has only "B" entries
Out[21]:
In [22]:
levels(z) # but it remembers the levels it had (the reason is mostly performance)
Out[22]:
In [23]:
droplevels!(z) # this way we can clean it up
levels(z)
Out[23]:
Data manipulation¶
In [24]:
x, levels(x)
Out[24]:
In [25]:
x[2] = "0"
x, levels(x) # new level added at the end (works only for unordered)
Out[25]:
In [26]:
v, levels(v)
Out[26]:
In [27]:
v[1] + v[2] # even though underlying data is Int, we cannot operate on it
In [28]:
Vector{Int}(v) # you have either to retrieve the data by conversion (may be expensive)
Out[28]:
In [29]:
get(v[1]) + get(v[2]) # or get a single value
Out[29]:
In [30]:
get.(v) # this will work for arrays witout missings
Out[30]:
In [31]:
get.(z) # but will fail on missing values
In [32]:
Vector{Union{String, Missing}}(z) # you have to do the conversion
Out[32]:
In [33]:
z[1]*z[2], z.^2 # the only exception are CategoricalArrays based on String - you can operate on them normally
Out[33]:
In [34]:
recode([1,2,3,4,5,missing], 1=>10) # recode some values in an array; has also in place recode! equivalent
Out[34]:
In [35]:
recode([1,2,3,4,5,missing], "a", 1=>10, 2=>20) # here we provided a default value for not mapped recodings
Out[35]:
In [36]:
recode([1,2,3,4,5,missing], 1=>10, missing=>"missing") # to recode Missing you have to do it explicitly
Out[36]:
In [37]:
t = categorical([1:5; missing])
t, levels(t)
Out[37]:
In [38]:
recode!(t, [1,3]=>2)
t, levels(t) # note that the levels are dropped after recode
Out[38]:
In [39]:
t = categorical([1,2,3], ordered=true)
levels(recode(t, 2=>0, 1=>-1)) # and if you introduce a new levels they are added at the end in the order of appearance
Out[39]:
In [40]:
t = categorical([1,2,3,4,5], ordered=true) # when using default it becomes the last level
levels(recode(t, 300, [1,2]=>100, 3=>200))
Out[40]:
Comparisons¶
In [41]:
x = categorical([1,2,3])
Out[41]:
In [42]:
xs = [x, categorical(x), categorical(x, ordered=true), categorical(x, ordered=true)]
Out[42]:
In [43]:
levels!(xs[2], [3,2,1])
Out[43]:
In [44]:
levels!(xs[4], [2,3,1])
Out[44]:
In [45]:
[a == b for a in xs, b in xs] # all are equal - comparison only by contents
Out[45]:
In [46]:
signature(x::CategoricalArray) = (x, levels(x), isordered(x)) # this is actually the full signature of CategoricalArray
Out[46]:
In [47]:
signature(xs[1])
Out[47]:
In [48]:
signature(xs[2])
Out[48]:
In [49]:
signature(xs[3])
Out[49]:
In [50]:
signature(xs[4])
Out[50]:
In [51]:
# all are different, notice that x[1] and x[2] are unordered but have a different order of levels
[signature(a) == signature(b) for a in xs, b in xs]
Out[51]:
In [52]:
x[1] < x[2] # you cannot compare elements of unordered CategoricalArray
In [53]:
t[1] < t[2] # but you can do it for an ordered one
Out[53]:
In [54]:
isless(x[1], x[2]) # isless works within the same CategoricalArray even if it is not ordered
Out[54]:
In [55]:
y = deepcopy(x) # but not across categorical arrays
isless(x[1], y[2])
In [56]:
isless(get(x[1]), get(y[2])) # you can use get to make a comparison of the contents of CategoricalArray
Out[56]:
In [57]:
x[1] == y[2] # equality tests works OK across CategoricalArrays
Out[57]:
Categorical columns in a DataFrame¶
In [58]:
df = DataFrame(x = 1:3, y = 'a':'c', z = ["a","b","c"])
Out[58]:
In [59]:
categorical!(df) # converts all eltype(AbstractString) columns to categorical
Out[59]:
In [60]:
showcols(df)
Out[60]:
In [61]:
categorical!(df, :x) # manually convert to categorical column :x
Out[61]:
In [62]:
showcols(df)
Out[62]:
'Flux in Julia > Learning Julia (Intro_to_Julia_DFs)' 카테고리의 다른 글
08. joins (0) | 2018.10.14 |
---|---|
07. factors (한글) (0) | 2018.10.13 |
06. rows (한글) (0) | 2018.10.12 |
06. rows (0) | 2018.10.12 |
05. columns (한글) (0) | 2018.10.11 |