티스토리 뷰
Introduction to DataFrames¶
Bogumił Kamiński, May 13, 2018
Reference¶
Series¶
- https://deepstat.tistory.com/69 (01. constructors)(in English)
- https://deepstat.tistory.com/70 (01. constructors)(한글)
- https://deepstat.tistory.com/71 (02. basicinfo)(in English)
- https://deepstat.tistory.com/72 (02. basicinfo)(한글)
- https://deepstat.tistory.com/73 (03. missingvalues)(in English)
- https://deepstat.tistory.com/74 (03. missingvalues)(한글)
- https://deepstat.tistory.com/75 (04. loadsave)(in English)
- https://deepstat.tistory.com/76 (04. loadsave)(한글)
- https://deepstat.tistory.com/77 (05. columns)(in English)
- https://deepstat.tistory.com/78 (05. columns)(한글)
- https://deepstat.tistory.com/79 (06. rows)(in English)
- https://deepstat.tistory.com/80 (06. rows)(한글)
- https://deepstat.tistory.com/81 (07. factors)(in English)
- https://deepstat.tistory.com/82 (07. factors)(한글)
- https://deepstat.tistory.com/83 (08. joins)(in English)
- https://deepstat.tistory.com/84 (08. joins)(한글)
- https://deepstat.tistory.com/85 (09. reshaping)(in English)
- https://deepstat.tistory.com/86 (09. reshaping)(한글)
- https://deepstat.tistory.com/87 (10. transforms)(in English)
- https://deepstat.tistory.com/88 (10. transforms)(한글)
- https://deepstat.tistory.com/89 (11. performance)(in English)
- https://deepstat.tistory.com/90 (11. performance)(한글)
- https://deepstat.tistory.com/91 (12. pitfalls)(in English)
- https://deepstat.tistory.com/92 (12. pitfalls)(한글)
- https://deepstat.tistory.com/93 (13. extras)(in English)
- https://deepstat.tistory.com/94 (13. extras)(한글)
In [1]:
using DataFrames
using Statistics
Extras - selected functionalities of selected packages¶
FreqTables: creating cross tabulations¶
In [2]:
using FreqTables
df = DataFrame(a=rand('a':'d', 1000), b=rand(["x", "y", "z"], 1000))
ft = freqtable(df, :a, :b) # observe that dimensions are sorted if possible
Out[2]:
In [3]:
ft[1,1], ft['b', "z"] # you can index the result using numbers or names
Out[3]:
In [4]:
prop(ft, 1) # getting proportions - 1 means we want to calculate them in rows (first dimension)
Out[4]:
In [5]:
prop(ft, 2) # and columns are normalized to 1.0 now
Out[5]:
In [6]:
x = categorical(rand(1:3, 10))
levels!(x, [3, 1, 2, 4]) # reordering levels and adding an extra level
freqtable(x) # order is preserved and not-used level is shown
Out[6]:
In [7]:
freqtable([1,1,2,3,missing]) # by default missings are listed
Out[7]:
In [8]:
freqtable([1,1,2,3,missing], skipmissing=true) # but we can skip them
Out[8]:
DataFramesMeta - working on DataFrame
¶
In [9]:
using DataFramesMeta
df = DataFrame(x=1:8, y='a':'h', z=repeat([true,false], outer=4))
Out[9]:
In [10]:
@with(df, :x+:z) # expressions with columns of DataFrame
Out[10]:
In [11]:
@with df begin # you can define code blocks
a = :x[:z]
b = :x[.!:z]
:y + [a; b]
end
Out[11]:
In [12]:
a # @with creates hard scope so variables do not leak out
In [13]:
df2 = DataFrame(a = [:a, :b, :c])
@with(df2, :a .== ^(:a)) # sometimes we want to work on raw Symbol, ^() escapes it
Out[13]:
In [14]:
df2 = DataFrame(x=1:3, y=4:6, z=7:9)
@with(df2, _I_(2:3)) # _I_(expression) is translated to df2[expression]
Out[14]:
In [15]:
@where(df, :x .< 4, :z .== true) # very useful macro for filtering
Out[15]:
In [16]:
@select(df, :x, y = 2*:x, z=:y) # create a new DataFrame based on the old one
Out[16]:
In [17]:
@transform(df, a=1, x = 2*:x, y=:x) # create a new DataFrame adding columns based on the old one
Out[17]:
In [18]:
@transform(df, a=1, b=:a) # old DataFrame is used and :a is not present there
In [19]:
@orderby(df, :z, -:x) # sorting into a new data frame, less powerful than sort, but lightweight
Out[19]:
In [20]:
@linq df |> # chaining of operations on DataFrame
where(:x .< 5) |>
orderby(:z) |>
transform(x²=:x.^2) |>
select(:z, :x, :x²)
Out[20]:
In [21]:
f(df, col) = df[col] # you can define your own functions and put them in the chain
@linq df |> where(:x .<= 4) |> f(:x)
Out[21]:
DataFramesMeta - working on grouped DataFrame
¶
In [22]:
df = DataFrame(a = 1:12, b = repeat('a':'d', outer=3))
g = groupby(df, :b)
Out[22]:
In [23]:
@by(df, :b, first=first(:a), last=last(:a), mean=mean(:a)) # more convinient than by from DataFrames
Out[23]:
In [24]:
@based_on(g, first=first(:a), last=last(:a), mean=mean(:a)) # the same as by but on grouped DataFrame
Out[24]:
In [25]:
@where(g, mean(:a) > 6.5) # filter gropus on aggregate conditions
Out[25]:
In [26]:
@orderby(g, -sum(:a)) # order groups on aggregate conditions
Out[26]:
In [27]:
@transform(g, center = mean(:a), centered = :a .- mean(:a)) # perform operations within a group and return ungroped DataFrame
Out[27]:
In [28]:
DataFrame(g) # a nice convinience function not defined in DataFrames
Out[28]:
In [29]:
@transform(g) # actually this is the same
Out[29]:
In [30]:
@linq df |> groupby(:b) |> where(mean(:a) > 6.5) |> DataFrame # you can do chaining on grouped DataFrames as well
Out[30]:
DataFramesMeta - rowwise operations on DataFrame
¶
In [31]:
df = DataFrame(a = 1:12, b = repeat(1:4, outer=3))
Out[31]:
In [32]:
# such conditions are often needed but are complex to write
@transform(df, x = ifelse.((:a .> 6) .& (:b .== 4), "yes", "no"))
Out[32]:
In [33]:
# one option is to use a function that works on a single observation and broadcast it
myfun(a, b) = a > 6 && b == 4 ? "yes" : "no"
@transform(df, x = myfun.(:a, :b))
Out[33]:
In [34]:
# or you can use @byrow! macro that allows you to process DataFrame rowwise
@byrow! df begin
@newcol x::Vector{String}
:x = :a > 6 && :b == 4 ? "yes" : "no"
end
Out[34]:
'Flux in Julia > Learning Julia (Intro_to_Julia_DFs)' 카테고리의 다른 글
13. extras (한글) (0) | 2018.10.20 |
---|---|
12. pitfalls (한글) (0) | 2018.10.19 |
12. pitfalls (0) | 2018.10.19 |
11. performance (한글) (0) | 2018.10.18 |
11. performance (0) | 2018.10.18 |