Introduction to DataFrames¶

Bogumił Kamiński, May 13, 2018

Reference¶

https://github.com/JuliaComputing/JuliaBoxTutorials/tree/master/introductory-tutorials/broader-topics-and-ecosystem/intro-to-julia-DataFrames

Series¶

https://deepstat.tistory.com/69 (01. constructors)(in English)
https://deepstat.tistory.com/70 (01. constructors)(한글)
https://deepstat.tistory.com/71 (02. basicinfo)(in English)
https://deepstat.tistory.com/72 (02. basicinfo)(한글)
https://deepstat.tistory.com/73 (03. missingvalues)(in English)
https://deepstat.tistory.com/74 (03. missingvalues)(한글)
https://deepstat.tistory.com/75 (04. loadsave)(in English)
https://deepstat.tistory.com/76 (04. loadsave)(한글)
https://deepstat.tistory.com/77 (05. columns)(in English)
https://deepstat.tistory.com/78 (05. columns)(한글)
https://deepstat.tistory.com/79 (06. rows)(in English)
https://deepstat.tistory.com/80 (06. rows)(한글)
https://deepstat.tistory.com/81 (07. factors)(in English)
https://deepstat.tistory.com/82 (07. factors)(한글)
https://deepstat.tistory.com/83 (08. joins)(in English)
https://deepstat.tistory.com/84 (08. joins)(한글)
https://deepstat.tistory.com/85 (09. reshaping)(in English)
https://deepstat.tistory.com/86 (09. reshaping)(한글)
https://deepstat.tistory.com/87 (10. transforms)(in English)
https://deepstat.tistory.com/88 (10. transforms)(한글)
https://deepstat.tistory.com/89 (11. performance)(in English)
https://deepstat.tistory.com/90 (11. performance)(한글)
https://deepstat.tistory.com/91 (12. pitfalls)(in English)
https://deepstat.tistory.com/92 (12. pitfalls)(한글)
https://deepstat.tistory.com/93 (13. extras)(in English)
https://deepstat.tistory.com/94 (13. extras)(한글)

using DataFrames
using Statistics

Extras - selected functionalities of selected packages¶

FreqTables: creating cross tabulations¶

using FreqTables
df = DataFrame(a=rand('a':'d', 1000), b=rand(["x", "y", "z"], 1000))
ft = freqtable(df, :a, :b) # observe that dimensions are sorted if possible

4×3 Named Array{Int64,2}
a ╲ b │   x    y    z
──────┼──────────────
'a'   │  93   76  105
'b'   │  70   73   76
'c'   │  84   82   91
'd'   │  79   88   83

ft[1,1], ft['b', "z"] # you can index the result using numbers or names

(93, 76)

prop(ft, 1) # getting proportions - 1 means we want to calculate them in rows (first dimension)

4×3 Named Array{Float64,2}
a ╲ b │        x         y         z
──────┼─────────────────────────────
'a'   │ 0.339416  0.277372  0.383212
'b'   │ 0.319635  0.333333  0.347032
'c'   │ 0.326848  0.319066  0.354086
'd'   │    0.316     0.352     0.332

prop(ft, 2) # and columns are normalized to 1.0 now

4×3 Named Array{Float64,2}
a ╲ b │        x         y         z
──────┼─────────────────────────────
'a'   │ 0.285276  0.238245  0.295775
'b'   │ 0.214724   0.22884  0.214085
'c'   │ 0.257669  0.257053  0.256338
'd'   │ 0.242331  0.275862  0.233803

x = categorical(rand(1:3, 10))
levels!(x, [3, 1, 2, 4]) # reordering levels and adding an extra level
freqtable(x) # order is preserved and not-used level is shown

4-element Named Array{Int64,1}
Dim1  │ 
──────┼──
3     │ 5
1     │ 3
2     │ 2
4     │ 0

freqtable([1,1,2,3,missing]) # by default missings are listed

4-element Named Array{Int64,1}
Dim1    │ 
────────┼──
1       │ 2
2       │ 1
3       │ 1
missing │ 1

freqtable([1,1,2,3,missing], skipmissing=true) # but we can skip them

3-element Named Array{Int64,1}
Dim1  │ 
──────┼──
1     │ 2
2     │ 1
3     │ 1

DataFramesMeta - working on `DataFrame`¶

using DataFramesMeta
df = DataFrame(x=1:8, y='a':'h', z=repeat([true,false], outer=4))

@with(df, :x+:z) # expressions with columns of DataFrame

8-element Array{Int64,1}:
 2
 2
 4
 4
 6
 6
 8
 8

@with df begin # you can define code blocks
    a = :x[:z]
    b = :x[.!:z]
    :y + [a; b]
end

8-element Array{Char,1}:
 'b'
 'e'
 'h'
 'k'
 'g'
 'j'
 'm'
 'p'

a # @with creates hard scope so variables do not leak out

UndefVarError: a not defined

Stacktrace:
 [1] top-level scope at In[12]:1

df2 = DataFrame(a = [:a, :b, :c])
@with(df2, :a .== ^(:a)) # sometimes we want to work on raw Symbol, ^() escapes it

3-element BitArray{1}:
  true
 false
 false

df2 = DataFrame(x=1:3, y=4:6, z=7:9)
@with(df2, _I_(2:3)) # _I_(expression) is translated to df2[expression]

┌ Warning: _I_() for escaping variables is deprecated, use cols() instead
└ @ DataFramesMeta /home/yt/.julia/packages/DataFramesMeta/oLnYB/src/DataFramesMeta.jl:35

@where(df, :x .< 4, :z .== true) # very useful macro for filtering

@select(df, :x, y = 2*:x, z=:y) # create a new DataFrame based on the old one

@transform(df, a=1, x = 2*:x, y=:x) # create a new DataFrame adding columns based on the old one

@transform(df, a=1, b=:a) # old DataFrame is used and :a is not present there

KeyError: key :a not found

Stacktrace:
 [1] (::getfield(Main, Symbol("##16#18")))(::DataFrame) at ./dict.jl:478
 [2] #transform#9(::Base.Iterators.Pairs{Symbol,Function,Tuple{Symbol,Symbol},NamedTuple{(:a, :b),Tuple{getfield(Main, Symbol("##15#17")),getfield(Main, Symbol("##16#18"))}}}, ::Function, ::DataFrame) at /home/yt/.julia/packages/DataFramesMeta/oLnYB/src/DataFramesMeta.jl:384
 [3] (::getfield(DataFramesMeta, Symbol("#kw##transform")))(::NamedTuple{(:a, :b),Tuple{getfield(Main, Symbol("##15#17")),getfield(Main, Symbol("##16#18"))}}, ::typeof(DataFramesMeta.transform), ::DataFrame) at ./none:0
 [4] top-level scope at /home/yt/.julia/packages/DataFramesMeta/oLnYB/src/DataFramesMeta.jl:406
 [5] top-level scope at In[18]:1

@orderby(df, :z, -:x) # sorting into a new data frame, less powerful than sort, but lightweight

@linq df |> # chaining of operations on DataFrame
    where(:x .< 5) |>
    orderby(:z) |>
    transform(x²=:x.^2) |>
    select(:z, :x, :x²)

f(df, col) = df[col] # you can define your own functions and put them in the chain
@linq df |> where(:x .<= 4) |> f(:x)

4-element Array{Int64,1}:
 1
 2
 3
 4

DataFramesMeta - working on grouped `DataFrame`¶

df = DataFrame(a = 1:12, b = repeat('a':'d', outer=3))
g = groupby(df, :b)

GroupedDataFrame with 4 groups based on key: :b
First Group: 3 rows
│ Row │ a     │ b    │
│     │ Int64 │ Char │
├─────┼───────┼──────┤
│ 1   │ 1     │ 'a'  │
│ 2   │ 5     │ 'a'  │
│ 3   │ 9     │ 'a'  │
⋮
Last Group: 3 rows
│ Row │ a     │ b    │
│     │ Int64 │ Char │
├─────┼───────┼──────┤
│ 1   │ 4     │ 'd'  │
│ 2   │ 8     │ 'd'  │
│ 3   │ 12    │ 'd'  │

@by(df, :b, first=first(:a), last=last(:a), mean=mean(:a)) # more convinient than by from DataFrames

@based_on(g, first=first(:a), last=last(:a), mean=mean(:a)) # the same as by but on grouped DataFrame

@where(g, mean(:a) > 6.5) # filter gropus on aggregate conditions

GroupedDataFrame with 2 groups based on key: :b
First Group: 3 rows
│ Row │ a     │ b    │
│     │ Int64 │ Char │
├─────┼───────┼──────┤
│ 1   │ 3     │ 'c'  │
│ 2   │ 7     │ 'c'  │
│ 3   │ 11    │ 'c'  │
⋮
Last Group: 3 rows
│ Row │ a     │ b    │
│     │ Int64 │ Char │
├─────┼───────┼──────┤
│ 1   │ 4     │ 'd'  │
│ 2   │ 8     │ 'd'  │
│ 3   │ 12    │ 'd'  │

@orderby(g, -sum(:a)) # order groups on aggregate conditions

GroupedDataFrame with 4 groups based on key: :b
First Group: 3 rows
│ Row │ a     │ b    │
│     │ Int64 │ Char │
├─────┼───────┼──────┤
│ 1   │ 4     │ 'd'  │
│ 2   │ 8     │ 'd'  │
│ 3   │ 12    │ 'd'  │
⋮
Last Group: 3 rows
│ Row │ a     │ b    │
│     │ Int64 │ Char │
├─────┼───────┼──────┤
│ 1   │ 1     │ 'a'  │
│ 2   │ 5     │ 'a'  │
│ 3   │ 9     │ 'a'  │

@transform(g, center = mean(:a), centered = :a .- mean(:a)) # perform operations within a group and return ungroped DataFrame

DataFrame(g) # a nice convinience function not defined in DataFrames

@transform(g) # actually this is the same

@linq df |> groupby(:b) |> where(mean(:a) > 6.5) |> DataFrame # you can do chaining on grouped DataFrames as well

DataFramesMeta - rowwise operations on `DataFrame`¶

df = DataFrame(a = 1:12, b = repeat(1:4, outer=3))

# such conditions are often needed but are complex to write
@transform(df, x = ifelse.((:a .> 6) .& (:b .== 4), "yes", "no"))

# one option is to use a function that works on a single observation and broadcast it
myfun(a, b) = a > 6 && b == 4 ? "yes" : "no"
@transform(df, x = myfun.(:a, :b))

# or you can use @byrow! macro that allows you to process DataFrame rowwise
@byrow! df begin
    @newcol x::Vector{String}
    :x = :a > 6 && :b == 4 ? "yes" : "no"
end

13. extras (한글) (0)	2018.10.20
12. pitfalls (한글) (0)	2018.10.19
12. pitfalls (0)	2018.10.19
11. performance (한글) (0)	2018.10.18
11. performance (0)	2018.10.18

DeepStat

티스토리 뷰

13. extras

Introduction to DataFrames¶

Reference¶

Series¶

Extras - selected functionalities of selected packages¶

FreqTables: creating cross tabulations¶

DataFramesMeta - working on `DataFrame`¶

DataFramesMeta - working on grouped `DataFrame`¶

DataFramesMeta - rowwise operations on `DataFrame`¶

'Flux in Julia > Learning Julia (Intro_to_Julia_DFs)' 카테고리의 다른 글

티스토리툴바

	x	y	z
	Int64	Char	Bool
1	1	'a'	true
2	2	'b'	false
3	3	'c'	true
4	4	'd'	false
5	5	'e'	true
6	6	'f'	false
7	7	'g'	true
8	8	'h'	false

	a	b	center	centered
	Int64	Char	Float64	Float64
1	1	'a'	5.0	-4.0
2	5	'a'	5.0	0.0
3	9	'a'	5.0	4.0
4	2	'b'	6.0	-4.0
5	6	'b'	6.0	0.0
6	10	'b'	6.0	4.0
7	3	'c'	7.0	-4.0
8	7	'c'	7.0	0.0
9	11	'c'	7.0	4.0
10	4	'd'	8.0	-4.0
11	8	'd'	8.0	0.0
12	12	'd'	8.0	4.0

	a	b	x
	Int64	Int64	String
1	1	1	no
2	2	2	no
3	3	3	no
4	4	4	no
5	5	1	no
6	6	2	no
7	7	3	no
8	8	4	yes
9	9	1	no
10	10	2	no
11	11	3	no
12	12	4	yes

« 2025/06 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

티스토리 뷰

13. extras

Introduction to DataFrames¶

Reference¶

Series¶

Extras - selected functionalities of selected packages¶

FreqTables: creating cross tabulations¶

DataFramesMeta - working on DataFrame¶

DataFramesMeta - working on grouped DataFrame¶

DataFramesMeta - rowwise operations on DataFrame¶

'Flux in Julia > Learning Julia (Intro_to_Julia_DFs)' 카테고리의 다른 글

티스토리툴바

DataFramesMeta - working on `DataFrame`¶

DataFramesMeta - working on grouped `DataFrame`¶

DataFramesMeta - rowwise operations on `DataFrame`¶