Introduction to DataFrames¶

Bogumił Kamiński, Apr 21, 2018

Reference¶

https://github.com/JuliaComputing/JuliaBoxTutorials/tree/master/introductory-tutorials/broader-topics-and-ecosystem/intro-to-julia-DataFrames

Series¶

https://deepstat.tistory.com/69 (01. constructors)(in English)
https://deepstat.tistory.com/70 (01. constructors)(한글)
https://deepstat.tistory.com/71 (02. basicinfo)(in English)
https://deepstat.tistory.com/72 (02. basicinfo)(한글)
https://deepstat.tistory.com/73 (03. missingvalues)(in English)
https://deepstat.tistory.com/74 (03. missingvalues)(한글)
https://deepstat.tistory.com/75 (04. loadsave)(in English)
https://deepstat.tistory.com/76 (04. loadsave)(한글)
https://deepstat.tistory.com/77 (05. columns)(in English)
https://deepstat.tistory.com/78 (05. columns)(한글)
https://deepstat.tistory.com/79 (06. rows)(in English)
https://deepstat.tistory.com/80 (06. rows)(한글)
https://deepstat.tistory.com/81 (07. factors)(in English)
https://deepstat.tistory.com/82 (07. factors)(한글)
https://deepstat.tistory.com/83 (08. joins)(in English)
https://deepstat.tistory.com/84 (08. joins)(한글)
https://deepstat.tistory.com/85 (09. reshaping)(in English)
https://deepstat.tistory.com/86 (09. reshaping)(한글)
https://deepstat.tistory.com/87 (10. transforms)(in English)
https://deepstat.tistory.com/88 (10. transforms)(한글)

using DataFrames # load package

Split-apply-combine¶

x = DataFrame(id=[1,2,3,4,1,2,3,4], id2=[1,2,1,2,1,2,1,2], v=rand(8))

gx1 = groupby(x, :id)

GroupedDataFrame with 4 groups based on key: :id
First Group: 2 rows
│ Row │ id    │ id2   │ v        │
│     │ Int64 │ Int64 │ Float64  │
├─────┼───────┼───────┼──────────┤
│ 1   │ 1     │ 1     │ 0.398172 │
│ 2   │ 1     │ 1     │ 0.899705 │
⋮
Last Group: 2 rows
│ Row │ id    │ id2   │ v        │
│     │ Int64 │ Int64 │ Float64  │
├─────┼───────┼───────┼──────────┤
│ 1   │ 4     │ 2     │ 0.241164 │
│ 2   │ 4     │ 2     │ 0.745005 │

gx2 = groupby(x, [:id, :id2])

GroupedDataFrame with 4 groups based on keys: :id, :id2
First Group: 2 rows
│ Row │ id    │ id2   │ v        │
│     │ Int64 │ Int64 │ Float64  │
├─────┼───────┼───────┼──────────┤
│ 1   │ 1     │ 1     │ 0.398172 │
│ 2   │ 1     │ 1     │ 0.899705 │
⋮
Last Group: 2 rows
│ Row │ id    │ id2   │ v        │
│     │ Int64 │ Int64 │ Float64  │
├─────┼───────┼───────┼──────────┤
│ 1   │ 4     │ 2     │ 0.241164 │
│ 2   │ 4     │ 2     │ 0.745005 │

vcat(gx2...) # back to the original DataFrame

x = DataFrame(id = [missing, 5, 1, 3, missing], x = 1:5)

show(groupby(x, :id), allgroups=true) # by default groups include mising values and are not sorted

GroupedDataFrame with 4 groups based on key: :id
Group 1: 2 rows
│ Row │ id      │ x     │
│     │ Int64⍰  │ Int64 │
├─────┼─────────┼───────┤
│ 1   │ missing │ 1     │
│ 2   │ missing │ 5     │
Group 2: 1 row
│ Row │ id     │ x     │
│     │ Int64⍰ │ Int64 │
├─────┼────────┼───────┤
│ 1   │ 5      │ 2     │
Group 3: 1 row
│ Row │ id     │ x     │
│     │ Int64⍰ │ Int64 │
├─────┼────────┼───────┤
│ 1   │ 1      │ 3     │
Group 4: 1 row
│ Row │ id     │ x     │
│     │ Int64⍰ │ Int64 │
├─────┼────────┼───────┤
│ 1   │ 3      │ 4     │

show(groupby(x, :id, sort=true, skipmissing=true), allgroups=true) # but we can change it :)

GroupedDataFrame with 3 groups based on key: :id
Group 1: 1 row
│ Row │ id     │ x     │
│     │ Int64⍰ │ Int64 │
├─────┼────────┼───────┤
│ 1   │ 1      │ 3     │
Group 2: 1 row
│ Row │ id     │ x     │
│     │ Int64⍰ │ Int64 │
├─────┼────────┼───────┤
│ 1   │ 3      │ 4     │
Group 3: 1 row
│ Row │ id     │ x     │
│     │ Int64⍰ │ Int64 │
├─────┼────────┼───────┤
│ 1   │ 5      │ 2     │

x = DataFrame(id=rand('a':'d', 100), v=rand(100));
using Statistics
by(x, :id, y->mean(y[:v])) # apply a function to each group of a data frame

by(x, :id, y->mean(y[:v]), sort=true) # we can sort the output

by(x, :id, y->DataFrame(res=mean(y[:v]))) # this way we can set a name for a column - DataFramesMeta @by is better

x = DataFrame(id=rand('a':'d', 100), x1=rand(100), x2=rand(100))
aggregate(x, :id, sum) # apply a function over all columns of a data frame in groups given by id

aggregate(x, :id, sum, sort=true) # also can be sorted

We omit the discussion of of map/combine as I do not find them very useful (better to use by)

x = DataFrame(rand(3, 5))

map(mean, eachcol(x)) # map a function over each column and return a data frame

foreach(c -> println(c[1], ": ", mean(c[2])), eachcol(x)) # a raw iteration returns a tuple with column name and values

x1: 0.6409819483419316
x2: 0.3691902706497387
x3: 0.10896519700388281
x4: 0.3754821255946908
x5: 0.6423361106326183

colwise(mean, x) # colwise is similar, but produces a vector

5-element Array{Float64,1}:
 0.6409819483419316 
 0.3691902706497387 
 0.10896519700388281
 0.3754821255946908 
 0.6423361106326183

x[:id] = [1,1,2]
colwise(mean,groupby(x, :id)) # and works on GroupedDataFrame

2-element Array{Array{Float64,1},1}:
 [0.570874, 0.542953, 0.160866, 0.458594, 0.633507, 1.0]   
 [0.781198, 0.0216644, 0.00516414, 0.209259, 0.659995, 2.0]

map(r -> r[:x1]/r[:x2], eachrow(x)) # now the returned value is DataFrameRow which works similarly to a one-row DataFrame

3-element Array{Float64,1}:
  0.2588859485275056
  1.9129355971426416
 36.05911887812408

	id	x1
	Char	Float64
1	'b'	0.451979
2	'a'	0.396662
3	'c'	0.471388
4	'd'	0.335608

	id	x1
	Char	Float64
1	'a'	0.396662
2	'b'	0.451979
3	'c'	0.471388
4	'd'	0.335608

	id	res
	Char	Float64
1	'b'	0.451979
2	'a'	0.396662
3	'c'	0.471388
4	'd'	0.335608

	id	x1_sum	x2_sum
	Char	Float64	Float64
1	'c'	9.56125	6.85763
2	'b'	13.1491	13.6665
3	'd'	13.9508	10.7499
4	'a'	12.0957	15.2495

	id	x1_sum	x2_sum
	Char	Float64	Float64
1	'a'	12.0957	15.2495
2	'b'	13.1491	13.6665
3	'c'	9.56125	6.85763
4	'd'	13.9508	10.7499

DeepStat

티스토리 뷰

10. transforms

Introduction to DataFrames¶

Reference¶

Series¶

Split-apply-combine¶

'Flux in Julia > Learning Julia (Intro_to_Julia_DFs)' 카테고리의 다른 글

티스토리툴바

	id	id2	v
	Int64	Int64	Float64
1	1	1	0.398172
2	2	2	0.688774
3	3	1	0.0905459
4	4	2	0.241164
5	1	1	0.899705
6	2	2	0.329597
7	3	1	0.470818
8	4	2	0.745005

11. performance (0)	2018.10.18
10. transforms (한글) (0)	2018.10.16
09. reshaping(한글) (0)	2018.10.15
09. reshaping (0)	2018.10.15
08. joins (한글) (0)	2018.10.14

	x1	x2	x3	x4	x5
	Float64	Float64	Float64	Float64	Float64
1	0.146424	0.565594	0.3102	0.153304	0.321919
2	0.995324	0.520312	0.011531	0.763883	0.945094
3	0.781198	0.0216644	0.00516414	0.209259	0.659995

	x1	x2	x3	x4	x5
	Float64	Float64	Float64	Float64	Float64
1	0.640982	0.36919	0.108965	0.375482	0.642336

« 2025/05 »
일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31