Introduction to DataFrames¶

Bogumił Kamiński, Apr 21, 2018

출처¶

https://github.com/JuliaComputing/JuliaBoxTutorials/tree/master/introductory-tutorials/broader-topics-and-ecosystem/intro-to-julia-DataFrames

함께보기¶

https://deepstat.tistory.com/69 (01. constructors)(in English)
https://deepstat.tistory.com/70 (01. constructors)(한글)
https://deepstat.tistory.com/71 (02. basicinfo)(in English)
https://deepstat.tistory.com/72 (02. basicinfo)(한글)
https://deepstat.tistory.com/73 (03. missingvalues)(in English)
https://deepstat.tistory.com/74 (03. missingvalues)(한글)
https://deepstat.tistory.com/75 (04. loadsave)(in English)
https://deepstat.tistory.com/76 (04. loadsave)(한글)
https://deepstat.tistory.com/77 (05. columns)(in English)
https://deepstat.tistory.com/78 (05. columns)(한글)
https://deepstat.tistory.com/79 (06. rows)(in English)
https://deepstat.tistory.com/80 (06. rows)(한글)
https://deepstat.tistory.com/81 (07. factors)(in English)
https://deepstat.tistory.com/82 (07. factors)(한글)
https://deepstat.tistory.com/83 (08. joins)(in English)
https://deepstat.tistory.com/84 (08. joins)(한글)
https://deepstat.tistory.com/85 (09. reshaping)(in English)
https://deepstat.tistory.com/86 (09. reshaping)(한글)
https://deepstat.tistory.com/87 (10. transforms)(in English)
https://deepstat.tistory.com/88 (10. transforms)(한글)

using DataFrames # load package

분할-적용-결합 (Split-apply-combine)¶

x = DataFrame(id=[1,2,3,4,1,2,3,4], id2=[1,2,1,2,1,2,1,2], v=rand(8))

gx1 = groupby(x, :id)

GroupedDataFrame with 4 groups based on key: :id
First Group: 2 rows
│ Row │ id    │ id2   │ v        │
│     │ Int64 │ Int64 │ Float64  │
├─────┼───────┼───────┼──────────┤
│ 1   │ 1     │ 1     │ 0.853822 │
│ 2   │ 1     │ 1     │ 0.624787 │
⋮
Last Group: 2 rows
│ Row │ id    │ id2   │ v        │
│     │ Int64 │ Int64 │ Float64  │
├─────┼───────┼───────┼──────────┤
│ 1   │ 4     │ 2     │ 0.702739 │
│ 2   │ 4     │ 2     │ 0.393803 │

gx2 = groupby(x, [:id, :id2])

GroupedDataFrame with 4 groups based on keys: :id, :id2
First Group: 2 rows
│ Row │ id    │ id2   │ v        │
│     │ Int64 │ Int64 │ Float64  │
├─────┼───────┼───────┼──────────┤
│ 1   │ 1     │ 1     │ 0.853822 │
│ 2   │ 1     │ 1     │ 0.624787 │
⋮
Last Group: 2 rows
│ Row │ id    │ id2   │ v        │
│     │ Int64 │ Int64 │ Float64  │
├─────┼───────┼───────┼──────────┤
│ 1   │ 4     │ 2     │ 0.702739 │
│ 2   │ 4     │ 2     │ 0.393803 │

vcat(gx2...) # 원래의 데이터프레임으로.

x = DataFrame(id = [missing, 5, 1, 3, missing], x = 1:5)

show(groupby(x, :id), allgroups=true) # 기본적으로 그룹은 결측(missing)도 포함하고, 정렬되지 않는다.

GroupedDataFrame with 4 groups based on key: :id
Group 1: 2 rows
│ Row │ id      │ x     │
│     │ Int64⍰  │ Int64 │
├─────┼─────────┼───────┤
│ 1   │ missing │ 1     │
│ 2   │ missing │ 5     │
Group 2: 1 row
│ Row │ id     │ x     │
│     │ Int64⍰ │ Int64 │
├─────┼────────┼───────┤
│ 1   │ 5      │ 2     │
Group 3: 1 row
│ Row │ id     │ x     │
│     │ Int64⍰ │ Int64 │
├─────┼────────┼───────┤
│ 1   │ 1      │ 3     │
Group 4: 1 row
│ Row │ id     │ x     │
│     │ Int64⍰ │ Int64 │
├─────┼────────┼───────┤
│ 1   │ 3      │ 4     │

show(groupby(x, :id, sort=true, skipmissing=true), allgroups=true) # 하지만 바꿀 수 있다.

GroupedDataFrame with 3 groups based on key: :id
Group 1: 1 row
│ Row │ id     │ x     │
│     │ Int64⍰ │ Int64 │
├─────┼────────┼───────┤
│ 1   │ 1      │ 3     │
Group 2: 1 row
│ Row │ id     │ x     │
│     │ Int64⍰ │ Int64 │
├─────┼────────┼───────┤
│ 1   │ 3      │ 4     │
Group 3: 1 row
│ Row │ id     │ x     │
│     │ Int64⍰ │ Int64 │
├─────┼────────┼───────┤
│ 1   │ 5      │ 2     │

x = DataFrame(id=rand('a':'d', 100), v=rand(100));
using Statistics
by(x, :id, y->mean(y[:v])) # 각각의 그룹에 대해서 함수를 적용할 수 있다.

by(x, :id, y->mean(y[:v]), sort=true) # 결과를 정렬할 수 있다.

by(x, :id, y->DataFrame(res=mean(y[:v]))) # 이 방법으로 열 이름을 바꿀 수 있다.

x = DataFrame(id=rand('a':'d', 100), x1=rand(100), x2=rand(100))
aggregate(x, :id, sum) # 모든 열에 대해서 :id별로 함수를 적용한다.

aggregate(x, :id, sum, sort=true) # 이 또한 정렬할 수 있다.

원문의 저자가 map과 combine은 유용한지 잘 모르겠어서 크게 언급하지 않기로 했다고 한다. (by가 더 낫다고 한다.)

x = DataFrame(rand(3, 5))

map(mean, eachcol(x)) # 각 열에 대해서 함수를 매핑하고 결과를 데이터프레임으로 받는다.

foreach(c -> println(c[1], ": ", mean(c[2])), eachcol(x)) # 반복을 이용해서 열 이름과 계산 값을 튜플로 받는다.

x1: 0.4650023050024165
x2: 0.5103055011163233
x3: 0.6476346358419589
x4: 0.7630800031479401
x5: 0.27687920249388487

colwise(mean, x) # 열별로 하는 것은 비슷하나, 결과를 벡터로 받는다.

5-element Array{Float64,1}:
 0.4650023050024165 
 0.5103055011163233 
 0.6476346358419589 
 0.7630800031479401 
 0.27687920249388487

x[:id] = [1,1,2]
colwise(mean,groupby(x, :id)) # 그룹화된 데이터프레임 (GroupedDataFrame)으로 작업한다.

2-element Array{Array{Float64,1},1}:
 [0.388712, 0.653694, 0.805337, 0.888662, 0.414097, 1.0] 
 [0.617582, 0.223529, 0.33223, 0.511916, 0.00244369, 2.0]

map(r -> r[:x1]/r[:x2], eachrow(x)) # 이번에는 데이터프레임의 행별로 함수를 적용한다.

3-element Array{Float64,1}:
 0.6306527976520862
 0.5412219085291118
 2.7628664958717355

	id	x1
	Char	Float64
1	'c'	0.506402
2	'b'	0.495104
3	'a'	0.447717
4	'd'	0.477769

	id	x1
	Char	Float64
1	'a'	0.447717
2	'b'	0.495104
3	'c'	0.506402
4	'd'	0.477769

	id	res
	Char	Float64
1	'c'	0.506402
2	'b'	0.495104
3	'a'	0.447717
4	'd'	0.477769

	id	x1_sum	x2_sum
	Char	Float64	Float64
1	'b'	9.68519	10.1587
2	'c'	14.4105	15.8606
3	'a'	8.01008	6.03163
4	'd'	15.9544	13.8985

	id	x1_sum	x2_sum
	Char	Float64	Float64
1	'a'	8.01008	6.03163
2	'b'	9.68519	10.1587
3	'c'	14.4105	15.8606
4	'd'	15.9544	13.8985

DeepStat

티스토리 뷰

10. transforms (한글)

Introduction to DataFrames¶

출처¶

함께보기¶

분할-적용-결합 (Split-apply-combine)¶

'Flux in Julia > Learning Julia (Intro_to_Julia_DFs)' 카테고리의 다른 글

티스토리툴바

	id	id2	v
	Int64	Int64	Float64
1	1	1	0.853822
2	2	2	0.428594
3	3	1	0.784733
4	4	2	0.702739
5	1	1	0.624787
6	2	2	0.43275
7	3	1	0.724575
8	4	2	0.393803

11. performance (한글) (0)	2018.10.18
11. performance (0)	2018.10.18
10. transforms (0)	2018.10.16
09. reshaping(한글) (0)	2018.10.15
09. reshaping (0)	2018.10.15

	x1	x2	x3	x4	x5
	Float64	Float64	Float64	Float64	Float64
1	0.492489	0.780919	0.883863	0.929136	0.582756
2	0.284936	0.526468	0.726811	0.848188	0.245438
3	0.617582	0.223529	0.33223	0.511916	0.00244369

	x1	x2	x3	x4	x5
	Float64	Float64	Float64	Float64	Float64
1	0.465002	0.510306	0.647635	0.76308	0.276879

« 2025/08 »
일	월	화	수	목	금	토
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31