Introduction to DataFrames¶

Bogumił Kamiński, May 13, 2018

출처¶

https://github.com/JuliaComputing/JuliaBoxTutorials/tree/master/introductory-tutorials/broader-topics-and-ecosystem/intro-to-julia-DataFrames

함께보기¶

https://deepstat.tistory.com/69 (01. constructors)(in English)
https://deepstat.tistory.com/70 (01. constructors)(한글)
https://deepstat.tistory.com/71 (02. basicinfo)(in English)
https://deepstat.tistory.com/72 (02. basicinfo)(한글)
https://deepstat.tistory.com/73 (03. missingvalues)(in English)
https://deepstat.tistory.com/74 (03. missingvalues)(한글)
https://deepstat.tistory.com/75 (04. loadsave)(in English)
https://deepstat.tistory.com/76 (04. loadsave)(한글)
https://deepstat.tistory.com/77 (05. columns)(in English)
https://deepstat.tistory.com/78 (05. columns)(한글)
https://deepstat.tistory.com/79 (06. rows)(in English)
https://deepstat.tistory.com/80 (06. rows)(한글)
https://deepstat.tistory.com/81 (07. factors)(in English)
https://deepstat.tistory.com/82 (07. factors)(한글)
https://deepstat.tistory.com/83 (08. joins)(in English)
https://deepstat.tistory.com/84 (08. joins)(한글)
https://deepstat.tistory.com/85 (09. reshaping)(in English)
https://deepstat.tistory.com/86 (09. reshaping)(한글)
https://deepstat.tistory.com/87 (10. transforms)(in English)
https://deepstat.tistory.com/88 (10. transforms)(한글)
https://deepstat.tistory.com/89 (11. performance)(in English)
https://deepstat.tistory.com/90 (11. performance)(한글)
https://deepstat.tistory.com/91 (12. pitfalls)(in English)
https://deepstat.tistory.com/92 (12. pitfalls)(한글)
https://deepstat.tistory.com/93 (13. extras)(in English)
https://deepstat.tistory.com/94 (13. extras)(한글)

using DataFrames
using Statistics

추가 - 선택된 패키지의 선택된 기능¶

FreqTables: 빈도표 작성 (creating cross tabulations)¶

using FreqTables
df = DataFrame(a=rand('a':'d', 1000), b=rand(["x", "y", "z"], 1000))
ft = freqtable(df, :a, :b) # 가능하면 정렬되는 것을 볼 수 있다.

4×3 Named Array{Int64,2}
a ╲ b │  x   y   z
──────┼───────────
'a'   │ 86  83  91
'b'   │ 84  77  73
'c'   │ 75  71  97
'd'   │ 88  85  90

ft[1,1], ft['b', "z"] # 숫자나 이름을 이용해서 결과의 일부를 불러올 수 있다.

(86, 73)

prop(ft, 1) # 비율을 얻을 수 있다. 1이 의미하는 것은 행별로 계산하고 싶다는 것이다.

4×3 Named Array{Float64,2}
a ╲ b │        x         y         z
──────┼─────────────────────────────
'a'   │ 0.330769  0.319231      0.35
'b'   │ 0.358974   0.32906  0.311966
'c'   │ 0.308642  0.292181  0.399177
'd'   │ 0.334601  0.323194  0.342205

prop(ft, 2) # 열별로 계산됐다.

4×3 Named Array{Float64,2}
a ╲ b │        x         y         z
──────┼─────────────────────────────
'a'   │ 0.258258  0.262658  0.259259
'b'   │ 0.252252  0.243671  0.207977
'c'   │ 0.225225  0.224684  0.276353
'd'   │ 0.264264  0.268987   0.25641

x = categorical(rand(1:3, 10))
levels!(x, [3, 1, 2, 4]) # 레벨(level)을 재정렬하거나 새로운 레벨을 추가.
freqtable(x) # levels! 에서 정한 순서대로 나오고, 0개인 레벨도 나온다.

4-element Named Array{Int64,1}
Dim1  │ 
──────┼──
3     │ 5
1     │ 3
2     │ 2
4     │ 0

freqtable([1,1,2,3,missing]) # 결측(missing)이 기본적으로 나온다.

4-element Named Array{Int64,1}
Dim1    │ 
────────┼──
1       │ 2
2       │ 1
3       │ 1
missing │ 1

freqtable([1,1,2,3,missing], skipmissing=true) # 결측(missing)을 안 나오게 할 수 있다.

3-element Named Array{Int64,1}
Dim1  │ 
──────┼──
1     │ 2
2     │ 1
3     │ 1

DataFramesMeta - `데이터프레임` 다루기 (working on `DataFrame`)¶

using DataFramesMeta
df = DataFrame(x=1:8, y='a':'h', z=repeat([true,false], outer=4))

@with(df, :x+:z) # 데이터프레임의 열로 표현함.

8-element Array{Int64,1}:
 2
 2
 4
 4
 6
 6
 8
 8

@with df begin # 코드 블록(code block)을 정의할 수 있다.
    a = :x[:z]
    b = :x[.!:z]
    :y + [a; b]
end

8-element Array{Char,1}:
 'b'
 'e'
 'h'
 'k'
 'g'
 'j'
 'm'
 'p'

a # @with는 변수가 누출되지 않도록 하드 스코프(hard scope)를 만든다.

UndefVarError: a not defined

Stacktrace:
 [1] top-level scope at In[12]:1

df2 = DataFrame(a = [:a, :b, :c])
@with(df2, :a .== ^(:a)) # 때때로 원시기호로 작업하고 싶을때, ^() 를 사용한다.

3-element BitArray{1}:
  true
 false
 false

df2 = DataFrame(x=1:3, y=4:6, z=7:9)
@with(df2, _I_(2:3)) # _l_ 표현은 df2로 변환된다.

┌ Warning: _I_() for escaping variables is deprecated, use cols() instead
└ @ DataFramesMeta /home/yt/.julia/packages/DataFramesMeta/oLnYB/src/DataFramesMeta.jl:35

@where(df, :x .< 4, :z .== true) # 필터링 하는 매크로를 사용하기에 매우 유용하다.

@select(df, :x, y = 2*:x, z=:y) # 예전 데이터프레임으로부터 새로운 데이터프레임을 만든다.

@transform(df, a=1, x = 2*:x, y=:x) # 예전 데이터프레임으로부터 새로운 데이터프레임을 만든다.

@transform(df, a=1, b=:a) # 예전 데이터프레임이 사용되고, :a 는 여기서 사용되지 않는다.

KeyError: key :a not found

Stacktrace:
 [1] (::getfield(Main, Symbol("##16#18")))(::DataFrame) at ./dict.jl:478
 [2] #transform#9(::Base.Iterators.Pairs{Symbol,Function,Tuple{Symbol,Symbol},NamedTuple{(:a, :b),Tuple{getfield(Main, Symbol("##15#17")),getfield(Main, Symbol("##16#18"))}}}, ::Function, ::DataFrame) at /home/yt/.julia/packages/DataFramesMeta/oLnYB/src/DataFramesMeta.jl:384
 [3] (::getfield(DataFramesMeta, Symbol("#kw##transform")))(::NamedTuple{(:a, :b),Tuple{getfield(Main, Symbol("##15#17")),getfield(Main, Symbol("##16#18"))}}, ::typeof(DataFramesMeta.transform), ::DataFrame) at ./none:0
 [4] top-level scope at /home/yt/.julia/packages/DataFramesMeta/oLnYB/src/DataFramesMeta.jl:406
 [5] top-level scope at In[18]:1

@orderby(df, :z, -:x) # 새 데이터프레임으로 만들어지면서 정렬된다. sort보다 가볍다.

@linq df |> # 데이터프레임의 물고 이어지는(체이닝 : chaining) 표현
    where(:x .< 5) |>
    orderby(:z) |>
    transform(x²=:x.^2) |>
    select(:z, :x, :x²)

f(df, col) = df[col] # 체인 안에 유저가 정의한 새로운 함수를 넣을 수도 있다.
@linq df |> where(:x .<= 4) |> f(:x)

4-element Array{Int64,1}:
 1
 2
 3
 4

DataFramesMeta - 그룹화된 `데이터프레임` 다루기¶

df = DataFrame(a = 1:12, b = repeat('a':'d', outer=3))
g = groupby(df, :b)

GroupedDataFrame with 4 groups based on key: :b
First Group: 3 rows
│ Row │ a     │ b    │
│     │ Int64 │ Char │
├─────┼───────┼──────┤
│ 1   │ 1     │ 'a'  │
│ 2   │ 5     │ 'a'  │
│ 3   │ 9     │ 'a'  │
⋮
Last Group: 3 rows
│ Row │ a     │ b    │
│     │ Int64 │ Char │
├─────┼───────┼──────┤
│ 1   │ 4     │ 'd'  │
│ 2   │ 8     │ 'd'  │
│ 3   │ 12    │ 'd'  │

@by(df, :b, first=first(:a), last=last(:a), mean=mean(:a)) # DataFrames 패키지의 by 보다 더 편하다.)

@based_on(g, first=first(:a), last=last(:a), mean=mean(:a)) # by와 같지만, 그룹화된 데이터프레임에 적용한다.

@where(g, mean(:a) > 6.5) # 그룹별로 조건을 비교해서, 해당되는 그룹만 골라낸다.

GroupedDataFrame with 2 groups based on key: :b
First Group: 3 rows
│ Row │ a     │ b    │
│     │ Int64 │ Char │
├─────┼───────┼──────┤
│ 1   │ 3     │ 'c'  │
│ 2   │ 7     │ 'c'  │
│ 3   │ 11    │ 'c'  │
⋮
Last Group: 3 rows
│ Row │ a     │ b    │
│     │ Int64 │ Char │
├─────┼───────┼──────┤
│ 1   │ 4     │ 'd'  │
│ 2   │ 8     │ 'd'  │
│ 3   │ 12    │ 'd'  │

@orderby(g, -sum(:a)) # 그룹별로 값을 계산해서, 그 순서대로 그룹을 정렬한다.

GroupedDataFrame with 4 groups based on key: :b
First Group: 3 rows
│ Row │ a     │ b    │
│     │ Int64 │ Char │
├─────┼───────┼──────┤
│ 1   │ 4     │ 'd'  │
│ 2   │ 8     │ 'd'  │
│ 3   │ 12    │ 'd'  │
⋮
Last Group: 3 rows
│ Row │ a     │ b    │
│     │ Int64 │ Char │
├─────┼───────┼──────┤
│ 1   │ 1     │ 'a'  │
│ 2   │ 5     │ 'a'  │
│ 3   │ 9     │ 'a'  │

@transform(g, center = mean(:a), centered = :a .- mean(:a)) # 그룹별로 계산하고, 그 결과는 그룹화 되지 않은 데이터프레임으로 나온다.

DataFrame(g) # DataFrames 패키지에 없는 좋고 편리한 함수

@transform(g) # 같은 결과를 낸다.

@linq df |> groupby(:b) |> where(mean(:a) > 6.5) |> DataFrame # 그룹화된 데이터프레임에도 체이닝(chaining)을 사용할 수 있다.

DataFramesMeta - `데이터프레임`에서 행별로 실행하는 함수들.¶

df = DataFrame(a = 1:12, b = repeat(1:4, outer=3))

# 이런 조건들은 종종 필요하지만 쓰기 복잡하다.
@transform(df, x = ifelse.((:a .> 6) .& (:b .== 4), "yes", "no"))

# 한 가지 방법은 하나의 관측치에 대해서 적용하는 함수를 만들고 이를 확장하는 것이다.
myfun(a, b) = a > 6 && b == 4 ? "yes" : "no"
@transform(df, x = myfun.(:a, :b))

# 혹은 데이터프레임을 행별로 다룰 수 있게 해주는 @byrow!라는 매크로를 사용할 수 있다.
@byrow! df begin
    @newcol x::Vector{String}
    :x = :a > 6 && :b == 4 ? "yes" : "no"
end

13. extras (0)	2018.10.20
12. pitfalls (한글) (0)	2018.10.19
12. pitfalls (0)	2018.10.19
11. performance (한글) (0)	2018.10.18
11. performance (0)	2018.10.18

DeepStat

티스토리 뷰

13. extras (한글)

Introduction to DataFrames¶

출처¶

함께보기¶

추가 - 선택된 패키지의 선택된 기능¶

FreqTables: 빈도표 작성 (creating cross tabulations)¶

DataFramesMeta - `데이터프레임` 다루기 (working on `DataFrame`)¶

DataFramesMeta - 그룹화된 `데이터프레임` 다루기¶

DataFramesMeta - `데이터프레임`에서 행별로 실행하는 함수들.¶

'Flux in Julia > Learning Julia (Intro_to_Julia_DFs)' 카테고리의 다른 글

티스토리툴바

	x	y	z
	Int64	Char	Bool
1	1	'a'	true
2	2	'b'	false
3	3	'c'	true
4	4	'd'	false
5	5	'e'	true
6	6	'f'	false
7	7	'g'	true
8	8	'h'	false

	a	b	center	centered
	Int64	Char	Float64	Float64
1	1	'a'	5.0	-4.0
2	5	'a'	5.0	0.0
3	9	'a'	5.0	4.0
4	2	'b'	6.0	-4.0
5	6	'b'	6.0	0.0
6	10	'b'	6.0	4.0
7	3	'c'	7.0	-4.0
8	7	'c'	7.0	0.0
9	11	'c'	7.0	4.0
10	4	'd'	8.0	-4.0
11	8	'd'	8.0	0.0
12	12	'd'	8.0	4.0

	a	b	x
	Int64	Int64	String
1	1	1	no
2	2	2	no
3	3	3	no
4	4	4	no
5	5	1	no
6	6	2	no
7	7	3	no
8	8	4	yes
9	9	1	no
10	10	2	no
11	11	3	no
12	12	4	yes

« 2026/01 »
일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

티스토리 뷰

13. extras (한글)

Introduction to DataFrames¶

출처¶

함께보기¶

추가 - 선택된 패키지의 선택된 기능¶

FreqTables: 빈도표 작성 (creating cross tabulations)¶

DataFramesMeta - 데이터프레임 다루기 (working on DataFrame)¶

DataFramesMeta - 그룹화된 데이터프레임 다루기¶

DataFramesMeta - 데이터프레임에서 행별로 실행하는 함수들.¶

'Flux in Julia > Learning Julia (Intro_to_Julia_DFs)' 카테고리의 다른 글

티스토리툴바

DataFramesMeta - `데이터프레임` 다루기 (working on `DataFrame`)¶

DataFramesMeta - 그룹화된 `데이터프레임` 다루기¶

DataFramesMeta - `데이터프레임`에서 행별로 실행하는 함수들.¶