Introduction to DataFrames¶

Bogumił Kamiński, 2018년 5월 23일

출처¶

https://github.com/JuliaComputing/JuliaBoxTutorials/tree/master/introductory-tutorials/broader-topics-and-ecosystem/intro-to-julia-DataFrames

함께보기¶

https://deepstat.tistory.com/69 (01. constructors)(in English)
https://deepstat.tistory.com/70 (01. constructors)(한글)

패키지 DataFrames을 먼저 부르면서 시작하자.

using DataFrames

생성자(Constructors) 와 변환(conversion)¶

생성자(Constructors)¶

이 세션에서, 우리는 DataFrame() 생성자를 이용해서 데이터프레임(DataFrame)을 생성하는 많은 방법을 배울거다.

먼저, 빈(empty) 데이터프레임은 쉽게 만들 수 있다.

DataFrame() # empty DataFrame

혹은 키워드 인수(keyword arguments)를 이용해서 데이터프레임에 열을 추가할 수 있다.

DataFrame(A=1:3, B=rand(3), C=randstring.([3,3,3])) # 그냥 쓰면 오류난다.

UndefVarError: randstring not defined

Stacktrace:
 [1] top-level scope at In[3]:1

using Random
DataFrame(A=1:3, B=rand(3), C=Random.randstring.([3,3,3]))

데이터프레임을 딕셔너리(dictionary)로 부터 만들 수도 있는데, 이 경우 키(key)별로 정렬되어 열 이름으로 들어간다.

x = Dict("A" => [1,2], "B" => [true, false], "C" => ['a', 'b'])
DataFrame(x)

위의 예제는 딕셔너리를 먼저 만들어줬지만, 데이터프레임의 인수를 딕셔너리 키-값(key-value) 쌍으로 넣을 수도 있다.

참고로 이 경우, : 기호를 사용해서 열 이름을 선언하며, 키별로 정렬되지 않는다. 예를 들어, A라는 열을 만들고 싶다면 :A로 선언한다.

DataFrame(:A => [1,2], :B => [true, false], :C => ['a', 'b'])

아래는 벡터(vector)의 벡터(vector)로부터 데이터프레임을 만든 것인데, 이 때 각각의 벡터가 열이 된다.

DataFrame([rand(3) for i in 1:3])

벡터의 원소로 이루어진 한 행의 데이터프레임을 만드려면, 아래와 같이 쓰면 될 것 같지만, 작동하지 않고 에러를 뱉어낸다. (예전에는 작동했지만 현재는 폐기되고 작동하지 않는다.)

DataFrame(rand(3))

ArgumentError: unable to construct DataFrame from Array{Float64,1}

Stacktrace:
 [1] DataFrame(::Array{Float64,1}) at /home/yt/.julia/packages/DataFrames/1PqZ3/src/other/tables.jl:32
 [2] top-level scope at In[8]:1

그래서 전치된(transposed) 벡터를 사용해야 한다.(이렇게하면 지원되는 생성자에 2차원 배열을 효과적으로 전달할 수 있다.)

DataFrame(transpose([1, 2, 3]))

두 번재 인수(argument)에 열 이름을 줄 수 있다.

DataFrame([1:3, 4:6, 7:9], [:A, :B, :C])

아래는 행렬(matrix)로부터 데이터프레임을 만들었다.

DataFrame(rand(3,4))

그리고 똑같이 두 번째 인수에 이름을 넣을 수 있다.

DataFrame(rand(3,4), Symbol.('a':'d'))

또한 초기화되지 않은(uninitialized) 데이터프레임을 생성할 수도 있다.

열 타입(type)과 이름, 열 갯수를 넣은 결과가 아래에 있다. Any >: Missing 이기 대문에 열 :C 에서 missing을 얻었다.

DataFrame([Int, Float64, Any], [:A, :B, :C], 1)

아래는 같은 데이터프레임인데, 열 :C가 #undef 이다.

DataFrame([Int, Float64, String], [:A, :B, :C], 1)

열 이름은 있지만 행은 없는 데이터프레임을 선언할 수도 있다.

DataFrame([Int, Float64, String], [:A, :B, :C], 0)

아래는 동질적인(Homogeneous) 데이터프레임을 빠르게 생성하는 구문이다.

DataFrame(Int, 3, 5)

비슷하지만, 동질적이지 않은(nonhomogeneous) 열인 경우는 아래처럼 하면 된다.

DataFrame([Int, Float64], 4)

마지막으로 기존 데이터프레임을 복사하여 데이터프레임을 만들 수 있다.

참고로 copy는 단순한 복사본을 만들 때 사용한다.

y = DataFrame(x)
z = copy(x)

Dict{String,Array{T,1} where T} with 3 entries:
  "B" => Bool[true, false]
  "A" => [1, 2]
  "C" => ['a', 'b']

x

Dict{String,Array{T,1} where T} with 3 entries:
  "B" => Bool[true, false]
  "A" => [1, 2]
  "C" => ['a', 'b']

y

z

Dict{String,Array{T,1} where T} with 3 entries:
  "B" => Bool[true, false]
  "A" => [1, 2]
  "C" => ['a', 'b']

(x === y), (x === z), isequal(x, z)

(false, false, true)

행렬로의 변환(Conversion)¶

열 2개와 행 2개인 데이터프레임을 생성하면서 시작하자.

x = DataFrame(x=1:2, y=["A", "B"])

Matrix함수를 이용해서 데이터프레임을 행렬로 만들 수 있다.

Matrix(x)

2×2 Array{Any,2}:
 1  "A"
 2  "B"

만약 데이터프레임에 missing이 있더라도 먹힌다.

x = DataFrame(x=1:2, y=[missing,"B"])

Matrix(x)

2×2 Array{Any,2}:
 1  missing
 2  "B"

이전의 두 예제에서, Julia는 Any타입을 원소로 가지는 행렬로 변환했다. 이는 우리가 넣는 데이터프레임에 따라서 알아서 유추하는데, 아래의 예를 보면 명확하게 알 수 있다. 정수의 데이터프레임을 Matrix에 전달하면, 타입이 Int64인 2D 배열(array)을 생성한다.

x = DataFrame(x=1:2, y=3:4)

Matrix(x)

2×2 Array{Int64,2}:
 1  3
 2  4

다음 예제는 Julia가 Union을 이용해서 올바로 유추하는 것을 볼 수 있다.

x = DataFrame(x=1:2, y=[missing,4])

Matrix(x)

2×2 Array{Union{Missing, Int64},2}:
 1   missing
 2  4

참고로 missing은 Int로 강제로 변환할 수 없다!

Matrix{Int}(x)

cannot convert a DataFrame containing missing values to array (found for column y)

Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] convert(::Type{Array{Int64,2}}, ::DataFrame) at /home/yt/.julia/packages/DataFrames/1PqZ3/src/abstractdataframe/abstractdataframe.jl:722
 [3] Array{Int64,2}(::DataFrame) at /home/yt/.julia/packages/DataFrames/1PqZ3/src/abstractdataframe/abstractdataframe.jl:729
 [4] top-level scope at In[31]:1

열 이름 중복¶

데이터프레임에는 기본적으로 열 이름을 반복적으로 쓸 수 없다. makeunique 키워드 인수를 사용해서 반복되는 이름을 반복되지 않게 자동적으로 바꿀 수 있다.

df = DataFrame(:a=>1, :a=>2, :a_1=>3; makeunique=true)

그냥 써도 warning이 뜰 뿐, 이름을 자동적으로 바꾼다.

df = DataFrame(:a=>1, :a=>2, :a_1=>3)

┌ Warning: Duplicate variable names are deprecated: pass makeunique=true to add a suffix automatically.
│   caller = ip:0x0
└ @ Core :-1

열 이름을 인자(argument)로 쓰는 경우에는, makeunique를 써도 자동적으로 반복되지 않게 바꿀 수 없다.

df = DataFrame(a=1, a=2, makeunique=true)

syntax: keyword argument "a" repeated in call to "DataFrame"

	A	B	C
	Int64	Float64	String
1	1	0.263148	9dT
2	2	0.0721952	XYd
3	3	0.813375	WYR

	x1	x2	x3
	Float64	Float64	Float64
1	0.794229	0.377263	0.0714833
2	0.405064	0.411712	0.40019
3	0.858875	0.274074	0.674826

	x1	x2	x3	x4
	Float64	Float64	Float64	Float64
1	0.629159	0.582966	0.240279	0.904271
2	0.998643	0.110063	0.772363	0.69006
3	0.212445	0.713742	0.229775	0.221431

	a	b	c	d
	Float64	Float64	Float64	Float64
1	0.571469	0.499821	0.798888	0.892562
2	0.264511	0.643322	0.650872	0.320339
3	0.182328	0.456088	0.29218	0.499611

	x1	x2	x3	x4	x5
	Int64	Int64	Int64	Int64	Int64
1	140067592119856	140067540633344	140066203421808	140067592176128	140067592176128
2	140067444304512	140067591551056	140066210800176	140067444301376	140067444301376
3	140067444304384	140067444301376	140067447490544	140066201352816	140066201352832

DeepStat

티스토리 뷰

01. Constructors (한글)

Introduction to DataFrames¶

출처¶

함께보기¶

생성자(Constructors) 와 변환(conversion)¶

생성자(Constructors)¶

행렬로의 변환(Conversion)¶

열 이름 중복¶

'Flux in Julia > Learning Julia (Intro_to_Julia_DFs)' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

	x1	x2
	Int64	Float64
1	140067444297736	6.92026e-310
2	140066203603024	6.92019e-310
3	140066201375136	6.92019e-310
4	140063178489856	6.92019e-310

03. missingvalues (한글) (0)	2018.10.09
03. missingvalues (0)	2018.10.09
02. basicinfo (한글) (0)	2018.10.08
02. basicinfo (0)	2018.10.08
01. Constructors (0)	2018.10.07

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

	A	B	C
	Int64	Bool	Char
1	1	true	'a'
2	2	false	'b'

	A	B	C
	Int64	Bool	Char
1	1	true	'a'
2	2	false	'b'

	A	B	C
	Int64	Float64	Any
1	140067476918704	6.92025e-310	missing

	A	B	C
	Int64	Bool	Char
1	1	true	'a'
2	2	false	'b'