Introduction to DataFrames¶

Bogumił Kamiński, May 23, 2018

Reference¶

https://github.com/JuliaComputing/JuliaBoxTutorials/tree/master/introductory-tutorials/broader-topics-and-ecosystem/intro-to-julia-DataFrames

Series¶

https://deepstat.tistory.com/69 (01. constructors)(in English)
https://deepstat.tistory.com/70 (01. constructors)(한글)

Let's get started by loading the DataFrames package.

using DataFrames

Constructors and conversion¶

Constructors¶

In this section, you'll see many ways to create a DataFrame using the DataFrame() constructor.

First, we could create an empty DataFrame,

DataFrame() # empty DataFrame

Or we could call the constructor using keyword arguments to add columns to the DataFrame.

DataFrame(A=1:3, B=rand(3), C=randstring.([3,3,3]))

UndefVarError: randstring not defined

Stacktrace:
 [1] top-level scope at In[3]:1

using Random
DataFrame(A=1:3, B=rand(3), C=Random.randstring.([3,3,3]))

We can create a DataFrame from a dictionary, in which case keys from the dictionary will be sorted to create the DataFrame columns.

x = Dict("A" => [1,2], "B" => [true, false], "C" => ['a', 'b'])
DataFrame(x)

Rather than explicitly creating a dictionary first, as above, we could pass DataFrame arguments with the syntax of dictionary key-value pairs.

Note that in this case, we use symbols to denote the column names and arguments are not sorted. For example, :A, the symbol, produces A, the name of the first column here:

DataFrame(:A => [1,2], :B => [true, false], :C => ['a', 'b'])

Here we create a DataFrame from a vector of vectors, and each vector becomes a column.

DataFrame([rand(3) for i in 1:3])

For now we can construct a single DataFrame from a Vector of atoms, creating a DataFrame with a single row. this will throw an error.

DataFrame(rand(3))

ArgumentError: unable to construct DataFrame from Array{Float64,1}

Stacktrace:
 [1] DataFrame(::Array{Float64,1}) at /home/yt/.julia/packages/DataFrames/1PqZ3/src/other/tables.jl:32
 [2] top-level scope at In[8]:1

Instead use a transposed vector if you have a vector of atoms (in this way you effectively pass a two dimensional array to the constructor which is supported).

DataFrame(transpose([1, 2, 3]))

Pass a second argument to give the columns names.

DataFrame([1:3, 4:6, 7:9], [:A, :B, :C])

Here we create a DataFrame from a matrix,

DataFrame(rand(3,4))

and here we do the same but also pass column names.

DataFrame(rand(3,4), Symbol.('a':'d'))

We can also construct an uninitialized DataFrame.

Here we pass column types, names and number of rows; we get missing in column :C because Any >: Missing.

DataFrame([Int, Float64, Any], [:A, :B, :C], 1)

Here we create a DataFrame, but column :C is #undef.

DataFrame([Int, Float64, String], [:A, :B, :C], 1)

To initialize a DataFrame with column names, but no rows use

DataFrame([Int, Float64, String], [:A, :B, :C], 0)

This syntax gives us a quick way to create homogenous DataFrame.

DataFrame(Int, 3, 5)

This example is similar, but has nonhomogenous columns.

DataFrame([Int, Float64], 4)

Finally, we can create a DataFrame by copying an existing DataFrame.

Note that copy creates a shallow copy.

y = DataFrame(x)
z = copy(x)
(x === y), (x === z), isequal(x, z)

(false, false, true)

Conversion to a matrix¶

Let's start by creating a DataFrame with two rows and two columns.

x = DataFrame(x=1:2, y=["A", "B"])

We can create a matrix by passing this DataFrame to Matrix.

Matrix(x)

2×2 Array{Any,2}:
 1  "A"
 2  "B"

This would work even if the DataFrame had some missings:

x = DataFrame(x=1:2, y=[missing,"B"])

Matrix(x)

2×2 Array{Any,2}:
 1  missing
 2  "B"

In the two previous matrix examples, Julia created matrices with elements of type Any. We can see more clearly that the type of matrix is inferred when we pass, for example, a DataFrame of integers to Matrix, creating a 2D Array of Int64s:

x = DataFrame(x=1:2, y=3:4)

Matrix(x)

2×2 Array{Int64,2}:
 1  3
 2  4

In this next example, Julia correctly identifies that Union is needed to express the type of the resulting Matrix (which contains missings).

x = DataFrame(x=1:2, y=[missing,4])

Matrix(x)

2×2 Array{Union{Missing, Int64},2}:
 1   missing
 2  4

Note that we can't force a conversion of missing values to Ints!

Matrix{Int}(x)

cannot convert a DataFrame containing missing values to array (found for column y)

Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] convert(::Type{Array{Int64,2}}, ::DataFrame) at /home/yt/.julia/packages/DataFrames/1PqZ3/src/abstractdataframe/abstractdataframe.jl:722
 [3] Array{Int64,2}(::DataFrame) at /home/yt/.julia/packages/DataFrames/1PqZ3/src/abstractdataframe/abstractdataframe.jl:729
 [4] top-level scope at In[27]:1

Handling of duplicate column names¶

We cannot use duplicate names in DataFrame. We can pass the makeunique keyword argument to get deduplicated names.

df = DataFrame(:a=>1, :a=>2, :a_1=>3; makeunique=true)

Otherwise, duplicates will not be allowed.

df = DataFrame(:a=>1, :a=>2, :a_1=>3)

┌ Warning: Duplicate variable names are deprecated: pass makeunique=true to add a suffix automatically.
│   caller = ip:0x0
└ @ Core :-1

A constructor that is passed column names as keyword arguments is a corner case. You cannot pass makeunique to allow duplicates here.

df = DataFrame(a=1, a=2, makeunique=true)

syntax: keyword argument "a" repeated in call to "DataFrame"

	A	B	C
	Int64	Float64	String
1	1	0.1272	2yx
2	2	0.884208	EFK
3	3	0.561774	tMQ

	x1	x2	x3
	Float64	Float64	Float64
1	0.61836	0.956117	0.772712
2	0.339327	0.646796	0.0373572
3	0.383056	0.015768	0.568402

	x1	x2	x3	x4
	Float64	Float64	Float64	Float64
1	0.549398	0.536636	0.564859	0.406246
2	0.354452	0.233334	0.585349	0.239128
3	0.369394	0.38183	0.54614	0.672149

	a	b	c	d
	Float64	Float64	Float64	Float64
1	0.301636	0.575057	0.246931	0.0463783
2	0.483671	0.0564641	0.579429	0.231766
3	0.545517	0.214624	0.765022	0.8457

	x1	x2	x3	x4	x5
	Int64	Int64	Int64	Int64	Int64
1	140482222175792	140482170689280	140480869092464	140482222232064	140482222232064
2	140482074905216	140482221606992	140480876470832	140482074902080	140482074902080
3	140482074905088	140482074902080	140482077542384	140480867023472	140480867023488

	x1	x2
	Int64	Float64
1	140482074898440	6.94074e-310
2	140480869273680	6.94068e-310
3	140480867045792	6.94068e-310
4	140479790317568	6.94068e-310

	A	B	C
	Int64	Bool	Char
1	1	true	'a'
2	2	false	'b'

	A	B	C
	Int64	Bool	Char
1	1	true	'a'
2	2	false	'b'

01. Constructors