딥스탯 2018. 10. 7. 21:36
01_constructors

Let's get started by loading the DataFrames package.

In [1]:
using DataFrames

Constructors and conversion

Constructors

In this section, you'll see many ways to create a DataFrame using the DataFrame() constructor.

First, we could create an empty DataFrame,

In [2]:
DataFrame() # empty DataFrame
Out[2]:

Or we could call the constructor using keyword arguments to add columns to the DataFrame.

In [3]:
DataFrame(A=1:3, B=rand(3), C=randstring.([3,3,3]))
UndefVarError: randstring not defined

Stacktrace:
 [1] top-level scope at In[3]:1
In [4]:
using Random
DataFrame(A=1:3, B=rand(3), C=Random.randstring.([3,3,3]))
Out[4]:
ABC
Int64Float64String
110.12722yx
220.884208EFK
330.561774tMQ

We can create a DataFrame from a dictionary, in which case keys from the dictionary will be sorted to create the DataFrame columns.

In [5]:
x = Dict("A" => [1,2], "B" => [true, false], "C" => ['a', 'b'])
DataFrame(x)
Out[5]:
ABC
Int64BoolChar
11true'a'
22false'b'

Rather than explicitly creating a dictionary first, as above, we could pass DataFrame arguments with the syntax of dictionary key-value pairs.

Note that in this case, we use symbols to denote the column names and arguments are not sorted. For example, :A, the symbol, produces A, the name of the first column here:

In [6]:
DataFrame(:A => [1,2], :B => [true, false], :C => ['a', 'b'])
Out[6]:
ABC
Int64BoolChar
11true'a'
22false'b'

Here we create a DataFrame from a vector of vectors, and each vector becomes a column.

In [7]:
DataFrame([rand(3) for i in 1:3])
Out[7]:
x1x2x3
Float64Float64Float64
10.618360.9561170.772712
20.3393270.6467960.0373572
30.3830560.0157680.568402

For now we can construct a single DataFrame from a Vector of atoms, creating a DataFrame with a single row. this will throw an error.

In [8]:
DataFrame(rand(3))
ArgumentError: unable to construct DataFrame from Array{Float64,1}

Stacktrace:
 [1] DataFrame(::Array{Float64,1}) at /home/yt/.julia/packages/DataFrames/1PqZ3/src/other/tables.jl:32
 [2] top-level scope at In[8]:1

Instead use a transposed vector if you have a vector of atoms (in this way you effectively pass a two dimensional array to the constructor which is supported).

In [9]:
DataFrame(transpose([1, 2, 3]))
Out[9]:
x1x2x3
Int64Int64Int64
1123

Pass a second argument to give the columns names.

In [10]:
DataFrame([1:3, 4:6, 7:9], [:A, :B, :C])
Out[10]:
ABC
Int64Int64Int64
1147
2258
3369

Here we create a DataFrame from a matrix,

In [11]:
DataFrame(rand(3,4))
Out[11]:
x1x2x3x4
Float64Float64Float64Float64
10.5493980.5366360.5648590.406246
20.3544520.2333340.5853490.239128
30.3693940.381830.546140.672149

and here we do the same but also pass column names.

In [12]:
DataFrame(rand(3,4), Symbol.('a':'d'))
Out[12]:
abcd
Float64Float64Float64Float64
10.3016360.5750570.2469310.0463783
20.4836710.05646410.5794290.231766
30.5455170.2146240.7650220.8457

We can also construct an uninitialized DataFrame.

Here we pass column types, names and number of rows; we get missing in column :C because Any >: Missing.

In [13]:
DataFrame([Int, Float64, Any], [:A, :B, :C], 1)
Out[13]:
ABC
Int64Float64Any
11404821069705446.94074e-310missing

Here we create a DataFrame, but column :C is #undef.

In [14]:
DataFrame([Int, Float64, String], [:A, :B, :C], 1)
Out[14]:
ABC
Int64Float64String
11404797903175686.94068e-310#undef

To initialize a DataFrame with column names, but no rows use

In [15]:
DataFrame([Int, Float64, String], [:A, :B, :C], 0)
Out[15]:
ABC
Int64Float64String

This syntax gives us a quick way to create homogenous DataFrame.

In [16]:
DataFrame(Int, 3, 5)
Out[16]:
x1x2x3x4x5
Int64Int64Int64Int64Int64
1140482222175792140482170689280140480869092464140482222232064140482222232064
2140482074905216140482221606992140480876470832140482074902080140482074902080
3140482074905088140482074902080140482077542384140480867023472140480867023488

This example is similar, but has nonhomogenous columns.

In [17]:
DataFrame([Int, Float64], 4)
Out[17]:
x1x2
Int64Float64
11404820748984406.94074e-310
21404808692736806.94068e-310
31404808670457926.94068e-310
41404797903175686.94068e-310

Finally, we can create a DataFrame by copying an existing DataFrame.

Note that copy creates a shallow copy.

In [18]:
y = DataFrame(x)
z = copy(x)
(x === y), (x === z), isequal(x, z)
Out[18]:
(false, false, true)

Conversion to a matrix

Let's start by creating a DataFrame with two rows and two columns.

In [19]:
x = DataFrame(x=1:2, y=["A", "B"])
Out[19]:
xy
Int64String
11A
22B

We can create a matrix by passing this DataFrame to Matrix.

In [20]:
Matrix(x)
Out[20]:
2×2 Array{Any,2}:
 1  "A"
 2  "B"

This would work even if the DataFrame had some missings:

In [21]:
x = DataFrame(x=1:2, y=[missing,"B"])
Out[21]:
xy
Int64String⍰
11missing
22B
In [22]:
Matrix(x)
Out[22]:
2×2 Array{Any,2}:
 1  missing
 2  "B"    

In the two previous matrix examples, Julia created matrices with elements of type Any. We can see more clearly that the type of matrix is inferred when we pass, for example, a DataFrame of integers to Matrix, creating a 2D Array of Int64s:

In [23]:
x = DataFrame(x=1:2, y=3:4)
Out[23]:
xy
Int64Int64
113
224
In [24]:
Matrix(x)
Out[24]:
2×2 Array{Int64,2}:
 1  3
 2  4

In this next example, Julia correctly identifies that Union is needed to express the type of the resulting Matrix (which contains missings).

In [25]:
x = DataFrame(x=1:2, y=[missing,4])
Out[25]:
xy
Int64Int64⍰
11missing
224
In [26]:
Matrix(x)
Out[26]:
2×2 Array{Union{Missing, Int64},2}:
 1   missing
 2  4       

Note that we can't force a conversion of missing values to Ints!

In [27]:
Matrix{Int}(x)
cannot convert a DataFrame containing missing values to array (found for column y)

Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] convert(::Type{Array{Int64,2}}, ::DataFrame) at /home/yt/.julia/packages/DataFrames/1PqZ3/src/abstractdataframe/abstractdataframe.jl:722
 [3] Array{Int64,2}(::DataFrame) at /home/yt/.julia/packages/DataFrames/1PqZ3/src/abstractdataframe/abstractdataframe.jl:729
 [4] top-level scope at In[27]:1

Handling of duplicate column names

We cannot use duplicate names in DataFrame. We can pass the makeunique keyword argument to get deduplicated names.

In [28]:
df = DataFrame(:a=>1, :a=>2, :a_1=>3; makeunique=true)
Out[28]:
aa_2a_1
Int64Int64Int64
1123

Otherwise, duplicates will not be allowed.

In [29]:
df = DataFrame(:a=>1, :a=>2, :a_1=>3)
┌ Warning: Duplicate variable names are deprecated: pass makeunique=true to add a suffix automatically.
│   caller = ip:0x0
└ @ Core :-1
Out[29]:
aa_2a_1
Int64Int64Int64
1123

A constructor that is passed column names as keyword arguments is a corner case. You cannot pass makeunique to allow duplicates here.

In [30]:
df = DataFrame(a=1, a=2, makeunique=true)
syntax: keyword argument "a" repeated in call to "DataFrame"