Introduction to DataFrames¶

Bogumił Kamiński, May 23, 2018

Reference¶

https://github.com/JuliaComputing/JuliaBoxTutorials/tree/master/introductory-tutorials/broader-topics-and-ecosystem/intro-to-julia-DataFrames

Series¶

https://deepstat.tistory.com/69 (01. constructors)(in English)
https://deepstat.tistory.com/70 (01. constructors)(한글)
https://deepstat.tistory.com/71 (02. basicinfo)(in English)
https://deepstat.tistory.com/72 (02. basicinfo)(한글)
https://deepstat.tistory.com/73 (03. missingvalues)(in English)
https://deepstat.tistory.com/74 (03. missingvalues)(한글)

using DataFrames # load package

Handling missing values¶

A singleton type Missings.Missing allows us to deal with missing values.

missing, typeof(missing)

(missing, Missing)

Arrays automatically create an appropriate union type.

x = [1, 2, missing, 3]

4-element Array{Union{Missing, Int64},1}:
 1       
 2       
  missing
 3

ismissing checks if passed value is missing.

ismissing(1), ismissing(missing), ismissing(x), ismissing.(x)

(false, true, false, Bool[false, false, true, false])

We can extract the type combined with Missing from a Union via

(This is useful for arrays!)

eltype(x), Missings.T(eltype(x))

(Union{Missing, Int64}, Int64)

missing comparisons produce missing.

missing == missing, missing != missing, missing < missing

(missing, missing, missing)

This is also true when missings are compared with values of other types.

1 == missing, 1 != missing, 1 < missing

(missing, missing, missing)

isequal, isless, and === produce results of type Bool.

isequal(missing, missing), missing === missing, isequal(1, missing), isless(1, missing)

(true, true, false, true)

missing is larger than any other numeric value (even if infinity!).

isless(Inf,missing)

true

In the next few examples, we see that many (not all) functions handle missing.

map(x -> x(missing), [sin, cos, zero, sqrt]) # part 1

4-element Array{Missing,1}:
 missing
 missing
 missing
 missing

map(x -> x(missing, 1), [+, - , *, /, div]) # part 2

5-element Array{Missing,1}:
 missing
 missing
 missing
 missing
 missing

using Statistics
map(x -> x([1,2,missing]), [minimum, maximum, extrema, mean, float]) # part 3

5-element Array{Any,1}:
 missing                                   
 missing                                   
 (missing, missing)                        
 missing                                   
 Union{Missing, Float64}[1.0, 2.0, missing]

skipmissing returns iterator skipping missing values. We can use collect and skipmissing to create an array that excludes these missing values.

collect(skipmissing([1, missing, 2, missing]))

2-element Array{Int64,1}:
 1
 2

Similarly, here we combine collect and Missings.replace to create an array that replaces all missing values with some value (NaN in this case).

collect(Missings.replace([1.0, missing, 2.0, missing], NaN))

4-element Array{Float64,1}:
   1.0
 NaN  
   2.0
 NaN

Another way to do this:

coalesce.([1.0, missing, 2.0, missing], NaN)

4-element Array{Float64,1}:
   1.0
 NaN  
   2.0
 NaN

You can use recode if you have homogenous output types.

recode([1.0, missing, 2.0, missing], missing=>NaN)

4-element Array{Float64,1}:
   1.0
 NaN  
   2.0
 NaN

You can use unique or levels to get unique values with or without missings, respectively.

unique([1, missing, 2, missing]), levels([1, missing, 2, missing])

(Union{Missing, Int64}[1, missing, 2], [1, 2])

In this next example, we convert x to y with allowmissing, where y has a type that accepts missings.

x = [1,2,3]
y = allowmissing(x)

3-element Array{Union{Missing, Int64},1}:
 1
 2
 3

Then, we convert back with disallowmissing. This would fail if y contained missing values!

z = disallowmissing(y)
x,y,z

([1, 2, 3], Union{Missing, Int64}[1, 2, 3], [1, 2, 3])

In this next example, we show that the type of each column in x is initially Int64. After using allowmissing! to accept missing values in columns 1 and 3, the types of those columns become Unions of Int64 and Missings.Missing.

x = DataFrame(Int, 2, 3)
println("Before: ", eltypes(x))
allowmissing!(x, 1) # make first column accept missings
allowmissing!(x, :x3) # make :x3 column accept missings
println("After: ", eltypes(x))

Before: Type[Int64, Int64, Int64]
After: Type[Union{Missing, Int64}, Int64, Union{Missing, Int64}]

In this next example, we'll use completecases to find all the rows of a DataFrame that have complete data.

x = DataFrame(A=[1, missing, 3, 4], B=["A", "B", missing, "C"])
println(x)
println("Complete cases:\n", completecases(x))

4×2 DataFrame
│ Row │ A       │ B       │
│     │ Int64⍰  │ String⍰ │
├─────┼─────────┼─────────┤
│ 1   │ 1       │ A       │
│ 2   │ missing │ B       │
│ 3   │ 3       │ missing │
│ 4   │ 4       │ C       │
Complete cases:
Bool[true, false, false, true]

We can use dropmissing or dropmissing! to remove the rows with incomplete data from a DataFrame and either create a new DataFrame or mutate the original in-place.

y = dropmissing(x)
dropmissing!(x)
[x, y]

2-element Array{DataFrame,1}:
 2×2 DataFrame
│ Row │ A      │ B       │
│     │ Int64⍰ │ String⍰ │
├─────┼────────┼─────────┤
│ 1   │ 1      │ A       │
│ 2   │ 4      │ C       │
 2×2 DataFrame
│ Row │ A      │ B       │
│     │ Int64⍰ │ String⍰ │
├─────┼────────┼─────────┤
│ 1   │ 1      │ A       │
│ 2   │ 4      │ C       │

When we call eltypes on a DataFrame with dropped missing values, the columns still allow missing values.

eltypes(x)

2-element Array{Type,1}:
 Union{Missing, Int64} 
 Union{Missing, String}

Since we've excluded missing values, we can safely use disallowmissing! so that the columns will no longer accept missing values.

disallowmissing!(x)
eltypes(x)

2-element Array{Type,1}:
 Int64 
 String

03. missingvalues

03. missingvalues

Introduction to DataFrames¶

Reference¶

Series¶

Handling missing values¶