딥스탯 2018. 10. 9. 16:32
03_missingvalues

Introduction to DataFrames

Bogumił Kamiński, May 23, 2018

Reference

Series

In [1]:
using DataFrames # load package

Handling missing values

A singleton type Missings.Missing allows us to deal with missing values.

In [2]:
missing, typeof(missing)
Out[2]:
(missing, Missing)

Arrays automatically create an appropriate union type.

In [3]:
x = [1, 2, missing, 3]
Out[3]:
4-element Array{Union{Missing, Int64},1}:
 1       
 2       
  missing
 3       

ismissing checks if passed value is missing.

In [4]:
ismissing(1), ismissing(missing), ismissing(x), ismissing.(x)
Out[4]:
(false, true, false, Bool[false, false, true, false])

We can extract the type combined with Missing from a Union via

(This is useful for arrays!)

In [5]:
eltype(x), Missings.T(eltype(x))
Out[5]:
(Union{Missing, Int64}, Int64)

missing comparisons produce missing.

In [6]:
missing == missing, missing != missing, missing < missing
Out[6]:
(missing, missing, missing)

This is also true when missings are compared with values of other types.

In [7]:
1 == missing, 1 != missing, 1 < missing
Out[7]:
(missing, missing, missing)

isequal, isless, and === produce results of type Bool.

In [8]:
isequal(missing, missing), missing === missing, isequal(1, missing), isless(1, missing)
Out[8]:
(true, true, false, true)

missing is larger than any other numeric value (even if infinity!).

In [9]:
isless(Inf,missing)
Out[9]:
true

In the next few examples, we see that many (not all) functions handle missing.

In [10]:
map(x -> x(missing), [sin, cos, zero, sqrt]) # part 1
Out[10]:
4-element Array{Missing,1}:
 missing
 missing
 missing
 missing
In [11]:
map(x -> x(missing, 1), [+, - , *, /, div]) # part 2
Out[11]:
5-element Array{Missing,1}:
 missing
 missing
 missing
 missing
 missing
In [12]:
using Statistics
map(x -> x([1,2,missing]), [minimum, maximum, extrema, mean, float]) # part 3
Out[12]:
5-element Array{Any,1}:
 missing                                   
 missing                                   
 (missing, missing)                        
 missing                                   
 Union{Missing, Float64}[1.0, 2.0, missing]

skipmissing returns iterator skipping missing values. We can use collect and skipmissing to create an array that excludes these missing values.

In [13]:
collect(skipmissing([1, missing, 2, missing]))
Out[13]:
2-element Array{Int64,1}:
 1
 2

Similarly, here we combine collect and Missings.replace to create an array that replaces all missing values with some value (NaN in this case).

In [14]:
collect(Missings.replace([1.0, missing, 2.0, missing], NaN))
Out[14]:
4-element Array{Float64,1}:
   1.0
 NaN  
   2.0
 NaN  

Another way to do this:

In [15]:
coalesce.([1.0, missing, 2.0, missing], NaN)
Out[15]:
4-element Array{Float64,1}:
   1.0
 NaN  
   2.0
 NaN  

You can use recode if you have homogenous output types.

In [16]:
recode([1.0, missing, 2.0, missing], missing=>NaN)
Out[16]:
4-element Array{Float64,1}:
   1.0
 NaN  
   2.0
 NaN  

You can use unique or levels to get unique values with or without missings, respectively.

In [17]:
unique([1, missing, 2, missing]), levels([1, missing, 2, missing])
Out[17]:
(Union{Missing, Int64}[1, missing, 2], [1, 2])

In this next example, we convert x to y with allowmissing, where y has a type that accepts missings.

In [18]:
x = [1,2,3]
y = allowmissing(x)
Out[18]:
3-element Array{Union{Missing, Int64},1}:
 1
 2
 3

Then, we convert back with disallowmissing. This would fail if y contained missing values!

In [19]:
z = disallowmissing(y)
x,y,z
Out[19]:
([1, 2, 3], Union{Missing, Int64}[1, 2, 3], [1, 2, 3])

In this next example, we show that the type of each column in x is initially Int64. After using allowmissing! to accept missing values in columns 1 and 3, the types of those columns become Unions of Int64 and Missings.Missing.

In [20]:
x = DataFrame(Int, 2, 3)
println("Before: ", eltypes(x))
allowmissing!(x, 1) # make first column accept missings
allowmissing!(x, :x3) # make :x3 column accept missings
println("After: ", eltypes(x))
Before: Type[Int64, Int64, Int64]
After: Type[Union{Missing, Int64}, Int64, Union{Missing, Int64}]

In this next example, we'll use completecases to find all the rows of a DataFrame that have complete data.

In [21]:
x = DataFrame(A=[1, missing, 3, 4], B=["A", "B", missing, "C"])
println(x)
println("Complete cases:\n", completecases(x))
4×2 DataFrame
│ Row │ A       │ B       │
│     │ Int64⍰String⍰ │
├─────┼─────────┼─────────┤
│ 1   │ 1       │ A       │
│ 2   │ missing │ B       │
│ 3   │ 3       │ missing │
│ 4   │ 4       │ C       │
Complete cases:
Bool[true, false, false, true]

We can use dropmissing or dropmissing! to remove the rows with incomplete data from a DataFrame and either create a new DataFrame or mutate the original in-place.

In [22]:
y = dropmissing(x)
dropmissing!(x)
[x, y]
Out[22]:
2-element Array{DataFrame,1}:
 2×2 DataFrame
│ Row │ A      │ B       │
│     │ Int64⍰String⍰ │
├─────┼────────┼─────────┤
│ 1   │ 1      │ A       │
│ 2   │ 4      │ C       │
 2×2 DataFrame
│ Row │ A      │ B       │
│     │ Int64⍰String⍰ │
├─────┼────────┼─────────┤
│ 1   │ 1      │ A       │
│ 2   │ 4      │ C       │

When we call eltypes on a DataFrame with dropped missing values, the columns still allow missing values.

In [23]:
eltypes(x)
Out[23]:
2-element Array{Type,1}:
 Union{Missing, Int64} 
 Union{Missing, String}

Since we've excluded missing values, we can safely use disallowmissing! so that the columns will no longer accept missing values.

In [24]:
disallowmissing!(x)
eltypes(x)
Out[24]:
2-element Array{Type,1}:
 Int64 
 String