Introduction to DataFrames¶

Bogumił Kamiński, Apr 21, 2018

Reference¶

https://github.com/JuliaComputing/JuliaBoxTutorials/tree/master/introductory-tutorials/broader-topics-and-ecosystem/intro-to-julia-DataFrames

Series¶

https://deepstat.tistory.com/69 (01. constructors)(in English)
https://deepstat.tistory.com/70 (01. constructors)(한글)
https://deepstat.tistory.com/71 (02. basicinfo)(in English)
https://deepstat.tistory.com/72 (02. basicinfo)(한글)
https://deepstat.tistory.com/73 (03. missingvalues)(in English)
https://deepstat.tistory.com/74 (03. missingvalues)(한글)
https://deepstat.tistory.com/75 (04. loadsave)(in English)
https://deepstat.tistory.com/76 (04. loadsave)(한글)
https://deepstat.tistory.com/77 (05. columns)(in English)
https://deepstat.tistory.com/78 (05. columns)(한글)
https://deepstat.tistory.com/79 (06. rows)(in English)
https://deepstat.tistory.com/80 (06. rows)(한글)

using DataFrames, Random # load package
Random.seed!(1); #srand(1);

Manipulating rows of DataFrame¶

Reordering rows¶

x = DataFrame(id=1:10, x = rand(10), y = [zeros(5); ones(5)]) # and we hope that x[:x] is not sorted :)

issorted(x), issorted(x, :x) # check if a DataFrame or a subset of its columns is sorted

(true, false)

sort!(x, :x) # sort x in place

y = sort(x, :id) # new DataFrame

sort(x, (:y, :x), rev=(true, false)) # sort by two columns, first is decreasing, second is increasing

sort(x, (order(:y, rev=true), :x)) # the same as above

sort(x, (order(:y, rev=true), order(:x, by=v->-v))) # some more fancy sorting stuff

x[shuffle(1:10), :] # reorder rows (here randomly)

sort!(x, :id)
x[[1,10],:] = x[[10,1],:] # swap rows
x

x[1,:], x[10,:] = x[10,:], x[1,:] # and swap again
x

Merging/adding rows¶

x = DataFrame(rand(3, 5))

[x; x] # merge by rows - data frames must have the same column names; the same is vcat

y = x[reverse(names(x))] # get y with other order of names

vcat(x, y) # we get what we want as vcat does column name matching

vcat(x, y[1:3]) # but column names must still match

ArgumentError: column(s) x1 and x2 are missing from argument(s) 2

Stacktrace:
 [1] _vcat(::Array{DataFrame,1}) at /home/yt/.julia/packages/DataFrames/1PqZ3/src/abstractdataframe/abstractdataframe.jl:926
 [2] vcat(::DataFrame, ::DataFrame) at /home/yt/.julia/packages/DataFrames/1PqZ3/src/abstractdataframe/abstractdataframe.jl:906
 [3] top-level scope at In[16]:1

append!(x, x) # the same but modifies x

append!(x, y) # here column names must match exactly

Column names do not match

Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] append!(::DataFrame, ::DataFrame) at /home/yt/.julia/packages/DataFrames/1PqZ3/src/dataframe/dataframe.jl:990
 [3] top-level scope at In[18]:1

push!(x, 1:5) # add one row to x at the end; must give correct number of values and correct types
x

push!(x, Dict(:x1=> 11, :x2=> 12, :x3=> 13, :x4=> 14, :x5=> 15)) # also works with dictionaries
x

Subsetting/removing rows¶

x = DataFrame(id=1:10, val='a':'j')

x[1:2, :] # by index

view(x, 1:2) # the same but a view

x[repeat([true, false], 5), :] # by Bool, exact length required
#x[repmat([true, false], 5), :]

view(x, repeat([true, false], 5), :) # view again
#view(x, repmat([true, false], 5), :)

deleterows!(x, 7) # delete one row

deleterows!(x, 6:7) # delete a collection of rows

x = DataFrame([1:4, 2:5, 3:6])

filter(r -> r[:x1] > 2.5, x) # create a new DataFrame where filtering function operates on DataFrameRow

# in place modification of x, an example with do-block syntax
filter!(x) do r
    if r[:x1] > 2.5
        return r[:x2] < 4.5
    end
    r[:x3] < 3.5
end

Deduplicating¶

x = DataFrame(A=[1,2], B=["x","y"])
append!(x, x)
x[:C] = 1:4
x

unique(x, [1,2]) # get first unique rows for given index

unique(x) # now we look at whole rows

nonunique(x, :A) # get indicators of non-unique rows

4-element Array{Bool,1}:
 false
 false
  true
  true

unique!(x, :B) # modify x in place

Extracting one row from `DataFrame` into a vector¶

x = DataFrame(x=[1,missing,2], y=["a", "b", missing], z=[true,false,true])

cols = [:x, :y]
[x[1, col] for col in cols] # subset of columns

2-element Array{Any,1}:
 1   
  "a"

[[x[i, col] for col in names(x)] for i in 1:nrow(x)] # vector of vectors, each entry contains one full row of x

3-element Array{Array{Any,1},1}:
 [1, "a", true]       
 [missing, "b", false]
 [2, missing, true]

Tuple(x[1, col] for col in cols) # similar construct for Tuples

(1, "a")

	id	x	y
	Int64	Float64	Float64
1	1	0.236033	0.0
2	2	0.346517	0.0
3	3	0.312707	0.0
4	4	0.00790928	0.0
5	5	0.488613	0.0
6	6	0.210968	1.0
7	7	0.951916	1.0
8	8	0.999905	1.0
9	9	0.251662	1.0
10	10	0.986666	1.0

	id	x	y
	Int64	Float64	Float64
1	4	0.00790928	0.0
2	6	0.210968	1.0
3	1	0.236033	0.0
4	9	0.251662	1.0
5	3	0.312707	0.0
6	2	0.346517	0.0
7	5	0.488613	0.0
8	7	0.951916	1.0
9	10	0.986666	1.0
10	8	0.999905	1.0

	id	x	y
	Int64	Float64	Float64
1	1	0.236033	0.0
2	2	0.346517	0.0
3	3	0.312707	0.0
4	4	0.00790928	0.0
5	5	0.488613	0.0
6	6	0.210968	1.0
7	7	0.951916	1.0
8	8	0.999905	1.0
9	9	0.251662	1.0
10	10	0.986666	1.0

	id	x	y
	Int64	Float64	Float64
1	6	0.210968	1.0
2	9	0.251662	1.0
3	7	0.951916	1.0
4	10	0.986666	1.0
5	8	0.999905	1.0
6	4	0.00790928	0.0
7	1	0.236033	0.0
8	3	0.312707	0.0
9	2	0.346517	0.0
10	5	0.488613	0.0

	id	x	y
	Int64	Float64	Float64
1	6	0.210968	1.0
2	9	0.251662	1.0
3	7	0.951916	1.0
4	10	0.986666	1.0
5	8	0.999905	1.0
6	4	0.00790928	0.0
7	1	0.236033	0.0
8	3	0.312707	0.0
9	2	0.346517	0.0
10	5	0.488613	0.0

	id	val
	Int64	Char
1	1	'a'
2	2	'b'
3	3	'c'
4	4	'd'
5	5	'e'
6	6	'f'
7	7	'g'
8	8	'h'
9	9	'i'
10	10	'j'

	x1	x2	x3	x4	x5
	Float64	Float64	Float64	Float64	Float64
1	0.0856352	0.185821	0.0516146	0.279395	0.370971
2	0.553206	0.111981	0.53803	0.178246	0.894166
3	0.46335	0.976312	0.455692	0.548983	0.648054

	x1	x2	x3	x4	x5
	Float64	Float64	Float64	Float64	Float64
1	0.0856352	0.185821	0.0516146	0.279395	0.370971
2	0.553206	0.111981	0.53803	0.178246	0.894166
3	0.46335	0.976312	0.455692	0.548983	0.648054
4	0.0856352	0.185821	0.0516146	0.279395	0.370971
5	0.553206	0.111981	0.53803	0.178246	0.894166
6	0.46335	0.976312	0.455692	0.548983	0.648054

	x5	x4	x3	x2	x1
	Float64	Float64	Float64	Float64	Float64
1	0.370971	0.279395	0.0516146	0.185821	0.0856352
2	0.894166	0.178246	0.53803	0.111981	0.553206
3	0.648054	0.548983	0.455692	0.976312	0.46335

	x1	x2	x3	x4	x5
	Float64	Float64	Float64	Float64	Float64
1	0.0856352	0.185821	0.0516146	0.279395	0.370971
2	0.553206	0.111981	0.53803	0.178246	0.894166
3	0.46335	0.976312	0.455692	0.548983	0.648054
4	0.0856352	0.185821	0.0516146	0.279395	0.370971
5	0.553206	0.111981	0.53803	0.178246	0.894166
6	0.46335	0.976312	0.455692	0.548983	0.648054

06. rows

Introduction to DataFrames¶

Reference¶

Series¶

Manipulating rows of DataFrame¶

Reordering rows¶

Merging/adding rows¶

Subsetting/removing rows¶

Deduplicating¶

Extracting one row from DataFrame into a vector¶

Extracting one row from `DataFrame` into a vector¶