딥스탯
2018. 10. 12. 23:55
Introduction to DataFrames¶
Bogumił Kamiński, Apr 21, 2018
Reference¶
Series¶
- https://deepstat.tistory.com/69 (01. constructors)(in English)
- https://deepstat.tistory.com/70 (01. constructors)(한글)
- https://deepstat.tistory.com/71 (02. basicinfo)(in English)
- https://deepstat.tistory.com/72 (02. basicinfo)(한글)
- https://deepstat.tistory.com/73 (03. missingvalues)(in English)
- https://deepstat.tistory.com/74 (03. missingvalues)(한글)
- https://deepstat.tistory.com/75 (04. loadsave)(in English)
- https://deepstat.tistory.com/76 (04. loadsave)(한글)
- https://deepstat.tistory.com/77 (05. columns)(in English)
- https://deepstat.tistory.com/78 (05. columns)(한글)
- https://deepstat.tistory.com/79 (06. rows)(in English)
- https://deepstat.tistory.com/80 (06. rows)(한글)
In [1]:
using DataFrames, Random # load package
Random.seed!(1); #srand(1);
Manipulating rows of DataFrame¶
Reordering rows¶
In [2]:
x = DataFrame(id=1:10, x = rand(10), y = [zeros(5); ones(5)]) # and we hope that x[:x] is not sorted :)
Out[2]:
In [3]:
issorted(x), issorted(x, :x) # check if a DataFrame or a subset of its columns is sorted
Out[3]:
In [4]:
sort!(x, :x) # sort x in place
Out[4]:
In [5]:
y = sort(x, :id) # new DataFrame
Out[5]:
In [6]:
sort(x, (:y, :x), rev=(true, false)) # sort by two columns, first is decreasing, second is increasing
Out[6]:
In [7]:
sort(x, (order(:y, rev=true), :x)) # the same as above
Out[7]:
In [8]:
sort(x, (order(:y, rev=true), order(:x, by=v->-v))) # some more fancy sorting stuff
Out[8]:
In [9]:
x[shuffle(1:10), :] # reorder rows (here randomly)
Out[9]:
In [10]:
sort!(x, :id)
x[[1,10],:] = x[[10,1],:] # swap rows
x
Out[10]:
In [11]:
x[1,:], x[10,:] = x[10,:], x[1,:] # and swap again
x
Out[11]:
Merging/adding rows¶
In [12]:
x = DataFrame(rand(3, 5))
Out[12]:
In [13]:
[x; x] # merge by rows - data frames must have the same column names; the same is vcat
Out[13]:
In [14]:
y = x[reverse(names(x))] # get y with other order of names
Out[14]:
In [15]:
vcat(x, y) # we get what we want as vcat does column name matching
Out[15]:
In [16]:
vcat(x, y[1:3]) # but column names must still match
In [17]:
append!(x, x) # the same but modifies x
Out[17]:
In [18]:
append!(x, y) # here column names must match exactly
In [19]:
push!(x, 1:5) # add one row to x at the end; must give correct number of values and correct types
x
Out[19]:
In [20]:
push!(x, Dict(:x1=> 11, :x2=> 12, :x3=> 13, :x4=> 14, :x5=> 15)) # also works with dictionaries
x
Out[20]:
Subsetting/removing rows¶
In [21]:
x = DataFrame(id=1:10, val='a':'j')
Out[21]:
In [22]:
x[1:2, :] # by index
Out[22]:
In [23]:
view(x, 1:2) # the same but a view
Out[23]:
In [24]:
x[repeat([true, false], 5), :] # by Bool, exact length required
#x[repmat([true, false], 5), :]
Out[24]:
In [25]:
view(x, repeat([true, false], 5), :) # view again
#view(x, repmat([true, false], 5), :)
Out[25]:
In [26]:
deleterows!(x, 7) # delete one row
Out[26]:
In [27]:
deleterows!(x, 6:7) # delete a collection of rows
Out[27]:
In [28]:
x = DataFrame([1:4, 2:5, 3:6])
Out[28]:
In [29]:
filter(r -> r[:x1] > 2.5, x) # create a new DataFrame where filtering function operates on DataFrameRow
Out[29]:
In [30]:
# in place modification of x, an example with do-block syntax
filter!(x) do r
if r[:x1] > 2.5
return r[:x2] < 4.5
end
r[:x3] < 3.5
end
Out[30]:
Deduplicating¶
In [31]:
x = DataFrame(A=[1,2], B=["x","y"])
append!(x, x)
x[:C] = 1:4
x
Out[31]:
In [32]:
unique(x, [1,2]) # get first unique rows for given index
Out[32]:
In [33]:
unique(x) # now we look at whole rows
Out[33]:
In [34]:
nonunique(x, :A) # get indicators of non-unique rows
Out[34]:
In [35]:
unique!(x, :B) # modify x in place
Out[35]:
Extracting one row from DataFrame
into a vector¶
In [36]:
x = DataFrame(x=[1,missing,2], y=["a", "b", missing], z=[true,false,true])
Out[36]:
In [37]:
cols = [:x, :y]
[x[1, col] for col in cols] # subset of columns
Out[37]:
In [38]:
[[x[i, col] for col in names(x)] for i in 1:nrow(x)] # vector of vectors, each entry contains one full row of x
Out[38]:
In [39]:
Tuple(x[1, col] for col in cols) # similar construct for Tuples
Out[39]: