티스토리 뷰

10_transforms

Introduction to DataFrames

Bogumił Kamiński, Apr 21, 2018

Reference

Series

In [1]:
using DataFrames # load package

Split-apply-combine

In [2]:
x = DataFrame(id=[1,2,3,4,1,2,3,4], id2=[1,2,1,2,1,2,1,2], v=rand(8))
Out[2]:
idid2v
Int64Int64Float64
1110.398172
2220.688774
3310.0905459
4420.241164
5110.899705
6220.329597
7310.470818
8420.745005
In [3]:
gx1 = groupby(x, :id)
Out[3]:
GroupedDataFrame with 4 groups based on key: :id
First Group: 2 rows
│ Row │ id    │ id2   │ v        │
│     │ Int64Int64Float64  │
├─────┼───────┼───────┼──────────┤
│ 1   │ 1     │ 1     │ 0.398172 │
│ 2   │ 1     │ 1     │ 0.899705 │
⋮
Last Group: 2 rows
│ Row │ id    │ id2   │ v        │
│     │ Int64Int64Float64  │
├─────┼───────┼───────┼──────────┤
│ 1   │ 4     │ 2     │ 0.241164 │
│ 2   │ 4     │ 2     │ 0.745005 │
In [4]:
gx2 = groupby(x, [:id, :id2])
Out[4]:
GroupedDataFrame with 4 groups based on keys: :id, :id2
First Group: 2 rows
│ Row │ id    │ id2   │ v        │
│     │ Int64Int64Float64  │
├─────┼───────┼───────┼──────────┤
│ 1   │ 1     │ 1     │ 0.398172 │
│ 2   │ 1     │ 1     │ 0.899705 │
⋮
Last Group: 2 rows
│ Row │ id    │ id2   │ v        │
│     │ Int64Int64Float64  │
├─────┼───────┼───────┼──────────┤
│ 1   │ 4     │ 2     │ 0.241164 │
│ 2   │ 4     │ 2     │ 0.745005 │
In [5]:
vcat(gx2...) # back to the original DataFrame
Out[5]:
idid2v
Int64Int64Float64
1110.398172
2110.899705
3220.688774
4220.329597
5310.0905459
6310.470818
7420.241164
8420.745005
In [6]:
x = DataFrame(id = [missing, 5, 1, 3, missing], x = 1:5)
Out[6]:
idx
Int64⍰Int64
1missing1
252
313
434
5missing5
In [7]:
show(groupby(x, :id), allgroups=true) # by default groups include mising values and are not sorted
GroupedDataFrame with 4 groups based on key: :id
Group 1: 2 rows
│ Row │ id      │ x     │
│     │ Int64⍰Int64 │
├─────┼─────────┼───────┤
│ 1   │ missing │ 1     │
│ 2   │ missing │ 5     │
Group 2: 1 row
│ Row │ id     │ x     │
│     │ Int64⍰Int64 │
├─────┼────────┼───────┤
│ 1   │ 5      │ 2     │
Group 3: 1 row
│ Row │ id     │ x     │
│     │ Int64⍰Int64 │
├─────┼────────┼───────┤
│ 1   │ 1      │ 3     │
Group 4: 1 row
│ Row │ id     │ x     │
│     │ Int64⍰Int64 │
├─────┼────────┼───────┤
│ 1   │ 3      │ 4     │
In [8]:
show(groupby(x, :id, sort=true, skipmissing=true), allgroups=true) # but we can change it :)
GroupedDataFrame with 3 groups based on key: :id
Group 1: 1 row
│ Row │ id     │ x     │
│     │ Int64⍰Int64 │
├─────┼────────┼───────┤
│ 1   │ 1      │ 3     │
Group 2: 1 row
│ Row │ id     │ x     │
│     │ Int64⍰Int64 │
├─────┼────────┼───────┤
│ 1   │ 3      │ 4     │
Group 3: 1 row
│ Row │ id     │ x     │
│     │ Int64⍰Int64 │
├─────┼────────┼───────┤
│ 1   │ 5      │ 2     │
In [9]:
x = DataFrame(id=rand('a':'d', 100), v=rand(100));
using Statistics
by(x, :id, y->mean(y[:v])) # apply a function to each group of a data frame
Out[9]:
idx1
CharFloat64
1'b'0.451979
2'a'0.396662
3'c'0.471388
4'd'0.335608
In [10]:
by(x, :id, y->mean(y[:v]), sort=true) # we can sort the output
Out[10]:
idx1
CharFloat64
1'a'0.396662
2'b'0.451979
3'c'0.471388
4'd'0.335608
In [11]:
by(x, :id, y->DataFrame(res=mean(y[:v]))) # this way we can set a name for a column - DataFramesMeta @by is better
Out[11]:
idres
CharFloat64
1'b'0.451979
2'a'0.396662
3'c'0.471388
4'd'0.335608
In [12]:
x = DataFrame(id=rand('a':'d', 100), x1=rand(100), x2=rand(100))
aggregate(x, :id, sum) # apply a function over all columns of a data frame in groups given by id
Out[12]:
idx1_sumx2_sum
CharFloat64Float64
1'c'9.561256.85763
2'b'13.149113.6665
3'd'13.950810.7499
4'a'12.095715.2495
In [13]:
aggregate(x, :id, sum, sort=true) # also can be sorted
Out[13]:
idx1_sumx2_sum
CharFloat64Float64
1'a'12.095715.2495
2'b'13.149113.6665
3'c'9.561256.85763
4'd'13.950810.7499

We omit the discussion of of map/combine as I do not find them very useful (better to use by)

In [14]:
x = DataFrame(rand(3, 5))
Out[14]:
x1x2x3x4x5
Float64Float64Float64Float64Float64
10.1464240.5655940.31020.1533040.321919
20.9953240.5203120.0115310.7638830.945094
30.7811980.02166440.005164140.2092590.659995
In [15]:
map(mean, eachcol(x)) # map a function over each column and return a data frame
Out[15]:
x1x2x3x4x5
Float64Float64Float64Float64Float64
10.6409820.369190.1089650.3754820.642336
In [16]:
foreach(c -> println(c[1], ": ", mean(c[2])), eachcol(x)) # a raw iteration returns a tuple with column name and values
x1: 0.6409819483419316
x2: 0.3691902706497387
x3: 0.10896519700388281
x4: 0.3754821255946908
x5: 0.6423361106326183
In [17]:
colwise(mean, x) # colwise is similar, but produces a vector
Out[17]:
5-element Array{Float64,1}:
 0.6409819483419316 
 0.3691902706497387 
 0.10896519700388281
 0.3754821255946908 
 0.6423361106326183 
In [18]:
x[:id] = [1,1,2]
colwise(mean,groupby(x, :id)) # and works on GroupedDataFrame
Out[18]:
2-element Array{Array{Float64,1},1}:
 [0.570874, 0.542953, 0.160866, 0.458594, 0.633507, 1.0]   
 [0.781198, 0.0216644, 0.00516414, 0.209259, 0.659995, 2.0]
In [19]:
map(r -> r[:x1]/r[:x2], eachrow(x)) # now the returned value is DataFrameRow which works similarly to a one-row DataFrame
Out[19]:
3-element Array{Float64,1}:
  0.2588859485275056
  1.9129355971426416
 36.05911887812408  

'Flux in Julia > Learning Julia (Intro_to_Julia_DFs)' 카테고리의 다른 글

11. performance  (0) 2018.10.18
10. transforms (한글)  (0) 2018.10.16
09. reshaping(한글)  (0) 2018.10.15
09. reshaping  (0) 2018.10.15
08. joins (한글)  (0) 2018.10.14
공지사항
최근에 올라온 글
최근에 달린 댓글
Total
Today
Yesterday
링크
TAG
more
«   2025/05   »
1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31
글 보관함