티스토리 뷰

09_reshaping

Introduction to DataFrames

Bogumił Kamiński, Apr 21, 2018

Reference

Series

In [1]:
using DataFrames # load package

Reshaping DataFrames

Wide to long

In [2]:
x = DataFrame(id=[1,2,3,4], id2=[1,1,2,2], M1=[11,12,13,14], M2=[111,112,113,114])
Out[2]:
idid2M1M2
Int64Int64Int64Int64
11111111
22112112
33213113
44214114
In [3]:
melt(x, :id, [:M1, :M2]) # first pass id-variables and then measure variables; meltdf makes a view
Out[3]:
variablevalueid
SymbolInt64Int64
1M1111
2M1122
3M1133
4M1144
5M21111
6M21122
7M21133
8M21144
In [4]:
# optionally you can rename columns; melt and stack are identical but order of arguments is reversed
stack(x, [:M1, :M2], :id, variable_name=:key, value_name=:observed) # first measures and then id-s; stackdf creates view
Out[4]:
keyobservedid
SymbolInt64Int64
1M1111
2M1122
3M1133
4M1144
5M21111
6M21122
7M21133
8M21144
In [5]:
# if second argument is omitted in melt or stack , all other columns are assumed to be the second argument
# but measure variables are selected only if they are <: AbstractFloat
melt(x, [:id, :id2])
Out[5]:
variablevalueidid2
SymbolInt64Int64Int64
1M11111
2M11221
3M11332
4M11442
5M211111
6M211221
7M211332
8M211442
In [6]:
melt(x, [1, 2]) # you can use index instead of symbol
Out[6]:
variablevalueidid2
SymbolInt64Int64Int64
1M11111
2M11221
3M11332
4M11442
5M211111
6M211221
7M211332
8M211442
In [7]:
bigx = DataFrame(rand(10^6, 10)) # a test comparing creation of new DataFrame and a view
bigx[:id] = 1:10^6
@time melt(bigx, :id)
@time melt(bigx, :id)
@time meltdf(bigx, :id)
@time meltdf(bigx, :id);
  0.255109 seconds (172.28 k allocations: 237.679 MiB, 34.60% gc time)
  0.203728 seconds (144 allocations: 228.889 MiB, 53.30% gc time)
  0.386479 seconds (633.47 k allocations: 32.617 MiB, 15.71% gc time)
  0.000075 seconds (117 allocations: 6.453 KiB)
In [8]:
x = DataFrame(id = [1,1,1], id2=['a','b','c'], a1 = rand(3), a2 = rand(3))
Out[8]:
idid2a1a2
Int64CharFloat64Float64
11'a'0.4460380.735251
21'b'0.5080450.783346
31'c'0.8746690.724064
In [9]:
melt(x)
Out[9]:
variablevalueidid2
SymbolFloat64Int64Char
1a10.4460381'a'
2a10.5080451'b'
3a10.8746691'c'
4a20.7352511'a'
5a20.7833461'b'
6a20.7240641'c'
In [10]:
melt(DataFrame(rand(3,2))) # by default stack and melt treats floats as value columns
Out[10]:
variablevalue
SymbolFloat64
1x10.407512
2x10.958294
3x10.993427
4x20.121015
5x20.987261
6x20.438873
In [11]:
df = DataFrame(rand(3,2))
df[:key] = [1,1,1]
mdf = melt(df) # duplicates in key are silently accepted
Out[11]:
variablevaluekey
SymbolFloat64Int64
1x10.01480161
2x10.07839441
3x10.7946111
4x20.1134151
5x20.9666211
6x20.09509331

Long to wide

In [12]:
x = DataFrame(id = [1,1,1], id2=['a','b','c'], a1 = rand(3), a2 = rand(3))
Out[12]:
idid2a1a2
Int64CharFloat64Float64
11'a'0.9820010.765671
21'b'0.002681510.780911
31'c'0.3331750.0896065
In [13]:
y = melt(x, [1,2])
display(x)
display(y)
idid2a1a2
Int64CharFloat64Float64
11'a'0.9820010.765671
21'b'0.002681510.780911
31'c'0.3331750.0896065
variablevalueidid2
SymbolFloat64Int64Char
1a10.9820011'a'
2a10.002681511'b'
3a10.3331751'c'
4a20.7656711'a'
5a20.7809111'b'
6a20.08960651'c'
In [14]:
unstack(y, :id2, :variable, :value) # stndard unstack with a unique key
Out[14]:
id2a1a2
CharFloat64⍰Float64⍰
1'a'0.9820010.765671
2'b'0.002681510.780911
3'c'0.3331750.0896065
In [15]:
unstack(y, :variable, :value) # all other columns are treated as keys
Out[15]:
idid2a1a2
Int64CharFloat64⍰Float64⍰
11'a'0.9820010.765671
21'b'0.002681510.780911
31'c'0.3331750.0896065
In [16]:
# by default :id, :variable and :value names are assumed; in this case it produces duplicate keys
unstack(y)
┌ Warning: In the future `unstack(df)` will call `unstack(df, :variable, :value)`. use `unstack(df, :id, :variable, :value)` to treat `:id` as the only `rowkeys` column
│   caller = top-level scope at In[16]:1
└ @ Core In[16]:1
┌ Warning: Duplicate entries in unstack at row 2 for key 1 and variable a1.
└ @ DataFrames /home/yt/.julia/packages/DataFrames/1PqZ3/src/abstractdataframe/reshape.jl:244
Out[16]:
ida1a2
Int64Float64⍰Float64⍰
110.3331750.0896065
In [17]:
df = stack(DataFrame(rand(3,2)))
Out[17]:
variablevalue
SymbolFloat64
1x10.524652
2x10.990633
3x10.419322
4x20.583264
5x20.0647236
6x20.0752103
In [18]:
unstack(df, :variable, :value) # unable to unstack when no key column is present
ArgumentError: No key column found

Stacktrace:
 [1] unstack(::DataFrame, ::Array{Symbol,1}, ::Int64, ::Int64) at /home/yt/.julia/packages/DataFrames/1PqZ3/src/abstractdataframe/reshape.jl:279
 [2] unstack(::DataFrame, ::Int64, ::Int64) at /home/yt/.julia/packages/DataFrames/1PqZ3/src/abstractdataframe/reshape.jl:269
 [3] unstack(::DataFrame, ::Symbol, ::Symbol) at /home/yt/.julia/packages/DataFrames/1PqZ3/src/abstractdataframe/reshape.jl:265
 [4] top-level scope at In[18]:1

'Flux in Julia > Learning Julia (Intro_to_Julia_DFs)' 카테고리의 다른 글

10. transforms  (0) 2018.10.16
09. reshaping(한글)  (0) 2018.10.15
08. joins (한글)  (0) 2018.10.14
08. joins  (0) 2018.10.14
07. factors (한글)  (0) 2018.10.13
공지사항
최근에 올라온 글
최근에 달린 댓글
Total
Today
Yesterday
링크
TAG
more
«   2025/05   »
1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31
글 보관함