딥스탯 2018. 10. 10. 20:30
04_loadsave

Introduction to DataFrames

Bogumił Kamiński, May 23, 2018

Reference

Series

In [1]:
using DataFrames # load package

Load and save DataFrames

We do not cover all features of the packages. Please refer to their documentation to learn them.

Here we'll load CSV to read and write CSV files and JLD, which allows us to work with a Julia native binary format. (JLD is not working now in Julia v1.0.1)

In [2]:
using CSV
#using JLD

Let's create a simple DataFrame for testing purposes,

In [3]:
x = DataFrame(A=[true, false, true], B=[1, 2, missing],
              C=[missing, "b", "c"], D=['a', missing, 'c'])
Out[3]:
ABCD
BoolInt64⍰String⍰Char⍰
1true1missing'a'
2false2bmissing
3truemissingc'c'

and use eltypes to look at the columnwise types.

In [4]:
eltypes(x)
Out[4]:
4-element Array{Type,1}:
 Bool                  
 Union{Missing, Int64} 
 Union{Missing, String}
 Union{Missing, Char}  

Let's use CSV to save x to disk; make sure x.csv does not conflict with some file in your working directory.

In [5]:
CSV.write("x.csv", x)
Out[5]:
"x.csv"

Now we can see how it was saved by reading x.csv.

In [6]:
print(read("x.csv", String))
A,B,C,D
true,1,,a
false,2,b,
true,,c,c

We can also load it back. use_mmap=false disables memory mapping so that on Windows the file can be deleted in the same session.

In [7]:
y = CSV.read("x.csv", use_mmap=false)
Out[7]:
ABCD
Bool⍰Int64⍰String⍰String⍰
1true1missinga
2false2bmissing
3truemissingcc

When loading in a DataFrame from a CSV, all columns allow Missing by default. Note that the column types have changed!

In [8]:
eltypes(y)
Out[8]:
4-element Array{Type,1}:
 Union{Missing, Bool}  
 Union{Missing, Int64} 
 Union{Missing, String}
 Union{Missing, String}

Now let's save x to a file in a binary format; make sure that x.jld does not exist in your working directory.

In [ ]:
save("x.jld", "x", x)

After loading in x.jld as y, y is identical to x.

In [ ]:
y = load("x.jld", "x")

Note that the column types of y are the same as those of x!

In [ ]:
eltypes(y)

Next, we'll create the files bigdf.csv and bigdf.jld, so be careful that you don't already have these files on disc!

In particular, we'll time how long it takes us to write a DataFrame with 10^3 rows and 10^5 columns to .csv and .jld files. You can expect JLD to be faster! Use compress=true to reduce file sizes.

In [ ]:
bigdf = DataFrame(Bool, 10^3, 10^2)
@time CSV.write("bigdf.csv", bigdf)
@time save("bigdf.jld", "bigdf", bigdf)
getfield.(stat.(["bigdf.csv", "bigdf.jld"]), :size)

Finally, let's clean up. Do not run the next cell unless you are sure that it will not erase your important files.

In [9]:
#foreach(rm, ["x.csv", "x.jld", "bigdf.csv", "bigdf.jld"])
rm("x.csv")