Working with data

Author

Mattias Villani

Read and writing data in Julia

CSV files can be read with the CSV.jl package:

using CSV, DataFrames
titanic = CSV.read("data/titanic.csv", DataFrame);
first(titanic, 3)
3×8 DataFrame
Row survived pclass name sex age sibling/spouse parent/child fare
Int64 Int64 String String7 Float64 Int64 Int64 Float64
1 0 3 Mr. Owen Harris Braund male 22.0 1 0 7.25
2 1 1 Mrs. John Bradley (Florence Briggs Thayer) Cumings female 38.0 1 0 71.2833
3 1 3 Miss. Laina Heikkinen female 26.0 0 0 7.925
  • Use CSV.Rows or CSV.Chunks for reading only some observations at the time.

  • With StringEncodings.jl package, CSV can handle other character encodings.

  • Data can be read from an internet url with the help of the standard library Downloads.jl:

using CSV, DataFrames, Downloads
url = "https://github.com/mattiasvillani/Julia4Stats/raw/main/data/titanic.csv";
http_response = Downloads.download(url);
titanic = CSV.read(http_response, DataFrame);
first(titanic, 3)
3×8 DataFrame
Row survived pclass name sex age sibling/spouse parent/child fare
Int64 Int64 String String7 Float64 Int64 Int64 Float64
1 0 3 Mr. Owen Harris Braund male 22.0 1 0 7.25
2 1 1 Mrs. John Bradley (Florence Briggs Thayer) Cumings female 38.0 1 0 71.2833
3 1 3 Miss. Laina Heikkinen female 26.0 0 0 7.925

Managing data

DataFrames.jl is the original dataframe package in Julia.

  • Construct DataFrame (labeled matrix/table) using different approaches.
using DataFrames
df = DataFrame(
    name = ["Alice", "Bob", "Charlie"],
    age  = [25, 30, 35],
    score = [88, 92, 95]
)
3×3 DataFrame
Row name age score
String Int64 Int64
1 Alice 25 88
2 Bob 30 92
3 Charlie 35 95
  • DataFrames.jl has some data wrangling functions like transform, filter, combine etc.
df2 = transform(df, :age => (x -> 2x) => :age_doubled)
3×4 DataFrame
Row name age score age_doubled
String Int64 Int64 Int64
1 Alice 25 88 50
2 Bob 30 92 60
3 Charlie 35 95 70

The data frame can be modfied in place with the Julia convention to use exclamation mark (!) to denote a mutating function:

transform!(df, :age => (x -> 2x) => :age_doubled)      # Modify df in place
3×4 DataFrame
Row name age score age_doubled
String Int64 Int64 Int64
1 Alice 25 88 50
2 Bob 30 92 60
3 Charlie 35 95 70

We can select a subset of the variables

df_selected = select(df, :name, :age_doubled)
3×2 DataFrame
Row name age_doubled
String Int64
1 Alice 50
2 Bob 60
3 Charlie 70

Or filter out observations:

df_filtered = filter(:age => >(28), df)
2×4 DataFrame
Row name age score age_doubled
String Int64 Int64 Int64
1 Bob 30 92 60
2 Charlie 35 95 70

DataFramesMeta.jl uses Julia macros to @chain (pipe) together data manipulations, inspired by tidyverse in R.

using DataFramesMeta
df_extra = DataFrame(
    name = ["Alice", "Bob", "Diana"],
    city = ["Stockholm", "Gothenburg", "Malmö"]
)
result = @chain df begin
    @transform(:age_doubled = 2 .* :age)
    @subset(_, :score .> 90)       # pipes to first argument, here explicitly using _
    @select(:name, :score, :age_doubled) 
    leftjoin(df_extra, on = :name)          # normal DataFrames function
end
2×4 DataFrame
Row name score age_doubled city
String Int64 Int64 String?
1 Bob 92 60 Gothenburg
2 Charlie 95 70 missing

TidierData.jl is a Julia re-implementation of the dplyr and tidyr packages from R.

Tidier.jl is meta package, similar to the tidyverse package in R.

using CSV, DataFrames, Tidier

# Read data from a URL using the standard library Downloads
url = "https://github.com/mattiasvillani/Julia4Stats/raw/main/data/titanic.csv";
http_response = Downloads.download(url);
titanic = CSV.read(http_response, DataFrame)

## TidierData.jl for data wrangling using @chain macro
titanic2 = @chain titanic begin
    @mutate(survived = survived == 1)
    @mutate(first_class = pclass == 1)
    @filter(fare > 10)
    @select(name, survived, age, sex, first_class)
end

Plots

Plots.jl - a meta plotting package with many backends

Makie.jl - gaining popularity, advanced features

TidierPlots.jl - Julia implementation of R’s ggplot2.

Plotly.jl interface to the plot.ly library.

and many many more.

Here is an example using Plots.jl to plot the mtcars data from Rdatasets.jl

using Plots, LaTeXStrings, RDatasets, GLM

mtcars = dataset("datasets", "mtcars")

# Make a scatter plot of Horsepower vs Miles per gallon
scatter(mtcars.HP, mtcars.MPG, 
     xlabel = "Horsepower", ylabel = "Miles per gallon",
     title = "MPG vs Horsepower", label = "Data points", 
     legend = :topright, color = :blue)

# Fit a linear model using GLM.jl
using GLM
lm_model = lm(@formula(MPG ~ HP), mtcars)

# Add the fitted line to the plot, note the mutating plot! function
plot!(mtcars.HP, predict(lm_model), 
     label = "Fitted line", color = :red, linewidth = 2)  

# Add a LaTeX string to the title
βhat = round.(coef(lm_model), digits = 3)
plot!(title = L"\beta_0 = %$(βhat[1])"* " and "* L"\beta_1 = %$(βhat[2])") 

Some random examples from my teaching

Switching backends

Let us first plot a surface in Plots.jl:

using Plots, LaTeXStrings

# Plot a surface with the gr backend
gr()
xs = range(-4, 4; length=150)
ys = range(-4, 4; length=150)
f(x, y) = sin(x) * cos(y) * exp(-(x^2 + y^2)/8)
Plots.surface(xs, ys, f; xlabel = L"x", ylabel = L"y", zlabel = L"f(x,y)", 
    legend=false, camera = (30, 60))

Now we switch backend to PlotlyJS to get an interactive plot where we can pan, zoom and rotate:

import PlotlyJS
plotlyjs() # swithing to plotlyjs for interactive Plots
Plots.surface(xs, ys, f; xlabel = "x", ylabel = "y", zlabel = "f(x,y)", 
    legend=false, camera = (30, 60))

The WebIO Jupyter extension was not detected. See the WebIO Jupyter integration documentation for more information.