options(repr.plot.width=16, repr.plot.height=5, lwd = 4)
library("RColorBrewer") # for pretty colors
library("tidyverse") # for string interpolation to print variables in plots.
library("latex2exp") # the TeX() function makes it possible to print latex math
= brewer.pal(12, "Paired")[c(1,2,7,8,3,4,5,6,9,10)]; colors
Analyzing email spam data with a Bernoulli model
a notebook for the book Bayesian Learning by Mattias Villani
Problem
The SpamBase dataset from the UCI repository consists of
The dataset also contains a vector of covariates/features for each email, such as the number of capital letters or $-signs; this information can be used to build a spam filter that automatically separates spam from ham.
This notebook analyzes only the proportion of spam emails without using the covariates.
Getting started
First, load libraries and setting up colors.
Data
= read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data", sep=",", header = TRUE)
data = data$X1 # This is the binary data where spam = 1, ham = 0.
spam = length(spam)
n = sample(spam, size = n) # Randomly shuffle the data. spam
Model, Prior and Posterior
Model
Prior
Posterior
where
Let us define a function that computes the posterior and plots it.
<- function(x, alphaPrior, betaPrior, legend = TRUE){
BernPost = seq(0,1, length = 1000)
thetaGrid = length(x)
n = sum(x)
s = n - s
f = alphaPrior + s
alphaPost = betaPrior + f
betaPost = dbeta(thetaGrid, alphaPrior, betaPrior)
priorPDF = dbeta(thetaGrid, s + 1, f + 1) # Trick to get the normalized likelihood
normLikePDF = dbeta(thetaGrid, alphaPost, betaPost)
postPDF
plot(1, type="n", axes=FALSE, xlab = expression(theta), ylab = "",
xlim=c(min(thetaGrid),max(thetaGrid)),
ylim = c(0,max(priorPDF,postPDF,normLikePDF)),
main = TeX(sprintf("Prior: $\\mathrm{Beta}(\\alpha = %0.0f, \\beta = %0.0f)", alphaPrior, betaPrior)))
axis(side = 1)
lines(thetaGrid, priorPDF, type = "l", lwd = 4, col = colors[6])
lines(thetaGrid, normLikePDF, lwd = 4, col = colors[2])
lines(thetaGrid, postPDF, lwd = 4, col = colors[4])
if (legend){
legend(x = "topleft", inset=.05,
legend = c("Prior", "Likelihood (normalized)", "Posterior"),
lty = c(1, 1, 1), pt.lwd = c(3, 3, 3),
col = c(colors[6], colors[2], colors[4]))
}cat("Posterior mean is ", round(alphaPost/(alphaPost + betaPost),3), "\n")
cat("Posterior standard deviation is ", round(sqrt( alphaPost*betaPost/( (alphaPost+betaPost)^2*(alphaPost+betaPost+1))),3), "\n")
return(list("alphaPost" = alphaPrior + s, "betaPost" = betaPrior + f))
}
Let start by analyzing only the first 10 data points.
= 10
n = spam[1:n]
x par(mfrow = c(1,3))
= BernPost(x, alphaPrior = 1, betaPrior = 5, legend = TRUE)
post = BernPost(x, alphaPrior = 5, betaPrior = 5, legend = FALSE)
post = BernPost(x, alphaPrior = 5, betaPrior = 1, legend = FALSE) post
Since we only have
= 100
n = spam[1:n]
x par(mfrow = c(1,3))
= BernPost(x, alphaPrior = 1, betaPrior = 5, legend = TRUE)
post = BernPost(x, alphaPrior = 5, betaPrior = 5, legend = FALSE)
post = BernPost(x, alphaPrior = 5, betaPrior = 1, legend = FALSE) post
The effect of the prior is now almost gone. Finally let’s use all
= spam
x par(mfrow = c(1,3))
= BernPost(x, alphaPrior = 1, betaPrior = 5, legend = TRUE)
post = BernPost(x, alphaPrior = 5, betaPrior = 5, legend = FALSE)
post = BernPost(x, alphaPrior = 5, betaPrior = 1, legend = FALSE) post
We see two things: * The effect of the prior is completely gone. All three prior give identical posteriors. We have reached a subjective consensus among the three persons. * We are quite sure now that the spam probability
A later notebook will re-analyze this data using for example logistic regression.