Skip to content

Bayes The Fun Way - Will Kurts - Notes

How do you get the Probability of your hypothesis \(P(H)\) from the odds \(O(H)\) of it?

\[ P(H) = \frac{O(H)}{O(H) + 1} \]

How do you get the Odds of your hypothesis \(O(H)\) from the probability \(P(H)\) of it?

The odds of the hypothesis is the ratio of the probability of it occuring to the probability of it not occuring

\[ O(H) = \frac{P(H)}{1-P(H)} \]

How can you refer to the negation of a probability of something happening?

\[ \neg P(H) \]

How do you refer to the set of all possible events?

\[ \Omega \]

How do you refer to the possibility of A occuring, GIVEN the case of B (or CONDITIONAL on B)

\[ P(A | B) \]

How do you refer to the possibility of A AND B occuring?

\[ P(A,B) \]

How do you refer to the possibilty of A OR B occuring?

\[ P(A \text{ OR } B) \]

How do you refer to the possibility of A NOT occuring?

\[ \neg P(A) \]

How do you calculate the possibility of A AND B occuring?

\[ P(A,B) = P(A) \times P(B) \]

What do you call calculating the possibility of A AND B occuring?

The PRODUCT RULE

How do you calculate the possibility of A OR B occuring?

\[ P(A \text{ OR } B) = P(A) + P(B) - P(A,B) \]

What do you call calculating the possibilty of A OR B occuring?

The SUM RULE

Binomial Distribution

What is the binomial distribution?

It is a probability distribution, looking at the probabilty of a certain number of successful outcomes. When the probability of a single successful outcome is set.

In a binomial distribution something can be yes/no T/F etc. It only has two outcomes, it is binomial not multinomial.

Notation for a binomial distribution is:

\[ B(n,p) \]

Where \(B\) just means binomial distribution, \(n\) is the number of trials (e.g. number of patients in a clinical trial), and \(p\) is the probability of a successful outcome (e.g. any given patient surviving).

You build up a distribution by working out values for each occurence of:

\[ B(k;n,p) \]

Where \(k\) is the amount of successes (i.e. what is the probability that exactly ten patients in our trial survive).

To work out this full distribution we use the Binomial Coeefficient

\[ \binom{n}{k} \]

What does this coefficient mean? It's the amount of ways you could acheive a given \(k\) with a given \(n\). It's the shorthand for this below:

\[ \binom{n}{k} = \frac{n!}{k!\times(n-k)!} \]

\(n!\) means the factorial of \(n\) (i.e. if \(n=5, n!=5\times4\times3\times2\times1\))

In R there is the binomial coefficient function choose(n,k).

So then we know that a way to describe \(B(k;n,p)\) is:

\[ B(k;n,p) = \binom{n}{k} \times P(\text{Desired Outcome}) \]

Where we have multiplied the binomial coefficient against the probability of the desired outcome.

The Probability of our desired outcome is dependent on the probability of a single event succeeding \(p\), and also \(n\) and \(k\), it can be described as

\[ P(\text{Desired Outcome}) = p^{k} \times (1-p)^{n-k} \]

So this is the probability of success happening exactly k times, and the probability of failure happening the rest of the times.

So the full description of the shorthand of \(B(k;n,p)\) is:

\[ B(k;n,p) = \binom{n}{k} \times p^{k} \times (1-p)^{n-k} \]

This full description is called the Probability Mass Function

The way I would calculate this in R is using the dbinom(k,n,p) function. this would give the probability of exactly this outcome. So for example I'm going to flip a coin 10 times, so \(n = 10\), I want to know the probability of getting exactly 3 heads, so \(k = 3\), and it's a fair coin with a 50/50 chance of heads, so \(p = 0.5\).

dbinom(3,10,0.5)
[1] 0.1171875

So, to then work out the binomial distribution for \(B(n,p)\), we would just need to work through the above, for all values of \(k\). You'd then plot \(k\) on the x-axis and the output from the probability mass function on the y-axis.

So what if I want to describe the full set of Probability Mass Functions, but don't want to write them out one by one for each value of \(k\)?

What I can do here is use the \(\sum\) (summation symbol) Say I care about knowing about the chance of at least one success in my hundred trials, this would be:

\[ \sum^{100}_{k = 1}\binom{100}{k}\times p^k \times (1-p)^{100-k} \]

If I wanted to calculate this in R I would use the pbinom() function

pbinom(0,100,0.5,lower.tail=FALSE)

In the above, the first argument is roughly k, the second is exactly n, the third is p (the probability of any single event succeeding), the fourth and final argument is important though. When lower.tail = FALSE, the output of the function will be the probability of any value greater than the first argument. So if I care about at least 1 success (my k), I need to put 0 as the argument, as I want more than 0. If I put lower.tail=TRUE then I'd be looking at the probability of any value of k less than the first argument. Say I wanted to know the chance of 60 or fewer successes, then I'd have the first argument as 61

If I wanted to plot this distribution in R:

r$> plot(function(p) dbinom(p,100,0.5),0,100)

Beta Distribution

Key point for difficulties: - Beta distributions care about ranges, - Binomial distributions care about discrete values - As in, I give dbeta() a range of values for p, - but I would give dbinom() a single value for k

In a binomial distribution, I know the probability \(p\) of a single event, and I want to know how probable it is that I see \(k\) amount of events, after trying \(n\) amont of times.

I describe that as this:

\[ B(k;n,p) \]

In a beta distribution, I'm coming at it the other way around. I've seen \(k\) amount of events already, after trying \(n\) amount of times. I want to work out what the most likely value for \(p\) is from this.

I make some changes though, there's an infinite range of possible values for \(p\), it's continuous rather than discrete, so I need to take that into account. This means I don't use \(k,n\) and \(p\). Instead I ise \(\alpha, \beta\) and \(p\).

\(\alpha\) is the same as \(k\), it's the amount of events I'm interested in, but \(\beta\) is not the same as \(n\). It's the amount of events I'm not interested in. It's \(n-k\). Or \(n-\alpha\).

\[ Beta(p;\alpha,\beta) \]

This Beta distribution can be described as

\[ Beta(p;\alpha,\beta) = \frac{p^{\alpha-1} \times (1-p)^{\beta-1}}{beta(\alpha,\beta)} \]

The top part of the fraction, is very similar to what's seen in the binomial distribution. The bottom part is the integral from 0-1, for different values of p. Apparently this will be explained more later.

Basically you add on this bit with the integral, and also the subtraction of 1 from \(\alpha\) and \(\beta\) to normalise the data.

The beta distribution when plotted gives you another distribution curve, of different likelihoods for the true value of \(p\), given your current data. Each of these different likelihoods themselves was the output of a binomial distribution. So a beta distribution is a distribution of distributions (and how well these distributions would explain the data we had seen) (I think).

I would use the \(Beta\) distribution to work out how likely it is that the true value for \(p\) sits within a certain range.

To do this, I need to add up all the different possible values for \(p\) that sit under a certain part of the distribution curve. If I were to sum up all of them between p being 0 and 1, I would get the likelihood of 1, that the true value sits somewhere between there (because when you add all possibilities together, you get 1).

To sum up how likely it is that the true value for \(p\) sits within a certain range, I need to use an integral or integration.

This is represented with \(\int_{bottom}^{top}\). The \(\int\) means the integral, and the \(top\) is the top value of the range, and the \(bottom\) is the bottom value in our range.

So if I wanted to desribe the likelihood that the true value for \(p\) sat between 0.45 and 0.55 I would write

\[ \int_{0.45}^{0.55} \frac{p^{\alpha-1}\times(1-p)^{\beta-1}}{beta(\alpha,\beta)} \]

The way I would do that in R code is I would write:

integrate(function(p) dbeta(p, a, b), 0.45, 0.55)
# a = alpha, our observed events/successes
# b = beta, our remaining attempts in trial/failures

If I wanted to plot the beta distribution of my observed trial (lets say there were a total \(n\) of 100 attempts, with a success rate (\(\alpha\)) of 20, and a failure rate (\(\beta\)) of 80),

plot(function(p) dbeta(p, 20,80),0,1)
# This will plot all possible values of p and their probability of being the true value,
# from a min value of p being 0, to a max value of 1
# but ignore the values on the y axis

Conditional Probability

So far we've talked about independent probabilities, the chance that something is true not affecting the chance that something else is true. I.e. the chance that the sky is blue doesn't affect the chance it's tuesday. But some probabilities are dependent, the chance that the sky is blue does affect the chance it's morning or midnight.

We can also talk about conditional probabilities, given that I know it is midnight, what is the probability that the sky is blue? Given that I know the sky is blue, what is the chance that it's midnight?

Say the probability the sky is blue is \(P(b)\), and the probability it is midnight is \(P(m)\)

If I already believe it is midnight, then I could describe the probability that the sky is blue, given that it is midnight as rhe conditional probability \(P(b|m)\)

This then brings us on to Bayes theorum

This states:

\[ P(A|B) = \frac{P(A)\times P(B|A)}{P(B)} \]

The probability of A given B, is the probability of A times the probability of B given A, divided by the probability of B.

It's allowing you to use one conditional probability to work out another. The classic way you could do this is say: call: \(A\) your beliefs \(B\) the observed evidence

You can then go from the probability of the observed events occuring if your beliefs were true \(P(B|A)\), to work out the probabilities of your beliefs being true given the observed events occured. It also means you need to start with a general probability of how likely you believe your beliefs are \(P(A)\), and generally how probable the events were \(P(B)\).

Probability Distributions

What I want to do, is have an estimate for what I believe, measure what happens, and combine these two things.

In combining them, I can work out how likely my previous belief was to be true, and also work out what my new belief is

$$

P(A|B) = \frac{P(B|A)\times P(A)}{P(B)}

$$

\(P(A|B)\) is my posterior belief, my new belief.

\(P(B|A)\) is the likelihood, how likely the data is (\(B\)), if the hypothesis (\(A\)) was true, how likely my prior is to be true, considering the data seen so far.

\(P(A)\) is my prior belief, my hypothesis.

In the beta distribution, I can get a normalised value for \(P(A|B)\) without knowing a value for \(P(B)\).

Remember the beta distribution is

\[ Beta(p;\alpha,\beta) \]

where \(p\) is the probability of a single success, \(\alpha\) is the count of successes, \(\beta\) is the count of failures.

So where the binomial distibution uses \(n, k\). In the beta distribution \(k\) and \(\alpha\) are the same. \(\beta\) is \(n - \alpha\).

To get to my posterior beta distribution, I use the componenets of the beta distributions for my likelihood and my priors.

For example:

Say a building got struck by lightning once a year, for nine years out of ten. That's what I would start with as my values for likelihood.

n = 10
alpha = 9
beta = 1

I think that being struck by lightning is a really uncommon experience. I'd expect a building to get struck by lightning at most one every 500 years. That's what I would use as my prior

n = 500
alpha = 1
beta = 499

I want to know how likely it is that the building will be struck by lightning next year, it should be really unlikely according to my priors. But the data is telling me that something seems off about this building.

\[ Beta(\alpha_{posterior}, \beta_{posterior}) \]
\[ \alpha_{posterior} = \alpha_{likelihood} + \alpha_{prior} = 9 + 1 = 10 \]
\[ \beta_{posterior} = \beta_{likelihood} + \beta_{prior} = 1 + 499 = 500 \]
\[ Beta(\alpha_{posterior}, \beta_{posterior}) = Beta(10,500) \]

This still results in a very low probability that my building will be struck by lightning next year.

plot(function(p) dbeta(p,10,500),0,0.1)

The probability is 10/500 (0.02), 2%. But that's still ten times more likely than my previous estimate of a probability at 0.002, 0.2%.

To expand on this, I might discover a lightning rod on this building, I might assume that a building with a lightning rod has a 1:10 chance of being hit. This means my prior distibution is now:

n = 100
alpha = 50
beta = 50
\[ Beta(\alpha_{posterior}, \beta_{posterior}) \]
\[ \alpha_{posterior} = \alpha_{likelihood} + \alpha_{prior} = 9 + 50 = 59 \]
\[ \beta_{posterior} = \beta_{likelihood} + \beta_{prior} = 1 + 50 = 51 \]
\[ Beta(\alpha_{posterior}, \beta_{posterior}) = Beta(59,51) \]

Now my posterior says the probability is slightly more likely than 50:50

plot(function(p) dbeta(p,59,51),0.4,0.6)

Distribution from the Mean

We want to measure the spread of error of our measurements from the true value. The distance from our measurements to the true value is called just that, the error. However we don't know the true value, so we can't talk about it or the error as such. Instead we are going to use the mean, and differences from it.

There are a few ways to do that.

The first is the Mean Absolute Difference. This is where you look at the average distance each measurement has from the mean. When we say the Absolute difference, that's something specific. The absolute value is the distance that value has from 0. E.g. both 6 and -6 have an absolute value of 6. This would be represented as \(|6|\). Because the spread of differences from the mean should be going equally out both in a positive and negative direction, if you didn't use an absolute value for the difference, they would cancel each other out and the average difference from the mean would be 0.

So this would be written out as:

\[ MAD(x) = \frac{1}{n} \times \sum^n_1|x_i-\mu| \]

The Mean Absolute Difference is the average value (where multiplying by \(\frac{1}{n}\) comes from) of the sum (\(\sum_1^n\)) of the absolute values (\(| |\)) of the difference between our measurements (\(x\)) and the mean (\(\mu\)).

Advantages of this are it's relatively simple to understand, and it gives a meaningful value, if we are measuring in seconds, the MAD will be also using seconds. Disadvantages is it can be heavily influenced by outliers.

The next option would be to use the variance. This attempts to deal with outliers a bit more, and also has an advantage of being more straightforward mathematically (apparently, I don't really know enough maths for that bit!)

Here rather than absolute difference, you use the square of the difference, so the formula looks like:

\[ Var(x) = \frac{1}{n} \times \sum_1^n (x_i - \mu)^2 \]

So I've just changed out that bit towards the end, for the squared difference.

The variance has the advantage of dealing with outliers better, but has the disadvantage of not sharing same units as what I'm measuring anymore.

One last way to deal with that is to use standard deviation

Here you take the square root of the variance to get back to units that go along with our measurements.

\[ \sigma = \sqrt{ \frac{1}{n} \times \sum_1^n (x_i - \mu)^2 } \]

Lastly, in variance, and so also in standard deviation, there's one other change you might need on a day to day basis. What I've put above is the formulae for both. However, if you are calculating the variance or sd for samples, you want to correct for bias even more, with something called Bessels correction. Basically this says that rather than dividing your sum difference from the mean, by the size of your sample, you instead divide by your sample minus 1. This increases the estimated values for variance and standard deviation. You only do this for samples. You wouldn't do this if you had the whole population.

So it would be as an example

\[ Var(x) = \frac{1}{n-1} \times \sum^n_1(x_i - \mu)^2 \]

Normal Distribution

r$> curve(dnorm(x, mean=0, sd=1), from = -4,to = 4, ylab="Density", xlab="Value", main="Normal Distribution")
#' This plots a curve of values taken from dnorm() with it's specified mean and sd,
#' It plots them from a certain value of x to a certain value of x

In estimating the likelihood of our values occuring within a range of possibilities, we can use the normal distribution.

The normal distribution provides a range of probabilities of values, taking the mean (\(\mu\)) and standard deviation (\(\sigma\)) of observed values as arguments.

\[ N(\mu,\sigma) \]

Like other distributions, we can use the probability density function (PDF) of the normal distribution to estimate the likelihood of an exact value occuring. We can use the cumulative density function (CDF) of the normal distribution to estimate the likelihood of values occuring within a certain range.

  • In R you can calculate the probability density function of the normal distribution of a value using dnorm(x, mean, sd)
  • You can use pnorm(q,mean,sd) to calculate the area under the curve of the normal distribution (the CDF), either with lower.tail=T or lower.tail=F.
  • the other name for the area under the curve is to "integrate the PDF". So pnorm is the integrated range of a set of values for dnorm
  • lower.tail=T will give the CDF for the likelihood of the value occuring between negative inf and q
  • lower.tail=F will give the CDF for the likelihood of the value occuring between q and positive inf
  • If you want to work out a different range that isn't simply greater than or less than,
  • you can do one of two things:
    1. pnorm(greater_value) - pnorm(lesser_value)
    1. integrate(function(x) dnorm(x, mean,sd),lesser_value,greater_value)

As the normal distribution follows a pattern, you can work out how likely values are within certain distributions:

SD Percentage of all values
\(1\sigma\) 68%
\(2\sigma\) 95%
\(3\sigma\) 99.7%

e.g.

r$> integrate(function(x) dnorm(x,0,1),-1,1)
0.6826895 with absolute error < 7.6e-15

r$> pnorm(q=1,mean = 0, sd = 1) - pnorm(q=-1,mean = 0,sd = 1)
[1] 0.6826895

r$> integrate(function(x) dnorm(x,0,1),-2,2)
0.9544997 with absolute error < 1.8e-11

You can see something being described as an x sigma ( \(x\sigma\) ) event. What you mean is something that is as likely to occur as values further out than 1 SD from the mean. e.g. a 1 sigma event should happen 32% of the time

Other R tidbits:

dnorm() # Our probability density function (PDF) for a certain value
pnorm() # Our cumulative density function (the integrated PDF) for values within a range,
        # either -ve inf to q, or q to +ve inf,
        # how probable are these values

rnorm() # Generate n amount of random values from within this normal distribution

qnorm() # rather than how probable are specified values,
        # instead it's what would the values be with a specified probability,

# example for qnorm:
r$> qnorm(p=0.3,mean = 0, sd=1,lower.tail = F)
[1] 0.5244005

# This tells us that values need to be greater than 0.5 in the above normal dist, if they are to occur with 30% probability or less

Parameter Estimation

The parameter might be our estimate for the number of events, or the size of the measurement, or the likelihood of something.

We've talked about the

Probability Density Function, the plot of the how likely values are to be the true value of the thing we are estimating.

You can get a PDF for a certain value by using dbeta()

You can plot it like below:

plot(function(p) dbeta(x= p,shape1= 2,shape2=8),0,1)

Remember, in the beta distribution you have this layout:

\(Beta(p;\alpha,\beta)\)

I've seen \(\alpha\) amount of events and \(\beta\) amount of failures and I want to estimate the probability (\(p\)) of another event if I were to try again.

So the plot above is plotting the likelihood of different estimates of \(p\)

If I then want to work out the probability that \(p\) is at least a certain value, then I'm now talking about a cumulative density function (CDF), were I have summed the area under the curve in my PDF. I have integrated it.

So if I wanted to know how probable it is that \(p\) is less than 0.5, I'd go for:

pbeta(0.5,2,8,lower.tail=F)

If I wanted to work out how probable it is that \(p\) is between 0.1 and 0.3 I'd go for:

pbeta(0.3,2,8)-pbeta(0.1,2,8)

or

integrate(function(p) dbeta(p,2,8),0.1,0.3)

Lastly we can talk about the quantile function for the distribution, this is not the same as quantile() in r.

Instead this would be using qbeta()

Rather than saying, I have a value and I want to know the percentage likelihood of it, or a range and the percentage likelihood that the true value is within it. Instead I'm saying, I want to start with the percentage likelihood (like being 99% certain the true value is within a certain range), and get what values would correspond with it.

If I wanted to know the estimated value that I am 99% certain the true value is greater than I would run

qbeta(0.99,2,8,lower.tail=F)
or
qbeta(0.01,2,8,lower.tail=T)

If I wanted the range that I wanted 99% likelihood that the true value sat within, I could just work out the upper and lower ranges. But remember, if I'm chopping at both sides, rather than splitting 99:1, instead it's 0.5:99:0.5

So

qbeta(0.995,2,8,lower.tail=F) # Lower Bound
qbeta(0.995,2,8,lower.tail=T) # Upper Bound

Parameter Estimation with Prior Probabilities

ChatGPT summary of difference between Beta and Binomial Distribution uses:
    Binomial Distribution:

        Assumes you know the probability of a single event (i.e., success or failure).

        It then gives you the likelihood of observing a certain number of successes in a fixed number of trials based on that known probability.

        For example, if you're flipping a known fair coin (probability p=0.5p=0.5 of getting heads), the binomial distribution can tell you the probability of getting exactly 6 heads in 10 flips.
    Beta Distribution:

        Assumes you have observed certain events (i.e., a certain number of successes and failures).

        Based on those observed events, it describes the distribution of possible probabilities for the likelihood of a single success.

        For instance, if you've observed 6 heads and 4 tails in 10 coin flips and want to make a statement about your belief regarding the true bias of the coin (i.e., the probability of getting heads in a single flip), the beta distribution can capture that belief.

In essence:

    The binomial distribution helps answer questions like: "Given a specific probability of success, how likely is it to see this many successes?"
    The beta distribution helps answer questions like: "Given these observed successes and failures, what's my belief or uncertainty about the true probability of success?"

Ultimately, we've already covered how to do updating of a Beta distribution

\[ Beta(\alpha_{Posterior},\beta_{Posterior}) = Beta(\alpha_{prior} + \alpha_{likelihood},\beta_{prior} + \beta_{likelihood}) \]

\(Prior\): our context dependent information, our previously observed information

\(Likelihood\): our observed data in the study

We can also describe the mean of a Beta distribution with \(\frac{\alpha}{\alpha+\beta}\).

So say in our prior we had observed a percentage of 30% successful events but we didn't know the true numbers, we could say Mean for the Beta distribution is, we know \(\alpha\) is 30, we know \(\alpha + \beta\) is 100, so beta is 70. So our prior is:

\[ Beta(\alpha = 30, \beta = 70) \]

we could make that less informative if we were less certain with:

\[ Beta(\alpha = 3, \beta = 7) \]

or even:

\[ Beta(\alpha = 2, \beta = 3) \]

A/B Testing using Monte Carlo Simulations

When we say A/B testing, we are typically comparing between two groups to see which has a greater effect on our variable of interest. When we say Monte Carlo simulations, we mean random sampling from our distributions. So Monte Carlo A/B testing means:

  1. random sample from distribution in group A
  2. random sample from distribution in group B
  3. compare difference between group A and group B in all samples Step 3 you could do one of two main things:

  4. Is the effect of A > than the effect of B? So Yes/No

  5. By how much is the effect of A > than the effect of B? So numeric estimate of difference

With method 1, you then can get a % estimate of how likely it is that A is better than B

With method 2, you can plot the distribution of likely difference in effect sizes, you could do the CDF, you could get confidence intervals

# R pseudocode for the above

## Posterior Values
A_alp_post <- # Our estimate for posterior value for alpha of A
A_bet_post <- # Our estimate for posterior value for beta of A
B_alp_post <- # Our estimate for posterior value for alpha of B
B_bet_post <- # Our estimate for posterior value for beta of B

## Random Sampling
n_trials <- 100000 

samples_A <- rbeta(n = n_trials, shape1 = A_alp_post, shape2 = A_bet_post)
samples_B <- rbeta(n = n_trials, shape1 = B_alp_post, shape2 = B_bet_post)

## Is A better than B?

samples_AB_better <- samples_A > samples_B
samples_AB_percent <- sum(samples_AB_better)/n_trials

# samples_AB_percent is our percentage estimate that A has a greater effect size than B

## How much is A better than B?

samples_AB_difference <- samples_A / samples_B

hist(samples_AB_difference) # Hist of estimates of effect size

plot(ecdf(samples_AB_difference)) # CDF of estimates of effect size

## ecdf() we are using a vector rather than a distribution function, so we use ecdf() rather than pbeta()

Bayes Factor and Comparing Hypotheses

Going back to Bayes Theorem:

\[ P(A|B) = \frac{P(B|A)P(A)}{P(B)} \]
  • \(P(A|B)\): Posterior Probability
  • How likely is our hypothesis to be true,
  • given that we've observed the data we have
  • \(P(B|A)\): Likelihood
  • How well did our hypothesis explain the data we saw,
  • How likely was the observed data, if our hypothesis had been true
  • \(P(A)\): Prior Probability
  • How strongly did we believe our hypothesis was true, prior to additional info
  • \(P(B)\): The Evidence / The Marginal Likelihood
  • How independently likely was the data in the first place, was it common?
  • Marginal Likelihood: It's the average of the sum of all versions of \(P(B|A_x)\)
  • By summing or integrating all the hypotheses you have "marginalised" them,
  • hence the name

To start off we know how strongly we believed our hypothesis to be true (our prior), using this hypothesis we can calculate how likely it was to see our observed data (our likelihood). We might not know how independently likely the data was in the first place though.

This doesn't always matter. If you are comparing two hypotheses, then \(P(B)\) is the same for both, and can be ignored.

You can describe the proportional form of Bayes theorem with:

\[ P(A|B) \propto P(B|A)P(A) \]

And you can compare hypotheses (say \(A_1\) and \(A_2\)) with the posterior odds:

\[ \frac{P(B|A_1)P(A_1)}{P(B|A_2)P(A_2)} \]

You can split up the posterior odds into two bits:

First you have the ratio of prior probabilities, the prior odds (\(O(A_1)\))

\[ \frac{P(A_1)}{P(A_2)} = O(A_1) \]

The prior odds is how much more strongly you believe one hypothesis over the other to start with.

Second you have the ratio of likelihoods, the likelihood ratio or Bayes Factor

\[ \frac{P(B|A_1)}{P(B|A_2)} \]

This is how many times more strongly one hypothesis explains the data over another. How much better does \(A_1\) explain the data than \(A_2\). So greater than 1 = more likely. 100 = 100x more likely. 0.01 = 100x less likely

So our posterior odds can also be stated as:

\[ \frac{P(A_1|B)}{P(A_2|B)} = O(A_1)\frac{P(B|A_1)}{P(B|A_2)} \]

There are some rough estimates for what different posterior odds mean:

Number Strength
0-3 Weak Evidence To Support \(A_1\) over \(A_2\)
3-20 Moderate Evidence
20-150 Strong Evidence
150+ Overwhelming evidence

Doing Bayesian Data Analysis

2 - Credibility, Models, Parameters

He starts by giving some other terms for how you could describe parts of a mathematical distribution, in:

\[ ND(\mu,\sigma) \]

As mean (\(\mu\)) centers the normal distribution at a point on a plot along the x axis, you can call \(\mu\) a location parameter

As the standard deviation (\(\sigma\)) defines how spread out the tails are on the normal distribution, you can call \(\sigma\) a scale parameter

This is giving a high level overview of what bayesian inference is. Basically, bayesian inference is using a bayesian method of analysis to help determine what model best describes your observed data.

Steps:

  1. First determine what you are trying to do,
  2. what are you trying to predict?
  3. what factors do you think would make good predictors?
  4. Describe a rough descriptive model for the relationship between your predictors and outcome.
  5. Pick meaningful and sensible parameters
  6. As well as meaningful/sensible forms they may be distributed in
  7. Specify your initial priors for these parameters in your descriptive predictive model
  8. Use bayesian inference to test how likely different values for these priors would be
  9. Do a form of posterior predictive check against your observed data

So as an example, in trying to make a comparison between two continuous variables, \(x\) and \(y\)

You may start with a linear model as your rough descriptive model:

\[ \hat{y} = \beta_1x + \beta_0 \]

You can see that's the formula for plotting a straightline on a x-y axis. $$ y = mx + c $$

  • \(\hat{y}\) is the predicted value of \(y\)
  • \(\beta_0\) is the intercept (the value you add to \(x\))
  • \(\beta_1\) is the slope, how much you multiply \(x\) by

That's your descriptive model, but you also need to estimate how accurate your model is, i.e. what's the relationship between \(y\) and \(\hat{y}\), the real value for y and what your prediction for it was.

You may think it follows a roughly normal distribution.

\[ y_i \sim ND(\hat{y_i}, \sigma) \]

The real value for \(y\) is somewhere in the normal distribution of my prediction for y, with a certain standard deviation \(\sigma\)

This leaves us with 3 parameters we need to estimate:

  1. \(\beta_0\) the intercept
  2. \(\beta_1\) the slope
  3. \(\sigma\) the standard deviation of our predictions

We can then plot distributions for all of these parameters, and see what values for each parameter seem most credible. One way to estimate that is to look at the probability distribution for each parameter and see what sits in the HDI (the highest density interval). This could be the 95% HDI.

You then can use your best guesses for each of your parameters in your updated model. And use that model to run a posterior predictive check to see how well your data is explained by your model.

4 - Probability

When talking about all possible potential outcomes in your test, you are talking about the sample space

When talking about the probability of any particular potential outcome, there are two aspects you can describe:

  1. The probabilty - how likely that outcome is (\(\theta\))
  2. The degrees of belief - how certain you are in that probability (\(p(\theta)\)) - the probability of the probability!

Both the probability and the degrees of belief have their own sample spaces (a range of potential values for them)

One way to describe probability is in the terms of it's long run relative frequency, if we did the intervention enough times, what would we see. You can either work this out mathematically, or simulate it.

When determining your degrees of belief, you can also describe it using mathematical terms (basically using a probability distribution function, like the normal distribution). Or you can estimate it in a similar way to betting odds.

Working out a probability distribution is easier for discrete events rather than for continuous values. A sample set of a continuous variable is basically infinite, cos you can chop it up into infinite values. So generally you turn your continuous into discrete, one way or another. First way is to explicitly chop up your continuous range into set intervals or bins. Then you can talk about the probability of your value falling into any given bin. This value is known as the probability mass for a given interval.

The other way is to make bins much more narrow, and then take into account the probability mass and the width of the bin itself. If you take the probability mass and divide it by the width of the bin you get the probability density. You can plot these probability densitys to get a curve (like a normal distribution). While any given probability density can be greater than 1, the sum of the area under the curve (the integral) will need to come to 1.

So to make this more maths:

  • \(x = \text{Variable}\)
  • \(x_i = \text{A given bin/interval}\)
  • \(\Delta x = \text{Width of that bin}\)
  • \([x_i , x_i + \Delta x] = \text{Defines a given bin}\)
  • \(p([x_i,x_i + \Delta x]) = \text{The probability of event in a given bin, the probability mass}\)
  • \(\frac{p(x_i,\Delta x)}{\Delta x} = \text{Probability density}\)
  • You then might refer to the bin width (\(\Delta x\)) as \(dx\) instead
  • And then confusingly, then start referring to both the probability density and the probability mass as \(p(x)\):
  • you'll generally be referring to mass functions for discrete variables
  • and density for continuous

Normal distribution

The normal distribution is one version of a probability density function. Defined by \(\mu\) and \(\sigma\) (mean and SD). As mean defines the central location of the curve, it is a location parameter, as SD defines the size of the spread, it is a scale parameter.

Other ways you might describe a distribution is by it's variance (also called the mean squared deviation, where the SD is the root mean squared deviation).

Or you might talk about central tendency (which in a normal distribution is just the mean).

Or you might talk about the highest density interval. This is a range in your density curve, where any value within the range, is more likely than any value outside the range. If the sum of the AUC of a density function must = 1, you can talk about the 95% highest density interval, where the sum of the AUC must = 0.95. The HDI doesn't need to be one continuous range. In a bimodal distribution it might be split. When looking at the density function. The HDI starts from the tops of the peaks of a function and works it's way down.

Other bits of probability

If you are talking about multiple variables, and you are talking about the probability of certain combinations, you are talking about joint probabilities. If you were to sum up the joint probabilities within one of the variables. Say you had variable A and B, both with values of 1234, so you could have A1 to A4, and B1 to B4, giving you combos of A1B1 to A4B4. If you added up A1B1, A1B2, A1B3, A1B4, that would give you the marginalised probability of A1.

Joint probabilities are different to conditional probabilities \(p(a|b)\) where you are talking about the probability of A given B.

You can use conditional probabilities to work out if two variables are independent. If A and B should have no effect on each others outcome then \(p(a|b)\) should be identical to \(p(a)\). If they aren't there might be a relationshup you are missing.

5 - Bayes Rule

This chapter gives us another way to get to Bayes rule, using joint probabilities and conditional probabilitiies.

We can describe a conditional probability as taking the joint probability of two events, and dividing that by the probability of the event we are conditioning on:

\[ P(A|B) = \frac{P(A,B)}{P(B)} \]

that can be rephrased by multiplying both sides by \(P(B)\)to:

\[ P(A|B).P(B) = P(A,B) \]

at the same time, talking about the other side of the conditional probabilities, rather than \(A\) given \(B\), what about \(B\) given \(A\).

\[ P(B|A) = \frac{P(B,A)}{P(A)} \]

which can be rephrased to:

\[ P(B,A).P(A) = P(B,A) \]

but we also know joint probabilities are identical. as in, \(P(A,B) = P(B,A)\)

so that means we can join together the two equations to get:

\[ P(A|B).P(B) = P(B|A).P(A) \]

we can then rephrase this to get:

\[ P(A|B) = \frac{P(B|A).P(A)}{P(B)} \]

which rather neatly is baye's rule.

You can chop up this equation into parts:

  • \(A\) = Our hypothesis
  • \(B\) = Our observed evidence
  • \(P(A)\) = The probability our hypothesis is true - The prior
  • \(P(B)\) = The probability we would see the observed data, regardless of what hypothesis is true
  • The marginal likelihood or evidence
  • \(P(A|B)\) = The probability our hypothesis is true, given the observed data - The Posterior Distribution
  • \(P(B|A)\) = How likely it was that we would see our observed data, if the hypothesis was true - The likelihood

So what is the equation saying?

That the probability that our hypothesis is true, given the observed data, can be calculated by taking the probability of seeing that data if our hypothesis was true, multiplying it against our starting guess for how probable our hypothesis is to be true, and dividing that by the probability of seeing that data regardless of the cause.

Marginal Likelihood

He then clarifies there are two ways to work out the marginal likelihood for something, depending on whether it's a discrete or continuous variable. The discrete version is:

\[ P(B) = \sum_{\theta^*}P(B|\theta^*)P(\theta) \]

whether \(\theta^*\) means, any given value of theta. So, a specific value for the true probability. You've summed up all these specific probabilities.

In a continuous variable you have to calculate the area under the curve, you have to integrate, this is as below:

\[ P(B) = \int d\theta^*(B|\theta^*)P(\theta^*) \]

This continuous version is obviously confusing and unpleasant, but it's what you need for a marginal likelihood. Without a marginal likelihood you cant get your posterior distribution. However. It's also difficult/impossible to calculate! So instead what we do is we estimate it! One method to approximate that area under the curve is to use a technique called Markov Chain Monte Carlo. Other older methods include things like grid approximation.

Bernoulli Distribution

A bit of a tangent, but we would use the bernoulli distribution for describing the probability of an outcome when there are two potential outcomes (Yes/No heads/tails etc), you classfy these outcomes as 1 and 0.

If we say that the underlying probability of someone saying yes (1) to a question is: \(\theta\),

then you can say that the probability of them answering yes given that probability is:

\[ P(A_y|\theta) = \theta \]

and the probability of them answering no (0) is:

\[ P(A_n|\theta) = 1 - \theta \]

But if you want to get both outcomes (\(A_x\)) expressed in a single equation you can use the bernoulli equation of:

\[ P(A_x|\theta) = \theta^{A_x}.(1-\theta)^{1-A_x} \]

because if: \(A_y = 1, A_n = 0\)

Then when you want to use the equation for \(A_y\) it becomes:

\[ P(A_y|\theta) = \theta^{1}.(1-\theta)^{1-1} \\ P(A_y|\theta) = \theta.(1-\theta)^0 \\ P(A_y|\theta) = \theta.1 \\ P(A_y|\theta) = \theta \]

If you want to use the equation for \(A_n\) it becomes:

\[ P(A_n|\theta) = \theta^0.(1-\theta)^{1-0} \\ P(A_n|\theta) = 1.(1-\theta)^1 \\ P(A_n|\theta) = 1.(1-\theta) \\ P(A_n|\theta) = 1-\theta \]

6 - Binomial Probability

When talking about a binomial measurement, we mean one where there are two nominal outcomes, that are mutually exclusive, with no ordinal or metric relationship. So: yes/no or heads/tails etc.

Bernoulli Distribution again + Likelihood vs Probability

\[ P(y|\theta) = \theta^{y}.(1-\theta)^{1-y} \]

So above we have the Bernoilli distribution, this is a probability distribution talking about the probability of getting a certain outcome \(y\) if the probability is set at \(\theta\). \(y\) may be 1 or 0 in this.

But we also may want to work out what is the most likely value for \(\theta\), and treat \(y\) (our observation) as fixed. In a likelihood function, we would be saying, if we have this observation \(y\), what value for \(\theta\) is most likely.

When we are talking about probabilty and likelihood, probability is looking forward, if we know the probability is set at \(x\), what outcome would be most probable. Likelihood is looking backwards. If we have a certain outcome \(y\), how likely are different values for the probability \(\theta\). Probabilities always have to sum to 1. Likelihoods do not.

Conjugate Priors and The Beta Distribution

When we are trying to work out \(p(y|\theta)\) and applying bayes rule to work it out, it would look like:

\[ p(\theta|y) = \frac{p(y|\theta).p(\theta)}{p(y)} \]

In this, on the left hand side is our posterior, the numerator is our prior multiplied by our likelihood the denominator is our evidence

What the aim of a conjugate prior is, is to find a likelihood that is compatible with your prior, so that when you multiply the two, you get a posterior distribution that is in the same form as your original prior, the same distribution family.

Examples would be if you had a beta prior, you could use a bernouli likelihood, and this would give you a beta posterior.

The advantage here is you can plug this new posterior straight back into the equation as the new prior.

What is the Beta Distribution

I won't do too much on this as I have some in my Baye's Fun notes.

But the Beta distribution is one with the following parameters:

\[ beta(\theta|a,b) \\ OR \\ beta(p|a,b) \]

The second way is how i've described it before, but it's the same thing.

What the beta distribution as a function actually means is:

\[ beta(\theta,a,b) = \frac{\theta^{(a-1)} (1-\theta)^{(b-1)}}{B(a,b)} \]

The denominator in this beta distribution is the Beta function, this normalises the distribution so the area under the curve = 1.

There are a few terms you can use to describe the Beta distribution.

\[ \text{Mean} = \mu = a/(a+b) \\ \text{Mode} = \omega = (a-1)/(a+b-2) \\ \text{Concentration (Spread)} = \kappa = a+b \]

The mean will represent the central tendency in non skewed distributions, but most distributions will be skewed, and the mode will be more useful here.

The concentration describes how narrow the distribution is, with higher values being more narrow.

Lastly, \(a\) and \(b\) in the beta distribution are typically your observed data, so \(a\) would be number of successes and \(b\) would be number of failures, for example.

In the beta distribution \(a\) and \(b\) are both shape parameters.

Posterior Distributions

Our posterior distribution is a combination of our prior and our likelihood (our likelihood is how well our current data is explained by the current parameters)

As our likelihood, our observed data increases in size, the influence of the prior on our results falls

It's great to use a beta distribution as the prior, because the posterior is always also a prior (as long as you have a bernoulli likelihood), and is mathemetically exact, and easy to compute.

But not all priors can be represented by the beta distribution, for example, a sharp bimodal distribution with two distinct peaks. Sometimes here, you need to start using some numerical approximation. In addition, only relatively simple likelihoods have conjugate priors.

A way you can do approximation is through MCMC.

6 - MCMC

The point of our bayesian analysis is to get to our new posterior distribution, however, to do so with Baye's rule, one thing we need is the marginal likelihood or evidence: \(p(B)\) (\(p(D)\)).

Getting to a marginal likelihood is computationally difficult. So getting to the posterior distribution is also difficult. Sometimes you can use a conjugate prior to get to the marginal likelihood, but you can't in all situations. Grid approximation allows you to get close to the marginal, but is only useful in discrete ranges for single variables, and gets too computationally inefficient for multiple distributions.

Markov Chain Monte Carlo allows us to sidestep calculating the marginal likelihood, and instead samples from the posterior distribution without the need to know it.

Without the marginal likelihood being known, we are able to make comparisions between proposed values in the posterior distribution based on their unnnormalised relative likelihood, i.e. while we don't know the true values, we know that one option is 4/5 likely in comparison to another option.

What we can then do, is take samples of the posterior distribution, with enough samples, we can approximate the posterior distribution. This allows us to avoid calculating the marginal likelihood.

One example of how to do this sampling, with a MCMC approach, is to use an algorithm caleld the Metropolis-Hastings approach

Metropolis Hastings

At the heart of this, we think there is a "true" posterior distribution that we are trying to map out, but we have no big picture view for what it looks like.

The metropolis-hastings algorithm is one approach to map out the posterior distribution, often described through metaphor.

If we picture the classic view of a distribution as a mountain ridge, with peaks and troughs in elevation. You've been tasked with understanding the heights of the mountain range, but the problem is you've been dropped into the middle of the range, in the middle of the night, with a rubbish torch that only shows you the area immediately next to you.

In taking a metropolis-hastings approach, you take a random walk through the mountain range to plot it out.

First you decide to go left or right on the mountain range, you make this through a simple coin toss, heads you think about left, tails you think about right. Each time before you think about a single step, you flip this coin again. In our algorithm, this part where you decide a direction to look in through flipping the coin, is called our proposal distribution.

Next once you've picked a direction, you then decide whether or not to take a step. To do so you have a set approach. If the next step involves going uphill, you always take it. If the step involves going down hill, you may or may not take it. The more downhill it would involve, the less likely you are to take that step. In our algorithm, this stage is called the acceptance decision. Determining to walk down our probability distribution is probabilistic. Where you divide the relative height of the place you propose going to by the relative height of where you currently are. (This is where our metaphor falls apart a little bit, as on the hill you can see if it is higher, or lower, but not really whether if it is higher or lower by however many times).

A mathematical expression for describing the probability of whether or not you take a step is below:

\[ p_{move} = min(\frac{p(\theta_{proposed})}{p(\theta_{current})},1) \]

This states that to work out the probability of whether or not to step, you are going to take the minimum of these two options within the min function. If your proposed location is greater than your current location, then the minimum option in that function is 1. So the probability of you taking the step is 1. If the proposed location is lower than the current location, then the minimum value in the function is that value of the ratio between \(p(\theta_{proposed})\) and \(p(\theta_{current})\). i.e. the probability you'd take the step is the value of that ratio.

Last, each step you take, you record where on the hillside you are, and you go on to make however many coin tosses and steps you feel like.

At the end of this, you can look at a bar graph of which places on the hill you spent more time on, and that will correspond with the highest points. You've plotted a map of the mountain range, without being able to see it.

Metropolis is a person, not a continuation of the metaphor!

More metropolis

Metropolis algorithms can be extended:

  • You can use continuous values, rather than discrete
  • You can move in any number of dimensions, not just one
  • You can use different proposal distributions (it doesn't have to be that coin toss, it could be random sampling from any distribution)
  • You don't have to just take one step in your decided direction (you can also use a normal distribution to determine the distance to travel. You could use a normal distribution to do both the direction (the proposal distribution) and the distance.)

How to define your proposal distribution itself doesn't change what your eventual posterior looks like, but it does change how quickly you can approximate it. Allowing to take too small or too large a jump will take you longer to get to your posterior distribution.

What you also need to do though, is work out when you think you've got to a good approximation. One way to do this is the effective size of your chain in the algorithm. If you go through the process of determining whether or not to move 50,000 times, but each of these locations is very similar to one another, then you haven't taken 50,000 independent samples from the distribution, you've got a much lower than 50,000 "effective sample size". This ESS is calculated by looking at the "autocorrelations" of the samples.

Multiple Parameters

Metropolis-Hastings may be great, but you may need to sample multiple parameters, e.g. estimating the probability in two different groups (what's the prob of dying in group A, whats the prob of dying in group B, what's the difference in the prob between the groups?)

When plotting the combined probability distribution of two parameters, you might see a contour plot. This is as you have two 2d prob distributions (one for group A, one for group B), combining to make a single 3d prob distribution of the combined prob distribution. You turn two curves into a 3d hill. A contour plot is looking birdseye view down on that probability distribution, like the contours on a map. This 3d space is still a probability density function, so ultimately it still integrates to one, like previous functions did.

You can write out the expression for a combined density function as below:

\[ \int\int d \theta_1 d \theta_2 p(\theta_1 , \theta_2) = 1 \]

Here we are saying one integral for theta 1, one for theta 2, and also the d stands for those infintessimally small bits of theta we are summing up.

One way of doing that may be using Gibb's sampling. This also does a random walk through our parameter space, but as there are multiple parameters, it might cycle sequentially through which parameter it's altering. It might also then pick where to go a bit less randomly. It ensures it picks somewhere representative in that parameter's probability distribution. To do so, it needs to be able to work out that single parameters posterior distribution. So Gibb's is only really useful if you aren't able to calculate the complete combined parameter probability distribution, but you can calculate the individual parameter's posterior probability distributions.

Why is MCMC called that?

Monte-Carlo just means it is a random sampling from an underlying distribution. Markov Chain represents the random walk. When wandering through a probability distribution, any step that doesn't depend on the previous steps is a Markov Process. Any sequence of these markov processes is a markov chain.

Ultimately, when doing any form of MCMC, either with metropolis-hastings, or gibbs, or anything else, your MCMC needs to be 3 things, it needs to be:

  1. Representative of the true distribution
  2. Accurate
  3. Efficient

MCMC - Representative

You can determine whether or not your MCMC is representative of the underlying distribution in two ways. Visually or numerically.

First with visually you can look at convergence, and the trajectory to get to convergence. Convergence is talking about comparing multiple markov chains, and seeing when they start overlapping in their estimate of the true distribution. If they never overlap, they haven't converged. The time (in terms of steps from the start of the process) it takes to reach convergence, is called the burn in period. You would routinely discard the steps in the burn in period, so that your ultimate sampling distribution is only representing the true distribution.

Numerically. What you can do is compare variance in the distribution in an individual chain, to the variance between all the chains. This is called the gelman-rubin statistic or the shrink factor. When all chains converge, the skrink factor reduces to 1. When the shrink factor is greater than 1.1 you need to worry that there hasn't been proper convergence.

MCMC - Accuracy

So next you want accurate MCMC chains. Accurate means they are stable, stable means that they are unlikely to change their opinion on the mean and SD of the distribution with further samples, they have enough information.

How much information is enough? One way to look at the amount of information, isn't simply the number of steps. If all the steps are going over the same part of the parameter space, we might talk about the chain being autocorrelated, or clumpy. We can measure the autocorrelation factor to calculte the effective sample size. This is in effect, a measure of the length of steps taken, once you've taken into account the autocorrelation. You generally want an ESS of ~10,000

Next you might look at the standard error of your chain, the Markov Chain Standard Error (MCSE). This is calculated using the SD of your predicted distribution, and the ESS.

\[ \text{MCSE} = \text{SD} / \sqrt{\text{ESS}} \]

MCMC - Efficiency

Lastly, you need a method of MCMC that actually works. As in, one that you can perform in a reasonable time.

To get to enough steps (a small enough MCSE, a large enough ESS, with convergence), you can try a few different things:

  • Use parallel computing to run multiple chains at same time
  • Use different sampling methods (metropolis-hastings vs gibbs etc)
  • Use different descriptions of your parameters (different ways to express the distribution etc)

8 - JAGS/BRMS

I've used the brms translation of this chapter, rather than the one focusing on JAGS.

First point, brms allows you to estimate model fit through: PPC/Cross Validation/Bayes Factor

The things you need for your model are your likelihood function (the likelihood of data given the set parameter values), and the prior distribution (the probability of paramater values without looking at the data)

What does brms use under the hood? It uses Hamiltonian Monte Carlo rather than the Metropolis-Hastings described above. What differences to care about? Not really anything much, note it talks about a Warmup phase rather than burn-in. These are similar things but not identical.

When you are looking at a brms::brm() output, the way to do so easily would be brms::as_draws_df()

But you can also do things like brms::mcmc_dens_overlay(model) to look at chain densities, to get a summary justprint(model), or if you want median rather than mean think of brms::posterior_summary(model, robust=T)

To get autocorrelation plots: brms::mcmc_acf(model)

If you want specific credible intervals in your model, brms::posterior_summary(model, probs = c(0.2,0.4,etc))

Other packages to look at include:

tidybayes with tidybayes::stat_halfeye() would allow you to calculate mode of posterior distribution, or tidybayes::stat_histinterval()

also ggthemes::scale_fill_colorblind()