Statistical Rethinking

Lecture One - Golem of Prague

Causal inference is about how you justify your statistical tools, sits above bayes vs frequent

Causal Inference

A statistical model is no good on its own. A statistical model is also no good with just the data. What you need is a scientific model that explains the data, gives an explanation as to what is the cause of the data. (This is why a scientific model is also called the causal model). The causes of the data cannot be extracted from the data alone. "No causes in, no causes out"

Even when your goal is to describe a model, you still need a causal model. You need causal information as to how the sample differs from the population. You need to know how to adjust for biases, and you can only know that by knowing what caused them.

Causal inference is attempting to understand the causal model. Causal inference is using the data we have, to understand the pieces of the causal model. The causal model is the scientific model.

It is possible to learn about causes from data.

There are two ways of thinking about this (but they are saying the same thing really)

Causal inference is a special type of prediction
Causal inference is a special type of imputation

Causal Prediction

Prediction and inference are different. But you can think of inference as a special kind of prediction.

Inference is understanding the causes, and understanding the consequences of interfering with a situation.

Causal Imputation

If you know a cause, you can construct consequences for things that didn't happen. "What if x had happened instead of y"

What's the main point of this lecture?

It's that in doing statistics, you need to always start with your underlying scientific model. Your scientific model is how you think everything interacts. Analysing the data without this framework, will not give you the answers as to how everything interacts. If you don't put causes into your statistics, you won't get causes out.

DAGS

DAGs are the simplest type of causal models.

"Directed Acyclic Graphs"

They are heuristic models, that help clarify your scientific thinking

In describing your underlying scientific model, also you could call this your underlying causal model, you can use a graphical representation called a DAG, a directed acyclic graph.

Even if you aren't trying to identify the causes of a process, if you are just trying to describe a population, you still need your underlying causal model to accurately describe it.

What would be the simplest model? It would be the effect of variable $X$ on variable $Y$

$\text{X} \rightarrow \text{Y}$

But there are never really models as simple as that. There are going to be additional variables:

graph LR
C --> X
C --> Y
A --> X
A --> C
B --> C
B --> Y
X --> Y

Here we have $\text{B}$ also affecting $\text{Y}$
$\text{A}$ affects $\text{X}$ so indirectly affects $\text{Y}$
Then you have $\text{C}$, a proper confound, affecting both $\text{X, Y}$, you'll want to control for this

Summarising a DAG. You can ask loads of questions of a DAG, not just how X affects Y. But for each question, you will need a different statistical method, possibly a different model. The reason you need loads of different methods is dependent on the controls. There are such a thing as "good controls" and "bad controls". A DAG helps you map out what you think is going on.

A DAG is a way of helping you plot out your intuition

Golems

In talking about golems, Richard is talking about a tool for automation, a physical or software robot that does what you say and does it well. The key point he's trying to get across is whilst it's powerful, a golem will only powerfully do what you say, not what you mean. You need to ensure you tell it to do exactly what you mean it to do. It can't interpret your intentions.

An example of this with statistical tests is, you might pick the wrong test for your purpose. The test itself is a good and powerful test for what it is designed for, but you've applied it to the wrong task. It will go ahead and do what it's designed for, which might not be what you need.

The Null Hypothesis

A problem in that example above, in picking the wrong test, is most of the frameworks available to help you pick a test are picking between options of a variety of tests that are all designed to reject the null hypothesis. Again, nothing wrong with that inherently, but it's not the be all and end all of stats, and is actually quite a narrow target.

The main useful place, where a lot of these null hypothesis tests come from, is in industrial quality control. He gives the example about how the T Test was designed when working in Guiness brewery.

He argues that null models are less useful in areas where you can't do straight forward experimentation, and insted have to focus on observational work. To me that's interesting as that's what my MD is. In this area he say's you need to end up with multiple process models instead ( I don't know what they are yet)

A process model is a mathematical model. A mathematical model of what you think is going on (like your scientific model or causal model)

A process model is a tightened up definition of your hypothesis. Loses the vagueness of it, what you actually mean in the hypothesis. By creating a specific definition of your hypothesis, you can then test it with a statistical model.

The process model sits in a similar space to where the null hypothesis would. The thing is you can have multiple different process models to represent different aspects of the same hypothesis. There is no ONE null hypothesis. There might not be a uniquely null hypothesis, Instead what you will do with your testing is compare one process model against another process model.

In working these things out you go:

graph LR
Hypothesis ---> P[Process Model] ---> S[Statistical Model]

Problems with Null Hypothesis Testing, and alternatives

A lot of problems with most statistical models is that they're focused on rejecting null hypotheses, rather than rejecting research hypotheses. The problem with this is it means that it can become unclear how the test takes on your research hypotheses.

The thing here is that, you can have multiple process models that could explain a hypothesis. This is a problem that he describes as there being no unique way to describe this null hypothesis. I think he's saying that these models are not specific enough.

So what he's saying with a lot of null hypothesis testing, is that you can test and reject a null hypothesis, but there can be many different scientific models that could result in that same rejection of the null hypothesis. So rejecting a null hypothesis wouldn't tell you which of those models was true.

So if null hypothesis testing is limited, what should we use instead?

Richard says Generative Causal Models, which is something more than a DAG.

Generative = you can simulate data from it

You're then going to create statistical models that will start off by testing this synthetic data. This testing creates "estimands".

Once you are sure your statistical model works with the synthetic data, then you introduce the actual data to it.

Our Workflow So Far:

Start with a DAG,
- in here work out what will be our "ajustment set", these will be the other variables we are adjusting for, our confounders
Go from there to a generative model, that's going to create our synthetic data,
then we need a strategy to create our estimates, and how to categorise the uncertainty of our estimates.:
- its in this bit that bayesian data analysis comes in, which is the simplest way to create and and categorise our estimates and uncertainty

Caveat:

Bayesian data analysis can be overkill for simple scenarios - such as simple linear regression

Caveat to the caveat! / Reasons for Bayes:

In real world scenarios with things like:
- measurement error
- missing data
- latent variables
- regularisation
Bayes has natural and comfortable ways to deal with these problems
In addition, Bayesian models are generative, useful for our generative models

Owls

We want to document our steps as we go with our generative models, we want to:

Have Our Generative Simulation
Write an estimator
Validate our estimator using our simulated data
Analyse real data

Process:

Define our theoretical estimand: "What are we even trying to do in the study"
Define a scientific model:
1. First as a DAG
2. Then as a generative causal model
Use step one and two to build a statistical model
Testing - Simulate from data in two to validate that 3 yields the result of 1
Use what we have created to analyse real data
We may loop back to step 2 from this and revise our causal model

Documenting it as we go

There will be step 5 to do things like compute hypotheticals

The process

Follow clear steps
Understand what you are doing
Document what you are doing
With a respectaable scientific workflow

Structure Of The Course:

Think of the first 5 weeks as a course in Bayesian Linear Regression
And the second 5 weeks as an expansion on that

Foundations of Bayesian Inference - Video Two

This lecture is walking us through a Bayesian workflow.

The first thing is: what is our estimand, what is the thing we are trying to estimate?

The process is going through there

$\text{Estimand} \rightarrow \text{Estimator} \rightarrow \text{Estimate}$

The estimator is our statistical model.

At the end, how do we summarise our estimate, and how do we demonstrate the estimator:

Step One and Two: Have our generative model built up.

Variables can be:

data - things you can observe
things you want to estimate - things you can't observe - you are measuring indirectly.

This example used is trying to estimate the proportion of water on earth by spinning a globe and seeing if you land on water or land. This means you have the variables:

$\text{p}$ - Proportion of water on earth
$\text{N}$ - Amount of times globe span
$\text{W}$ - Amount of times landed on Water
$\text{L}$ - Amount of times landed on Land

graph LR
N --> W
N --> L
p --> W
p --> L

So we know that N influences W and L, but not the other way around. Same for p

Above is a DAG. It's conceptual. Next we want to make it a generative causal model.

If we were to express it as a 'functional relationship':

$W,L = f(p,N)$ <- The values of W and L are both dependent of a function on p, N

Step Three: Our Statistical Model

We are using Bayesian data analysis.

What is Bayesian Data Analysis

Count all the ways data can happen. Explanations with more ways to produce the data are more plausible.

As in, you've got what youve observed, you assume that a particular explanation is true, and count all the possible ways that that explanation would be true.

You work your way through all the possible explanations, counting the ways each explanation could be true.

Then you look at which explanation has the most possible ways to be true. These are the more plausible explanations

The Garden of Forking Data

First count all the things that could have happened:
For each possible proportion of water on the globe
Count all the ways the samples that we acheived could have happened
Possible proportions with more ways to produce the sample we observer are the more plausible ones.

The difference that happens between this and most bayesian inference, is that rather than using counts, bayesian inference usually uses probability.

Summmary:

2022-10-04:

Summary of this lecture:

This lecture is the overview lecture to explain what a Bayesian approach is.
Bayesian Updating is the name of the process/method
How would I explain Bayes?:
- You have your data you're trying to understand
- You think of all the possible underlying explanations (explanatory models)
- For each of these explanatory models, you count up all the potential ways that model would result in the results you have
- You compare all your explanatory models together
- The ones with the most potential ways to produce the data we have are the most likely
But that bit above, we haven't acually talked about updating yet, or priors, or posteriors
The bit above is more just straight forward probability theory
Step One: Collect Data
Step Two: Think of possible explanations
Step Three: Think how likely each possible explanation is
Step Four: Compare likelihoods to see what's most likely:
- Step Four and a bit: Normalise these likelihoods:
  - Add up all the different routes of all the different explanations. Divide each individual explanations routes, by this total
What about the Bayesian Updating Bit?:
- Here you're multiplying one probability distribution by another:
  - You're multiplying your prior distribution, by the probability distribution of your observed data
  - This gives you your posterior arobability distribution
- A probability distribution is a plot:
  - X - the range of possible explanations
  - Y - how likely each explanation is - the density of an explanation
This probability distribution can be represented as:
- A plot (remember the plot needs to be normalised data, the area under the curve needs to be constant
- A vector (a list)
- An equation
It's the whole probability distribution we care about!:
- Not a point statistic
- Not a confidence interval
How do you generate these probability distributions?:
- Can be:
  - Grid Approximation
  - Quadratic Approximation
  - Markov Chain Monte Carlo
What do you do with these probability distributions?:
- You might use them to make predictions about future data
- Or just to explain what you think the most likely underlying explanation of observed data is

Notes Taken During Lecture

The logic of probability theory is this above, the garden of forking data

When you do this, this counting up the possible explanations, you could allocate each explanation a parameter (say a proportion) $p$

Then you would count up the ways, you know that the highest nuumber is most likely.

Now you would "normalise them". You do that by adding up all the different ways of all the different explanations, Then you divide each explanations ways by the total sum.

The advantage of this is you can add new information fairly easily. You take your new information (say your next marble from the bag) and multiply it against your previous counts. You don't have to recalculate the entire garden

Bayesian Updating

So the key bit in updating, is you have to state a causal model for how the observations arise, given each possible explanation.

Then count the ways data could arise from each explanation.

Then you get your 'plausibility'/'probability' from that step above (when normalised.

Posterior distribution graph (see xournal)

"Density" in probability theory means the "relative plausibility" of the potential explanation for our observations.

Remember when you want to update on bayes, you multiply the probabilities against each other

All bayesian updating is doing this multiplication again and again

Each time you update your curve it will get taller and narrower, this is cos it's been normalised, the area under the curve has to sum to one.

Things to note

There's no minimum sample size in this process (unlike in frequentist)
The whole shape of the distribution curve embodies the sample size. All the information that is relevant has gone into constructing the posterior curve
None of this is to say sample size matters, cos the more the sample size the more narrow the curve.
Another thing that is less useful in bayes is the "point estimate" - there is no valid point estimate, you want to report the entire distribution:
- The entire distribution is the curve
- Summary of the distribution is the LAST STEP. You don't do calculations with the summary
Againm you don't use confidence intervals in the same way in bayesian stuff. Cos there are an infinite version of each of these intervals:
- There's nothing special about the 95% interval!
Escpecially as in Bayes, we don't use the confidence intervals to do error control (like you would in frequentist)

So how do we write down the models

$p$ Little p is normally used to denote proportions

There's a bunomial probability function in R that will give you the binomial probability function

dbinom(W,W+L,p) - number of observations of one variable type, out of total number of observations, given assumption of probability

You multiply your prior by your observations, then normalise it.

So how do we code the model?

For each possible p we will compute the product conditional on that p.
One method is called Grid Approximation:
- The grid is the list of the possible xplanations we will consider
- We then set up a prior
- Then we set up a binomial sampling formula to give us the plausibilities of the observations we've seen
- Then we multiply our prior by our posterior
- Then we normalise that to give us our probability distribution - our posterior predictive distribution
In practce we'll be using Quadratic Approximation
or markov chain monte carlo

What do we now do with this model?

We can use this model to make predictions
In reasonably complicated models they may have dozens of parameters, and you won't be able to understand it just looking at a summary table. You'll need to use the entire distribution.
A prediction is what a model thinks further observations will be

These posterior Predictive distributions are a two step process. You've sampled what the proportion you assume is, you sample it's predictions, you add that to a posterior predictive distribution.

You can do something similar with the prior distributions. The advantage of this is it means you can see how a lot of priors would interact, to make sure you can see you've got sensible priors.

Lecture Two - Geocentric Models

Why Normal?

Generative Argument - If you sum the fluctuations in a lot of natural processes you get something that looks like the normal distribution (symmetric variance around the mean)

Statistical Argument - If you want to estimate the mean and variance, the normal distribution is the "least informative"

A variable doesn't need to be normally distributed to benefit from the normal distribution!

How do you write a model?

$W$ ~ Binomial($N$, $p$)

Geocentric Models - Video 3

Geocentric Models - Video Three

Summary of Lecture

The previous lecture was on what bayesian updating is, and high level stuff on how we get probability distributions

This lecture is more about probability distributions, in the context of how linear regression generates them.

It talks through the process of how you go from your scientific question, to your statistical model, to how you test your model, to how you use your model to analse the data you've got.

What he tries to make clear, is your model isn't necessarily providing you with any underlying truth about what's actually going on. The example he uses is how you can have a completely "wrong" model like a geocentric view of the universe, and use it to still create "correct" predictions about where the stars will be in the sky.

He talks about the normal distribution, and how it can be useful to model distributions of variables, even when they don't have a normal distribution themsellves.

The process of linear regression he walks through, looks like its the process of any stats question.

You start with your question about the relationship between two variables (the outcome variable and the variable of interest)
You generate a scientific model about these variables ("I think variable X has a causal affect on variable Y")
You express this scientific model in the terms of a statistical model
You test this statistical model against the starting assumptions you gave it, to make sure it makes sense
Then you use the statistical model against the data you have, to try and understand the actual relationship between x and y

Flashcard	type:basic
In maths and stats, what is the equation (just the symbols) for how to draw a line of X against Y?	$y_{i}=\alpha+\beta x_{i}$
In maths and stats, what is another name for the normal distribution?	Gaussian distribution
In maths and stats, what is another name for the Gaussian distribution?	Normal distribution
In maths and stats, what could this equation give you a plot of? $y_{i}=\alpha+\beta x_{i}$	The relationship between two variables, x and y
In maths and stats, describe what this equation means in detail? $y_{i}=\alpha+\beta x_{i}$	This is an equation of a straight line. $y_{i}$ is the value on the y axis. $\alpha$ is the intercept $\beta$ is the slope
In linear regression, what does this equation mean? $y_{i} \sim \text{Normal}(\mu_{i}, \sigma)$	The value of $y$ is distributed normally, dependent on the mean $\mu$ and the standard deviation $\sigma$
In linear regression, how would you say in math notation that variable $y$ is normally distributed?	$y \sim \text{Normal}(\mu,\sigma)$
In linear regression equations, what does $\mu$ usually represent?	The mean
In linear regression equations, what does $\sigma$ usually represent?	The standard deviation
In linear regression equations, what symbol do you use for the mean?	$\mu$
In linear regression equations, what symbol do you use for the standard deviation?	$\sigma$
In linear regression equations, what equation do you use to represent the mean of a normal distribution?	The equation of a straight line! $\mu_{i} = \alpha + \beta.X_{i}$
In linear regression, when estimating the standard deviation of a variable, why can't you use a normal distribution?	Because your standard deviation has to be positive, and a normal distribution would give some values that are negative
In linear regression, when trying to estimate the posterior probability distribution, name some example techniques that could be used?	Grid Approximation Quadratic Approximation
In linear regression, what are some other names for Quadratic Approximation?	Laplace Approximation Gaussian Approximation

Notes from Video

You can create a model that's giving correct predictions, whilst based on underlying science that's wrong.

That's almost always the case in linear regression. It describes associations, with good predictions, but it's mechanistaclly wrong. It doesn't describe something true about the structure of it.

Here, this is "prediction without explanation"

I guess the theme of this talk, as well as linear regression, is also that idea of:

"all models are wrong but some are useful"

He gives the example on how we were able to give useful predictions on where planets would be in the sky, even when we were using a geocentric model rather than a heliocentric model

You can give a great explanation even without a clue as to what's going on!

So linear regression is a good example of this. The linear regression doesn't mean there's a lovely linear relationship between everything

What does a linear regression do?

It models the mean and variance of a variable. And that's it!

Generally it gets the mean as a weighted sum from a load of other variables. So this mean is dependent on them. This mean is to predict our outcome variable.

ANOVA,ANCOVA,t-test, MANOVAs are all just special cases of a linear regression. They have the same machinery as a linear regression.

Bayesian statistics uses linear regression, using the least-squares estimation

When Gauss produced the normal distribution, the gaussian distribution, he used bayesian arguments to make it!

Normal Distribution / Gaussian Distribution

A normal distribution is one way to count up all the ways that observations could happen, that first step we were talking about in previous lecture on bayesian updating

There are two arguments to care about a normal distribution,

the generative,
and the statistical

The generative is that a lot of natural processeses, in summed fluctuations, you end up with a normal distribution

The statistical is that in a lot of tests you want to estimate a mean and variance, and a normal distribution is a least informative distribution (it makes the least assumptions, and only depends on a mean and variance) - more detail later in course.

So in the statistical we are saying, what's the point of a normal distribution? It's a mechanism, a machine, that allows us to estimate the mean and the variance of a dataset.

When you use a normal distribution, you're still using the garden of forking data.

The variable's frequency distribution does not have to be empirically normally distributed for the normal distribution to be useful. (Statistically speaking) You can still use it to correctly estimate the mean and variance

So the generative argument is saying the normal distribution is useful cos loads of variables will have a normal distribution naturally. The statistical argument is saying that the generative argument may be nice and all, but it doesn't matter if those variables do have a normal distribution anyway, you can still use a normal distribution to estimate mean and variance, even when your variable doesn't have a normal distribution.

Language for 'Geocentric' Statistical Models

Aim is to get some language to represent a model.
Aim is to show how to calculate a posterior distribution, when there are multiple unknowns.
Aim is to learn how to construct & understand a linear model.

What's the workflow in analysis again?

State your question clearly
Sketch your causal assumptions - DAGS
Use the sketch to define a generative model
Use that generative model to build estimator

Step 2 Describing a scientific model:

First you list the variables
Then for each variable define if it comes from a deterministic or distributional function of the other variables.

Variables	Definitions	Explanations
$W_{i}$	$= \beta H_{i} + U_{i}$	Equation for expected weight. Where $\beta$ is some proportion of height. $U$ is this unknown function.
$U_{i}$	$\sim \text{Normal}(0,\sigma)$	Equation for our unknown function. Comes from Gaussian error with standard deviation of sigma($\sigma$)
$H_{i}$	$\sim \text{Uniform}(130,170)$	Equation for height, comes from a uniform deviation between 130 and 170

The point of this model is to help us estimate $\beta$. As in, estimate the effect in this case of height on someones weight.

The subscript $i$ means for a particular individual in your dataset.

If your variable is defined with an $=$ then it means it comes from a deterministic function. If it is defined with an $\sim$ then it comes from a distributional function.

Step 3 Statistical Models Estimator

So we now want to build an estimator. How much does the average weight change with height?

$\text{E}(W_{i}|H_{i}) = \alpha + \beta H_{i}$

So above this says the average expected weight, conditional on height, is defined by the linear equation of some intercept ($\alpha$) and the slope ($\beta$) times the height

Posterior Distribution

So in bayesian statistics, our estimator is always a posterior distribution. However there are loads of different ways to generate a posterior distribution. Remember, we have the estimand, the estimator, and the estimate. The estimand is the thing we want to estimate. The estimator is the statistical model we use to get there.

$\text{Pr}(\alpha, \beta, \sigma | H_i, W_i) = \frac{\text{Pr}(W_i|H_i, \alpha, \beta, \sigma) \text{Pr}(\alpha, \beta, \sigma)}{Z}$

Before the = we have the posterior probability of a specific regression line.

We need to generate a posterior distribution of our unknown variables ($\alpha, \beta, \sigma$)

After the = is our first probability statement ($\text{Pr}(W_i|H_i, \alpha, \beta, \sigma)$).This is the same as the garden of forking data. How many ways can you get to a certain value of Weight, with certain values of H, $\alpha, \beta, \sigma$. The second Pr was our prior, the normalised values for alpha beta and sigma. Z is our normalising constant.

Posterior Distribution

However, we don't normally bother writing it out like this probability statement above.

Instead we write it as a linear model:

$W_i \sim \text{Normal}(\mu_i, \sigma)$

$\mu_i = \alpha + \beta H_i$

This says W is distributed normally, with a mean that is a linear function of height. H.

Statistical Models - Approximation

Continuous approximation can be done with "quadratic approximation"

In this what you are doing is saying you can "approximate the posterior distribution with a multivariate gaussian distribution"

In R you can do this with a tool called quap(), in the rethinking package.

In quap you enter a list of deterministic and distributional functions, and then quap runs it to find the gaussian approximation for this posterior distribution for this model.

The statistical model, different to the generative model (in addition to the generative model really), needs priors

So when designing an estimator, and you need to put priors in, you could use really "flat" priors. Or you can use more concentrated priors. Concentrated priors constrain your model into generating more plausible data (before it's ever seen any data).

Prior Predictive Distributions

Prior distributions are posterior distributions, they're the same procedure.
You can use your model to make predictions from your priors.
So you should set your priors to predict reasonable datapoints (as in, they shouldn't predict impossible datapoints)

So for example in the height and weight example.

If height = 0, weight should = 0
Weight should increase on average with height
weight on average in kg should be less than height in cm
sigma needs to be positive

Caveat on priors

There isn't such a thing as a correct prior, just a justifiable prior
You justify your prior with information from outside the data
Priors aren't as important in a simple model
but very important in a complex model

Step 4 - Validating Our Model

The bare minimum for this is to test your model with some simulated observations from your scientific model.University Hospital Crosshouse

To see if you can still get out the values that you put in.

Step 5 - Analysing Data

When interpreting your model remember: your parameters are not independent of one another, so cannot always be independently interpreted.
Instead; produce posterior predictions and describe and interpret those.
So you extract samples from your model,
and plot the sample set,
and plot your slopes from the model

Old Notes:

How do you write a model?

Our Model in math

$W \sim \text{Binomial}(N, p)$

$p \sim \text{Uniform}(0,1)$

Our Model in text

The outcome (W) is distributed using a binomial data distribution, the shape of this binomial distribution is dependent on N and p. Another name for this data distribution of W is a likelihood function. N is the sample size of our data, and p is explained below

The thing that we don't know just from looking at the data, the thing we are trying to work out, is $p$, our "parameter to estimate"

Our starting guess for how $p$ is distributed, our prior, is that it has a uniform data distribution, where the likelihood of $p$ is equally likely between 0 and 1

In the above:

$W$ is the Outcome Variable
~ means "is distributed"
Binomial($N$, $p$) is the data distribution we assign to W (likelihood function in non bayesian stuff)
$N$ is the sample size
$p$ is the proportion of interest
N and P are the shape of the outcome of interest

$p$ ~ Uniform(0,1)

In this:

$p$ is the parameter to estimate
Uniform is the prior distribution from 0 to 1

How do we turn this into a probability statement?

The way to take those two statements and turn them into bayesian updating is to take your outcome and prior proportion, and shove them into the equations:
Pr($W | N,p$)= Binomial($W | N,p$)
Pr($p$)= Uniform($p|0,1$)
In this the $|$ is the "pipe", you can read it as "is conditional on"

So then you can say: - Pr$(p|W,N)$ $\alpha$ Binomial($W | N,p$) . Uniform($p|0,1$) - The probability fo the posterior distribution Pr$(p|W,N)$ - $\alpha$ is proportional to - the binomial times the uniform distributions

Linear Regression

$\sim$ means "is distributed"
$\text{Binomial}(N, p)$ is the data distribution we assign to W (likelihood function in non bayesian stuff)
$N$ is the sample size
$p$ is the parameter of interest
N and p will define the shape of the binomal distribution
$p$ is the parameter to estimate
Uniform is the prior distribution from 0 to 1

Our model in probability statements

So the math statements above, imply a way to do bayesian methods, bayesian updating
The way to do that, is to turn these statements into probability statements.
The way to take those two statements and turn them into bayesian updating is to take your outcome and parameter to estimate, and shove them into the equations:

$W \sim \text{Binomial}(N, p)$ becomes $\text{Binomial}(W|N,p)$ to give us $\text{Pr}(W|N,p)$

$p \sim \text{Uniform}(0,1)$ becomes $\text{Uniform}(p|0,1)$ to give us $\text{Pr}(p)$

Our probability statements in text:

The first line, the last equation means the probability of W, as it depends on N and P, can be described using a binomial distribution of W, that depends on N and p.

The way to say "as it depends on" in stats, would be to say "is conditional on"

In this the $|$ is the "pipe", you can read it as "is conditional on"

The complete model in probability statements becomes

$\text{Pr}(p|W,N) \propto \text{Binomial}(W|N,p).\text{Uniform}(p|0,1)$

The complete model in text

The probability of our parameter of interest ($p$) (another name for the probability of the parameter of interest is the posterior distribution), as it depends on the results we've observed ($W$), and the sample size we've collected so far ($N$),

is proportional to ($\propto$)

the probability of the data observed times the prior probability

In this the $\propto$ (which is not the same as $\alpha$), means "is proportional to"

Linear Regression

The example in this lecture, is we are looking at the relationship between height and weight in adults. So the example is the relationship between two continuous variables.

Question
Scientific Model
Statistical Model
Validate Model
Analyse Data
What is the question? Describe the association between adult height and weight
What is the scientific model? "How does height influence weight":
- $W=f(H)$ - Weight is some function of height (Height has a causal influence on weight)
- As in, if you change someones height you'll change their weight,
- but if you change their weight you won't change their height
- is our scientific model
- $H \rightarrow W$ : Would be our DAG
Generative models, couple of options:
1. Dynamic - fancy, model the growth of an individual over time, that way you get your normal distribution from slummed fluctuations
2. Static - more straight forward - changes in height result in changes in weight, but without the mechanism, the gaussian distribution is imposed (rather than grown more oganically in normal)

Anatomy of a linear model

Start with the equation of how to plot a line

$y_{i}=\alpha+\beta x_{i}$

$y_{i}$ : index : there's a variable called y, the index of it, is the individual things being measured, the sample size
$\alpha$ : intercept : where the x and the y axis meet
$\beta x_{i}$ : slope
If you have different choices for $\alpha$ and $\beta$, you will get different lines on the graph
This linear regression doesn't tell you where the observed values are, it tells you where the expected mean values are

Linear Model

$y_{i}$ ~ $Normal(\mu_{i}, \sigma)$ :
- $y_{i}$ : this is our outcome variable
- the outcome variable is distributed (~ = distributed) over the normal distribution
- $mu_{i}$ : with mu being the mean, or the expectation
- $\sigma$ with sigma being the standard deviation : so the standard deviation is how scattered away from the line the individual points are
- this is our model
$\mu_i = \alpha + \beta x_{i}$:
- this is our line, that would plot the mean of our model as a line
- the spread of the dots around the line would be defined by the standard deviation
- our alpha is still the intercept, where x and y axis meet
- our beta is still the angle of the slope
- $\mu_{i} = \alpha + \beta x_{i}$
You can describe a lot of relationship models with a linear model

Statistical Model

One. Question

"What is the association between adult weight and height?"

Two. Turn this question into a scientific model.

"How does height influence weight?"

How could we show that model as a DAG?:
- $H \rightarrow W$
- That right arrow means that Height ($H$) has a causal influence on Weight ($W$)
How could we show that model in maths language/modelling?:
- $W=f(H)$
- Weight is some function of height

As in, if you change someones height you'll change their weight, but if you change their weight you won't change their height is our scientific model

Let's make a generative model.

We have a few different levels of models from fancy to straight forward. Fancy would be a dynamic model, where you model the growth over time. Here you get a gaussian distribution from the natural fluctuations. Straight forward would be static models, where you apply a function to show that relationship between height and weight, without modeling it.

In this lectures example we are using a static model. You have to impose the gaussian model rather than it coming about more "naturally"

Linear Models (Our static scientific method)

You start with just, the equation for how draw a line:

$y_{i}=\alpha+\beta x_{i}$

There's a variable called $y$. There are multiple different versions of $y$, and they are identified using their index ${i}$. So you see $y$ etc etc. In our example, each version of y would be an individual with a certain height and weight}$ which means one version of $y$. You could have $y_{1}$ or $y_{2

You have an intercept, where x and y axes on plot meet, called $\alpha$

You have a slope, called $\beta$ The slope $\beta$ is multiplied against an $x_{i}$ value.

Different values for $\alpha$ or $\beta$ gives you different lines

Thing to remember in a linear regression, this value for y is not what the observed value is in the data. It is what our expectation of what the mean value should be. So if we are saying that Height is our x, then our y is our prediction of what the average value for weight would be, at any given height.

This linear regression doesn't tell you where the observed values are, it tells you where the expected mean values are

Turning our line into a linear model:

$y_{i} \sim \text{Normal}(\mu_{i}, \sigma)$

$\mu_i = \alpha + \beta x_{i}$:

Our model says that we expect our outcome (weight, $y$) to be distributed on a normal distribution (a type of binomial distribution, also called a gaussian distribution). That normal distribution is dependent on the values for our mean, and the standard deviation.

Another name for the mean, here is the expectation.

we say that $\mu$ (the mean, the expectation) is defined using the equation for the line as above

So the mean is the intercept ($\alpha$) plus the slope ($\beta$) times the value for $x$ (in this case height)

the standard deviation is the scatter of variables around the mean

$y_{i}$ : this is our outcome variable
the outcome variable is distributed (~ = distributed) over the normal distribution
$mu_{i}$ : with mu being the mean, or the expectation
$\sigma$ with sigma being the standard deviation : so the standard deviation is how scattered away from the line the individual points are
You can describe a lot of relationship models with a linear model. Again. this doesn't mean that this is the "truth" of what's going on. like a lot of models it is wrong but useful. This is our scientific model

Three. Express this scientific model as a statistical model

To turn our scientific model into a statistical model we need:

Priors
We need to learn the values of alpha, beta, and sigma also!

!!! Parameters like sigma are scale parameters, they stretch across a distribution and are always positive

So how do you simulate these priors?

We need to learn the values of $\alpha$ (intercept), $\beta$ (slope), and $\sigma$ (standard deviation) also!

Linear Model (Our statistical method)

$y_{i} \sim \text{Normal}(\mu_{i}, \sigma)$

$\mu_i = \alpha + \beta x_{i}$:

$\alpha \sim \text{Normal}(0,1)$

$\beta \sim \text{Normal}(0,1)$

$\sigma \sim \text{Uniform}(0,1)$

Those first two lines above are the same as when we had our scientific model,
We've now added the next three lines to give us our statistical model:
- these priors for $\alpha \beta \sigma$ are arbitrary, they don't have to be these
- The only bit is that $\sigma$ does need to be uniform, or at least can't be normal, as you can't have a standard deviation that is less than 0. standard deviation is a scale parameter, and scale parameters must always be positive
$\alpha \sim \text{Normal}(0,1)$ means that our intercept is distributed using a normal distribution with a mean ($\mu$) of 0 and a standard deviation ($\sigma$) of 1
$\sigma \sim \text{Uniform}(0,1)$ means that our standard deviation is distributed using a uniform distribution with a range going from 0 to 1

Simulate These Prior Distributions

Why would we do that? You'd do it to understand the implications of the priors you had chosen

n_samples <- 10
alpha <- rnorm(n_samples,0,1)
beta <- rnorm(n_samples,0,1)

### To plot
plot(NULL,
    xlim=c(-2,2),
    ylim=c(-2,2),
    xlab="x",
    ylab="y")
for(i in 1:n_samples){
    abline(alpha[i],
            beta[i],
            lwd=4,
            col=2)
            }

How do you do bayesian updating?

You add random datapoints,
to learn where the plausible lines might be

Example Statistical Model

It's similar to the generative model BUT:

in linear regression it is often useful to re-scale the variables to make the estimations better (less needed for simulation)
you must think about priors:
- you simulate the prior distribution

Rescaling

If you are doing height vs weight, you might rescale height so that the intercept ($\alpha$) makes sense
If height is affecting weight ($H \rightarrow W$):
- the meaning of alpha is the average value of weight, when the individual is of average height:
  - $\mu$ when $H_{i}-\bar{H}=0$ (when the value of H minus the mean value of H is 0):
    - $\bar{H}$ = mean value of $H_{i}$
  - $W_{i} \sim \text{Normal}(\mu_{i},\sigma)$
  - $\mu_{i} = \alpha + \beta(H_{i}, - \bar{H}$)
If we do recale so things make more sense, it means we can set scientifically reasonable priors:
- we can set $\alpha$, our intercept, to be whatever we think the popuation average weight is
- BUT, we need to set a big standard deviation $\sigma$ (a SD of 10 would be a variance of 100)
- You can also set $\beta$ to always be positive (there's no point having a slope where as height goes up weight goes down!):
  - one example of a distribution where it's always positive would be the Log Normal Distribution:
    - a distribution where if you took the logarithm of all of the values of it, the distribution would be normal

Priors

There isn't such a thing as a correct prior. Rather there's a prior that is scientifically or statistically justifiable.:
Each time you do bayesian updating, when you times the prior against the probability distribution of the new observed data, what you are doing is shrinking the range of what $\alpha$ and $\beta$ in your linear model could be.

Example Statistical Model

When we talk about a statistical model versus a generative model:

Our generative model is to make new predictions
Our statistical model is to help us understand relationships
Sort of, maybe, that's my understanding

So in our statistical model it might make sense to rescale the variables, this will help us estimate the relationships better. It helps the computing of the posterior probability distribution.

Rescaling

We can rescale our variables, so that the intercept in the regression, gives us more useful information For example, in how height affects weight. We can rescale height, so that the intercept is where average height, meets average weight (rather than what weight is when height is 0)

To do this rather than

$\mu_{i} = \alpha + \beta H_{i}$ (where the a given mean = intercept + slope times a given height)

instead you use

$\mu_{i} = \alpha + \beta (H_{i} - \bar{H})$ (where a given mean = intercept + slope times given height minus averages height)

The reason you do this $(H_{i}-\bar{H})$ is it gives you, for each value of H, its distance from the mean.

If you rescale your $x$ variable, in this case $H$, to be the distance from the mean, it results in the scenario, that when you have an individual who is the average height, you get an equation for the intercept as below:

$\mu_i = \alpha$

So the intercept, $\alpha$ is equal to the same as the mean for our outcome variable, in this case Weight

because when you take the average height, and minus the average height, you get the equation been $\mu_i = \alpha + \beta.(0)$ (which simplifies to $\mu_{i} = \alpha + 0$)

Reason for this rescaling

What does this now mean:

If we do recale so things make more sense, it means we can set scientifically reasonable priors:
- we can set $\alpha$, our intercept, to be whatever we think the popuation average weight is
- BUT, we need to set a big standard deviation $\sigma$ (a SD of 10 would be a variance of 100):
  - this is to emphasise, that our alpha prior is based on some knowledge, but we don't want it to have too much say
  - If our alpha is the average weight in kg,
  - and our average value for height is cm
  - then our beta represents how many kg gained for each cm gained

Another bit of rescaling

You can also set $\beta$ to always be positive (there's no point having a slope where as height goes up weight goes down!):
one example of a distribution where it's always positive would be the Log Normal Distribution:
- $\beta \sim \text{LogNormal}(0,1)$
- a distribution where if you took the logarithm of all of the values of it, the distribution would be normal

(Little Aside on Priors)

There isn't such a thing as a correct prior. Rather there's a prior that is scientifically or statistically justifiable:
- So you need a scientific or statistical argument for each prior
- That justification HAS TO come from information OUTSIDE OF the data (Just like everything else in the model, the justification comes from information other than the sample itself)
- If you use the model itself that's just cheating and will give massive false positives
Simple linear models, the priors don't really matter, you could do whatever and it would be fine,
but later on it will matter, so start now,
especially in later more complex models when priors do start to come from the data

Posterior

The posterior is a joint distribution of $\alpha \beta \sigma$
In practice, rather than using grid approximation, instead we use a posterior distribution called Gaussian approximation (continuous but imposes a gaussian shape) also called Quadratic approximation or Laplace approximation

Validated Model and Analyse Data

Validate model: We would like to take scientific data from that scientific model and put it in the statistical model to check it works (this si the bare minimum):
- Linear models are fairly predictible, but more complicated ones dont, and really need this
- A really strong test for this is simulation-based calibration
The parameters ($\alpha, \beta, \sigma$)are not independent of each other, so you cannot interpret them independently!
So instead, push out Posterior predictions of the model, and describe/interpret them!

Posterior Predictive Distribution

Plot the sample
Plot your posterior mean (remember, the mean line is not our prediction, the whole distribution is our prediction)
Plot the uncertainty of the mean (the high probability compatability)
Plot the uncertainty of the predictions

Categories, Curves, & Splines

2023

This lecture expands the model from the previous lecture, you've now got multiple estimands (true relationship) and multiple estimators (our guesses at the true relationship, our statistical models)
Also how to process the posterior distribution to assess how good it is.
This lecture is for:
Categories
Curves (Spines)

Categorical Variables

We usually do a different linear relationship for each
Things that are the same as the linear relationships:
Draw a DAG (think of your scientific model)
Remember that your causes aren't in the data, you won't work out what causes what without thought.
So then you turn your DAG into functions to show how one variable affects another.
Things can have direct and indirect causes,
they can also have unobserved causes,
- you can ignore these unobserved causes unless variables share them!

Causal Affects,

We need to think about the difference between the causal effect and the direct causal effect.
The causal = the total

How can you code sample categorical variables?

Methods:

Using dummy and indicator variables
Using index variables (the advantage of these over dummy and indicators:
1. simpler for code
2. easier to specify priors
3. and extend easily to multilevels
Then when you are comparing differences between categories:
- You need to look at contrast (the distribution of the difference between the categories)
- Never legitimate to compare the overlap between parameters (when plotting the distributions over each other)
- (What does this mean? I think it means it's not just mean - mean, it's entire distribution - entire distribution)
especially in later more complex models when priors do start to come from the data (this doesn't count as cheating in the way we mentioned above), and the boundaries blur even more in machine learning models.

Fitting Our Model

Our model so far:

$W_{i} \sim \text{Normal}(\mu_{i},\sigma)$

Weight is distributed using a normal distribution, with a given mean, and a given standard deviation

so this is to describe our probability statement: $\text{Pr}(W_{i}|\mu_{i},\sigma)$ that the probability of a given weight is conditional on the mean height and the standard deviation

$\mu_{i} = \alpha + \beta(H_{i} - \bar{H})$

A given mean weight is equal to the intercept, plus the slope times the difference between a given height, and the mean height (our rescaled height)

$\alpha \sim \text{Normal}(60,10)$

Our prior for our intercept is that it is normally distributed, with a mean of 60cm, and a standard deviation of 10cm

this describes a probability statement of: $\text{Pr}(\alpha)$

$\beta \sim \text{LogNormal}(0,1)$

Our prior for our slope is that it has a log normal distribution, with a mean of 0 (and I think a standard deviation of 1?)

this describes a probability statement of: $\text{Pr}(\beta)$

$\sigma \sim \text{Normal}(0,10)$

our prior for our standard deviation is that it has a uniform distribution, with a range from 0 to 10

this describes a probability statement of: $\text{Pr}(\sigma)$

So the probability statement for our Posterior probability distribution is defined as:

$\text{Pr}(\alpha,\beta,\sigma|W,H)$

The probability of alpha and beta and sigma as conditional on the data Weight and Height

Ultimately all the unknowns have a joint posterior distribution, they don't have seperate distributions. You have to consider ALL the combinations of all the potential values.

Grid Approximation

This is an example of considering all the combinations of all the possible values of alpha beta and sigma as above.

If each of those three has 100 possible values, then 100 * 100 * 100, that = 1,000,000 combinations to calculate

Approximation of Posterior Distribution - Gaussian/Quadratic/Laplace

In practice we don't generally use grid approximation

It turns out in quite a lot of situations, posterior distributions are roughly (approximately) Gaussian in shape.

So we use this approximation called Gaussian approximation, also called Quadratic or Laplace approximation. (Instead of the grid approxiamtion)

Four. Validate This Statistical Model. and Five. Analyse The Data.

Validate model: We would like to take scientific data from that scientific model and put it into the statistical model to check it works (this is the bare minimum to check the scientific model and the statistical model works)
We are testing to see if the statistical model can get out the values of what we put in
Linear models are fairly predictible, but more complicated ones arent, and really need this testing
A really strong test for this is called simulation-based calibration

Getting the quadratic approximation of the posterior

So first we simulate some data, like we've done above by setting our priors for alpha, beta, sigma
Then one method to get the quadratic approximation of the posterior distribution is to use the quap package in R
- this is the validation step, it takes the probability statements we've defined above as its arguments
To turn this validation into analysing te real data, we just provide the quap code our real data rather than our simulated data

Another aside on statistical interpretation

The parameters ($\alpha, \beta, \sigma$) are not independent of each other, they effect each other, so you cannot interpret them independently!

So rather than trying on interpreting them seperately, just generate posterior predictions of the model. And descrive and interpret them.

Interpreting our model with the Posterior Predictive Distribution

Method:

Plot the observed data
Plot your posterior mean
- If you take the posterior distribution of the intercept and the slope, and take the mean of that
- this is the most plausible line
- but remember it's not this line that's our prediction, the whole distribution is that.
Plot the uncertainty of the mean (the high probability compatability region)
- This area is saying, if you have to use a line to represent the distribution, the best line is within this area
Plot the uncertainty of the predictions
- There is some residual uncertainty, the scatter, the standard deviation, so you can plot that around further.

Categories, Curves, & Splines - Video Four

Summary:

This lecture went beyond linear regression for continuous datasets, and started talking about how you apply it to categorical causal data. How does being in one category or another affect a continuous outcome?

He also brought in some of how you can take multiple causal effects into account, as in, the direct causal effect your variable of interest has on the outcome variable, and the indirect effect it has via some other variable.

He introduced some new terminology:

Estimand - The underlying true relationship between two variables we are trying to ascertain
Estimate - Our current best guess as to what that estimand relationship is

He explained there were two main ways in stats to code categorical variables, theres the dummy plus indicator variables approach, and the index variables approach. He explained why he prefered the index variables approach, and how to do it.

We talked about how to compare between two categories to get the effect of being in one category or another. Again we talked about the importance of making your comparison between the whole distribution of each category, and not doing something like comparing the mean of each distribution

We also talked about non-linear relationships. As in, when the relationship isn't a straight line. There are two main approaches here to fit a curve. First to use a polynomial, this is when you apply another function to your straight line equation (times the whole line by something squared, or something cubed). This gives you some wild curves. Second is to use a spline. This is more fine tuning, where you apply a function to just a small bit of a curve, then a different function to the next small bit. This is less likely to cause problems However both of these approaches mean that whilst your model can fit the curve, it probably can't tell you anything useful about what is outside the range. Polynomials especially even still in the inside of the end of the range, can tell you some wildly inaccurate shit. Both splines and polynomials are not telling you anything true about the actual scientific relationship between variables.

Whilst a polynomial still uses the straight line equation, the alpha plus beta times x, and adds on extra bits like plus beta times x^2. A spline uses something different. It does alpha plus weight times B (B not Beta), b is the Basis function. It means, where on the xaxis does the weight actually work.

Notes:

So the last lecture introduced linear models, and explained how linear models don't have to represent some underlying truth to be usful and make good predictions. You can use linear models to make predictions about things that don't even have linear relationships. This lecture continues with linear models, and looks at some of these non-linear things you can do with them

The first example is how to deal with categorical (discrete) datasets. - Using index variables. The second example is how to deal with curves, as in, not straight linear relationships. - Using polnomials, and splines (lecturer prefers Splines)

This lecture is about the scientific models more than the statistical ones. However, you need to think about the statistical model whilst creating the scientific one.

Scientific vs Statistical Relationships

There is a difference between how a model is scientifically related and how it is statistically related.

When we are talking about a scientific relationships in this context we mean a causal relationship. This is different to a their statistical association.

When we are talking about the causes that are in a dataset, that isn't built into the data itself. You can't just read out the causes from the data. You might be able to see what the data implies the causes are, but you won't know exactly. One example would be you can see there's a relationship apparent between variable A and B. How do you know if it is A that causes B, or B that causes A, or a seperate unknown variable C that affects both A and C?. All of these potential causal relationships could produce the same data with the same association.

Example DAG

    graph LR
    H((H: Height)) --> W((W: Weight))
    S((S: Sex)) --> W
    S -->H

In the graph above, Sex directly influences height and weight seperately. Sex also indirectly influences weight, through its effect on height. Height directly influences weight also, but it's direct effect is "contaminated" by that indirect affect from sex too.

Example DAG in Function Statements and Text

$H = f_{H}(S)$

Height is equal to some function for height, that is a function of sex

$W = f_{W}(H,S)$

Weight is equal to some function for weight, that uses height and sex

Example What is our question?

What scientific question are we trying to work out?

The causal effect of height on weight?
Or the causal effect of sex on weight? (so that's direct and indirect effects, including the height bit)
Or the direct causal effect of sex on weight?

These different questions will have different statistical models. Answering these questions in statistical terms means that, say for the first one, if you change height by a certain amount, how will weight change?

Estimands VS Estimates

Estimand is the thing we are trying to work out, the relationship between two variables we are trying to describe
Estimate is our current guess as to what the estimand is

Coding Categorical Variables

There are two main ways (and loads of other ways) to code variables.

A classic/common approach is to use dummy plus indicator variables.
Another approach is index variables

In the dummy plus indicators, say you have a variable with 10 categories in it, you'd end up with 10 seperate variables in your code, with each cell containing a value of yes/no (or true/false, or 1/0)

In the index variables, with the same variable with 10 categories, your code ends up with one variable, with each cell containing a value between 1 and 10.

Index Variables

The advantage for index variables, is that they are a bit cleaner to code, you don't keep adding extra variables. You don't have to rewrite models to add new categories. They are also easier to specify priors for. They also work in multi-level models.

So this means that in your model, if you have an index variable, with 4 categories.

For your model, when assigning values to parameters, the parameters value will depend on which category we are using.

i.e:

In your code, say you are looking at the intercept:

$\alpha$.

This would become the intercept:

$\alpha_{\text{CATEGORY}[i]}$

allowing you to have $\alpha_{1}, \alpha_{2}, \alpha_{3}, \alpha_{4}$, which you plonk into the model as appropriate.

Your outcome variable would look like:

$y_{i} \sim \text{Normal}(\mu_{i},\sigma)$

Outcome variable,in a certain category, is distributed normally, as determined using the mean for a certain category, and the standard deviation

$\mu_{i} = \alpha_{\text{CATEGORY}[i]}$

The mean for a certain category, is determined by the intercept for that certain category

Example Model - Weight, Sex, Height

$W_{i} \sim \text{Normal}(\mu_{i}, \sigma)$

$\mu_{i} = \alpha$

Our average outcome weight, is distributed normally Determined by the mean for a certain category, and the standard deviation

The mean is determined by the intercept alpha

If we want to take into account how the categorical variable affects weight, we'll need to edit alpha

Aside on intercepts in categorical variables

In these models the alpha is still referred to as the intercept. This is despite the fact there isn't actually a slope

When we analyse by categorical variables, we often make a vector of these alphas. This means we can then give alpha a subscript, telling it that alpha is dependent on the sex of the person we are analysing.

$W_{i} \sim \text{Normal}(\mu_{i}, \sigma)$

$\mu_{i} = \alpha_{S_{i}}$ (can also be written $\alpha_{S[i]}$)

This new version of alpha says it depends on the sex of the i-th person

It doesn't mean that there are hundreds of alpha values, one for each person, there is still just two, one for each biological sex, it's just at each person's row of data, there's a key telling the code which value of alpha to look up.

To create an actual statistical model we also need priors:

$W_{i} \sim \text{Normal}(\mu_{i}, \sigma)$

$\mu_{i} = \alpha_{S[i]}$

$\alpha_{j} \sim \text{Normal}(60,10)$

Our prior for alpha, with a sub j showing there may be different values for alpha, is that it is distributed normally, with a mean of 60, and a standard deviation of 10

$\sigma \sim \text{Uniform}(0,10)$ a If you are using the quap (the quadratic approximation) package in R to make this model. All you would need to do is:

Make sure your data set had your categorical variable
Provide the parameters in quap ($\alpha$), the categorical variable ([S])
- a becomes a[S]

Aside on differences between categories

If you want to look at the difference between categories (in our example, the difference between sex = male, and sex = female) then you need to compute something called contrast

You cannot compare just the overlap. You need to compue the distribution of the difference between the categories. This is what the contrast is.

This distribution of differences, this contrast, is the estimand we've been trying to work out.

To expand how you can't just look at the overlap between two distributions, you need to look at the distribution of the differences between them, this also means you can't compare directly between two confidence intervals. Instead you need to look at a distribution of the difference between the confidence intervals itself.

Direct Causal Effect - How do you take indirect effects into account?

$W_{i} \sim \text{Normal}(\mu_{i}, \sigma)$

$\mu_{i} = \alpha_{s[i]} + \beta_{s[i]}.(H_{i} - \bar{H})$

So this is the same linear model as a continuous linear regression, looking at the relationship between Weight and Height, but we've said that $\alpha$ (the intercept) $\beta$ (the slope) and $\bar{H}$ (the mean value of Height) are all dependent on the sex ($s[i]$)

Then to make a contrast between the two sexs, you compute the posterior predictive for women, a seperate one for men (so not just the mean line, but the entire distribution). Then you minus one distribution by the other.

This gives you a new distribution, which is the difference in the relationship between Height and Weight between the two sexes. This distribution tells you at each height, whether or not men are heavier than women, or women are heavier than men.

What about an alternative analysis?

In the notes above, we used two seperate models, one to look at the direct plus indirect effect, one to look at the direct effect. That's fine, and doable, but howabout. How about instead we use just the one statistical model to estimate the whole system?

If you do that, you then use the joint posterior you create, to seperate out those two estimands (the direct plus indirect, and the solo direct). You do this by simulating each estimand seperately.

You could do this using quap in R. It runs those two linear models simultaneously.

So both of these two approaches are the same amount of work. The first method you do more work at the start. The second method you do more work at the end.

Little Summary On Categorical Variables

Remember, when you create posterior distributions for your seperate categories, and you want to compare them to get the difference, as this difference is the effect on your outcome from being one category or another. You don't want to compare the difference between the mean of one distribution, and the mean of another, instead, you want the mean difference between the entirety of the two distributions. ALWAYS SUMMARISE AT THE END

Updated Notes on Curves - 2023-27-01

How do we use a linear model to produce a non-linear shape?

Two approaches: 1. Polynomials - DONT DO THIS - Do weird global smoothing - They put weird curves where there shouldn't be one 3. Splines - DO THIS - Do local smoothing - Splines are nice - Simplest version is a basis-spline - they have a load of locally trained functions - Mainly useful to allow you to stratify by continuous variables

Non Linear Relationships - Curves

Loads of relationships in nature aren't really linear. Linear models can fit them anyway!

BUT

Remember if you get your linear model to fit a non linear thing, then you are no longer describing the truth of the relationship between variables. You can still produce accurate predictions. But your model isn't actually describing the mechanics of the real relationship.

One method to fit non-linear relationships: Polynomials

Be cautious with these! There's plenty of ways to do this incorrectly. When you do them right they're fine, and appropriate, but you often won't do it right! (or at least, I wont!)

What is a polynomial?

In a polynomial you start with your equation for a line:

$\mu_{i} = \alpha + \beta.x_{i}$

mean for the outcome accounting for i, equals the intercept plus the slope times the x variable accounting for i

and you stick extra bits on the end:

$\mu_{i} = \alpha + \beta_{1}.x_{i} \textbf{+}\boldsymbol{\beta}\mathbf{{2}.x$}^{2}

these extra bits are doing higher order things to the slope (such as squaring it, cubing it, raising to power of 4, etc)

you can make basically any shape with this

but a big problem, is the uncertainty at the edges can be massive

you can use it to describe the data, but can't use it to extrapolate beyond the sample range

the next issue with a polynomial, is when it is trying to find the relationship, is that given half the chance it will assume the relationship has a big curve in it, even when it's a simple line. It forces a curve, and will give you dumb predictions for what happens beyond the range.

Adding more and more higher order bits to the end of your equation, will allow you to fit the data you have more and more, but will make your predictions less and less useful.

Little aside on polynomials

Rather than shoving extra and extra bits onto your normal distribution, maybe you shouldn't be using a normal distribution?

Maybe you should be using a log normal? or something else? - will come back to that later in the course though.

Another method to fit non-linear relationships: Splines

Splines are generally better.

There will be more detail on splines later.

Splines are a big family of functions. They do local smoothing in each region.

In a spline only the points in a certain region will determine the fit of the model in that region. In a polynomial, the whole range of the sample will fit the curve everywhere

Again, remember, these have no mechanistic truth, they are still a "wrong" model that gives you useful predictions.

B-Splines

B Splines stand for Basis Splines

They are linear regressions, but regressions on synthetic variables:

$\mu_{i} = \alpha + w_{1}.B_{i,1} + w_{2}.B_{i,2} + w_{3}.B_{i,3} + \dots$

So here you have:

$w$ is a weight parameter, which is like a slope

$B$ is a Basis function, a synthetic variable. Note this is $B$ not $\beta$ - this isn't the slope The Basis function is a variable that specifies what regions of the x axis that the $w$ weight parameter works.

Remember like polynomials, splines aren't saying anything mechanistically true, they're just doing some random curve fitting, if you've got some scientific knowledge a biological model might be better.

Summary on Curves on Splines

A lot of the world's relationships aren't linear, so having these non-linear functions are useful!
But remember that polynomials and splines do not give you that underlying truth
Building models using scientific information is helpful

This talk is about spurious correlations, correlations (also called associations) are common, which is why we shouldn't rely on them to work out what relationships are causal.

You cannot rely on statistical models alone to explain causal relationships. You need that scientific model.

Lecture 5

Estimand - The scientific question we are trying to answer
Estimator - Our code our methods
Estimate - Our result, our guess as to what the estimate will be

This lecture aims to introduce the four forms of confounds. A confound is anything that confuses you. You can start with these four confounds, and build very complex models through combining them.

The Fork

Fork DAG:

$X \leftarrow Z \rightarrow Y$

In this $Z$ is the confounder. It is the common cause of $X$ and $Y$.

But in our stats model we are interested if there is a causal relationship between: $X$ and $Y$

Relationship in Maths

$Y \not \perp!!!!\perp X$

The variables $X$ and $Y$ are not independent of each other.

They are not independent because they share a common cause ($Z$)

$Y \perp!!!!\perp X | Z$

However if you stratify for $Z$, then there is no association between $X$ and $Y$, they are independent.

Stratifying for $Z$ means, if $Z$ is a yes/no variables, look at the relationship between $X$ and $Y$ when $Z$ is yes, and seperately look at the relationship when $Z$ is no.

Aside on bernoulli distributions.

A bernoulli distribution is just a yes/no distribution. It's a coin flip (but doesn't have to be a fair coin!)

Fork Example:

The example in north america is that rates of marriage are associated with rates of divorce (more marrieages = more divorces.

Fork Example DAG:

$M \stackrel{?}{\rightarrow} D$

Our estimand is what is the relationship between marriage and divoirce

What could be a confounder?

Age at marriage? $A$:
- Places where you marry later, have lower divorce rates

graph TD
    M[Marriage]
    D[Divorce]
    subgraph Marriage to Divorce
    M--->|?|D
    end
    A--->M
    A-->|?|D
    A[Age at marriage]

In the DAG above we are saying we think Age at Marriage $A$ is a fork, a confounder.

$M \leftarrow A \rightarrow D$

So to estimate the actual relationship between $M$ and $D$, we need to stratify by $A$ age of marriage.

Aside on stratifying for a continous variable

It's simple enough to do it for a binary variable, but a continuous one is more complicated.

In a linear regression, you add to your model another term for your confounder, and a slope for it.

$D_{i} \sim \text{Normal}(\mu_{i},\sigma)$

Divorce ($D$) conditional on i, is normally distributed, determined by the mean conditional on i, and the standard deviation

$\mu_{i} = \alpha + \beta_{M}M_{i} + \beta_{A}A_{i}$

The mean conditional on i, is defined using the linear regression, equation for a straight line, the intercept ($\alpha$) plus the slope for marriage times the value for the marriage variable (both conditional on i) plus the slope for age at marriage, times the value for the age at marriage variable

Statistical Model for The fork

Now we need some priors for this example.

It's very good practice (a very good default) to standardise the data variables, for linear regressions, especially when there's more than one predictor variable. By standardising your variable, the mean becomes 0 and the standard deviation becomes 1 (you do this by taking each variable, subtracting the mean from it, then dividing by the standard deviation)

Standardising Data

Flashcard	type:basic
How do you standardise a variables values in linear regresssion?	You subtract it's mean from it, then divide by its standard deviation
In linear regression/stats, what does standardising a variable result in?	It turns the value for the variable into a z-score
In linear regression/stats, what is a z-score?	It is the value for how many standard deviations a variable is away from the mean
In stats/linear regression, when a standardised variable has a value of 0, what does that mean?	It means it equals the mean!

If you standardise your variables, it makes the computation easier, and makes picking sensible variables easier.

If you standardise your variables, you need to think differently about how you assign priors. If you used the same approach as before, "$\alpha \sim \text{Normal}(0,10)$", where you have a standard deviation ($\sigma$) of 10, which equals a variance of 100, then in a standardised variables, you are assuming there will be a very strong relationship.

Instead, consider something closer to:

$\alpha \sim \text{Normal}(0,0.2)$

$\beta_{M} \sim \text{Normal}(0,0.5)$

$\beta_{A} \sim \text{Normal}(0,0.5)$

$\sigma \sim \text{Exponential}(1)$

This keeps your priors more within the range of likely relationships, from strong relationships in either direction, to no relationship.

Example model analysis

Here is the model we are trying to run:

$D_{i} \sim \text{Normal}(\mu_{i},\sigma)$

$\mu_{i} = \alpha + \beta_{M}M_{i} + \beta_{A}A_{i}$

$\alpha \sim \text{Normal}(0,0.2)$

$\beta_{M} \sim \text{Normal}(0,0.5)$

$\beta_{A} \sim \text{Normal}(0,0.5)$

$\sigma \sim \text{Exponential}(1)$

Code:

# model

dat <- list(
    D = standardize(d$Divorce),
    M = standardize(d$Marriage),
    A = standardize(d$MedianAgeMarriage)
    )

m_DMA <- quap(
    alist(
        D ~ dnorm(mu, sigma),
        mu <- a + bM*M + bA*A,
        a ~ dnorm(0,0.2),
        bM ~ dnorm(0,0.5),
        bA ~ dnorm(0,0.5),
        a ~ dexp(1)
    ) , data=dat    )

An aside on Exponential distributions

Why use an Exponential variable for sigma (the standard deviation)

exponential distributions are constrained to be positive
the only information in an exponential is the average displacement from 0
this means in a standardised value, with an exponential distribution for standard deviation, with a value of 1, the average standard deviation will be 1

The Pipe

The Pipe's other name is a mediator

Statistically the pipe is v similar to a fork, structurally the pipe is very different.

$X \rightarrow Z \rightarrow Y$

In this, $X$ and $Y$ are associated, but it is that $X$ affects $Z$, and $Z$ passes on some of $X$ onto $Y$.

$Y \not\perp!!!!\perp X$ : $X$ and $Y$ are not independent, they are associated.

$Y \perp!!!!\perp X | Z$ : Again, once you stratify by $Z$, there is no association, $Y$ is independent of $X$.

To do the stats for this, it's the same as the fork

Example Pipe DAG

IF we were trying to find the TOTAL causal effect of Antifungals on plants:

graph TD
    T[T: Treatment]
    H0[H0: Height at Start]
    H1[H1: Height at End]
    F[F: Fungus]

    H0 --->H1
    F--->H1
    T===>H1
    T===>F

So we expect the treatment should directly affect the fungus (that's the whole point) And we'd expect height at start to affect height at end. Treatment may also affect height at end directly (positive or negative)

How do we do adjust for this?

You DON'T stratify for Fungus $F$ ! Which seems odd, cos that is how you would adjust for a fork. But if you did that here, you would block the pipe.

If you stratified for fungus, you'd remove whatever indirect effect that treatment has on F

Instead you just ignore fungal status in your model.

Aside on Post Treatment Bias

If you were to make this mistake, this is a type of confounding called post treatment bias.

It arises when you stratify your analysis based on some consequence of the treatment. This is usually a very bad idea!

Consequences of treatments shouldn't be in your statistical model. You should include them in your scientific model though!

TODO: Read "How Conditioning on Posttreatment Variables can ruin your experiment and what to do about it" Montgomery 2018

The Collider

$X \rightarrow Z \leftarrow Y$

In this, $Z$ is the collider

$Y \perp!!!!\perp X$ : $X$ and $Y$ are not associated in the total sample, they share no causese.

Instead they both influence $Z$

$Y \not\perp!!!!\perp X | Z$ : However this time, once you stratify by $Z$, $X$ and $Y$ are now associated! That does't mean there's a true association, just that you've confused yourself vy accounting for Z.

Kind of an "anti fork".

The reason being, is that as $Z$ depends on $X$ and $Y$. If you then start stratifying based on $Z$, you are also kind of stratifying based on $X$ and $Y$

When we stratify by $Z$, when we learn $Z$, we are constraining both $X$ and $Y$, that means that there has to be an association between $X$ and $Y$.

If Z is a collider instead of a fork, and you add it to the regression, you've added a confounder rather than removing them.

An example he gives is:

$A \rightarrow Su \leftarrow Sk$

In this one above an actor could be successfull ($Su$) by being attractive ($A$) or by being skilled ($Sk$). So if you try and adjust for/stratify for being successful, you end up with the false conclusions that attractive actors are less skilled (in the high success group you'll find more attractive less skilled actors, as they are the only less skilled actors that can survive in that group)

Another version of collider bias is one where you create it

So you can have colliders through selection bias, but you can also have non selection bias "endogenous colliders"

The Descendant

The Descendant is more of a special case

    graph LR
    X[X]
    Z[Z]
    Y[Y]
    A[A]
    X===>Z===>Y
    Z---->A

So in this $A$ is a Descendant of the $Z$ variable.

What this means is that you stratify by $A$ it's very similar to stratifying by $Z$

Because values for A are dependent on information from Z.

So in this case if you stratified for $A$ it would be like blocking the pipe.

    graph LR
    X[X]
    Z[Z]
    Y[Y]
    A[A]
    X===>Z
    Y===>Z
    Z---->A

In this example above $A$ is still a descendent, just now of a collider

The problem is that lots of the measurements we make are not really the thing we want to measure, they're proxies of the things we wanted really to measure.

We will come back to adjusting for confounders like this later in the course.

You can use your descendents to adjust for unobserved confounding variables too!, using something like a factor analysis.

Unobserved Confounds

Unobserved confounds ($U$) are a problem!

Say the relationship we are trying to estimate, our estimand, is the direct effect (just the direct effect, not the total effect) of G on C.

    graph LR
    G[G]-.->P[P]
    G--->C
    P-.->C
    U-.->P
    U-.->C

To do so, we would need to block the pipe of $G \rightarrow P \rightarrow C$, as that is the indirect effect of G on C.

This is a puzzle for the next lecture!

Good and Bad Controls - Video Six

    graph LR
    G[G]-.->P[P]
    G==>C
    P-.->C
    U-.->P
    U-.->C

What we are trying to do here is understand the direct total affect of $G$ on $C$. So that is the direct of $G$ on $C$ and not the indirect of $G$ on $C$ via $P$.

If $U$ wasn't there. Then $P$ would just be a mediator, and you'd just need to stratify by $P$ to work out the total effect.

But $U$ is there!. This means $P$ is a mediator and also a collider. That means stratifying by $P$ will open up this connection between $G$ and $U$, creating an unwanted association.

This means at the moment you could estimate the Total effect, but you can't estimate the direct effect. Or at least, you can only esimate a biased direct effect.

Lecture 6

Randomising without the randomisation
There is a way to do this
If you have two variables, x and y, and they are both affected by the confound U. Then you can account for this by stratifying by U.
There is a mathematical framework to do this
You do this by marginalising over the control variables (you average over the control variables)
So the causal effect of X on Y is NOT (in general) the coefficient relating X to Y. Instead it is the distribution of Y when we change X, averaged over the distributions of the control variables (in this case U)
Coefficients are not enough for the causal effect, you need the marginalised results. "Marginal Effects"
do-calculus is the system for transforming a DAG into maths
do is what allows us to work out to work out if an analysis is possible

Backdoor Criterion

This is a shortcut to allow you to do do-calculus through looking at a dag

This is a rule that allows us to work out which variables to stratify by, to yield an estimate for our estimand.
Identify all paths on the DAG that connect the treatment i'm interested in to the outcome:
You don't have to follow arrows! These paths can go with or against arrows.
But then what you want to do is select the baths that leave the treatment you are interested in, by going out along a path against an arrow (as in leave along a path that points the opposite direction). These paths are the non causal paths or backdoor paths.
You now need find a "minimal adjustment set" that closes or blocks all backdoor paths if you want to estimate the causal effect
So you can work out what to adjust for by using do-calc and the backdoor criterion
An alternative to doing this would just be to do "full luxury bayes" where you use all variables, but in seperate submodels instead of a single regression

Picking your control variables

There are good ones, and bad ones
The everything in the spreadsheet approach is wrong!
So is the removing all the colinear ones
So is removing all pre-treatment measurements

TODO: Cinelli, Forney, Pearl 2021 A Crash Course in Good and Bad Controls

The main part here, is that you should think very careful about what control variables to pick. You should not just think about the variables you have collected, but also the unobserved variables that might mess up these relationships
An example of a v bad version would be, accidentally using the case-control bias. Don't stratify by a descendent of an outcome!