Skip to content

Applied Stats For Post Grad Students - Adrian Bowman

This is a concept heavy course

Statistical Thinking:

  • How do we design these experiments?
  • How should we collect the data?
  • Is there a point in doing this test?
  • How sohuld we visualise this data?
  • How can we construct appropraite models?
  • Which tools would be appropriate?

Adrian Has written this book: www.stats.gla.ac.uk/~adrain/Statistical_Modelling

There's more detail on this site. Far more than the course covers!

How to Design Statistical Questions

  • Start with your question?
  • Work on reproducibility,

Where do data come from?

Need to make distinction between:

  • Population
  • Sample

Population

  • All the observations we could possibly ever make

Sample

  • The limited number of observations collected in our experiment
  • You need a sample that properly represents the population!

Is your experiment:

  • Observational
  • Designed

  • Observational: Where you observe and record variables of interest

  • Designed: Where you are abe to identify causal reslationshups between variables

Randomisation:

  • SImple Random Sampling
  • Systematic Sampling
  • Stratified Sampling
  • Cluster Sampling

Variability 0- Confidence Intervals

Standard deviation vs Standard Error

SD - describes the population of our sample is spread from the mean of itself SE - how much the sample mean is spread from the true mean

Confidence intervals:

most of the time the thing you are trying to estimate, and your estimate of it, are within two standard errors of one another.

so if you attach two standard errors to your sample mean, you should create a range that catches the true mean

we can measure how far away our sample mean is from


Power

Power is a thought experiment, it helps us think through what we are going to do

Power is:

you have two groups, a control and an experiment, there is a difference between these two groups.

Power Definition: the probability that when you analyse the data, you detect this difference that exists.

What do you need to know to work out this probability?

  • The effect size, how far apart do the means be before the difference is important enough?
  • The standard deviation, what is the level of variation present in each population? Can we assume that to be the same across the two populations?
  • normality, is it reasonable to assume a normal distribution for each population?

How do you calculate this?

You use a power tool, a power calculation

This power tool works for looking at the difference between two means, but what about more sophisticated models?

This is wher you create simulated models in R, with the structure of what you are looking at, and compute the models

Lecture Two - Models - Linear Models

This is an intro as to how to use R initially, notes are in the R file with the same name.

Predictive Power

\(R^2\) is talking about how good is your model in explaining the variation seen in the dataset

Using the \(R^2\) you can see how useful simpler models are at making predictions

There can be two main aims:

  • Can we understand the system
  • Can we make predictions about what's going to happen

TODO: Pester Ben about his project

Proportions

Binomial Distribution

\(\hat{p}\)

When you have to compare proportions, what you are doing is comparing the expected values that you would get if the null hypothesis was true.

2x2 Table

The way you do that is if you compare using the Chi-Squared Test: It's called that because you are looking at a Chi-Squared Distribution.

Logistic Regression

Why might you be going for a log regression rather than a linear?

Linear might give you wrong answers, in that it will make preictions outside of the scale that don't make sense.

But also, if you were looking a binomial proportion, where the proportion is yes or no rather than continous, then why would you apply a continuous distribution to it? When you know it's not continuous, it's binomial.

A logistic model is an adjusted linear model, where youve shoved it inside an exponential function (keeping the proportion between 0 and 1)

Random Effects

How do we take into effect multiple factors that can affect our outcomes:

\(y_{bci} = \mu + \epsilon_b + \epsilon_{bc} + \epsilon_{bci}\)

So this is a linear mixed-effects model function

lme() #Linear Mixed Effects Model

Flexible Regression