Introduction to statistical concepts

Author

Gianluca Baio

Published

18 October, 2022

Preliminaries

What are these notes about and for?

The aim of these notes is not to provide a comprehensive and detailed introduction to all of statistical theory. Of course, that requires more than just a single document/book and you need to work your way through the many lectures you will attend during your MSc Programme in Health Economics and Decision Science. What these notes are meant to provide you with is a critical introduction to the most important concepts, that you will encounter specifically in STAT0015, STAT0016 and STAT0019. Of course, the last two modules are not compulsory, so you may not take them. Nevertheless, it is useful to have all these at hand. Arguably, statistical reasoning and analysis is central to all forms of what can be generically called “Health Economics” — both in terms of modelling for cost-effectiveness/utility analysis and when performing econometric modelling. It is then crucial that the concepts described in these notes are clear to you.

The structure of these notes guides you through the very basics of the philosophy underpinning the ideas of sampling and data collection. Suitable methods for summary and visualisation of the data are also presented in Chapter 1. Then Chapter 2 describes several statistical models that are commonly used in the applications you will encounter. These include Bernoulli and Binomial models to describe sampling variability in individual or aggregated binary data, Poisson models for counts, Normal distributions for continuous, symmetric phenomena and more specific distributions (e.g. t, Chi-squared and the Gamma-family), which are the basis for many of the procedures you will see during your Programme. Of course, the presentation is far from exhaustive and there are many more models you may be exposed to in specific modules.

Chapter 3 and Chapter 4 present the central tools of statistical inference — the methods of estimation and testing. These are presented while highlighting the fundamental distinctions among different approaches (e.g. Bayesian, Likelihood and Frequentist), which are often confused or conflated (especially the last two) into an integrated theory, which essentially does not exist. Again, the mathematical sophistication is kept to a low level — you do not need to read these notes to learn all about the technical issues. The point is rather to try and help you understand the basic principles and why things work the way they do, over and above how. Throughout the notes, there are some parts in which it is unavoidable to use mathematics to make the point; but you are not expected to learn proofs for theorems or anything similar — only to understand the process.

Finally, Chapter 5 discusses regression analysis, which is a general tool used in many areas of statistical modelling. Again, we dispense with the most complicated technical details and try to convey the most important ideas underpinning the development of linear and generalised linear models.

Computer software

Throughout the notes, we demonstrate some of the computational problems using the freely available software R, which you can use on UCL machines. You can also download it on to your own machines from CRAN, i.e. the main repository from which all the relevant “packages”, as well as the main software is stored. This is available at https://cran.r-project.org/index.html.

Notice that you do not have to learn R when reading or studying these notes. Code and output are typeset in grey boxes, something like the following.

# Defines a variable
x=4
# Defines a vector
y=c(1,2,3,4)
# Computes a function of a given input
m=mean(y)
# Returns the output
m

[1] 2.5

You are not expected to have learnt it in preparation for the exam you will have to take before starting the Programme. The code is only presented to help you understand what is actually going on — and you can use it to replicate some of the analyses presented in the notes. Moreover, note that while attending the various modules, you will encounter several statistical software, including R, Stata, Matlab and perhaps others. While having their own different syntax and at times idiosyncrasy, if you learn to use one of these proper statistical programmes, then you will be able to switch to others — because their common trait is the possibility of scripting the workflow, using functions and packages.

This, in addition to their advanced computational engines, is what makes them more appropriate than commonly used spreadsheet calculators, e.g. MS Excel that are often used, particularly in the field of cost-effectiveness modelling. These are not ideal and have several shortcomings. So, while you will see them at times in the various modules, you are not encouraged to use them for “real” work — and we will see several applications of statistical modelling in the more appropriate software in STAT0015, STAT0016 and STAT0019.

Scientific writing (hints for your dissertation)

These notes are written using quarto, which can be used to combine plain text with advanced formatting and, crucially, R code. In this way, you can annotate and descibe the whole analysis process in a single file, where you describe all the technical details as well as the general presentation of the problem. This is something you may consider for your final year dissertation.

Symbols, notation, etc

Although, as mentioned above, we keep the mathematical sophistication to a bare minimum level, we do need to use specific symbols and terminology, for the sake of clarity. Generally speaking, statistical notation distinguishes mainly between observed or observable variables, which we indicate in upper-case Roman letters, e.g. \(Y\), \(W\), \(T\); and unobservable parameters, which are indicated using Greek letters, e.g. \(\theta\), \(\mu\), \(\sigma\).

When data are observed (and thus their realised value is known to us), we usually indicate in lower-case Roman letters, e.g. \(y\), \(w\), \(t\). When we consider a vector of variables or parameters, we typeset them in bold, e.g. \(\boldsymbol{Y}\) indicates a vector of observable variables. We often describe this fully as \(\boldsymbol{Y}=(Y_1,\ldots,Y_n)\), which can be used to indicate that we have a vector of length \(n\). We apply this to parameters too; for example, if a model is indexed by two parameters \(\mu\) and \(\sigma\), we write that the parameters vector is \(\boldsymbol\theta=(\mu,\sigma)\).

As mentioned above, a crucial part of statistical modelling is to associate variables with probability distributions, e.g. to describe uncertainty or sampling variability. We do this using the terminology \[y \sim \mbox{Name of the distribution}(\mbox{Name of the parameters}),\] where the symbol “\(\sim\)” is read “is distributed as”, or more appropriately “is associated with a XXX distribution with parameters YYY”. Alternatively, we may write \(p(\mbox{Name of variable} \mid \mbox{Name of parameters})\) to indicate the probability distribution associated with a variable and indexed by some parameters. An example is \[p(r \mid \theta,n)=\left( \begin{array}{c}n \\ r\end{array} \right) \theta^r (1-\theta)^{(n-r)}\] (see Section 2.1). The symbol “\(\mid\)” (read “given” or “conditionally on”) is used to indicate that the argument to its left is the main variable of interest, while the argument(s) to its right are used as parameters or known values.

The notation \(\Pr(Y=y\mid \theta)\) indicates the probability that the variable \(Y\) takes on the value \(y\) — so this is a slightly different concept to the probability distribution \(p(y\mid\theta)\) — the former is a single value, while the latter is an entire distribution.

When we use sample data to estimate a model parameter, we may use the “hat” notation, e.g. \(\hat{\mu}\) (read “\(\mu\) hat”) can be used to indicate a function of the data \(\boldsymbol{Y}\) that we use to give our best guess as to what the underlying value for the parameter \(\mu\) is. This is not universal and some other terminology is possible to indicate an estimate for a parameter. When these are used, we will define them appropriately in the text.

Occasionally, we use “text blocks” to include specific bits of text and alert you to their importance. These look something like the following.

Important

This is a block of text that you should read carefully. This is may be because the content is very important, or perhaps it is subtle and requires some thinking before you fully understand its meaning. Or may be it is a technical note, explaining some more advanced details — in which case, you do not need to learn all these (possibly mathematical) details by heart. As usual, only try and understand the deeper meaning of the text and maths included in the block.

List of mathematical symbols

\(\mbox{E}[Y]\): expected value of a variable \(Y\) (see Section 1.5.1). This indicates the mean of a variable and is often indicated with the symbol \(\mu\).
\(\mbox{Var}[Y]\): variance of a variable \(Y\) (see Section 1.6). This is often indicated with the symbol \(\sigma^2\).
\(\displaystyle \sum_{i=1}^n y_i\): the sum of \(n\) values \(y_1,\ldots,y_n\). Here \(y_i\) indicates one such generic value and the index \(i=1,\ldots,n\) (read: “\(i\) goes from 1 to \(n\)”).
\(\displaystyle \prod_{i=1}^n y_i\): the product of \(n\) values \(y_1,\ldots,y_n\).
\(\displaystyle \int_a^b f(x) \rm{d}x\): is the integral of the function \(f(x)\) of the variable \(x\), ranging in the interval \([a;b]\). This is used to compute the area under the curve described by the function \(f(x)\), for values of the \(x-\)axis ranging in \([a;b]\). You will not encounter much of this, although this concept is often discussed in STAT0019.
\(\exp\) is the exponential function, with properties \(\exp(0)=1\), \(\exp(1)=e=2.7182818\).
\(\log\) is the logarithm function, i.e. the inverse of the exponential function. This means that \(\log\left(\exp(x)\right)=x\), i.e. if you apply the log to exp you essentially cancel out these two functions and are left with the argument to the inner function (exp). The log function only applies to positive numbers; also \(\log(1)=0\) and \(\log(e)=1\).
\(n!\): the factorial function indicates the product \(n(n-1)(n-2)\cdots 1\), for any positive number \(n\). This is used in the definition of some probability distributions, including the Binomial (see Section 2.1, the Student’s t Section 2.4)) and the Gamma family of distributions (see Section 2.5).
\(x \in [a,b]\): read “\(x\) is in the interval \([a;b]\). The symbol \(\in\) indicates group (or set, interval) membership.
\(\rightarrow\): read “tends to” or “approaches”. This is used in expression such as \(n\rightarrow \infty\) (read “\(n\) approaches infinity”).
\(f{'}(x)\): read “\(f\) prime of \(x\). This indicates the first derivative of a function \(f\). This measures the changes in the value of the function for infinitely small changes of the argument \(x\). Derivatives are crucial concepts in differentiation and calculus and are used to determine maxima or minima of a given function. This is technically the notation introduce by Giuseppe Luigi Lagrangia, an Italian mathematician (actually popular with the French version of his surname, Lagrange), who developed much of the early versions of calculus.
\(f{''}(x)\): read the second derivative of \(f\). This is computed as the first derivative of a first derivative, so \(f{''}(x) = f{'}\left(f{'}(x)\right)\).
\(\min_{a} f(a)\): the minimum of a given function with respect to its argument \(a\). In other words, the value \(a\) is the one in correspondence of which the function \(f(\cdot)\) reaches its minimum. The obvious counterpart is \(\max_{a}f(a)\).