# Load the package - assuming it's been installed (check https://gianluca.statistica.it/software/survhe/ for guidance)
library(survHE)
# Loads the fictional dataset
data("survtrial")
# Turns the dataset into a "tibble" object and then inspects it
=data %>% as_tibble()
data
data## # A tibble: 367 × 8
## ID_patient time censored arm sex age imd ethnic
## <int> <dbl> <int> <int> <int> <int> <int> <int>
## 1 1 0.03 0 0 1 34 1 4
## 2 2 0.03 0 0 0 29 4 2
## 3 3 0.92 0 0 0 30 4 5
## 4 4 1.48 0 0 0 31 5 1
## 5 5 1.64 0 0 0 27 4 1
## 6 6 1.64 0 0 0 37 5 4
## 7 7 1.7 0 0 0 31 2 5
## 8 8 1.7 0 0 1 30 2 4
## 9 9 1.74 0 0 0 40 3 5
## 10 10 1.77 0 0 0 34 3 3
## # ℹ 357 more rows
A (quick) tutorial on how to use survHE for survival modelling in health economics
This tutorial is based on the Journal of Statistical Software paper describing the philosophy underlying survHE
and its main functionality. All the technical details (including the statistical background) are described in the paper, while this tutorial is only meant to provide some annotated description of the main important functions provided in survHE
.
The example uses a fictional dataset — in fact the actual data are created by “digitising” an existing Kaplan Meier curve. survHE
is based on the output produced by DigitizeIt, but there are other possibilities — and in the future we plan on including different sources for pre- and post-processing from within survHE
.
The current version of survHE
uses modern R tools and packages for data management and programming (based on the tidyverse
). Thus when survHE
is loaded in the R workspace, relevant functions from dplyr
and “piping” (%>%
) are immediately available for the user.
As a first example, we run a survival model using a few distributional assumptions. The main survHE
function is fit.models
, which can be used to specify:
- the model “formula”, which describes the relationships between the observed data and (potentially) relevant covariates;
- the data, which are used to estimate the model parameters and the relevant survival quantities (including the survival curves);
- the distribution to be used to approximate sampling variability for the outcome;
- the method that should be used to fit the model.
Survival modelling using Maximum Likelihood Estimators
A typical call to fit.models
is like the following.
# First, defines the vector of models to be used
<- c("exp", "weibull", "gamma", "lnorm", "llogis", "gengamma")
mods
# Then the formula specifying the linear predictor
<- Surv(time, censored) ~ as.factor(arm)
formula
# And then runs the models using MLE via flexsurv
<- fit.models(formula = formula, data = data, distr = mods) m1
Here, we are considering 6 different parametric distributions: Exponential, Weibull, Gamma, log-Normal, log-Logistic and Generalised Gamma. Because we’re specifying the value for the argument distr
in a vector (with the 6 models), fit.models
will store the results for all of them into a single object (in this case, m1
). This is not necessary and alternatively, we can use a separate call to fit.models
for each distributional assumption.
The model specified in the object formula
essentially assumes the following specification. \[\begin{eqnarray*}
t_i & \sim & f(t_i \mid \boldsymbol\theta) \\
\boldsymbol\theta & = & (\mu, \alpha) \\
g(\mu_i) & = & \beta_0 + \beta_1 \mbox{Arm}_i.
\end{eqnarray*}\] Here, \(f(t_i \mid \boldsymbol\theta)\) is the density function associated with each of the 6 models specified; \(\boldsymbol\theta = (\mu, \alpha)\) is a vector of model parameters, including a location parameter \(\mu\) and a (set of) ancillary parameters \(\alpha\), indicating a measure of shape or variance for the underlying distribution. Moreover, we assume a generalised linear regression on the location parameter (based on a suitable link function \(g(\cdot)\), typically, the \(\log\)), which is based on an intercept \(\beta_0\) and a single covariate \(\mbox{Arm}_i,\) whose effect is quantified by the coefficient \(\beta_1\). Notice that because the formula specifies as.factor(arm)
, we essentially consider this as a categorical variable — taking values 0 for the controls and 1 for the individuals in the treatment arm.
In the R formula, we need to specify not only the observed outcome (in this case \(t=\) time
), but also the censoring indicator (in this case \(d=\) censored
). This is crucial in the context of survival modelling.
Survival modelling using Integrated Nested Laplace Approximation
For several reasons (many of which you would expect from me…), it is often useful to be Bayesians about modelling, particularly if the overarching objective is to feed the output of the statistical building block to an economic model and, downstream, to a decision-making process.
survHE
uses two Bayesian inferential engine (section 2.3 of the paper explains why). The first one is Integrated Nested Laplace Approximation (INLA), which is a very efficient way of obtaining estimates from the marginal posterior distributions of the model parameters. While INLA currently fits only 4 parametric survival models, it is generally as fast as the MLE counterparts.
In survHE
, the user can specify that they want to use INLA by simply adding an option method="inla"
to the call to fit.models
, something like the following.
# Runs the models using INLA using the formula and data specified above, but with the subset of distributions that INLA can fit
<- fit.models(formula = formula, data = data, distr = c("exp","wei","lno","llo"), method="inla") m2
Survival modelling using Hamiltonian Monte Carlo
Markov Chain Monte Carlo (MCMC) methods are arguably the default mode of fitting Bayesian models. Software such as BUGS
, JAGS
, Nimble
and, more recently, Stan
all implement some MCMC algorithm to obtain samples from the posterior joint distribution of all the model parameters. In particular, rstan
uses a version of MCMC, called Hamiltonian Monte Carlo (HMC), which can be very efficient in exploring the posterior distribution.
survHE
implements rstan
versions of all the parametric models described in the paper (which are recommended by NICE). Similarly to INLA, all the user needs to add is the option method="hmc"
to instruct fit.models
to run rstan
in the background.
# Runs the models using HMC using the formula and data specified above
<- fit.models(formula = formula, data = data, distr = mods, method="hmc") m3
Several more options are described in details in the survHE
paper.
Printing and plotting with survHE
survHE
is meant to simplify the workflow for health economic modellers by standardising the process (e.g. through a common and simplified call to define the models to be fitted, as shown above). To this aim, it comes with “methods” that can be deployed for survHE
objects (such as m1
, m2
and m3
created by the code above), to summarise and visualise the process.
These methods take the output created by either flexsurv
, INLA
or rstan
and format it in a standardised way that the modellers can use to feed into the economic model. For instance, we can summarise the model coefficients using the print
method.
# Prints the output of the three objects (by default shows the first of the models stored in the survHE objects)
print(m1)
##
## Model fit for the Exponential model, obtained using Flexsurvreg
## (Maximum Likelihood Estimate). Running time: 0.021 seconds
##
## mean se L95% U95%
## rate 0.0824203 0.00828355 0.0676839 0.100365
## as.factor(arm)1 -0.4656075 0.15427131 -0.7679738 -0.163241
##
## Model fitting summaries
## Akaike Information Criterion (AIC)....: 1274.576
## Bayesian Information Criterion (BIC)..: 1282.387
print(m2)
##
## Model fit for the Exponential model, obtained using INLA (Bayesian inference via
## Integrated Nested Laplace Approximation). Running time: 0.5566 seconds
##
## mean se L95% U95%
## rate 0.0823381 0.00841851 0.067348 0.0999875
## as.factor(arm)1 -0.4651301 0.15298910 -0.776177 -0.1721019
##
## Model fitting summaries
## Akaike Information Criterion (AIC)....: 1276.579
## Bayesian Information Criterion (BIC)..: 1288.295
## Deviance Information Criterion (DIC)..: 1278.255
print(m3)
##
## Model fit for the Exponential model, obtained using Stan (Bayesian inference via
## Hamiltonian Monte Carlo). Running time: 1.0856 seconds
##
## mean se L95% U95%
## rate 0.082491 0.008755 0.066959 0.100632
## as.factor(arm)1 -0.463966 0.159636 -0.783629 -0.171070
##
## Model fitting summaries
## Akaike Information Criterion (AIC)....: 1276.579
## Bayesian Information Criterion (BIC)..: 1288.295
## Deviance Information Criterion (DIC)..: 1274.847
In this case, we have not specified much information in the prior distributions for the Bayesian versions of the models. In addition, the dataset is relatively “mature” (i.e. there are not very many censored cases) and thus the estimates for the model parameters are fairly similar in all three models. NB: this is not guaranteed in all circumstances. In fact, it is absolutely possible that, particularly in the presence of data subject to a large proportion of censored individuals (as is often the case e.g. in immuno-oncology), genuine prior information may be crucial and instrumental to producing sensible estimates!
survHE
offers various options in the call to the print
method. For example, we can visualise the output in the “native” formatting, i.e. that produced by the package that has been used to run the analysis. This can be obtained by adding the option original=TRUE
in the call to print
. However, the point of using survHE
is in fact to have the output in a standardised format, so as to avoid confusion (e.g. when different software presents data and estimates using different names or terminology). In addition, when the underlying survHE
object contains more than one model, the call to print
automatically generates the summary statistics for the first one (in this case, the Exponential model, as this is the first to be specified in the vector mods
or in the call to fit.models
for m2
). However, we can select any one of the models by adding the option mods=...
, for instance
Perhaps even more interestingly, health economic modellers will be likely interested (almost exclusively) in the survival curves obtained from the fitted models. The user can visually inspect the models in terms of the survival curves, using the plot
method.
The current (and future!) version(s) of survHE
use the plotting capabilities of ggplot2
. As such, all graphs generated by survHE
are in fact ggplot2
objects and thus they can be manipulated and post-processed at will. The expert user can customise the graphs in essentially any way they want. The less-expert users don’t actually need to do much and survHE
does allow some customisation — for example, the call above specifies the options lab.model
, which tells survHE
what labels to use to describe the various models being depicted; and the option lab.profile
, which defines specific labels for the individual “profiles” (i.e. the combination of the covariates). More options are described in the survHE
paper and we will produce more tutorials (in fact, please do contribute your survHE
examples, if you can — you can do so by opening “issues” on the GitHub repository or by sending me an email).
One example of ggplot2
potential is given in survHE
by the model.fit.plot
function, which depicts graphically the measures of model fit, based on information criteria.
Probabilistic sensitivity analysis with survHE
One of the crucial features of a health economic evaluation is represented by the need of performing probabilistic sensitivity analysis (PSA). This is one of my obsessions and I think one of the places where being Bayesian becomes natural and thus the preferred option, particularly when the end product is to use the statistical model for the purposes of an economic evaluation.
survHE
has a dedicated function, make.surv
that does the process of PSA in the context of the survival models. In this case, the idea is to propagate the uncertainty in the model parameters (e.g. \(\boldsymbol\beta=(\beta_0,\beta_1)\) in the current example) all the way to the elements that feed into the economic model (which typically are the survival curves for the different interventions).
In reality, the user may not even need to worry too much about how make.surv
works — more advanced and statistically sophisticated modellers may use all the flexibility and power of this function to obtain a characterisation of the uncertainty underlying the model estimates (this is based on bootstrap for the MLE implementation and on samples from the posterior in the Bayesian versions). In any case, survHE
also has plotting capabilities to visualise the PSA output. This can be done by going through make.surv
and then applying the specialised plotting method psa.plot
, or directly using the general plot
method with some suitable options.
Notice that make.surv
(which is called either explicitly in the left-hand panel, or implicitly through the plot
method in the right-hand panel) allow for a direct extrapolation of the survival curves — this is one of the key feature of survival modelling in health economics. In this case, the survival curves, together with 95% interval estimates, are shown over a time horizon spanning up to 63, while the observed data are only up to about time = 21.
Of course, our recommendation is to perform the whole of the modelling (that includes the statistical model, the economic model, the PSA and the final decision-making process) in R. This reduces the possibility of errors (for instance through copy and paste across different software) and more importantly takes full advantage of a computing environment that is fit for purpose (as opposed to spreadsheets that are great at doing what they are designed to do, but can be spectacularly bad at tasks they shouldn’t be asked to do!). In addition, performing the complete analysis in R makes it possible to link to other tools (to remain in our own backyard, this, this or this — but many more are available and mentioned here).
But: if the user really has to, survHE
allows them to automatically write the simulated values for the survival curves onto an xlsx
file.
# Writes the results of the PSA to an Excel file, containing the simulations for the survival curves
write.surv(psa, file = "temp.xlsx")
These can then be post-processed and used to construct the rest of the economic model, for instance in terms of a Markov model — again, if you must this can be done in Excel, but we recommend very strongly against doing so and insted suggest that, once you’ve come this far, you can continue to work in R!
Conclusions
There are many more options and possibilities when working with survHE
— and many more are under development. These include using “flexible” methods (e.g. Royston-Parmar Splines or Poly-Weibull distributions) and there are many options for customisation of the results/outputs. These can be explored in the survHE
tutorial paper in JSS and, as mentioned above, we welcome additional case-studies or vignettes generated by real experience of the modellers.