Preparing for AI in Health technology assessment: some questions from a very wise old man


Gianluca Baio

Department of Statistical Science   |   University College London

g.baio@ucl.ac.uk


https://gianluca.statistica.it

https://egon.stats.ucl.ac.uk/research/statistics-health-economics

https://github.com/giabaio   https://github.com/StatisticsHealthEconomics  

@gianlubaio@mas.to     gianluca-baio    


The Road Ahead: Future Trends in Artificial Intelligence and Automation for Health Technology Assessment, UCL

14 January 2026

Check out our departmental podcast “Random Talks” on Soundcloud!

Follow our departmental social media accounts + magazine “Sample Space”

Disclaimer…

  • This is probably the most unstructured talk I’ve ever written/given in my career!
    • I’m OK with this, though: ideas are not completely solidified – but not just in my mind…
    • Still, it wouldn’t be very serious of me to impart my wisdom on this…
  • Instead, I’ll try and raise some questions and doubts and, hopefully, give some ideas!
  • I’ll use the perspective of a statistician
    • Of course, there is (just about… 😉) more to only Statistics in the whole HTA process
    • BUT: the questions I will pose are, I think, extremely valid and relevant to the whole of our “industry”

AI & HTA: Where are we?

  • Obviously a big and hot topic. For example, at the last ISPOR conference:
    • 2 short courses on the Sunday
    • One of the topics in the Plenary session + at least 8 parallel sessions on the Monday
    • At least 5 parallel sessions on the Tuesday
    • One of the main focus points of the Plenary session + at least 2 parallel sessions on the Wednesday
  • BUT: I think it’s fair to say that the use of AI in HTA has been so far characterised by a few important issues!
    • Inflated expectations
    • Poorly specified use-cases
    • A mismatch between what tools are optimised for and what HTA actually requires

Inflated expectations

  • LLMs are optimised for linguistic plausibility, trained to minimise “next-token loss” (statistical vs causal prediction) and rewarded for fluency and confidence
  • Conversely, although HTA relies on inference and extrapolation, it must be centered on explicit recognition of uncertainty over confident point statements and being wrong in a visible way rather than right for the wrong reasons
  • Systematic bias in how AI performance may be perceived
    • A fluent answer feels correct, especially to non-specialists
  • Dangerous in HTA where
    • Many errors are subtle
    • Outputs are often difficult to immediately falsify
    • Mistakes propagate downstream into decisions
  • “Benchmark leakage” (= LLMs trained on questions & answers used to benchmark it…)
    • AI is evaluated on tasks that resemble exams or coding puzzles not on end-to-end processes
    • HTA workflows with messy data, conflicting evidence and value judgement

Poorly specified use-cases

  • HTA tasks are rarely atomic
    • “Do a systematic review” actually involves several distinct judgement calls
    • “Build a model” involves conceptual, statistical and normative steps
    • “Write a report” encodes institutional conventions and implicit assumptions
  • Without precise specification failures are hard to diagnose, accountability is unclear and validation becomes impossible
  • Decision ownership:
    • Who is responsible when AI output is wrong?
    • Is AI advising, filtering, drafting or deciding?
  • In HTA, ambiguity of responsibility is unacceptable, but many proposed AI uses depend on it…

Mismatch in optimisation objectives

Operational

LLMs/AI HTA
Tend to collapse uncertainty into a single narrative Needs uncertainty explicitly represented
Optimise average accuracy Often cares about worst-case or tail behaviour
Encourage a single "best answer" Requires structured disagreement and scenario analysis

Cultural

  • AI tools often assume iteration is cheap and mistakes are acceptable
  • HTA decisions are high-stakes and constrained
    • e.g. sunk costs of replacing an existing technology

Question 1: Where can AI help us?

Coding

  • This is probably the least controversial – but also the most misunderstood area!
  • Where AI can help
    • Boilerplate code generation (reusable with little alteration, e.g. country adaptation)
    • Refactoring or language translation (e.g. from BUGS to Stan to R)
    • Writing unit tests (which we should do a lot more of, by the way…)
    • Exploratory scaffolding (not final models)
    • Clear settings for responses (“You are a health economic modeller working with R. Acknowledge uncertainty and tell me if you’re not 100% sure”…)
  • Where AI is generally dangerous
    • Silent logical errors/Invented package behaviour (hallucinations)
    • Plausible-looking but incorrect conclusions/flaws
    • Overconfident handling of edge cases (missing data, censoring, treatment switching)
  • For HTA, the key issue is that errors are most often not random: they tend to appear in exactly the places that matter most for decision-making!

  • Bottom line: we’ve come a loooooong way and the core of our “industry” is now more and more proficient on statistical programming – we cannot backtrack on that, because we think that AI can do that job for us!

Reporting/automation

I’m personally a lot more skeptical about this…

  • Much of what people present as “AI-powered reporting” is
    • Templating
    • Parameter substitution
    • Re-packaged dynamic documents
  • Existing tools that can be considered part of the “standard”(?) workflow, like Quarto, R Markdown and others already can do these tasks
    • Integrate directly with the analysis – no need to use “external” tools!
  • The incremental benefit of LLMs here is modest, while the risk (loss of traceability, hallucinated explanations) is real
  • Feels a bit like change for the sake of change…

Evidence generation: SLRs and modelling

  • This is perhaps where AI enthusiasm is highest in HTA. But: caution is needed, in my view!…

Systematic literature reviews

  • AI can be helpful with tasks such as de-duplication, prioritisation or extraction of drafts

  • But:

    • Key tasks such as inclusion/exclusion are normative, not purely textual
    • Bias assessment requires (human!) judgement
    • Missing a “good” study may not carry the same penalty as including a “bad” one…

Modelling

  • For economic models structure matters more than syntax, assumptions matter more than fit and interpretability is non-negotiable

  • Replacing

    • Conceptual model development
    • Structural sensitivity analysis
    • Expert elicitation

    would fundamentally misunderstand what HTA models are for

  • Not sure we’d gain so much – modelling of NMA problems is well established in HTA

What else?

  • Where AI may add genuine value (with caveats…)
    • Screening and triage in systematic reviews (ranking, not deciding)
    • Information extraction with human validation
    • Protocol drafting (again: draft, not authority!)
    • Exploratory scenario generation (“what assumptions should I stress-test?”)

  • Notably, these are all assistive roles, not decision-making ones!
  • On a side note…
    • We need to train people as well as LLMs – modellers should know what and how to ask!
    • This is in my view a duty of all the elements of our “industry” (academia, regulators, consultancy, sponsors, etc)

Question 2: What can we do to make AI more helpful to us?

General-purpose vs domain-adapted AI models

“Absolute” models

  • Using a general LLM “as is”
    • Ask ChatGPT to do something for you
  • Zero or minimal domain constraints

“Hybrid” models

  • Constrained systems
    • Specialised models trained in a “safe environment”
  • Infrastructure cost
    • Fine-tuning open models (e.g. LLaMA-class) on HTA-specific text
    • Using LLMs with HTA guidelines, model documentation, previous submissions, methodological papers

Investment?

  • Data curation: expensive in human time, not computing time
  • Fine-tuning costs: tens of thousands £ (not millions)??
  • Ongoing infrastructure: modest compared to trial costs

\(\Rightarrow\) The real bottleneck is expert labour, not GPUs!

So what shall we invest on?

  • Not a one-off project, but rather an actively curated methodological public good

  • If AI is to be used responsibly in HTA, it likely requires

    • Publicly (co-)funded, transparent tools
    • Strong academic–regulatory partnerships
    • Industrial involvement
    • Shared benchmarks and failure cases

  • If we don’t do this, in my view the risk is that we end up with
    • Black-box vendor tools
    • Asymmetric information
    • Regulatory capture by tech providers

  • Events like today can (should?!) pave the way for all-round collaborations towards this aim!

Do we really, really, want another revolution?

  • HTA already has
    • Mature methods
    • Hard-won best practice/workflows that still allow for innovative methods
    • Institutional memory
  • The danger is not necessarily that AI fails, but that it
    • Encourages method skipping
    • Devalues careful modelling
    • Reframes rigour as “inefficiency”
  • Conservative and progressive aim
    • AI should embed existing decision models, probabilistic thinking and transparent uncertainty