Preparing for AI in Health technology assessment: some questions from a very wise old man

Gianluca Baio

Department of Statistical Science | University College London

g.baio@ucl.ac.uk

https://gianluca.statistica.it

https://egon.stats.ucl.ac.uk/research/statistics-health-economics

https://github.com/giabaio https://github.com/StatisticsHealthEconomics

@gianlubaio@mas.to gianluca-baio

The Road Ahead: Future Trends in Artificial Intelligence and Automation for Health Technology Assessment, UCL

14 January 2026

Check out our departmental podcast “Random Talks” on Soundcloud!

Follow our departmental social media accounts + magazine “Sample Space”

Disclaimer…

This is probably the most unstructured talk I’ve ever written/given in my career!
- I’m OK with this, though: ideas are not completely solidified – but not just in my mind…
- Still, it wouldn’t be very serious of me to impart my wisdom on this…

Instead, I’ll try and raise some questions and doubts and, hopefully, give some ideas!
I’ll use the perspective of a statistician
- Of course, there is (just about… 😉) more to only Statistics in the whole HTA process
- BUT: the questions I will pose are, I think, extremely valid and relevant to the whole of our “industry”

AI & HTA: Where are we?

Obviously a big and hot topic. For example, at the last ISPOR conference:
- 2 short courses on the Sunday
- One of the topics in the Plenary session + at least 8 parallel sessions on the Monday
- At least 5 parallel sessions on the Tuesday
- One of the main focus points of the Plenary session + at least 2 parallel sessions on the Wednesday

BUT: I think it’s fair to say that the use of AI in HTA has been so far characterised by a few important issues!
- Inflated expectations
- Poorly specified use-cases
- A mismatch between what tools are optimised for and what HTA actually requires

Inflated expectations

LLMs are optimised for linguistic plausibility, trained to minimise “next-token loss” (statistical vs causal prediction) and rewarded for fluency and confidence
Conversely, although HTA relies on inference and extrapolation, it must be centered on explicit recognition of uncertainty over confident point statements and being wrong in a visible way rather than right for the wrong reasons

Systematic bias in how AI performance may be perceived
- A fluent answer feels correct, especially to non-specialists
Dangerous in HTA where
- Many errors are subtle
- Outputs are often difficult to immediately falsify
- Mistakes propagate downstream into decisions

“Benchmark leakage” (= LLMs trained on questions & answers used to benchmark it…)
- AI is evaluated on tasks that resemble exams or coding puzzles not on end-to-end processes
- HTA workflows with messy data, conflicting evidence and value judgement

Poorly specified use-cases

HTA tasks are rarely atomic
- “Do a systematic review” actually involves several distinct judgement calls
- “Build a model” involves conceptual, statistical and normative steps
- “Write a report” encodes institutional conventions and implicit assumptions
- …
Without precise specification failures are hard to diagnose, accountability is unclear and validation becomes impossible

Decision ownership:
- Who is responsible when AI output is wrong?
- Is AI advising, filtering, drafting or deciding?
In HTA, ambiguity of responsibility is unacceptable, but many proposed AI uses depend on it…

Mismatch in optimisation objectives

Operational

LLMs/AI	HTA
Tend to collapse uncertainty into a single narrative	Needs uncertainty explicitly represented
Optimise average accuracy	Often cares about worst-case or tail behaviour
Encourage a single "best answer"	Requires structured disagreement and scenario analysis

Cultural

AI tools often assume iteration is cheap and mistakes are acceptable
HTA decisions are high-stakes and constrained
- e.g. sunk costs of replacing an existing technology

Question 1: Where can AI help us?

Coding

This is probably the least controversial – but also the most misunderstood area!

Where AI can help
- Boilerplate code generation (reusable with little alteration, e.g. country adaptation)
- Refactoring or language translation (e.g. from BUGS to Stan to R)
- Writing unit tests (which we should do a lot more of, by the way…)
- Exploratory scaffolding (not final models)
- Clear settings for responses (“You are a health economic modeller working with R. Acknowledge uncertainty and tell me if you’re not 100% sure”…)

Where AI is generally dangerous
- Silent logical errors/Invented package behaviour (hallucinations)
- Plausible-looking but incorrect conclusions/flaws
- Overconfident handling of edge cases (missing data, censoring, treatment switching)

For HTA, the key issue is that errors are most often not random: they tend to appear in exactly the places that matter most for decision-making!
Bottom line: we’ve come a loooooong way and the core of our “industry” is now more and more proficient on statistical programming – we cannot backtrack on that, because we think that AI can do that job for us!

Reporting/automation

I’m personally a lot more skeptical about this…

Much of what people present as “AI-powered reporting” is
- Templating
- Parameter substitution
- Re-packaged dynamic documents

Existing tools that can be considered part of the “standard”(?) workflow, like Quarto, R Markdown and others already can do these tasks
- Integrate directly with the analysis – no need to use “external” tools!

The incremental benefit of LLMs here is modest, while the risk (loss of traceability, hallucinated explanations) is real
Feels a bit like change for the sake of change…

Evidence generation: SLRs and modelling

This is perhaps where AI enthusiasm is highest in HTA. But: caution is needed, in my view!…

Systematic literature reviews

AI can be helpful with tasks such as de-duplication, prioritisation or extraction of drafts
But:
- Key tasks such as inclusion/exclusion are normative, not purely textual
- Bias assessment requires (human!) judgement
- Missing a “good” study may not carry the same penalty as including a “bad” one…

Modelling

For economic models structure matters more than syntax, assumptions matter more than fit and interpretability is non-negotiable
Replacing
- Conceptual model development
- Structural sensitivity analysis
- Expert elicitation
would fundamentally misunderstand what HTA models are for
Not sure we’d gain so much – modelling of NMA problems is well established in HTA

What else?

Where AI may add genuine value (with caveats…)
- Screening and triage in systematic reviews (ranking, not deciding)
- Information extraction with human validation
- Protocol drafting (again: draft, not authority!)
- Exploratory scenario generation (“what assumptions should I stress-test?”)

Notably, these are all assistive roles, not decision-making ones!

On a side note…
- We need to train people as well as LLMs – modellers should know what and how to ask!
- This is in my view a duty of all the elements of our “industry” (academia, regulators, consultancy, sponsors, etc)

Question 2: What can we do to make AI more helpful to us?

General-purpose vs domain-adapted AI models

“Absolute” models

Using a general LLM “as is”
- Ask ChatGPT to do something for you
Zero or minimal domain constraints

“Hybrid” models

Constrained systems
- Specialised models trained in a “safe environment”
Infrastructure cost
- Fine-tuning open models (e.g. LLaMA-class) on HTA-specific text
- Using LLMs with HTA guidelines, model documentation, previous submissions, methodological papers

Investment?

Data curation: expensive in human time, not computing time
Fine-tuning costs: tens of thousands £ (not millions)??
Ongoing infrastructure: modest compared to trial costs

\(\Rightarrow\) The real bottleneck is expert labour, not GPUs!

So what shall we invest on?

Not a one-off project, but rather an actively curated methodological public good
If AI is to be used responsibly in HTA, it likely requires
- Publicly (co-)funded, transparent tools
- Strong academic–regulatory partnerships
- Industrial involvement
- Shared benchmarks and failure cases

If we don’t do this, in my view the risk is that we end up with
- Black-box vendor tools
- Asymmetric information
- Regulatory capture by tech providers

Events like today can (should?!) pave the way for all-round collaborations towards this aim!

Do we really, really, want another revolution?

HTA already has
- Mature methods
- Hard-won best practice/workflows that still allow for innovative methods
- Institutional memory

The danger is not necessarily that AI fails, but that it
- Encourages method skipping
- Devalues careful modelling
- Reframes rigour as “inefficiency”

Conservative and progressive aim
- AI should embed existing decision models, probabilistic thinking and transparent uncertainty