The Road Ahead: Future Trends in Artificial Intelligence and Automation for Health Technology Assessment, UCL
14 January 2026
Check out our departmental podcast “Random Talks” on Soundcloud!
Follow our departmental social media accounts + magazine “Sample Space”
Disclaimer…
This is probably the most unstructured talk I’ve ever written/given in my career!
I’m OK with this, though: ideas are not completely solidified – but not just in my mind…
Still, it wouldn’t be very serious of me to impart my wisdom on this…
Instead, I’ll try and raise some questions and doubts and, hopefully, give some ideas!
I’ll use the perspective of a statistician
Of course, there is (just about… 😉) more to only Statistics in the whole HTA process
BUT: the questions I will pose are, I think, extremely valid and relevant to the whole of our “industry”
AI & HTA: Where are we?
Obviously a big and hot topic. For example, at the last ISPOR conference:
2 short courses on the Sunday
One of the topics in the Plenary session + at least 8 parallel sessions on the Monday
At least 5 parallel sessions on the Tuesday
One of the main focus points of the Plenary session + at least 2 parallel sessions on the Wednesday
BUT: I think it’s fair to say that the use of AI in HTA has been so far characterised by a few important issues!
Inflated expectations
Poorly specified use-cases
A mismatch between what tools are optimised for and what HTA actually requires
Inflated expectations
LLMs are optimised for linguistic plausibility, trained to minimise “next-token loss” (statistical vs causal prediction) and rewarded for fluency and confidence
Conversely, although HTA relies on inference and extrapolation, it must be centered on explicit recognition of uncertainty over confident point statements and being wrong in a visible way rather than right for the wrong reasons
Systematic bias in how AI performance may be perceived
A fluent answer feels correct, especially to non-specialists
Dangerous in HTA where
Many errors are subtle
Outputs are often difficult to immediately falsify
Mistakes propagate downstream into decisions
“Benchmark leakage” (= LLMs trained on questions & answers used to benchmark it…)
AI is evaluated on tasks that resemble exams or coding puzzles not on end-to-end processes
HTA workflows with messy data, conflicting evidence and value judgement
Poorly specified use-cases
HTA tasks are rarely atomic
“Do a systematic review” actually involves several distinct judgement calls
“Build a model” involves conceptual, statistical and normative steps
“Write a report” encodes institutional conventions and implicit assumptions
…
Without precise specification failures are hard to diagnose, accountability is unclear and validation becomes impossible
Decision ownership:
Who is responsible when AI output is wrong?
Is AI advising, filtering, drafting or deciding?
In HTA, ambiguity of responsibility is unacceptable, but many proposed AI uses depend on it…
Mismatch in optimisation objectives
Operational
LLMs/AI
HTA
Tend to collapse uncertainty into a single narrative
Needs uncertainty explicitly represented
Optimise average accuracy
Often cares about worst-case or tail behaviour
Encourage a single "best answer"
Requires structured disagreement and scenario analysis
Cultural
AI tools often assume iteration is cheap and mistakes are acceptable
HTA decisions are high-stakes and constrained
e.g. sunk costs of replacing an existing technology
Question 1: Where can AI help us?
Coding
This is probably the least controversial – but also the most misunderstood area!
Where AI can help
Boilerplate code generation (reusable with little alteration, e.g. country adaptation)
Refactoring or language translation (e.g. from BUGS to Stan to R)
Writing unit tests (which we should do a lot more of, by the way…)
Exploratory scaffolding (not final models)
Clear settings for responses (“You are a health economic modeller working with R. Acknowledge uncertainty and tell me if you’re not 100% sure”…)
Overconfident handling of edge cases (missing data, censoring, treatment switching)
For HTA, the key issue is that errors are most often not random: they tend to appear in exactly the places that matter most for decision-making!
Bottom line: we’ve come a loooooong way and the core of our “industry” is now more and more proficient on statistical programming – we cannot backtrack on that, because we think that AI can do that job for us!
Reporting/automation
I’m personally a lot more skeptical about this…
Much of what people present as “AI-powered reporting” is
Templating
Parameter substitution
Re-packaged dynamic documents
Existing tools that can be considered part of the “standard”(?) workflow, like Quarto, R Markdown and others already can do these tasks
Integrate directly with the analysis – no need to use “external” tools!
The incremental benefit of LLMs here is modest, while the risk (loss of traceability, hallucinated explanations) is real
Feels a bit like change for the sake of change…
Evidence generation: SLRs and modelling
This is perhaps where AI enthusiasm is highest in HTA. But: caution is needed, in my view!…
Systematic literature reviews
AI can be helpful with tasks such as de-duplication, prioritisation or extraction of drafts
But:
Key tasks such as inclusion/exclusion are normative, not purely textual
Bias assessment requires (human!) judgement
Missing a “good” study may not carry the same penalty as including a “bad” one…
Modelling
For economic models structure matters more than syntax, assumptions matter more than fit and interpretability is non-negotiable
Replacing
Conceptual model development
Structural sensitivity analysis
Expert elicitation
would fundamentally misunderstand what HTA models are for
Not sure we’d gain so much – modelling of NMA problems is well established in HTA
What else?
Where AI may add genuine value (with caveats…)
Screening and triage in systematic reviews (ranking, not deciding)
Information extraction with human validation
Protocol drafting (again: draft, not authority!)
Exploratory scenario generation (“what assumptions should I stress-test?”)
Notably, these are all assistive roles, not decision-making ones!
On a side note…
We need to train people as well as LLMs – modellers should know what and how to ask!
This is in my view a duty of all the elements of our “industry” (academia, regulators, consultancy, sponsors, etc)
Question 2: What can we do to make AI more helpful to us?
General-purpose vs domain-adapted AI models
“Absolute” models
Using a general LLM “as is”
Ask ChatGPT to do something for you
Zero or minimal domain constraints
“Hybrid” models
Constrained systems
Specialised models trained in a “safe environment”
Infrastructure cost
Fine-tuning open models (e.g. LLaMA-class) on HTA-specific text
Using LLMs with HTA guidelines, model documentation, previous submissions, methodological papers
Investment?
Data curation: expensive in human time, not computing time
Fine-tuning costs: tens of thousands £ (not millions)??
Ongoing infrastructure: modest compared to trial costs
\(\Rightarrow\)The real bottleneck is expert labour, not GPUs!
So what shall we invest on?
Not a one-off project, but rather an actively curated methodological public good
If AI is to be used responsibly in HTA, it likely requires
Publicly (co-)funded, transparent tools
Strong academic–regulatory partnerships
Industrial involvement
Shared benchmarks and failure cases
If we don’t do this, in my view the risk is that we end up with
Black-box vendor tools
Asymmetric information
Regulatory capture by tech providers
Events like today can (should?!) pave the way for all-round collaborations towards this aim!
Do we really, really, want another revolution?
HTA already has
Mature methods
Hard-won best practice/workflows that still allow for innovative methods
Institutional memory
The danger is not necessarily that AI fails, but that it
Encourages method skipping
Devalues careful modelling
Reframes rigour as “inefficiency”
Conservative and progressive aim
AI should embed existing decision models, probabilistic thinking and transparent uncertainty