Review says Babylon’s AI claims lack ‘convincing evidence’

News, Start-ups

7 November 2018

Review says Babylon’s AI claims lack ‘convincing evidence’

A legal challenge made by Babylon Healthcare against a Care Quality Commission (CQC) report has been dropped

October 24, 2017

Researchers have concluded that Babylon Health has not offered ‘convincing evidence’ that its AI-powered diagnostic and triage system can perform better than doctors.

In July 2018, Babylon Health claimed a study had demonstrated its artificial intelligence (AI) system had diagnostic ability that is ‘on-par with human doctors’.

But in a letter to medical journal The Lancet, Hamish Fraser, Enrico Coiera and David Wong explained their review – ‘Safety of patient-facing digital symptom checkers’ – shows there ‘is a possibility that it [Babylon’s service] might perform significantly worse’.

Fraser, Coiera and Wong – respectively a qualified doctor and associate professor of medical science; professor in medical informatics; and lecturer in health informatics – argue that Babylon’s claims have been ‘met with scepticism because of methodological concerns’.

This included the fact ‘data in the trials were entered by doctors’ and not real-life patients or ‘lay users’.

Babylon made its original claim based on feeding a representive sample of questions from the Membership of the Royal College of General Practitioners (MRCGP) exam to its diagnostic and triage system. The company reported the AI scored 81%.

The researchers commended Babylon for releasing a ‘fairly detailed description of the system’ and said it ‘potentially showed some improvement to the average symptom checker’.

However, the letter states: “The study does not offer convincing evidence that the Babylon Diagnostic and Triage System can perform better than doctors in any realistic situation, and there is a possibility that it might perform significantly worse.”

It adds: “Further clinical evaluation is necessary to ensure confidence in patient safety.”

The letter concludes: “Symptom checkers have great potential to improve diagnostics, quality of care and health system performance worldwide.

“However, systems that are poorly designed or lack rigorous clinical evaluation can put patients at risk and likely increase the load on health systems.”

Babylon Health’s chief scientist, Saurabh Johri, thanked the co-authors for their letter and review.

He added: “As we emphasise in the conclusion of our paper, the ability to generalise the findings of our pilot will require further studies.

“We welcome the suggestions of the authors for developing guidelines for robust evaluation of computerised diagnostic decision support systems since they align with our own thinking on how best to perform clinical evaluation. Together with our academic partners, we are currently in the process of performing a larger, real-world study, which we intend to submit for peer-review.

[themify_box icon=”info” color=”gray”]

Babylon’s chief scientist, Saurabh Johri said:

“We would like to thank the authors for their letter and review: ‘Safety of patient-facing digital symptom checkers’.

“As we outline in the original paper, the goal of our pilot study was to assess the performance of our system against a broad set of independently-created vignettes, which represent a diverse range of conditions, including both common and rare diseases. Hence, the purpose of the study was to perform an initial comparison through statistical summaries rather than detailed statistical analysis. This setting contrasts to a ‘real-world’ one, which would strongly favour common conditions at the expense of those of lower incidence. Despite the limited number of vignettes in our study, for increased breadth, we test against twice as many, as in another similar evaluation (Semigran et al. 2015, BMJ).

“It is also important to remark that we took appropriate care to rigorously ground our scientific findings by stating in our paper that ‘further studies using larger, real-world cohorts will be required to demonstrate the relative performance of these systems to human doctors’.

“The authors raise a number of concerns, some of which were addressed by us previously in our response to the online commentary provided by one of the authors (Prof. Coiera).

“In their correspondence, the authors claim that ‘the study does not offer convincing evidence that the Babylon Diagnostic and Triage System can perform better than doctors in any realistic situation, and there is a possibility that it might perform significantly worse.’ As we indicated in our original study, our intention was not to demonstrate or claim that our AI system is capable of performing better than doctors in natural settings. In fact, we stress that our study adopts a ‘semi-naturalistic role-play paradigm’ to simulate a realistic consultation between patient and doctor, and it is in the context of this controlled experiment that we compare AI and doctor performance.

“We would also like to take this opportunity to remark on a number of factual inaccuracies in the commentary. Firstly, the authors have commented that some of the doctors are outliers in terms of their accuracy. However, regardless of their performance in the study, all doctors are GMC-registered and regularly consult with real patients. Also, even if Doctor B is removed, the Babylon Triage and Diagnostic System’s performance is similar to the performance of Doctor A and Doctor D. Secondly, we would also like to clarify that in a previous paper (Semigran et al. 2015, BMJ), its authors included only relevant vignettes for each Symptom Checker tested, not just all adult vignettes as suggested in the appendix to the review letter.

“As we emphasise in the conclusion of our paper, the ability to generalise the findings of our pilot will require further studies. We welcome the suggestions of the authors for developing guidelines for robust evaluation of computerised diagnostic decision support systems since they align with our own thinking on how best to perform clinical evaluation. Together with our academic partners, we are currently in the process of performing a larger, real-world study, which we intend to submit for peer-review.”

[/themify_box]

Subscribe to our newsletter

Subscribe To Our Newsletter

9 Comments

Kirk Pitmon says:
20 November 2018 at 9:43 PM

They also filmed off of Cottletown Road in the Smithville/Paige area for a couple of weeks, about 3 or 4 weeks ago. Don’t know if they’ll be back or not.
Craig Wakeham says:
11 November 2018 at 1:55 PM

It would be interesting to hear the Richard Smith take on the reaction “The hegemony of health people” to challenge and change.
Phil Molyneux says:
9 November 2018 at 3:14 PM

At the moment it dosn’t need to be better than Drs to be of value does it. Just to be as good (as the best ?) because we haven’t got enough Drs have we.

The NHS does struggle with this good enough concept sometimes dosn’t it !
- William says:
  9 November 2018 at 3:15 PM
  
  Tell that to the coroner…
saywell says:
8 November 2018 at 6:13 PM

It’s not even as good as NHS111 yet! And that’s a pretty low bar to cross.
Health provider says:
8 November 2018 at 6:12 PM

We are using a provider from Switzerland (apimedic.com) that is giving us a solid symptom checker. The good thing is with them that if we are not happy with their quality we can just change them in the background with another provider. Until now this has helped that they are willing to always improve.
Matt Brown says:
7 November 2018 at 3:09 PM

Anyone else get the feeling Babylon is going to implode soon? It’s like watching a car crash in slow motion.

First they released a private video consultation app – didn’t do well.

Then they sneaked into winning NHS contracts. They then realised that the bottle neck remains the limited number of doctors who are able to see patients.

So they then really started to push the AI angle, which has been pretty disastrous. Even if they invest 100 million , that’s a drop in the ocean of what’s needed.

So after all this noise and hoo-ha were left with an unprofitable company that keeps trying to make noise so that that sweet VC money keeps coming in. Ha!
Mike says:
7 November 2018 at 10:29 AM

Hey – didn’t they stated that they will invest 100 million into this project?
The Insider says:
7 November 2018 at 8:45 AM

So, why is Ali Parsa swanning about saying that it DID perform better than doctors?