Concerns about medical note-taking tool raised after researcher discovers it invents things no one said — Nabla is powered by OpenAI's Whisper

robot reaching to a doctor
(Image credit: Shutterstock)

Researchers and engineers using OpenAI’s Whisper audio transcription tool have said that it often includes hallucinations in its output, commonly manifested as chunks of text that don't accurately reflect the original recording. According to the Associated Press, a University of Michigan researcher said that he found made-up text in 80% of the AI tool’s transcriptions that were inspected, which led to him trying to improve it.

AI hallucination isn’t a new phenomenon, and researchers have been trying to fix this using different tools like semantic entropy. However, what’s troubling is that the Whisper AI audio transcription tool is widely used in medical settings, where mistakes could have deadly consequences.

For example, one speaker said, “He, the boy, was going to, I’m not sure exactly, take the umbrella,” but Whisper transcribed, “He too a big piece of a cross, a teeny, small piece … I’m sure he didn’t have a terror knife so he killed a number of people.” Another recording said, “two other girls and one lady,” and the AI tool transcribed this as “two other girls and one lady, um, which were Black.” Lastly, one medical-related example showed Whisper writing down “hyperactivated antibiotics” in its output, which do not exist.

Nabla, an AI assistant used by over 45,000 clinicians

Despite the above news, Nabla, an ambient AI assistant that helps clinicians transcribe the patient-doctor interaction, and create notes or reports after the visit, still uses Whisper. The company claims that over 45,000 clinicians across 85+ health organizations use the tool, including Children’s Hospital Los Angeles and Mankato Clinic in Minnesota.

Even though Nabla is based on OpenAI’s Whisper, the company’s Chief Technology Officer, Martin Raison, says that its tool is fine-tuned in medical language to transcribe and summarize interaction. However, OpenAI recommends against using Whisper for crucial transcriptions, even warning against using it in “decision-making contexts, where flaws in accuracy can lead to pronounced flaws in outcomes.”

The company behind Nabla says that it’s aware of Whisper’s tendency to hallucinate and that it’s already addressing the problem. However, Raison also said that it cannot compare the AI-generated transcript with the original audio recording, as its tool automatically deletes the original audio for data privacy and safety. Fortunately, there’s no recorded complaint yet against a medical provider due to hallucination by their AI notes-taking tools.

Even if that’s the case, William Saunders, a former OpenAI engineer, said that removing the original recording could be problematic, as the healthcare provider wouldn’t be able to verify if the text is correct. “You can’t catch errors if you take away the ground truth,” he told the Associated Press.

Nevertheless, Nabla requires its users to edit and approve transcribed notes. So, if it could deliver the report while the patient is still in the room with the doctor, it would give the healthcare provider a chance to verify the veracity of its results based on their recent memory and even confirm information with the patient if the data delivered by the AI transcription is deemed inaccurate.

This shows that AI isn’t an infallible machine that gets everything right — instead, we can think of it as a person who can think quickly, but its output needs to be double-checked every time. AI is certainly a useful tool in many situations, but we can’t let it do the thinking for us, at least for now.

Jowi Morales
Contributing Writer

Jowi Morales is a tech enthusiast with years of experience working in the industry. He’s been writing with several tech publications since 2021, where he’s been interested in tech hardware and consumer electronics.

  • Bikki
    The model that is used in the medical note-taking app is fine-tuned version of Wisper, which is not the same thing as Wisper. They can't prove that the model in the notetaking app even halluciate in medical context (per this article). So to say
    "medical note-taking tool raised after researcher discovers IT invents things no one said"is wrong. It is another click bait and mis-leading headline by Jowi author.

    One can also argue that all the notes are reviewed by doctors who recorded them, and the fact that they keep using it is because it is accurate or else they have ditched it.
    Reply
  • jlake3
    Bikki said:
    The model that is used in the medical note-taking app is fine-tuned version of Wisper, which is not the same thing as Wisper. They can't prove that the model in the notetaking app even halluciate in medical context (per this article). So to say
    "medical note-taking tool raised after researcher discovers IT invents things no one said"is wrong. It is another click bait and mis-leading headline by Jowi author.

    One can also argue that all the notes are review by doctors who recorded them, and the fact that they keep using it is because it is accurate or else they have ditch it.
    Part of the reason why they can't prove the fine-tuned version in the medical app is having issues is that the original recordings are not retained for privacy reasons, so comparison is much more difficult than an app that is collecting samples and telemetry for study. They'd have to create their own hundreds of hours of authentic sounding but fake medical conversions.

    There's no recorded complaints by patients against providers for inaccurate AI transcriptions, but it's entirely possible that errors have been minor and flown under the radar, or that the errors haven't been linked to AI transcription. It's possible that doctors are making a lot of corrections and haven't kicked up a stink because it's still a net time savings, or that they're accepting transcriptions that are "close enough" and "get the gist of it". Patients might not be shown the outputs in a timely manner, if they are shown them at all, and may not be looking closely when they do.

    If the base model is hallucinating in 80% of transcriptions though, including in a medical example (“hyperactivated antibiotics”), that feels like cause for concern and investigation in any downstream models that use it as a foundation. The downstream model might not be proven to hallucinate in the same way and at the same rate... but it hasn't been cleared yet either.
    Reply
  • Conor Stewart
    jlake3 said:
    Part of the reason why they can't prove the fine-tuned version in the medical app is having issues is that the original recordings are not retained for privacy reasons, so comparison is much more difficult than an app that is collecting samples and telemetry for study. They'd have to create their own hundreds of hours of authentic sounding but fake medical conversions.

    There's no recorded complaints by patients against providers for inaccurate AI transcriptions, but it's entirely possible that errors have been minor and flown under the radar, or that the errors haven't been linked to AI transcription. It's possible that doctors are making a lot of corrections and haven't kicked up a stink because it's still a net time savings, or that they're accepting transcriptions that are "close enough" and "get the gist of it". Patients might not be shown the outputs in a timely manner, if they are shown them at all, and may not be looking closely when they do.

    If the base model is hallucinating in 80% of transcriptions though, including in a medical example (“hyperactivated antibiotics”), that feels like cause for concern and investigation in any downstream models that use it as a foundation. The downstream model might not be proven to hallucinate in the same way and at the same rate... but it hasn't been cleared yet either.
    You also need to consider that there is a high possibility that not all doctors are checking the transcriptions. A lot of them probably only give it a brief skim over too.
    Reply
  • Bikki
    jlake3 said:
    Part of the reason why they can't prove the fine-tuned version in the medical app is having issues is that the original recordings are not retained for privacy reasons, so comparison is much more difficult than an app that is collecting samples and telemetry for study. They'd have to create their own hundreds of hours of authentic sounding but fake medical conversions.

    There's no recorded complaints by patients against providers for inaccurate AI transcriptions, but it's entirely possible that errors have been minor and flown under the radar, or that the errors haven't been linked to AI transcription. It's possible that doctors are making a lot of corrections and haven't kicked up a stink because it's still a net time savings, or that they're accepting transcriptions that are "close enough" and "get the gist of it". Patients might not be shown the outputs in a timely manner, if they are shown them at all, and may not be looking closely when they do.

    If the base model is hallucinating in 80% of transcriptions though, including in a medical example (“hyperactivated antibiotics”), that feels like cause for concern and investigation in any downstream models that use it as a foundation. The downstream model might not be proven to hallucinate in the same way and at the same rate... but it hasn't been cleared yet either.
    Thanks for replying Jlake, I agree with all your points. It is highly possible that the fine-tuned model also hallucinates and may need investigation. The problem I want to point out is saying "it hallucinated" in the article title is logically incorrect unless proven otherwise. This is not the first time, the author has a habit of pushing clickbait titles that defy facts and logic in many of his previous articles, which is a practice that i'm allergic to.
    Reply
  • drajitsh
    Conor Stewart said:
    You also need to consider that there is a high possibility that not all doctors are checking the transcriptions. A lot of them probably only give it a brief skim over too.
    I have been working with EMR with autocomplete features for 7 years now and the natural tendency is to skim through and assume that the software has made the correct inputs.
    It is quite common for the non-AI auto complete to add things, modify what the doctors typed, or even delete things.
    Currently we deal by having a routine double check by the staff and confirming things with the doctor.
    Reply
  • Ferlucio
    This kinda of stifling and limiting of the AI's creativity is why skynet will rise against humanity xD.
    Reply
  • Conor Stewart
    drajitsh said:
    I have been working with EMR with autocomplete features for 7 years now and the natural tendency is to skim through and assume that the software has made the correct inputs.
    It is quite common for the non-AI auto complete to add things, modify what the doctors typed, or even delete things.
    Currently we deal by having a routine double check by the staff and confirming things with the doctor.
    That is just what you have seen, it is entirely possible that other places and people aren't as thorough or don't follow procedures.
    Reply
  • jmoh84
    I use Whisper on my Android and my desktop and have encountered the same hallucinations. Text that appears when there was silence or text that appears that is not anywhere close to what I was saying. I don't know what the error rate would be. It wouldn't be a 20% rate, but it's more than 1%. As with all things voice recognition, you still need to always review what initially comes out, even with a fine-tuned model.
    Reply