Dr. ChatGPT Will See You Now

Patients and doctors are turning to AI for diagnoses and treatment recommendations, often with stellar results, but problems arise when experts and algorithms disagree.
Image may contain Adult Person and Text
Woman's finger touching futuristic display.Artificial intelligence and Technology Science concept.Photograph: Francesco Carta fotografo/ Getty Images

A poster on Reddit lived with a painful clicking jaw, the result of a boxing injury, for five years. They saw specialists, got MRIs, but no one could give them a solution to fix it, until they described the problem to ChatGPT. The AI chatbot suggested a specific jaw-alignment issue might be the problem and offered a technique involving tongue placement as a treatment. The individual tried it, and the clicking stopped. “After five years of just living with it,” they wrote on Reddit in April, “this AI gave me a fix in a minute.”

The story went viral, with LinkedIn cofounder Reid Hoffman sharing it on X. And it’s not a one-off: Similar stories are flooding social media—of patients purportedly getting accurate assessments from LLMs of their MRI scans or x-rays.

Courtney Hofmann’s son has a rare neurological condition. After 17 doctor visits over three years and still not receiving a diagnosis, she gave all of his medical documents, scans, and notes to ChatGPT. It provided her with an answer—tethered cord syndrome, where the spinal cord can’t move freely because it’s attached to tissue around the spine—that she says physicians treating her son had missed. “He had surgery six weeks from when I used ChatGPT, and he is a new kid now,” she told a New England Journal of Medicine podcast in November 2024.

Consumer-friendly AI tools are changing how people seek medical advice, both on symptoms and diagnoses. The era of “Dr. Google” is giving way to the age of “Dr. ChatGPT.” Medical schools, physicians, patient groups, and the chatbots’ creators are racing to catch up, trying to determine how accurate these LLMs’ medical answers are, how best patients and doctors should use them, and how to address patients who are given false information.

“I’m very confident that this is going to improve health care for patients,” says Adam Rodman, a Harvard Medical School instructor and practicing physician. “You can imagine lots of ways people could talk to LLMs that might be connected to their own medical records.”

Rodman has already seen patients turn to AI chatbots during his own hospital rounds. On a recent shift, he was juggling care for more than a dozen patients when one woman, frustrated by a long wait time, took a screenshot of her medical records and plugged it into an AI chatbot. “She’s like, ‘I already asked ChatGPT,’” Rodman says, and it gave her the right answer regarding her condition, a blood disorder.

Rodman wasn’t put off by the exchange. As an early adopter of the technology and the chair of the group that guides the use of generative AI in the curriculum at Harvard Medical School, he thinks there’s potential for AI to give physicians and patients better information and improve their interactions. “I treat this as another chance to engage with the patient about what they are worried about,” he says.

The key word here is potential. Several studies have shown that AI is capable in certain circumstances of providing accurate medical advice and diagnoses, but it’s when these tools get put in people’s hands—whether they’re doctors or patients—that accuracy often falls. Users can make mistakes—like not providing all of their symptoms to AI, or discarding the right info when it is fed back to them.

In one example, researchers gave physicians a set of patient cases and asked them to estimate the chances of the patients having different diseases—first based on the patients’ symptoms and history, and then again after seeing lab results. One group had access to AI assistance while another did not. Both groups performed similarly on a measure of their diagnostic reasoning, which looks at not just the accuracy of the diagnosis but also at how they explained their reasoning, considered alternatives, and suggested next steps. The AI-assisted group had a median diagnostic reasoning score of 76 percent, while the group using only standard resources scored 74 percent. But when the AI was tested alone—without any human input—it scored much higher, with a median score of 92 percent.

Harvard’s Rodman worked on this study and says when the research was conducted in 2023, AI chatbots were still relatively new, so doctors’ lack of familiarity with these tools may have lessened their ability to reach an accurate diagnosis. But beyond that, the broader insight was that physicians still viewed themselves as the primary information filter. “They loved it when it agreed with them, and they disregarded it when it disagreed with them,” he says. “They didn’t trust it when the machine told them that they were wrong.”

Rodman himself tested AI a few years ago on a tough case that he and other specialists had misdiagnosed on first pass. He provided the tool with the information he had on the patient’s case, “and the first thing it spat out was the very rare disease that this patient had,” he says. The AI also offered a more common condition as an alternative diagnosis but deemed it less likely. This was the condition Rodman and the specialists had misdiagnosed the patient with initially.

Another preprint study with over 1,200 participants showed that AI offered the right diagnosis nearly 95 percent of the time on its own but dropped to only a third of the time when people used the same tools to guide their own thinking.

For example, one scenario in the study involved a painful headache and stiff neck that had come on suddenly. The correct action is to seek immediate medical attention for a potential serious condition like meningitis or a brain hemorrhage. Some users were able to use the AI to reach the right answer, but others were told to just take over-the-counter pain medication and lie down in a dark room. The key difference between the AI’s responses, the study found, was due to the information provided—the incorrect answer was generated when the sudden onset of symptoms wasn’t mentioned by the user.

But regardless of whether the information provided is right or wrong, AI presents its answers confidently, as truthful, even when that answer may be completely wrong—and that’s a problem, says Alan Forster, a physician as well as a professor in innovation at McGill University’s Department of Medicine. Unlike an internet search that returns a list of websites and links to follow up on, AI chatbots write in prose. “It feels more authoritative when it comes out as a structured text,” Forster says. “It’s very well constructed, and it just somehow feels a bit more real.”

And even if it is right, an AI agent can’t complement the information it provides with the knowledge physicians gain through experience, says fertility doctor Jaime Knopman. When patients at her clinic in midtown Manhattan bring her information from AI chatbots, it isn’t necessarily incorrect, but what the LLM suggests may not be the best approach for a patient’s specific case.

For instance, when considering IVF, couples will receive grades for viability for their embryos. But asking ChatGPT to provide recommendations on next steps based on those scores alone doesn’t take into consideration other important factors, Knopman says. “It’s not just about the grade: There’s other things that go into it”—such as when the embryo was biopsied, the state of the patient’s uterine lining, and whether they have had success in the past with fertility. In addition to her years of training and medical education, Knopman says she has “taken care of thousands and thousands of women.” This, she says, gives her real-world insights on what next steps to pursue that an LLM lacks.

Other patients will come in certain of how they want an embryo transfer done, based on a response they received from AI, Knopman says. However, while the method they’ve been suggested may be common, other courses of action may be more appropriate for the specific patient’s circumstances, she says. “There’s the science, which we study, and we learn how to do, but then there’s the art of why one treatment modality or protocol is better for a patient than another,” she says.

Some of the companies behind these AI chatbots have been building tools to address concerns about the medical information dispensed. OpenAI, the parent company of ChatGPT, announced on May 12 it was launching HealthBench, a system designed to measure AI’s capabilities in responding to health questions. OpenAI says the program was built with the help of more than 260 physicians in 60 countries, and includes 5,000 simulated health conversations between users and AI models, with a scoring guide designed by doctors to evaluate the responses. The company says that it found that with earlier versions of its AI models, doctors could improve upon the responses generated by the chatbot, but claims the latest models, available as of April 2025, such as GPT-4.1, were as good as or better than the human doctors.

“Our findings show that large language models have improved significantly over time and already outperform experts in writing responses to examples tested in our benchmark,” Open AI says on its website. “Yet even the most advanced systems still have substantial room for improvement, particularly in seeking necessary context for underspecified queries and worst-case reliability.”

Other companies are building health-specific tools that are specifically designed for medical professionals to use. Microsoft says it has created a new AI system—called MAI Diagnostic Orchestrator (MAI-DxO)—that in testing diagnosed patients four times as accurately as human doctors. The system works by querying several leading large language models—including OpenAI’s GPT, Google’s Gemini, Anthropic’s Claude, Meta’s Llama, and xAI’s Grok—in a way that loosely mimics multiple human experts working together.

New doctors will need to learn how to both use these AI tools as well as counsel patients who use them, says Bernard S. Chang, dean of medical education at Harvard Medical School. That’s why his university was one of the first to offer students classes on how to use the technology in their practices. “It’s one of the most exciting things that’s happening right now in medical education,” Chang says.

The situation reminds Chang of when people started turning to the internet for medical information 20 years ago. Patients would come to him and say, “I hope you’re not one of those doctors that uses Google.” But as the search engine became ubiquitous, he wanted to reply to these patients: “You wouldn’t want to go to a doctor who didn’t.” He sees the same thing now happening with AI. “What kind of doctor is practicing at the forefront of medicine and doesn’t use this powerful tool?”

Updated 7-11-2025 5:00 pm BST: A misspelling of Alan Forster’s name was corrected.