AI and Mental Health Care: Issues, Challenges, and Opportunities

QUESTION 3: Must there always be a human in the loop?

Back to table of contents
Project
AI and Mental Health Care

Background

The effectiveness and safety of fully or partially autonomous mental health AI tools remain contested. Some patients see potential benefits from fully autonomous AI providing mental health care, turning to conversational agents (e.g., ChatGPT) for therapeutic conversations. A recent survey found that one in four Americans preferred speaking with a chatbot rather than a human therapist, and 80 percent found ChatGPT to be an effective alternative to in-person therapy.43 However, these findings reflect user satisfaction rather than clinical effectiveness and do not assess long-term outcomes or risk exposure.

From a regulatory perspective, the fundamental question about human involvement hinges, at least in part, on how we define the role of LLMs in mental health: Are they a form of therapy, or are they more akin to a friend or companion? If LLMs are intended as therapy, the human being in the loop would be defined as a licensed clinician responsible for prescribing the LLM intervention, monitoring its effectiveness, and managing adverse events. The use of LLMs without clinician oversight would then be analogous to an “over-the-counter” therapy and would be regulated as such (e.g., for validity of claims). Given the widespread public access to these tools, some LLM mental health interventions effectively occupy this status already, despite lacking formal approval or rigorous testing.44 Providers and experts remain divided. Some argue that clinician oversight is essential for patient protection and for maintaining care standards. Clinicians also note potential benefits in using LLMs to handle intermediate tasks, such as note summarization, interim patient support, or improving access to mental health education. However, they recognize that evidence supporting its therapeutic effectiveness is currently limited.45

If, on the other hand, LLMs function primarily as friends or companions, the need for regulation and human oversight might be much less, although both safety and efficacy concerns would remain. For example, should chatbots that may not be claiming to provide therapy be freely accessible to potentially vulnerable individuals, like children or individuals with severe mental illness, or do their potentially unintended consequences still need to be mitigated?

Responses

A photo of Daniel Barron, a person with light skin and short brown hair, wearing a dark business suit and white shirt and smiling at the viewer.

Daniel Barron
 

Medical decision-making rests on two broad forms of knowledge: trained judgment and quantitative inquiry. Trained judgment is what a psychiatrist develops over years of medical school, residency, fellowship, and ongoing clinical practice. From a neuroscience perspective, here the clinician’s brain serves as sensor, evidence generator, and interpreter—all in one. When a clinician passes their board exam, what this indicates to patients (and payers) is that the clinician’s brain has internalized a baseline level of medical knowledge and demonstrated the ability to apply it under uncertainty to community standards. But competence in psychiatry goes beyond memorizing criteria. It involves developing sensitivity to the “texture” of a patient’s life—their tone of voice, shifts in body language, and the felt sense of their emotional state. This is difficult to do well. I often tell my patients that I am a better clinician today than I was five years ago—and that I expect to be better still ten years from now. Practicing medicine under uncertainty requires a kind of humility: I am constantly learning from my patients and my own decisions because I choose to learn. In AI terms, this is akin to “recursive self-improvement.” While recursive self-improvement is foundational to human medical training, it remains a significant challenge for current AI tools.

Quantitative inquiry, by contrast, is decision-making guided by empirical, measurable evidence. In this case, the sensor and evidence generator lie outside the clinician’s brain—embedded in instruments and tests. A cardiologist may adjust an antihypertensive medication based on blood pressure readings; an oncologist may evaluate the effectiveness of chemotherapy by tracking changes in tumor volume. In each case, the decision is anchored to data that are external to and observable by sources outside the clinician’s brain. In mental health care, the potential for quantitative inquiry is vast but largely untapped—because the relevant data are often high-dimensional, subjective, and dispersed across time and context. That said, with emerging tools and capacities—digital phenotyping, voice and speech analysis, passive behavioral monitoring—we now stand at the edge of expanding psychiatry’s empirical foundation. The fundamental enterprise of evidence-based medicine is to deploy the scientific method to translate trained judgment into quantitative inquiry. Clinical trials are the key mechanism of this translation: They formalize clinical intuition into reproducible knowledge. With this in mind, consider that the long-term goal of all medical practice is to become increasingly fit for automation—not to eliminate human clinicians but to systematize what works and scale it. To some, this may sound unsettling. What of empathy, nuance, or human judgment? Yet many areas of medicine have already embraced this trajectory. We very much expect cardiologists and oncologists to be empathic and perceptive—but we seek their care because they interpret blood pressure, heart rhythms, and tumor volumes through rigorous, validated protocols. The aspiration of mental health care is no different: to blend empathy with increasingly robust, data-driven decision-making.

Putting aside this larger and critical philosophical motivation, whether human oversight needs to accompany AI in mental health care depends entirely on the task at hand. For interpreting terabytes of quantitative measures to apply the latest evidence-based guidelines, human supervision (if possible) may be minimal. Similarly, straightforward, low-risk tasks—think helping patients book follow-ups or providing standardized info on sleep hygiene—can likely be delegated to AI confidently. But for the higher-risk decisions such as definitive diagnoses, significant human oversight is likely wanted and required. The right approach is task-specific: define the clinical task, characterize the AI solution for it, then compare it to how human beings do the same job. This concrete exercise allows us to weigh the risks and benefits not in the abstract but against real-world standards.

Hyein Lee and colleagues found that, while patients see the usefulness of AI conversational agents for certain tasks, most still want a human involved in AI-driven therapy, particularly for direct care jobs.46 This suggests that patient acceptance of AI autonomy is task-dependent. Hybrid models often get touted wherein AI acts as an assistant, boosting human capabilities for specific jobs. An AI might screen for potential drug interactions (a clear task) for a human being to double-check, applying their trained judgment to other conclusions.

R. Andrew Taylor and colleagues describe AI streamlining information gathering in emergency rooms, a specific role that complements rather than replaces a clinician’s trained intuition.47 Oliver Higgins and colleagues echo this, stating that AI/ML-based clinical decision support systems should enhance human judgment for defined clinical tasks. They also stress the importance of clinician trust, system transparency, and ethical considerations like bias and equity, especially for vulnerable folks.48

Where could AI possibly act without human intervention? Maybe in highly repetitive, data-heavy jobs with low inherent risk, such as performing initial scans of wearable data to flag troubling patterns for human review; in short, tasks where AI demonstrably outperforms human speed and accuracy for those specific tasks. Anithamol Babu and Akhil Joseph offer examples where human beings and AI can collaborate nicely.49 However, even in these contexts, they caution against “automation bias,” in which clinicians might place undue trust in the AI’s recommendations (this is also a common critique of professional guidelines, which, unfortunately, some clinicians blindly trust, leading to the reminder, “Treat the patient, not the protocol!”). Having clear protocols for when to seek help is always nonnegotiable—whether you are an AI or a medical student.

 

A photo of Nicholas Jacobson, a person with red hair and a red beard and mustache, wearing glasses and a gray suit and blue shirt,  and smiling at the viewer.

Nicholas Jacobson
 

The level of human oversight should be based on the level of evidence of safety and efficacy. Given the current state of generative AI technology and the inherent risks in mental health care, the answer for now is, yes, a human should remain in the loop, particularly for oversight and safety, though the nature of that loop can vary. While fully autonomous AI therapy is a potential future goal, the field is nascent, and this must come with greater time and evidence.

What are the risks and benefits of using purpose-built AI tools with and without human oversight?

With human oversight, clinicians can review interactions, intervene in crises, correct AI errors, and integrate AI insights into overall care. The primary risk is the resource cost of maintaining that oversight. The main benefit of a lack of oversight is maximum scalability and reduced cost. However, the risks are currently unacceptably high. These include the AI providing harmful or inappropriate responses, failing to detect or adequately respond to crises (e.g., suicidality), perpetuating biases, or fostering unhealthy dependence. Our Therabot trial, despite strong results, involved human monitoring. No generative AI is ready for fully autonomous operation in mental health care today.

How can requirements around human oversight be established and enforced?

Scientists should make this a norm of scientific work in review. Requirements should be established by regulatory bodies (like the FDA for tools making clinical claims) based on the tool’s intended use, level of risk, and demonstrated autonomous safety capabilities. Enforcement could involve mandatory reporting, periodic audits, and clear protocols for human review and intervention as part of the approval process. Certification standards for AI mental health tools could mandate specific oversight levels based upon their level of evidence.

How can hybrid models effectively balance automation with human clinical judgment? How can hybrid models that enhance human clinical judgment be promoted?

Complicating the use of hybrid models is that the therapeutic orientation and intervention targets of both the generative AI and the system need to be fully aligned. This can be nontrivial and may require the same level of coordination between multiple outpatient providers trying to see the same patient (most therapists will not do it). Hybrid or blended care systems must demonstrate their value proposition to clinicians (reducing workload, enhancing outcomes) and patients (providing continuous support), integrating them smoothly into clinical workflows and ensuring proper training for therapists on how to use them effectively.

In what scenarios might purpose-built AI operate fully independently without compromising patient safety and care quality?

Although these systems can currently be deployed safely with some human oversight, in no scenario can a generative AI operate fully independently to treat diagnosed mental health conditions without compromising safety. Significant advances in demonstrating safety, reliability, contextual understanding, and fail-safe mechanisms, verified through extensive, rigorous testing and regulatory approval, should be required before considering independent operation for therapeutic purposes.

 

A photo of Henry T. Greely, a person with light skin, gray hair and gray mustache, wearing glasses and a blue shirt, facing the viewer with his hand on his chin.

Hank Greely
 

For the standard “human in the loop” question, the answer seems to me to be clearly “it depends.” Unless or until there is solid evidence that, at least in some categories of AI interventions with some categories of patients, AI without a human in the loop works as well or “better” than AI with a human in the loop, then, yes, a human being should be required. (Note that defining how well any mental health care treatment works will be tricky, especially if safety moves in one direction and effectiveness—or cost and hence accessibility—moves in another.) It is foolish to say, today, “always” to just about anything in this rapidly developing field. I am in general a skeptic about just how useful AI will be across a range of applications. I certainly do not expect it to be “magic pixie dust.” But I cannot exclude the possibility that it might work better, in some circumstances, when no potentially interfering humans are in the loop. We just do not, and, at this point, cannot know.

I will add that one aspect of this issue does actually give me some sympathy for AI (or its developers). Some people tend to demand that AI be perfect; one sees that occasionally in discussions of driverless cars. But the standard should really be, “Is it better or worse than human-controlled decisions, as human drivers are often terrible?” Or, in this case, “Is the mental health care better with or without humans in the loop?” It seems to me right to put the burden of proof on those promoting AI to show that it is at least as good, and preferably better, but, if they can prove that sufficiently, no humans should be required in the direct use of the AI. (I will still argue, on grounds of political theory, albeit perhaps a Homo sapiens–centric theory, that human beings must be the ultimate decision-makers on whether and how AI is to be used.)

 

A photo of Arthur Kleinman, a person with light skin, gray hair, and a gray beard and mustache, wearing a brown jacket and blue shirt, and smiling at the viewer.

Arthur Kleinman
 

A human being must always be in the loop. In the absence of human assessment, AI has not been demonstrated to be able to provide an unbiased evaluation of its own workings. To the best of my understanding, no evaluation of real outcomes has yet demonstrated that AI is effective and safe in mental health care. For example, in a recent piece in the journal Nature Communications, researchers at Brigham and Women’s Hospital in Boston demonstrated that, in primary care medical evaluations, AI interventions that have been found adequate using multiple-choice questions were also found to be inadequate when AI was used in real-world doctor-patient conversations.50 Surely this will improve over time, but outcomes need to be assessed by human experts who are outside the AI intervention under examination.

AI operations should thus never operate independently of human clinical judgment; instead, they should always be augmenting it. In fact, this is where I believe AI could really advance clinical care, the key elements of which are quality of relationships, communication, and clinical judgment. In most health care today, including mental health care, we usually measure none of these things. Direct measurement of the quality of clinical care is one of the most significant things that can be achieved by AI, and a clear example of it is performing operations that augment, but do not substitute for, human therapeutic interventions.

AI systems will become vulnerable to hacking the moment they are introduced. To address this security risk and the privacy concerns it raises, researchers and developers will have to create and adopt best practices and guidelines for preventing serious misuse and abuse. This should be a fundamental concern of regulatory frameworks. Each mental health care system or agency should prioritize the protection of privacy and the maintenance of security.51

 

A photo of Robert Levenson, a person with light skin, long brown curly hair, and a brown mustache, wearing a dark suit and white shirt, and facing the viewer.

Robert Levenson
 

What are the risks and benefits of using purpose-built AI tools with and without human oversight?

This question arguably reflects our inherent mistrust of technology, fanned by decades of popular culture tropes of machines run amok (from HAL to the Terminator and beyond). Inherent in this mistrust is a belief that having human beings in the loop (despite all of their flaws, including difficulty maintaining sustained attention and vulnerability to fatigue) will protect us against the failings of technology in general and of AIMHIs in particular.

Clearly, having human beings in the loop in the relatively early days of AIMHIs will have precautionary advantages, as we do not yet have high-quality research data available to evaluate bots’ safety, efficacy, and ability to deal with situations that fall outside their rules-based programming and LLM training. In these relatively early days, it is impor­tant to avoid some of the foibles of human cognition, such as overweighing negative events and ignoring base rates. Thus, we should expect to see incidents where people working with AIMHIs engage in suicidal behavior or violence toward others that is not detected, reported, or well-handled by the bots. As horrifying as these events are, we must continue to ask how the rates of horrible events compare between AIMHIs and human therapists and take these data into account when developing future mental health policies.

As scientists, we should be working to evaluate whether a human being always needs to be “in the loop.” However, we must also be open to expanding our evaluation to consider whether, in some areas, a bot should also always be in the loop. In contrast to AIMHIs, where we lack sufficient high-quality research data to determine how well AI bots will do with therapy, assessment, and other mental-health related activities, data from medicine already indicate areas where bots are particularly effective. For example, we can imagine a time when radiologists examining a mammogram for early signs of tumor growth will routinely seek a “second opinion” from a bot that has been trained to make these kinds of judgments, which it should be capable of doing without lapses related to fatigue or distraction.

In a similar vein, mental health bots could be trained to digest results from batteries of psychological tests and/or behavioral observations to help provide an accurate clinical diagnosis, whether of the traditional type associated with the Diagnostic and Statistical Manual of Mental Disorders or one of the modern alternatives (e.g., the Hierarchical Taxonomy of Psychopathology).52 Going beyond this, given adequate training materials, it is not far-fetched to imagine bots that could review recordings of psychotherapy sessions to detect patterns of client/patient discourse that suggest heightened risk for suicide or other forms of self- or other harm. Although many therapists and assessors in training have their work double-checked by a supervisor, this is rare after formal training ends. If AI bots are found to be able to provide this kind of quality assurance and backup at high levels of reliability and validity, would any future client want to see a human therapist or assessor without an AI in the loop?

Clearly, studies are needed that evaluate the relative safety and efficacy of human practitioners and AIs working alone. But research could also show that having them work together has exciting synergistic effects on client/patient safety, treatment efficacy, and our ability to increase the number of clients/patients with whom human therapists can work.

 

A photo of Alison Darcy, a person with light skin and long gray-brown hair, wearing a green top and facing the viewer.

Alison Darcy
 

While the FDA is convening meetings and gathering feedback, at the time of writing, states are also taking proactive steps to regulate AI. Bills from California (AB 3030, AB 2013, SB 243) will impact AI deployed in health care, require transparency in training data, and specifically impact chatbots. Elsewhere, Colorado’s AI Act, Utah’s SB 149, Oregon’s HB 2748, New York’s AI companion restrictions, and Illinois’s HB 1806 will require active policy tracking for AI mental health companies. These companies will need to consider implementing features that enable deployer compliance, and they will benefit by embedding compliance early. Companies will also be required to ensure datasets comply with privacy and secondary-use restrictions, and to audit marketing claims and public documents to ensure transparency with end users. The bills also lean on the need to implement continuous performance monitoring to track model drift. All these bills are beyond the FDA’s purview and translate into a growing, differentiated set of expectations at the state level. Estimates suggest there are over two thousand bills addressing AI at the state level; it remains to be seen how many will be specific to AI applications in mental health. While compliance may historically have been viewed as a burden, it is fast becoming a competitive advantage.

Leading bodies, such as the World Health Organization, American Psychological Association, American Psychiatric Association, and the American Medical Association, are going on record regarding chatbots, calling for more clinical oversight, transparency in privacy and security, patient safety, and a need for a human in the loop as concerns grow about replacing clinical judgment. At a minimum, chatbots should be clear about their protocols and guardrails, with documented risk taxonomies and playbooks for managing self-harm, substance use, crisis hotlines, and appropriate deflection responses. Real-world monitoring should include drift dashboards, periodic red teaming, and the publication of post-market summaries. Special considerations should be added for subgroups like youth versus adult; for example, chatbots that address language-level, teen-tuned policies, guardian controls, and age checks. 

Endnotes