QUESTION 1: How might we measure the effectiveness of AI-driven mental health interventions?
Background
A limited but growing evidence base supports the effectiveness of purpose-built mental health chatbots, particularly for older versions that do not use large language models (LLMs). Several studies of purpose-built mental health chatbots show moderate improvements over control conditions when efficacy is assessed through symptom reduction using validated clinical scales.10 For example, a randomized control trial (RCT) of a chatbot for depression found the chatbot group had greater reductions in depression and anxiety scores compared to bibliotherapy controls.11 Two meta-analyses of twenty-nine studies also demonstrated medium to large effect sizes in alleviating anxiety and depression symptoms, although the studies generally had poor data quality, with high heterogeneity, small sample sizes, and inconsistent blinding.12
Some postintervention studies show that these benefits can be sustained after the conclusion of treatment. In one eight-week intervention, 60 percent of initially depressed patients maintained clinically significant improvement at one-year follow-up.13 However, other research indicates that symptoms can reemerge after the initial treatment period is over, similar to other forms of therapy.14 Long-term longitudinal data are limited, particularly for individuals with moderate to severe conditions, in which ongoing care is often required.
Evidence on outcomes for compliance measures, such as treatment adherence and engagement, is similarly limited. AI-based interventions generally achieve dropout rates comparable to traditional interventions, but treatment engagement varies widely. Factors such as user demographics and app design contribute to this variation, but inconsistent definitions (such as variably defining engagement as app logins or completion of therapeutic modules) complicate comparison.15
Variation in outcomes among studies may also be explained by the choice of comparison groups. AI tools generally show strong outcomes when compared to waitlist controls. For example, an AI-based anxiety program produced large symptom reductions relative to no-treatment conditions. Against active controls, some (but not all) AI-guided interventions have achieved outcomes that are comparable to traditional cognitive behavioral therapy (CBT).16 Fully automated chatbots for depression and anxiety have also demonstrated results similar to traditional therapy over short durations such as two months.17 However, evidence on longer-term outcomes remains sparse, and few studies include diverse or high-acuity clinical populations.
As most studies have focused on the impact of AI on a patient or user, little is known about its use in mental health care administrative tasks. AI tools are increasingly used in this context, often without formal documentation or patient awareness. Therapists may use LLMs to draft session notes, summarize progress, or generate treatment.18 These informal uses are rarely studied in trials, but used in this way, AI may nonetheless shape therapeutic decisions. This reliance raises questions about accuracy, accountability, and consent, especially when tools influence care without being disclosed to patients.
While the research based on non-LLM AI tools is substantial, the clinical effectiveness of LLM-based interventions remains largely untested. Trials are underway, but, as of this writing, peer-reviewed outcome data on LLM-driven chatbots are limited. Future assessments will need to distinguish clearly between different AI architectures and their respective capabilities, risks, and regulatory needs.
Responses
Robert Levenson
What outcomes should be prioritized: symptom improvement, treatment adherence, or long-term well-being?
In evaluating the efficacy of AI-driven or any other mental health interventions, it is important to recognize that evaluation opportunities will arise at many levels of scientific rigor, ranging from word-of-mouth and “Yelp-like” user satisfaction ratings to formal randomized controlled clinical trials. RCTs will ultimately convey the most definitive determination of efficacy; however, there is much value to building in other kinds of evaluation wherever possible.
Many clients, patients, and users will come to AI-driven mental health interventions (AIMHI) hoping to obtain relief with troublesome symptoms (e.g., reducing anxiety and/or depression, habit abatement); thus, measures of symptom improvement will be paramount. However, others will come with hopes of improving their general quality of life, including strengthening relationships with partners, friends, and family. Although symptom reduction may well occur in those instances, it will be important to include measures that are appropriate for these goals (e.g., measures of well-being and relationship satisfaction). Thus, especially in situations in which only limited assessments are possible, measure selection should start with those factors that are being targeted by treatments.
With any measure of symptom reduction, it is important to build in periodic follow-up assessments to determine whether gains (or losses) are maintained over time. There is no shame in finding out that the benefits of a promising AIMHI are short-lived. Many conventional treatments for mental health and relationship issues show declining efficacy over time19 and/or require periodic “booster” interventions.20
Assessing user satisfaction with AIMHIs will be important. Note that satisfaction is not synonymous with efficacy. A person can greatly enjoy their interaction with a therapy bot but not show symptom reduction or other treatment goals. Conversely, a poor user experience could still lead to improvement (e.g., getting useful information despite a clumsy user interface). Three decades ago, a satisfaction-focused evaluation of various kinds of therapy conducted under the auspices of Consumer Reports proved to be quite useful (e.g., in helping to understand early termination by clients).21
Measuring intermediate/mediating factors known to be related to good treatment outcomes has great value. In the realm of mental health, the quality of the relationship between the therapist and client has proved to be particularly important. Tracing back to the earliest studies of psychotherapy, factors such as high levels of therapist empathy (i.e., the client perceiving the therapist as listening carefully, understanding, and caring) have been related to better outcomes.22 In more modern conceptualizations, high levels of therapeutic alliance (i.e., the sense that the therapist and client are committed to working together to address the client’s issues) have similarly been associated with better therapeutic outcomes.23 Thus, it seems wise to include these kinds of measures as well as outcome-focused measures (e.g., of symptoms, relationship quality).
Finally, measures of client demographics (e.g., socioeconomic status, rural/urban, ethnicity, age) and treatment course (e.g., number of sessions, early termination) are important for evaluating the effectiveness of AIMHIs. In studies of human therapists, certain ethnic groups have been historically underserved and, when served, tend to end treatment early.24 Similarly, with human therapists, geographic location can be important, with the effectiveness of empirically validated therapies lessening as a function of distance from university-based clinics.25
What should be the standard of comparison for purpose-built tools?
To make optimal use of data derived from research on whether AIMHIs work, for whom, and how, it will be critical to compare these data against data from other approaches. As AI tools proliferate, societal and commercial demand for comparisons among AIMHIs will increase. We expect that many anecdotes and testimonials will also appear, suggesting miracle cures and heart-breaking failures. These will ultimately be of limited scientific value and may create a background of noise that can obscure important decision-making by consumers, providers, insurers, and public health planners. “Horse-race” studies comparing treatments that are replete with disqualifying confounds will also likely appear (e.g., comparing treatment X, used with young, male college students, versus treatment Y, used with more occupationally diverse, middle-age, female clients and patients). Arguments for high-level empirical standards will need to be repeatedly made and heeded.
The “gold standard” of RCT research designs can help avoid many of the confounding factors that plague less rigorous research designs. Unfortunately, existing published treatment research with human therapists does not always reach these standards. Moreover, even when RCTs are used, active treatments are often compared to nontreatments (e.g., waiting-list controls). Results of such studies almost always indicate that “something is better than nothing,” a finding that is interesting but not fully satisfying. What is needed are more powerful designs in which a treatment of interest is compared with another “active” treatment in ways that control for some of the possible confounds (e.g., different amounts of time spent with a helping person or agent), as well as a minimal treatment (to control for the passage of time, which may cause the problems being treated to decrease or increase). In the case of AI bots, research designs that compare an AI treatment with a human-based treatment and a minimal treatment condition could reveal a great deal about the advantages and disadvantages of AIMHIs compared to human therapists. Importantly, both the public and the scientific community must not let their preconceptions (e.g., machines are better than human beings; human beings are better than machines) cloud their objectivity when evaluating the data from these studies. Low-tech, relatively inexpensive nonhuman treatments (e.g., self-help, psychoeducation) should also be included in research that compares treatments.
Finally, because of the long history in therapy research of finding that factors such as therapist empathy and the therapeutic alliance are important “nonspecific/common factors” that contribute to treatment efficacy, measures of these factors should also be included in evaluations of AIMHI therapies. Legitimate questions have been raised as to whether AI bots can create durable human bonds between the bot and the client and patient. A market flooded with clumsy, poorly designed AIMHIs will do little to assuage such doubts. But, for the best of the implementations, this will be an important question to ask and answer.
Hank Greely
“It depends” is almost always an excellent way to begin an answer. And the answers to all these questions—and to the fundamental underlying issue of the value of AI in mental health—depend on how safe and effective its use is, not “in general,” but, ideally, for individual patients. In practice this will almost certainly require grouping individual patients into categories; that is, the consequences of AI in mental health care for adolescents with anorexia nervosa, patients with geriatric delusional psychosis, or adults with obsessive/compulsive disorders will need to be considered separately. AI in mental health care may be miraculous for some of those groups (or, probably more accurately, for many people in some of those groups) and disastrous for people in other groups (as well as for some people in the groups that largely benefit from it).
How can we know where it works, for whom, and under what conditions? “Rigorous studies” are the obvious answer, but what those are and how to get them are tricky questions that follow immediately. Ideally, requiring such studies before a treatment can be marketed would be a solution; however, I think issues of political economy will make such requirements highly unlikely to be imposed.
A second- (or third- or eighth-) best solution might be to require extensive data collection by those prescribing or using AI in mental health care, with the further requirement that these data be available for use by independent researchers: either one or a few specified entities or a broader category of groups. This raises obvious risks about patient privacy, especially given the likelihood that, among other things, AI will further reduce the already tenuous reality of “deidentification,” the idea that removing directly identifying information from data files will protect privacy. (This hope is increasingly undercut by larger databases and better computer search abilities, including AI.) But it might well be worthwhile, though also not politically easy to implement.
Another problem embedded in this question: How will we know who is using AI in mental health care? Defining what we mean by “AI in mental health care” will be hard. If a therapist uses a broadly available LLM to help write up notes of a patient encounter, is that an example of AI in mental health care? What if she uses it to “hold a conversation” with a patient? To analyze a patient’s responses for signs of mental illness? Identifying people doing something that meets whatever definition is used will be harder. If a therapist does whichever of those things we decide to call “AI in mental health care,” how will we know it? Enforcing actions on them will be even harder.
Alison Darcy
To measure symptom change, investigations into the efficacy of AI interventions should follow the same principles as those used in traditional treatment outcomes research. Observed improvements in a person are, after all, independent of the type of intervention delivered; therefore, this should be the highest order objective. The same applies to less clinical and real-world outcomes, such as quality of life, loneliness, and health economics, requiring all the usual rigor in administration, validation, and analysis.
However, studies should also be capable of capturing the broader advantages of AI interventions, such as time to care, time to treatment response, preference, and engagement. This additional layer of investigation, which is both necessary and promising, pushes us to expand beyond the methods typical of traditional outcomes research. For example, what role these interventions will play in our current health systems and structures is not yet clear. While much of the field is beginning by augmenting traditional clinical care, often in patients experiencing sub-clinical and/or mild symptoms, this is not likely where we will end up, given that the most pressing problem in the field of mental health care today is access. Studies show a gap of eleven years from symptoms to initial treatment.26 Therefore, a full and fruitful exploration of the efficacy of this technology must adopt a robust systems perspective. The question becomes not just whether interventions are safe and effective but how they might change clinical care and what will emerge as the leading opportunities for improving access, efficiency, clinician burden, and patient outcomes.
Engagement is one area where we ought to think differently about the measurement of AI-delivered interventions. The traditional way of measuring engagement looks at total time exposure. In traditional treatment outcomes research, we can assume that if a person has attended four therapy sessions in a clinic, they have had approximately four hours of therapy. However, this mapping does not work for many AI therapeutic interventions because the structure and shape of interactions are so fundamentally different. The type or quality of the interaction can vary widely in a digital world. Mindless scrolling, for example, is not the same as a chat-based therapeutic exchange, so to simply quantify time across all behaviors would be inappropriate. Interactions are usually much shorter but might be expected to occur more frequently than just once per week. At Woebot Health, we have argued in favor of concepts closer to potency, rather than total time spent.27 This includes the dimension of symptom shift with time as the denominator, such that in this model a shorter timeframe to symptom reduction, rather than being interpreted as “less adherent,” might actually be viewed as more favorable because it is more potent.
Finally, this technology may afford new opportunities that can help inform therapeutic models and systems more broadly. Psychotherapy itself is an imperfect and arguably a relatively nascent field in which a lack of data has complicated efforts to innovate. When innovations have occurred, they have come from visionaries who are usually expert in their field and can therefore draw from thousands of therapy hours to emerge with key insights that push the field forward. What AI and AI-delivered therapeutic interventions give us is a crucial opportunity to accelerate innovation through access to a large amount of data and a natural ability to atomize concepts into micro-interventions that can be practiced in real time outside of the clinic walls. For example, we can systematically explore moderators and mediators of therapy—what works, for whom, and under what circumstances—because we have datasets with sufficient statistical power for the first time, unlocking a pathway to improving outcomes through precision intervention. Findings here could, in turn, inform human-delivered therapy, helping the field make better use of all of the services we are developing in a holistic way. And that is truly exciting.
Eric Horvitz
Advancing applications of the constellation of technologies collectively known as artificial intelligence in mental health interventions will require targeted clinical research that rigorously evaluates both established therapeutic approaches and emerging AI-enabled modalities. Given the breadth of the design space, an essential early step is to clarify concrete use cases and clinical scenarios for systematic study. These may range from standalone AI-based tools to systems that operate under direct clinical supervision. The technologies span from fine-tuned, purpose-built psychological support models to generalist models adapted for therapy.
Usage scenarios fall along a continuum of clinical oversight and patient engagement. At one end are standalone self-help agents accessed independently via computers or smartphones. At the other are deeply integrated decision-support tools embedded in clinical workflows, surfacing timely insights to supervising clinicians. Between these poles lie hybrid models—for example, generative AI-powered chatbots that provide therapeutic interactions between sessions and flag excerpts or generate summaries for therapists. Such systems may be deployed in an ongoing manner or when primary therapists become intermittently unavailable, for example, during professional travel or vacations. Additional possibilities include relapse prevention systems that combine passive sensing with AI-driven outreach, as well as assistants that draft progress notes or recommend evidence-based interventions to clinicians.
To systematically map this landscape, at least three intersecting dimensions must be considered: the degree of system autonomy, the intensity and locus of clinical oversight, and the level of personalization for each user. Cross-cutting all of these is the critical need for clinical validation. Foundational concerns, such as privacy, transparency, equity, and safety, must be addressed.
Emerging evidence from randomized trials, mixed-methods evaluations, and scoping reviews suggests that AI-mediated cognitive behavioral interventions can lead to meaningful symptom improvements and high user engagement. However, these studies also reveal challenges, including inconsistent crisis response, embedded bias in language models, and performance drift as models evolve. These findings point to the need for a staged validation process akin to pharmaceutical development: beginning with feasibility and safety studies, progressing to adequately powered efficacy trials with active comparators, and culminating in pragmatic effectiveness trials that reflect real-world diversity in patients, settings, and implementation fidelity.
As capabilities advance, new psychological and relational concerns are also coming into view that require proactive attention. One emerging issue is what I refer to as the rising “mirage of mind” in conversational AI systems: a perhaps unavoidable tendency for users to perceive these systems, based on their fluency, responsiveness, and affective cues, as possessing personhood. This perception may include assumptions of sentience and human-like capacities for recall, relationship-building, trust, and emotional resonance. The resulting illusion can lead to inappropriate attributions of empathy, continuity of care, and understanding, generating a sense of therapeutic relationship that the system cannot truly support. Patients may believe the system “remembers” past interactions or genuinely cares, when in fact such capabilities do not exist. These misperceptions risk eroding clarity around roles, expectations, and trust, and may foster inappropriate attachment or reliance, especially with psychologically or emotionally vulnerable patients. Without careful design and user education, the illusion of connection may lead to emotional dependency, overreliance, or confusion about capabilities and accountability. Anticipating, studying, and mitigating these effects, particularly when they risk harm, will be necessary.
Meeting both established and emerging challenges with the rise of AI capabilities will be essential for transitioning from proof of concept to routine care. Reproducibility depends on common standards for prompt engineering, model provenance, and version control. Accountability will hinge on governance frameworks that align permissible autonomy levels with clinical risk. And critically, participatory design—engaging patients, clinicians, and ethicists as co-creators—will help ensure solutions are sensitive to diverse cultural and contextual needs.
By combining rigorous science, thoughtful design, close attention to evolving AI capabilities, and strong oversight, the field can chart a responsible and sustainable course for integrating AI into mental health care.
Nicholas Jacobson
I think the suite of tools used to evaluate psychotherapy applies well to generative AI–driven mental health interventions. Specifically, I believe both the benefits and risks can be measured and quantified to ensure that there is efficacy.
What outcomes should be prioritized: symptom improvement, treatment adherence, or long-term well-being?
The answer likely depends on what is intended by the system. The primary outcomes in most mental health applications should be symptom improvement, which is the goal for most persons seeking care. Treatment adherence and long-term well-being may be important primary outcomes in other applications; in particular, treatment adherence may be particularly useful if, rather than a standalone treatment, generative AI is used alongside other, more traditional treatments (e.g., medication adherence).
What should be the standard of comparison for purpose-built tools?
Multiple standards are highly relevant and thus have their place. For purpose-built tools aiming for clinical impact, the gold standard of comparison should ultimately be active, evidence-based treatments delivered by human providers. While waitlist controls (WLC) are necessary in early phases to establish a baseline effect over no treatment (as used in our Heinz et al., 2025 initial trial), demonstrating noninferiority or superiority against established therapies is key for integration into care systems.28 Sham comparisons or comparisons against digital tools also have their place but they are less informative given the real degree of their clinical efficacy.
What mechanisms should be implemented to monitor and report long-term patient outcomes?
The implementation of mechanisms for long-term monitoring involves leveraging the technology itself. This can include periodic check-ins via the app using validated questionnaires (e.g., PHQ–9), but it could also include more objective behavioral indicators of well-being (e.g., time spent in conversation or time spent outside the home).
How can AI tools be designed to promote appropriate disengagement when needed?
A key consideration in delivering appropriate care is examining not just the immediate impact of the technology used but the long-term effects of actions promoted by a generative AI system. Generative AI should not be optimized directly for engagement for this reason, as it may promote dependence and may reward LLMs for engaging in pathologizing behavior (e.g., responding with reassurance when a patient engages in reassurance seeking). Explicit behavioral recommendations to go and experience life are likely important, and drawing on long-standing evidence surrounding how to deliver appropriate care is also important in designing these systems.
How can AI systems be adapted to meet diverse cultural and linguistic needs while ensuring equitable outcomes?
AI systems can be trained with the same fundamental techniques used to train psychologists on multicultural competence and this is a potential bedrock on which to attempt to adapt them. Adapting AI systems for diverse needs while ensuring equity is a significant challenge requiring dedicated research. Fine-tuning models, as we did with Therabot using expert-curated data, allows for incorporating specific cultural contexts, but this must be done carefully and rigorously for each adaptation to avoid perpetuating biases and to ensure equitable outcomes. This remains a critical area for future development.
Arthur Kleinman
The measurement of AI-driven mental health interventions should be no different than our measurement of health care interventions in general. Outcomes should include symptom improvement and long-term well-being when possible. Compliance is not a measure of health care efficacy and should not be used to substitute for symptom change and well-being. Patient satisfaction with the quality of care needs to be assessed. What is core to the assessment of quality care are measurements of the therapeutic relationship, quality and problems in communication, and the AI-driven equivalent of clinical judgment. One of the real values of AI is that it itself may be able to advance the measurement of these crucial aspects of caregiving, many of which we do not measure today. Again, as with any health outcome assessment, untoward effects must also be recorded.
The standard for comparison should be with established measures of health outcomes. Patients receiving AI-driven mental health interventions need to be followed with periodic outcome assessments. Furthermore, human assessment of AI outcomes is particularly crucial, so the system of evaluation must include evaluation by mental health experts, such as psychiatrists, psychologists, and social workers.
Algorithms central to AI contain cultural bias. But they are also biased with respect to the framing and measuring of AI-driven activities. The same kind of attention to cultural bias that occurs with the psychological testing and algorithms used throughout health care should be applied in the mental health field as well, and not only to detect cultural bias in AI but also to see whether AI-driven interventions contain their own kinds of digital bias. All AI interventions for mental health care should also be evaluated for their potential linguistic uses among those who are non-English speakers. Here AI interventions may have a built-in advantage.29
Daniel Barron
At the outset of this series of questions and responses, I note that my comments will focus on concepts that are important now and, critically, likely to be important in five, fifty, or five hundred years. The presence (or absence) of artificial intelligence does not change the fundamental problem of medicine: determining which tools, strategies, and ideas we can deploy to most effectively alleviate human suffering.
Alleviating human suffering remains medicine’s core task. However, when it comes to tools that involve “artificial intelligence,” adoption is often paralyzed by the grand existential debate of our time, which was eloquently illustrated in our committee proceedings. Here, two fundamentally different conversations clashed and stymied progress. On one side were practical, immediate, measurable, and actionable clinical questions: “Can X AI-based tool reduce patient PHQ-9 scores in Y weeks for Z dollars, thus giving access to mental health care to N people?” On the other were abstract, philosophical, and even apocalyptic questions: “Will superintelligence erode human empathy, cheapen the therapeutic alliance, and diminish friendships and families such that society writ large comes to an end?” The ultimate risks of scaling human capabilities—from empathy to avarice—are intellectually seductive but functionally paralyzing. These are value-laden, yet ultimately irreconcilable debates—what Isaiah Berlin identified as clashes of incommensurable first principles that are immune to evidence or consensus—that distract from our core goal: to relieve human suffering with every available tool.
An overemphasis on hypothetical risks over actionable utility has always bottlenecked innovation. And it is understandable, rational, and appropriate to approach new technologies with skepticism. Consider this excerpt from Daniel Immerwahr’s New Yorker essay, “What If the Attention Crisis Is All a Distraction?”
I’m particularly fond of a hand-wringing essay by Nathaniel Hawthorne, from 1843. Hawthorne warns of the arrival of a technology so powerful that those born after it will lose the capacity for mature conversation. They will seek separate corners rather than common spaces, he prophesies. Their discussions will devolve into acrid debates, and “all mortal intercourse” will be “chilled with a fatal frost.” Hawthorne’s worry? The replacement of the open fireplace by the iron stove.30
My comments here focus not on abstract risks, but on the real opportunities in identifying specific, consequential problems and in building tools to solve those tasks—not hypothetically, but now. Medicine has always advanced by relentlessly pursuing its core mission: relieve human suffering with every available tool.
“Mental health interventions” should be defined broadly—not just as therapy, medication management, or some other procedure (e.g., transcranial magnetic stimulation, or TMS), but as the sum total of the clinical and administrative actions and processes that must be coordinated to bring a patient from their presenting condition (preintervention) to a successful outcome (postintervention). Approaching mental health interventions from a systems and process perspective will facilitate conversations about where, when, and how effective an AI-based tool is within this larger process. Critically, such a conversation will be important for determining responsibility and risk ownership and when performing cost analyses of the AI-based tool to determine its feasibility in our modern health care system.
To guide this discovery process—and to begin determining how we might measure the efficacy of any AI-based tool—I propose three basic steps (see Table 1):
clearly define the clinical task as it currently exists;
characterize the proposed AI-based solution; and
compare the AI-based solution to other options to inform policy and regulatory processes.
While Table 1 is far from exhaustive, it illustrates what I hope is a useful framework for those developing (or considering developing) an AI-based tool for mental health care. Step 1 requires us to define the status quo for a given task, including the task’s definition, scope, setting, and existing outcome measures. Step 2 involves characterizing the proposed AI-based solution across multiple dimensions: task typology (assist, augment, automate), process design, failure mode, risk profile, cost, and readiness for deployment. Step 3 is a comparative assessment of the AI-based solution in the landscape of alternatives. This includes evaluating face validity (is it even appropriate to have AI perform this task?), payment pathways, policy formation, regulatory mechanisms, and enforcement considerations. Table 1 assumes a baseline level of compliance with the governance and enforcement of the Health Insurance Portability and Accountability Act (HIPAA), which in the United States falls under the purview of the Department of Health and Human Services’ Office for Civil Rights.
As Table 1 makes clear, AI-based tools might assist in many types of clinical tasks, each with its own profile of risk, readiness, and regulatory burden. Rather than debate the existential question of whether AI has a role in mental health care, we would do better to apply a dose of clinical precision to guide our considerations.
An AI tool developed outside—or disconnected from—the health care delivery system is unlikely to survive. To succeed, AI-based tools must solve a real clinical problem, demonstrate that their solution works, and then navigate existing pathways for reimbursement and regulation.
Too often, developers begin with a technically impressive concept (“AI can do this, which would be totally cool”) without adequately considering how the product might meaningfully enter—and endure within—the clinical ecosystem.
TABLE 1: A Framework for Developing AI-Based Tools for Mental Health Care
Developing an AI-based tool for mental health care involves three steps. Step 1 is to define the status quo for a given task, including its definition, scope, setting, and existing outcome measures. Step 2 is to characterize the proposed AI-based solution across multiple dimensions: task typology (assist, augment, automate), process design, failure mode, risk profile, cost, and readiness for deployment. Step 3 is to comparatively assess the AI-based solution in the landscape of alternatives, including evaluating face validity, payment pathways, policy formation, regulatory mechanisms, and enforcement considerations. This table assumes a baseline level of HIPAA compliance, governance, and enforcement, which in the United States falls under the purview of the Department of Health and Human Services’ Office for Civil Rights.
STEP 1: Clearly Define the Clinical Task
TASK DEFINITION – “What clinical task needs to be done?” Visit scheduling
TASK SETTING – “When is this task typically performed?” Before every visit
TASK SCOPE – “How is this task typically performed?” Via telephone call to patient.
OUTCOME MEASURE – “How do we know the task was solved?” Patient arrives at their appointment on time.
TASK DEFINITION – “What clinical task needs to be done?” Medication reconciliation
TASK SETTING – “When is this task typically performed?” Previsit intake/follow-up
TASK SCOPE – “How is this task typically performed?” This task is typically performed by manually combining historical data with pharmacy data, then confirmed with the patient orally.
OUTCOME MEASURE – “How do we know the task was solved?” The EHR (electronic health record) list matches what the patient is actually taking.
TASK DEFINITION – “What clinical task needs to be done?” Medication side-effect screening
TASK SETTING – “When is this task typically performed?” Follow-up visit
TASK SCOPE – “How is this task typically performed?” Clinician asks patient about medication side effect(s).
OUTCOME MEASURE – “How do we know the task was solved?” The AI summarizes what it has learned, and the patient confirms/corrects.
TASK DEFINITION – “What clinical task needs to be done?” Test patellar reflex (or any physical exam finding)
TASK SETTING – “When is this task typically performed?” Diagnostic evaluation and follow-up
TASK SCOPE – “How is this task typically performed?” Human strikes the patellar tendon with a reflex hammer; reflex rated on 3/3 scale; interrater reliability in well-trained people.
OUTCOME MEASURE – “How do we know the task was solved?” Tap patellar tendon at the appropriate location and reliably rate the resulting reflex on 3/3 scale.
TASK DEFINITION – “What clinical task needs to be done?” Evaluate speech process
TASK SETTING – “When is this task typically performed?” Triage, intake, diagnostic evaluation, and follow-up
TASK SCOPE – “How is this task typically performed?” Human listens, thinks on it, jots down some thoughts/impressions. LOW interrater reliability.
OUTCOME MEASURE – “How do we know the task was solved?” Inter-reliability for capturing the content, acoustic properties, flow, and context of speech within and across sessions.
TASK DEFINITION – “What clinical task needs to be done?” Summarize clinical conversation and SOAP note generation
TASK SETTING – “When is this task typically performed?” Triage, intake, diagnostic evaluation, and follow-up
TASK SCOPE – “How is this task typically performed?” Human listens and types up clinical conversations. LOW interrater reliability or structure.
OUTCOME MEASURE – “How do we know the task was solved?” Adequate summary of pertinent conversational points.
TASK DEFINITION – “What clinical task needs to be done?” Acute psychosis evaluation (i.e., moderate-/high-risk evaluation)
TASK SETTING – “When is this task typically performed?” Emergency room: diagnostic evaluation
TASK SCOPE – “How is this task typically performed?” Observe patient’s visible/audible behavior, think about it, and write down impressions.
OUTCOME MEASURE – “How do we know the task was solved?” Psychosis is detected and managed. NB: human beings struggle at this evaluation.
TASK DEFINITION – “What clinical task needs to be done?” Chronic psychosis evaluation (i.e., low-/moderate-risk evaluation)
TASK SETTING – “When is this task typically performed?” Outpatient visit: diagnostic evaluation or follow-up
TASK SCOPE – “How is this task typically performed?” Observe patient’s visible/audible behavior, think about it, and write down impressions.
OUTCOME MEASURE – “How do we know the task was solved?” Psychosis is detected and managed. NB: human beings struggle at this evaluation.
TASK DEFINITION – “What clinical task needs to be done?” Patient engagement and educational tools
TASK SETTING – “When is this task typically performed?” Between visits
TASK SCOPE – “How is this task typically performed?” Human beings call, message (in-basket Epic).
OUTCOME MEASURE – “How do we know the task was solved?” Patient engagement, improvement over time.
TASK DEFINITION – “What clinical task needs to be done?” Therapy follow-up/ workbooks
TASK SETTING – “When is this task typically performed?” Outpatient visit: follow-up
TASK SCOPE – “How is this task typically performed?” Written manuals (can be purchased outside clinical visit), discussed with clinician (MD/PHD/ LCSW/CMHC/etc.).
OUTCOME MEASURE – “How do we know the task was solved?” Completion of program leading to improvement in function.
STEP 2: Characterize the Proposed AI-based Solution
TASK DEFINITION – “What clinical task needs to be done?” Visit scheduling
TOOL TYPOLOGY – “At what level is the AI meant to perform?” Automate
TOOL PROCESS – “How might an AI tool do the job?” Patient calls and (without having to wait in a phone tree) is asked when they can come in; AI and patient then converge on suitable time.
FAILURE MODES – “How might this tool fail?” Communication difficulty (speech or text).
TOOL RISK – “What level of risk is present if an AI-based tool fails?” Low
TOOL COST – “What is the cost to build/deploy an AI tool?” Low
TOOL READINESS – “Can an AI-based tool do this today?” Yes
TASK DEFINITION – “What clinical task needs to be done?” Medication reconciliation
TOOL TYPOLOGY – “At what level is the AI meant to perform?” Augment/ automate
TOOL PROCESS – “How might an AI tool do the job?” Automate process, have conversation with patient to clarify medication regimen.
FAILURE MODES – “How might this tool fail?” Communication difficulty (speech or text).
TOOL RISK – “What level of risk is present if an AI-based tool fails?” Low
TOOL COST – “What is the cost to build/deploy an AI tool?” Low
TOOL READINESS – “Can an AI-based tool do this today?” Yes
TASK DEFINITION – “What clinical task needs to be done?” Medication side-effect screening
TOOL TYPOLOGY – “At what level is the AI meant to perform?” Augment/ automate
TOOL PROCESS – “How might an AI tool do the job?” Call/chat
FAILURE MODES – “How might this tool fail?” Communication difficulty (speech or text).
TOOL RISK – “What level of risk is present if an AI-based tool fails?” Low
TOOL COST – “What is the cost to build/deploy an AI tool?” Low
TOOL READINESS – “Can an AI-based tool do this today?” Likely?
TASK DEFINITION – “What clinical task needs to be done?” Test patellar reflex (or any physical exam finding)
TOOL TYPOLOGY – “At what level is the AI meant to perform?” Assist (if viable)
TOOL PROCESS – “How might an AI tool do the job?” AI software, but would require robotic aide to “tap” the tendon, detect and rate reflex on 3/3 scale.
FAILURE MODES – “How might this tool fail?” False negative/positive. EG hyperreflexia or lead pipe rigidity → serotonin syndrome (potentially fatal), NMS, other neurologic/metabolic abnormality.
TOOL RISK – “What level of risk is present if an AI-based tool fails?” Low
TOOL COST – “What is the cost to build/deploy an AI tool?” Right now, very expensive; would require robotics + AI.
TOOL READINESS – “Can an AI-based tool do this today?” No
TASK DEFINITION – “What clinical task needs to be done?” Evaluate speech process
TOOL TYPOLOGY – “At what level is the AI meant to perform?” Augment
TOOL PROCESS – “How might an AI tool do the job?” Recording device→speech2text→structured LLM-based text summary. Real-time interaction with behavioral probes across clinical visits.
FAILURE MODES – “How might this tool fail?” Speech content and intonation form only part of communication—body language, context (cultural, intersession).
TOOL RISK – “What level of risk is present if an AI-based tool fails?” Low
TOOL COST – “What is the cost to build/deploy an AI tool?” Low
TOOL READINESS – “Can an AI-based tool do this today?” Maybe?
TASK DEFINITION – “What clinical task needs to be done?” Summarize clinical conversation and SOAP note generation
TOOL TYPOLOGY – “At what level is the AI meant to perform?” Augment
TOOL PROCESS – “How might an AI tool do the job?” Recording device→speech2text→structured LLM-based text summary.
FAILURE MODES – “How might this tool fail?” Security and accuracy.
TOOL RISK – “What level of risk is present if an AI-based tool fails?” Low
TOOL COST – “What is the cost to build/deploy an AI tool?” Low
TOOL READINESS – “Can an AI-based tool do this today?” Yes
TASK DEFINITION – “What clinical task needs to be done?” Acute psychosis evaluation (i.e., moderate-/high-risk evaluation)
TOOL TYPOLOGY – “At what level is the AI meant to perform?” Assist (if ever viable)
TOOL PROCESS – “How might an AI tool do the job?” Observe patient’s visible/audible behavior in moderate-/high-risk situations.
FAILURE MODES – “How might this tool fail?” Patient unable to engage AI-based tool given mental state. Patient violent given mental state. Patient further decompensates.
TOOL RISK – “What level of risk is present if an AI-based tool fails?” High risk
TOOL COST – “What is the cost to build/deploy an AI tool?” Very expensive. Would need to be much more sophisticated and likely robotics-capable.
TOOL READINESS – “Can an AI-based tool do this today?” Unlikely
TASK DEFINITION – “What clinical task needs to be done?” Chronic psychosis evaluation (i.e., low-/moderate-risk evaluation)
TOOL TYPOLOGY – “At what level is the AI meant to perform?” Augment
TOOL PROCESS – “How might an AI tool do the job?” Observe patient’s visible/audible behavior in low-/moderate-risk situation.
FAILURE MODES – “How might this tool fail?”: Patient unable to engage AI-based tool.
TOOL RISK – “What level of risk is present if an AI-based tool fails?” Moderate
TOOL COST – “What is the cost to build/deploy an AI tool?” Low
TOOL READINESS – “Can an AI-based tool do this today?” Uncertain
TASK DEFINITION – “What clinical task needs to be done?” Patient engagement and educational tools
TOOL TYPOLOGY – “At what level is the AI meant to perform?” Automate
TOOL PROCESS – “How might an AI tool do the job?” Trained LLM interacts with patient 24/7 to provide information.
FAILURE MODES – “How might this tool fail?” Patient unable to engage AI-based tool.
TOOL RISK – “What level of risk is present if an AI-based tool fails?” Low
TOOL COST – “What is the cost to build/deploy an AI tool?” Low
TOOL READINESS – “Can an AI-based tool do this today?” Yes
TASK DEFINITION – “What clinical task needs to be done?” Therapy follow-up/workbooks
TOOL TYPOLOGY – “At what level is the AI meant to perform?” Automate/augment
TOOL PROCESS – “How might an AI tool do the job?” Remotely, 24/7 access to predefined therapy modules (CBT/DBT/PRT) or conversational agent.
FAILURE MODES – “How might this tool fail?” Patient unable to engage AI-based tool.
TOOL RISK – “What level of risk is present if an AI-based tool fails?” Low (workbook content) and Moderate (patient conversation)
TOOL COST – “What is the cost to build/deploy an AI tool?” Low
TOOL READINESS – “Can an AI-based tool do this today?” Yes
STEP 3: Compare Solutions, Inform Policy and Regulation
TASK DEFINITION – “What clinical task needs to be done?” Visit scheduling
HUMAN OR AI TOOL – “Does an AI-based tool seem reasonable?” Yes
PAYMENT – “Who pays?” Health care institution as administrative/operations tool
POLICY FORMATION – “Where should policy be formed?” Institution-level policy
POLICY REGULATION – “How is the AI tool regulated?” Admin team collects stats on shows/no-shows and evaluates performance based on patient feedback + HHS/OCR for HIPAA.
REGULATION MECHANISM – “What happens if the AI tool fails?” Tool modified/updated based on feedback
TASK DEFINITION – “What clinical task needs to be done?” Medication reconciliation
HUMAN OR AI TOOL – “Does an AI-based tool seem reasonable?” Yes
PAYMENT – “Who pays?” Health care institution as administrative/operations tool
POLICY FORMATION – “Where should policy be formed?” Institution-level policy
POLICY REGULATION – “How is the AI tool regulated?” Patient/clinician confirm
REGULATION MECHANISM – “What happens if the AI tool fails?” Tool modified/updated until no errors
TASK DEFINITION – “What clinical task needs to be done?” Medication side-effect screening
HUMAN OR AI TOOL – “Does an AI-based tool seem reasonable?” Maybe?
PAYMENT – “Who pays?” Health care institution as administrative/operations tool
POLICY FORMATION – “Where should policy be formed?” Institution-level policy
POLICY REGULATION – “How is the AI tool regulated?” Patient/clinician confirm
REGULATION MECHANISM – “What happens if the AI tool fails?” Tool modified/updated until no errors
TASK DEFINITION – “What clinical task needs to be done?” Test patellar reflex (or any physical exam finding)
HUMAN OR AI TOOL – “Does an AI-based tool seem reasonable?” No
PAYMENT – “Who pays?” Payer, as diagnostic tool
POLICY FORMATION – “Where should policy be formed?” Federal-level policy
POLICY REGULATION – “How is the AI tool regulated?” FDA for clinical instrument
REGULATION MECHANISM – “What happens if the AI tool fails?” FDA approval/clearance/denial
TASK DEFINITION – “What clinical task needs to be done?” Evaluate speech process
HUMAN OR AI TOOL – “Does an AI-based tool seem reasonable?” Maybe?
PAYMENT – “Who pays?” Health care institution as administrative/ operations tool
POLICY FORMATION – “Where should policy be formed?” Federal-level policy
POLICY REGULATION – “How is the AI tool regulated?” FDA for clinical instrument
REGULATION MECHANISM – “What happens if the AI tool fails?” FDA approval/clearance/denial
TASK DEFINITION – “What clinical task needs to be done?” Summarize clinical conversation and SOAP note generation
HUMAN OR AI TOOL – “Does an AI-based tool seem reasonable?” Yes
PAYMENT – “Who pays?” Health care institution as administrative/ operations tool
POLICY FORMATION – “Where should policy be formed?” Institution-level policy
POLICY REGULATION – “How is the AI tool regulated?” Patient/clinician confirm
REGULATION MECHANISM – “What happens if the AI tool fails?” Tool modified/updated until no errors
TASK DEFINITION – “What clinical task needs to be done?” Acute psychosis evaluation (i.e., moderate-/high-risk evaluation)
HUMAN OR AI TOOL – “Does an AI-based tool seem reasonable?” Unlikely
PAYMENT – “Who pays?” Payer, as diagnostic device
POLICY FORMATION – “Where should policy be formed?” Federal-level policy
POLICY REGULATION – “How is the AI tool regulated?” FDA for clinical instrument?
REGULATION MECHANISM – “What happens if the AI tool fails?” FDA approval/clearance/denial
TASK DEFINITION – “What clinical task needs to be done?” Chronic psychosis evaluation (i.e., low-/moderate-risk evaluation)
HUMAN OR AI TOOL – “Does an AI-based tool seem reasonable?” Uncertain
PAYMENT – “Who pays?” Payer, as diagnostic device
POLICY FORMATION – “Where should policy be formed?” Federal-level policy
POLICY REGULATION – “How is the AI tool regulated?” FDA for clinical instrument?
REGULATION MECHANISM – “What happens if the AI tool fails?” FDA approval/clearance/denial
TASK DEFINITION – “What clinical task needs to be done?” Patient engagement and educational tools
HUMAN OR AI TOOL – “Does an AI-based tool seem reasonable?” Yes
PAYMENT – “Who pays?” Payer, as RTM/Education
POLICY FORMATION – “Where should policy be formed?” Institution-level policy
POLICY REGULATION – “How is the AI tool regulated?” Engagement measures (institution)
REGULATION MECHANISM – “What happens if the AI tool fails?” Domain/Dx-specific training
TASK DEFINITION – “What clinical task needs to be done?” Therapy follow-up/workbooks
HUMAN OR AI TOOL – “Does an AI-based tool seem reasonable?” Yes
PAYMENT – “Who pays?” Payer as 99213?/ 90833?/etc.
POLICY FORMATION – “Where should policy be formed?: Institution-level policy
POLICY REGULATION – “How is the AI tool regulated?” Engagement measures (institution)
REGULATION MECHANISM – “What happens if the AI tool fails?” Domain/Dx-specific training