DIME PROJECT

Implementing AI in Healthcare

SECTION 2 | CHOOSE THE RIGHT TOOL

AI tool evaluation essentials

Raising the bar for what “good” looks like.

Pause: Are you ready to evaluate vendors?

AI’s potential is matched only by its risks; bias, breakdowns, and unintended harm don’t just happen in theory–they happen in practice. Responsible health systems raise the bar: they ask harder questions, demand stronger evidence, and center clinical, operational, and equity impacts from the start.

Health AI Evaluation Guide

A structured approach to evaluation: Your decision filter

SPOTLIGHTS

The best health systems aren’t just checking boxes—they’re designing iterative AI evaluation processes that prioritize clinical reality, equity, and stakeholder trust from the start — and adapt over time.

Learn from Playbook partners on what it takes to move beyond technical vetting to build truly responsible, resilient AI programs.

“To illuminate the blind spots, we have to triangulate stakeholders’ views. We have to identify all of the potentially impacted stakeholders and then identify what they view the tools’ uses, the various values of the tools, how it fits their workflow, and where those collisions occur [to] highlight the important problems.”

-Danton Char, MD, MAS, Stanford University

Unmasking AI’s ethical fault lines: Stanford Medicine’s battle with value collisions in mortality prediction tools

Beyond accuracy: Stanford’s frontier trial of autonomous AI confronts stark realities of equity, bias, and care bottlenecks

Susannah Rose, PhD highlights the role of model cards in building transparency and trust, and explains why equipping clinicians with clear, accessible information on AI tools is critical for safe and effective adoption.

Lessons learned from the frontlines: Recurring system-level issues that need to be addressed prior to AI deployment in healthcare

The most effective AI evaluations happen in the real world, using relevant data, and driven by clinical end users. In this short video, Dr. Marla Sammer shares how her team validated AI tools on pediatric datasets, underscoring why context and population matter for trustworthy performance.

See the Digital Medicine Society’s The Playbook: Pediatric Digital Medicine resource on developing and disseminating digital health technologies designed for children, accounting for developmental, safety, and family-centered care considerations.

Key learnings

Process and principles

Ethical AI evaluation needs a structured process, not just principles. Stanford Healthcare’s FURM model assesses fairness, utility, and reliability proactively.

Triangulate stakeholder perspectives

Effective evaluation includes understanding conflicting stakeholder values, especially patients. Efforts are underway to incorporate patient voices despite AI literacy challenges.

Patient outcomes drive ethical risk assessment

Ethical evaluations prioritize real-world impact, addressing issues like distress from predictions, unequal care, errors, over-reliance, and disproportionate harm to vulnerable groups.

Transparency and patient autonomy are critical

Hospitals must inform patients about AI usage, supporting patient agency and informed consent through general and specific disclosures.

Systemic barriers undermine patient benefit

Structural issues like biased data, care bottlenecks, and clinician overload can negate AI benefits. Ethical AI requires addressing these inequities.

Narrowing your shortlist from contenders to finalists

You’ve identified a shortlist of promising AI solutions. Now it’s time to pressure test them on real-world usability, workflow fit, and frontline trust. At this stage, your goal is to move beyond marketing claims and into performance in context.

Use these tactics to make confident, system-ready decisions:

Tap your network

Ask peer systems and trusted experts for candid feedback. Has this tool delivered value elsewhere? What lessons did they learn in implementation?

Run usability testing

Bring in actual end-users—clinicians, nurses, staff—to interact with the tool. Observe how it fits into their workflow. Let them surface friction points and pick what works.

Pilot if needed

For high-risk or high-impact tools, a limited-scope pilot with a small user group can validate assumptions and expose risks early—before full deployment.

Additional considerations for your AI evaluation

1. Move beyond accuracy: Evaluate the whole system

Why it matters: A technically sound model that fails in workflow, safety, or clinician trust won’t scale.

Best practice: Use frameworks like AI-IMPACTS or the AI Healthcare Integration Framework (AI-HIF) to assess not just algorithmic performance, but how AI integrates with workflows, governance, human decision-making, and long-term risk across the full range of care environments.

2. Prioritize real-world usage & continuous monitoring

Why it matters: Models that perform well in sandbox environments often falter in live clinical settings.

Best practice: Prioritize vendors who’ve demonstrated success with real-world datasets, and insist on a monitoring strategy to detect model drift, safety degradation, and equity issues over time.

3. Align on meaningful outcomes

Why it matters: AI success isn’t just about technical performance—it should align with meaningful goals, whether improving patient care, operational efficiency, or clinical workflows.

Best Practice: Select metrics that reflect your objectives. When relevant, incorporate patient-centered measures (e.g., time to diagnosis, communication clarity, care coordination) and involve patients or care partners in demos, tool selection, and pilot feedback.

4. Bake equity, transparency & trust into the process

Why it matters: AI systems must be designed to avoid biases and ensure fair, transparent decisions. Without diverse data and clear disclosures, even well-meaning tools can perpetuate harm

Best practice: Require transparency around training data, bias audits, and model limitations. Use third-party validation where possible, and include equity impact reviews as part of selection.

5. Ensuring safety and representation

Why it matters: AI models can perpetuate or even amplify health inequities if not rigorously evaluated for diverse patient groups.

Best practice: Integrate questions about context fit and subgroup validation across all evaluation stages, from initial sourcing to contract negotiations and ongoing validation.

6. AI and the health of our planet

Why it matters: The carbon footprint of training and running large AI models, especially in high-throughput hospital settings, is substantial. Every new deployment should also account for environmental stewardship.

Best practice: Include environmental impact as a consideration in AI procurement and deployment.

Ask vendors about compute intensity, data center energy sources, and whether carbon offset strategies are in place.
Track energy use and emissions associated with large models.
Integrate sustainability training into onboarding for clinical and technical teams using AI.

Next Steps

Now comes the critical question: Will this AI truly work here; for your patients, within your workflows, and on your real-world data? Vendor demos or glowing testimonials can’t guarantee it. What works elsewhere may face unforeseen challenges in your environment.

To truly know the viability and effectiveness of a chosen solution, a crucial step is unavoidable: you have to deploy it. Move to Implement and Scale AI Across Your System to guide the last mile of your AI implementation. → Planning your deployment

How to deploy AI