Regulatory

8 Min.

FLARE Deep Dive (2/6): Why AI Validation Starts with Intended Use & Overall Risk

Tibor Zechmeister

Jul 15, 2025

Gauge graphic showing how intended use determines software risk level (low, medium, high) in the FLARE framework for AI validation in MedTech.

At Flinn, we like to say that AI validation in MedTech is like your first cup of coffee: absolutely essential. Miss it, and everything that follows is guaranteed to unravel, because without a solid start, nothing runs as it should.

But where do you start? Not with the algorithm. Not with model performance. And certainly not with regulatory checklists.

The first and arguably most critical step is this: clearly defining your software’s intended use and assessing its overall risk. This isn’t just a formality; it’s the cornerstone upon which every other part of the validation process depends.

Visual overview of Step 1 in the FLARE framework for AI validation, highlighting how intended use and software risk guide the next validation steps.

So, let’s dive into the foundation of our FLARE validation framework!

Why This Must Be Priority #1

Today, it's not enough to say a tool "uses AI". In MedTech, context is everything. AI that supports a decision is not the same as AI that makes one. And when patient safety is on the line, that distinction is critical.

Most issues we see in AI validation can be traced back to this first step: either because it was underestimated or simply undocumented.

Understanding where, how, and by whom an AI-based system is used is the only way to responsibly evaluate the risks it introduces.

Key Takeaways: What Intended Use + Overall Risk really Mean in Practice

Turning insight into action: So what does it actually look like to define the intended use and assess the overall risk of an AI tool? Let’s break it down:

1. Start With the Intended Use

Just like with any medical device, the intended use must be clearly described:

Who will use the software? Is it designed for Regulatory Affairs? Quality Management? Maybe even production teams?
What exactly does the tool do? Does it generate draft content for technical documentation? Flag duplicates in a dataset? Recommend actions in Post-Market Surveillance?
Which regulatory processes does it impact? Some tools touch a single workflow. Others may affect processes across departments, from regulatory to HR. The intended use needs to reflect that.

Depending on the complexity, this might be documented in a few lines or require several pages. Either way, it should be crystal clear!

2. Then Assess the Overall Risk

Once the intended use is documented, it’s time to ask: How risky is the software in this context?

We suggest a pragmatic, context-driven classification:

Low-risk example: A tool that flags duplicate entries in a spreadsheet. Worst case? Some extra cleanup work.
High-risk example: An AI-powered PMS tool that autonomously identifies death-related incidents from vigilance reports. Inaccuracies here could have real-world consequences for patient safety.

What’s important: The risk class is not dictated externally. You, as the user and organization, define it: based on your use case, product, and regulatory context.

How to Assess the Overall Risk

To move from gut feeling to a defensible classification, we recommend using a simple 1–10 risk rubric.
It breaks risk down into four core dimensions:

Dimension	Score Range	Description
Worst Case Patient Impact	0 - 4	0 = no patient impact/inconvenience only 1 = workflow friction; no care impact 2 = care delay or minor, reversible harm 3 = serious, non-permanent injury 4 = death or permanent serious harm
Degree of automation & external effect	0 - 3	0 = advisory only (no actions) 1 = triage/prioritization; user decides actions 2 = action proposed; explicit human approval required to execute 3 = fully automated action, especially if it triggers external parties/processes (e.g., regulators, clinicians, patients)
Detectability & reversibility	0 - 2	0 = errors are obvious and easily reversible 1 = detectable but costly/time-consuming to unwind 2 = hard to detect or irreversible once executed
Scale & blast radius (0–1)	0 - 1	0 = affects a single user / local artifact 1 = fleet-wide, multi-product, or feeds regulated processes (e.g., PMS, vigilance, submissions)

Score each dimension individually & then add up the total:

0–3 → Low Risk
4–6 → Medium Risk
7–10 → High Risk

What Is (and Isn’t) High Risk in Practice?

Let’s bring the scoring rubric to life with a few examples:

Low Risk (clear case)
“AI flags duplicate entries in an internal spreadsheet; user decides what to delete.”
Scoring:
Impact 0, Automation 0–1, Detectability 0, Scale 0
→ Risk score: 0–1 (Low)
Medium Risk (illustrative)
“AI flags potential death-related incidents in PMS for human review.”
Scoring:
Impact 4 (if missed), Automation 1 (triage), Detectability 1 (review reduces risk), Scale 0–1
→ Risk score: 6–7 (borderline Medium/High)
→ In most organizations with proper human review, this would likely be treated as Medium risk.
High Risk (clear case)
“AI auto-categorizes a complaint as a vigilance case and auto-submits an authority report.”
Scoring:
Impact 3–4, Automation 3, Detectability 2, Scale 1
→ Risk score: 9–10 (High)

Rule of thumb:
If the AI can trigger a major process or external consequence without a mandatory human gate and errors could plausibly harm patients or the business -> assume High Risk.

Adjusting the Risk Score: How Controls Change the Outcome

Controls that lower the score and which dimension they influence:

Control	Effect
Mandatory human approval before any external action	↓ Automation
Shadow mode / dry-run period	↓ Detectability risk; errors surfaced early
Confidence thresholds & abstention	↓ Automation; AI defers when uncertain
Sampling, four-eyes review, or QA gates	↓ Detectability risk
Rate limits / blast-radius guards	↓ Scale
Clear UI disclosures & decision support context	↓ Automation; keeps it advisory
Rollback plans & audit logs	↓ Reversibility risk
Change control for models & prompts	↓ Detectability/Reversibility risk

Tip: Document which specific controls you apply and explicitly adjust the rubric score. This makes the “why” behind Low/Med/High defensible.

Same Feature, Different Risk: Why Context Matters

Even the same AI model can fall into different risk categories depending on who uses it and how it’s gated.

Case	Context	Score -> Risk
Case A (Low–Medium)	Mid-size manufacturer, PMS classifier suggests categories; mandatory reviewer sign-off, shadowed for 2 months, tight rate limits.	Impact 3–4 Automation 2→1 (due to human gate) Detectability 1 Scale 1 → 5–6 (Medium)
Case Y (Medium)	Large enterprise; same model but with triage only, quarterly bias audits, and CAPA gate before any external action.	Impact 3–4 Automation 1 Detectability 1 Scale 1 → 6–7 (Upper-Medium)
Case Z (High)	Small team auto-routes complaints as vigilance cases; no human approval during peak load; auto-emails Notified Body.	Impact 3–4 Automation 3 Detectability 2 Scale 1 → 9–10 (High)

Case

Context

Score -> Risk

Case A (Low–Medium)

Mid-size manufacturer, PMS classifier suggests categories; mandatory reviewer sign-off, shadowed for 2 months, tight rate limits.

Impact 3–4
Automation 2→1 (due to human gate)
Detectability 1
Scale 1

→ 5–6 (Medium)

Case Y (Medium)

Large enterprise; same model but with triage only, quarterly bias audits, and CAPA gate before any external action.

Impact 3–4
Automation 1
Detectability 1
Scale 1

→ 6–7 (Upper-Medium)

Case Z (High)

Small team auto-routes complaints as vigilance cases; no human approval during peak load; auto-emails Notified Body.

Impact 3–4
Automation 3
Detectability 2
Scale 1

→ 9–10 (High)

Takeaway: The identical model can span Medium → High based on governance, gating, and deployment scope.

Let’s Wrap This Up: What Should You Do Next?

After reading this, ask yourself:

Do we have a documented intended use for our AI-powered tools?
Have we assessed and classified the overall risk on a feature level?

If not, that’s the place to start!

And if you’re unsure how to define or structure that assessment, this is exactly the kind of foundational work FLARE was built to support. We're here to help! Let’s Talk!

So…What Comes Next in the FLARE Framework?

Once you’ve defined the intended use and assessed the overall risk of your AI tool, FLARE helps you decide where to go from there:

If the risk is low, the next step is to assess the vendor. Can you trust the provider behind the tool?
If the risk is medium or high, a deeper feature-level risk analysis is needed before moving forward.

In our next article, we’ll explore what happens in the low-risk path:
How to evaluate your vendor’s AI maturity, transparency, and quality practices.