Regulatory
AI
6 Min.
FLARE Series (5/6): Required Sample Size – From Probability Theory to Practical Validation

Tibor Zechmeister
Oct 13, 2025
In our last FLARE article, we dissected AI tools feature by feature and assessed their individual risk levels. Low-risk features were routed back to Vendor Assessment for general qualification. So far, so good.

But what about the high-risk ones?
Those bring us to the next step in the FLARE pathway: Required Sample Size – combined with Human Expert Testing, and where necessary, a Full Review with the Vendor.
From Overall Risk to Statistical Confidence: Why This Step Matters
Risk assessment tells us which features need closer scrutiny. But knowing a feature is “high risk” is only the starting point. The real question is:
How many real-world examples do we need to confidently validate that feature?
And this is where probability theory comes into play.
The Probability Primer: Enter the Urn
Think back to the classic urn model from probability theory:
Imagine an urn filled with balls of different colors
You can’t see inside; you only draw one ball at a time
Some colors are common, others are rare
Now replace the colored balls with case reports:
Green balls = common malfunctions
Purple balls = rare events such as patient deaths

In real vigilance datasets, death cases are extremely rare. To find just 10–20, you might need to sift through thousands, even tens of thousands, of reports.
In short: great for patients, tough for validation.
Because you can’t test an AI feature for death classification with only 10 samples. You might need hundreds or even thousands before a death case shows up.
By contrast, injuries are reported much more frequently. In this urn, you can be confident that even a small random sample will include several injury cases.

The takeaway?
The rarity of the outcome directly determines how many examples are needed to test the AI feature.
Two Validation Paths: When to Test Yourself, When to Involve the Vendor
Path 1: Human Expert Testing
If the required sample size is manageable, you can validate directly.
Take an AI feature that distinguishes “injuries” from “device malfunctions”:
With a modest set of 10 vigilance reports, you’ll likely get several examples of both
A human reviewer can check each case, compare it with the AI’s output, and confirm performance
This kind of hands-on testing keeps validation efficient and close to the user.
Path 2: Full Review with the Vendor
When the required sample size becomes unrealistically large, direct validation is no longer practical.
Think back to our death-case example: validating that an AI can detect them might require 10,000+ reports. That’s simply not feasible for human reviewers, not in this lifetime.
Instead, this calls for a structured review with the vendor. Here’s how to make it work in practice:
Meet with the software provider
Involve their data scientists and AI engineers
Request documentation: how they tested the feature, what sample size they used, and which thresholds they applied for acceptable performance
And this is where it gets real. Two big questions now sit on the table:
What level of reliability will you accept? For something as serious as death detection, is 90% sensitivity enough? 95%? 99.999%? The answer depends on your own risk tolerance: how much error can you live with?
Can you get there on your own? Do you actually have access to enough samples to validate yourself or do you need the vendor’s evidence to bridge the gap?
Turning Risk Levels into Real Validation
Required Sample Size is the bridge between knowing the risk and proving the AI can handle it. It makes you face the uncomfortable math of rare events, decide where human testing will do the job, and admit where you’ll need vendor transparency.
What Happens Next?
With this step complete:
Manageable sample sizes → you validate directly with human expert testing
Unrealistic sample sizes → you move into a structured review with the vendor
Which brings us to the final step of FLARE: What happens when validation fails and a feature doesn’t deliver the expected quality?
Final Thoughts
Required Sample Size isn’t just about numbers, it’s about connecting risk, probability, and validation strategy.
Because in regulated AI, validation isn’t one-size-fits-all. Sometimes you can check a feature yourself and sometimes you need to sit down with the vendor: The key is knowing when to choose which path!
Need help choosing the right validation path?
If sample size or review strategy is a blocker for you, contact us – we’re happy to help!