FLARE Series (5/6): Required Sample Size – From Probability Theory to Practical Validation

Tibor Zechmeister

Oct 13, 2025

A robotic hand and a human hand both reach toward an urn filled with balls. One highlighted ball is drawn from the urn, symbolizing rare events in AI validation. The image represents the challenge of selecting statistically significant samples in high-risk AI validation.

In our last FLARE article, we dissected AI tools feature by feature and assessed their individual risk levels. Low-risk features were routed back to Vendor Assessment for general qualification. So far, so good.

Diagram of the FLARE framework with the “Required Sample Size” step highlighted. It shows two validation paths: one leading to Human Expert Testing for low sample sizes, the other to Full Review with Vendor for large sample sizes.

But what about the high-risk ones?

Those bring us to the next step in the FLARE pathway: Required Sample Size – combined with Human Expert Testing, and where necessary, a Full Review with the Vendor.

From Overall Risk to Statistical Confidence: Why This Step Matters

Risk assessment tells us which features need closer scrutiny. But knowing a feature is “high risk” is only the starting point. The real question is:

How many real-world examples do we need to confidently validate that feature?

And this is where probability theory comes into play.

The Probability Primer: Enter the Urn

Think back to the classic urn model from probability theory:

Imagine an urn filled with balls of different colors
You can’t see inside; you only draw one ball at a time
Some colors are common, others are rare

Now replace the colored balls with case reports:

Green balls = common malfunctions
Purple balls = rare events such as patient deaths

An urn filled with many dark balls and one purple ball. The purple ball is highlighted as it’s being drawn. Legend: malfunction (dark), death (purple). Symbolizes the rarity of death cases in vigilance data.

In real vigilance datasets, death cases are extremely rare. To find just 10–20, you might need to sift through thousands, even tens of thousands, of reports.

In short: great for patients, tough for validation.

Because you can’t test an AI feature for death classification with only 10 samples. You might need hundreds or even thousands before a death case shows up.

By contrast, injuries are reported much more frequently. In this urn, you can be confident that even a small random sample will include several injury cases.

Two urns side by side. The left urn contains mostly dark balls and one purple ball (death case); the right urn contains many light balls (injuries) and one purple (death). Illustrates how common or rare certain case types are in vigilance data.

The takeaway?
The rarity of the outcome directly determines how many examples are needed to test the AI feature.

Two Validation Paths: When to Test Yourself, When to Involve the Vendor

Path 1: Human Expert Testing

If the required sample size is manageable, you can validate directly.

Take an AI feature that distinguishes “injuries” from “device malfunctions”:

With a modest set of 10 vigilance reports, you’ll likely get several examples of both
A human reviewer can check each case, compare it with the AI’s output, and confirm performance

This kind of hands-on testing keeps validation efficient and close to the user.

Path 2: Full Review with the Vendor

When the required sample size becomes unrealistically large, direct validation is no longer practical.

Think back to our death-case example: validating that an AI can detect them might require 10,000+ reports. That’s simply not feasible for human reviewers, not in this lifetime.

Instead, this calls for a structured review with the vendor. Here’s how to make it work in practice:

Meet with the software provider
Involve their data scientists and AI engineers
Request documentation: how they tested the feature, what sample size they used, and which thresholds they applied for acceptable performance

And this is where it gets real. Two big questions now sit on the table:

What level of reliability will you accept? For something as serious as death detection, is 90% sensitivity enough? 95%? 99.999%? The answer depends on your own risk tolerance: how much error can you live with?
Can you get there on your own? Do you actually have access to enough samples to validate yourself or do you need the vendor’s evidence to bridge the gap?

Turning Risk Levels into Real Validation

Required Sample Size is the bridge between knowing the risk and proving the AI can handle it. It makes you face the uncomfortable math of rare events, decide where human testing will do the job, and admit where you’ll need vendor transparency.

What Happens Next?

With this step complete:

Manageable sample sizes → you validate directly with human expert testing
Unrealistic sample sizes → you move into a structured review with the vendor

Which brings us to the final step of FLARE: What happens when validation fails and a feature doesn’t deliver the expected quality?

Final Thoughts

Required Sample Size isn’t just about numbers, it’s about connecting risk, probability, and validation strategy.

Because in regulated AI, validation isn’t one-size-fits-all. Sometimes you can check a feature yourself and sometimes you need to sit down with the vendor: The key is knowing when to choose which path!

Need help choosing the right validation path?

If sample size or review strategy is a blocker for you, contact us – we’re happy to help!