How Lawnise Verifies AI Answers Against Official Sources

Someone asks a public AI assistant a plain question: how long do I have to report this, what does this card actually charge, where do I send a suspected fraud. The answer comes back clean and confident, often in a tidy table. To the person reading it, that is the institution speaking, or the regulator, or the scheme. The hard question — the one this piece is about — is the next one: is it true?

Not "does it sound true." Not "is there an official page somewhere that says the right thing." Those are easy. The hard part is the gap between them: an official record can be immaculate and the AI's representation of it still wrong. Closing that gap — turning "an AI said X about an institution" into a finding we'd put our name to — is the whole of what we do. This is how we do it, in enough detail that you can check our work.

Start from the questions a customer actually asks

Verification has to begin somewhere honest, and the honest starting point is not a clever adversarial prompt designed to trip a model up. It's the question a real person types before they act — before they open a product, dispute a charge, compare two policies, or report a scam in progress. We hold a standard set of these high-intent questions for each scope we cover, where a scope is a defined market and sector: a country and an industry, narrow enough that the right answer is knowable and the wrong one matters.

The discipline here is restraint. We are not hunting for the model's worst moment; we are asking the questions that carry consequences when they're answered wrong. A bungled trivia answer about a bank's founding year is not what we're measuring. The reporting deadline on a claim, the rate on a card, the eligibility rule on a relief scheme, the channel for a fraud report — those are the questions where a wrong answer changes what someone does next.

Anchor each question to an official, dated fact

Every question in the set carries a reference fact, and that fact is not our opinion of the answer. It is taken from an authoritative public source — the institution's own published page, or the official scheme or regulator page that governs the rule. The rate as the bank publishes it. The deadline as the framework states it. The threshold as the scheme defines it. Each is recorded with its source and the date we read it, because a reference fact is only as good as the page it came from and the day it was current.

This anchor is the yardstick. Without it, "the AI is wrong" is just a second opinion. With it, the claim becomes checkable: here is what the official record says, here is where it says it, here is when we confirmed it. The reference fact is the part a journalist, a regulator, or the institution itself can pull up and verify independently — which is exactly the point.

Put the questions to the public assistants

We then put the same questions to a set of public AI assistants — the general-purpose answer engines a customer would actually reach for. We record which assistants we tested internally, but public findings never rank them or attribute an error to a named assistant. There is no league table here, no "least accurate assistant for X." That framing would be a different and less useful piece of work, and it invites a fight over methodology that distracts from the finding that matters.

The more useful question is what the assistants get wrong in common — where the published record drifts in representation no matter which engine renders it. When several systems disagree on the same factual question, at least some of them are wrong, and the person asking has no way to know which. That shared drift, not a ranking, is the signal we're after.

Compare each answer to the official record

Each answer is compared, one by one, against the published source for that question. Where an answer aligns with the record, it passes. Where it conflicts — a deadline stated more generously than the framework allows, a discontinued product described as available, a narrow track described as wide open, a published figure quoted soft — it's flagged for review.

A flag is not yet a finding. It's a candidate: a place where the automated comparison saw daylight between what the assistant said and what the source says. Most of the work — and most of the credibility — is in what happens to that flag next.

The bar a flag has to clear to become a finding

This is the heart of the method, and it's deliberately strict, because the cost of a verification brand calling a correct answer wrong is higher than the cost of missing one. A flag becomes a finding only when it clears every one of the following.

It has to be current. The contradiction must hold against the official source as it stands today, not as it stood before a rule moved. A flag raised against a fact that has since changed is dropped, not published.

It has to be materially misleading to a customer. The gap must be the kind a person acts on and is worse off for — a price, a deadline, an eligibility rule, a safety step. A trivia slip or a harmless imprecision doesn't qualify. We hold "material" to the same line throughout: could this reasonably mislead someone's understanding of a product, policy, process, fee, eligibility, coverage, or obligation.

It has to be grounded in the AI's full response, not a snippet. The error must be locatable in what the assistant actually said, read in full — not inferred from a truncated excerpt or from the flag itself. An answer can look wrong in a fragment and be right in context, and the reverse; only the full response settles it.

It has to survive re-verification against the live official source. We go back to the source page and confirm it still says what we recorded, still exists, still governs the rule. If the live page no longer confirms our reference fact — or if the assistant's answer actually matches the live page — the flag is dropped. We would rather lose a finding than keep a wrong one.

And it has to be anonymisable — describable in a way that carries the finding without naming the institution.

Most flags don't clear this bar. The automated pass is built to over-flag: it's a triage signal that surfaces candidates for review, not a verdict. Treating it as a verdict would be the easy mistake, and it's the one that quietly destroys a verification brand's credibility. Human verification is where the over-flagging gets corrected, and it's why the findings we feature are far fewer than the flags raised. The detailed scoring, sampling, and gate definitions live in our research methodology.

A funnel diagram narrowing from many candidate flags to few verified findings: real customer questions, an official dated source, public AI assistants, compare to the record (many flags), then a five-part gate — current, materially misleading, grounded in the full response, re-verified live, anonymisable — leaving a single verified finding. — How a flag becomes a finding.

Why a correct source isn't the same as a correct answer

The most common misreading of this work is that we're checking whether the official record is right. We're not. The record is usually right — that's rather the point. What we're checking is whether the AI's representation of that record is right, and those two things come apart more often than they should.

An assistant can have the correct source in front of it and still describe a general rule as if it answered a specific, current question. It can take a framework about reporting procedure and attach a penalty the framework doesn't impose. It can take a figure that's broadly in the right direction and soften the magnitude past what the source states. None of these requires the official page to be wrong. They require only that the model generalise where it should have been specific, or reach for currency it doesn't have. That gap — general versus specific, settled versus current — is the failure mode the accuracy check alone is built to catch, and it's the subject of our companion piece on contextual accuracy.

Facts move, so we check against the live source

A reference fact is a snapshot, and snapshots age. A rate changes, a scheme adjusts a threshold, a product is retired. So every featured finding is checked against the live source at the time of writing, and re-checked if the source moves before publication. A finding that was true against last month's page and stale against this month's is not a finding we'll publish — it's a fact that changed, which is a different thing, and saying otherwise would be its own kind of inaccuracy.

This is also why our barometers are read as a single-month, directional reading rather than a permanent verdict. The state of what public AI gets wrong about a scope is a moving picture, and we date it accordingly. What's stable is the shape of the errors; the specific instances are tied to the day we caught them.

Honest about the tools

The automated comparison that raises flags is useful and it is not ground truth. We treat it as directional and engine-assessed — a way to point human attention at the places worth looking, not a measurement we'd publish unchecked. That's precisely why featured findings are hand-verified against the live source rather than waved through on the strength of a score, and why we publish our methodology openly instead of asking anyone to take the engine's word for it. The figures that describe how the automated pass performs belong in the methodology, where they can carry their proper denominators and caveats, not in a sentence here stripped of context.

A verification company that overstated the reliability of its own tooling would be making, about itself, the same move it exists to catch. So we don't. The honest claim is narrow and it holds: the automated pass finds candidates, human verification confirms findings, and we show which is which.

Anonymised, with a right to reply

Institutions are unnamed in everything we publish — by design, not by accident. The finding is the AI's error against a published fact; the institution's identity isn't load-bearing to that, and naming it would turn a research finding into something it isn't. Schemes, regulators, and standards bodies are named, because those are public rules being publicly misrepresented and the source matters.

Behind the anonymisation, the evidence is retained. The full captured response, the source URL, the dated reference fact, and a result identifier sit in our internal record for every featured finding — the audit trail that lets us stand behind a claim, and the basis for a right of reply if an institution wants to engage. Names and URLs are withheld in public by research design; the evidence and identifiers are kept internally for audit and reply. You can read the boundaries of what we publish and how a party engages on our limitations and right-to-reply page.

Where verification fits the wider frameworks

It helps to place this against the governance vocabulary institutions already use. The NIST AI Risk Management Framework organises AI governance into four functions — govern, map, measure, and manage. Most of an institution's existing controls live in govern and manage: policies, ownership, oversight, response. The function that's hardest to satisfy for public AI is measure — building the evidence about how a system actually behaves out in the world.

External verification is a measure-layer activity. The public-AI surface — what assistants tell people about an institution, in private sessions nobody at the institution can see — is exactly the surface that internal governance tooling doesn't reach, because there's no system to instrument and no log to read. Verification is how you measure it: a repeatable way to observe what's being said, test it against the official record, and produce dated evidence a governance team can review, escalate, or evidence when asked. It doesn't replace the govern and manage work; it gives that work something measured to act on. Treated this way, "we don't monitor what public AI says about us" stops being a posture and becomes a measurement gap with a method to close it. The governance case for why that gap matters is its own piece: why AI answer accuracy is becoming a governance issue.

That's the method, end to end: start from the questions that carry consequences, anchor each to a dated official fact, ask the public assistants, compare, and then hold every flag to a bar most of them don't clear. The published record was right. Whether its representation was — that's the question we answer, one verified finding at a time. If you'd like to see what public AI is currently saying about the rules your institution operates under, we can scope a private baseline.

How to cite this

Short form: Lawnise Research & Editorial team. (2026). How Lawnise Verifies AI Answers Against Official Sources. Lawnise. https://www.lawnise.com/research/how-lawnise-verifies-ai-answers
Long form (APA): Lawnise Research & Editorial team. (2026, June 29). How Lawnise Verifies AI Answers Against Official Sources (Methodology v1.1). Lawnise. https://www.lawnise.com/research/how-lawnise-verifies-ai-answers
BibTeX: @misc{lawnise2026howlawniseverifiesaianswers, author = {Lawnise Research and Editorial team}, title = {How Lawnise Verifies AI Answers Against Official Sources}, year = {2026}, publisher = {Lawnise}, url = {https://www.lawnise.com/research/how-lawnise-verifies-ai-answers} }

References

[1]Lawnise Methodology (v1.1). Lawnise verifies public AI answers against official dated sources before promoting a flag to a finding. https://www.lawnise.com/trust-index/methodology/v1#main
[2]NIST AI Risk Management Framework overview. The NIST AI Risk Management Framework is cited as a governance vocabulary reference for measure-layer work. https://www.nist.gov/itl/ai-risk-management-framework(accessed 2026-06-26)
[3]NIST AI RMF resources. The NIST AI RMF resources page is cited for the framework resource reference. https://airc.nist.gov/airmf-resources/airmf/(accessed 2026-06-26)