We earn commissions when you shop through links on this site — at no extra cost to you. Learn more

Back to all essays
Own Your Tech

We Asked Our 4B Legal AI 10 Questions. It Invented 7 Cases.

·8 min read
George Pu
George Pu$10M+ Portfolio

28 · Toronto · Building to own for 30+ years

We Asked Our 4B Legal AI 10 Questions. It Invented 7 Cases.

An honest post-mortem on our own model — and what anyone building or buying AI should take from it.


Two weeks ago we shipped flash-1-mini on Hugging Face.

4 billion parameters. Apache 2.0.

Trained on Canadian Business, Legal, and Regulatory material.

Sovereign Canadian compute — compute resources in Rimouski, Quebec, the federally-backed infrastructure cited in last week's National AI Strategy.

It's not the production model. flash-1 ships September 30.

flash-1-mini is the research preview - the public artifact we're learning in front of.

This week we tested it.

Ten questions across Canadian Charter law, Civil Code of Quebec, federal tax, securities, employment, and privacy.

The standard battery any Canadian lawyer would consider basic.

Here's what failed, and what every team building or buying AI should take from it.

The headline

flash-1-mini hallucinated 7 case citations across 10 questions.

It articulated the doctrines correctly - the Oakes test framework, the Quebec civil law concept of prescription, the general principle of Bill 96, the structure of PIPEDA consent.

The framework reasoning is genuinely there at 4B.

But the citations? Most of them are invented.

It invented R. v. Hufsky [2014] 2 R.C.S. 694 - a case that doesn't exist.

The real Hufsky is from 1988 and is about random highway stops, not border searches.

It got R. v. Oakes - the foundational Canadian Charter case every law student learns - citation right and holding reversed.

It said the Supreme Court justified the limitation under section 1.

The actual Oakes holding struck the provision down.

The case is famous precisely because the law failed the test Oakes itself created.

The model got it backwards.

It invented a Quebec employment statute called 'ARC' - the Act respecting the independence of the Quebec labor market.

This statute does not exist.

The real Quebec statute is the LNT, the Loi sur les normes du travail.

It told me Canadian securities filings go through EDGAR.

EDGAR is the US SEC system. Canadian filings go through SEDAR+.

On one question - comparing Quebec to Ontario employment termination notice - it went into a reasoning loop.

Forty paragraphs of 'Wait, no, actually...' without ever converging on an answer.

Just oscillation until I cut it off. This is a system prompt/UI error, but still notable.

Why this matters for anyone working with AI

None of these are Canadian-specific quirks.

They're the universal failure modes of small language models without retrieval grounding.

If you're building or buying AI for anything high-stakes, these are the four patterns to test for.

Confident hallucination.

The model doesn't know what it doesn't know.

It generates plausible-sounding citations because the training objective rewarded plausible-sounding text, not verified sources.

Confidence is not evidence of accuracy - it's the default output style.

US-data default.

Almost every base model is trained predominantly on US sources.

Deploy it for any non-US use case and you get this exact pattern: US frameworks confidently applied to non-US questions.

EDGAR for Canadian filings.

Delaware for non-US incorporation.

SEC rules for non-US regulation.

Right pointer, wrong content.

Even when the citation is real, the substance can be inverted — like Oakes.

This is the dangerous one, because it looks confident, the source check passes, and the answer is still wrong.

Generation loops.

Without proper output controls, models get stuck oscillating.

Anyone who's used a chat model long enough has seen it.

Test for these on the questions your actual users will ask - not on the demo prompts.

Benchmark hallucination rates do not predict operational hallucination rates.

The honest decomposition

I want to be precise about what's happening here - and equally precise that the split below is my judgment, not a measurement.

I haven't run an ablation that cleanly attributes each failure to a cause.

This is my read as the person who trained the thing.

I'd rather give you my honest estimate than false precision.

Roughly, it breaks into three layers, and they have different fixes.

A large share is inherent to 4B model size.

A 4-billion-parameter model can't memorize the full Code civil du Québec verbatim, plus every Supreme Court case, plus every National Instrument, plus every Quebec statute, with perfect article-number recall.

Some hallucination at this scale is structural.

Even frontier models hallucinate Canadian-specific citations.

The biggest share is our training data choices.

This is the part we control.

flash-1-mini was trained on what we had at the time, and the corpus was US-heavy by default and underweighted on Canadian primary sources - verbatim CCQ text, SCC case digests with verified holdings, CSA National Instruments, Quebec employment law.

We can fix this for flash-1.

It's the highest-leverage lever we have.

A smaller share is inference configuration.

The infinite loop on Quebec-vs-Ontario was a generation control issue.

vLLM has parameters for it - a max-tokens cap, repetition penalty, stop sequences.

We're shipping that fix this week before any external user touches the demo.

The takeaway for buyers: when a vendor says 'our model performs well,' the honest follow-up is 'on what dataset, with what corpus emphasis, with what inference controls.'

Those three layers determine more than parameter count does.

Why I'm publishing this

A few reasons.

Hidden failures compound.

If I don't publish this, every team building specialized AI hits the same walls in private.

Their cofounders think they shipped something solid.

Their first customer finds out the hard way.

The market for specialized regulated AI gets a reputation for being unreliable.

Nobody wins.

Published failures get fixed - by us, by anyone reading this, by the next team that doesn't have to rediscover the same patterns.

Substrate matters more than size.

I've been writing for months about why specialized AI for regulated sectors needs to be model + RAG architecture, not a standalone model.

People nodded politely. Now I have receipts.

A 4B model without retrieval grounding hallucinates because it can't ground.

A 9B model isn't dramatically better on this dimension.

If you're finding this useful, I send essays like this 2-3x per week.
·No spam

Going from 9B to 70B doesn't solve it either.

The architectural commitment is retrieval-augmented generation - grounding every output against authoritative sources like A2AJ's Canadian case law dataset and the federal court archive at inference time.

That's what flash-1 ships with September 30. It's not a feature. It's the whole architecture.

Sophisticated buyers are smart.

AMF begins enforcing its AI Guideline on Quebec financial institutions in May 2027.

Treasury Board's Directive on Automated Decision-Making has a compliance milestone on June 24.

These buyers will not tolerate a vendor that pretends its model never hallucinates.

They'll buy the vendor that documented the failure modes, fixed them publicly, and ships with audit-trail architecture.

Transparency about limitations is not vulnerability. It's the only credible position.

What we're doing about it

Specific fixes, ordered by timeline:

This week:

Ayush updates the vLLM serving config - output capped at 1500 tokens, repetition penalty above 1.0, stop sequences on the loop pattern.

The Quebec-vs-Ontario question won't loop on the next test.

The system prompt now warns users about citation hallucination on every response, so the warning travels with every screenshot.

Through July:

Closed beta with 10–15 trusted contacts - people who understand model limitations and can give us back operational signal.

Not a public launch.

The demo URL stays unpublished except to people we directly invite.

Through September:

Corpus rebalancing for flash-1.

Priority on SCC case digests with verified holdings (fixes Oakes-style reversals),

CCQ verbatim text (fixes Quebec civil law article hallucination), CSA National Instruments verbatim (fixes EDGAR / SEDAR+ confusion), LNT verbatim (fixes the 'Quebec ARC' fabrication), federal tax sources, PIPEDA Schedule 1, and Bill 96 amended Charter text.

We're paying domain experts to validate the dataset - not because human-only labeling is cheaper, but because LLM-assisted pipelines with human validation produce higher-integrity training data than pure-machine generation.

September 30:

flash-1 ships. 9 billion parameters.

Bilingual Canadian Business + Legal + Regulatory specialization.

RAG architecture grounding every citation against A2AJ, the federal court archive, and our curated corpus.

Target citation accuracy above 87 percent. Audit trail per response. Open weights, Apache 2.0.

Through March 31, 2027:

flash-1-pro. 27B mixture-of-experts. The specialized variant track.

These aren't promises.

They're the operational sequence we're executing.

You can hold us to it on September 30.

What this doesn't mean

A few things I want to push back on before they get said.

This is not a failure of the project.

A 4B research preview hallucinating Canadian legal citations is the documented failure mode of small models without retrieval.

We knew it going in.

The CBLRE evaluation methodology we published explicitly measures these failures.

flash-1-mini is performing exactly to its documented limitations.

This is not an argument for bigger models.

Going from 4B to 9B doubles knowledge capacity.

It doesn't fix the structural problem.

Frontier models with hundreds of billions of parameters also hallucinate Canadian-specific citations.

The architectural commitment is RAG, not size.

This is not adversarial to other AI labs.

Cohere, OpenAI, Anthropic, Mistral - they're building different products for different customers, and they make the right calls for their position.

Canadian Business, Legal, and Regulatory specialization is a lane the frontier labs don't economically serve, not a lane they're losing.

This is not the end of testing.

We'll run more diagnostics on flash-1 when it ships, and publish those too.

Test, document, fix, retest — that's the operational discipline.

CBLRE is the public artifact of it.

What I want from you reading this

If you're a lawyer or regulatory professional - try the demo when closed beta opens this month.

Break it. Tell us what failed, and tell us where the framework reasoning was useful even when the citations weren't.

If you're an AI builder - copy whatever's useful from CBLRE.

It's on Hugging Face under permissive license.

The methodology is the artifact, not the model. And four things worth carrying into your own work:

  1. Test on the questions your real users will ask, not demo prompts. Benchmark hallucination rates don't predict operational ones.
  2. Watch the corpus, not just the parameter count. A 4B model with the right corpus and architecture beats a 70B general model on a specialized domain. Size and capability stop being linear once you specialize.
  3. Retrieval isn't a feature - for high-stakes domain use, it's the architecture. A standalone model sold for regulated deployment without grounding is a confident-hallucination machine.
  4. Publish your failure modes. The moat in specialized AI isn't which model hallucinates least. It's which team documented its failures most honestly and fixed them most publicly.

If you're a regulator or procurement officer - the audit-trail transparency we're building toward is what your buyer playbook should require from every AI vendor. Not just us. Every one.

If you work at a Canadian financial institution preparing for the AMF AI Guideline - the procurement window is now. flash-1 ships inside your runway. Reach out when you want to talk.

The brand thesis, made concrete

I've been writing for months about 'Own or Be Owned' as the durable thesis for this work.

This essay is what that means at the model layer.

A model whose failures you can audit.

A training corpus whose sources you can verify.

A compute infrastructure whose sovereignty you can confirm.

An architecture whose grounding you can trace.

You can rent frontier capability.

That's a legitimate choice for many use cases. It's not the only choice.

You can also own the substrate.

That's what we're building toward, one published failure at a time.


flash-1-mini is available on Hugging Face under Apache 2.0. The CBLRE evaluation methodology is published alongside it. flash-1 ships September 30, 2026 with RAG architecture targeting above 87 percent citation accuracy.

Closed beta access for trusted contacts opens this month - reach out directly if you want to participate.

— George Pu, SimpleDirect®

Share this: