Accuracy

How accurate is EurekaWrite?

Q: How do you stop the AI from making up feedback?

Every piece of evidence in a EurekaWrite report must be an exact quote from the student's essay. After the AI responds, our code checks that each quote actually appears in the submitted text. Quotes that fail the check trigger a repair pass, and anything still unverified is flagged for review rather than shown as fact.

The test results, the method, and the limits. Updated every time the scorer changes.

Every AI marking tool claims it's accurate. Almost none of them show you how they checked. I'm a data analyst by profession, so EurekaWrite gets the same treatment I would demand from any model at work: a fixed test set with known answers, numbers published, limitations stated in plain language.

Quick summary: Before any change goes live, the scorer must pass a regression test against 42 essays that already have human-assigned marks, including all 8 sample writing responses publicly released for the NSW Selective test. On those NSW essays, the scorer lands within about 1.9 marks of the published total (out of 25) on average. It puts 4 of the 8 in exactly the right band, and all 8 within one band of it (a band is about 4 marks wide, so "within one band" is the easy bar and "exactly right" is the harder one). Re-scoring the same essay moves the total by up to 2 marks on the NSW essays. Full numbers below, including what the AI is not good at.

How we test the scorer

The core of our quality control is a fixed test set we call the golden dataset: 42 essays, every one of them already marked by humans before our AI ever saw them.

Group	Essays	Source	What it checks
NSW released samples	8	Sample writing responses publicly released for the NSW Selective test, with marker commentary	Same rubric, same 25-mark scale. Can we match the published mark?
New Zealand exemplars	18	e-asTTle published writing exemplars, marked by trained markers	A different rubric conservatively mapped to ours. Do we rank essays the same way their markers did?
US Grade 5 samples	16	Published state assessment writing samples (Florida)	Weak and average essays NSW never publishes. Do we catch weak writing instead of over-praising it?

A note on materials and copyright. The NSW sample responses and marker commentary are published by the NSW Department of Education, and copyright in those materials is owned by the Department and Cambridge University Press & Assessment. None of that material is reproduced on this site, and EurekaWrite is not affiliated with or endorsed by either organisation. What we publish here is only the comparison: how close our scorer's marks come to the published ones.

Why the overseas essays? NSW publishes very few marked scripts, and almost all of them are strong. A scorer tested only on strong essays learns nothing about weak ones. The New Zealand and US sets give us officially marked weak and average scripts, with their scales mapped to ours. For those groups we hold the scorer to a direction test (does it place the essay in the right band range?) rather than an exact-mark test, because cross-system mapping is approximate by nature.

This test runs before every change to the scoring system: a new prompt, a new model, a rubric adjustment. If the numbers get worse, the change doesn't ship.

The numbers

Last full run: 18 June 2026, scoring engine v009 (the version scoring your child's essay today). The exact model and pipeline are in the FAQ below.

On the 8 publicly released NSW sample responses

What we measure	Result
Average gap to the published total (out of 25)	1.9 marks
Average gap on Set A: content, structure, style (out of 15)	0.9 marks
Average gap on Set B: sentences, punctuation, spelling (out of 10)	1.0 marks
Overall band exactly right	4 of 8
Overall band within one band of the published band	8 of 8
Average gap per dimension (each out of 5 or less)	0.35 marks

One pattern worth knowing: every band miss in this run went the same direction. The scorer marked four published Band 6 scripts down to Band 5, by 2 to 5 marks; it did not push any weaker script up a band. If EurekaWrite errs, it leans tough on the very strongest writing far more often than it flatters the rest.

Here is every one of those 8 essays, the published total beside ours. Only the marks are shown, never the essays or the markers' comments.

NSW sample	Published total (/25)	EurekaWrite	Gap
1	23	21	2
2	24	21	3
3	20	21	1
4	25	24	1
5	23	20	3
6	20	20	0
7	24	19	5
8	19	19	0

The gaps cluster at the top. The four 23–25 scripts were all marked down, while the 19–20 scripts landed within a point. That is the same pattern as above, shown in the raw numbers: strict on the strongest writing, accurate on the rest.

Across the full 42-essay set

What we measure	Result
Evidence quotes that verified against the essay text	99.8% (41 of 42 essays: every quote checked out; 0 flagged for review)
NZ exemplars: band within the expected range (±1 band)	18 of 18
NZ exemplars: total within the mapped range (the tighter test)	12 of 18
US samples: band within the expected range (±1 band)	15 of 16
Re-scoring stability (5 essays × 3 runs): max spread	3 marks (NSW essays within 2; the wider spread was a cross-system essay)

How to read your child's score: treat the total as accurate to within about 2 marks, treat the band as the primary signal, and treat the quoted evidence as the most reliable part of the report. That ±2 is mostly the model's own run-to-run variance, the same essay re-scored moves by up to 2 marks, so if two practice essays score 1 mark apart, that's noise; a full band apart is signal.

How the system keeps itself honest

Quote-verified feedback. Every piece of evidence must be an exact quote from your child's essay. Our code checks each quote actually appears in the submitted text. Failed quotes trigger one repair pass; anything still unverified is flagged for review, never silently displayed.
Consistency by design. The scorer runs at a low randomness setting, and we test stability by re-scoring the same essays multiple times before release.
Everything is versioned. Every score stores the exact prompt version, rubric version, and model version that produced it. When the scorer improves, we can tell you exactly what changed and when.
A regression gate, not a vibe check. The 42-essay test ran before each of the nine prompt versions we've shipped. Versions that scored worse never reached you.
No ghostwriting. Feedback shows how to improve a sentence or a paragraph. It never writes the essay for your child. That's a deliberate product rule, not a technical limitation.

What these numbers do not mean

This is not an official score. EurekaWrite has no affiliation with the NSW Department of Education or any test provider. A practice score is guidance, not a prediction of placement.
Eight released scripts is a small sample. That's all NSW has made public. It's why we report "about 1.9 marks average gap on those 8 essays" instead of quoting a confident-sounding percentage.
The precise figure is on strong essays only. NSW publishes mostly strong scripts, so that 1.9-mark gap is measured on strong writing. For weak and average essays (the overseas sets) we run only the looser direction test, not an exact-mark comparison. The reassuring part is that wherever we can measure, the scorer is hard on strong writing and only rarely over-scores the weaker end, by a mark or two when it does.
Human markers vary too. In the real test, scripts are seen by more than one marker precisely because trained humans differ by a mark or two. AI variance is the same kind of uncertainty, and we publish ours.
A practice score is a snapshot. Test-day performance moves with nerves, the prompt, and the day. Use the trend across several essays, not any single number.

Frequently asked questions

Is EurekaWrite an official NSW marking service?

No. EurekaWrite is an independent practice tool with no affiliation to the NSW Department of Education or any test provider. Scores are practice guidance calibrated against publicly available marked essays, not predictions of an official result.

Why does the same essay sometimes get a slightly different mark?

AI scoring has a small amount of natural variance, the same way two human markers can differ by a mark or two. We run the scorer at a low randomness setting and test stability by re-scoring the same essays multiple times. A 1-mark difference between runs is noise; a full band difference is signal.

How do you stop the AI from making up feedback?

Every piece of evidence in a report must be an exact quote from the student's essay. After the AI responds, our code checks that each quote actually appears in the submitted text. Quotes that fail the check trigger a repair pass, and anything still unverified is flagged for review rather than shown as fact.

What AI model does EurekaWrite use?

EurekaWrite currently runs on OpenAI's GPT-4.1 mini, wrapped in a scoring pipeline calibrated over nine prompt versions against the test set above. The model receives a structured rubric and must return scores within hard boundaries, with quote-backed evidence for every dimension. For how the rubric itself works, see our marking criteria guide.

Questions about the methodology? I read every email: eurekawrite@haorix.com. The story of why I built this is on the About page. For the bigger picture of why writing is the test component AI is best suited to, see using AI for selective writing practice.

Try EurekaWrite, it's free

Score your child's writing in 30 seconds. No signup needed.