Analyse Writing Guides Home

Accuracy

How accurate is EurekaWrite?

The test results, the method, and the limits. Updated every time the scorer changes.

Every AI marking tool claims it's accurate. Almost none of them show you how they checked. I'm a data analyst by profession, so EurekaWrite gets the same treatment I would demand from any model at work: a fixed test set with known answers, numbers published, limitations stated in plain language.

Quick summary: Before any change goes live, the scorer must pass a regression test against 42 essays that already have human-assigned marks, including all 8 sample writing responses publicly released for the NSW Selective test. On those NSW essays, the current scorer lands within about 1.9 marks of the published total (out of 25) on average, and its band is never more than one off the published band. Re-scoring the same essay produces totals within a 2-mark spread. Full numbers below, including what the AI is not good at.

How we test the scorer

The core of our quality control is a fixed test set we call the golden dataset: 42 essays, every one of them already marked by humans before our AI ever saw them.

GroupEssaysSourceWhat it checks
NSW released samples 8 Sample writing responses publicly released for the NSW Selective test, with marker commentary Same rubric, same 25-mark scale. Can we match the published mark?
New Zealand exemplars 18 e-asTTle published writing exemplars, marked by trained markers A different rubric conservatively mapped to ours. Do we rank essays the same way their markers did?
US Grade 5 samples 16 Published state assessment writing samples (Florida) Weak and average essays NSW never publishes. Do we catch weak writing instead of over-praising it?
A note on materials and copyright. The NSW sample responses and marker commentary are published by the NSW Department of Education, and copyright in those materials is owned by the Department and Cambridge University Press & Assessment. None of that material is reproduced on this site, and EurekaWrite is not affiliated with or endorsed by either organisation. What we publish here is only the comparison: how close our scorer's marks come to the published ones.

Why the overseas essays? NSW publishes very few marked scripts, and almost all of them are strong. A scorer tested only on strong essays learns nothing about weak ones. The New Zealand and US sets give us officially marked weak and average scripts, with their scales mapped to ours. For those groups we hold the scorer to a direction test (does it place the essay in the right band range?) rather than an exact-mark test, because cross-system mapping is approximate by nature.

This test runs before every change to the scoring system: a new prompt, a new model, a rubric adjustment. If the numbers get worse, the change doesn't ship.

The numbers

Last full run: 12 June 2026, scoring engine v008 on GPT-4.1 mini (the version scoring your child's essay today).

On the 8 publicly released NSW sample responses

What we measureResult
Average gap to the published total (out of 25)1.9 marks
Average gap on Set A: content, structure, style (out of 15)0.9 marks
Average gap on Set B: sentences, punctuation, spelling (out of 10)1.0 marks
Overall band exactly right4 of 8
Overall band within one band of the published band8 of 8
Average gap per dimension (each out of 5 or less)0.35 marks

One pattern worth knowing: every band miss in this run went the same direction. The scorer called four published Band 6 scripts a high Band 5; it never lifted a weaker script into a higher band. If EurekaWrite errs, it errs slightly tough on the very strongest writing rather than flattering the rest.

Across the full 42-essay set

What we measureResult
Evidence quotes verified to exist in the essay100%
NZ exemplars placed in the expected band range18 of 18
US samples placed in the expected band range16 of 16
Re-scoring stability (5 essays scored 3 times each)max spread 2 marks
How to read your child's score: treat the total as accurate to within about 2 marks, treat the band as the primary signal, and treat the quoted evidence as the most reliable part of the report. If two practice essays score 1 mark apart, that's noise. If they sit a full band apart, that's signal.

How the system keeps itself honest

What these numbers do not mean

Frequently asked questions

Is EurekaWrite an official NSW marking service?

No. EurekaWrite is an independent practice tool with no affiliation to the NSW Department of Education or any test provider. Scores are practice guidance calibrated against publicly available marked essays, not predictions of an official result.

Why does the same essay sometimes get a slightly different mark?

AI scoring has a small amount of natural variance, the same way two human markers can differ by a mark or two. We run the scorer at a low randomness setting and test stability by re-scoring the same essays multiple times. A 1-mark difference between runs is noise; a full band difference is signal.

How do you stop the AI from making up feedback?

Every piece of evidence in a report must be an exact quote from the student's essay. After the AI responds, our code checks that each quote actually appears in the submitted text. Quotes that fail the check trigger a repair pass, and anything still unverified is flagged for review rather than shown as fact.

What AI model does EurekaWrite use?

EurekaWrite currently runs on OpenAI's GPT-4.1 mini, wrapped in a scoring pipeline calibrated over eight prompt versions against the test set above. The model receives a structured rubric and must return scores within hard boundaries, with quote-backed evidence for every dimension. For how the rubric itself works, see our marking criteria guide.

Questions about the methodology? I read every email: eurekawrite@haorix.com. The story of why I built this is on the About page.

Try EurekaWrite, it's free

Score your child's writing in 30 seconds. No signup needed.