Accuracy
How accurate is EurekaWrite?
The test results, the method, and the limits. Updated every time the scorer changes.
Every AI marking tool claims it's accurate. Almost none of them show you how they checked. I'm a data analyst by profession, so EurekaWrite gets the same treatment I would demand from any model at work: a fixed test set with known answers, numbers published, limitations stated in plain language.
How we test the scorer
The core of our quality control is a fixed test set we call the golden dataset: 42 essays, every one of them already marked by humans before our AI ever saw them.
| Group | Essays | Source | What it checks |
|---|---|---|---|
| NSW released samples | 8 | Sample writing responses publicly released for the NSW Selective test, with marker commentary | Same rubric, same 25-mark scale. Can we match the published mark? |
| New Zealand exemplars | 18 | e-asTTle published writing exemplars, marked by trained markers | A different rubric conservatively mapped to ours. Do we rank essays the same way their markers did? |
| US Grade 5 samples | 16 | Published state assessment writing samples (Florida) | Weak and average essays NSW never publishes. Do we catch weak writing instead of over-praising it? |
Why the overseas essays? NSW publishes very few marked scripts, and almost all of them are strong. A scorer tested only on strong essays learns nothing about weak ones. The New Zealand and US sets give us officially marked weak and average scripts, with their scales mapped to ours. For those groups we hold the scorer to a direction test (does it place the essay in the right band range?) rather than an exact-mark test, because cross-system mapping is approximate by nature.
This test runs before every change to the scoring system: a new prompt, a new model, a rubric adjustment. If the numbers get worse, the change doesn't ship.
The numbers
Last full run: 12 June 2026, scoring engine v008 on GPT-4.1 mini (the version scoring your child's essay today).
On the 8 publicly released NSW sample responses
| What we measure | Result |
|---|---|
| Average gap to the published total (out of 25) | 1.9 marks |
| Average gap on Set A: content, structure, style (out of 15) | 0.9 marks |
| Average gap on Set B: sentences, punctuation, spelling (out of 10) | 1.0 marks |
| Overall band exactly right | 4 of 8 |
| Overall band within one band of the published band | 8 of 8 |
| Average gap per dimension (each out of 5 or less) | 0.35 marks |
One pattern worth knowing: every band miss in this run went the same direction. The scorer called four published Band 6 scripts a high Band 5; it never lifted a weaker script into a higher band. If EurekaWrite errs, it errs slightly tough on the very strongest writing rather than flattering the rest.
Across the full 42-essay set
| What we measure | Result |
|---|---|
| Evidence quotes verified to exist in the essay | 100% |
| NZ exemplars placed in the expected band range | 18 of 18 |
| US samples placed in the expected band range | 16 of 16 |
| Re-scoring stability (5 essays scored 3 times each) | max spread 2 marks |
How the system keeps itself honest
- Quote-verified feedback. Every piece of evidence must be an exact quote from your child's essay. Our code checks each quote actually appears in the submitted text. Failed quotes trigger one repair pass; anything still unverified is flagged for review, never silently displayed.
- Consistency by design. The scorer runs at a low randomness setting, and we test stability by re-scoring the same essays multiple times before release.
- Everything is versioned. Every score stores the exact prompt version, rubric version, and model version that produced it. When the scorer improves, we can tell you exactly what changed and when.
- A regression gate, not a vibe check. The 42-essay test ran before each of the eight prompt versions we've shipped. Versions that scored worse never reached you.
- No ghostwriting. Feedback shows how to improve a sentence or a paragraph. It never writes the essay for your child. That's a deliberate product rule, not a technical limitation.
What these numbers do not mean
- This is not an official score. EurekaWrite has no affiliation with the NSW Department of Education or any test provider. A practice score is guidance, not a prediction of placement.
- Eight released scripts is a small sample. That's all NSW has made public. It's why we report "about 1.9 marks average gap on those 8 essays" instead of quoting a confident-sounding percentage.
- Human markers vary too. In the real test, scripts are seen by more than one marker precisely because trained humans differ by a mark or two. AI variance is the same kind of uncertainty, and we publish ours.
- A practice score is a snapshot. Test-day performance moves with nerves, the prompt, and the day. Use the trend across several essays, not any single number.
Frequently asked questions
Is EurekaWrite an official NSW marking service?
No. EurekaWrite is an independent practice tool with no affiliation to the NSW Department of Education or any test provider. Scores are practice guidance calibrated against publicly available marked essays, not predictions of an official result.
Why does the same essay sometimes get a slightly different mark?
AI scoring has a small amount of natural variance, the same way two human markers can differ by a mark or two. We run the scorer at a low randomness setting and test stability by re-scoring the same essays multiple times. A 1-mark difference between runs is noise; a full band difference is signal.
How do you stop the AI from making up feedback?
Every piece of evidence in a report must be an exact quote from the student's essay. After the AI responds, our code checks that each quote actually appears in the submitted text. Quotes that fail the check trigger a repair pass, and anything still unverified is flagged for review rather than shown as fact.
What AI model does EurekaWrite use?
EurekaWrite currently runs on OpenAI's GPT-4.1 mini, wrapped in a scoring pipeline calibrated over eight prompt versions against the test set above. The model receives a structured rubric and must return scores within hard boundaries, with quote-backed evidence for every dimension. For how the rubric itself works, see our marking criteria guide.
Questions about the methodology? I read every email: eurekawrite@haorix.com. The story of why I built this is on the About page.
Score your child's writing in 30 seconds. No signup needed.