States of Learning

Rethinking how we detect learning gaps

On why educational AI detects failure too late, and what detecting it earlier would require

Picture this: The time has come and the teacher returns carefully graded exams to his class. A student who seemed to be keeping up, who completed the homework, who raised their hand occasionally and gave reasonable answers, has failed. And looking back, the signs were there: a slight uncertainty three weeks ago, a pattern of errors that seemed random at the time but now, with hindsight, looks systematic. The understanding was becoming fragile long before the exam, however nobody caught it.

This is not a failure of teaching, but rather a failure of the instrument. The instruments available to most teachers, e.g. tests at defined intervals, classroom observation, homework completion, are all retrospective. They reveal what has already happened, yet they do not reveal what is developing.

The question that interests me is whether this is a structural limitation or a solvable problem. I think it is solvable, but solving it requires being precise about why current approaches - including the most advanced AI tutoring systems - still fail to catch the problem early.

Why AI tutoring does not solve this

The evidence for AI tutoring is genuinely encouraging. A 2025 randomised controlled trial at Harvard found effect sizes between 0.73 and 1.3 standard deviations for AI tutoring versus active classroom learning.¹ A 2025 Google DeepMind study found that supervised AI tutoring matched or exceeded human tutors on immediate learning outcomes across five UK secondary schools.² These are real results. The technology works, in the short term, under specific conditions. But notice what neither study measures: what happens to those students six weeks later? Was the misconception that was apparently resolved in the tutoring session had actually been resolved, or had been temporarily suppressed and was quietly reforming. The studies cannot measure this because the systems they evaluated have no longitudinal model of the learner as every session begins fresh. There is no way to look at what has happened over time and ask whether the trajectory is heading somewhere good.

A 2025 paper by Weidlich and colleagues in the Journal of Computer Assisted Learning made this point with precision.³ Auditing a widely cited meta-analysis claiming a large positive effect of AI on learning, they found that the vast majority of included studies lacked the methodological structure to draw causal conclusions, because they were measuring performance during AI use, not durable understanding after it. The field is observing effects it cannot yet explain, because it has no model of what is happening inside the learner over time.

The three things you cannot see without longitudinal data

What detecting it earlier would require

The instrument that would catch these three problems earlier is not a better test or a more frequent test. It is a different kind of representation entirely, one that tracks not what a student scored, but how their understanding is structured and how that structure is changing. One that is updated continuously, not at defined intervals. One that is trained to recognise the temporal signature of developing fragility - the specific patterns in longitudinal interaction data that precede performance failure - across thousands of learners with months of outcome data to learn from.

This is a harder thing to build than a tutoring system. It requires longitudinal data that most educational platforms do not yet have in usable form. It requires an architecture specifically designed for temporal sequence modelling rather than dialogue generation. And it requires the institutional partnerships that would allow such a system to be trained on real learners in real schools over real time, which is a governance and trust problem as much as a technical one. But it is buildable and the technical primitives exist. The data is there, in school databases across Europe, largely unanalysed. The question is whether anyone builds the infrastructure to use it - before the exam comes back, again, revealing what could have been caught in October.

¹ Kestin, G. et al. (2025). AI tutoring outperforms in-class active learning: an RCT. Scientific Reports, 15, 17458. ² LearnLM Team & Eedi (2025). AI tutoring can safely and effectively support students: an exploratory RCT in UK classrooms. arXiv:2512.23633. ³ Weidlich, J., Gašević, D., Drachsler, H., & Kirschner, P. (2025). ChatGPT in education: An effect in search of a cause. Journal of Computer Assisted Learning, 41(5), e70105. ⁴ Bastani, H. et al. (2025). Generative AI without guardrails can harm learning. PNAS, 122(26), e2422633122. ⁵ Kosmyna, N. et al. (2025). Your brain on ChatGPT: Accumulation of cognitive debt. arXiv:2506.08872.