States of Learning

These are papers I have been reading recently to inform my approach to modelling knowledge states. I update this list regularly.

I. Foundational Educational Research

The 2 Sigma Problem & Tutoring Effectiveness

[1] Bloom, B.S. (1984). The 2 Sigma Problem: The Search for Methods of Group Instruction as Effective as One-to-One Tutoring. Educational Researcher, 13(6), 4โ€“16. The foundational study: one-to-one tutoring produces a 2 SD improvement; average tutored student above 98th percentile of conventional class

[2] Kraft, M.A., Schueler, B.E., & Falken, G. (2024). What impacts should we expect from tutoring at scale? Exploring meta-analytic generalizability. EdWorking Paper 24-1031. Effect sizes from controlled research often do not replicate in real-world deployment

Misconceptions & Conceptual Change

[3] Smith, J.P., diSessa, A.A., & Roschelle, J. (1994). Misconceptions reconceived: A constructivist analysis of knowledge in transition. Journal of the Learning Sciences, 3(2), 115โ€“163. Foundational: misconceptions are not gaps but actively held frameworks that resist correction

[4] diSessa, A.A. (1993). Toward an epistemology of physics. Cognition and Instruction, 10(2โ€“3), 105โ€“225. Knowledge-in-Pieces (KiP) framework; p-prims as fragmentary, context-dependent knowledge structures

[5] Rittle-Johnson, B., Siegler, R.S., & Alibali, M.W. (2001). Developing conceptual understanding and procedural skill in mathematics: An iterative process. Journal of Educational Psychology, 93(2), 346โ€“362. Procedural-without-conceptual knowledge: the mechanism by which fragile understanding forms

Cognitive Load Theory

[6] Sweller, J. (1988). Cognitive load during problem solving: Effects on learning. Cognitive Science, 12(2), 257โ€“285. Foundational paper establishing cognitive load theory

II. AI Tutoring: Evidence & Risks

Randomised Controlled Trials โ€” Positive Effects

[7] Kestin, G., Miller, K., Klales, A., Milbourne, T., & Ponti, G. (2025). AI tutoring outperforms in-class active learning: an RCT introducing a novel research-based design in an authentic educational setting. Scientific Reports, 15, 17458. Harvard RCT (N=194); effect sizes 0.73โ€“1.3 SD for AI tutoring vs. active learning

[8] LearnLM Team, Google DeepMind & Eedi (2025). AI tutoring can safely and effectively support students: An exploratory RCT in UK classrooms. arXiv:2512.23633. N=165; five UK secondary schools; LearnLM matched or exceeded human tutors on immediate outcomes; 5.5pp better knowledge transfer. Explicitly calls for longitudinal follow-up

AI Harm & Cognitive Offloading

[9] Bastani, H., Bastani, O., Sungu, A., Ge, H., Kabakci, O., & Mariman, R. (2025). Generative AI without guardrails can harm learning: Evidence from high school mathematics. PNAS, 122(26), e2422633122. RCT: unguarded AI tutoring significantly worsened unassisted performance; guardrailed AI did not. Key evidence for the hollow shell problem

[10] Kosmyna, N., Hauptmann, E., Yuan, Y.T., Situ, J., Liao, X.H., Beresnitzky, A.V., Braunstein, I., & Maes, P. (2025). Your brain on ChatGPT: Accumulation of cognitive debt when using an AI assistant for essay writing task. arXiv:2506.08872. Neuroimaging evidence of cognitive debt: measurable cortical disengagement accumulating with repeated AI-assisted writing

Methodological Critiques

[11] Weidlich, J., Gaลกeviฤ‡, D., Drachsler, H., & Kirschner, P. (2025). ChatGPT in education: An effect in search of a cause. Journal of Computer Assisted Learning, 41(5), e70105. Audit of 19 studies in meta-analysis: only 4 met minimum methodological criteria. We observe effects but lack mechanistic understanding

[12] Jurenka, I. et al. (LearnLM Team) (2024). Towards responsible development of generative AI for education: An evaluation-driven approach. arXiv:2407.12687. Google DeepMind's framework for LearnLM: the five pedagogical principles (active learning, cognitive load, adaptation, curiosity, metacognition)

[13] Gillick, D. (2025). AI tutors should not approximate human tutors. AI Policy Perspectives. Google DeepMind team member argues AI tutors should do what AI uniquely can, rather than approximate human dialogue

III. Knowledge Tracing & Learner State Modelling

Foundational Knowledge Tracing

[14] Piech, C., Spencer, J., Huang, J., Ganguli, S., Sahami, M., Guibas, L., & Sohl-Dickstein, J. (2015). Deep Knowledge Tracing. NeurIPS 2015. Foundational paper applying deep learning (LSTM) to student knowledge state modelling

Longitudinal & Long-Sequence Modelling

[15] Gao, A. & Liu, Z. (2025). Long sequence temporal knowledge tracing for student performance prediction via integrating LSTM and Informer. PLOS One, 20(9), e0330433. "Long-sequence modelling is more valuable than short sequence prediction but underexplored." Direct evidence for the longitudinal gap

[16] Liu, H. et al. (2025). Advancing Knowledge Tracing by Exploring Follow-up Performance Trends (FINER). arXiv:2508.08019. Forward-looking KT using follow-up trends outperforms 10 SOTA KT methods by 8.74โ€“84.85%

Personalised & Structured Knowledge Tracing

[17] Eedi / Oxford University Press (2026). Faster, Cheaper, More Accurate: Specialised Knowledge Tracing Models Outperform LLMs. arXiv:2603.02830. Specialised KT models outperform LLMs on longitudinal prediction: 600โ€“12,000ร— cheaper, higher accuracy. Core empirical support for LearnOS architecture choice

[18] Tran, C. et al. (2024). Towards modelling learner performance with large language models. EDM 2024. Zero-shot LLMs fail at KT; fine-tuned LLMs only match Bayesian KT baselines - not state-of-the-art specialised models

Systematic Reviews

[19] Preprints.org Authors (2025). A systematic review of Deep Knowledge Tracing (2015โ€“2025): Toward responsible AI for education. Preprints.org, October 2025. 1,047 articles reviewed. Only 3.6% assess sequential stability; only 11.9% include interpretability measures. Real-world adoption remains limited

IV. Temporal Modelling Architecture

State Space Models

[20] Gu, A. & Dao, T. (2023). Mamba: Linear-time sequence modelling with selective state spaces. arXiv:2312.00752. Selective SSM with input-dependent state transitions; linear time complexity vs. quadratic for transformers. Core architectural choice for LearnOS temporal layer

V. AI in Education: Policy & System-Level

[21] UNESCO (2023). Guidance for generative AI in education and research. UNESCO Publications. <10% of schools globally have formal AI guidance; 40% communicated policies only verbally. First global policy framework

[22] McKinsey Global Institute (2024). A new future of work: The race to deploy AI and raise skills in Europe and beyond. McKinsey Global Institute. Up to 12 million occupational transitions required in Europe by 2030, double pre-pandemic pace; demand for technological skills up 25%

[23] Kasneci, E. et al. (2023). ChatGPT for good? On opportunities and challenges of large language models for education. Learning and Individual Differences, 103, 102274. Broad review of LLM opportunities and challenges in education

VI. Misconception Detection & Educational Datasets

[24] Eedi / The Learning Agency Lab (2024). MAP โ€” Charting student math misunderstandings (Kaggle Competition). Kaggle, 2024. 1,850+ teams; MAP@25 >0.94 achieved. Demonstrated contrastive learning on 1,800 real + 10,000 synthetic examples reaches high-quality misconception mapping

[25] ASSISTments Dataset (multiple releases) (2004โ€“2019). ASSISTments knowledge tracing benchmark dataset. Carnegie Mellon / WPI. Benchmark dataset used in 82.1% of published DKT studies. Essential for credible comparison

[26] Harrison, W., Dobson, E., Higgins, S., Uwimpuhwe, G., & Khowaja, R. (2025). Eedi 2024 impact report: A study to evaluate the effectiveness of Eedi on raising attainment in mathematics at KS3 (Year 7). WhatWorked Education. Students on Eedi experience equivalent of two additional months of academic progress; impact doubles for highly engaged students

VII. Cognitive Science of Learning

[27] Brown, J.S. et al. (2013). Dynamic systems perspectives on thinking and learning. Frontiers in Education. Misconceptions as emergent patterns of varying stability rather than stable objects students "have." Directly relevant to the KiP debate