Towards a formal representation of learner knowledge state
On the gap between measuring learning and modelling it and the architectural choices that gap implies
There is a question that sounds simple and turns out to be surprisingly hard: what does it mean to know something?
Not to have encountered it, or to be able to reproduce it under familiar conditions, but to genuinely know it - in the sense that the knowledge is stable, connected to other knowledge, applicable under novel conditions, and durable under the pressure of time and competing demands. This is the question that every assessment system is trying to answer, and that most assessment systems answer imperfectly.
The gap between measuring learning and modelling it is where most educational technology currently lives. Measurement asks: what can this student produce right now? Modelling asks something harder: what is the current structure of this student's understanding, how is it changing, and what does that trajectory predict? These are not the same question. The instruments adequate to the first are not adequate to the second. And the architectural choices required to build systems capable of the second are substantially different from the choices made by systems optimised for the first.
The three things a test score conceals
A test score tells you something real: A student who consistently scores 85% on fraction problems knows more about fractions than a student who scores 40%. This is useful information. But it systematically conceals three things that matter at least as much.
The first is the stability of understanding. Two students with identical scores on fraction problems may have qualitatively different knowledge states. One has a genuine conceptual grasp of fractions as part-whole relationships and multiplicative structures: knowledge that will transfer when fractions appear in unfamiliar contexts, in algebraic manipulation, in statistical reasoning. The other has learned to execute standard fraction algorithms correctly in standard formats without the underlying conceptual foundation - a condition the cognitive science literature describes as procedural-without-conceptual knowledge.¹ Smith, diSessa, and Roschelle (1994) demonstrated that such procedural knowledge is not merely incomplete but actively interfering: it generates systematic error patterns when the surface structure of problems changes, and it resists correction precisely because the student cannot locate the source of their errors in any conscious belief they hold.² A test score cannot distinguish these two students. Their trajectories are radically different.
The second is the cascade risk. Mathematical knowledge is hierarchically structured: concepts have dependency relationships such that genuine understanding of later concepts requires not merely exposure to but conceptual grasp of prior ones. A student with procedural-without-conceptual knowledge of fractions who encounters ratios, then percentages, then proportional reasoning in statistics, will apply the same surface procedures at each stage without the underlying relational understanding. The fragility compounds at each node of the dependency graph. By the time it becomes visible in assessment data, it has propagated through three or four dependent concepts and remediation requires reconstructing the entire conceptual chain, a process that takes months if it succeeds at all.
The third is the trajectory. A student who scored 70% three weeks ago and 65% today is not in the same situation as a student who scored 70% three weeks ago and 75% today, even if both score 70% this week. The direction and rate of change of a knowledge state carries predictive information that the point estimate alone does not. A declining trajectory on a foundational concept, even from a comfortable baseline, predicts future difficulty in a way that a stable trajectory at the same level does not. Point-in-time assessment, however frequent, cannot recover this information. The trajectory is in the change, and the change is only legible from longitudinal data.
From measurement to modelling: what a knowledge state representation requires
The instrument that would capture what measurement misses is not a better test. It is a fundamentally different kind of representation - one that tracks not what a student scored at a moment but how their understanding is currently structured and how that structure is evolving.
Formally, such a representation must satisfy four properties. It must be continuous - updated with each new interaction rather than sampled at defined intervals, because the trajectory information is in the rate of change and discretising time destroys it. It must be structured rather than scalar, because understanding is not a single quantity but a graph of connected concepts with varying degrees of consolidation, and the topology of that graph carries information that any scalar reduction loses. It must be longitudinal — trained on interaction sequences of sufficient temporal depth that the model can learn what the temporal signature of developing fragility looks like before observing the failure it predicts. And it must be predictive in a specific sense: not merely predictive of next-question performance, but predictive of future performance on dependent concepts not yet encountered in the current interaction history.
The deep knowledge tracing literature has made substantial progress on the first three properties. Models from DKT (Piech et al., 2015) through AKT (Ghosh et al., 2020) and SAINT+ (Shin et al., 2021) have progressively improved prediction of next-question correctness on held-out test sets.³ But the systematic review of DKT research covering 2015–2025 found that only 3.6% of studies assessed sequential stability of knowledge estimates over time, and only 11.9% included interpretability measures sufficient for a practitioner to act on the model's outputs.⁴ The field has optimised for short-horizon prediction accuracy on benchmark datasets while the fourth property - longitudinal predictive validity across dependent concepts in real curricula - remains largely unaddressed.
The architectural question: why the choice of sequence model matters
The requirement for longitudinal, structured, continuously updated learner representations is not neutral with respect to architectural choice. Different sequence modelling architectures make different tradeoffs that bear directly on which of the four properties above they can satisfy, and at what computational cost.
Transformer-based knowledge tracing models (AKT, SAINT+, and their successors) use self-attention mechanisms that, in principle, can attend to any prior interaction in the student's history when updating the knowledge state estimate. This global attention is architecturally attractive for modelling long-range dependencies - the relationship between a student's errors on fractions six months ago and their current performance on algebraic fractions, for example. However, standard transformer attention scales quadratically with sequence length: processing a student's full interaction history over an academic year, which might comprise several thousand interactions, becomes computationally prohibitive. In practice, transformer-based KT models are evaluated on sequences of 50–200 interactions, which corresponds roughly to a few weeks of engagement - far shorter than the longitudinal depth required for the predictive task described above. Attempts to extend context through sparse attention mechanisms (Longformer, BigBird patterns) recover some of this at the cost of the full global attention that motivated the architecture.
Recurrent architectures - LSTM and GRU-based models, the basis of the original DKT - handle long sequences more efficiently by compressing prior history into a fixed-dimensional hidden state that is updated at each time step. The computational cost of an update is constant regardless of sequence length, which is the property required for deployment at scale. The limitation is that the compression is lossy in ways determined by the training objective rather than by the structure of the knowledge domain: the hidden state may preserve information about recent performance well while losing information about older interactions that are nonetheless diagnostically relevant. The fact that knowledge consolidation and decay have specific temporal signatures that depend on the interval since last practice (forgetting curve dynamics) are difficult to encode explicitly in a standard recurrent architecture.
State space models (SSMs), and specifically the selective SSM architecture Mamba (Gu & Dao, 2023), offer a third option that addresses several of the limitations above.⁵ SSMs represent the learner state as a fixed-dimension vector updated by a learned linear recurrence at each time step. The key property of Mamba's selective SSM is that the parameters of the state transition are input-dependent. The model learns to selectively retain or discard information from the current input based on its content, rather than applying a fixed compression rule. This allows the model to implement something closer to principled selective attention over the interaction history without the quadratic cost of transformer attention. For the learner state modelling task, this is attractive: the model can learn that certain types of interaction (a novel problem variant answered without hints, a specific error pattern recurring after a period of apparent resolution) are more diagnostically significant than others, and weight them accordingly in the state update, while processing sequences of arbitrary length at constant cost.
The FINER paper (Liu et al., 2025) demonstrated a related insight from a different direction: incorporating forward-looking performance trends, e.g. what happens to a student's performance in the weeks following a given interaction, into the training signal improves prediction accuracy on six real-world datasets by 8.74% to 84.85% over state-of-the-art KT baselines.⁶ The magnitude of the improvement reflects how much predictive information is present in temporal patterns that standard session-level evaluation cannot access. Graph neural network components are a natural complement to the temporal backbone for the structured knowledge representation property. The dependency structure of a curriculum — fractions → ratios → proportional reasoning → algebraic fractions — can be represented as a directed graph, with concepts as nodes and prerequisite relationships as edges. Embedding the learner state in this graph, rather than in an unstructured vector space, means that uncertainty about a student's grasp of a foundational concept propagates forward through the graph to dependent concepts in a way that is structurally grounded rather than learned implicitly from co-occurrence statistics. Structure-aware knowledge tracing models (SKT, SINKT) have demonstrated that incorporating the knowledge graph into the state representation improves prediction accuracy and interpretability relative to graph-agnostic baselines.⁷
The training signal problem
The most important and least discussed challenge for longitudinal learner state modelling is not architectural but data-related, and it is worth being precise about why.
Training a model to predict the future stability or fragility of a student's understanding requires training examples where both the trajectory signal and the subsequent outcome are present in the data. This means: longitudinal interaction sequences of sufficient length (months, not weeks), with ground truth outcomes also present in the dataset. Most publicly available educational datasets are either too short (ASSISTments interactions are typically sparse over time), too coarse (single binary correctness labels without timing or hint request data), or lack the curricular structure annotations required to identify dependent concepts.
The implication is that the architectural choices above are necessary but not sufficient. The training data must have the temporal and structural depth that the model requires to learn the longitudinal patterns it is being asked to detect. Building that training data, e.g. through institutional partnerships that generate sustained, structured, annotated interaction logs at meaningful scale, is the constraint that determines whether the system described here is buildable in practice, not merely in principle.
The teacher who sees December in October
The question of what it would mean to actually know a student — not measure them at intervals but model the current structure and trajectory of their cognitive development — is the question that the next generation of educational AI needs to answer. The architectural primitives for approaching it exist: selective state space models for efficient long-range temporal representation, graph neural networks for structured knowledge state encoding, forward-looking training objectives for longitudinal predictive validity. What does not yet exist is a system that combines these components, trains them on data of sufficient depth, and deploys them at the scale that would make the detection meaningful.
The teacher who sees December coming in October is not running a better assessment. They are maintaining a richer internal model: one that represents not just where a student is but how they got there and where the trajectory leads. That is the capability that needs to be formalised, trained, and scaled.
¹ For the procedural-conceptual distinction in mathematics education, see: Rittle-Johnson, B., Siegler, R.S., & Alibali, M.W. (2001). Developing conceptual understanding and procedural skill in mathematics: An iterative process. Journal of Educational Psychology, 93(2), 346–362. ² Smith, J.P., diSessa, A.A., & Roschelle, J. (1994). Misconceptions reconceived: A constructivist analysis of knowledge in transition. Journal of the Learning Sciences, 3(2), 115–163. ³ Piech, C. et al. (2015). Deep Knowledge Tracing. NeurIPS 2015. Ghosh, A., Heffernan, N., & Lan, A.S. (2020). Context-aware attentive knowledge tracing. KDD 2020. Shin, D. et al. (2021). SAINT+: Integrating temporal features for EdNet correctness prediction. LAK 2021. ⁴ Preprints.org (2025). A Systematic Review of Deep Knowledge Tracing (2015–2025): Toward Responsible AI for Education. ⁵ Gu, A. & Dao, T. (2023). Mamba: Linear-time sequence modelling with selective state spaces. arXiv:2312.00752. ⁶ Liu, H. et al. (2025). Advancing Knowledge Tracing by Exploring Follow-up Performance Trends (FINER). arXiv:2508.08019. ⁷ Tong, S. et al. (2020). Structure-based Knowledge Tracing: An Influence Propagation View. ICDM 2020. Fu, L. et al. (2024). SINKT: A structure-aware inductive knowledge tracing model with large language model. CIKM 2024.