The peer-reviewed literature on large language models in education is less than three years old. We do not yet have robust longitudinal data, strong findings on equity or transfer, or any clear picture of what these tools do to the deeper capacities education is ultimately supposed to develop β metacognition, intellectual resilience, a genuine sense of yourself as someone who can learn. We are, methodologically speaking, in the first quarter.
That is not a reason to dismiss what we do have. There are real signals in the evidence already, and the question is whether we can read them clearly, without the distortion field that tends to form around any technology arriving with commercial momentum and cultural novelty. In a way we have been here before, and that history is worth understanding.
The Dream of the Teaching Machine
The ambition to automate teaching is not new. Sidney Pressey, a psychologist at Ohio State, had been developing a mechanical teaching device since the early 1920s β a typewriter-like apparatus that presented multiple-choice questions and held a question in place until the student selected the correct answer. When he first displayed it publicly at the 1924 APA meeting, it was framed as an intelligence testing tool; the teaching function came later (Petrina, 2004). He spent much of the following decade trying to commercialise it, largely without success. B. F. Skinner gave the idea a proper theoretical home three decades later: programmed instruction decomposed learning into small sequenced steps, reinforced correct responses immediately, and let learners move at their own pace. Skinner was explicit that his machine was not a substitute for a teacher but a way of doing what good teaching had always done β attending carefully to the individual (Cooper, 1993). The underlying ambition maps almost exactly onto today’s discourse about AI tutors: personalised, responsive, available at scale.
Programmed instruction worked, within its limits. It produced reliable gains on well-defined, low-level skills, and struggled badly with complex content, wide variation in prior knowledge, and anything requiring genuine judgement or transfer. The ceiling was not only technological β the theoretical model was too narrow. Behaviorism’s premise that learning could be fully described through observable input-output relations could not account for what happened when students encountered genuinely novel problems (Cooper, 1993; Peggy A. Ertmer & Timothy Newby, 2013). What followed is sometimes narrated as a clean succession of paradigms, but it is more accurate to say that different intellectual traditions accumulated alongside each other, each illuminating something the others could not. Cognitive science uncovered the architecture behaviorism had deliberately bracketed β working memory, schema formation, the enormous role of prior knowledge. Constructivist and sociocultural research, drawing on philosophy, developmental psychology, anthropology, and linguistics, showed that learning is not a transfer of content but an active construction always embedded in social and cultural context. Motivational psychology added that what students believe about themselves as learners shapes outcomes as much as any instructional technique. Neuroscience, instructional design, psychometrics, and learning analytics have each added further layers, and none of these traditions made the others obsolete. What we know today about effective teaching and learning is their cumulative, sometimes contested, still-developing output β which is precisely why applying it is hard, and why any technology claiming to rest on “the science of learning” deserves scrutiny about which slice of that science it is actually drawing on.
The intelligent tutoring systems of the 1980s and 1990s (software designed to deliver individualised instruction by tracking what each student knew and responding accordingly) operationalised mainly the cognitive science strand of that picture. The flagship examples were explicitly built on cognitive models of skill acquisition, decomposing knowledge into learnable components and tracking student mastery of each. Within those foundations they worked surprisingly well: a comprehensive meta-analytic review found that these systems produced learning gains nearly equivalent to one-on-one human tutoring for well-defined STEM tasks, far outperforming the answer-based instruction that preceded them (VanLehn, 2011). But that qualifier β well-defined STEM tasks β is important. They were never shown to generalise to ill-defined problems, to transfer beyond practiced procedures, or to reach the motivational and sociocultural dimensions of learning that other research traditions had placed more into the picture β partly because the technology could not get there, and partly because the theoretical model they rested on was perhaps too narrow to even recognise those dimensions as part of the problem. Large language models can be seen as another such attempt at the same underlying ambition, with dramatically greater expressive range. That the ambition itself indeed seems continuous is visible in the vocabulary: Khalifeh and colleagues (2026), surveying recent AI education literature, found that “individualized instruction” β the term most directly associated with Skinner’s programme β now appears in fewer than 2% of studies, almost entirely displaced by “adaptive learning” and “personalized learning” β labels that, as Khalifeh and colleagues also found, have yet to acquire a universally accepted definition even within the research literature itself, functioning instead as broad umbrella terms loose enough to encompass almost any student-centred approach. The vocabulary keeps refreshing; the conceptual clarity has not yet arrived; and what changes, each cycle, is the technology claimed to finally deliver on the promise. Whether large language models represent a genuine advance on their predecessors, or a more capable version of a technology whose limits will again prove to lie exactly where the theoretical model is thinnest β in transfer, in motivation, in the sociocultural dimensions of learning that no individual-interaction system has yet reached β is exactly what the research is now trying to figure out.
Meanwhile, Everyone Is Tinkering
Before getting to controlled trials, it is worth naming something that rarely appears in the biggest evidence reviews: a great deal of what is actually happening with generative AI in education right now is tinkering. Educators are building chatbots for Q&A, deploying AI writing coaches, designing simulation environments for professional practice, asking students to critique AI-generated arguments, and trying to route AI agents through curriculum materials β sometimes thoughtfully, sometimes chaotically, and occasionally producing something that works in ways nobody predicted.
This matters, and not merely as a precursor to the “real” deeper research. It is historically how educational technology has moved forward. The first people to use video in teaching were not running randomised trials, and neither were the early adopters of wikis, simulation software, or peer learning platforms. They were trying things, watching what happened, adjusting, and sharing what they found, and some of those serendipitous discoveries became the basis for empirical work that eventually confirmed or complicated them. The difference now is scale and speed: tools are being deployed to millions of students before the field has had time to establish what questions to even ask about them, which raises the stakes considerably for the slower, harder work of building evidence about what actually works, for whom, and under what conditions.
What We Know Makes Learning Work
Before reading the AI studies, it is worth grounding the discussion in what the learning sciences have established about learning itself. Not because AI tools automatically embody these principles β many quietly violate them β but because that is the lens through which any of this could be evaluated.
Working memory is the bottleneck. Human working memory is sharply limited in capacity and duration, and learning fails when instruction places too many demands on it simultaneously β not because students lack ability, but because the architecture has real constraints. Effective design manages this through sequencing, worked examples, and stripping out irrelevant complexity (Paas et al., 2003; van MerriΓ«nboer & Sweller, 2005).
Retrieving beats reviewing. Actively recalling information from memory produces dramatically stronger long-term retention than passively re-reading it, and spaced retrieval practice β especially when interleaved across topics β consistently outperforms re-study (Carpenter & Agarwal, 2019; Taylor & Rohrer, 2010). The difficulty of retrieval is itself what consolidates learning; struggle, within the right range, is the mechanism rather than a design flaw.
Feedback needs to inform, not just evaluate. Explanations, hints, and worked alternatives consistently outperform grades and binary judgements at the level of actual skill development, and what matters most is whether students can do something with the information they receive (Shute, 2008).
Self-regulation is central, not supplementary. The capacity to plan, monitor, and adjust one’s own cognitive activity is not an advanced skill reserved for confident students β it is fundamental to how learning unfolds. Conditions that quietly remove decision-making from students can erode this capacity while producing short-term performance gains that mask the longer-term cost (Boekaerts, 1999).
Scaffolding is supposed to dissolve. Good support is calibrated to what learners are close to doing independently and gradually withdrawn as competence grows (Shvarts & Bakker, 2019; Vygotsky, 1978). A scaffold that stays in place indefinitely is not support anymore β it is a substitute.
These principles are not the full picture of what the learning sciences know. And dimensions harder to measure β identity, belonging, disciplinary ways of thinking, the social texture of a seminar room β are no less real for being less tractable to the methods that dominate current AI education research. The right question to ask about any new AI tool is not whether it is technically impressive, but whether it actually embodies these principles or quietly works against them.
What Studies Are Starting to Show:
For Students: Conditional Gains and Real Risks
The most striking recent result I found comes from Kestin and colleagues (2024), who used a then accessible LLM and ran a randomised controlled trial comparing an AI-powered tutor to active learning instruction for university-level physics. Students using the AI tutor learned more than twice as much in less time β a significant signal, but one that requires careful reading. The system was not a raw language model pointed at students: it was a deliberately designed pedagogical tool that asked questions rather than providing answers, pushed students to explain their reasoning, and adapted based on their responses. The gains came from the technology operating within intentional instructional design, not from the model’s capability alone, and that distinction will matter enormously as institutions scale up deployment.
How a tool is framed for students turns out to be equally consequential. Mollick & Mollick (2023) propose seven distinct roles AI can play in student learning β tutor, coach, mentor, teammate, tool, simulator, student β each generating different cognitive activity and carrying different risks. An AI that questions and prompts self-explanation produces fundamentally different learning conditions than one that simply executes tasks, and leaving students to implicitly negotiate which role the AI plays, without any pedagogical framing, is precisely where much of the evidence for harm clusters.
Lehmann and colleagues (2024) make this concrete across three studies in university programming courses. Students who conversed with the LLM β asking for explanations, probing their own understanding β benefited significantly. Those who used it to complete practice exercises for them, bypassing the effort of struggling with the problem, learned less and significantly overestimated what they had understood. That last finding deserves emphasis: the felt experience of AI-assisted work can be highly fluent and confident even when genuine skill acquisition has not occurred, and students without prior domain knowledge were especially vulnerable to this pattern. The LLM eliminates the desirable difficulty that consolidates learning and replaces it with a performance of competence that the student mistakes for the real thing. Khalifeh and colleagues (2026), reviewing fifty-five studies, found the same pattern at scale: gains are clearest for skill-based tasks with well-defined correct answers, and become much less clear for complex, open-ended goals where productive struggle is hardest to distinguish from unproductive confusion.
For Teachers: The Augmentation Case
The evidence I collected so far for AI supporting teachers, rather than acting on students directly, is more limited in volume but at least as interesting in direction. Wang and colleagues (2025) describe Tutor CoPilot, a system that sits alongside human tutors during live sessions and provides real-time guidance on how to respond. In a randomised trial involving 900 tutors and 1,800 students from historically under-served communities, students whose tutors had access to the system were 4 percentage points more likely to master topics β an effect that reached 9 points for lower-rated tutors, who started asking more guiding questions and giving away answers less often. At roughly $20 per tutor annually, the equity implications are worth sitting with seriously. This is a meaningfully different vision of AI’s role: not replacing the human interaction but upgrading it, amplifying expert pedagogical knowledge and distributing it to those still developing it.
Afzaal and colleagues (2024) found that AI-generated feedback via a learning dashboard improved both performance and self-regulation, particularly because the system moved beyond corrections toward root-cause analysis β not just flagging what went wrong, but explaining where performance was declining and what to do about it. This mirrors what Shute (2008) identified as most effective in feedback research, though the open question remains whether students can genuinely act on analytically sophisticated feedback without the relational context that a person who actually knows their situation would provide.
The Technical Complexity Nobody Talks About
One thing easy to miss in the public discourse about AI in education is how genuinely difficult it is to build any of these tools well β and how much of the variation in outcomes is probably explained not by pedagogy or student behaviour, but by engineering decisions that almost nobody in the conversation is qualified to examine.
Start with the alignment problem, which is more fundamental than it first appears. A general-purpose language model is trained to be fluent, helpful, and responsive β to reduce friction, complete the task, and satisfy the user. Good pedagogy often requires the opposite: introducing productive difficulty, withholding answers, slowing students down at exactly the moment they want to be accelerated. These are genuinely different optimisation targets, and the tension between them does not resolve cleanly through a system prompt. You can instruct a model to “ask questions rather than provide answers,” and it will do so, up to a point β but the underlying training exerts a constant gravitational pull toward helpfulness. A determined student can erode Socratic constraints within a few conversational turns simply by rephrasing requests, expressing frustration, or claiming they have already understood the concept and just need the answer confirmed. The model, trained to be agreeable and to reduce user distress, complies. What looked like a thoughtful pedagogical tool quietly becomes a sophisticated answer machine.
Then there is the question of contextualisation β how you actually get a general-purpose model to know anything meaningful about a specific course. This is where a pervasive misconception needs correcting. When an educator uploads a course text or a set of lecture slides to an AI chatbot, the model is not trained on those materials. Training means updating a model’s weights through gradient descent across billions of examples, a process requiring enormous compute, months of time, and significant specialist expertise. What actually happens when you upload a PDF is something called Retrieval-Augmented Generation, or RAG β and understanding what RAG does, and does not do, matters a great deal for evaluating what these tools can realistically offer.
In a RAG system, documents are broken into chunks, converted into numerical vectors through an embedding model, and stored in a vector database. When a student asks a question, the system calculates which stored chunks are most semantically similar to the query and injects them into the context window as plain text, alongside the conversation. The model then generates a response conditioned on that retrieved content. The elegance of the approach is real, but so are its failure modes, and they are numerous. Chunking strategy β how documents are divided β is a non-trivial engineering decision: split too finely and conceptual coherence is lost; split too coarsely and the wrong content gets retrieved. A section heading may end up in one chunk while the substantive content it labels is in the next, making both chunks less retrievable for relevant queries. The embedding model encodes semantic similarity as it was learned from general training data, which may handle disciplinary language, local terminology, and the specific conceptual vocabulary of a course poorly β a question about a concept the course defines in a particular way may retrieve the wrong passage because the embedding model maps it to a different part of semantic space. When retrieved content does not actually answer a student’s question, the model does not typically say so; it generates a plausible-sounding response drawing on its broader training data, while appearing to be grounded in the course material. The result is confident-sounding misinformation that is particularly hard to detect because it arrives with an implicit imprimatur of course authority.
Even when RAG works well technically or if an entire textbook fits inside the context window, it has no knowledge of the pedagogical architecture of the course β which concepts depend on which others, what students have and have not covered yet, what the assessment is designed to test, what the lecturer specifically emphasised in week four. The retrieved text is content without context. A student in week three of a ten-week course asking about a concept introduced in week eight gets a technically accurate answer that is pedagogically premature, without the system having any representation of why that might matter.
Agentic architectures β systems where one or multiple AI models work in sequence, for example with one diagnosing, another retrieving, another generating a question, another evaluating a response β represent the next layer of ambition, and the next layer of complexity. The appeal is genuine: an agent that could accurately diagnose a student’s specific misconception, retrieve the most relevant course passage, design a targeted practice question calibrated to the zone of proximal development, evaluate the student’s response for genuine understanding rather than surface pattern-matching, and adapt the next step accordingly would be a remarkable pedagogical tool. Pieces of this are now technically feasible in ways they were not three years ago.
The difficulty is that each step in an agentic chain is a potential failure point, and errors propagate. A misdiagnosis at step one cascades through the entire sequence, producing a confidently wrong intervention that the student has no basis to identify as such. Unlike a single LLM response that a student can immediately sense is off, multi-agent systems can produce compound outputs whose errors are buried in intermediate processing invisible to the end user. Testing and quality assurance become correspondingly harder: you cannot simply evaluate the final output, you need to evaluate each step and the interactions between them, and for educational tasks where “correct” often depends on context, learning stage, and disciplinary convention, automating that evaluation is a genuinely unsolved problem. Agentic systems are also slower, sometimes inconsistent across runs, and sensitive to small variations in input in ways that single-model systems are not β a student asking the same question twice in slightly different words may receive substantively different sequences of interaction.
And all of this is before accounting for the most uncontrollable variable in the whole system: the learner. Unlike almost any other software context, educational AI cannot determine what users actually do with it. A student can lie to the system about what they have understood. They can paste in answers from elsewhere. They can use the tool at midnight before a deadline in a state of anxiety the system has no visibility into, optimising for task completion rather than understanding in ways the system cannot distinguish from genuine engagement. What a student does in a single AI session is shaped by everything else happening in their week β the lecture that confused them, the exam three weeks away, the feedback that demoralised them yesterday β and the system sees none of that. Learning is distributed across time and context in ways that fundamentally resist encapsulation in a conversational interface, however sophisticated. The intelligent tutoring systems of the 1980s failed partly because they encoded too thin a model of what learning involves. That warning has not been superseded. If anything, the current generation of tools, with their smooth interfaces and confident outputs, makes it easier than ever to mistake fluency for depth.
The Evidence Ecosystem and Its Distortions
It would be reassuring if the research landscape around AI in education were shaped primarily by disinterested inquiry, building knowledge slowly and cumulatively toward clear understanding. It is not, and this needs saying plainly rather than politely.
A significant and growing portion of the most prominent AI education research is produced by, funded by, or conducted in formal partnership with the companies building the tools under study β and the structural incentives this creates are not neutral. Companies publish positive findings as research papers and press releases simultaneously, with communications infrastructure that academic research cannot match. Preprints circulate for months before peer review, accumulating citations and shaping discourse. Conference keynotes and panel invitations flow to researchers affiliated with well-resourced organisations rather than to independent scholars doing careful, slow, methodologically demanding work in contexts that don’t generate headlines. The result is an information environment systematically tilted toward optimism, toward the impressive single study, and toward framings of the question that happen to favour the products already being sold.
OpenAI’s Learning Outcomes Measurement Suite (OpenAI, 2026) is instructive here, and not because it represents bad faith β the mixed results it reported (no significant gains in the neuroscience condition; roughly 15% higher scores in microeconomics) were published, which matters. What is worth examining is the broader architecture of the initiative: a company leading the development of measurement frameworks for AI’s impact on learning, in partnership with Stanford’s SCALE Initiative, Arizona State, UCL Knowledge Lab, and MIT Media Lab (Stanford HAI, 2026). The partnerships are legitimate academic institutions. The framing β that the field lacks adequate tools to evaluate AI’s learning impact, and that OpenAI is helping to close that gap β is presented as a service to the research community. But the company that defines how learning is measured, and co-develops the instruments used to measure it, has a structural interest in what those instruments find. When James Donovan, OpenAI’s Head of Cognitive Outcomes Research, presents alongside Carol Dweck at a Stanford summit, the signalling β that this is serious science with serious scholarly backing β is doing real work in shaping how the field receives the findings. Independent replication, by researchers with no commercial relationship to the tools being evaluated, is not optional in this context. It is the essential corrective.
More broadly, the field is currently in the grip of what might be called a measurement gap. Randomised controlled trials take two to three years from design to peer-reviewed publication. A company can ship a product, run an internal study, and issue a press release within weeks. The evidence that accumulates fastest is, almost by definition, the evidence produced by those with the greatest commercial interest in particular results. The signal that pedagogically aligned AI interaction can improve learning outcomes under specific conditions is plausible and worth taking seriously β there are independent results pointing in that direction. Whether it does so durably, at scale, equitably, and with the full complexity of learning contexts that a well-designed curriculum involves, is a question the current evidence base simply cannot answer. That gap should make us cautious, not dismissive β but it should make us genuinely cautious, rather than performing caution while acting on the optimistic case.
What We Still Do Not Know
The honest answer to most headline claims about generative AI and learning is: probably, for some students, on some tasks, under some conditions. Almost all positive findings come from short interventions measured on immediate post-tests or end-of-unit exams, and whether AI-supported learning produces more durable retention, better transfer to novel problems, or stronger metacognitive development over time remains largely open. These are not merely the growing pains of a young literature. Khalifeh and colleagues (2026) found that the broader personalised learning field from which current AI education discourse inherits its frameworks has long relied predominantly on cross-sectional designs and self-report data, leaving causal inference weak and the risk of bias structurally high β and that the complexity of educational environments makes isolating the effect of any specific intervention genuinely difficult regardless of how technically sophisticated that intervention is. The LLM education literature is not starting from a methodologically solid base; it is adding new questions to an already, dare I say under-evidenced field. We do not have the longitudinal evidence that would let us say anything confident about what happens to learners’ capacity for independent intellectual work after a year or two of heavy AI assistance, and the absence of that evidence is arguably the most important question the field is not yet able to answer.
The equity dimension is at least as pressing. AI tools assume reliable internet access, digital fluency, and β crucially β a degree of metacognitive sophistication to use them in ways that support rather than bypass learning. Lehmann and colleagues (2024) found that the students most likely to use LLMs harmfully were those without prior domain knowledge, who are also, typically, the students who most need genuine support. A tool that consistently helps confident, knowledgeable learners while quietly harming those who are struggling has a serious equity problem β and that problem gets worse, not better, if the tool is deployed at scale under the assumption that its average positive effect applies uniformly across the student population. Khalifeh and colleagues (2026) found a pointed illustration of this in their review of the broader personalised learning literature: studies comparing computer science and mathematics students in the same adaptive programming courses found that CS students developed more efficient learning strategies over time, while mathematics students showed high engagement without corresponding achievement β suggesting that what counts as effective personalisation varies not just by individual ability but by disciplinary identity and motivation. Those studies largely predate LLM-based tools; the problem they surface is if anything sharper for current AI systems, which typically lack even the structured learner models of those earlier adaptive platforms and have no mechanism for distinguishing a student who is deeply engaged from one who is going through the motions.
In addition, we do not yet have adequate frameworks for identifying when human teaching is not merely preferable but truly irreplaceable. The relational dimensions of learning β the felt sense that someone who cares about your development has attended carefully to your work, knows your history, and can read the difference between a student who needs to be challenged and a student who needs to be steadied β appear to matter more than the technical quality of any single instructional act. Whether an AI system can produce functionally equivalent relational effects, or whether those effects are definitionally tied to human presence, is not settled. That question is genuinely hard, and the field should resist the temptation to close it prematurely with optimistic assumptions.
Keep Tinkering β But Know What You Are Doing
The evidence is neither as positive as the headlines claim nor as negative as the sceptics suggest, and what it actually supports is a set of distinctions that are easy to state and genuinely hard to operationalise. The gains that appear in the literature are consistently linked to deliberate design and intentional use β not to the capability of the model alone. A chatbot that answers questions is not the same as a tool designed to ask them, and the difference lies entirely in the configuration, the context, and what students understand they are supposed to be doing with it. That distinction matters more than it might seem, because the evidence on how personalised learning tools are actually deployed is not that encouraging: Khalifeh and colleagues (2026) found that schools frequently adopt such tools without coherent institutional frameworks, resulting in fragmented, teacher-dependent use of digital resources rather than the sustained, designed learning experiences that the positive research findings assume. How students use a tool determines the learning outcome more than which tool they use β which places real demands on educators to develop metacognitive capacity explicitly, as a taught, scaffolded, assessed skill, before students encounter these systems unsupported. The field has instead largely assumed that metacognitive and motivational capacities will emerge from technological implementation itself (Khalifeh et al., 2026). They do not. Or at least not automatically and not for everyone. That assumption is one of the more consequential gaps between what the research conditions require and what deployment in practice tends to provide.
It is worth noticing who is speaking most loudly in this conversation. The confident voices β at conferences, in LinkedIn feeds, in op-eds about AI transforming education β are often those with the strongest commercial interest in a particular answer, or those whose identity is most invested in a particular position. The researchers doing careful, slow, methodologically rigorous work in specific educational contexts are rarely the ones giving keynotes at the biggest stage during EdTech summits. This is not a reason for paralysis. It is a reason to be thoughtful about whose frameworks you are borrowing and what assumptions come built into them.
Education has absorbed and outlasted many waves of technology that arrived promising transformation. Pressey’s machine was not the end of teachers. Neither were Skinner’s. Neither were the intelligent tutoring systems of the 1990s, for all their genuine achievements within particular domains. What ended each wave was not defeat but the discovery of its ceiling β and the theoretical and empirical work that followed, which eventually gave rise to something more adequate. We are somewhere in that process again, earlier than most of the public conversation acknowledges. The right response is not dismissal, and it is not uncritical adoption. It is continued tinkering, grounded in the learning sciences, honest about the limits of current evidence, and genuinely curious about what the harder studies β the ones with longer time horizons, harder outcomes, and no commercial stake in the result β will eventually show.
References & Further Reading
Afzaal, M., Zia, A., Nouri, J., & Fors, U. (2024). Informative Feedback and Explainable AI-Based Recommendations to Support Studentsβ Self-regulation. Technology, Knowledge and Learning, 29(1), 331β354. https://doi.org/10.1007/s10758-023-09650-0
Badali, S., & Greve, M. (2024). Can successive relearning enhance performance on application-based exam questions? Journal of Applied Research in Memory and Cognition, 13(3), 407β418. https://doi.org/10.1037/mac0000137
Baker, R. S. (2023). AI and self-regulated learning theory: What could be on the horizon? Computers in Human Behavior, 147, 107849. https://doi.org/10.1016/j.chb.2023.107849
Boekaerts, M. (1999). Self-regulated learning: Where we are today. International Journal of Educational Research, 31(6), 445β457. https://doi.org/10.1016/S0883-0355(99)00014-2
Carpenter, S. K., & Agarwal, P. K. (2019). How to use spaced retrieval practice to boost learning. Iowa State University.
Cavalcanti, A. P., Barbosa, A., Carvalho, R., Freitas, F., Tsai, Y.-S., GaΕ‘eviΔ, D., & Mello, R. F. (2021). Automatic feedback in online learning environments: A systematic literature review. Computers and Education: Artificial Intelligence, 2, 100027. https://doi.org/10.1016/j.caeai.2021.100027
Cooper, P. A. (1993). Paradigm Shifts in Designed Instruction: From Behaviorism to Cognitivism to Constructivism. Educational Technology, 33(5), 12β19.
Ertmer, P. A. & Newby, T. (2018). Behaviorism, Cognitivism, Constructivism: Comparing Critical Features From an Instructional Design Perspective. In West, R. E. (Ed.), Foundations of Learning and Instructional Design Technology (1st Edition): Historical Roots and Current Trends (pp. 133-151). EdTech Books. https://edtechbooks.org/lidtfoundations/behaviorism_cognitivism_constructivism
Holmes, W., Bialik, M., & Fadel, C. (2023). Artificial intelligence in education. In In: Data ethics: Building trust: How digital technologies can serve humanity. (Pp. 621-653).Β Globethics Publications (2023) (pp. 621β653). Globethics Publications. https://doi.org/10.58863/20.500.12424/4276068
Kestin*, G., Miller*, K., Klales, A., Milbourne, T., & Ponti, G. (2024). AI Tutoring Outperforms Active Learning. Research Square. https://doi.org/10.21203/rs.3.rs-4243877/v1
Khalifeh, F., Santiago, R., & Palau, R. (2026). Redefining personalized learning in the artificial intelligence era: An updated systematic review from 2019 to 2025. Smart Learning Environments, 13(1), 19. https://doi.org/10.1186/s40561-026-00440-6
Kohout-Diaz, M. (2026). Making sense of AI in teacher education: A qualitative study of perceptions, practices and pedagogical tensions. Teaching and Teacher Education, 171, 105342. https://doi.org/10.1016/j.tate.2025.105342
Lehmann, M., Cornelius, P. B., & Sting, F. J. (2024). AI Meets the Classroom: When Does ChatGPT Harm Learning? (arXiv:2409.09047; Version 1). arXiv. https://doi.org/10.48550/arXiv.2409.09047
Mollick, E., & Mollick, L. (2023). Assigning AI: Seven Approaches for Students, with Prompts (arXiv:2306.10052). arXiv. https://doi.org/10.48550/arXiv.2306.10052
OpenAI. (2026). New tools for understanding AI and learning outcomes. (2026, March 5). https://openai.com/index/understanding-ai-and-learning-outcomes/
Ouyang, F., & Jiao, P. (2021). Artificial intelligence in education: The three paradigms. Computers and Education: Artificial Intelligence, 2, 100020. https://doi.org/10.1016/j.caeai.2021.100020
Paas, F., Renkl, A., & Sweller, J. (2003). Cognitive Load Theory and Instructional Design: Recent Developments. Educational Psychologist, 38(1), 1β4. https://doi.org/10.1207/S15326985EP3801_1
Petrina, S. (2004). Sidney Pressey and the Automation of Education, 1924-1934. Technology and Culture, 45(2), 305β330.
Roediger III, H. L., & Karpicke, J. D. (2006). The power of testing memory: Basic research and implications for educational practice. Perspectives on Psychological Science, 1(3), 181β210. https://doi.org/10.1111/j.1745-6916.2006.00012.x
Rudolph, J., Tang, F. X., Aspland, T., & Stafford, V. (2025). What does βgood teachingβ mean in the AI age? Journal of Applied Learning and Teaching, 8(2). https://doi.org/10.37074/jalt.2025.8.2.1
Shute, V. J. (2008). Focus on Formative Feedback. Review of Educational Research, 78(1), 153β189. https://doi.org/10.3102/0034654307313795
Shvarts, A., & Bakker, A. (2019). The early history of the scaffolding metaphor: Bernstein, Luria, Vygotsky, and before. Mind, Culture, and Activity, 0(0), 1β20. https://doi.org/10.1080/10749039.2019.1574306
Stanford HAI. (2026, February 11). AI+Education Summit 2026 [Video recording]. https://www.youtube.com/watch?v=EqouaCgSo-k
Szpunar, K. K., Khan, N. Y., & Schacter, D. L. (2013). Interpolated memory tests reduce mind wandering and improve learning of online lectures. Proceedings of the National Academy of Sciences, 110(16), 6313β6317. https://doi.org/10.1073/pnas.1221764110
Taylor, K., & Rohrer, D. (2010). The effects of interleaved practice. Applied Cognitive Psychology, 24(6), 837β848. https://doi.org/10.1002/acp.1598
van MerriΓ«nboer, J. J. G., & Sweller, J. (2005). Cognitive Load Theory and Complex Learning: Recent Developments and Future Directions. Educational Psychology Review, 17(2), 147β177. https://doi.org/10.1007/s10648-005-3951-0
VanLEHN, K. (2011). The Relative Effectiveness of Human Tutoring, Intelligent Tutoring Systems, and Other Tutoring Systems. Educational Psychologist, 46(4), 197β221. https://doi.org/10.1080/00461520.2011.611369
Vygotsky, L. S., & Cole, M. (1978). Mind in Society: Development of Higher Psychological Processes. Harvard University Press.
Wang, R. E., Ribeiro, A. T., Robinson, C. D., Loeb, S., & Demszky, D. (2025). Tutor CoPilot: A Human-AI Approach for Scaling Real-Time Expertise (arXiv:2410.03017). arXiv. https://doi.org/10.48550/arXiv.2410.03017
Winstone, N., Gravett, K., Noble, C., Nicola-Richmond, K., Bearman, M., Jensen, L. X., Jones, A., Corbin, T., de Kleijn, R., Gabelica, C., Kainth, R., Poobalan, A., & Reedy, G. (2025). Manifesto for feedback in the age of generative artificial intelligence (p. 845268 Bytes). figshare. https://doi.org/10.6084/M9.FIGSHARE.30195568