Akesha M. Horton

Most conversations about generative AI in programming courses still run on instinct. One camp wants to ban the tools to protect the fundamentals. The other wants to adopt them quickly so students are ready for an industry that has already changed. Both positions get argued from conviction, and both skip the step that should come first: looking at what the research says.

That step is now possible. In the past two years, computing education researchers have produced enough studies to move the question from opinion to evidence. The findings do not hand a clean win to either camp. They support bringing AI into the programming classroom, and they show that doing it badly can widen the gap between students who are already succeeding and those who are struggling. The case for integration and the case for careful design turn out to be the same case.

What the tools are good at

Start with the capability question, because for the types of problems we usually assign, it is no longer in serious dispute. Current large language models solve the kinds of problems we assign in introductory courses. A review of the field by Paul Denny and colleagues documents how code-generation tools went from a curiosity to a standard part of the programming workflow in a matter of months, and how readily they handle the small, well-specified problems that fill most CS1 assignments. The exam evidence is concrete. When James Finnie-Ansley and colleagues ran OpenAI’s Codex through two CS1 tests in 2022, its answers ranked seventeenth in a class of seventy-one, inside the top quartile, and that was an early model. Newer ones have only widened the margin.

The industry signal points the same direction. In April 2025, Microsoft’s CEO said as much as 30% of the company’s code was being written by AI. That figure is a remark on a conference stage, not a measured result, and “written by AI” has no agreed definition, so it deserves caution. As a marker of where professional practice is heading, though, it is hard to ignore. If we want graduates who can work the way their future colleagues already work, ignoring the tools is its own kind of risk.

The second strength is tutoring. A first-year case study by Liam Stienstra and colleagues used a ChatGPT-based tutor to teach a new topic, then tested students without assistance. The tool supported learning, and most students preferred the AI tutor to a human one. Their reasons are worth considering: it was always available, and it did not make them feel judged for asking a basic question. Purpose-built tutors push this further. Tools like CodeHelp, from Mark Liffiton and colleagues, and CodeAid, from Majeed Kazemitabaar and colleagues, wrap the model in guardrails so it nudges a student toward an answer instead of handing one over. Work by Harsh Kumar and colleagues, run across a classroom study and a larger controlled experiment, found that the structure an instructor puts around the tool changes how students use it. Guidance that prompted students to attempt a problem before turning to the model, or to reflect on their own thinking, cut down on aimless queries and on the habit of pasting an assignment straight into the chatbot verbatim without attempting it first.

There is also early evidence that the help does not vanish the moment the tool is taken away. In a controlled study, Kazemitabaar and colleagues gave sixty-nine beginners, ages ten to seventeen, a sequence of forty-five Python tasks. Half had an AI code generator during the authoring work. That group completed more tasks and scored higher on the authoring work. The natural worry is that they would fall apart once the tool was gone. They did not. On the modification tasks they had to do by hand, they performed no worse than the students who never had access. On a retention test a week later they came out slightly ahead, though the difference was not statistically significant. Encouraging and modest at once, which is the honest way to hold it.

So far this is the authentic optimistic story. It is also incomplete.

The catch is an equity problem

Learning with AI is not automatic, and the students it fails are the ones we should worry about most.

The clearest evidence comes from James Prather and colleagues, in a study with the apt title “The Widening Gap.” Using eye-tracking and think-aloud methods with novice programmers, they found that every metacognitive difficulty students had before AI is still present, and that new ones have appeared. The pattern that is concerning: students with higher grades and stronger self-efficacy used the tools to accelerate, while weaker students were often slowed down and left with what the authors call an illusion of competence. The work finished but the understanding did not. The same tool that helps a confident student move faster can help a struggling student fail without noticing.

Part of the reason can be found in a paper by Lev Tankelevitch and colleagues on the metacognitive demands of generative AI. Working with these systems asks the user to do things that are cognitively hard: state a goal precisely, break it into parts, judge whether the output is any good, and adjust course when it is not. The authors compare it to a manager delegating to a team. That is a demanding skill, and it is not one novices arrive with. Studies of how beginners actually use these tools back this up. A close look by James Prather and colleagues at novices working with Copilot found students leaving out details that matter, like input and output types, and, when a prompt failed, changing a few words in the prompt rather than supplying the missing requirements.

This is where the equity lens is needed. A tool that rewards students who already plan well, monitor their own understanding, and know what they do not know will tend to reward the students who were going to do fine anyway. Left unaddressed AI creates a larger equity gap.

What evidence-based design looks like

The same body of research points to what helps, and the through-line is consistent: slow students down enough to make them think, and build the supports that strong students supply for themselves.

Majeed Kazemitabaar and colleagues tested this directly. They compared a baseline that simply showed students AI-generated code with an explanation against seven techniques that required some engagement with the code before or after it appeared. The technique that worked best, called Lead-and-Reveal, brought what students thought they had learned closest to what they actually learned, without adding cognitive load. The effects were modest and the samples small, which the authors are candid about, but the direction matches everything else in the literature. Passive acceptance of generated code produces the illusion of learning. Forcing a small act of reasoning interrupts it.

Other design moves follow from the same logic. Scaffold the tool rather than handing it over, as Kumar’s guidance strategies did. Teach prompting as an explicit skill, since novices do not pick it up on their own. Move assessment toward process and away from the take-home artifact that a model can now generate in seconds. And give upper-level students structured practice with the messier task they will actually face at work: a study by Anshul Shah, Leo Porter, and colleagues taught students to use GitHub Copilot inside a large codebase and found a pattern they call one-shot prompting, where students ask the tool to build an entire feature at once and then spend their effort debugging what comes back. That is a teachable habit, and naming it is the first step to correcting it.

There is a quieter finding that reframes the whole debate. A 2025 replication study by Rita Garcia and Michelle Craig revisited a 2004 survey of what CS1 instructors teach and find hard to teach. Twenty years on, the basic concepts are largely unchanged, recursion is still the hardest thing to teach, and the newest challenge instructors name is teaching students to think and plan before they code. That difficulty predates generative AI. The shift toward planning, decomposition, and self-monitoring, the very skills AI makes more important, was already underway. The course it was happening in was already struggling: introductory programming has carried high failure rates for decades, documented by Jens Bennedsen and Michael Caspersen and revisited by Christopher Watson and Frederick Li, and the long line of work on Soloway’s rainfall problem, surveyed by Kathi Fisler, showed that many students finish CS1 unable to write a basic loop correctly. AI did not break the introductory course. It exposed problems the course already had.

You do not have to figure this out alone

The research is scattered across conferences and journals, which makes it hard for any individual instructor to assemble. That gap is part of what the Consortium for Generative AI in CS Education https://www.teachcswithai.org/about exists to close. Housed at UC San Diego’s Center for Research on Education, Assessment, and Teaching Excellence and led by Leo Porter, Daniel Zingaro, and Beth Simon, the consortium gathers course materials, research summaries, and a community of educators working through the same questions. Its partner network runs from professional bodies like ACM and SIGCSE to organizations focused on access and inclusion, which matters given what the research shows about who gets left behind. The point of the consortium is not to sell a position. It is to make the evidence usable, so that a faculty member redesigning a course in the summer is not starting from a blank page. About half of the articles I used for this article can be found in the consortium’s resources section.

Two of the consortium leads, Porter and Zingaro, have also written the classroom-ready version. Their book, Learn AI-Assisted Python Programming, reorders the introductory course around the skills the research elevates. They argue that the skills to write good software are evolving: problem decomposition, writing a specification, reading code, and testing matter more than they used to, while memorizing syntax and library details matters less. Two of the chapters teach students to read code, on the logic that if the assistant writes the code, the student’s job is to judge whether the assistant has done what the student intended. A full chapter covers testing and prompt engineering. Top-down design, the practice of breaking a large problem into smaller ones, runs across several chapters rather than appearing once. Debugging gets its own deep dive, a skill that think-aloud studies by Jacqueline Whalley and colleagues show novices tend to approach unsystematically. Porter and Zingaro grant that a future professional engineer should eventually learn to write code from scratch, but they argue it no longer makes sense as the starting place for most learners. That is a concrete answer to the camp that wants to ban the tools to protect the fundamentals: they are just sequenced differently and weighted toward the skills that matter most now, like reading code, testing, and breaking problems down, rather than writing syntax from memory. The book is available to the IU community through IUCAT on the Bloomington and Indianapolis campuses, in digital ebook (https://iucat.iu.edu/catalog/20590350) and streaming audio (https://iucat.iu.edu/catalog/21587005) formats.

What about large courses?

Class size does not change the goals, but it splits these design moves into two groups. Some work just as well with three hundred students as with thirty and cost the instructor no extra effort. Requiring an attempt before a student reaches for the tutor is a policy built into the assignment or the tool, not labor spent per student. Kumar’s guidance strategies were tested as exactly this kind of scalable intervention. The cognitive-engagement step is the same: Kazemitabaar’s Lead-and-Reveal shipped as a reusable task builder, so the prediction step runs for every student without the instructor touching it. The AI tutor itself has its strongest case here. Kumar frames LLM tutors as a response to growing class sizes where teacher presence is thin, CodeHelp was built for scalable support in large classes, and Stienstra’s students valued the tutor because it was always available, which a human teaching assistant in a 400-person course cannot be.

The moves that resist scale are both forms of assessment. Reading hundreds of process artifacts or grading prompts by hand does not work past a certain size. The realistic versions are to make the process artifact auto-gradeable or peer-reviewable, push it into teaching-assistant-led lab sections, assess prompts by outcome rather than by reading each one, and reserve the labor-heavy formats like oral checks for a rotating sample. The teaching of these skills scales fine, since a lecture is a lecture. The grading is what has to be redesigned.

Underneath this is the equity point. In a small class you can see who is drifting. In a large one, the student coasting on an illusion of competence is invisible, and that is the student Prather’s widening-gap study identifies as most at risk. The large-course adaptation is a triage logic: let the tools that scale support the median student, and aim scarce human attention, teaching-assistant hours, oral checks, early-alert outreach, at the students the tools are most likely to fail. Spreading human help evenly is the thing a big course cannot afford and the thing struggling students most need.

The opportunity

The strongest argument for bringing AI into programming education is not that the tools are impressive, though they are. It is that they force a reckoning with weaknesses the field has tolerated for decades: introductory courses where many students do not learn what we think they learn, and a persistent distance between what we teach and what graduates need. We can treat that reckoning as a threat to manage or as a chance to fix the course. The evidence says the fix is within reach, and that it depends less on the tools than on how we design around them.

A few things to consider trying next term:

Replace one take-home assignment with an in-class task that assesses process, not just the finished program.
Before students use an AI tutor, require them to attempt the problem and write down what they think the answer should be.
Teach one short lesson on writing a specification for a model, including input and output types, and grade a prompt the way you would grade code.
Pick a single cognitive-engagement move, such as having students predict what generated code will do before running it, and use it consistently for a unit.

References

Denny, P., Prather, J., Becker, B. A., Finnie-Ansley, J., Hellas, A., Leinonen, J., Luxton-Reilly, A., Reeves, B. N., Santos, E. A., & Sarsa, S. (2024). Computing Education in the Era of Generative AI. Communications of the ACM. https://doi.org/10.1145/3624720
Finnie-Ansley, J., Denny, P., Becker, B. A., Luxton-Reilly, A., & Prather, J. (2022). The Robots Are Coming: Exploring the Implications of OpenAI Codex on Introductory Programming. Australasian Computing Education Conference (ACE ’22). https://doi.org/10.1145/3511861.3511863
Kazemitabaar, M., Chow, J., Ma, C. K. T., Ericson, B. J., Weintrop, D., & Grossman, T. (2023). Studying the Effect of AI Code Generators on Supporting Novice Learners in Introductory Programming. CHI ’23. https://doi.org/10.1145/3544548.3580919
Prather, J., Reeves, B. N., Denny, P., Becker, B. A., Leinonen, J., Luxton-Reilly, A., Powell, G., Finnie-Ansley, J., & Santos, E. A. (2023). “It’s Weird That it Knows What I Want”: Usability and Interactions with Copilot for Novice Programmers. ACM Transactions on Computer-Human Interaction. arXiv:2304.02491
Liffiton, M., Sheese, B., Savelka, J., & Denny, P. (2023). CodeHelp: Using Large Language Models with Guardrails for Scalable Support in Programming Classes. Koli Calling ’23. arXiv:2308.06921
Kazemitabaar, M., Ye, R., Wang, X., Henley, A. Z., Denny, P., Craig, M., & Grossman, T. (2024). CodeAid: Evaluating a Classroom Deployment of an LLM-based Programming Assistant that Balances Student and Educator Needs. CHI ’24. arXiv:2401.11314
Prather, J., Reeves, B. N., Leinonen, J., MacNeil, S., Randrianasolo, A. S., Becker, B. A., Kimmel, B., Wright, J., & Briggs, B. (2024). The Widening Gap: The Benefits and Harms of Generative AI for Novice Programmers. ICER. https://doi.org/10.1145/3632620.3671116
Tankelevitch, L., Kewenig, V., Simkute, A., Scott, A. E., Sarkar, A., Sellen, A., & Rintel, S. (2024). The Metacognitive Demands and Opportunities of Generative AI. CHI ’24. https://doi.org/10.1145/3613904.3642902
Kumar, H., Musabirov, I., Reza, M., Shi, J., Wang, X., Williams, J. J., Kuzminykh, A., & Liut, M. (2024). Guiding Students in Using LLMs in Supported Learning Environments: Effects on Interaction Dynamics, Learner Performance, Confidence, and Trust. CSCW. https://doi.org/10.1145/3687038
Kazemitabaar, M., Huang, O., Suh, S., Henley, A. Z., & Grossman, T. (2025). Exploring the Design Space of Cognitive Engagement Techniques with AI-Generated Code for Enhanced Learning. IUI ’25. https://doi.org/10.1145/3708359.3712104
Stienstra, L., Mohamed, A., & Mohamed, M. (2025). Exploring GenAI as a Tutoring Tool: A Case Study in First-Year Computer Programming. ITiCSE ’25. https://doi.org/10.1145/3724363.3729060
Shah, A., Chernova, A., Tomson, E., Porter, L., Griswold, W. G., & Soosai Raj, A. G. (2025). Students’ Use of GitHub Copilot for Working with Large Code Bases. SIGCSE ’25. https://doi.org/10.1145/3641554.3701800
Garcia, R., & Craig, M. (2025). 20 Years Later: A Replication Study on Teaching CS1 Concepts. ACM Transactions on Computing Education. https://doi.org/10.1145/3730405
Whalley, J., Settle, A., & Luxton-Reilly, A. (2023). A Think-Aloud Study of Novice Debugging. ACM Transactions on Computing Education. https://doi.org/10.1145/3589004
Fisler, K. (2014). The Recurring Rainfall Problem. ICER ’14. https://doi.org/10.1145/2632320.2632346
Watson, C., & Li, F. W. B. (2014). Failure Rates in Introductory Programming Revisited. ITiCSE ’14. https://doi.org/10.1145/2591708.2591749
Bennedsen, J., & Caspersen, M. E. (2007). Failure Rates in Introductory Programming. ACM SIGCSE Bulletin. https://doi.org/10.1145/1272848.1272879
Porter, L., & Zingaro, D. Learn AI-Assisted Python Programming, Second Edition: With GitHub Copilot and ChatGPT. Manning (Function). Kindle Edition. Available via IUCAT, Bloomington and Indianapolis campuses: ebook https://iucat.iu.edu/catalog/20590350, streaming audio https://iucat.iu.edu/catalog/21587005
Novet, J., & Vanian, J. (2025, April 29). Satya Nadella says as much as 30% of Microsoft code is written by AI. CNBC.
Consortium for Generative AI in CS Education, UC San Diego CREATE. Faculty leads: Leo Porter, Daniel Zingaro, Beth Simon. https://www.teachcswithai.org/
Porter, L. Research on GenAI and Learning To Program [Video]. YouTube. https://www.youtube.com/watch?v=faY5gDTlvs0