Estimated reading time: 5 minutes, 12 seconds
In case you missed my first post, I am blogging an unpublished paper as a series in parts over several days. You can read that post to understand the story and the reason behind this. Comments on each post welcome. This is the seventh post, and it will cover Attitudes Toward Essay grading, as part of the Findings Section. Part 1 covered abstract/references, Part 2 intro/lit review, Part 3: methodology/positionality, Part 4: Findings: General Attitudes Towards AI, Part 5: Attitudes Towards Turnitin, Part 6: Attitudes towards Teacher Bots.
Findings
Attitudes Towards Automated Essay Grading
Participants were quite skeptical of this application of AI. They all could not imagine this working well for more difficult tasks of understanding diverse ways of writing and structures. They could see it working for grammar and more technical aspects of writing, but not of actually assessing writing quality. Most mentioned their willingness to work with it as a first line of assessment before a human looked at the writing in more depth. A few people expressed concern that the automated grader would reproduce biases and expect more standard or dominant modes of writing, and therefore wrongly downgrade writing that was less common or more creative.
AUC1 and AUC5’s immediate reaction to this was a firm “no”. AUC5 believed that a human may need to read something several times to understand it correctly and a machine would miss nuances, perhaps misunderstand colloquial language, and that it may not be able to judge whether someone made the perfect word choice even if it is a correct word choice. They also said “when dealing with students, you have a history”, for example you know if you had taught them a particular word or structure before and can refer to that.
AUC3 reminded us that even when two humans grade the same paper, there are discrepancies, sometimes up to a full grade up or down. AUC1 raised a similar concern:
“I teach and I know how difficult.. interrater reliability and bias and standardizing [are] and I feel the writing part has to have a human element involved. You can do this [use software] in grammar, count mistakes, but for, like, ideas and doing thesis statement and details and examples… You can’t do that. I wouldn’t trust the number. Could be like a starting point to filter, then [you would need] a second eye, second grader”, a human one. Several participants (e.g. AUC1, AUC2, SAU1, SAU4) were comfortable allowing an automated essay grading system to give preliminary feedback before a human looked at the essay, what SAU2 called a “balance” of “moderating” it afterwards. SAU4 put it this way “don’t take the human out, but take the boring stuff that the human has to do” and focus on getting higher value of the face-to-face time for human interaction, and they likened this to the rationale for flipped classroom teaching.
SAU4 had some experience teaching writing. When asked about automated essay grading, she said:
“On the face of it “ew”… coz you want a human to be reading… but research I found, is that machine grading compared to human grading is not as far off. We valorize how good human grading is in the first place”.
Her point was that in large first-year classes, the grading is often done by tutors with less experience and students either don’t receive quality feedback, or it takes longer to reach them, to the point of being too late to be useful. SAU3 made a similar point – that students currently are probably not getting a great experience anyway. Both SAU3 and SAU4 suggested we collect evidence on the quality of these tools to test them before we judge.
SAU3 and AUC5 expressed a concern on how these services would assess the writing of non-native speakers, whether it would be trained on more dominant ways of writing or expression, whether it might mark correct but less commonly used text, or colloquial language down unnecessarily. In a similar vein, AUC3 was concerned the software would be biased towards one standard way of writing, because people had different writing styles and “it’s not necessary that every A paper looks the same”. She was concerned such a software would expect that. SAU1 raised the concern of the knowledge bank used to train the AI being “Northern or Western” such that it disempowers the expression of local knowledge in local ways and instead assumes “standardized means a particular [Western or Northern] discourse” then it would be “quite problematic”.
Participants across AUC all mentioned particular aspects they felt the software would be unable to judge, such as nuances, metaphor, context, and actual meaning, beyond how the sentences looked on the surface. A couple of people in both institutions mentioned Grammerly as a tool that did help with grammar, as it was a little bit more rule-based and easier to work on than other aspects of language that were more nuanced and culturally-dependent.
While AUC4 recognized that even human grading of essays could be highly problematic, they remain skeptical of automated essay grading because we should always ask “does it make education better?” before we try something, and that we should not use a tool “just because you can”.
SAU5 said “I suppose it really challenges what we understand as a teacher and the position of the teacher and lecturer. It’s quite a new role that is being developed and one needs to kind of see how that feeds into our understanding of teaching and learning.”
That’s it for now – what do you think? Will we someday have tools that help students automatically write their papers, only to be graded automatically and plagiarism-checked automatically? What the heck? It does not feel like an exaggeration to imagine this slippery slope to me!
Photo by Setyaki Irham on Unsplash This is another one that is not literal because I thought when I searched for grading I’d find something with a teacher’s markings over a paper or something but I got odd things. This one showed up because of “color grading” and it reminded me of how much they use color in things like Turnitin.com so it looks kinda pretty and a bit messy.
When I saw the photo above this installment of your article, I thought the circles looked like targets. I immediately thought: what are we targeting in our students’ writing? Grammar? Cohesion? Content? Style? Voice? Word choice? Academic honesty? Spelling? . . .
So, we as teachers need to give some thought to what we want to prioritize when we teach and assess writing, and presumably many of us focus on more than one aspect because the whole is greater than the sum of its parts. Can software look at all of the parts of writing and assess the whole? Not yet. Can it assess specific aspects? Some yes, but not always accurately. Will the software become more accurate as it’s developed further? Most likely, but if at that point teachers use software that can assess grammar, punctuation, spelling, will we be tempted to prioritize these aspects at the expense of equally important aspects of writing? If we ask our students to use this kind of software to check their grammar, will they focus more on grammar, spelling, punctuation and less on the message they are conveying? And will only certain language varieties be considered “correct?” In other words, will we lose voices in this coversation or will they be forced to adapt their writing style to the ones that are in the programming?
Exactly that! Will students and teachers then emphasize what is measurable (as all measurement biases us to find those things more valuable)? Or will we start valuing what is human and not automatable? The point you wrote about everyone starting to write the same way and lose voice is also spot on! Are we creating replicas?