Last month, I attended a training session, at Sussex, about Generative AI and assessment. We had to create a custom GPT for our modules and, then, use it to complete an assessment task for that module, using various levels and types of prompts. Afterwards, we marked the outputs and reflected on how “automatable” the different types of assessment that we use are.
In my case, the initial attempt was a bare pass. There were some correct points, but the answer lacked detail; there were some mistakes resulting from the literal interpretation of key terms rather than using the terms in line with the discipline / module tradition; the recommendations lacked contextual relevance; and the answers were in bullet point format.
After various rounds of additional prompting, including the provision of additional material, I could get the answer to a very solid mark. Though, to be fair, the additional prompting would require solid knowledge of the course material (for instance to detect that the GPT was misusing key terms). It would also require strong critical skills (for instance, to develop contextually relevant recommendations). So, effectively, the student would be kind of coaching the GPT to produce a good answer. And, in my view, a student that could do that kind of coaching would also be able to produce the answer by themselves; though I appreciate that starting with the GPT (vs a blank page) might remove some of the anxiety of doing the assignment and improve the readability of the final output.
Participating in this training session and doing this exercise reminded me of the paper “Falling Asleep at the Wheel: Human/AI Collaboration in a Field Experiment on HR Recruiters” by Fabrizio Dell’Acqua. Dell’Acqua conducted an experiment whereby HR experts had to complete a task on their own, or with the help of AI (not generative AI). He found that the HR experts performed better when using a less sophisticated AI (75% accuracy rate) than a more sophisticated one (85% accuracy). Dell’Acqua goes on to argue that the detrimental effect of the 85% accurate AI on the experts’ performance happens because the experts let their guard down when using the more sophisticate AI. They just accepted the AI answer, without using their experience and expertise to judge it. As a result, these experts under-performed the other experts that augmented the weaker AI’s answers.
In summary, for the type of tasks tested in university assessment and in Dell’Acqua’s study, the best output is the one that is augmented or authenticated by the user. That means that the users still need to develop the relevant skills and acquire the relevant knowledge. That is good news, as it means that, with some tweaking, we can design assessment / tasks that differentiate between those students / experts that know what they are doing vs those that are cheating with Generative AI.
However, it also means that users need to “mistrust” generative AI, in order to stay engaged with the task and not “fall asleep at the wheel”. And that means developing Gen AI literacy, as well as reminding users of its limitations. As I start planning for next year’s teaching and assessment, it would be great to hear how others have managed to do that. Please do share your examples with me.

