UK students due to sit exams this Spring (for instance, A-level exams, for entrance into University), saw their examinations cancelled, as a result of the Covid-19 pandemic and the need to avoid group gatherings. Instead, as explained in the gov.uk website:
“For each student, schools and colleges have provided a ‘centre assessment grade’ for each subject – the grade they would be most likely to have achieved had exams gone ahead – taking into account a range of evidence including, for example, non-exam assessment and mock results.
To make sure that grades are fair between schools and colleges, exam boards are putting all centre assessment grades through a process of standardisation using a model developed with Ofqual, the independent qualifications regulator”
However, it soon transpired that the predicted grades submitted by the teachers, were out of quilter with previous years. According to the BBC, “they were so generous that it would have meant a huge increase in top grades, up to 38%.”
Therefore, the government developed an algorithm to calculate students’ grades which considered the information provided by the teachers, as well as a number of contextual factors such as:
- The school’s past performance – the reasoning being that schools don’t usually experience sharp changes in performance from one year to the other;
- The school / class size – the reasoning being that teachers of small classes knew their students better than those teaching large groups. So, the former’s predictions were deemed more reliable than the latter.
[Here is a short explanation of what is an algorithm]
The explanation of the methodology used by the regulator to calculate the grades is available here. Mind you, the document has over 300 pages!
As a result of using the algorithm to determine students’ grades, more than a third of the grades were adjusted downwards:
Because of the criteria used to adjust the grades, the downwards adjustments impacted mostly – though, not exclusively – those students enrolled in schools or exam centres in deprived areas. Such schools tend to do less well in national exams, and they tend to have larger classes, than schools in leafy suburbs. So, their pupils suffered a double penalty, which, understandably, resulted in an outrage. The A-levels debacle is, many pointed, an example of how algorithms reinforce biases and discrimination, because they are based on historical data. By adjusting downwards students’ grades based on the nature and location of their schools (among other factors), the algorithms indirectly adjusted grades based on social class and ethnicity. [If you want to learn more about how and why algorithms perpetuate the past, check this TED talk by Cathy O’Neil.]
But that is not the only fault. In addition to the problem of perpetuating past biases, algorithms have the problem of legitimising ambiguity.
Here is why.
In order to make calculations, algorithms draw on data. These data are like pixels in a picture: tiny crumbs of information brought together, to create an image – the profile – of the data subject.
Some of these data can be directly linked to the phenomenon being profiled. But others don’t. And the interpretation of the latter ones is loaded with assumptions, or even prejudice. For instance, a number on a thermometer’s display has a direct relationship to the temperature of the “object” being monitored – for instance, air temperature. In turn, a wilted outdoor plant could indicate a very hot day… but could also indicate a lack of water, or of soil nutrients. So, an outdoor plant is an ambiguous – and thus – low quality indicator of air temperature in a particular location. Sure, air temperature can have an effect on how a plant looks – but it is not a determinant.
Likewise, class sizes can have an impact on students’ performance in national exams, as thus economic deprivation, but those are circumstantial factors. They are not determinants of any individual student’s performance. And, because of that, they make for poor predictors of individual student’s performance (even if, overall, there is a strong correlation between those context factors and the group’s average performance).
Don’t get me wrong: students learning in those schools are at a disadvantage vs those learning in small, well-resourced schools, with food security, and a warm home to return to. Hence, on average, the grades of the first group will be lower than the grades in the second group. But the keyword here is “average”. We can’t infer that any individual student in the first group will perform worse than any individual student in the second one, because there are other factors that impact on a student’s performance in exams (just like air temperature is not the only reason why a plant may look droopy).
What this all means is that, when using algorithms for decision making we need to have lots and lots and lots of data points which are unambiguously linked to the phenomenon being modelled. In the case of the A-levels’ algorithm, this would be lots of data points about the individual student’s performance – both past grades and grades’ trajectory. Failing that, we would need a model with all possible factors impacting on a student’s performance – not just the contextual factors of class size and school’s history, but also individual level factors such as sleep and eating habits, hours spent studying, and so on. That would be hard and expensive for the regulator. But the alternative – the social and economic costs of the false negatives – is disastrous for the individuals in question.
What this debacle has shown us is that, if we don’t have good data, then we shouldn’t be using algorithms, because the result is a myth with a veneer of legitimacy.