These are the notes from a talk that I delivered, recently, about the importance of data quality, and how to assess it.
In my talk, I started by noting the critical role of data as a source of insight and, subsequently, as an enabler of service automation. Then, went on to note that data are not only a key input, but also an output of service automation: every interaction generates more data (for instance, whether a product recommendation is accepted or not) which feeds back into the analytics cycle.
This is not a new phenomenon, of course. For instance, personalised recommendations have long existed in the form of staff suggestions, based on their knowledge of a customer’s past purchases or their real-time reading of that customer’s preferences and needs. However, the phenomenon has reached mass scale because we are constantly generating digital traces, which can then be used by machine learning to generate automated interactions. This combination of big data and machine learning promises to deliver novel insights, cost savings and also branding benefits.
But what are the pitfalls of this combination of big data and machine learning?
To explore this, I revisited the findings from the gender and money project.
As discussed here, when you search the leading image banks for the keywords “Money” AND “Men” OR “Women”, the top results that you get (i.e., the images most frequently used) provide a very different depiction of each gender, as financial citizens. Of course, you and I can see that these are not factual depictions. However, an algorithm does not know truth from falsehood. It only cares which image (or word or number) is the one statistically most likely to be linked to a certain label. So, tools like generative AI take this faulty representation of a phenomenon (in this case, gender and money), and produce a response that is, of course, faulty too. It doesn’t matter how powerful the algorithm is: if the dataset is erroneous, the result will be erroneous.
That is, the quality of the dataset is absolutely crucial for the result of the analytical exercise, and the resulting outputs.
But what is data quality, and how do we measure it?
Based on the work of Kahn, Strong and Wang (2002), I proposed four dimensions of data quality:
- First, we need to distinguish between quality from the point of view of what is produced vs what we need. For instance, a sarcastic comment made about a product on Twitter may have been perfect to fulfil the author’s goal of poking fun at a product. However, it will be useless to assess what that user really thinks about that product.
- Second, we need to distinguish between quality from point of the view of data as a products vs access to that data. For instance, how the data are shared will determine how incomplete or biased the available datasets are. Moreover, if my systems can’t access, process or store certain data (for instance, emojis), it will further reduce the quality of the dataset available.
It’s not because a dataset is big that it will be good. And it is not because we used a sophisticated algorithm that the decision will be fine. Poor data quality leads to poor insight and poor decisions. Thus, it is important that organisations audit the quality of their datasets. I hope that this multi-dimensional approach to thinking about data quality will help organisations develop an holistic approaching to assessing the quality of their datasets.
One thought on “It’s not because a dataset is big that it will be good. And it is not because we used a sophisticated algorithm that the decision will be fine”