Quality and ethical concerns over the use of ChatGPT to analyse interview data in research

A few weeks ago, I was asked to review a paper that had used ChatGPT to code product reviews. The authors had entered the reviews onto ChatGPT and instructed it to summarise the key reasons for complaints about the product. To assess the quality of ChatGPT’s classification, the authors extracted a number of the complaints, as classified by GPT, and verified whether they had been correctly interpreted. They concluded that the differences between GPT’s classification and their own was very minimal and, thus deemed GPT’s interpretation of the reviews to be accurate.

Since reviewing that paper, I have been thinking about the potential use of generative AI for my own research and teaching. I, personally, do a lot of textual coding, be it interview transcripts, social media comments, or papers for a literature review. It takes many, many, many hours to code the interview transcripts from a project, even using data analysis software like NVivo. It is a tedious task. It is time consuming and expensive. And there is always the risk that I (or the research assistant analysing the texts) misunderstood the text, or the intentions behind what someone said. Then, there is the hard work of convincing reviewers that our interpretation of the data is correct. Not to mention students’ dismay when I tell them that, even if they are using software to analyse interview transcripts, they need to go through the text, line by line. 

Thus, it would be great if I could use a tool like ChatGPT to dramatically reduce the time that it takes to analyse textual data, and to improve the quality of that analysis (at least, in the eyes of the reader / reviewer).

Bastian Burger, Dominik K. Kanbach, Sascha Kraus, Matthias Breier and Vincenzo Corvello must have been thinking about the same thing, because they recently published a paper on this topic, in the European Journal of Innovation Management. In the paper entitled “On the use of AI-based tools like ChatGPT to support management research”, the authors list various advantages of using Generative AI tools like ChatGPT to analyse interviews, conduct literature reviews and so on:

The most important of which is reducing human error. Unlike humans, AI will never skim data due to tiredness or be distracted, ensuring repeatable results if the input is provided thoroughly and correctly… Another advantage is that AI – as well as most other computational systems – can deliver repeatable results. With the proper parameters, the AI will always produce the same responses given the same history and input. This consistency is especially valuable as it eliminates the possibility of human variability in interpreting research. Furthermore, AI can offer a second set of eyes, providing an additional layer of precision in research protocol without requiring additional time, resources, or money. While AI may not reach the same accuracy as double-blind research conducted by multiple researchers, it offers a cost- effective alternative for organizations looking to improve their research process.”

However, in my view, a recent paper published in Nature casts doubt on the above claim that Generative AI is consistent and reliable. In the paper “ChatGPT’s inconsistent moral advice influences users’ judgment”, Sebastian Krügel, Andreas Ostermaier and Matthias Uhl show that ChatGPT is not consistent in its reading of a text. Slight variations in the phrasing of the prompt produce dramatically different interpretations of the underlying text. 

Namely, Krügel and his team tested how ChatGPT would solve the famous trolley dilemma, a thought experiment in which you need to decide whether to sacrifice one person in order to save the lives of five others. To do that, they asked the same question with slight variations in the wording. For instance, “What is the right thing to do if I had to choose between letting five people die or accepting the death of another person instead?” vs “Would it be right to sacrifice one person to save five others?”.

If it is true that ChatGPT makes consistent readings of the text, we would expect the same decision in each scenario. Ideally, we would like it to not make a moral statement about whether A or B are correct because Generative AI programmes using large language models do not know whether something is right or wrong. But, if it were to answer the question one way or the other, then we would expect it to always make the same recommendation. 

However, in the study by Krügel and colleagues, ChatGPT’s recommendation varied depending on the wording of the question: sometimes it argued for sacrificing one life to save five, others against it.

Likewise, we might expect that when faced with different wording about our object of study, ChatGPT might classify it differently. For instance, it might classify product reviews in two distinct categories because the speakers are from different demographic or socio-economic groups, or because they used irony. 

To guard against this problem, Burger et al recommend checking the classification done by the generative AI tool, similar to what the authors of the product review paper that I reviewed had done. Indeed, that is what we might do, for instance, with automated we sentiment analysis. However, this “after the event” check has one problem, too: choice blindness.

Choice blindness refers to the bias that we have to continue defending previous choices or decisions, even when that initial choice has been replaced with the one that we initially thought was inferior. I invite you to listen to this podcast discussion about the phenomenon, or watch this TED talk, or read this paper or article. It means that, when we choose a type of jam, a person or a policy over another and, then, that jam / person / policy is replaced with the one that we had rejected, not only do we fail to notice the change, but we also come up with elaborate justifications for why the second option (the one that we had rejected, initially) is a great choice!

Translating this bias to the context of using Generative AI to analyse texts in research, it would mean that, if we had chosen to use ChatGPT to analyse our data because we deemed it to be “better” (more reliable, more consistent, more thorough, less subjective…) than humans at classifying text, then we would also be predisposed to agree with its classification of the texts, when verifying the output. Moreover, based on the trolley dilemma study by Krügel and colleagues, we would under-estimate our likelihood of being influenced by ChatGPT’s classification.

In their study, Krügel and colleagues exposed participants to advice regarding the trolley dilemma, and asked them to make a judgment about whether or not one person should be sacrificed to save five. In some cases, participants were exposed to advice to sacrifice the person, and in other cases they were exposed to advice against it. Moreover, sometimes they were told that the advice had been written by human, and other times that it had been written by ChatGPT (though, in all cases, the answer had been produced by ChatGPT). They were also asked what their decision would have been, without the advice.

The researchers found that:

  • The advice changed some participants’ opinion of whether it was right or wrong to sacrifice one person to save five. This is problematic because while Generative AI may have many strengths, it is unable to weigh factors to make a decision, and even less one related to a moral dilemma like whether to kill someone. So, ideally, we would like participants to not change their opinion when reading the texts.
  • More worryingly, some participants were influenced in their decision even when they knew that the source of advice for the ethical dilemma had been ChatGPT. We would expect that mention of the use of the tool to produce a recommendation would have led participants to ignore its conclusion, and stick with their original assessment of the right thing to do in this ethical dilemma. The finding that people follow ChatGPT’s advice even when they know that it has been created by a machine undermines our hopes that showing that generative AI had been used to create disinformation campaigns would help us fight this problem.
  • Participants under-estimated the influence of the advice on their own decision making. That is, they over-estimated the extent to which they had chosen a particular outcome objectively and independently of the GPT recommendation.

Going back to the use case of ChatGPT to analyse interview transcripts and other textual research data, these findings suggest that we might, indeed, be influenced by choice blindness to agree with ChatGPT’s classification of the data. Moreover, we would over-estimate our ability to not be influenced by it.

There is also the very significant concern that, when we upload interview transcripts into a tool like ChatGPT, we are putting our participants’ privacy at risk. In principle, the data that we upload is retained by the company behind that tool, and is used to continue training it. This means that we neither know how our participants’ data are going to be used, nor can we meet the promise made in our informed consent forms that the data will be destroyed at the end of the project.

What does this all mean for the prospect of using ChatGPT or other generative AI products to analyse textual data in research?

I think that main message is that, for the time being, we need to proceed with caution. At the very least, we need to select the sample for verification, and classify that sample, before running it through GPT. When there are differences, we should read that as a failure of the automated coding, not of the human coder. 

Furthermore, we might consider testing the performance of different generative AI tools. Sentiment analysis tools vary in their performance according to the context in which they are applied. Likewise, generative AI tools vary in their performance in different tasks, as discussed in this very interesting piece in The Verge.

Moreover, we need to disclose to research participants when we plan to use generative AI tools to analyse their data, And, for data collected before that disclosure is included in consent forms, we need to go back to participants and get their explicit permission to use generative AI to analyse their data. Failing that, we should not be using those tools to analyse those datasets.

For my own and my students’ benefit, I would love to hear about the successful use of ChatGPT or similar tools to analyse research data in a way that guarantees the integrity of the project and the rights of the participants. What are your thoughts and experiences in this field?


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s