top of page

Maggie, the Librarian, and Jim, the CEO: Exploring Repeatability of Gender Stereotyping in ChatGPT Responses

  • Writer: Elianna Gadsby
    Elianna Gadsby
  • Jul 27
  • 10 min read

Author: Pooja Karthikeyan


Introduction

Since their emergence, AI technologies, particularly ChatGPT, have become commonplace in all aspects of our lives. Because of its incredible performance capabilities, this data-driven generative AI tool has been employed in various industries, including education, the workplace, entertainment, and others. However, it has limits. Because ChatGPT is trained on vast and publicly available data sets, it may be prone to stereotypes, factual errors, bias, or disinformation [12].


Bias in ChatGPT has been a challenge, particularly in highlighting or maintaining gender bias. Occupational gender bias has long existed in our society, with certain jobs stereotypically assigned to women and men. For example, stereotypically, female jobs include baby sisters, nurses, and receptionists, whereas male jobs include construction workers, electricians, and firefighters.


Since ChatGPT and other AI tools are used to generate books, stories, and other content, these tools are in a unique position to influence gender discrimination that is already plaguing our society. Such perpetuation of gender bias can be disruptive and detrimental to our society [5]. Furthermore, occupational stereotypes can considerably influence children's capacity to make professional decisions. Children begin to believe in gender stereotypes at an early age. For example, research indicated that girls in third grade ranked themselves lower than boys in mathematics, even though their test results were identical [10]. As a result, it is critical to understand how consistently ChatGPT promotes gender stereotypes.


ChatGPT has been repeatedly proven to reinforce or amplify gender stereotypes [3] [6] [8]. In their study of gender bias and stereotypes in large language models, Kotek et al. discovered that LLMs made biassed assumptions about men's and women's vocations, portraying doctors as males and nurses as women. Such gender bias was also evident in the recommendation letters generated by ChatGPT [6], which highlighted males in leadership positions and females in supportive roles.


Even though there is a wide consensus towards the presence of gender bias in ChatGPT, there have been mixed findings on the degree of bias towards stereotypically male or female-dominated jobs. Spillner (2024) investigated gender prejudice in short stories generated by ChatGPT and discovered that the stereotypes about professions generated by ChatGPT outperformed the human occupational gender bias. Furthermore, they found substantial disparities in how gender preconceptions were enhanced for typically male and female occupations, particularly amplifying existing prejudices in female-dominated jobs. On the other hand, Babaei et al. (2024) explored gender bias by directly prompting names for female or male-dominated professions for a play. Even when the prompt was repeated many times, they found that the ChatGPT responses were gender-balanced and politically correct for the occupations they tested. One of the reasons for these mixed results may be due to the variability and reduced consistency associated with large language models like ChatGPT.


Since the influence of ChatGPT in specific and generative AI, in general, is increasing exponentially, it is important to have an unambiguous understanding of their role in the perpetuation of gender bias. ChatGPT-3.5, the free version, is frequently shown to generate larger and less concise prompt responses [7]. Systematic studies on the consistency and uniformity of ChatGPT responses have shown that the outputs from ChatGPT can vary even when the same prompt is used [9] [13]. Moreover, ChatGPT 3.5 has been shown to have poor response consistency when compared to ChatGPT 4 [4]. Some amount of randomness and chaos is inevitable in the AI generative algorithms in large language models. However, such variations in the ChatGPT output can generate confusion in interpreting the findings and lead to inconclusive understanding. The present study aims to examine the consistency and repeatability of gender bias in ChatGPT prompts if and when it occurs.


Methodology

To investigate the gender bias in ChatGPT, 11 stereotypically male-dominated and 11 stereotypically female-dominated professions based on the published reports [2] [11] were selected. Consistency metrics reveal the stability of the model’s behaviour under repeated testing. The consistency of ChatGPT responses was assessed by measuring responses across multiple iterations of the same prompt.


For each profession, ChatGPT 3.5 was instructed to describe the character and the personality traits. For example, the prompt “I am writing a story and the protagonist is a nurse. Describe the character and personality traits” was the input. A total of 22 prompts were presented each day. To evaluate the presence of bias with certainty and to ensure that the generated responses are not a random single phenomenon that could have occurred on a particular day for any number of reasons, the 22 ChatGPT prompts were presented continuously for 10 days. The order of the prompt queries was randomised to prevent standardised responses.


The task was to evaluate if the stereotype related to the profession was held by the ChatGPT 3.5. If the ChatGPT held a stereotype (i.e., the protagonist with the male or female-dominated profession was addressed by he/him or she/her, respectively), then the response was assigned a rating of 1. If the stereotype was not held or if the protagonist was addressed by "they/them,” then the response was assigned a rating of 0.


A repeated measures ANOVA was administered to evaluate gender stereotypes in ChatGPT responses for female vs. male-dominated professions across 10 days. Python program was used to conduct the statistical analysis. The initial step involved converting the list values into arrays, a process achieved using the Numpy function np. Array. The code used two essential libraries, Numpy and Scipy. Numpy facilitated the efficient manipulation of arrays, while Scipy provided an assortment of statistical functions.


Results

Gender bias in ChatGPT 3.5 responses was examined for 11 stereotypically feminine and 11 stereotypically male-dominated professions over 10 days. Gender bias was assessed by determining if a stereotype commonly associated with female- or male-dominated jobs is supported. A rating of 1 or 0 was given when the stereotype for each gender was held or not, respectively. A stark difference in the gender stereotyping of female and male-dominated professions in the ChatGPT responses emerged across different days.


Figure 1 shows the gender stereotype held by ChatGPT 3.5 in percentage for stereotypically female-dominated professions (red) and male-dominated professions (blue) for the same prompt repeated for 10 days for each profession. The figure provided below shows that the gender stereotype for all female-dominated professions was held repeatedly for all 10 days. On the other hand, the presence or absence of gender tereotypes for male-dominated jobs varied across 10 days. When the presence or absence of gender stereotypes in ChatGPT responses was averaged over 10 days for female and male-dominated jobs, the stereotype was held 100% for female jobs and 63% for male jobs (Figure 2), demonstrating a considerable gender bias for female jobs.

Figure 1: Stereotype held by ChatGPT 3.5 in percentage for stereotypically female-dominated professions (red) and male-dominated professions (blue) when repeated for 10 days.
Figure 1: Stereotype held by ChatGPT 3.5 in percentage for stereotypically female-dominated professions (red) and male-dominated professions (blue) when repeated for 10 days.
Figure 2: Mean stereotype held by ChatGPT 3.5 in percentage for stereotypically female-dominated professions (red) and male-dominated professions (blue) for prompt repeated for 10 days.  The error bars represent standard deviation of the mean.
Figure 2: Mean stereotype held by ChatGPT 3.5 in percentage for stereotypically female-dominated professions (red) and male-dominated professions (blue) for prompt repeated for 10 days.  The error bars represent standard deviation of the mean.

A repeated measures ANOVA was administered to find if ChatGPT responses in female vs. male-dominated professions upheld the stereotype across 10 days. The statistical test showed that gender bias in ChatGPT responses significantly (F = 4.64; p <0.005), indicating that gender stereotyping of female and male-dominated professions differed dramatically across days. Figure 1 shows the variation (highest 81% to lowest 45%) in the percentage of stereotypes held for male jobs across 10 days. Post-hoc tests following Bonferroni corrections further showed how stereotypes in ChatGPT responses differed across days. For some days, like days 2, 4, or 8 (Figure 1), gender bias in ChatGPT responses for male-dominated jobs was as gender biassed as those for female-dominated jobs (p > 0.005), while on other days (days 3, 7, or 10), the responses were significantly less gender-biassed than female jobs (p < 0.005).


Figure 3 below shows stereotypes held in percentage by ChatGPT for individual professions across 10 days. Figure 4 shows the distributions of pronouns (he, she, or they) assigned for different professions across 10 days.

Figure 3: Stereotype held by ChatGPT 3.5 in percentage for individual stereotypically female-dominated (red) and male-dominated professions (blue).
Figure 3: Stereotype held by ChatGPT 3.5 in percentage for individual stereotypically female-dominated (red) and male-dominated professions (blue).
Figure 4: Percentage distribution of gender pronouns - He (blue), She (red) and They (grey) assigned by ChatGPT 3.5 across 11 stereotypically male-dominated professions
Figure 4: Percentage distribution of gender pronouns - He (blue), She (red) and They (grey) assigned by ChatGPT 3.5 across 11 stereotypically male-dominated professions

As seen in Figure 3, all 11 stereotypically female-dominated professions were consistently assigned to females, and the pronoun she/her was used. On the other hand, of the 11 typically male-dominated professions, only two professions—cab driver and construction worker—were consistently considered as professions for males (Figure 3) and were assigned masculine pronouns “he/him “(Figure 4) by ChatGPT for all 10 days, while financial analyst and sales manager were considered more gender-neutral jobs and mostly assigned the pronoun “they.” Notice in Figure 3 that for stereotypically male-dominated jobs, gender bias was highest for professions that require more manual labour skills and are often referred to as “blue-collared jobs like a construction worker, cab driver, or electrician, and bias reduced and reversed as the job shifted to professions that need advanced education like CEO and civil engineer.


Discussion

The study explored the presence and repeatability of occupational-related gender bias in ChatGPT 3.5-generated responses. The present study showed that even though ChatGPT responses have an undeniable gender bias for stereotypically female-dominated jobs, the gender bias for male-dominated jobs varied when repeated across 10 days.


A trend of strong maintenance of gender bias in female jobs but a variable, sometimes reversible gender bias for male jobs has been reported by Spillner (2024). This may reflect the current societal trend where it is easier and more acceptable for the female population to make their way into male-dominated fields than for males to seek stereotypically female jobs. For example, it is more acceptable for females to become firefighters and CEOs than for males to become elementary teachers, caregivers, or librarians. The current findings emphasise how the stereotypes and gender bias upheld by society are further perpetuated by the ChatGPT.


Moreover, the shifting of the gender bias from male to female in the stereotypically male jobs was seen more for the professions that need advanced education, like CEO and civil engineer, than professions that require more manual labour skills, like a construction worker, cab driver, or electrician. Similar findings were reported by Spillner (2024). The study showed that the male gender bias remained for stereotypically “blue-collared” male jobs, while the bias reversed and female characters were assigned to the “white-collared” jobs. The study’s author attributes the bias reversal to a possible effect of correction and sometimes overcorrection of gender bias in language models triggered by awareness and human feedback.


It was of interest to see if ChatGPT gender bias is maintained or changes when repeated over time for female or male-dominated jobs. The same prompt for each profession was repeated in new sessions for 10 days to strengthen the accuracy of the response and account for any randomness that would influence a single data point. It was interesting to see that gender bias for female jobs selected in this study was 100% repeatable. However, gender bias for male jobs varied across 10 days. Some days the gender bias in the ChatGPT responses between male and female jobs was comparable, and some days there was significantly reduced bias for male jobs compared to female jobs. This finding brings attention to the variability in ChatGPT responses and the need to exercise caution when interpreting the findings based on a single ChatGPT response.


Conclusion

Generative AI like ChatGPT is a powerful tool that has a lot of potential to enhance all walks of our lives. Moreover, the influence of such generative AI tools is only going to increase exponentially in the future. Therefore, it is important to understand the role of ChatGPT in perpetuating gender bias in our society. In our present study, the consistency of the gender bias in ChatGPT outputs, if present, was explored by studying multiple iterations of the same prompt over a period of time. Our study concludes that outputs of ChatGPT behave differently for stereotypically male-dominated jobs than female-dominated jobs. Our findings consistently show that there is an undeniable bias for female jobs, while bias for male jobs varies on different days. The study throws light on the importance of using multiple responses rather than a single data point while interpreting results from large language models like ChatGPT. Further research is needed to understand the repeatability and variability in AI tools like ChatGPT and extend these findings in interpreting and correcting for the gender bias in these tools.


References

  1. Babaei, Golnoosh, David Banks, Costanza Bosone, Paolo Giudici, and Yunhong Shan. "Is ChatGPT More Biased Than You?." Harvard Data Science Review (2024). https://doi.org/10.1162/99608f92.2781452d

 

  1. Business News Daily. "Gendered Jobs Are on the Decline, But Stereotypes Remain." Business News Daily, October 19, 2023. https://www.businessnewsdaily.com/10085-male-female-dominated-jobs.html.

 

  1. Busker, Tony, Sunil Choenni, and Mortaza Shoae Bargh. "Stereotypes in ChatGPT: an empirical study." In Proceedings of the 16th International Conference on Theory and Practice of Electronic Governance, pp. 24-32. 2023. https://doi.org/10.1145/3614321.3614325

 

  1. Funk, Paul F., Cosima C. Hoch, Samuel Knoedler, Leonard Knoedler, Sebastian Cotofana, Giuseppe Sofo, Ali Bashiri Dezfouli, Barbara Wollenberg, Orlando Guntinas-Lichius, and Michael Alfertshofer. "ChatGPT’s response consistency: A study on repeated queries of medical examination questions." European Journal of Investigation in Health, psychology and Education 14, no. 3 (2024): 657-668. https://doi.org/10.3390/ejihpe14030043

 

  1. Gross, Nicole. "What ChatGPT tells us about gender: a cautionary tale about performativity and gender biases in AI." Social Sciences 12, no. 8 (2023): 435. https://doi.org/10.3390/socsci12080435 

 

  1. Kaplan, Deanna M., Roman Palitsky, Santiago J. Arconada Alvarez, Nicole S. Pozzo, Morgan N. Greenleaf, Ciara A. Atkinson, and Wilbur A. Lam. "What’s in a Name? Experimental Evidence of Gender Bias in Recommendation Letters Generated by ChatGPT." Journal of Medical Internet Research 26 (2024): e51837. https://doi.org/10.2196/51837 

 

  1. Karakose, Turgut, Murat Demirkol, Ramazan Yirci, Hakan Polat, Tuncay Yavuz Ozdemir, and Tijen Tülübaş. "A conversation with ChatGPT about digital leadership and technology integration: Comparative analysis based on human–AI collaboration." Administrative Sciences 13, no. 7 (2023): 157. https://doi.org/10.3390/admsci13070157

 

  1. Kotek, Hadas, Rikker Dockum, and David Sun. "Gender bias and stereotypes in large language models." In Proceedings of the ACM collective intelligence conference, pp. 12-24. 2023. https://doi.org/10.1145/3582269.3615599

 

  1. Lainwright, Nehoda, and Moyat Pemberton. "Assessing the response strategies of large language models under uncertainty: A comparative study using prompt engineering." (2024).

 

  1. Morales, Danielle Xiaodan, Sara Elizabeth Grineski, and Timothy William Collins. "Racial/ethnic and gender inequalities in third grade children’s self-perceived STEM competencies." Educational studies 49, no. 2 (2023): 402-417. https://doi.org/10.1080/03055698.2020.1871324

 

  1. My Perfect Resume. "Gendered Jobs: Exploring Career Stereotypes [2023 Study]." MyPerfectResume.com, March 20, 2023. https://www.myperfectresume.com/career-center/careers/basics/gendered-jobs.

 

  1. Ray, Partha Pratim. "ChatGPT: A comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope." Internet of Things and Cyber-Physical Systems 3 (2023): 121-154. https://doi.org/10.1016/j.iotcps.2023.04.003

 

  1. Shin, Euibeom, and Murali Ramanathan. "Evaluation of prompt engineering strategies for pharmacokinetic data analysis with the ChatGPT large language model." Journal of Pharmacokinetics and Pharmacodynamics 51, no. 2 (2024): 101-108. https://doi.org/10.1007/s10928-023-09892-6

 

  1. Spillner, Laura. "Unexpected Gender Stereotypes in AI-Generated Stories: Hairdressers are Female, But so are Doctors." In Text2Story@ ECIR, pp. 115-128. 2024.

 

Comments


YSJournal is A Registered Charity In England And Wales; #1188626

© 2023-2024 by YSJournal.

bottom of page