AI hate speech tools show gaps as online abuse spreads

Artificial intelligence systems are taking on more of the work of policing hate speech online, but research and experts say they often disagree on what should be flagged. The gap matters because abuse that once spread in smaller offline circles can now move quickly through anonymous accounts on major platforms, according to Al Jazeera.

The issue drew renewed attention on June 18, when the United Nations marked the International Day for Countering Hate Speech. UN Secretary-General Antonio Guterres has warned that social media platforms are intensifying the threat, Al Jazeera reported.

How hate speech is defined

The UN defines hate speech as communication that discriminates against a person or group or encourages violence against them. According to the UN, it can be spoken, written or behavioural, and can target actual or perceived identity, including race, ethnicity, religion, gender, sexual orientation or disability.

The UN also says hate speech is not confined to words. Images, cartoons, gestures and objects can fall under the definition, depending on how they are used.

A 2023 UNESCO-Ipsos survey of 8,000 people in 16 countries found that 67 percent of internet users had come across some form of hate speech online. In the same survey, 33 percent of respondents said LGBTQI people faced the most hate speech, followed by ethnic and racial minorities at 28 percent and women at 18 percent.

Platform moderation is shifting

Meta removed fewer hateful posts in late 2025 than a year earlier, according to figures reported by Al Jazeera. In the fourth quarter of 2025, the company took down 1.3 million posts from Instagram and 1.3 million from Facebook, compared with 7.4 million Instagram posts and 5.8 million Facebook posts in the fourth quarter of 2024.

Al Jazeera reported that the decline came as Meta moved away from proactive hate speech detection and placed more weight on user reports. TikTok, by contrast, said it removed 96.3 percent of hate speech and hateful content in the fourth quarter of 2025 before users reported it.

Why AI systems disagree

Social media companies increasingly use AI moderation tools built with large language models to check large volumes of posts, Al Jazeera reported. These systems typically rely on labelled datasets and pretrained language models, then use rules or score cutoffs to decide whether content breaches platform policies.

A 2025 University of Pennsylvania study found wide differences in how AI moderation systems identified and ranked hate speech. The researchers tested seven systems, including models from OpenAI, Anthropic, DeepSeek, Mistral and Google, and found inconsistencies across categories and demographic groups.

According to the study, Mistral’s Moderation Endpoint often gave examples scores close to 1 on a 0-to-1 severity scale, marking many items as highly hateful across target groups. OpenAI’s Moderation Endpoint often gave lower scores for many categories, in some cases less than half the score assigned by other systems, Al Jazeera reported.

The University of Pennsylvania researchers said such disagreement can weaken trust in moderation decisions when identical content is removed by one system but allowed by another.

Context remains a problem

Arkaitz Zubiaga, an associate professor at Queen Mary University of London and co-lead of its Social Data Science lab, told Al Jazeera that AI can catch direct abuse, including slurs aimed at protected groups, but struggles with indirect hate speech.

Zubiaga said some hateful posts may begin with positive language before turning against a demographic group, making them harder for systems that focus on surface wording. He also said AI can make the opposite mistake by flagging reclaimed language, where communities reuse words once used to demean them in non-hateful ways.

This story draws on original reporting from Al Jazeera.