Toxicity

About

Hate speech is a serious and growing problem for online publishers, e-gaming companies, comment moderation platforms, and social media sites. In addition to the ethical reasons for combating online hate speech, governments across the globe are beginning to implement new legislation that requires platforms to remove hateful content within hours or face significant financial penalties.

The Toxicity framework can be used to detect the likelihood of toxic language in your data. It can detect the probability that your language sample contains threats, hate speech, offensive language, and general toxicity. In addition to these four categories, we also provide 17 Toxicity measures that can help you identify the specific source of toxicity in your data.

The Toxicity framework helps to identify language that might be considered obscene, insulting, or offensive, including lewdness, descriptions of sex or desires, direct insults to the reader of the message, or excessive swear words in a context where they aren’t welcome.

Note: The toxicity framework measures the likelihood that language is toxic, not how toxic it is.

{
  "plan_usage": {
    "word_limit": 100000,
    "words_used": 1020,
    "words_remaining": 98980,
    "percent_used": 1.02,
    "start_date": "2021-03-16T20:52:15.254118Z",
    "end_date": "2021-04-18T23:59:59Z"
  },
  "results": [
    {
      "response_id": "68150a92-21c8-42f6-8b6e-7a2551f31cf1",
      "request_id": "req-1",
      "language": "en",
      "version": "v1.0.0",
      "summary": {
        "word_count": 17,
        "words_per_sentence": 8.5,
        "sentence_count": 2,
        "six_plus_words": 0.4117647058823529,
        "capitals": 0.02127659574468085,
        "emojis": 0,
        "emoticons": 0,
        "hashtags": 0,
        "urls": 0
      },
      "toxicity_likelihood": {
        "hate_speech": 0.08253253779198318,
        "offensive": 0,
        "threat": 0.14128420102382466,
        "toxicity": 0.4139965950424534
      },
      "toxicity_measures": {
        "authority_structures": 0,
        "body_size_shape": 0,
        "crimes": 0,
        "death": 0,
        "direct_insults": 0,
        "disability": 0,
        "ethnic_origin": 0,
        "gender_sex": 0,
        "other_sociocultural": 0,
        "political_affiliation": 0,
        "political_issues": 0.058823529411764705,
        "race_ethnicity": 0.11764705882352941,
        "religion": 0,
        "sexual": 0,
        "sexual_orientation": 0,
        "swears": 0,
        "violence": 0.058823529411764705
      }
    }
  ]
}

Toxicity Measure Categories

Measure	Summary	High Score Example
`hate_speech`	Provides the probability that a text sample contains an instance of hatred, anger, or disgust regarding any demographic group. Demographic groups are commonly differentiated by race, ethnicity, gender, sexual orientation, or disability. Expressions of racism or sexism are considered hate speech, while insults directed at a person or people with no mention of their race, gender, or other demographic markers are not hate speech.	Can't you see why you morons get nowhere? `0.5`
`offensive_language`	Provides the probability that a text sample contains language that might be considered obscene or offensive. Examples of this might include lewd descriptions of sex or desires, direct insults to the reader of the message, or excessive swear words in a context where they aren’t welcome. What is considered offensive in one context (i.e. particular website, culture of website users, point in time, etc.) may not be considered offensive in another, so it’s important to recognize that this measure is based on generalizations.	Wow you're an idiot `1`
`threat`	Provides the probability that a text sample contains a direct threat of any kind. This threat may be directed to the reader or recipient, a third party who isn’t reading the message, or to a type of person in general.	If you keep that posting content like this, I will find you. And hurt you `0.67`
`toxicity`	Provides the probability that a text sample contains any content considered undesirable in civil discourse. This may include, but is not limited to threats, hate speech, offensive language, obscenities, bullying, direct insults, insensitive content, descriptions of violence, or overt sexual content.	Your promiscuous behaviour will get you into trouble. That isn't your place. `0.94`

Measures per Category: Hate Speech

Score	Summary
`race_ethnicity`	Includes references to skin colour as well as more specific ethnicities. Includes both rude and socially-acceptable references.
`ethnic_origin`	Includes references to citizenship status, immigration or refugee status, and specific nationalities. Includes both rude and socially-acceptable references.
`gender_sex`	Includes terms based on biological sex as well as gender identity and gender expression. Includes both rude and socially-acceptable references.
`sexual_orientation`	Includes terms for all common sexual orientations. Includes both rude and socially-acceptable references.
`religion`	Includes terms for all major world religions, as well as terms that do not refer to any specific religion but refer instead to types of faith or religious practices. Includes both rude and socially-acceptable references.
`disability`	Includes terms associated with sensory or mobility impairments, other physical disabilities, mental or psychological health concerns, and substance abuse. Includes both rude and socially-acceptable references.
`body_size_shape`	Includes terms used for different body sizes and shapes, including both overweight and underweight. Includes both rude and socially-acceptable references.
`sociocultural`	Includes an assortment of social identifiers that people are assigned, such as “hippie” or “redneck”. Includes both rude and socially-acceptable references.
`political_affiliation`	Includes terms used specifically for the political left or right, as well as terms for other political ideologies or affiliations. Includes both rude and socially-acceptable references.

Measures per Category: Threat

Score	Summary
`crimes`	Includes a range of specific crimes. Includes both violent and non-violent crimes. Includes both misdemeanors and felonies.
`violence`	Contains specific references to violence, including war and gang violence, methods of murder, methods of violating or injuring people, and references to weapons.
`death`	Contains references to death or the deceased, regardless of mechanism or cause of death.
`authority_structures`	Contains references to specific authority structures or figures, including those found within the military, police force, government, corporations, and educational institutions.
`political_issues`	Specific political issues relevant to current political conversation. Primarily focuses on controversial issues and issues that are considered very important by many people.

Measures per Category: Offensive Language

Score	Summary
`direct_insults`	Includes terms that are predictably used to insult people, as well as terms that insult a person’s intelligence, appearance, morality, worth, and sanity.
`sexual`	Includes terms based on biological sex, as well as commonly used words and phrases related to sexual harassment or unwanted romantic interaction. Includes both rude and socially-acceptable references.
`swears`	Includes words that may be considered swears, ranging from very mild exclamations acceptable in any context to very obscene language considered unacceptable in most contexts.

Specs and Examples

The Toxicity framework was designed to assist both human interpreters and machine learning models working on the task of identifying specific types of toxicity in online discourse.

The Toxicity framework contains specific, topical word lists that are unique to the goal of detecting toxicity (such as direct insults to intelligence and racial slurs), as well as LIWC2015’s original grammatical categories (such as personal pronouns and conjunctions) which have been studied at length for their ability to model relevant underlying social dynamics. This means that the Toxicity framework provides information about both the explicit and implicit signals in toxic discourse.

The four main categories - Hate Speech, Threat, Offensive Language, and Toxicity - are derived through machine learning modelling based on the measures that correspond with them. A score is a probability that a text sample contains some kind of toxic language, and not a measure of severity of the toxicity.

Example: Your promiscuous behaviour will get you into trouble. That isn't your place.

// partial results
"results": [
        {
        "response_id": "4ba0550e-4df7-4b13-805b-73b23f9b0e06",
        "language": "en",
        "version": "v1.0.0",
        "summary": {
            "word_count": 13,
            "words_per_sentence": 6.5,
            "sentence_count": 2,
            "six_plus_words": 0.23076923076923078,
            "capitals": 0.03076923076923077,
            "emojis": 0,
            "emoticons": 0,
            "hashtags": 0,
            "urls": 0
        },
        "toxicity_likelihood": {
            "hate_speech": 0.5780701545167783,
            "offensive": 0.6897154921639401,
            "threat": 0.7026310100469049,
            "toxicity": 0.9442709017897967
        },
        "toxicity_measures": {
            "authority_structures": 0,
            "body_size_shape": 0,
            ......
            "sexual": 0.07692307692307693,
            ......
        }

In the example sentence above, the word promiscuous counts towards the sexual measure. The word frequency from that measure - in this case, one instance of a word in sexual divided by the length of the sample (13) - goes into the machine learning model. The model uses the percentage in that category to predict whether the sample contains toxicity or not. As can be seen by the scores, the likelihood that the language in that sample is both offensive and threatening is quite high, and is especially high in toxicity, at 0.94.

Temporal and Orientation

Needs and Values