Preparing Your Data
Preparing Your Data
This page will help you determine what text is valid and what could potentially skew your results. Our technology performs best when samples come from written or spoken language, including conversational language, formal or informal language from a variety of sources including blog posts, survey responses, social media posts, transcribed calls, short text samples, or text messages.
Raw text works best, meaning that it's unnecessary to tokenize, lemmatize, stem, remove stop words, or remove punctuation.
During data preparation, raw text may need to be aggregated or parsed depending on the goal of your analysis. Raw text should be prepared in a way that corresponds with the level of insight you aim to produce. For example, parse raw text into discrete sentences before calling the API if you aim to produce sentence-level insights; aggregate raw text into paragraphs before calling the API if you aim to produce paragraph-level insights. The API will analyze input text submissions as a whole unit.
Refer to the table below for details surrounding what to include and what to exclude from your text before using the Receptiviti API.
Element | Example(s) | Include? | Action | Comment |
---|---|---|---|---|
Text Encoding | utf-8 | Encode your text strings in utf-8 | The API currently accepts only JSON as input. JSONs are encoded in unicode with a default encoding of utf-8. More details here and here. | |
@Mentions | @bigScaryPup | Yes | Leave in your text if this is relevant to your use case. Exclude, if not. | For most use cases, @Mentions are data noise and not natural language and do not indicate underlying psychology or emotion. Currently, an @Mention adds 1 to the word count (wc). |
Hashtags | #lolnotlol | Yes | Retain hashtags: we score them. | Hashtags are separated and parsed by the API. The individual components of the hashtags count towards word count. #thiswillbescored will be split up into this will be scored and count as 4 words. Currently, hashtags adds the number of tokens in the hashtag to wc and 1 to hashtags. |
Emojis | \xf0\x9f\x8c\xbb 😂😡 | Yes | Retain emojis in your text. | Emojis are visual representations of emotions, common objects and situations. They are powerful tools to uncover psychological and emotional meaning in language. |
URLs | http://receptiviti.com | Yes | Leave in your text if this is relevant to your use case. Exclude, if not. | For most use cases, URLs are data noise and not natural language and do not indicate underlying psychology or emotion. However, if they are relevant to your use case, feel free to leave them in your text. Currently, a URL adds 1 to the wc and to the urls category. |
Email headers | From: [email protected] | No | Remove all email headers, and only use email body as text. Remember, if your email body is in html, follow the instructions below to strip html tags from your text. | Email headers are data noise and not natural language for Receptiviti’s metrics. They do not indicate underlying psychology or emotion. They will count towards the total number of words and thereby skew scores. |
Email metadata | Mon, 24 Aug 2020 10:16:07 -0700 (PDT) | No | Remove all email metadata, and only use email body as text. Remember, if your email body is in html, follow the instructions below to strip html tags from your text. | Email metadata are data noise and not natural language for Receptiviti’s metrics. They do not indicate underlying psychology or emotion. They will count towards the total number of words and thereby skew scores. |
Email footers and confidentiality disclaimers | Head office: 150 Bloor St. West, Suite 310, Toronto, Ontario | No | Remove all email footers and legal disclaimers from your email. Remember to use only email body as text. | Email footers and confidentiality disclaimers are data noise and not natural language for Receptiviti's metrics. They do not indicate underlying psychology or emotion. They will count towards the total number of words and thereby skew scores. |
HTML | <!DOCTYPE html> | No | Strip all HTML tags and only retain relevant content within the tags e.g., text within the <p> tags could be natural language and therefore valid for analysis. | HTML tags specify formatting, not naturally spoken or written language. The text within some HTML tags may be useful (depending on your application). Tools like BeautifulSoup can help you do this. |
Code | Print("Hello World") | No | Remove all code snippets from your text. | Code snippets are not natural language and do not indicate underlying psychology or emotion. They will count towards the total number of words and thereby skew scores. |
Code Snippets to Remove Unwanted Elements
@Mentions
# pythonimport redef remove_at_mentions(line):return re.sub("@\w+", "", line)def scrub_text(text):scrubbed_text = remove_at_mentions(text).strip()return scrubbed_textscrubbed_text = scrub_text("@dogwalker I hope you had a nice long walk!")
URLs
# python# Regex from https://gist.github.com/dperini/729294import redef remove_urls(line):URL_REGEX = "(?:(?:(?:https?|ftp):)?\/\/)(?:\S+(?::\S*)[email protected])?(?:(?!(?:10|127)(?:\.\d{1,3}){3})(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z0-9\u00a1-\uffff][a-z0-9\u00a1-\uffff_-]{0,62})?[a-z0-9\u00a1-\uffff]\.)+(?:[a-z\u00a1-\uffff]{2,}\.?))(?::\d{2,5})?(?:[/?#]\S*)?"return re.sub(URL_REGEX, "", line)remove_urls("have you visited https://www.receptiviti.com/company to find out more")
Email headers and Email Metadata
# pythonfrom email import message_from_stringdef extract_body_from_email(text):msg = message_from_string(text)if msg.is_multipart():for part in msg.walk():content_type = part.get_content_type()content_disposition = str(part.get('Content-Disposition'))found_body = Falseif content_type == 'text/plain' and 'attachment' not in content_disposition and not found_body:body = part.get_payload(decode=True)else:body = msg.get_payload(decode=True)return bodyReceived: by 2002:a05:6000:1188:0:0:0:0 with SMTP id g8csp2511771wrx;Mon, 14 Feb 2021 05:11:19 -0700 (PDT)MIME-Version: 1.0In-Reply-To: <[email protected]>From: Test User <[email protected]>Date: on, 14 Feb 2021 08:11:06 -0400Subject: Fwd: Spam email from youTo: All full-time employees <[email protected]>Content-Type: multipart/related; boundary="000000000000754a2e05af44eeac"--000000000000754a2e05af44eeacContent-Type: multipart/alternative; boundary="000000000000754a2c05af44eeab"--000000000000754a2c05af44eeabContent-Type: text/plain; charset="UTF-8"As part of our ongoing security awareness, if you see emails like this,please mark them as phishing.Note the warning signs - the actual email address doesn't match sender, thesignature isn't right, and there's a giant red warning banner :)[image: Screen Shot 2020-09-14 at 9.24.41 am.png]Marking something as phishing is slightly different than using the spambutton in gmail.The email goes through different processes / to a different team and helpsGoogle prevent these from landing in our inboxes.Thanks all!*Test User* Desig, Nation | Head, Software @ Receptiviti |This message is intended only for the use of the intended recipients, andit may be privileged and confidential. If you are not the intendedrecipient, you are hereby notified that any review, re-transmission,conversion to hard copy, copying, circulation or other use of this messageis strictly prohibited and may be illegal. If you are not the intendedrecipient, please notify me immediately by return email and delete thismessage from your system. Thank you.--000000000000754a2c05af44eeabContent-Type: text/html; charset="UTF-8"Content-Transfer-Encoding: quoted-printable<div dir=3D"ltr"><div>As part of our ongoing security awareness, if you seeemails like this, please mark them as phishing. <br></div><div>Note the warning signs - the actual email address doesn't match sender, the signature isn't right, and there's a giant red warning banner :)</div><div><br></div><div><img alt=3D"Screen Shot 2020-09-14 at 9.24.41 am.png" src=3D"cid:1748c828c6b7ef93e481" width=3D"542" height=3D"309"></div><div><font face=3D"Times"><span style=3D"font-size:12px">This message is intended only for the use of theintended recipients, and it may be privileged and confidential. If you are not the intendedrecipient, you are hereby notified that any review, re-transmission,conversion to hard copy, copying, circulation or other use of this messageis strictly prohibited and may be illegal. If you are not the intendedrecipient, please notify me immediately by return email and delete thismessage from your system. Thank you.</span></font></div></div>--000000000000754a2c05af44eeab----000000000000754a2e05af44eeacContent-Type: image/png; name="Screen Shot 2020-09-14 at 9.24.41 am.png"Content-Disposition: inline; filename="Screen Shot 2020-09-14 at 9.24.41 am.png"Content-Transfer-Encoding: base64Content-ID: <1748c828c6b7ef93e481>X-Attachment-Id: 1748c828c6b7ef93e481--000000000000754a2e05af44eeac--"""extract_body_from_email(text)
HTML
# python# pip3 install bs4from bs4 import BeautifulSoupdef strip_html_basic(message_string, parser="lxml-xml"):soup = BeautifulSoup(message_string, parser)for tag in soup("style"):tag.decompose()plain = soup.get_text("\n", strip=True)return plainhtml_doc = """<html><head><title>The Receptiviti Story</title></head><body><p class="story">Every word counts...<a href="http://receptiviti.com/i" class="link">I</a>,<a href="http://receptiviti.com/fountain" class="link">Fountain</a> and<a href="http://receptiviti.com/puppy" class="link">Puppy</a>;When language first emerged, they were not made equally.</p><p class="story">...</p>"""strip_html_basic(html_doc)