Preprocessing

# import libraries
import pandas as pd
import spacy
import nltk
nltk.download('punkt', quiet=True) # download punkt (if not already downloaded)
nltk.download('stopwords', quiet=True) # download stopwords (if not already downloaded)

# change this to your own data directory
data_dir = "data/"

# read and preprocess data
text_file_name = "osdg-community-data-v2023-01-01.csv"
text_df = pd.read_csv(data_dir + text_file_name,sep = "\t",  quotechar='"')
col_names = text_df.columns.values[0].split('\t')
text_df[col_names] = text_df[text_df.columns.values[0]].apply(lambda x: pd.Series(str(x).split("\t")))
text_df = text_df.astype({'sdg':int, 'labels_negative': int, 'labels_positive':int, 'agreement': float}, copy=True)
text_df.drop(text_df.columns.values[0], axis=1, inplace=True)

14.2. Preprocessing#

Natural language processing (NLP) refers to the use of computers to process, analyze, and interpret human languages. NLP often involves the use of machine learning (ML), which is explained in more detail in section 5. Usually, natural language processors operate on both speech data and textual data. There are several components necessary to understand the structure and meaning of human language.

From a linguistics perspective, it is important to look at the following:

syntax, the actual rules that define how words are combined to form understandable sentences;
semantics, referring to the meaning behind the phrases and sentences formed by syntax;
morphology, referring to the ways in which words can take different forms.

Then, from a computational perspective, we can take the rules from these linguistics aspects and transform linguistic knowledge into rule-based and/or ML-based algorithms to solve problems related to natural language processing.

Exercise 1

Explore the Lexalytics NLP demo found at https://www.lexalytics.com/nlp-demo/. Choose any of the provided packs and look at the provided sample texts.

Look under Document and Themes. What words and phrases does the model highlight for each sample?
Look under Topics. What topics does the model detect in each sample? How accurate and comprehensive is its list?

14.2.1. Tasks#

Within the realm of preprocessing textual data, we first must perform certain detections and tasks; this ensures that we have textual data that make sense and are meaningful. The following all help at ensuring data cleanliness:

Syntactic analysis: Does the grammar make sense? Is the text grammatically correct?
Semantic analysis: Are we aware of the meaning of the words in context? Do we understand the structure, word interaction, and related concepts of the text?
Keyword extraction: What are the most important words in the text?
Named entity resolution: How do we identify and extract entities (names, organizations, addresses and places) and relationships?
Text classification: Can we organize our text into predefined categories? Ways of doing this include:
- Sentiment analysis: Can we classify text (such as customer feedback) into positive and negative feedback?
- Email filtering: If we are taking our texts from emails, can we classify email text as spam mail and remove spam?
- Intent detection: What is the text generator trying to achieve? For example, searching “apple” could indicate an intent of buying, eating, or researching apples.
- Language detection: Can we classify the body of text into languages, with an associated probability?

Exercise 2

Take another look at the Lexalytics NLP demo found at https://www.lexalytics.com/nlp-demo/. Choose any of the provided packs and look at the provided sample texts.

Look under Document and Themes, which present the model’s sentiment analysis. What sentiment scores does the model give to the various words and phrases it has extracted? Do you agree or disagree with its verdicts?

Looking at the UN SDG dataset, we find that it includes positive or negative labels. Since each text is classified with a single goal, positive labels indicate the agreement of that text with the goal, while negative labels indicate disagreement. We use sum() to sum the negative labels and the positive labels:

text_df.labels_negative.sum() + text_df.labels_positive.sum()

Compare this to the number of text samples in the dataset:

len(text_df)

You may notice that the sum of the positive and negative labels does not equal the number of texts in the dataset. This is because the texts have been labeled by more than one volunteer. As the table below shows, between 3 and 990 volunteers labeled each text, with an average (and median) of about 7 volunteers per text. The columns labels_positive and labels_negative give the total number of positive and negative labels assigned to each text by these volunteers.

(text_df['labels_positive'] + text_df['labels_negative']).describe(percentiles=[0.5])

count    40062.000000
mean         6.927363
std         15.578931
min          3.000000
50%          7.000000
max        990.000000
dtype: float64

With this in mind, determining a text’s agreement with the assigned SDG goal will be more complicated than simply reading the positive and negative labels. These observations thus provide some beginning rationale for the necessity of preprocessing.

The data preprocessing we will need to perform includes some tasks specific to textual data, but it also includes some work that can be done on datasets more generally. For our UN SDG dataset, it is useful to remove the columns that we don’t need, which for this dataset are any columns with N/A values in all entries. We can use the following code to do so:

# save the original number of columns in a variable
ncols_old = text_df.shape[1]
# remove columns comprised solely of N/A values
text_df.dropna(axis=1, how='all', inplace=True)
print("Number of columns removed:", ncols_old - text_df.shape[1])

Number of columns removed: 0

In this case, there were no columns comprised solely of N/A values.

It is also a good idea to check our data for any other discrepancies, such as missing or incorrect entries.

Exercise 3

Modify the code above to remove any rows that contain only N/A values. (Hint: change the value of the axis parameter.) How many rows, if any, were removed?
Using the isna() and any() functions, check the data for any remaining N/A values. Is it a good idea to remove entries with N/A values?

For our UN SDG dataset, we want texts that are clearly classified into a single goal. Positive labels indicate agreement with the labeled goal, while negative labels indicate disagreement. With this in mind, our filtering will remove the rows with labeling agreement less than or equal to 0.5, as well as the rows where the number of positive labels minus the negative labels is less than or equal to 2.

text_df = text_df.query("agreement > 0.5 and (labels_positive - labels_negative) > 2")
text_df.reset_index(inplace=True, drop=True)

Exercise 4

Use text_df.info() to examine information about the dataset after running the preprocessing in this section. What columns are included? What data types are there, and how many entries? Does this seem like a reasonable size of data to work with?

14.2.2. Levels of Processing#

In (almost) any language, there are distinct levels of a text we can focus on: characters are strung together to form words, which are put together into sentences, then paragraphs, documents, and a whole set of documents, or a corpus. In all these cases, we put multiples together to make the next level, and decisions in preprocessing at one level will inherently impact the next level.

We will not be focusing on using algorithms to process whole documents or anything above that level, so we can look at preprocessing at the remaining levels: characters, words, sentences, and paragraphs.

For characters, we typically process textual data by removing special characters, punctuations, and normalizations. However, this is not necessarily the case for every NLP approach. In some cases, taking away these special characters, punctuations, and normalizations will completely alter the meaning of higher levels of text.

For words, we first have to define words within our text snippets. This is known as segmenting. In the English language, we can use white space to segment text into words. However, this is not always the case with other languages, so it is important to know the linguistic rules of the primary language you are working with. Additionally, we perform some removals and/or replacements for odd words such as abbreviations, acronyms, numbers, and misspelled words. Finally, we normalize our text data; this is typically done by stemming (covered later in this section) or lemmatization.

For sentences, we first have to define sentence boundaries. In most languages, including English, sentences are started with a capital letter and ended with a period. However, periods and capital letters are also used within sentences, and capital letters or periods might not even exist in other languages, which again highlights the importance of knowing the language’s rules. Within sentences, we can also mark phrases. In English sentences, we can typically label phrases in sentences with a subject, predicate, and object; the subject is the one performing the action, or predicate, on or directed towards the object. Finally, we parse the sentence by tagging the words with their respective part of speech. Note that this can only be done on the sentence level, not the word level, as the same word can have a different part of speech depending on how it is used in a sentence.

Finally, for paragraphs, our primary level of preprocessing is to understand the text by extracting the meaning of the text. This can include sentiments, emotions, intentions, etc. A good way of doing this is to perform abstractive summarization, which involves constructing new text that concisely captures the meaning of the old text.

The actual process of text preprocessing, however, does not necessarily flow distinctly from one level to another. The order of preprocessing steps, as well as whether or not to include specific steps, depends largely on the application. An example process is given here:

Segmentation (breaking text into sentences)
Spelling correction
Noise removal (includes removal of text that would otherwise confuse the main text, including emojis, foreign language, and hyperlinks)
Language detection (identifying the language used in a body of text)
Stop-words removal (these words are typically high frequency, generic, and less context-specific)
Case-folding (removing variances in case such as lowercase, uppercase, titlecase, and so on)
Lemmatization
Tokenization (break texts into words, phrases, symbols, and other semantically useful units or meaningful elements)
Parsing (part of speech identification)
Standardization
Stemming (reduces tokens to base forms, or stems, and removes affixes)

Many of these steps may be unfamiliar - that is perfectly fine, as they will be covered later in this section.

It is commonly accepted that preprocessing text helps improve accuracy, so it is critical to perform good, appropriate preprocessing.

14.2.3. Procedures#

Stemming#

As mentioned previously, stemming is performed to help normalize our textual data by reducing tokens to their base forms. This is done so that we do not end up with many different words with very similar meaning. For example, stemming turns “watch”, “watched”, “watching”, and “watches” into “watch”. To do this, the algorithm accesses a database where known word forms, such as the various forms of “watch”, are grouped together. It then analyzes the text for all instances of these forms and turns them all into the root word.

Stemmers also typically perform preprocessing tasks like turning words not at the beginning of a sentence into lowercase and removing stop words, which are common words like “the” and “that” which show up in a variety of contexts and typically do not change meaning across these contexts. The NLTK library contains a set of English stop words, which the following code prints out. Notice how many of these stop words are common words like pronouns, prepositions, and conjunctions.

print(nltk.corpus.stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

Stemming algorithms can be very useful for preprocessing text. However, there are always irregularities when using any stemming algorithm. Take the following examples of how a stemming algorithm might transform sentences:

“I bought an Apple” => change to lowercase => “I bought an apple”
- Here, the stemming algorithm has removed the capital on “Apple”. We can clearly see that this drastically altered the meaning; now, the speaker has bought a fruit instead of a product from the company Apple.
“I watched The Who” => remove stop words => “I watched”
- Here, the stemming algorithm has removed the stop words “the” and “who”, but this again changed the meaning of the sentence. Instead of watching the TV show The Who, the speaker has now watched something unspecified.
“Take that!” => remove stop words => “Take”
- Like the above example, the stemming algorithm removed a stop word, “that”. However, the sentence now loses all meaning: it is no longer a complete sentence, and without the “that” which served to add a violent, fighting intent to the original sentence, this emotion and intent is completely gone.

Developed by Martin Porter in 1980, the Porter Stemmer is the oldest and most commonly used stemming algorithm in many languages. It is slightly time consuming, so another stemming algorithm that can be used is the Lancaster Stemmer. This stemmer is very aggressive, however, and is especially harmful for short tokens, where the token itself may become unclear or altered to a point where it loses meaning.

Let’s try these stemmers, as well as a third type, the Snowball Stemmer, on some tokens:

tokens =  ['narrative', 'comprehensives', 'publicised', 'moderately', 'prosperous', 'legitimizes', 'according', 'glorifies'] 

porter = nltk.stem.porter.PorterStemmer() # Porter stemmer
snowball = nltk.stem.snowball.EnglishStemmer() # Snowball stemmer
lancaster = nltk.stem.lancaster.LancasterStemmer() # Lancaster stemmer

# apply each stemmer to the tokens
porter_results = [porter.stem(i) for i in tokens]
snowball_results = [snowball.stem(i) for i in tokens]
lancaster_results = [lancaster.stem(i) for i in tokens]

# display results
pd.DataFrame(
    index = ['Porter Stemmer', 'Snowball Stemmer', 'Lancaster Stemmer'],
    columns = tokens,
    data = [porter_results, snowball_results, lancaster_results]
)

	narrative	comprehensives	publicised	moderately	prosperous	legitimizes	according	glorifies
Porter Stemmer	narr	comprehens	publicis	moder	prosper	legitim	accord	glorifi
Snowball Stemmer	narrat	comprehens	publicis	moder	prosper	legitim	accord	glorifi
Lancaster Stemmer	nar	comprehend	publ	mod	prosp	legitim	accord	glor

Exercise 5

How do the different stemmers compare for the list of words in the above example? Which stemmer might have the most trouble on a larger set of tokens?

Exercise 6

Modify the above code to stem your own tokens. How are words commonly shortened, and what types of words seem to cause the most trouble for the stemmers?

Tokenizers#

The primary goal of a tokenizer is to break text into tokens. This is highly language dependent; in English, we use white space and punctuation as word and token separators, but this may not work in some languages. Since tokens are semantically useful units of text, a token is not necessarily the same as a word; in fact, some tokens may be pieces of words, or even symbols such as punctuation.

For example, the following code uses NLTK to tokenize one of the texts from our UN SDG dataset. Note how not only words like “policy” but also numbers like “70”, suffixes like “-‘s”, and symbols like “%” and “(” have been separated into individual tokens.

example = text_df.loc[2425, 'text']
print(f'Original text: "{example}"\n')
print('Tokens:', nltk.word_tokenize(example))

Original text: "About 70% of them have reported at least one trade policy targeting women's economic empowerment. Overall, in four years, almost half of the WTO membership has implemented trade policies in support of women (at least one). It simply provides examples of trade policies as reported by WTO members themselves."

Tokens: ['About', '70', '%', 'of', 'them', 'have', 'reported', 'at', 'least', 'one', 'trade', 'policy', 'targeting', 'women', "'s", 'economic', 'empowerment', '.', 'Overall', ',', 'in', 'four', 'years', ',', 'almost', 'half', 'of', 'the', 'WTO', 'membership', 'has', 'implemented', 'trade', 'policies', 'in', 'support', 'of', 'women', '(', 'at', 'least', 'one', ')', '.', 'It', 'simply', 'provides', 'examples', 'of', 'trade', 'policies', 'as', 'reported', 'by', 'WTO', 'members', 'themselves', '.']

Exercise 7

Modify the above code to tokenize your own text. Does anything surprise you about how the tokenizer breaks up the text? What makes the resultant tokens semantically useful or meaningful?

For the purposes of building ML models, there are several things that tokenizers track in addition to creating the tokens themselves. Tokenizers track features, or the frequency of occurrence of each individual token (normalized or not). They also produce a sample, or a vector of all the token frequencies for a given document.

When a tokenizer vectorizes a corpus, it counts the occurrences of tokens in each document, normalizes and weights these occurrences, and constructs an n-gram or other representation of the text, which is discussed in the next section.

Named Entity Recognition#

The purpose of named entity recognition (NER) is to identify named entities in a text, extract them, and classify them according to their type (people, locations, organizations, and so on). For example, the following code uses the spaCy library to recognize named entities in one of the texts from the UN SDG dataset and then displays them with spaCy’s built-in displacy visualizer.

Note

Before running the following code, you may have to download en_core_web_sm. This is a trained pipeline which spaCy uses to perform the NLP tasks we’ve been discussing on English-language texts. You can download it by uncommenting the second line in the code below.

## uncomment the next line to download the 'en_core_web_sm' pipeline
# ! python3 -m spacy download en_core_web_sm

nlp = spacy.load('en_core_web_sm')
sample = text_df.loc[5707, 'text']
doc = nlp(sample)
spacy.displacy.render(doc, style='ent', jupyter=True)

Source: OECD Development Centre ORG , based on IEA ORG (2015a), World Energy Outlook 2015 ORG , IEA ORG ( 2015b DATE ), World Energy Outlook ORG 2015 DATE : Special Report on Southeast Asia LOC . By definition, TPES ORG is equal to Total Primary Energy Demand ORG (TPED), and includes power generation, other energy sector and total final energy consumption ( IEA ORG , 2015a). China GPE will continue to account for the largest share of the energy demand in Emerging Asia LOC , even though its share of the region’s TPES ORG decreases from 69% PERCENT in 2013 DATE to 57% PERCENT in 2040 DATE owing to the strong growth in energy demand from ASEAN ORG and India GPE .

At first glance, the meaning of these labels may not be clear. We can use the spacy.explain() function to learn more about the meaning of each label:

labels = set([ent.label_ for ent in doc.ents])
for label in labels:
    print(f'{label}: {spacy.explain(label)}')

PERCENT: Percentage, including "%"
GPE: Countries, cities, states
LOC: Non-GPE locations, mountain ranges, bodies of water
DATE: Absolute or relative dates or periods
ORG: Companies, agencies, institutions, etc.

In the example above, the trained model has automatically identified and classified various entities in the text. Notice how the types of entities range from countries and organizations to dates and percentages.

Exercise 8

Use the following code to perform NER on a random sample from the UN SDG dataset. Rerun the code several times to get different samples. What types of entities does the model identify?

sample = nlp(text_df['text'].sample(1).values[0])
doc = nlp(sample)
spacy.displacy.render(doc, style='ent', jupyter=True)

Although the trained model has correctly identified many of the named entities in this text, it has still made some mistakes. For example, it incorrectly labels “Total Primary Energy Demand” as an organization. Additionally, the model often relies on capitalization to identify entities. If we change the sample text to lowercase, notice how the model fails to recognize many of the entities that it previously identified.

doc = nlp(sample.lower())
spacy.displacy.render(doc, style='ent', jupyter=True)

source: oecd development centre ORG , based on iea (2015a), world energy outlook 2015 DATE , iea ( 2015b DATE ), world energy outlook 2015 DATE : special report on southeast asia LOC . by definition, tpes is equal to total primary energy demand (tped), and includes power generation, other energy sector and total final energy consumption (iea, 2015a). china GPE will continue to account for the largest share of the energy demand in emerging asia LOC , even though its share of the region’s tpes decreases from 69% PERCENT in 2013 DATE to 57% PERCENT in 2040 DATE owing to the strong growth in energy demand from asean GPE and india GPE .

Exercise 9

Modify the code from Exercise 8 to perform NER on a random sample from the UN SDG dataset, first in the original case and then in lowercase. (Hint: use the lower() function to make the sample lowercase.) Rerun your code several times to get different samples. What mistakes does the model tend to make? What types of entities does it tend to misidentify when the text is converted to lowercase?:::

String Operations#

Now that we know about various preprocessing techniques, we can review some basic string operations. These will also be important in identifying various sentences and tokens within our dataset in case we would like to focus on specific words. They can also help in identifying cases, numeric characters, and other tests that we might need to conduct. A few are given below:

s.startswith(t) # test if s starts with t
s.endswith(t) # test if s ends with t
t in s # test if t is a substring of s
s.islower() # test if s contains cased characters and all are lowercase
s.isupper() # test if s contains cased characters and all are uppercase
s.isalpha() # test if s is non-empty and all characters in s are alphabetic
s.isalnum() # test if s is non-empty and all characters in s are alphanumeric
s.isdigit() # test if s is non-empty and all characters in s are digits
s.istitle() # test if s contains cased characters and is titlecased (i.e. all words in s have initial capitals)

For example, the following code uses the startswith() function to test if the first text from the UN SDG dataset begins with the string ‘The’.

text_df.loc[0, 'text'].startswith('The')

False

The functions above are all called on one string at a time. In pandas, many of these string operations can also be performed on an entire series or dataframe, not just a single value. (These are known as vectorized functions.) For example, the following code uses str.startswith, the vectorized version of startswith, to test which texts from the UN SDG dataset begin with the string ‘The’.

text_df['text'].str.startswith('The')

      False
       True
       True
       True
      False
         ...  
  False
  False
  False
  False
   True
Name: text, Length: 24669, dtype: bool

Exercise 10

Practice using some of the string operations on your own sample sentences and observe the results.

As an example to work through, let’s take the Chinese phrase ‘四个全面’ and see if we can locate it in the texts. This phrase, anglicized as “sigequanmian” and roughly translating to “four comprehensives”, was publicized by Chinese President Xi Jinping as a set of goals for China. We will use the vectorized str.contains function to search the entire dataset.

text_df[text_df['text'].str.contains('四个全面')]

	doi	text_id	text	sdg	labels_negative	labels_positive	agreement
18009	10.32890/UUMJLS.6.2015.4584	baa468803d1c7dba3c1096b1665ed7f7	In China today,President Xi Jinping’s new gran...	16	1	4	0.6

Another possible string operation is comparison, as given below:

t1 = text_df[text_df['text'].str.contains('四个全面')]["text"].values[0]
# compare using '=='
print(t1 == text_df.iloc[18009].text)
# compare casefolded version
print(t1 == text_df.iloc[18009].text.casefold())
# compare lowercase and casefold
print(t1.lower() == text_df.iloc[18009].text.casefold())

True
False
True

We can see that Python uses strict comparisons, meaning that the two sentences will not be equal unless the cases (capital and lowercase) are matched. Note that in the above example, lower() and casefold() yielded equal results. However, this is not the case for all languages. Take the German letter ß:

text = 'groß'

# convert text to lowercase using lower()
print('Using lower():', text.lower())

# convert text to lowercase using casefold()
print('Using casefold():', text.casefold())

# check equality
print(f'{text.lower()} == {text.casefold()}:', text.lower() == text.casefold())

Using lower(): groß
Using casefold(): gross
groß == gross: False

The German word groß translates as “large” or “big”, and this meaning is kept using lower(). However, when using casefold(), the word becomes gross, which is an acceptable alternate spelling in German but is also used in English to indicate disgust. Even though groß and gross represent the same word in German, Python’s strict comparison will treat them as different. This is another reason to perform consistent and appropriate preprocessing of text.

It is up to you to decide which of these functions to use when converting to lowercase; however, it is best to keep this consistent across all your code so as to avoid potential problems such as this.