Solutions to Exercises

Hide code cell source
# import libraries
import pandas as pd
import numpy as np
import nltk
nltk.download('punkt', quiet=True) # download punkt (if not already downloaded)
import spacy
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import OneHotEncoder

# change this to your own data directory
data_dir = "data/"

# read and preprocess data
text_file_name = "osdg-community-data-v2023-01-01.csv"
text_df = pd.read_csv(data_dir + text_file_name,sep = "\t",  quotechar='"')
col_names = text_df.columns.values[0].split('\t')
text_df[col_names] = text_df[text_df.columns.values[0]].apply(lambda x: pd.Series(str(x).split("\t")))
text_df = text_df.astype({'sdg':int, 'labels_negative': int, 'labels_positive':int, 'agreement': float}, copy=True)
text_df.drop(text_df.columns.values[0], axis=1, inplace=True)

14.8. Solutions to Exercises#

14.8.1. Preprocessing#

Exercise 1.1

Answers may vary.

Exercise 1.2

Answers may vary.

Exercise 1.3 The following code removes any rows that contain only N/A values. In this case, there are no such rows to remove.

nrows_old = text_df.shape[0]
text_df.dropna(axis=0, how='all', inplace=True)
print("Number of rows removed:", nrows_old - text_df.shape[0])
Number of rows removed: 0

The next line of code checks for the existence of any remaining N/A values. It turns out that there are none.

text_df.isna().any()
doi                False
text_id            False
text               False
sdg                False
labels_negative    False
labels_positive    False
agreement          False
dtype: bool

Whether or not entries with N/A values should be removed depends on the dataset and the nature of the problem. Sometimes, entries with N/A values should be dropped, while at other times, they should be kept unchanged, or replaced with interpolated or placeholder values. Consult the pandas documentation for more information about how to deal with missing values in dataframes.

Exercise 1.4

After filtering the dataset, we inspect it using the info() function.

# filter the dataset
text_df = text_df.query("agreement > 0.5 and (labels_positive - labels_negative) > 2")
text_df.reset_index(inplace=True, drop=True)

# inspect it
text_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24669 entries, 0 to 24668
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   doi              24669 non-null  object 
 1   text_id          24669 non-null  object 
 2   text             24669 non-null  object 
 3   sdg              24669 non-null  int64  
 4   labels_negative  24669 non-null  int64  
 5   labels_positive  24669 non-null  int64  
 6   agreement        24669 non-null  float64
dtypes: float64(1), int64(3), object(3)
memory usage: 1.3+ MB

We have 40062 entries with 7 features (see section 0 for details). The data types range from object (likely denoting strings) to int64 (integers) to float64 (floating-point numbers). This is a reasonable amount of data to work with.

Exercise 1.5

The Porter and Snowball stemmers are largely comparable, while the Lancaster stemmer is the most aggressive. As a result, the Lancaster stemmer is likely to have the most trouble on a larger set of tokens.

Exercise 1.6

Answers may vary. Some possible observations include the fact that stemmers tend to remove affixes (such as -ing, -ed, and -s in English) and the fact that irregular words are particularly likely to give the stemmers trouble.

Exercise 1.7

Answers may vary.

Exercise 1.8

Answers may vary. Some possible entity labels include GPE (“nationalities or religious or political groups”), TIME (“times smaller than a day”), QUANTITY (“measurements, as of weight or distance”), and WORK_OF_ART (“titles of books, songs, etc.”).

Exercise 1.9

Sample code solution:

# load trained pipeline
nlp = spacy.load('en_core_web_sm')

# perform NER on random sample in both original and lower case
sample = text_df['text'].sample(1).values[0]
doc = nlp(sample)
print('ORIGINAL CASE')
spacy.displacy.render(doc, style='ent', jupyter=True)
print('\nLOWERCASE')
doc = nlp(sample.lower())
spacy.displacy.render(doc, style='ent', jupyter=True)
ORIGINAL CASE
The legal protection of religious freedom in Australia GPE has been subject to significant debate over recent years DATE . In the last four years DATE this question has formed the basis of inquiries by the Australian Law Reform Commission ORG , a Parliamentary Committee ORG , as well as a specially formed Expert Panel ORG , chaired by Philip Ruddock ORG . In this article we outline the international and comparative approach taken to protect freedom of religion, and contrast this to the position in Australia GPE . We find that Australian NORP law does not adequately protect this foundational human right. We then assess the recommendations proposed by the Ruddock Review ORG . We argue that although the Expert Panel ORG recognised the extent of the problem, it did not propose a comprehensive or holistic solution that will resolve existing inadequacies. To protect religious freedom, and indeed human rights more generally, the Commonwealth Parliament ORG should enact a national human rights act
LOWERCASE
the legal protection of religious freedom in australia GPE has been subject to significant debate over recent years DATE . in the last four years DATE this question has formed the basis of inquiries by the australian NORP law reform commission, a parliamentary committee, as well as a specially formed expert panel, chaired by philip ruddock. in this article we outline the international and comparative approach taken to protect freedom of religion, and contrast this to the position in australia GPE . we find that australian NORP law does not adequately protect this foundational human right. we then assess the recommendations proposed by the ruddock review. we argue that although the expert panel recognised the extent of the problem, it did not propose a comprehensive or holistic solution that will resolve existing inadequacies. to protect religious freedom, and indeed human rights more generally, the commonwealth parliament should enact a national human rights act

Answers may vary depending on the samples chosen. This sample demonstrates that the model sometimes confuses organizations with people. Additionally, it shows that the model often fails to recognize organization names (especially abbreviated ones) when they are converted to lowercase.

Exercise 1.10

Answers may vary.

14.8.2. About Text Data#

Exercise 2.1

# get document-term matrix
docs = text_df.text
cv = CountVectorizer()
cv_fit = cv.fit_transform(docs)

# get feature names and total counts
feature_names = cv.get_feature_names_out()
total_counts = cv_fit.sum(axis=0)

# get the index of the most frequent word
most_freq_feature = total_counts.argmax()

# get the most frequent word itself
most_freq_token = feature_names[most_freq_feature]
print(f"Most frequent word: '{most_freq_token}'")
Most frequent word: 'the'

Exercise 2.2

# get document-term matrix with stop words removed
cv2 = CountVectorizer(stop_words='english') # exclude English stop words
cv2_fit = cv2.fit_transform(text_df.text)

original_len = len(cv.vocabulary_) # length of the original vocabulary (with stop words)
new_len = len(cv2.vocabulary_) # length of the new vocabulary (without stop words)
stopwords = cv2.get_stop_words()

print('Length of the original vocabulary (with stop words):', original_len)
print('Length of the new vocabulary (without stop words):', new_len)
print('Number of stop words:', len(stopwords))
print('Difference between original and new vocabularies:', original_len - new_len)
Length of the original vocabulary (with stop words): 45738
Length of the new vocabulary (without stop words): 45440
Number of stop words: 318
Difference between original and new vocabularies: 298

The difference between the original and new vocabularies is less than the number of stop words. This is because not all of the stop words actually occur in the original vocabulary. The following code lists the stop words that are missing from the original vocabulary. Note how the difference between the original and new vocabulary lengths (298) added to the number of missing stopwords (20) is equal to the total number of stop words (318).

missing_stopwords = stopwords - cv.vocabulary_.keys()
print(f'{len(missing_stopwords)} missing stopwords:', missing_stopwords)
20 missing stopwords: {'whereafter', 'whence', 'noone', 'thereupon', 'i', 'thence', 'a', 'latterly', 'yours', 'whereupon', 'couldnt', 'whoever', 'anyhow', 'hasnt', 'whither', 'hers', 'amoungst', 'hereupon', 'yourselves', 'beforehand'}

Exercise 2.3

# get feature names and total counts
feature_names = cv2.get_feature_names_out()
total_counts = cv2_fit.sum(axis=0)

# get the index of the most frequent word
most_freq_feature = total_counts.argmax()

# get the most frequent word itself
most_freq_token = feature_names[most_freq_feature]
print(f"Most frequent word: '{most_freq_token}'")
Most frequent word: 'countries'

Exercise 2.4

First, we fit the one-hot encoder to the sample text.

sample_text = text_df.text.iloc[12737].lower()
tokens = nltk.word_tokenize(sample_text)

def ohe_reshape(tokens):
    return np.asarray(tokens).reshape(-1,1)

ohe = OneHotEncoder(handle_unknown='ignore') # encode unknown tokens as vectors of all zeros
ohe.fit(ohe_reshape(tokens));

Next, we transform each token only once by using a set to remove duplicates.

token_set = list(set(tokens))
encodings = ohe.transform(ohe_reshape(token_set)).toarray() # encode the tokens

There are multiple ways to check that the resulting encodings are unique, but one simple way is to use the pandas library. The following code transforms the encodings into a pandas dataframe and then verifies that there are no duplicates. This confirms that each learned token has a unique encoding.

pd.DataFrame(encodings).duplicated().any()
False

Exercise 2.5

print('SDG:', text_df.sdg.iloc[118])
print('Text:', text_df.text.iloc[118])
SDG: 5
Text: "Female economic activities were critically examined and new light was shed on existing conceptions of traditional housework. Oxford University Press, 2007). An edited version of Ihe chapter is available al www.rci.rutgers.edu/~cwgl/globalcenler/charlotte/UN-Handbook.pdf. Targets were also set for the improvement of women's access to economic, social and cultural rights, including improvements in health, reproductive services and sanitation. The women in development approach is embodied in article 14 of the Convention, which focuses on rural women and calls on States to ensure that women ""participate in and benefit from rural development"" and also that they ""participate in the elaboration and implementation of development planning at all levels"".15 Participation is an important component of the right to development, as discussed below."

The most frequent words are “women” and “development”, which occur 4 times each. This, together with the label of SDG 5 (gender equality), suggests that this document is about equality for women.

Exercise 2.6

Each token in a given document, except for the first and last, is grouped into two different bigrams (one with the previous token, and another with the next token). In this case, the large number of distinct bigrams in the entire corpus likely leads to a bigram vocabulary that is larger than the corresponding unigram vocabulary. However, many of the unigrams may occur more often than many of the bigrams do, making the total count of bigrams smaller than the total count of unigrams.

Exercise 2.7

count_vectorizer = CountVectorizer(ngram_range=(3,3), stop_words='english') 
count_vector = count_vectorizer.fit_transform(docs)
print('Total count of trigrams (without stop words):', count_vector.sum())
print('Number of unique trigrams (without stop words):', len(count_vectorizer.vocabulary_))
Total count of trigrams (without stop words): 1301713
Number of unique trigrams (without stop words): 1214215

The total count of trigrams is smaller than the total count of bigrams, but the number of unique trigrams is larger than the total number of unique bigrams. The explanation for this is similar to the reasoning offered in the solution to the previous exercise, but substituting bigrams for unigrams and trigrams for bigrams.

Exercise 2.8

Answers may vary depending on the sentences chosen.

Exercise 2.9

tp = 398
fp = 153
fn = 83

precision = tp / (tp + fp)
recall = tp / (tp + fn)
f1 = 2 * (precision * recall)/(precision + recall)

print(f'Precision = {precision}, recall = {recall}, f1 = {f1}')
Precision = 0.7223230490018149, recall = 0.8274428274428275, f1 = 0.7713178294573643