Solutions to Exercises: Sections 1 to 4

Hide code cell source
# import libraries
import pandas as pd
import numpy as np
import nltk
nltk.download('punkt', quiet=True) # download punkt (if not already downloaded)
from nltk import word_tokenize, sent_tokenize
import spacy
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import OneHotEncoder
import matplotlib.pyplot as plt
import tensorflow as tf
import tensorflow_hub as hub

# change this to your own data directory
data_dir = "data/"

# read and preprocess data
text_file_name = "osdg-community-data-v2023-01-01.csv"
text_df = pd.read_csv(data_dir + text_file_name,sep = "\t",  quotechar='"')
col_names = text_df.columns.values[0].split('\t')
text_df[col_names] = text_df[text_df.columns.values[0]].apply(lambda x: pd.Series(str(x).split("\t")))
text_df = text_df.astype({'sdg':int, 'labels_negative': int, 'labels_positive':int, 'agreement': float}, copy=True)
text_df.drop(text_df.columns.values[0], axis=1, inplace=True)

14.8. Solutions to Exercises: Sections 1 to 4#

14.8.1. Preprocessing#

Exercise 1

Answers may vary.

Exercise 2

Answers may vary.

Exercise 3

The following code removes any rows that contain only N/A values. In this case, there are no such rows to remove.

nrows_old = text_df.shape[0]
text_df.dropna(axis=0, how='all', inplace=True)
print("Number of rows removed:", nrows_old - text_df.shape[0])
Number of rows removed: 0

The next line of code checks for the existence of any remaining N/A values. It turns out that there are none.

text_df.isna().any()
doi                False
text_id            False
text               False
sdg                False
labels_negative    False
labels_positive    False
agreement          False
dtype: bool

Whether or not entries with N/A values should be removed depends on the dataset and the nature of the problem. Sometimes, entries with N/A values should be dropped, while at other times, they should be kept unchanged, or replaced with interpolated or placeholder values. Consult the pandas documentation for more information about how to deal with missing values in dataframes.

Exercise 4

After filtering the dataset, we inspect it using the info() function.

# filter the dataset
text_df = text_df.query("agreement > 0.5 and (labels_positive - labels_negative) > 2")
text_df.reset_index(inplace=True, drop=True)

# inspect it
text_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24669 entries, 0 to 24668
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   doi              24669 non-null  object 
 1   text_id          24669 non-null  object 
 2   text             24669 non-null  object 
 3   sdg              24669 non-null  int64  
 4   labels_negative  24669 non-null  int64  
 5   labels_positive  24669 non-null  int64  
 6   agreement        24669 non-null  float64
dtypes: float64(1), int64(3), object(3)
memory usage: 1.3+ MB

We have 40062 entries with 7 features (see section 1 for details). The data types range from object (likely denoting strings) to int64 (integers) to float64 (floating-point numbers). This is a reasonable amount of data to work with.

Exercise 5

The Porter and Snowball stemmers are largely comparable, while the Lancaster stemmer is the most aggressive. As a result, the Lancaster stemmer is likely to have the most trouble on a larger set of tokens.

Exercise 6

Answers may vary. Some possible observations include the fact that stemmers tend to remove affixes (such as -ing, -ed, and -s in English) and the fact that irregular words are particularly likely to give the stemmers trouble.

Exercise 7

Answers may vary.

Exercise 8

Answers may vary. Some possible entity labels include GPE (“nationalities or religious or political groups”), TIME (“times smaller than a day”), QUANTITY (“measurements, as of weight or distance”), and WORK_OF_ART (“titles of books, songs, etc.”).

Exercise 9

Sample code solution:

# load trained pipeline
nlp = spacy.load('en_core_web_sm')

# perform NER on random sample in both original and lower case
sample = text_df['text'].sample(1).values[0]
doc = nlp(sample)
print('ORIGINAL CASE')
spacy.displacy.render(doc, style='ent', jupyter=True)
print('\nLOWERCASE')
doc = nlp(sample.lower())
spacy.displacy.render(doc, style='ent', jupyter=True)
ORIGINAL CASE
The legal protection of religious freedom in Australia GPE has been subject to significant debate over recent years DATE . In the last four years DATE this question has formed the basis of inquiries by the Australian Law Reform Commission ORG , a Parliamentary Committee ORG , as well as a specially formed Expert Panel ORG , chaired by Philip Ruddock ORG . In this article we outline the international and comparative approach taken to protect freedom of religion, and contrast this to the position in Australia GPE . We find that Australian NORP law does not adequately protect this foundational human right. We then assess the recommendations proposed by the Ruddock Review ORG . We argue that although the Expert Panel ORG recognised the extent of the problem, it did not propose a comprehensive or holistic solution that will resolve existing inadequacies. To protect religious freedom, and indeed human rights more generally, the Commonwealth Parliament ORG should enact a national human rights act
LOWERCASE
the legal protection of religious freedom in australia GPE has been subject to significant debate over recent years DATE . in the last four years DATE this question has formed the basis of inquiries by the australian NORP law reform commission, a parliamentary committee, as well as a specially formed expert panel, chaired by philip ruddock. in this article we outline the international and comparative approach taken to protect freedom of religion, and contrast this to the position in australia GPE . we find that australian NORP law does not adequately protect this foundational human right. we then assess the recommendations proposed by the ruddock review. we argue that although the expert panel recognised the extent of the problem, it did not propose a comprehensive or holistic solution that will resolve existing inadequacies. to protect religious freedom, and indeed human rights more generally, the commonwealth parliament should enact a national human rights act

Answers may vary depending on the samples chosen. This sample demonstrates that the model sometimes confuses organizations with people. Additionally, it shows that the model often fails to recognize organization names (especially abbreviated ones) when they are converted to lowercase.

Exercise 10

Answers may vary.

14.8.2. About Text Data#

Exercise 1

# get document-term matrix
docs = text_df.text
cv = CountVectorizer()
cv_fit = cv.fit_transform(docs)

# get feature names and total counts
feature_names = cv.get_feature_names_out()
total_counts = cv_fit.sum(axis=0)

# get the index of the most frequent word
most_freq_feature = total_counts.argmax()

# get the most frequent word itself
most_freq_token = feature_names[most_freq_feature]
print(f"Most frequent word: '{most_freq_token}'")
Most frequent word: 'the'

Exercise 2

# get document-term matrix with stop words removed
cv2 = CountVectorizer(stop_words='english') # exclude English stop words
cv2_fit = cv2.fit_transform(text_df.text)

original_len = len(cv.vocabulary_) # length of the original vocabulary (with stop words)
new_len = len(cv2.vocabulary_) # length of the new vocabulary (without stop words)
stopwords = cv2.get_stop_words()

print('Length of the original vocabulary (with stop words):', original_len)
print('Length of the new vocabulary (without stop words):', new_len)
print('Number of stop words:', len(stopwords))
print('Difference between original and new vocabularies:', original_len - new_len)
Length of the original vocabulary (with stop words): 45738
Length of the new vocabulary (without stop words): 45440
Number of stop words: 318
Difference between original and new vocabularies: 298

The difference between the original and new vocabularies is less than the number of stop words. This is because not all of the stop words actually occur in the original vocabulary. The following code lists the stop words that are missing from the original vocabulary. Note how the difference between the original and new vocabulary lengths (298) added to the number of missing stopwords (20) is equal to the total number of stop words (318).

missing_stopwords = stopwords - cv.vocabulary_.keys()
print(f'{len(missing_stopwords)} missing stopwords:', missing_stopwords)
20 missing stopwords: {'whereafter', 'whence', 'noone', 'thereupon', 'i', 'thence', 'a', 'latterly', 'yours', 'whereupon', 'couldnt', 'whoever', 'anyhow', 'hasnt', 'whither', 'hers', 'amoungst', 'hereupon', 'yourselves', 'beforehand'}

Exercise 3

# get feature names and total counts
feature_names = cv2.get_feature_names_out()
total_counts = cv2_fit.sum(axis=0)

# get the index of the most frequent word
most_freq_feature = total_counts.argmax()

# get the most frequent word itself
most_freq_token = feature_names[most_freq_feature]
print(f"Most frequent word: '{most_freq_token}'")
Most frequent word: 'countries'

Exercise 4

First, we fit the one-hot encoder to the sample text.

sample_text = text_df.text.iloc[12737].lower()
tokens = nltk.word_tokenize(sample_text)

def ohe_reshape(tokens):
    return np.asarray(tokens).reshape(-1,1)

ohe = OneHotEncoder(handle_unknown='ignore') # encode unknown tokens as vectors of all zeros
ohe.fit(ohe_reshape(tokens));

Next, we transform each token only once by using a set to remove duplicates.

token_set = list(set(tokens))
encodings = ohe.transform(ohe_reshape(token_set)).toarray() # encode the tokens

There are multiple ways to check that the resulting encodings are unique, but one simple way is to use the pandas library. The following code transforms the encodings into a pandas dataframe and then verifies that there are no duplicates. This confirms that each learned token has a unique encoding.

pd.DataFrame(encodings).duplicated().any()
False

Exercise 5

print('SDG:', text_df.sdg.iloc[118])
print('Text:', text_df.text.iloc[118])
SDG: 5
Text: "Female economic activities were critically examined and new light was shed on existing conceptions of traditional housework. Oxford University Press, 2007). An edited version of Ihe chapter is available al www.rci.rutgers.edu/~cwgl/globalcenler/charlotte/UN-Handbook.pdf. Targets were also set for the improvement of women's access to economic, social and cultural rights, including improvements in health, reproductive services and sanitation. The women in development approach is embodied in article 14 of the Convention, which focuses on rural women and calls on States to ensure that women ""participate in and benefit from rural development"" and also that they ""participate in the elaboration and implementation of development planning at all levels"".15 Participation is an important component of the right to development, as discussed below."

The most frequent words are “women” and “development”, which occur 4 times each. This, together with the label of SDG 5 (gender equality), suggests that this document is about equality for women.

Exercise 6

Each token in a given document, except for the first and last, is grouped into two different bigrams (one with the previous token, and another with the next token). In this case, the large number of distinct bigrams in the entire corpus likely leads to a bigram vocabulary that is larger than the corresponding unigram vocabulary. However, many of the unigrams may occur more often than many of the bigrams do, making the total count of bigrams smaller than the total count of unigrams.

Exercise 7

count_vectorizer = CountVectorizer(ngram_range=(3,3), stop_words='english') 
count_vector = count_vectorizer.fit_transform(docs)
print('Total count of trigrams (without stop words):', count_vector.sum())
print('Number of unique trigrams (without stop words):', len(count_vectorizer.vocabulary_))
Total count of trigrams (without stop words): 1301713
Number of unique trigrams (without stop words): 1214215

The total count of trigrams is smaller than the total count of bigrams, but the number of unique trigrams is larger than the total number of unique bigrams. The explanation for this is similar to the reasoning offered in the solution to the previous exercise, but substituting bigrams for unigrams and trigrams for bigrams.

Exercise 8

Answers may vary depending on the sentences chosen.

Exercise 9

tp = 398
fp = 153
fn = 83

precision = tp / (tp + fp)
recall = tp / (tp + fn)
f1 = 2 * (precision * recall)/(precision + recall)

print(f'Precision = {precision}, recall = {recall}, f1 = {f1}')
Precision = 0.7223230490018149, recall = 0.8274428274428275, f1 = 0.7713178294573643

Exercise 10

Answers may vary depending on the parameters chosen. Here is a sample answer using the parameters ngram_range = (2,2) (for bigrams), stop_words = 'english', and min_df = 3.

docs = text_df.text
count_vectorizer = CountVectorizer(ngram_range=(2,2), stop_words='english', min_df=3)
count_vector = count_vectorizer.fit_transform(docs).toarray()
term_freq = pd.DataFrame({"term": count_vectorizer.get_feature_names_out(), "freq" : count_vector.sum(axis=0)})

# find the frequencies of the 5 most common bigrams
term_freq.sort_values(by='freq', ascending=False).iloc[0:5]
term freq
27774 human rights 1981
10071 climate change 1301
20488 et al 1167
40775 oecd countries 898
26527 health care 881
docs = text_df.text
count_vectorizer = CountVectorizer(ngram_range=(2,2), stop_words='english', min_df=3)
count_vector = count_vectorizer.fit_transform(docs).toarray()
count_vector_df = pd.DataFrame(count_vector, columns=count_vectorizer.get_feature_names_out())

# find the frequencies of the 5 most common bigrams
term_freq.sort_values(by='freq', ascending=False).iloc[0:5]

Exercise 11

def analyze_frequency(corpus, stop_words=None):
    # create term-document matrix
    count_vectorizer = CountVectorizer(ngram_range=(1,1), stop_words=stop_words)
    count_vector = count_vectorizer.fit_transform(corpus).toarray()
    
    # calculate frequencies of 50 most frequent terms
    freq_df = pd.DataFrame(
        {"term": count_vectorizer.get_feature_names_out(), "freq" : count_vector.sum(axis=0)}
    ).sort_values(by='freq', ascending=False)

    # calculate cumulative word counts
    csum = np.cumsum(freq_df.iloc[0:50].freq).values
    
    # create plot
    fig, ax = plt.subplots()
    plt.plot(csum)
    ax.set_ylabel('cumulative word count')
    ax.set_xlabel('rank')
    ax.set_title('Cumulative Word Count (Most Frequent to 50th Most Frequent)')
    
    # calculate comparison
    comp = csum[-1] / freq_df.freq.sum()
    
    return (freq_df.iloc[0:50], ax, f'{comp:.2%}')

Exercise 12

First, we obtain our corpus:

sdg8 = text_df[text_df.sdg == 8].text

Then we run the function from the previous exercise on that corpus and examine the results:

(top_50_words, plot, pct) = analyze_frequency(sdg8)

print('Level of cumulation percentage:', pct)
print()
print('Top 50 words:')
top_50_words
Level of cumulation percentage: 38.58%

Top 50 words:
term freq
6933 the 5400
4806 of 3294
636 and 3261
3578 in 2777
7000 to 2618
3018 for 1212
3842 is 1005
6932 that 745
712 are 734
4843 on 694
747 as 683
7546 with 601
2498 employment 566
927 be 550
1137 by 539
3994 labour 530
6958 this 485
7565 workers 430
4888 or 386
7560 work 365
3326 have 360
3094 from 353
4552 more 346
590 also 331
3856 it 314
1771 countries 312
627 an 303
4729 not 300
798 at 290
3322 has 274
7513 which 273
6935 their 266
6929 than 256
3883 job 248
7222 unemployment 243
1162 can 241
4317 market 235
4921 other 228
7451 was 220
6201 sector 215
6946 these 213
3269 growth 212
6947 they 209
6735 such 209
2396 economic 202
6451 social 199
3601 income 189
7059 training 187
4804 oecd 186
939 been 184
../../_images/6d528b4a32578d54d13e4a14bce433e932e91712a2aa5df66b6eba3e7b83e886.png

Exercise 13

With stop word removal:

docs = text_df.text
(top_50_words, plot, pct) = analyze_frequency(docs, 'english')

print('Level of cumulation percentage:', pct)
print()
print('Top 50 words:')
top_50_words
Level of cumulation percentage: 12.27%

Top 50 words:
term freq
10295 countries 7761
44859 women 5984
12072 development 5312
19188 health 4685
44337 water 4664
33322 public 4591
38326 social 4538
13847 education 4535
31876 policy 4367
21846 international 4360
24043 law 4240
14495 energy 4224
27973 national 4087
35753 rights 3905
29304 oecd 3547
13726 economic 3517
43320 use 3391
28342 new 3337
24376 level 3267
20913 income 3171
32235 poverty 3167
11084 data 3078
8513 climate 3033
18280 government 3012
7227 care 3002
37356 services 2999
17714 gender 2991
20001 human 2975
5194 based 2879
39986 support 2759
44896 work 2742
19483 high 2674
4016 areas 2611
37030 sector 2583
41272 time 2581
31873 policies 2477
36770 school 2466
25497 management 2465
2098 access 2424
7882 change 2416
18543 growth 2386
19484 higher 2367
21274 information 2363
20898 including 2357
15415 example 2318
24819 local 2285
33571 quality 2255
20724 important 2236
25028 low 2217
12270 different 2211
../../_images/e98e748e7b0cc5131408459877468e4bcef994958892783bfef45ab3347c3a53.png

Without stop word removal:

(top_50_words, plot, pct) = analyze_frequency(docs, None)

print('Level of cumulation percentage:', pct)
print()
print('Top 50 words:')
top_50_words
Level of cumulation percentage: 36.81%

Top 50 words:
term freq
41229 the 143100
29487 of 95834
3469 and 93357
20921 in 67152
41630 to 64701
16900 for 30010
22403 is 25175
41221 that 20395
4217 as 18926
29700 on 18365
4040 are 18215
6918 by 14171
45094 with 14162
41378 this 14153
5337 be 12608
17320 from 9833
29892 or 9382
22504 it 9364
19212 have 9070
3397 an 7885
10354 countries 7761
4447 at 7643
19159 has 7551
41248 their 7356
28988 not 7182
3182 also 7151
44893 which 6968
27490 more 6953
41327 these 6246
7121 can 6016
45152 women 5984
40001 such 5612
12135 development 5312
30114 other 5018
41211 than 4751
19298 health 4685
41339 they 4666
44606 water 4664
33516 public 4591
38540 social 4538
13916 education 4535
5660 between 4466
5401 been 4377
32070 policy 4367
21975 international 4360
44573 was 4339
24180 law 4240
14571 energy 4224
6886 but 4088
28132 national 4087
../../_images/db51b377e22fd39a55662106206b6f50dcf0f122f626d41c729d8dc61cedd6f7.png

Frequent words that are not stop words, such as ‘countries’, ‘women’, and ‘development’, occur often enough to show up in both lists. As one might expect, stop words such as ‘the’, ‘of’, and ‘and’ occur much more frequently than terms that are not stop words. As a result, the level of cumulation percentage is much smaller and the cumulative word count curve is more linear with stop word removal than without stop word removal.

14.8.3. Document Embedding#

Exercise 1

After creating sentence_df, we can compare its dimensions to those of text_df.

def tokenize_into_sentences(corpus):
    corpus_sentence = []
    corpus_sdg = []
    corpus_sample = []
    for (text, sdg, i) in iter(zip(corpus.text, corpus.sdg, corpus.index)):
        sentences = nltk.sent_tokenize(text) 
        corpus_sentence += sentences
        corpus_sdg += [sdg]*len(sentences)
        corpus_sample += [i]*len(sentences)
    sentence_df = pd.DataFrame({"text": corpus_sentence, "sdg": corpus_sdg, "sample": corpus_sample})
    return sentence_df

sentence_df = tokenize_into_sentences(text_df)
print('text_df dimensions:', text_df.shape)
print('sentence_df dimensions:', sentence_df.shape)
text_df dimensions: (24669, 7)
sentence_df dimensions: (92839, 3)

The dimensions of text_df represent the number of sample texts and the number of features, respectively. The dimensions of sentence_dfrepresent the number of sentences and the number of features, respectively.

Exercise 2

Student answers may vary. As a sample answer, we choose a text containing direct quotations, which can increase the difficulty of sentence tokenization:

text_df.text.loc[465]
'When asked “Have you no morals?” Alfred Doolittle in George Bernard Shaw’s Pygmalion answered: “Can’t afford them governor. Neither could you if you was as poor as me.” The modern concept of human rights underpins a moral society and holds governments responsible for fulfilling these rights. From informed consent to the right to privacy civil and political rights have dominated the human rights focus of the HIV-1 epidemic. Yet the economic and social rights of people with HIV-1 infection in particular the rights to health care and to share in scientific advances are glaringly disparate between rich and poor countries. This disparity has become the focus of debate in transnational HIV-1 vaccine research. (excerpt)'

Despite the increased difficulty, the sentence tokenizer is able to separate out the sentences correctly:

sentence_df[sentence_df['sample'] == 465]
text sdg sample
1785 When asked “Have you no morals?” Alfred Doolit... 16 465
1786 Neither could you if you was as poor as me.” T... 16 465
1787 From informed consent to the right to privacy ... 16 465
1788 Yet the economic and social rights of people w... 16 465
1789 This disparity has become the focus of debate ... 16 465
1790 (excerpt) 16 465

Exercise 3

Answers may vary depending on the samples chosen. For simplicity, this sample solution chooses two samples in text_df with the same number of sentences.

samples = text_df.loc[[32,6]]
sentences = tokenize_into_sentences(samples)
sentences
text sdg sample
0 This points to the possibility that the effect... 1 32
1 One possible explanation for this is that incr... 1 32
2 These results are similar to those obtained by... 1 32
3 This analysis is presented in the following se... 1 32
4 Prescription rates appear to be higher where l... 8 6
5 There is also a possible relationship between ... 8 6
6 This may arise after the definition of disabil... 8 6
7 Krueger (2017(47)) found that around one-fifth... 8 6
# change this to your own embedding directory
embedding_dir = "embeddings/"

# load the embedding
embed = hub.load(embedding_dir + "universal-sentence-encoder_4")
sdg1_embedding = embed(sentences[sentences['sdg'] == 1].text.tolist())
sdg8_embedding = embed(sentences[sentences['sdg'] == 8].text.tolist())
np.tensordot(sdg1_embedding, sdg8_embedding)
array(0.2418205, dtype=float32)