Before Reading

14.1. Before Reading#

The dataset used in the exercises for this section is from a citizen science project. It consists of 40067 text excerpts labeled with 16 United Nations Sustainable Development Goals (UN SDG) categories and can be found at https://zenodo.org/record/7540165#.ZAF10uzMKfU. As of now, all the exercises use the 2023.01 version of the dataset, which is the version used when designing the exercises and the content for this section. Please be aware that the data file format has gone through significant changes since the first version, 2021.09.

The UN SDG framework is a UN 2030 thematic framework for world development. Adopted in 2015, it has 17 comprehensive goals, with 169 targets and 232 indicators. The UN SDGs are universal, being accepted and adopted by countries all over the globe, and represents a proxy of an ideal future.

../../_images/91b6e5e6b4e46ab51b48464762e7763840de17a6f88242055f65e4d3e879cb1a.png

We recommend creating a new Python environment for this section; you can do this in the terminal by running the following code chunk:

conda deactivate 
conda create -n nlp
conda activate nlp
conda install -c conda-forge spacy
conda install -c conda-forge scikit-learn
conda install ipykernel
conda install seaborn
conda install nltk

The libraries for this section include: scikit-learn, nltk, fasttext, googletrans, spaCy, and huggingface, as well as some sub-packages in these libraries. Use the following code to import the necessary libraries:

import pandas as pd
import numpy as np
import seaborn as sns 
from matplotlib import pyplot as plt
import matplotlib as mpl
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.neural_network import MLPClassifier

from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

Most importantly, it is necessary to download the full dataset and put it in a directory you can access, and load the directory in using the following:

data_dir = "[your-directory]"

In the authors’ case, [your-directory] might be replaced with /Users/yingli/Documents/Development/Data. The above must be modified in order to run the following code. We can then load the dataset as follows:

text_file_name = "osdg-community-data-v2023-01-01.csv"
text_df = pd.read_csv(data_dir + text_file_name,sep = "\t",  quotechar='"')
col_names = text_df.columns.values[0].split('\t')
text_df[col_names] = text_df[text_df.columns.values[0]].apply(lambda x: pd.Series(str(x).split("\t")))
text_df = text_df.astype({'sdg':int, 'labels_negative': int, 'labels_positive':int, 'agreement': float}, copy=True)
text_df.info()

This provides the framework for the rest of the examples and exercises we will use throughout this section.