14.4. Document Embedding#

In the previous section, we represented variable length texts as fixed length numeric vectors; the approach we have used so far is the traditional approach of Bag of Words (BoW), which tokenizes a text into words (tokens). BoW ignores the order of the tokens, though it may pay attention to the frequency. This approach is high dimension, and very sparse; this may result in overfitting and high time complexity.

A more modern text vectorization approach is word embedding (also called simply embedding), relying on neural representations. This approach takes distributional semantics into account; that is, a word’s meaning is inferred from the words that frequently appear close-by. Hence, we can construct a word’s context by using the set of words that appear nearby within a fixed-sized window.

Semantically similar texts, then, would appear closer to each other in the vector space. We could also possibly capture semantic relationships by operations in the vector space; for example, similarity between texts could be measured by vector dot product. We could also perform algebraic operations; for example,

\(\text{vector(``King'')} - \text{vector(``Man'')} + \text{vector(``Woman'')} \sim \text{vector(``Queen'')}\).

Modern-day representations are typically learned from vast body of texts, often with deep neural networks, and they typically result in pre-trained models.

To get embeddings, we use the tensorflow library, installed and imported as follows:

## uncomment the next two lines of code to install tensorflow
# ! pip install tensorflow
# ! pip install tensorflow_hub

import tensorflow as tf
import tensorflow_hub as hub

Warning

The tensorflow package can be finicky, especially on certain distributions of Python and on computers that are not very powerful. However, it is a versatile and useful package for document embedding, which is an important part of natural language processing. If you encounter trouble with tensorflow, try using a kernel on a different version of Python (such as Python 3.9) and/or a reduced version of the dataset. If necessary, the code from this section can be skipped, and later sections will not depend on code from this one.

Next, we need to download the embedding method itself, the Universal Sentence Encoder, found at https://tfhub.dev/google/universal-sentence-encoder/4. Go to this link, download the embedding (around 1GB file) and put it in a directory you can access.

embed = hub.load("[your-directory]/universal-sentence-encoder_4")

We’re also going to use the tokenizer found in the nltk.data library:

import nltk.data
from nltk import word_tokenize, sent_tokenize
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

To start, we first need to break our document corpus into sentences:

text_df_sentence = []
text_df_sdg = []
text_df_sample = []
for (text, sdg, i) in iter(zip(text_df.text, text_df.sdg, text_df.index)):
    sentences = sent_tokenize(text) 
    text_df_sentence += sentences
    text_df_sdg += [sdg]*len(sentences)
    text_df_sample += [i]*len(sentences)
sentence_df = pd.DataFrame({"text": text_df_sentence, "sdg": text_df_sdg, "sample": text_df_sample})

The dataframe sentence_df now contains all of the sentences in the corpus. The column text gives the sentences, the column sdg gives the SDG, and the column sample gives the index of the original text sample from which the sentence was extracted. We can use the head function to see the first few sentences:

sentence_df.head(10)
text sdg sample
0 "From a gender perspective, Paulgaard points o... 5 0
1 But the fact that young people are still worki... 5 0
2 When Paulgaard refers to continuity with tradi... 5 0
3 As described earlier, Paulgaard (2015) conclud... 5 0
4 The average figure also masks large difference... 3 1
5 The number of annual contacts ranges from 2.0 ... 3 1
6 In addition, poor coverage of outpatient presc... 3 1
7 These findings are consistent with previous wo... 10 2
8 Returns to education are also found to be an i... 10 2
9 In these countries, however, the magnitude of ... 10 2

Exercise 3.1

What are the dimensions of the text_df and sentence_df dataframes? What do each of the dimension numbers represent?

Exercise 3.2

Evaluate the performance of the sentence tokenizer by picking one of the sample texts, manually breaking it into sentences, and determining whether your sentence divisions match the corresponding ones in sentence_df.

An important question to ask is how many sentences each document has; in other words, what is the distribution of the number of sentences in each text? We can determine that by grouping the dataframe by sample, counting the number of sentences in each group, and then counting those counts:

sentence_df.groupby(by='sample').count()['text'].value_counts()
text
3     12015
4      7579
5      2263
6      1586
2       586
7       354
8       119
1        89
9        21
10       14
12       12
13        6
15        6
11        4
14        4
16        3
17        2
18        1
24        1
31        1
21        1
19        1
20        1
Name: count, dtype: int64

Notice that the vast majority of the documents have less than 10 sentences, with most being only 3 or 4 sentences long.

This type of sentence tokenization can help us with further NLP tasks down the line.

14.4.1. Universal Sentence Encoder#

The Universal Sentence Encoder (USE) was first published by Google around 2018. It maps a sentence, word, or short paragraph to a fixed length (typically 512) numeric vector. This approach would mean semantically similar sentences would be placed closer to each other in the embedding space.

Embeddings are typically the result of using raw text, so no pre-processing would be involved. This sentence embedding can then be used for downstream applications such as classification, clustering, and language prediction.

USE is a pre-trained model trained on variety of data such as books and Wikipedia. It was trained with a deep averaging network (DAN) encoder; more information about the process behind USE can be found at https://arxiv.org/pdf/1803.11175.pdf.

To utilize USE, we can take one of three approaches:

  1. We could take our desired document, turn it into a collection of sentences, and then map each sentence to its respective vector;

  2. We could treat each document as a short paragraph and match each document to its respective vector, or;

  3. We could take a similar approach to #1, except then aggregate the vectors for each document to form a single vector per document.

USE can be found at https://tfhub.dev/google/universal-sentence-encoder/4. Go to this link, download the embedding (which is around 1GB in size), and put it in a directory you can access. Change the next line of code to refer to this directory.

# change this to your own embedding directory
embedding_dir = "embeddings/"

# load the embedding
embed = hub.load(embedding_dir + "universal-sentence-encoder_4")

Note that the first time you run this, it may take some time (5+ minutes) to complete the process.

We are now ready to use USE! We will be returning back to embedding in more depth in Section 5, but as an example, we will embed some training and testing vectors:

docs = sentence_df.text
categories = sentence_df.sdg
X_train, X_test, y_train, y_test = \
    train_test_split(docs, categories, test_size=0.33, random_state=7)

X_train_use_vector = embed(X_train.tolist())
X_test_use_vector = embed(X_test.tolist())
X_train_use_vector
<tf.Tensor: shape=(62202, 512), dtype=float32, numpy=
array([[ 0.05109644,  0.05008151, -0.00049568, ..., -0.02072422,
         0.03613124,  0.01162022],
       [ 0.06358109, -0.0565166 , -0.05213946, ...,  0.00228676,
        -0.04131475, -0.00125164],
       [ 0.05327134, -0.02345247,  0.02675822, ...,  0.06473472,
         0.0628962 ,  0.03613978],
       ...,
       [ 0.01227464,  0.00510351, -0.00631195, ...,  0.0456263 ,
        -0.04639212, -0.07870013],
       [ 0.06549167,  0.08835391,  0.00940975, ...,  0.03276522,
         0.02704752, -0.00052535],
       [-0.04562642, -0.00093942, -0.0363333 , ..., -0.03696235,
        -0.03747366, -0.03414274]], dtype=float32)>
X_test_use_vector
<tf.Tensor: shape=(30637, 512), dtype=float32, numpy=
array([[-0.06285849,  0.05835171, -0.0444552 , ..., -0.02508012,
        -0.04616681, -0.02821783],
       [ 0.04709839, -0.03887603,  0.01155359, ...,  0.02760062,
        -0.06390072, -0.0412748 ],
       [ 0.01141804, -0.01593475, -0.04189894, ...,  0.05544081,
         0.03996534,  0.03494091],
       ...,
       [ 0.07271196, -0.01527977, -0.02277797, ...,  0.02758262,
        -0.00676865,  0.05427777],
       [-0.07911933, -0.03813467,  0.00067008, ...,  0.00562062,
         0.06976797, -0.041276  ],
       [-0.04857083,  0.06084292, -0.03894068, ...,  0.01013256,
        -0.03109165, -0.07248887]], dtype=float32)>

14.4.2. More Exercises#

Exercise 3.3

Take two documents, one labeled as SDG 1 and the other as SDG 8. Segment these into sentences, compute the embedding, and find the dot product between the embeddings.