14.10. JNB Lab: United States Grant Data#
The United States publishes government grant opportunities to solicit eligible opportunities. The dataset of grant opportunities is updated every day and can be found at https://www.grants.gov/xml-extract to be downloaded as an xml file. In this lab, we will be classifying these grant entries into the various UN SDG goals that we talked about throughout the chapter.
14.10.1. Lab Exercises, Part 1: Supervised Learning and Vectorizations#
Use
pandas
to read in the provided xml file, which is from June 25, 2024, when this lab was first being started. You can use xml files from other days, but the solutions and exercises for this lab are based off of this file.
Now that we’ve read in the data, we can take a look at it.
df.head()
OpportunityID | OpportunityTitle | OpportunityNumber | OpportunityCategory | FundingInstrumentType | CategoryOfFundingActivity | CategoryExplanation | CFDANumbers | EligibleApplicants | AdditionalInformationOnEligibility | ... | CloseDateExplanation | OpportunityCategoryExplanation | EstimatedSynopsisPostDate | FiscalYear | EstimatedSynopsisCloseDate | EstimatedSynopsisCloseDateExplanation | EstimatedAwardDate | EstimatedProjectStartDate | GrantorContactName | GrantorContactPhoneNumber | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 262148 | Establishment of the Edmund S. Muskie Graduate... | SCAPPD-14-AW-161-SCA-08152014 | D | CA | O | Public Diplomacy | 19.040 | 25.0 | Eligibility for U.S. institutions is limited t... | ... | None | None | NaN | NaN | NaN | None | NaN | NaN | None | None |
1 | 262149 | Eradication of Yellow Crazy Ants on Johnston A... | F14AS00402 | D | CA | NR | None | 15.608 | 99.0 | The recipient has already been selected for th... | ... | None | None | NaN | NaN | NaN | None | NaN | NaN | None | None |
2 | 131073 | Cooperative Ecosystem Studies Unit, Piedmont S... | G12AS20003 | D | CA | ST | None | 15.808 | 25.0 | This financial assistance opportunity is being... | ... | None | None | NaN | NaN | NaN | None | NaN | NaN | None | None |
3 | 131094 | Plant Feedstock Genomics for Bioenergy: A Joi... | DE-FOA-0000598 | D | G | ST | None | 81.049 | 99.0 | DOE Eligibility Criteria: Applicants from U.S.... | ... | None | None | NaN | NaN | NaN | None | NaN | NaN | None | None |
4 | 131095 | Management of HIV-Related Lung Disease and Car... | RFA-HL-12-034 | D | G | HL | None | 93.838 | 25.0 | Other Eligible Applicants include the followin... | ... | None | None | NaN | NaN | NaN | None | NaN | NaN | None | None |
5 rows × 38 columns
What variable seems to be the best for which to classify these grant applications into various UN SDGs?
Take the variable you’ve chosen above and transform it into a bag of words matrix.
From the bag of words that you’ve made, identify the first ten features and print them out.
Refer to the section on Document-Term Matrices to create another vectorization of the documents. Show features 100-110.
(Bonus, if you have a strong computer) Create yet another vectorization, similar to the above, but using bigrams. Again, show features 100-110.
Once again, make another vectorization, using the TF-IDF Vectorizer. Show features 100-110.
Using the function you made in Exercise 4 from Section 5, modify it so as to train a model on the entire UNSDG dataset. Then, use this model to assign predicted classes to each of the entries in the dataset using the Perceptron, Naive Bayes, and Ridge Classifier models.
One thing you’ve probably noticed is that this dataset does not explicitly have UNSDG classes associated with it. While we can do what we did in the previous exercise for this lab, it is often not the case that we have pre-labeled data. As such, one thing we can turn to is unsupervised learning. This was mentioned in the section giving an overview on machine learning and involves data that is not already labeled with the “correct” category.
14.10.2. Unsupervised Learning#
Two popular algorithms for unsupervised learning revolve around what is known as clustering. Clustering, similar to how the name sounds, puts data points into clusters such that there is high similarity within the cluster and low similarity between different clusters. The algorithms for this are a deterministic method called \(k\)-means, which puts each data point into a definitive cluster, and a probabilistic method called Gaussian Mixture Modeling, which assigns probabilities for each data point belonging in any cluster.
\(k\)-means is described in more detail in the chapter on linear algebra and optimization. In practice, however, \(k\)-means clustering takes the following procedure:
Create \(k\) random points to serve as centroids; you choose the value of \(k\). These will be the “centers” of each of our clusters.
Assign each existing data point to its closest centroid. This is typically done with Euclidean distance, or
for data point \(x\) and centroid \(a\), each with \(j\) features (columns in the dataset).
Measure the distance of each point from its assigned centroid and sum all these distances for all \(n\) points. Again, this is done with Euclidean distance.
Re-calculate the centroid of each cluster by taking the mean vector of all points in the cluster.
Repeat steps 2-4 until the total distance metric changes marginally between iterations, or until the centroids do not change position between iterations.
Gaussian Mixture Modeling is similar to \(k\)-means, except it assigns probabilities for each point belonging to each cluster, assuming that each point follows a multivariate normal distribution from each cluster’s mean point. In practice, it is performed with the Expectation-Maximization (EM) algorithm. Its procedure works as follows:
Create \(k\) random points to serve as means, and assign each cluster a \(j \times j\) covariance matrix; this can be randomly set, but it is more common to use the identity matrix. These mean vectors and covariance matrices are the mixture parameters. We also need a prior probability to help with normalization in the next step, which is a single vector of length \(k\) detailing the prior probability of a point belonging in any one cluster. This vector is known as the mixing proportions.
Expectation Step: Calculate the log-likelihood of the current data points given these randomly set parameters. This involves two main steps:
(a) Calculate the probability of each point belonging in each cluster using the multivariate normal probability density function. Following this, multiply these values by the respective probability found in the mixing proportions, then normalize these probabilities so that they sum to 1. These probabilities are stored in a matrix known as a hidden matrix.
(b) Take the cluster of highest probability for each point, then take the natural log of that probability. Perform that over all points, taking the sum of the natural logs.
Maximization Step: Given the points assigned to each cluster of maximum likelihood, re-calculate the mean vector and covariance matrix for each cluster, as well as the mixing proportions.
To recalculate mixing proportions, take the cluster where each point has greatest probability to be in and “assign” the point that cluster. Then for each cluster, the new proportion is simply the number of points assigned to that cluster, divided by the total number of points.
To recalculate mean vectors, for each entry in each mean vector, take the corresponding feature in the data and take the dot product of it to each of the columns in the hidden matrix. Normalize the result based on the sums of the cluster, which gives us a single value for each k; we then do that \(j\) times to get the full matrix.
To recalculate covariance matrices, for each cluster, take the deviation of each point from the mean of the cluster and use these deviations to calculate a new covariance matrix. Then normalize the entries in this covariance matrix by the sums of the probabilities for that cluster.
Repeat the Expectation and Maximization Steps until the increase in log-likelihood falls below a specified threshold. Note that various mathematical theories prove that the log-likelihood always increases with each iteration, but at some point this increase will be minimal.
Clusters can be evaluated by a variety of metrics, including consulting domain experts or using Jaccard index. Evaluation of these clusters is beyond the scope of this lab.
14.10.3. Lab Exercises, Part 2: Unsupervised Learning#
Look at the documentation for \(k\)-means and the EM algorithm on
scikit-learn
and use these with various \(k\) to cluster the grants. If you have a powerful computer, try \(k = 17\) to match the number of UN SDGs.Look at the entries for one of the \(k\)-means clusters you made in the previous exercise. How similar are the entries to each other?
Compare the entries from the previous cluster you analyzed to a different cluster. How similar are the entries in the first cluster to the ones in the new cluster?
(Bonus) Implement \(k\)-means and the EM algorithm only using the
numpy
package - feel free to consult other resources for mathematical help. This exercise is not for the faint of heart but is advisable for those who want to improve their understanding of the mathematics underlying these methods!
14.10.4. Lab Exercises, Part 3: Similarity#
Construct a heatmap for 40 of the grants in the dataset.
Isolate twenty of the entries, then write a function that will take one of these twenty entries and an integer \(k\) as an input; the function will return the \(k\) most similar entries in the rest of the dataset.
14.10.5. Generative AI and Language Models#
With the rise of generative AI and large language models like the GPT-system of models developed by OpenAI, it is easier than ever to give a model a string of text and have it classify these texts into predicted UN SDGs.
So how exactly do these models work? The exact mathematical theory behind these models is highly complex as they build on years of research on AI, natural language processing, and machine learning. We mention generative text models briefly in a previous section, and state that these models are similar to probabilistic language models but are generative in the sense that they will generate the next word in the sequence based on a highly complex model.
We won’t have you recreate any generative AI programs here. Instead, we will provide a quick guide through best practices to use them in the context of text classification.
The guiding principle is to be as specific as possible and narrow the desired task as much as you can. Expect that you will sometimes get incorrect results, or ones that do not align with the task you intended. As you go, iterate and fine-tune the prompts so that they become more specific.
Additionally, some advanced prompting techniques exist, including few-shot prompting and chain-of-thought prompting, which provides some examples for the LLM or guides the LLM through a few reasoning steps, respectively.
14.10.6. Lab Exercises, Part 4: Generative AI and Language Models#
Use a LLM available online, such as ChatGPT, and ask it what UNSDGs it predicts some of the grants to fall under using simple question asking. For example, “For a grant with the title __________, what UNSDG aligns best with the grant?” Do the classifications make sense?
Use few-shot prompting to classify some other grants. Utilize your own previous classification models (or your own classifications) to provide shots to the prompt. For example, “The grant titled _____________ falls under UN SDG _____. The grant titled __________ falls under which UNSDG?” Compare the classifications from this step with the classifications from the previous step.
(Bonus) Check out the following additional resources for more details on LLM prompting. Code along with the code provided on the pages and provide your resulting notebooks to answer this question. Extend the principles found in resources to some of these grant titles as well.
https://cookbook.openai.com/examples/multiclass_classification_for_transactions: a resource from OpenAI, utilizing the capabilities of some of their own models, to classify text documents
https://huggingface.co/docs/transformers/main/tasks/prompting: a resource from HuggingFace, a package we used in this section, that talks more about general LLM prompting.