Evaluating Classification

14.6. Evaluating Classification#

In many classification problems, the metrics we use include precision, recall. f1, accuracy, and average accuracy (both macro and micro). Additional common metrics for classification include ROC (Receiver Operating Characteristics) and AUC (Area Under the Curve), often combined as ROC-AUC, and for multi-class classification, we can test the model’s ability to classify items on a One vs. Rest (OvR) or One vs. One (OvO) basis.

In Section 3, we defined precision, recall, and f1. Accuracy is self-explanatory as the proportion of correctly predicted classes as a whole. The micro average accuracy calculates the average accuracy per class, while macro average accuracy similarly calculates this but takes class imbalance into account.

For this section, we’re going to focus on the results from our ridge classification in the previous section. Notice that we get the specific results for each goal on its own, in addition to a confusion matrix for the specific results on the testing set.

docs = text_df.embedding.tolist()
scaler = preprocessing.MinMaxScaler().fit(docs)
X = scaler.transform(docs)
y = text_df.sdg
X_train, X_test, y_train, y_test = \
    train_test_split(X, y, test_size=0.33, random_state=7)
ridge_clf = RidgeClassifier(tol=1e-2, solver="sparse_cg")
ridge_clf = ridge_clf.fit(X_train, y_train)
y_pred = ridge_clf.predict(X_test)
print(metrics.classification_report(y_test,y_pred, digits = 4))
ConfusionMatrixDisplay.from_predictions(y_test, y_pred)

              precision    recall  f1-score   support

           1     0.7340    0.7401    0.7371       481
           2     0.7529    0.8291    0.7892       316
           3     0.9084    0.9125    0.9104       674
           4     0.8607    0.9525    0.9043       863
           5     0.8723    0.9359    0.9030       920
           6     0.8478    0.8624    0.8550       465
           7     0.8171    0.8753    0.8452       730
           8     0.6942    0.4759    0.5647       353
           9     0.7310    0.6463    0.6861       328
          10     0.7006    0.4297    0.5327       256
          11     0.7747    0.7965    0.7855       462
          12     0.8841    0.5622    0.6873       217
          13     0.7468    0.7923    0.7689       443
          14     0.8849    0.8479    0.8660       263
          15     0.8576    0.8083    0.8322       313
          16     0.9004    0.9499    0.9245      1057

    accuracy                         0.8312      8141
   macro avg     0.8105    0.7761    0.7870      8141
weighted avg     0.8276    0.8312    0.8257      8141

<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x14ed10310>

../../_images/0e1516ecebd3095e3c8ed79357908a1d32addb4c5a9357fc2450321ca5e8ba85.png

14.6.1. The ROC Curve#

The goal of any classifier is to maximize true positive rate (TPR) while minimizing false positive rate (FPR). The ROC curve is then determined as follows: for any given threshold, we line up the points, with TPR on the Y-axis and FPR on the X-axis. Points at the top left corner of the plot imply an FPR of 0 and a TPR of 1 (i.e., perfect classification).

A model can be evaluating qualitatively by judging the “steepness” of its ROC curve; this evaluation can be made quantitative by finding the Area Under the Curve (AUC). A larger AUC implies a steeper curve and is usually better.

For binary classification, each point on the ROC curve represents a different threshold and can be a choice for the final classifier to be used; the choice made typically depends on the business requirements and constraints.

14.6.2. Multiclass Classifiers#

In any multiclass classification problem, there are multiple class labels \((c_1, c_2, …, c_k)\) in the data, and each training sample is labeled as belonging to one class only. The strategies used with binary classifiers can be taken in multiclass classifiers by reducing the problem to binary classification. There are two strategies typically taken to do this, those being One vs. Rest (OvR) and One vs. One (OvO).

One vs Rest#

The OvR method of evaluation judges the model ability to determine if a given item belongs to a given class or if it is in one of the other classes. For each class label \(c\) in \((c_1, c_2, …, c_k)\), we say that the positive samples are those that are labeled as class \(c\), and the negative samples are all the rest of the samples. We then fit and predict for class \(c\).

For scoring each sample, we simply take the highest probable score (i.e., highest score) of the \(k\) classifiers as the class for the sample.

When looking at the ROC curve for the OvR strategy on our classification, we get the following graphic:

docs = text_df.embedding.tolist()
scaler = preprocessing.MinMaxScaler().fit(docs)
X = scaler.transform(docs)
y = text_df.sdg

label_binarizer = LabelBinarizer().fit(y)
y_onehot = label_binarizer.transform(y)
n_classes = len(label_binarizer.classes_)
class_names = [sdg_names[sdg_names["sdg"] == label_binarizer.classes_[i]].sdg_name.item() \
               for i in range(n_classes)]

X_train, X_test, y_train, y_test = train_test_split(X, y_onehot, test_size=.33, random_state=0)
ovr_mlp_clf = OneVsRestClassifier(MLPClassifier(random_state=0, max_iter=300)).fit(X_train,y_train)
y_score = ovr_mlp_clf.predict_proba(X_test)

fpr = dict()
tpr = dict()
roc_auc = dict()
for class_id in range(n_classes):
    fpr[class_id], tpr[class_id], _ = metrics.roc_curve(y_test[:, class_id], y_score[:, class_id]) # roc_curve works on binary
    roc_auc[class_id] = metrics.auc(fpr[class_id], tpr[class_id])

for class_id, color in zip(range(n_classes), colors):
    plt.plot(fpr[class_id], tpr[class_id], color=color, lw=1.5,alpha = 1, 
             label='SDG {0} - {1} (AUC = {2:0.2f})'
             ''.format(class_id+1, class_names[class_id], roc_auc[class_id]))
plt.plot([0, 1], [0, 1], '--', lw=1, color="lightgrey")
plt.xlim([-0.05, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC for SDG multiclass classifier')
plt.legend(loc="lower right", fontsize=6)
plt.show()

../../_images/bc5fac0d24049744d88f87d61d65352a2f47a104ad4791b4f267e6ccd3c1569b.png

We can see that each of the goals, aside from goal 17, gets its own curve, and our program is able to report the AUC score for each one.

Exercise 1

Create a similar ROC curve plot for your Naive Bayes classification and Multilayer Perceptron classification from the previous section.

One vs One#

The OvO approach compares each class to another single class for all possible pairs of classes. For each class label \(c\) in \((c_1, c_2, …, c_k)\), and for each pair \((c, c_j)\), with \(c_j \neq c\), we say that the positive samples are those that are labeled as class \(c\), and the negative samples are those that are labeled as class \(c_j\).

For each sample to be scored, it gets \(k-1\) classification results, and we then vote by majority to determine the class for the sample.

Similar to the OvR strategy, we can produce an ROC curve for the OvO strategy as follows:

docs = text_df.embedding.tolist()
scaler = preprocessing.MinMaxScaler().fit(docs)
X = scaler.transform(docs)
y = text_df.sdg

X_train, X_test, y_train, y_test = \
    train_test_split(X, y, test_size=0.33, random_state=7)

logit_clf = LogisticRegression(tol=1e-2, solver="liblinear")
logit_clf = logit_clf.fit(X_train, y_train)

# metrics.roc_auc_score work on binary and multiclass
y_score = logit_clf.predict_proba(X_test)
print(metrics.roc_auc_score(y_test,y_score, multi_class='ovr'))

0.9830440625522275

label_binarizer = LabelBinarizer().fit(y)
y_onehot_test = label_binarizer.transform(y_test)

fig, ax = plt.subplots(figsize=(6,6))

class_of_interest = 8 # SDG 8
class_id = np.flatnonzero(label_binarizer.classes_ == class_of_interest)[0]
RocCurveDisplay.from_predictions(
    y_onehot_test[:, class_id],
    y_score[:, class_id],
    name=f"SDG {class_of_interest} vs the rest",
    color="darkorange",
    ax = ax,
)

class_of_interest = 16
class_id = np.flatnonzero(label_binarizer.classes_ == class_of_interest)[0]
RocCurveDisplay.from_predictions(
    y_onehot_test[:, class_id],
    y_score[:, class_id],
    name=f"SDG {class_of_interest} vs the rest",
    color="purple",
    ax = ax,
)

plt.plot([0, 1], [0, 1], "--", label="chance level (AUC = 0.5)", color = "grey")
plt.plot([0.05, 0.05], [0, 1], "--", label="FPT at 0.05 level", color = "green")

plt.axis("square")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("One-vs-Rest ROC curves")
plt.legend()
plt.show()

../../_images/3cbef101063fe22ca560e6989ab4363c12897f4a1de552cc3142f4bf9cb1c41a.png

Here, we identified two classes that which we wanted to specifically examine, and these two are the only curves that are present on the plot. Note that the AUC for these goals we picked are different from those in the OvR strategy.

Exercise 2

Create another ROC curve plot for three additional goals of your choosing.

Exercise 3

Using SDGs 8 and 16, recreate the ROC curve plot using your Naive Bayes and Multilayer Perceptron classifications from the previous sections.

14.6.3. More Exercises#

Exercise 4

Modify your function from Exercise 4 in the previous section to now output an OvR ROC plot, which can be toggled with a parameter roc_plot.

Additionally, add a parameter quick_metrics that will not output the entire table of metrics and only the overall accuracy, precision, recall, and F1-Score.

Exercise 5

What does a point on the ROC curve at (1,1) indicate about the corresponding model? Similarly, what does a point at (0,0) on the curve indicate about the model?