Introduction

10.1. Introduction#

This chapter offers a brief introduction to several important topics in data analysis:

  • Linear approximation of data by OLS regression (Section 2)

  • Clustering of unlabelled data using \(k\) -means (Section 3)

  • Dimensionality reduction of data by principal component analysis (Section 4)

  • Binary classification of labeled data using support vector machines (Section 5)

For each topic, we first illustrate the goal or a key idea(s) underlying the data analysis using a real data set (e.g., COVID test rates and death rates by zip code). We then explain how linear algebra and the solution to an optimization problem yield the desired goal. For simplicity, we explain concepts and/or give examples in two-dimensional settings that are easy to visualize. Aggarwal [2020] gives an in-depth treatment of how the basic theory covered in this module can be extended to more advanced settings.

The exercises include problems solvable by hand to test the reader’s understanding of the theory, as well as problems that utilize Python Jupyter Notebooks (JNBs). Instructors can tailor these JNBs for use as computer labs depending on the level of students’ familiarity with Python programming. The JNB exercises should at least be read, if not tackled until completion, since they suggest how the basic examples and concepts introduced using small datasets in two dimensions can be applied to real data. The table below indicates the Python Jupyter Notebooks (JNBs) utilized in this chapter. The JNBs are available at https://tinyurl.com/2neb4z4c

Name of JNB Exercise Data File
OLS LINEAR REGRESSION Exercise 2.3 ACTCollegeEligible.csv
K-MEANS CLUSTERING Exercise 3.3 CPDdistricts.geojson
PRINCIPAL COMPONENT ANALYSIS Exercises 4.5, 4.6 standardizedindicators.xlsx
SUPPORT VECTOR MACHINES Exercise 5.3 Housing.xlsx