10.1. Introduction#
This chapter offers a brief introduction to several important topics in data analysis:
Linear approximation of data by OLS regression (Section 2)
Clustering of unlabelled data using \(k\) -means (Section 3)
Dimensionality reduction of data by principal component analysis (Section 4)
Binary classification of labeled data using support vector machines (Section 5)
For each topic, we first illustrate the goal or a key idea(s) underlying the data analysis using a real data set (e.g., COVID test rates and death rates by zip code). We then explain how linear algebra and the solution to an optimization problem yield the desired goal. For simplicity, we explain concepts and/or give examples in two-dimensional settings that are easy to visualize. Aggarwal [2020] gives an in-depth treatment of how the basic theory covered in this module can be extended to more advanced settings.
The exercises include problems solvable by hand to test the reader’s understanding of the theory, as well as problems that utilize Python Jupyter Notebooks (JNBs). Instructors can tailor these JNBs for use as computer labs depending on the level of students’ familiarity with Python programming. The JNB exercises should at least be read, if not tackled until completion, since they suggest how the basic examples and concepts introduced using small datasets in two dimensions can be applied to real data. The table below indicates the Python Jupyter Notebooks (JNBs) utilized in this chapter. The JNBs are available at https://tinyurl.com/2neb4z4c
Name of JNB |
Exercise |
Data File |
---|---|---|
OLS LINEAR REGRESSION |
Exercise 2.3 |
ACTCollegeEligible.csv |
K-MEANS CLUSTERING |
Exercise 3.3 |
CPDdistricts.geojson |
PRINCIPAL COMPONENT ANALYSIS |
Exercises 4.5, 4.6 |
standardizedindicators.xlsx |
SUPPORT VECTOR MACHINES |
Exercise 5.3 |
Housing.xlsx |