4.1. Introduction to Part I#
Exploratory Data Analysis (EDA) is the process of preliminarily working through data to correct or remove any errors and find preliminary insights about the dataset. We will cover some of the basic procedures behind EDA in this portion of the chapter. Specifically, we will use a sample dataset to cover the following topics:
Basic data examination
Removing N/A values
Plotting with
matplotlib
Basic statistical tests
This section will start with some basic information on EDA and what is done in it; it requires knowledge of the previous sections, especially the Python programming guide. We will focus primarily on the practical methods for performing certain aspects and less so the mathematical theory behind it; later sections will cover the theoretical foundation behind any statistics performed in this section. Additionally, the second half of the chapter will have more advanced EDA procedures on a different, much larger dataset that relates more closely to real-world problems.
4.1.1. Why is EDA Important?#
Often times in the real world, regardless of application, datasets that have to work with will not be perfectly suitable for advanced usage such as machine learning. Many datasets have many errors, missing values, and outliers that will confound your analysis; any analysis done with these errors included will end up biasing your conclusions.
Additionally, it is a bad idea (most of the time) to do advanced procedures with data without knowing anything about the data and how it looks. Data analysis often builds on existing knowledge about data and, in turn, provides new insights about a dataset.
4.1.2. How Do We Do EDA?#
The process of EDA varies between datasets as well as between analysis purposes, and EDA does not follow a specific procedure. However, it typically follows the following process:
Visually inspect the data
Cleaning the data
Extracting insights from data
Additionally, EDA does not have to be a one-time process. With more looking at the data and getting insights from it, you will know more about the data and can find even more insights.
4.1.3. About the Dataset#
The dataset used for this portion of the chapter can be found at https://archive.ics.uci.edu/dataset/20/census+income. It comes from the UC Irvine Machine Learning Repository and concerns census data from 1994. It was originally intended for machine learning projects to predict whether a given person (represented as one row in the dataset) makes over $50,000 a year.
Knowing the context of data is important in many cases, as the context of the data will inform some of the preliminary analyses you perform on it. We will start looking at the data in the next section.