Introduction to Part II

4.5. Introduction to Part II#

In this portion of the chapter, we will introduce some more challenging methods for doing exploratory data analysis. This portion can be taken on its own or in conjunction with the previous sections on basic EDA.

This chapter will go over the following:

Basic data examination
Removing N/A values
Basic transformations
Advanced transformations
Plotting with matplotlib and seaborn
Statistical testing
Writing and producing reports

Here, we will aim to focus on the practical methodologies of implementing these procedures, and much less so the mathematical theory behind how and why they work. Additionally, it may be more open-ended in terms of exercises and may require readers to consult outside resources, such as Python package documentation.

4.5.1. About the Dataset#

The dataset used for the advanced EDA portion of this chapter comes from the Department of Human Services (DHS) in Allegheny County, Pennsylvania, USA, where the city of Pittsburgh is located. These data are a synthetic version of data used by the DHS for recordkeeping of the usage of their services; these records have been used in projects utilising machine learning to assess child maltreatment risk (see https://proceedings.mlr.press/v81/chouldechova18a.html). Since the raw version of the records data contains large amounts of sensitive and private information, the DHS, along with CountyStat, Pittsburgh’s Urban Institute, and the University of Pittsburgh’s Western Pennsylvania Regional Data Center (WPRDC) created a synthetic version that does not carry sensitive information and is therefore freely usable to the public.

The data can be found at https://data.wprdc.org/dataset/synthetic-integrated-services-data, along with other information about the dataset and synthetic data in general. However, since it is a large dataset and may take a while to download, we recommend you download the compressed version at https://data.wprdc.org/dataset/synthetic-integrated-services-data/resource/d5b12f18-ce4e-4565-b7ed-d5565e18bd43. More information on the data will be discovered in later sections.

4.5.2. What is EDA?#

EDA, or exploratory data analysis, is the process of looking at, cleaning, transforming, and making small insights about data before it is used in more advanced methodologies such as machine learning.

The process of EDA varies between datasets and does not have a specific procedure. However, it typically follows the following process:

Looking at the data
Cleaning the data
Transforming the data
Getting insights from data

Data scientists like to describe EDA as the process of “having conversations” with the data. As you have conversations with real-life people, you find out more and more about them. The first time you talk to someone new, you may only find out things about their day-to-day life or some of their hobbies. However, with more conversations, you get to know them better, and that knowledge can reveal even more about the person with more conversations. This metaphor is very much the same when working with data - the more you work with it, look at it, and experiment with it, the more you will learn about the data.

4.5.3. Why is EDA Important?#

Very few times in academia or industry will you encounter a dataset that is ready-made and good to go for advanced uses like machine learning. In fact, even if you have such a dataset, it is often a bad idea to jump straight into advanced uses of the data and coming to conclusions about it without looking at it first. Real-world data is often full of missing values, incomplete entries, and improper formatting, and going into usage of the data without understanding it first will result in these errors influencing any conclusions you make about the data.

A majority of the time spent in EDA is spent cleaning up the data and making it good for use. While humans may be able to read small datasets and perform operations on them just fine, computers are often more finicky and perform differently under different circumstances. For example, one missing value will result in a computer not being able to compute the mean of a set of numbers; in another case, having a variable set as the “string” datatype instead of “category” may make the computer run more slowly as it treats these differently.

4.5.4. Integrating Domain Knowledge#

Every data analysis setting has a context in which it lies, whether you are doing the analysis for a course project or for your job. In any case, data analysis does not occur in a bubble.

Data analysis needs the input of those who are knowledgeable about the context of the data; these people are called domain experts, and their knowledge is called domain knowledge. Data analysis should build on what domain experts already know; in turn, domain knowledge can increase with subsequent data analysis. Without each other, progress towards understanding the data will be slow, if existent at all.

4.5.5. Guiding Questions#

When performing EDA, we often have guiding questions that will help us make a straightforward path for our analysis. These questions can come from domain experts, papers that have used the data before, or in some cases even thinking about the data itself (again, more conversations with the data can open up avenues to get to know the data even better).

4.5.6. What Makes EDA Here “Advanced”?#

While the EDA here builds on the basics of EDA covered in previous sections, the EDA here does not simply utilize pre-existing Python packages. In some cases, it may rely on writing functions to accomplish a specific goal; this involves a lot of logical thinking about how to accomplish this goal and code it.

Another reason EDA in this section is more advanced is that there are far fewer exercises, replaced instead with a section-spanning project that concerns this whole dataset. This project may also require the use of outside resources and problem-solving skills that cannot be succinctly explained in this chapter. Because of this, the project will require more knowledge and take more time to complete.

It should also be noted that using Python to conduct EDA is just one way of doing EDA. There are a myriad of tools that each have their own distinct uses, each of which can be useful for EDA. Some tools include Tableau, a commonly used software used for data visualization without extensive coding; even Excel can be useful for looking at datasets, though its capabilities are far less than if you were to use a programming language. Additionally, since the datasets we are using so far are very big, other programming languages can be helpful.

So far, this entire book has used Python, which is known as an interpreter language; every line you write in Python, when run, makes subsequent calls to built-in lines of code, whether in Python or in another language like C++. This makes many layers upon layers of code calling other lines of code. As such, Python is often memory-intensive and may take longer to run than interpreter languages’ counterparts, known as compiler languages. With compiler languages, the code you write is directly translated into machine code for the computer to execute. This means far fewer layers of called lines of code and the procedure becomes less memory intensive.

The downside of many compiler languages is that they are semantically farther away from normal human language (in this case English) than interpreter languages like Python. As such, they are more advanced to read and understand what is going on. Additionally, since there are not as many widely-used purpose-built packages for compiler languages, it often means having to write more code that is less flexible (e.g., certain languages cannot handle datasets with both integers and strings).

Ultimately it is up to you what tools to use. In this section, we will only be using Python, but knowing that you can use whatever tools are most comfortable to you may be helpful in your future endeavors and in the project presented in this chapter.