Data Investigation I

4.2. Data Investigation I#

Often times when we explore data for the first time, we have to spend time looking at what is missing in the data, or what needs to be changed. Data files rarely if ever are perfect for use immediately once you get them. Additionally, it is often unwise to utilize data for predictive modeling or statistical analysis without knowing some of the context for the data and looking at its contents.

Python has a helpful package called pandas which is purpose-built for looking at and manipulating data, and a lot of what we will work on in this section will involve using it.

import pandas as pd
import numpy as np

#read in data
data = pd.read_csv('census_income.csv')
data.head()

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	hours-per-week	native-country	income
0	39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States	<=50K
1	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States	<=50K
2	38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States	<=50K
3	53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	40	United-States	<=50K
4	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	40	Cuba	<=50K

data.head() is a good way to get a good look at the first few rows of the data.

Exercise 2.1

Thinking of the opposite of the head, how do we get the last few rows of the data?

First things first is to get a look at the data. How many rows (samples) are there? How many columns (variables) are there? These numbers correspond to the shape of the dataframe.

data.shape

(32561, 15)

We can see that this dataset has 32,561 instances of 15 variables. This may seem like a lot, but it is actually relatively small compared to many other modern datasets, like the one used in the advanced EDA portion of this chapter, which has millions of instances!

Secondly, we should get a look at the features themselves. What do they mean? What datatype are they? While we can do the first one by looking at the names of the columns by calling data.head(), this does not allow us to do the second. So we can instead do the following:

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   age              32561 non-null  int64 
 1    workclass       30725 non-null  object
 2    fnlwgt          32561 non-null  int64 
 3    education       32561 non-null  object
 4    education-num   32561 non-null  int64 
 5    marital-status  32561 non-null  object
 6    occupation      30718 non-null  object
 7    relationship    32561 non-null  object
 8    race            32561 non-null  object
 9    sex             32561 non-null  object
 10   capital-gain    32561 non-null  int64 
 11   capital-loss    32561 non-null  int64 
 12   hours-per-week  32561 non-null  int64 
 13   native-country  31978 non-null  object
 14   income          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB

This tells us the datatypes of each variable and gives us a full list of the variables. The object datatype corresponds to strings, such as “United States” or “Female”, while the int64 datatype corresponds to integer numbers.

Exercise 2.2

Hypothesize as to what each variable name indicates. Then consult the website where the dataset is found, https://archive.ics.uci.edu/dataset/20/census+income. Do your hypotheses match the actual meaning?

While you may or may not have been able to guess what each variable meant just by looking at the name, it is often the case in the real world that you may have no idea what some variable names mean. This implies that you will have to look to outside resources, especially your data source, to understand what these might mean.

Next, we can get quick summaries of the data. This is very easy to do with pandas dataframes.

data.describe()

	age	fnlwgt	education-num	capital-gain	capital-loss	hours-per-week
count	32561.000000	3.256100e+04	32561.000000	32561.000000	32561.000000	32561.000000
mean	38.581647	1.897784e+05	10.080679	1077.648844	87.303830	40.437456
std	13.640433	1.055500e+05	2.572720	7385.292085	402.960219	12.347429
min	17.000000	1.228500e+04	1.000000	0.000000	0.000000	1.000000
25%	28.000000	1.178270e+05	9.000000	0.000000	0.000000	40.000000
50%	37.000000	1.783560e+05	10.000000	0.000000	0.000000	40.000000
75%	48.000000	2.370510e+05	12.000000	0.000000	0.000000	45.000000
max	90.000000	1.484705e+06	16.000000	99999.000000	4356.000000	99.000000

This will describe all the numerical variables, so long as no N/A values exist in these variables. Note that one N/A value will make all summary statistics in the description table show up as N/A, except for count.

Therefore, you can see that removing N/A values is important. We were lucky to stumble across a dataset with no N/A values in the numerical dataset. However, this dataset does have N/A values in it! This can be problematic.

4.2.1. Cleaning Data#

Data cleaning has a lot of meanings in it, but most commonly, the type of cleaning you do to make a dataset more usable for analysis or machine learning is to remove or somehow manage N/A values. These values are empty and can bias any conclusions you make from the data if they are still present when you perform further analysis.

To find the number of N/As in the dataset, we can call a function that only isolates N/A values from the dataset and sums them for every column.

Warning

If you use the raw dataset from the UC Irvine Machine Learning Repository, the N/A values are not actually given as N/A in the raw data, but rather as a question mark ? and later functions may not count these as N/A values. You can easily replace these with the line data = data.replace('?', np.nan).

data.isna().sum()

age                   0
 workclass         1836
 fnlwgt               0
 education            0
 education-num        0
 marital-status       0
 occupation        1843
 relationship         0
 race                 0
 sex                  0
 capital-gain         0
 capital-loss         0
 hours-per-week       0
 native-country     583
 income               0
dtype: int64

This dataset is relatively clean and so it has many columns with no missing values; the columns with missing values have relatively few of them. In the case where the N/A values are relatively few, we can remove the rows with an N/A value in any column. Let’s try that out and see how it changes our data.

#remove rows with missing values
data = data.dropna()
data.shape

(30162, 15)

This simple act dropped around 2000 rows from our dataset. Now we should be ready to use this dataset for further analysis and get insights from it, which we discuss in the next section.

4.2.2. More Exercises#

Exercise 2.3

Re-calculate the summary statistics after dropping N/A values and describe any changes.

Exercise 2.4

To describe categorical data, we can have the program print out the different categories within the variable, and the number of instances of each category. Try this out for the education column in the dataset.

Exercise 2.5

(Bonus) Write a function that can do Exercise 2.4 for any column of your choosing.

Data Investigation I

Contents

4.2. Data Investigation I#

4.2.1. Cleaning Data#

4.2.2. More Exercises#