Data Investigation II

4.6. Data Investigation II#

Often times when we explore data for the first time, we have to spend time looking at what is missing in the data, or what needs to be changed. Data files rarely if ever are perfect for use immediately once you get them. Additionally, it is often unwise to utilize data for predictive modeling or statistical analysis without knowing some of the context for the data and looking at its contents.

Python has a helpful package called pandas which is purpose-built for looking at and manipulating data, and a lot of what we will work on in this section will involve using it.

4.6.1. Setup#

The first thing we typically do when dealing with data in Python is to import the pandas package and one of its dependencies, numpy.

import pandas as pd
import numpy as np

Due to the large size of the provided data file, we store it in a different directory than the one where this notebook lies. We can easily set this directory using the os package.

import os
os.chdir('<your-directory>')
df = pd.read_csv('dhs_service_records_synthesized_final.csv')

4.6.2. Simple Data Inspection#

Before doing anything with the data, it is good practice to take a look at a few samples from the dataset. Here, we’re going to take a look at the first ten rows and the last seven rows, or the head and tail of the data; this is a function that is easy to call, and it is good practice to call either of these often so that you know how a dataset evolves over time.

df.head()

	synthetic_data	MCI_UNIQ_ID	CALDR_YR	DATE_OF_EVENT	service	GEO_AREA	age	DOB	DOD	GENDER	GENDER_IDENTITY	SEX_ORIENT	LEGAL_SEX	RACE	ETHNICITY	LIVING_ARRANGEMENT	EMPLOYMENT_STATUS	MARITAL_STATUS	EDUCATION_LEVEL	VETERAN_FLAG
0	SYNTHETIC DATA	1626	2021	01/31/2021	Children_Attending_Early_Childhood_Programs_Ma...	NaN	5	NaN	NaN	1~Male	NaN	NaN	NaN	1~White	99~Unknown	NaN	NaN	99~Unknown	9-12~High School (grade 9-12)	NaN
1	SYNTHETIC DATA	1626	2021	02/28/2021	Children_Attending_Early_Childhood_Programs_Ma...	NaN	5	NaN	NaN	1~Male	NaN	NaN	NaN	1~White	99~Unknown	NaN	NaN	99~Unknown	9-12~High School (grade 9-12)	NaN
2	SYNTHETIC DATA	1626	2021	03/31/2021	Children_Attending_Early_Childhood_Programs_Ma...	NaN	5	NaN	NaN	1~Male	NaN	NaN	NaN	1~White	99~Unknown	NaN	NaN	99~Unknown	9-12~High School (grade 9-12)	NaN
3	SYNTHETIC DATA	1626	2021	04/30/2021	Children_Attending_Early_Childhood_Programs_Ma...	NaN	5	NaN	NaN	1~Male	NaN	NaN	NaN	1~White	99~Unknown	NaN	NaN	99~Unknown	9-12~High School (grade 9-12)	NaN
4	SYNTHETIC DATA	1626	2021	05/31/2021	Children_Attending_Early_Childhood_Programs_Ma...	NaN	5	NaN	NaN	1~Male	NaN	NaN	NaN	1~White	99~Unknown	NaN	NaN	99~Unknown	9-12~High School (grade 9-12)	NaN

df.tail(7)

	synthetic_data	MCI_UNIQ_ID	CALDR_YR	DATE_OF_EVENT	service	GEO_AREA	age	DOB	DOD	GENDER	GENDER_IDENTITY	SEX_ORIENT	LEGAL_SEX	RACE	ETHNICITY	LIVING_ARRANGEMENT	EMPLOYMENT_STATUS	MARITAL_STATUS	EDUCATION_LEVEL	VETERAN_FLAG
7116127	SYNTHETIC DATA	530696	2021	06/30/2021	Suicides	NaN	21	NaN	NaN	2~Female	NaN	NaN	NaN	1~White	99~Unknown	NaN	NaN	99~Unknown	99~Unknown	NaN
7116128	SYNTHETIC DATA	530765	2021	06/30/2021	Suicides	NaN	15	NaN	NaN	2~Female	NaN	NaN	NaN	1~White	99~Unknown	NaN	NaN	99~Unknown	99~Unknown	NaN
7116129	SYNTHETIC DATA	530832	2021	06/30/2021	Suicides	NaN	17	NaN	NaN	1~Male	NaN	NaN	NaN	1~White	99~Unknown	NaN	NaN	99~Unknown	99~Unknown	NaN
7116130	SYNTHETIC DATA	532121	2021	04/30/2021	Suicides	NaN	22	NaN	NaN	2~Female	NaN	NaN	NaN	1~White	99~Unknown	NaN	NaN	1~Single-Never Married	9-12~High School (grade 9-12)	NaN
7116131	SYNTHETIC DATA	533991	2021	11/30/2021	Suicides	NaN	16	NaN	NaN	1~Male	NaN	NaN	NaN	1~White	99~Unknown	NaN	NaN	1~Single-Never Married	9-12~High School (grade 9-12)	NaN
7116132	SYNTHETIC DATA	534127	2021	07/31/2021	Suicides	NaN	19	NaN	NaN	1~Male	NaN	NaN	NaN	1~White	2~Not Hispanic/Latinx	NaN	NaN	4~Widowed	GRAD~Graduate Degree	NaN
7116133	SYNTHETIC DATA	535156	2021	06/30/2021	Suicides	NaN	24	NaN	NaN	1~Male	NaN	NaN	NaN	1~White	99~Unknown	NaN	NaN	1~Single-Never Married	9-12~High School (grade 9-12)	NaN

4.6.3. Dataframe Attributes#

You can get a lot of helpful information from the attributes of the dataframe; specifically, we are going to look at dataframe dimensions, column names, column data types, and the ranges of column values. To get these, we can easily call some prebuilt attributes that are assigned to pandas dataframes. These help us to understand our data before we even start modifying it.

df.shape

(7116134, 20)

df.columns

Index(['synthetic_data', 'MCI_UNIQ_ID', 'CALDR_YR', 'DATE_OF_EVENT', 'service',
       'GEO_AREA', 'age', 'DOB', 'DOD', 'GENDER', 'GENDER_IDENTITY',
       'SEX_ORIENT', 'LEGAL_SEX', 'RACE', 'ETHNICITY', 'LIVING_ARRANGEMENT',
       'EMPLOYMENT_STATUS', 'MARITAL_STATUS', 'EDUCATION_LEVEL',
       'VETERAN_FLAG'],
      dtype='object')

Pandas dataframes also have a helpful method for easy summarization of the set, which you can call with df.info().

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7116134 entries, 0 to 7116133
Data columns (total 20 columns):
 #   Column              Dtype  
---  ------              -----  
 synthetic_data      object 
 MCI_UNIQ_ID         int64  
 CALDR_YR            int64  
 DATE_OF_EVENT       object 
 service             object 
 GEO_AREA            float64
 age                 int64  
 DOB                 float64
 DOD                 float64
 GENDER              object 
GENDER_IDENTITY     float64
SEX_ORIENT          float64
LEGAL_SEX           float64
RACE                object 
ETHNICITY           object 
LIVING_ARRANGEMENT  float64
EMPLOYMENT_STATUS   float64
MARITAL_STATUS      object 
EDUCATION_LEVEL     object 
VETERAN_FLAG        float64
dtypes: float64(9), int64(3), object(8)
memory usage: 1.1+ GB

Another way to check the data types, if you just want that, is to use the dtypes attribute of the dataframe.

df.dtypes

synthetic_data         object
MCI_UNIQ_ID             int64
CALDR_YR                int64
DATE_OF_EVENT          object
service                object
GEO_AREA              float64
age                     int64
DOB                   float64
DOD                   float64
GENDER                 object
GENDER_IDENTITY       float64
SEX_ORIENT            float64
LEGAL_SEX             float64
RACE                   object
ETHNICITY              object
LIVING_ARRANGEMENT    float64
EMPLOYMENT_STATUS     float64
MARITAL_STATUS         object
EDUCATION_LEVEL        object
VETERAN_FLAG          float64
dtype: object

We can also find the number of unique entries in the dataframe. This can be important as in some cases we may not want duplicates. In the case of these data from Allegheny County, unique IDs are important as each one will correspond to a single person, and we can track that person across their usage of many different services.

df.nunique()

synthetic_data             1
MCI_UNIQ_ID           533799
CALDR_YR                   1
DATE_OF_EVENT             12
service                   22
GEO_AREA                   0
age                      112
DOB                        0
DOD                        0
GENDER                     3
GENDER_IDENTITY            0
SEX_ORIENT                 0
LEGAL_SEX                  0
RACE                       7
ETHNICITY                  3
LIVING_ARRANGEMENT         0
EMPLOYMENT_STATUS          0
MARITAL_STATUS             7
EDUCATION_LEVEL            8
VETERAN_FLAG               0
dtype: int64

Note to collaborators: the next two code cells are irrelevant with the complete dataset as there are no entries with “SYNTHET” in the synthetic_data column.

#note: I am not sure what 'SYNTHET" versus "SYNTHETIC DATA" means.
#but we split by them anyway.
df.groupby("synthetic_data").count()

	MCI_UNIQ_ID	CALDR_YR	DATE_OF_EVENT	service	GEO_AREA	age	DOB	DOD	GENDER	GENDER_IDENTITY	SEX_ORIENT	LEGAL_SEX	RACE	ETHNICITY	LIVING_ARRANGEMENT	EMPLOYMENT_STATUS	MARITAL_STATUS	EDUCATION_LEVEL	VETERAN_FLAG
synthetic_data
SYNTHETIC DATA	7116134	7116134	7116134	7116134	0	7116134	0	0	7116134	0	0	0	7116134	7116134	0	0	7116134	7116134	0

df["synthetic_data"].value_counts()

SYNTHETIC DATA    7116134
Name: synthetic_data, dtype: int64

One way to get an overview of the data values is to use another helpful function, df.describe(). This will provide several summary statistics for you for each of the columns in the set.

df.describe()

	MCI_UNIQ_ID	CALDR_YR	GEO_AREA	age	DOB	DOD	GENDER_IDENTITY	SEX_ORIENT	LEGAL_SEX	LIVING_ARRANGEMENT	EMPLOYMENT_STATUS	VETERAN_FLAG
count	7.116134e+06	7116134.0	0.0	7.116134e+06	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
mean	2.674853e+05	2021.0	NaN	3.267702e+01	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
std	1.544397e+05	0.0	NaN	2.258095e+01	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
min	1.000000e+00	2021.0	NaN	0.000000e+00	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
25%	1.337730e+05	2021.0	NaN	1.300000e+01	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
50%	2.675410e+05	2021.0	NaN	3.000000e+01	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
75%	4.010820e+05	2021.0	NaN	4.900000e+01	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
max	5.356080e+05	2021.0	NaN	1.210000e+02	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

In some cases, these are not helpful, like in the IDs portion, having the mean ID value is not important. Same with calendar year, especially since these are all from the same year. But something to note is that the values for the mean, std error, and quartiles will not show up if there are ANY NaN values in the column. So we need to find out where these NaN values are.

4.6.4. Finding NaN values#

To get the number of NaN (N/A or NA, as they are sometimes called), we can use the isna() method in combination with the sum() method.

df.isna().sum()

synthetic_data              0
MCI_UNIQ_ID                 0
CALDR_YR                    0
DATE_OF_EVENT               0
service                     0
GEO_AREA              7116134
age                         0
DOB                   7116134
DOD                   7116134
GENDER                      0
GENDER_IDENTITY       7116134
SEX_ORIENT            7116134
LEGAL_SEX             7116134
RACE                        0
ETHNICITY                   0
LIVING_ARRANGEMENT    7116134
EMPLOYMENT_STATUS     7116134
MARITAL_STATUS              0
EDUCATION_LEVEL             0
VETERAN_FLAG          7116134
dtype: int64

Note for collaborators: df.isna() and df.isnull() now produce the same results, so the below code cell is not relevant.

df.isnull().sum()

synthetic_data              0
MCI_UNIQ_ID                 0
CALDR_YR                    0
DATE_OF_EVENT               0
service                     0
GEO_AREA              7116134
age                         0
DOB                   7116134
DOD                   7116134
GENDER                      0
GENDER_IDENTITY       7116134
SEX_ORIENT            7116134
LEGAL_SEX             7116134
RACE                        0
ETHNICITY                   0
LIVING_ARRANGEMENT    7116134
EMPLOYMENT_STATUS     7116134
MARITAL_STATUS              0
EDUCATION_LEVEL             0
VETERAN_FLAG          7116134
dtype: int64

We can see the columns [‘GEO_AREA’, ‘DOB’, ‘DOD’, ‘GENDER_IDENTITY’, ‘SEX_ORIENT’, ‘LEGAL_SEX’, ‘LIVING_ARRANGEMENT’, ‘EMPLOYMENT_STATUS’, ‘VETERAN_FLAG’] have only NaN values.

Note to collaborators: the bottom row that has only NAs is gone with the complete version of the data.

df.isna().tail(1)

	synthetic_data	MCI_UNIQ_ID	CALDR_YR	DATE_OF_EVENT	service	GEO_AREA	age	DOB	DOD	GENDER	GENDER_IDENTITY	SEX_ORIENT	LEGAL_SEX	RACE	ETHNICITY	LIVING_ARRANGEMENT	EMPLOYMENT_STATUS	MARITAL_STATUS	EDUCATION_LEVEL	VETERAN_FLAG
7116133	False	False	False	False	False	True	False	True	True	False	True	True	True	False	False	True	True	False	False	True

let’s clean up the data frame then, drop last row, drop the columns that only have NAs

Note: I am also not quite sure what the code cell above and below are doing in this context.

df.tail(1)

	synthetic_data	MCI_UNIQ_ID	CALDR_YR	DATE_OF_EVENT	service	GEO_AREA	age	DOB	DOD	GENDER	GENDER_IDENTITY	SEX_ORIENT	LEGAL_SEX	RACE	ETHNICITY	LIVING_ARRANGEMENT	EMPLOYMENT_STATUS	MARITAL_STATUS	EDUCATION_LEVEL	VETERAN_FLAG
7116133	SYNTHETIC DATA	535156	2021	06/30/2021	Suicides	NaN	24	NaN	NaN	1~Male	NaN	NaN	NaN	1~White	99~Unknown	NaN	NaN	1~Single-Never Married	9-12~High School (grade 9-12)	NaN

4.6.5. Removing NA Columns and Rows#

To drop all rows with only NaN values, we can use a simple function called df.dropna().

df.dropna(how='all', inplace=True)
df.shape

(7116134, 20)

With this function, the default parameter for a hidden value axis removed rows of only NA values, of which there are none in the dataset. To remove columns, we need to specify the axis.

#we want to remove all the columns that only have NA values
df.dropna(axis=1, how='all', inplace=True)
df.shape

(7116134, 11)

Now what if we wanted to drop a row with a single NA value? We can simply remove the how=‘all’ parameter and it will remove the rows with at least one NA value.

# df.dropna(inplace=True) # will drop rows with any NA.

In addition to looking at the head and the tail, we are also at this point going to take a look at 3 random rows, as it is important to inspect data beyond the head and tail.

# set seed for random function so that we get same rows when re-run the cell
np.random.seed(17)
np.random.randint(2, df.shape[0],3)

array([ 589043, 5539985, 2078088])

np.random.randint(3, df.shape[0],3)

array([ 491929,  828732, 6417139])

# set seed for random function so that we get same rows when re-run the cell
np.random.seed(17)
df.iloc[np.random.randint(0, df.shape[0],3)]

	synthetic_data	MCI_UNIQ_ID	CALDR_YR	DATE_OF_EVENT	service	age	GENDER	RACE	ETHNICITY	MARITAL_STATUS	EDUCATION_LEVEL
589041	SYNTHETIC DATA	364757	2021	10/31/2021	Individuals_Receiving_DHS_Services	54	1~Male	1~White	2~Not Hispanic/Latinx	1~Single-Never Married	99~Unknown
5539983	SYNTHETIC DATA	449747	2021	09/30/2021	Individuals_Receiving_Income_Supports	51	1~Male	1~White	2~Not Hispanic/Latinx	1~Single-Never Married	99~Unknown
2078086	SYNTHETIC DATA	109525	2021	09/30/2021	Individuals_Receiving_Income_Supports	51	1~Male	1~White	2~Not Hispanic/Latinx	1~Single-Never Married	99~Unknown

We will also take a look at the record for a specific MCI_UNIQ_ID. Let’s look at client 364757.

df[df['MCI_UNIQ_ID'] == 364757]

	synthetic_data	MCI_UNIQ_ID	CALDR_YR	DATE_OF_EVENT	service	age	GENDER	RACE	ETHNICITY	MARITAL_STATUS	EDUCATION_LEVEL
589034	SYNTHETIC DATA	364757	2021	02/28/2021	Individuals_Receiving_DHS_Services	54	1~Male	1~White	2~Not Hispanic/Latinx	1~Single-Never Married	99~Unknown
589035	SYNTHETIC DATA	364757	2021	03/31/2021	Individuals_Receiving_DHS_Services	54	1~Male	1~White	2~Not Hispanic/Latinx	1~Single-Never Married	99~Unknown
589036	SYNTHETIC DATA	364757	2021	05/31/2021	Individuals_Receiving_DHS_Services	54	1~Male	1~White	2~Not Hispanic/Latinx	1~Single-Never Married	99~Unknown
589037	SYNTHETIC DATA	364757	2021	06/30/2021	Individuals_Receiving_DHS_Services	54	1~Male	1~White	2~Not Hispanic/Latinx	1~Single-Never Married	99~Unknown
589038	SYNTHETIC DATA	364757	2021	07/31/2021	Individuals_Receiving_DHS_Services	54	1~Male	1~White	2~Not Hispanic/Latinx	1~Single-Never Married	99~Unknown
589039	SYNTHETIC DATA	364757	2021	08/31/2021	Individuals_Receiving_DHS_Services	54	1~Male	1~White	2~Not Hispanic/Latinx	1~Single-Never Married	99~Unknown
589040	SYNTHETIC DATA	364757	2021	09/30/2021	Individuals_Receiving_DHS_Services	54	1~Male	1~White	2~Not Hispanic/Latinx	1~Single-Never Married	99~Unknown
589041	SYNTHETIC DATA	364757	2021	10/31/2021	Individuals_Receiving_DHS_Services	54	1~Male	1~White	2~Not Hispanic/Latinx	1~Single-Never Married	99~Unknown
589042	SYNTHETIC DATA	364757	2021	11/30/2021	Individuals_Receiving_DHS_Services	54	1~Male	1~White	2~Not Hispanic/Latinx	1~Single-Never Married	99~Unknown
589043	SYNTHETIC DATA	364757	2021	12/31/2021	Individuals_Receiving_DHS_Services	54	1~Male	1~White	2~Not Hispanic/Latinx	1~Single-Never Married	99~Unknown
4675581	SYNTHETIC DATA	364757	2021	01/31/2021	Individuals_Receiving_Income_Supports	54	1~Male	1~White	2~Not Hispanic/Latinx	1~Single-Never Married	99~Unknown
4675582	SYNTHETIC DATA	364757	2021	02/28/2021	Individuals_Receiving_Income_Supports	54	1~Male	1~White	2~Not Hispanic/Latinx	1~Single-Never Married	99~Unknown
4675583	SYNTHETIC DATA	364757	2021	03/31/2021	Individuals_Receiving_Income_Supports	54	1~Male	1~White	2~Not Hispanic/Latinx	1~Single-Never Married	99~Unknown
4675584	SYNTHETIC DATA	364757	2021	04/30/2021	Individuals_Receiving_Income_Supports	54	1~Male	1~White	2~Not Hispanic/Latinx	1~Single-Never Married	99~Unknown
4675585	SYNTHETIC DATA	364757	2021	05/31/2021	Individuals_Receiving_Income_Supports	54	1~Male	1~White	2~Not Hispanic/Latinx	1~Single-Never Married	99~Unknown
4675586	SYNTHETIC DATA	364757	2021	06/30/2021	Individuals_Receiving_Income_Supports	54	1~Male	1~White	2~Not Hispanic/Latinx	1~Single-Never Married	99~Unknown
4675587	SYNTHETIC DATA	364757	2021	07/31/2021	Individuals_Receiving_Income_Supports	54	1~Male	1~White	2~Not Hispanic/Latinx	1~Single-Never Married	99~Unknown
4675588	SYNTHETIC DATA	364757	2021	08/31/2021	Individuals_Receiving_Income_Supports	54	1~Male	1~White	2~Not Hispanic/Latinx	1~Single-Never Married	99~Unknown
4675589	SYNTHETIC DATA	364757	2021	09/30/2021	Individuals_Receiving_Income_Supports	54	1~Male	1~White	2~Not Hispanic/Latinx	1~Single-Never Married	99~Unknown
4675590	SYNTHETIC DATA	364757	2021	10/31/2021	Individuals_Receiving_Income_Supports	54	1~Male	1~White	2~Not Hispanic/Latinx	1~Single-Never Married	99~Unknown
4675591	SYNTHETIC DATA	364757	2021	11/30/2021	Individuals_Receiving_Income_Supports	54	1~Male	1~White	2~Not Hispanic/Latinx	1~Single-Never Married	99~Unknown
4675592	SYNTHETIC DATA	364757	2021	12/31/2021	Individuals_Receiving_Income_Supports	54	1~Male	1~White	2~Not Hispanic/Latinx	1~Single-Never Married	99~Unknown
6710438	SYNTHETIC DATA	364757	2021	01/31/2021	Individuals_Receiving_Mental_Health_Services	54	1~Male	1~White	2~Not Hispanic/Latinx	1~Single-Never Married	99~Unknown
6710439	SYNTHETIC DATA	364757	2021	03/31/2021	Individuals_Receiving_Mental_Health_Services	54	1~Male	1~White	2~Not Hispanic/Latinx	1~Single-Never Married	99~Unknown
6710440	SYNTHETIC DATA	364757	2021	04/30/2021	Individuals_Receiving_Mental_Health_Services	54	1~Male	1~White	2~Not Hispanic/Latinx	1~Single-Never Married	99~Unknown
6710441	SYNTHETIC DATA	364757	2021	05/31/2021	Individuals_Receiving_Mental_Health_Services	54	1~Male	1~White	2~Not Hispanic/Latinx	1~Single-Never Married	99~Unknown
6710442	SYNTHETIC DATA	364757	2021	06/30/2021	Individuals_Receiving_Mental_Health_Services	54	1~Male	1~White	2~Not Hispanic/Latinx	1~Single-Never Married	99~Unknown
6710443	SYNTHETIC DATA	364757	2021	07/31/2021	Individuals_Receiving_Mental_Health_Services	54	1~Male	1~White	2~Not Hispanic/Latinx	1~Single-Never Married	99~Unknown
6710444	SYNTHETIC DATA	364757	2021	08/31/2021	Individuals_Receiving_Mental_Health_Services	54	1~Male	1~White	2~Not Hispanic/Latinx	1~Single-Never Married	99~Unknown
6710445	SYNTHETIC DATA	364757	2021	10/31/2021	Individuals_Receiving_Mental_Health_Services	54	1~Male	1~White	2~Not Hispanic/Latinx	1~Single-Never Married	99~Unknown
6710446	SYNTHETIC DATA	364757	2021	11/30/2021	Individuals_Receiving_Mental_Health_Services	54	1~Male	1~White	2~Not Hispanic/Latinx	1~Single-Never Married	99~Unknown
6710447	SYNTHETIC DATA	364757	2021	12/31/2021	Individuals_Receiving_Mental_Health_Services	54	1~Male	1~White	2~Not Hispanic/Latinx	1~Single-Never Married	99~Unknown

And similarly for client 449747.

df[df['MCI_UNIQ_ID'] == 449747]

	synthetic_data	MCI_UNIQ_ID	CALDR_YR	DATE_OF_EVENT	service	age	GENDER	RACE	ETHNICITY	MARITAL_STATUS	EDUCATION_LEVEL
5539976	SYNTHETIC DATA	449747	2021	01/31/2021	Individuals_Receiving_Income_Supports	51	1~Male	1~White	2~Not Hispanic/Latinx	1~Single-Never Married	99~Unknown
5539977	SYNTHETIC DATA	449747	2021	03/31/2021	Individuals_Receiving_Income_Supports	51	1~Male	1~White	2~Not Hispanic/Latinx	1~Single-Never Married	99~Unknown
5539978	SYNTHETIC DATA	449747	2021	04/30/2021	Individuals_Receiving_Income_Supports	51	1~Male	1~White	2~Not Hispanic/Latinx	1~Single-Never Married	99~Unknown
5539979	SYNTHETIC DATA	449747	2021	05/31/2021	Individuals_Receiving_Income_Supports	51	1~Male	1~White	2~Not Hispanic/Latinx	1~Single-Never Married	99~Unknown
5539980	SYNTHETIC DATA	449747	2021	06/30/2021	Individuals_Receiving_Income_Supports	51	1~Male	1~White	2~Not Hispanic/Latinx	1~Single-Never Married	99~Unknown
5539981	SYNTHETIC DATA	449747	2021	07/31/2021	Individuals_Receiving_Income_Supports	51	1~Male	1~White	2~Not Hispanic/Latinx	1~Single-Never Married	99~Unknown
5539982	SYNTHETIC DATA	449747	2021	08/31/2021	Individuals_Receiving_Income_Supports	51	1~Male	1~White	2~Not Hispanic/Latinx	1~Single-Never Married	99~Unknown
5539983	SYNTHETIC DATA	449747	2021	09/30/2021	Individuals_Receiving_Income_Supports	51	1~Male	1~White	2~Not Hispanic/Latinx	1~Single-Never Married	99~Unknown
5539984	SYNTHETIC DATA	449747	2021	10/31/2021	Individuals_Receiving_Income_Supports	51	1~Male	1~White	2~Not Hispanic/Latinx	1~Single-Never Married	99~Unknown
5539985	SYNTHETIC DATA	449747	2021	11/30/2021	Individuals_Receiving_Income_Supports	51	1~Male	1~White	2~Not Hispanic/Latinx	1~Single-Never Married	99~Unknown

4.6.6. Basic Transformations#

Often times when we encounter data, we find that it is in a format that is a bit clunky to use. Part of EDA will be to make datasets more friendly. Here we will go over four transformations: substitution, generation, outlier removal, and illogical value removal.

Substitution#

For the data entries that have strings in certain columns, like “Individuals_Receiving_DHS_Services”, we may need to alter these. Strings can be a bit clunky to work with and are generally read better for plots when we have them in categories. Additionally, we can try and do some mathematical tests on them when we have numerical category labels.

#we want to take the columns for gender, race, ethnicity, marital status, and education level
#and turn these columns into the category data type.

#first we add a new column to the dataframe that has this dtype
#but named with a _cat suffix.
df["GENDER_cat"] = df["GENDER"].astype('category')
df["RACE_cat"] = df["RACE"].astype('category')
df["ETHNICITY_cat"] = df["ETHNICITY"].astype('category')
df["MARITAL_STATUS_cat"] = df["MARITAL_STATUS"].astype('category')
df["EDUCATION_LEVEL_cat"] = df["EDUCATION_LEVEL"].astype('category')
df["service_cat"] = df["service"].astype('category')

df.head()

	synthetic_data	MCI_UNIQ_ID	CALDR_YR	DATE_OF_EVENT	service	age	GENDER	RACE	ETHNICITY	MARITAL_STATUS	EDUCATION_LEVEL	GENDER_cat	RACE_cat	ETHNICITY_cat	MARITAL_STATUS_cat	EDUCATION_LEVEL_cat	service_cat
0	SYNTHETIC DATA	1626	2021	01/31/2021	Children_Attending_Early_Childhood_Programs_Ma...	5	1~Male	1~White	99~Unknown	99~Unknown	9-12~High School (grade 9-12)	1~Male	1~White	99~Unknown	99~Unknown	9-12~High School (grade 9-12)	Children_Attending_Early_Childhood_Programs_Ma...
1	SYNTHETIC DATA	1626	2021	02/28/2021	Children_Attending_Early_Childhood_Programs_Ma...	5	1~Male	1~White	99~Unknown	99~Unknown	9-12~High School (grade 9-12)	1~Male	1~White	99~Unknown	99~Unknown	9-12~High School (grade 9-12)	Children_Attending_Early_Childhood_Programs_Ma...
2	SYNTHETIC DATA	1626	2021	03/31/2021	Children_Attending_Early_Childhood_Programs_Ma...	5	1~Male	1~White	99~Unknown	99~Unknown	9-12~High School (grade 9-12)	1~Male	1~White	99~Unknown	99~Unknown	9-12~High School (grade 9-12)	Children_Attending_Early_Childhood_Programs_Ma...
3	SYNTHETIC DATA	1626	2021	04/30/2021	Children_Attending_Early_Childhood_Programs_Ma...	5	1~Male	1~White	99~Unknown	99~Unknown	9-12~High School (grade 9-12)	1~Male	1~White	99~Unknown	99~Unknown	9-12~High School (grade 9-12)	Children_Attending_Early_Childhood_Programs_Ma...
4	SYNTHETIC DATA	1626	2021	05/31/2021	Children_Attending_Early_Childhood_Programs_Ma...	5	1~Male	1~White	99~Unknown	99~Unknown	9-12~High School (grade 9-12)	1~Male	1~White	99~Unknown	99~Unknown	9-12~High School (grade 9-12)	Children_Attending_Early_Childhood_Programs_Ma...

#forgot to add for ids
df["MCI_UNIQ_ID_cat"] = df["MCI_UNIQ_ID"].astype('category')

Generation#

In other cases, we may want to generate a new variable. These can simply be combinations of previous variables, but in the case of this dataset, we will convert the “age” column to a categorical datatype variable as well. We will do this by putting ages in bins of 10 years.

#first we will define a helper function that takes in an int
#and returns the corresponding age bin value (grouped by 10), so a 5 gets returned as "0-9"
def age_bin(age):
    if age < 10:
        return "0-9"
    elif age < 20:
        return "10-19"
    elif age < 30:
        return "20-29"
    elif age < 40:
        return "30-39"
    elif age < 50:
        return "40-49"
    elif age < 60:
        return "50-59"
    elif age < 70:
        return "60-69"
    elif age < 80:
        return "70-79"
    elif age < 90:
        return "80-89"
    else:
        return "90+"
    
#is there a better way to do this? I am not sure, but we will go with this anyway. 
#it gets the job done.

#we will now apply this function to the age column and create a new column called AGE_BIN
df["AGE_BIN"] = df["age"].apply(age_bin)
df.head()

	synthetic_data	MCI_UNIQ_ID	CALDR_YR	DATE_OF_EVENT	service	age	GENDER	RACE	ETHNICITY	MARITAL_STATUS	EDUCATION_LEVEL	GENDER_cat	RACE_cat	ETHNICITY_cat	MARITAL_STATUS_cat	EDUCATION_LEVEL_cat	service_cat	MCI_UNIQ_ID_cat	AGE_BIN
0	SYNTHETIC DATA	1626	2021	01/31/2021	Children_Attending_Early_Childhood_Programs_Ma...	5	1~Male	1~White	99~Unknown	99~Unknown	9-12~High School (grade 9-12)	1~Male	1~White	99~Unknown	99~Unknown	9-12~High School (grade 9-12)	Children_Attending_Early_Childhood_Programs_Ma...	1626	0-9
1	SYNTHETIC DATA	1626	2021	02/28/2021	Children_Attending_Early_Childhood_Programs_Ma...	5	1~Male	1~White	99~Unknown	99~Unknown	9-12~High School (grade 9-12)	1~Male	1~White	99~Unknown	99~Unknown	9-12~High School (grade 9-12)	Children_Attending_Early_Childhood_Programs_Ma...	1626	0-9
2	SYNTHETIC DATA	1626	2021	03/31/2021	Children_Attending_Early_Childhood_Programs_Ma...	5	1~Male	1~White	99~Unknown	99~Unknown	9-12~High School (grade 9-12)	1~Male	1~White	99~Unknown	99~Unknown	9-12~High School (grade 9-12)	Children_Attending_Early_Childhood_Programs_Ma...	1626	0-9
3	SYNTHETIC DATA	1626	2021	04/30/2021	Children_Attending_Early_Childhood_Programs_Ma...	5	1~Male	1~White	99~Unknown	99~Unknown	9-12~High School (grade 9-12)	1~Male	1~White	99~Unknown	99~Unknown	9-12~High School (grade 9-12)	Children_Attending_Early_Childhood_Programs_Ma...	1626	0-9
4	SYNTHETIC DATA	1626	2021	05/31/2021	Children_Attending_Early_Childhood_Programs_Ma...	5	1~Male	1~White	99~Unknown	99~Unknown	9-12~High School (grade 9-12)	1~Male	1~White	99~Unknown	99~Unknown	9-12~High School (grade 9-12)	Children_Attending_Early_Childhood_Programs_Ma...	1626	0-9

Another type of generation we can do is to split the date of event so we get the month in which each service occurs. We can do this by simply getting the first two characters of the DATE_OF_EVENT column.

#make a new month column
#this is the first two characters of the date column and the last 5 characters of the column
df["month"] = df["DATE_OF_EVENT"].str[:2] + df["DATE_OF_EVENT"].str[-5:]
df.head()

	synthetic_data	MCI_UNIQ_ID	CALDR_YR	DATE_OF_EVENT	service	age	GENDER	RACE	ETHNICITY	MARITAL_STATUS	EDUCATION_LEVEL	GENDER_cat	RACE_cat	ETHNICITY_cat	MARITAL_STATUS_cat	EDUCATION_LEVEL_cat	service_cat	MCI_UNIQ_ID_cat	AGE_BIN	month
0	SYNTHETIC DATA	1626	2021	01/31/2021	Children_Attending_Early_Childhood_Programs_Ma...	5	1~Male	1~White	99~Unknown	99~Unknown	9-12~High School (grade 9-12)	1~Male	1~White	99~Unknown	99~Unknown	9-12~High School (grade 9-12)	Children_Attending_Early_Childhood_Programs_Ma...	1626	0-9	01/2021
1	SYNTHETIC DATA	1626	2021	02/28/2021	Children_Attending_Early_Childhood_Programs_Ma...	5	1~Male	1~White	99~Unknown	99~Unknown	9-12~High School (grade 9-12)	1~Male	1~White	99~Unknown	99~Unknown	9-12~High School (grade 9-12)	Children_Attending_Early_Childhood_Programs_Ma...	1626	0-9	02/2021
2	SYNTHETIC DATA	1626	2021	03/31/2021	Children_Attending_Early_Childhood_Programs_Ma...	5	1~Male	1~White	99~Unknown	99~Unknown	9-12~High School (grade 9-12)	1~Male	1~White	99~Unknown	99~Unknown	9-12~High School (grade 9-12)	Children_Attending_Early_Childhood_Programs_Ma...	1626	0-9	03/2021
3	SYNTHETIC DATA	1626	2021	04/30/2021	Children_Attending_Early_Childhood_Programs_Ma...	5	1~Male	1~White	99~Unknown	99~Unknown	9-12~High School (grade 9-12)	1~Male	1~White	99~Unknown	99~Unknown	9-12~High School (grade 9-12)	Children_Attending_Early_Childhood_Programs_Ma...	1626	0-9	04/2021
4	SYNTHETIC DATA	1626	2021	05/31/2021	Children_Attending_Early_Childhood_Programs_Ma...	5	1~Male	1~White	99~Unknown	99~Unknown	9-12~High School (grade 9-12)	1~Male	1~White	99~Unknown	99~Unknown	9-12~High School (grade 9-12)	Children_Attending_Early_Childhood_Programs_Ma...	1626	0-9	05/2021

Outlier Removal#

Often times, data will have outliers. These are data points that are very far from the majority of the data. In the case of a dataset with all categorical variables, this is not often feasible. However, such an act is more common with numeric features. Let’s try the age category and remove all the rows with age 105 or greater.

#want to remove all entries from the dataset where the age entry is 105 or more
df = df[df["age"] < 105]
df.shape

(7115529, 20)

Illogical Value Removal#

This portion of data transformations is slightly harder as we will have to look at multiple features at the same time. However, it is still important. Some data points in the dataset have children of age 5, yet lists them as in high school.

#we want to remove all the values where the age is 0-9
#AND where they are listed as in high school as "9-12~High School (grade 9-12)"
df = df[~((df["AGE_BIN"] == "0-9") & (df["EDUCATION_LEVEL"] == "9-12~High School (grade 9-12)"))]
df.shape

(6581506, 20)

4.6.7. Advanced Transformations#

Sometimes when doing exploratory data analysis, the storage format of the table does not give a complete view of the picture. In the case of this dataset, we found earlier that there were several rows with the same ID, which means that there are multiple entries for some people. This indicates that they used the services provided in the dataset multiple times.

To get a better picture of some data insights, we can make a new dataframe; this one will have one row for each ID, and we will add the columns regarding the number of services used, number of distinct services used, and the number of months the services were used. Because the dataset is big, this may take a while.

In general cases, to find new transformations for the datasets you work with, you will need to think about the general direction to take the data, which can involve simply thinking about the data logically. However, a better approach is to consult domain knowledge. This basically involves looking to sources related to the data (research articles, people who collected the data, people who are knowledgeable on the subject) and transforming the data as they may advise, or getting advice on where to look for ideas on transformations.

Here, we are going to perform two aggregative transformations. These will group the entries of the dataset by the ID, as well as by the (ID, Month) pairing and compute some summary statistics, including the number of services used, the number of months the services were used, and the range of dates the services were used.

def get_client_attributes(services):
    client_attributes = services.groupby(['MCI_UNIQ_ID']).agg(
        num_services = ('service', 'count'),
        distinct_services = ('service', 'nunique'), 
        first_date = ('DATE_OF_EVENT', 'min'), 
        last_date = ('DATE_OF_EVENT', 'max'), 
        num_months = ('month', 'count'), 
        distinct_month = ('month', 'nunique')
    ).reset_index()

    return client_attributes

client_attributes = get_client_attributes(df)
client_attributes.head()

	MCI_UNIQ_ID	num_services	distinct_services	first_date	last_date	num_months	distinct_month
0	1	12	1	01/31/2021	12/31/2021	12	12
1	2	1	1	02/28/2021	02/28/2021	1	1
2	3	11	1	02/28/2021	12/31/2021	11	11
3	5	12	1	01/31/2021	12/31/2021	12	12
4	6	24	3	01/31/2021	12/31/2021	24	12

Similarly, we perform the client-month attribute aggregations.

def get_client_date_attributes(services):
    client_date_attributes = services.groupby(['MCI_UNIQ_ID', 'month']).agg(
        num_services = ('service', 'count'),
        distinct_services = ('service', 'nunique')
    ).reset_index()

    return client_date_attributes

client_date_attributes = get_client_date_attributes(df)
client_date_attributes.head()

	MCI_UNIQ_ID	month	num_services	distinct_services
0	1	01/2021	1	1
1	1	02/2021	1	1
2	1	03/2021	1	1
3	1	04/2021	1	1
4	1	05/2021	1	1

Now, we can merge the datasets in a sense so that some of the information about each person can carry over to the new datasets we’ve created. Specifically, we want to carry over gender, race, ethnicity, education level, and marital status.

#start with client_attributes
#we want to add columns of age, gender, race, ethnicity, education level, and marital status
#from the original dataset, making sure IDs match up and that there is only one occurrence of each ID.
client_attributes = pd.merge(client_attributes, df[['MCI_UNIQ_ID', 'age', 'GENDER', 'RACE', 'ETHNICITY', 'MARITAL_STATUS', 'EDUCATION_LEVEL']], on='MCI_UNIQ_ID', how='left')
#now remove duplicate rows
client_attributes = client_attributes.drop_duplicates(subset=['MCI_UNIQ_ID'])
client_attributes.head()

	MCI_UNIQ_ID	num_services	distinct_services	first_date	last_date	num_months	distinct_month	age	GENDER	RACE	ETHNICITY	MARITAL_STATUS	EDUCATION_LEVEL
0	1	12	1	01/31/2021	12/31/2021	12	12	13	1~Male	99~Unknown	2~Not Hispanic/Latinx	99~Unknown	GED~High School Diploma/GED Completed
12	2	1	1	02/28/2021	02/28/2021	1	1	70	2~Female	1~White	2~Not Hispanic/Latinx	2~Married	99~Unknown
13	3	11	1	02/28/2021	12/31/2021	11	11	29	2~Female	1~White	2~Not Hispanic/Latinx	2~Married	9-12~High School (grade 9-12)
24	5	12	1	01/31/2021	12/31/2021	12	12	40	1~Male	2~Black/African American	2~Not Hispanic/Latinx	2~Married	UNDERGRAD~Some College
36	6	24	3	01/31/2021	12/31/2021	24	12	71	2~Female	2~Black/African American	2~Not Hispanic/Latinx	1~Single-Never Married	99~Unknown

#we'll do the same with client_date_attributes
client_date_attributes = pd.merge(client_date_attributes, df[['MCI_UNIQ_ID', 'age', 'GENDER', 'RACE', 'ETHNICITY', 'MARITAL_STATUS', 'EDUCATION_LEVEL']], on='MCI_UNIQ_ID', how='left')
#now remove rows that have duplicate ID-month pairs
client_date_attributes = client_date_attributes.drop_duplicates(subset=['MCI_UNIQ_ID', 'month'])
client_date_attributes.head()

	MCI_UNIQ_ID	month	num_services	distinct_services	age	GENDER	RACE	ETHNICITY	MARITAL_STATUS	EDUCATION_LEVEL
0	1	01/2021	1	1	13	1~Male	99~Unknown	2~Not Hispanic/Latinx	99~Unknown	GED~High School Diploma/GED Completed
12	1	02/2021	1	1	13	1~Male	99~Unknown	2~Not Hispanic/Latinx	99~Unknown	GED~High School Diploma/GED Completed
24	1	03/2021	1	1	13	1~Male	99~Unknown	2~Not Hispanic/Latinx	99~Unknown	GED~High School Diploma/GED Completed
36	1	04/2021	1	1	13	1~Male	99~Unknown	2~Not Hispanic/Latinx	99~Unknown	GED~High School Diploma/GED Completed
48	1	05/2021	1	1	13	1~Male	99~Unknown	2~Not Hispanic/Latinx	99~Unknown	GED~High School Diploma/GED Completed

4.6.8. Saving Data Transformations#

If you want to close your work and come back to it easily without having to rerun all of the above code, or if you want to use the same data in a new notebook, an easy workaround is to save the data to an intermediate file and then reload it. We’ll be saving all of our datasets to files that we will then plot in the next section.

#we will save df to a new csv file
df.to_csv('dhs_cleaned.csv', index=False)
client_attributes.to_csv('dhs_client_attributes.csv', index=False)
client_date_attributes.to_csv('dhs_client_date_attributes.csv', index=False)

4.6.9. EDA Project: Part 1#

Answer the following questions:

Which MCI_UNIQ_ID has the highest number of records among all IDs?
What is the average number of interactions for clients in the file?
How many months did a particular client use the service?
What other general observations can you make from the data?