4.6. Data Investigation II#

Often times when we explore data for the first time, we have to spend time looking at what is missing in the data, or what needs to be changed. Data files rarely if ever are perfect for use immediately once you get them. Additionally, it is often unwise to utilize data for predictive modeling or statistical analysis without knowing some of the context for the data and looking at its contents.

Python has a helpful package called pandas which is purpose-built for looking at and manipulating data, and a lot of what we will work on in this section will involve using it.

4.6.1. Setup#

The first thing we typically do when dealing with data in Python is to import the pandas package and one of its dependencies, numpy.

import pandas as pd
import numpy as np

Due to the large size of the provided data file, we store it in a different directory than the one where this notebook lies. We can easily set this directory using the os package.

import os
os.chdir('<your-directory>')
df = pd.read_csv('dhs_service_records_synthesized_final.csv')

4.6.2. Simple Data Inspection#

Before doing anything with the data, it is good practice to take a look at a few samples from the dataset. Here, we’re going to take a look at the first ten rows and the last seven rows, or the head and tail of the data; this is a function that is easy to call, and it is good practice to call either of these often so that you know how a dataset evolves over time.

df.head()
synthetic_data MCI_UNIQ_ID CALDR_YR DATE_OF_EVENT service GEO_AREA age DOB DOD GENDER GENDER_IDENTITY SEX_ORIENT LEGAL_SEX RACE ETHNICITY LIVING_ARRANGEMENT EMPLOYMENT_STATUS MARITAL_STATUS EDUCATION_LEVEL VETERAN_FLAG
0 SYNTHETIC DATA 1626 2021 01/31/2021 Children_Attending_Early_Childhood_Programs_Ma... NaN 5 NaN NaN 1~Male NaN NaN NaN 1~White 99~Unknown NaN NaN 99~Unknown 9-12~High School (grade 9-12) NaN
1 SYNTHETIC DATA 1626 2021 02/28/2021 Children_Attending_Early_Childhood_Programs_Ma... NaN 5 NaN NaN 1~Male NaN NaN NaN 1~White 99~Unknown NaN NaN 99~Unknown 9-12~High School (grade 9-12) NaN
2 SYNTHETIC DATA 1626 2021 03/31/2021 Children_Attending_Early_Childhood_Programs_Ma... NaN 5 NaN NaN 1~Male NaN NaN NaN 1~White 99~Unknown NaN NaN 99~Unknown 9-12~High School (grade 9-12) NaN
3 SYNTHETIC DATA 1626 2021 04/30/2021 Children_Attending_Early_Childhood_Programs_Ma... NaN 5 NaN NaN 1~Male NaN NaN NaN 1~White 99~Unknown NaN NaN 99~Unknown 9-12~High School (grade 9-12) NaN
4 SYNTHETIC DATA 1626 2021 05/31/2021 Children_Attending_Early_Childhood_Programs_Ma... NaN 5 NaN NaN 1~Male NaN NaN NaN 1~White 99~Unknown NaN NaN 99~Unknown 9-12~High School (grade 9-12) NaN
df.tail(7)
synthetic_data MCI_UNIQ_ID CALDR_YR DATE_OF_EVENT service GEO_AREA age DOB DOD GENDER GENDER_IDENTITY SEX_ORIENT LEGAL_SEX RACE ETHNICITY LIVING_ARRANGEMENT EMPLOYMENT_STATUS MARITAL_STATUS EDUCATION_LEVEL VETERAN_FLAG
7116127 SYNTHETIC DATA 530696 2021 06/30/2021 Suicides NaN 21 NaN NaN 2~Female NaN NaN NaN 1~White 99~Unknown NaN NaN 99~Unknown 99~Unknown NaN
7116128 SYNTHETIC DATA 530765 2021 06/30/2021 Suicides NaN 15 NaN NaN 2~Female NaN NaN NaN 1~White 99~Unknown NaN NaN 99~Unknown 99~Unknown NaN
7116129 SYNTHETIC DATA 530832 2021 06/30/2021 Suicides NaN 17 NaN NaN 1~Male NaN NaN NaN 1~White 99~Unknown NaN NaN 99~Unknown 99~Unknown NaN
7116130 SYNTHETIC DATA 532121 2021 04/30/2021 Suicides NaN 22 NaN NaN 2~Female NaN NaN NaN 1~White 99~Unknown NaN NaN 1~Single-Never Married 9-12~High School (grade 9-12) NaN
7116131 SYNTHETIC DATA 533991 2021 11/30/2021 Suicides NaN 16 NaN NaN 1~Male NaN NaN NaN 1~White 99~Unknown NaN NaN 1~Single-Never Married 9-12~High School (grade 9-12) NaN
7116132 SYNTHETIC DATA 534127 2021 07/31/2021 Suicides NaN 19 NaN NaN 1~Male NaN NaN NaN 1~White 2~Not Hispanic/Latinx NaN NaN 4~Widowed GRAD~Graduate Degree NaN
7116133 SYNTHETIC DATA 535156 2021 06/30/2021 Suicides NaN 24 NaN NaN 1~Male NaN NaN NaN 1~White 99~Unknown NaN NaN 1~Single-Never Married 9-12~High School (grade 9-12) NaN

4.6.3. Dataframe Attributes#

You can get a lot of helpful information from the attributes of the dataframe; specifically, we are going to look at dataframe dimensions, column names, column data types, and the ranges of column values. To get these, we can easily call some prebuilt attributes that are assigned to pandas dataframes. These help us to understand our data before we even start modifying it.

df.shape
(7116134, 20)
df.columns
Index(['synthetic_data', 'MCI_UNIQ_ID', 'CALDR_YR', 'DATE_OF_EVENT', 'service',
       'GEO_AREA', 'age', 'DOB', 'DOD', 'GENDER', 'GENDER_IDENTITY',
       'SEX_ORIENT', 'LEGAL_SEX', 'RACE', 'ETHNICITY', 'LIVING_ARRANGEMENT',
       'EMPLOYMENT_STATUS', 'MARITAL_STATUS', 'EDUCATION_LEVEL',
       'VETERAN_FLAG'],
      dtype='object')

Pandas dataframes also have a helpful method for easy summarization of the set, which you can call with df.info().

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7116134 entries, 0 to 7116133
Data columns (total 20 columns):
 #   Column              Dtype  
---  ------              -----  
 0   synthetic_data      object 
 1   MCI_UNIQ_ID         int64  
 2   CALDR_YR            int64  
 3   DATE_OF_EVENT       object 
 4   service             object 
 5   GEO_AREA            float64
 6   age                 int64  
 7   DOB                 float64
 8   DOD                 float64
 9   GENDER              object 
 10  GENDER_IDENTITY     float64
 11  SEX_ORIENT          float64
 12  LEGAL_SEX           float64
 13  RACE                object 
 14  ETHNICITY           object 
 15  LIVING_ARRANGEMENT  float64
 16  EMPLOYMENT_STATUS   float64
 17  MARITAL_STATUS      object 
 18  EDUCATION_LEVEL     object 
 19  VETERAN_FLAG        float64
dtypes: float64(9), int64(3), object(8)
memory usage: 1.1+ GB

Another way to check the data types, if you just want that, is to use the dtypes attribute of the dataframe.

df.dtypes
synthetic_data         object
MCI_UNIQ_ID             int64
CALDR_YR                int64
DATE_OF_EVENT          object
service                object
GEO_AREA              float64
age                     int64
DOB                   float64
DOD                   float64
GENDER                 object
GENDER_IDENTITY       float64
SEX_ORIENT            float64
LEGAL_SEX             float64
RACE                   object
ETHNICITY              object
LIVING_ARRANGEMENT    float64
EMPLOYMENT_STATUS     float64
MARITAL_STATUS         object
EDUCATION_LEVEL        object
VETERAN_FLAG          float64
dtype: object

We can also find the number of unique entries in the dataframe. This can be important as in some cases we may not want duplicates. In the case of these data from Allegheny County, unique IDs are important as each one will correspond to a single person, and we can track that person across their usage of many different services.

df.nunique()
synthetic_data             1
MCI_UNIQ_ID           533799
CALDR_YR                   1
DATE_OF_EVENT             12
service                   22
GEO_AREA                   0
age                      112
DOB                        0
DOD                        0
GENDER                     3
GENDER_IDENTITY            0
SEX_ORIENT                 0
LEGAL_SEX                  0
RACE                       7
ETHNICITY                  3
LIVING_ARRANGEMENT         0
EMPLOYMENT_STATUS          0
MARITAL_STATUS             7
EDUCATION_LEVEL            8
VETERAN_FLAG               0
dtype: int64

Note to collaborators: the next two code cells are irrelevant with the complete dataset as there are no entries with “SYNTHET” in the synthetic_data column.

#note: I am not sure what 'SYNTHET" versus "SYNTHETIC DATA" means.
#but we split by them anyway.
df.groupby("synthetic_data").count()
MCI_UNIQ_ID CALDR_YR DATE_OF_EVENT service GEO_AREA age DOB DOD GENDER GENDER_IDENTITY SEX_ORIENT LEGAL_SEX RACE ETHNICITY LIVING_ARRANGEMENT EMPLOYMENT_STATUS MARITAL_STATUS EDUCATION_LEVEL VETERAN_FLAG
synthetic_data
SYNTHETIC DATA 7116134 7116134 7116134 7116134 0 7116134 0 0 7116134 0 0 0 7116134 7116134 0 0 7116134 7116134 0
df["synthetic_data"].value_counts()
SYNTHETIC DATA    7116134
Name: synthetic_data, dtype: int64

One way to get an overview of the data values is to use another helpful function, df.describe(). This will provide several summary statistics for you for each of the columns in the set.

df.describe()
MCI_UNIQ_ID CALDR_YR GEO_AREA age DOB DOD GENDER_IDENTITY SEX_ORIENT LEGAL_SEX LIVING_ARRANGEMENT EMPLOYMENT_STATUS VETERAN_FLAG
count 7.116134e+06 7116134.0 0.0 7.116134e+06 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
mean 2.674853e+05 2021.0 NaN 3.267702e+01 NaN NaN NaN NaN NaN NaN NaN NaN
std 1.544397e+05 0.0 NaN 2.258095e+01 NaN NaN NaN NaN NaN NaN NaN NaN
min 1.000000e+00 2021.0 NaN 0.000000e+00 NaN NaN NaN NaN NaN NaN NaN NaN
25% 1.337730e+05 2021.0 NaN 1.300000e+01 NaN NaN NaN NaN NaN NaN NaN NaN
50% 2.675410e+05 2021.0 NaN 3.000000e+01 NaN NaN NaN NaN NaN NaN NaN NaN
75% 4.010820e+05 2021.0 NaN 4.900000e+01 NaN NaN NaN NaN NaN NaN NaN NaN
max 5.356080e+05 2021.0 NaN 1.210000e+02 NaN NaN NaN NaN NaN NaN NaN NaN

In some cases, these are not helpful, like in the IDs portion, having the mean ID value is not important. Same with calendar year, especially since these are all from the same year. But something to note is that the values for the mean, std error, and quartiles will not show up if there are ANY NaN values in the column. So we need to find out where these NaN values are.

4.6.4. Finding NaN values#

To get the number of NaN (N/A or NA, as they are sometimes called), we can use the isna() method in combination with the sum() method.

df.isna().sum()
synthetic_data              0
MCI_UNIQ_ID                 0
CALDR_YR                    0
DATE_OF_EVENT               0
service                     0
GEO_AREA              7116134
age                         0
DOB                   7116134
DOD                   7116134
GENDER                      0
GENDER_IDENTITY       7116134
SEX_ORIENT            7116134
LEGAL_SEX             7116134
RACE                        0
ETHNICITY                   0
LIVING_ARRANGEMENT    7116134
EMPLOYMENT_STATUS     7116134
MARITAL_STATUS              0
EDUCATION_LEVEL             0
VETERAN_FLAG          7116134
dtype: int64

Note for collaborators: df.isna() and df.isnull() now produce the same results, so the below code cell is not relevant.

df.isnull().sum()
synthetic_data              0
MCI_UNIQ_ID                 0
CALDR_YR                    0
DATE_OF_EVENT               0
service                     0
GEO_AREA              7116134
age                         0
DOB                   7116134
DOD                   7116134
GENDER                      0
GENDER_IDENTITY       7116134
SEX_ORIENT            7116134
LEGAL_SEX             7116134
RACE                        0
ETHNICITY                   0
LIVING_ARRANGEMENT    7116134
EMPLOYMENT_STATUS     7116134
MARITAL_STATUS              0
EDUCATION_LEVEL             0
VETERAN_FLAG          7116134
dtype: int64

We can see the columns [‘GEO_AREA’, ‘DOB’, ‘DOD’, ‘GENDER_IDENTITY’, ‘SEX_ORIENT’, ‘LEGAL_SEX’, ‘LIVING_ARRANGEMENT’, ‘EMPLOYMENT_STATUS’, ‘VETERAN_FLAG’] have only NaN values.

Note to collaborators: the bottom row that has only NAs is gone with the complete version of the data.

df.isna().tail(1)
synthetic_data MCI_UNIQ_ID CALDR_YR DATE_OF_EVENT service GEO_AREA age DOB DOD GENDER GENDER_IDENTITY SEX_ORIENT LEGAL_SEX RACE ETHNICITY LIVING_ARRANGEMENT EMPLOYMENT_STATUS MARITAL_STATUS EDUCATION_LEVEL VETERAN_FLAG
7116133 False False False False False True False True True False True True True False False True True False False True

let’s clean up the data frame then, drop last row, drop the columns that only have NAs

Note: I am also not quite sure what the code cell above and below are doing in this context.

df.tail(1)
synthetic_data MCI_UNIQ_ID CALDR_YR DATE_OF_EVENT service GEO_AREA age DOB DOD GENDER GENDER_IDENTITY SEX_ORIENT LEGAL_SEX RACE ETHNICITY LIVING_ARRANGEMENT EMPLOYMENT_STATUS MARITAL_STATUS EDUCATION_LEVEL VETERAN_FLAG
7116133 SYNTHETIC DATA 535156 2021 06/30/2021 Suicides NaN 24 NaN NaN 1~Male NaN NaN NaN 1~White 99~Unknown NaN NaN 1~Single-Never Married 9-12~High School (grade 9-12) NaN

4.6.5. Removing NA Columns and Rows#

To drop all rows with only NaN values, we can use a simple function called df.dropna().

df.dropna(how='all', inplace=True)
df.shape
(7116134, 20)

With this function, the default parameter for a hidden value axis removed rows of only NA values, of which there are none in the dataset. To remove columns, we need to specify the axis.

#we want to remove all the columns that only have NA values
df.dropna(axis=1, how='all', inplace=True)
df.shape
(7116134, 11)

Now what if we wanted to drop a row with a single NA value? We can simply remove the how=‘all’ parameter and it will remove the rows with at least one NA value.

# df.dropna(inplace=True) # will drop rows with any NA.

In addition to looking at the head and the tail, we are also at this point going to take a look at 3 random rows, as it is important to inspect data beyond the head and tail.

# set seed for random function so that we get same rows when re-run the cell
np.random.seed(17)
np.random.randint(2, df.shape[0],3)
array([ 589043, 5539985, 2078088])
np.random.randint(3, df.shape[0],3)
array([ 491929,  828732, 6417139])
# set seed for random function so that we get same rows when re-run the cell
np.random.seed(17)
df.iloc[np.random.randint(0, df.shape[0],3)]
synthetic_data MCI_UNIQ_ID CALDR_YR DATE_OF_EVENT service age GENDER RACE ETHNICITY MARITAL_STATUS EDUCATION_LEVEL
589041 SYNTHETIC DATA 364757 2021 10/31/2021 Individuals_Receiving_DHS_Services 54 1~Male 1~White 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown
5539983 SYNTHETIC DATA 449747 2021 09/30/2021 Individuals_Receiving_Income_Supports 51 1~Male 1~White 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown
2078086 SYNTHETIC DATA 109525 2021 09/30/2021 Individuals_Receiving_Income_Supports 51 1~Male 1~White 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown

We will also take a look at the record for a specific MCI_UNIQ_ID. Let’s look at client 364757.

df[df['MCI_UNIQ_ID'] == 364757]
synthetic_data MCI_UNIQ_ID CALDR_YR DATE_OF_EVENT service age GENDER RACE ETHNICITY MARITAL_STATUS EDUCATION_LEVEL
589034 SYNTHETIC DATA 364757 2021 02/28/2021 Individuals_Receiving_DHS_Services 54 1~Male 1~White 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown
589035 SYNTHETIC DATA 364757 2021 03/31/2021 Individuals_Receiving_DHS_Services 54 1~Male 1~White 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown
589036 SYNTHETIC DATA 364757 2021 05/31/2021 Individuals_Receiving_DHS_Services 54 1~Male 1~White 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown
589037 SYNTHETIC DATA 364757 2021 06/30/2021 Individuals_Receiving_DHS_Services 54 1~Male 1~White 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown
589038 SYNTHETIC DATA 364757 2021 07/31/2021 Individuals_Receiving_DHS_Services 54 1~Male 1~White 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown
589039 SYNTHETIC DATA 364757 2021 08/31/2021 Individuals_Receiving_DHS_Services 54 1~Male 1~White 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown
589040 SYNTHETIC DATA 364757 2021 09/30/2021 Individuals_Receiving_DHS_Services 54 1~Male 1~White 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown
589041 SYNTHETIC DATA 364757 2021 10/31/2021 Individuals_Receiving_DHS_Services 54 1~Male 1~White 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown
589042 SYNTHETIC DATA 364757 2021 11/30/2021 Individuals_Receiving_DHS_Services 54 1~Male 1~White 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown
589043 SYNTHETIC DATA 364757 2021 12/31/2021 Individuals_Receiving_DHS_Services 54 1~Male 1~White 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown
4675581 SYNTHETIC DATA 364757 2021 01/31/2021 Individuals_Receiving_Income_Supports 54 1~Male 1~White 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown
4675582 SYNTHETIC DATA 364757 2021 02/28/2021 Individuals_Receiving_Income_Supports 54 1~Male 1~White 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown
4675583 SYNTHETIC DATA 364757 2021 03/31/2021 Individuals_Receiving_Income_Supports 54 1~Male 1~White 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown
4675584 SYNTHETIC DATA 364757 2021 04/30/2021 Individuals_Receiving_Income_Supports 54 1~Male 1~White 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown
4675585 SYNTHETIC DATA 364757 2021 05/31/2021 Individuals_Receiving_Income_Supports 54 1~Male 1~White 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown
4675586 SYNTHETIC DATA 364757 2021 06/30/2021 Individuals_Receiving_Income_Supports 54 1~Male 1~White 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown
4675587 SYNTHETIC DATA 364757 2021 07/31/2021 Individuals_Receiving_Income_Supports 54 1~Male 1~White 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown
4675588 SYNTHETIC DATA 364757 2021 08/31/2021 Individuals_Receiving_Income_Supports 54 1~Male 1~White 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown
4675589 SYNTHETIC DATA 364757 2021 09/30/2021 Individuals_Receiving_Income_Supports 54 1~Male 1~White 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown
4675590 SYNTHETIC DATA 364757 2021 10/31/2021 Individuals_Receiving_Income_Supports 54 1~Male 1~White 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown
4675591 SYNTHETIC DATA 364757 2021 11/30/2021 Individuals_Receiving_Income_Supports 54 1~Male 1~White 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown
4675592 SYNTHETIC DATA 364757 2021 12/31/2021 Individuals_Receiving_Income_Supports 54 1~Male 1~White 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown
6710438 SYNTHETIC DATA 364757 2021 01/31/2021 Individuals_Receiving_Mental_Health_Services 54 1~Male 1~White 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown
6710439 SYNTHETIC DATA 364757 2021 03/31/2021 Individuals_Receiving_Mental_Health_Services 54 1~Male 1~White 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown
6710440 SYNTHETIC DATA 364757 2021 04/30/2021 Individuals_Receiving_Mental_Health_Services 54 1~Male 1~White 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown
6710441 SYNTHETIC DATA 364757 2021 05/31/2021 Individuals_Receiving_Mental_Health_Services 54 1~Male 1~White 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown
6710442 SYNTHETIC DATA 364757 2021 06/30/2021 Individuals_Receiving_Mental_Health_Services 54 1~Male 1~White 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown
6710443 SYNTHETIC DATA 364757 2021 07/31/2021 Individuals_Receiving_Mental_Health_Services 54 1~Male 1~White 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown
6710444 SYNTHETIC DATA 364757 2021 08/31/2021 Individuals_Receiving_Mental_Health_Services 54 1~Male 1~White 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown
6710445 SYNTHETIC DATA 364757 2021 10/31/2021 Individuals_Receiving_Mental_Health_Services 54 1~Male 1~White 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown
6710446 SYNTHETIC DATA 364757 2021 11/30/2021 Individuals_Receiving_Mental_Health_Services 54 1~Male 1~White 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown
6710447 SYNTHETIC DATA 364757 2021 12/31/2021 Individuals_Receiving_Mental_Health_Services 54 1~Male 1~White 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown

And similarly for client 449747.

df[df['MCI_UNIQ_ID'] == 449747]
synthetic_data MCI_UNIQ_ID CALDR_YR DATE_OF_EVENT service age GENDER RACE ETHNICITY MARITAL_STATUS EDUCATION_LEVEL
5539976 SYNTHETIC DATA 449747 2021 01/31/2021 Individuals_Receiving_Income_Supports 51 1~Male 1~White 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown
5539977 SYNTHETIC DATA 449747 2021 03/31/2021 Individuals_Receiving_Income_Supports 51 1~Male 1~White 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown
5539978 SYNTHETIC DATA 449747 2021 04/30/2021 Individuals_Receiving_Income_Supports 51 1~Male 1~White 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown
5539979 SYNTHETIC DATA 449747 2021 05/31/2021 Individuals_Receiving_Income_Supports 51 1~Male 1~White 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown
5539980 SYNTHETIC DATA 449747 2021 06/30/2021 Individuals_Receiving_Income_Supports 51 1~Male 1~White 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown
5539981 SYNTHETIC DATA 449747 2021 07/31/2021 Individuals_Receiving_Income_Supports 51 1~Male 1~White 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown
5539982 SYNTHETIC DATA 449747 2021 08/31/2021 Individuals_Receiving_Income_Supports 51 1~Male 1~White 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown
5539983 SYNTHETIC DATA 449747 2021 09/30/2021 Individuals_Receiving_Income_Supports 51 1~Male 1~White 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown
5539984 SYNTHETIC DATA 449747 2021 10/31/2021 Individuals_Receiving_Income_Supports 51 1~Male 1~White 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown
5539985 SYNTHETIC DATA 449747 2021 11/30/2021 Individuals_Receiving_Income_Supports 51 1~Male 1~White 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown

4.6.6. Basic Transformations#

Often times when we encounter data, we find that it is in a format that is a bit clunky to use. Part of EDA will be to make datasets more friendly. Here we will go over four transformations: substitution, generation, outlier removal, and illogical value removal.

Substitution#

For the data entries that have strings in certain columns, like “Individuals_Receiving_DHS_Services”, we may need to alter these. Strings can be a bit clunky to work with and are generally read better for plots when we have them in categories. Additionally, we can try and do some mathematical tests on them when we have numerical category labels.

#we want to take the columns for gender, race, ethnicity, marital status, and education level
#and turn these columns into the category data type.

#first we add a new column to the dataframe that has this dtype
#but named with a _cat suffix.
df["GENDER_cat"] = df["GENDER"].astype('category')
df["RACE_cat"] = df["RACE"].astype('category')
df["ETHNICITY_cat"] = df["ETHNICITY"].astype('category')
df["MARITAL_STATUS_cat"] = df["MARITAL_STATUS"].astype('category')
df["EDUCATION_LEVEL_cat"] = df["EDUCATION_LEVEL"].astype('category')
df["service_cat"] = df["service"].astype('category')

df.head()
synthetic_data MCI_UNIQ_ID CALDR_YR DATE_OF_EVENT service age GENDER RACE ETHNICITY MARITAL_STATUS EDUCATION_LEVEL GENDER_cat RACE_cat ETHNICITY_cat MARITAL_STATUS_cat EDUCATION_LEVEL_cat service_cat
0 SYNTHETIC DATA 1626 2021 01/31/2021 Children_Attending_Early_Childhood_Programs_Ma... 5 1~Male 1~White 99~Unknown 99~Unknown 9-12~High School (grade 9-12) 1~Male 1~White 99~Unknown 99~Unknown 9-12~High School (grade 9-12) Children_Attending_Early_Childhood_Programs_Ma...
1 SYNTHETIC DATA 1626 2021 02/28/2021 Children_Attending_Early_Childhood_Programs_Ma... 5 1~Male 1~White 99~Unknown 99~Unknown 9-12~High School (grade 9-12) 1~Male 1~White 99~Unknown 99~Unknown 9-12~High School (grade 9-12) Children_Attending_Early_Childhood_Programs_Ma...
2 SYNTHETIC DATA 1626 2021 03/31/2021 Children_Attending_Early_Childhood_Programs_Ma... 5 1~Male 1~White 99~Unknown 99~Unknown 9-12~High School (grade 9-12) 1~Male 1~White 99~Unknown 99~Unknown 9-12~High School (grade 9-12) Children_Attending_Early_Childhood_Programs_Ma...
3 SYNTHETIC DATA 1626 2021 04/30/2021 Children_Attending_Early_Childhood_Programs_Ma... 5 1~Male 1~White 99~Unknown 99~Unknown 9-12~High School (grade 9-12) 1~Male 1~White 99~Unknown 99~Unknown 9-12~High School (grade 9-12) Children_Attending_Early_Childhood_Programs_Ma...
4 SYNTHETIC DATA 1626 2021 05/31/2021 Children_Attending_Early_Childhood_Programs_Ma... 5 1~Male 1~White 99~Unknown 99~Unknown 9-12~High School (grade 9-12) 1~Male 1~White 99~Unknown 99~Unknown 9-12~High School (grade 9-12) Children_Attending_Early_Childhood_Programs_Ma...
#forgot to add for ids
df["MCI_UNIQ_ID_cat"] = df["MCI_UNIQ_ID"].astype('category')

Generation#

In other cases, we may want to generate a new variable. These can simply be combinations of previous variables, but in the case of this dataset, we will convert the “age” column to a categorical datatype variable as well. We will do this by putting ages in bins of 10 years.

#first we will define a helper function that takes in an int
#and returns the corresponding age bin value (grouped by 10), so a 5 gets returned as "0-9"
def age_bin(age):
    if age < 10:
        return "0-9"
    elif age < 20:
        return "10-19"
    elif age < 30:
        return "20-29"
    elif age < 40:
        return "30-39"
    elif age < 50:
        return "40-49"
    elif age < 60:
        return "50-59"
    elif age < 70:
        return "60-69"
    elif age < 80:
        return "70-79"
    elif age < 90:
        return "80-89"
    else:
        return "90+"
    
#is there a better way to do this? I am not sure, but we will go with this anyway. 
#it gets the job done.

#we will now apply this function to the age column and create a new column called AGE_BIN
df["AGE_BIN"] = df["age"].apply(age_bin)
df.head()
synthetic_data MCI_UNIQ_ID CALDR_YR DATE_OF_EVENT service age GENDER RACE ETHNICITY MARITAL_STATUS EDUCATION_LEVEL GENDER_cat RACE_cat ETHNICITY_cat MARITAL_STATUS_cat EDUCATION_LEVEL_cat service_cat MCI_UNIQ_ID_cat AGE_BIN
0 SYNTHETIC DATA 1626 2021 01/31/2021 Children_Attending_Early_Childhood_Programs_Ma... 5 1~Male 1~White 99~Unknown 99~Unknown 9-12~High School (grade 9-12) 1~Male 1~White 99~Unknown 99~Unknown 9-12~High School (grade 9-12) Children_Attending_Early_Childhood_Programs_Ma... 1626 0-9
1 SYNTHETIC DATA 1626 2021 02/28/2021 Children_Attending_Early_Childhood_Programs_Ma... 5 1~Male 1~White 99~Unknown 99~Unknown 9-12~High School (grade 9-12) 1~Male 1~White 99~Unknown 99~Unknown 9-12~High School (grade 9-12) Children_Attending_Early_Childhood_Programs_Ma... 1626 0-9
2 SYNTHETIC DATA 1626 2021 03/31/2021 Children_Attending_Early_Childhood_Programs_Ma... 5 1~Male 1~White 99~Unknown 99~Unknown 9-12~High School (grade 9-12) 1~Male 1~White 99~Unknown 99~Unknown 9-12~High School (grade 9-12) Children_Attending_Early_Childhood_Programs_Ma... 1626 0-9
3 SYNTHETIC DATA 1626 2021 04/30/2021 Children_Attending_Early_Childhood_Programs_Ma... 5 1~Male 1~White 99~Unknown 99~Unknown 9-12~High School (grade 9-12) 1~Male 1~White 99~Unknown 99~Unknown 9-12~High School (grade 9-12) Children_Attending_Early_Childhood_Programs_Ma... 1626 0-9
4 SYNTHETIC DATA 1626 2021 05/31/2021 Children_Attending_Early_Childhood_Programs_Ma... 5 1~Male 1~White 99~Unknown 99~Unknown 9-12~High School (grade 9-12) 1~Male 1~White 99~Unknown 99~Unknown 9-12~High School (grade 9-12) Children_Attending_Early_Childhood_Programs_Ma... 1626 0-9

Another type of generation we can do is to split the date of event so we get the month in which each service occurs. We can do this by simply getting the first two characters of the DATE_OF_EVENT column.

#make a new month column
#this is the first two characters of the date column and the last 5 characters of the column
df["month"] = df["DATE_OF_EVENT"].str[:2] + df["DATE_OF_EVENT"].str[-5:]
df.head()
synthetic_data MCI_UNIQ_ID CALDR_YR DATE_OF_EVENT service age GENDER RACE ETHNICITY MARITAL_STATUS EDUCATION_LEVEL GENDER_cat RACE_cat ETHNICITY_cat MARITAL_STATUS_cat EDUCATION_LEVEL_cat service_cat MCI_UNIQ_ID_cat AGE_BIN month
0 SYNTHETIC DATA 1626 2021 01/31/2021 Children_Attending_Early_Childhood_Programs_Ma... 5 1~Male 1~White 99~Unknown 99~Unknown 9-12~High School (grade 9-12) 1~Male 1~White 99~Unknown 99~Unknown 9-12~High School (grade 9-12) Children_Attending_Early_Childhood_Programs_Ma... 1626 0-9 01/2021
1 SYNTHETIC DATA 1626 2021 02/28/2021 Children_Attending_Early_Childhood_Programs_Ma... 5 1~Male 1~White 99~Unknown 99~Unknown 9-12~High School (grade 9-12) 1~Male 1~White 99~Unknown 99~Unknown 9-12~High School (grade 9-12) Children_Attending_Early_Childhood_Programs_Ma... 1626 0-9 02/2021
2 SYNTHETIC DATA 1626 2021 03/31/2021 Children_Attending_Early_Childhood_Programs_Ma... 5 1~Male 1~White 99~Unknown 99~Unknown 9-12~High School (grade 9-12) 1~Male 1~White 99~Unknown 99~Unknown 9-12~High School (grade 9-12) Children_Attending_Early_Childhood_Programs_Ma... 1626 0-9 03/2021
3 SYNTHETIC DATA 1626 2021 04/30/2021 Children_Attending_Early_Childhood_Programs_Ma... 5 1~Male 1~White 99~Unknown 99~Unknown 9-12~High School (grade 9-12) 1~Male 1~White 99~Unknown 99~Unknown 9-12~High School (grade 9-12) Children_Attending_Early_Childhood_Programs_Ma... 1626 0-9 04/2021
4 SYNTHETIC DATA 1626 2021 05/31/2021 Children_Attending_Early_Childhood_Programs_Ma... 5 1~Male 1~White 99~Unknown 99~Unknown 9-12~High School (grade 9-12) 1~Male 1~White 99~Unknown 99~Unknown 9-12~High School (grade 9-12) Children_Attending_Early_Childhood_Programs_Ma... 1626 0-9 05/2021

Outlier Removal#

Often times, data will have outliers. These are data points that are very far from the majority of the data. In the case of a dataset with all categorical variables, this is not often feasible. However, such an act is more common with numeric features. Let’s try the age category and remove all the rows with age 105 or greater.

#want to remove all entries from the dataset where the age entry is 105 or more
df = df[df["age"] < 105]
df.shape
(7115529, 20)

Illogical Value Removal#

This portion of data transformations is slightly harder as we will have to look at multiple features at the same time. However, it is still important. Some data points in the dataset have children of age 5, yet lists them as in high school.

#we want to remove all the values where the age is 0-9
#AND where they are listed as in high school as "9-12~High School (grade 9-12)"
df = df[~((df["AGE_BIN"] == "0-9") & (df["EDUCATION_LEVEL"] == "9-12~High School (grade 9-12)"))]
df.shape
(6581506, 20)

4.6.7. Advanced Transformations#

Sometimes when doing exploratory data analysis, the storage format of the table does not give a complete view of the picture. In the case of this dataset, we found earlier that there were several rows with the same ID, which means that there are multiple entries for some people. This indicates that they used the services provided in the dataset multiple times.

To get a better picture of some data insights, we can make a new dataframe; this one will have one row for each ID, and we will add the columns regarding the number of services used, number of distinct services used, and the number of months the services were used. Because the dataset is big, this may take a while.

In general cases, to find new transformations for the datasets you work with, you will need to think about the general direction to take the data, which can involve simply thinking about the data logically. However, a better approach is to consult domain knowledge. This basically involves looking to sources related to the data (research articles, people who collected the data, people who are knowledgeable on the subject) and transforming the data as they may advise, or getting advice on where to look for ideas on transformations.

Here, we are going to perform two aggregative transformations. These will group the entries of the dataset by the ID, as well as by the (ID, Month) pairing and compute some summary statistics, including the number of services used, the number of months the services were used, and the range of dates the services were used.

def get_client_attributes(services):
    client_attributes = services.groupby(['MCI_UNIQ_ID']).agg(
        num_services = ('service', 'count'),
        distinct_services = ('service', 'nunique'), 
        first_date = ('DATE_OF_EVENT', 'min'), 
        last_date = ('DATE_OF_EVENT', 'max'), 
        num_months = ('month', 'count'), 
        distinct_month = ('month', 'nunique')
    ).reset_index()

    return client_attributes

client_attributes = get_client_attributes(df)
client_attributes.head()
MCI_UNIQ_ID num_services distinct_services first_date last_date num_months distinct_month
0 1 12 1 01/31/2021 12/31/2021 12 12
1 2 1 1 02/28/2021 02/28/2021 1 1
2 3 11 1 02/28/2021 12/31/2021 11 11
3 5 12 1 01/31/2021 12/31/2021 12 12
4 6 24 3 01/31/2021 12/31/2021 24 12

Similarly, we perform the client-month attribute aggregations.

def get_client_date_attributes(services):
    client_date_attributes = services.groupby(['MCI_UNIQ_ID', 'month']).agg(
        num_services = ('service', 'count'),
        distinct_services = ('service', 'nunique')
    ).reset_index()

    return client_date_attributes

client_date_attributes = get_client_date_attributes(df)
client_date_attributes.head()
MCI_UNIQ_ID month num_services distinct_services
0 1 01/2021 1 1
1 1 02/2021 1 1
2 1 03/2021 1 1
3 1 04/2021 1 1
4 1 05/2021 1 1

Now, we can merge the datasets in a sense so that some of the information about each person can carry over to the new datasets we’ve created. Specifically, we want to carry over gender, race, ethnicity, education level, and marital status.

#start with client_attributes
#we want to add columns of age, gender, race, ethnicity, education level, and marital status
#from the original dataset, making sure IDs match up and that there is only one occurrence of each ID.
client_attributes = pd.merge(client_attributes, df[['MCI_UNIQ_ID', 'age', 'GENDER', 'RACE', 'ETHNICITY', 'MARITAL_STATUS', 'EDUCATION_LEVEL']], on='MCI_UNIQ_ID', how='left')
#now remove duplicate rows
client_attributes = client_attributes.drop_duplicates(subset=['MCI_UNIQ_ID'])
client_attributes.head()
MCI_UNIQ_ID num_services distinct_services first_date last_date num_months distinct_month age GENDER RACE ETHNICITY MARITAL_STATUS EDUCATION_LEVEL
0 1 12 1 01/31/2021 12/31/2021 12 12 13 1~Male 99~Unknown 2~Not Hispanic/Latinx 99~Unknown GED~High School Diploma/GED Completed
12 2 1 1 02/28/2021 02/28/2021 1 1 70 2~Female 1~White 2~Not Hispanic/Latinx 2~Married 99~Unknown
13 3 11 1 02/28/2021 12/31/2021 11 11 29 2~Female 1~White 2~Not Hispanic/Latinx 2~Married 9-12~High School (grade 9-12)
24 5 12 1 01/31/2021 12/31/2021 12 12 40 1~Male 2~Black/African American 2~Not Hispanic/Latinx 2~Married UNDERGRAD~Some College
36 6 24 3 01/31/2021 12/31/2021 24 12 71 2~Female 2~Black/African American 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown
#we'll do the same with client_date_attributes
client_date_attributes = pd.merge(client_date_attributes, df[['MCI_UNIQ_ID', 'age', 'GENDER', 'RACE', 'ETHNICITY', 'MARITAL_STATUS', 'EDUCATION_LEVEL']], on='MCI_UNIQ_ID', how='left')
#now remove rows that have duplicate ID-month pairs
client_date_attributes = client_date_attributes.drop_duplicates(subset=['MCI_UNIQ_ID', 'month'])
client_date_attributes.head()
MCI_UNIQ_ID month num_services distinct_services age GENDER RACE ETHNICITY MARITAL_STATUS EDUCATION_LEVEL
0 1 01/2021 1 1 13 1~Male 99~Unknown 2~Not Hispanic/Latinx 99~Unknown GED~High School Diploma/GED Completed
12 1 02/2021 1 1 13 1~Male 99~Unknown 2~Not Hispanic/Latinx 99~Unknown GED~High School Diploma/GED Completed
24 1 03/2021 1 1 13 1~Male 99~Unknown 2~Not Hispanic/Latinx 99~Unknown GED~High School Diploma/GED Completed
36 1 04/2021 1 1 13 1~Male 99~Unknown 2~Not Hispanic/Latinx 99~Unknown GED~High School Diploma/GED Completed
48 1 05/2021 1 1 13 1~Male 99~Unknown 2~Not Hispanic/Latinx 99~Unknown GED~High School Diploma/GED Completed

4.6.8. Saving Data Transformations#

If you want to close your work and come back to it easily without having to rerun all of the above code, or if you want to use the same data in a new notebook, an easy workaround is to save the data to an intermediate file and then reload it. We’ll be saving all of our datasets to files that we will then plot in the next section.

#we will save df to a new csv file
df.to_csv('dhs_cleaned.csv', index=False)
client_attributes.to_csv('dhs_client_attributes.csv', index=False)
client_date_attributes.to_csv('dhs_client_date_attributes.csv', index=False)

4.6.9. EDA Project: Part 1#

Answer the following questions:

  • Which MCI_UNIQ_ID has the highest number of records among all IDs?

  • What is the average number of interactions for clients in the file?

  • How many months did a particular client use the service?

  • What other general observations can you make from the data?