4.6. Data Investigation II#
Often times when we explore data for the first time, we have to spend time looking at what is missing in the data, or what needs to be changed. Data files rarely if ever are perfect for use immediately once you get them. Additionally, it is often unwise to utilize data for predictive modeling or statistical analysis without knowing some of the context for the data and looking at its contents.
Python has a helpful package called pandas
which is purpose-built for looking at and manipulating data, and a lot of what we will work on in this section will involve using it.
4.6.1. Setup#
The first thing we typically do when dealing with data in Python is to import the pandas
package and one of its dependencies, numpy
.
import pandas as pd
import numpy as np
Due to the large size of the provided data file, we store it in a different directory than the one where this notebook lies. We can easily set this directory using the os
package.
import os
os.chdir('<your-directory>')
df = pd.read_csv('dhs_service_records_synthesized_final.csv')
4.6.2. Simple Data Inspection#
Before doing anything with the data, it is good practice to take a look at a few samples from the dataset. Here, we’re going to take a look at the first ten rows and the last seven rows, or the head and tail of the data; this is a function that is easy to call, and it is good practice to call either of these often so that you know how a dataset evolves over time.
df.head()
synthetic_data | MCI_UNIQ_ID | CALDR_YR | DATE_OF_EVENT | service | GEO_AREA | age | DOB | DOD | GENDER | GENDER_IDENTITY | SEX_ORIENT | LEGAL_SEX | RACE | ETHNICITY | LIVING_ARRANGEMENT | EMPLOYMENT_STATUS | MARITAL_STATUS | EDUCATION_LEVEL | VETERAN_FLAG | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | SYNTHETIC DATA | 1626 | 2021 | 01/31/2021 | Children_Attending_Early_Childhood_Programs_Ma... | NaN | 5 | NaN | NaN | 1~Male | NaN | NaN | NaN | 1~White | 99~Unknown | NaN | NaN | 99~Unknown | 9-12~High School (grade 9-12) | NaN |
1 | SYNTHETIC DATA | 1626 | 2021 | 02/28/2021 | Children_Attending_Early_Childhood_Programs_Ma... | NaN | 5 | NaN | NaN | 1~Male | NaN | NaN | NaN | 1~White | 99~Unknown | NaN | NaN | 99~Unknown | 9-12~High School (grade 9-12) | NaN |
2 | SYNTHETIC DATA | 1626 | 2021 | 03/31/2021 | Children_Attending_Early_Childhood_Programs_Ma... | NaN | 5 | NaN | NaN | 1~Male | NaN | NaN | NaN | 1~White | 99~Unknown | NaN | NaN | 99~Unknown | 9-12~High School (grade 9-12) | NaN |
3 | SYNTHETIC DATA | 1626 | 2021 | 04/30/2021 | Children_Attending_Early_Childhood_Programs_Ma... | NaN | 5 | NaN | NaN | 1~Male | NaN | NaN | NaN | 1~White | 99~Unknown | NaN | NaN | 99~Unknown | 9-12~High School (grade 9-12) | NaN |
4 | SYNTHETIC DATA | 1626 | 2021 | 05/31/2021 | Children_Attending_Early_Childhood_Programs_Ma... | NaN | 5 | NaN | NaN | 1~Male | NaN | NaN | NaN | 1~White | 99~Unknown | NaN | NaN | 99~Unknown | 9-12~High School (grade 9-12) | NaN |
df.tail(7)
synthetic_data | MCI_UNIQ_ID | CALDR_YR | DATE_OF_EVENT | service | GEO_AREA | age | DOB | DOD | GENDER | GENDER_IDENTITY | SEX_ORIENT | LEGAL_SEX | RACE | ETHNICITY | LIVING_ARRANGEMENT | EMPLOYMENT_STATUS | MARITAL_STATUS | EDUCATION_LEVEL | VETERAN_FLAG | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
7116127 | SYNTHETIC DATA | 530696 | 2021 | 06/30/2021 | Suicides | NaN | 21 | NaN | NaN | 2~Female | NaN | NaN | NaN | 1~White | 99~Unknown | NaN | NaN | 99~Unknown | 99~Unknown | NaN |
7116128 | SYNTHETIC DATA | 530765 | 2021 | 06/30/2021 | Suicides | NaN | 15 | NaN | NaN | 2~Female | NaN | NaN | NaN | 1~White | 99~Unknown | NaN | NaN | 99~Unknown | 99~Unknown | NaN |
7116129 | SYNTHETIC DATA | 530832 | 2021 | 06/30/2021 | Suicides | NaN | 17 | NaN | NaN | 1~Male | NaN | NaN | NaN | 1~White | 99~Unknown | NaN | NaN | 99~Unknown | 99~Unknown | NaN |
7116130 | SYNTHETIC DATA | 532121 | 2021 | 04/30/2021 | Suicides | NaN | 22 | NaN | NaN | 2~Female | NaN | NaN | NaN | 1~White | 99~Unknown | NaN | NaN | 1~Single-Never Married | 9-12~High School (grade 9-12) | NaN |
7116131 | SYNTHETIC DATA | 533991 | 2021 | 11/30/2021 | Suicides | NaN | 16 | NaN | NaN | 1~Male | NaN | NaN | NaN | 1~White | 99~Unknown | NaN | NaN | 1~Single-Never Married | 9-12~High School (grade 9-12) | NaN |
7116132 | SYNTHETIC DATA | 534127 | 2021 | 07/31/2021 | Suicides | NaN | 19 | NaN | NaN | 1~Male | NaN | NaN | NaN | 1~White | 2~Not Hispanic/Latinx | NaN | NaN | 4~Widowed | GRAD~Graduate Degree | NaN |
7116133 | SYNTHETIC DATA | 535156 | 2021 | 06/30/2021 | Suicides | NaN | 24 | NaN | NaN | 1~Male | NaN | NaN | NaN | 1~White | 99~Unknown | NaN | NaN | 1~Single-Never Married | 9-12~High School (grade 9-12) | NaN |
4.6.3. Dataframe Attributes#
You can get a lot of helpful information from the attributes of the dataframe; specifically, we are going to look at dataframe dimensions, column names, column data types, and the ranges of column values. To get these, we can easily call some prebuilt attributes that are assigned to pandas dataframes. These help us to understand our data before we even start modifying it.
df.shape
(7116134, 20)
df.columns
Index(['synthetic_data', 'MCI_UNIQ_ID', 'CALDR_YR', 'DATE_OF_EVENT', 'service',
'GEO_AREA', 'age', 'DOB', 'DOD', 'GENDER', 'GENDER_IDENTITY',
'SEX_ORIENT', 'LEGAL_SEX', 'RACE', 'ETHNICITY', 'LIVING_ARRANGEMENT',
'EMPLOYMENT_STATUS', 'MARITAL_STATUS', 'EDUCATION_LEVEL',
'VETERAN_FLAG'],
dtype='object')
Pandas dataframes also have a helpful method for easy summarization of the set, which you can call with df.info()
.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7116134 entries, 0 to 7116133
Data columns (total 20 columns):
# Column Dtype
--- ------ -----
0 synthetic_data object
1 MCI_UNIQ_ID int64
2 CALDR_YR int64
3 DATE_OF_EVENT object
4 service object
5 GEO_AREA float64
6 age int64
7 DOB float64
8 DOD float64
9 GENDER object
10 GENDER_IDENTITY float64
11 SEX_ORIENT float64
12 LEGAL_SEX float64
13 RACE object
14 ETHNICITY object
15 LIVING_ARRANGEMENT float64
16 EMPLOYMENT_STATUS float64
17 MARITAL_STATUS object
18 EDUCATION_LEVEL object
19 VETERAN_FLAG float64
dtypes: float64(9), int64(3), object(8)
memory usage: 1.1+ GB
Another way to check the data types, if you just want that, is to use the dtypes
attribute of the dataframe.
df.dtypes
synthetic_data object
MCI_UNIQ_ID int64
CALDR_YR int64
DATE_OF_EVENT object
service object
GEO_AREA float64
age int64
DOB float64
DOD float64
GENDER object
GENDER_IDENTITY float64
SEX_ORIENT float64
LEGAL_SEX float64
RACE object
ETHNICITY object
LIVING_ARRANGEMENT float64
EMPLOYMENT_STATUS float64
MARITAL_STATUS object
EDUCATION_LEVEL object
VETERAN_FLAG float64
dtype: object
We can also find the number of unique entries in the dataframe. This can be important as in some cases we may not want duplicates. In the case of these data from Allegheny County, unique IDs are important as each one will correspond to a single person, and we can track that person across their usage of many different services.
df.nunique()
synthetic_data 1
MCI_UNIQ_ID 533799
CALDR_YR 1
DATE_OF_EVENT 12
service 22
GEO_AREA 0
age 112
DOB 0
DOD 0
GENDER 3
GENDER_IDENTITY 0
SEX_ORIENT 0
LEGAL_SEX 0
RACE 7
ETHNICITY 3
LIVING_ARRANGEMENT 0
EMPLOYMENT_STATUS 0
MARITAL_STATUS 7
EDUCATION_LEVEL 8
VETERAN_FLAG 0
dtype: int64
Note to collaborators: the next two code cells are irrelevant with the complete dataset as there are no entries with “SYNTHET” in the synthetic_data column.
#note: I am not sure what 'SYNTHET" versus "SYNTHETIC DATA" means.
#but we split by them anyway.
df.groupby("synthetic_data").count()
MCI_UNIQ_ID | CALDR_YR | DATE_OF_EVENT | service | GEO_AREA | age | DOB | DOD | GENDER | GENDER_IDENTITY | SEX_ORIENT | LEGAL_SEX | RACE | ETHNICITY | LIVING_ARRANGEMENT | EMPLOYMENT_STATUS | MARITAL_STATUS | EDUCATION_LEVEL | VETERAN_FLAG | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
synthetic_data | |||||||||||||||||||
SYNTHETIC DATA | 7116134 | 7116134 | 7116134 | 7116134 | 0 | 7116134 | 0 | 0 | 7116134 | 0 | 0 | 0 | 7116134 | 7116134 | 0 | 0 | 7116134 | 7116134 | 0 |
df["synthetic_data"].value_counts()
SYNTHETIC DATA 7116134
Name: synthetic_data, dtype: int64
One way to get an overview of the data values is to use another helpful function, df.describe()
. This will provide several summary statistics for you for each of the columns in the set.
df.describe()
MCI_UNIQ_ID | CALDR_YR | GEO_AREA | age | DOB | DOD | GENDER_IDENTITY | SEX_ORIENT | LEGAL_SEX | LIVING_ARRANGEMENT | EMPLOYMENT_STATUS | VETERAN_FLAG | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 7.116134e+06 | 7116134.0 | 0.0 | 7.116134e+06 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
mean | 2.674853e+05 | 2021.0 | NaN | 3.267702e+01 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
std | 1.544397e+05 | 0.0 | NaN | 2.258095e+01 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
min | 1.000000e+00 | 2021.0 | NaN | 0.000000e+00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
25% | 1.337730e+05 | 2021.0 | NaN | 1.300000e+01 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
50% | 2.675410e+05 | 2021.0 | NaN | 3.000000e+01 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
75% | 4.010820e+05 | 2021.0 | NaN | 4.900000e+01 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
max | 5.356080e+05 | 2021.0 | NaN | 1.210000e+02 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
In some cases, these are not helpful, like in the IDs portion, having the mean ID value is not important. Same with calendar year, especially since these are all from the same year. But something to note is that the values for the mean, std error, and quartiles will not show up if there are ANY NaN values in the column. So we need to find out where these NaN values are.
4.6.4. Finding NaN values#
To get the number of NaN (N/A or NA, as they are sometimes called), we can use the isna()
method in combination with the sum()
method.
df.isna().sum()
synthetic_data 0
MCI_UNIQ_ID 0
CALDR_YR 0
DATE_OF_EVENT 0
service 0
GEO_AREA 7116134
age 0
DOB 7116134
DOD 7116134
GENDER 0
GENDER_IDENTITY 7116134
SEX_ORIENT 7116134
LEGAL_SEX 7116134
RACE 0
ETHNICITY 0
LIVING_ARRANGEMENT 7116134
EMPLOYMENT_STATUS 7116134
MARITAL_STATUS 0
EDUCATION_LEVEL 0
VETERAN_FLAG 7116134
dtype: int64
Note for collaborators: df.isna() and df.isnull() now produce the same results, so the below code cell is not relevant.
df.isnull().sum()
synthetic_data 0
MCI_UNIQ_ID 0
CALDR_YR 0
DATE_OF_EVENT 0
service 0
GEO_AREA 7116134
age 0
DOB 7116134
DOD 7116134
GENDER 0
GENDER_IDENTITY 7116134
SEX_ORIENT 7116134
LEGAL_SEX 7116134
RACE 0
ETHNICITY 0
LIVING_ARRANGEMENT 7116134
EMPLOYMENT_STATUS 7116134
MARITAL_STATUS 0
EDUCATION_LEVEL 0
VETERAN_FLAG 7116134
dtype: int64
We can see the columns [‘GEO_AREA’, ‘DOB’, ‘DOD’, ‘GENDER_IDENTITY’, ‘SEX_ORIENT’, ‘LEGAL_SEX’, ‘LIVING_ARRANGEMENT’, ‘EMPLOYMENT_STATUS’, ‘VETERAN_FLAG’] have only NaN values.
Note to collaborators: the bottom row that has only NAs is gone with the complete version of the data.
df.isna().tail(1)
synthetic_data | MCI_UNIQ_ID | CALDR_YR | DATE_OF_EVENT | service | GEO_AREA | age | DOB | DOD | GENDER | GENDER_IDENTITY | SEX_ORIENT | LEGAL_SEX | RACE | ETHNICITY | LIVING_ARRANGEMENT | EMPLOYMENT_STATUS | MARITAL_STATUS | EDUCATION_LEVEL | VETERAN_FLAG | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
7116133 | False | False | False | False | False | True | False | True | True | False | True | True | True | False | False | True | True | False | False | True |
let’s clean up the data frame then, drop last row, drop the columns that only have NAs
Note: I am also not quite sure what the code cell above and below are doing in this context.
df.tail(1)
synthetic_data | MCI_UNIQ_ID | CALDR_YR | DATE_OF_EVENT | service | GEO_AREA | age | DOB | DOD | GENDER | GENDER_IDENTITY | SEX_ORIENT | LEGAL_SEX | RACE | ETHNICITY | LIVING_ARRANGEMENT | EMPLOYMENT_STATUS | MARITAL_STATUS | EDUCATION_LEVEL | VETERAN_FLAG | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
7116133 | SYNTHETIC DATA | 535156 | 2021 | 06/30/2021 | Suicides | NaN | 24 | NaN | NaN | 1~Male | NaN | NaN | NaN | 1~White | 99~Unknown | NaN | NaN | 1~Single-Never Married | 9-12~High School (grade 9-12) | NaN |
4.6.5. Removing NA Columns and Rows#
To drop all rows with only NaN values, we can use a simple function called df.dropna()
.
df.dropna(how='all', inplace=True)
df.shape
(7116134, 20)
With this function, the default parameter for a hidden value axis
removed rows of only NA values, of which there are none in the dataset. To remove columns, we need to specify the axis.
#we want to remove all the columns that only have NA values
df.dropna(axis=1, how='all', inplace=True)
df.shape
(7116134, 11)
Now what if we wanted to drop a row with a single NA value? We can simply remove the how=‘all’ parameter and it will remove the rows with at least one NA value.
# df.dropna(inplace=True) # will drop rows with any NA.
In addition to looking at the head and the tail, we are also at this point going to take a look at 3 random rows, as it is important to inspect data beyond the head and tail.
# set seed for random function so that we get same rows when re-run the cell
np.random.seed(17)
np.random.randint(2, df.shape[0],3)
array([ 589043, 5539985, 2078088])
np.random.randint(3, df.shape[0],3)
array([ 491929, 828732, 6417139])
# set seed for random function so that we get same rows when re-run the cell
np.random.seed(17)
df.iloc[np.random.randint(0, df.shape[0],3)]
synthetic_data | MCI_UNIQ_ID | CALDR_YR | DATE_OF_EVENT | service | age | GENDER | RACE | ETHNICITY | MARITAL_STATUS | EDUCATION_LEVEL | |
---|---|---|---|---|---|---|---|---|---|---|---|
589041 | SYNTHETIC DATA | 364757 | 2021 | 10/31/2021 | Individuals_Receiving_DHS_Services | 54 | 1~Male | 1~White | 2~Not Hispanic/Latinx | 1~Single-Never Married | 99~Unknown |
5539983 | SYNTHETIC DATA | 449747 | 2021 | 09/30/2021 | Individuals_Receiving_Income_Supports | 51 | 1~Male | 1~White | 2~Not Hispanic/Latinx | 1~Single-Never Married | 99~Unknown |
2078086 | SYNTHETIC DATA | 109525 | 2021 | 09/30/2021 | Individuals_Receiving_Income_Supports | 51 | 1~Male | 1~White | 2~Not Hispanic/Latinx | 1~Single-Never Married | 99~Unknown |
We will also take a look at the record for a specific MCI_UNIQ_ID. Let’s look at client 364757.
df[df['MCI_UNIQ_ID'] == 364757]
synthetic_data | MCI_UNIQ_ID | CALDR_YR | DATE_OF_EVENT | service | age | GENDER | RACE | ETHNICITY | MARITAL_STATUS | EDUCATION_LEVEL | |
---|---|---|---|---|---|---|---|---|---|---|---|
589034 | SYNTHETIC DATA | 364757 | 2021 | 02/28/2021 | Individuals_Receiving_DHS_Services | 54 | 1~Male | 1~White | 2~Not Hispanic/Latinx | 1~Single-Never Married | 99~Unknown |
589035 | SYNTHETIC DATA | 364757 | 2021 | 03/31/2021 | Individuals_Receiving_DHS_Services | 54 | 1~Male | 1~White | 2~Not Hispanic/Latinx | 1~Single-Never Married | 99~Unknown |
589036 | SYNTHETIC DATA | 364757 | 2021 | 05/31/2021 | Individuals_Receiving_DHS_Services | 54 | 1~Male | 1~White | 2~Not Hispanic/Latinx | 1~Single-Never Married | 99~Unknown |
589037 | SYNTHETIC DATA | 364757 | 2021 | 06/30/2021 | Individuals_Receiving_DHS_Services | 54 | 1~Male | 1~White | 2~Not Hispanic/Latinx | 1~Single-Never Married | 99~Unknown |
589038 | SYNTHETIC DATA | 364757 | 2021 | 07/31/2021 | Individuals_Receiving_DHS_Services | 54 | 1~Male | 1~White | 2~Not Hispanic/Latinx | 1~Single-Never Married | 99~Unknown |
589039 | SYNTHETIC DATA | 364757 | 2021 | 08/31/2021 | Individuals_Receiving_DHS_Services | 54 | 1~Male | 1~White | 2~Not Hispanic/Latinx | 1~Single-Never Married | 99~Unknown |
589040 | SYNTHETIC DATA | 364757 | 2021 | 09/30/2021 | Individuals_Receiving_DHS_Services | 54 | 1~Male | 1~White | 2~Not Hispanic/Latinx | 1~Single-Never Married | 99~Unknown |
589041 | SYNTHETIC DATA | 364757 | 2021 | 10/31/2021 | Individuals_Receiving_DHS_Services | 54 | 1~Male | 1~White | 2~Not Hispanic/Latinx | 1~Single-Never Married | 99~Unknown |
589042 | SYNTHETIC DATA | 364757 | 2021 | 11/30/2021 | Individuals_Receiving_DHS_Services | 54 | 1~Male | 1~White | 2~Not Hispanic/Latinx | 1~Single-Never Married | 99~Unknown |
589043 | SYNTHETIC DATA | 364757 | 2021 | 12/31/2021 | Individuals_Receiving_DHS_Services | 54 | 1~Male | 1~White | 2~Not Hispanic/Latinx | 1~Single-Never Married | 99~Unknown |
4675581 | SYNTHETIC DATA | 364757 | 2021 | 01/31/2021 | Individuals_Receiving_Income_Supports | 54 | 1~Male | 1~White | 2~Not Hispanic/Latinx | 1~Single-Never Married | 99~Unknown |
4675582 | SYNTHETIC DATA | 364757 | 2021 | 02/28/2021 | Individuals_Receiving_Income_Supports | 54 | 1~Male | 1~White | 2~Not Hispanic/Latinx | 1~Single-Never Married | 99~Unknown |
4675583 | SYNTHETIC DATA | 364757 | 2021 | 03/31/2021 | Individuals_Receiving_Income_Supports | 54 | 1~Male | 1~White | 2~Not Hispanic/Latinx | 1~Single-Never Married | 99~Unknown |
4675584 | SYNTHETIC DATA | 364757 | 2021 | 04/30/2021 | Individuals_Receiving_Income_Supports | 54 | 1~Male | 1~White | 2~Not Hispanic/Latinx | 1~Single-Never Married | 99~Unknown |
4675585 | SYNTHETIC DATA | 364757 | 2021 | 05/31/2021 | Individuals_Receiving_Income_Supports | 54 | 1~Male | 1~White | 2~Not Hispanic/Latinx | 1~Single-Never Married | 99~Unknown |
4675586 | SYNTHETIC DATA | 364757 | 2021 | 06/30/2021 | Individuals_Receiving_Income_Supports | 54 | 1~Male | 1~White | 2~Not Hispanic/Latinx | 1~Single-Never Married | 99~Unknown |
4675587 | SYNTHETIC DATA | 364757 | 2021 | 07/31/2021 | Individuals_Receiving_Income_Supports | 54 | 1~Male | 1~White | 2~Not Hispanic/Latinx | 1~Single-Never Married | 99~Unknown |
4675588 | SYNTHETIC DATA | 364757 | 2021 | 08/31/2021 | Individuals_Receiving_Income_Supports | 54 | 1~Male | 1~White | 2~Not Hispanic/Latinx | 1~Single-Never Married | 99~Unknown |
4675589 | SYNTHETIC DATA | 364757 | 2021 | 09/30/2021 | Individuals_Receiving_Income_Supports | 54 | 1~Male | 1~White | 2~Not Hispanic/Latinx | 1~Single-Never Married | 99~Unknown |
4675590 | SYNTHETIC DATA | 364757 | 2021 | 10/31/2021 | Individuals_Receiving_Income_Supports | 54 | 1~Male | 1~White | 2~Not Hispanic/Latinx | 1~Single-Never Married | 99~Unknown |
4675591 | SYNTHETIC DATA | 364757 | 2021 | 11/30/2021 | Individuals_Receiving_Income_Supports | 54 | 1~Male | 1~White | 2~Not Hispanic/Latinx | 1~Single-Never Married | 99~Unknown |
4675592 | SYNTHETIC DATA | 364757 | 2021 | 12/31/2021 | Individuals_Receiving_Income_Supports | 54 | 1~Male | 1~White | 2~Not Hispanic/Latinx | 1~Single-Never Married | 99~Unknown |
6710438 | SYNTHETIC DATA | 364757 | 2021 | 01/31/2021 | Individuals_Receiving_Mental_Health_Services | 54 | 1~Male | 1~White | 2~Not Hispanic/Latinx | 1~Single-Never Married | 99~Unknown |
6710439 | SYNTHETIC DATA | 364757 | 2021 | 03/31/2021 | Individuals_Receiving_Mental_Health_Services | 54 | 1~Male | 1~White | 2~Not Hispanic/Latinx | 1~Single-Never Married | 99~Unknown |
6710440 | SYNTHETIC DATA | 364757 | 2021 | 04/30/2021 | Individuals_Receiving_Mental_Health_Services | 54 | 1~Male | 1~White | 2~Not Hispanic/Latinx | 1~Single-Never Married | 99~Unknown |
6710441 | SYNTHETIC DATA | 364757 | 2021 | 05/31/2021 | Individuals_Receiving_Mental_Health_Services | 54 | 1~Male | 1~White | 2~Not Hispanic/Latinx | 1~Single-Never Married | 99~Unknown |
6710442 | SYNTHETIC DATA | 364757 | 2021 | 06/30/2021 | Individuals_Receiving_Mental_Health_Services | 54 | 1~Male | 1~White | 2~Not Hispanic/Latinx | 1~Single-Never Married | 99~Unknown |
6710443 | SYNTHETIC DATA | 364757 | 2021 | 07/31/2021 | Individuals_Receiving_Mental_Health_Services | 54 | 1~Male | 1~White | 2~Not Hispanic/Latinx | 1~Single-Never Married | 99~Unknown |
6710444 | SYNTHETIC DATA | 364757 | 2021 | 08/31/2021 | Individuals_Receiving_Mental_Health_Services | 54 | 1~Male | 1~White | 2~Not Hispanic/Latinx | 1~Single-Never Married | 99~Unknown |
6710445 | SYNTHETIC DATA | 364757 | 2021 | 10/31/2021 | Individuals_Receiving_Mental_Health_Services | 54 | 1~Male | 1~White | 2~Not Hispanic/Latinx | 1~Single-Never Married | 99~Unknown |
6710446 | SYNTHETIC DATA | 364757 | 2021 | 11/30/2021 | Individuals_Receiving_Mental_Health_Services | 54 | 1~Male | 1~White | 2~Not Hispanic/Latinx | 1~Single-Never Married | 99~Unknown |
6710447 | SYNTHETIC DATA | 364757 | 2021 | 12/31/2021 | Individuals_Receiving_Mental_Health_Services | 54 | 1~Male | 1~White | 2~Not Hispanic/Latinx | 1~Single-Never Married | 99~Unknown |
And similarly for client 449747.
df[df['MCI_UNIQ_ID'] == 449747]
synthetic_data | MCI_UNIQ_ID | CALDR_YR | DATE_OF_EVENT | service | age | GENDER | RACE | ETHNICITY | MARITAL_STATUS | EDUCATION_LEVEL | |
---|---|---|---|---|---|---|---|---|---|---|---|
5539976 | SYNTHETIC DATA | 449747 | 2021 | 01/31/2021 | Individuals_Receiving_Income_Supports | 51 | 1~Male | 1~White | 2~Not Hispanic/Latinx | 1~Single-Never Married | 99~Unknown |
5539977 | SYNTHETIC DATA | 449747 | 2021 | 03/31/2021 | Individuals_Receiving_Income_Supports | 51 | 1~Male | 1~White | 2~Not Hispanic/Latinx | 1~Single-Never Married | 99~Unknown |
5539978 | SYNTHETIC DATA | 449747 | 2021 | 04/30/2021 | Individuals_Receiving_Income_Supports | 51 | 1~Male | 1~White | 2~Not Hispanic/Latinx | 1~Single-Never Married | 99~Unknown |
5539979 | SYNTHETIC DATA | 449747 | 2021 | 05/31/2021 | Individuals_Receiving_Income_Supports | 51 | 1~Male | 1~White | 2~Not Hispanic/Latinx | 1~Single-Never Married | 99~Unknown |
5539980 | SYNTHETIC DATA | 449747 | 2021 | 06/30/2021 | Individuals_Receiving_Income_Supports | 51 | 1~Male | 1~White | 2~Not Hispanic/Latinx | 1~Single-Never Married | 99~Unknown |
5539981 | SYNTHETIC DATA | 449747 | 2021 | 07/31/2021 | Individuals_Receiving_Income_Supports | 51 | 1~Male | 1~White | 2~Not Hispanic/Latinx | 1~Single-Never Married | 99~Unknown |
5539982 | SYNTHETIC DATA | 449747 | 2021 | 08/31/2021 | Individuals_Receiving_Income_Supports | 51 | 1~Male | 1~White | 2~Not Hispanic/Latinx | 1~Single-Never Married | 99~Unknown |
5539983 | SYNTHETIC DATA | 449747 | 2021 | 09/30/2021 | Individuals_Receiving_Income_Supports | 51 | 1~Male | 1~White | 2~Not Hispanic/Latinx | 1~Single-Never Married | 99~Unknown |
5539984 | SYNTHETIC DATA | 449747 | 2021 | 10/31/2021 | Individuals_Receiving_Income_Supports | 51 | 1~Male | 1~White | 2~Not Hispanic/Latinx | 1~Single-Never Married | 99~Unknown |
5539985 | SYNTHETIC DATA | 449747 | 2021 | 11/30/2021 | Individuals_Receiving_Income_Supports | 51 | 1~Male | 1~White | 2~Not Hispanic/Latinx | 1~Single-Never Married | 99~Unknown |
4.6.6. Basic Transformations#
Often times when we encounter data, we find that it is in a format that is a bit clunky to use. Part of EDA will be to make datasets more friendly. Here we will go over four transformations: substitution, generation, outlier removal, and illogical value removal.
Substitution#
For the data entries that have strings in certain columns, like “Individuals_Receiving_DHS_Services”, we may need to alter these. Strings can be a bit clunky to work with and are generally read better for plots when we have them in categories. Additionally, we can try and do some mathematical tests on them when we have numerical category labels.
#we want to take the columns for gender, race, ethnicity, marital status, and education level
#and turn these columns into the category data type.
#first we add a new column to the dataframe that has this dtype
#but named with a _cat suffix.
df["GENDER_cat"] = df["GENDER"].astype('category')
df["RACE_cat"] = df["RACE"].astype('category')
df["ETHNICITY_cat"] = df["ETHNICITY"].astype('category')
df["MARITAL_STATUS_cat"] = df["MARITAL_STATUS"].astype('category')
df["EDUCATION_LEVEL_cat"] = df["EDUCATION_LEVEL"].astype('category')
df["service_cat"] = df["service"].astype('category')
df.head()
synthetic_data | MCI_UNIQ_ID | CALDR_YR | DATE_OF_EVENT | service | age | GENDER | RACE | ETHNICITY | MARITAL_STATUS | EDUCATION_LEVEL | GENDER_cat | RACE_cat | ETHNICITY_cat | MARITAL_STATUS_cat | EDUCATION_LEVEL_cat | service_cat | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | SYNTHETIC DATA | 1626 | 2021 | 01/31/2021 | Children_Attending_Early_Childhood_Programs_Ma... | 5 | 1~Male | 1~White | 99~Unknown | 99~Unknown | 9-12~High School (grade 9-12) | 1~Male | 1~White | 99~Unknown | 99~Unknown | 9-12~High School (grade 9-12) | Children_Attending_Early_Childhood_Programs_Ma... |
1 | SYNTHETIC DATA | 1626 | 2021 | 02/28/2021 | Children_Attending_Early_Childhood_Programs_Ma... | 5 | 1~Male | 1~White | 99~Unknown | 99~Unknown | 9-12~High School (grade 9-12) | 1~Male | 1~White | 99~Unknown | 99~Unknown | 9-12~High School (grade 9-12) | Children_Attending_Early_Childhood_Programs_Ma... |
2 | SYNTHETIC DATA | 1626 | 2021 | 03/31/2021 | Children_Attending_Early_Childhood_Programs_Ma... | 5 | 1~Male | 1~White | 99~Unknown | 99~Unknown | 9-12~High School (grade 9-12) | 1~Male | 1~White | 99~Unknown | 99~Unknown | 9-12~High School (grade 9-12) | Children_Attending_Early_Childhood_Programs_Ma... |
3 | SYNTHETIC DATA | 1626 | 2021 | 04/30/2021 | Children_Attending_Early_Childhood_Programs_Ma... | 5 | 1~Male | 1~White | 99~Unknown | 99~Unknown | 9-12~High School (grade 9-12) | 1~Male | 1~White | 99~Unknown | 99~Unknown | 9-12~High School (grade 9-12) | Children_Attending_Early_Childhood_Programs_Ma... |
4 | SYNTHETIC DATA | 1626 | 2021 | 05/31/2021 | Children_Attending_Early_Childhood_Programs_Ma... | 5 | 1~Male | 1~White | 99~Unknown | 99~Unknown | 9-12~High School (grade 9-12) | 1~Male | 1~White | 99~Unknown | 99~Unknown | 9-12~High School (grade 9-12) | Children_Attending_Early_Childhood_Programs_Ma... |
#forgot to add for ids
df["MCI_UNIQ_ID_cat"] = df["MCI_UNIQ_ID"].astype('category')
Generation#
In other cases, we may want to generate a new variable. These can simply be combinations of previous variables, but in the case of this dataset, we will convert the “age” column to a categorical datatype variable as well. We will do this by putting ages in bins of 10 years.
#first we will define a helper function that takes in an int
#and returns the corresponding age bin value (grouped by 10), so a 5 gets returned as "0-9"
def age_bin(age):
if age < 10:
return "0-9"
elif age < 20:
return "10-19"
elif age < 30:
return "20-29"
elif age < 40:
return "30-39"
elif age < 50:
return "40-49"
elif age < 60:
return "50-59"
elif age < 70:
return "60-69"
elif age < 80:
return "70-79"
elif age < 90:
return "80-89"
else:
return "90+"
#is there a better way to do this? I am not sure, but we will go with this anyway.
#it gets the job done.
#we will now apply this function to the age column and create a new column called AGE_BIN
df["AGE_BIN"] = df["age"].apply(age_bin)
df.head()
synthetic_data | MCI_UNIQ_ID | CALDR_YR | DATE_OF_EVENT | service | age | GENDER | RACE | ETHNICITY | MARITAL_STATUS | EDUCATION_LEVEL | GENDER_cat | RACE_cat | ETHNICITY_cat | MARITAL_STATUS_cat | EDUCATION_LEVEL_cat | service_cat | MCI_UNIQ_ID_cat | AGE_BIN | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | SYNTHETIC DATA | 1626 | 2021 | 01/31/2021 | Children_Attending_Early_Childhood_Programs_Ma... | 5 | 1~Male | 1~White | 99~Unknown | 99~Unknown | 9-12~High School (grade 9-12) | 1~Male | 1~White | 99~Unknown | 99~Unknown | 9-12~High School (grade 9-12) | Children_Attending_Early_Childhood_Programs_Ma... | 1626 | 0-9 |
1 | SYNTHETIC DATA | 1626 | 2021 | 02/28/2021 | Children_Attending_Early_Childhood_Programs_Ma... | 5 | 1~Male | 1~White | 99~Unknown | 99~Unknown | 9-12~High School (grade 9-12) | 1~Male | 1~White | 99~Unknown | 99~Unknown | 9-12~High School (grade 9-12) | Children_Attending_Early_Childhood_Programs_Ma... | 1626 | 0-9 |
2 | SYNTHETIC DATA | 1626 | 2021 | 03/31/2021 | Children_Attending_Early_Childhood_Programs_Ma... | 5 | 1~Male | 1~White | 99~Unknown | 99~Unknown | 9-12~High School (grade 9-12) | 1~Male | 1~White | 99~Unknown | 99~Unknown | 9-12~High School (grade 9-12) | Children_Attending_Early_Childhood_Programs_Ma... | 1626 | 0-9 |
3 | SYNTHETIC DATA | 1626 | 2021 | 04/30/2021 | Children_Attending_Early_Childhood_Programs_Ma... | 5 | 1~Male | 1~White | 99~Unknown | 99~Unknown | 9-12~High School (grade 9-12) | 1~Male | 1~White | 99~Unknown | 99~Unknown | 9-12~High School (grade 9-12) | Children_Attending_Early_Childhood_Programs_Ma... | 1626 | 0-9 |
4 | SYNTHETIC DATA | 1626 | 2021 | 05/31/2021 | Children_Attending_Early_Childhood_Programs_Ma... | 5 | 1~Male | 1~White | 99~Unknown | 99~Unknown | 9-12~High School (grade 9-12) | 1~Male | 1~White | 99~Unknown | 99~Unknown | 9-12~High School (grade 9-12) | Children_Attending_Early_Childhood_Programs_Ma... | 1626 | 0-9 |
Another type of generation we can do is to split the date of event so we get the month in which each service occurs. We can do this by simply getting the first two characters of the DATE_OF_EVENT
column.
#make a new month column
#this is the first two characters of the date column and the last 5 characters of the column
df["month"] = df["DATE_OF_EVENT"].str[:2] + df["DATE_OF_EVENT"].str[-5:]
df.head()
synthetic_data | MCI_UNIQ_ID | CALDR_YR | DATE_OF_EVENT | service | age | GENDER | RACE | ETHNICITY | MARITAL_STATUS | EDUCATION_LEVEL | GENDER_cat | RACE_cat | ETHNICITY_cat | MARITAL_STATUS_cat | EDUCATION_LEVEL_cat | service_cat | MCI_UNIQ_ID_cat | AGE_BIN | month | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | SYNTHETIC DATA | 1626 | 2021 | 01/31/2021 | Children_Attending_Early_Childhood_Programs_Ma... | 5 | 1~Male | 1~White | 99~Unknown | 99~Unknown | 9-12~High School (grade 9-12) | 1~Male | 1~White | 99~Unknown | 99~Unknown | 9-12~High School (grade 9-12) | Children_Attending_Early_Childhood_Programs_Ma... | 1626 | 0-9 | 01/2021 |
1 | SYNTHETIC DATA | 1626 | 2021 | 02/28/2021 | Children_Attending_Early_Childhood_Programs_Ma... | 5 | 1~Male | 1~White | 99~Unknown | 99~Unknown | 9-12~High School (grade 9-12) | 1~Male | 1~White | 99~Unknown | 99~Unknown | 9-12~High School (grade 9-12) | Children_Attending_Early_Childhood_Programs_Ma... | 1626 | 0-9 | 02/2021 |
2 | SYNTHETIC DATA | 1626 | 2021 | 03/31/2021 | Children_Attending_Early_Childhood_Programs_Ma... | 5 | 1~Male | 1~White | 99~Unknown | 99~Unknown | 9-12~High School (grade 9-12) | 1~Male | 1~White | 99~Unknown | 99~Unknown | 9-12~High School (grade 9-12) | Children_Attending_Early_Childhood_Programs_Ma... | 1626 | 0-9 | 03/2021 |
3 | SYNTHETIC DATA | 1626 | 2021 | 04/30/2021 | Children_Attending_Early_Childhood_Programs_Ma... | 5 | 1~Male | 1~White | 99~Unknown | 99~Unknown | 9-12~High School (grade 9-12) | 1~Male | 1~White | 99~Unknown | 99~Unknown | 9-12~High School (grade 9-12) | Children_Attending_Early_Childhood_Programs_Ma... | 1626 | 0-9 | 04/2021 |
4 | SYNTHETIC DATA | 1626 | 2021 | 05/31/2021 | Children_Attending_Early_Childhood_Programs_Ma... | 5 | 1~Male | 1~White | 99~Unknown | 99~Unknown | 9-12~High School (grade 9-12) | 1~Male | 1~White | 99~Unknown | 99~Unknown | 9-12~High School (grade 9-12) | Children_Attending_Early_Childhood_Programs_Ma... | 1626 | 0-9 | 05/2021 |
Outlier Removal#
Often times, data will have outliers. These are data points that are very far from the majority of the data. In the case of a dataset with all categorical variables, this is not often feasible. However, such an act is more common with numeric features. Let’s try the age category and remove all the rows with age 105 or greater.
#want to remove all entries from the dataset where the age entry is 105 or more
df = df[df["age"] < 105]
df.shape
(7115529, 20)
Illogical Value Removal#
This portion of data transformations is slightly harder as we will have to look at multiple features at the same time. However, it is still important. Some data points in the dataset have children of age 5, yet lists them as in high school.
#we want to remove all the values where the age is 0-9
#AND where they are listed as in high school as "9-12~High School (grade 9-12)"
df = df[~((df["AGE_BIN"] == "0-9") & (df["EDUCATION_LEVEL"] == "9-12~High School (grade 9-12)"))]
df.shape
(6581506, 20)
4.6.7. Advanced Transformations#
Sometimes when doing exploratory data analysis, the storage format of the table does not give a complete view of the picture. In the case of this dataset, we found earlier that there were several rows with the same ID, which means that there are multiple entries for some people. This indicates that they used the services provided in the dataset multiple times.
To get a better picture of some data insights, we can make a new dataframe; this one will have one row for each ID, and we will add the columns regarding the number of services used, number of distinct services used, and the number of months the services were used. Because the dataset is big, this may take a while.
In general cases, to find new transformations for the datasets you work with, you will need to think about the general direction to take the data, which can involve simply thinking about the data logically. However, a better approach is to consult domain knowledge. This basically involves looking to sources related to the data (research articles, people who collected the data, people who are knowledgeable on the subject) and transforming the data as they may advise, or getting advice on where to look for ideas on transformations.
Here, we are going to perform two aggregative transformations. These will group the entries of the dataset by the ID, as well as by the (ID, Month) pairing and compute some summary statistics, including the number of services used, the number of months the services were used, and the range of dates the services were used.
def get_client_attributes(services):
client_attributes = services.groupby(['MCI_UNIQ_ID']).agg(
num_services = ('service', 'count'),
distinct_services = ('service', 'nunique'),
first_date = ('DATE_OF_EVENT', 'min'),
last_date = ('DATE_OF_EVENT', 'max'),
num_months = ('month', 'count'),
distinct_month = ('month', 'nunique')
).reset_index()
return client_attributes
client_attributes = get_client_attributes(df)
client_attributes.head()
MCI_UNIQ_ID | num_services | distinct_services | first_date | last_date | num_months | distinct_month | |
---|---|---|---|---|---|---|---|
0 | 1 | 12 | 1 | 01/31/2021 | 12/31/2021 | 12 | 12 |
1 | 2 | 1 | 1 | 02/28/2021 | 02/28/2021 | 1 | 1 |
2 | 3 | 11 | 1 | 02/28/2021 | 12/31/2021 | 11 | 11 |
3 | 5 | 12 | 1 | 01/31/2021 | 12/31/2021 | 12 | 12 |
4 | 6 | 24 | 3 | 01/31/2021 | 12/31/2021 | 24 | 12 |
Similarly, we perform the client-month attribute aggregations.
def get_client_date_attributes(services):
client_date_attributes = services.groupby(['MCI_UNIQ_ID', 'month']).agg(
num_services = ('service', 'count'),
distinct_services = ('service', 'nunique')
).reset_index()
return client_date_attributes
client_date_attributes = get_client_date_attributes(df)
client_date_attributes.head()
MCI_UNIQ_ID | month | num_services | distinct_services | |
---|---|---|---|---|
0 | 1 | 01/2021 | 1 | 1 |
1 | 1 | 02/2021 | 1 | 1 |
2 | 1 | 03/2021 | 1 | 1 |
3 | 1 | 04/2021 | 1 | 1 |
4 | 1 | 05/2021 | 1 | 1 |
Now, we can merge the datasets in a sense so that some of the information about each person can carry over to the new datasets we’ve created. Specifically, we want to carry over gender, race, ethnicity, education level, and marital status.
#start with client_attributes
#we want to add columns of age, gender, race, ethnicity, education level, and marital status
#from the original dataset, making sure IDs match up and that there is only one occurrence of each ID.
client_attributes = pd.merge(client_attributes, df[['MCI_UNIQ_ID', 'age', 'GENDER', 'RACE', 'ETHNICITY', 'MARITAL_STATUS', 'EDUCATION_LEVEL']], on='MCI_UNIQ_ID', how='left')
#now remove duplicate rows
client_attributes = client_attributes.drop_duplicates(subset=['MCI_UNIQ_ID'])
client_attributes.head()
MCI_UNIQ_ID | num_services | distinct_services | first_date | last_date | num_months | distinct_month | age | GENDER | RACE | ETHNICITY | MARITAL_STATUS | EDUCATION_LEVEL | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 12 | 1 | 01/31/2021 | 12/31/2021 | 12 | 12 | 13 | 1~Male | 99~Unknown | 2~Not Hispanic/Latinx | 99~Unknown | GED~High School Diploma/GED Completed |
12 | 2 | 1 | 1 | 02/28/2021 | 02/28/2021 | 1 | 1 | 70 | 2~Female | 1~White | 2~Not Hispanic/Latinx | 2~Married | 99~Unknown |
13 | 3 | 11 | 1 | 02/28/2021 | 12/31/2021 | 11 | 11 | 29 | 2~Female | 1~White | 2~Not Hispanic/Latinx | 2~Married | 9-12~High School (grade 9-12) |
24 | 5 | 12 | 1 | 01/31/2021 | 12/31/2021 | 12 | 12 | 40 | 1~Male | 2~Black/African American | 2~Not Hispanic/Latinx | 2~Married | UNDERGRAD~Some College |
36 | 6 | 24 | 3 | 01/31/2021 | 12/31/2021 | 24 | 12 | 71 | 2~Female | 2~Black/African American | 2~Not Hispanic/Latinx | 1~Single-Never Married | 99~Unknown |
#we'll do the same with client_date_attributes
client_date_attributes = pd.merge(client_date_attributes, df[['MCI_UNIQ_ID', 'age', 'GENDER', 'RACE', 'ETHNICITY', 'MARITAL_STATUS', 'EDUCATION_LEVEL']], on='MCI_UNIQ_ID', how='left')
#now remove rows that have duplicate ID-month pairs
client_date_attributes = client_date_attributes.drop_duplicates(subset=['MCI_UNIQ_ID', 'month'])
client_date_attributes.head()
MCI_UNIQ_ID | month | num_services | distinct_services | age | GENDER | RACE | ETHNICITY | MARITAL_STATUS | EDUCATION_LEVEL | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 01/2021 | 1 | 1 | 13 | 1~Male | 99~Unknown | 2~Not Hispanic/Latinx | 99~Unknown | GED~High School Diploma/GED Completed |
12 | 1 | 02/2021 | 1 | 1 | 13 | 1~Male | 99~Unknown | 2~Not Hispanic/Latinx | 99~Unknown | GED~High School Diploma/GED Completed |
24 | 1 | 03/2021 | 1 | 1 | 13 | 1~Male | 99~Unknown | 2~Not Hispanic/Latinx | 99~Unknown | GED~High School Diploma/GED Completed |
36 | 1 | 04/2021 | 1 | 1 | 13 | 1~Male | 99~Unknown | 2~Not Hispanic/Latinx | 99~Unknown | GED~High School Diploma/GED Completed |
48 | 1 | 05/2021 | 1 | 1 | 13 | 1~Male | 99~Unknown | 2~Not Hispanic/Latinx | 99~Unknown | GED~High School Diploma/GED Completed |
4.6.8. Saving Data Transformations#
If you want to close your work and come back to it easily without having to rerun all of the above code, or if you want to use the same data in a new notebook, an easy workaround is to save the data to an intermediate file and then reload it. We’ll be saving all of our datasets to files that we will then plot in the next section.
#we will save df to a new csv file
df.to_csv('dhs_cleaned.csv', index=False)
client_attributes.to_csv('dhs_client_attributes.csv', index=False)
client_date_attributes.to_csv('dhs_client_date_attributes.csv', index=False)
4.6.9. EDA Project: Part 1#
Answer the following questions:
Which MCI_UNIQ_ID has the highest number of records among all IDs?
What is the average number of interactions for clients in the file?
How many months did a particular client use the service?
What other general observations can you make from the data?