{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Data Investigation II\n", "\n", "Often times when we explore data for the first time, we have to spend time looking at what is missing in the data, or what needs to be changed. Data files rarely if ever are perfect for use immediately once you get them. Additionally, it is often unwise to utilize data for predictive modeling or statistical analysis without knowing some of the context for the data and looking at its contents.\n", "\n", "Python has a helpful package called `pandas` which is purpose-built for looking at and manipulating data, and a lot of what we will work on in this section will involve using it. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setup\n", "The first thing we typically do when dealing with data in Python is to import the `pandas` package and one of its dependencies, `numpy`. " ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "id": "2eYcfvLr2ahn", "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Due to the large size of the provided data file, we store it in a different directory than the one where this notebook lies. We can easily set this directory using the `os` package.\n", "\n", "```python\n", "import os\n", "os.chdir('')\n", "df = pd.read_csv('dhs_service_records_synthesized_final.csv')\n", "```" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "tags": [ "remove-input" ] }, "outputs": [], "source": [ "import os\n", "os.chdir('/Users/jonathanzhu/Documents/ev work/allegheny')\n", "df = pd.read_csv('dhs_service_records_synthesized_final.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Simple Data Inspection\n", "\n", "Before doing anything with the data, it is good practice to take a look at a few samples from the dataset. Here, we're going to take a look at the first ten rows and the last seven rows, or the head and tail of the data; this is a function that is easy to call, and it is good practice to call either of these often so that you know how a dataset evolves over time." ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
synthetic_dataMCI_UNIQ_IDCALDR_YRDATE_OF_EVENTserviceGEO_AREAageDOBDODGENDERGENDER_IDENTITYSEX_ORIENTLEGAL_SEXRACEETHNICITYLIVING_ARRANGEMENTEMPLOYMENT_STATUSMARITAL_STATUSEDUCATION_LEVELVETERAN_FLAG
0SYNTHETIC DATA1626202101/31/2021Children_Attending_Early_Childhood_Programs_Ma...NaN5NaNNaN1~MaleNaNNaNNaN1~White99~UnknownNaNNaN99~Unknown9-12~High School (grade 9-12)NaN
1SYNTHETIC DATA1626202102/28/2021Children_Attending_Early_Childhood_Programs_Ma...NaN5NaNNaN1~MaleNaNNaNNaN1~White99~UnknownNaNNaN99~Unknown9-12~High School (grade 9-12)NaN
2SYNTHETIC DATA1626202103/31/2021Children_Attending_Early_Childhood_Programs_Ma...NaN5NaNNaN1~MaleNaNNaNNaN1~White99~UnknownNaNNaN99~Unknown9-12~High School (grade 9-12)NaN
3SYNTHETIC DATA1626202104/30/2021Children_Attending_Early_Childhood_Programs_Ma...NaN5NaNNaN1~MaleNaNNaNNaN1~White99~UnknownNaNNaN99~Unknown9-12~High School (grade 9-12)NaN
4SYNTHETIC DATA1626202105/31/2021Children_Attending_Early_Childhood_Programs_Ma...NaN5NaNNaN1~MaleNaNNaNNaN1~White99~UnknownNaNNaN99~Unknown9-12~High School (grade 9-12)NaN
\n", "
" ], "text/plain": [ " synthetic_data MCI_UNIQ_ID CALDR_YR DATE_OF_EVENT \\\n", "0 SYNTHETIC DATA 1626 2021 01/31/2021 \n", "1 SYNTHETIC DATA 1626 2021 02/28/2021 \n", "2 SYNTHETIC DATA 1626 2021 03/31/2021 \n", "3 SYNTHETIC DATA 1626 2021 04/30/2021 \n", "4 SYNTHETIC DATA 1626 2021 05/31/2021 \n", "\n", " service GEO_AREA age DOB DOD \\\n", "0 Children_Attending_Early_Childhood_Programs_Ma... NaN 5 NaN NaN \n", "1 Children_Attending_Early_Childhood_Programs_Ma... NaN 5 NaN NaN \n", "2 Children_Attending_Early_Childhood_Programs_Ma... NaN 5 NaN NaN \n", "3 Children_Attending_Early_Childhood_Programs_Ma... NaN 5 NaN NaN \n", "4 Children_Attending_Early_Childhood_Programs_Ma... NaN 5 NaN NaN \n", "\n", " GENDER GENDER_IDENTITY SEX_ORIENT LEGAL_SEX RACE ETHNICITY \\\n", "0 1~Male NaN NaN NaN 1~White 99~Unknown \n", "1 1~Male NaN NaN NaN 1~White 99~Unknown \n", "2 1~Male NaN NaN NaN 1~White 99~Unknown \n", "3 1~Male NaN NaN NaN 1~White 99~Unknown \n", "4 1~Male NaN NaN NaN 1~White 99~Unknown \n", "\n", " LIVING_ARRANGEMENT EMPLOYMENT_STATUS MARITAL_STATUS \\\n", "0 NaN NaN 99~Unknown \n", "1 NaN NaN 99~Unknown \n", "2 NaN NaN 99~Unknown \n", "3 NaN NaN 99~Unknown \n", "4 NaN NaN 99~Unknown \n", "\n", " EDUCATION_LEVEL VETERAN_FLAG \n", "0 9-12~High School (grade 9-12) NaN \n", "1 9-12~High School (grade 9-12) NaN \n", "2 9-12~High School (grade 9-12) NaN \n", "3 9-12~High School (grade 9-12) NaN \n", "4 9-12~High School (grade 9-12) NaN " ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
synthetic_dataMCI_UNIQ_IDCALDR_YRDATE_OF_EVENTserviceGEO_AREAageDOBDODGENDERGENDER_IDENTITYSEX_ORIENTLEGAL_SEXRACEETHNICITYLIVING_ARRANGEMENTEMPLOYMENT_STATUSMARITAL_STATUSEDUCATION_LEVELVETERAN_FLAG
7116127SYNTHETIC DATA530696202106/30/2021SuicidesNaN21NaNNaN2~FemaleNaNNaNNaN1~White99~UnknownNaNNaN99~Unknown99~UnknownNaN
7116128SYNTHETIC DATA530765202106/30/2021SuicidesNaN15NaNNaN2~FemaleNaNNaNNaN1~White99~UnknownNaNNaN99~Unknown99~UnknownNaN
7116129SYNTHETIC DATA530832202106/30/2021SuicidesNaN17NaNNaN1~MaleNaNNaNNaN1~White99~UnknownNaNNaN99~Unknown99~UnknownNaN
7116130SYNTHETIC DATA532121202104/30/2021SuicidesNaN22NaNNaN2~FemaleNaNNaNNaN1~White99~UnknownNaNNaN1~Single-Never Married9-12~High School (grade 9-12)NaN
7116131SYNTHETIC DATA533991202111/30/2021SuicidesNaN16NaNNaN1~MaleNaNNaNNaN1~White99~UnknownNaNNaN1~Single-Never Married9-12~High School (grade 9-12)NaN
7116132SYNTHETIC DATA534127202107/31/2021SuicidesNaN19NaNNaN1~MaleNaNNaNNaN1~White2~Not Hispanic/LatinxNaNNaN4~WidowedGRAD~Graduate DegreeNaN
7116133SYNTHETIC DATA535156202106/30/2021SuicidesNaN24NaNNaN1~MaleNaNNaNNaN1~White99~UnknownNaNNaN1~Single-Never Married9-12~High School (grade 9-12)NaN
\n", "
" ], "text/plain": [ " synthetic_data MCI_UNIQ_ID CALDR_YR DATE_OF_EVENT service \\\n", "7116127 SYNTHETIC DATA 530696 2021 06/30/2021 Suicides \n", "7116128 SYNTHETIC DATA 530765 2021 06/30/2021 Suicides \n", "7116129 SYNTHETIC DATA 530832 2021 06/30/2021 Suicides \n", "7116130 SYNTHETIC DATA 532121 2021 04/30/2021 Suicides \n", "7116131 SYNTHETIC DATA 533991 2021 11/30/2021 Suicides \n", "7116132 SYNTHETIC DATA 534127 2021 07/31/2021 Suicides \n", "7116133 SYNTHETIC DATA 535156 2021 06/30/2021 Suicides \n", "\n", " GEO_AREA age DOB DOD GENDER GENDER_IDENTITY SEX_ORIENT \\\n", "7116127 NaN 21 NaN NaN 2~Female NaN NaN \n", "7116128 NaN 15 NaN NaN 2~Female NaN NaN \n", "7116129 NaN 17 NaN NaN 1~Male NaN NaN \n", "7116130 NaN 22 NaN NaN 2~Female NaN NaN \n", "7116131 NaN 16 NaN NaN 1~Male NaN NaN \n", "7116132 NaN 19 NaN NaN 1~Male NaN NaN \n", "7116133 NaN 24 NaN NaN 1~Male NaN NaN \n", "\n", " LEGAL_SEX RACE ETHNICITY LIVING_ARRANGEMENT \\\n", "7116127 NaN 1~White 99~Unknown NaN \n", "7116128 NaN 1~White 99~Unknown NaN \n", "7116129 NaN 1~White 99~Unknown NaN \n", "7116130 NaN 1~White 99~Unknown NaN \n", "7116131 NaN 1~White 99~Unknown NaN \n", "7116132 NaN 1~White 2~Not Hispanic/Latinx NaN \n", "7116133 NaN 1~White 99~Unknown NaN \n", "\n", " EMPLOYMENT_STATUS MARITAL_STATUS \\\n", "7116127 NaN 99~Unknown \n", "7116128 NaN 99~Unknown \n", "7116129 NaN 99~Unknown \n", "7116130 NaN 1~Single-Never Married \n", "7116131 NaN 1~Single-Never Married \n", "7116132 NaN 4~Widowed \n", "7116133 NaN 1~Single-Never Married \n", "\n", " EDUCATION_LEVEL VETERAN_FLAG \n", "7116127 99~Unknown NaN \n", "7116128 99~Unknown NaN \n", "7116129 99~Unknown NaN \n", "7116130 9-12~High School (grade 9-12) NaN \n", "7116131 9-12~High School (grade 9-12) NaN \n", "7116132 GRAD~Graduate Degree NaN \n", "7116133 9-12~High School (grade 9-12) NaN " ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.tail(7)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```{index} dataframe attributes\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dataframe Attributes \n", "\n", "You can get a lot of helpful information from the attributes of the dataframe; specifically, we are going to look at dataframe dimensions, column names, column data types, and the ranges of column values. To get these, we can easily call some prebuilt attributes that are assigned to pandas dataframes. These help us to understand our data before we even start modifying it." ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(7116134, 20)" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.shape" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "fiOM4VZs9ZRV", "outputId": "58b2036b-2ad6-4100-d7e2-520f823c9a93", "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/plain": [ "Index(['synthetic_data', 'MCI_UNIQ_ID', 'CALDR_YR', 'DATE_OF_EVENT', 'service',\n", " 'GEO_AREA', 'age', 'DOB', 'DOD', 'GENDER', 'GENDER_IDENTITY',\n", " 'SEX_ORIENT', 'LEGAL_SEX', 'RACE', 'ETHNICITY', 'LIVING_ARRANGEMENT',\n", " 'EMPLOYMENT_STATUS', 'MARITAL_STATUS', 'EDUCATION_LEVEL',\n", " 'VETERAN_FLAG'],\n", " dtype='object')" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pandas dataframes also have a helpful method for easy summarization of the set, which you can call with `df.info()`." ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 7116134 entries, 0 to 7116133\n", "Data columns (total 20 columns):\n", " # Column Dtype \n", "--- ------ ----- \n", " 0 synthetic_data object \n", " 1 MCI_UNIQ_ID int64 \n", " 2 CALDR_YR int64 \n", " 3 DATE_OF_EVENT object \n", " 4 service object \n", " 5 GEO_AREA float64\n", " 6 age int64 \n", " 7 DOB float64\n", " 8 DOD float64\n", " 9 GENDER object \n", " 10 GENDER_IDENTITY float64\n", " 11 SEX_ORIENT float64\n", " 12 LEGAL_SEX float64\n", " 13 RACE object \n", " 14 ETHNICITY object \n", " 15 LIVING_ARRANGEMENT float64\n", " 16 EMPLOYMENT_STATUS float64\n", " 17 MARITAL_STATUS object \n", " 18 EDUCATION_LEVEL object \n", " 19 VETERAN_FLAG float64\n", "dtypes: float64(9), int64(3), object(8)\n", "memory usage: 1.1+ GB\n" ] } ], "source": [ "df.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another way to check the data types, if you just want that, is to use the `dtypes` attribute of the dataframe." ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "synthetic_data object\n", "MCI_UNIQ_ID int64\n", "CALDR_YR int64\n", "DATE_OF_EVENT object\n", "service object\n", "GEO_AREA float64\n", "age int64\n", "DOB float64\n", "DOD float64\n", "GENDER object\n", "GENDER_IDENTITY float64\n", "SEX_ORIENT float64\n", "LEGAL_SEX float64\n", "RACE object\n", "ETHNICITY object\n", "LIVING_ARRANGEMENT float64\n", "EMPLOYMENT_STATUS float64\n", "MARITAL_STATUS object\n", "EDUCATION_LEVEL object\n", "VETERAN_FLAG float64\n", "dtype: object" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also find the number of unique entries in the dataframe. This can be important as in some cases we may not want duplicates. In the case of these data from Allegheny County, unique IDs are important as each one will correspond to a single person, and we can track that person across their usage of many different services." ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "synthetic_data 1\n", "MCI_UNIQ_ID 533799\n", "CALDR_YR 1\n", "DATE_OF_EVENT 12\n", "service 22\n", "GEO_AREA 0\n", "age 112\n", "DOB 0\n", "DOD 0\n", "GENDER 3\n", "GENDER_IDENTITY 0\n", "SEX_ORIENT 0\n", "LEGAL_SEX 0\n", "RACE 7\n", "ETHNICITY 3\n", "LIVING_ARRANGEMENT 0\n", "EMPLOYMENT_STATUS 0\n", "MARITAL_STATUS 7\n", "EDUCATION_LEVEL 8\n", "VETERAN_FLAG 0\n", "dtype: int64" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.nunique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note to collaborators: the next two code cells are irrelevant with the complete dataset as there are no entries with \"SYNTHET\" in the synthetic_data column." ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
MCI_UNIQ_IDCALDR_YRDATE_OF_EVENTserviceGEO_AREAageDOBDODGENDERGENDER_IDENTITYSEX_ORIENTLEGAL_SEXRACEETHNICITYLIVING_ARRANGEMENTEMPLOYMENT_STATUSMARITAL_STATUSEDUCATION_LEVELVETERAN_FLAG
synthetic_data
SYNTHETIC DATA7116134711613471161347116134071161340071161340007116134711613400711613471161340
\n", "
" ], "text/plain": [ " MCI_UNIQ_ID CALDR_YR DATE_OF_EVENT service GEO_AREA \\\n", "synthetic_data \n", "SYNTHETIC DATA 7116134 7116134 7116134 7116134 0 \n", "\n", " age DOB DOD GENDER GENDER_IDENTITY SEX_ORIENT \\\n", "synthetic_data \n", "SYNTHETIC DATA 7116134 0 0 7116134 0 0 \n", "\n", " LEGAL_SEX RACE ETHNICITY LIVING_ARRANGEMENT \\\n", "synthetic_data \n", "SYNTHETIC DATA 0 7116134 7116134 0 \n", "\n", " EMPLOYMENT_STATUS MARITAL_STATUS EDUCATION_LEVEL \\\n", "synthetic_data \n", "SYNTHETIC DATA 0 7116134 7116134 \n", "\n", " VETERAN_FLAG \n", "synthetic_data \n", "SYNTHETIC DATA 0 " ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#note: I am not sure what 'SYNTHET\" versus \"SYNTHETIC DATA\" means.\n", "#but we split by them anyway.\n", "df.groupby(\"synthetic_data\").count()" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "SYNTHETIC DATA 7116134\n", "Name: synthetic_data, dtype: int64" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"synthetic_data\"].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One way to get an overview of the data values is to use another helpful function, `df.describe()`. This will provide several summary statistics for you for each of the columns in the set." ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
MCI_UNIQ_IDCALDR_YRGEO_AREAageDOBDODGENDER_IDENTITYSEX_ORIENTLEGAL_SEXLIVING_ARRANGEMENTEMPLOYMENT_STATUSVETERAN_FLAG
count7.116134e+067116134.00.07.116134e+060.00.00.00.00.00.00.00.0
mean2.674853e+052021.0NaN3.267702e+01NaNNaNNaNNaNNaNNaNNaNNaN
std1.544397e+050.0NaN2.258095e+01NaNNaNNaNNaNNaNNaNNaNNaN
min1.000000e+002021.0NaN0.000000e+00NaNNaNNaNNaNNaNNaNNaNNaN
25%1.337730e+052021.0NaN1.300000e+01NaNNaNNaNNaNNaNNaNNaNNaN
50%2.675410e+052021.0NaN3.000000e+01NaNNaNNaNNaNNaNNaNNaNNaN
75%4.010820e+052021.0NaN4.900000e+01NaNNaNNaNNaNNaNNaNNaNNaN
max5.356080e+052021.0NaN1.210000e+02NaNNaNNaNNaNNaNNaNNaNNaN
\n", "
" ], "text/plain": [ " MCI_UNIQ_ID CALDR_YR GEO_AREA age DOB DOD \\\n", "count 7.116134e+06 7116134.0 0.0 7.116134e+06 0.0 0.0 \n", "mean 2.674853e+05 2021.0 NaN 3.267702e+01 NaN NaN \n", "std 1.544397e+05 0.0 NaN 2.258095e+01 NaN NaN \n", "min 1.000000e+00 2021.0 NaN 0.000000e+00 NaN NaN \n", "25% 1.337730e+05 2021.0 NaN 1.300000e+01 NaN NaN \n", "50% 2.675410e+05 2021.0 NaN 3.000000e+01 NaN NaN \n", "75% 4.010820e+05 2021.0 NaN 4.900000e+01 NaN NaN \n", "max 5.356080e+05 2021.0 NaN 1.210000e+02 NaN NaN \n", "\n", " GENDER_IDENTITY SEX_ORIENT LEGAL_SEX LIVING_ARRANGEMENT \\\n", "count 0.0 0.0 0.0 0.0 \n", "mean NaN NaN NaN NaN \n", "std NaN NaN NaN NaN \n", "min NaN NaN NaN NaN \n", "25% NaN NaN NaN NaN \n", "50% NaN NaN NaN NaN \n", "75% NaN NaN NaN NaN \n", "max NaN NaN NaN NaN \n", "\n", " EMPLOYMENT_STATUS VETERAN_FLAG \n", "count 0.0 0.0 \n", "mean NaN NaN \n", "std NaN NaN \n", "min NaN NaN \n", "25% NaN NaN \n", "50% NaN NaN \n", "75% NaN NaN \n", "max NaN NaN " ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```{index} NaN values\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In some cases, these are not helpful, like in the IDs portion, having the mean ID value is not important. Same with calendar year, especially since these are all from the same year. But something to note is that the values for the mean, std error, and quartiles will not show up if there are ANY NaN values in the column. So we need to find out where these NaN values are.\n", "\n", "## Finding NaN values\n", "\n", "To get the number of NaN (N/A or NA, as they are sometimes called), we can use the `isna()` method in combination with the `sum()` method." ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "synthetic_data 0\n", "MCI_UNIQ_ID 0\n", "CALDR_YR 0\n", "DATE_OF_EVENT 0\n", "service 0\n", "GEO_AREA 7116134\n", "age 0\n", "DOB 7116134\n", "DOD 7116134\n", "GENDER 0\n", "GENDER_IDENTITY 7116134\n", "SEX_ORIENT 7116134\n", "LEGAL_SEX 7116134\n", "RACE 0\n", "ETHNICITY 0\n", "LIVING_ARRANGEMENT 7116134\n", "EMPLOYMENT_STATUS 7116134\n", "MARITAL_STATUS 0\n", "EDUCATION_LEVEL 0\n", "VETERAN_FLAG 7116134\n", "dtype: int64" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.isna().sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note for collaborators: df.isna() and df.isnull() now produce the same results, so the below code cell is not relevant." ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "synthetic_data 0\n", "MCI_UNIQ_ID 0\n", "CALDR_YR 0\n", "DATE_OF_EVENT 0\n", "service 0\n", "GEO_AREA 7116134\n", "age 0\n", "DOB 7116134\n", "DOD 7116134\n", "GENDER 0\n", "GENDER_IDENTITY 7116134\n", "SEX_ORIENT 7116134\n", "LEGAL_SEX 7116134\n", "RACE 0\n", "ETHNICITY 0\n", "LIVING_ARRANGEMENT 7116134\n", "EMPLOYMENT_STATUS 7116134\n", "MARITAL_STATUS 0\n", "EDUCATION_LEVEL 0\n", "VETERAN_FLAG 7116134\n", "dtype: int64" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.isnull().sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see the columns ['GEO_AREA', 'DOB', 'DOD', 'GENDER_IDENTITY', 'SEX_ORIENT', 'LEGAL_SEX', 'LIVING_ARRANGEMENT', 'EMPLOYMENT_STATUS', 'VETERAN_FLAG'] have only NaN values.\n", "\n", "Note to collaborators: the bottom row that has only NAs is gone with the complete version of the data.\n", "" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
synthetic_dataMCI_UNIQ_IDCALDR_YRDATE_OF_EVENTserviceGEO_AREAageDOBDODGENDERGENDER_IDENTITYSEX_ORIENTLEGAL_SEXRACEETHNICITYLIVING_ARRANGEMENTEMPLOYMENT_STATUSMARITAL_STATUSEDUCATION_LEVELVETERAN_FLAG
7116133FalseFalseFalseFalseFalseTrueFalseTrueTrueFalseTrueTrueTrueFalseFalseTrueTrueFalseFalseTrue
\n", "
" ], "text/plain": [ " synthetic_data MCI_UNIQ_ID CALDR_YR DATE_OF_EVENT service \\\n", "7116133 False False False False False \n", "\n", " GEO_AREA age DOB DOD GENDER GENDER_IDENTITY SEX_ORIENT \\\n", "7116133 True False True True False True True \n", "\n", " LEGAL_SEX RACE ETHNICITY LIVING_ARRANGEMENT EMPLOYMENT_STATUS \\\n", "7116133 True False False True True \n", "\n", " MARITAL_STATUS EDUCATION_LEVEL VETERAN_FLAG \n", "7116133 False False True " ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.isna().tail(1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "let's clean up the data frame then, drop last row, drop the columns that only have NAs\n", "\n", "Note: I am also not quite sure what the code cell above and below are doing in this context." ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
synthetic_dataMCI_UNIQ_IDCALDR_YRDATE_OF_EVENTserviceGEO_AREAageDOBDODGENDERGENDER_IDENTITYSEX_ORIENTLEGAL_SEXRACEETHNICITYLIVING_ARRANGEMENTEMPLOYMENT_STATUSMARITAL_STATUSEDUCATION_LEVELVETERAN_FLAG
7116133SYNTHETIC DATA535156202106/30/2021SuicidesNaN24NaNNaN1~MaleNaNNaNNaN1~White99~UnknownNaNNaN1~Single-Never Married9-12~High School (grade 9-12)NaN
\n", "
" ], "text/plain": [ " synthetic_data MCI_UNIQ_ID CALDR_YR DATE_OF_EVENT service \\\n", "7116133 SYNTHETIC DATA 535156 2021 06/30/2021 Suicides \n", "\n", " GEO_AREA age DOB DOD GENDER GENDER_IDENTITY SEX_ORIENT \\\n", "7116133 NaN 24 NaN NaN 1~Male NaN NaN \n", "\n", " LEGAL_SEX RACE ETHNICITY LIVING_ARRANGEMENT \\\n", "7116133 NaN 1~White 99~Unknown NaN \n", "\n", " EMPLOYMENT_STATUS MARITAL_STATUS \\\n", "7116133 NaN 1~Single-Never Married \n", "\n", " EDUCATION_LEVEL VETERAN_FLAG \n", "7116133 9-12~High School (grade 9-12) NaN " ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.tail(1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Removing NA Columns and Rows\n", "\n", "To drop all rows with only NaN values, we can use a simple function called `df.dropna()`." ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(7116134, 20)" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.dropna(how='all', inplace=True)\n", "df.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With this function, the default parameter for a hidden value `axis` removed rows of only NA values, of which there are none in the dataset. To remove columns, we need to specify the axis." ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(7116134, 11)" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#we want to remove all the columns that only have NA values\n", "df.dropna(axis=1, how='all', inplace=True)\n", "df.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now what if we wanted to drop a row with a single NA value? We can simply remove the how='all' parameter and it will remove the rows with at least one NA value." ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [], "source": [ "# df.dropna(inplace=True) # will drop rows with any NA." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In addition to looking at the head and the tail, we are also at this point going to take a look at 3 random rows, as it is important to inspect data beyond the head and tail." ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 589043, 5539985, 2078088])" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# set seed for random function so that we get same rows when re-run the cell\n", "np.random.seed(17)\n", "np.random.randint(2, df.shape[0],3)\n" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 491929, 828732, 6417139])" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.random.randint(3, df.shape[0],3)" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
synthetic_dataMCI_UNIQ_IDCALDR_YRDATE_OF_EVENTserviceageGENDERRACEETHNICITYMARITAL_STATUSEDUCATION_LEVEL
589041SYNTHETIC DATA364757202110/31/2021Individuals_Receiving_DHS_Services541~Male1~White2~Not Hispanic/Latinx1~Single-Never Married99~Unknown
5539983SYNTHETIC DATA449747202109/30/2021Individuals_Receiving_Income_Supports511~Male1~White2~Not Hispanic/Latinx1~Single-Never Married99~Unknown
2078086SYNTHETIC DATA109525202109/30/2021Individuals_Receiving_Income_Supports511~Male1~White2~Not Hispanic/Latinx1~Single-Never Married99~Unknown
\n", "
" ], "text/plain": [ " synthetic_data MCI_UNIQ_ID CALDR_YR DATE_OF_EVENT \\\n", "589041 SYNTHETIC DATA 364757 2021 10/31/2021 \n", "5539983 SYNTHETIC DATA 449747 2021 09/30/2021 \n", "2078086 SYNTHETIC DATA 109525 2021 09/30/2021 \n", "\n", " service age GENDER RACE \\\n", "589041 Individuals_Receiving_DHS_Services 54 1~Male 1~White \n", "5539983 Individuals_Receiving_Income_Supports 51 1~Male 1~White \n", "2078086 Individuals_Receiving_Income_Supports 51 1~Male 1~White \n", "\n", " ETHNICITY MARITAL_STATUS EDUCATION_LEVEL \n", "589041 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown \n", "5539983 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown \n", "2078086 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown " ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# set seed for random function so that we get same rows when re-run the cell\n", "np.random.seed(17)\n", "df.iloc[np.random.randint(0, df.shape[0],3)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will also take a look at the record for a specific MCI_UNIQ_ID. Let's look at client 364757." ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
synthetic_dataMCI_UNIQ_IDCALDR_YRDATE_OF_EVENTserviceageGENDERRACEETHNICITYMARITAL_STATUSEDUCATION_LEVEL
589034SYNTHETIC DATA364757202102/28/2021Individuals_Receiving_DHS_Services541~Male1~White2~Not Hispanic/Latinx1~Single-Never Married99~Unknown
589035SYNTHETIC DATA364757202103/31/2021Individuals_Receiving_DHS_Services541~Male1~White2~Not Hispanic/Latinx1~Single-Never Married99~Unknown
589036SYNTHETIC DATA364757202105/31/2021Individuals_Receiving_DHS_Services541~Male1~White2~Not Hispanic/Latinx1~Single-Never Married99~Unknown
589037SYNTHETIC DATA364757202106/30/2021Individuals_Receiving_DHS_Services541~Male1~White2~Not Hispanic/Latinx1~Single-Never Married99~Unknown
589038SYNTHETIC DATA364757202107/31/2021Individuals_Receiving_DHS_Services541~Male1~White2~Not Hispanic/Latinx1~Single-Never Married99~Unknown
589039SYNTHETIC DATA364757202108/31/2021Individuals_Receiving_DHS_Services541~Male1~White2~Not Hispanic/Latinx1~Single-Never Married99~Unknown
589040SYNTHETIC DATA364757202109/30/2021Individuals_Receiving_DHS_Services541~Male1~White2~Not Hispanic/Latinx1~Single-Never Married99~Unknown
589041SYNTHETIC DATA364757202110/31/2021Individuals_Receiving_DHS_Services541~Male1~White2~Not Hispanic/Latinx1~Single-Never Married99~Unknown
589042SYNTHETIC DATA364757202111/30/2021Individuals_Receiving_DHS_Services541~Male1~White2~Not Hispanic/Latinx1~Single-Never Married99~Unknown
589043SYNTHETIC DATA364757202112/31/2021Individuals_Receiving_DHS_Services541~Male1~White2~Not Hispanic/Latinx1~Single-Never Married99~Unknown
4675581SYNTHETIC DATA364757202101/31/2021Individuals_Receiving_Income_Supports541~Male1~White2~Not Hispanic/Latinx1~Single-Never Married99~Unknown
4675582SYNTHETIC DATA364757202102/28/2021Individuals_Receiving_Income_Supports541~Male1~White2~Not Hispanic/Latinx1~Single-Never Married99~Unknown
4675583SYNTHETIC DATA364757202103/31/2021Individuals_Receiving_Income_Supports541~Male1~White2~Not Hispanic/Latinx1~Single-Never Married99~Unknown
4675584SYNTHETIC DATA364757202104/30/2021Individuals_Receiving_Income_Supports541~Male1~White2~Not Hispanic/Latinx1~Single-Never Married99~Unknown
4675585SYNTHETIC DATA364757202105/31/2021Individuals_Receiving_Income_Supports541~Male1~White2~Not Hispanic/Latinx1~Single-Never Married99~Unknown
4675586SYNTHETIC DATA364757202106/30/2021Individuals_Receiving_Income_Supports541~Male1~White2~Not Hispanic/Latinx1~Single-Never Married99~Unknown
4675587SYNTHETIC DATA364757202107/31/2021Individuals_Receiving_Income_Supports541~Male1~White2~Not Hispanic/Latinx1~Single-Never Married99~Unknown
4675588SYNTHETIC DATA364757202108/31/2021Individuals_Receiving_Income_Supports541~Male1~White2~Not Hispanic/Latinx1~Single-Never Married99~Unknown
4675589SYNTHETIC DATA364757202109/30/2021Individuals_Receiving_Income_Supports541~Male1~White2~Not Hispanic/Latinx1~Single-Never Married99~Unknown
4675590SYNTHETIC DATA364757202110/31/2021Individuals_Receiving_Income_Supports541~Male1~White2~Not Hispanic/Latinx1~Single-Never Married99~Unknown
4675591SYNTHETIC DATA364757202111/30/2021Individuals_Receiving_Income_Supports541~Male1~White2~Not Hispanic/Latinx1~Single-Never Married99~Unknown
4675592SYNTHETIC DATA364757202112/31/2021Individuals_Receiving_Income_Supports541~Male1~White2~Not Hispanic/Latinx1~Single-Never Married99~Unknown
6710438SYNTHETIC DATA364757202101/31/2021Individuals_Receiving_Mental_Health_Services541~Male1~White2~Not Hispanic/Latinx1~Single-Never Married99~Unknown
6710439SYNTHETIC DATA364757202103/31/2021Individuals_Receiving_Mental_Health_Services541~Male1~White2~Not Hispanic/Latinx1~Single-Never Married99~Unknown
6710440SYNTHETIC DATA364757202104/30/2021Individuals_Receiving_Mental_Health_Services541~Male1~White2~Not Hispanic/Latinx1~Single-Never Married99~Unknown
6710441SYNTHETIC DATA364757202105/31/2021Individuals_Receiving_Mental_Health_Services541~Male1~White2~Not Hispanic/Latinx1~Single-Never Married99~Unknown
6710442SYNTHETIC DATA364757202106/30/2021Individuals_Receiving_Mental_Health_Services541~Male1~White2~Not Hispanic/Latinx1~Single-Never Married99~Unknown
6710443SYNTHETIC DATA364757202107/31/2021Individuals_Receiving_Mental_Health_Services541~Male1~White2~Not Hispanic/Latinx1~Single-Never Married99~Unknown
6710444SYNTHETIC DATA364757202108/31/2021Individuals_Receiving_Mental_Health_Services541~Male1~White2~Not Hispanic/Latinx1~Single-Never Married99~Unknown
6710445SYNTHETIC DATA364757202110/31/2021Individuals_Receiving_Mental_Health_Services541~Male1~White2~Not Hispanic/Latinx1~Single-Never Married99~Unknown
6710446SYNTHETIC DATA364757202111/30/2021Individuals_Receiving_Mental_Health_Services541~Male1~White2~Not Hispanic/Latinx1~Single-Never Married99~Unknown
6710447SYNTHETIC DATA364757202112/31/2021Individuals_Receiving_Mental_Health_Services541~Male1~White2~Not Hispanic/Latinx1~Single-Never Married99~Unknown
\n", "
" ], "text/plain": [ " synthetic_data MCI_UNIQ_ID CALDR_YR DATE_OF_EVENT \\\n", "589034 SYNTHETIC DATA 364757 2021 02/28/2021 \n", "589035 SYNTHETIC DATA 364757 2021 03/31/2021 \n", "589036 SYNTHETIC DATA 364757 2021 05/31/2021 \n", "589037 SYNTHETIC DATA 364757 2021 06/30/2021 \n", "589038 SYNTHETIC DATA 364757 2021 07/31/2021 \n", "589039 SYNTHETIC DATA 364757 2021 08/31/2021 \n", "589040 SYNTHETIC DATA 364757 2021 09/30/2021 \n", "589041 SYNTHETIC DATA 364757 2021 10/31/2021 \n", "589042 SYNTHETIC DATA 364757 2021 11/30/2021 \n", "589043 SYNTHETIC DATA 364757 2021 12/31/2021 \n", "4675581 SYNTHETIC DATA 364757 2021 01/31/2021 \n", "4675582 SYNTHETIC DATA 364757 2021 02/28/2021 \n", "4675583 SYNTHETIC DATA 364757 2021 03/31/2021 \n", "4675584 SYNTHETIC DATA 364757 2021 04/30/2021 \n", "4675585 SYNTHETIC DATA 364757 2021 05/31/2021 \n", "4675586 SYNTHETIC DATA 364757 2021 06/30/2021 \n", "4675587 SYNTHETIC DATA 364757 2021 07/31/2021 \n", "4675588 SYNTHETIC DATA 364757 2021 08/31/2021 \n", "4675589 SYNTHETIC DATA 364757 2021 09/30/2021 \n", "4675590 SYNTHETIC DATA 364757 2021 10/31/2021 \n", "4675591 SYNTHETIC DATA 364757 2021 11/30/2021 \n", "4675592 SYNTHETIC DATA 364757 2021 12/31/2021 \n", "6710438 SYNTHETIC DATA 364757 2021 01/31/2021 \n", "6710439 SYNTHETIC DATA 364757 2021 03/31/2021 \n", "6710440 SYNTHETIC DATA 364757 2021 04/30/2021 \n", "6710441 SYNTHETIC DATA 364757 2021 05/31/2021 \n", "6710442 SYNTHETIC DATA 364757 2021 06/30/2021 \n", "6710443 SYNTHETIC DATA 364757 2021 07/31/2021 \n", "6710444 SYNTHETIC DATA 364757 2021 08/31/2021 \n", "6710445 SYNTHETIC DATA 364757 2021 10/31/2021 \n", "6710446 SYNTHETIC DATA 364757 2021 11/30/2021 \n", "6710447 SYNTHETIC DATA 364757 2021 12/31/2021 \n", "\n", " service age GENDER RACE \\\n", "589034 Individuals_Receiving_DHS_Services 54 1~Male 1~White \n", "589035 Individuals_Receiving_DHS_Services 54 1~Male 1~White \n", "589036 Individuals_Receiving_DHS_Services 54 1~Male 1~White \n", "589037 Individuals_Receiving_DHS_Services 54 1~Male 1~White \n", "589038 Individuals_Receiving_DHS_Services 54 1~Male 1~White \n", "589039 Individuals_Receiving_DHS_Services 54 1~Male 1~White \n", "589040 Individuals_Receiving_DHS_Services 54 1~Male 1~White \n", "589041 Individuals_Receiving_DHS_Services 54 1~Male 1~White \n", "589042 Individuals_Receiving_DHS_Services 54 1~Male 1~White \n", "589043 Individuals_Receiving_DHS_Services 54 1~Male 1~White \n", "4675581 Individuals_Receiving_Income_Supports 54 1~Male 1~White \n", "4675582 Individuals_Receiving_Income_Supports 54 1~Male 1~White \n", "4675583 Individuals_Receiving_Income_Supports 54 1~Male 1~White \n", "4675584 Individuals_Receiving_Income_Supports 54 1~Male 1~White \n", "4675585 Individuals_Receiving_Income_Supports 54 1~Male 1~White \n", "4675586 Individuals_Receiving_Income_Supports 54 1~Male 1~White \n", "4675587 Individuals_Receiving_Income_Supports 54 1~Male 1~White \n", "4675588 Individuals_Receiving_Income_Supports 54 1~Male 1~White \n", "4675589 Individuals_Receiving_Income_Supports 54 1~Male 1~White \n", "4675590 Individuals_Receiving_Income_Supports 54 1~Male 1~White \n", "4675591 Individuals_Receiving_Income_Supports 54 1~Male 1~White \n", "4675592 Individuals_Receiving_Income_Supports 54 1~Male 1~White \n", "6710438 Individuals_Receiving_Mental_Health_Services 54 1~Male 1~White \n", "6710439 Individuals_Receiving_Mental_Health_Services 54 1~Male 1~White \n", "6710440 Individuals_Receiving_Mental_Health_Services 54 1~Male 1~White \n", "6710441 Individuals_Receiving_Mental_Health_Services 54 1~Male 1~White \n", "6710442 Individuals_Receiving_Mental_Health_Services 54 1~Male 1~White \n", "6710443 Individuals_Receiving_Mental_Health_Services 54 1~Male 1~White \n", "6710444 Individuals_Receiving_Mental_Health_Services 54 1~Male 1~White \n", "6710445 Individuals_Receiving_Mental_Health_Services 54 1~Male 1~White \n", "6710446 Individuals_Receiving_Mental_Health_Services 54 1~Male 1~White \n", "6710447 Individuals_Receiving_Mental_Health_Services 54 1~Male 1~White \n", "\n", " ETHNICITY MARITAL_STATUS EDUCATION_LEVEL \n", "589034 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown \n", "589035 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown \n", "589036 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown \n", "589037 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown \n", "589038 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown \n", "589039 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown \n", "589040 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown \n", "589041 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown \n", "589042 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown \n", "589043 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown \n", "4675581 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown \n", "4675582 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown \n", "4675583 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown \n", "4675584 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown \n", "4675585 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown \n", "4675586 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown \n", "4675587 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown \n", "4675588 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown \n", "4675589 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown \n", "4675590 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown \n", "4675591 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown \n", "4675592 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown \n", "6710438 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown \n", "6710439 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown \n", "6710440 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown \n", "6710441 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown \n", "6710442 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown \n", "6710443 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown \n", "6710444 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown \n", "6710445 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown \n", "6710446 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown \n", "6710447 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown " ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[df['MCI_UNIQ_ID'] == 364757]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And similarly for client 449747." ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
synthetic_dataMCI_UNIQ_IDCALDR_YRDATE_OF_EVENTserviceageGENDERRACEETHNICITYMARITAL_STATUSEDUCATION_LEVEL
5539976SYNTHETIC DATA449747202101/31/2021Individuals_Receiving_Income_Supports511~Male1~White2~Not Hispanic/Latinx1~Single-Never Married99~Unknown
5539977SYNTHETIC DATA449747202103/31/2021Individuals_Receiving_Income_Supports511~Male1~White2~Not Hispanic/Latinx1~Single-Never Married99~Unknown
5539978SYNTHETIC DATA449747202104/30/2021Individuals_Receiving_Income_Supports511~Male1~White2~Not Hispanic/Latinx1~Single-Never Married99~Unknown
5539979SYNTHETIC DATA449747202105/31/2021Individuals_Receiving_Income_Supports511~Male1~White2~Not Hispanic/Latinx1~Single-Never Married99~Unknown
5539980SYNTHETIC DATA449747202106/30/2021Individuals_Receiving_Income_Supports511~Male1~White2~Not Hispanic/Latinx1~Single-Never Married99~Unknown
5539981SYNTHETIC DATA449747202107/31/2021Individuals_Receiving_Income_Supports511~Male1~White2~Not Hispanic/Latinx1~Single-Never Married99~Unknown
5539982SYNTHETIC DATA449747202108/31/2021Individuals_Receiving_Income_Supports511~Male1~White2~Not Hispanic/Latinx1~Single-Never Married99~Unknown
5539983SYNTHETIC DATA449747202109/30/2021Individuals_Receiving_Income_Supports511~Male1~White2~Not Hispanic/Latinx1~Single-Never Married99~Unknown
5539984SYNTHETIC DATA449747202110/31/2021Individuals_Receiving_Income_Supports511~Male1~White2~Not Hispanic/Latinx1~Single-Never Married99~Unknown
5539985SYNTHETIC DATA449747202111/30/2021Individuals_Receiving_Income_Supports511~Male1~White2~Not Hispanic/Latinx1~Single-Never Married99~Unknown
\n", "
" ], "text/plain": [ " synthetic_data MCI_UNIQ_ID CALDR_YR DATE_OF_EVENT \\\n", "5539976 SYNTHETIC DATA 449747 2021 01/31/2021 \n", "5539977 SYNTHETIC DATA 449747 2021 03/31/2021 \n", "5539978 SYNTHETIC DATA 449747 2021 04/30/2021 \n", "5539979 SYNTHETIC DATA 449747 2021 05/31/2021 \n", "5539980 SYNTHETIC DATA 449747 2021 06/30/2021 \n", "5539981 SYNTHETIC DATA 449747 2021 07/31/2021 \n", "5539982 SYNTHETIC DATA 449747 2021 08/31/2021 \n", "5539983 SYNTHETIC DATA 449747 2021 09/30/2021 \n", "5539984 SYNTHETIC DATA 449747 2021 10/31/2021 \n", "5539985 SYNTHETIC DATA 449747 2021 11/30/2021 \n", "\n", " service age GENDER RACE \\\n", "5539976 Individuals_Receiving_Income_Supports 51 1~Male 1~White \n", "5539977 Individuals_Receiving_Income_Supports 51 1~Male 1~White \n", "5539978 Individuals_Receiving_Income_Supports 51 1~Male 1~White \n", "5539979 Individuals_Receiving_Income_Supports 51 1~Male 1~White \n", "5539980 Individuals_Receiving_Income_Supports 51 1~Male 1~White \n", "5539981 Individuals_Receiving_Income_Supports 51 1~Male 1~White \n", "5539982 Individuals_Receiving_Income_Supports 51 1~Male 1~White \n", "5539983 Individuals_Receiving_Income_Supports 51 1~Male 1~White \n", "5539984 Individuals_Receiving_Income_Supports 51 1~Male 1~White \n", "5539985 Individuals_Receiving_Income_Supports 51 1~Male 1~White \n", "\n", " ETHNICITY MARITAL_STATUS EDUCATION_LEVEL \n", "5539976 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown \n", "5539977 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown \n", "5539978 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown \n", "5539979 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown \n", "5539980 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown \n", "5539981 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown \n", "5539982 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown \n", "5539983 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown \n", "5539984 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown \n", "5539985 2~Not Hispanic/Latinx 1~Single-Never Married 99~Unknown " ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[df['MCI_UNIQ_ID'] == 449747]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```{index} substitution\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Basic Transformations\n", "\n", "Often times when we encounter data, we find that it is in a format that is a bit clunky to use. Part of EDA will be to make datasets more friendly. Here we will go over four transformations: substitution, generation, outlier removal, and illogical value removal.\n", "\n", "### Substitution\n", "\n", "For the data entries that have strings in certain columns, like \"Individuals_Receiving_DHS_Services\", we may need to alter these. Strings can be a bit clunky to work with and are generally read better for plots when we have them in categories. Additionally, we can try and do some mathematical tests on them when we have numerical category labels." ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
synthetic_dataMCI_UNIQ_IDCALDR_YRDATE_OF_EVENTserviceageGENDERRACEETHNICITYMARITAL_STATUSEDUCATION_LEVELGENDER_catRACE_catETHNICITY_catMARITAL_STATUS_catEDUCATION_LEVEL_catservice_cat
0SYNTHETIC DATA1626202101/31/2021Children_Attending_Early_Childhood_Programs_Ma...51~Male1~White99~Unknown99~Unknown9-12~High School (grade 9-12)1~Male1~White99~Unknown99~Unknown9-12~High School (grade 9-12)Children_Attending_Early_Childhood_Programs_Ma...
1SYNTHETIC DATA1626202102/28/2021Children_Attending_Early_Childhood_Programs_Ma...51~Male1~White99~Unknown99~Unknown9-12~High School (grade 9-12)1~Male1~White99~Unknown99~Unknown9-12~High School (grade 9-12)Children_Attending_Early_Childhood_Programs_Ma...
2SYNTHETIC DATA1626202103/31/2021Children_Attending_Early_Childhood_Programs_Ma...51~Male1~White99~Unknown99~Unknown9-12~High School (grade 9-12)1~Male1~White99~Unknown99~Unknown9-12~High School (grade 9-12)Children_Attending_Early_Childhood_Programs_Ma...
3SYNTHETIC DATA1626202104/30/2021Children_Attending_Early_Childhood_Programs_Ma...51~Male1~White99~Unknown99~Unknown9-12~High School (grade 9-12)1~Male1~White99~Unknown99~Unknown9-12~High School (grade 9-12)Children_Attending_Early_Childhood_Programs_Ma...
4SYNTHETIC DATA1626202105/31/2021Children_Attending_Early_Childhood_Programs_Ma...51~Male1~White99~Unknown99~Unknown9-12~High School (grade 9-12)1~Male1~White99~Unknown99~Unknown9-12~High School (grade 9-12)Children_Attending_Early_Childhood_Programs_Ma...
\n", "
" ], "text/plain": [ " synthetic_data MCI_UNIQ_ID CALDR_YR DATE_OF_EVENT \\\n", "0 SYNTHETIC DATA 1626 2021 01/31/2021 \n", "1 SYNTHETIC DATA 1626 2021 02/28/2021 \n", "2 SYNTHETIC DATA 1626 2021 03/31/2021 \n", "3 SYNTHETIC DATA 1626 2021 04/30/2021 \n", "4 SYNTHETIC DATA 1626 2021 05/31/2021 \n", "\n", " service age GENDER RACE \\\n", "0 Children_Attending_Early_Childhood_Programs_Ma... 5 1~Male 1~White \n", "1 Children_Attending_Early_Childhood_Programs_Ma... 5 1~Male 1~White \n", "2 Children_Attending_Early_Childhood_Programs_Ma... 5 1~Male 1~White \n", "3 Children_Attending_Early_Childhood_Programs_Ma... 5 1~Male 1~White \n", "4 Children_Attending_Early_Childhood_Programs_Ma... 5 1~Male 1~White \n", "\n", " ETHNICITY MARITAL_STATUS EDUCATION_LEVEL GENDER_cat \\\n", "0 99~Unknown 99~Unknown 9-12~High School (grade 9-12) 1~Male \n", "1 99~Unknown 99~Unknown 9-12~High School (grade 9-12) 1~Male \n", "2 99~Unknown 99~Unknown 9-12~High School (grade 9-12) 1~Male \n", "3 99~Unknown 99~Unknown 9-12~High School (grade 9-12) 1~Male \n", "4 99~Unknown 99~Unknown 9-12~High School (grade 9-12) 1~Male \n", "\n", " RACE_cat ETHNICITY_cat MARITAL_STATUS_cat EDUCATION_LEVEL_cat \\\n", "0 1~White 99~Unknown 99~Unknown 9-12~High School (grade 9-12) \n", "1 1~White 99~Unknown 99~Unknown 9-12~High School (grade 9-12) \n", "2 1~White 99~Unknown 99~Unknown 9-12~High School (grade 9-12) \n", "3 1~White 99~Unknown 99~Unknown 9-12~High School (grade 9-12) \n", "4 1~White 99~Unknown 99~Unknown 9-12~High School (grade 9-12) \n", "\n", " service_cat \n", "0 Children_Attending_Early_Childhood_Programs_Ma... \n", "1 Children_Attending_Early_Childhood_Programs_Ma... \n", "2 Children_Attending_Early_Childhood_Programs_Ma... \n", "3 Children_Attending_Early_Childhood_Programs_Ma... \n", "4 Children_Attending_Early_Childhood_Programs_Ma... " ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#we want to take the columns for gender, race, ethnicity, marital status, and education level\n", "#and turn these columns into the category data type.\n", "\n", "#first we add a new column to the dataframe that has this dtype\n", "#but named with a _cat suffix.\n", "df[\"GENDER_cat\"] = df[\"GENDER\"].astype('category')\n", "df[\"RACE_cat\"] = df[\"RACE\"].astype('category')\n", "df[\"ETHNICITY_cat\"] = df[\"ETHNICITY\"].astype('category')\n", "df[\"MARITAL_STATUS_cat\"] = df[\"MARITAL_STATUS\"].astype('category')\n", "df[\"EDUCATION_LEVEL_cat\"] = df[\"EDUCATION_LEVEL\"].astype('category')\n", "df[\"service_cat\"] = df[\"service\"].astype('category')\n", "\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [], "source": [ "#forgot to add for ids\n", "df[\"MCI_UNIQ_ID_cat\"] = df[\"MCI_UNIQ_ID\"].astype('category')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```{index} generation\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Generation\n", "In other cases, we may want to generate a new variable. These can simply be combinations of previous variables, but in the case of this dataset, we will convert the \"age\" column to a categorical datatype variable as well. We will do this by putting ages in bins of 10 years." ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
synthetic_dataMCI_UNIQ_IDCALDR_YRDATE_OF_EVENTserviceageGENDERRACEETHNICITYMARITAL_STATUSEDUCATION_LEVELGENDER_catRACE_catETHNICITY_catMARITAL_STATUS_catEDUCATION_LEVEL_catservice_catMCI_UNIQ_ID_catAGE_BIN
0SYNTHETIC DATA1626202101/31/2021Children_Attending_Early_Childhood_Programs_Ma...51~Male1~White99~Unknown99~Unknown9-12~High School (grade 9-12)1~Male1~White99~Unknown99~Unknown9-12~High School (grade 9-12)Children_Attending_Early_Childhood_Programs_Ma...16260-9
1SYNTHETIC DATA1626202102/28/2021Children_Attending_Early_Childhood_Programs_Ma...51~Male1~White99~Unknown99~Unknown9-12~High School (grade 9-12)1~Male1~White99~Unknown99~Unknown9-12~High School (grade 9-12)Children_Attending_Early_Childhood_Programs_Ma...16260-9
2SYNTHETIC DATA1626202103/31/2021Children_Attending_Early_Childhood_Programs_Ma...51~Male1~White99~Unknown99~Unknown9-12~High School (grade 9-12)1~Male1~White99~Unknown99~Unknown9-12~High School (grade 9-12)Children_Attending_Early_Childhood_Programs_Ma...16260-9
3SYNTHETIC DATA1626202104/30/2021Children_Attending_Early_Childhood_Programs_Ma...51~Male1~White99~Unknown99~Unknown9-12~High School (grade 9-12)1~Male1~White99~Unknown99~Unknown9-12~High School (grade 9-12)Children_Attending_Early_Childhood_Programs_Ma...16260-9
4SYNTHETIC DATA1626202105/31/2021Children_Attending_Early_Childhood_Programs_Ma...51~Male1~White99~Unknown99~Unknown9-12~High School (grade 9-12)1~Male1~White99~Unknown99~Unknown9-12~High School (grade 9-12)Children_Attending_Early_Childhood_Programs_Ma...16260-9
\n", "
" ], "text/plain": [ " synthetic_data MCI_UNIQ_ID CALDR_YR DATE_OF_EVENT \\\n", "0 SYNTHETIC DATA 1626 2021 01/31/2021 \n", "1 SYNTHETIC DATA 1626 2021 02/28/2021 \n", "2 SYNTHETIC DATA 1626 2021 03/31/2021 \n", "3 SYNTHETIC DATA 1626 2021 04/30/2021 \n", "4 SYNTHETIC DATA 1626 2021 05/31/2021 \n", "\n", " service age GENDER RACE \\\n", "0 Children_Attending_Early_Childhood_Programs_Ma... 5 1~Male 1~White \n", "1 Children_Attending_Early_Childhood_Programs_Ma... 5 1~Male 1~White \n", "2 Children_Attending_Early_Childhood_Programs_Ma... 5 1~Male 1~White \n", "3 Children_Attending_Early_Childhood_Programs_Ma... 5 1~Male 1~White \n", "4 Children_Attending_Early_Childhood_Programs_Ma... 5 1~Male 1~White \n", "\n", " ETHNICITY MARITAL_STATUS EDUCATION_LEVEL GENDER_cat \\\n", "0 99~Unknown 99~Unknown 9-12~High School (grade 9-12) 1~Male \n", "1 99~Unknown 99~Unknown 9-12~High School (grade 9-12) 1~Male \n", "2 99~Unknown 99~Unknown 9-12~High School (grade 9-12) 1~Male \n", "3 99~Unknown 99~Unknown 9-12~High School (grade 9-12) 1~Male \n", "4 99~Unknown 99~Unknown 9-12~High School (grade 9-12) 1~Male \n", "\n", " RACE_cat ETHNICITY_cat MARITAL_STATUS_cat EDUCATION_LEVEL_cat \\\n", "0 1~White 99~Unknown 99~Unknown 9-12~High School (grade 9-12) \n", "1 1~White 99~Unknown 99~Unknown 9-12~High School (grade 9-12) \n", "2 1~White 99~Unknown 99~Unknown 9-12~High School (grade 9-12) \n", "3 1~White 99~Unknown 99~Unknown 9-12~High School (grade 9-12) \n", "4 1~White 99~Unknown 99~Unknown 9-12~High School (grade 9-12) \n", "\n", " service_cat MCI_UNIQ_ID_cat AGE_BIN \n", "0 Children_Attending_Early_Childhood_Programs_Ma... 1626 0-9 \n", "1 Children_Attending_Early_Childhood_Programs_Ma... 1626 0-9 \n", "2 Children_Attending_Early_Childhood_Programs_Ma... 1626 0-9 \n", "3 Children_Attending_Early_Childhood_Programs_Ma... 1626 0-9 \n", "4 Children_Attending_Early_Childhood_Programs_Ma... 1626 0-9 " ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#first we will define a helper function that takes in an int\n", "#and returns the corresponding age bin value (grouped by 10), so a 5 gets returned as \"0-9\"\n", "def age_bin(age):\n", " if age < 10:\n", " return \"0-9\"\n", " elif age < 20:\n", " return \"10-19\"\n", " elif age < 30:\n", " return \"20-29\"\n", " elif age < 40:\n", " return \"30-39\"\n", " elif age < 50:\n", " return \"40-49\"\n", " elif age < 60:\n", " return \"50-59\"\n", " elif age < 70:\n", " return \"60-69\"\n", " elif age < 80:\n", " return \"70-79\"\n", " elif age < 90:\n", " return \"80-89\"\n", " else:\n", " return \"90+\"\n", " \n", "#is there a better way to do this? I am not sure, but we will go with this anyway. \n", "#it gets the job done.\n", "\n", "#we will now apply this function to the age column and create a new column called AGE_BIN\n", "df[\"AGE_BIN\"] = df[\"age\"].apply(age_bin)\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another type of generation we can do is to split the date of event so we get the month in which each service occurs. We can do this by simply getting the first two characters of the `DATE_OF_EVENT` column. " ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
synthetic_dataMCI_UNIQ_IDCALDR_YRDATE_OF_EVENTserviceageGENDERRACEETHNICITYMARITAL_STATUSEDUCATION_LEVELGENDER_catRACE_catETHNICITY_catMARITAL_STATUS_catEDUCATION_LEVEL_catservice_catMCI_UNIQ_ID_catAGE_BINmonth
0SYNTHETIC DATA1626202101/31/2021Children_Attending_Early_Childhood_Programs_Ma...51~Male1~White99~Unknown99~Unknown9-12~High School (grade 9-12)1~Male1~White99~Unknown99~Unknown9-12~High School (grade 9-12)Children_Attending_Early_Childhood_Programs_Ma...16260-901/2021
1SYNTHETIC DATA1626202102/28/2021Children_Attending_Early_Childhood_Programs_Ma...51~Male1~White99~Unknown99~Unknown9-12~High School (grade 9-12)1~Male1~White99~Unknown99~Unknown9-12~High School (grade 9-12)Children_Attending_Early_Childhood_Programs_Ma...16260-902/2021
2SYNTHETIC DATA1626202103/31/2021Children_Attending_Early_Childhood_Programs_Ma...51~Male1~White99~Unknown99~Unknown9-12~High School (grade 9-12)1~Male1~White99~Unknown99~Unknown9-12~High School (grade 9-12)Children_Attending_Early_Childhood_Programs_Ma...16260-903/2021
3SYNTHETIC DATA1626202104/30/2021Children_Attending_Early_Childhood_Programs_Ma...51~Male1~White99~Unknown99~Unknown9-12~High School (grade 9-12)1~Male1~White99~Unknown99~Unknown9-12~High School (grade 9-12)Children_Attending_Early_Childhood_Programs_Ma...16260-904/2021
4SYNTHETIC DATA1626202105/31/2021Children_Attending_Early_Childhood_Programs_Ma...51~Male1~White99~Unknown99~Unknown9-12~High School (grade 9-12)1~Male1~White99~Unknown99~Unknown9-12~High School (grade 9-12)Children_Attending_Early_Childhood_Programs_Ma...16260-905/2021
\n", "
" ], "text/plain": [ " synthetic_data MCI_UNIQ_ID CALDR_YR DATE_OF_EVENT \\\n", "0 SYNTHETIC DATA 1626 2021 01/31/2021 \n", "1 SYNTHETIC DATA 1626 2021 02/28/2021 \n", "2 SYNTHETIC DATA 1626 2021 03/31/2021 \n", "3 SYNTHETIC DATA 1626 2021 04/30/2021 \n", "4 SYNTHETIC DATA 1626 2021 05/31/2021 \n", "\n", " service age GENDER RACE \\\n", "0 Children_Attending_Early_Childhood_Programs_Ma... 5 1~Male 1~White \n", "1 Children_Attending_Early_Childhood_Programs_Ma... 5 1~Male 1~White \n", "2 Children_Attending_Early_Childhood_Programs_Ma... 5 1~Male 1~White \n", "3 Children_Attending_Early_Childhood_Programs_Ma... 5 1~Male 1~White \n", "4 Children_Attending_Early_Childhood_Programs_Ma... 5 1~Male 1~White \n", "\n", " ETHNICITY MARITAL_STATUS EDUCATION_LEVEL GENDER_cat \\\n", "0 99~Unknown 99~Unknown 9-12~High School (grade 9-12) 1~Male \n", "1 99~Unknown 99~Unknown 9-12~High School (grade 9-12) 1~Male \n", "2 99~Unknown 99~Unknown 9-12~High School (grade 9-12) 1~Male \n", "3 99~Unknown 99~Unknown 9-12~High School (grade 9-12) 1~Male \n", "4 99~Unknown 99~Unknown 9-12~High School (grade 9-12) 1~Male \n", "\n", " RACE_cat ETHNICITY_cat MARITAL_STATUS_cat EDUCATION_LEVEL_cat \\\n", "0 1~White 99~Unknown 99~Unknown 9-12~High School (grade 9-12) \n", "1 1~White 99~Unknown 99~Unknown 9-12~High School (grade 9-12) \n", "2 1~White 99~Unknown 99~Unknown 9-12~High School (grade 9-12) \n", "3 1~White 99~Unknown 99~Unknown 9-12~High School (grade 9-12) \n", "4 1~White 99~Unknown 99~Unknown 9-12~High School (grade 9-12) \n", "\n", " service_cat MCI_UNIQ_ID_cat AGE_BIN \\\n", "0 Children_Attending_Early_Childhood_Programs_Ma... 1626 0-9 \n", "1 Children_Attending_Early_Childhood_Programs_Ma... 1626 0-9 \n", "2 Children_Attending_Early_Childhood_Programs_Ma... 1626 0-9 \n", "3 Children_Attending_Early_Childhood_Programs_Ma... 1626 0-9 \n", "4 Children_Attending_Early_Childhood_Programs_Ma... 1626 0-9 \n", "\n", " month \n", "0 01/2021 \n", "1 02/2021 \n", "2 03/2021 \n", "3 04/2021 \n", "4 05/2021 " ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#make a new month column\n", "#this is the first two characters of the date column and the last 5 characters of the column\n", "df[\"month\"] = df[\"DATE_OF_EVENT\"].str[:2] + df[\"DATE_OF_EVENT\"].str[-5:]\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```{index} outlier removal\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Outlier Removal\n", "Often times, data will have outliers. These are data points that are very far from the majority of the data. In the case of a dataset with all categorical variables, this is not often feasible. However, such an act is more common with numeric features. Let's try the age category and remove all the rows with age 105 or greater." ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(7115529, 20)" ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#want to remove all entries from the dataset where the age entry is 105 or more\n", "df = df[df[\"age\"] < 105]\n", "df.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```{index} illogical value removal\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Illogical Value Removal\n", "This portion of data transformations is slightly harder as we will have to look at multiple features at the same time. However, it is still important. Some data points in the dataset have children of age 5, yet lists them as in high school." ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(6581506, 20)" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#we want to remove all the values where the age is 0-9\n", "#AND where they are listed as in high school as \"9-12~High School (grade 9-12)\"\n", "df = df[~((df[\"AGE_BIN\"] == \"0-9\") & (df[\"EDUCATION_LEVEL\"] == \"9-12~High School (grade 9-12)\"))]\n", "df.shape\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Advanced Transformations\n", "\n", "Sometimes when doing exploratory data analysis, the storage format of the table does not give a complete view of the picture. In the case of this dataset, we found earlier that there were several rows with the same ID, which means that there are multiple entries for some people. This indicates that they used the services provided in the dataset multiple times. \n", "\n", "To get a better picture of some data insights, we can make a new dataframe; this one will have one row for each ID, and we will add the columns regarding the number of services used, number of distinct services used, and the number of months the services were used. Because the dataset is big, this may take a while.\n", "\n", "In general cases, to find new transformations for the datasets you work with, you will need to think about the general direction to take the data, which can involve simply thinking about the data logically. However, a better approach is to consult domain knowledge. This basically involves looking to sources related to the data (research articles, people who collected the data, people who are knowledgeable on the subject) and transforming the data as they may advise, or getting advice on where to look for ideas on transformations.\n", "\n", "Here, we are going to perform two aggregative transformations. These will group the entries of the dataset by the ID, as well as by the (ID, Month) pairing and compute some summary statistics, including the number of services used, the number of months the services were used, and the range of dates the services were used." ] }, { "cell_type": "code", "execution_count": 83, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
MCI_UNIQ_IDnum_servicesdistinct_servicesfirst_datelast_datenum_monthsdistinct_month
0112101/31/202112/31/20211212
121102/28/202102/28/202111
2311102/28/202112/31/20211111
3512101/31/202112/31/20211212
4624301/31/202112/31/20212412
\n", "
" ], "text/plain": [ " MCI_UNIQ_ID num_services distinct_services first_date last_date \\\n", "0 1 12 1 01/31/2021 12/31/2021 \n", "1 2 1 1 02/28/2021 02/28/2021 \n", "2 3 11 1 02/28/2021 12/31/2021 \n", "3 5 12 1 01/31/2021 12/31/2021 \n", "4 6 24 3 01/31/2021 12/31/2021 \n", "\n", " num_months distinct_month \n", "0 12 12 \n", "1 1 1 \n", "2 11 11 \n", "3 12 12 \n", "4 24 12 " ] }, "execution_count": 83, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def get_client_attributes(services):\n", " client_attributes = services.groupby(['MCI_UNIQ_ID']).agg(\n", " num_services = ('service', 'count'),\n", " distinct_services = ('service', 'nunique'), \n", " first_date = ('DATE_OF_EVENT', 'min'), \n", " last_date = ('DATE_OF_EVENT', 'max'), \n", " num_months = ('month', 'count'), \n", " distinct_month = ('month', 'nunique')\n", " ).reset_index()\n", "\n", " return client_attributes\n", "\n", "client_attributes = get_client_attributes(df)\n", "client_attributes.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Similarly, we perform the client-month attribute aggregations." ] }, { "cell_type": "code", "execution_count": 86, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
MCI_UNIQ_IDmonthnum_servicesdistinct_services
0101/202111
1102/202111
2103/202111
3104/202111
4105/202111
\n", "
" ], "text/plain": [ " MCI_UNIQ_ID month num_services distinct_services\n", "0 1 01/2021 1 1\n", "1 1 02/2021 1 1\n", "2 1 03/2021 1 1\n", "3 1 04/2021 1 1\n", "4 1 05/2021 1 1" ] }, "execution_count": 86, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def get_client_date_attributes(services):\n", " client_date_attributes = services.groupby(['MCI_UNIQ_ID', 'month']).agg(\n", " num_services = ('service', 'count'),\n", " distinct_services = ('service', 'nunique')\n", " ).reset_index()\n", "\n", " return client_date_attributes\n", "\n", "client_date_attributes = get_client_date_attributes(df)\n", "client_date_attributes.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we can merge the datasets in a sense so that some of the information about each person can carry over to the new datasets we've created. Specifically, we want to carry over gender, race, ethnicity, education level, and marital status. " ] }, { "cell_type": "code", "execution_count": 84, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
MCI_UNIQ_IDnum_servicesdistinct_servicesfirst_datelast_datenum_monthsdistinct_monthageGENDERRACEETHNICITYMARITAL_STATUSEDUCATION_LEVEL
0112101/31/202112/31/20211212131~Male99~Unknown2~Not Hispanic/Latinx99~UnknownGED~High School Diploma/GED Completed
1221102/28/202102/28/202111702~Female1~White2~Not Hispanic/Latinx2~Married99~Unknown
13311102/28/202112/31/20211111292~Female1~White2~Not Hispanic/Latinx2~Married9-12~High School (grade 9-12)
24512101/31/202112/31/20211212401~Male2~Black/African American2~Not Hispanic/Latinx2~MarriedUNDERGRAD~Some College
36624301/31/202112/31/20212412712~Female2~Black/African American2~Not Hispanic/Latinx1~Single-Never Married99~Unknown
\n", "
" ], "text/plain": [ " MCI_UNIQ_ID num_services distinct_services first_date last_date \\\n", "0 1 12 1 01/31/2021 12/31/2021 \n", "12 2 1 1 02/28/2021 02/28/2021 \n", "13 3 11 1 02/28/2021 12/31/2021 \n", "24 5 12 1 01/31/2021 12/31/2021 \n", "36 6 24 3 01/31/2021 12/31/2021 \n", "\n", " num_months distinct_month age GENDER RACE \\\n", "0 12 12 13 1~Male 99~Unknown \n", "12 1 1 70 2~Female 1~White \n", "13 11 11 29 2~Female 1~White \n", "24 12 12 40 1~Male 2~Black/African American \n", "36 24 12 71 2~Female 2~Black/African American \n", "\n", " ETHNICITY MARITAL_STATUS \\\n", "0 2~Not Hispanic/Latinx 99~Unknown \n", "12 2~Not Hispanic/Latinx 2~Married \n", "13 2~Not Hispanic/Latinx 2~Married \n", "24 2~Not Hispanic/Latinx 2~Married \n", "36 2~Not Hispanic/Latinx 1~Single-Never Married \n", "\n", " EDUCATION_LEVEL \n", "0 GED~High School Diploma/GED Completed \n", "12 99~Unknown \n", "13 9-12~High School (grade 9-12) \n", "24 UNDERGRAD~Some College \n", "36 99~Unknown " ] }, "execution_count": 84, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#start with client_attributes\n", "#we want to add columns of age, gender, race, ethnicity, education level, and marital status\n", "#from the original dataset, making sure IDs match up and that there is only one occurrence of each ID.\n", "client_attributes = pd.merge(client_attributes, df[['MCI_UNIQ_ID', 'age', 'GENDER', 'RACE', 'ETHNICITY', 'MARITAL_STATUS', 'EDUCATION_LEVEL']], on='MCI_UNIQ_ID', how='left')\n", "#now remove duplicate rows\n", "client_attributes = client_attributes.drop_duplicates(subset=['MCI_UNIQ_ID'])\n", "client_attributes.head()" ] }, { "cell_type": "code", "execution_count": 88, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
MCI_UNIQ_IDmonthnum_servicesdistinct_servicesageGENDERRACEETHNICITYMARITAL_STATUSEDUCATION_LEVEL
0101/202111131~Male99~Unknown2~Not Hispanic/Latinx99~UnknownGED~High School Diploma/GED Completed
12102/202111131~Male99~Unknown2~Not Hispanic/Latinx99~UnknownGED~High School Diploma/GED Completed
24103/202111131~Male99~Unknown2~Not Hispanic/Latinx99~UnknownGED~High School Diploma/GED Completed
36104/202111131~Male99~Unknown2~Not Hispanic/Latinx99~UnknownGED~High School Diploma/GED Completed
48105/202111131~Male99~Unknown2~Not Hispanic/Latinx99~UnknownGED~High School Diploma/GED Completed
\n", "
" ], "text/plain": [ " MCI_UNIQ_ID month num_services distinct_services age GENDER \\\n", "0 1 01/2021 1 1 13 1~Male \n", "12 1 02/2021 1 1 13 1~Male \n", "24 1 03/2021 1 1 13 1~Male \n", "36 1 04/2021 1 1 13 1~Male \n", "48 1 05/2021 1 1 13 1~Male \n", "\n", " RACE ETHNICITY MARITAL_STATUS \\\n", "0 99~Unknown 2~Not Hispanic/Latinx 99~Unknown \n", "12 99~Unknown 2~Not Hispanic/Latinx 99~Unknown \n", "24 99~Unknown 2~Not Hispanic/Latinx 99~Unknown \n", "36 99~Unknown 2~Not Hispanic/Latinx 99~Unknown \n", "48 99~Unknown 2~Not Hispanic/Latinx 99~Unknown \n", "\n", " EDUCATION_LEVEL \n", "0 GED~High School Diploma/GED Completed \n", "12 GED~High School Diploma/GED Completed \n", "24 GED~High School Diploma/GED Completed \n", "36 GED~High School Diploma/GED Completed \n", "48 GED~High School Diploma/GED Completed " ] }, "execution_count": 88, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#we'll do the same with client_date_attributes\n", "client_date_attributes = pd.merge(client_date_attributes, df[['MCI_UNIQ_ID', 'age', 'GENDER', 'RACE', 'ETHNICITY', 'MARITAL_STATUS', 'EDUCATION_LEVEL']], on='MCI_UNIQ_ID', how='left')\n", "#now remove rows that have duplicate ID-month pairs\n", "client_date_attributes = client_date_attributes.drop_duplicates(subset=['MCI_UNIQ_ID', 'month'])\n", "client_date_attributes.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Saving Data Transformations\n", "If you want to close your work and come back to it easily without having to rerun all of the above code, or if you want to use the same data in a new notebook, an easy workaround is to save the data to an intermediate file and then reload it. We'll be saving all of our datasets to files that we will then plot in the next section." ] }, { "cell_type": "code", "execution_count": 89, "metadata": {}, "outputs": [], "source": [ "#we will save df to a new csv file\n", "df.to_csv('dhs_cleaned.csv', index=False)\n", "client_attributes.to_csv('dhs_client_attributes.csv', index=False)\n", "client_date_attributes.to_csv('dhs_client_date_attributes.csv', index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## EDA Project: Part 1\n", "Answer the following questions:\n", " - Which MCI_UNIQ_ID has the highest number of records among all IDs?\n", " - What is the average number of interactions for clients in the file?\n", " - How many months did a particular client use the service?\n", " - What other general observations can you make from the data?" ] } ], "metadata": { "colab": { "name": "AwanTunai.ipynb", "provenance": [] }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.3" }, "vscode": { "interpreter": { "hash": "e57d65b3db466a255fb366a0fd9ddb903b84d832e81688bf0eb6e7b10d96915d" } } }, "nbformat": 4, "nbformat_minor": 4 }