{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Data Investigation I\n", "\n", "Often times when we explore data for the first time, we have to spend time looking at what is missing in the data, or what needs to be changed. Data files rarely if ever are perfect for use immediately once you get them. Additionally, it is often unwise to utilize data for predictive modeling or statistical analysis without knowing some of the context for the data and looking at its contents.\n", "\n", "Python has a helpful package called `pandas` which is purpose-built for looking at and manipulating data, and a lot of what we will work on in this section will involve using it. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```{index} pandas\n", "```" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ageworkclassfnlwgteducationeducation-nummarital-statusoccupationrelationshipracesexcapital-gaincapital-losshours-per-weeknative-countryincome
039State-gov77516Bachelors13Never-marriedAdm-clericalNot-in-familyWhiteMale2174040United-States<=50K
150Self-emp-not-inc83311Bachelors13Married-civ-spouseExec-managerialHusbandWhiteMale0013United-States<=50K
238Private215646HS-grad9DivorcedHandlers-cleanersNot-in-familyWhiteMale0040United-States<=50K
353Private23472111th7Married-civ-spouseHandlers-cleanersHusbandBlackMale0040United-States<=50K
428Private338409Bachelors13Married-civ-spouseProf-specialtyWifeBlackFemale0040Cuba<=50K
\n", "
" ], "text/plain": [ " age workclass fnlwgt education education-num \\\n", "0 39 State-gov 77516 Bachelors 13 \n", "1 50 Self-emp-not-inc 83311 Bachelors 13 \n", "2 38 Private 215646 HS-grad 9 \n", "3 53 Private 234721 11th 7 \n", "4 28 Private 338409 Bachelors 13 \n", "\n", " marital-status occupation relationship race sex \\\n", "0 Never-married Adm-clerical Not-in-family White Male \n", "1 Married-civ-spouse Exec-managerial Husband White Male \n", "2 Divorced Handlers-cleaners Not-in-family White Male \n", "3 Married-civ-spouse Handlers-cleaners Husband Black Male \n", "4 Married-civ-spouse Prof-specialty Wife Black Female \n", "\n", " capital-gain capital-loss hours-per-week native-country income \n", "0 2174 0 40 United-States <=50K \n", "1 0 0 13 United-States <=50K \n", "2 0 0 40 United-States <=50K \n", "3 0 0 40 United-States <=50K \n", "4 0 0 40 Cuba <=50K " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "\n", "#read in data\n", "data = pd.read_csv('census_income.csv')\n", "data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`data.head()` is a good way to get a good look at the first few rows of the data. \n", "\n", ":::{admonition} Exercise 2.1\n", "Thinking of the opposite of the head, how do we get the last few rows of the data?\n", ":::\n", "\n", "First things first is to get a look at the data. How many rows (samples) are there? How many columns (variables) are there? These numbers correspond to the *shape* of the dataframe." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(32561, 15)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that this dataset has 32,561 instances of 15 variables. This may seem like a lot, but it is actually relatively small compared to many other modern datasets, like the one used in the advanced EDA portion of this chapter, which has millions of instances!\n", "\n", "Secondly, we should get a look at the features themselves. What do they mean? What datatype are they? While we can do the first one by looking at the names of the columns by calling `data.head()`, this does not allow us to do the second. So we can instead do the following:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 32561 entries, 0 to 32560\n", "Data columns (total 15 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 age 32561 non-null int64 \n", " 1 workclass 30725 non-null object\n", " 2 fnlwgt 32561 non-null int64 \n", " 3 education 32561 non-null object\n", " 4 education-num 32561 non-null int64 \n", " 5 marital-status 32561 non-null object\n", " 6 occupation 30718 non-null object\n", " 7 relationship 32561 non-null object\n", " 8 race 32561 non-null object\n", " 9 sex 32561 non-null object\n", " 10 capital-gain 32561 non-null int64 \n", " 11 capital-loss 32561 non-null int64 \n", " 12 hours-per-week 32561 non-null int64 \n", " 13 native-country 31978 non-null object\n", " 14 income 32561 non-null object\n", "dtypes: int64(6), object(9)\n", "memory usage: 3.7+ MB\n" ] } ], "source": [ "data.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This tells us the datatypes of each variable and gives us a full list of the variables. The `object` datatype corresponds to strings, such as \"United States\" or \"Female\", while the `int64` datatype corresponds to integer numbers.\n", "\n", ":::{admonition} Exercise 2.2\n", "Hypothesize as to what each variable name indicates. Then consult the website where the dataset is found, https://archive.ics.uci.edu/dataset/20/census+income. Do your hypotheses match the actual meaning?\n", ":::\n", "\n", "While you may or may not have been able to guess what each variable meant just by looking at the name, it is often the case in the real world that you may have no idea what some variable names mean. This implies that you will have to look to outside resources, especially your data source, to understand what these might mean.\n", "\n", "Next, we can get quick summaries of the data. This is very easy to do with `pandas` dataframes." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```{index} summary statistics\n", "```" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agefnlwgteducation-numcapital-gaincapital-losshours-per-week
count32561.0000003.256100e+0432561.00000032561.00000032561.00000032561.000000
mean38.5816471.897784e+0510.0806791077.64884487.30383040.437456
std13.6404331.055500e+052.5727207385.292085402.96021912.347429
min17.0000001.228500e+041.0000000.0000000.0000001.000000
25%28.0000001.178270e+059.0000000.0000000.00000040.000000
50%37.0000001.783560e+0510.0000000.0000000.00000040.000000
75%48.0000002.370510e+0512.0000000.0000000.00000045.000000
max90.0000001.484705e+0616.00000099999.0000004356.00000099.000000
\n", "
" ], "text/plain": [ " age fnlwgt education-num capital-gain \\\n", "count 32561.000000 3.256100e+04 32561.000000 32561.000000 \n", "mean 38.581647 1.897784e+05 10.080679 1077.648844 \n", "std 13.640433 1.055500e+05 2.572720 7385.292085 \n", "min 17.000000 1.228500e+04 1.000000 0.000000 \n", "25% 28.000000 1.178270e+05 9.000000 0.000000 \n", "50% 37.000000 1.783560e+05 10.000000 0.000000 \n", "75% 48.000000 2.370510e+05 12.000000 0.000000 \n", "max 90.000000 1.484705e+06 16.000000 99999.000000 \n", "\n", " capital-loss hours-per-week \n", "count 32561.000000 32561.000000 \n", "mean 87.303830 40.437456 \n", "std 402.960219 12.347429 \n", "min 0.000000 1.000000 \n", "25% 0.000000 40.000000 \n", "50% 0.000000 40.000000 \n", "75% 0.000000 45.000000 \n", "max 4356.000000 99.000000 " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```{index} N/A value\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This will describe all the numerical variables, so long as no N/A values exist in these variables. Note that one N/A value will make all summary statistics in the description table show up as N/A, except for count.\n", "\n", "Therefore, you can see that removing N/A values is important. We were lucky to stumble across a dataset with no N/A values in the numerical dataset. However, this dataset does have N/A values in it! This can be problematic." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```{index} cleaning data\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cleaning Data\n", "\n", "Data cleaning has a lot of meanings in it, but most commonly, the type of cleaning you do to make a dataset more usable for analysis or machine learning is to remove or somehow manage N/A values. These values are empty and can bias any conclusions you make from the data if they are still present when you perform further analysis.\n", "\n", "To find the number of N/As in the dataset, we can call a function that only isolates N/A values from the dataset and sums them for every column.\n", "\n", ":::{warning}\n", "If you use the raw dataset from the UC Irvine Machine Learning Repository, the N/A values are not actually given as N/A in the raw data, but rather as a question mark `?` and later functions may not count these as N/A values. You can easily replace these with the line `data = data.replace('?', np.nan)`.\n", ":::" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "age 0\n", " workclass 1836\n", " fnlwgt 0\n", " education 0\n", " education-num 0\n", " marital-status 0\n", " occupation 1843\n", " relationship 0\n", " race 0\n", " sex 0\n", " capital-gain 0\n", " capital-loss 0\n", " hours-per-week 0\n", " native-country 583\n", " income 0\n", "dtype: int64" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.isna().sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This dataset is relatively clean and so it has many columns with no missing values; the columns with missing values have relatively few of them. In the case where the N/A values are relatively few, we can remove the rows with an N/A value in any column. Let's try that out and see how it changes our data." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(30162, 15)" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#remove rows with missing values\n", "data = data.dropna()\n", "data.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This simple act dropped around 2000 rows from our dataset. Now we should be ready to use this dataset for further analysis and get insights from it, which we discuss in the next section." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## More Exercises\n", "\n", ":::{admonition} Exercise 2.3\n", "Re-calculate the summary statistics after dropping N/A values and describe any changes.\n", ":::\n", "\n", ":::{admonition} Exercise 2.4\n", "To describe categorical data, we can have the program print out the different categories within the variable, and the number of instances of each category. Try this out for the `education` column in the dataset.\n", ":::\n", "\n", ":::{admonition} Exercise 2.5\n", "(Bonus) Write a function that can do Exercise 2.4 for any column of your choosing.\n", ":::" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.3" } }, "nbformat": 4, "nbformat_minor": 2 }