{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data Investigation I\n",
    "\n",
    "Often times when we explore data for the first time, we have to spend time looking at what is missing in the data, or what needs to be changed. Data files rarely if ever are perfect for use immediately once you get them. Additionally, it is often unwise to utilize data for predictive modeling or statistical analysis without knowing some of the context for the data and looking at its contents.\n",
    "\n",
    "Python has a helpful package called `pandas` which is purpose-built for looking at and manipulating data, and a lot of what we will work on in this section will involve using it. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```{index} pandas\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>age</th>\n",
       "      <th>workclass</th>\n",
       "      <th>fnlwgt</th>\n",
       "      <th>education</th>\n",
       "      <th>education-num</th>\n",
       "      <th>marital-status</th>\n",
       "      <th>occupation</th>\n",
       "      <th>relationship</th>\n",
       "      <th>race</th>\n",
       "      <th>sex</th>\n",
       "      <th>capital-gain</th>\n",
       "      <th>capital-loss</th>\n",
       "      <th>hours-per-week</th>\n",
       "      <th>native-country</th>\n",
       "      <th>income</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>39</td>\n",
       "      <td>State-gov</td>\n",
       "      <td>77516</td>\n",
       "      <td>Bachelors</td>\n",
       "      <td>13</td>\n",
       "      <td>Never-married</td>\n",
       "      <td>Adm-clerical</td>\n",
       "      <td>Not-in-family</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>2174</td>\n",
       "      <td>0</td>\n",
       "      <td>40</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>50</td>\n",
       "      <td>Self-emp-not-inc</td>\n",
       "      <td>83311</td>\n",
       "      <td>Bachelors</td>\n",
       "      <td>13</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Exec-managerial</td>\n",
       "      <td>Husband</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>13</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>38</td>\n",
       "      <td>Private</td>\n",
       "      <td>215646</td>\n",
       "      <td>HS-grad</td>\n",
       "      <td>9</td>\n",
       "      <td>Divorced</td>\n",
       "      <td>Handlers-cleaners</td>\n",
       "      <td>Not-in-family</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>40</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>53</td>\n",
       "      <td>Private</td>\n",
       "      <td>234721</td>\n",
       "      <td>11th</td>\n",
       "      <td>7</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Handlers-cleaners</td>\n",
       "      <td>Husband</td>\n",
       "      <td>Black</td>\n",
       "      <td>Male</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>40</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>28</td>\n",
       "      <td>Private</td>\n",
       "      <td>338409</td>\n",
       "      <td>Bachelors</td>\n",
       "      <td>13</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Prof-specialty</td>\n",
       "      <td>Wife</td>\n",
       "      <td>Black</td>\n",
       "      <td>Female</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>40</td>\n",
       "      <td>Cuba</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   age          workclass   fnlwgt   education   education-num  \\\n",
       "0   39          State-gov    77516   Bachelors              13   \n",
       "1   50   Self-emp-not-inc    83311   Bachelors              13   \n",
       "2   38            Private   215646     HS-grad               9   \n",
       "3   53            Private   234721        11th               7   \n",
       "4   28            Private   338409   Bachelors              13   \n",
       "\n",
       "        marital-status          occupation    relationship    race      sex  \\\n",
       "0        Never-married        Adm-clerical   Not-in-family   White     Male   \n",
       "1   Married-civ-spouse     Exec-managerial         Husband   White     Male   \n",
       "2             Divorced   Handlers-cleaners   Not-in-family   White     Male   \n",
       "3   Married-civ-spouse   Handlers-cleaners         Husband   Black     Male   \n",
       "4   Married-civ-spouse      Prof-specialty            Wife   Black   Female   \n",
       "\n",
       "    capital-gain   capital-loss   hours-per-week  native-country  income  \n",
       "0           2174              0               40   United-States   <=50K  \n",
       "1              0              0               13   United-States   <=50K  \n",
       "2              0              0               40   United-States   <=50K  \n",
       "3              0              0               40   United-States   <=50K  \n",
       "4              0              0               40            Cuba   <=50K  "
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "#read in data\n",
    "data = pd.read_csv('census_income.csv')\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`data.head()` is a good way to get a good look at the first few rows of the data. \n",
    "\n",
    ":::{admonition} Exercise 2.1\n",
    "Thinking of the opposite of the head, how do we get the last few rows of the data?\n",
    ":::\n",
    "\n",
    "First things first is to get a look at the data. How many rows (samples) are there? How many columns (variables) are there? These numbers correspond to the *shape* of the dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(32561, 15)"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that this dataset has 32,561 instances of 15 variables. This may seem like a lot, but it is actually relatively small compared to many other modern datasets, like the one used in the advanced EDA portion of this chapter, which has millions of instances!\n",
    "\n",
    "Secondly, we should get a look at the features themselves. What do they mean? What datatype are they? While we can do the first one by looking at the names of the columns by calling `data.head()`, this does not allow us to do the second. So we can instead do the following:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 32561 entries, 0 to 32560\n",
      "Data columns (total 15 columns):\n",
      " #   Column           Non-Null Count  Dtype \n",
      "---  ------           --------------  ----- \n",
      " 0   age              32561 non-null  int64 \n",
      " 1    workclass       30725 non-null  object\n",
      " 2    fnlwgt          32561 non-null  int64 \n",
      " 3    education       32561 non-null  object\n",
      " 4    education-num   32561 non-null  int64 \n",
      " 5    marital-status  32561 non-null  object\n",
      " 6    occupation      30718 non-null  object\n",
      " 7    relationship    32561 non-null  object\n",
      " 8    race            32561 non-null  object\n",
      " 9    sex             32561 non-null  object\n",
      " 10   capital-gain    32561 non-null  int64 \n",
      " 11   capital-loss    32561 non-null  int64 \n",
      " 12   hours-per-week  32561 non-null  int64 \n",
      " 13   native-country  31978 non-null  object\n",
      " 14   income          32561 non-null  object\n",
      "dtypes: int64(6), object(9)\n",
      "memory usage: 3.7+ MB\n"
     ]
    }
   ],
   "source": [
    "data.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This tells us the datatypes of each variable and gives us a full list of the variables. The `object` datatype corresponds to strings, such as \"United States\" or \"Female\", while the `int64` datatype corresponds to integer numbers.\n",
    "\n",
    ":::{admonition} Exercise 2.2\n",
    "Hypothesize as to what each variable name indicates. Then consult the website where the dataset is found, https://archive.ics.uci.edu/dataset/20/census+income. Do your hypotheses match the actual meaning?\n",
    ":::\n",
    "\n",
    "While you may or may not have been able to guess what each variable meant just by looking at the name, it is often the case in the real world that you may have no idea what some variable names mean. This implies that you will have to look to outside resources, especially your data source, to understand what these might mean.\n",
    "\n",
    "Next, we can get quick summaries of the data. This is very easy to do with `pandas` dataframes."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```{index} summary statistics\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>age</th>\n",
       "      <th>fnlwgt</th>\n",
       "      <th>education-num</th>\n",
       "      <th>capital-gain</th>\n",
       "      <th>capital-loss</th>\n",
       "      <th>hours-per-week</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>32561.000000</td>\n",
       "      <td>3.256100e+04</td>\n",
       "      <td>32561.000000</td>\n",
       "      <td>32561.000000</td>\n",
       "      <td>32561.000000</td>\n",
       "      <td>32561.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>38.581647</td>\n",
       "      <td>1.897784e+05</td>\n",
       "      <td>10.080679</td>\n",
       "      <td>1077.648844</td>\n",
       "      <td>87.303830</td>\n",
       "      <td>40.437456</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>13.640433</td>\n",
       "      <td>1.055500e+05</td>\n",
       "      <td>2.572720</td>\n",
       "      <td>7385.292085</td>\n",
       "      <td>402.960219</td>\n",
       "      <td>12.347429</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>17.000000</td>\n",
       "      <td>1.228500e+04</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>28.000000</td>\n",
       "      <td>1.178270e+05</td>\n",
       "      <td>9.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>40.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>37.000000</td>\n",
       "      <td>1.783560e+05</td>\n",
       "      <td>10.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>40.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>48.000000</td>\n",
       "      <td>2.370510e+05</td>\n",
       "      <td>12.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>45.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>90.000000</td>\n",
       "      <td>1.484705e+06</td>\n",
       "      <td>16.000000</td>\n",
       "      <td>99999.000000</td>\n",
       "      <td>4356.000000</td>\n",
       "      <td>99.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                age        fnlwgt   education-num   capital-gain  \\\n",
       "count  32561.000000  3.256100e+04    32561.000000   32561.000000   \n",
       "mean      38.581647  1.897784e+05       10.080679    1077.648844   \n",
       "std       13.640433  1.055500e+05        2.572720    7385.292085   \n",
       "min       17.000000  1.228500e+04        1.000000       0.000000   \n",
       "25%       28.000000  1.178270e+05        9.000000       0.000000   \n",
       "50%       37.000000  1.783560e+05       10.000000       0.000000   \n",
       "75%       48.000000  2.370510e+05       12.000000       0.000000   \n",
       "max       90.000000  1.484705e+06       16.000000   99999.000000   \n",
       "\n",
       "        capital-loss   hours-per-week  \n",
       "count   32561.000000     32561.000000  \n",
       "mean       87.303830        40.437456  \n",
       "std       402.960219        12.347429  \n",
       "min         0.000000         1.000000  \n",
       "25%         0.000000        40.000000  \n",
       "50%         0.000000        40.000000  \n",
       "75%         0.000000        45.000000  \n",
       "max      4356.000000        99.000000  "
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```{index} N/A value\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This will describe all the numerical variables, so long as no N/A values exist in these variables. Note that one N/A value will make all summary statistics in the description table show up as N/A, except for count.\n",
    "\n",
    "Therefore, you can see that removing N/A values is important. We were lucky to stumble across a dataset with no N/A values in the numerical dataset. However, this dataset does have N/A values in it! This can be problematic."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```{index} cleaning data\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Cleaning Data\n",
    "\n",
    "Data cleaning has a lot of meanings in it, but most commonly, the type of cleaning you do to make a dataset more usable for analysis or machine learning is to remove or somehow manage N/A values. These values are empty and can bias any conclusions you make from the data if they are still present when you perform further analysis.\n",
    "\n",
    "To find the number of N/As in the dataset, we can call a function that only isolates N/A values from the dataset and sums them for every column.\n",
    "\n",
    ":::{warning}\n",
    "If you use the raw dataset from the UC Irvine Machine Learning Repository, the N/A values are not actually given as N/A in the raw data, but rather as a question mark `?` and later functions may not count these as N/A values. You can easily replace these with the line `data = data.replace('?', np.nan)`.\n",
    ":::"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "age                   0\n",
       " workclass         1836\n",
       " fnlwgt               0\n",
       " education            0\n",
       " education-num        0\n",
       " marital-status       0\n",
       " occupation        1843\n",
       " relationship         0\n",
       " race                 0\n",
       " sex                  0\n",
       " capital-gain         0\n",
       " capital-loss         0\n",
       " hours-per-week       0\n",
       " native-country     583\n",
       " income               0\n",
       "dtype: int64"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.isna().sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This dataset is relatively clean and so it has many columns with no missing values; the columns with missing values have relatively few of them. In the case where the N/A values are relatively few, we can remove the rows with an N/A value in any column. Let's try that out and see how it changes our data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(30162, 15)"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#remove rows with missing values\n",
    "data = data.dropna()\n",
    "data.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This simple act dropped around 2000 rows from our dataset. Now we should be ready to use this dataset for further analysis and get insights from it, which we discuss in the next section."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## More Exercises\n",
    "\n",
    ":::{admonition} Exercise 2.3\n",
    "Re-calculate the summary statistics after dropping N/A values and describe any changes.\n",
    ":::\n",
    "\n",
    ":::{admonition} Exercise 2.4\n",
    "To describe categorical data, we can have the program print out the different categories within the variable, and the number of instances of each category. Try this out for the `education` column in the dataset.\n",
    ":::\n",
    "\n",
    ":::{admonition} Exercise 2.5\n",
    "(Bonus) Write a function that can do Exercise 2.4 for any column of your choosing.\n",
    ":::"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}