Show code cell source
# import libraries
# Always run this cell first!
import numpy as np
import pandas as pd
import scipy
import statsmodels.api # appear to need to import the api as well as the library itself for the interpreter to find the modules
import statsmodels as sm
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import plotly.graph_objects as go
import plotly.offline
plotly.offline.init_notebook_mode(connected=True) # make plotly work with Jupyter Notebook using CDN
6.7. JNB Lab: The Framingham Heart Study#
The Framingham Heart Study is an ongoing study of cardiovascular health in the United States. The initial study followed over 5,000 volunteers from Framingham, Massachusetts, USA over the course of several decades, and it still continues today. The study led to important findings in many areas, including a link between cholesterol and heart disease. (These exercises are inspired by an assignment from UC Berkeley’s Data 8.)
We load in the data below from the file framingham.csv
(click here to download a copy). There are a number of different interesting variables to explore, but we will focus on exploring total cholesterol levels (TOTCHOL
) versus the occurrence of heart disease (ANYCHD
). The variable ANYCHD
takes the value 0
if the patient does not have heart disease and the value 1
if they do.
framingham = pd.read_csv("framingham.csv")
framingham.head()
AGE | SYSBP | DIABP | TOTCHOL | CURSMOKE | DIABETES | GLUCOSE | DEATH | ANYCHD | |
---|---|---|---|---|---|---|---|---|---|
0 | 39 | 106.0 | 70.0 | 195.0 | 0 | 0 | 77.0 | 0 | 1 |
1 | 46 | 121.0 | 81.0 | 250.0 | 0 | 0 | 76.0 | 0 | 0 |
2 | 48 | 127.5 | 80.0 | 245.0 | 1 | 0 | 70.0 | 0 | 0 |
3 | 61 | 150.0 | 95.0 | 225.0 | 1 | 0 | 103.0 | 1 | 0 |
4 | 46 | 130.0 | 84.0 | 285.0 | 1 | 0 | 85.0 | 0 | 0 |
6.7.1. Part 1: Explore the data#
As always, it’s important to take a look at your data before you dive into any kind of inference. We first note that we have both categorical and numerical variables, which changes what types of visualization we might be interested in. First, we’ll look at the size of the data frame, which shows that we have 3842 observations of 9 variables.
framingham.shape
(3842, 9)
Since we are interested in how cholesterol connects to occurrence of heart disease, we next explore those variables. Like the penguin example in the text, we will separate our data into our two samples. Then we’ll calculate the mean cholesterol rating for each group.
chd = framingham[framingham['ANYCHD'] == 1]
nochd = framingham[framingham['ANYCHD'] == 0]
print(f"The mean total cholesterol for those with an occurrence of CHD is {chd['TOTCHOL'].mean():.3f}.")
print(f"The mean total cholesterol for those without occurrence of CHD is {nochd['TOTCHOL'].mean():.3f}.")
The mean total cholesterol for those with an occurrence of CHD is 249.482.
The mean total cholesterol for those without occurrence of CHD is 232.846.
The means are different, but it’s not clear if they are a lot different, or just a little bit. That depends on the sample sizes and the distribution of cholesterol levels!
Write a line or two of code to figure out how many people in the study have an occurrence of CHD and how many do not.
Make a histogram of the cholesterol levels for both samples. Describe the distributions’ centers, shapes, and compare the two.
6.7.2. Part 2: Two Sample T-tests#
We want to determine if there is a true difference in average cholesterol levels between people with heart disease and those without. This is true in our sample, but we need to see if the evidence is enough to make a claim about the population. We are testing the hypotheses:
\( H_0: \mu_1 = \mu_2\) \( H_1: \mu_1 \neq \mu_2\)
where \(\mu_1\) and \(\mu_2\) represent the average total cholesterol levels of the CHD and No CHD populations, respectively. But first we need to know if using a T-test is valid!
Describe the assumptions of the T-test and comment on if they are valid for this example. Your work in Part 1 should be enough.
Once we know that the test is appropriate for our problem, we want to perform the test. Recall that to do this we want to set up a CompareMeans
object from our two samples. Just like in the chapter, we do that in the following way.
sample1 = chd['TOTCHOL']
sample2 = nochd['TOTCHOL']
# create a CompareMeans object from the two samples
cm = sm.stats.weightstats.CompareMeans.from_data(sample1, sample2)
Now we are ready to perform the test!
Compute the test statistic and \(P\)-value using
cm.ttest_ind
.Write your conclusion in a complete sentence.
Give a confidence interval for the difference in average total cholesterol.