2.4. Glimpse of Chicago#
Note
In the Summer of 2020, Hope Wood (Wheaton College Urban Studies staff), Demetrius Crawford (Wheaton College Urban Studies outreach program participant) and I (Dr. Paul Isihara, Wheaton College math professor) launched the idea of creating JNB labs as a tool to teach math, data analysis and STEAM to disadvantaged Chicago grade-school students.
In Fall 2020, Mr. Jim Wilkes, principal of Cornerstone Academy, an alternative Christian High School serving mainly low to middle income students on the West Side of Chicago held the pilot offering of a JNB after-school program via Zoom. A class of Wheaton College student mentors worked online in small groups of 2 or 3 with Cornerstone students. Mr. Wilkes and I served as co-program directors, and Hope Wood and Antoinette Ratliff as program staff.
The following JNBs were developed for use with 8th graders participating in an after-school program at a non-profit organization called Celestial Ministries. The first two years, the scholars’ middle school math teacher brought them to to the program and participated in the class. I served as teacher. Our goal was for the middle school teacher to eventually take over the primary teaching responsibilities.
Celestial Ministries serves mainly single parent children in the Lawndale neighborhood on the West Side of Chicago. Parents and friends were invited to a recognition ceremony in which students demonstrated various functionalities of JNBs which were covered in the program. Students who successfully completed the program were given a certificate and a 100 dollar stipend.
Word of Advice
There are likely opportunities to introduce Python Jupyter Notebooks in disadvantaged communities within proximity of your workplace or residence. Establishing a good working relationship with a well-established community organization and local community math teacher is highly recommended as opposed to seeking to establish a program on your own.
2.4.1. Michael Jordan’s Greatest Scoring Game#
Note
In this section, we will introduce a very important programming tool called a function.
STEP#1 Review the PowerPoint on this lesson: link to PPT
STEP #2 From the PowerPoint, a variable is like a b ____ . A variable has a n ______ . We can store some i ______ such as a number (eg. 3) or a message (eg. ‘God is Good!’) in a variable. Let’s give names to two variables:
feet
inch
In Michael Jordan’s case, the info stored in the variable feet is the number 6 and the info stored in the variable inch is also equal to 6. For short, we just write feet=6 and inch=6 Let’s go around and each person state your name and what your value is for the variables feet and inch.
(Answers: box, name, info)
STEP #3 The word for a short computer program is a c ______. The programming language we are using is Py ______ n. In the next cell, we give Python code to create the function in STEP 2. (If it looks complicated, try to find someone to help explain it to you.)
(Answers: code, Python)
def height(feet,inch):
height_in_inches = 12*feet + inch
return height_in_inches
height(6,0) #Test on Seimone's height feet=6, inch=0
72
STEP #4 What do we have to do in the next cell to test our function on 6’6” Michael Jordan?
# Test our function on Michael Jordan's height.
STEP #5 One of the fun parts of programming is that we can try to design a function to do almost anything we want.
Define a function called points(fg,ft,tp) which computes the points scored by a basketball player given the input variables
fg= number of 2 point field goals made
ft=number of free throws made
tp=number of three point shots made
Use a variable called pts to store the answer.
STEP 6 In his highest scoring game, Michael Jordan made 21 two-point field goals, 21 free throws and 2 three point shots. Check that your function gives the correct value of 69 for his point total.
#Supply the missing values for points(fg,ft,tp)
For Discussion
What would you like your mentors to explain about today’s lesson?
What is one of your favorite sports memories?
2.4.2. How Bad is COVID-19?#
Note
In this section we will analyze data imported directly from the Chicago Data Portal
STEP 1 Let’s read in up to 100,000 rows of COVID data for Chicago from the Chicago Data Portal.
Show code cell source
import pandas as pd
import numpy as np
rawCOVID = pd.read_json('https://data.cityofchicago.org/resource/yhhz-zm2v.json?$limit=100000') #Import data directly from data portal
rawCOVID.head(10)
Let’s check the number of rows and columns.
rawCOVID.shape
We’ll list the 26 column names
list(rawCOVID.columns)
Q1 Does case matter when referring to a column name?
STEP TWO Let’s streamline the data to just 4 columns, drop rows with missing data, simplify the column names, and then display the first 20 rows.
COVID=rawCOVID[['deaths_cumulative', "population", 'tests_cumulative','zip_code']]
COVID.dropna #drop rows with missing data
COVID.columns=["deaths","population","tests","zip"] #simplify the column names
COVID.head(20)
Q2 Why is there different information for the same zip code?
STEP THREE
Let’s check how many rows have data in each column.
COVID.count()
Q3 How many data rows are there in each column?
STEP FOUR
Let’s find out how many times each zip code appears.
COVID["zip"].value_counts()
STEP FIVE
Let’s make a copy of the COVID dataframe and display the first 5 rows.
df1=COVID
df1.head(5)
STEP SIX
Filter data by a specific zip code, for example, ‘60611’
df2=df1[df1['zip']=='60611']
df2.head()
df2["zip"].value_counts()
STEP SEVEN
Let’s find the largest value in the ‘tests’ column for zip ‘60611’.
df2 = df1[df1["zip"]=='60637'] #get just rows with zip 60611
numtested=df2["tests"].max() #get the largest number for tests
numtested
Q4 How can we find how many deaths due to COVID have occurred in zip ‘60623’? (Create a new dataframe for 60623 called temp)
Show code cell source
df3 = COVID[df1["zip"]=='60623'] #get just rows with zip 60611
numdeaths=df3["deaths"].max() #get the largest number for tests
numdeaths
Q5 How can we get the population in 60623?
Show code cell source
df3 = COVID[df1["zip"]=='60623'] #get just rows with zip 60611
pop=df3["population"].max() #get the largest number for tests
pop
df3 = COVID[df1["zip"]=='60623'] #get just rows with zip 60611
pop=df3["population"].max() #get the largest number for tests
pop
STEP EIGHT We can instruct the computer to give us the number of COVID tests for any Chicago zipcode.
def MyCOVID(COVID,zip):
alreadychecked=0 #eliminate duplication of information
for z in COVID.index: #go through all the index values
if COVID.loc[z,"zip"]==zip and alreadychecked==0: #found the zip we requested (first-time)
alreadychecked=1 #we will only do this once
df=COVID[COVID["zip"]==zip]
numtested=df["tests"].max()
print("Zip code: ", zip) #print zipcode
print("number tested is ", numtested) #print number tested
return ("Enter a different zip code if you wish.")
Q6 Test out the function on zip code ‘60610’
Show code cell source
MyCOVID(COVID,'60637')
Exercises#
Exercises
Define a function myCOVID2() which outputs for each given zip code the population, number tested, and number of deaths.
Use your function myCOVID2() to determine the COVID data for each of the following Chicago landmarks:
a) North Park University (zip ‘60625’)
b) Wheaton in Chicago (zip=‘60637’)
Why is COVID disproportionately impacting black and brown communities?
2.4.3. Mapping Famous Chicagoans#
Note
In the section we will learn how to use Python to put information on a map of a geographical location.
Step One Let’s begin by importing data of some famous people born in Chicago.
import pandas as pd
Chi=pd.read_csv('chicagoans.csv')
Chi
Q1 How old was Bobby Fischer when he became a grandmaster?
STEP TWO Let’s use a Python library called folium to make a map of Chicago.
!pip install folium
import folium # map rendering library
from folium.features import DivIcon #used to add popup info to a map
Chicago_map = folium.Map(location=[41.886456, -87.62325], tiles="openstreetmap", zoom_start=10)
Chicago_map
Q2
Try adjusting the “zoom_start” value. What happens?
What happens if you change the numbers in location=[41.886456, -87.62325]”
STEP THREE Let’s add our data about famous Chicagoans to the map.
Chicago_map=folium.Map(location=[41.886456,-87.62325],tiles="openstreetmap",zoom_start=11)
for i in Chi.index:
p=[Chi.loc[i,"Lat"],Chi.loc[i,"Lon"]]
folium.Marker(p,icon=DivIcon(
icon_size=(100,0),
icon_anchor=(0,8),
html='<div style="font-size:20pt; color:red">'+str(Chi.loc[i,"Name"]) +'</div>',
)).add_to(Chicago_map)
Chicago_map.add_child(folium.CircleMarker(p, radius=1,color='black'))
Chicago_map.save("Chicagoans.html")
Chicago_map
Q3 What side of the city was Michelle Obama born?
STEP FOUR Let’s make a function which can add another person to our dataframe called Chi
Show code cell source
def addperson(map_name,df,name,age,alive,noted,birth,zipcode,lat,lon,fact):
our_map=map_name
new_row = {'Name':name,
'Age':age,
'Alive':alive,
'Noted For':noted,
'Place of Birth':birth,
'Zip':zipcode,
'Lat':lat,
'Lon':lon,
'fun fact':fact }
df = pd.concat([df, pd.DataFrame([new_row])], ignore_index=True)
i=len(df)-1
p=[df.loc[i,"Lat"],df.loc[i,"Lon"]]
folium.Marker(p,icon=DivIcon(
icon_size=(100,0),
icon_anchor=(0,8),
html='<div style="font-size: 15pt; color : blue">'+str(df.loc[i,"Name"]) +'</div>',
)).add_to(our_map)
our_map.add_child(folium.CircleMarker(p, radius=1,color='blue'))
return df,our_map
Q4 What command is used to add a new line to the dataframe df?
STEP FIVE Let’s add Jenifer Hudson to our dataframe using the following info:
‘Jennifer Hudson’,38,‘yes’,‘Actress’,‘Englewood’,60621,41.779699,-87.633194,‘worked at Burger King’
[Chi,Chicago_map]=addperson(Chicago_map,Chi,'Jennifer Hudson','Actress',41,'yes','Englewood',60621,41.779699,-87.633194,'worked at Burger King')
Chicago_map
Q5 Where did Jennifer Hudson work before becoming famous?
STEP SIX Add your info to the map by editing the info in the next cell.
#answer to Step Six
STEP SEVEN Use the following two commands to save your map and info in a new Excel file:
Chicago_map.save(“Chicagoans.html”)
Chi.to_excel(‘ChiRevised.xlsx’)
Enter and run the commands in the cell below.
Q7 Did your map display correctly? Did the new Excel file include your name?
Discussion
Whom do you admire? Why so?
What character qualities can help someone to be successful in life?
What can help you to be more successful?
2.4.4. Predicting Exemplary Schools#
Note
This section is a friendly competition predict whether a K-8 Chicago school’s IL State Board of Education Summative Designation is designated ‘Exemplary’ (the ‘Non-Exemplary’ school categories are “Commendable”, “Targeted”, and “Comprehensive”).
The prediction will be based on the file MiddleSchool.xlsx with the following data:
Student Enrollment - Black or African American
Student Enrollment - Hispanic or Latino
Student Enrollment - Children with Disabilities
Student Enrollment - Low Income
Total Number of School Days
8th Grade Passing Algebra 1
Student Attendance Rate
Student Chronic Truancy Rate
Avg Class Size – All Grades
Teacher Retention Rate
Scoring
Scoring for our competition is based on values for the Confusion Matrix:
where
TP=True Positive: your model predicts exemplary and the school is exemplary
TN=True Negative: your model predicts not exemplary and the school is not exemplary
FP=False Positive: your model predicts exemplary but the school is not exemplary
FN=False Negative: your model predicts not exemplary but the school is exemplary
The number of each type of prediction then determines
Accuracy = \(\frac{\mid TP\mid + \mid TN \mid}{\mid TP\mid + \mid TN \mid+ \mid FP\mid + \mid FN \mid}\) (proportion that were correctly predicted out of all the schools)
Specificity (Precision) = \(\frac{\mid TP\mid}{\mid TP\mid + \mid FP\mid }\) (proportion that were correct out of those you predicted to be exemplary)
Sensitivity (Recall) = \(\frac{\mid TP\mid}{\mid TP\mid + \mid FN\mid }\) (proportion that you predicted correctly among just the exemplary schools)
Your competition (F1) score is the geometric mean of the precision (specificity), and recall(sensitivity):
STEP ONE Exploratory Data Analysis 1a) Import the usual libraries including matplotlib.pyplot as plt, as well as the MiddleSchool report card data into a datframe df.
Show code cell source
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df=pd.read_excel("MiddleSchool.xlsx")
df.head(2)
1b) Display the column names and the number of data rows in each column.
df.count()
1c Get the max and min values in each column.
Show code cell source
n1min=df["# Student Enrollment"].min()
n1max=df["# Student Enrollment"].max()
n2min=df["% Student Enrollment - Black or African American"].min()
n2max=df["% Student Enrollment - Black or African American"].max()
n3min=df["% Student Enrollment - Hispanic or Latino"].min()
n3max=df["% Student Enrollment - Hispanic or Latino"].max()
n4min=df["% Student Enrollment - Children with Disabilities"].min()
n4max=df["% Student Enrollment - Children with Disabilities"].max()
n5min=df["% Student Enrollment - Low Income"].min()
n5max=df["% Student Enrollment - Low Income"].max()
n6min=df["Total Number of School Days"].min()
n6max=df["Total Number of School Days"].max()
n7min=df["% 8th Grade Passing Algebra 1"].min()
n7max=df["% 8th Grade Passing Algebra 1"].max()
n8min=df["Student Attendance Rate"].min()
n8max=df["Student Attendance Rate"].max()
n9min=df["Student Chronic Truancy Rate"].min()
n9max=df["Student Chronic Truancy Rate"].max()
n10min=df["Avg Class Size – All Grades"].min()
n10max=df["Avg Class Size – All Grades"].max()
n11min=df["Teacher Retention Rate"].min()
n11max=df["Teacher Retention Rate"].max()
print("min enroll",n1min)
print("max enroll",n1max)
print("min % Student Enrollment - Black or African American",n2min)
print("max % Student Enrollment - Black or African American",n2max)
print("min % Student Enrollment - Hispanic or Latino",n3min)
print("max % Student Enrollment - Hispanic or Latino",n3max)
print("min % Student Enrollment - Children with Disabilities",n4min)
print("max % Student Enrollment - Children with Disabilities",n4max)
print("min % Student Enrollment - Low Income",n5min)
print("max % Student Enrollment - Low Income",n5max)
print("min Total Number of School Days",n6min)
print("max Total Number of School Days",n6max)
print("min % 8th Grade Passing Algebra 1",n7min)
print("max % 8th Grade Passing Algebra 1",n7max)
print("min Student Attendance Rate",n8min)
print("max Student Attendance Rate",n8max)
print("min Student Chronic Truancy Rate",n9min)
print("max Student Chronic Truancy Rate",n9max)
print("Avg Class Size – All Grades",n10min)
print("Avg Class Size – All Grades",n10max)
print("Teacher Retention Rate",n11min)
print("Teacher Retention Rate",n11max)
1d Check how many schools are in each category.
df["Summative Designation"].value_counts()
STEP TWO Define a function which predicts whether a school is exemplary (1) or not-exemplary (0).
2a For a simple prediction, let us predict that a school is exemplary if the Teacher Retention Rate is at least 90%.
Show code cell source
#---PREDICTION MODEL----#
def mypredict(df):
for i in df.index:
if df.loc[i,"Teacher Retention Rate"]>90:
df.loc[i,"Prediction"]=1
else:
df.loc[i,"Prediction"]=0
return df
#---APPLY MODEL TO OUR DATA---#
mydf=mypredict(df)
mydf=mydf.reset_index(drop=True)
#---COMPUTE YOUR SCORE---#
TP=0
TN=0
FP=0
FN=0
numschools=0
for i in mydf.index:
if mydf.loc[i,"Prediction"]==1 and mydf.loc[i,"Summative Designation"]=="Exemplary":
TP=TP+1
if mydf.loc[i,"Prediction"]==0 and mydf.loc[i,"Summative Designation"]!="Exemplary":
TN=TN+1
if mydf.loc[i,"Prediction"]==1 and mydf.loc[i,"Summative Designation"]!="Exemplary":
FP=FP+1
if mydf.loc[i,"Prediction"]==0 and mydf.loc[i,"Summative Designation"]=="Exemplary":
FN=FN+1
numschools=numschools+1
print("|TP|=",TP)
print("|TN|=",TN)
print("|FP|=",FP)
print("|FN|=",FN)
accuracy=round((TP+TN)/numschools,2)
precision=round(TP/(TP+FP),2)
recall=round(TP/(TP+FN),2)
F1score=2*(precision*recall)/(precision+recall)
print("Accuracy (% correct all 122 schools)=",100*accuracy,"%")
print("Precision (% correct of those you predicted to be exemplary) =",100*precision,"%")
print("Recall (% correct of schools that are exemplary) =",100*recall,"%")
print('COMPETITION F1 SCORE=',round(F1score*100,2),"%" )
Assignment#
Assignment
Modify the Prediction Model to see how high you can score.
2.4.5. The Great Migration#
DATASETS:
migration.xlsx (source https://www.census.gov/dataviz/visualizations/020/508.php); uscities.xlsx ((https://simplemaps.com/data/us-cities))
SUMMARY: We create a heat map to show the %change in African Americans living in cities in the Northern and Southern USA as a result of the Great Migrations. Those with the largest increase (over 20%) were all in the North.
BACKGROUND: Isabel Wilkerson gives a good introduction to the Great Migration. (Run the next cell)
“The Great Migration and the power of a single decision | Isabel Wilkerson” YouTube, uploaded by TED, 6 April 2018, https://youtu.be/n3qA8DNc2Ss?si=kF0l5_YhY-cmmLfO. Permissions: YouTube Terms of Service
from IPython.display import YouTubeVideo
YouTubeVideo('n3qA8DNc2Ss')
Show code cell source
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
### Creating a Dataframe with %Changes in African American Population by (Lat,Lon) Location
Use pandas to import the datafile migration.xlsx into a dataframe called migration.
migration=pd.read_excel('migration.xlsx')
Display the first two rows of migration.
migration.head(2)
Shorten the column names (MIg1=1st Great Migration, Mig2=2nd Great Migration), separate the City and State, and create a multi-index with city and state.
Show code cell source
migdf=migration #make a copy of the original dataframe
migdf.columns=["City","Mig2","Mig1"] #rename the columns
migdf["city"]="city" #create a column for the city
migdf["state"]="state" #create a column for the state
for m in migdf.index:
x=migdf.loc[m,"City"].split(", ") #split the city from the state
migdf.loc[m,"city"]=x[0] #add the city to the city column
migdf.loc[m,"state"]=x[1] #add the state to the state column
migdf.drop(['City'], axis=1, inplace=True) #Drop the original City column
migdf=migdf.set_index(["city","state"],drop=True) #create multi-index
migdf.head(5)
Read the lat lon data for US cities in the file “uscities.xlsx” and make [“city”,“state”] the multi-index.
Show code cell source
rawlatlon=pd.read_excel("uscities.xlsx") #read data
latlon=rawlatlon[["city_ascii","lat","lng","state_id"]] #select columns
latlon.columns=["city","lat","lon","state"] #rename columns
latlon=latlon.set_index(["city","state"],drop=True) #create multi-index
latlon.head(2)
We’ll perform an inner join (intersection) of the two dataframes migdf and latlon. This will add the lat lon data to the migdf data
df=pd.merge(latlon,migdf, how='inner', left_index=True,right_index=True)
df.columns=["lat","lon","Mig2","Mig1"]
df.to_excel("GM.xlsx")
df.head(5)
Check for missing data.
df[df["Mig1"]=="No data"].count()
df[df["Mig2"]=="No data"].count()
Remove rows with missing data.
df.count()
df=df[df["Mig1"]!="No data"]
df=df[df["Mig2"]!="No data"]
df.count()
Creating a Map from our Dataframe#
Make a heat map of the 1st Great Migration.
Show code cell source
fig=plt.figure(figsize=(35, 40))
X=df["Mig1"].astype(float)
Y=df["lat"]
heat_map= plt.hist2d(X, Y, bins=6) #heat map is a 2dimensional histogram
plt.xlabel("% Change in Afr. American Population First Great Migration(1910-1940)",size=30)
plt.ylabel("Latitude",size=30)
plt.yticks(fontsize=30)
plt.xticks(fontsize=30)
names=df.reset_index()
for i in names.index: ##add the names of the cities and %change in Afr. American population
plt.text(names.Mig1[i],names.lat[i],names.city[i]+' '+str(names.Mig1[i]),fontsize=10,color='white')
cbar = plt.colorbar()
cbar.set_label('# of cities', rotation=270,size=50)
cbar.ax.tick_params(labelsize=30)
fig.savefig("Migration1.png")
Improve clarity by specifying certain cities to display.
Show code cell source
fig,ax=plt.subplots(figsize=(35, 40))
X=df["Mig1"].astype(float)
Y=df["lat"]
heat_map= plt.hist2d(X, Y, bins=6,alpha=.6) #heat map is a 2dimensional histogram
plt.xlabel("% Change in Afr. American Population First Great Migration(1910-1940)",size=30)
plt.ylabel("Latitude",size=30)
plt.yticks(fontsize=30)
plt.xticks(fontsize=30)
plt.xlim(-20,45)
names=df.reset_index()
for i in names.index: ##add the names of the cities and %change in Afr. American population
if names.Mig1[i]<0 and any(names.loc[i,'city'] in x for x in ["Austin","Chicago","Detroit","Cleveland","Dallas","Denver","Grand Rapids","Houston","Huntsville","Indianapolis","Jacksonville","Louisville","Miami","Milwaukee","Minneapolis","Montgomery","Mobile","Nashville","New York","New Orleans","Newark","Omaha","Philadelphia","Pittsburgh","Providence","Raleigh","San Antonio","San Francisco","Seattle","St. Louis","Washington, DC",]):
plt.text(names.Mig1[i],names.lat[i],"x"+names.city[i]+' '+str(names.Mig1[i]),fontsize=30,color='red')
else:
if any(names.loc[i,'city'] in x for x in ["Chicago","Detroit","Cleveland","Dallas","Grand Rapids","Indianapolis","Louisville","Milwaukee","Minneapolis","Montgomery","Nashville","New York","Newark","Omaha","Philadelphia","Pittsburgh","Providence","Raleigh","San Francisco","Seattle","St. Louis","Washington, DC",]):
plt.text(names.Mig1[i],names.lat[i],"x"+names.city[i]+' '+str(names.Mig1[i]),fontsize=30,color='black')
cbar = plt.colorbar()
cbar.set_label('# of cities', rotation=270,size=50)
cbar.ax.tick_params(labelsize=25)
fig.savefig("Migration1Simplified.png")
ASSIGNMENT#
Assignment
Make heatmaps for Mig2 and compare it with the heatmaps for Mig1. (Eg. What happened in Chicago?)
#import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt # plotting library
import sklearn
from sklearn.cluster import KMeans
2.4.6. The Chicago Hardship Index#
Note
This section is all about Chicago, so has been included here. As it contains more advanced coding methods (explained in later chapters), the coding cells may be skimmed or skipped entirely and returned to later. The contents first appeared in an open access article distributed under the terms of the Creative Commons CC BY license:
Amdat, W. C. (2021). The Chicago Hardship Index: An Introduction to Urban Inequity. Journal of Statistics and Data Science Education, 29(3), 328–336. https://doi.org/10.1080/26939169.2021.1994489
Raw Data and Summary Statistics#
The hardship index is a way to use data to explore urban inequities. Exploratory data analysis for social justice issues should aim to be accessible, significant, equitable, impartial, and transparent.
The hardship index (HI) is the mean of six indicator estimates. Each indicator has been normalized and scaled from 0-100 (except per capita income). A higher hardship index value indicates greater hardship. The 6 raw indicator estimates are complicated by time variation over 5-year periods (US Census Bureau 2008) and geographical boundary approximation (Great Cities Institute 2019). The indicators are
UNEMP = % of community age 16 and older who are unemployed
NOHS = % of community age 25 and older without a high school diploma
DEP = % of community who are dependent (under age 18 or over age 64)
HOUS= % of community with overcrowded housing (more than 1 occupant per room)
POV = % below federal poverty line
INC = per capita income
Datafile: ‘HIHOM20142017.xlsx’
Data Source: https://greatcities.uic.edu/wp-content/uploads/2016/07/GCI-Hardship-Index-Fact-SheetV2.pdf (2010-2014) https://greatcities.uic.edu/wp-content/uploads/2019/12/Hardship-Index-Fact-Sheet-2017-ACS-Final-1.pdf (2013-2017).
Summary Statistics#
Upload the hardship index data for 2014 and 2017. The first two of seventy-seven community areas in Chicago are printed below.
raw_hardship=pd.read_excel('HIHOM20142017.xlsx')
raw_hardship.head(2)
Separate the 2014 and 2017 hardship index (HI) data into two dataframes called “dfHI14” and “dfHI17.” The column names will reflect the year
dfHI14=raw_hardship[["Community","HI14","UNEMP14","NOHS14","DEP14","HOUS14","POV14","INC14"]]
dfHI14 = dfHI14.rename(columns = {'Community':'Community14'})
dfHI17=raw_hardship[["Community","HI17","UNEMP17","NOHS17","DEP17","HOUS17","POV17","INC17"]]
dfHI17 = dfHI17.rename(columns = {'Community':'Community17'})
The first two communities’ HI and the indicator scores for 2014 are printed below.
dfHI14.head(2)
The scores for two communities in 2017 are printed below.
dfHI17.head(2)
Calculate the summary statistics for indicators from the data in all seventy-seven communities. For example, we calculate statistics for the HOUS indicator in 2017. By replacing “HOUS17,” statistics for other indicators can be calculated–like “HI17,” “NOHS17”, etc.
x=dfHI17["HOUS17"]
import numpy
print("Minimum: ", numpy.min(x))
print("Maximum: ", numpy.max(x))
print("Standard Deviation: ", numpy.std(x))
print("Mean: ", numpy.mean(x))
print("Median: ", numpy.median(x))
The table below lists the indicator values for several communities, as shown above, and the summary statistics for each indicator.
K-Means Clustering#
We will use a machine learning method called K-means clustering (see Chapter 10.3) to visualize patterns in the geographic information. By plotting the homicide data with it, we can investigate a relationship between location, hardship cluster, and number of homocides. We use the sklearn library for machine learning.
Datafile: ‘standardizedindicators.xlsx’
Read standardized hardship index and homicide data.
hom_df = pd.read_excel('standardizedindicators.xlsx')
hom_df.head(2)
Create dataframe with just HI and HOM 2017 info
HIHOM=hom_df[["UNEMP17","NOHS17","DEP17","HOUS17","POV17","INC17"]]
HIHOM.head()
Use the KMeans() function to make n_clusters=2 clusters and get the labels indicating which cluster each point belongs to.
# Fit the k means model
k_means = KMeans(init="k-means++", n_clusters=2, n_init=2)
k_means.fit(HIHOM)
#Get Labels
k_means_labels = k_means.labels_
k_means_labels
Add the labels to hom_df
hom_df["CLASS"]=k_means_labels
hom_df.head(2)
Make a geographic plot of Chicago’s 77 community areas with marker color (blue or red) indicating k-means bi-clustering based only on the 6 standardized economic hardship indicators. Affluent community areas such as the Loop (central business district), Near North Side (which includes the “Gold Coast”), and Hyde Park (site of the University of Chicago) appear in blue. Lower-income communities with a history of injustices, including Woodlawn, Englewood, and Austin, appear in red. Marker sizes are proportional to homicide counts, with actual numbers in parentheses following named community areas.
Show code cell source
fig=plt.figure(figsize=(25,20))
for i in hom_df.index:
if hom_df.loc[i,"CLASS"]==0: #toggle class (0 or 1) if "Loop" does not appear on the map
plt.scatter(hom_df.loc[i,'LON'], hom_df.loc[i,'LAT'],s=20*hom_df.loc[i,'HOM17']+20,color='b', alpha=0.95)
if hom_df.loc[i,"Community"] in ["Loop"]:
plt.gca().text(hom_df.loc[i,'LON']+.003, hom_df.loc[i,'LAT']-.005, hom_df.loc[i,'Community']+'('+str(hom_df.loc[i,'HOM17'])+')',color='blue', size=30)
if hom_df.loc[i,"Community"] in ["Hyde Park","Near West Side","Kenwood","Near North Side","Near South Side","West Town"]:
plt.gca().text(hom_df.loc[i,'LON']+.003, hom_df.loc[i,'LAT']-.005, hom_df.loc[i,'Community']+'('+str(hom_df.loc[i,'HOM17'])+')',color='blue', size=20)
if hom_df.loc[i,"Community"] in ["Lincoln Park","Lakeview","Uptown","Edgewater","Rogers Park","Logan Square","Avondale","North Center"]:
plt.gca().text(hom_df.loc[i,'LON']-.015, hom_df.loc[i,'LAT']-.0075, hom_df.loc[i,'Community']+'('+str(hom_df.loc[i,'HOM17'])+')',color='blue', size=20)
if hom_df.loc[i,"Community"] in ["Lincoln Square","Irving Park","Portage Park","Jefferson Park"]:
plt.gca().text(hom_df.loc[i,'LON']-.025, hom_df.loc[i,'LAT']-.0075, hom_df.loc[i,'Community']+'('+str(hom_df.loc[i,'HOM17'])+')',color='blue', size=20)
if hom_df.loc[i,"Community"] in ["Dunning","O’Hare","Edison Park","Norwood Park"]:
plt.gca().text(hom_df.loc[i,'LON']-.015, hom_df.loc[i,'LAT']+.005, hom_df.loc[i,'Community']+'('+str(hom_df.loc[i,'HOM17'])+')',color='blue', size=20)
if hom_df.loc[i,"Community"] in ["Forest Glen","Garfield Ridge","Clearning","Ashburn","Beverly","Mount Greenwood","Morgan Park"]:
plt.gca().text(hom_df.loc[i,'LON']-.015, hom_df.loc[i,'LAT']+.0015, hom_df.loc[i,'Community']+'('+str(hom_df.loc[i,'HOM17'])+')',color='blue', size=20)
if hom_df.loc[i,"Community"] in ["Hegewisch"]:
plt.gca().text(hom_df.loc[i,'LON']-.01, hom_df.loc[i,'LAT']-.0075, hom_df.loc[i,'Community']+'('+str(hom_df.loc[i,'HOM17'])+')',color='blue', size=20)
else:
plt.scatter(hom_df.loc[i,'LON'], hom_df.loc[i,'LAT'],s=20*hom_df.loc[i,'HOM17']+20,color='r', alpha=0.95)
if hom_df.loc[i,"Community"] in ["Austin","Belmont Craigin","Montclare"]:
plt.gca().text(hom_df.loc[i,'LON']-.02,hom_df.loc[i,'LAT']-.013, hom_df.loc[i,'Community']+'('+str(hom_df.loc[i,'HOM17'])+')',color='red', size=20)
if hom_df.loc[i,"Community"] in ["Woodlawn","Englewood","Chicago Lawn","South Lawndale","McKinley Park","Brighton Park","Archer Heights","West Elsdon","West Lawn","New City","Greater Grand Crossing","Auburn Gresham","South Chicago","East Side","South Deering","Riverdale"]:
plt.gca().text(hom_df.loc[i,'LON']-.015,hom_df.loc[i,'LAT']-.009, hom_df.loc[i,'Community']+'('+str(hom_df.loc[i,'HOM17'])+')',color='red', size=20)
if hom_df.loc[i,"Community"] in ["Avalon Park","Burnside","Pullman"]:
plt.gca().text(hom_df.loc[i,'LON']-.015,hom_df.loc[i,'LAT']-.006,hom_df.loc[i,'Community']+'('+str(hom_df.loc[i,'HOM17'])+')',color='red', size=20)
if hom_df.loc[i,"Community"] in ["Roseland","North Lawndale","East Garfield Park","Hermosa","West Englewood","Humboldt Park"]:
plt.gca().text(hom_df.loc[i,'LON']-.015,hom_df.loc[i,'LAT']+.007,hom_df.loc[i,'Community']+'('+str(hom_df.loc[i,'HOM17'])+')',color='red', size=20)
if hom_df.loc[i,"Community"] in ["West Garfield Park"]:
plt.gca().text(hom_df.loc[i,'LON']-.04,hom_df.loc[i,'LAT']-.007,hom_df.loc[i,'Community']+'('+str(hom_df.loc[i,'HOM17'])+')',color='red', size=20)
if hom_df.loc[i,"Community"] in ["Lower West Side","Oakland","Douglas","Armour Square"]:
plt.gca().text(hom_df.loc[i,'LON']-.001,hom_df.loc[i,'LAT']+.002,hom_df.loc[i,'Community']+'('+str(hom_df.loc[i,'HOM17'])+')',color='red', size=20)
if hom_df.loc[i,"Community"] in ["Bridgeport"]:
plt.gca().text(hom_df.loc[i,'LON']-.01,hom_df.loc[i,'LAT']-.005,hom_df.loc[i,'Community']+'('+str(hom_df.loc[i,'HOM17'])+')',color='red', size=20)
if hom_df.loc[i,"Community"] in ["South Shore"]:
plt.gca().text(hom_df.loc[i,'LON']-.015,hom_df.loc[i,'LAT']+.003,hom_df.loc[i,'Community']+'('+str(hom_df.loc[i,'HOM17'])+')',color='red', size=20)
if hom_df.loc[i,"Community"] in ["Washington Park","West Pullman","Grand Boulevard"]:
plt.gca().text(hom_df.loc[i,'LON']-.025,hom_df.loc[i,'LAT']+.003,hom_df.loc[i,'Community']+'('+str(hom_df.loc[i,'HOM17'])+')',color='red', size=20)
if hom_df.loc[i,"Community"] in ["Fuller Park"]:
plt.gca().text(hom_df.loc[i,'LON']-.027,hom_df.loc[i,'LAT']+.002,hom_df.loc[i,'Community']+'('+str(hom_df.loc[i,'HOM17'])+')',color='red', size=20)
if hom_df.loc[i,"Community"] in ["Washington Heights"]:
plt.gca().text(hom_df.loc[i,'LON']-.05,hom_df.loc[i,'LAT']+.005,hom_df.loc[i,'Community']+'('+str(hom_df.loc[i,'HOM17'])+')',color='red', size=20)
if hom_df.loc[i,"Community"] in ["West Ridge","North Park","Albany Park"]:
plt.gca().text(hom_df.loc[i,'LON']-.02,hom_df.loc[i,'LAT']+.005,hom_df.loc[i,'Community']+'('+str(hom_df.loc[i,'HOM17'])+')',color='red', size=20)
#plt.gca().set_facecolor('lightgray')
plt.gca().grid()
plt.yticks(fontsize=20)
plt.xticks(fontsize=20)
#title
plt.title('KMeans Classification of Community Areas by 6 Standardized Hardship Indicators',size=25)
plt.xlabel("Longitude",size=30)
plt.ylabel("Latitude",size=30)
plt.legend(loc="lower left")
fig.savefig("Fig4.png")
#show the plot
plt.show()
Exercise
Note that the Near North Side has the same homicide count as Riverdale. However, the homicide rate in Riverdale was 10 times higher than the Near North Side. Using the 2017 data, change the marker size so that it corresponds to homicide rate (per 10,000 people) rather than homicide count.