import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

3.4. Solution to Exercises#

3.4.1. Introduction to Python#

3.4.2. Working with Data#

An Overview of Data and Data Analysis#

Exercise 1

Answers may vary. Data of chess moves from human players: sophisticated algorithms are needed to understand the meaning of the moves and the impact of each move towards the winning of the chess game. Real-time data from background radiation: it is nearly impossible to predict the next minute’s exact radiation level by “analyzing” the historical values, though an average level over a time period is available. Written texts of a now-extinct human language: because there is no gold-standard human verification, a machine’s interpretation of such a language can be less convincing.

Exercise 2

There are several suggested options, but all depend on the nature of the specific data. One can implement parallel systems, consider data compression if possible, or just prioritize the volume with less accurate results, etc.

Exercise 3

Not really. The structure and the quality of data are two different things. One individual may use a numeric value from 0-5 to self-record the tastiness level for a certain type of food that he or she eats every day. However, such kind of record can be very much subjective and does not imply a high quality or accuracy.

Data Operations: Importing and Exporting, Reading and Writing Data#

Exercise 1

The purpose of making this question is to provide a general solution in case that one does not want a program to display too many data at a time.

All user entered elements to be stored in a list

list_numbers = list()

Accepting user inputs

user_input = input()

Continue to accept user inputs until user types STOP

while user_input != ‘STOP’: list_numbers.append(user_input) user_input = input()

Displaying the smaller of: number of user entered elements or 6

records_to_display = min(len(list_numbers), 6)

Using a loop to display uer input values

for i in range(records_to_display): print(list_numbers[i])

Exercise 2

Except the first line, all other 99 lines (rows) have actual data. Usually, there is no comma after last data record in each line. There will be 26 records in each line (row). However, since the first column is storing an index value, there will be 25 actual data records in each line (row). Therefore, the total number of records in this .csv file can store (100-1) * (25+1-1) = 2475.

Exercise 3

We must treat each element as a number type value. The following code does not work, though the output looks okay.

user_input_line = input() output_format = user_input_line.replace(“,” , “\n”) print(output_format)

The correct function that we should use is the split() function.

user_input_line = input()

output_list = user_input_line.split(“,”) output_numbers = []

for i in range(len(output_list)): # adding and displaying the numbers as number type values output_numbers.append(int(output_list[i])) print(output_numbers[i])

Exercise 4

One way to solve the problem is to replace all other symbols with the same separator such as a comma throughout the file.

#. The separator symbol (e.g. semicolon) to be replaced
separator_other = ";"
#.The separator symbol (e.g. comma) to replace other separator symbols
separator_comma = ","
#. Assume that all_data is the entire text from the file. Using .replace() function to update ; with ,
all_data = all_data.replace(separator_other, separator_comma)

Another way is to recognize both commas and semicolons as data separators.

#. Example: In the sep parameter part, put both , and ; as separators.
df = pd.read_csv("file_name.csv", sep = ",;")

Introduction to Pandas#

Exercise 1

# a) Read the data file 
import pandas as pd
population_raw_data = pd.read_excel('world_population.xlsx')

# b) print the first five lines raw data
population_raw_data.head(5)
Country Name Country Code Indicator Name Indicator Code 1960 1961 1962 1963 1964 1965 ... 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022
0 Aruba ABW International migrant stock, total SM.POP.TOTL 7893.0 NaN NaN NaN NaN 7677.0 ... NaN NaN 36114.0 NaN NaN NaN NaN NaN NaN NaN
1 Africa Eastern and Southern AFE International migrant stock, total SM.POP.TOTL NaN NaN NaN NaN NaN NaN ... NaN NaN 10406165.0 NaN NaN NaN NaN NaN NaN NaN
2 Afghanistan AFG International migrant stock, total SM.POP.TOTL 46468.0 NaN NaN NaN NaN 49535.0 ... NaN NaN 382365.0 NaN NaN NaN NaN NaN NaN NaN
3 Africa Western and Central AFW International migrant stock, total SM.POP.TOTL NaN NaN NaN NaN NaN NaN ... NaN NaN 8270665.0 NaN NaN NaN NaN NaN NaN NaN
4 Angola AGO International migrant stock, total SM.POP.TOTL 122089.0 NaN NaN NaN NaN 77075.0 ... NaN NaN 106845.0 NaN NaN NaN NaN NaN NaN NaN

5 rows × 67 columns

# c) Display the population values for Ethiopia

Ethiopia=population_raw_data[population_raw_data["Country Name"] == "Ethiopia"]
Ethiopia
Country Name Country Code Indicator Name Indicator Code 1960 1961 1962 1963 1964 1965 ... 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022
72 Ethiopia ETH International migrant stock, total SM.POP.TOTL 393260.0 NaN NaN NaN NaN 383551.0 ... NaN NaN 1072949.0 NaN NaN NaN NaN NaN NaN NaN

1 rows × 67 columns

population_raw_data.columns
Index([  'Country Name',   'Country Code', 'Indicator Name', 'Indicator Code',
                   1960,             1961,             1962,             1963,
                   1964,             1965,             1966,             1967,
                   1968,             1969,             1970,             1971,
                   1972,             1973,             1974,             1975,
                   1976,             1977,             1978,             1979,
                   1980,             1981,             1982,             1983,
                   1984,             1985,             1986,             1987,
                   1988,             1989,             1990,             1991,
                   1992,             1993,             1994,             1995,
                   1996,             1997,             1998,             1999,
                   2000,             2001,             2002,             2003,
                   2004,             2005,             2006,             2007,
                   2008,             2009,             2010,             2011,
                   2012,             2013,             2014,             2015,
                   2016,             2017,             2018,             2019,
                   2020,             2021,             2022],
      dtype='object')
# d) Display all countries/regions whose population was over 10 million in the year 2000
df = population_raw_data[["Country Name",2000]]
large=df[df[2000]>10000000]
large
Country Name 2000
7 Arab World 16216741.0
62 Early-demographic dividend 38155943.0
63 East Asia & Pacific 15674381.0
64 Europe & Central Asia (excluding high income) 27780057.0
65 Europe & Central Asia 64005283.0
68 Euro area 26129209.0
73 European Union 29602637.0
74 Fragile and conflict affected situations 10005606.0
95 High income 100949072.0
98 Heavily indebted poor countries (HIPC) 10632626.0
102 IBRD only 51103969.0
103 IDA & IBRD total 72738421.0
104 IDA total 21634452.0
107 IDA only 13832671.0
135 Least developed countries: UN classification 10073307.0
139 Lower middle income 31018425.0
140 Low & middle income 70316148.0
142 Late-demographic dividend 30615531.0
153 Middle East & North Africa 20000241.0
156 Middle income 63267150.0
170 North America 40343635.0
181 OECD members 86944671.0
191 Pre-demographic dividend 10751023.0
198 Post-demographic dividend 92403421.0
202 Russian Federation 11900297.0
204 South Asia 12474215.0
215 Sub-Saharan Africa (excluding high income) 13462901.0
217 Sub-Saharan Africa 13469475.0
231 Europe & Central Asia (IDA & IBRD countries) 29190606.0
240 South Asia (IDA & IBRD) 12474215.0
241 Sub-Saharan Africa (IDA & IBRD countries) 13469475.0
249 Upper middle income 32248725.0
251 United States 34814053.0
259 World 172278883.0

Exercise 2

Hide code cell source
# Assume a Python list was used to store all elements. Then, transform the list into a Series.
import pandas as pd
list_numbers = [10301, 4994, 8872, 9624, 3666, 924, 73712, 3823, 55900, 62, 6498, 852, 24540, 421, 67891, 924, 80192, 3667, 494, 1788]
series_numbers = pd.Series(list_numbers)

# a) the number count, i.e. how many elements in this set.
print(len(list_numbers))
# Or
print(series_numbers.size)

# b) the sum of all these numbers.
print(series_numbers.sum()) 

# c) the minimum value from this set.
print(series_numbers.min()) 

# d) the maximum value from this set.
print(series_numbers.max()) 

# e) the median value from this set.
print(series_numbers.median()) 

# f) the mode of the elements in this set.
print(series_numbers.mode()) 

# g) the standard deviation from this set.
print(series_numbers.std()) 

# h) the variance (defined as the square of the standard deviation from #7). 
print(series_numbers.var()) 
20
20
359145
62
80192
4408.5
0    924
dtype: int64
27283.438618124135
744386022.8289474

Basics of MATPLOTLIB#

A bar chart may be considered; however, some modifications from the original bar chart are necessary. The values of total births in different months are likely close to each other. A simple bar chart that is showing the absolute values may not present the difference of values very well. Different months have different number of days. It is likely that February has the smallest number, but such a fact does not mean “February is the month that people avoid giving births.”

One suggestion is that we use the number of births per day in a month to represent data. 

The following chart is produced with the reflection of modifications.
Hide code cell source
# Data initialization
number_births = [253, 229, 247, 251, 261, 239, 244, 252, 235, 242, 245, 249]
days_per_month = [31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31]
month_labels = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]

# Transforming the original data of total births to the births by day
births_by_day = []

# Calculating and adding numbers to the list of births by day
for i in range(len(number_births)):
    births_by_day.append(number_births[i] / days_per_month[i])

# Assume that we want to see the difference between a benchmark value, e.g. 8.0, and the actual monthly births
benchmark_number = 8.0
monthly_deviation = []

# Calculating and adding numbers to the list of monthly deviations
for i in range(len(number_births)):
    monthly_deviation.append(births_by_day[i] - 8.0)

# Beginning ploting
import matplotlib.pyplot as plt
import numpy as np

# X: monthly deviation from 8.0 births per day; Y: months
x_months = np.array(month_labels)
y_deviations = np.array(monthly_deviation)

# Displaying the bar chart with labels
plt.bar(x_months, y_deviations)
plt.title("Monthly births data")
plt.xlabel("Months")
plt.ylabel("Deviation from 8.0 births per day")
plt.show()
../_images/ae11cfc68359bddb16bfadcbbef4f7a74865f313af6ee7b5c6aa76d6ac3a42d2.png