Solution to Exercises

3.4. Solution to Exercises#

3.4.1. Introduction to Python#

3.4.2. Working with Data#

An Overview of Data and Data Analysis#

Exercise 1

Answers may vary. Data of chess moves from human players: sophisticated algorithms are needed to understand the meaning of the moves and the impact of each move towards the winning of the chess game. Real-time data from background radiation: it is nearly impossible to predict the next minute’s exact radiation level by “analyzing” the historical values, though an average level over a time period is available. Written texts of a now-extinct human language: because there is no gold-standard human verification, a machine’s interpretation of such a language can be less convincing.

Exercise 2

There are several suggested options, but all depend on the nature of the specific data. One can implement parallel systems, consider data compression if possible, or just prioritize the volume with less accurate results, etc.

Exercise 3

Not really. The structure and the quality of data are two different things. One individual may use a numeric value from 0-5 to self-record the tastiness level for a certain type of food that he or she eats every day. However, such kind of record can be very much subjective and does not imply a high quality or accuracy.

Data Operations: Importing and Exporting, Reading and Writing Data#

Exercise 1

The purpose of making this question is to provide a general solution in case that one does not want a program to display too many data at a time.

All user entered elements to be stored in a list

list_numbers = list()

Accepting user inputs

user_input = input()

Continue to accept user inputs until user types STOP

while user_input != ‘STOP’: list_numbers.append(user_input) user_input = input()

Displaying the smaller of: number of user entered elements or 6

records_to_display = min(len(list_numbers), 6)

Using a loop to display uer input values

for i in range(records_to_display): print(list_numbers[i])

Exercise 2

Except the first line, all other 99 lines (rows) have actual data. Usually, there is no comma after last data record in each line. There will be 26 records in each line (row). However, since the first column is storing an index value, there will be 25 actual data records in each line (row). Therefore, the total number of records in this .csv file can store (100-1) * (25+1-1) = 2475.

Exercise 3

We must treat each element as a number type value. The following code does not work, though the output looks okay.

user_input_line = input() output_format = user_input_line.replace(“,” , “\n”) print(output_format)

The correct function that we should use is the split() function.

user_input_line = input()

output_list = user_input_line.split(“,”) output_numbers = []

for i in range(len(output_list)): # adding and displaying the numbers as number type values output_numbers.append(int(output_list[i])) print(output_numbers[i])

Exercise 4

One way to solve the problem is to replace all other symbols with the same separator such as a comma throughout the file.

#. The separator symbol (e.g. semicolon) to be replaced
separator_other = ";"
#.The separator symbol (e.g. comma) to replace other separator symbols
separator_comma = ","
#. Assume that all_data is the entire text from the file. Using .replace() function to update ; with ,
all_data = all_data.replace(separator_other, separator_comma)

Another way is to recognize both commas and semicolons as data separators.

#. Example: In the sep parameter part, put both , and ; as separators.
df = pd.read_csv("file_name.csv", sep = ",;")

Introduction to Pandas#

Exercise 1

# a) Read the data file 
import pandas as pd
population_raw_data = pd.read_excel('world_population.xlsx')

# b) print the first five lines raw data
population_raw_data.head(5)

	Country Name	Country Code	Indicator Name	Indicator Code	1960	1961	1962	1963	1964	1965	...	2013	2014	2015	2016	2017	2018	2019	2020	2021	2022
0	Aruba	ABW	International migrant stock, total	SM.POP.TOTL	7893.0	NaN	NaN	NaN	NaN	7677.0	...	NaN	NaN	36114.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	Africa Eastern and Southern	AFE	International migrant stock, total	SM.POP.TOTL	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	10406165.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	Afghanistan	AFG	International migrant stock, total	SM.POP.TOTL	46468.0	NaN	NaN	NaN	NaN	49535.0	...	NaN	NaN	382365.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	Africa Western and Central	AFW	International migrant stock, total	SM.POP.TOTL	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	8270665.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	Angola	AGO	International migrant stock, total	SM.POP.TOTL	122089.0	NaN	NaN	NaN	NaN	77075.0	...	NaN	NaN	106845.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 67 columns

# c) Display the population values for Ethiopia

Ethiopia=population_raw_data[population_raw_data["Country Name"] == "Ethiopia"]
Ethiopia

	Country Name	Country Code	Indicator Name	Indicator Code	1960	1961	1962	1963	1964	1965	...	2013	2014	2015	2016	2017	2018	2019	2020	2021	2022
72	Ethiopia	ETH	International migrant stock, total	SM.POP.TOTL	393260.0	NaN	NaN	NaN	NaN	383551.0	...	NaN	NaN	1072949.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN

1 rows × 67 columns

population_raw_data.columns

Index([  'Country Name',   'Country Code', 'Indicator Name', 'Indicator Code',
                   1960,             1961,             1962,             1963,
                   1964,             1965,             1966,             1967,
                   1968,             1969,             1970,             1971,
                   1972,             1973,             1974,             1975,
                   1976,             1977,             1978,             1979,
                   1980,             1981,             1982,             1983,
                   1984,             1985,             1986,             1987,
                   1988,             1989,             1990,             1991,
                   1992,             1993,             1994,             1995,
                   1996,             1997,             1998,             1999,
                   2000,             2001,             2002,             2003,
                   2004,             2005,             2006,             2007,
                   2008,             2009,             2010,             2011,
                   2012,             2013,             2014,             2015,
                   2016,             2017,             2018,             2019,
                   2020,             2021,             2022],
      dtype='object')

# d) Display all countries/regions whose population was over 10 million in the year 2000
df = population_raw_data[["Country Name",2000]]
large=df[df[2000]>10000000]
large

	Country Name	2000
7	Arab World	16216741.0
62	Early-demographic dividend	38155943.0
63	East Asia & Pacific	15674381.0
64	Europe & Central Asia (excluding high income)	27780057.0
65	Europe & Central Asia	64005283.0
68	Euro area	26129209.0
73	European Union	29602637.0
74	Fragile and conflict affected situations	10005606.0
95	High income	100949072.0
98	Heavily indebted poor countries (HIPC)	10632626.0
102	IBRD only	51103969.0
103	IDA & IBRD total	72738421.0
104	IDA total	21634452.0
107	IDA only	13832671.0
135	Least developed countries: UN classification	10073307.0
139	Lower middle income	31018425.0
140	Low & middle income	70316148.0
142	Late-demographic dividend	30615531.0
153	Middle East & North Africa	20000241.0
156	Middle income	63267150.0
170	North America	40343635.0
181	OECD members	86944671.0
191	Pre-demographic dividend	10751023.0
198	Post-demographic dividend	92403421.0
202	Russian Federation	11900297.0
204	South Asia	12474215.0
215	Sub-Saharan Africa (excluding high income)	13462901.0
217	Sub-Saharan Africa	13469475.0
231	Europe & Central Asia (IDA & IBRD countries)	29190606.0
240	South Asia (IDA & IBRD)	12474215.0
241	Sub-Saharan Africa (IDA & IBRD countries)	13469475.0
249	Upper middle income	32248725.0
251	United States	34814053.0
259	World	172278883.0

Exercise 2

20
20
359145
62
80192
4408.5
0    924
dtype: int64
27283.438618124135
744386022.8289474

Basics of MATPLOTLIB#

A bar chart may be considered; however, some modifications from the original bar chart are necessary. The values of total births in different months are likely close to each other. A simple bar chart that is showing the absolute values may not present the difference of values very well. Different months have different number of days. It is likely that February has the smallest number, but such a fact does not mean “February is the month that people avoid giving births.”

One suggestion is that we use the number of births per day in a month to represent data. 

The following chart is produced with the reflection of modifications.

../_images/ae11cfc68359bddb16bfadcbbef4f7a74865f313af6ee7b5c6aa76d6ac3a42d2.png

Solution to Exercises

Contents

3.4. Solution to Exercises#

3.4.1. Introduction to Python#

3.4.2. Working with Data#

An Overview of Data and Data Analysis#

Data Operations: Importing and Exporting, Reading and Writing Data#

Introduction to Pandas#

Basics of MATPLOTLIB#