The Covid-19 pandemic, of which we are all too aware, has had a devastating impact on the world in recent years. It is much more than just a health threat and has affected each individual in some way or the other. With lockdowns and emergencies in all parts of the world, it has changed the way everything used to function. The economic and social disruption caused by the pandemic is huge, and has caused crisis even in well developed countries. Millions of people lost their job, and are at risk of extreme poverty. Lot of enterprises reached a state of existential threat. But all of this can be improved in the future. Something we cannot change is the dramatic loss of human life worldwide and the effect it has had on the health of people.
The best way to really understand how the coronavirus pandemic has affected the world is using statistics. Let us first start with analyzing its spread and the number of deaths it caused.
from IPython.display import HTML
HTML('''<button type="button" class="btn btn-outline-danger" onclick="codeToggle();">Toggle Code</button>''')
# Importing the libraries needed
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from ipywidgets import widgets, interactive
import plotly.io as pio
import plotly.express as px
import warnings
warnings.filterwarnings('ignore')
pio.renderers.default='notebook'
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)
# Reading the data
covid_df = pd.read_csv("./data/owid-covid-data.csv.gz", compression="gzip")
A lot of data in the dataset is missing, and thus we drop the rows which have missing data in the necessary columns.
Note: Interpolation techniques cannot be used here to fill the missing data since interpolating features like 'iso_code', 'location', 'continent', and 'date' does not make sense.
required_columns = ["iso_code", "location", "continent", "date"]
covid_df = covid_df.dropna(subset = required_columns)
Following is the visualization of the data we will be using. Some of the relavant columns in our data for each country include:
# Visualizing the data
cols = ['iso_code', 'continent', 'location', 'date', 'total_cases', 'new_cases', 'total_deaths', 'new_deaths', 'total_cases_per_million', 'new_cases_per_million', 'total_deaths_per_million', 'new_deaths_per_million']
covid_df[cols].fillna(0).sample(5)
iso_code | continent | location | date | total_cases | new_cases | total_deaths | new_deaths | total_cases_per_million | new_cases_per_million | total_deaths_per_million | new_deaths_per_million | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
138357 | ZAF | Africa | South Africa | 2021-08-30 | 2770575.0 | 5644.0 | 81830.0 | 235.0 | 46143.952 | 94.001 | 1362.879 | 3.914 |
12234 | BHR | Asia | Bahrain | 2021-03-07 | 126602.0 | 476.0 | 472.0 | 3.0 | 72414.552 | 272.265 | 269.977 | 1.716 |
103271 | MAR | Africa | Morocco | 2021-09-10 | 899581.0 | 2668.0 | 13436.0 | 66.0 | 24088.529 | 71.442 | 359.782 | 1.767 |
13068 | BGD | Asia | Bangladesh | 2021-06-17 | 841087.0 | 3840.0 | 13345.0 | 63.0 | 5057.543 | 23.090 | 80.245 | 0.379 |
9654 | AUS | Oceania | Australia | 2022-02-22 | 3099249.0 | 25229.0 | 5025.0 | 59.0 | 120180.817 | 978.315 | 194.856 | 2.288 |
The following plot denotes the number of 'new cases' in a day. We can move on the slider timeline bar to get the plot for a particular day. We can thus get a idea of the time period when a country was hit by a covid wave. We can hover over the countries in the world map to get the exact number of cases.
For example, if you check for around April 2021, we observe that India has the most rise in cases. This was the time India was going through the second wave of the pandemic, and thus we had most new cases in a day.
# Plotting interactive graph for new covid cases worldwide
tmp_df = covid_df.dropna(subset=['new_cases_smoothed'])
fig = px.scatter_geo(tmp_df, locations="iso_code", color="continent",
hover_name="location", size="new_cases_smoothed",
projection="natural earth", animation_frame="date", template="seaborn")
fig.show()
The next plot shows the 'total number of cases' with time for each country. We can move on the slider timeline bar to see the rise of cases around the world.
# Plotting interactive graph for total covid cases worldwide
tmp_df = covid_df.dropna(subset=['total_cases'])
fig = px.scatter_geo(tmp_df, locations="iso_code", color="continent",
hover_name="location", size="total_cases",
projection="natural earth", animation_frame="date", template="seaborn")
fig.show()
The most major and devastating loss due to Covid-19 was in terms of human life. Let us now visualize the number of deaths.
We first drop the rows which have missing data in the relevant columns.
cols = ['location', 'total_deaths_per_million', 'continent', 'total_deaths']
df = covid_df.dropna(subset = cols)
countries = ['India', 'Georgia', 'Brazil', 'United States', 'Chad', 'Canada', 'Germany', 'China']
df_country = df[df['location'].isin(countries)]
# df.sample(5)
The following plot shows the total number of deaths for some of the countries with time. We have chosen the countries from all three categories, developed, developing, and under-developed. The x-axis represents the date and the y-axis represents the number of deaths.
We can choose to visualize the plot for a particular country by double clicking on that country. We can add the plots for countries to compare with a single click on the country we wish to add. Now if we want to go back and visualize plots for all countries, we can double click on any unselected country.
We can also hover on the graph to get the exact number of deaths of a particular country till that date.
# Total Deaths for each country
fig = px.line(df_country, x="date", y="total_deaths", title='Total Deaths', hover_name='location', color='location')
fig.show()
Does the above plot mean that US, Brazil, and India are the worst affected countries due to Covid? No! This is because each country has different population and thus we should not compare them with the total deaths. One possible solution would be to compare the deaths per million. To understand it better and prove the above, we can take a simple example.
Below is a similar plot, but here, instead of the total deaths, we have the plot for total deaths per million. If you select India and Georgia in the above plot, we can see that the total deaths in India (514k) is a lot more than that of Georgia (16k). Now, select the same two countries in the below plot. The total deaths per millions for India as observed is around 350, but on the other hand, the total deaths per million for Georgia is around 4000!
This is a big misconception that a lot of poeple have, and thus small contries like Georgia don't make it to the news even if they are more severely affected, and don't get the attention and help they should.
# Total Deaths per million for all countries
fig = px.line(df_country, x="date", y="total_deaths_per_million", title='Total Deaths per million', hover_name='location', color='location')
fig.show()
To get a better insight, we should view the statistics for each continent separately as each of them have different resources. Below is the plot for the top 5 countries in Asia with the most and least deaths.
We can give other continents as input in the python function defined below to view the statistics of that continent.
def death_per_continent(df, continent, top=5, bottom=5):
df_continent = df[df['continent']==continent]
df_continent = df_continent.sort_values('total_deaths_per_million', ascending=False)
df_continent[['location', 'date', 'total_deaths_per_million']]
locs = list(df_continent.location.unique()[:top])
locs += list(df_continent.location.unique()[-bottom:])
df_continent = df_continent[df_continent['location'].isin(locs)]
fig = px.line(df_continent, x="date", y="total_deaths_per_million", title=f'Total Deaths Per Million in {continent}', hover_name='location', color='location')
return fig
continent = covid_df.continent.unique()
fig = death_per_continent(df, continent[0])
fig.show()
We saw how much damage Covid has done to the world. What is the solution? Vaccines!
To bring this long running pandemic to an end, an efficient and inclusive distribution of Covid-19 vaccines could be our next most prospective step. If order to take action along these lines, we first need to understand how the current covid vaccination drives are runnning. If we are to interpret this data, we will be able to identify any underlying inequalities that might be happening during the distribution of covid vaccinations. So, our task is to understand Covid-19 vaccinations data worldwide and draw inferences from the same to understand how Covid-19 vaccination drives are going. We also plan to understand the underlying inequalities across the world.
As mentioned, let us compare the distribution of Covid-19 vaccines in different countries. For better understanding the inequalities in the distribution, we compare the statistics for 3 countries, Canada, India, and Chad. Canada is a developed country, India is a developing country, and Chad is an under-developed country.
In each plot, we show three trends, distribution of first dose, second dose and the booster dose. For comparing the three, we have plotted the percentage of poeple in the country who have received the doses. We can hover on the plots to see the percentage of vaccinated people at a particular point of time.
We have also created a custom python function in which we can input the country for which we need to see the trends.
def vacc_by_country(df, country):
df_country = df[df['location']==country]
fig = px.line(df_country, x='date', y=['people_vaccinated_per_hundred','people_fully_vaccinated_per_hundred','total_boosters_per_hundred'], title = f'% of people vaccinated in {country}')
return fig
# Removing rows with missing data in relevant columns
cols = ['date','location','people_vaccinated_per_hundred']
covid_df = covid_df.dropna(subset = cols)
fig = vacc_by_country(covid_df, 'Canada')
fig.show()
fig = vacc_by_country(covid_df, 'India')
fig.show()
fig = vacc_by_country(covid_df, 'Chad')
fig.show()
We can clearly observe that Canada is the most vaccinated country with around 81% of people doubly vaccinated, and also the trends show that vaccination drive was organized in a planned and timely manner. This shows that developed countries were able to vaccinate a majority of their population in a short span of time. The vaccination drive for booster dose started quite early in the developing countries and can be seen from the line plots.
In India, around 57% of the population is fully vaccinated. It is evident from the curve of the graph that developing countries like India took a lot of time to complete their vaccination drives whereas developed countries have a steep vaccination curve. Moreover, the vaccination drive in India started a couple of months later as compared to Canada, this adds to the delayed vaccination drive in India for the booster dose.
Underdeveloped countries like Chad are in a very bad situation with only 1% of poeple fully vaccinated and the drive for booster dose not even started. Such countries need a lot of attention and help as they have a weak economy and lack the resources needed to carry out planned vaccination drives.
Now let us have a look at how there was a rise in the number of doses of different vaccines in various countries. The type of vaccine and thus the manufacturer played an important role in the vaccination drives due to their cost and success rate.
The data contains the number of total doses of different vaccines with time in each country.
# Importing the data
v_by_manu = pd.read_csv("./data/vaccinations/vaccinations-by-manufacturer.csv")
# v_by_manu.sample(10)
Following is the plot for the country Argentina. The x-axis represents the time series and the y-axis contains the number of total doses of that particular vaccine. We can observe how certain vaccines saw a sudden rise in their production.
j="Argentina"
v_arg=v_by_manu[v_by_manu["location"]==j]
for i in v_arg["vaccine"].unique():
v_arg_spu=v_arg[v_arg["vaccine"]==i]
plt.plot(v_arg_spu["date"],v_arg_spu["total_vaccinations"], label=i)
x_ticks = ["2021-01-01", "2021-04-01", "2021-07-01", "2021-10-01", "2022-01-01"]
x_labels = ['1-21', '4-21', '7-21', '10-21', '1-22']
plt.xticks(ticks=x_ticks, labels=x_labels)
plt.legend()
plt.xlabel('Date')
plt.ylabel('Total Vaccinations till Date')
plt.title(j)
Text(0.5, 1.0, 'Argentina')
We can also create a interactive plot in the following way. When run in python, this gives us a dropdown to select the country for which we need to analyse the number of doses of different vaccines used in that country.
area = widgets.Dropdown(
options=v_by_manu["location"].unique(),
value='Argentina',
description='Country',
)
def plotit(area):
v_arg=v_by_manu[v_by_manu["location"]==area]
x_ticks = [v_arg["date"].min(),v_arg["date"].max()]
for i in v_arg["vaccine"].unique():
v_arg_spu=v_arg[v_arg["vaccine"]==i]
plt.plot(v_arg_spu["date"],v_arg_spu["total_vaccinations"], label=i)
x_labels = x_ticks
plt.xticks(ticks=x_ticks, labels=x_labels)
plt.legend()
plt.xlabel('Date')
plt.ylabel('Total Vaccinations till Date')
plt.title(area)
interactive(plotit, area=area)
What would be the reason for different percentage of vaccinated people in different countries? There are two main factors:
First, let's understand the co-relation of vaccination with economy of the country. We have plotted a scatter plot for the percentage of people fully vaccinated against the GDP of the country. We have grouped the countries of a continent with the same colour to understand trends between continents. We have also plotted the trendline that fits the data.
We can hover on the points to get the country name, GDP and vaccination percentage for that country.
cols = ['iso_code', 'location', 'continent', 'gdp_per_capita', 'people_fully_vaccinated_per_hundred']
df1 = covid_df[covid_df['date'] == '2021-09-12'][cols]
df1.isna().sum()/df1.shape[0]
df1 = df1.dropna(axis=0)
fig = px.scatter(df1, x='gdp_per_capita', y='people_fully_vaccinated_per_hundred', color='continent', hover_name='location', trendline='lowess', trendline_scope="overall", trendline_color_override="black")
fig.show()
Based on the above plot, we can make the following inferences:
The second factor that might affect vaccination drives is the Human Development Index. HDI is a composite measure of a country's life expectancy, education and per capita income. It indicates the overall development of a country. Below is the scatter plot for percentage of poeple fully vaccinated against the HDI. Like the previous plot, we have grouped countries of a continent with the same colour and also plotted the trendline fitting the data.
cols = ['iso_code', 'location', 'continent', 'human_development_index', 'people_fully_vaccinated_per_hundred']
df = covid_df[covid_df['date'] == '2021-09-12'][cols]
df.isna().sum()/df.shape[0]
df = df.dropna(axis=0)
fig = px.scatter(df, x='human_development_index', y='people_fully_vaccinated_per_hundred', color='continent', hover_name='location', trendline='lowess', trendline_scope="overall",trendline_color_override="black")
fig.show()
Based on the above plot we can make the following observations:
From the analysis made above, we get a very good insight of the effects of Covid-19, Vaccination drives, and the inequalities beneath. Data we see is often misleading, and under-developed severely affected countries with low population don't get the help they actualy need. The economy and the Human Development Index of a country have a very important role in the vaccination drives. Countries with good HDI and economy have better vaccination percentages and were able to organize planned vaccination drives in less amount of time. We can also infer that HDI and ecomony are directly proportional. The booster dose vaccination is yet to be started in many of the under-developed countries and even some of the developing countries. The only way to end the coronavirus pandemic is getting vaccinated. The entire world needs to fight in this together and this can only happen by breaking the inequalities between the developed, developing and under-developed countries.