COVID-19 Per Capita
Day 3 of our “official” lockdown (we’ve been sheltered in place for 9 days already) and I am breaking my own rule, giving in and completing my first COVID-19 data visualization. I know that everyone’s news feed has been absolutely flooded with Coronavirus updates, my feed has had a special data science themed focus, mainly for dashboards and data visualizations.
I hesitated for a few reasons: First, like most people I initially underestimated how serious this pandemic was going to be; Second, while I have a background in statistics and analytics, I believe any significant analysis should be left to the experts i.e. immunologists and biostatisticians who can provide proper context; Third and finally, there have been so many data visualizations already put out there, what could I contribute that would add to the collective good.
But then…I saw the news yesterday (3/26/20).
These articles may be update over time, so I have included the headlines as well:
So here’s the thing: all those headlines are true…but do they provide the proper context?
I submit that it misses two key factors: population and completed COVID-19 tests. Let’s tackle each of those separately.
First, let’s try to match the numbers provided in the articles above. To do that, we just need to connect to the Johns Hopkins University GitHub repository to pull in the data they maintain regarding COVID-19 and update on a daily basis:
library(RCurl) #Pull data from Johns Hopkins GitHub Repo c0 <- data.frame(read.csv(text=getURL("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/03-26-2020.csv"), header=T), stringsAsFactors=FALSE)
Now that we have the data, simply rolling it up by country and summarizing the relevant fields, we can produce a list of the top 10 countries by the number of confirmed cases:
Confirmed Cases by Country
So yes, the United States did just pass China for the most confirmed cases of COVID-19 on 3/26/20. However, China also had more than four times as many people as the U.S. had in 2019, according to the United Nations in the 2019 revision of World Population Prospects.
To put these numbers in better perspective, I pulled in the 2019 population numbers from Wikipedia to create a per capita summary. You can view the code at the bottom of this post, but here is the ggplot output:
Confirmed Cases per 100,000 People
This is exactly why I believe it’s important for news outlets to report not only the raw numbers, but also per capita. The confirmed cases view would make it seem that the U.S. and Italy are on par, but it ignores that the U.S. has five and half times more people than Italy!
Let’s now put these charts side by side for a better comparison.
Putting It All Together
Now with a side by side view, you can easily see that while China, the U.S. and Italy all have a similar number of total confirmed cases, China has about 5.7 confirmed cases per 100,000 people, while the U.S. has 25.5 and Italy has 133! Admittedly, I may have just missed this in the news, but I had no idea that Switzerland would be at the top, nearly 137.5 cases per 100,000 people. To put that in perspective, as of 3/26/20 Switzerland and the U.K. has nearly identical numbers of confirmed cases (only one person off)…but the U.K. also has nearly 8 times more residents.
A quick note about testing: the Johns Hopkins data does not contain any information about testing. If it did, I think we could put the number of confirmed cases in much better perspective. A big part of the reason that confirmed cases continue to increase is just because the world continues to test more and more people. The confirmed cases should serve only as a baseline.
I started using a new R package for this project cowplot, which made it super easy to stack the charts side by side, both vertically and horizontally, while adding custom chart labels as well.
library(rvest) library(RCurl) library(dplyr) library(cowplot) library(ggplot2)
#Pull data from Johns Hopkins GitHub Repo c0 <- data.frame(read.csv(text=getURL("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/03-26-2020.csv"), header=T), stringsAsFactors=FALSE)
#Summarize by country and get the top 10 countries by confirmed cases c1 <- c0 %>% group_by(Country_Region) %>% summarise(Confirmed= sum(Confirmed), Deaths =sum(Deaths), Recovered=sum(Recovered), Active=sum(Active)) %>% arrange(-Confirmed) %>% top_n(10, Confirmed) %>% rename(country = Country_Region) %>% mutate_if(is.factor, as.character)
c1$country[c1$country == "Korea, South"] <- "S. Korea" c1$country[c1$country == "United Kingdom"] <- "UK"
#Create chart for raw confirmed cases raw <- ggplot(data=c1, aes(x=reorder(country, order(-Confirmed)), y=Confirmed, fill = country)) + geom_bar(stat="identity") + theme_minimal() + theme(legend.position = "none") + labs(x="", y="Total Confirmed Cases")
#Wikipedia data source url <- ‘https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)'
#Read HTML code from the URL webpage <- read_html(url)
#Bring in the table with the population by country c2 <- html_table(html_nodes(webpage,‘table’),fill = T)[]
#Format Wikipedia data to match Johns Hopkins data c3 <- c2 %>% select(
Country or area,
Population(1 July 2019)) %>% rename(country =
Country or area, pop_2019 =
Population(1 July 2019)) %>% mutate(country = gsub("\[.]","", country)) %>% filter(country %in% c(‘China’, ‘United States’, ‘Italy’, ‘Spain’, ‘Germany’, ‘France’, ‘Iran’, ‘United Kingdom’, ‘Switzerland’, ‘South Korea’))
c3$country[c3$country == "United States"] <- "US" c3$country[c3$country == "United Kingdom"] <- "UK" c3$country[c3$country == "South Korea"] <- "S. Korea"
#Calculate the rate per 100K residents c4 <- merge(x = c1, y = c3, by = "country") %>% mutate(per_con = (as.numeric(Confirmed) / as.numeric(gsub(",","",pop_2019))) * 100000) %>% arrange(-per_con)
#Create chart for raw confirmed cases per <- ggplot(data=c4, aes(x=reorder(country, order(-per_con)), y=per_con, fill = country)) + geom_bar(stat="identity") + theme_minimal() + theme(legend.position = "none") + labs(x="", y="Confirmed Cases per 100K Citizens")
plot_grid(raw, per, ncol = 1, align = "v", labels = c(‘Total Confirmed Covid19 Cases by Country’, ‘Confirmed Covid-19 Cases by Country per 100K’))
A Note of Hope
I for one am torn between watching the news to keep up with the world and avoiding it for its impact to mental health. This analysis is not meant to downplay the severity of confirmed cases in any country, rather just to help put things in perspective. While things are dark now, they will get better soon and there is still a lot of positivity in the world. For me, this tweet really brightened my day:
@momuses) March 25, 2020