Data Literacy

Navigate the Data Deluge.

The spread of COVID-19 has brought numbers, statistics and charts to the forefront of all of our lives. However, with so much data out there it’s hard to fully digest and make sense of it all. Furthermore, due to the authority of numbers, we are more likely to accept their face-value without questioning their veracity.

Numbers represent a refinement of a data set, taken on a sample of the population, by humans who make mistakes, using error-prone tools, in constantly changing conditions at a single point in time. It is then analysed by another human (who works for an organisation with their own motives) before the statistic or graph is delivered to you by a publisher (who has their own interests in persuading you of a story).

We must remember data are a helpful reduction of the world and do not capture everything. This is not to say we should abandon statistics altogether but rather recognise the many factors influencing the final numbers we see.

Confirmed Cases

COVID-19 Cases

Due to the rapid spread of the virus, conclusions are often derived from sparse or incomplete data. Being able to understand the deficiencies in COVID-19 data is essential to protect ourselves from incorrect information.

Confirmed Cases: Estimated by the number of positive test results collected and reported in any given period.

In the absence of testing everyone, confirmed cases likely under-represent the number of true cases due to:

    • Testing lag: Infected people can take up to 14 days to show symptoms and might wait even longer to get tested.
    • Processing time: All reported numbers are aggregates of numbers reported by state or local public health organisations which aren’t always updated in a timely manner.

As this data shapes policy, we, the public, need to know the truth surrounding the confirmed cases – even if the truth means a great deal of uncertainty which will not be resolved for some time. 

Always check the quality of your data to understand limitations before making decisions.

Number of Deaths

COVID-19 Deaths

Number of Deaths: No standard definition exists. 

True deaths from COVID-19 are most likely underestimated due to:

    • Changing definitions over time: Several countries have changed the requirements to be counted as a COVID-19 death which can drastically affect the number (e.g. the number of deaths in France attributed to the virus soared in early April after officials began including previously unreported deaths in nursing homes, boosting the count by more than 2,000). Even within a country, official statistics can vary.
    • Lack of post-mortem testing: Often seen as a misuse of scarce resources that could be used on the living. In addition, medical examiners don’t typically get involved with deaths caused by COVID-19.
    • Misattribution (e.g. someone dying from causes, such as a heart attack or stroke, triggered by the infection, with no one ever suspecting or testing for the disease).
    • Under reporting: The death toll has become a heavily politicised benchmark with pressures to keep this as low as possible.
    • Processing time: All reported numbers are aggregates of numbers reported by state or local public health organisations which aren’t always updated in a timely manner.

This is not an isolated issue, a 2013 study of the 2009 H1N1 (swine flu) pandemic in the U.S. found that lab-confirmed tests accounted for only about 1 in every 7 deaths attributable to the disease.

Always check the quality of your data to understand limitations before making decisions.

Data Lag

COVID-19 Data Lag

One of the challenges of tracking the virus is the lag time between an individual becoming infected, acquiring symptoms, seeking care, testing positive and being reported.

Today’s reported cases most probably belong to people who started showing symptoms a few days ago which means they could have contracted the virus up to two weeks ago.

Similarly, when talking about today’s reported deaths, we are referring to events which occurred in the past. Deaths aren’t reported immediately or consistently by federal and state/territory health authorities and as such, numbers change retrospectively as new data comes in. This is especially true in places with overworked hospitals, demonstrated below for NYC.

This chart shows reported deaths per day increasing in April for the period March 11 – April 1.

Premature reports of daily death tolls are dangerous because they distort our perception of the progress of the epidemic. The misleadingly low numbers announced each day can make it look like we’re perpetually on the verge of peaking – giving us false hope the worst is over, when in fact the true peak may remain days or weeks away.

We should also be sceptical of daily spikes as they tend to represent backlogs being cleared or a change in processes instead of real increases.

Numbers can give us a sense of what’s happening, as long as we recognise their flaws.

Source Comparison

COVID-19 Data Sources

The three key publicly-available COVID-19 data sets are JHU, ECDC and WHO. All follow a similar trend but variances exist due to differences in:

  1. Underlying sources.
    1. WHO: Direct reports from national governments (UN organisation).
    2. ECDC: Multiple sources including governments, the WHO, as well as other national, health and media authorities.
    3. JHU: Multiple sources including DXY, ECDC, WHO, as well as other private, government, state, health and media authorities.
  2. Criteria definitions. E.g. JHU includes presumptive cases.
  3. Timing of retrospective data changes. E.g. the WHO were slow to report 1,700+ Chinese health care workers who became infected (Feb spike).
  4. Daily data cut off times. The WHO and ECDC periods end 0000 CET while JHU’s ends 0200 CET. Prior to 18/03/20 the WHO’s ended at 0900 CET.

In addition to publishing these data sets daily, each source maintains dashboards to provide a current view of events. These update at different frequencies which means the time of day can also impact the numbers you see. ECDC updates their dashboard’s once a day, WHO three times a day, and JHU’s is near real time.

Too much faith in a singular data set can blind you from its limitations. Always use more than one data source when making decisions.

Testing Metrics

Testing numbers have become highly politicised as governments around the world lean on different metrics to present themselves in a better light. Two commonly used metrics are:

    • Total Tests (referred to as “testing” by Donald Trump).
    • Tests per population unit (per 100,000 people by Scott Morrison).

“Just reported that the United States has done far more “testing” than any other nation, by far! Over an 8 day span, the US now does more testing than South Korea does over an 8 week span.” – Donald Trump 25/03/2020

“Australia has now reached a testing rate of more than 1000 tests per 100,000 population.We are the first country, to the best of our knowledge, that has been able to exceed that mark.” – Scott Morrison 02/04/2020.

COVID-19 Testing
COVID-19 Testing

Large populations can hide low testing rates per population unit (left). The right chart more accurately reflects testing coverage.

In the midst of differing fiscal situations, healthcare systems, self-sufficiency, and social and political relations, one way to standardise for country comparisons is to use testing rates per population unit.

Avoid drawing conclusions from raw numbers when comparing values among groups of different sizes.

Sampling Bias

COID-19 Sampling Bias

As COVID-19 spreads throughout the world, many counties are faced with a growing number of cases and limited testing kits. In certain places, this has led to tighter restrictions on who gets tested and thus a higher proportion of positive results (as testing is reserved for the more severe cases).

One metric which assists in identifying this bias is the number of COVID-19 tests per confirmed case. Comparing this metric can give us an indication of how successful countries have been in tracking and diagnosing the true extent of the pandemic.

The higher the number, the more accurately confirmed cases reflects the true number of infections in those countries (as cases of mild/moderate symptoms are more likely to have been identified).

The chart tells us Australia’s confirmed cases are likely a better estimate of our true cases when compared to countries like the USA, UK, Italy and France (which likely have a higher proportion of undiagnosed cases).

This does not mean we are immune from sampling bias, only that our figures are probably less impacted by it.

When unrepresentative samples are used to draw conclusions about a population, the results should always be questioned.

Tests per Day

COVID-19 Tests

Scott Morrison has declared Australia has one of the most rigorous coronavirus testing systems in the world. Many media agencies have echoed his thoughts, relaying Australia is testing an average of over 10,000 people per day.

The chart below shows the number of daily tests in Australia for the past month.

When dealing with a virus which can spread so quickly, quoting averages as evidence for successful testing rates can mislead the public. Each new day is different and as important as the last. Grouping days together to produce an average masks this individual variation. In addition, the average almost always never occurs.

Furthermore, the statement of testing an average of 10,000 people per day, while technically correct, doesn’t tell us anything about trends over time. From the 1st to 14th April tests per day trended downwards (a time when many countries were ramping up testing). From the 15th of April, we saw our testing criteria relaxed which started to reverse the trend.

Averages quoted alone are dangerous pieces of information, always plot your data and look for other summary statistics, especially variation.

Modelling Part 1

COVID-19 Modelling

On the 7th of April the Federal Government released modelling on how COVID-19 could affect Australia. The modelling compared the peak ICU bed demand under three scenarios (uncontrolled spread, isolation + quarantine and isolation + quarantine + social distancing).

It’s important to remember all models are built on assumptions. Small changes in these assumptions can produce wildly divergent outcomes. Assumptions used in this modelling include:

    • Overseas data accurately reflects Australia’s circumstances.
    • How the virus behaves: its ability to spread (transmission parameters).
    • How humans behave: a 25% to 33% reduction in social contacts.
    • No regional variation in how the virus or humans behave.
    • The proportion of infected individuals requiring hospital and ICU beds.
    • The average bed days for inpatients (hospital 7.5 days and ICU 10 days).
    • Excluding the probability of a subsequent peak(s).
    • Consistent implementation of any measures over time and across states.

The modelling is highly sensitive to all these variables, especially those regarding how the virus behaves. For these types of models, simply changing the virus’s reproductive number (R_0: how many new people on average each virus carrier infects) from 2.3 to 2.4 can triple the outcome (click here to see this in action using the default parameters).

Model outputs can only reflect their assumptions and at this stage, due to the small amount of viable data, our inputs contain a lot of uncertainty.

Modelling outputs should never be considered as truth.

Modelling Part 2

On the 16th of April, the Federal Government released its second round of modelling for the current conditions of COVID-19 in Australia. Referred to as ‘nowcasting’, the modelling uses real Australian data to estimate the symptomatic case detection rate and effective R_0 (the virus’ reproductive number when there is some immunity or intervention measures in place).

Key findings: Australia is detecting approximately 92% of all symptomatic cases with an effective R_0 < 1 in all states except Tasmania. While this message was widely communicated in the media, the assumptions feeding into these results were not. The modelling assumed:

    • The Chinese case fatality rate (the proportion of confirmed positive patients who end up dying of the disease) accurately reflects the true baseline case fatality rate.
    • There are no differences in age-structure or differential risks of severe outcomes across age groups when compared to China.
    • A constant proportion of cases are detected over time.

The three types of analysis utilised to calculate the symptomatic case detection rate and the effective R_0 varied in assumptions regarding:

    • The serial interval (time between symptom onset in an infector and an infectee).
    • Levels of infectiousness between “Imported” and “Local” cases.
    • Inclusion/exclusion of “Imported” or “Under Investigation” cases.
    • Attribution of “Unknown” or “Unconfirmed” source of acquisition.
    • Accounting for delays in reporting.
    • Size of initial outbreak.

It’s important to remember assumptions are propagated through models, many small uncertainties can together yield significant aggregate uncertainties. As such, the greater the number of critical assumptions used in a COVID-19 model, the more rigour we need to apply in our assessment of their plausibility and applicability to Australia.

Modelling results should always be carefully evaluated in light of their assumptions.

Modelling Part 3

COVID-19 Modelling

Initial COVID-19 models were based on assumptions for which we had little to no data (e.g. the exact value of R_0, the proportion of infected patients who require hospitalisation and the proportion of infected patients who will die).

Good models don’t use a single exact number for each assumption but instead use a probability distribution (a range of numbers with differing degrees of likelihood) for the input. This accepts the inherent uncertainty in each variable as it’s almost impossible to know its actual value with total certainty. Unfortunately, not all COVID-19 models were built like this.

Luckily for us, Australia’s modelling DOES incorporate this uncertainty in its assumptions. This is demonstrated through the inclusion of confidence intervals with each output. Confidence intervals communicate how certain we are the correct answer lies within a range produced by the model.

The charts below show the 50% (dark blue) and 95% (light blue) confidence intervals for the average symptomatic case detection rate (dark blue line). ACT’s results should be read as “We are 95% confident the average symptomatic case detection rate is between 30% and 100% for ACT”.

The real danger in all of this is when the media presents modelling results without confidence intervals to the public. This can result in a false sense of certainty around the pandemic and wrongly inform decisions. 

Always look for confidence intervals to communicate uncertainty and assign confidence, statistical or otherwise.

See It to Believe It

COVID-19 Cases

Visual information is powerfully and unavoidably persuasive in ways text and speech aren’t. This is no surprise as the human brain processes images 60,000 times faster than text and 90% of information transmitted to the brain is visual. As a result, data visualisations are extremely effective ways to tell a story and convince a reader of a particular point. 

When demand for data is high, agencies further capitalise on our susceptibility to visual information. Watching the pandemic evolve through data visualisations therefore provides an interesting lens into motives of the author’s behind them. Their goal is to persuade instead of inform and they leverage a well-known toolkit to make this happen (choosing the right chart, designing for their audience, etc.). 

The chart below was created on the 22nd March for maximum eyeballs, to sell a story. What better story than “Everybody panic!”? The choice of cumulative cases matches this intent as values can only ever stay the same or increase. This chart also masks the rate at which the virus is spreading and the success of any intervention measures in place.

Your immediate visceral reaction to this chart was no accident. The author knew their objective (you to click/read the article) and presented the data in a specific visual way to make this happen.

When looking at data visualisations, the goal is to separate the information from its packaging. Be alert of the author’s motive and how it might manipulate the overall presentation.


COVID-19 Cases

As COVID-19 continues to sweep the globe, media and government agencies use data visualisations to both inform and influence. Each visualisation tells a different story as authors choose to show the metric most in favour of their message.

Recent government messaging around social distancing has included orders to stay at home other than for essential activities. To demonstrate this was the right decision, charts like the one above have been circulated.

This type of visualisation clearly shows the benefits of social distancing. It should therefore come as no surprise it was made by an organisation that just so happens to enforce social distancing. 

The point here is not that social distancing isn’t working (because it is) but instead that certain charts and metrics are presented to the public for a reason. As people begin to draw parallels between the shape of this curve and COVID-19 modelling around the world, they may start to believe we have successfully contained the virus and no future peaks will occur.

The danger in drawing this bold conclusion from the above is that the chart tells us nothing about detection rates, local transmission rates or how long measures will last which can all influence the probability of a second wave.

Author’s motives influence the chart and metric you are presented with; always try to look at more than one type to get the full story.

Log Scales

COVID-19 Cases
COVID-19 Cases

Infectious diseases don’t spread in an even or linear fashion. As such, using a linear scale to track cumulative numbers makes it difficult to draw conclusions about the rate of spread and the success of interventions in place. This is because the line will continue to increase in the same direction, creating the impression social distancing isn’t working (first chart). One way to show this exponential growth in a way which helps better explain the spread is to use a logarithmic (log) scale (second chart).

On a log scale, numbers on the y-axis don’t move up in equal increments (e.g 0, 10, 20, 30, etc.) but instead increase by an order of magnitude which is often 10 (e.g 1, 10, 100, etc.). Compressing the y-axis in this way allows us to see when the rate of growth starts to slow which may indicate certain public health measures are starting to have the desired effect. 

While log scales are an effective way to visualise changes in exponential growth rates, they also have the potential to downplay the urgency of the pandemic. The recent levelling out in April could indicate the virus is under control, whereas in reality, the WHO reported a new highest daily number of confirmed cases on April 25th. This could also have been thought in early March. 

Always check for irregular axis intervals to determine how the data should be interpreted.

Excess Deaths

COVID-19 Excess Deaths

It is widely accepted official COVID-19 death tolls under-count the true number of fatalities caused by the disease (see Element #2). Furthermore, many argue this metric doesn’t accurately measure the true impact on the population. As a result, a better way is to look at a region’s “excess deaths” (the gap between the total number of people who died from any cause in 2020 and the historical average) to reveal what’s really happening.

The charts below show excess deaths from different urban areas worldwide, reinforcing the idea fatalities are much greater than those reported.

There are a three possible explanations for the red excess deaths above:

    • COVID-19 deaths.
    • Under-reported or misattributed COVID-19 deaths.
    • Collateral damage: Deaths of uninfected people whose normal medical treatment has been disrupted. (E.g. reluctance to attend hospital in spite of illness, unable to get treatment due to lack of hospital capacity).  

All three scenarios are still COVID-19 related deaths. As shown historically, if we really want to understand how the pandemic has impacted total mortality, excess deaths provides the most accurate estimate.

Excess deaths provide an objective, consistent and comparable measurement across time and space to assess the scale of the pandemic (and to ultimately formulate lessons learned).

False Causality

A preliminary study released on the 21st March by the New York Institute of Technology found countries with a Bacillus Calmette-Guerin (BCG) vaccination policy saw 10 times fewer incidence and mortality of COVID-19 when compared to those without a BCG policy. It went on to imply causality by saying “BCG vaccination also reduced the number of reported COVID-19 cases in a country”.

Statements like this are dangerously misleading as ecological studies are inherently limited – they take aggregate data and use it to make inferences at the individual level. What may hold at the country level, will not necessarily be true for every person in that country (ecological fallacy).

In addition, when underlying factors influence both the exposure (BCG vaccine) and the outcome (COVID-19 cases/deaths), any correlation found can be spurious (this situation is known as confounding). Such factors relating to this study include:

    • Poor data quality of confirmed cases (see Element #1): Underestimation in lower income countries could explain the authors’ observed results (e.g. India has one of the lowest COVID-19 testing rates in the world).
    • Timing: The study was done early in the pandemic and since the paper was published, many countries applying BCG (such as India or South Africa) have seen considerable increases in cases.

The danger lies in the fact the study received a lot of media attention, even before peer-review. Accepting these findings at face value can create a false sense of security and lead to inaction, especially from the most vulnerable populations who received BCG vaccination at birth.

Correlation does not imply causation. Randomised Control Trials are required to establish causality.

Herd Immunity

COVID-19 flatening the curve herd immunity

Medicine, money and mathematics have all been used to combat the COVID-19 pandemic. As nations attempt to navigate the virus, experts use models to estimate the impact on healthcare systems. Policies are designed (e.g social distancing) to ensure the number of daily cases remains below a threshold (based on available ICU beds). The goal is to minimise lives lost. This is commonly referred to as ‘flattening the curve’.

To protect the most vulnerable from COVID-19, the general population must achieve herd immunity through either a vaccine or infection.

This occurs when a significant proportion of the population develops immunity to the disease, effectively stopping the spread of the virus. The Australian Government estimates the R_0 of the virus as 2.53, which means 60% of the population will need to contract the virus in order to achieve herd immunity.

An unrealistic best case scenario (which minimises loss of life) sees 6920 ICU beds available in Australia and 5% of cases requiring ICU admission. Therefore, the maximum number of cases we can experience at any time to ensure enough ICU beds is 138,400. As each patient requires on average 10 days in ICU (and assuming we want to reach herd immunity as quickly as possible), we can experience a maximum of 13,840 new daily cases. 

Assuming this unrealistic best case scenario and a population of 25,464,116, it would take Australia over 3 years to achieve herd immunity via infection. Considering no vaccine has ever been developed for a coronavirus, if achieving herd immunity this way is our goal, we may be in for a long ride.

While it’s impossible to know how long it will take to achieve natural herd immunity, best case scenarios can shed light on minimum timelines.

Covidsafe Adoption

CovidSafe App adoption

The Government has released its new contact tracing app (COVIDSafe) as it looks for strategies to lift social distancing measures whilst minimising the risk of resurgence. When tracing is delayed (even by a day) we can lose control of the pandemic, resulting in more people quarantined and more deaths. Digital tracing provides speed and scalability.

The app, based on Singapore’s Tracetogether software, uses Bluetooth to detect other phones running the app nearby (1.5m). This data is retained on phones for 21 days (maximum incubation period + time taken for positive test result). If an app user tests positive, data for phones that recorded 15 minutes or more of contact (duration likely to put someone at risk of infection) are uploaded to a Government server, making it easier for health authorities to contact potentially infected users.

Scott Morrison says the required take-up rate for this to be successful is 40%, however experts elsewhere have suggested this is too low. With a 40% uptake, a quick calculation indicates the probability a random person with COVID-19 and the app encounters someone else with the app is 16% (40% x 40%) which is quite low.

Oxford’s Big Data Institute calculated an adoption rate of 80% of all phone users (56% of population) is required to reliably suppress an epidemic (below).

If the probability of two independent events are low, the chance of their joint occurrence will always lead to a small number.


COVID-19 Mutations

Along with varying human DNA, testing levels, testing accuracy, sampling bias, asymptomatic cases, definitions, misattribution, intervention measures, political pressures and data lag, different strains of COVID-19 could also be a reason for the high levels of uncertainty and variability around official statistics. For example;

A Cambridge University study analysed the first 160 virus genomes sequenced from human patients and have found three three distinct strains of COVID-19 around the globe. These are shown by the larger circles below.

Each time a person is infected with COVID-19, the virus replicates and roughly six genetic mutations occur. While we haven’t witnessed these mutations alter transmissibility or lethality just yet, some mutations can lead to stronger or weaker viruses. Scientists are watching closely in case these types appear and how they affect the virus’ characteristics.

The absence of evidence is not the evidence of absence.

Fatality Rates

COVID-19 Fatality Rates

Calculating the probability someone who is infected with COVID-19 dies from the disease is surprisingly difficult. The Case Fatality Rate and Crude Mortality Rate are two rates presented in the media which attempt to estimate the true fatality rate for an infected person. They each have their own limitations and vary significantly by region. 

Case Fatality Rate (CFR) = Confirmed Deaths / Confirmed Cases. 

    • Most common rate seen in the media.
    • Confirmed Cases under represent true cases (Element #1). This means smaller changes in the numerator have a larger impact on the rate (e.g. the CFR looks worse in countries with less widespread testing or overworked hospitals as it will seem like a higher proportion of infected people are dying).
    • Confirmed Deaths under represent true deaths (Element #2).
    • The CFR is volatile while an epidemic is still going. It stabilises over time as confirmed cases and deaths approach their true figures.

Crude Mortality Rate (CMR) = Confirmed Deaths / Total Population. 

    • Probability an individual will die from the disease regardless of being infected or not.
    • The CMR usually increases over time as confirmed deaths increase to accurately reflect the true number.

Before drawing conclusions, it’s best to understand how metrics have been calculated and what limitations may cause them to be more/less volatile over time.


COVID-19 Comparisons
COVID-19 Comparisons

Comparing the daily global mortality of COVID-19 at the very start of an emerging pandemic with the steady state mortalities of other diseases is like comparing apples and oranges. The other diseases have been around for a long time, allowing us to collect enough data to comfortably draw conclusions about mortality rates. COVID-19 however, has not reached maximum circulation and the number of people dying per day is constantly changing.

The danger in comparing COVID-19 like this may lead the audience to perceive the risk of COVID-19 as low despite its incredible potential to overtake all of these in a very short period of time. This is already evident in the above chart progression throughout time (left to right).

At best, these charts are an inaccurate comparison due to major differences in our knowledge of treatment and testing resources for COVID-19 compared to other diseases. At worst, they significantly understate the seriousness of COVID-19 and may lead people to ignore the advice of public health professionals on social distancing and other individual actions which can slow the spread of the virus.

Always make like for like comparisons when using insights to make decisions.


Infection Fatality Rate

One of the biggest unanswered questions around the world is “what percentage of people who get COVID-19 will die of the disease?” Also known as the Infection Fatality Rate (IFR), this seemingly easy question is incredibly hard to answer as both the numerator (COVID-19 deaths) and denominator (COVID-19 infections) are both moving targets in the midst of a pandemic.

In saying this, the missing data on deaths is likely dwarfed by the expected increase in the denominator when the total number of infections is better understood (including asymptomatic and undiagnosed cases). This means the CFR (Element #19) almost certainly overstates the lethality of the virus.

Knowing the IFR is important because it informs modelling predictions and therefore assists in calibrating interventions.

Serology surveys (testing for antibodies to get a better sense of total infections) point to an IFR anywhere between 0.12% and 1.08%. The latter should be given more weight due to:

    • A meta-analysis of 13 studies from around the world gave an overall estimate of 0.75% IFR, with the 95% confidence interval ranging from 0.49% to 1.01%.
    • One of the most exhaustive and up-to-date peer reviewed studies by researchers at Imperial College London estimate the IFR to be 0.66%, with the 95% confidence interval ranging from 0.39% to 1.33%.
    • For the best estimate of a 0.1% IFR to be correct, there must be 230,000 undetected infections in South Korea (currently 10,800 confirmed cases). This seems unlikely in a country where the outbreak is under control.

In other words, the best guess of the proportion of people who die from COVID-19 infections seems to be between 0.49% and 1.01% (meta-analysis). If the IFR of the seasonal flu is 0.04%, that’s roughly 12-25 times deadlier. Despite the large range, it gives us an idea of what the plausible reality is likely to be.

It’s very hard to draw accurate estimates from poor quality data.

Risk Factors

While scientists are confident the virus isn’t mutating very quickly, it appears COVID-19 can be more or less deadly in different countries. When tracking lethality it’s important to remember the Infection Fatality Rate (IFR) is not an innate feature of the pathogen. The IFR depends on many factors which are unevenly distributed throughout the population. Such variables include:

    • Age: Estimated IFRs only rise above 0.5% after age 50 and top at 13% for those above 80. This means, holding all else constant, countries with a higher average age will experience more deaths from the disease. This is evident when looking at Italy’s death toll, the country with the second-oldest population in the world.
    • Underlying Health: It has become apparent Covid-19 is deadlier for those with existing conditions such as lung disease, cardiovascular disease, severe obesity, diabetes, hypertension, chronic respiratory conditions and cancer.
    • Access to Treatment: The IFR increases if the healthcare system is overwhelmed, as well as by the extent to which this happens. Speed and effectiveness of government responses are vital factors in determining this. Controlling the outbreak early (e.g. Taiwan) reduces the impact on the healthcare system which keeps the IFR low.

As a result, your personal risk of death from COVID-19 is not going to be the global, your country-specific or even your local IFR. These aggregated measures don’t take into account all your personal characteristics which alter your individual probability. Other variables such as dose and dosing route are also thought to influence severity.

Always look for lurking variables or meaningful subgroups when drawing conclusions from aggregated data.


COVID-19 fake news

How individuals respond to the pandemic will depend, in large part, on the quality of information to which they are exposed as well as the stories and reports they find credible. While COVID-19 conspiracy theories may appear comical, the abundance of misinformation in the current environment is a legitimate concern as audiences find it hard to separate fact from fiction. This has been exacerbated by each person’s ability to share and promote content regardless of its truth. Examples include:

Misinformation has the potential to undermine social distancing efforts, divide our community, or promote the adoption of potentially dangerous fake cures. The good news is it’s easy to fight misinformation by following these 7 steps:

    • Check if the information is a joke: Is it meant to be satirical or factual?
    • Read beyond the headline: Headlines can be misleading and purely designed to capture a reader’s attention.
    • Check the source: How credible is it? Is it a personal blog or recognised organisation such as a UN agency?
    • Check the supporting sources: Click the links to see where the information originated and whether sources provide evidence based claims from experts in the field. Consult multiple sources before considering information as factual.
    • Check the author: Investigate the author by looking into whether they are known. Do they write for a credible publication? Are they real?
    • Check the date: Old news stories may not be relevant to current events.
    • Check your biases: Consider how your own beliefs could affect your judgement of the information you are taking in.

Take a moment to pause and assess the accuracy of the information before drawing conclusions and sharing articles.

Pie Charts

Covid-19 Pie Chart

On the 5th of May, Scott Morrison used a pie chart (below) to visualise how the estimated 1.5 million people out of work in the first half of 2020 compared across industry sectors.

Covid-19 Pie Chart
Covid-19 Pie Chart

Pie charts communicate values by either the relative area of each slice or the angles formed by the slices as they radiate out from the centre. Unfortunately for humans, neither of these attributes are naturally easy to compare, especially when segments are close in size. Our eyes are great at comparing differences in 1-D line lengths but not 2-D areas and angles.

Some may argue adding data labels and values to each segment allows this comparison. However, in doing this, we’ve only turned the pie chart into the equivalent of an awkwardly arranged table containing labels and values. In addition, as the labels don’t line up, the chart becomes cluttered and hard to read.

One approach is to use a horizontal bar chart organised in order of relative size (unless there is some natural ordering to categories). Bar charts make it easy to assess relative size as the bars are aligned to a common baseline and our eyes only have to compare the end points. This makes it straightforward to see not only which segment is the largest but also how incrementally larger it is than other segments.

The one thing pie charts do well is communicate the “part-to-whole relationship”, signified by all the slices adding up to a complete circle. That’s about it though and this can be accomplished, in part, by using a percentage scale in a bar chart.

The goal of any data visualisation is to make data easier to understand without increasing the complexity of what’s being presented.

Transmission Source

COVID-19 Transmission Source
COVID-19 Transmission Source

As the Government aims to balance the economic and social benefits of easing restrictions with the risk of a future spike in cases, a close eye is being kept on community transmissions. These represent locally acquired infections from an unknown source and currently account for 10% of all cases in Australia

While imported cases are relatively “easy to control” with proper checks and tracing (63% of Australian cases have been acquired overseas), a rise in community transmissions can signal a loss of control of the virus.

The inability to identify the source of a case means unknown people may be circulating the virus in the community. This leads to unidentified transmission events and further spread of the virus. By understanding where the spread originates, we have a better chance of blocking the transmission pathway. 

The charts below show daily cases by transmission source and highlight there’s still some evidence of low level community transmission (right). This is why as Australia starts to ease restrictions we cannot get complacent. This is also the reason Premiers are pushing for a lot more people to get tested – the more that people go and get tested, the more chance we have of reducing that number.

Always look out for meaningful subgroups within data sets.


COVID-19 Uncertainty

Data visualisations are instrumental to how we process COVID-19 information. They carry an aura of objectivity which gives designers tremendous power when communicating messages. If designers wield that authority irresponsibly, it may downplay the pandemic which can lead to the rejection of public health advice.

One enduring issue for any visualisation design is accurately communicating uncertainty. The default is usually to shy away from it because:

    • Uncertainty is hard to calculate and show: Calculating intervals or probability distributions is not always easy and uncertainty symbology is very difficult to use.
    • The risk of misinterpretation: People aren’t used to visualising uncertainty. As a result, some may perceive the author as not being confident in what they’re showing.  
    • Designers fear their audience will be confused: For a lot of people, looking at graphs is hard enough. If you must further explain things like probability, you’re going to lose readers.

One way to visualise uncertainty is to use a dotted line and shaded areas for possible values (below). As the chart doesn’t use a clean, certain-looking line, the author doesn’t mislead people into thinking the data has given a definitive answer. Other common methods include error bars or explicit explanations and disclaimers in narratives.

Try and avoid taking values in charts as the precise truth, especially if the uncertainty behind them has not been properly visualised.


COVID-19 Sweden

Sweden has taken an alternate route to fighting COVID-19, keeping their country relatively open as many nations impose heavy restrictions. This ‘trust-based’ approach aims to slow the spread of the virus while mitigating the economic and social impact.

Whether their strategy is working depends on which metrics you look at and how you interpret them.

Relatively high numbers of deaths per million people, case fatality rates and excess deaths may represent significant defeat. However, if Sweden’s aim is to protect the health system but not prevent death, it’s worth looking at a few other points:

    • Sweden’s healthcare system has not been overburdened: there has always been at least 20% spare Intensive Care Units (ICU) capacity.
    • The country may be able to achieve natural herd immunity faster (since a higher proportion of the population has already been infected): It is estimated 15-20% of people in Stockholm have reached a level of immunity which could slow the spread of the virus. 
    • Sweden’s high government ratings: Approval ratings have increased for the second month in a row for the Social Democratic Party demonstrating the majority of society believes in the strategy.

It’s too early to tell if Sweden’s strategy has been a success. As each country will have to reach herd immunity through either infection or vaccine (which could be greater than 11 months away), it’s worth keeping an eye on different metrics to judge the overall efficacy of measures.

Different statistics tell different stories, try to look at a diverse range before drawing conclusions.

Confirmation Bias

COVID-19 Confirmation Bias

As we continue to be bombarded with COVID-19 information, determining what to believe is no easy task. To cope with the overwhelming amount, uncertain outcomes and the stressful environment, our brains develop shortcuts (cognitive biases) to help us make decisions quickly. Once you’ve made up your mind about a certain topic, it’s easier to look for ways to confirm it, even if there is data to the contrary. This is called the confirmation bias.

Studies show people tend to “cherry-pick” isolated studies, Facebook posts, or Tweets which support their desires or logic in order to normalise their activities. Social network algorithms facilitate this by presenting you with similar posts you’ve previously liked (lowering your chances seeing contradictory information). 

Most of the time this almost automatic response of our brains leads us to sub-optimal decisions. However, there are a few ways to fight confirmation bias during COVID-19:

    • Hold your opinions loosely: The more adaptable you are to new information, the less likely you are to anchor your beliefs or skew your perception. Admit when you are wrong.
    • Seek out contradictory sources: News channels, newspapers and radio stations all present different perspectives meant to attract a certain type of audience. Experiment with other reputable sources.  
    • Listen to the experts: It’s particularly critical to scrutinise whether a person is actually an expert. Elon Musk’s Tweet on hydroxychloroquine was highly influential despite this not being his expertise.
    • Pause before sharing: Misinformation is itself a virus that can spread quickly and get out of control (Element #23). 

The most dangerous aspect of cognitive biases is not succumbing to them but failing to realise we are succumbing to them. Always be aware of your own confirmation bias when processing information.

Infection Risk

Infection risk of COVID-19 = Intensity of exposure x Time of exposure. This means you can become infected relatively quickly from a single sneeze or by sharing the same air in an enclosed space for a prolonged period (e.g. 1.5 hour dinner at a restaurant or working in a call centre).

In an article analysing 54 Super Spreader Events (SSEs), 70% involved one or more of the following activities: parties, funerals, religious meetups and business networking sessions. These types of events all involved similar behaviour: extended, close-range, face-to-face conversation – typically in crowded, socially animated spaces.

The remainder, while not in these categories, experienced almost identical interpersonal dynamics (e.g. concert-goers and singing groups in Japan, Skagit County, WA, and Singapore and outbreaks in meat-processing plants where workers must communicate with one another over the deafening noise of machinery).

With few exceptions, almost all the SSEs took place indoors, where people tend to pack closer together and ventilation is poorer (less than 0.3% of traced infections have occurred outdoors). High levels of noise also seem to be a common feature of SSEs as these environments force people to talk, yell, laugh, cheer, sob or sing at extremely close range.

As restrictions ease and you assess your risk of infection with each interaction, you need to consider:

    • The volume of air space (how much airflow is around you).
    • The number of people.
    • The length of time you will be in this environment.

Remember, dose AND time are both crucial elements in your chance of infection. This explains why grocery store customers (weak exposure, short time) are much safer than beauty parlours (close exposure, long time).

Outliers shouldn’t always be excluded from your analysis, sometimes these extreme events provide valuable insight.

Short Term Trends

COVID-19 cubic fit

On the 5th of May, the White House Council of Economic Advisors (CEA) released the below chart of actual and projected USA COVID-19 daily deaths. They included a cubic fit to “improve data visualisation” which saw daily deaths head to zero in a mere 10 days

This extremely misleading representation of the data doesn’t match reality due to:

    • The cubic fit ignoring the data lag (Element #3). We know on average over 20,000 new cases are being discovered each day and it can take two weeks from someone catching it to dying. This means there will definitely be people dying in 10 days time as evident with the 1,703 deaths today.
    • Cubic modelling uses inappropriate smoothing methods in the context of a pandemic. After peaking, deaths do not rapidly decline to less than zero for any virus. 
    • Day-of-the-week variability. The cubic prediction will vary significantly depending on which day of the week it was estimated on.
    • The stark difference between the cubic fit and the latest IHME projection (green). 

While the US Government later clarified the chart in question was not a forecast of deaths, the red line projected into the future makes it seem otherwise. Sitting the line alongside outdated, more optimistic IHME models only made the cubic fit seem more reasonable to the viewer.

Fitting a curve to known data to assist in visualisation is fine but extending it outside that range can quickly send the wrong message.

Never assume short term trends in a data set last forever.

Jump to: